[jira] Resolved: (NUTCH-816) Add zip target to build.xml

2010-05-08 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved NUTCH-816.
-

Resolution: Fixed

- fixed in r942427

> Add zip target to build.xml
> ---
>
> Key: NUTCH-816
> URL: https://issues.apache.org/jira/browse/NUTCH-816
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 1.0.0
> Environment: indep. of env.
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.1
>
>
> Just like we have an ant tar target (pun intended) we should have an ant zip 
> target. I'd like to have this ready for the release and future releases.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Work started: (NUTCH-816) Add zip target to build.xml

2010-05-08 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-816 started by Chris A. Mattmann.

> Add zip target to build.xml
> ---
>
> Key: NUTCH-816
> URL: https://issues.apache.org/jira/browse/NUTCH-816
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 1.0.0
> Environment: indep. of env.
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.1
>
>
> Just like we have an ant tar target (pun intended) we should have an ant zip 
> target. I'd like to have this ready for the release and future releases.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-816) Add zip target to build.xml

2010-04-27 Thread Chris A. Mattmann (JIRA)
Add zip target to build.xml
---

 Key: NUTCH-816
 URL: https://issues.apache.org/jira/browse/NUTCH-816
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: 1.0.0
 Environment: indep. of env.
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.1


Just like we have an ant tar target (pun intended) we should have an ant zip 
target. I'd like to have this ready for the release and future releases.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-814) SegmentMerger bug

2010-04-27 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861401#action_12861401
 ] 

Chris A. Mattmann commented on NUTCH-814:
-

Hey Andrzej,

After you commit this, should I cut a new RC (rc #3)?

Cheers,
Chris

> SegmentMerger bug
> -
>
> Key: NUTCH-814
> URL: https://issues.apache.org/jira/browse/NUTCH-814
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.1
>Reporter: Dennis Kubes
>Assignee: Andrzej Bialecki 
> Fix For: 1.1
>
> Attachments: merger.patch
>
>
> Dennis reported:
> {quote}
> In the SegmentMerger.java file about line 150 we have this:
>final SequenceFile.Reader reader =
>  new SequenceFile.Reader(FileSystem.get(job), fSplit.getPath(),
> job);
> Then about line 166 in the record reader we have this:
> boolean res = reader.next(key, w);
> If I am reading that right, that would mean that the map tap would loop
> over all records for a given file and not just a given split.
> {quote}
> Right, this should instead use SequenceFileRecordReader that already has the 
> logic to handle splits. Patch coming shortly - thanks for spotting this! This 
> could be the reason for "out of disk space" errors that many users reported.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-812) Crawl.java incorrectly uses the Generator API resulting in NPE

2010-04-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved NUTCH-812.
-

Fix Version/s: 1.1
   Resolution: Fixed

- fixed in r935453. Thanks, Phil and Andrzej!

> Crawl.java incorrectly uses the Generator API resulting in NPE
> --
>
> Key: NUTCH-812
> URL: https://issues.apache.org/jira/browse/NUTCH-812
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.1
>Reporter: Andrzej Bialecki 
>Assignee: Chris A. Mattmann
>Priority: Critical
> Fix For: 1.1
>
>
> As reported by Phil Barnett on nutch-user:
> {quote}
> The Fix.
> In line 131 of Crawl.java
> Generate no longer returns segments like it used to. Now it returns segs.
> line 131 needs to read
>  If (segs == null)
>  Instead of the current
> If (segments == null)
> After that change and a recompile, crawl is working just fine.
> {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (NUTCH-812) Crawl.java incorrectly uses the Generator API resulting in NPE

2010-04-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned NUTCH-812:
---

Assignee: Chris A. Mattmann

> Crawl.java incorrectly uses the Generator API resulting in NPE
> --
>
> Key: NUTCH-812
> URL: https://issues.apache.org/jira/browse/NUTCH-812
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.1
>Reporter: Andrzej Bialecki 
>Assignee: Chris A. Mattmann
>Priority: Critical
>
> As reported by Phil Barnett on nutch-user:
> {quote}
> The Fix.
> In line 131 of Crawl.java
> Generate no longer returns segments like it used to. Now it returns segs.
> line 131 needs to read
>  If (segs == null)
>  Instead of the current
> If (segments == null)
> After that change and a recompile, crawl is working just fine.
> {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Work started: (NUTCH-812) Crawl.java incorrectly uses the Generator API resulting in NPE

2010-04-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-812 started by Chris A. Mattmann.

> Crawl.java incorrectly uses the Generator API resulting in NPE
> --
>
> Key: NUTCH-812
> URL: https://issues.apache.org/jira/browse/NUTCH-812
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.1
>Reporter: Andrzej Bialecki 
>Assignee: Chris A. Mattmann
>Priority: Critical
>
> As reported by Phil Barnett on nutch-user:
> {quote}
> The Fix.
> In line 131 of Crawl.java
> Generate no longer returns segments like it used to. Now it returns segs.
> line 131 needs to read
>  If (segs == null)
>  Instead of the current
> If (segments == null)
> After that change and a recompile, crawl is working just fine.
> {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-570) Improvement of URL Ordering in Generator.java

2010-04-07 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854767#action_12854767
 ] 

Chris A. Mattmann commented on NUTCH-570:
-

Hi Otis:

I think your logic perfectly rational here. Maybe you could leave it open for 
another 48 hrs, and then close it out if you don't get any feedback from the 
original reporter, or those that were interested.

Cheers,
Chris


> Improvement of URL Ordering in Generator.java
> -
>
> Key: NUTCH-570
> URL: https://issues.apache.org/jira/browse/NUTCH-570
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Reporter: Ned Rockson
>Assignee: Otis Gospodnetic
>Priority: Minor
> Attachments: GeneratorDiff.out, GeneratorDiff_v1.out
>
>
> [Copied directly from my email to nutch-dev list]
> Recently I switched to Fetcher2 over Fetcher for larger whole web fetches 
> (50-100M at a time).  I found that the URLs generated are not optimal because 
> they are simply randomized by a hash comparator.  In one crawl on 24 machines 
> it took about 3 days to crawl 30M URLs.  In comparison with old benchmarks I 
> had set with regular Fetcher.java this was at least 3 fold more time.
> Anyway, I realized that the best situation for ordering can be approached by 
> randomization, but in order to get optimal ordering, urls from the same host 
> should be as far apart in the list as possible.  So I wrote a series of 2 
> map/reduces to optimize the ordering and for a list of 25M documents it takes 
> about 10 minutes on our cluster.  Right now I have it in its own class, but I 
> figured it can go in Generator.java and just add a flag in nutch-default.xml 
> determining if the user wants to use it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-789) Improvements to Tika parser

2010-04-04 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853285#action_12853285
 ] 

Chris A. Mattmann commented on NUTCH-789:
-

Hey Julien, Tika 0.7 is available from Maven central:

http://repo1.maven.org/maven2/org/apache/tika/tika-parsers/

Cheers,
Chris


> Improvements to Tika parser
> ---
>
> Key: NUTCH-789
> URL: https://issues.apache.org/jira/browse/NUTCH-789
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
> Environment: reported by Sami, in NUTCH-766
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.1
>
> Attachments: NutchTikaConfig.java, TikaParser.java
>
>
> As reported by Sami in NUTCH-766, Sami has a few improvements he made to the 
> Tika parser. We'll track that progress here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-789) Improvements to Tika parser

2010-04-03 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853212#action_12853212
 ] 

Chris A. Mattmann commented on NUTCH-789:
-

Hey Julien -- okey dok, Tika 0.7 has been released. Feel free to upgrade, and 
close this one out...after that, I'll cut the Nutch 1.1 RC.

Thanks!

Cheers,
Chris


> Improvements to Tika parser
> ---
>
> Key: NUTCH-789
> URL: https://issues.apache.org/jira/browse/NUTCH-789
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
> Environment: reported by Sami, in NUTCH-766
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.1
>
> Attachments: NutchTikaConfig.java, TikaParser.java
>
>
> As reported by Sami in NUTCH-766, Sami has a few improvements he made to the 
> Tika parser. We'll track that progress here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-794) Language Identification must use check the parse metadata for language values

2010-03-31 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852101#action_12852101
 ] 

Chris A. Mattmann commented on NUTCH-794:
-

Hey Julien, yepper, I posted an RC of Tika 0.7, see: http://bit.ly/c7FZRc. If 
the VOTE passes on that in say the next 72 hours, I will push out a Tika 0.7 
release to the mirrors. If everyone is OK with that, we can release Nutch 1.1 
after...thoughts?

> Language Identification must use check the parse metadata for language values 
> --
>
> Key: NUTCH-794
> URL: https://issues.apache.org/jira/browse/NUTCH-794
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: 1.1
>
> Attachments: NUTCH-794.patch
>
>
> The following HTML document : 
> document 1 titlejotain 
> suomeksi
> is rendered as the following xhtml by Tika : 
>  xmlns="http://www.w3.org/1999/xhtml";>document 1 
> titlejotain suomeksi
> with the lang attribute getting lost.  The lang is not stored in the metadata 
> either.
> I will open an issue on Tika and modify TestHTMLLanguageParser so that the 
> tests don't break anymore 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-789) Improvements to Tika parser

2010-03-31 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852048#action_12852048
 ] 

Chris A. Mattmann commented on NUTCH-789:
-

Folks, I'm going to put together an RC for Tika 0.7 and take care of JIRA now. 
Once I do that, we can try and close out this issue for 1.1. I should be able 
to do this before the 48 hr deadline I threw up for Nutch 1.1...

> Improvements to Tika parser
> ---
>
> Key: NUTCH-789
> URL: https://issues.apache.org/jira/browse/NUTCH-789
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
> Environment: reported by Sami, in NUTCH-766
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.1
>
> Attachments: NutchTikaConfig.java, TikaParser.java
>
>
> As reported by Sami in NUTCH-766, Sami has a few improvements he made to the 
> Tika parser. We'll track that progress here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-673) Upgrade the Carrot2 plug-in to release 3.0

2010-03-31 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852047#action_12852047
 ] 

Chris A. Mattmann commented on NUTCH-673:
-

Folks: if you get time to put together a patch for 1.1 or feel that this should 
go into 1.1, please see:  http://bit.ly/c7tBv9 and comment in the next 48 hrs...

> Upgrade the Carrot2 plug-in to release 3.0
> --
>
> Key: NUTCH-673
> URL: https://issues.apache.org/jira/browse/NUTCH-673
> Project: Nutch
>  Issue Type: Improvement
>  Components: web gui
>Affects Versions: 0.9.0
> Environment: All Nutch deployments.
>Reporter: Sean Dean
>Priority: Minor
>
> Release 3.0 of the Carrot2 plug-in was released recently.
> We currently have version 2.1 in the source tree and upgrading it to the 
> latest version before 1.0-release might make sence.
> Details on the release can be found here: 
> http://project.carrot2.org/release-3.0-notes.html
> One major change in requirements is for JDK 1.5 to be used, but this is also 
> now required for Hadoop 0.19 so this wouldnt be the only reason for the 
> switch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-771) Add WebGraph classes to the bin/nutch script

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-771:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

> Add WebGraph classes to the bin/nutch script
> 
>
> Key: NUTCH-771
> URL: https://issues.apache.org/jira/browse/NUTCH-771
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.1
> Environment: All, shell script
>Reporter: Dennis Kubes
>Assignee: Dennis Kubes
>
> Currently the webgraph jobs are called on the command line by calling main 
> methods on their classes.  I propose to upgrade the bin/nutch shell script to 
> allow calling these jobs as well.  This would include the webgraphdb, 
> linkrank, scoreupdater, and nodedumper jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-475) Adaptive crawl delay

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-475:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

> Adaptive crawl delay
> 
>
> Key: NUTCH-475
> URL: https://issues.apache.org/jira/browse/NUTCH-475
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Reporter: Doğacan Güney
> Attachments: adaptive-delay_draft.patch
>
>
> Current fetcher implementation waits a default interval before making another 
> request to the same server (if crawl-delay is not specified in robots.txt). 
> IMHO, an adaptive implementation will be better. If the server is under 
> little load and can server requests fast, then fetcher can ask for more pages 
> in a given interval. Similarly, if the server is suffering from heavy load, 
> fetcher can slow down(w.r.t that host), easing the load on the server.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-666:


Patch Info: [Patch Available]

> Analysis plugins for multiple language and new Language Identifier Tool
> ---
>
> Key: NUTCH-666
> URL: https://issues.apache.org/jira/browse/NUTCH-666
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.1
> Environment: All
>Reporter: Dennis Kubes
>Assignee: Dennis Kubes
> Attachments: NUTCH-666-1-20081126.patch, NUTCH-666-2-20091217-nf.patch
>
>
> Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, 
> russian, and thai.  Also includes a new Language Identifier tool that used 
> the new indexing framework in NUTCH-646.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-666:


 Due Date: 27/Nov/08  (was: 27/Nov/08)
Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

> Analysis plugins for multiple language and new Language Identifier Tool
> ---
>
> Key: NUTCH-666
> URL: https://issues.apache.org/jira/browse/NUTCH-666
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.1
> Environment: All
>Reporter: Dennis Kubes
>Assignee: Dennis Kubes
> Attachments: NUTCH-666-1-20081126.patch, NUTCH-666-2-20091217-nf.patch
>
>
> Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, 
> russian, and thai.  Also includes a new Language Identifier tool that used 
> the new indexing framework in NUTCH-646.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-583) FeedParser empty links for items

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-583:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

> FeedParser empty links for items
> 
>
> Key: NUTCH-583
> URL: https://issues.apache.org/jira/browse/NUTCH-583
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
>
> FeedParser in feed plugin just discards the item if it does not have  
> element. However Rss 2.0 does not necessitate the  element for each 
> . 
> Moreover sometimes the link is given in the  element which is a 
> globally unique identifier for the item. I think we can search the url for an 
> item first, then if it is still not found, we can use the feed's url, but 
> with merging all the parse texts into one Parse object. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-628) Host database to keep track of host-level information

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-628:


   Patch Info: [Patch Available]
Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

> Host database to keep track of host-level information
> -
>
> Key: NUTCH-628
> URL: https://issues.apache.org/jira/browse/NUTCH-628
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher, generator
>Reporter: Otis Gospodnetic
> Attachments: domain_statistics_v2.patch, 
> NUTCH-628-DomainStatistics.patch, NUTCH-628-HostDb.patch
>
>
> Nutch would benefit from having a DB with per-host/domain/TLD information.  
> For instance, Nutch could detect hosts that are timing out, store information 
> about that in this DB.  Segment/fetchlist Generator could then skip such 
> hosts, so they don't slow down the fetch job.  Another good use for such a DB 
> is keeping track of various host scores, e.g. spam score.
> From the recent thread on nutch-u...@lucene:
> Otis asked:
> > While we are at it, how would one go about implementing this DB, as far as 
> > its structures go?
> Andrzej said:
> The easiest I can imagine is to use something like .
> This way you could store arbitrary information under arbitrary keys.
> I.e. a single database then could keep track of aggregate statistics at
> different levels, e.g. TLD, domain, host, ip range, etc. The basic set
> of statistics could consist of a few predefined gauges, totals and averages.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-650) Hbase Integration

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-650:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

> Hbase Integration
> -
>
> Key: NUTCH-650
> URL: https://issues.apache.org/jira/browse/NUTCH-650
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: 1.0.0
>Reporter: Doğacan Güney
>Assignee: Doğacan Güney
> Attachments: hbase-integration_v1.patch, hbase_v2.patch, 
> malformedurl.patch, meta.patch, meta2.patch, nofollow-hbase.patch, 
> NUTCH-650.patch, nutch-habase.patch, searching.diff, slash.patch
>
>
> This issue will track nutch/hbase integration

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-716) Make subcollection index filed multivalued

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-716:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

> Make subcollection index filed multivalued
> --
>
> Key: NUTCH-716
> URL: https://issues.apache.org/jira/browse/NUTCH-716
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.0.0
>Reporter: Dmitry Lihachev
> Attachments: NUTCH-716_multivalued_subcollection.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-541) Index url field untokenized

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-541:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

> Index url field untokenized
> ---
>
> Key: NUTCH-541
> URL: https://issues.apache.org/jira/browse/NUTCH-541
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer, searcher
>Affects Versions: 1.0.0
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
>
> Url field is indexed as Strore.YES , Index.TOKENIZED. We also need the 
> untokenized version of the url field in some contexts : 
> 1. For deleting duplicates by url (at search time). see NUTCH-455
> 2. For restricting the search to a certain url (may be used in the case of 
> RSS search where each entry in the Rss is added as a distinct document with 
> (possibly) same url ) 
>query-url extends FieldQueryFilter so: 
> Query: url:http://www.apache.org/
> Parsed: url:"http http-www http-www-apache www www-apache apache org"
> Translated: +url:"http-http-www http-www-http-www-apache 
> http-www-apache-www www-www-apache www-apache apache org"
> 3. for accessing a document(s) in the search servers in the search servers. 
> (using query plugin)
> I suggest we add url as in index-basic and implement a query-url-untoken 
> plugin. 
> doc.add(new Field("url", url.toString(), Field.Store.YES, 
> Field.Index.TOKENIZED));
> doc.add(new Field("url_untoken", url.toString(), Field.Store.NO, 
> Field.Index.UN_TOKENIZED));

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-717) Make Nutch Solr integration easier

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-717:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

> Make Nutch Solr integration easier
> --
>
> Key: NUTCH-717
> URL: https://issues.apache.org/jira/browse/NUTCH-717
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Sami Siren
>
> Erik Hatcher proposed we should provide a full solr config dir to be used 
> with Nutch-Solr. Now we only provide index schema. It would be considerably 
> easier to setup nutch-solr if we provided the whole conf dir that you could 
> use with solr like:
> java -Dsolr.solr.home= -jar start.jar

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-573) Multiple Domains - Query Search

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-573:


Fix Version/s: (was: 1.1)

> Multiple Domains - Query Search
> ---
>
> Key: NUTCH-573
> URL: https://issues.apache.org/jira/browse/NUTCH-573
> Project: Nutch
>  Issue Type: Improvement
>  Components: searcher
>Affects Versions: 0.9.0
> Environment: All
>Reporter: Rajasekar Karthik
>Assignee: Enis Soztutar
> Attachments: multiTermQuery_v1.patch
>
>
> Searching multiple domains can be done on Lucene - nut not that efficiently 
> on nutch.
> Query:
> +content:"abc" +(site"www.aaa.com" site:"www.bbb.com")
> works on lucene but the same concept does not work on nutch.
> In Lucene, it works with 
> org.apache.lucene.analysis.KeywordAnalyzer
> org.apache.lucene.analysis.standard.StandardAnalyzer 
> but NOT on
> org.apache.lucene.analysis.SimpleAnalyzer 
> Is Nutch analyzer based on SimpleAnalyzer? In this case, is there a 
> workaround to make this work? Is there an option to change what analyzer 
> nutch is using? 
> Just FYI, another solution (inefficient I believe) which seems to be working 
> on nutch
>  -site:"ccc.com" -site:"ddd.com" 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-573) Multiple Domains - Query Search

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-573:



- pushing this out per http://bit.ly/c7tBv9

> Multiple Domains - Query Search
> ---
>
> Key: NUTCH-573
> URL: https://issues.apache.org/jira/browse/NUTCH-573
> Project: Nutch
>  Issue Type: Improvement
>  Components: searcher
>Affects Versions: 0.9.0
> Environment: All
>Reporter: Rajasekar Karthik
>Assignee: Enis Soztutar
> Attachments: multiTermQuery_v1.patch
>
>
> Searching multiple domains can be done on Lucene - nut not that efficiently 
> on nutch.
> Query:
> +content:"abc" +(site"www.aaa.com" site:"www.bbb.com")
> works on lucene but the same concept does not work on nutch.
> In Lucene, it works with 
> org.apache.lucene.analysis.KeywordAnalyzer
> org.apache.lucene.analysis.standard.StandardAnalyzer 
> but NOT on
> org.apache.lucene.analysis.SimpleAnalyzer 
> Is Nutch analyzer based on SimpleAnalyzer? In this case, is there a 
> workaround to make this work? Is there an option to change what analyzer 
> nutch is using? 
> Just FYI, another solution (inefficient I believe) which seems to be working 
> on nutch
>  -site:"ccc.com" -site:"ddd.com" 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-729) NPE in FieldIndexer when BasicFields url doesn't exist

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-729:


 Due Date: 26/Mar/09  (was: 26/Mar/09)
   Patch Info: [Patch Available]
Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

> NPE in FieldIndexer when BasicFields url doesn't exist
> --
>
> Key: NUTCH-729
> URL: https://issues.apache.org/jira/browse/NUTCH-729
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 0.9.0, 1.0.0
> Environment: All
>Reporter: Dennis Kubes
>Assignee: Dennis Kubes
> Attachments: NUTCH-729-1-20090235.patch
>
>
> There is a NullPointerException during a logging call in FieldIndexer when 
> there isn't a url for a document.  Documents shouldn't be without urls but 
> since the FieldIndexer doesn't validate fields it is possible for it to 
> occur.  Most often this happens when BasicFields is run with the wrong 
> segments directory and doesn't complain.  It could also occur if using the 
> FieldIndexer to index things other than basic fields.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-460) RDF parser plugin

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-460:


Patch Info: [Patch Available]

- pushing this out per http://bit.ly/c7tBv9

> RDF parser plugin
> -
>
> Key: NUTCH-460
> URL: https://issues.apache.org/jira/browse/NUTCH-460
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher
>Affects Versions: 0.9.0
>Reporter: Ricardo J. Méndez
> Attachments: rubyspider-rdf.zip
>
>
> I've written a couple plugins that I'd like to contribute.  
> RDFLinkParseFilter looks for links on the pages that point towards RDF 
> information, and tags the pages with metadata about the type of links they 
> hold. RDFLinkIndexingFilter indexes said metadata.  RDFParser parses RDF 
> information from several possible formats using Jena, and extracts the links 
> that the file points to as Outlinks so that they can be fetched as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-460) RDF parser plugin

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-460:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

> RDF parser plugin
> -
>
> Key: NUTCH-460
> URL: https://issues.apache.org/jira/browse/NUTCH-460
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher
>Affects Versions: 0.9.0
>Reporter: Ricardo J. Méndez
> Attachments: rubyspider-rdf.zip
>
>
> I've written a couple plugins that I'd like to contribute.  
> RDFLinkParseFilter looks for links on the pages that point towards RDF 
> information, and tags the pages with metadata about the type of links they 
> hold. RDFLinkIndexingFilter indexes said metadata.  RDFParser parses RDF 
> information from several possible formats using Jena, and extracts the links 
> that the file points to as Outlinks so that they can be fetched as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-774) Retry interval in crawl date is set to 0

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-774:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

> Retry interval in crawl date is set to 0
> 
>
> Key: NUTCH-774
> URL: https://issues.apache.org/jira/browse/NUTCH-774
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.0.0
>Reporter: Reinhard Schwab
>Assignee: Andrzej Bialecki 
> Attachments: NUTCH-774.patch, NUTCH-774_2.patch
>
>
> When i fetch and parse a feed with the feed plugin,
> http://www.wachauclimbing.net/home/impressum-disclaimer/feed/
> another crawl date is generated
> http://www.wachauclimbing.net/home/impressum-disclaimer/comment-page-1/
> after fetching a second round
> the dump in the crawl db still shows a retry interval with value 0.
> http://www.wachauclimbing.net/home/impressum-disclaimer/comment-page-1/ 
> Version: 7
> Status: 2 (db_fetched)
> Fetch time: Wed Dec 02 12:48:22 CET 2009
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 0 seconds (0 days)
> Score: 1.084
> Signature: db9ab2193924cd2d0b53113a500ca604
> Metadata: _pst_: success(1), lastModified=0
> a check should be done in DefaultFetchSchedule (or AbstractFetchSchedule) in 
> the
> method 
> setFetchSchedule

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-677) Segment merge filering based on segment content

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-677:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

> Segment merge filering based on segment content
> ---
>
> Key: NUTCH-677
> URL: https://issues.apache.org/jira/browse/NUTCH-677
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 0.9.0
>Reporter: Marcin Okraszewski
> Attachments: MergeFilter.patch, MergeFilter_for_1.0.patch, 
> SegmentMergeFilter.java, SegmentMergeFilter.java, SegmentMergeFilters.java, 
> SegmentMergeFilters.java
>
>
> I needed a segment filtering based on meta data detected during parse phase. 
> Unfortunately current URL based filtering does not allow for this. So I have 
> created a new SegmentMergeFilter extension which receives segment entry which 
> is being merged and decides if it should be included or not. Even though I 
> needed only ParseData for my purpose I have done it a bit more general 
> purpose, so the filter receives all merged data.
> The attached patch is for version 0.9 which I use. Unfortunately I didn't 
> have time to check how it fits to trunk version. Sorry :(

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-479) Support for OR queries

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-479:


   Patch Info: [Patch Available]
Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

> Support for OR queries
> --
>
> Key: NUTCH-479
> URL: https://issues.apache.org/jira/browse/NUTCH-479
> Project: Nutch
>  Issue Type: Improvement
>  Components: searcher
>Affects Versions: 1.0.0
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
> Attachments: nutch_0.9_OR.patch, or.patch, or.patch
>
>
> There have been many requests from users to extend Nutch query syntax to add 
> support for OR queries, in addition to the implicit AND and NOT queries 
> supported now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-747) inject&Index metadatas and inherit these metadatas to all matching suburls

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-747:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

> inject&Index metadatas and inherit these metadatas to all matching suburls
> --
>
> Key: NUTCH-747
> URL: https://issues.apache.org/jira/browse/NUTCH-747
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, injector
>Reporter: Marko Bauhardt
> Attachments: index-metadata.patch, metadata.patch
>
>
> Hi.
> the following two patches supports
> + inject metadatas to url's into a metadatadb
> url.com   :  : 
>  ...
> ...
> + updates the parse_data metadata from a shard and write the metadatas to all 
> fetched urls that starts with an url from the metadatadb
> + this patch support's metadata to all matching suburls inheritance
> the second patch implements a index-metadata plugin.
> + this plugin extract all metadats from the parse_data of a shard and index 
> it. which metadats you can configure in the plugin.properties.
> + to index for example the lang you have to configure the plugin.properties: 
> lang=STORE,UNTOKENIZED
> + that means that the index plugin exract metadata values with key "lang". if 
> exists, all values are indexed stored and untokenized
> Example
> create start url's in "/tmp/urls/start/urls.txt"
> http://lucene.apache.org/nutch/apidocs-1.0/index.html
> http://lucene.apache.org/nutch/apidocs-0.9/index.html
> create metadata url's in "/tmp/urls/metadata/urls.txt"
> http://lucene.apache.org/nutch/apidocs-1.0/ version:1.0
> http://lucene.apache.org/nutch/apidocs-0.9/ version:0.9
> Inject Urls
> bin/nutch inject crawldb /tmp/urls/start/
> bin/nutch org.apache.nutch.crawl.metadata.MetadataInjector metadatadb 
> /tmp/urls/metadata/
> Fetch & Parse & Update
> bin/nutch generate crawldb segments
> bin/nutch fetch segments/20090806105717/
> bin/nutch org.apache.nutch.crawl.metadata.ParseDataUpdater metadatadb 
> segments/20090806105717
> bin/nutch updatedb crawldb/ segments/20090806105717/
> Fetch & Parse & Update Again
> ...
> Index
> bin/nutch invertlinks linkdb -dir segments/
> bin/nutch index index crawldb/ linkdb/ segments/20090806105717 
> segments/20090806110127
> Check your Index
> All urls starting with "http://lucene.apache.org/nutch/apidocs-1.0/ " are 
> indexed with "version:1.0".
> All urls starting with "http://lucene.apache.org/nutch/apidocs-0.9/ " are 
> indexed with "version:0.9".
> This issue is some related to http://issues.apache.org/jira/browse/NUTCH-655

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-455) dedup on tokenized fields is faulty

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-455:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

> dedup on tokenized fields is faulty
> ---
>
> Key: NUTCH-455
> URL: https://issues.apache.org/jira/browse/NUTCH-455
> Project: Nutch
>  Issue Type: Bug
>  Components: searcher
>Affects Versions: 0.9.0
>Reporter: Enis Soztutar
> Attachments: IndexSearcherCacheWarm.patch
>
>
> (From LUCENE-252) 
> nutch uses several index servers, and the search results from these servers 
> are merged using a dedup field for for deleting duplicates. The values from 
> this field is cached by Lucene's FieldCachImpl. The default is the site 
> field, which is indexed and tokenized. However for a Tokenized Field (for 
> example "url" in nutch), FieldCacheImpl returns an array of Terms rather that 
> array of field values, so dedup'ing becomes faulty. Current FieldCache 
> implementation does not respect tokenized fields , and as described above 
> caches only terms. 
> So in the situation that we are searching using "url" as the dedup field, 
> when a Hit is constructed in IndexSearcher, the dedupValue becomes a token of 
> the url (such as "www" or "com") rather that the whole url. This prevents 
> using tokenized fields in the dedup field. 
> I have written a patch for lucene and attached it in 
> http://issues.apache.org/jira/browse/LUCENE-252, this patch fixes the 
> aforementioned issue about tokenized field caching. However building such a 
> cache for about 1.5M documents takes 20+ secs. The code in 
> IndexSearcher.translateHits() starts with
> if (dedupField != null) 
>   dedupValues = FieldCache.DEFAULT.getStrings(reader, dedupField);
> and for the first call of search in IndexSearcher, cache is built. 
> Long story short, i have written a patch against IndexSearcher, which in 
> constructor warms-up the caches of wanted fields(configurable). I think we 
> should vote for LUCENE-252, and then commit the above patch with the last 
> version of lucene.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-540) some problem about the Nutch cache

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-540:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

> some problem about the Nutch cache
> --
>
> Key: NUTCH-540
> URL: https://issues.apache.org/jira/browse/NUTCH-540
> Project: Nutch
>  Issue Type: Bug
>  Components: searcher
>Affects Versions: 0.9.0
> Environment: Red hat AS4 + Tomcat5.5 + Nutch0.9
>Reporter: crossany
> Attachments: 1.gif, 1186733525.jpg
>
>
> I'am a chinese.
> I just test to search chinese word in nutch. I install nutch0.9 in tomcat5 on 
> linux.and the Tomcat charset it's UTF-8 and I use nutch to Crawl the website 
> it a chinese website the web charset it's also UTF-8. when Use the nutch on 
> tomcat for search chinese word , I find the search result' Title and 
> description was right to display. but when I click the cache, the cache web 
> was display a error charset code, I see the cache
> web' charset also utf-8. I find a website use Nutch 
> http://www.synoo.com:8080/zh/ I just test to search chinese word . It's also 
> error.
> I use Luke to see the segments It's can display chinese word, I think maybe 
> it's a Bug.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-578) URL fetched with 403 is generated over and over again

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-578:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

> URL fetched with 403 is generated over and over again
> -
>
> Key: NUTCH-578
> URL: https://issues.apache.org/jira/browse/NUTCH-578
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 1.0.0
> Environment: Ubuntu Gutsy Gibbon (7.10) running on VMware server. I 
> have checked out the most recent version of the trunk as of Nov 20, 2007
>Reporter: Nathaniel Powell
>Assignee: Dennis Kubes
> Attachments: crawl-urlfilter.txt, NUTCH-578.patch, 
> NUTCH-578_v2.patch, NUTCH-578_v3.patch, NUTCH-578_v4.patch, nutch-site.xml, 
> regex-normalize.xml, urls.txt
>
>
> I have not changed the following parameter in the nutch-default.xml:
> 
>   db.fetch.retry.max
>   3
>   The maximum number of times a url that has encountered
>   recoverable errors is generated for fetch.
> 
> However, there is a URL which is on the site that I'm crawling, 
> www.teachertube.com, which keeps being generated over and over again for 
> almost every segment (many more times than 3):
> fetch of http://www.teachertube.com/images/ failed with: Http code=403, 
> url=http://www.teachertube.com/images/
> This is a bug, right?
> Thanks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-794) Language Identification must use check the parse metadata for language values

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved NUTCH-794.
-

Resolution: Fixed

@julien -- I think this issue has been fixed in Tika right? If not, feel free 
to reopen, or better yet, re-file the issue against a post 1.1 Nutch release. 
Thanks!

> Language Identification must use check the parse metadata for language values 
> --
>
> Key: NUTCH-794
> URL: https://issues.apache.org/jira/browse/NUTCH-794
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: 1.1
>
> Attachments: NUTCH-794.patch
>
>
> The following HTML document : 
> document 1 titlejotain 
> suomeksi
> is rendered as the following xhtml by Tika : 
>  xmlns="http://www.w3.org/1999/xhtml";>document 1 
> titlejotain suomeksi
> with the lang attribute getting lost.  The lang is not stored in the metadata 
> either.
> I will open an issue on Tika and modify TestHTMLLanguageParser so that the 
> tests don't break anymore 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-609) Allow Plugins to be Loaded from Jar File(s)

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-609:


 Due Date: 13/Feb/08  (was: 13/Feb/08)
   Patch Info: [Patch Available]
Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

> Allow Plugins to be Loaded from Jar File(s)
> ---
>
> Key: NUTCH-609
> URL: https://issues.apache.org/jira/browse/NUTCH-609
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.0.0
> Environment: All
>Reporter: Dennis Kubes
>Assignee: Dennis Kubes
>Priority: Minor
> Attachments: NUTCH-609-1-20080212.patch
>
>
> Currently plugins cannot be loaded from a jar file.  Plugins must be unzipped 
> in one or more directories specified by the plugin.folders config.  I have 
> been thinking about an extension to PluginRepository or PluginManifestParser 
> (or both) that would allow plugins to packaged into multiple independent jar 
> files and placed on the classpath.  The system would search the classpath for 
> resources with the correct folder name and would load any plugins in those 
> jars.
> This functionality would be very useful in making the nutch core more 
> flexible in terms of packaging.  It would also help with web applications 
> where we don't want to have a plugins directory included in the webapp.
> Thoughts so far are unzipping those plugin jars into a common temp directory 
> before loading.  Another option is using something like commons vfs to 
> interact with the jar files.  VFS essential uses a disk based temporary cache 
> for jar files, so it is pretty much the same solution.   What are everyone 
> else's thoughts on this?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-251) Administration GUI

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-251:


   Patch Info: [Patch Available]
Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9 (comment from me: would be nice to 
get this into 1.2)

> Administration GUI
> --
>
> Key: NUTCH-251
> URL: https://issues.apache.org/jira/browse/NUTCH-251
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 0.8
>Reporter: Stefan Groschupf
>Priority: Minor
> Attachments: hadoop_nutch_gui_v1.patch, Nutch-251-AdminGUI.tar.gz, 
> nutch_gui_plugins_v1.zip, nutch_gui_v1.patch
>
>
> Having a web based administration interface would help to make nutch 
> administration and management much more user friendly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-477) Extend URLFilters to support different filtering chains

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-477:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

> Extend URLFilters to support different filtering chains
> ---
>
> Key: NUTCH-477
> URL: https://issues.apache.org/jira/browse/NUTCH-477
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.1
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
>Priority: Minor
> Attachments: urlfilters.patch
>
>
> I propose to make the following changes to URLFilters:
> * extend URLFilters so that they support different filtering rules depending 
> on the context where they are executed. This functionality mirrors the one 
> that URLNormalizers already support.
> * change their return value to an int code, in order to support early 
> termination of long filtering chains.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-564) External parser supports encoding attribute

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-564:


   Patch Info: [Patch Available]
Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

> External parser supports encoding attribute
> ---
>
> Key: NUTCH-564
> URL: https://issues.apache.org/jira/browse/NUTCH-564
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 0.9.0
> Environment: All
>Reporter: Antony Bowesman
>Priority: Minor
> Attachments: ExtParser_0.9.0.patch, ExtParser_1.0.0.patch
>
>
> When an external component generates text, which is returned to the external 
> parser, it always converts the text using the default character set.  
> (os.toString()).  For example, the returned text may be utf-8, but will not 
> be converted to a String correctly.
> I added the attribute  to the  XML in plugin.xml 
> and this is then used to convert the text.
> I have tested my original fix on my local 0.9 and include a patch, but have 
> also made an untested patch for trunk.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-750) HtmlParser plugin - page title extraction

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-750:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

> HtmlParser plugin - page title extraction
> -
>
> Key: NUTCH-750
> URL: https://issues.apache.org/jira/browse/NUTCH-750
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.0.0
>Reporter: Alexey Torochkov
>Priority: Minor
> Attachments: SkipBody.patch
>
>
> A little improvement to trying to extract  tag in body if it doesn't 
> exist in head.
> In current version DOMContentUtils just skip all after  in getTitle() 
> method.
> Attached patch allows to change this behavior (for default it doesn't change 
> anything) and can cope with webmasters mistakes

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-664) Possibility to update already stored documents.

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-664:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

> Possibility to update already stored documents.
> ---
>
> Key: NUTCH-664
> URL: https://issues.apache.org/jira/browse/NUTCH-664
> Project: Nutch
>  Issue Type: Wish
>Reporter: Sergey Khilkov
>Priority: Minor
>
> We have huge index of stored documents. It is high cost procedure to fetch 
> page, merge indexes any time we update some information about page. The 
> information can be changed 1-3 times per day. At this moment we have to store 
> changed info in database, but in this case we have lots of problems with 
> sorting, search restricions and so on. Lucene itself allows delete single 
> document and add new one into existing index. But there is a problem with 
> hadoop... As I understand hadoop filesystem has no possibility to write in 
> random positions. But it will be great feature if nutch will be able to 
> update created index.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-673) Upgrade the Carrot2 plug-in to release 3.0

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-673:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

> Upgrade the Carrot2 plug-in to release 3.0
> --
>
> Key: NUTCH-673
> URL: https://issues.apache.org/jira/browse/NUTCH-673
> Project: Nutch
>  Issue Type: Improvement
>  Components: web gui
>Affects Versions: 0.9.0
> Environment: All Nutch deployments.
>Reporter: Sean Dean
>Priority: Minor
>
> Release 3.0 of the Carrot2 plug-in was released recently.
> We currently have version 2.1 in the source tree and upgrading it to the 
> latest version before 1.0-release might make sence.
> Details on the release can be found here: 
> http://project.carrot2.org/release-3.0-notes.html
> One major change in requirements is for JDK 1.5 to be used, but this is also 
> now required for Hadoop 0.19 so this wouldnt be the only reason for the 
> switch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-310) Review Log Levels

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-310:


Fix Version/s: (was: 1.1)
 Assignee: Chris A. Mattmann  (was: Jerome Charron)

- pushing this out per http://bit.ly/c7tBv9 (and assign to me, I think this can 
be closed but will wait until after 1.1 to revisit)

> Review Log Levels
> -
>
> Key: NUTCH-310
> URL: https://issues.apache.org/jira/browse/NUTCH-310
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 0.8
>Reporter: Jerome Charron
>Assignee: Chris A. Mattmann
>Priority: Minor
>
> Review of logs content and logs levels (see Commons Logging Best Parctices : 
> http://jakarta.apache.org/commons/logging/guide.html#Message_Priorities_Levels)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-577) Use explicit tika-config.xml file to enable mime magic detection to be turned on and off

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-577:


 Due Date: 30/Nov/07  (was: 30/Nov/07)
Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

> Use explicit tika-config.xml file to enable mime magic detection to be turned 
> on and off
> 
>
> Key: NUTCH-577
> URL: https://issues.apache.org/jira/browse/NUTCH-577
> Project: Nutch
>  Issue Type: Improvement
>  Components: mime_type_detector
>Affects Versions: 1.0.0
> Environment: Mac Book Pro Intel Core Duo 2.0 Ghz, 2. 0 GB RAM, Mac OS 
> X 10.4, although improvement is indep. of env.
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
>Priority: Minor
>
> Currently, there is a configuration file for Tika (which the trunk in Nutch 
> uses for its mime type detection) called "tika-config.xml" left unexposed (a 
> default one lives in the tika-0.1-dev.jar file). Tika's mime system has two 
> config files it relies on: tika-mimetypes.xml (which Nutch has its own 
> version of, that overrides the version that comes with the tika jar file), 
> and tika-config.xml (to turn on or off magic char detection). We should 
> probably have a nutch version of tika-config.xml, so that Nutch users can 
> employ magic char mime detection. I'll get going on this in the next day or 
> so.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-763) Separate configuration files from resources to be included in the job file

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-763:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

> Separate configuration files from resources to be included in the job file
> --
>
> Key: NUTCH-763
> URL: https://issues.apache.org/jira/browse/NUTCH-763
> Project: Nutch
>  Issue Type: Wish
>Reporter: Julien Nioche
>Priority: Minor
>
> One of the things I found confusing when I was learning Nutch was the fact 
> that the conf/ directory contains at the same time : 
> - configuration files for Hadoop / Nutch which are put in the jar files but 
> not used there
> - resource files (e.g. filtering rules) which MUST be up to date in the job 
> file
> I would separate the conf/ directory from say a resources/ directory which 
> would contain the rule files and other things to put in the job file. Unless 
> I am mistaken none of the configuration files need to be in the job file. I 
> know it is a very minor point, but that would probably simplify things and 
> make it easier for beginners to understand what has to be modified where. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-309) Uses commons logging Code Guards

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-309:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

> Uses commons logging Code Guards
> 
>
> Key: NUTCH-309
> URL: https://issues.apache.org/jira/browse/NUTCH-309
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 0.8
>Reporter: Jerome Charron
>Assignee: Chris A. Mattmann
>Priority: Minor
>
> "Code guards are typically used to guard code that only needs to execute in 
> support of logging, that otherwise introduces undesirable runtime overhead in 
> the general case (logging disabled). Examples are multiple parameters, or 
> expressions (e.g. string + " more") for parameters. Use the guard methods of 
> the form log.is() to verify that logging should be performed, 
> before incurring the overhead of the logging method call. Yes, the logging 
> methods will perform the same check, but only after resolving parameters."
> (description extracted from 
> http://jakarta.apache.org/commons/logging/guide.html#Code_Guards)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-249) black- white list url filtering

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-249:


Fix Version/s: (was: 1.1)

- push out per http://bit.ly/c7tBv9

> black- white list url filtering
> ---
>
> Key: NUTCH-249
> URL: https://issues.apache.org/jira/browse/NUTCH-249
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 0.8
>Reporter: Stefan Groschupf
>Assignee: Dennis Kubes
>Priority: Trivial
> Attachments: blackWhiteListV2.patch, blackWhiteListV3.patch, bw.patch
>
>
> Existing url filter mechanisms need to process each url against each filter 
> pattern. For very large filter sets this may be does not scale very well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-801) Remove RTF and MP3 parse plugins

2010-03-10 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12843576#action_12843576
 ] 

Chris A. Mattmann commented on NUTCH-801:
-

+1 on this from me, Julien. Sounds good.

> Remove RTF and MP3 parse plugins
> 
>
> Key: NUTCH-801
> URL: https://issues.apache.org/jira/browse/NUTCH-801
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.0.0
>Reporter: Julien Nioche
> Fix For: 1.1
>
>
> *Parse-rtf* and *parse-mp3* are not built by default  due to licensing 
> issues. Since we now have *parse-tika* to handle these formats I would be in 
> favour of removing these 2 plugins altogether to keep things nice and simple. 
> The other plugins will probably be phased out only after the release of 1.1  
> when parse-tika will have been tested a lot more.
> Any reasons not to?
> Julien

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-790) Some external javadoc links are broken

2010-02-14 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12833591#action_12833591
 ] 

Chris A. Mattmann commented on NUTCH-790:
-

+1 to commit this. Thanks, Sami!

> Some external javadoc links are broken
> --
>
> Key: NUTCH-790
> URL: https://issues.apache.org/jira/browse/NUTCH-790
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Reporter: Sami Siren
>Assignee: Sami Siren
>Priority: Trivial
> Attachments: NUTCH-790.patch
>
>
> Nutch javadoc links for lucene and hadoop are broken.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-766) Tika parser

2010-02-11 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832866#action_12832866
 ] 

Chris A. Mattmann commented on NUTCH-766:
-

- forgot to add in dep libs, added in r909269. Thanks!

> Tika parser
> ---
>
> Key: NUTCH-766
> URL: https://issues.apache.org/jira/browse/NUTCH-766
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Julien Nioche
>Assignee: Chris A. Mattmann
> Fix For: 1.1
>
> Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, 
> sample.tar.gz, TikaParser.java
>
>
> Tika handles a lot of different formats under the bonnet and exposes them 
> nicely via SAX events. What is described here is a tika-parser plugin which 
> delegates the pasring mechanism of Tika but can still coexist with the 
> existing parsing plugins which is useful for formats partially handled by 
> Tika (or not at all). Some of the elements below have already been discussed 
> on the mailing lists. Note that this is work in progress, your feedback is 
> welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as 
> different jar files (core and parsers), in the work described here we decided 
> to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only 
> need to put tika-core at the main lib level whereas the tika plugin obviously 
> needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the 
> TikaConfig class in the tika-plugin. This might be fixed in the future in 
> Tika itself or avoided by refactoring the mimetype part of Nutch using 
> extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why 
> we are using "*" as its mimetype value in the plugin descriptor and have 
> modified ParserFactory.java so that it considers the tika parser as 
> potentially suitable for all mime-types. In practice this means that the 
> associations between a mime type and a parser plugin as defined in 
> parse-plugins.xml are useful only for the cases where we want to handle a 
> mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the 
> Tika parsers into DOM objects and reuse the utilities that come with the 
> current HTML parser i.e. link detection,  metatag handling but also means 
> that we can use the HTMLParseFilters in exactly the same way. The main 
> difference though is that HTMLParseFilters are not limited to HTML documents 
> anymore as the XHTML tags returned by Tika can correspond to a different 
> format for the original document. There is a duplication of code with the 
> html-plugin which will be resolved by either a) getting rid of the 
> html-plugin altogether or b) exporting its jar and make the tika parser 
> depend on it.
> The following libraries are required in the lib/ directory of the tika-parser 
> : 
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
> There is a small test suite which needs to be improved. We will need to have 
> a look at each individual format and check that it is covered by Tika and if 
> so to the same extent; the Wiki is probably the right place for this. The 
> language identifier (which is a HTMLParseFilter) seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a 
> first step. 
> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-789) Improvements to Tika parser

2010-02-11 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-789:


Attachment: NutchTikaConfig.java
TikaParser.java

- updates contributed by Sami. I'll generate a diff and then re-attach.

> Improvements to Tika parser
> ---
>
> Key: NUTCH-789
> URL: https://issues.apache.org/jira/browse/NUTCH-789
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
> Environment: reported by Sami, in NUTCH-766
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.1
>
> Attachments: NutchTikaConfig.java, TikaParser.java
>
>
> As reported by Sami in NUTCH-766, Sami has a few improvements he made to the 
> Tika parser. We'll track that progress here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-789) Improvements to Tika parser

2010-02-11 Thread Chris A. Mattmann (JIRA)
Improvements to Tika parser
---

 Key: NUTCH-789
 URL: https://issues.apache.org/jira/browse/NUTCH-789
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
 Environment: reported by Sami, in NUTCH-766
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
Priority: Minor
 Fix For: 1.1


As reported by Sami in NUTCH-766, Sami has a few improvements he made to the 
Tika parser. We'll track that progress here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-766) Tika parser

2010-02-11 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved NUTCH-766.
-

Resolution: Fixed

- committed in r909268. Added in the nutch-default.xml comments near the 
parse-tika plugin.includes enable block. Sami, I'll create a new issue now to 
track your proposed updates to the Tika parser. I ran unit tests with the patch 
i committed, and they all passed.

Thanks, Julien!

> Tika parser
> ---
>
> Key: NUTCH-766
> URL: https://issues.apache.org/jira/browse/NUTCH-766
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Julien Nioche
>Assignee: Chris A. Mattmann
> Fix For: 1.1
>
> Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, 
> sample.tar.gz, TikaParser.java
>
>
> Tika handles a lot of different formats under the bonnet and exposes them 
> nicely via SAX events. What is described here is a tika-parser plugin which 
> delegates the pasring mechanism of Tika but can still coexist with the 
> existing parsing plugins which is useful for formats partially handled by 
> Tika (or not at all). Some of the elements below have already been discussed 
> on the mailing lists. Note that this is work in progress, your feedback is 
> welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as 
> different jar files (core and parsers), in the work described here we decided 
> to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only 
> need to put tika-core at the main lib level whereas the tika plugin obviously 
> needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the 
> TikaConfig class in the tika-plugin. This might be fixed in the future in 
> Tika itself or avoided by refactoring the mimetype part of Nutch using 
> extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why 
> we are using "*" as its mimetype value in the plugin descriptor and have 
> modified ParserFactory.java so that it considers the tika parser as 
> potentially suitable for all mime-types. In practice this means that the 
> associations between a mime type and a parser plugin as defined in 
> parse-plugins.xml are useful only for the cases where we want to handle a 
> mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the 
> Tika parsers into DOM objects and reuse the utilities that come with the 
> current HTML parser i.e. link detection,  metatag handling but also means 
> that we can use the HTMLParseFilters in exactly the same way. The main 
> difference though is that HTMLParseFilters are not limited to HTML documents 
> anymore as the XHTML tags returned by Tika can correspond to a different 
> format for the original document. There is a duplication of code with the 
> html-plugin which will be resolved by either a) getting rid of the 
> html-plugin altogether or b) exporting its jar and make the tika parser 
> depend on it.
> The following libraries are required in the lib/ directory of the tika-parser 
> : 
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
> There is a small test suite which needs to be improved. We will need to have 
> a look at each individual format and check that it is covered by Tika and if 
> so to the same extent; the Wiki is probably the right place for this. The 
> language identifier (which is a HTMLParseFilter) seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a 
> first step. 
> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-766) Tika parser

2010-02-11 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832588#action_12832588
 ] 

Chris A. Mattmann commented on NUTCH-766:
-

@Julien:

Sigh, no I didn't! :(

That's probably why! Thanks for the help. I'll try it later today. If that 
passes, my +1 to commit. 

@Sami, regarding your updates, would you be OK with me creating another issue 
to track them, attaching your diffs as patches against this issue, once 
committed to the trunk? That way we'll make sure they get into 1.1, but we 
won't block this issue anymore from getting in. Let me know what you think, 
thanks.

Cheers,
Chris


> Tika parser
> ---
>
> Key: NUTCH-766
> URL: https://issues.apache.org/jira/browse/NUTCH-766
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Julien Nioche
>Assignee: Chris A. Mattmann
> Fix For: 1.1
>
> Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, 
> sample.tar.gz, TikaParser.java
>
>
> Tika handles a lot of different formats under the bonnet and exposes them 
> nicely via SAX events. What is described here is a tika-parser plugin which 
> delegates the pasring mechanism of Tika but can still coexist with the 
> existing parsing plugins which is useful for formats partially handled by 
> Tika (or not at all). Some of the elements below have already been discussed 
> on the mailing lists. Note that this is work in progress, your feedback is 
> welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as 
> different jar files (core and parsers), in the work described here we decided 
> to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only 
> need to put tika-core at the main lib level whereas the tika plugin obviously 
> needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the 
> TikaConfig class in the tika-plugin. This might be fixed in the future in 
> Tika itself or avoided by refactoring the mimetype part of Nutch using 
> extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why 
> we are using "*" as its mimetype value in the plugin descriptor and have 
> modified ParserFactory.java so that it considers the tika parser as 
> potentially suitable for all mime-types. In practice this means that the 
> associations between a mime type and a parser plugin as defined in 
> parse-plugins.xml are useful only for the cases where we want to handle a 
> mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the 
> Tika parsers into DOM objects and reuse the utilities that come with the 
> current HTML parser i.e. link detection,  metatag handling but also means 
> that we can use the HTMLParseFilters in exactly the same way. The main 
> difference though is that HTMLParseFilters are not limited to HTML documents 
> anymore as the XHTML tags returned by Tika can correspond to a different 
> format for the original document. There is a duplication of code with the 
> html-plugin which will be resolved by either a) getting rid of the 
> html-plugin altogether or b) exporting its jar and make the tika parser 
> depend on it.
> The following libraries are required in the lib/ directory of the tika-parser 
> : 
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
> There is a small test suite which needs to be improved. We will need to have 
> a look at each individual format and check that it is covered by Tika and if 
> so to the same extent; the Wiki is probably the right place for this. The 
> language identifier (which is a HTMLParseFilter) seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a 
> first step. 
> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-766) Tika parser

2010-02-11 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832565#action_12832565
 ] 

Chris A. Mattmann commented on NUTCH-766:
-

Hi Julien:

{quote}
@Chris : I just did a fresh co from svn, applied the patch v3 and unzipped 
sample.tar.gz onto the directory parse-tika and ran the test just as you did 
but could not reproduce the problem. Could there be a difference between your 
version and the trunk? 
{quote}

I tried this process last night:

1. SVN up to r908832
2. download patch v3
3. download sample.tgz
4. apply patch v3 to r908832
5. untar sample.tgz into src/plugin/parse-tika, creating a sample folder in 
that dir
6. ant clean compile-core test

Any idea why I'm seeing the error?

Cheers,
Chris


> Tika parser
> ---
>
> Key: NUTCH-766
> URL: https://issues.apache.org/jira/browse/NUTCH-766
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Julien Nioche
>Assignee: Chris A. Mattmann
> Fix For: 1.1
>
> Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, 
> sample.tar.gz, TikaParser.java
>
>
> Tika handles a lot of different formats under the bonnet and exposes them 
> nicely via SAX events. What is described here is a tika-parser plugin which 
> delegates the pasring mechanism of Tika but can still coexist with the 
> existing parsing plugins which is useful for formats partially handled by 
> Tika (or not at all). Some of the elements below have already been discussed 
> on the mailing lists. Note that this is work in progress, your feedback is 
> welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as 
> different jar files (core and parsers), in the work described here we decided 
> to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only 
> need to put tika-core at the main lib level whereas the tika plugin obviously 
> needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the 
> TikaConfig class in the tika-plugin. This might be fixed in the future in 
> Tika itself or avoided by refactoring the mimetype part of Nutch using 
> extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why 
> we are using "*" as its mimetype value in the plugin descriptor and have 
> modified ParserFactory.java so that it considers the tika parser as 
> potentially suitable for all mime-types. In practice this means that the 
> associations between a mime type and a parser plugin as defined in 
> parse-plugins.xml are useful only for the cases where we want to handle a 
> mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the 
> Tika parsers into DOM objects and reuse the utilities that come with the 
> current HTML parser i.e. link detection,  metatag handling but also means 
> that we can use the HTMLParseFilters in exactly the same way. The main 
> difference though is that HTMLParseFilters are not limited to HTML documents 
> anymore as the XHTML tags returned by Tika can correspond to a different 
> format for the original document. There is a duplication of code with the 
> html-plugin which will be resolved by either a) getting rid of the 
> html-plugin altogether or b) exporting its jar and make the tika parser 
> depend on it.
> The following libraries are required in the lib/ directory of the tika-parser 
> : 
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
> There is a small test suite which needs to be improved. We will need to have 
> a look at each individual format and check that it is covered by Tika and if 
> so to the same extent; the Wiki is probably the right place for this. The 
> language identifier (which is a HTMLParseFilter) seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a 
> first step. 
> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-766) Tika parser

2010-02-10 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832398#action_12832398
 ] 

Chris A. Mattmann commented on NUTCH-766:
-

I'm going to hold off on committing this tonight. I've updated the docs per 
Andrzej, and I've also updated CHANGES.txt, but when running:

{code}
ant clean compile-core test
{code}

I'm seeing these messages during plugin testing for parse-tika:

{noformat}
2010-02-10 22:39:16,593 ERROR tika.TikaParser (TikaParser.java:getParse(63)) - 
Can't retrieve Tika parser for mime-type application/pdf
-  ---

Testcase: testIt took 2.684 sec
FAILED
null
junit.framework.AssertionFailedError
at org.apache.nutch.tika.TestPdfParser.testIt(TestPdfParser.java:79)
{noformat}

It seems that the TikaConfig is not being found? I was looking at 
TikaParser#setConf and it seems that a default config is being created for 
Tika, but maybe not being loaded correctly? I need to look into this more...

> Tika parser
> ---
>
> Key: NUTCH-766
> URL: https://issues.apache.org/jira/browse/NUTCH-766
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Julien Nioche
>Assignee: Chris A. Mattmann
> Fix For: 1.1
>
> Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, sample.tar.gz
>
>
> Tika handles a lot of different formats under the bonnet and exposes them 
> nicely via SAX events. What is described here is a tika-parser plugin which 
> delegates the pasring mechanism of Tika but can still coexist with the 
> existing parsing plugins which is useful for formats partially handled by 
> Tika (or not at all). Some of the elements below have already been discussed 
> on the mailing lists. Note that this is work in progress, your feedback is 
> welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as 
> different jar files (core and parsers), in the work described here we decided 
> to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only 
> need to put tika-core at the main lib level whereas the tika plugin obviously 
> needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the 
> TikaConfig class in the tika-plugin. This might be fixed in the future in 
> Tika itself or avoided by refactoring the mimetype part of Nutch using 
> extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why 
> we are using "*" as its mimetype value in the plugin descriptor and have 
> modified ParserFactory.java so that it considers the tika parser as 
> potentially suitable for all mime-types. In practice this means that the 
> associations between a mime type and a parser plugin as defined in 
> parse-plugins.xml are useful only for the cases where we want to handle a 
> mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the 
> Tika parsers into DOM objects and reuse the utilities that come with the 
> current HTML parser i.e. link detection,  metatag handling but also means 
> that we can use the HTMLParseFilters in exactly the same way. The main 
> difference though is that HTMLParseFilters are not limited to HTML documents 
> anymore as the XHTML tags returned by Tika can correspond to a different 
> format for the original document. There is a duplication of code with the 
> html-plugin which will be resolved by either a) getting rid of the 
> html-plugin altogether or b) exporting its jar and make the tika parser 
> depend on it.
> The following libraries are required in the lib/ directory of the tika-parser 
> : 
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
> There is a small test suite which needs to be improved. We will need to have 
> a look at each individual format and check that it is covered by Tika and if 
> so to the same extent; the Wiki is probably the right place for this. The 
> language identifier (which is a HTMLParseFilter) seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a 
> first step. 
> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-766) Tika parser

2010-02-10 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832255#action_12832255
 ] 

Chris A. Mattmann commented on NUTCH-766:
-

{quote}
+1 to commit this...
{quote}

Awesome, Andrzej. Will do so tonight, PST, if I don't hear any objections 
between now and then...

Thanks!

Cheers,
Chris


> Tika parser
> ---
>
> Key: NUTCH-766
> URL: https://issues.apache.org/jira/browse/NUTCH-766
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Julien Nioche
>Assignee: Chris A. Mattmann
> Fix For: 1.1
>
> Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, sample.tar.gz
>
>
> Tika handles a lot of different formats under the bonnet and exposes them 
> nicely via SAX events. What is described here is a tika-parser plugin which 
> delegates the pasring mechanism of Tika but can still coexist with the 
> existing parsing plugins which is useful for formats partially handled by 
> Tika (or not at all). Some of the elements below have already been discussed 
> on the mailing lists. Note that this is work in progress, your feedback is 
> welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as 
> different jar files (core and parsers), in the work described here we decided 
> to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only 
> need to put tika-core at the main lib level whereas the tika plugin obviously 
> needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the 
> TikaConfig class in the tika-plugin. This might be fixed in the future in 
> Tika itself or avoided by refactoring the mimetype part of Nutch using 
> extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why 
> we are using "*" as its mimetype value in the plugin descriptor and have 
> modified ParserFactory.java so that it considers the tika parser as 
> potentially suitable for all mime-types. In practice this means that the 
> associations between a mime type and a parser plugin as defined in 
> parse-plugins.xml are useful only for the cases where we want to handle a 
> mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the 
> Tika parsers into DOM objects and reuse the utilities that come with the 
> current HTML parser i.e. link detection,  metatag handling but also means 
> that we can use the HTMLParseFilters in exactly the same way. The main 
> difference though is that HTMLParseFilters are not limited to HTML documents 
> anymore as the XHTML tags returned by Tika can correspond to a different 
> format for the original document. There is a duplication of code with the 
> html-plugin which will be resolved by either a) getting rid of the 
> html-plugin altogether or b) exporting its jar and make the tika parser 
> depend on it.
> The following libraries are required in the lib/ directory of the tika-parser 
> : 
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
> There is a small test suite which needs to be improved. We will need to have 
> a look at each individual format and check that it is covered by Tika and if 
> so to the same extent; the Wiki is probably the right place for this. The 
> language identifier (which is a HTMLParseFilter) seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a 
> first step. 
> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-766) Tika parser

2010-01-25 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804546#action_12804546
 ] 

Chris A. Mattmann commented on NUTCH-766:
-

Hi Sami:

{quote}
Chris, can you please explain me how keeping two components doing identical 
work would be more backwards compatible than having only 1?
{quote}

Sure, it's more of a configuration backwards-compat issue. For those folks who 
have gone to the trouble of customizing their nutch configuration 
(nuch-site.xml, or nutch-default.xml, or even parse-plugins), to remove out the 
parsing plugins (e.g., basically say they don't exist anymore and update your 
deployed configuration to use the tika-plugin), this patch would require a 
configuration update in their deployed environments. Because of that, why don't 
we ease them into that upgrade with at least one released version before the 
plugins go away. It would make it easier from a configuration backwards-compat 
perspective.

HTH,
Chris


> Tika parser
> ---
>
> Key: NUTCH-766
> URL: https://issues.apache.org/jira/browse/NUTCH-766
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Julien Nioche
>Assignee: Chris A. Mattmann
> Fix For: 1.1
>
> Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch
>
>
> Tika handles a lot of different formats under the bonnet and exposes them 
> nicely via SAX events. What is described here is a tika-parser plugin which 
> delegates the pasring mechanism of Tika but can still coexist with the 
> existing parsing plugins which is useful for formats partially handled by 
> Tika (or not at all). Some of the elements below have already been discussed 
> on the mailing lists. Note that this is work in progress, your feedback is 
> welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as 
> different jar files (core and parsers), in the work described here we decided 
> to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only 
> need to put tika-core at the main lib level whereas the tika plugin obviously 
> needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the 
> TikaConfig class in the tika-plugin. This might be fixed in the future in 
> Tika itself or avoided by refactoring the mimetype part of Nutch using 
> extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why 
> we are using "*" as its mimetype value in the plugin descriptor and have 
> modified ParserFactory.java so that it considers the tika parser as 
> potentially suitable for all mime-types. In practice this means that the 
> associations between a mime type and a parser plugin as defined in 
> parse-plugins.xml are useful only for the cases where we want to handle a 
> mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the 
> Tika parsers into DOM objects and reuse the utilities that come with the 
> current HTML parser i.e. link detection,  metatag handling but also means 
> that we can use the HTMLParseFilters in exactly the same way. The main 
> difference though is that HTMLParseFilters are not limited to HTML documents 
> anymore as the XHTML tags returned by Tika can correspond to a different 
> format for the original document. There is a duplication of code with the 
> html-plugin which will be resolved by either a) getting rid of the 
> html-plugin altogether or b) exporting its jar and make the tika parser 
> depend on it.
> The following libraries are required in the lib/ directory of the tika-parser 
> : 
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
> There is a small test suite which needs to be improved. We will need to have 
> a look at each individual format and check that it is covered by Tika and if 
> so to the same extent; the Wiki is probably the right place for this. The 
> language identifier (which is a HTMLParseFilter) seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a 
> first step. 
> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (NUTCH-766) Tika parser

2010-01-22 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803709#action_12803709
 ] 

Chris A. Mattmann edited comment on NUTCH-766 at 1/22/10 2:38 PM:
--

{quote}
Sure, but it would be silly to block the whole Tika plugin because Tika does 
not support such or such format as well as the original Nutch plugins. As I 
explained above we can configure which parser to use for which mimetype and use 
the Tika-plugin by default. Hopefully the Tika implementation will get better 
and better and there will be no need for keeping the old plugins.
{quote}

+1, I'm going to agree on this one here Julien. Other communities ;) have 
convinced me of the need for backwards compat and unobtrusiveness when bringing 
in new functionality or results. +1 to at least in Nutch 1.1 leaving the old 
plugins (perhaps mentioning they should be deprecated and replaced by the Tika 
functionality) and then removing them in 1.2 or 1.3.

I got bogged down with my paid job, but I found some Apache time recently so 
this is tops on my list to tackle.

Cheers,
Chris



  was (Author: chrismattmann):
{quote}
Sure, but it would be silly to block the whole Tika plugin because Tika does 
not support such or such format as well as the original Nutch plugins. As I 
explained above we can configure which parser to use for which mimetype and use 
the Tika-plugin by default. Hopefully the Tika implementation will get better 
and better and there will be no need for keeping the old plugins.
{quote}

+1, I'm going to agree on this one here Julien. Other communities ;) have 
convinced me of the need for backwards compat and unobtrusiveness when bringing 
in new functionality or results. +1 to at least in Nutch 1.1 leaving the old 
plugins (perhaps mentioning they should be deprecated and replace by the Tika 
functionality) and then removing them in 1.2 or 1.3.

I got bogged down with my paid job, but I found some Apache time recently so 
this is tops on my list to tackle.

Cheers,
Chris


  
> Tika parser
> ---
>
> Key: NUTCH-766
> URL: https://issues.apache.org/jira/browse/NUTCH-766
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Julien Nioche
>Assignee: Chris A. Mattmann
> Fix For: 1.1
>
> Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch
>
>
> Tika handles a lot of different formats under the bonnet and exposes them 
> nicely via SAX events. What is described here is a tika-parser plugin which 
> delegates the pasring mechanism of Tika but can still coexist with the 
> existing parsing plugins which is useful for formats partially handled by 
> Tika (or not at all). Some of the elements below have already been discussed 
> on the mailing lists. Note that this is work in progress, your feedback is 
> welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as 
> different jar files (core and parsers), in the work described here we decided 
> to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only 
> need to put tika-core at the main lib level whereas the tika plugin obviously 
> needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the 
> TikaConfig class in the tika-plugin. This might be fixed in the future in 
> Tika itself or avoided by refactoring the mimetype part of Nutch using 
> extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why 
> we are using "*" as its mimetype value in the plugin descriptor and have 
> modified ParserFactory.java so that it considers the tika parser as 
> potentially suitable for all mime-types. In practice this means that the 
> associations between a mime type and a parser plugin as defined in 
> parse-plugins.xml are useful only for the cases where we want to handle a 
> mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the 
> Tika parsers into DOM objects and reuse the utilities that come with the 
> current HTML parser i.e. link detection,  metatag handling but also means 
> that we can use the HTMLParseFilters in exactly the same way. The main 
> difference though is that HTMLParseFilters are not limited to HTML documents 
> anymore as the XHTML tags returned by Tika can correspond to a different 
> format for the original document. There is a duplication of code with the 
> html-plugin which will be resolved by either a) getting rid of the 
> html-plugin altogether or b) exporting its jar and make the tika parser 
> depend on it.
> The following libraries are required i

[jira] Commented: (NUTCH-766) Tika parser

2010-01-22 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803709#action_12803709
 ] 

Chris A. Mattmann commented on NUTCH-766:
-

{quote}
Sure, but it would be silly to block the whole Tika plugin because Tika does 
not support such or such format as well as the original Nutch plugins. As I 
explained above we can configure which parser to use for which mimetype and use 
the Tika-plugin by default. Hopefully the Tika implementation will get better 
and better and there will be no need for keeping the old plugins.
{quote}

+1, I'm going to agree on this one here Julien. Other communities ;) have 
convinced me of the need for backwards compat and unobtrusiveness when bringing 
in new functionality or results. +1 to at least in Nutch 1.1 leaving the old 
plugins (perhaps mentioning they should be deprecated and replace by the Tika 
functionality) and then removing them in 1.2 or 1.3.

I got bogged down with my paid job, but I found some Apache time recently so 
this is tops on my list to tackle.

Cheers,
Chris



> Tika parser
> ---
>
> Key: NUTCH-766
> URL: https://issues.apache.org/jira/browse/NUTCH-766
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Julien Nioche
>Assignee: Chris A. Mattmann
> Fix For: 1.1
>
> Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch
>
>
> Tika handles a lot of different formats under the bonnet and exposes them 
> nicely via SAX events. What is described here is a tika-parser plugin which 
> delegates the pasring mechanism of Tika but can still coexist with the 
> existing parsing plugins which is useful for formats partially handled by 
> Tika (or not at all). Some of the elements below have already been discussed 
> on the mailing lists. Note that this is work in progress, your feedback is 
> welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as 
> different jar files (core and parsers), in the work described here we decided 
> to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only 
> need to put tika-core at the main lib level whereas the tika plugin obviously 
> needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the 
> TikaConfig class in the tika-plugin. This might be fixed in the future in 
> Tika itself or avoided by refactoring the mimetype part of Nutch using 
> extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why 
> we are using "*" as its mimetype value in the plugin descriptor and have 
> modified ParserFactory.java so that it considers the tika parser as 
> potentially suitable for all mime-types. In practice this means that the 
> associations between a mime type and a parser plugin as defined in 
> parse-plugins.xml are useful only for the cases where we want to handle a 
> mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the 
> Tika parsers into DOM objects and reuse the utilities that come with the 
> current HTML parser i.e. link detection,  metatag handling but also means 
> that we can use the HTMLParseFilters in exactly the same way. The main 
> difference though is that HTMLParseFilters are not limited to HTML documents 
> anymore as the XHTML tags returned by Tika can correspond to a different 
> format for the original document. There is a duplication of code with the 
> html-plugin which will be resolved by either a) getting rid of the 
> html-plugin altogether or b) exporting its jar and make the tika parser 
> depend on it.
> The following libraries are required in the lib/ directory of the tika-parser 
> : 
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
> There is a small test suite which needs to be improved. We will need to have 
> a look at each individual format and check that it is covered by Tika and if 
> so to the same extent; the Wiki is probably the right place for this. The 
> language identifier (which is a HTMLParseFilter) seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a 
> first step. 
> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-766) Tika parser

2010-01-11 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798718#action_12798718
 ] 

Chris A. Mattmann commented on NUTCH-766:
-

Hi Julien:

I have had a look and was trying to test it out but got sidetracked. Give me 
this week to try and put together a final reviewable/commitable patch, 
otherwise, it's all yours.

Cheers,
Chris


> Tika parser
> ---
>
> Key: NUTCH-766
> URL: https://issues.apache.org/jira/browse/NUTCH-766
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Julien Nioche
>Assignee: Chris A. Mattmann
> Fix For: 1.1
>
> Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch
>
>
> Tika handles a lot of different formats under the bonnet and exposes them 
> nicely via SAX events. What is described here is a tika-parser plugin which 
> delegates the pasring mechanism of Tika but can still coexist with the 
> existing parsing plugins which is useful for formats partially handled by 
> Tika (or not at all). Some of the elements below have already been discussed 
> on the mailing lists. Note that this is work in progress, your feedback is 
> welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as 
> different jar files (core and parsers), in the work described here we decided 
> to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only 
> need to put tika-core at the main lib level whereas the tika plugin obviously 
> needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the 
> TikaConfig class in the tika-plugin. This might be fixed in the future in 
> Tika itself or avoided by refactoring the mimetype part of Nutch using 
> extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why 
> we are using "*" as its mimetype value in the plugin descriptor and have 
> modified ParserFactory.java so that it considers the tika parser as 
> potentially suitable for all mime-types. In practice this means that the 
> associations between a mime type and a parser plugin as defined in 
> parse-plugins.xml are useful only for the cases where we want to handle a 
> mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the 
> Tika parsers into DOM objects and reuse the utilities that come with the 
> current HTML parser i.e. link detection,  metatag handling but also means 
> that we can use the HTMLParseFilters in exactly the same way. The main 
> difference though is that HTMLParseFilters are not limited to HTML documents 
> anymore as the XHTML tags returned by Tika can correspond to a different 
> format for the original document. There is a duplication of code with the 
> html-plugin which will be resolved by either a) getting rid of the 
> html-plugin altogether or b) exporting its jar and make the tika parser 
> depend on it.
> The following libraries are required in the lib/ directory of the tika-parser 
> : 
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
> There is a small test suite which needs to be improved. We will need to have 
> a look at each individual format and check that it is covered by Tika and if 
> so to the same extent; the Wiki is probably the right place for this. The 
> language identifier (which is a HTMLParseFilter) seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a 
> first step. 
> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-766) Tika parser

2009-12-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-766:


Fix Version/s: 1.1

> Tika parser
> ---
>
> Key: NUTCH-766
> URL: https://issues.apache.org/jira/browse/NUTCH-766
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Julien Nioche
>Assignee: Chris A. Mattmann
> Fix For: 1.1
>
> Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch
>
>
> Tika handles a lot of different formats under the bonnet and exposes them 
> nicely via SAX events. What is described here is a tika-parser plugin which 
> delegates the pasring mechanism of Tika but can still coexist with the 
> existing parsing plugins which is useful for formats partially handled by 
> Tika (or not at all). Some of the elements below have already been discussed 
> on the mailing lists. Note that this is work in progress, your feedback is 
> welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as 
> different jar files (core and parsers), in the work described here we decided 
> to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only 
> need to put tika-core at the main lib level whereas the tika plugin obviously 
> needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the 
> TikaConfig class in the tika-plugin. This might be fixed in the future in 
> Tika itself or avoided by refactoring the mimetype part of Nutch using 
> extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why 
> we are using "*" as its mimetype value in the plugin descriptor and have 
> modified ParserFactory.java so that it considers the tika parser as 
> potentially suitable for all mime-types. In practice this means that the 
> associations between a mime type and a parser plugin as defined in 
> parse-plugins.xml are useful only for the cases where we want to handle a 
> mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the 
> Tika parsers into DOM objects and reuse the utilities that come with the 
> current HTML parser i.e. link detection,  metatag handling but also means 
> that we can use the HTMLParseFilters in exactly the same way. The main 
> difference though is that HTMLParseFilters are not limited to HTML documents 
> anymore as the XHTML tags returned by Tika can correspond to a different 
> format for the original document. There is a duplication of code with the 
> html-plugin which will be resolved by either a) getting rid of the 
> html-plugin altogether or b) exporting its jar and make the tika parser 
> depend on it.
> The following libraries are required in the lib/ directory of the tika-parser 
> : 
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
> There is a small test suite which needs to be improved. We will need to have 
> a look at each individual format and check that it is covered by Tika and if 
> so to the same extent; the Wiki is probably the right place for this. The 
> language identifier (which is a HTMLParseFilter) seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a 
> first step. 
> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-777) Upgrading to jetty6 broke unit tests

2009-12-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved NUTCH-777.
-

Resolution: Fixed

- fixed in r892350

> Upgrading to jetty6 broke unit tests
> 
>
> Key: NUTCH-777
> URL: https://issues.apache.org/jira/browse/NUTCH-777
> Project: Nutch
>  Issue Type: Bug
>  Components: build
> Environment: My MacBook pro, JDK 1.6.0.
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.1
>
>
> It seems that somewhere down the line, there was an upgrade to jetty6, which 
> broke unit tests, specifically TestFetcher and CrawlDBTestUtil. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-777) Upgrading to jetty6 broke unit tests

2009-12-18 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792579#action_12792579
 ] 

Chris A. Mattmann commented on NUTCH-777:
-

Okay with the changes I'm about to commit, we have:

{noformat}
copy-generated-lib:

test:
 [echo] Testing plugin: urlnormalizer-regex
[junit] Running 
org.apache.nutch.net.urlnormalizer.regex.TestRegexURLNormalizer
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.286 sec
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 6.802 sec

test:

BUILD SUCCESSFUL
Total time: 5 minutes 52 seconds
[chipotle:~/src/nutch] mattmann% 
{noformat}

Yay!

> Upgrading to jetty6 broke unit tests
> 
>
> Key: NUTCH-777
> URL: https://issues.apache.org/jira/browse/NUTCH-777
> Project: Nutch
>  Issue Type: Bug
>  Components: build
> Environment: My MacBook pro, JDK 1.6.0.
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.1
>
>
> It seems that somewhere down the line, there was an upgrade to jetty6, which 
> broke unit tests, specifically TestFetcher and CrawlDBTestUtil. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-777) Upgrading to jetty6 broke unit tests

2009-12-18 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792566#action_12792566
 ] 

Chris A. Mattmann commented on NUTCH-777:
-

I found this page, which shows the mapping from Jetty5 (which the Nutch test 
code used to depend on), to Jetty6:

http://docs.codehaus.org/display/JETTY/Porting+to+jetty6

> Upgrading to jetty6 broke unit tests
> 
>
> Key: NUTCH-777
> URL: https://issues.apache.org/jira/browse/NUTCH-777
> Project: Nutch
>  Issue Type: Bug
>  Components: build
> Environment: My MacBook pro, JDK 1.6.0.
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.1
>
>
> It seems that somewhere down the line, there was an upgrade to jetty6, which 
> broke unit tests, specifically TestFetcher and CrawlDBTestUtil. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-777) Upgrading to jetty6 broke unit tests

2009-12-18 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792565#action_12792565
 ] 

Chris A. Mattmann commented on NUTCH-777:
-

Here is what I was getting with the latest Nutch trunk:

{noformat}
compile:

job:
  [jar] Building jar: /Users/mattmann/src/nutch/build/nutch-1.0.job

compile-core-test:
[javac] Compiling 43 source files to 
/Users/mattmann/src/nutch/build/test/classes
[javac] 
/Users/mattmann/src/nutch/src/test/org/apache/nutch/crawl/CrawlDBTestUtil.java:33:
 package org.mortbay.http does not exist
[javac] import org.mortbay.http.HttpContext;
[javac]^
[javac] 
/Users/mattmann/src/nutch/src/test/org/apache/nutch/crawl/CrawlDBTestUtil.java:34:
 package org.mortbay.http does not exist
[javac] import org.mortbay.http.SocketListener;
[javac]^
[javac] 
/Users/mattmann/src/nutch/src/test/org/apache/nutch/crawl/CrawlDBTestUtil.java:35:
 package org.mortbay.http.handler does not exist
[javac] import org.mortbay.http.handler.ResourceHandler;
[javac]^
[javac] 
/Users/mattmann/src/nutch/src/test/org/apache/nutch/crawl/CrawlDBTestUtil.java:134:
 cannot find symbol
[javac] symbol  : class SocketListener
[javac] location: class org.apache.nutch.crawl.CrawlDBTestUtil
[javac] SocketListener listener = new SocketListener();
[javac] ^
[javac] 
/Users/mattmann/src/nutch/src/test/org/apache/nutch/crawl/CrawlDBTestUtil.java:134:
 cannot find symbol
[javac] symbol  : class SocketListener
[javac] location: class org.apache.nutch.crawl.CrawlDBTestUtil
[javac] SocketListener listener = new SocketListener();
[javac]   ^
[javac] 
/Users/mattmann/src/nutch/src/test/org/apache/nutch/crawl/CrawlDBTestUtil.java:138:
 cannot find symbol
[javac] symbol  : class HttpContext
[javac] location: class org.apache.nutch.crawl.CrawlDBTestUtil
[javac] HttpContext staticContext = new HttpContext();
[javac] ^
..snip...
[javac] 
/Users/mattmann/src/nutch/src/test/org/apache/nutch/fetcher/TestFetcher.java:167:
 cannot find symbol
[javac] symbol  : method getListeners()
[javac] location: class org.mortbay.jetty.Server
[javac] urls.add("http://127.0.0.1:"; + 
server.getListeners()[0].getPort() + "/" + page);
[javac]  ^
[javac] Note: Some input files use or override a deprecated API.
[javac] Note: Recompile with -Xlint:deprecation for details.
[javac] Note: Some input files use unchecked or unsafe operations.
[javac] Note: Recompile with -Xlint:unchecked for details.
[javac] 9 errors

BUILD FAILED
/Users/mattmann/src/nutch/build.xml:229: Compile failed; see the compiler error 
output for details.

Total time: 37 seconds
{noformat}

> Upgrading to jetty6 broke unit tests
> 
>
> Key: NUTCH-777
> URL: https://issues.apache.org/jira/browse/NUTCH-777
> Project: Nutch
>  Issue Type: Bug
>  Components: build
> Environment: My MacBook pro, JDK 1.6.0.
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.1
>
>
> It seems that somewhere down the line, there was an upgrade to jetty6, which 
> broke unit tests, specifically TestFetcher and CrawlDBTestUtil. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Work started: (NUTCH-777) Upgrading to jetty6 broke unit tests

2009-12-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-777 started by Chris A. Mattmann.

> Upgrading to jetty6 broke unit tests
> 
>
> Key: NUTCH-777
> URL: https://issues.apache.org/jira/browse/NUTCH-777
> Project: Nutch
>  Issue Type: Bug
>  Components: build
> Environment: My MacBook pro, JDK 1.6.0.
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.1
>
>
> It seems that somewhere down the line, there was an upgrade to jetty6, which 
> broke unit tests, specifically TestFetcher and CrawlDBTestUtil. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-777) Upgrading to jetty6 broke unit tests

2009-12-18 Thread Chris A. Mattmann (JIRA)
Upgrading to jetty6 broke unit tests


 Key: NUTCH-777
 URL: https://issues.apache.org/jira/browse/NUTCH-777
 Project: Nutch
  Issue Type: Bug
  Components: build
 Environment: My MacBook pro, JDK 1.6.0.
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.1


It seems that somewhere down the line, there was an upgrade to jetty6, which 
broke unit tests, specifically TestFetcher and CrawlDBTestUtil. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Work started: (NUTCH-766) Tika parser

2009-12-15 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-766 started by Chris A. Mattmann.

> Tika parser
> ---
>
> Key: NUTCH-766
> URL: https://issues.apache.org/jira/browse/NUTCH-766
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Julien Nioche
>Assignee: Chris A. Mattmann
> Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch
>
>
> Tika handles a lot of different formats under the bonnet and exposes them 
> nicely via SAX events. What is described here is a tika-parser plugin which 
> delegates the pasring mechanism of Tika but can still coexist with the 
> existing parsing plugins which is useful for formats partially handled by 
> Tika (or not at all). Some of the elements below have already been discussed 
> on the mailing lists. Note that this is work in progress, your feedback is 
> welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as 
> different jar files (core and parsers), in the work described here we decided 
> to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only 
> need to put tika-core at the main lib level whereas the tika plugin obviously 
> needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the 
> TikaConfig class in the tika-plugin. This might be fixed in the future in 
> Tika itself or avoided by refactoring the mimetype part of Nutch using 
> extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why 
> we are using "*" as its mimetype value in the plugin descriptor and have 
> modified ParserFactory.java so that it considers the tika parser as 
> potentially suitable for all mime-types. In practice this means that the 
> associations between a mime type and a parser plugin as defined in 
> parse-plugins.xml are useful only for the cases where we want to handle a 
> mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the 
> Tika parsers into DOM objects and reuse the utilities that come with the 
> current HTML parser i.e. link detection,  metatag handling but also means 
> that we can use the HTMLParseFilters in exactly the same way. The main 
> difference though is that HTMLParseFilters are not limited to HTML documents 
> anymore as the XHTML tags returned by Tika can correspond to a different 
> format for the original document. There is a duplication of code with the 
> html-plugin which will be resolved by either a) getting rid of the 
> html-plugin altogether or b) exporting its jar and make the tika parser 
> depend on it.
> The following libraries are required in the lib/ directory of the tika-parser 
> : 
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
> There is a small test suite which needs to be improved. We will need to have 
> a look at each individual format and check that it is covered by Tika and if 
> so to the same extent; the Wiki is probably the right place for this. The 
> language identifier (which is a HTMLParseFilter) seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a 
> first step. 
> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (NUTCH-766) Tika parser

2009-12-15 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned NUTCH-766:
---

Assignee: Chris A. Mattmann

> Tika parser
> ---
>
> Key: NUTCH-766
> URL: https://issues.apache.org/jira/browse/NUTCH-766
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Julien Nioche
>Assignee: Chris A. Mattmann
> Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch
>
>
> Tika handles a lot of different formats under the bonnet and exposes them 
> nicely via SAX events. What is described here is a tika-parser plugin which 
> delegates the pasring mechanism of Tika but can still coexist with the 
> existing parsing plugins which is useful for formats partially handled by 
> Tika (or not at all). Some of the elements below have already been discussed 
> on the mailing lists. Note that this is work in progress, your feedback is 
> welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as 
> different jar files (core and parsers), in the work described here we decided 
> to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only 
> need to put tika-core at the main lib level whereas the tika plugin obviously 
> needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the 
> TikaConfig class in the tika-plugin. This might be fixed in the future in 
> Tika itself or avoided by refactoring the mimetype part of Nutch using 
> extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why 
> we are using "*" as its mimetype value in the plugin descriptor and have 
> modified ParserFactory.java so that it considers the tika parser as 
> potentially suitable for all mime-types. In practice this means that the 
> associations between a mime type and a parser plugin as defined in 
> parse-plugins.xml are useful only for the cases where we want to handle a 
> mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the 
> Tika parsers into DOM objects and reuse the utilities that come with the 
> current HTML parser i.e. link detection,  metatag handling but also means 
> that we can use the HTMLParseFilters in exactly the same way. The main 
> difference though is that HTMLParseFilters are not limited to HTML documents 
> anymore as the XHTML tags returned by Tika can correspond to a different 
> format for the original document. There is a duplication of code with the 
> html-plugin which will be resolved by either a) getting rid of the 
> html-plugin altogether or b) exporting its jar and make the tika parser 
> depend on it.
> The following libraries are required in the lib/ directory of the tika-parser 
> : 
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
> There is a small test suite which needs to be improved. We will need to have 
> a look at each individual format and check that it is covered by Tika and if 
> so to the same extent; the Wiki is probably the right place for this. The 
> language identifier (which is a HTMLParseFilter) seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a 
> first step. 
> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-185) XMLParser is configurable xml parser plugin.

2009-11-25 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved NUTCH-185.
-

   Resolution: Won't Fix
Fix Version/s: 1.1

See comments related to NUTCH-767 in this issue's comments section. Once we 
address NUTCH-767, we get this functionality for free...

> XMLParser is configurable xml parser plugin.
> 
>
> Key: NUTCH-185
> URL: https://issues.apache.org/jira/browse/NUTCH-185
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher, indexer
>Affects Versions: 0.7.2, 0.8, 0.8.1
> Environment: OS Independent
>Reporter: Rida Benjelloun
>Assignee: Chris A. Mattmann
> Fix For: 1.1
>
> Attachments: parse-xml.patch, parse-xml.zip, parse-xml.zip
>
>
> Xml parser  is configurable plugin. It use XPath and namespaces to do the 
> mapping between the XML elements and Lucene fields. 
> Informations :
> 1- Copy "xmlparser-conf.xml" to the nutch/conf dir
> 2- To index your custom XML file, you have to modify the 
> "xmlparser-conf.xml". 
> This parser uses namespaces and XPATH to parse XML content
> The config file do the mapping between the XML noeds (using XPATH) and lucene 
> field. 
> Example :  
> 3- The xmlIndexerProperties encapsulate a set of fields associated to a 
> namespace. 
> If the namespace is found in the xml document, the fields represented by the 
> namespace will be indexed.
> Example : 
> http://purl.org/dc/elements/1.1/";>
>
>
> 
> 4- It is possible to define a default namespace that will be applied when the 
> parser 
> didn't find any namespace in the document or when the namespace found in the 
> xml document doesn't match with the namespace defined in the 
> xmlIndexerProperties. 
> Example :
> 
>
> 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-767) Update version of Tika for the MimeType detection

2009-11-18 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779476#action_12779476
 ] 

Chris A. Mattmann commented on NUTCH-767:
-

Hi Julien,

Thanks for pushing this forward. I'll take a look at this patch...

Cheers,
Chris


> Update version of Tika for the MimeType detection
> -
>
> Key: NUTCH-767
> URL: https://issues.apache.org/jira/browse/NUTCH-767
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Julien Nioche
>Assignee: Chris A. Mattmann
> Attachments: NUTCH-767.patch
>
>
> The latest version of TIka requires a few changes to the MimeType 
> implementation. Tika is now split in several jars, we need to place the 
> tika-core.jar in the main nutch lib.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (NUTCH-767) Update version of Tika for the MimeType detection

2009-11-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned NUTCH-767:
---

Assignee: Chris A. Mattmann

> Update version of Tika for the MimeType detection
> -
>
> Key: NUTCH-767
> URL: https://issues.apache.org/jira/browse/NUTCH-767
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Julien Nioche
>Assignee: Chris A. Mattmann
> Attachments: NUTCH-767.patch
>
>
> The latest version of TIka requires a few changes to the MimeType 
> implementation. Tika is now split in several jars, we need to place the 
> tika-core.jar in the main nutch lib.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (NUTCH-714) Need a SFTP and SCP Protocol Handler

2009-03-09 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned NUTCH-714:
---

Assignee: Chris A. Mattmann

> Need a SFTP and SCP Protocol Handler
> 
>
> Key: NUTCH-714
> URL: https://issues.apache.org/jira/browse/NUTCH-714
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher
>Affects Versions: 0.9.0
>Reporter: Sanjoy Ghosh
>Assignee: Chris A. Mattmann
> Fix For: 0.8.2
>
>
> An SFTP and SCP Protocol handler is needed to fetch intranet content on an 
> SFTP or SCP server.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-714) Need a SFTP and SCP Protocol Handler

2009-03-09 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12680348#action_12680348
 ] 

Chris A. Mattmann commented on NUTCH-714:
-

Hi Sanjoy,

When you get a patch, let me know and I will work to integrate it. For 
reference, you were intending this as an upgrade for 0.8.2? I think we should 
probably do this as a post 1.0 upgrade (maybe 1.1)?

Cheers,.
Chris


> Need a SFTP and SCP Protocol Handler
> 
>
> Key: NUTCH-714
> URL: https://issues.apache.org/jira/browse/NUTCH-714
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher
>Affects Versions: 0.9.0
>Reporter: Sanjoy Ghosh
>Assignee: Chris A. Mattmann
> Fix For: 0.8.2
>
>
> An SFTP and SCP Protocol handler is needed to fetch intranet content on an 
> SFTP or SCP server.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-631) MoreIndexingFilter fails with NoSuchElementException

2009-02-17 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12674219#action_12674219
 ] 

Chris A. Mattmann commented on NUTCH-631:
-

Sami, +1. Sorry I didn't have time to get to this. 

Thanks for whipping it up.

> MoreIndexingFilter fails with NoSuchElementException
> 
>
> Key: NUTCH-631
> URL: https://issues.apache.org/jira/browse/NUTCH-631
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.0.0
> Environment: Verified on CentOS and OSX
>Reporter: Stefan Will
>Assignee: Chris A. Mattmann
>Priority: Blocker
> Fix For: 1.0.0
>
> Attachments: NUTCH-631.patch
>
>
> I did a simple crawl and started the indexer with the index-more plugin 
> activated. The index job fails with the following stack trace in the task log:
> java.util.NoSuchElementException
> at java.util.TreeMap.key(TreeMap.java:433)
> at java.util.TreeMap.firstKey(TreeMap.java:287)
> at java.util.TreeSet.first(TreeSet.java:407)
> at 
> java.util.Collections$UnmodifiableSortedSet.first(Collections.java:1114)
> at 
> org.apache.nutch.indexer.more.MoreIndexingFilter.addType(MoreIndexingFilter.java:207)
> at 
> org.apache.nutch.indexer.more.MoreIndexingFilter.filter(MoreIndexingFilter.java:90)
> at 
> org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:111)
> at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:249)
> at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:52)
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:333)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:164)
> I traced this down to the part in MoreIndexingFilter where the mime type is 
> split into primary type and subtype for indexing:
> contentType = mimeType.getName();
> String primaryType = mimeType.getSuperType().getName();
> String subType = mimeType.getSubTypes().first().getName();
> Apparently Tika does not have a subtype for text/html. Furthermore, the 
> supertype for text/html is set as application/octet-stream, which I doubt is 
> what we want indexed. Don't we want primaryType to be "text" and subType to 
> be "html" ?
> So I changed the code to:
> contentType = mimeType.getName();
> String[] split = contentType.split("/");
> String primaryType = split[0];
> String subType = (split.length>1)?split[1]:null;
> 
> This does what I think it should do, but perhaps I'm missing something ? 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Work started: (NUTCH-631) MoreIndexingFilter fails with NoSuchElementException

2009-02-02 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-631 started by Chris A. Mattmann.

> MoreIndexingFilter fails with NoSuchElementException
> 
>
> Key: NUTCH-631
> URL: https://issues.apache.org/jira/browse/NUTCH-631
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.0.0
> Environment: Verified on CentOS and OSX
>Reporter: Stefan Will
>Assignee: Chris A. Mattmann
>Priority: Blocker
> Fix For: 1.0.0
>
>
> I did a simple crawl and started the indexer with the index-more plugin 
> activated. The index job fails with the following stack trace in the task log:
> java.util.NoSuchElementException
> at java.util.TreeMap.key(TreeMap.java:433)
> at java.util.TreeMap.firstKey(TreeMap.java:287)
> at java.util.TreeSet.first(TreeSet.java:407)
> at 
> java.util.Collections$UnmodifiableSortedSet.first(Collections.java:1114)
> at 
> org.apache.nutch.indexer.more.MoreIndexingFilter.addType(MoreIndexingFilter.java:207)
> at 
> org.apache.nutch.indexer.more.MoreIndexingFilter.filter(MoreIndexingFilter.java:90)
> at 
> org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:111)
> at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:249)
> at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:52)
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:333)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:164)
> I traced this down to the part in MoreIndexingFilter where the mime type is 
> split into primary type and subtype for indexing:
> contentType = mimeType.getName();
> String primaryType = mimeType.getSuperType().getName();
> String subType = mimeType.getSubTypes().first().getName();
> Apparently Tika does not have a subtype for text/html. Furthermore, the 
> supertype for text/html is set as application/octet-stream, which I doubt is 
> what we want indexed. Don't we want primaryType to be "text" and subType to 
> be "html" ?
> So I changed the code to:
> contentType = mimeType.getName();
> String[] split = contentType.split("/");
> String primaryType = split[0];
> String subType = (split.length>1)?split[1]:null;
> 
> This does what I think it should do, but perhaps I'm missing something ? 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (NUTCH-631) MoreIndexingFilter fails with NoSuchElementException

2009-02-02 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned NUTCH-631:
---

Assignee: Chris A. Mattmann

> MoreIndexingFilter fails with NoSuchElementException
> 
>
> Key: NUTCH-631
> URL: https://issues.apache.org/jira/browse/NUTCH-631
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.0.0
> Environment: Verified on CentOS and OSX
>Reporter: Stefan Will
>Assignee: Chris A. Mattmann
>Priority: Blocker
> Fix For: 1.0.0
>
>
> I did a simple crawl and started the indexer with the index-more plugin 
> activated. The index job fails with the following stack trace in the task log:
> java.util.NoSuchElementException
> at java.util.TreeMap.key(TreeMap.java:433)
> at java.util.TreeMap.firstKey(TreeMap.java:287)
> at java.util.TreeSet.first(TreeSet.java:407)
> at 
> java.util.Collections$UnmodifiableSortedSet.first(Collections.java:1114)
> at 
> org.apache.nutch.indexer.more.MoreIndexingFilter.addType(MoreIndexingFilter.java:207)
> at 
> org.apache.nutch.indexer.more.MoreIndexingFilter.filter(MoreIndexingFilter.java:90)
> at 
> org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:111)
> at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:249)
> at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:52)
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:333)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:164)
> I traced this down to the part in MoreIndexingFilter where the mime type is 
> split into primary type and subtype for indexing:
> contentType = mimeType.getName();
> String primaryType = mimeType.getSuperType().getName();
> String subType = mimeType.getSubTypes().first().getName();
> Apparently Tika does not have a subtype for text/html. Furthermore, the 
> supertype for text/html is set as application/octet-stream, which I doubt is 
> what we want indexed. Don't we want primaryType to be "text" and subType to 
> be "html" ?
> So I changed the code to:
> contentType = mimeType.getName();
> String[] split = contentType.split("/");
> String primaryType = split[0];
> String subType = (split.length>1)?split[1]:null;
> 
> This does what I think it should do, but perhaps I'm missing something ? 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-621) Nutch needs to declare it's crypto usage

2008-09-29 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-621:


Affects Version/s: 0.7
   0.7.1
   0.7.2
   0.8
   0.8.1
   0.9.0
Fix Version/s: 1.0.0

> Nutch needs to declare it's crypto usage
> 
>
> Key: NUTCH-621
> URL: https://issues.apache.org/jira/browse/NUTCH-621
> Project: Nutch
>  Issue Type: Task
>Affects Versions: 0.7, 0.7.1, 0.7.2, 0.8, 0.8.1, 0.9.0
>Reporter: Grant Ingersoll
>Assignee: Chris A. Mattmann
>Priority: Blocker
> Fix For: 1.0.0
>
> Attachments: NUTCH-621.Mattmann.091008.step3.txt, 
> NUTCH-621.step1.Mattmann.090408.patch.txt, 
> NUTCH-621.step1.Mattmann.091008.patch.txt
>
>
> Per the ASF board direction outlined at 
> http://www.apache.org/dev/crypto.html, Nutch needs to declare it's use of 
> crypto libraries (i.e. BouncyCastle, via PDFBox/Tika).
> See TIKA-118.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-621) Nutch needs to declare it's crypto usage

2008-09-29 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved NUTCH-621.
-

Resolution: Fixed

- resolved in r699866

> Nutch needs to declare it's crypto usage
> 
>
> Key: NUTCH-621
> URL: https://issues.apache.org/jira/browse/NUTCH-621
> Project: Nutch
>  Issue Type: Task
>Affects Versions: 0.7, 0.7.1, 0.7.2, 0.8, 0.8.1, 0.9.0
>Reporter: Grant Ingersoll
>Assignee: Chris A. Mattmann
>Priority: Blocker
> Attachments: NUTCH-621.Mattmann.091008.step3.txt, 
> NUTCH-621.step1.Mattmann.090408.patch.txt, 
> NUTCH-621.step1.Mattmann.091008.patch.txt
>
>
> Per the ASF board direction outlined at 
> http://www.apache.org/dev/crypto.html, Nutch needs to declare it's use of 
> crypto libraries (i.e. BouncyCastle, via PDFBox/Tika).
> See TIKA-118.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-621) Nutch needs to declare it's crypto usage

2008-09-28 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635241#action_12635241
 ] 

Chris A. Mattmann commented on NUTCH-621:
-

Folks,

Based on Jukka's comments, I've ahead and updated Nutch's README file and 
completed step 4/4 of the crypto usage for Nutch:

http://svn.apache.org/viewvc?rev=699866&view=rev

Nutch is now fully compliant with Apache crypto reqts!

Grant, if this is satisfactory, and you are +1, I will go ahead and close this 
issue. Thanks for everyone's help!

Cheers, 
Chris


> Nutch needs to declare it's crypto usage
> 
>
> Key: NUTCH-621
> URL: https://issues.apache.org/jira/browse/NUTCH-621
> Project: Nutch
>  Issue Type: Task
>Reporter: Grant Ingersoll
>Assignee: Chris A. Mattmann
>Priority: Blocker
> Attachments: NUTCH-621.Mattmann.091008.step3.txt, 
> NUTCH-621.step1.Mattmann.090408.patch.txt, 
> NUTCH-621.step1.Mattmann.091008.patch.txt
>
>
> Per the ASF board direction outlined at 
> http://www.apache.org/dev/crypto.html, Nutch needs to declare it's use of 
> crypto libraries (i.e. BouncyCastle, via PDFBox/Tika).
> See TIKA-118.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-621) Nutch needs to declare it's crypto usage

2008-09-11 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630445#action_12630445
 ] 

Chris A. Mattmann commented on NUTCH-621:
-

Grant:

Great, thanks. Okay, once you get back the email from the govt (which hopefully 
we will since perhaps they will CC nutch-dev@ on the reply), I will proceed 
with step 4:

http://www.apache.org/dev/crypto.html#inform

And update the appropriate Nutch README file here:

http://svn.apache.org/repos/asf/lucene/nutch/trunk/README.txt

with the crypto notice and then I think we're done!

Cheers, 
Chris


> Nutch needs to declare it's crypto usage
> 
>
> Key: NUTCH-621
> URL: https://issues.apache.org/jira/browse/NUTCH-621
> Project: Nutch
>  Issue Type: Task
>Reporter: Grant Ingersoll
>Assignee: Chris A. Mattmann
>Priority: Blocker
> Attachments: NUTCH-621.Mattmann.091008.step3.txt, 
> NUTCH-621.step1.Mattmann.090408.patch.txt, 
> NUTCH-621.step1.Mattmann.091008.patch.txt
>
>
> Per the ASF board direction outlined at 
> http://www.apache.org/dev/crypto.html, Nutch needs to declare it's use of 
> crypto libraries (i.e. BouncyCastle, via PDFBox/Tika).
> See TIKA-118.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-621) Nutch needs to declare it's crypto usage

2008-09-10 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-621:


Attachment: NUTCH-621.Mattmann.091008.step3.txt

Hey Doug:

I've attached a text file containing an email template that we need you to send 
as the PMC Chair for Lucene, regarding Nutch's crypto status. Could you send 
ASAP to the TO: addresses in the attached txt file, using the attached text 
email body, and then let me know when this has been complete?

At that point, I'll move onto step 4.

Thanks!

Cheers,
 Chris


> Nutch needs to declare it's crypto usage
> 
>
> Key: NUTCH-621
> URL: https://issues.apache.org/jira/browse/NUTCH-621
> Project: Nutch
>  Issue Type: Task
>Reporter: Grant Ingersoll
>Assignee: Chris A. Mattmann
>Priority: Blocker
> Attachments: NUTCH-621.Mattmann.091008.step3.txt, 
> NUTCH-621.step1.Mattmann.090408.patch.txt, 
> NUTCH-621.step1.Mattmann.091008.patch.txt
>
>
> Per the ASF board direction outlined at 
> http://www.apache.org/dev/crypto.html, Nutch needs to declare it's use of 
> crypto libraries (i.e. BouncyCastle, via PDFBox/Tika).
> See TIKA-118.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-621) Nutch needs to declare it's crypto usage

2008-09-10 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-621:


Attachment: NUTCH-621.step1.Mattmann.091008.patch.txt

Hey Grant:

Sorry about this, but I put Nutch in the wrong place on the original patch you 
committed (I put it under the Incubator project -- which is incorrect).

This new patch:

1. creates an entry for Apache Lucene as a top-level project with crypto 
Products
2. lists Nutch as one of those products, with 2 versions (dev, and releases 0.7 
and later)

Your commit mojo is appreciated! ^_^

Cheers,
 Chris


> Nutch needs to declare it's crypto usage
> 
>
> Key: NUTCH-621
> URL: https://issues.apache.org/jira/browse/NUTCH-621
> Project: Nutch
>  Issue Type: Task
>Reporter: Grant Ingersoll
>Assignee: Chris A. Mattmann
>Priority: Blocker
> Attachments: NUTCH-621.step1.Mattmann.090408.patch.txt, 
> NUTCH-621.step1.Mattmann.091008.patch.txt
>
>
> Per the ASF board direction outlined at 
> http://www.apache.org/dev/crypto.html, Nutch needs to declare it's use of 
> crypto libraries (i.e. BouncyCastle, via PDFBox/Tika).
> See TIKA-118.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-621) Nutch needs to declare it's crypto usage

2008-09-09 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629611#action_12629611
 ] 

Chris A. Mattmann commented on NUTCH-621:
-

Thanks, Grant!

I will begin step 3 in a few hours...

> Nutch needs to declare it's crypto usage
> 
>
> Key: NUTCH-621
> URL: https://issues.apache.org/jira/browse/NUTCH-621
> Project: Nutch
>  Issue Type: Task
>Reporter: Grant Ingersoll
>Assignee: Chris A. Mattmann
>Priority: Blocker
> Attachments: NUTCH-621.step1.Mattmann.090408.patch.txt
>
>
> Per the ASF board direction outlined at 
> http://www.apache.org/dev/crypto.html, Nutch needs to declare it's use of 
> crypto libraries (i.e. BouncyCastle, via PDFBox/Tika).
> See TIKA-118.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-621) Nutch needs to declare it's crypto usage

2008-09-04 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-621:


Attachment: NUTCH-621.step1.Mattmann.090408.patch.txt

Hi:

Okey dok, could someone with site-dev karma commit the attached patch to:

https://svn.apache.org/repos/asf/infrastructure/site/trunk/xdocs/licenses/exports/index.xml

as specified in step 2 of http://www.apache.org/dev/crypto.html ?

This will get us started. Once that's complete, I'll begin step 3, notifying 
the U.S. govt.

Thanks,
Chris

> Nutch needs to declare it's crypto usage
> 
>
> Key: NUTCH-621
> URL: https://issues.apache.org/jira/browse/NUTCH-621
> Project: Nutch
>  Issue Type: Task
>Reporter: Grant Ingersoll
>Assignee: Chris A. Mattmann
>Priority: Blocker
> Attachments: NUTCH-621.step1.Mattmann.090408.patch.txt
>
>
> Per the ASF board direction outlined at 
> http://www.apache.org/dev/crypto.html, Nutch needs to declare it's use of 
> crypto libraries (i.e. BouncyCastle, via PDFBox/Tika).
> See TIKA-118.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Work started: (NUTCH-621) Nutch needs to declare it's crypto usage

2008-09-04 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-621 started by Chris A. Mattmann.

> Nutch needs to declare it's crypto usage
> 
>
> Key: NUTCH-621
> URL: https://issues.apache.org/jira/browse/NUTCH-621
> Project: Nutch
>  Issue Type: Task
>Reporter: Grant Ingersoll
>Assignee: Chris A. Mattmann
>Priority: Blocker
>
> Per the ASF board direction outlined at 
> http://www.apache.org/dev/crypto.html, Nutch needs to declare it's use of 
> crypto libraries (i.e. BouncyCastle, via PDFBox/Tika).
> See TIKA-118.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-621) Nutch needs to declare it's crypto usage

2008-06-11 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12604409#action_12604409
 ] 

Chris A. Mattmann commented on NUTCH-621:
-

Hi Grant:

Thanks. The code does exist in nutch, in the parse-pdf plugin. It seems to be 
using PDFBox's decrypt functionality:

http://svn.apache.org/viewvc/lucene/nutch/trunk/src/plugin/parse-pdf/src/java/org/apache/nutch/parse/pdf/PdfParser.java?view=markup

Judging by your comment, it sounds like this makes Nutch have to declare its 
crypto usage. I will work to move Nutch towards this. Thanks for the 
clarification.

Cheers,
 Chris


> Nutch needs to declare it's crypto usage
> 
>
> Key: NUTCH-621
> URL: https://issues.apache.org/jira/browse/NUTCH-621
> Project: Nutch
>  Issue Type: Task
>Reporter: Grant Ingersoll
>Assignee: Chris A. Mattmann
>Priority: Blocker
>
> Per the ASF board direction outlined at 
> http://www.apache.org/dev/crypto.html, Nutch needs to declare it's use of 
> crypto libraries (i.e. BouncyCastle, via PDFBox/Tika).
> See TIKA-118.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-621) Nutch needs to declare it's crypto usage

2008-06-10 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12603884#action_12603884
 ] 

Chris A. Mattmann commented on NUTCH-621:
-

Hi Grant:

Thanks for the poke on this. I was speaking with Jukka Zitting about this. Tika 
requires the crypto declaration because of its transitive Maven dependencies in 
its Parsing framework on the Bountycastle libraries. Nutch, on the other hand, 
is using Tika at this point for mime detection only, and Nutch achieves its 
usage of Tika (0.1-incubating) by CM'ing only the Apache Tika 0.1 jar, and not 
making use of any of its transitive dependencies (which are inherently Parsing 
specific, and not Mime Detection specific). In addition, there was a similar 
thread discussed here:

http://markmail.org/message/u7sjfzt7naknsv34

where the consensus was you don't need crypto notifications if you don't 
include any crypto libraries or use the related functionality in an included 
other library that has an optional dependency on a crypto library. So, I think 
that Nutch falls within that category. Would you agree?

Thanks for your help and guidance.

Cheers,
 Chris


> Nutch needs to declare it's crypto usage
> 
>
> Key: NUTCH-621
> URL: https://issues.apache.org/jira/browse/NUTCH-621
> Project: Nutch
>  Issue Type: Task
>Reporter: Grant Ingersoll
>Assignee: Chris A. Mattmann
>Priority: Blocker
>
> Per the ASF board direction outlined at 
> http://www.apache.org/dev/crypto.html, Nutch needs to declare it's use of 
> crypto libraries (i.e. BouncyCastle, via PDFBox/Tika).
> See TIKA-118.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-618) Tika error "Media type alias already exists"

2008-06-04 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved NUTCH-618.
-

   Resolution: Fixed
Fix Version/s: 1.0.0

- patch applied to trunk:

http://svn.apache.org/viewvc?rev=663092&view=rev

> Tika error "Media type alias already exists"
> 
>
> Key: NUTCH-618
> URL: https://issues.apache.org/jira/browse/NUTCH-618
> Project: Nutch
>  Issue Type: Bug
>  Components: mime_type_detector
>Affects Versions: 1.0.0
>Reporter: Andrzej Bialecki 
>Assignee: Chris A. Mattmann
> Fix For: 1.0.0
>
> Attachments: NUTCH-618.Mattmann.patch.060108.2.txt, 
> NUTCH-618.Mattmann.patch.060108.txt
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> After the upgrade to the latest Tika jar we see a lot of errors like this:
> 2008-03-06 08:07:20,659 WARN org.apache.tika.mime.MimeTypesReader: Invalid 
> media type alias: text/xml
> org.apache.tika.mime.MimeTypeException: Media type alias already exists: 
> text/xml
>   at org.apache.tika.mime.MimeTypes.addAlias(MimeTypes.java:312)
>   at org.apache.tika.mime.MimeType.addAlias(MimeType.java:238)
>   at 
> org.apache.tika.mime.MimeTypesReader.readMimeType(MimeTypesReader.java:168)
>   at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java:138)
>   at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java:121)
>   at 
> org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:56)
>   at org.apache.nutch.util.MimeUtil.(MimeUtil.java:58)
>   at org.apache.nutch.protocol.Content.(Content.java:85)
>   at 
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:226)
>   at 
> org.apache.nutch.fetcher.Fetcher2$FetcherThread.run(Fetcher2.java:523)
> This is caused most likely by the duplicate tika-mimetypes.xml file - one 
> copy is embedded inside the Tika jar, the other is found in Nutch conf/ 
> directory. The one inside the jar seems to be more recent, so I propose to 
> simply remove the one we have in conf.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (NUTCH-621) Nutch needs to declare it's crypto usage

2008-06-03 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned NUTCH-621:
---

Assignee: Chris A. Mattmann

> Nutch needs to declare it's crypto usage
> 
>
> Key: NUTCH-621
> URL: https://issues.apache.org/jira/browse/NUTCH-621
> Project: Nutch
>  Issue Type: Task
>Reporter: Grant Ingersoll
>Assignee: Chris A. Mattmann
>Priority: Blocker
>
> Per the ASF board direction outlined at 
> http://www.apache.org/dev/crypto.html, Nutch needs to declare it's use of 
> crypto libraries (i.e. BouncyCastle, via PDFBox/Tika).
> See TIKA-118.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-621) Nutch needs to declare it's crypto usage

2008-06-03 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12602100#action_12602100
 ] 

Chris A. Mattmann commented on NUTCH-621:
-

Grant,

Will do. Thanks.

Cheers, 
 Chris


> Nutch needs to declare it's crypto usage
> 
>
> Key: NUTCH-621
> URL: https://issues.apache.org/jira/browse/NUTCH-621
> Project: Nutch
>  Issue Type: Task
>Reporter: Grant Ingersoll
>Assignee: Chris A. Mattmann
>Priority: Blocker
>
> Per the ASF board direction outlined at 
> http://www.apache.org/dev/crypto.html, Nutch needs to declare it's use of 
> crypto libraries (i.e. BouncyCastle, via PDFBox/Tika).
> See TIKA-118.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-618) Tika error "Media type alias already exists"

2008-06-01 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12601519#action_12601519
 ] 

Chris A. Mattmann commented on NUTCH-618:
-

Dennis Kubes tested this patch for me. According to Dennis, there were 2 
lingering log warnings that still came up:

1. For alias:



removing the ;exe removed one of the errors

2. removing the subclass from:






removes the second of the errors.

I am going to attach an updated patch that address these issues.

Thanks,
 Chris


> Tika error "Media type alias already exists"
> 
>
> Key: NUTCH-618
> URL: https://issues.apache.org/jira/browse/NUTCH-618
> Project: Nutch
>  Issue Type: Bug
>  Components: mime_type_detector
>Affects Versions: 1.0.0
>Reporter: Andrzej Bialecki 
>Assignee: Chris A. Mattmann
> Attachments: NUTCH-618.Mattmann.patch.060108.2.txt, 
> NUTCH-618.Mattmann.patch.060108.txt
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> After the upgrade to the latest Tika jar we see a lot of errors like this:
> 2008-03-06 08:07:20,659 WARN org.apache.tika.mime.MimeTypesReader: Invalid 
> media type alias: text/xml
> org.apache.tika.mime.MimeTypeException: Media type alias already exists: 
> text/xml
>   at org.apache.tika.mime.MimeTypes.addAlias(MimeTypes.java:312)
>   at org.apache.tika.mime.MimeType.addAlias(MimeType.java:238)
>   at 
> org.apache.tika.mime.MimeTypesReader.readMimeType(MimeTypesReader.java:168)
>   at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java:138)
>   at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java:121)
>   at 
> org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:56)
>   at org.apache.nutch.util.MimeUtil.(MimeUtil.java:58)
>   at org.apache.nutch.protocol.Content.(Content.java:85)
>   at 
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:226)
>   at 
> org.apache.nutch.fetcher.Fetcher2$FetcherThread.run(Fetcher2.java:523)
> This is caused most likely by the duplicate tika-mimetypes.xml file - one 
> copy is embedded inside the Tika jar, the other is found in Nutch conf/ 
> directory. The one inside the jar seems to be more recent, so I propose to 
> simply remove the one we have in conf.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-618) Tika error "Media type alias already exists"

2008-06-01 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-618:


Attachment: NUTCH-618.Mattmann.patch.060108.2.txt

Updated patch that includes the updates to tika-mimetypes.xml identified by 
Dennis Kubes. Thanks, Dennis!

Dennis tested this on his testbed environment and it ran through great. So, I'd 
like to call for 24-48 hr review on the patch, and then if no objections, I'd 
like to commit it.

Thanks!

Cheers,
 Chris


> Tika error "Media type alias already exists"
> 
>
> Key: NUTCH-618
> URL: https://issues.apache.org/jira/browse/NUTCH-618
> Project: Nutch
>  Issue Type: Bug
>  Components: mime_type_detector
>Affects Versions: 1.0.0
>Reporter: Andrzej Bialecki 
>Assignee: Chris A. Mattmann
> Attachments: NUTCH-618.Mattmann.patch.060108.2.txt, 
> NUTCH-618.Mattmann.patch.060108.txt
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> After the upgrade to the latest Tika jar we see a lot of errors like this:
> 2008-03-06 08:07:20,659 WARN org.apache.tika.mime.MimeTypesReader: Invalid 
> media type alias: text/xml
> org.apache.tika.mime.MimeTypeException: Media type alias already exists: 
> text/xml
>   at org.apache.tika.mime.MimeTypes.addAlias(MimeTypes.java:312)
>   at org.apache.tika.mime.MimeType.addAlias(MimeType.java:238)
>   at 
> org.apache.tika.mime.MimeTypesReader.readMimeType(MimeTypesReader.java:168)
>   at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java:138)
>   at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java:121)
>   at 
> org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:56)
>   at org.apache.nutch.util.MimeUtil.(MimeUtil.java:58)
>   at org.apache.nutch.protocol.Content.(Content.java:85)
>   at 
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:226)
>   at 
> org.apache.nutch.fetcher.Fetcher2$FetcherThread.run(Fetcher2.java:523)
> This is caused most likely by the duplicate tika-mimetypes.xml file - one 
> copy is embedded inside the Tika jar, the other is found in Nutch conf/ 
> directory. The one inside the jar seems to be more recent, so I propose to 
> simply remove the one we have in conf.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Work logged: (NUTCH-618) Tika error "Media type alias already exists"

2008-06-01 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-618?page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#action_10651
 ]

Chris A. Mattmann logged work on NUTCH-618:
---

Author: Chris A. Mattmann
Created on: 01/Jun/08 06:23 PM
Start Date: 01/Jun/08 06:23 PM
Worklog Time Spent: 2h 
  Work Description: produced candidate patch for review

Issue Time Tracking
---

Time Spent: 2h
Remaining Estimate: 0h

> Tika error "Media type alias already exists"
> 
>
> Key: NUTCH-618
> URL: https://issues.apache.org/jira/browse/NUTCH-618
> Project: Nutch
>  Issue Type: Bug
>  Components: mime_type_detector
>Affects Versions: 1.0.0
>Reporter: Andrzej Bialecki 
>Assignee: Chris A. Mattmann
> Attachments: NUTCH-618.Mattmann.patch.060108.txt
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> After the upgrade to the latest Tika jar we see a lot of errors like this:
> 2008-03-06 08:07:20,659 WARN org.apache.tika.mime.MimeTypesReader: Invalid 
> media type alias: text/xml
> org.apache.tika.mime.MimeTypeException: Media type alias already exists: 
> text/xml
>   at org.apache.tika.mime.MimeTypes.addAlias(MimeTypes.java:312)
>   at org.apache.tika.mime.MimeType.addAlias(MimeType.java:238)
>   at 
> org.apache.tika.mime.MimeTypesReader.readMimeType(MimeTypesReader.java:168)
>   at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java:138)
>   at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java:121)
>   at 
> org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:56)
>   at org.apache.nutch.util.MimeUtil.(MimeUtil.java:58)
>   at org.apache.nutch.protocol.Content.(Content.java:85)
>   at 
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:226)
>   at 
> org.apache.nutch.fetcher.Fetcher2$FetcherThread.run(Fetcher2.java:523)
> This is caused most likely by the duplicate tika-mimetypes.xml file - one 
> copy is embedded inside the Tika jar, the other is found in Nutch conf/ 
> directory. The one inside the jar seems to be more recent, so I propose to 
> simply remove the one we have in conf.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-618) Tika error "Media type alias already exists"

2008-06-01 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-618:


Attachment: NUTCH-618.Mattmann.patch.060108.txt

Hey Guys:

Okey dok: here's a candidate patch. Could someone who has an environment set up 
already in which these types of errors were manifesting please trying this 
patch out and see if it makes them go away? I'm thinking that the root of the 
issue is that the MimeTypes object was not necessarily being re instantiated 
many many times as much as it wasn't being cached in the ObjectCache. We'll see.

This attached patch passes all unit tests. So, please let me know what you 
think.

Thanks!

Cheers,
 Chris


> Tika error "Media type alias already exists"
> 
>
> Key: NUTCH-618
> URL: https://issues.apache.org/jira/browse/NUTCH-618
> Project: Nutch
>  Issue Type: Bug
>  Components: mime_type_detector
>Affects Versions: 1.0.0
>Reporter: Andrzej Bialecki 
>Assignee: Chris A. Mattmann
> Attachments: NUTCH-618.Mattmann.patch.060108.txt
>
>
> After the upgrade to the latest Tika jar we see a lot of errors like this:
> 2008-03-06 08:07:20,659 WARN org.apache.tika.mime.MimeTypesReader: Invalid 
> media type alias: text/xml
> org.apache.tika.mime.MimeTypeException: Media type alias already exists: 
> text/xml
>   at org.apache.tika.mime.MimeTypes.addAlias(MimeTypes.java:312)
>   at org.apache.tika.mime.MimeType.addAlias(MimeType.java:238)
>   at 
> org.apache.tika.mime.MimeTypesReader.readMimeType(MimeTypesReader.java:168)
>   at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java:138)
>   at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java:121)
>   at 
> org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:56)
>   at org.apache.nutch.util.MimeUtil.(MimeUtil.java:58)
>   at org.apache.nutch.protocol.Content.(Content.java:85)
>   at 
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:226)
>   at 
> org.apache.nutch.fetcher.Fetcher2$FetcherThread.run(Fetcher2.java:523)
> This is caused most likely by the duplicate tika-mimetypes.xml file - one 
> copy is embedded inside the Tika jar, the other is found in Nutch conf/ 
> directory. The one inside the jar seems to be more recent, so I propose to 
> simply remove the one we have in conf.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-618) Tika error "Media type alias already exists"

2008-05-29 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12600909#action_12600909
 ] 

Chris A. Mattmann commented on NUTCH-618:
-

Hey Andrzej:

Sorry, I haven't made much progress on the issue. My time has dwindled a bit in 
the past few months. If someone else has time and wants to reassign the issue, 
please feel free. Otherwise, I just returned from vacation and will have some 
free time this weekend, so if there is time until then I can at least prepare a 
draft patch and submit it for review.

Cheers,
 Chris


> Tika error "Media type alias already exists"
> 
>
> Key: NUTCH-618
> URL: https://issues.apache.org/jira/browse/NUTCH-618
> Project: Nutch
>  Issue Type: Bug
>  Components: mime_type_detector
>Affects Versions: 1.0.0
>Reporter: Andrzej Bialecki 
>Assignee: Chris A. Mattmann
>
> After the upgrade to the latest Tika jar we see a lot of errors like this:
> 2008-03-06 08:07:20,659 WARN org.apache.tika.mime.MimeTypesReader: Invalid 
> media type alias: text/xml
> org.apache.tika.mime.MimeTypeException: Media type alias already exists: 
> text/xml
>   at org.apache.tika.mime.MimeTypes.addAlias(MimeTypes.java:312)
>   at org.apache.tika.mime.MimeType.addAlias(MimeType.java:238)
>   at 
> org.apache.tika.mime.MimeTypesReader.readMimeType(MimeTypesReader.java:168)
>   at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java:138)
>   at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java:121)
>   at 
> org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:56)
>   at org.apache.nutch.util.MimeUtil.(MimeUtil.java:58)
>   at org.apache.nutch.protocol.Content.(Content.java:85)
>   at 
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:226)
>   at 
> org.apache.nutch.fetcher.Fetcher2$FetcherThread.run(Fetcher2.java:523)
> This is caused most likely by the duplicate tika-mimetypes.xml file - one 
> copy is embedded inside the Tika jar, the other is found in Nutch conf/ 
> directory. The one inside the jar seems to be more recent, so I propose to 
> simply remove the one we have in conf.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-618) Tika error "Media type alias already exists"

2008-03-06 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12576051#action_12576051
 ] 

Chris A. Mattmann commented on NUTCH-618:
-

Hey Andrzej:

bq. I noticed also another problem: o.a.n.u.MimeUtil doesn't use ObjectCache, 
so it instantiates MimeTypes over and over again. It should do this once for a 
given Configuration, and then use ObjectCache to store this object.

Yikes :/ Okay, I will get working on this right away. In addition, I will 
investigate the cause of the doubly loaded media types -- I'm not positive that 
it's due to the mime xml file being present inside the tika jar file too -- 
that's a default one, that we should have the capability to override (like 
we're doing in Nutch), if we need to.

Thanks!

Cheers,
 Chris


> Tika error "Media type alias already exists"
> 
>
> Key: NUTCH-618
> URL: https://issues.apache.org/jira/browse/NUTCH-618
> Project: Nutch
>  Issue Type: Bug
>  Components: mime_type_detector
>Affects Versions: 1.0.0
>Reporter: Andrzej Bialecki 
>Assignee: Chris A. Mattmann
>
> After the upgrade to the latest Tika jar we see a lot of errors like this:
> 2008-03-06 08:07:20,659 WARN org.apache.tika.mime.MimeTypesReader: Invalid 
> media type alias: text/xml
> org.apache.tika.mime.MimeTypeException: Media type alias already exists: 
> text/xml
>   at org.apache.tika.mime.MimeTypes.addAlias(MimeTypes.java:312)
>   at org.apache.tika.mime.MimeType.addAlias(MimeType.java:238)
>   at 
> org.apache.tika.mime.MimeTypesReader.readMimeType(MimeTypesReader.java:168)
>   at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java:138)
>   at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java:121)
>   at 
> org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:56)
>   at org.apache.nutch.util.MimeUtil.(MimeUtil.java:58)
>   at org.apache.nutch.protocol.Content.(Content.java:85)
>   at 
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:226)
>   at 
> org.apache.nutch.fetcher.Fetcher2$FetcherThread.run(Fetcher2.java:523)
> This is caused most likely by the duplicate tika-mimetypes.xml file - one 
> copy is embedded inside the Tika jar, the other is found in Nutch conf/ 
> directory. The one inside the jar seems to be more recent, so I propose to 
> simply remove the one we have in conf.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Work started: (NUTCH-618) Tika error "Media type alias already exists"

2008-03-06 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-618 started by Chris A. Mattmann.

> Tika error "Media type alias already exists"
> 
>
> Key: NUTCH-618
> URL: https://issues.apache.org/jira/browse/NUTCH-618
> Project: Nutch
>  Issue Type: Bug
>  Components: mime_type_detector
>Affects Versions: 1.0.0
>Reporter: Andrzej Bialecki 
>Assignee: Chris A. Mattmann
>
> After the upgrade to the latest Tika jar we see a lot of errors like this:
> 2008-03-06 08:07:20,659 WARN org.apache.tika.mime.MimeTypesReader: Invalid 
> media type alias: text/xml
> org.apache.tika.mime.MimeTypeException: Media type alias already exists: 
> text/xml
>   at org.apache.tika.mime.MimeTypes.addAlias(MimeTypes.java:312)
>   at org.apache.tika.mime.MimeType.addAlias(MimeType.java:238)
>   at 
> org.apache.tika.mime.MimeTypesReader.readMimeType(MimeTypesReader.java:168)
>   at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java:138)
>   at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java:121)
>   at 
> org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:56)
>   at org.apache.nutch.util.MimeUtil.(MimeUtil.java:58)
>   at org.apache.nutch.protocol.Content.(Content.java:85)
>   at 
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:226)
>   at 
> org.apache.nutch.fetcher.Fetcher2$FetcherThread.run(Fetcher2.java:523)
> This is caused most likely by the duplicate tika-mimetypes.xml file - one 
> copy is embedded inside the Tika jar, the other is found in Nutch conf/ 
> directory. The one inside the jar seems to be more recent, so I propose to 
> simply remove the one we have in conf.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



  1   2   3   >