[jira] [Commented] (NUTCH-1037) Deduplicate anchors before indexing

2011-07-11 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063380#comment-13063380 ] Julien Nioche commented on NUTCH-1037: -- IIRC it stores a byte array for each

[jira] [Created] (NUTCH-1043) Add pattern for filtering .js in default url filters

2011-07-12 Thread Julien Nioche (JIRA)
Reporter: Julien Nioche Priority: Minor Fix For: 1.4, 2.0 The Javascript parser is not used by default as it is extremely noisy, however the default URL filters do not filter out URLs ending in .js and the default parser (Tika) can't parse them. In a nut

[jira] [Commented] (NUTCH-1043) Add pattern for filtering .js in default url filters

2011-07-12 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13064065#comment-13064065 ] Julien Nioche commented on NUTCH-1043: -- My point here is not to add all the suff

[jira] [Created] (NUTCH-1045) MimeUtil to rely on default config provided by Tika

2011-07-13 Thread Julien Nioche (JIRA)
Reporter: Julien Nioche Fix For: 2.0 We currently provide conf/tika-mimetypes.xml despite the fact that it is absolutely similar to the one found in tika-core.jar Having a mechanism for specifying a custom tika-mimetypes.xml is good though but if the user hasn't spec

[jira] [Updated] (NUTCH-1045) MimeUtil to rely on default config provided by Tika

2011-07-13 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1045: - Attachment: NUTCH-1045-1.4.patch > MimeUtil to rely on default config provided by T

[jira] [Updated] (NUTCH-1045) MimeUtil to rely on default config provided by Tika

2011-07-13 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1045: - Priority: Minor (was: Major) Affects Version/s: 2.0 Fix Version/s: 1.4

[jira] [Updated] (NUTCH-1043) Add pattern for filtering .js in default url filters

2011-07-13 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1043: - Attachment: NUTCH-1043.patch > Add pattern for filtering .js in default url filt

[jira] [Commented] (NUTCH-987) Support HTTP auth for Solr communication

2011-07-13 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13064541#comment-13064541 ] Julien Nioche commented on NUTCH-987: - don't forget to add the param

[jira] [Created] (NUTCH-1046) Add tests for indexing to SOLR

2011-07-13 Thread Julien Nioche (JIRA)
Reporter: Julien Nioche Fix For: 1.4, 2.0 We currently have no tests for checking that the indexing to SOLR works as expected. Running an embedded SOLR Server within the tests would be good. -- This message is automatically generated by JIRA. For more information on JIRA, see

[jira] [Created] (NUTCH-1047) Pluggable indexing backends

2011-07-13 Thread Julien Nioche (JIRA)
Reporter: Julien Nioche Fix For: 1.4 One possible feature would be to add a new endpoint for indexing-backends and make the indexing plugable. at the moment we are hardwired to SOLR - which is OK - but as other resources like ElasticSearch are becoming more popular it would be better to

[jira] [Commented] (NUTCH-1037) Deduplicate anchors before indexing

2011-07-13 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13064596#comment-13064596 ] Julien Nioche commented on NUTCH-1037: -- Looks OK apart from the indentation +

[jira] [Commented] (NUTCH-1037) Deduplicate anchors before indexing

2011-07-13 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13064618#comment-13064618 ] Julien Nioche commented on NUTCH-1037: -- * indentation : not that bad indeed -

[jira] [Commented] (NUTCH-987) Support HTTP auth for Solr communication

2011-07-15 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065792#comment-13065792 ] Julien Nioche commented on NUTCH-987: - Hi Markus, will this be committed to trun

[jira] [Commented] (NUTCH-1045) MimeUtil to rely on default config provided by Tika

2011-07-15 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065806#comment-13065806 ] Julien Nioche commented on NUTCH-1045: -- Does not pass the tests - will investi

[jira] [Created] (NUTCH-1053) Parsing of RSS feeds fails

2011-07-15 Thread Julien Nioche (JIRA)
Nioche Assignee: Julien Nioche Fix For: 1.4 See discussion on http://lucene.472066.n3.nabble.com/RSS-feed-parsing-on-Nutch-1-3-td3166487.html Have been able to reproduce the problem and will look into it -- This message is automatically generated by JIRA. For more

[jira] [Created] (NUTCH-1054) Make linkDB optional during indexing

2011-07-15 Thread Julien Nioche (JIRA)
Reporter: Julien Nioche Fix For: 1.4, 2.0 Having a linkDB is currently mandatory for indexing, however not all users are interested in using the anchors. The linkDB should be optional while indexing -- This message is automatically generated by JIRA. For more information on

[jira] [Updated] (NUTCH-1054) Make linkDB optional during indexing

2011-07-15 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1054: - Attachment: NUTCH-1054-1.4.patch Patch which prevents getting an exception when the linkDB

[jira] [Updated] (NUTCH-1054) Make linkDB optional during indexing

2011-07-15 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1054: - Attachment: NUTCH-1054-1.4.v2.patch New patch which implements the change in the syntax i.e

[jira] [Commented] (NUTCH-1054) Make linkDB optional during indexing

2011-07-15 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13066063#comment-13066063 ] Julien Nioche commented on NUTCH-1054: -- good catch - hadn't tried with

[jira] [Updated] (NUTCH-1054) Make linkDB optional during indexing

2011-07-15 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1054: - Attachment: NUTCH-1054-1.4.v2.patch New patch with check on number of args fixed > Make lin

[jira] [Commented] (NUTCH-657) Estonian N-gram profile has wrong name

2011-07-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13066446#comment-13066446 ] Julien Nioche commented on NUTCH-657: - Hi Ken I think this has been done for 2.0

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2011-07-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13066484#comment-13066484 ] Julien Nioche commented on NUTCH-1047: -- {quote} My interest in your last point

[jira] [Updated] (NUTCH-1047) Pluggable indexing backends

2011-07-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1047: - Description: One possible feature would be to add a new endpoint for indexing-backends and make

[jira] [Updated] (NUTCH-1054) Make linkDB optional during indexing

2011-07-18 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1054: - Fix Version/s: (was: 2.0) > Make linkDB optional during index

[jira] [Resolved] (NUTCH-1054) Make linkDB optional during indexing

2011-07-18 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-1054. -- Resolution: Fixed Committed revision 1147794 Thanks for testing and reviewing > Make lin

[jira] [Resolved] (NUTCH-1043) Add pattern for filtering .js in default url filters

2011-07-18 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-1043. -- Resolution: Fixed Committed revision 1147796 -> 1.4 Committed revision 1147798 -> 2.0

[jira] [Commented] (NUTCH-1059) Remove convdb command from /bin/nutch

2011-07-18 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13066874#comment-13066874 ] Julien Nioche commented on NUTCH-1059: -- +1 thanks. Don't forget to ad

[jira] [Commented] (NUTCH-1044) Redirected URLs and possibly all of their outlinked URLs have invalid scores.

2011-07-18 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13066982#comment-13066982 ] Julien Nioche commented on NUTCH-1044: -- I can confirming the issue. The solutio

[jira] [Assigned] (NUTCH-1044) Redirected URLs and possibly all of their outlinked URLs have invalid scores.

2011-07-18 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche reassigned NUTCH-1044: Assignee: Julien Nioche > Redirected URLs and possibly all of their outlinked URLs h

[jira] [Updated] (NUTCH-1044) Redirected URLs and possibly all of their outlinked URLs have invalid scores.

2011-07-18 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1044: - Priority: Critical (was: Major) Fix Version/s: 1.4 > Redirected URLs and possibly

[jira] [Commented] (NUTCH-1048) Busted links on http://nutch.apache.org/mailing_lists.html

2011-07-18 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13066993#comment-13066993 ] Julien Nioche commented on NUTCH-1048: -- The 'view list archive' link

[jira] [Issue Comment Edited] (NUTCH-1044) Redirected URLs and possibly all of their outlinked URLs have invalid scores.

2011-07-18 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13066982#comment-13066982 ] Julien Nioche edited comment on NUTCH-1044 at 7/18/11 12:5

[jira] [Commented] (NUTCH-1048) Busted links on http://nutch.apache.org/mailing_lists.html

2011-07-18 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067066#comment-13067066 ] Julien Nioche commented on NUTCH-1048: -- That would be great, thanks > Buste

[jira] [Resolved] (NUTCH-1048) Busted links on http://nutch.apache.org/mailing_lists.html

2011-07-18 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-1048. -- Resolution: Fixed > Busted links on http://nutch.apache.org/mailing_lists.h

[jira] [Closed] (NUTCH-1048) Busted links on http://nutch.apache.org/mailing_lists.html

2011-07-18 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-1048. Changes to page done. Thanks to Eric for reporting and Lewis for committing the changes > Bus

[jira] [Reopened] (NUTCH-1048) Busted links on http://nutch.apache.org/mailing_lists.html

2011-07-18 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche reopened NUTCH-1048: -- Assignee: Lewis John McGibbney Lewis - when changing the content of the site you must commit

[jira] [Commented] (NUTCH-1048) Busted links on http://nutch.apache.org/mailing_lists.html

2011-07-18 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067309#comment-13067309 ] Julien Nioche commented on NUTCH-1048: -- {quote} I've been committing

[jira] [Commented] (NUTCH-1037) Deduplicate anchors before indexing

2011-07-19 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067668#comment-13067668 ] Julien Nioche commented on NUTCH-1037: -- +1. Maybe add to the description somet

[jira] [Commented] (NUTCH-1050) Add segmentDir option to WebGraph

2011-07-19 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067674#comment-13067674 ] Julien Nioche commented on NUTCH-1050: -- +1 looks good to me. Thanks Markus &

[jira] [Commented] (NUTCH-1057) Make fetcher thread time out configurable

2011-07-19 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067678#comment-13067678 ] Julien Nioche commented on NUTCH-1057: -- Apart from the part related to NUTCH-

[jira] [Commented] (NUTCH-865) Format source code in unique style

2011-07-19 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067684#comment-13067684 ] Julien Nioche commented on NUTCH-865: - That's not very complex nor huge. All

[jira] [Created] (NUTCH-1063) OutlinkExtractor test generates an exception but does not fail

2011-07-19 Thread Julien Nioche (JIRA)
Versions: 1.4 Reporter: Julien Nioche Fix For: 1.4 Testsuite: org.apache.nutch.parse.TestOutlinkExtractor Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 0.043 sec - Standard Output --- 2011-07-19 15:06:36,073 ERROR parse.OutlinkExtractor

[jira] [Updated] (NUTCH-1045) MimeUtil to rely on default config provided by Tika

2011-07-19 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1045: - Attachment: NUTCH-1045-1.4-v2.patch New version of the patch which passes the tests. Any

[jira] [Commented] (NUTCH-1045) MimeUtil to rely on default config provided by Tika

2011-07-19 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067773#comment-13067773 ] Julien Nioche commented on NUTCH-1045: -- you should see a message in the logs at

[jira] [Created] (NUTCH-1064) o.a.n.util.MimeUtil uses deprecated Tika methods

2011-07-19 Thread Julien Nioche (JIRA)
Reporter: Julien Nioche Fix For: 1.4 this class is in serious need of refactoring as the underlying Tika API has changed a lot. The logic around what strategies to use e.g. trust the metadata returned by the server? trust Tika's detection? etc... should be reimplemented

[jira] [Commented] (NUTCH-1045) MimeUtil to rely on default config provided by Tika

2011-07-19 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067794#comment-13067794 ] Julien Nioche commented on NUTCH-1045: -- {quote} May be because the empty fil

[jira] [Closed] (NUTCH-1048) Busted links on http://nutch.apache.org/mailing_lists.html

2011-07-19 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-1048. Resolution: Fixed you are welcome. thanks for committing the changes > Busted links on h

[jira] [Commented] (NUTCH-1045) MimeUtil to rely on default config provided by Tika

2011-07-19 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067912#comment-13067912 ] Julien Nioche commented on NUTCH-1045: -- Great, seems to be working fine then. Th

[jira] [Commented] (NUTCH-865) Format source code in unique style

2011-07-19 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067914#comment-13067914 ] Julien Nioche commented on NUTCH-865: - Maybe have a look at the diffs after appl

[jira] [Commented] (NUTCH-920) Project Metadata

2011-07-21 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13069150#comment-13069150 ] Julien Nioche commented on NUTCH-920: - Guys, what about adding one in the trunk

[jira] [Commented] (NUTCH-1066) trivial correction of domain-urlfilter.txt

2011-07-22 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13069499#comment-13069499 ] Julien Nioche commented on NUTCH-1066: -- trivial indeed - feel free to commit, th

[jira] [Commented] (NUTCH-717) Make Nutch Solr integration easier

2011-07-25 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13070452#comment-13070452 ] Julien Nioche commented on NUTCH-717: - Maybe we could make the indexing back

[jira] [Commented] (NUTCH-1065) New mvn.template

2011-07-25 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13070453#comment-13070453 ] Julien Nioche commented on NUTCH-1065: -- +1 thanks > New mvn.t

[jira] [Resolved] (NUTCH-1045) MimeUtil to rely on default config provided by Tika

2011-07-25 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-1045. -- Resolution: Fixed Assignee: Julien Nioche 1.4 : Committed revision 1150669 trunk

[jira] [Updated] (NUTCH-1044) Redirected URLs and possibly all of their outlinked URLs have invalid scores.

2011-07-25 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1044: - Attachment: NUTCH-1044-1.4.patch Fixes the score of redirections by giving them the same score

[jira] [Created] (NUTCH-1071) Crawldb update to total counts per status

2011-07-28 Thread Julien Nioche (JIRA)
: Julien Nioche Assignee: Julien Nioche Priority: Trivial Fix For: 1.4 The reduce phase of the crawldb update outputs all the entries that will be found in the updated crawldb. We can use the counters to summarise the number of URLs per status, which is a bit

[jira] [Updated] (NUTCH-1071) Crawldb update to total counts per status

2011-07-28 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1071: - Attachment: NUTCH-1071.patch > Crawldb update to total counts per sta

[jira] [Closed] (NUTCH-1071) Crawldb update to total counts per status

2011-07-28 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-1071. > Crawldb update to total counts per sta

[jira] [Resolved] (NUTCH-1071) Crawldb update to total counts per status

2011-07-28 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-1071. -- Resolution: Fixed Committed revision 1151852. > Crawldb update to total counts per sta

[jira] [Created] (NUTCH-1072) Display number and size of queues in Fetcher status

2011-07-28 Thread Julien Nioche (JIRA)
: fetcher Affects Versions: 1.4 Reporter: Julien Nioche Assignee: Julien Nioche Priority: Trivial Fix For: 1.4 Knowing about the number of queues gives a better idea of the distribution of the fetch list per [host|domain|IP]. As for the size of the

[jira] [Updated] (NUTCH-1072) Display number and size of queues in Fetcher status

2011-07-28 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1072: - Attachment: NUTCH-1072.patch > Display number and size of queues in Fetcher sta

[jira] [Closed] (NUTCH-1072) Display number and size of queues in Fetcher status

2011-07-28 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-1072. Resolution: Fixed Committed revision 1151884. > Display number and size of queues in Fetc

[jira] [Closed] (NUTCH-919) Logos and Graphics

2011-07-28 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-919. --- Resolution: Fixed Thanks Lewis > Logos and Graphics > -- > >

[jira] [Closed] (NUTCH-920) Project Metadata

2011-07-28 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-920. --- Resolution: Fixed Thanks > Project Metadata > > > Ke

[jira] [Commented] (NUTCH-917) Website Navigation Links

2011-07-28 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13072469#comment-13072469 ] Julien Nioche commented on NUTCH-917: - Lewis, what is the commit number for

[jira] [Reopened] (NUTCH-917) Website Navigation Links

2011-07-28 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche reopened NUTCH-917: - > Website Navigation Links > > > Ke

[jira] [Created] (NUTCH-1073) Rename parameters 'fetcher.threads.per.host.by.ip' and 'fetcher.threads.per.host'

2011-07-29 Thread Julien Nioche (JIRA)
Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.4 Reporter: Julien Nioche Assignee: Julien Nioche Priority: Minor Fix For: 1.4 The parameters 'fetcher.threads.per.host.by.ip'

[jira] [Commented] (NUTCH-1044) Redirected URLs and possibly all of their outlinked URLs have invalid scores.

2011-08-01 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13076043#comment-13076043 ] Julien Nioche commented on NUTCH-1044: -- Will commit soon if there aren&#

[jira] [Created] (NUTCH-1075) Delegate language identification to Tika

2011-08-01 Thread Julien Nioche (JIRA)
: 1.4 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.4 In 2.0 the language identification is delegated to Tika and is done as part of the parsing step (and not during the indexing as done currently). The patch attached is a backport from trunk

[jira] [Updated] (NUTCH-1075) Delegate language identification to Tika

2011-08-01 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1075: - Attachment: NUTCH-1075.patch Passes the tests but requires some testing > Delegate langu

[jira] [Updated] (NUTCH-1073) Rename parameters 'fetcher.threads.per.host.by.ip' and 'fetcher.threads.per.host'

2011-08-08 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1073: - Attachment: NUTCH-1073.patch Patch implementing the change of parameter names and allows

[jira] [Commented] (NUTCH-1028) Log parser keys

2011-08-09 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13081585#comment-13081585 ] Julien Nioche commented on NUTCH-1028: -- You can see the progression of the par

[jira] [Issue Comment Edited] (NUTCH-623) Change plugin source directory "languageidentifier" to "language-identifier"

2011-08-09 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13081663#comment-13081663 ] Julien Nioche edited comment on NUTCH-623 at 8/9/11 2:3

[jira] [Commented] (NUTCH-623) Change plugin source directory "languageidentifier" to "language-identifier"

2011-08-09 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13081663#comment-13081663 ] Julien Nioche commented on NUTCH-623: - The functionality being delegated to Tika

[jira] [Commented] (NUTCH-623) Change plugin source directory "languageidentifier" to "language-identifier"

2011-08-09 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13081693#comment-13081693 ] Julien Nioche commented on NUTCH-623: - Lewis, Again this is a separate issue

[jira] [Closed] (NUTCH-463) Nutch powerpoint parser plugin fails to parse ppt with images

2011-08-09 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-463. --- Resolution: Won't Fix Parsing delegated to Tika > Nutch powerpoint parser plugin fails to p

[jira] [Closed] (NUTCH-537) TestMP3Parser.java, TestRTFParser.java, TestMSWordParser.java compile

2011-08-09 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-537. --- Resolution: Won't Fix > TestMP3Parser.java, TestRTFParser.java, TestMSWordParser.java

[jira] [Commented] (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2011-08-10 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13082298#comment-13082298 ] Julien Nioche commented on NUTCH-258: - Lewis - this issue is closed and I am not

[jira] [Closed] (NUTCH-917) Website Navigation Links

2011-08-10 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-917. --- Resolution: Fixed That's great, thanks Lewis > Website Navigati

[jira] [Commented] (NUTCH-1075) Delegate language identification to Tika

2011-08-10 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13082610#comment-13082610 ] Julien Nioche commented on NUTCH-1075: -- Hi Lewis, One way of testing would b

[jira] [Resolved] (NUTCH-1044) Redirected URLs and possibly all of their outlinked URLs have invalid scores.

2011-08-10 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-1044. -- Resolution: Fixed Committed revision 1156342. Thanks for reporting it > Redirected URLs

[jira] [Reopened] (NUTCH-623) Change plugin source directory "languageidentifier" to "language-identifier"

2011-08-11 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche reopened NUTCH-623: - Lewis, If you haven't opened the issue yourself, then don't close it unless the p

[jira] [Commented] (NUTCH-1079) StringBuffer converted to StringBuilder

2011-08-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13085667#comment-13085667 ] Julien Nioche commented on NUTCH-1079: -- Can we expect to get any signifi

[jira] [Commented] (NUTCH-940) static field plugin

2011-08-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13085670#comment-13085670 ] Julien Nioche commented on NUTCH-940: - Still needs a description of the parameter

[jira] [Commented] (NUTCH-1075) Delegate language identification to Tika

2011-08-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13085671#comment-13085671 ] Julien Nioche commented on NUTCH-1075: -- Any more testers for this issue? Shal

[jira] [Created] (NUTCH-1083) ParserChecker implements Tool

2011-08-16 Thread Julien Nioche (JIRA)
ParserChecker implements Tool - Key: NUTCH-1083 URL: https://issues.apache.org/jira/browse/NUTCH-1083 Project: Nutch Issue Type: Improvement Components: parser Reporter: Julien Nioche

[jira] [Updated] (NUTCH-1083) ParserChecker implements Tool

2011-08-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1083: - Attachment: NUTCH-1083.patch > ParserChecker implements T

[jira] [Commented] (NUTCH-1075) Delegate language identification to Tika

2011-08-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13085687#comment-13085687 ] Julien Nioche commented on NUTCH-1075: -- Markus - have a look at h

[jira] [Resolved] (NUTCH-1083) ParserChecker implements Tool

2011-08-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-1083. -- Resolution: Fixed Committed revision 1158269. > ParserChecker implements T

[jira] [Closed] (NUTCH-1083) ParserChecker implements Tool

2011-08-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-1083. > ParserChecker implements Tool > - > > Key

[jira] [Commented] (NUTCH-1083) ParserChecker implements Tool

2011-08-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13085708#comment-13085708 ] Julien Nioche commented on NUTCH-1083: -- [big sigh] indeed. Having 2 branches

[jira] [Resolved] (NUTCH-1083) ParserChecker implements Tool

2011-08-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-1083. -- Resolution: Fixed Fix Version/s: 2.0 Trunk - Committed revision 1158277

[jira] [Closed] (NUTCH-1083) ParserChecker implements Tool

2011-08-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-1083. > ParserChecker implements Tool > - > > Key

[jira] [Reopened] (NUTCH-1083) ParserChecker implements Tool

2011-08-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche reopened NUTCH-1083: -- > ParserChecker implements Tool > - > >

[jira] [Commented] (NUTCH-623) Change plugin source directory "languageidentifier" to "language-identifier"

2011-08-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13085729#comment-13085729 ] Julien Nioche commented on NUTCH-623: - Lewis - have you reverted the changes in 1.

[jira] [Commented] (NUTCH-1075) Delegate language identification to Tika

2011-08-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13085728#comment-13085728 ] Julien Nioche commented on NUTCH-1075: -- See https://issues.apache.org/jira/br

[jira] [Commented] (NUTCH-1078) Upgrade all instances of commons logging to slf4j (with log4j backend)

2011-08-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13085771#comment-13085771 ] Julien Nioche commented on NUTCH-1078: -- slf4j has a library that can do

[jira] [Commented] (NUTCH-1067) Configure minimum throughput for fetcher

2011-08-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13085773#comment-13085773 ] Julien Nioche commented on NUTCH-1067: -- Markus - please assign this issue t

[jira] [Commented] (NUTCH-1075) Delegate language identification to Tika

2011-08-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13085786#comment-13085786 ] Julien Nioche commented on NUTCH-1075: -- ah, sorry. it looked a lot like the erro

[jira] [Commented] (NUTCH-1051) Export WebGraph node scores for solr.ExternalFileField

2011-08-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13085800#comment-13085800 ] Julien Nioche commented on NUTCH-1051: -- +1 Haven't tested it but it looks O

<    3   4   5   6   7   8   9   10   11   12   >