[jira] Commented: (NUTCH-88) Enhance ParserFactory plugin selection policy

2005-09-09 Thread Dawid Weiss (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-88?page=comments#action_12323001 ] 

Dawid Weiss commented on NUTCH-88:
--

Hi.

I share your opinion -- this is an important issue. If I may add my few cents, 
the crawler should try to mimic a browser in handling mime types. This, of 
course, gets quite complex since Internet Explorer has a very confusing and 
unnecessarily complex mime type handling heuristic... which happens to change 
from version to version as well. Anyway, if you care to look, there are a few 
articles that explain the steps performed by IE to resolve a mime type of a Web 
page --

http://msdn.microsoft.com/workshop/networking/moniker/overview/appendix_a.asp
http://msdn.microsoft.com/workshop/networking/moniker/overview/mime_handling.asp

D.

> Enhance ParserFactory plugin selection policy
> -
>
>  Key: NUTCH-88
>  URL: http://issues.apache.org/jira/browse/NUTCH-88
>  Project: Nutch
> Type: Improvement
>   Components: indexer
> Versions: 0.7, 0.8-dev
> Reporter: Jerome Charron
>  Fix For: 0.8-dev

>
> The ParserFactory choose the Parser plugin to use based on the content-types 
> and path-suffix defined in the parsers plugin.xml file.
> The selection policy is as follow:
> Content type has priority: the first plugin found whose "contentType" 
> attribute matches the beginning of the content's type is used. 
> If none match, then the first whose "pathSuffix" attribute matches the end of 
> the url's path is used.
> If neither of these match, then the first plugin whose "pathSuffix" is the 
> empty string is used.
> This policy has a lot of problems when no matching is found, because a random 
> parser is used (and there is a lot of chance this parser can't handle the 
> content).
> On the other hand, the content-type associated to a parser plugin is 
> specified in the plugin.xml of each plugin (this is the value used by the 
> ParserFactory), AND the code of each parser checks itself in its code if the 
> content-type is ok (it uses an hard-coded content-type value, and not uses 
> the value specified in the plugin.xml => possibility of missmatches between 
> content-type hard-coded and content-type delcared in plugin.xml).
> A complete list of problems and discussion aout this point is available in:
>   * http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg00744.html
>   * http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00789.html

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-88) Enhance ParserFactory plugin selection policy

2005-09-09 Thread Dawid Weiss (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-88?page=comments#action_12323009 ] 

Dawid Weiss commented on NUTCH-88:
--

Yep, I know about byte-magic mime detector. I'm just pointing out Internet 
Explorer doesn't use it... or at least, it doesn't always use it the way you 
would expect it to. Whether Nutch should mimic IE in this behaviour is another 
question.

> Enhance ParserFactory plugin selection policy
> -
>
>  Key: NUTCH-88
>  URL: http://issues.apache.org/jira/browse/NUTCH-88
>  Project: Nutch
> Type: Improvement
>   Components: indexer
> Versions: 0.7, 0.8-dev
> Reporter: Jerome Charron
>  Fix For: 0.8-dev

>
> The ParserFactory choose the Parser plugin to use based on the content-types 
> and path-suffix defined in the parsers plugin.xml file.
> The selection policy is as follow:
> Content type has priority: the first plugin found whose "contentType" 
> attribute matches the beginning of the content's type is used. 
> If none match, then the first whose "pathSuffix" attribute matches the end of 
> the url's path is used.
> If neither of these match, then the first plugin whose "pathSuffix" is the 
> empty string is used.
> This policy has a lot of problems when no matching is found, because a random 
> parser is used (and there is a lot of chance this parser can't handle the 
> content).
> On the other hand, the content-type associated to a parser plugin is 
> specified in the plugin.xml of each plugin (this is the value used by the 
> ParserFactory), AND the code of each parser checks itself in its code if the 
> content-type is ok (it uses an hard-coded content-type value, and not uses 
> the value specified in the plugin.xml => possibility of missmatches between 
> content-type hard-coded and content-type delcared in plugin.xml).
> A complete list of problems and discussion aout this point is available in:
>   * http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg00744.html
>   * http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00789.html

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-82) Nutch Commands should run on Windows without external tools

2005-10-20 Thread Dawid Weiss (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-82?page=comments#action_12332559 ] 

Dawid Weiss commented on NUTCH-82:
--

I personally disagree Perl is a better alternative to Cygwin... Most people 
familiar with Unix/ Windows development will have no problems modifying a bash 
script, whereas a Perl script... hmm.. Perl is perl :) 

As for a pure Java solution, I agree this would be handy. However, Java is 
quite a pain to invoke, especially with multiple JVM switches such as -Xmx... 
So you'd probably have to fall back to a 'boot' script anyway at some point. 
The only pure Java thing that comes to my mind is using ANT to spawn a JVM and 
then write commons-cli equivalents of command line tools... but this, as much 
as I hate to have platform-dependent scripts, seems like an overkill compared 
to the bash solution.

> Nutch Commands should run on Windows without external tools
> ---
>
>  Key: NUTCH-82
>  URL: http://issues.apache.org/jira/browse/NUTCH-82
>  Project: Nutch
> Type: New Feature
>  Environment: Windows 2000
> Reporter: AJ Banck
>  Attachments: nutch.bat, nutch.bat, nutch.pl
>
> Currently there is only a shellscript to run the Nutch commands. This should 
> be platform independant.
> Best would be Ant tools, or scripts generated by a template tool to avoid 
> replication.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-217) InstantiationException when deserializing Query (no parameterless constructor)

2006-02-26 Thread Dawid Weiss (JIRA)
InstantiationException when deserializing Query (no parameterless constructor)
--

 Key: NUTCH-217
 URL: http://issues.apache.org/jira/browse/NUTCH-217
 Project: Nutch
Type: Bug
  Components: searcher  
Versions: 0.8-dev
Reporter: Dawid Weiss


I've been playing with the trunk. The distributed searcher complains with an 
instantiation exception when deserializing Query. A quick code inspection shows 
that Query doesn't have any parameterless constructor.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-228) Clustering plugin descriptor broken (fix included)

2006-03-12 Thread Dawid Weiss (JIRA)
Clustering plugin descriptor broken (fix included)
--

 Key: NUTCH-228
 URL: http://issues.apache.org/jira/browse/NUTCH-228
 Project: Nutch
Type: Bug
Reporter: Dawid Weiss
Priority: Minor


The plugin descriptor for clustering-carrot2 is currently broken (points to a 
missing JAR). I'm adding a patch fixing this to this issue in a minute.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-228) Clustering plugin descriptor broken (fix included)

2006-03-12 Thread Dawid Weiss (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-228?page=all ]

Dawid Weiss updated NUTCH-228:
--

Attachment: clustering.patch

This patch fixed the plugin descriptor and a typo in cluster.jsp that caused 
wrong number of milliseconds to be dumped in the output log file.

> Clustering plugin descriptor broken (fix included)
> --
>
>  Key: NUTCH-228
>  URL: http://issues.apache.org/jira/browse/NUTCH-228
>  Project: Nutch
> Type: Bug
> Reporter: Dawid Weiss
> Priority: Minor
>  Attachments: clustering.patch
>
> The plugin descriptor for clustering-carrot2 is currently broken (points to a 
> missing JAR). I'm adding a patch fixing this to this issue in a minute.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-234) Clustering extension code cleanups and a real JUnit test case for the current implementation.

2006-03-17 Thread Dawid Weiss (JIRA)
Clustering extension code cleanups and a real JUnit test case for the current 
implementation.
-

 Key: NUTCH-234
 URL: http://issues.apache.org/jira/browse/NUTCH-234
 Project: Nutch
Type: Test
Reporter: Dawid Weiss
Priority: Minor


I've cleaned up the code a bit and added a real test case for the clustering 
extension. This is in preparation for upgrading to the most recent Carrot2 
codebase and I didn't want to mix these two patches together. I'd appreciate if 
somebody could review this patch so that I can integrate the newest C2 code 
this weekend. Thanks.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-234) Clustering extension code cleanups and a real JUnit test case for the current implementation.

2006-03-17 Thread Dawid Weiss (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-234?page=all ]

Dawid Weiss updated NUTCH-234:
--

Attachment: patch.diff

The patch adding:
- a JUnit test case to the clustering extension,
- minor code cleanups
- adds ".settings" file to svn:ignore on the main Nutch folder -- this is 
Eclipse's project settings file.

> Clustering extension code cleanups and a real JUnit test case for the current 
> implementation.
> -
>
>  Key: NUTCH-234
>  URL: http://issues.apache.org/jira/browse/NUTCH-234
>  Project: Nutch
> Type: Test
> Reporter: Dawid Weiss
> Priority: Minor
>  Attachments: patch.diff
>
> I've cleaned up the code a bit and added a real test case for the clustering 
> extension. This is in preparation for upgrading to the most recent Carrot2 
> codebase and I didn't want to mix these two patches together. I'd appreciate 
> if somebody could review this patch so that I can integrate the newest C2 
> code this weekend. Thanks.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-237) Carrot2 clustering plugin upgrade.

2006-03-23 Thread Dawid Weiss (JIRA)
Carrot2 clustering plugin upgrade.
--

 Key: NUTCH-237
 URL: http://issues.apache.org/jira/browse/NUTCH-237
 Project: Nutch
Type: Improvement
Reporter: Dawid Weiss
Priority: Trivial


This is an upgrade of the clustering plugin to the newest release (1.0.2).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-237) Carrot2 clustering plugin upgrade.

2006-03-23 Thread Dawid Weiss (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-237?page=all ]

Dawid Weiss updated NUTCH-237:
--

Attachment: c2.patch
svn-stat.txt

Note the two deleted files (I attached the result of svn stat). I didn't know 
how to include this info in the diff file, don't think it's possible with plain 
svn.

> Carrot2 clustering plugin upgrade.
> --
>
>  Key: NUTCH-237
>  URL: http://issues.apache.org/jira/browse/NUTCH-237
>  Project: Nutch
> Type: Improvement
> Reporter: Dawid Weiss
> Priority: Trivial
>  Attachments: c2.patch, svn-stat.txt
>
> This is an upgrade of the clustering plugin to the newest release (1.0.2).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-237) Carrot2 clustering plugin upgrade.

2006-03-23 Thread Dawid Weiss (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-237?page=all ]

Dawid Weiss updated NUTCH-237:
--

Attachment: libs.zip

Libraries that need to be replaced.

> Carrot2 clustering plugin upgrade.
> --
>
>  Key: NUTCH-237
>  URL: http://issues.apache.org/jira/browse/NUTCH-237
>  Project: Nutch
> Type: Improvement
> Reporter: Dawid Weiss
> Priority: Trivial
>  Attachments: c2.patch, libs.zip, svn-stat.txt
>
> This is an upgrade of the clustering plugin to the newest release (1.0.2).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-237) Carrot2 clustering plugin upgrade.

2006-03-24 Thread Dawid Weiss (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-237?page=comments#action_12371687 ] 

Dawid Weiss commented on NUTCH-237:
---

Yes and no. I removed the "support" for foreign languages from the constructor 
code:

// We initialize Lingo with English stemming and stopwords. Lingo has 
// a simple language detection filter, but you'll be better off 
hardcoding
// the language according to your needs. If you have bilingual indices, 
// then there is a possibility of creating a more complex process that 
assigns
// a language tag before the clustering is actually started.
return new LingoLocalFilterComponent(
  new Language[] { new English() },
  defaults);
  }

Language detection is not really brilliant in the open source Lingo so I 
thought it wouldn't make sense to give people false hopes. Now, all the 
stemmers and stopword lists are still included in the release (look inside 
carrot2-util-tokenizer.jar$/com/dawidweiss/carrot/util/tokenizer/languages/...) 
so you can freely switch to another language by changing the instantiated 
language. 

I have a better idea though -- how about if you apply this patch (because I\ve 
tested it and know it works) and I'll make the language configurable via ISO 
codes set in nutch configuration? The default would be English and you could 
set your own language in there if you wanted to. All right?

> Carrot2 clustering plugin upgrade.
> --
>
>  Key: NUTCH-237
>  URL: http://issues.apache.org/jira/browse/NUTCH-237
>  Project: Nutch
> Type: Improvement
> Reporter: Dawid Weiss
> Priority: Trivial
>  Attachments: c2.patch, libs.zip, svn-stat.txt
>
> This is an upgrade of the clustering plugin to the newest release (1.0.2).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-237) Carrot2 clustering plugin upgrade.

2006-04-04 Thread Dawid Weiss (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-237?page=all ]

Dawid Weiss updated NUTCH-237:
--

Attachment: NUTCH-237.DWEISS.patch.zip

Hi Andrzej. The ZIP file contains a patch and svn stat with the improved code:

- The primary language for hits without explicit langid and a list of enabled 
languages in the clustering component can be specified in the configuration 
file (readme.txt gives the details).

- by default all languages in Carrot2 (except for Polish) are enabled. English 
is the default.

- I removed the dependency on Neko in favor of the simpler routine we have in 
Carrot2 codebase anyway. The change shouldn't affect the results (I checked on 
my local installation and it seems to be fine).

I haven't played with the language identifier yet because I don't have a crawl 
with documents containing langid codes. The code should work without problems 
though -- details.getValue("lang") is converted to Carrot2's property 
RawDocument.PROPERTY_LANGUAGE and this is taken into account when clustering.

I couldn't delete previously attached files. This ZIP file contains only the 
patch and svnstat -- you'll have to remove a few JARs manually and replace 
other with their new counterparts from the ZIP file I've attached to this issue 
earlier (they haven't changed). Let me know if you need anything.


> Carrot2 clustering plugin upgrade.
> --
>
>  Key: NUTCH-237
>  URL: http://issues.apache.org/jira/browse/NUTCH-237
>  Project: Nutch
> Type: Improvement

> Reporter: Dawid Weiss
> Priority: Trivial
>  Attachments: NUTCH-237.DWEISS.patch.zip, c2.patch, libs.zip, svn-stat.txt
>
> This is an upgrade of the clustering plugin to the newest release (1.0.2).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-134) Summarizer doesn't select the best snippets

2006-05-08 Thread Dawid Weiss (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-134?page=comments#action_12378387 ] 

Dawid Weiss commented on NUTCH-134:
---

(back from holidays, so a bit delayed, but) I confirm Andrzej's suggestion -- a 
plain-text only summarized is ideal for clustering for example. HTML is quite 
uncomfortable to work with.

> Summarizer doesn't select the best snippets
> ---
>
>  Key: NUTCH-134
>  URL: http://issues.apache.org/jira/browse/NUTCH-134
>  Project: Nutch
> Type: Bug

>   Components: searcher
> Versions: 0.7.2, 0.7.1, 0.7, 0.8-dev
> Reporter: Andrzej Bialecki 
>  Attachments: summarizer.060506.patch
>
> Summarizer.java tries to select the best fragments from the input text, where 
> the frequency of query terms is the highest. However, the logic in line 223 
> is flawed in that the excerptSet.add() operation will add new excerpts only 
> if they are not already present - the test is performed using the Comparator 
> that compares only the numUniqueTokens. This means that if there are two or 
> more excerpts, which score equally high, only the first of them will be 
> retained, and the rest of equally-scoring excerpts will be discarded, in 
> favor of other excerpts (possibly lower scoring).
> To fix this the Set should be replaced with a List + a sort operation. To 
> keep the relative position of excerpts in the original order the Excerpt 
> class should be extended with an "int order" field, and the collected 
> excerpts should be sorted in that order prior to adding them to the summary.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-265) Getting Clustered results in better form.

2006-05-08 Thread Dawid Weiss (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-265?page=comments#action_12378425 ] 

Dawid Weiss commented on NUTCH-265:
---

The clustering interface is very simple in Nutch because it usually needs to be 
adjusted to the needs of a particular application. Maintaing a complex user 
interface is not among Nutch's objectives, so I doubt if it's possible. 
Carrot2, which Nutch internally uses, has a JavaScript-powered interface which 
could be added to Nutch if there are folks that really think it is worth the 
effort.

See this one:
http://carrot.cs.put.poznan.pl/carrot2-remote-controller/newsearch.do?query=nutch&processingChain=carrot2.process.lingo-yahooapi&resultsRequested=100

> Getting Clustered results in better form.
> -
>
>  Key: NUTCH-265
>  URL: http://issues.apache.org/jira/browse/NUTCH-265
>  Project: Nutch
> Type: Improvement

>   Components: searcher
> Versions: 0.7.2
> Reporter: Kris K

>
> The cluster results are coming with title and link to URL. For improvement it 
> should be clustered keyword phrases (Like  Vivisimo type). Any person can 
> share their views on it. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-265) Getting Clustered results in better form.

2006-05-23 Thread Dawid Weiss (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-265?page=comments#action_12413072 ] 

Dawid Weiss commented on NUTCH-265:
---

Chris, the current clusterer in Nutch _does_ discover phrases for clusters, so 
I don't know what you really mean. Did you take a look at my previous post? 
Would that kind of user interface make you happy?

> Getting Clustered results in better form.
> -
>
>  Key: NUTCH-265
>  URL: http://issues.apache.org/jira/browse/NUTCH-265
>  Project: Nutch
> Type: Improvement

>   Components: searcher
> Versions: 0.7.2
> Reporter: Kris K

>
> The cluster results are coming with title and link to URL. For improvement it 
> should be clustered keyword phrases (Like  Vivisimo type). Any person can 
> share their views on it. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-265) Getting Clustered results in better form.

2006-05-24 Thread Dawid Weiss (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-265?page=comments#action_12413220 ] 

Dawid Weiss commented on NUTCH-265:
---

If you just mean the user interface, then you can simply take the XSLT 
stylesheet from Carrot2 and reuse it in Nutch with the opensearch XML -- I 
believe there is even an example in Carrot2 of using opensearch, so you 
shouldn't have much troubles.

Now, the phrases you wish to see on your screen won't always be so beautiful 
because search results clustering works on snippets extracted from search 
results. If you want clean and accurate labels then you'd need to use a 
predefined ontology or something -- I can't help you with that. 

Try playing around with Carrot2 demo and see if the results satisfy your needs. 
If so, then rewriting Nutch's user interface to suit your needs shouldn't be a 
problem. If your expectations are more demanding then you'll need to think of 
some other solution.


> Getting Clustered results in better form.
> -
>
>  Key: NUTCH-265
>  URL: http://issues.apache.org/jira/browse/NUTCH-265
>  Project: Nutch
> Type: Improvement

>   Components: searcher
> Versions: 0.7.2
> Reporter: Kris K

>
> The cluster results are coming with title and link to URL. For improvement it 
> should be clustered keyword phrases (Like  Vivisimo type). Any person can 
> share their views on it. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-294) Topic-maps of related searchwords

2006-06-06 Thread Dawid Weiss (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-294?page=comments#action_12414960 ] 

Dawid Weiss commented on NUTCH-294:
---

Ehm, sorry I'm so late with this -- tons of work.

1) Stefan, if you can't get it working, speak up what is not working 
(exceptions? anything else?). The only thing you need to do is enable the 
clustering plugin in your configuration -- there should be a checkbox next to 
your search box, tick that and you should be able to see clustered results when 
you perform a query.

2) Now, having said that, I don't think that's what you're after. Carrot2 
performs clustering of search results based solely on the information contained 
in snippets retrieved from documents (in other words, there is NO ontology and 
NO predefined information, everything is constructed dynamically). If you're 
looking for topic-maps then I guess you're after a certain type of 
classification engine that could pick relevant categories and display them 
along with search results. It's not what (the open source) Carrot2 does.

> Topic-maps of related searchwords
> -
>
>  Key: NUTCH-294
>  URL: http://issues.apache.org/jira/browse/NUTCH-294
>  Project: Nutch
> Type: New Feature

>   Components: searcher
> Reporter: Stefan Neufeind

>
> Would it be possible to offer a user  "topic-maps"? It's when you search for 
> something and get topic-related words that might also be of interest for you. 
> I wonder if that's somehow possible with the ngram-index for "did you mean" 
> (see separate feature-enhancement-bug for this), but we'd need to have a 
> relation between words (in what context do they occur).
> For the webfrontend usually trees are used  - which for some users offer 
> quite impressive eye-candy :-) E.g. see this advertisement by Novell where 
> I've just seen a similar "topic-map" as well:
> http://www.novell.com/de-de/company/advertising/defineyouropen.html

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-294) Topic-maps of related searchwords

2006-06-07 Thread Dawid Weiss (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-294?page=comments#action_12415094 ] 

Dawid Weiss commented on NUTCH-294:
---

Well, you certainly have something wrong in your configuration then. I just 
tried
with the head revision. My nutch-site looks like this:

[...]

  plugin.includes
  
clustering-carrot2|protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic
  Regular expression naming plugin directory names to
  [...]
  

[...]

Start Tomcat and issue any query that returns results. Look in the log files 
for:

2006-06-07 09:29:35 org.apache.nutch.plugin.PluginRepository displayStatus
INFO:   Online Search Results Clustering using Carrot2's Lingo component 
(clustering-carrot2)

2006-06-07 09:29:35 org.apache.nutch.clustering.OnlineClustererFactory 
getOnlineClusterer
INFO: Using the first clustering extension found: Carrot2-Lingo

Ok, the results page should show a "clustering" option next to "Search" button 
(it does
on my installation). Select it and rerun the query. On the right side you'll 
have clusters
(titles and three sample documents from each cluster are shown).

As for your idea, I still don't think Lingo is what you need... Of course you 
can try feeding it with unrelated keywords and then see what comes out, but I 
don't think it's the right approach.


> Topic-maps of related searchwords
> -
>
>  Key: NUTCH-294
>  URL: http://issues.apache.org/jira/browse/NUTCH-294
>  Project: Nutch
> Type: New Feature

>   Components: searcher
> Reporter: Stefan Neufeind

>
> Would it be possible to offer a user  "topic-maps"? It's when you search for 
> something and get topic-related words that might also be of interest for you. 
> I wonder if that's somehow possible with the ngram-index for "did you mean" 
> (see separate feature-enhancement-bug for this), but we'd need to have a 
> relation between words (in what context do they occur).
> For the webfrontend usually trees are used  - which for some users offer 
> quite impressive eye-candy :-) E.g. see this advertisement by Novell where 
> I've just seen a similar "topic-map" as well:
> http://www.novell.com/de-de/company/advertising/defineyouropen.html

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-309) Uses commons logging Code Guards

2006-06-28 Thread Dawid Weiss (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-309?page=comments#action_12418396 ] 

Dawid Weiss commented on NUTCH-309:
---

Painful job, Jerome, but in most cases (non-critical loops) the gain will not 
be significant and proliferating if statements makes the code harder to read. 
Wrapping logging statements with code guards is a perfect aspect -- I'm sure 
it'd be possible to postprocess the binaries and do it automatically (with 
AspectJ or even a simple implementation of an observer in asmlib). Just a 
thought.

> Uses commons logging Code Guards
> 
>
>  Key: NUTCH-309
>  URL: http://issues.apache.org/jira/browse/NUTCH-309
>  Project: Nutch
> Type: Improvement

> Versions: 0.8-dev
> Reporter: Jerome Charron
> Assignee: Jerome Charron
> Priority: Minor
>  Fix For: 0.8-dev

>
> "Code guards are typically used to guard code that only needs to execute in 
> support of logging, that otherwise introduces undesirable runtime overhead in 
> the general case (logging disabled). Examples are multiple parameters, or 
> expressions (e.g. string + " more") for parameters. Use the guard methods of 
> the form log.is() to verify that logging should be performed, 
> before incurring the overhead of the logging method call. Yes, the logging 
> methods will perform the same check, but only after resolving parameters."
> (description extracted from 
> http://jakarta.apache.org/commons/logging/guide.html#Code_Guards)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-300) Clustering API improvements

2006-07-07 Thread Dawid Weiss (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-300?page=comments#action_12419708 ] 

Dawid Weiss commented on NUTCH-300:
---

Hi. I just took a look at it -- I don't see anything wrong with the code and 
Andrzej has used Carrot2 before. We're under major refactorings to simplify 
things within Carrot2 -- the internals won't change much, but we drop obsolete 
APIs etc. The new web application has a new shiny user interface (at the moment 
XSLT-filtered from XMLs, so not applicable for huge user loads, but very 
convenient to work with on customizations). Stay tuned.

> Clustering API improvements
> ---
>
>  Key: NUTCH-300
>  URL: http://issues.apache.org/jira/browse/NUTCH-300
>  Project: Nutch
> Type: Improvement

> Versions: 0.8-dev
> Reporter: Andrzej Bialecki 
> Priority: Minor
>  Attachments: patch.txt
>
> This patch adds support for retrieving original document scores (from 
> NutchBean), as well as cluster-level relevance scores (from Clusterer). Both 
> methods may improve visual representation of the clusters, where individual 
> items may be visually differentiated depending on their query relevance and 
> cluster relevance. A modified cluster.jsp illustrates this feature.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-397) porting clustering-carrot2 plugin to carrot2 v2.0

2006-11-15 Thread Dawid Weiss (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-397?page=comments#action_12450146 ] 

Dawid Weiss commented on NUTCH-397:
---

I'll review this patch and commit all the necessary code as soon as possible 
(it may be around the end of the week though since I have a few urgent papers 
to review).

> porting clustering-carrot2 plugin to carrot2 v2.0
> -
>
> Key: NUTCH-397
> URL: http://issues.apache.org/jira/browse/NUTCH-397
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Doğacan Güney
>Priority: Trivial
> Attachments: carrot2-nutch-plugin.patch, 
> clustering-carrot2-lib.tar.gz, clustering.patch
>
>
> A rather trivial port of clustering-carrot2 to new carrot2. I also added the 
> necessary jars for Polish, so that nutch will not give the annoying 
> exceptions when it is initializing clustering-carrot2. 
> There is a small problem, though. AFAICS, a small patch has to be applied to 
> carrot2, otherwise nutch can not start the plugin. (I am also attaching that 
> here.)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (NUTCH-544) Upgrade Carrot2 clustering plugin to the newest stable release (2.1)

2007-08-22 Thread Dawid Weiss (JIRA)
Upgrade Carrot2 clustering plugin to the newest stable release (2.1)


 Key: NUTCH-544
 URL: https://issues.apache.org/jira/browse/NUTCH-544
 Project: Nutch
  Issue Type: Improvement
Reporter: Dawid Weiss
Priority: Minor


This issue upgrades Carrot2 search results clustering plugin to the newest 
stable version.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-544) Upgrade Carrot2 clustering plugin to the newest stable release (2.1)

2007-08-22 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12521784
 ] 

Dawid Weiss commented on NUTCH-544:
---

I've started working on this -- will send a patch for revision soon (tested 
against the current trunk -- didn't know which version to set for "Affects 
version", please feel free to edit this field on this issue).

> Upgrade Carrot2 clustering plugin to the newest stable release (2.1)
> 
>
> Key: NUTCH-544
> URL: https://issues.apache.org/jira/browse/NUTCH-544
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Priority: Minor
>
> This issue upgrades Carrot2 search results clustering plugin to the newest 
> stable version.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-544) Upgrade Carrot2 clustering plugin to the newest stable release (2.1)

2007-08-22 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12521791
 ] 

Dawid Weiss commented on NUTCH-544:
---

Yes, absolutely -- it's actually my fault I didn't notice these tasks, 
apologies.

> Upgrade Carrot2 clustering plugin to the newest stable release (2.1)
> 
>
> Key: NUTCH-544
> URL: https://issues.apache.org/jira/browse/NUTCH-544
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Priority: Minor
>
> This issue upgrades Carrot2 search results clustering plugin to the newest 
> stable version.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-544) Upgrade Carrot2 clustering plugin to the newest stable release (2.1)

2007-08-22 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12521792
 ] 

Dawid Weiss commented on NUTCH-544:
---

Doğacan, would it be a problem if we threw in BeanShell and Dom4j JARs? We have 
been talking about this with Staszek -- this would allow us to instantiate 
clustering algorithms dynamically and would effectively provide alternatives 
for Nutch users to use Lingo, STC or Lingo3G (our commercial clusterer).

I'm asking because I remember at the beginning there were concerns about the 
size of Nutch when compliled with all plugin dependencies etc. 


> Upgrade Carrot2 clustering plugin to the newest stable release (2.1)
> 
>
> Key: NUTCH-544
> URL: https://issues.apache.org/jira/browse/NUTCH-544
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Priority: Minor
>
> This issue upgrades Carrot2 search results clustering plugin to the newest 
> stable version.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-544) Upgrade Carrot2 clustering plugin to the newest stable release (2.1)

2007-08-22 Thread Dawid Weiss (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss updated NUTCH-544:
--

Attachment: clustering-upgrade-2.1.patch

svn diff of the patch. Binary files are not included (is there a way to do it 
with Subversion?), I'll post them in a separate bundle.

> Upgrade Carrot2 clustering plugin to the newest stable release (2.1)
> 
>
> Key: NUTCH-544
> URL: https://issues.apache.org/jira/browse/NUTCH-544
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Priority: Minor
> Attachments: clustering-upgrade-2.1.patch
>
>
> This issue upgrades Carrot2 search results clustering plugin to the newest 
> stable version.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-544) Upgrade Carrot2 clustering plugin to the newest stable release (2.1)

2007-08-22 Thread Dawid Weiss (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss updated NUTCH-544:
--

Attachment: libs-packed.tar.gz

lib folder (binary files to be replaced).

> Upgrade Carrot2 clustering plugin to the newest stable release (2.1)
> 
>
> Key: NUTCH-544
> URL: https://issues.apache.org/jira/browse/NUTCH-544
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Priority: Minor
> Attachments: clustering-upgrade-2.1.patch, libs-packed.tar.gz
>
>
> This issue upgrades Carrot2 search results clustering plugin to the newest 
> stable version.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-544) Upgrade Carrot2 clustering plugin to the newest stable release (2.1)

2007-08-22 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12521842
 ] 

Dawid Weiss commented on NUTCH-544:
---

Ok, this patch does the following:

- upgrades Carrot2 libs to 2.1 (the most recent stable version)
- fixes issues with tests not run properly,
- fixes some multiple-initialization issues.

It is ready for review/ commit.

> Upgrade Carrot2 clustering plugin to the newest stable release (2.1)
> 
>
> Key: NUTCH-544
> URL: https://issues.apache.org/jira/browse/NUTCH-544
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Priority: Minor
> Attachments: clustering-upgrade-2.1.patch, libs-packed.tar.gz
>
>
> This issue upgrades Carrot2 search results clustering plugin to the newest 
> stable version.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-544) Upgrade Carrot2 clustering plugin to the newest stable release (2.1)

2007-08-22 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12521843
 ] 

Dawid Weiss commented on NUTCH-544:
---

Not exactly; the initialization issue is still present, but I'll create another 
JIRA entry for it and fix it there (it's not related to the upgrade, but rather 
to the webapp).

> Upgrade Carrot2 clustering plugin to the newest stable release (2.1)
> 
>
> Key: NUTCH-544
> URL: https://issues.apache.org/jira/browse/NUTCH-544
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Priority: Minor
> Attachments: clustering-upgrade-2.1.patch, libs-packed.tar.gz
>
>
> This issue upgrades Carrot2 search results clustering plugin to the newest 
> stable version.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-545) Configuration and OnlineClusterer get initialized in every request.

2007-08-22 Thread Dawid Weiss (JIRA)
Configuration and OnlineClusterer get initialized in every request.
---

 Key: NUTCH-545
 URL: https://issues.apache.org/jira/browse/NUTCH-545
 Project: Nutch
  Issue Type: Bug
  Components: web gui
Reporter: Dawid Weiss


The initialization code block in search.jsp is invoked in every request (it's 
part of the request block). This is unnecessary and actually slows down the 
request cycle -- Configuration and OnlineClusterer can (and should) be reused.

The attached patch moved initialization code to  jspInit().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-545) Configuration and OnlineClusterer get initialized in every request.

2007-08-22 Thread Dawid Weiss (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss updated NUTCH-545:
--

Attachment: search.jsp.patch

Patch of search.jsp that moves initialization code to jspInit().

> Configuration and OnlineClusterer get initialized in every request.
> ---
>
> Key: NUTCH-545
> URL: https://issues.apache.org/jira/browse/NUTCH-545
> Project: Nutch
>  Issue Type: Bug
>  Components: web gui
>Reporter: Dawid Weiss
> Attachments: search.jsp.patch
>
>
> The initialization code block in search.jsp is invoked in every request (it's 
> part of the request block). This is unnecessary and actually slows down the 
> request cycle -- Configuration and OnlineClusterer can (and should) be reused.
> The attached patch moved initialization code to  jspInit().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-544) Upgrade Carrot2 clustering plugin to the newest stable release (2.1)

2007-08-22 Thread Dawid Weiss (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss updated NUTCH-544:
--

Attachment: (was: clustering-upgrade-2.1.patch)

> Upgrade Carrot2 clustering plugin to the newest stable release (2.1)
> 
>
> Key: NUTCH-544
> URL: https://issues.apache.org/jira/browse/NUTCH-544
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Priority: Minor
> Attachments: libs-packed.tar.gz
>
>
> This issue upgrades Carrot2 search results clustering plugin to the newest 
> stable version.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-544) Upgrade Carrot2 clustering plugin to the newest stable release (2.1)

2007-08-22 Thread Dawid Weiss (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss updated NUTCH-544:
--

Attachment: clustering-upgrade-2.1.patch

Same patch, but I added an optional parameter that allows custom clustering 
processes to be used.

> Upgrade Carrot2 clustering plugin to the newest stable release (2.1)
> 
>
> Key: NUTCH-544
> URL: https://issues.apache.org/jira/browse/NUTCH-544
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Priority: Minor
> Attachments: clustering-upgrade-2.1.patch, libs-packed.tar.gz
>
>
> This issue upgrades Carrot2 search results clustering plugin to the newest 
> stable version.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-544) Upgrade Carrot2 clustering plugin to the newest stable release (2.1)

2007-08-23 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12522047
 ] 

Dawid Weiss commented on NUTCH-544:
---

This parameter is in the code. It is specific to the plugin, not the extension 
point, so I didn't add it to nutch-defaults.xml. I'll write the configuration/ 
process switching info on the Wiki -- I guess it makes more sense to have it 
there.

http://wiki.apache.org/nutch/ClusteringPlugin

Switching clustering algorithms isn't very intuitive because they come with 
their own JARs and Nutch's plugin system requires all JARs to be explicitly 
defined in the plugin's descriptor. I finally decided to go for a workaround -- 
there is a default clustering algorithm embedded with the clustering plugin 
(which uses the Lingo algorithm), if another clustering process is to be used, 
all its required classes must be present in classpath (for example by placing 
them in the container's shared classes). Worked for me quite well since you 
don't have to modify Nutch's WAR at all.  As I said, I'll write a longer 
explanation of this on the Wiki.

> Upgrade Carrot2 clustering plugin to the newest stable release (2.1)
> 
>
> Key: NUTCH-544
> URL: https://issues.apache.org/jira/browse/NUTCH-544
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Priority: Minor
> Attachments: clustering-upgrade-2.1.patch, libs-packed.tar.gz
>
>
> This issue upgrades Carrot2 search results clustering plugin to the newest 
> stable version.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-544) Upgrade Carrot2 clustering plugin to the newest stable release (2.1)

2007-08-27 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12522992
 ] 

Dawid Weiss commented on NUTCH-544:
---

Hey, Doğacan will you find a spare minute to commit this patch some time this 
week? Thanks a bunch,

> Upgrade Carrot2 clustering plugin to the newest stable release (2.1)
> 
>
> Key: NUTCH-544
> URL: https://issues.apache.org/jira/browse/NUTCH-544
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Priority: Minor
> Attachments: clustering-upgrade-2.1.patch, libs-packed.tar.gz
>
>
> This issue upgrades Carrot2 search results clustering plugin to the newest 
> stable version.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-544) Upgrade Carrot2 clustering plugin to the newest stable release (2.1)

2007-08-27 Thread Dawid Weiss (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss updated NUTCH-544:
--

Attachment: clustering-upgrade-2.1.patch2

The same patch, one extra line of logging info added (specifying the clustering 
algorithm used).

> Upgrade Carrot2 clustering plugin to the newest stable release (2.1)
> 
>
> Key: NUTCH-544
> URL: https://issues.apache.org/jira/browse/NUTCH-544
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Priority: Minor
> Attachments: clustering-upgrade-2.1.patch2, libs-packed.tar.gz
>
>
> This issue upgrades Carrot2 search results clustering plugin to the newest 
> stable version.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-544) Upgrade Carrot2 clustering plugin to the newest stable release (2.1)

2007-08-27 Thread Dawid Weiss (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss updated NUTCH-544:
--

Attachment: (was: clustering-upgrade-2.1.patch)

> Upgrade Carrot2 clustering plugin to the newest stable release (2.1)
> 
>
> Key: NUTCH-544
> URL: https://issues.apache.org/jira/browse/NUTCH-544
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Priority: Minor
> Attachments: clustering-upgrade-2.1.patch2, libs-packed.tar.gz
>
>
> This issue upgrades Carrot2 search results clustering plugin to the newest 
> stable version.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-567) Proper (?) handling of URIs in TagSoup.

2007-10-17 Thread Dawid Weiss (JIRA)
Proper (?) handling of URIs in TagSoup.
---

 Key: NUTCH-567
 URL: https://issues.apache.org/jira/browse/NUTCH-567
 Project: Nutch
  Issue Type: Improvement
Reporter: Dawid Weiss
Priority: Minor
 Attachments: uri-entities.patch

Doug Cook reported that TagSoup incorrectly handles some URI parameters. More 
discussion on the list and at TagSoup's mailing list.

http://tech.groups.yahoo.com/group/tagsoup-friends/message/838

I looked at the sources of TagSoup because I'm using it myself (although the 
URIs are not relevant for me). It seems like you can implement a naive 
workaround by remembering the parsing state and just avoiding entity 
resolution. Attached is the patch that does this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-567) Proper (?) handling of URIs in TagSoup.

2007-10-17 Thread Dawid Weiss (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss updated NUTCH-567:
--

Attachment: uri-entities.patch

A patch against tagsoup-1.1.3 fixing the entities-in-URIs problem. Hopefully, I 
didn't test much.

You'll have to fix paths in the patch file to apply it locally.

> Proper (?) handling of URIs in TagSoup.
> ---
>
> Key: NUTCH-567
> URL: https://issues.apache.org/jira/browse/NUTCH-567
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Priority: Minor
> Attachments: uri-entities.patch
>
>
> Doug Cook reported that TagSoup incorrectly handles some URI parameters. More 
> discussion on the list and at TagSoup's mailing list.
> http://tech.groups.yahoo.com/group/tagsoup-friends/message/838
> I looked at the sources of TagSoup because I'm using it myself (although the 
> URIs are not relevant for me). It seems like you can implement a naive 
> workaround by remembering the parsing state and just avoiding entity 
> resolution. Attached is the patch that does this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-567) Proper (?) handling of URIs in TagSoup.

2007-10-17 Thread Dawid Weiss (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss updated NUTCH-567:
--

Attachment: tagsoup-1.1.3-uripatched.jar 

Binary of tagsoup with the patched compiled in.

> Proper (?) handling of URIs in TagSoup.
> ---
>
> Key: NUTCH-567
> URL: https://issues.apache.org/jira/browse/NUTCH-567
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Priority: Minor
> Attachments: tagsoup-1.1.3-uripatched.jar , uri-entities.patch
>
>
> Doug Cook reported that TagSoup incorrectly handles some URI parameters. More 
> discussion on the list and at TagSoup's mailing list.
> http://tech.groups.yahoo.com/group/tagsoup-friends/message/838
> I looked at the sources of TagSoup because I'm using it myself (although the 
> URIs are not relevant for me). It seems like you can implement a naive 
> workaround by remembering the parsing state and just avoiding entity 
> resolution. Attached is the patch that does this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-567) Proper (?) handling of URIs in TagSoup.

2007-10-18 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535853
 ] 

Dawid Weiss commented on NUTCH-567:
---

Don't mention it. Happy birthday and I hope it'll work for you. If you take a 
look at the patch (source) you'll see it's really a trivial change to the 
source... I actually looked at how browsers handle such "illegal" URIs (because 
all URIs should have & in them to separate parameters, not just an 
ampersand) and it seems they use some heuristics to determine what is an entity 
and what is not. Look at the test case -- it shows such a nasty situation. The 
current patch attempts to resolve URIs in a similar way my Firefox does it.

> Proper (?) handling of URIs in TagSoup.
> ---
>
> Key: NUTCH-567
> URL: https://issues.apache.org/jira/browse/NUTCH-567
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Priority: Minor
> Attachments: tagsoup-1.1.3-uripatched.jar , uri-entities.patch
>
>
> Doug Cook reported that TagSoup incorrectly handles some URI parameters. More 
> discussion on the list and at TagSoup's mailing list.
> http://tech.groups.yahoo.com/group/tagsoup-friends/message/838
> I looked at the sources of TagSoup because I'm using it myself (although the 
> URIs are not relevant for me). It seems like you can implement a naive 
> workaround by remembering the parsing state and just avoiding entity 
> resolution. Attached is the patch that does this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-567) Proper (?) handling of URIs in TagSoup.

2007-10-31 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12539025
 ] 

Dawid Weiss commented on NUTCH-567:
---

Hi Doğacan. I have sent an e-mail to Tagsoup's mailing list, but it seems like 
the project has been inactive for some time. 
(http://tech.groups.yahoo.com/group/tagsoup-friends/). I guess we could patch 
TagSoup locally so that people can use it with Nutch. I didn't do any extensive 
tests though, so if Doug has done some testing this would be valuable. If this 
patch were to be integrated with Nutch I can prepare some more tests to cover 
border cases.

D.

> Proper (?) handling of URIs in TagSoup.
> ---
>
> Key: NUTCH-567
> URL: https://issues.apache.org/jira/browse/NUTCH-567
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Priority: Minor
> Attachments: tagsoup-1.1.3-uripatched.jar , uri-entities.patch
>
>
> Doug Cook reported that TagSoup incorrectly handles some URI parameters. More 
> discussion on the list and at TagSoup's mailing list.
> http://tech.groups.yahoo.com/group/tagsoup-friends/message/838
> I looked at the sources of TagSoup because I'm using it myself (although the 
> URIs are not relevant for me). It seems like you can implement a naive 
> workaround by remembering the parsing state and just avoiding entity 
> resolution. Attached is the patch that does this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-567) Proper (?) handling of URIs in TagSoup.

2007-10-31 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12539162
 ] 

Dawid Weiss commented on NUTCH-567:
---

I agree. What we used to do in Carrot2 was to include the patch (against the 
original version of the sources) along with the recompiled binary. This way you 
did have a track of what's been changed locally compared to the publicly 
available version.

> Proper (?) handling of URIs in TagSoup.
> ---
>
> Key: NUTCH-567
> URL: https://issues.apache.org/jira/browse/NUTCH-567
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Priority: Minor
> Attachments: tagsoup-1.1.3-uripatched.jar , uri-entities.patch
>
>
> Doug Cook reported that TagSoup incorrectly handles some URI parameters. More 
> discussion on the list and at TagSoup's mailing list.
> http://tech.groups.yahoo.com/group/tagsoup-friends/message/838
> I looked at the sources of TagSoup because I'm using it myself (although the 
> URIs are not relevant for me). It seems like you can implement a naive 
> workaround by remembering the parsing state and just avoiding entity 
> resolution. Attached is the patch that does this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-567) Proper (?) handling of URIs in TagSoup.

2007-11-08 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541074
 ] 

Dawid Weiss commented on NUTCH-567:
---

I didn't put the feather because I wasn't sure about licensing; I'll see into 
it, but I have to leave in a minute -- it'll be tomorrow. If you want to go 
ahead with it just check the license and if it's conforming to Apache's then 
re-submit it on your own. I'll do it tomorrow if it's not done by then.

> Proper (?) handling of URIs in TagSoup.
> ---
>
> Key: NUTCH-567
> URL: https://issues.apache.org/jira/browse/NUTCH-567
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Priority: Minor
> Attachments: README-tagsoup-patched.txt, tagsoup-1.1.3-uripatched.jar 
> , uri-entities.patch
>
>
> Doug Cook reported that TagSoup incorrectly handles some URI parameters. More 
> discussion on the list and at TagSoup's mailing list.
> http://tech.groups.yahoo.com/group/tagsoup-friends/message/838
> I looked at the sources of TagSoup because I'm using it myself (although the 
> URIs are not relevant for me). It seems like you can implement a naive 
> workaround by remembering the parsing state and just avoiding entity 
> resolution. Attached is the patch that does this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-567) Proper (?) handling of URIs in TagSoup.

2007-11-08 Thread Dawid Weiss (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss updated NUTCH-567:
--

Attachment: (was: uri-entities.patch)

> Proper (?) handling of URIs in TagSoup.
> ---
>
> Key: NUTCH-567
> URL: https://issues.apache.org/jira/browse/NUTCH-567
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Priority: Minor
> Attachments: README-tagsoup-patched.txt
>
>
> Doug Cook reported that TagSoup incorrectly handles some URI parameters. More 
> discussion on the list and at TagSoup's mailing list.
> http://tech.groups.yahoo.com/group/tagsoup-friends/message/838
> I looked at the sources of TagSoup because I'm using it myself (although the 
> URIs are not relevant for me). It seems like you can implement a naive 
> workaround by remembering the parsing state and just avoiding entity 
> resolution. Attached is the patch that does this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-567) Proper (?) handling of URIs in TagSoup.

2007-11-08 Thread Dawid Weiss (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss updated NUTCH-567:
--

Attachment: (was: tagsoup-1.1.3-uripatched.jar )

> Proper (?) handling of URIs in TagSoup.
> ---
>
> Key: NUTCH-567
> URL: https://issues.apache.org/jira/browse/NUTCH-567
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Priority: Minor
> Attachments: README-tagsoup-patched.txt
>
>
> Doug Cook reported that TagSoup incorrectly handles some URI parameters. More 
> discussion on the list and at TagSoup's mailing list.
> http://tech.groups.yahoo.com/group/tagsoup-friends/message/838
> I looked at the sources of TagSoup because I'm using it myself (although the 
> URIs are not relevant for me). It seems like you can implement a naive 
> workaround by remembering the parsing state and just avoiding entity 
> resolution. Attached is the patch that does this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-567) Proper (?) handling of URIs in TagSoup.

2007-11-08 Thread Dawid Weiss (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss updated NUTCH-567:
--

Attachment: tagsoup-1.1.3-uripatched.jar

Attached is a patched version of tagsoup. The Tagsoup's Web site states that:

"TagSoup is free and Open Source software, licensed under the Academic Free 
License version 3.0, a cleaned-up and patent-safe BSD-style license which 
allows proprietary re-use."

I haven't found any information about incompatibilities between Apache vs. AFL 
licenses.

> Proper (?) handling of URIs in TagSoup.
> ---
>
> Key: NUTCH-567
> URL: https://issues.apache.org/jira/browse/NUTCH-567
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Priority: Minor
> Attachments: README-tagsoup-patched.txt, tagsoup-1.1.3-uripatched.jar
>
>
> Doug Cook reported that TagSoup incorrectly handles some URI parameters. More 
> discussion on the list and at TagSoup's mailing list.
> http://tech.groups.yahoo.com/group/tagsoup-friends/message/838
> I looked at the sources of TagSoup because I'm using it myself (although the 
> URIs are not relevant for me). It seems like you can implement a naive 
> workaround by remembering the parsing state and just avoiding entity 
> resolution. Attached is the patch that does this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-567) Proper (?) handling of URIs in TagSoup.

2008-01-05 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556261#action_12556261
 ] 

Dawid Weiss commented on NUTCH-567:
---

John Cowan apparently released a fixed version of TagSoup (1.2). This is good 
news for several reasons (quoting):

- As noted above, I have changed the license to Apache 2.0.

- The processing of entity references in attribute values has finally been 
fixed to do what browsers do. That is, a reference is only recognized if it is 
properly terminated by a semicolon;  otherwise it is treated as plain text. 
This means that URIs like "foo?cdown=32&cup=42" are no longer seen as 
containing an instance of the cup character.

I guess this issue is no longer applicable and an upgrade to the newer TagSoup 
would be appropriate.

> Proper (?) handling of URIs in TagSoup.
> ---
>
> Key: NUTCH-567
> URL: https://issues.apache.org/jira/browse/NUTCH-567
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Priority: Minor
> Attachments: README-tagsoup-patched.txt, tagsoup-1.1.3-uripatched.jar
>
>
> Doug Cook reported that TagSoup incorrectly handles some URI parameters. More 
> discussion on the list and at TagSoup's mailing list.
> http://tech.groups.yahoo.com/group/tagsoup-friends/message/838
> I looked at the sources of TagSoup because I'm using it myself (although the 
> URIs are not relevant for me). It seems like you can implement a naive 
> workaround by remembering the parsing state and just avoiding entity 
> resolution. Attached is the patch that does this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-673) Upgrade the Carrot2 plug-in to release 3.0

2010-02-05 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830051#action_12830051
 ] 

Dawid Weiss commented on NUTCH-673:
---

Hi guys. I'd be willing to proceed with this and upgrade to Carrot2 3.x line. 
The first issue I have encountered is Lucene incompatibilities between 2.9 
(currently in Nutch) and 3.0 (currently in Carrot2). Any plans or reasons not 
to upgrade to Lucene 3.0? It's been with us for quite a while. If there are no 
objections, I can prepare a patch replacing Lucene 2.9 with Lucene 3.0 (as a 
separate issue).

> Upgrade the Carrot2 plug-in to release 3.0
> --
>
> Key: NUTCH-673
> URL: https://issues.apache.org/jira/browse/NUTCH-673
> Project: Nutch
>  Issue Type: Improvement
>  Components: web gui
>Affects Versions: 0.9.0
> Environment: All Nutch deployments.
>Reporter: Sean Dean
>Priority: Minor
> Fix For: 1.1
>
>
> Release 3.0 of the Carrot2 plug-in was released recently.
> We currently have version 2.1 in the source tree and upgrading it to the 
> latest version before 1.0-release might make sence.
> Details on the release can be found here: 
> http://project.carrot2.org/release-3.0-notes.html
> One major change in requirements is for JDK 1.5 to be used, but this is also 
> now required for Hadoop 0.19 so this wouldnt be the only reason for the 
> switch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-787) Upgrade Lucene to 3.0.0.

2010-02-05 Thread Dawid Weiss (JIRA)
Upgrade Lucene to 3.0.0.


 Key: NUTCH-787
 URL: https://issues.apache.org/jira/browse/NUTCH-787
 Project: Nutch
  Issue Type: Task
  Components: build
Reporter: Dawid Weiss
Priority: Trivial




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-673) Upgrade the Carrot2 plug-in to release 3.0

2010-02-05 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830078#action_12830078
 ] 

Dawid Weiss commented on NUTCH-673:
---

O.K., I'll see into the complexity of upgrading to 3.0 first then. Filing a 
separate issue.

> Upgrade the Carrot2 plug-in to release 3.0
> --
>
> Key: NUTCH-673
> URL: https://issues.apache.org/jira/browse/NUTCH-673
> Project: Nutch
>  Issue Type: Improvement
>  Components: web gui
>Affects Versions: 0.9.0
> Environment: All Nutch deployments.
>Reporter: Sean Dean
>Priority: Minor
> Fix For: 1.1
>
>
> Release 3.0 of the Carrot2 plug-in was released recently.
> We currently have version 2.1 in the source tree and upgrading it to the 
> latest version before 1.0-release might make sence.
> Details on the release can be found here: 
> http://project.carrot2.org/release-3.0-notes.html
> One major change in requirements is for JDK 1.5 to be used, but this is also 
> now required for Hadoop 0.19 so this wouldnt be the only reason for the 
> switch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-787) Upgrade Lucene to 3.0.0.

2010-02-05 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830085#action_12830085
 ] 

Dawid Weiss commented on NUTCH-787:
---

Just did an initial check -- this should be doable, although will result in a 
sizeable patch due to API changes and removed deprecations. I think it still 
makes sense to try and push the 3.0 version of Lucene into Nutch, so I will 
keep working on this and seek help in reviewing the patch (and incompatible 
changes) once it's ready.

> Upgrade Lucene to 3.0.0.
> 
>
> Key: NUTCH-787
> URL: https://issues.apache.org/jira/browse/NUTCH-787
> Project: Nutch
>  Issue Type: Task
>  Components: build
>Reporter: Dawid Weiss
>Priority: Trivial
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-787) Upgrade Lucene to 3.0.0.

2010-02-06 Thread Dawid Weiss (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss updated NUTCH-787:
--

Attachment: NUTCH-787.patch

Text-patch of changes porting the code to Lucene 3.0.0.

> Upgrade Lucene to 3.0.0.
> 
>
> Key: NUTCH-787
> URL: https://issues.apache.org/jira/browse/NUTCH-787
> Project: Nutch
>  Issue Type: Task
>  Components: build
>Reporter: Dawid Weiss
>Priority: Trivial
> Attachments: NUTCH-787.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-787) Upgrade Lucene to 3.0.0.

2010-02-06 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830534#action_12830534
 ] 

Dawid Weiss commented on NUTCH-787:
---

Definitely not an easy thing to do. I need to finish for today, the code 
compiles, here's a brief summary of changes:

- modified all filters and streams to use token attributes instead of raw 
Tokens. In many places I tried to be least intrusive so that the patch can be 
easily reviewed and accepted; improvements resulting from the new API can 
follow,

- replaced deprecated constants to their new equivalents (UN_TOKENIZED, etc),

- there are no compressed fields any more, so this stuff is commented out.

If I may ask as many people with Lucene/Nutch knowledge to go through the patch 
and point out potential problems, it would be great. At the moment one core 
test fails for me -- TestIndexSorter. I don't know if the difference in boosts 
is something that is a result of Lucene changes or my bug introduced somewhere 
along the way. 



> Upgrade Lucene to 3.0.0.
> 
>
> Key: NUTCH-787
> URL: https://issues.apache.org/jira/browse/NUTCH-787
> Project: Nutch
>  Issue Type: Task
>  Components: build
>Reporter: Dawid Weiss
>Priority: Trivial
> Attachments: NUTCH-787.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-787) Upgrade Lucene to 3.0.0.

2010-02-08 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830900#action_12830900
 ] 

Dawid Weiss commented on NUTCH-787:
---

The failing test in TestIndexSorter is caused by the change of implementation 
inside Lucene. In Lucene 2.9, SegmentMerger calls IndexReader#document(int, 
FieldSelector), but in 3.0 this has been changed to a call to document(int):

Document doc = reader.document(docCount);

Now, IndexSorter in Nutch overrides both methods and delegates to the 
superclass (IndexReader) with mapping from old ids to new ids, but IndexReader 
re-delegates back to the overriden method, so IDs are effectively remapped back 
to original values.


> Upgrade Lucene to 3.0.0.
> 
>
> Key: NUTCH-787
> URL: https://issues.apache.org/jira/browse/NUTCH-787
> Project: Nutch
>  Issue Type: Task
>  Components: build
>Reporter: Dawid Weiss
>Priority: Trivial
> Attachments: NUTCH-787.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-787) Upgrade Lucene to 3.0.0.

2010-02-08 Thread Dawid Weiss (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss updated NUTCH-787:
--

Attachment: (was: NUTCH-787.patch)

> Upgrade Lucene to 3.0.0.
> 
>
> Key: NUTCH-787
> URL: https://issues.apache.org/jira/browse/NUTCH-787
> Project: Nutch
>  Issue Type: Task
>  Components: build
>Reporter: Dawid Weiss
>Priority: Trivial
> Attachments: NUTCH-787.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-787) Upgrade Lucene to 3.0.0.

2010-02-08 Thread Dawid Weiss (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss updated NUTCH-787:
--

Attachment: NUTCH-787.patch

This patch moves Nutch from Lucene 2.9.1 to Lucene 3.0.0. All tests pass. The 
patch does not contain binary files (Lucene JARs), these should be applied 
manually.

D   src/plugin/summary-lucene/lib/lucene-highlighter-2.9.1.jar
A   src/plugin/summary-lucene/lib/lucene-highlighter-3.0.0.jar
D   src/plugin/lib-lucene-analyzers/lib/lucene-analyzers-2.9.1.jar
A   src/plugin/lib-lucene-analyzers/lib/lucene-analyzers-3.0.0.jar
D   lib/lucene-misc-2.9.1.jar
A   lib/lucene-core-3.0.0.jar
D   lib/lucene-core-2.9.1.jar
A   lib/lucene-misc-3.0.0.jar


> Upgrade Lucene to 3.0.0.
> 
>
> Key: NUTCH-787
> URL: https://issues.apache.org/jira/browse/NUTCH-787
> Project: Nutch
>  Issue Type: Task
>  Components: build
>Reporter: Dawid Weiss
>Priority: Trivial
> Attachments: NUTCH-787.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-787) Upgrade Lucene to 3.0.0.

2010-02-08 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830902#action_12830902
 ] 

Dawid Weiss commented on NUTCH-787:
---

O.K. I think this is ready for review/ testing and integration. All built-in 
tests pass, it would be good if people could test it against their indexes.

> Upgrade Lucene to 3.0.0.
> 
>
> Key: NUTCH-787
> URL: https://issues.apache.org/jira/browse/NUTCH-787
> Project: Nutch
>  Issue Type: Task
>  Components: build
>Reporter: Dawid Weiss
>Priority: Trivial
> Attachments: NUTCH-787.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-787) Upgrade Lucene to 3.0.0.

2010-03-17 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846434#action_12846434
 ] 

Dawid Weiss commented on NUTCH-787:
---

I'll be happy to help if I can. I admit I only ran the build tests -- some 
empirical crawls and other types of jobs would be more then desirable, but I 
don't have the infrastructure to do it.

> Upgrade Lucene to 3.0.0.
> 
>
> Key: NUTCH-787
> URL: https://issues.apache.org/jira/browse/NUTCH-787
> Project: Nutch
>  Issue Type: Task
>  Components: build
>Reporter: Dawid Weiss
>Priority: Trivial
> Fix For: 1.1
>
> Attachments: NUTCH-787.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-787) Upgrade Lucene to 3.0.1.

2010-03-19 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12847325#action_12847325
 ] 

Dawid Weiss commented on NUTCH-787:
---

Thanks Andrzej.

> Upgrade Lucene to 3.0.1.
> 
>
> Key: NUTCH-787
> URL: https://issues.apache.org/jira/browse/NUTCH-787
> Project: Nutch
>  Issue Type: Task
>  Components: build
>Reporter: Dawid Weiss
>Assignee: Andrzej Bialecki 
>Priority: Trivial
> Fix For: 1.1
>
> Attachments: NUTCH-787.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.