[jira] Commented: (NUTCH-787) Upgrade Lucene to 3.0.1.

2010-03-19 Thread Dawid Weiss (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12847325#action_12847325 ] Dawid Weiss commented on NUTCH-787: --- Thanks Andrzej. > Upgrade Lucene t

[jira] Commented: (NUTCH-787) Upgrade Lucene to 3.0.0.

2010-03-17 Thread Dawid Weiss (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846434#action_12846434 ] Dawid Weiss commented on NUTCH-787: --- I'll be happy to help if I can. I admit I

[jira] Commented: (NUTCH-787) Upgrade Lucene to 3.0.0.

2010-02-08 Thread Dawid Weiss (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830902#action_12830902 ] Dawid Weiss commented on NUTCH-787: --- O.K. I think this is ready for review/ testing

[jira] Updated: (NUTCH-787) Upgrade Lucene to 3.0.0.

2010-02-08 Thread Dawid Weiss (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss updated NUTCH-787: -- Attachment: NUTCH-787.patch This patch moves Nutch from Lucene 2.9.1 to Lucene 3.0.0. All tests pass

[jira] Updated: (NUTCH-787) Upgrade Lucene to 3.0.0.

2010-02-08 Thread Dawid Weiss (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss updated NUTCH-787: -- Attachment: (was: NUTCH-787.patch) > Upgrade Lucene to 3.

[jira] Commented: (NUTCH-787) Upgrade Lucene to 3.0.0.

2010-02-08 Thread Dawid Weiss (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830900#action_12830900 ] Dawid Weiss commented on NUTCH-787: --- The failing test in TestIndexSorter is caused by

[jira] Commented: (NUTCH-787) Upgrade Lucene to 3.0.0.

2010-02-06 Thread Dawid Weiss (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830534#action_12830534 ] Dawid Weiss commented on NUTCH-787: --- Definitely not an easy thing to do. I need to fi

[jira] Updated: (NUTCH-787) Upgrade Lucene to 3.0.0.

2010-02-06 Thread Dawid Weiss (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss updated NUTCH-787: -- Attachment: NUTCH-787.patch Text-patch of changes porting the code to Lucene 3.0.0. > Upgrade Luc

[jira] Commented: (NUTCH-787) Upgrade Lucene to 3.0.0.

2010-02-05 Thread Dawid Weiss (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830085#action_12830085 ] Dawid Weiss commented on NUTCH-787: --- Just did an initial check -- this should be do

[jira] Commented: (NUTCH-673) Upgrade the Carrot2 plug-in to release 3.0

2010-02-05 Thread Dawid Weiss (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830078#action_12830078 ] Dawid Weiss commented on NUTCH-673: --- O.K., I'll see into the complexity of upg

[jira] Created: (NUTCH-787) Upgrade Lucene to 3.0.0.

2010-02-05 Thread Dawid Weiss (JIRA)
Upgrade Lucene to 3.0.0. Key: NUTCH-787 URL: https://issues.apache.org/jira/browse/NUTCH-787 Project: Nutch Issue Type: Task Components: build Reporter: Dawid Weiss Priority

[jira] Commented: (NUTCH-673) Upgrade the Carrot2 plug-in to release 3.0

2010-02-05 Thread Dawid Weiss (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830051#action_12830051 ] Dawid Weiss commented on NUTCH-673: --- Hi guys. I'd be willing to proceed with

Re: Nutch crawled results for Clustering with Carrot2

2009-05-07 Thread Dawid Weiss
Gaurang, You can fetch documents from Nutch indexes (which are Lucene indexes) and then feed them to the clustering algorithm directly, as explained in Carrot2 examples here: http://download.carrot2.org/head/manual/index.html#section.integration There are several examples you can choose to

[jira] Commented: (NUTCH-567) Proper (?) handling of URIs in TagSoup.

2008-01-05 Thread Dawid Weiss (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556261#action_12556261 ] Dawid Weiss commented on NUTCH-567: --- John Cowan apparently released a fixed versio

[jira] Updated: (NUTCH-567) Proper (?) handling of URIs in TagSoup.

2007-11-08 Thread Dawid Weiss (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss updated NUTCH-567: -- Attachment: tagsoup-1.1.3-uripatched.jar Attached is a patched version of tagsoup. The Tagsoup'

[jira] Updated: (NUTCH-567) Proper (?) handling of URIs in TagSoup.

2007-11-08 Thread Dawid Weiss (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss updated NUTCH-567: -- Attachment: (was: tagsoup-1.1.3-uripatched.jar ) > Proper (?) handling of URIs in TagS

[jira] Updated: (NUTCH-567) Proper (?) handling of URIs in TagSoup.

2007-11-08 Thread Dawid Weiss (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss updated NUTCH-567: -- Attachment: (was: uri-entities.patch) > Proper (?) handling of URIs in TagS

[jira] Commented: (NUTCH-567) Proper (?) handling of URIs in TagSoup.

2007-11-08 Thread Dawid Weiss (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541074 ] Dawid Weiss commented on NUTCH-567: --- I didn't put the feather because I wasn't sure about licensing; I&#

[jira] Commented: (NUTCH-567) Proper (?) handling of URIs in TagSoup.

2007-10-31 Thread Dawid Weiss (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12539162 ] Dawid Weiss commented on NUTCH-567: --- I agree. What we used to do in Carrot2 was to include the patch (against the

[jira] Commented: (NUTCH-567) Proper (?) handling of URIs in TagSoup.

2007-10-31 Thread Dawid Weiss (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12539025 ] Dawid Weiss commented on NUTCH-567: --- Hi Doğacan. I have sent an e-mail to Tagsoup's mailing list, but it seems

[jira] Commented: (NUTCH-567) Proper (?) handling of URIs in TagSoup.

2007-10-18 Thread Dawid Weiss (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535853 ] Dawid Weiss commented on NUTCH-567: --- Don't mention it. Happy birthday and I hope it'll work for you. If

Re: Anyone looked for a better HTML parser?

2007-10-17 Thread Dawid Weiss
I looked at TagSoup sources and it seems it could be quite easily fixed. See here: https://issues.apache.org/jira/browse/NUTCH-567 D.

[jira] Updated: (NUTCH-567) Proper (?) handling of URIs in TagSoup.

2007-10-17 Thread Dawid Weiss (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss updated NUTCH-567: -- Attachment: tagsoup-1.1.3-uripatched.jar Binary of tagsoup with the patched compiled in. > Pro

[jira] Updated: (NUTCH-567) Proper (?) handling of URIs in TagSoup.

2007-10-17 Thread Dawid Weiss (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss updated NUTCH-567: -- Attachment: uri-entities.patch A patch against tagsoup-1.1.3 fixing the entities-in-URIs problem

[jira] Created: (NUTCH-567) Proper (?) handling of URIs in TagSoup.

2007-10-17 Thread Dawid Weiss (JIRA)
Proper (?) handling of URIs in TagSoup. --- Key: NUTCH-567 URL: https://issues.apache.org/jira/browse/NUTCH-567 Project: Nutch Issue Type: Improvement Reporter: Dawid Weiss

[jira] Updated: (NUTCH-544) Upgrade Carrot2 clustering plugin to the newest stable release (2.1)

2007-08-27 Thread Dawid Weiss (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss updated NUTCH-544: -- Attachment: (was: clustering-upgrade-2.1.patch) > Upgrade Carrot2 clustering plugin to the new

[jira] Updated: (NUTCH-544) Upgrade Carrot2 clustering plugin to the newest stable release (2.1)

2007-08-27 Thread Dawid Weiss (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss updated NUTCH-544: -- Attachment: clustering-upgrade-2.1.patch2 The same patch, one extra line of logging info added

[jira] Commented: (NUTCH-544) Upgrade Carrot2 clustering plugin to the newest stable release (2.1)

2007-08-27 Thread Dawid Weiss (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12522992 ] Dawid Weiss commented on NUTCH-544: --- Hey, Doğacan will you find a spare minute to commit this patch some time this

[jira] Commented: (NUTCH-544) Upgrade Carrot2 clustering plugin to the newest stable release (2.1)

2007-08-23 Thread Dawid Weiss (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12522047 ] Dawid Weiss commented on NUTCH-544: --- This parameter is in the code. It is specific to the plugin, not the extension

Clustering patches ready for review/ commit.

2007-08-22 Thread Dawid Weiss
Hi guys (Doğacan? :), I finalized the upgrade of Carrot2 libraries and a minor bug fix to the Web application. Both issues should be pretty straightforward, if anyone finds 5 spare minutes to review and commit these patches I'd appreciate. https://issues.apache.org/jira/browse/NUTCH-544 http

[jira] Updated: (NUTCH-544) Upgrade Carrot2 clustering plugin to the newest stable release (2.1)

2007-08-22 Thread Dawid Weiss (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss updated NUTCH-544: -- Attachment: clustering-upgrade-2.1.patch Same patch, but I added an optional parameter that allows

[jira] Updated: (NUTCH-544) Upgrade Carrot2 clustering plugin to the newest stable release (2.1)

2007-08-22 Thread Dawid Weiss (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss updated NUTCH-544: -- Attachment: (was: clustering-upgrade-2.1.patch) > Upgrade Carrot2 clustering plugin to the new

[jira] Updated: (NUTCH-545) Configuration and OnlineClusterer get initialized in every request.

2007-08-22 Thread Dawid Weiss (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss updated NUTCH-545: -- Attachment: search.jsp.patch Patch of search.jsp that moves initialization code to jspInit

[jira] Created: (NUTCH-545) Configuration and OnlineClusterer get initialized in every request.

2007-08-22 Thread Dawid Weiss (JIRA)
Components: web gui Reporter: Dawid Weiss The initialization code block in search.jsp is invoked in every request (it's part of the request block). This is unnecessary and actually slows down the request cycle -- Configuration and OnlineClusterer can (and should) be reused.

[jira] Commented: (NUTCH-544) Upgrade Carrot2 clustering plugin to the newest stable release (2.1)

2007-08-22 Thread Dawid Weiss (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12521843 ] Dawid Weiss commented on NUTCH-544: --- Not exactly; the initialization issue is still present, but I'll c

[jira] Commented: (NUTCH-544) Upgrade Carrot2 clustering plugin to the newest stable release (2.1)

2007-08-22 Thread Dawid Weiss (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12521842 ] Dawid Weiss commented on NUTCH-544: --- Ok, this patch does the following: - upgrades Carrot2 libs to 2.1 (the most

[jira] Updated: (NUTCH-544) Upgrade Carrot2 clustering plugin to the newest stable release (2.1)

2007-08-22 Thread Dawid Weiss (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss updated NUTCH-544: -- Attachment: libs-packed.tar.gz lib folder (binary files to be replaced). > Upgrade Carrot2 cluster

[jira] Updated: (NUTCH-544) Upgrade Carrot2 clustering plugin to the newest stable release (2.1)

2007-08-22 Thread Dawid Weiss (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss updated NUTCH-544: -- Attachment: clustering-upgrade-2.1.patch svn diff of the patch. Binary files are not included (is there

[jira] Commented: (NUTCH-544) Upgrade Carrot2 clustering plugin to the newest stable release (2.1)

2007-08-22 Thread Dawid Weiss (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12521792 ] Dawid Weiss commented on NUTCH-544: --- Doğacan, would it be a problem if we threw in BeanShell and Dom4j JARs? We

[jira] Commented: (NUTCH-544) Upgrade Carrot2 clustering plugin to the newest stable release (2.1)

2007-08-22 Thread Dawid Weiss (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12521791 ] Dawid Weiss commented on NUTCH-544: --- Yes, absolutely -- it's actually my fault I didn't notice t

[jira] Commented: (NUTCH-544) Upgrade Carrot2 clustering plugin to the newest stable release (2.1)

2007-08-22 Thread Dawid Weiss (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12521784 ] Dawid Weiss commented on NUTCH-544: --- I've started working on this -- will send a patch for revision soon (t

[jira] Created: (NUTCH-544) Upgrade Carrot2 clustering plugin to the newest stable release (2.1)

2007-08-22 Thread Dawid Weiss (JIRA)
: Improvement Reporter: Dawid Weiss Priority: Minor This issue upgrades Carrot2 search results clustering plugin to the newest stable version. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-397) porting clustering-carrot2 plugin to carrot2 v2.0

2006-11-15 Thread Dawid Weiss (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-397?page=comments#action_12450146 ] Dawid Weiss commented on NUTCH-397: --- I'll review this patch and commit all the necessary code as soon as possible (it may be around the end of the week t

Re: Patch: deflate encoding

2006-08-07 Thread Dawid Weiss
svn.sourceforge.net/svnroot/carrot2/trunk/carrot2/components/carrot2-util-gzip/ Dawid Dawid Weiss wrote: I believe both deflate and gzip (as well as zip) are included as servlet filters in: http://sourceforge.net/projects/pjl-comp-filter/ Dawid Pascal Beis wrote: Hi all, I'v added su

Re: Patch: deflate encoding

2006-08-07 Thread Dawid Weiss
I believe both deflate and gzip (as well as zip) are included as servlet filters in: http://sourceforge.net/projects/pjl-comp-filter/ Dawid Pascal Beis wrote: Hi all, I'v added support for deflate encoding (next to gzip) to nutch. Is there interest to include this into the main source repo

[jira] Commented: (NUTCH-300) Clustering API improvements

2006-07-07 Thread Dawid Weiss (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-300?page=comments#action_12419708 ] Dawid Weiss commented on NUTCH-300: --- Hi. I just took a look at it -- I don't see anything wrong with the code and Andrzej has used Carrot2 before. We're u

[jira] Commented: (NUTCH-309) Uses commons logging Code Guards

2006-06-28 Thread Dawid Weiss (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-309?page=comments#action_12418396 ] Dawid Weiss commented on NUTCH-309: --- Painful job, Jerome, but in most cases (non-critical loops) the gain will not be significant and proliferating if statements makes the

[jira] Commented: (NUTCH-294) Topic-maps of related searchwords

2006-06-07 Thread Dawid Weiss (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-294?page=comments#action_12415094 ] Dawid Weiss commented on NUTCH-294: --- Well, you certainly have something wrong in your configuration then. I just tried with the head revision. My nutch-site looks like this

[jira] Commented: (NUTCH-294) Topic-maps of related searchwords

2006-06-06 Thread Dawid Weiss (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-294?page=comments#action_12414960 ] Dawid Weiss commented on NUTCH-294: --- Ehm, sorry I'm so late with this -- tons of work. 1) Stefan, if you can't get it working, speak up what is not working (

[jira] Commented: (NUTCH-265) Getting Clustered results in better form.

2006-05-24 Thread Dawid Weiss (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-265?page=comments#action_12413220 ] Dawid Weiss commented on NUTCH-265: --- If you just mean the user interface, then you can simply take the XSLT stylesheet from Carrot2 and reuse it in Nutch with the opensearch

[jira] Commented: (NUTCH-265) Getting Clustered results in better form.

2006-05-23 Thread Dawid Weiss (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-265?page=comments#action_12413072 ] Dawid Weiss commented on NUTCH-265: --- Chris, the current clusterer in Nutch _does_ discover phrases for clusters, so I don't know what you really mean. Did you take a lo

Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

2006-05-12 Thread Dawid Weiss
Yes, this should be definitely mentioned somewhere (in the documentation :) At least we left a track on the mailing list so it'll be possible to refer to it. D. Jérôme Charron wrote: You're right -- changing anything with the input (snippets length, number of documents etc) will alter the c

Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

2006-05-12 Thread Dawid Weiss
Hi Jerome, Yes Dawid, but it is already committed => the clustering now uses the plain text version returned by the toString() method. Ugh, yes, sorry about that, it uses Summary.toStrings(summaries) to be specific and that uses toString internally. Actually, the clustering uses the summa

Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

2006-05-11 Thread Dawid Weiss
The reason is that they should not use the same HTML code : 1. OpenSearch should only use around highlights 2. search.jsp should use some more complicated HTML code () Add 3. Clustering would benefit from a plain text version. D.

[jira] Commented: (NUTCH-265) Getting Clustered results in better form.

2006-05-08 Thread Dawid Weiss (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-265?page=comments#action_12378425 ] Dawid Weiss commented on NUTCH-265: --- The clustering interface is very simple in Nutch because it usually needs to be adjusted to the needs of a particular application

[jira] Commented: (NUTCH-134) Summarizer doesn't select the best snippets

2006-05-08 Thread Dawid Weiss (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-134?page=comments#action_12378387 ] Dawid Weiss commented on NUTCH-134: --- (back from holidays, so a bit delayed, but) I confirm Andrzej's suggestion -- a plain-text only summarized is ideal for clusterin

Re: [Proposal] New Lucene sub-project

2006-04-24 Thread Dawid Weiss
I also think it makes sense -- we use language idenfier component in Carrot2 and we'd love to just have a single library for this functionality. As always, some extra managerial effort is unfortunately needed to drive a stand-alone project. D. Chris Mattmann wrote: Hi Otis, This thread s

Re: [ot] binary subversion diffs

2006-04-13 Thread Dawid Weiss
Subversion basically uses plain diff so I believe what you ask for isn't possible. But if somebody knows otherwise I'd also appreciate a note. D. Stefan Groschupf wrote: Hi, does any body know how to do svn diff's that contains binary content, like jars or images? I was not able to find any

Re: 0.8 release?

2006-04-13 Thread Dawid Weiss
sure I have not applied it wrongly (I think it is correct but I did it so many times that I want to cross check). Regards Piotr Dawid Weiss wrote: What kind of problems? If you need something, let me know. D. Piotr Kosiorowski wrote: I got some problems while applying Dawid clustering patch

Re: 0.8 release?

2006-04-13 Thread Dawid Weiss
What kind of problems? If you need something, let me know. D. Piotr Kosiorowski wrote: I got some problems while applying Dawid clustering patch (my linux environment looks not to be setu correctly) - but I switched to cygwin and it looks ok. I will try to commit it today/tommorow. Regards Piot

Re: PMD integration

2006-04-07 Thread Dawid Weiss
I do agree with Jarome - plugins should be checked too. This basically means modifying the fileset in the pmd task. Shouldn't be too difficult to include all plugin sources with a single statement. I will make it totally separate target (so test do not depend on it). That was actually

Re: 0.8 release schedule (was Re: latest build throws error - critical)

2006-04-07 Thread Dawid Weiss
Could we have the clustering patch applied before the 0.8.0 release? I know you're way busy with other things, Andrzej, maybe you'll forward it to somebody else? It shouldn't be a difficult patch to review and apply. D. Doug Cutting wrote: TDLN wrote: I mean, how do others keep uptodate wi

Re: Add ".settings" to svn:ignore on root Nutch folder?

2006-04-07 Thread Dawid Weiss
My feeling was simply that the closest we are to Nutch-1.0, the more be need some Q&A metrics (for us and for nutch users). No? I absolutely agree Jérôme, really. It's just that developers usually tend to hook up dozens of Q&A plugins and never look at what they output (that's the usual scen

Re: PMD integration

2006-04-07 Thread Dawid Weiss
ed rules (in another target or even in the same one). That's again up to you guys. Dawid P.S. Tom Copeland has already fixed the bug I mentioned in the patch. Quite impressive bugfix turnaround, isn't it. :) Piotr Kosiorowski wrote: P. Dawid Weiss wrote: All right, I thoug

Re: Add ".settings" to svn:ignore on root Nutch folder?

2006-04-06 Thread Dawid Weiss
's perfect. https://sourceforge.net/tracker/?func=detail&atid=479921&aid=1465574&group_id=56262 D. Piotr Kosiorowski wrote: +1 - I offer my help - we can coordinate it and I can do a part of work. I will also try to commit your patches quickly. Piotr On 4/6/06, Dawid Weiss <[EMA

Re: Add ".settings" to svn:ignore on root Nutch folder?

2006-04-06 Thread Dawid Weiss
> Other options (raised on the Hadoop list) are Checkstyle: PMD seems to be the best choice for an Apache project and they all seem to perform at a similar level. Anything that generates a lot of false positives is bad: it either causes us to skip analysis of lots of files, or ignore the war

Re: Add ".settings" to svn:ignore on root Nutch folder?

2006-04-05 Thread Dawid Weiss
I'm a fan of automated testing and code analysis utilities, but I must say they only make sense if people actually use them and look at their results. So it's not really just about integration -- it's about looking at the results of these tools. PMD is neat because it can simply interrupt you

Re: Add ".settings" to svn:ignore on root Nutch folder?

2006-04-05 Thread Dawid Weiss
Ok, PMD seems like a good idea. I've added it to the build file. Unused code detection shows a few catches (javacc-generated classes need to be ignored because they contain a lot of junk), but unfortunately it also displays false positives such as in: MapWritable.java 429 {Avoid unused p

Re: Add ".settings" to svn:ignore on root Nutch folder?

2006-04-05 Thread Dawid Weiss
One can presumably disable such minor warnings in Eclipse. Arguably the bug is that Eclipse warns about such things by default, rather than in a 'pedantic' mode. I agree -- some of them are really annoying. Plus, Eclipse has been having notorious problems showing warnings for unused paramet

Re: Search quality evaluation

2006-04-05 Thread Dawid Weiss
In any case, it includes a system to scrape search results from other engines, based on Apple's Sherlock search-engine descriptors. These descriptors are also used by Mozilla: Just a note: we used to have exactly the same mechanism in Carrot2. Unfortunately this format does not make a clea

Re: Search quality evaluation

2006-04-05 Thread Dawid Weiss
I can help by reusing input components from Carrot2 -- they give access to Google (via GoogleAPI), Yahoo (via their REST API) and Nutch (via OpenSearch). Somebody would need to put together the rest of the evaluation framework though :) D. Andrzej Bialecki wrote: Hi, I found this paper, m

Re: Add ".settings" to svn:ignore on root Nutch folder?

2006-04-04 Thread Dawid Weiss
It works fine Doug, thanks. Please tell me if it is correct, since I don't use Eclipse. I'm at the vi (or rather vim) level very often, but emacs is still ahead of me ;) And on a more serious note, Eclipse shows a good few warnings in the present codebase. They are usually minor things like

[jira] Updated: (NUTCH-237) Carrot2 clustering plugin upgrade.

2006-04-04 Thread Dawid Weiss (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-237?page=all ] Dawid Weiss updated NUTCH-237: -- Attachment: NUTCH-237.DWEISS.patch.zip Hi Andrzej. The ZIP file contains a patch and svn stat with the improved code: - The primary language for hits without

Add ".settings" to svn:ignore on root Nutch folder?

2006-04-04 Thread Dawid Weiss
Would it be a problem to add Eclipse's ".settings" folder to ignored files (since Eclipse project files are already there anyway). This file is used when one wants to override default project configuration (code formatting, specific JVM etc). Dawid

[jira] Commented: (NUTCH-237) Carrot2 clustering plugin upgrade.

2006-03-24 Thread Dawid Weiss (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-237?page=comments#action_12371687 ] Dawid Weiss commented on NUTCH-237: --- Yes and no. I removed the "support" for foreign languages from the constructor code: // We initialize Lingo wi

Carrot2 upgrade patch

2006-03-23 Thread Dawid Weiss
Hi, This issue: http://issues.apache.org/jira/browse/NUTCH-237 contains an upgrade of Carrot2 libraries to the newest codebase and a few minor editing operations on the plugin sources. Please review and commit (not urgent, Andrzej :). Thanks, Dawid

[jira] Updated: (NUTCH-237) Carrot2 clustering plugin upgrade.

2006-03-23 Thread Dawid Weiss (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-237?page=all ] Dawid Weiss updated NUTCH-237: -- Attachment: libs.zip Libraries that need to be replaced. > Carrot2 clustering plugin upgrade. > -- > > Ke

[jira] Updated: (NUTCH-237) Carrot2 clustering plugin upgrade.

2006-03-23 Thread Dawid Weiss (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-237?page=all ] Dawid Weiss updated NUTCH-237: -- Attachment: c2.patch svn-stat.txt Note the two deleted files (I attached the result of svn stat). I didn't know how to include this info i

[jira] Created: (NUTCH-237) Carrot2 clustering plugin upgrade.

2006-03-23 Thread Dawid Weiss (JIRA)
Carrot2 clustering plugin upgrade. -- Key: NUTCH-237 URL: http://issues.apache.org/jira/browse/NUTCH-237 Project: Nutch Type: Improvement Reporter: Dawid Weiss Priority: Trivial This is an upgrade of the clustering plugin to

[jira] Updated: (NUTCH-234) Clustering extension code cleanups and a real JUnit test case for the current implementation.

2006-03-17 Thread Dawid Weiss (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-234?page=all ] Dawid Weiss updated NUTCH-234: -- Attachment: patch.diff The patch adding: - a JUnit test case to the clustering extension, - minor code cleanups - adds ".settings" file to svn:ignore o

[jira] Created: (NUTCH-234) Clustering extension code cleanups and a real JUnit test case for the current implementation.

2006-03-17 Thread Dawid Weiss (JIRA)
Type: Test Reporter: Dawid Weiss Priority: Minor I've cleaned up the code a bit and added a real test case for the clustering extension. This is in preparation for upgrading to the most recent Carrot2 codebase and I didn't want to mix these two patches together. I'

Re: Much faster RegExp lib needed in nutch?

2006-03-14 Thread Dawid Weiss
The probability of encountering a $ sign somewhere inside URL is not insignificant... I agree that it's very unlikely (perhaps even illegal) to use ^ in URLs, but $ are sometimes used. I'd have to take a look at the spec, but I think both characters should be URL-encoded anyway. Maybe it'd b

Re: quality of search text

2006-03-12 Thread Dawid Weiss
Hmm... I'm not convinced. How would you generate the best snippet from a relevant, but ignored chunk? Good point... I guess you simply wouldn't generate anything at all (show the title?). I guess structure text should not be relevant enough to actually cause a hit on top of the search result

[jira] Updated: (NUTCH-228) Clustering plugin descriptor broken (fix included)

2006-03-12 Thread Dawid Weiss (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-228?page=all ] Dawid Weiss updated NUTCH-228: -- Attachment: clustering.patch This patch fixed the plugin descriptor and a typo in cluster.jsp that caused wrong number of milliseconds to be dumped in the output

[jira] Created: (NUTCH-228) Clustering plugin descriptor broken (fix included)

2006-03-12 Thread Dawid Weiss (JIRA)
Clustering plugin descriptor broken (fix included) -- Key: NUTCH-228 URL: http://issues.apache.org/jira/browse/NUTCH-228 Project: Nutch Type: Bug Reporter: Dawid Weiss Priority: Minor The plugin descriptor

Issue can be closed.

2006-03-11 Thread Dawid Weiss
I see this issue: https://issues.apache.org/jira/browse/NUTCH-217 is no longer relevant (a patch has been applied in the trunk). I added a note about it, somebody with more privileges needs to close it when time permits. D.

Re: quality of search text

2006-03-11 Thread Dawid Weiss
It seems to me that there are two separate problems: 1) content parsing to avoid site structure -> influences the index and rankings 2) content parsing for KWIC snippet generation -> influences the user perception of the engine. I'd agree that (2) is quite important for the end user; Richard

[jira] Created: (NUTCH-217) InstantiationException when deserializing Query (no parameterless constructor)

2006-02-26 Thread Dawid Weiss (JIRA)
: searcher Versions: 0.8-dev Reporter: Dawid Weiss I've been playing with the trunk. The distributed searcher complains with an instantiation exception when deserializing Query. A quick code inspection shows that Query doesn't have any parameterless constructor. -- This

Re: duplicate libs

2006-02-16 Thread Dawid Weiss
I just wanted to say we've gone through such problems already in Carrot2 -- many modules depend on each other, some of them have custom build steps. A pure ANT solution is likely to be quite ugly... But back to the point: you can test for existence of a plugin-specific build file and execute

Re: duplicate libs

2006-02-14 Thread Dawid Weiss
Yes, there is an easier way. Implement a custom task to which you'll pass a path to plugin.xml and a name for a path. The task (Java code) will create a named (id) object which can be subsequently used in ant with . This requires a custom ant task, but as you mentioned foreach is also a se

Re: duplicate libs

2006-02-14 Thread Dawid Weiss
log4j-1.2.11.jar src/plugin/clustering-carrot2/lib log4j-1.2.6.jar 1 src/plugin/parse-rss/lib log4j-1.2.9.jar src/plugin/parse-pdf/lib nekohtml-0.9.2.jarsrc/plugin/clustering-carrot2/lib nekohtml-0.9.4.jarsrc/plu

Re: Carrot2 v. 1.0.1. [clustering plugin]

2006-02-03 Thread Dawid Weiss
Definitely there is interest! Let's hear all the voices though :) If the interfaces in carrot2 don't change too much, there is not so much work with the adapters, they are quite simple after all. You are right -- they don't change a lot on Carrot2 side. I was concerned mostly with the Nu

Carrot2 v. 1.0.1. [clustering plugin]

2006-02-03 Thread Dawid Weiss
Hi there, We've been quite busy with putting things together at Carrot2. Version 1.0.1 is out -- it is a stable release with a few tweaks and tunings that appeared after 1.0. We also have a Web site ;) http://www.carrot2.org So... I think it's time for reintegrating that code into Nutch cl

Re: Google performance bottlenecks ;-) (Re: Lucene performance bottlenecks)

2005-12-12 Thread Dawid Weiss
u try this, self-indulging, query (with filtering enabled): http://www.google.com/search?as_q=dawid+weiss&num=10&hl=en&as_qdr=all&as_occt=any&as_dt=i&safe=active&start=900 You get: "Results 781 - 782 of about 61,700" Now try disabling filtering: http://www.g

Re: [C2-devel] about the question of clustering-carrot2

2005-12-09 Thread Dawid Weiss
Hi Charlie, Don't cross-post to two lists at once. The question you asked is relevant to C2, not Nutch, so I'll reply to it there. Dawid charlie wrote: Dear all, Currently I’m using the Nutch plug-in “clustering-carrot2” and would like to ask for some help. When I built the search resu

[Slightly off topic] A search interface for the next generation?

2005-11-17 Thread Dawid Weiss
Check this out, guys, I thought some of you might find it amusing: http://www.mex-search.com/ The "full" option gives you an "agent-based search engine". The usability might be questioned (long animations), but it certainly gives strong first impression :) Have fun. Dawid

[jira] Commented: (NUTCH-82) Nutch Commands should run on Windows without external tools

2005-10-20 Thread Dawid Weiss (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-82?page=comments#action_12332559 ] Dawid Weiss commented on NUTCH-82: -- I personally disagree Perl is a better alternative to Cygwin... Most people familiar with Unix/ Windows development will have no problems

Re: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

2005-10-13 Thread Dawid Weiss
Yes but (I think -- I haven't confirmed) this basic escaping is being done by the DOM streaming. It at least is converting characters like 0xC to . I'd have to look at the code and see how the XML is serialized... Most DOM streaming classes will encode entities somehow, so you shouldn't wo

Re: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

2005-10-13 Thread Dawid Weiss
> The differences between this method and the patch supplied in NUTCH-110 > are: Take a closer look at the source code -- 1. XMLSerializerHelper#toValidXmlText throws an exception when an invalid character whereas NUTCH-110 just drops it. Not really, it is governed by a boolean flag. If t

Re: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

2005-10-13 Thread Dawid Weiss
Right, I didn't think about this... somehow I thought this was all about special characters like ' " & <. Oh, believe me: this knowledge came from sour experience not from book wisdom... I know for sure some XML parsers complain about invalid characters, while others don't. Then we should

  1   2   >