[
https://issues.apache.org/jira/browse/NUTCH-787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12847325#action_12847325
]
Dawid Weiss commented on NUTCH-787:
---
Thanks Andrzej.
> Upgrade Lucene t
[
https://issues.apache.org/jira/browse/NUTCH-787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846434#action_12846434
]
Dawid Weiss commented on NUTCH-787:
---
I'll be happy to help if I can. I admit I
[
https://issues.apache.org/jira/browse/NUTCH-787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830902#action_12830902
]
Dawid Weiss commented on NUTCH-787:
---
O.K. I think this is ready for review/ testing
[
https://issues.apache.org/jira/browse/NUTCH-787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dawid Weiss updated NUTCH-787:
--
Attachment: NUTCH-787.patch
This patch moves Nutch from Lucene 2.9.1 to Lucene 3.0.0. All tests pass
[
https://issues.apache.org/jira/browse/NUTCH-787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dawid Weiss updated NUTCH-787:
--
Attachment: (was: NUTCH-787.patch)
> Upgrade Lucene to 3.
[
https://issues.apache.org/jira/browse/NUTCH-787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830900#action_12830900
]
Dawid Weiss commented on NUTCH-787:
---
The failing test in TestIndexSorter is caused by
[
https://issues.apache.org/jira/browse/NUTCH-787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830534#action_12830534
]
Dawid Weiss commented on NUTCH-787:
---
Definitely not an easy thing to do. I need to fi
[
https://issues.apache.org/jira/browse/NUTCH-787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dawid Weiss updated NUTCH-787:
--
Attachment: NUTCH-787.patch
Text-patch of changes porting the code to Lucene 3.0.0.
> Upgrade Luc
[
https://issues.apache.org/jira/browse/NUTCH-787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830085#action_12830085
]
Dawid Weiss commented on NUTCH-787:
---
Just did an initial check -- this should be do
[
https://issues.apache.org/jira/browse/NUTCH-673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830078#action_12830078
]
Dawid Weiss commented on NUTCH-673:
---
O.K., I'll see into the complexity of upg
Upgrade Lucene to 3.0.0.
Key: NUTCH-787
URL: https://issues.apache.org/jira/browse/NUTCH-787
Project: Nutch
Issue Type: Task
Components: build
Reporter: Dawid Weiss
Priority
[
https://issues.apache.org/jira/browse/NUTCH-673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830051#action_12830051
]
Dawid Weiss commented on NUTCH-673:
---
Hi guys. I'd be willing to proceed with
Gaurang,
You can fetch documents from Nutch indexes (which are Lucene indexes) and then
feed them to the clustering algorithm directly, as explained in Carrot2 examples
here:
http://download.carrot2.org/head/manual/index.html#section.integration
There are several examples you can choose to
[
https://issues.apache.org/jira/browse/NUTCH-567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556261#action_12556261
]
Dawid Weiss commented on NUTCH-567:
---
John Cowan apparently released a fixed versio
[
https://issues.apache.org/jira/browse/NUTCH-567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dawid Weiss updated NUTCH-567:
--
Attachment: tagsoup-1.1.3-uripatched.jar
Attached is a patched version of tagsoup. The Tagsoup'
[
https://issues.apache.org/jira/browse/NUTCH-567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dawid Weiss updated NUTCH-567:
--
Attachment: (was: tagsoup-1.1.3-uripatched.jar )
> Proper (?) handling of URIs in TagS
[
https://issues.apache.org/jira/browse/NUTCH-567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dawid Weiss updated NUTCH-567:
--
Attachment: (was: uri-entities.patch)
> Proper (?) handling of URIs in TagS
[
https://issues.apache.org/jira/browse/NUTCH-567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541074
]
Dawid Weiss commented on NUTCH-567:
---
I didn't put the feather because I wasn't sure about licensing; I
[
https://issues.apache.org/jira/browse/NUTCH-567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12539162
]
Dawid Weiss commented on NUTCH-567:
---
I agree. What we used to do in Carrot2 was to include the patch (against the
[
https://issues.apache.org/jira/browse/NUTCH-567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12539025
]
Dawid Weiss commented on NUTCH-567:
---
Hi Doğacan. I have sent an e-mail to Tagsoup's mailing list, but it seems
[
https://issues.apache.org/jira/browse/NUTCH-567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535853
]
Dawid Weiss commented on NUTCH-567:
---
Don't mention it. Happy birthday and I hope it'll work for you. If
I looked at TagSoup sources and it seems it could be quite easily fixed. See
here:
https://issues.apache.org/jira/browse/NUTCH-567
D.
[
https://issues.apache.org/jira/browse/NUTCH-567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dawid Weiss updated NUTCH-567:
--
Attachment: tagsoup-1.1.3-uripatched.jar
Binary of tagsoup with the patched compiled in.
> Pro
[
https://issues.apache.org/jira/browse/NUTCH-567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dawid Weiss updated NUTCH-567:
--
Attachment: uri-entities.patch
A patch against tagsoup-1.1.3 fixing the entities-in-URIs problem
Proper (?) handling of URIs in TagSoup.
---
Key: NUTCH-567
URL: https://issues.apache.org/jira/browse/NUTCH-567
Project: Nutch
Issue Type: Improvement
Reporter: Dawid Weiss
[
https://issues.apache.org/jira/browse/NUTCH-544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dawid Weiss updated NUTCH-544:
--
Attachment: (was: clustering-upgrade-2.1.patch)
> Upgrade Carrot2 clustering plugin to the new
[
https://issues.apache.org/jira/browse/NUTCH-544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dawid Weiss updated NUTCH-544:
--
Attachment: clustering-upgrade-2.1.patch2
The same patch, one extra line of logging info added
[
https://issues.apache.org/jira/browse/NUTCH-544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12522992
]
Dawid Weiss commented on NUTCH-544:
---
Hey, Doğacan will you find a spare minute to commit this patch some time this
[
https://issues.apache.org/jira/browse/NUTCH-544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12522047
]
Dawid Weiss commented on NUTCH-544:
---
This parameter is in the code. It is specific to the plugin, not the extension
Hi guys (Doğacan? :),
I finalized the upgrade of Carrot2 libraries and a minor bug fix to the Web
application. Both issues should be pretty straightforward, if anyone finds 5
spare minutes to review and commit these patches I'd appreciate.
https://issues.apache.org/jira/browse/NUTCH-544
http
[
https://issues.apache.org/jira/browse/NUTCH-544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dawid Weiss updated NUTCH-544:
--
Attachment: clustering-upgrade-2.1.patch
Same patch, but I added an optional parameter that allows
[
https://issues.apache.org/jira/browse/NUTCH-544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dawid Weiss updated NUTCH-544:
--
Attachment: (was: clustering-upgrade-2.1.patch)
> Upgrade Carrot2 clustering plugin to the new
[
https://issues.apache.org/jira/browse/NUTCH-545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dawid Weiss updated NUTCH-545:
--
Attachment: search.jsp.patch
Patch of search.jsp that moves initialization code to jspInit
Components: web gui
Reporter: Dawid Weiss
The initialization code block in search.jsp is invoked in every request (it's
part of the request block). This is unnecessary and actually slows down the
request cycle -- Configuration and OnlineClusterer can (and should) be reused.
[
https://issues.apache.org/jira/browse/NUTCH-544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12521843
]
Dawid Weiss commented on NUTCH-544:
---
Not exactly; the initialization issue is still present, but I'll c
[
https://issues.apache.org/jira/browse/NUTCH-544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12521842
]
Dawid Weiss commented on NUTCH-544:
---
Ok, this patch does the following:
- upgrades Carrot2 libs to 2.1 (the most
[
https://issues.apache.org/jira/browse/NUTCH-544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dawid Weiss updated NUTCH-544:
--
Attachment: libs-packed.tar.gz
lib folder (binary files to be replaced).
> Upgrade Carrot2 cluster
[
https://issues.apache.org/jira/browse/NUTCH-544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dawid Weiss updated NUTCH-544:
--
Attachment: clustering-upgrade-2.1.patch
svn diff of the patch. Binary files are not included (is there
[
https://issues.apache.org/jira/browse/NUTCH-544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12521792
]
Dawid Weiss commented on NUTCH-544:
---
Doğacan, would it be a problem if we threw in BeanShell and Dom4j JARs? We
[
https://issues.apache.org/jira/browse/NUTCH-544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12521791
]
Dawid Weiss commented on NUTCH-544:
---
Yes, absolutely -- it's actually my fault I didn't notice t
[
https://issues.apache.org/jira/browse/NUTCH-544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12521784
]
Dawid Weiss commented on NUTCH-544:
---
I've started working on this -- will send a patch for revision soon (t
: Improvement
Reporter: Dawid Weiss
Priority: Minor
This issue upgrades Carrot2 search results clustering plugin to the newest
stable version.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[
http://issues.apache.org/jira/browse/NUTCH-397?page=comments#action_12450146 ]
Dawid Weiss commented on NUTCH-397:
---
I'll review this patch and commit all the necessary code as soon as possible
(it may be around the end of the week t
svn.sourceforge.net/svnroot/carrot2/trunk/carrot2/components/carrot2-util-gzip/
Dawid
Dawid Weiss wrote:
I believe both deflate and gzip (as well as zip) are included as servlet
filters in:
http://sourceforge.net/projects/pjl-comp-filter/
Dawid
Pascal Beis wrote:
Hi all,
I'v added su
I believe both deflate and gzip (as well as zip) are included as servlet
filters in:
http://sourceforge.net/projects/pjl-comp-filter/
Dawid
Pascal Beis wrote:
Hi all,
I'v added support for deflate encoding (next to gzip) to nutch. Is there
interest to
include this into the main source repo
[
http://issues.apache.org/jira/browse/NUTCH-300?page=comments#action_12419708 ]
Dawid Weiss commented on NUTCH-300:
---
Hi. I just took a look at it -- I don't see anything wrong with the code and
Andrzej has used Carrot2 before. We're u
[
http://issues.apache.org/jira/browse/NUTCH-309?page=comments#action_12418396 ]
Dawid Weiss commented on NUTCH-309:
---
Painful job, Jerome, but in most cases (non-critical loops) the gain will not
be significant and proliferating if statements makes the
[
http://issues.apache.org/jira/browse/NUTCH-294?page=comments#action_12415094 ]
Dawid Weiss commented on NUTCH-294:
---
Well, you certainly have something wrong in your configuration then. I just
tried
with the head revision. My nutch-site looks like this
[
http://issues.apache.org/jira/browse/NUTCH-294?page=comments#action_12414960 ]
Dawid Weiss commented on NUTCH-294:
---
Ehm, sorry I'm so late with this -- tons of work.
1) Stefan, if you can't get it working, speak up what is not working
(
[
http://issues.apache.org/jira/browse/NUTCH-265?page=comments#action_12413220 ]
Dawid Weiss commented on NUTCH-265:
---
If you just mean the user interface, then you can simply take the XSLT
stylesheet from Carrot2 and reuse it in Nutch with the opensearch
[
http://issues.apache.org/jira/browse/NUTCH-265?page=comments#action_12413072 ]
Dawid Weiss commented on NUTCH-265:
---
Chris, the current clusterer in Nutch _does_ discover phrases for clusters, so
I don't know what you really mean. Did you take a lo
Yes, this should be definitely mentioned somewhere (in the documentation
:) At least we left a track on the mailing list so it'll be possible to
refer to it.
D.
Jérôme Charron wrote:
You're right -- changing anything with the input (snippets length,
number of documents etc) will alter the c
Hi Jerome,
Yes Dawid, but it is already committed => the clustering now uses the plain
text version returned by the toString() method.
Ugh, yes, sorry about that, it uses Summary.toStrings(summaries) to be
specific and that uses toString internally.
Actually, the clustering uses the summa
The reason is that they should not use the same HTML code :
1. OpenSearch should only use around highlights
2. search.jsp should use some more complicated HTML code ()
Add 3. Clustering would benefit from a plain text version.
D.
[
http://issues.apache.org/jira/browse/NUTCH-265?page=comments#action_12378425 ]
Dawid Weiss commented on NUTCH-265:
---
The clustering interface is very simple in Nutch because it usually needs to be
adjusted to the needs of a particular application
[
http://issues.apache.org/jira/browse/NUTCH-134?page=comments#action_12378387 ]
Dawid Weiss commented on NUTCH-134:
---
(back from holidays, so a bit delayed, but) I confirm Andrzej's suggestion -- a
plain-text only summarized is ideal for clusterin
I also think it makes sense -- we use language idenfier component in
Carrot2 and we'd love to just have a single library for this
functionality. As always, some extra managerial effort is unfortunately
needed to drive a stand-alone project.
D.
Chris Mattmann wrote:
Hi Otis,
This thread s
Subversion basically uses plain diff so I believe what you ask for isn't
possible. But if somebody knows otherwise I'd also appreciate a note.
D.
Stefan Groschupf wrote:
Hi,
does any body know how to do svn diff's that contains binary content,
like jars or images?
I was not able to find any
sure I have not applied it wrongly (I think it is
correct but I did it so many times that I want to cross check).
Regards
Piotr
Dawid Weiss wrote:
What kind of problems? If you need something, let me know.
D.
Piotr Kosiorowski wrote:
I got some problems while applying Dawid clustering patch
What kind of problems? If you need something, let me know.
D.
Piotr Kosiorowski wrote:
I got some problems while applying Dawid clustering patch (my linux
environment looks not to be setu correctly) - but I switched to cygwin
and it looks ok. I will try to commit it today/tommorow.
Regards
Piot
I do agree with Jarome - plugins should be checked too.
This basically means modifying the fileset in the pmd task. Shouldn't be
too difficult to include all plugin sources with a single
statement.
I will make it totally separate target (so test do not
depend on it).
That was actually
Could we have the clustering patch applied before the 0.8.0 release? I
know you're way busy with other things, Andrzej, maybe you'll forward it
to somebody else? It shouldn't be a difficult patch to review and apply.
D.
Doug Cutting wrote:
TDLN wrote:
I mean, how do others keep uptodate wi
My feeling was simply that the closest we are to Nutch-1.0, the more be need
some Q&A metrics (for us and for nutch users). No?
I absolutely agree Jérôme, really. It's just that developers usually
tend to hook up dozens of Q&A plugins and never look at what they output
(that's the usual scen
ed rules
(in another target or even in the same one).
That's again up to you guys.
Dawid
P.S. Tom Copeland has already fixed the bug I mentioned in the patch.
Quite impressive bugfix turnaround, isn't it. :)
Piotr Kosiorowski wrote:
P.
Dawid Weiss wrote:
All right, I thoug
's perfect.
https://sourceforge.net/tracker/?func=detail&atid=479921&aid=1465574&group_id=56262
D.
Piotr Kosiorowski wrote:
+1 - I offer my help - we can coordinate it and I can do a part of work. I
will also try to commit your patches quickly.
Piotr
On 4/6/06, Dawid Weiss <[EMA
> Other options (raised on the Hadoop list) are Checkstyle:
PMD seems to be the best choice for an Apache project and they all seem
to perform at a similar level.
Anything that generates a lot of false positives is bad: it either
causes us to skip analysis of lots of files, or ignore the war
I'm a fan of automated testing and code analysis utilities, but I must
say they only make sense if people actually use them and look at their
results. So it's not really just about integration -- it's about looking
at the results of these tools. PMD is neat because it can simply
interrupt you
Ok, PMD seems like a good idea. I've added it to the build file. Unused
code detection shows a few catches (javacc-generated classes need to be
ignored because they contain a lot of junk), but unfortunately it also
displays false positives such as in:
MapWritable.java 429 {Avoid unused p
One can presumably disable such minor warnings in Eclipse. Arguably the
bug is that Eclipse warns about such things by default, rather than in a
'pedantic' mode.
I agree -- some of them are really annoying. Plus, Eclipse has been
having notorious problems showing warnings for unused paramet
In any case, it includes a system to scrape search results from other
engines, based on Apple's Sherlock search-engine descriptors. These
descriptors are also used by Mozilla:
Just a note: we used to have exactly the same mechanism in Carrot2.
Unfortunately this format does not make a clea
I can help by reusing input components from Carrot2 -- they give access
to Google (via GoogleAPI), Yahoo (via their REST API) and Nutch (via
OpenSearch). Somebody would need to put together the rest of the
evaluation framework though :)
D.
Andrzej Bialecki wrote:
Hi,
I found this paper, m
It works fine Doug, thanks.
Please tell me if it is correct, since I don't use Eclipse.
I'm at the vi (or rather vim) level very often, but emacs is still ahead
of me ;) And on a more serious note, Eclipse shows a good few warnings
in the present codebase. They are usually minor things like
[ http://issues.apache.org/jira/browse/NUTCH-237?page=all ]
Dawid Weiss updated NUTCH-237:
--
Attachment: NUTCH-237.DWEISS.patch.zip
Hi Andrzej. The ZIP file contains a patch and svn stat with the improved code:
- The primary language for hits without
Would it be a problem to add Eclipse's ".settings" folder to ignored
files (since Eclipse project files are already there anyway). This file
is used when one wants to override default project configuration (code
formatting, specific JVM etc).
Dawid
[
http://issues.apache.org/jira/browse/NUTCH-237?page=comments#action_12371687 ]
Dawid Weiss commented on NUTCH-237:
---
Yes and no. I removed the "support" for foreign languages from the constructor
code:
// We initialize Lingo wi
Hi,
This issue:
http://issues.apache.org/jira/browse/NUTCH-237
contains an upgrade of Carrot2 libraries to the newest codebase and a
few minor editing operations on the plugin sources. Please review and
commit (not urgent, Andrzej :).
Thanks,
Dawid
[ http://issues.apache.org/jira/browse/NUTCH-237?page=all ]
Dawid Weiss updated NUTCH-237:
--
Attachment: libs.zip
Libraries that need to be replaced.
> Carrot2 clustering plugin upgrade.
> --
>
> Ke
[ http://issues.apache.org/jira/browse/NUTCH-237?page=all ]
Dawid Weiss updated NUTCH-237:
--
Attachment: c2.patch
svn-stat.txt
Note the two deleted files (I attached the result of svn stat). I didn't know
how to include this info i
Carrot2 clustering plugin upgrade.
--
Key: NUTCH-237
URL: http://issues.apache.org/jira/browse/NUTCH-237
Project: Nutch
Type: Improvement
Reporter: Dawid Weiss
Priority: Trivial
This is an upgrade of the clustering plugin to
[ http://issues.apache.org/jira/browse/NUTCH-234?page=all ]
Dawid Weiss updated NUTCH-234:
--
Attachment: patch.diff
The patch adding:
- a JUnit test case to the clustering extension,
- minor code cleanups
- adds ".settings" file to svn:ignore o
Type: Test
Reporter: Dawid Weiss
Priority: Minor
I've cleaned up the code a bit and added a real test case for the clustering
extension. This is in preparation for upgrading to the most recent Carrot2
codebase and I didn't want to mix these two patches together. I'
The probability of encountering a $ sign somewhere inside URL is not
insignificant... I agree that it's very unlikely (perhaps even illegal)
to use ^ in URLs, but $ are sometimes used.
I'd have to take a look at the spec, but I think both characters should
be URL-encoded anyway. Maybe it'd b
Hmm... I'm not convinced. How would you generate the best snippet from a
relevant, but ignored chunk?
Good point... I guess you simply wouldn't generate anything at all (show
the title?). I guess structure text should not be relevant enough to
actually cause a hit on top of the search result
[ http://issues.apache.org/jira/browse/NUTCH-228?page=all ]
Dawid Weiss updated NUTCH-228:
--
Attachment: clustering.patch
This patch fixed the plugin descriptor and a typo in cluster.jsp that caused
wrong number of milliseconds to be dumped in the output
Clustering plugin descriptor broken (fix included)
--
Key: NUTCH-228
URL: http://issues.apache.org/jira/browse/NUTCH-228
Project: Nutch
Type: Bug
Reporter: Dawid Weiss
Priority: Minor
The plugin descriptor
I see this issue:
https://issues.apache.org/jira/browse/NUTCH-217
is no longer relevant (a patch has been applied in the trunk). I added a
note about it, somebody with more privileges needs to close it when time
permits.
D.
It seems to me that there are two separate problems:
1) content parsing to avoid site structure -> influences the index and
rankings
2) content parsing for KWIC snippet generation -> influences the user
perception of the engine.
I'd agree that (2) is quite important for the end user; Richard
: searcher
Versions: 0.8-dev
Reporter: Dawid Weiss
I've been playing with the trunk. The distributed searcher complains with an
instantiation exception when deserializing Query. A quick code inspection shows
that Query doesn't have any parameterless constructor.
--
This
I just wanted to say we've gone through such problems already in Carrot2
-- many modules depend on each other, some of them have custom build
steps. A pure ANT solution is likely to be quite ugly... But back to the
point: you can test for existence of a plugin-specific build file and
execute
Yes, there is an easier way. Implement a custom task to which you'll
pass a path to plugin.xml and a name for a path. The task (Java code)
will create a named (id) object which can be subsequently used in
ant with .
This requires a custom ant task, but as you mentioned foreach is also a
se
log4j-1.2.11.jar src/plugin/clustering-carrot2/lib
log4j-1.2.6.jar 1 src/plugin/parse-rss/lib
log4j-1.2.9.jar src/plugin/parse-pdf/lib
nekohtml-0.9.2.jarsrc/plugin/clustering-carrot2/lib
nekohtml-0.9.4.jarsrc/plu
Definitely there is interest!
Let's hear all the voices though :)
If the interfaces in carrot2 don't change
too much, there is not so much work with the adapters, they are quite
simple after all.
You are right -- they don't change a lot on Carrot2 side. I was
concerned mostly with the Nu
Hi there,
We've been quite busy with putting things together at Carrot2. Version
1.0.1 is out -- it is a stable release with a few tweaks and tunings
that appeared after 1.0. We also have a Web site ;)
http://www.carrot2.org
So... I think it's time for reintegrating that code into Nutch
cl
u try this, self-indulging, query (with filtering enabled):
http://www.google.com/search?as_q=dawid+weiss&num=10&hl=en&as_qdr=all&as_occt=any&as_dt=i&safe=active&start=900
You get: "Results 781 - 782 of about 61,700"
Now try disabling filtering:
http://www.g
Hi Charlie,
Don't cross-post to two lists at once. The question you asked is
relevant to C2, not Nutch, so I'll reply to it there.
Dawid
charlie wrote:
Dear all,
Currently I’m using the Nutch plug-in “clustering-carrot2” and would
like to ask for some help. When I built the search resu
Check this out, guys, I thought some of you might find it amusing:
http://www.mex-search.com/
The "full" option gives you an "agent-based search engine". The
usability might be questioned (long animations), but it certainly gives
strong first impression :) Have fun.
Dawid
[
http://issues.apache.org/jira/browse/NUTCH-82?page=comments#action_12332559 ]
Dawid Weiss commented on NUTCH-82:
--
I personally disagree Perl is a better alternative to Cygwin... Most people
familiar with Unix/ Windows development will have no problems
Yes but (I think -- I haven't confirmed) this basic escaping is being
done by the DOM streaming. It at least is converting characters like 0xC
to .
I'd have to look at the code and see how the XML is serialized... Most
DOM streaming classes will encode entities somehow, so you shouldn't
wo
> The differences between this method and the patch supplied in NUTCH-110
> are:
Take a closer look at the source code --
1. XMLSerializerHelper#toValidXmlText throws an exception when an
invalid character whereas NUTCH-110 just drops it.
Not really, it is governed by a boolean flag. If t
Right, I didn't think about this... somehow I thought this was all about
special characters like ' " & <.
Oh, believe me: this knowledge came from sour experience not from book
wisdom... I know for sure some XML parsers complain about invalid
characters, while others don't.
Then we should
1 - 100 of 124 matches
Mail list logo