Doug Cutting wrote:
I have committed this, along with the LuceneQueryOptimizer changes.
I could only find one place where I was using numDocs() instead of
maxDoc().
Right, I confused two bugs from different files - the other bug still
exists in the committed version of the
LuceneQueryOpti
During a fetch I have recently started getting these (pretty
consistently).
task_r_5m9ybr 0.15 reduce > copy > java.lang.NullPointerException at
sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:991)
at
java.lang.Float.parseFloat(Float.java:394) at
org.apache.nutch.parse
Plain text parser should use parser.character.encoding.default property for
fall back encoding
--
Key: NUTCH-161
URL: http://issues.apache.org/jira/browse/NUTCH-161
Project: Nutch
Andrzej Bialecki wrote:
Sounds like tf/idf might be de-emphasized in scoring. Perhaps
NutchSimilarity.tf() should use log() instead of sqrt() when
field==content?
I don't think it's that simple, the OPIC score is what determined this
behaviour, and it doesn't correspond at all to tf/idf, but
Doug Cutting wrote:
Andrzej Bialecki wrote:
Using the original index, it was possible for pages with high tf/idf
of a term, but with a low "boost" value (the OPIC score), to outrank
pages with high "boost" but lower tf/idf of a term. This phenomenon
leads quite often to results that are perc
Andrzej Bialecki wrote:
Using the original index, it was possible for pages with high tf/idf of
a term, but with a low "boost" value (the OPIC score), to outrank pages
with high "boost" but lower tf/idf of a term. This phenomenon leads
quite often to results that are perceived as "junk", e.g. p
[
http://issues.apache.org/jira/browse/NUTCH-138?page=comments#action_12361552 ]
KuroSaka TeruHiko commented on NUTCH-138:
-
Sorry, my oversight, useBodyEncodingForURI did not work as I expected. Setting
URIEncoding is the only way. I'll write this
Doug Cutting wrote:
[EMAIL PROTECTED] wrote:
Now users can select their own page signature implementation, possibly
with better properties than the old one.
Two implementations are provided:
* MD5Signature: backward-compatible with the old schema.
* TextProfileSignature: an example implemen
[
http://issues.apache.org/jira/browse/NUTCH-138?page=comments#action_12361549 ]
Piotr Kosiorowski commented on NUTCH-138:
-
BTW - just create user for yourself in nutch Wiki and you shoudl be able to add
a new page with information without problems.
[ http://issues.apache.org/jira/browse/NUTCH-138?page=all ]
Piotr Kosiorowski closed NUTCH-138:
---
Resolution: Invalid
Setting URIEncoding in tomcat config file fixes the problem.
> non-Latin-1 characters cannot be submitted for search
> -
[
http://issues.apache.org/jira/browse/NUTCH-138?page=comments#action_12361546 ]
KuroSaka TeruHiko commented on NUTCH-138:
-
You are right. WIth this Tomcat config, UTF-8 characters can be passed.
Also works is having: useBodyEncodingForURI="true"
Andrzej Bialecki wrote:
I'm happy to report that further tests performed on a larger index seem
to show that the overall impact of the IndexSorter is definitely
positive: performance improvements are significant, and the overall
quality of results seems at least comparable, if not actually bett
Stefan Groschupf wrote:
I also note this line in client.java
public Writable[] call(Writable[] params, InetSocketAddress[] addresses)
throws IOException {
if (params.length == 0) return new Writable[0];
Do I understand it correct that in case the remote method does not need
any paramet
Andrzej Bialecki wrote:
Gal Nitzan wrote:
this function throws IOException. Why?
public long getPos() throws IOException {
return (doc*INDEX_LENGTH)/maxDoc;
}
It should be throwing ArithmeticException
The IOException is required by the API of RecordReader.
W
Gal Nitzan wrote:
I am using trunk. while trying to crawl I get the following:
[ ...]
050825 100222 task_m_ns3ehv Error running child
050825 100222 task_m_ns3ehv java.lang.ArithmeticException: / by zero
050825 100222 task_m_ns3ehv at
org.apache.nutch.indexer.DeleteDuplicates
$1.getPos(De
[EMAIL PROTECTED] wrote:
Now users can select their own page signature implementation, possibly
with better properties than the old one.
Two implementations are provided:
* MD5Signature: backward-compatible with the old schema.
* TextProfileSignature: an example implementation of a signature,
[
http://issues.apache.org/jira/browse/NUTCH-159?page=comments#action_12361545 ]
byron miller commented on NUTCH-159:
While it's from the mapred trunk, it is a non ndfs/local instance only.
Mapred.temp.dir was left at it's defaults.. (which didn't exis
[
http://issues.apache.org/jira/browse/NUTCH-159?page=comments#action_12361541 ]
Doug Cutting commented on NUTCH-159:
mapred.local.dir is the thing to set. if that fails, then there is a bug.
what did you have this set to?
> Specify temp/working dire
Hi Andrzej,
Gal Nitzan wrote:
>> It seems that Trunk is now broken...
>>
DmozParser seems to be broken, too. It's package declaration is still
org.apache.nutch.crawl instead of org.apache.nutch.tools.
TJ
Hi Andrzej,
Gal Nitzan wrote:
> It seems that Trunk is now broken...
>
DmozParser seems to be broken, too. It's package declaration is still
org.apache.nutch.crawl instead of org.apache.nutch.tools.
TJ
Piotr Kosiorowski wrote:
Andrzej Bialecki wrote:
Hi,
I just commited a large patch to cleanup the trunk/ of obsolete and
broken classes remaining from the 0.7.x development line. Please test
that things still work as they should ...
Hi,
I am not sure what is wrong but a lot of JUnit test
[
http://issues.apache.org/jira/browse/NUTCH-138?page=comments#action_12361520 ]
Piotr Kosiorowski commented on NUTCH-138:
-
I am not sure but I would suspect it is a problem of bad tomcat configuration.
To handle special characters in query urls one
22 matches
Mail list logo