svn commit: r356257 - in /james/server/trunk/src/java/org/apache/james: transport/mailets/BayesianAnalysis.java util/BayesianAnalyzer.java util/JDBCBayesianAnalyzer.java

2005-12-12 Thread vincenzo
Author: vincenzo
Date: Mon Dec 12 06:26:29 2005
New Revision: 356257

URL: http://svn.apache.org/viewcvs?rev=356257&view=rev
Log:
1) Fixed JAMES-387 (java.lang.ClassCastException: java.lang.Integer).
2) Some enhancements to reduce memory footprint.

Modified:

james/server/trunk/src/java/org/apache/james/transport/mailets/BayesianAnalysis.java
james/server/trunk/src/java/org/apache/james/util/BayesianAnalyzer.java
james/server/trunk/src/java/org/apache/james/util/JDBCBayesianAnalyzer.java

Modified: 
james/server/trunk/src/java/org/apache/james/transport/mailets/BayesianAnalysis.java
URL: 
http://svn.apache.org/viewcvs/james/server/trunk/src/java/org/apache/james/transport/mailets/BayesianAnalysis.java?rev=356257&r1=356256&r2=356257&view=diff
==
--- 
james/server/trunk/src/java/org/apache/james/transport/mailets/BayesianAnalysis.java
 (original)
+++ 
james/server/trunk/src/java/org/apache/james/transport/mailets/BayesianAnalysis.java
 Mon Dec 12 06:26:29 2005
@@ -340,8 +340,10 @@
 try {
 // this is synchronized to avoid concurrent update of the corpus
 synchronized(JDBCBayesianAnalyzer.DATABASE_LOCK) {
+analyzer.tokenCountsClear();
 analyzer.loadHamNSpam(conn);
 analyzer.buildCorpus();
+analyzer.tokenCountsClear();
 }
 
 log("BayesianAnalysis Corpus loaded");

Modified: 
james/server/trunk/src/java/org/apache/james/util/BayesianAnalyzer.java
URL: 
http://svn.apache.org/viewcvs/james/server/trunk/src/java/org/apache/james/util/BayesianAnalyzer.java?rev=356257&r1=356256&r2=356257&view=diff
==
--- james/server/trunk/src/java/org/apache/james/util/BayesianAnalyzer.java 
(original)
+++ james/server/trunk/src/java/org/apache/james/util/BayesianAnalyzer.java Mon 
Dec 12 06:26:29 2005
@@ -261,14 +261,21 @@
 public void clear() {
 corpus.clear();
 
-hamTokenCounts.clear();
-spamTokenCounts.clear();
+tokenCountsClear();
 
 hamMessageCount = 0;
 spamMessageCount = 0;
 }
 
 /**
+ * Clears token counters.
+ */
+public void tokenCountsClear() {
+hamTokenCounts.clear();
+spamTokenCounts.clear();
+}
+
+/**
  * Public setter for corpus.
  *
  * @param corpus The new corpus.
@@ -289,17 +296,19 @@
  */
 public void buildCorpus() {
 //Combine the known ham & spam tokens.
-corpus.putAll(hamTokenCounts);
-corpus.putAll(spamTokenCounts);
+Set set = new HashSet(hamTokenCounts.size() + spamTokenCounts.size());
+set.addAll(hamTokenCounts.keySet());
+set.addAll(spamTokenCounts.keySet());
+Map tempCorpus = new HashMap(set.size());
 
 //Iterate through all the tokens and compute their new
 //individual probabilities.
-Iterator i = corpus.keySet().iterator();
+Iterator i = set.iterator();
 while (i.hasNext()) {
 String token = (String) i.next();
-
-corpus.put(token, new Double(computeProbability(token)));
+tempCorpus.put(token, new Double(computeProbability(token)));
 }
+setCorpus(tempCorpus);
 }
 
 /**
@@ -335,13 +344,17 @@
 //Build a set of the tokens in the Stream.
 Set tokens = parse(stream);
 
+// Get the corpus to use in this run
+// A new corpus may be being built in the meantime
+Map workCorpus = getCorpus();
+
 //Assign their probabilities from the Corpus (using an additional
 //calculation to determine spamminess).
-SortedSet tokenProbabilityStrengths = 
getTokenProbabilityStrengths(tokens);
+SortedSet tokenProbabilityStrengths = 
getTokenProbabilityStrengths(tokens, workCorpus);
 
 //Compute and return the overall probability that the
 //stream is SPAM.
-return computeOverallProbability(tokenProbabilityStrengths);
+return computeOverallProbability(tokenProbabilityStrengths, 
workCorpus);
 }
 
 /**
@@ -575,9 +588,10 @@
  * The ordering is from the highest strength to the lowest strength.
  *
  * @param tokens
+ * @param workCorpus
  * @return  SortedSet of TokenProbabilityStrength objects.
  */
-private SortedSet getTokenProbabilityStrengths(Set tokens) {
+private SortedSet getTokenProbabilityStrengths(Set tokens, Map workCorpus) 
{
 //Convert to a SortedSet of token probability strengths.
 SortedSet tokenProbabilityStrengths = new TreeSet();
 
@@ -587,14 +601,15 @@
 
 tps.token = (String) i.next();
 
-if (corpus.containsKey(tps.token)) {
-tps.strength = Math.abs(0.5

[jira] Resolved: (JAMES-387) Exception in BayesianAnalysis

2005-12-12 Thread Vincenzo Gianferrari Pini (JIRA)
 [ http://issues.apache.org/jira/browse/JAMES-387?page=all ]
 
Vincenzo Gianferrari Pini resolved JAMES-387:
-

Fix Version: 2.3.0
 Resolution: Fixed

The corpus reload activity was possibly conflicting with any ongoing analysis 
of messages, and the corpus could screw up.
Now such reload activity is done on a new hashmap, that at the end of the 
reload becomes the actual corpus. In the meantime any analysis is done on the 
old corpus and no conflict occurs. The old corpus will eventually be garbage 
collected.

> Exception in BayesianAnalysis
> -
>
>  Key: JAMES-387
>  URL: http://issues.apache.org/jira/browse/JAMES-387
>  Project: James
> Type: Bug
>   Components: Matchers/Mailets (bundled)
> Versions: 3.0
>  Environment: James from svn-trunk 2005-08-01.
> MySQL 4.0
> Reporter: Stefano Bagnara
> Assignee: Vincenzo Gianferrari Pini
> Priority: Minor
>  Fix For: 2.3.0

>
> Got this exception for every incoming mail:
> 02/08/05 00:39:25 INFO  James.Mailet: BayesianAnalysis: Exception: 
> java.lang.Integer
> java.lang.ClassCastException: java.lang.Integer
> at 
> org.apache.james.util.BayesianAnalyzer.getTokenProbabilityStrengths(BayesianAnalyzer.java:591)
> at 
> org.apache.james.util.BayesianAnalyzer.computeSpamProbability(BayesianAnalyzer.java:340)
> at 
> org.apache.james.transport.mailets.BayesianAnalysis.service(BayesianAnalysis.java:289)
> at 
> org.apache.james.transport.LinearProcessor.service(LinearProcessor.java:407)
> at 
> org.apache.james.transport.JamesSpoolManager.process(JamesSpoolManager.java:460)
> at 
> org.apache.james.transport.JamesSpoolManager.run(JamesSpoolManager.java:369)
> at java.lang.Thread.run(Unknown Source)
> If I clean my spam/ham db the exceptions disappears but they start again when 
> the spam/ham db become large.
> My bayesiananalysis_spam contains 20 rows.
> The following are the spam tokens with higher "occurrences".
> +---+-+
> | token | occurrences |
> +---+-+
> | 3D|   82151 |
> | a |   59953 |
> | the   |   45295 |
> | FONT  |   42771 |
> | Content-Type  |   39058 |
> | to|   36626 |
> | com   |   32902 |
> | http  |   32886 |
> | of|   32504 |
> | font  |   31803 |
> | and   |   31577 |
> | Content-Transfer-Encoding |   31576 |
> | p |   29746 |
> | text  |   29482 |
> | in|   29418 |
> | it|   28498 |
> | br|   28037 |
> | DIV   |   27431 |

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]