Matus UHLAR - fantomas wrote:

Ah. I didn't see that option. That's nice. I'm now using pdftotext instead of pdftohtml here as well. :-)

I've been thinking about it. The pdftohtml could provide interesting
infromations like colour informations that could lead to better spam
detection. Any experiences with this?

You're right. It should be usefult to extract to HTML when possible, and then use Mail::SpamAssassin::HTML to get and then set properties just like the rendered method of Mail::SpamAssassin::Message::Node does.

The nice way to do this would IMHO be to make it possible for a plugin to call the "rendered" method of Mail::SpamAssassin::Message::Node passing type and extracted data as parameters.

Something like this (completely untested, and watch for wraps):
---8<---
--- Node.pm     Thu Jun 12 17:40:48 2008
+++ Node-new.pm Mon Jul 13 17:22:20 2009
@@ -411,16 +411,17 @@
 =cut

 sub rendered {
-  my ($self) = @_;
+  my ($self, $type, $text) = @_;

-  if (!exists $self->{rendered}) {
+  if ((defined($type) && defined($data)) || !exists $self->{rendered}) {
     # We only know how to render text/plain and text/html ...
     # Note: for bug 4843, make sure to skip text/calendar parts
     # we also want to skip things like text/x-vcard
     # text/x-aol is ignored here, but looks like text/html ...
+    $type = $self->{'type'} unless (defined($type));
return(undef,undef) unless ( $self->{'type'} =~ /^text\/(?:plain|html)$/i );

-    my $text = $self->_normalize($self->decode(), $self->{charset});
+ $text = $self->_normalize($self->decode(), $self->{charset}) unless (defined($text));
     my $raw = length($text);

# render text/html always, or any other text|text/plain part as text/html
---8<---

This way, AFAICT, any extracted (or generated) HTML should be treated the same way a normal text/html is. Making it available to HTML eval tests for example.

Otherwise my plugin could of course use Mail::SpamAssassin::HTML itself.
Unfortunately Mail::SpamAssassin::Message::Node has no nice methods for setting the separate relevant properties though, so either the set_rendered metod needs to be expanded or complemeted to allow this anyway, or my plugin will have to directly set the relevant properties (wich makes it depend on Mail::SpamAssassin::Message::Node not being changed too much).

I guess I could do the hack version now, and then update it if/when Mail::SpamAssassin::Message::Node is updated to support this in a nice way. :-)

Regards
/Jonas
--
Jonas Eckerman
Fruktträdet & Förbundet Sveriges Dövblinda
http://www.fsdb.org/
http://www.frukt.org/
http://whatever.frukt.org/

Reply via email to