Matus UHLAR - fantomas wrote:
Ah. I didn't see that option. That's nice. I'm now using pdftotext
instead of pdftohtml here as well. :-)
I've been thinking about it. The pdftohtml could provide interesting
infromations like colour informations that could lead to better spam
detection. Any experiences with this?
You're right. It should be usefult to extract to HTML when possible, and
then use Mail::SpamAssassin::HTML to get and then set properties just
like the rendered method of Mail::SpamAssassin::Message::Node does.
The nice way to do this would IMHO be to make it possible for a plugin
to call the "rendered" method of Mail::SpamAssassin::Message::Node
passing type and extracted data as parameters.
Something like this (completely untested, and watch for wraps):
---8<---
--- Node.pm Thu Jun 12 17:40:48 2008
+++ Node-new.pm Mon Jul 13 17:22:20 2009
@@ -411,16 +411,17 @@
=cut
sub rendered {
- my ($self) = @_;
+ my ($self, $type, $text) = @_;
- if (!exists $self->{rendered}) {
+ if ((defined($type) && defined($data)) || !exists $self->{rendered}) {
# We only know how to render text/plain and text/html ...
# Note: for bug 4843, make sure to skip text/calendar parts
# we also want to skip things like text/x-vcard
# text/x-aol is ignored here, but looks like text/html ...
+ $type = $self->{'type'} unless (defined($type));
return(undef,undef) unless ( $self->{'type'} =~
/^text\/(?:plain|html)$/i );
- my $text = $self->_normalize($self->decode(), $self->{charset});
+ $text = $self->_normalize($self->decode(), $self->{charset}) unless
(defined($text));
my $raw = length($text);
# render text/html always, or any other text|text/plain part as
text/html
---8<---
This way, AFAICT, any extracted (or generated) HTML should be treated
the same way a normal text/html is. Making it available to HTML eval
tests for example.
Otherwise my plugin could of course use Mail::SpamAssassin::HTML itself.
Unfortunately Mail::SpamAssassin::Message::Node has no nice methods for
setting the separate relevant properties though, so either the
set_rendered metod needs to be expanded or complemeted to allow this
anyway, or my plugin will have to directly set the relevant properties
(wich makes it depend on Mail::SpamAssassin::Message::Node not being
changed too much).
I guess I could do the hack version now, and then update it if/when
Mail::SpamAssassin::Message::Node is updated to support this in a nice
way. :-)
Regards
/Jonas
--
Jonas Eckerman
Fruktträdet & Förbundet Sveriges Dövblinda
http://www.fsdb.org/
http://www.frukt.org/
http://whatever.frukt.org/