Re: notification in python

2010-09-28 Thread harryos
hi Erik
that was food for thought..content length may not work if
substitutions leave length unchanged..
Will look into L distance ..thanks for the suggestion
regards
harry

> Content length (which you could also get using the HTTP header "Content 
> Length") won't necessarily tell you if content has changed. I think your 
> problem is a candidate 
> forhttp://en.wikipedia.org/wiki/Levenshtein_distance(calculating the 
> "distance" between two strings), for which I think there are Python 
> implementations.
>
> Depending on your requirements, you could add other heuristics to detect 
> major changes, e.g. load the page into an XML parser and only check certain 
> 's. But further suggestions would require more information on your 
> problem.

-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-us...@googlegroups.com.
To unsubscribe from this group, send email to 
django-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en.



Re: notification in python

2010-09-28 Thread Erik Cederstrand
Harryos,

Den 28/09/2010 kl. 09.56 skrev harryos:

> thanks Erik,
> By 'update' I meant a major addition/removal of text(say 100
> characters).
> Initially I thought of making hash of a page and comparing it to the
> saved hash of the same page  at a different moment of time..But ,this
> would
> cause even a tiny change to be considered as an update..I would like
> to use a filter to set an update of x number of characters.
> May be using f=urllib.urlopen and
> currentsize=len(f.read())  will let me find the number of added/
> removed characters..and set the filter accordingly..


Content length (which you could also get using the HTTP header "Content 
Length") won't necessarily tell you if content has changed. I think your 
problem is a candidate for http://en.wikipedia.org/wiki/Levenshtein_distance 
(calculating the "distance" between two strings), for which I think there are 
Python implementations.

Depending on your requirements, you could add other heuristics to detect major 
changes, e.g. load the page into an XML parser and only check certain 's. 
But further suggestions would require more information on your problem.

Kind regards,

Erik


smime.p7s
Description: S/MIME cryptographic signature


Re: notification in python

2010-09-28 Thread harryos

> You could also try looking at the HTTP headers for a request for e.g. 
> "index.htm" using urllib. Specifically the "Expires" and "Last-Modified".
> Using headers values requires that you can trust the site on the header 
> content. Web servers and caching proxies can do all sorts of things with the 
> headers. Otherwise, saving the hash of the raw HTML (without GIFs etc.) as 
> suggested is a good approach. Depending on what your definition of "updated" 
> is.
>


thanks Erik,
By 'update' I meant a major addition/removal of text(say 100
characters).
Initially I thought of making hash of a page and comparing it to the
saved hash of the same page  at a different moment of time..But ,this
would
cause even a tiny change to be considered as an update..I would like
to use a filter to set an update of x number of characters.
May be using f=urllib.urlopen and
currentsize=len(f.read())  will let me find the number of added/
removed characters..and set the filter accordingly..

any other suggestions most welcome
harry


-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-us...@googlegroups.com.
To unsubscribe from this group, send email to 
django-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en.



Re: notification in python

2010-09-27 Thread Erik Cederstrand

Den 27/09/2010 kl. 19.02 skrev harryos:

> thanks for the pointer
> I am trying to get something similar to changedetection  but with
> hourly updates.
> I need to get updates from a number of sites..So I was wondering how
> to implement an updating utility

You could also try looking at the HTTP headers for a request for e.g. 
"index.htm" using urllib. Specifically the "Expires" and "Last-Modified". This 
would let you ignore e.g. banners and flash content etc. as they are fetched in 
separate requests. If you want to go really lightweight and fast, do a HEAD 
request instead of a plain GET. It's easy to look at the headers a specific 
site is sending with e.g. the Firebug plugin for Firefox.

Using headers values requires that you can trust the site on the header 
content. Web servers and caching proxies can do all sorts of things with the 
headers. Otherwise, saving the hash of the raw HTML (without GIFs etc.) as 
suggested is a good approach. Depending on what your definition of "updated" is.

King regards,
Erik

smime.p7s
Description: S/MIME cryptographic signature


Re: notification in python

2010-09-27 Thread Shawn Milochik
I did a quick Google search and didn't find anything that was obviously solving 
this problem. I did see companies that sell this service, and probably with 
good reason.

Due to dynamic content such as ads, data from RSS feeds, and simply 
auto-generated content from server-side code, it seems that you'd almost have 
to customize the configuration on a site-by-site basis. It would be difficult 
to distinguish between changes to the content you're interested in monitoring 
and all the other stuff. 

On the other hand, it seems that this functionality is something that a lot of 
people might want, so don't stop looking. If you're sure there's no open-source 
solution out there, maybe you can create it and put out a call for contributors 
on this list.

Shawn

-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-us...@googlegroups.com.
To unsubscribe from this group, send email to 
django-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en.



Re: notification in python

2010-09-27 Thread harryos
thanks for the pointer
I am trying to get something similar to changedetection  but with
hourly updates.
I need to get updates from a number of sites..So I was wondering how
to implement an updating utility
harry

On Sep 27, 9:16 pm, Shawn Milochik  wrote:
> If you're asking for functionality like this:http://www.changedetection.com/
>
> Or are you looking for something to embed in your own code to know when 
> something has happened on your own site?
>
> If the former, you can probably do it by scheduling a urlopen and saving its 
> hash, comparing it each time. If the latter, you can use the logging module.
>
> Shawn

-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-us...@googlegroups.com.
To unsubscribe from this group, send email to 
django-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en.



Re: notification in python

2010-09-27 Thread Shawn Milochik
If you're asking for functionality like this: http://www.changedetection.com/

Or are you looking for something to embed in your own code to know when 
something has happened on your own site?

If the former, you can probably do it by scheduling a urlopen and saving its 
hash, comparing it each time. If the latter, you can use the logging module.

Shawn

-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-us...@googlegroups.com.
To unsubscribe from this group, send email to 
django-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en.