Re: Any interest in an HTML DOM in C / C# / similar? (was Re: State of the AJAX Union)

2006-11-24 Thread John J Lee
Can you explain how wrapping a C/C# DOM implementation and using that from 
Perl is in conflict with using Perl's regex engine?



John

On Thu, 23 Nov 2006, Christopher Hart wrote:


For one particular application, I need the speed of Perl's regex engine and
I have not been able to match it in C# or even with limited attempts using
C++ and Boost's regex library.  So, regardless of feature availability on
other platforms, I'm going to continue pursuing DOM & JS functionality in
Perl extending WWW::Mechanize.

That said, I do a lot of *other* work in C# and what you suggest would be
useful.  If you already have something in the works (you mentioned
publishing some code), I'd be interested in learning more.

On 11/23/06, John J Lee <[EMAIL PROTECTED]> wrote:


On Wed, 22 Nov 2006, Christopher Hart wrote:

> Would an "easier" (yet still monumental) starting point be to tackle the
DOM
> implementation independent of a JS engine?
[...]
> This seems like a great open source project - it's way too much to
handle
> for most individual developers, but I think could be tackled with a
> moderately organized team of folks with a good design laid out in
advance.
[...]

So, having pooh-poohed Christopher's proposal, I'm going to make a very
similar proposal of my own.

Give me some slack, I have the excuse of actually having put in some
implementation effort on this, and published the code :-) (but not in
Perl).  And I'm asking with a view to writing some more code myself
(albeit only a slim chance of that happening).

So, my question / semi-serious proposal is: is anybody here interested in
collaborating in writing a portable HTML DOM library and DOM builder in a
language *other* than Perl, with the intent of wrapping it for Perl and
other languages (Python is my own interest)?

I'm afraid I'm not interested in Perl 6's cross-language support, though.
If it were me working on it, it would probably have to be C, C++ or C# (or
possibly Java, or something weirder like Caml).  Cross-language use is one
reason for my thinking about doing it in one of those languages.  Memory
usage and execution speed is another.  I lean towards C or C#.


John






Any interest in an HTML DOM in C / C# / similar? (was Re: State of the AJAX Union)

2006-11-23 Thread John J Lee

On Wed, 22 Nov 2006, Christopher Hart wrote:


Would an "easier" (yet still monumental) starting point be to tackle the DOM
implementation independent of a JS engine?

[...]

This seems like a great open source project - it's way too much to handle
for most individual developers, but I think could be tackled with a
moderately organized team of folks with a good design laid out in advance.

[...]

So, having pooh-poohed Christopher's proposal, I'm going to make a very 
similar proposal of my own.


Give me some slack, I have the excuse of actually having put in some 
implementation effort on this, and published the code :-) (but not in 
Perl).  And I'm asking with a view to writing some more code myself 
(albeit only a slim chance of that happening).


So, my question / semi-serious proposal is: is anybody here interested in 
collaborating in writing a portable HTML DOM library and DOM builder in a 
language *other* than Perl, with the intent of wrapping it for Perl and 
other languages (Python is my own interest)?


I'm afraid I'm not interested in Perl 6's cross-language support, though. 
If it were me working on it, it would probably have to be C, C++ or C# (or 
possibly Java, or something weirder like Caml).  Cross-language use is one 
reason for my thinking about doing it in one of those languages.  Memory 
usage and execution speed is another.  I lean towards C or C#.



John



Re: State of the AJAX Union

2006-11-23 Thread John J Lee

On Wed, 22 Nov 2006, Christopher Hart wrote:


I agree that folks have been talking about JS for a long time, and that it's
frustrating, but what I'm suggesting is that we need to tackle a different
problem first.

[...]

An HTML DOM implemention is a necessary part of JS support, sure (though 
Stefan points out that for web app testing -- as opposed to scraping -- an 
XML DOM may be sufficient for some purposes).


Forgive me for being blunt, but that's not the blinding flash of insight 
that's needed 


What's needed is for somebody to actually write and publish some code -- 
and then for people to *keep on* working on it.  No doubt it sometimes 
happens, but I've never seen a "free-range" open source project (as 
opposed to one started within a company) start up based on one person's 
plan and another's code.


perspiration-not-inspiration-ly y'rs,


John



Re: State of the AJAX Union

2006-11-23 Thread John J Lee

On Wed, 22 Nov 2006, Christopher Hart wrote:


I'm willing to take a crack at laying out a vision, high level objectives
and some implementation requirements based on my experiences and see how

[...]

Everyone who's seriously interested is willing to do that.  Indeed, many 
have surely done that already, including myself.



John



Re: State of the AJAX Union

2006-11-23 Thread John J Lee

On Wed, 22 Nov 2006, apv wrote:


I've also been interested for a long time and tried to work on
this 2 years ago but didn't get far enough to bother trying
to release anything.

[...]

I would gladly throw down if there was a group effort with a
real plan. I'm not the right hacker to lead this project though.


Publish and be damned!

Doesn't matter if you don't want to lead something.  Somebody has to get 
the ball rolling if it's to happen, so why not you?



John



Re: State of the AJAX Union

2006-11-22 Thread John J Lee

On Wed, 22 Nov 2006, Stefan Seifert wrote:
[...]

I too thought about that. Maybe using the JavaScript or
JavaScript::Spidermonkey module and XML::DOM. I will certainly
experiment around with them, as we need it at work. Doesn't seem to be


Sigh, we've had this same little discussion at least five times here.

The browser object model is not the XML DOM.  It is the HTML DOM (which is 
ill-defined in practice, and is not really a superset of the XML DOM), 
plus other stuff.  There is currently no implementation of it outside of 
browsers.  Plus you have to build the damned DOM in the first place :-)




too hard to me, but of course, I'm underestimating that :)


Yes.

As I've said many times before here, getting something working is not too 
hard, getting something useful is harder (how much depends on the 
audience, I guess), getting something good is a lot of work.  Maybe this 
is universally true, but especially so of JS support for LWP :-)



John



Re: State of the AJAX Union

2006-11-22 Thread John J Lee

On Fri, 3 Nov 2006, Christopher Hart wrote:


I know there is a rich history of challenges implementing any kind of
JavaScript interpretation using Mechanize or any other web
scripting/automation utility, but I was wondering if anyone has tried to
focus on "Mechanizing" AJAX?

I realize this would take at least some degree of JavaScript interpretation
and most likely some kind of internal DOM representation to maintain the


No doubt you could profitably concentrate your implementation / bug-fixing 
effort on the DOM features you're interested in, but I don't think there's 
any terribly obvious closed subset of the DOM &c. that you could implement 
and save yourself lots of work as compared with implementing the full 
monty.




state of the page, and that it's probably extraordinarily challenging.


Probably only extraordinarily challenging in that it involves lots of work 
-- that's why nobody has done it :-)




Nonetheless, with the increasing popularity of AJAX, it seems like it
eventually needs to be done.  I'm watching more and more of the sites I've
written automation for slowly migrate to AJAX and it's getting increasingly
difficult to work around these designs.

[...]

Whether or not it "needs" to be done, it won't be, unless somebody steps 
up to do it.



John



Re: using LWP getting a PDF file which comes up blank

2006-08-09 Thread John J Lee

On Tue, 8 Aug 2006, Churton Budd wrote:
[...]
using LWP for display within this portal.  When I get the return of the 
PDF, the adobe acrobat plugin pops up but it comes up blank.  For multi 
page ECG's it comes up with multiple blank pages.  I have saved this 
blank file and looked at it, it seems like it has the same size as a PDF 
which displays in the inherent web application.  Loading this saved file 
into Adobe from the desktop, its still blank. Some characters hex code 
are different though (so I'm not sure this is an encoding issue).


Anyone have any thoughts why the PDF file displays

[...]

Didn't read your code, but: Are you processing the PDF file at any stage 
as a text file?  If so, stop doing that :)



John


Re: Query Results on Multiple Pages

2006-07-11 Thread John J Lee

On Mon, 10 Jul 2006, flynfast wrote:


I'm trying to write a script to send a query to the patent and
trademark office webpage and capture the URL's pointing to the patents
identified.  The problem is that the results appear on more than one
page (like Google lists its results on multiple pages).  How do I write
a script that will access the other pages?


I hereby wager 50 British pence that you'll find that on CPAN.  Have you 
looked for a patent search module there?


(the winner must collect his/her winnings in person ;-)


John


Re: Java script FAQ revised

2006-04-06 Thread John J Lee

On Thu, 6 Apr 2006, Peter Stevens wrote:
[...]

One typical use of Javascript is to perform argument checking before
posting to the server. The URL you want is probably just buried in
the Javascript function. Do a regular expression match on
   | $mech->content()| to find the link that you want and |$mech->get|
it directly (this assumes that you know what your are looking for in
advance).

In more difficult cases, the Javascript is used for URL mangling to
satisfy the needs of some middleware. In this case you need to
figure out what the Javascript is doing (why are these URLs always
really long?). There is probably some function with one or more
arguments which calculates the new URL.

[...]

Another very common thing that's important for would-be scrapers is 
manipulation of forms (adding form controls and list items, submitting 
forms).  In a sense that's just URL manipulation, of course, but in the 
FAQ it might be useful to draw people's attention to this specific case.


Script can also set cookies.


John



Re: Java script FAQ [was Re: :Mechanize]

2006-04-06 Thread John J Lee

On Wed, 5 Apr 2006, Mike Schilli wrote:
[...]

As soon as someone gets going and comes up with a reference implementation
(every browser naturally has its own DOM implementation, that's why IE
and Firefox behave differently at times), WWW::Mech is in business.

How cool would that be!

[...]

Sadly, that's not something that's going to come out of the Mozilla camp:

http://article.gmane.org/gmane.comp.mozilla.devel.dom/4227


John



Re: Javascript Execution

2005-12-18 Thread John J Lee
On Sat, 17 Dec 2005, Andy Lester wrote:

> On Sat, Dec 17, 2005 at 12:16:29PM -0500, Christopher Hart ([EMAIL 
> PROTECTED]) wrote:
> > There are also JavaScript engines available in C and Java
> > (SpiderMonkey and Rhino, respectively, available on mozilla.org).  You
> > may be able to leverage those.
> 
> I didn't know about SpiderMonkey.  I'm going to have a look at it to see
> if it will fit into WWW::Mechanize.

Hi Andy

As I've posted about here before a few times (search Gmane), I actually
did this with my Python port of WWW::Mechanize a few years back, using
spidermonkey.  My implementation was a first-cut half-baked thing, but I
did get it working for a few pages.  I decided that was enough excitement
for me ;-)  I know a few people used it for projects of their own and
improved on it a bit, though (eg. one guy used it in a college project to
make JS-using pages accessible on non-JS devices, by having a proxy server
and executing the JS there -- nice idea).  The code is still available at
wwwsearch.sf.net

I made use of the Perl wrapper of SpiderMonkey to write something very
similar for Python.  IIRC, I had to extend it a little over what was in
the Perl thing.

I used an existing HTML DOM, but had to modify both the DOM, and of course
the DOM builder (and add event stuff and browser object model).  This is
where the work lies :-)  If you intend to try this, and you're not
intimately familiar with the bizarre ways in which people can and do use

Re: URI::javascript and LWP::Protocol::javascipt...who'done it?

2005-11-19 Thread John J Lee
On Fri, 18 Nov 2005, Christian Montanari wrote:
[...]
> My Quest has been in the dream of many others already.
> It is all about tackling this javascripting trought WWW::Mechanize but, tell 
> me if 
> my ideas about this topic is wrong, it seems that no good souls has ever yet
> done it! 
[...]

http://search.gmane.org/search.php?group=gmane.comp.lang.perl.modules.lwp&query=javascript

http://thread.gmane.org/gmane.comp.lang.perl.modules.lwp/1285

http://article.gmane.org/gmane.comp.lang.perl.modules.lwp/1107/match=javascript

Win32::IE::Mechanize


John


Mailing list archives 2001-2005?

2005-08-12 Thread John J Lee
I've lost the archives for this list again.

I'm sure somebody has one on the web.  Can anybody point me to it?  There
are lots of links to old sites that stop in 2001, and GMANE seems to start
in 2005, but I can't find anything between 2001 and 2005.

Cheers


John


Re: Bug in cookies in libwww-perl-5.803

2005-07-31 Thread John J Lee
On Thu, 28 Jul 2005, Mysql user wrote:

> I'm trying to write a perl program to access the configuration of a VOIP
> telephone through its web interface. The web interface assigns you a
> session id cookie once you've logged in. It works with browsers but not
> with libwww-perl5.803 as shipped with Fedora Core 4.
>
> Here is the set-cookie header:
> Set-Cookie: SessionId="ab6931f2c09b05c9"; Version=1; Path=/
>
>
> Here is the correct cookie being sent (by Firefox):
> Cookie: SessionId="bf754500cc94652f"
>
> Here is the cookie being sent by LWP:
> Cookie: $Version=1; SessionId="\"ab6931f2c09b05c9\""; $Path="/"
[...]

Have you tried turning off RFC 2965 handling?

Even then, perhaps Version 1 cookies won't be downgraded to V0 cookies,
I don't recall.  But try it and see.


John


Re: Javascript and WWW::Mechanize or LWP::UserAgent?

2005-07-15 Thread John J Lee
On Thu, 14 Jul 2005, Peter Stevens wrote:

> Under the heading of small serious amounts of work...
>
> I mentioned previously Win32::IE::Mechanize - does anybody have any
> ideas on how to do the same thing with Firefox under Linux?
[...]

Warning: I'm not up-to-date on this, take what I say with a pinch of salt.

I see you want linux, but first a comment about doing this on Windows:

I believe there's a (MS-)COM IWebBrowser2 wrapper of Firefox's XP-COM
interfaces, so *in theory* you should be able to point
Win32::IE::Mechanize at that under Windows.  I wouldn't be at all
surprised if it were much harder than it should be, though (due to the
complexities of COM, XP-COM and Firefox rather than the Perl side
particularly)...  Also, note the comment below about XP-COM only
supporting in-process clients -- not sure exactly how one does things,
given this.

Under linux, I guess you'd have to do one of:

1. Extend the Perl module to interface with XP-COM direct (note XP-COM,
unlike COM, is in-process only IIUC, so I guess you have to rebuild
Firefox with your new code, which may be a "wonderful learning
experience", even if you *are* a battle-hardened C++ veteran ;-)

2. Build Perl support into Firefox.  I don't know if such functionality
still exists in Firefox (there used to at least some support for Perl, but
I have a feeling that was a long time ago, not sure if it's still there...
Also, I don't know if it allowed external processes to talk to the
browser.

3. Forget Perl and just write what you want in JavaScript.  Not ideal, I
know, but practical: obviously JS support is excellent in Firefox.  See
Selenium for inspiration.


John


Re: Javascript and WWW::Mechanize or LWP::UserAgent?

2005-07-15 Thread John J Lee
[John Lee]
> That's not a small amount of work you've just set Warren to do.  :-)
>
> (speaking as somebody who made a semi-serious attempt at it, in Python)

[deborah sciales]
> Well, I guess it depends on his set of needs, and he does have
> tokeparser and treebuilder, etc to use.
>
> If his javascript is inside of script tags, he can use treebuilder to
> get those nodes, and then work with them. I see a host of Javascript
> modules on CPAN.
>
> Here's why I would not try to write this in Python just yet, unless i
> had the time:
>
> Perl -MCPAN -e shell;
>
> cpan>  i/JavaScript/
>
>
> use a module from CPAN, use a few modules from CPAN, or patch and
> improve a module!

I don't think people here are interested in Python/Perl comparisons.

Nevertheless, the problems with the existing libraries for either language
are roughly the same, AFAICT (I too started with existing libraries.
That was kind of the easy part).

Are there a *specific* set of HTML parsing, HTML (not XML) DOM-building
(with all the 

Re: Javascript and WWW::Mechanize or LWP::UserAgent?

2005-07-14 Thread John J Lee
[Warren Pollans]
> The problem I'm running into is "trying to deal with scripts that use
> javascript" - so far, I've had to ignore them or, at least, those

[deborah sciales]
> You might also try writing your own javascript parsing routines?

That's not a small amount of work you've just set Warren to do.  :-)

(speaking as somebody who made a semi-serious attempt at it, in Python)


John


Re: Javascript and WWW::Mechanize or LWP::UserAgent?

2005-07-12 Thread John J Lee
On Mon, 11 Jul 2005, Warren Pollans wrote:

> I've been using WWW::Mechanize to automate testing of cgi scripts -
> works great!
>
> The problem I'm running into is "trying to deal with scripts that use
> javascript" - so far, I've had to ignore them or, at least, those
[...]
> I really like being able to test from my unix box instead of having to
> find a windows box to run quicktest pro or winrunner on.

Try Selenium instead of mechanize.  Not ideal for scraping (you have to
drag in the entire browser, including its GUI, to use it, and the "driven"
mode is currently buggy), but good for functional testing.  Written in
cross-browser JavaScript (in good OO style, too).

http://selenium.thoughtworks.com/


It's quite new, but it's the only free tool in its class (cross-browser
playback of functional tests) that I know of.  With any luck, somebody may
write a decent test recorder for it too (ie. record a test simply by doing
what you'd do if you were manually testing -- ATM there are HTTP-proxying
test recorders that try and do this, but one written in JS is what's
really needed).


John


Re: Authentication problem?

2005-03-14 Thread John J Lee
On Sat, 12 Mar 2005, Andrew Johnson wrote:
> I've been wrestling with a script to scrape some information off of
[...]
> What else could I try?
[...]

Hi Andrew

Read some past messages on this list from me.  I think I've made the same
guesses about fifty times now ;-/ and most of the debugging hints are
always taken from the same fairly small set.  Feel free to come back if
you've tried those and are still stuck, of course!

IIRC this list is on gmane now, so it should be easy to search.

Does this list have a FAQ, anybody?

My own FAQs (Python, not Perl, but that's not really all *that* relevant):

http://wwwsearch.sourceforge.net/bits/GeneralFAQ.html
http://wwwsearch.sourceforge.net/ClientCookie/doc.html#debugging

Other standard responses:

1. use WWW::Mechanize

2. perdoc lwpcook


John


Re: Architectural Question rgd LWP::UserAgent, WWW::Mechanize

2005-03-04 Thread John J Lee
On Sat, 5 Mar 2005, Robert Barta wrote:

> I am using WWW::Mechanize/LWP and some of their subclasses now for
> several things and I see an architectural problem I will be facing in
> some future:
> 
> For downstream developers (and for me) I need to offer a facility
> to choose a user agent which supports a number of features:
> 
>   - local caching
> 
>   - specialized cookie handling for specific web sites
> 
>   - scripting (controlling the user agent via a dedicate language
> and not via Perl method calls to WWW::Mechanize).
> 
>   - triggering of application specific code at particular events
> (page loaded, link selection, page unload)
> 
>   - maybe optional JavaScript/DOM coverage later
> 
> Now much of this functionality is already there (I have implemented
> scripting recently), but somehow spread over several packages in
> incompatible ways. But for a downstream developer it is not possible
> to say something like this:
>
>   my $ua = new LWP::UserAgent::Pluggable;
>
>   $ua->add_plugin (new LWP::UserAgent::Plugins::Cache (size => '4M'));
>   $ua->add_plugin (new LWP::UserAgent::Plugins::Scriptable (plan => ...));
>   $ua->add_plugin (new LWP::UserAgent::Plugins::Hooks (
>   ('http://specialsite/page' => sub { do something; }));
>
> Does this make sense?

Yes!  Python's urllib2 works like this, so I'm sure looking at that is
well worth the time if you want something similar in Perl.  I extended it
in a fairly simple way in Python 2.4, and it now works quite nicely to
support all kinds of things (cookies, auth (various flavours), http, ftp,
gopher etc., refresh handling, referer handling, http-equiv, redirection,
seek()-able responses, robots.txt observance...) using a single,
relatively simple, plugin handler interface.  Caching (of both content and
connections) would naturally and easily fit into that.  Recently noticed
the yum package manager / urlgrabber developers have added more features
(what I assume are decent implementations of throttling, persistent
connections, mirror selection, etc ...), I assume mostly using the same
plugin handler system (though they're pretty application-focused).

There's no requirement to shoehorn everything into some elegant scheme in
order to enable customisation and re-use, though, is there?  Module
designs need effort expended to keep them open and reusable, true, but
that doesn't mean (mythical) perfect genericity (although really generic
interfaces can sometimes be just the ticket and very useful, as with
urllib2's handlers).  A few examples of where, despite urllib2's rather
nice handlers, I don't feel a need to fit into any grand generic
interface:

For cookie policy, I have (in ClientCookie, and now cookielib in Python
stdlib), CookiePolicy objects -- *not* a handler -- rather, each cookie
handler *has* a CookieJar, which *has* a CookiePolicy.

Hooks as you describe might well be done best with explicit support from
standard handlers, I would guess (though I woouldn't know for sure 'till I
try).  Mind you, I have a couple of useful debug handlers, eg. for
printing redirected response bodies.

Never tried scripting, but I don't see any obvious reason for wanting that
as a plugin handler in the urllib2 sense (FWIW, never looked at it, but I
know there's a scripting system based on urllib2 + my libraries (in turned
based in large part on ports from LWP), called PBP).  I've not considered
more elaborate generic plugin systems that might offer the opportunity for
having eg. this kind of scripting as a plugin to some browser object (too
much else more valuable I could do first!), but maybe that'd be an
interesting idea to think about a bit.

In my port of WWW::Mechanize, I added simple methods back on top of the
urllib2 handler system, mostly for convenience of *removing* handlers
without rebuilding an opener object each time (eg.  
Browser.handle_refresh(handle)  -- where handle is a boolean arg).  Works
fairly nicely, I think.

I also started on Javascript support.  You need a browser model for that
(same goes for proper Referer handling, though eg. my
mechanize.HTTPRefererProcessor is written as an object that works just
like any other handler -- it just happens to use a Browser class in its
implementation), so the sort of handlers I refer to above aren't the main
issue.  See DOMForm and python-spidermonkey here:

http://wwwsearch.sourceforge.net/


Enough rambling.  Hope this helps stir you to write something interesting
and share it...


John


Re: R: Help!! I'm stuck! - using LWP for single sign on purposes

2005-03-01 Thread John J Lee
On Tue, 1 Mar 2005, Andrea Setti wrote:

> Thank you for the answer.
> 
> I had a look th WWW::Mechanize and it does almost everything that i need.
> 
> The only thing i cannot understand is: how can i forward the cookie to the
> real browser?
> I need to fetch it from the real login page and then forward it to the
> referrer...
[...]

You don't have to forward the cookies explicitly -- it's all done under
the covers automatically.  You just have to make sure it's switched on.

I don't recall if cookie handling is on by default in mechanize, though.  
I would imagine so, but don't trust me: read the docs.


John


Re: R: Help!! I'm stuck! - using LWP for single sign on purposes

2005-03-01 Thread John J Lee
On Tue, 1 Mar 2005, Peter Stevens wrote:

> HTTP::Cookies has two submodules. one for Mozilla browsers and one for 
> Microsoft browsers. Unfortunately the MS version does not support saving 
> the cookies. (BTW - everybody knows, Firefox is the better browser ;-) ).
[...]

Those are only needed if you want to interoperate with those browsers.  
Use HTTP::Cookies itself otherwise.


John


Re: Mechanize - redirect problem

2005-03-01 Thread John J Lee
On Fri, 25 Feb 2005, Martin Kos wrote:

> hi john
> 
> > It wants this header (or similar, but this is a minimal one):
> > Accept: text/html
> i have added this header and it just works!!! thanks a LOT!
> 
> > Maybe mechanize should sent an Accept header by default?
> i think that would be a good idea for the text/html type.
> 
> > BTW, Martin: I debugged this by just looking at what Firefox sends.  Get
> > livehttpheaders.
> very handy firefox-plugin! i haven't knew it before.
> how have you "see" that mechanize is missing the accept-header and that 
> the servers "needs" it ? was it only a guessing because firefox sends it?

1. Blindly copied firefox headers that I noticed mechanize (in fact,
Python httplib/urllib2/mechanize) didn't send, or had obviously different
values (the latter, in the case of Accept).

2. Saw that it now worked.

3. Deleted hdrs until it stopped working again :-)


John


Re: Mechanize - redirect problem

2005-02-24 Thread John J Lee
On Tue, 22 Feb 2005, Martin Kos wrote:
[...]
> i try to login to the page http://mymobile.sunrise.ch/ but it seems like 
> mechanize is not doing the redirect that is on the start site... if i 
> try with my browser or wget i get redirect to a page like 
> http://mymobile.sunrise.ch/portal/res/guest;jsessionid=HCCISJ1USYYSVQFIGZAXRAQ?paf_dm=full&paf_gear_id=11&?successURL=/portal/res/member%3Bjsessionid%3DHCCISJ1USYYSVQFIGZAXRAQ
> 
> i tried it with a simple "get" but it doesn't work and i don't see what 
> the problem could be... any idea what i'm doing wrong?

It wants this header (or similar, but this is a minimal one):

Accept: text/html


Maybe mechanize should sent an Accept header by default?

BTW, Martin: I debugged this by just looking at what Firefox sends.  Get
livehttpheaders.


John


Re: automating javascript data forms

2005-01-22 Thread John J Lee
On Tue, 18 Jan 2005, Edward Peschko wrote:

> hey all,
>
> I've got a data retrieval problem - I need to get data from a secure
> website (ie: https) which has forms using javascript.
>
> What base technology can I use to do this? Will LWP suffice?
>
> I can't believe this isn't a FAQ - I searched up and down for this,
> without luck. Is there an easy workaround around javascript?
[...]

http://wwwsearch.sourceforge.net/bits/GeneralFAQ.html

the bit on "Embedded script is messing up my web-scraping. What do I do?"
applies just as much to Perl as to Python (with minor translation...).


John


Re: automating javascript data forms

2005-01-22 Thread John J Lee
On Wed, 19 Jan 2005, Peter Stevens wrote:
> Edward Peschko wrote:
[...]
> >If there was an integration between LWP and seamonkey, what form
> >of integration would people feel would be most useful?
[...]
> I think seamonkey integration would be a good thing and see it as an
> alternative to mech. Essentially the same methods as mech, but there
> would be two advantages:
>
>1. Since the browser is supported by (an increasing number of)
>   websites, there will be fewer issues of "it works under
>   Firefox/IE6/etc, but not with my script".
>2. support for javascript. A lot of sites use javascript to do
>   argument checking before dispatching to the actual link. I'd like
>   to invoke a method 'click on the button', have it do the
>   javascript and get/post/whatever the link. As it is, I have to use
>   hard coded URLs or do regex matching on the javascript to find
>   where the button actually posts to. Inelegant at best and fragile
>   at worst.
[...]

Do you mean spidermonkey (Mozilla's JavaScript interpreter)?

Or do you mean Mozilla itself, through XP-COM? (Wasn't seamonkey the
original project to get a working browser out of Netscape's source code?
Or is there some project now to make the Mozilla source code usable as a
library?)

The latter would be essentially a replacement for LWP, rather than
something that you would integrate with it.

If you mean the former, that doesn't remove the need for LWP and
mechanize.  I got a first attempt at automatic JavaScript interpretation
working for the Python port of mechanize and parts of LWP:

http://wwwsearch.sourceforge.net/DOMForm/
http://wwwsearch.sourceforge.net/python-spidermonkey/
http://wwwsearch.sourceforge.net/mechanize/


If there's a good HTML DOM parser for Perl, it will be fairly easy to get
something like this working with a few changes to Claes Jacobssen's
JavaScript module (Perl wrapper of spidermonkey, which I borrowed from
when doing the stuff above).

Never did anything with it, though: I think it would be a LOT of work to
make it work really well, certainly if nobody has already written a
browser-style (rather than standards-compliant!) HTML DOM for you.  I had
to hack a DOM together from somebody's unmaintained pre-standards
implementation of the HTML DOM.  The tree builder literally gave me a
headache (the version on my web site is certainly very incorrect, though
if anybody is interested in doing a Perl version, I can probably dig out
some patches that people sent me to make it work something approaching
correctly).

I don't want to put people off, though: a module that gives a useful level
of compatibility with real browsers that is much better than my effort is
quite doable in somebody's spare time, I think. It would be nice to see it
done well -- browsers are such heavy things to drag into your code when
all you want to do is fetch one lousy URL without poring over somebody
else's JavaScript!


John


Re: asp sessions

2005-01-05 Thread John J Lee
On Thu, 30 Dec 2004, [EMAIL PROTECTED] wrote:
[...]
> Why does WWW::Mechanize get directed to cookieerror.htm?

Beats me too.  I get the same problem with this site using Python's
urllib2 &c.  I tried near-identical headers to Firefox, and got the error
page.

Another guess: Sounds odd, I know, but I wonder if IIS/ASP.NET is
insisting on a persistent connection?  I haven't tested that theory, but I
do notice that neither Lynx, nor Python/urllib2, nor libwww-perl (IIRC on
the latter) use persistent connections, and all fail; Firefox does, and
succeeds.

[...]
> I tried to establish a session with the server using IO::Socket::SSL
> using the example from the POD. You can see the code at the following
> URI:
[...]

Dropping back to a low level is obviously the right plan of attack, but I 
can't help with the Perl code...


John


Re: asp sessions

2004-12-31 Thread John J Lee
On Thu, 30 Dec 2004, [EMAIL PROTECTED] wrote:

>
> I am trying to determine why the following commands to WWW::Mechanize::Shell 
> result as they do:
>
> [EMAIL PROTECTED] trwww]$ perl -MWWW::Mechanize::Shell -e 'shell'
> >get https://www.setsivr.odjfs.state.oh.us/Login.asp
> Retrieving https://www.setsivr.odjfs.state.oh.us/Login.asp(200)
> https://www.setsivr.odjfs.state.oh.us/cookieerror.htm>
[...]

A guess: JavaScript setting a cookie?


John


Re: Warnings in HTTP::Cookies

2004-09-27 Thread John J Lee
On Mon, 27 Sep 2004, Ed Avis wrote:
[...]
> It's difficult to produce a self-contained test case since this
> program spends its time hitting someone else's website.  I hope that
[...]

Easy but helpful would be to turn on HTTP::Cookies' debugging and post the
output (censored if necessary).


John


Re: Help, Please: Can't Get a Hold of

2004-09-13 Thread John J Lee
On Mon, 13 Sep 2004, Daniel E. Doherty wrote:
[...]
> Here is the javascript function that gets invoked:
>   function FormSubmit(objForm)
>   {
>   var strVersion = new String(navigator.appVersion);
>   var arrVersion = strVersion.split(" ");
>   var intVersion = new Number(arrVersion[0]);
> 
>   objForm.BrowserName.value = navigator.appName;
>   if (navigator.appName == "Netscape")
>   {
>   //alert("here");
>   objForm.action = "NSPrint.asp";
>   }
>   objForm.submit();
>   }
> 
> One of the solutions you recommend is to do in python what the script
> does.  It looks to me like this script just submits the form or prints
> for Navigator (though I don't know much about javascript).  It would
[...]

The last line submits the form.  The rest of it doesn't .

For example, objForm.action = "blah" sets the form element's action
attribute to "blah", thus causing objForm.submit() to submit to a
different URL.  Google for "HTML 4.01 spec", and read the documentation
for the Form element's action attribute.  Then try looking at the HTML
that contains the JavaScript function above, and figure out what
objForm.BrowserName.value = "whatever" does (search for "BrowserName" in
the HTML).


John


Re: Help, Please: Can't Get a Hold of

2004-09-12 Thread John J Lee
On Wed, 8 Sep 2004, Daniel E. Doherty wrote:

> I hit a page on the FDIC website that allows me to download Bank
> Performance Reports, so-called "Call Reports."  I can fill in the fields
> on the page, but the button that kicks off the file transfer is
> generated by an HTML tag like this:
> 
> 
[...]
> Is this right?  I'm no HTML expert, but my 1998 O'Reilly book on HTML,
> covering HTML 4.0 states that  is a synonym for
> .

No, they're not synonymous.  Quoting from my FAQ on my Python module based
on HTML::Forms (see second bullet point -- your callback JavaScript
function is FormSubmit()):

http://wwwsearch.sourceforge.net/ClientForm/#faq

Why does .click()ing on a button not work for me? 

- Clicking on a RESET button doesn't do anything, by design - this is a 
library for web automation, not an interactive browser. Even in an 
interactive browser, clicking on RESET sends nothing to the server, so 
there is little point in having .click() do anything special here. 

- Clicking on a BUTTON TYPE=BUTTON doesn't do anything either, also by 
design. This time, the reason is that that BUTTON is only in the HTML 
standard so that one can attach callbacks to its events. The callbacks are 
functions in SCRIPT elements (such as Javascript) embedded in the HTML, 
and their execution may result in information getting sent back to the 
server. ClientForm, however, knows nothing about these callbacks, so it 
can't do anything useful with a click on a BUTTON whose type is BUTTON. 

- Generally, embedded script may be messing things up in all kinds of ways. 
See the answer to the next question. 


John


Re: How to simulate https secured login using lwp

2004-09-12 Thread John J Lee
On Wed, 8 Sep 2004, Joseph Alotta wrote:
[...]
> > thing, so I don't read any impoliteness into your request, but only
> > because of conscious effort not to.
[...]
> He did say "please".  I think his request was polite and to the point.
> 
> You are asking too much from non-native speakers.   Let's see how well 
> you do in his language.

You're right: I probably should have said that in private email (which I
guess would be OK, extrapolating from the fact that *I'd* rather be told
if my use of a language came across as impolite).

Sorry, Suryya: no impoliteness was meant on *my* end either!


John


Re: How to simulate https secured login using lwp

2004-09-08 Thread John J Lee
On Wed, 8 Sep 2004, Suryya Ghosh wrote:
> How to simulate https secured login using lwp?
[...]
> I have open ssl installed in my system and Net ssleay 2.25 installed in
> my system , but while loging in i am getting arespnse of 302 Moved
> Temporarily

Doesn't sound like an SSL problem.  Does it redirect you somewhere you
don't expect to end up?  Or are you asking how to follow the redirection?

[...]
> please provide me a code sample.

A friendly tip: no matter how politely phrased, directly commanding that
people to give you code comes across as rude.  "Does anyone have any code
samples?" or "I'd be grateful if anyone can show me..." work fine.  I know
there are big differences between languages and cultures in this kind of
thing, so I don't read any impoliteness into your request, but only
because of conscious effort not to.


John


Re: Cookie2: $Version="1" by default? (fwd)

2004-08-10 Thread John J Lee
On Mon, 9 Aug 2004, Andy Lester wrote:
[...]
> LWP is RFC-compliant.  Gisle has done a marvelous job of making sure it
> does just what the RFC says.
>
> WWW::Mechanize is a subclass and superset of LWP that does more
> "browser-like" stuff.  Mechanize is meant as a browser in an object,
> whereas LWP does the strictly correct thing.
>
> That division of labor has served us well for a few years now.
[...]

I guess that makes a fair amount of sense.

RFC 2965 really is dodo-like in its dead-ness, though.  I think it's
unusual in that respect: Most RFCs are much healthier, even if they're
frequently ignored.

The only practical uses of 2965 code I can think of are on intranets
(*maybe*), and 2109 cookies (viz, those that have a Version attribute of 1
and arrive in a Set-Cookie: header).  AFAICT, it makes sense to treat 2109
cookies as if they were 2965 cookies.  Not sure if LWP does so, though,
and I don't imagine it's a burning issue in many users' minds 


John


Re: Cookie2: $Version="1" by default? (fwd)

2004-08-09 Thread John J Lee
[Juan]
> Why Cookie2: $Version="1" is still sent by default by LWP?
> No browser sends that header by default, neither MSIE, Mozilla nor Konqueror.
>
> I suggest to remove it (at least by default).
> I find much more useful to make LWP masquerade as MSIE instead
> following an RFC nobody follows.
[...]

I agree that RFC 2965 handling should probably be switched off by default.

Did you actually run into a problem though, or are you just paranoid?-)


John


Re: WWW:Mechanize help clicking button

2004-08-09 Thread John J Lee
On Thu, 5 Aug 2004, Joseph Alotta wrote:
[...]
> I think
> there is something going on in the java code in the first part.
[...]

That sounds like a fair bet .  Your next step is to figure out what
that something is.

(Actually, it's JavaScript code.  Java != JavaScript -- the two are quite
different (JavaScript is much better designed <0.5 wink>).)

Luckily, it really doesn't require any JavaScript knowledge to figure out
what the code does -- if you know Perl, you won't have any trouble
guessing what's going on.  You just have to knuckle down and read it.


John


Re: javascript, and cookie

2004-08-09 Thread John J Lee
On Mon, 9 Aug 2004, Richard Lawson wrote:
[...]
> I saw your post at libwww mail list but no answer.
> It's an old post but it's the closest thing in the archives to my issue.
> I have a similar problem where the page sets a client cookie and I need to
> set it in LWP, but I can't seem to confirm that the cookies are being sent.
[...]

1. Get a copy of ethereal

2. Turn on LWP's debug output


HTH


John


Cookie2: $Version="1" by default? (fwd)

2004-08-09 Thread John J Lee
Juan asked me to forward this to this list.  (just this once, Juan; get
yourself a free email account to post from -- eg. fastmail.fm)


John

-- Forwarded message --
Date: Mon, 09 Aug 2004 21:27:18 GMT
From: JUANMARCOSMOREN <[EMAIL PROTECTED]>
To: John J Lee <[EMAIL PROTECTED]>
Subject: Cookie2: $Version="1" by default?

Could you post this message to <[EMAIL PROTECTED]> ?
I did not post there in the firt place because perl.org has changed its
policy to not allow mails from terra.es:
> Recipient: <[EMAIL PROTECTED]>
> Reason:Mail from terra.es rejected because it does not accept bounces. This 
> violates RFC 821/2505/2821 http://www.rfc-ignorant.org/

--8<--

Why Cookie2: $Version="1" is still sent by default by LWP?
No browser sends that header by default, neither MSIE, Mozilla nor Konqueror.

I suggest to remove it (at least by default).
I find much more useful to make LWP masquerade as MSIE instead
following an RFC nobody follows.

Could people on this list at least express their feelings about this
subject? Do you want something that works or just something that
follows an RFC nobody follows.

Making LWP more MSIE complaint will make it more useful for everyone.

Juan





RE: url/query question...

2004-06-29 Thread John J Lee
bruce, please don't cross-post unless you have some valid reason for it.

On Mon, 28 Jun 2004, bruce wrote:
[...]
> however, if you examine the headers between the server/browser app, you can
> more or less.. see what's being transfered back/forth... in this case, the
> content/post data is available, and looks to be some ~6k of data...

Sure.


> it was my understanding that combining this information with the URl,
> """should""" be able to get to the targeted page.. assuming all things are
> equal..  however.. this does not appear to be the case..

There's no "should" here.  No standard (prescriptive or de-facto) says
that URL query string parameters and POST data are interchangeable.  In
many cases, server code merely *happens* to work that way.


> i've been able to successfully simulate what a post does with a number of
> sites, by simply combining the URL with the requisite data and dropping the
[...]

Yup, no surprise there.  Equally, no surprise that it *doesn't* work in
other cases.


John


Re: url/query question...

2004-06-28 Thread John J Lee
This is nothing to do with win32, so I've cut that list from the To: line.

On Sun, 27 Jun 2004, bruce wrote:
[...]
> i was under the impression that if i concatenated the url and the
> content/query from the headers, that i'd be able to "simulate" the submit

What do you mean by "the content/query from the headers"?  I guess you
mean the POST data?  POST data != header data.  An HTTP request contains
1. GET / POST line (containing the URL path), 2. headers, and 3. data.

If you're taking a POST request you sniffed by some means, and issuing the
corresponding GET request (GET /foo.cgi?post=data&goes=here HTTP/1.1),
then, yes, whether or not that works is indeed entirely dependent on the
way the code on the server was written.

[...]
> with the stjohn's site, the header information indicates that ~6-8k of
> information is in the content portion of the URL. could this be correct??

Yes.


> when i try to stuff this much (cut/paste) into the browser url/address it
> cuts it off..

Don't do that, then.  Do a POST instead, using LWP.


> i was under the impression that you were limited with regards
> to the size of the content/query portion of the URL...
[...]

Apparently so.  POST data is not part of the URL, though.


John


Re: www::mechanize issues

2004-05-30 Thread John J Lee
On Sat, 29 May 2004, bruce wrote:
> hi...

Hi

[...]
> basically, i'm looking to be able to get class schedule information from the
> http://lca.lehman.cuny.edu/dept/registrar/schedule/coursefinder.asp site.
[...]

#!/usr/bin/perl -w

use WWW::Mechanize;

my $b = WWW::Mechanize->new();
$b->get("http://lca.lehman.cuny.edu/dept/registrar/schedule/coursefinder.asp";);
$b->form_number(0);
print $b->current_form()->dump();
$b->field("u_input", "CHE");
$b->field("sortby", "Instructor");
open(F, ">out.html");
print F $b->submit()->content();
close(F)


or you could defect :

http://wwwsearch.sf.net/mechanize/

#!/usr/bin/env python

import mechanize

b = mechanize.Browser()
b.open("http://lca.lehman.cuny.edu/dept/registrar/schedule/coursefinder.asp";)
b.select_form(nr=0)
print b  # I confess this is actually a bit of an accident (see below)
b["u_input"] = ["CHE"]
b["sortby"] = ["Instructor"]
f = open("out.html", "w")
f.write(b.submit().read())
f.close()

Why does 'print b' print the current form?  Because Browser delegates all
unknown attribute access to ClientForm.HTMLForm.  .__str__() is one such
method.  For almost everything, this is useful, but it's not really what
one wants in this particular case.  Since mechanize (the Python module) is
still alpha (though fairly dilute of bugs, I believe), I shall go away now
and add Browser.form_as_string() and Browser.__str__() methods :-)


John


Re: URI support for OpenURL

2004-05-13 Thread John J Lee
On Thu, 13 May 2004, Tim Brody wrote:
[...]
> To the best of my knowledge there aren't any other standards for the
> transport of bibliographic data through URIs.

Sounds fair enough to me.


> Besides that, OpenURL is
> likely to become the standard method of linking within the multi-billion
> dollar scholarly publishing industry:
> http://www.crossref.org/02publishers/16openurl.html
[...]

Oh joy.

racket-is-a-more-accurate-word-than-industry-ly y'rs


John


Re: cookie handling patch

2004-04-02 Thread John J Lee
On Thu, 1 Apr 2004, JUANMARCOSMOREN wrote:
[...]
> > >Aleksandr Guidrevitch wrote:
> > >>We've found that LWP incorrectly handles cookies
> > >>containing ';' in the cookie value.
> > >>The patch (test case and fix) is attached
[...]
> So, why do you want ';' in cookies if they are not handled
> correctly by the most used HTTP implementations (MSIE and Mozilla)?

Right.


> > According RFC in **quoted** string you can put almost anything.
> > See http://www.cse.ohio-state.edu/cgi-bin/rfc/rfc2109.html for
> > definition of cookie:
>
> People don't care much about the HTTP RFC what people really want is to
> be compatible with MSIE and Mozilla.
[...]

The algorithm in browsers has apparently always been pretty much 'split on
';'', so right again.


John


ANN: mailing list for Python web client / URL programming

2004-03-09 Thread John J Lee
[yeah, I know this is a Perl list, but I thought people here might be
interested, since I like to follow the Perl list]

A new list for discussion of anything related to either web-client
software or URL-processing / -fetching software written in Python.

This includes, but is not limited to, the software at the wwwsearch.sf.net
site.

To subscribe or post messages to the list
([EMAIL PROTECTED]), visit the Mailman Info Page:

http://lists.sourceforge.net/lists/listinfo/wwwsearch-general


John


Re: HTTP traffic? (use LWP::Debug qw(conns); not working)

2004-01-25 Thread John J Lee
On Sun, 25 Jan 2004, Philippe 'BooK' Bruhat wrote:
> Le dimanche 25 janvier 2004 à 21:22, John J Lee écrivait:
> >
> > > In fact, you can already use HTTP::Proxy to see inside a HTTPS connection:
[...]
> > Any recommendations for a specific one?
>
> Well, I was talking about my pet module, HTTP::Proxy. Version 0.12 on a
> CPAN mirror near you. ;-)
[...]

Gah, sorry, not reading carefully again, am I?

Thanks.


John


Re: HTTP traffic? (use LWP::Debug qw(conns); not working)

2004-01-25 Thread John J Lee
On Sun, 25 Jan 2004, Philippe 'BooK' Bruhat wrote:

> Le dimanche 25 janvier 2004 à 15:46, John J Lee écrivait:
> >
> > BTW, anybody have any tips on software / usage thereof for HTTPS proxying,
> > for debugging purposes, and how to set up with LWP?  I've always used
> > browser plugins or debugging output from (Python) code until now.
> >
>
> Well, in June 2004, I suppose HTTP::Proxy will support working in a man
> in middle manner, so this kind of thing should be quite easy to do.

Whoops -- I scanned that thread, then forgot about it ;-)  Thanks.


> (I have yet to understand how to use Net::SSLeay, though.)

Worked out of the box for me (I'm running somebody else's code, though).


> In fact, you can already use HTTP::Proxy to see inside a HTTPS connection:
> set HTTPS_PROXY to point to your HTTP::Proxy proxy, use env_proxy
> with your LWP::UA object. LWP::UA does a GET https://www.example.com/
> to the proxy, which will fetch the data with SSL, and return it in a
> plain (cleartext) HTTP session.
[...]

Any recommendations for a specific one?

Hmm, do these proxies check certificates / revocation lists when they do
that (not that I care, particularly -- just curious)?  What happens when
they fail, if so?


John


HTTP traffic? (use LWP::Debug qw(conns); not working)

2004-01-25 Thread John J Lee
Attempting to look at the network traffic generated by a Perl program that
uses LWP for doing HTTPS POSTs, I put this in the driver script:

use LWP::Debug qw(conns);


But, though I see some debugging messages, I don't actually see the HTTP
headers or body data.  Same happens with plain HTTP (no SSL involved). The
docs say:

 conns   : show all data transfered over the connections


I don't get HTTP headers or data from this, either:

use LWP::Debug qw(+);


What's wrong?

I'm using LWP 5.69, Perl 5.6.1, Debian 2.2.

BTW, anybody have any tips on software / usage thereof for HTTPS proxying,
for debugging purposes, and how to set up with LWP?  I've always used
browser plugins or debugging output from (Python) code until now.


John


Re: Problem logging on to site with MECHANIZE

2004-01-23 Thread John J Lee
On Fri, 23 Jan 2004, Gedanken wrote:

> On Fri, 23 Jan 2004, John J Lee wrote:
>
>  Yuck.  Does it also work if you wave a dead chicken at it? ;-)
>
>  Why not check the HTTP headers to find out what's going wrong?
>
> the headers are identical as far as i can tell.  after all, the code
> snippet i sent doesnt actually change anything.  whether its mechanize
> having problems or the javascript on the servers, i have not a clue.
>
> i agree with your chicken waving comment, i just dont have a better
> explanation.

Forgot to add: the browser must be doing it "right", and if the browser
reloads, it's not doing that because it happens to feel like it: There
must (presumably!) be some reason -- even if bogus -- why it does so.


John


Re: Problem logging on to site with MECHANIZE

2004-01-23 Thread John J Lee
On Fri, 23 Jan 2004, Gedanken wrote:

> On Fri, 23 Jan 2004, John J Lee wrote:
>
>  Yuck.  Does it also work if you wave a dead chicken at it? ;-)
>
>  Why not check the HTTP headers to find out what's going wrong?
>
> the headers are identical as far as i can tell.  after all, the code
[...]

Did you actually check to make sure (eg. with ethereal)?


John


Re: Problem logging on to site with MECHANIZE

2004-01-23 Thread John J Lee
On Fri, 23 Jan 2004, bzzt wrote:

> I'm trying to log on to this site (www.thecityvibe.com/forum/) with the
> followin script but doesn't seem to succeed. Anyone knows what the problem
> might be?

(without reading your script): no cookie jar?

I don't recall if WWW::Mechanize makes one by default if none is supplied
to the constructor.  If not, that could be your problem.  Look at the HTTP
headers.

What do you get back from the server?


John


Re: Different outcomes with same request

2004-01-23 Thread John J Lee
On Fri, 23 Jan 2004, Justin Cook wrote:
[...HTML saved from browser and fetched with LWP appear different...]
> going on here? Is it the difference between a dynamic page and a static
> page being posted to?

No.


> Am I not recieving all the chunks of response in
> time to get the transaction id? Is my regex just plain lame?

Could be (I haven't checked).


> I'm a
> semi-newbie and have played with this for several days but to no avail.
> Any help would greatly be appreciated.

Change your script to save the response data (ie. the HTML) you fetched
with LWP to a file.  Compare it with the HTML you saved from your browser.

If you find the data LWP is fetching seems wrong, you have to figure out
what LWP is doing different from your browser -- if you get stuck there
after some effort, ask here again.  If you find that the data LWP is
fetching seems OK, debug your parsing code.


John


Re: Problem logging on to site with MECHANIZE

2004-01-23 Thread John J Lee
On Fri, 23 Jan 2004, Gedanken wrote:
[...]
> basically i manually set the form action... to the same thing it was
> already set to.  and voila, stuff starts working.   Ill edit your version
> below to show you what i mean.

Yuck.  Does it also work if you wave a dead chicken at it? ;-)

Why not check the HTTP headers to find out what's going wrong?


John


Re: Different outcomes with same request

2004-01-23 Thread John J Lee
On Fri, 23 Jan 2004, Philippe 'BooK' Bruhat wrote:
[...]
> Maybe the transaction is put in the page by some javascript
> (document.print?). Your browser saves the resulting page, while
> WWW::Mechanize works on what the server sends.

No, browsers always save the original document.  At least, that's what
they've always done when I've asked them to save...


John


Re: found my mech problem

2003-12-23 Thread John J Lee
On Tue, 23 Dec 2003, Gedanken wrote:
[...]
> because, unbeknownst to me, the action for that form happened to have the
> phrase '&lang=FR' in it.  well apparently &lang has a special meaning, as
> i can see from my request object that it has been encoded into an escape
> sequence against my will =)
[...]

What byte string do you get?


John


Re: RFC: WWW::Mechanize::Compress or LWP patch?

2003-12-03 Thread John J Lee
On Wed, 3 Dec 2003, John J Lee wrote:
[...]
> Not in KDE 3.2: it decompresses automatically, so when you save or open
> with KWrite, it's just 200_gzip.xml.

...and I'd take a guess that's because Safari (Apple's browser based on
Konqueror) does the same, because 3.2 apparently includes a lot of changes
merged back from Safari.


John


Re: RFC: WWW::Mechanize::Compress or LWP patch?

2003-12-03 Thread John J Lee
On Wed, 3 Dec 2003, Gisle Aas wrote:
> [EMAIL PROTECTED] writes:
[...]
> > http://diveintomark.org/tests/client/http/200_gzip.xml
> >
> > IE "just does it".
>
[...]
> Konqueror suggest saving or opening the file in an
> external app, but the file saved or given to an external app is still
> gzipped.

Not in KDE 3.2: it decompresses automatically, so when you save or open
with KWrite, it's just 200_gzip.xml.


John


HTTP::Cookies and URI character encodings

2003-11-24 Thread John J Lee
I think there might be a problem with _normalize_path, from HTTP::Cookies.
I'll explain what happens with my Python port, because I have no idea how
Perl and unicode interact: a unicode URI got passed to my equivalent of
_normalize_path() (a unicode string is a separate type from an ordinary
byte-string in Python).  That function complained because there were
non-ASCII characters in the unicode string, and it refused to guess which
encoding to use.

The stated purpose of _normalize_path is to allow plain string-comparison
of HTTP URI paths, but I don't understand a) how that's possible given
that the URI character set isn't always known, and b) why it's necessary
-- why not just compare without any normalization?

The trouble is, RFC 2396 doesn't specify any URI character encoding, but
allows %-escapes, which are defined in terms of octets.  So, when you see
a URI containing %-escaped chars, you have to know the original URI
character encoding in order to work out what characters they represent.
Unfortunately, I don't think that's always possible (is it?), so
normalizing to "fully-escaped" form (as _normalize_path does) may involve
assuming a different encoding than was used to partially escape the URI
before HTTP::Cookies had anything to do with it.  Escaping with
inconsistent character encodings certainly seems bad.

Am I correct?  Why not just leave URIs un-normalized?  If they must be
normalized, how should unicode URIs (or non-ASCII ones, generally) get
normalized?

This is all very confusing, especially to an English speaker who never
reads or writes anything but ASCII!


John


Re: Mechanize, Yahoo, and cookies

2003-11-19 Thread John J Lee
On Wed, 19 Nov 2003, John J Lee wrote:
[...]
> The Yahoo email login page is full of Javascript code doing complicated
[...]

BTW, as I must have said here before, the first thing everybody seems to
do is to try to automate their Yahoo email account, so I'm sure there's
lots of free pre-existing code around that already does this.


John


Archive? [was: Re: Submiting a javascript...]

2003-11-19 Thread John J Lee
I was about to say "search the archives", but I can't find them.  Surely
they exist??  There are several places that have archives years out of
date, and one with a couple of messages from 2003 and nothing else.

Can the real libwww-perl archive stand up, please?


On Sun, 16 Nov 2003, tv fw wrote:
[...]
> javacript
[...]

LWP doesn't handle JavaScript.  Figure out what it does, then copy it
using LWP, or get a browser to interpret it for you (eg. COM automation of
MSIE -- eg. samie project seems to be a bunch of convenience functions
layered on top of that, or use Java's httpunit, or Mozilla / XPCOM, or
Konqueror / KParts or DCOP).


John


Re: Mechanize, Yahoo, and cookies

2003-11-19 Thread John J Lee
On Tue, 18 Nov 2003, Brian Spiegel wrote:
[...]
> The launched browser, if the login was successful, should take me to my
> inbox.  However, I get a page stating that my browser doesn't allow cookies.
> Has anyone attempted logins with Yahoo or any of these other services?  Is
> there something in their auth/cookie mechanism that needs special handling?

The Yahoo email login page is full of Javascript code doing complicated
stuff.  Either read and understand it and copy what it does using LWP, or
try something else: eg. automate MSIE.


John


Re: cookies

2003-11-16 Thread John J Lee
On Sun, 16 Nov 2003, John J Lee wrote:

> On Sat, 15 Nov 2003, allan juul wrote:
[...]
> > no - sorry,i didn't mean kill in that unix sense - i close the program
> > with an exit or die or nothing more to do, then restart the program a
> > bit later and at that point i have gotten a completely new cookie.
>
> I don't know what the problem is.  Try sticking a print statement in the
[...]

For the record, the OP reports by email that the problem was
ignore_discard (you need to pass that argument to the CookieJar
constructor to tell it to save even session cookies).


John


Re: cookies

2003-11-15 Thread John J Lee
On Sat, 15 Nov 2003, allan juul wrote:
> On Saturday, Nov 15, 2003, at 16:36 Europe/Copenhagen, John J Lee wrote:
[...]
> > How did you kill the process?  If you kill -kill it in Unix, then Perl
> > won't have a chance to run the code to save your cookies.
> >
> > If you shut down your program normally, do the cookies get persisted
> > OK?
>
> no - sorry,i didn't mean kill in that unix sense - i close the program
> with an exit or die or nothing more to do, then restart the program a
> bit later and at that point i have gotten a completely new cookie.

I don't know what the problem is.  Try sticking a print statement in the
DESTROY method of HTTP::Cookies to check it's actually getting called, and
trace things through save(), as_string() to figure out what's going wrong.


John


Whoever is subscrib'd from cathaybk.com.tw, please fix your subscription address

2003-11-15 Thread John J Lee
Somebody is subscribed with an old address, apparently.  Every time I post
here, I get this:

-- Forwarded message --
Date: Sat, 15 Nov 2003 23:37:50 +0800
From: Postmaster <[EMAIL PROTECTED]>
To: John J Lee <[EMAIL PROTECTED]>
Subject: AutoReply Reminding Message: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
domain changed as @Cathaybk.com.tw

Original Subject: Re: cookies
 Thanks for send us email. We've delivered your email to @cathaybk.com.tw! Reminding!! 
Our Email domain is changed as @cathaybk.com.tw~ Thanks


Re: cookies

2003-11-15 Thread John J Lee
On Sat, 15 Nov 2003, allan juul wrote:
[...]
> i have tried something like
>
>  $robot->cookie_jar( {
>   autosave => 1,
>   file => 'cookie.lwp'
>   }
> );
>
>
> then i have tried to print out the cookies i get before i kill the
> robot process and after i re-start the robot and they are different. ?
> eh, what am i doing wrong ?

How did you kill the process?  If you kill -kill it in Unix, then Perl
won't have a chance to run the code to save your cookies.

If you shut down your program normally, do the cookies get persisted OK?


John


Re: SSL interface for HTTPS within LWP

2003-11-12 Thread John J Lee
On Wed, 12 Nov 2003, Haroon Rafique wrote:
[...]
> To have SSL capability you need either one of the following 2 modules on
> your Windows 2003 Server machine:
>
> Crypt::SSLeay
> IO::Socket::SSL

or the stuff from Johnny Lee (no, not me, we just happen to have similar
names), which only depends on MSIE, I think.


John


Re: Help with a login in a site

2003-11-08 Thread John J Lee
On Sat, 8 Nov 2003, Alexandre Loureiro wrote:
[...]
> I´ve talked to support of  the site and they gave some hints..
[...]
> - Then I´m redirected to a Login.php Script that gets my information in a
> LDAP Server

If that really is the information you need, why not start and end right
there?  Use LDAP directly, and forget LWP.  Far saner than messing about
with JavaScript.

http://perl-ldap.sourceforge.net/

[...]
> Question nº 1 - How can I submit this page above to autenticate the session
> to access the page i want ? ( It will redirect to the main page)

Use HTML::Form.


> Question nº 2 - After submit I need to maintain this session alive to enter
> the page i want ? ( Like using  LWP::ConnCache , if needed  send  me some
> examples)

A session is not the same as a connection.  ConnCache caches HTTP
connections -- this is an optimisation you don't care about (until you get
it working, anyway!).  Sessions are usually maintained by cookies and/or
data passed in POST or GET requests.  The former are set by HTTP headers
(Set-Cookie) and/or JavaScript code.  The latter comes from JavaScript
code, and/or from form submission (including INPUT TYPE=HIDDEN controls).
POST data is passed in the HTTP body.  GET data is passed in the URL as a
query string (like ?foo=bar&spam=eggs)


John


Re: Mysterious Redirect

2003-11-05 Thread John J Lee
On Wed, 5 Nov 2003, David Busby wrote:
[...]
> works on another similar site.  So the question is not how to handle this in
> LWP but what other methods would these folks use for redirecting me that
> will work in IE and Netscape but not LWP.

Embedded script (JavaScript, usually) or Refresh redirect (either a normal
Refresh: HTTP header or, commonly, one that sits in a META HTML tag).
See recent posts here on both.


> The traffic is over SSL
> connection, how could I watch that data?  Is there some clever way with
> stunnel that I can "watch the wire"?  How can I debug this s---?

See third question here

http://wwwsearch.sourceforge.net/bits/clientx.html


John


Re: I can´t make perl log into a server

2003-11-02 Thread John J Lee
On Sun, 2 Nov 2003, Alexandre Loureiro wrote:

> I´m new to perl and using some books I´ve done some simple scripts to
> make things easier. But now I´m having some problems login into a web
> site ( secure using php login). I can log into the site but to navigate
> futher i need a cookie to send some information to the site.
>
> It´s like an authentification, then it goes to empty page as follow:

[...big ball of muddy JavaScript...]

Bleck.

What are you actually trying to do?

If it's just a watchdog, well, OK.  Good luck ;-)

If it's functional testing, I think you're better off using a browser
(automated from Perl).

I imagine you'll be lucky if somebody here feels like hacking through that
mess. :-)


John


Re: How to handle an onLoad body attribute

2003-10-31 Thread John J Lee
...just to add: I wasn't implying that Rod *is* doing something he
shouldn't.  I have no knowledge about the site in question or the
motivations of the people involved.


John


Re: How to handle an onLoad body attribute

2003-10-31 Thread John J Lee
On Thu, 30 Oct 2003, Poly wrote:

> This reminds me of a script that was supposed to post news articles
> somewhere until the site decided to keep hackers and automatic scripts
> out by inserting a session ID in an intermediary form...

Most of the time this stuff isn't an attempt to keep people out.  If it
is, you probably shouldn't be trying to get around it.


[...]
> onLoad().  Now if they catch hacking their site, that where the fun is
> right?
[...]

Rod, please *don't* "hack their site".


John


Re: How to handle an onLoad body attribute

2003-10-30 Thread John J Lee
On Thu, 30 Oct 2003, Roderick A. Anderson wrote:
[...]
> There is a intermediate document being returned with an onLoad attribute
> in the body tag to automagically submit the new form.
>Needless to say this causes my script to fail and the as_string method
> doesn't include the original form and data or the intermediate form and
> data.  (This making any sense?)

Options:

0. Get the XML stuff working (recommended).
1. Figure out exactly what the script does and emulate it in your code.
2. Give up on libwww-perl and use browser automation.  Browsers know
   about stuff like embedded script.  You should be able to do this from
   any mainstream language, including Perl.  Keywords: COM, MSIE, XPCOM,
   Mozilla, DCOP, Konqueror.
3. Use another language which has libraries that provide JavaScript
   interpretation.  I imagine there's probably a Perl-Java bridge out
   there, for example, and the httpuunit library can do this, though not
   fantastically well (mostly because the browser object model bindings
   are not fully implemented, and what is implemented is not a faithful
   copy of browsers behaviour).


John


Re: Help on LWP: college project on sms.ac

2003-10-23 Thread John J Lee
On Thu, 23 Oct 2003, Abhishek jain wrote:
[...]
> I am a B.Tech student and as a part of college project I have to make a
> program that is able to send sms using www.sms.ac website.I tried to
> make the project myself but I am having some cookies problem. The site
> is not accepting the cookies generated by mine LWP pogram. I am sending
[...]

1. Forget about CGI until you've got the thing working on the command line
(or in your IDE, or whatever)

2. Have you actually checked what cookies are getting set and returned?
Turn on HTTP::Cookie's debugging (IIRC, you have to ask LWP's centralised
debug logging facility to do that), and it should tell you exactly what
the server is sending, why it doesn't like any rejected cookies, and why
it's not returning any cookies whose domain matches the site you're
talking to.  A sniffer like ethereal will also likely be useful, to see
exactly what's going on (try sniffing both a browser and your program, and
compare the two).

HTH


John


Re: The saga continues

2003-10-18 Thread John J Lee
On Fri, 17 Oct 2003, Roderick A. Anderson wrote:
[...]
> my $res = LWP::UserAgent->new->request($form->click);
>
> Are there any methods to search $res (which contains another form) to pull
> out specific inputs that have been returned?
[...]

Just do another HTML::Form->parse() on the response data (or the response
itself if you've got the latest LWP, IIRC), and call the find_input method
on one of the forms that returns.  I think there's a possible_values
method, too, which you might find useful.

Have you found the HTML::Form docs?


John


Re: Building HTML document in memory

2003-10-16 Thread John J Lee
On Wed, 15 Oct 2003, Roderick A. Anderson wrote:

> Back again.  Getting more things solved but I can't find _anything_ on how
> to build a HTML document in memory.  All I really need is a method to POST
> to a page with data I already have collected.  I know the form inputs
[...]

HTTP POST does not require building an HTML document.

I don't really understand what you wrote, so it's hard to guess what your
problem is.


John


Re: libwww-perl-5.71

2003-10-15 Thread John J Lee
On Wed, 15 Oct 2003, Gisle Aas wrote:
[...]
>2
[...]
> Of the browsers I have here Mozilla displays "2" for the second value
> while konqueror shows "x".  I guess my question is what MSIE shows?

2!  Damn.

(For IE 5 -- has there been any standards-compliance effort with IE 6?  I
certainly doubt it on this particular point.)

[...]
> > > Female
> > >
> > > in the expected way.
> >
> > Officially, those s aren't allowed, are they?
>
> Yes.  After the input you are in plain text context.  The input tag is
> implicitly empty.

Oh, right.


> > seen them 'in the wild', though :-(  Probably I should strip tags even for
> > OPTION element contents...
>
> But I think  is different as it is a container.  You can't
> have  here, but I have not tested what browsers do.

Officially you can't, right, but we all know how this game works ;-).
But as I said, my parser won't pick any up anyway, being event-driven, and
IIRC yours code is similar in that respect (albeit 'pull' rather than
'push', which is a nicer way of doing it).

[...]
> > I see.  It can still be used for the list items of INPUT type=radio, or
> > whatever, though (though, as I said, probably rarely).
>
> I'll deal with it if somebody complains :)

Good plan.


John


Re: libwww-perl-5.71

2003-10-15 Thread John J Lee
On Wed, 15 Oct 2003, John J Lee wrote:
[...]
> seen them 'in the wild', though :-(  Probably I should strip tags even for
> OPTION element contents...
[...]

Hmm, I guess both our parsers do that naturally, anyway.  :-)


John


Re: libwww-perl-5.71

2003-10-15 Thread John J Lee
On Wed, 14 Oct 2003, Gisle Aas wrote:
> John J Lee <[EMAIL PROTECTED]> writes:
[...]
> I could potentially let there be multiple 'value_names' for a single
> value, but I could also just let an explicit label override the option
> as per spec.  Is this something that real browsers implement?

I'm not sure I understand what you're asking.

Real browsers do implement the defaulting of OPTION values to OPTION
element contents (well, I tested Konqueror just now, and it does), and the
labels certainly do, of course.  So, if you have


 1
 2


You see 1 and 2 as the labels in the browser's GUI, and the server gets
sent foo=1, or whatever.

Personally, I chose to have users explicitly say that items are specified
by label if that's what they want -- otherwise, items are assumed to be
specified by value.  I can see that searching both sets of names for a
match by default might be better in some ways, though.


> The label just seems like a mechanism to get shorter labels when
>  is used and as such the labels are likely not to be unique.
> That makes them bad names for selecting values.

I don't know why the label attribute was introduced, but I implemented the
feature because I ran into a page where the labels seemed less likely to
change than the values.

[...]
> > I see.  That's not a part of the HTML spec IIRC (unlike the case of
> > OPTION), but I guess it could be useful.
>
> I think so.  The problem is to know where to stop collecting text.  I
> implemented a new get_phrase method to HTML::TokeParser to get it to
> do what I think makes sense.  It will deal with stuff like:
>
> Female
>
> in the expected way.

Officially, those s aren't allowed, are they?  No surprise if you've
seen them 'in the wild', though :-(  Probably I should strip tags even for
OPTION element contents...


> >  Of course, there can also be explicit LABEL elements, but I suspect
> > people rarely use them, so probably not very useful.
>
> All the examples in the HTML4 spec use the label more like a prompt
> text and as such it is more an alternative name for the input.  The
> following example is given:
[...]

I see.  It can still be used for the list items of INPUT type=radio, or
whatever, though (though, as I said, probably rarely).


John


Re: libwww-perl-5.71

2003-10-14 Thread John J Lee
On Tue, 14 Oct 2003, John J Lee wrote:
[...]
> OK.  Did you notice that both the value and the label of OPTION default to
> the contents (eg. Female here), according to the HTML 4 spec?  In my
[...]

Just to be clear, OPTION actually has a label attribute, unlike INPUT
(which needs a special LABEL element if you want it to have a label).  As
a result, I guess OPTION labels are relatively common.


John


Re: [spam score 5/10 -pobox] Re: libwww-perl-5.71

2003-10-14 Thread John J Lee
On Tue, 14 Oct 2003, Gisle Aas wrote:
> John J Lee <[EMAIL PROTECTED]> writes:
[...]
> Yes.  If a form contains:
>
> 
>Female
>Male
>Unknown
> 
>
> Then the values that this field might take becomes "F", "M" and "?",
> while the value names are "Female", "Male" and "Unknown".  With newer
> version of HTML::Form you can use both to modify the values.  The
> statement
[...]

OK.  Did you notice that both the value and the label of OPTION default to
the contents (eg. Female here), according to the HTML 4 spec?  In my
Python module, I decided to allow people to set options 'by label'.  I
guess your scheme doesn't let you use the label (foo) here:

bar

while my scheme doesn't let you use the element contents (bar) in that
same case.  Hm...

[...]
> No.  It's the same concept as for the select/option shown above.  If
> you have a form containing:
>
> Female
> Male
> Unknown
>
> then the values and value names for the sex field will be the same as
> in the previous example and:
>
>$form->param(sex => "male")
>
> will just work.  That is if you are not surprised by $form->("sex")
> returning "M" even if you just set it to "male".

I see.  That's not a part of the HTML spec IIRC (unlike the case of
OPTION), but I guess it could be useful.  Of course, there can also be
explicit LABEL elements, but I suspect people rarely use them, so probably
not very useful.


John


Re: libwww-perl-5.71

2003-10-14 Thread John J Lee
On Tue, 14 Oct 2003, Gisle Aas wrote:
[...]
> HTML::Form's dump now also print alternative value names.

What does 'alternative value name' mean?  Is this something to do with
OPTION element contents (this bit) and labels?


> HTML::Form will now pick up the phrase after a 
> or  and use that as the name of the checked
> value.
[...]

What does this mean?  Something like this, maybe (no name attributes)?

foo
bar


The browser I happen to be running right now (Konqueror) doesn't make
those controls successful (but that says nothing about what other browsers
do, of course :-( ).

Or do you mean something else?


John


Re: Getting returned from values???

2003-10-14 Thread John J Lee
On Mon, 13 Oct 2003, Roderick A. Anderson wrote:
[...]
> Well I figured WWW:Mechanize was the ticket but I am now up against the
> wall.  I can't figure out how to take the returned page and get just that
> field's value.  All the examples I've found of using Mech are looking for
> non-form information and though I could use those techniques I thought
> there has to be an easier way.  Do I need to use CGI.pm (which I'm already
> using in the script) and if so how do I get the Mech results into a CGI
> instance?

HTML::Form (part of LWP)?  Not certain it works nicely with
WWW::Mechanize, because I've never used that.

I doubt any CGI code is going to be very useful to you.


John


RE: Cant "download" a webpage whit the same content my browser do es.

2003-10-11 Thread John J Lee
On Fri, 10 Oct 2003, Thurn, Martin wrote:

>   What's probably happening is that you have cookies enabled, but you're not
> sending any.  You have to GET the cookie from the search FORM page, in order
> to SEND the cookie back with your POST of the query.

Probably right.

Jonathan: remember that Mozilla doesn't save session cookies to disk (by
definition of 'session cookies', really -- they only last as long as the
browser is open).


John


Re: [patch] Uninitialized value in HTTP/Cookies.pm

2003-09-24 Thread John J Lee
On Wed, 24 Sep 2003, Christophe Chisogne wrote:

> Just the same patch as in my previous post, but with a more
> correct mime type ;-)
>
> I feel like excite.com sends bad cookies.
> "Set-Cookie: uu=i=213.193.180.194-1064396543121MJ;; ..."
> the double ';' is preceded by 2 '=' in the same
> 'name=value' string. I guess its not right syntax.

The equals inside the cookie value is allowed -- taking the standard as
the de-facto one set by Netscape and IE, since the written spec. is very
poorly defined and, indeed, wrong.  I'm afraid the truth is that, if IE
and Mozilla like it, it's 'correct'.  I haven't seen a double semicolon
before, though.


John



Re: [PATCH] URI test failure on OS/2

2003-09-20 Thread John J Lee
On 19 Sep 2003, Gisle Aas wrote:
[...]
> The current behaviour is based on what made sense to me, not on how
> stuff actually works in other apps on Windows.  Anybody know a place
> that describes the de-factor rules for file: URLs on Windows?
[...]

Probably a useless snippet: apparently both ':' and '|' are accepted by
both Netscape 4 and IE (version 5 or 6, I guess).

http://www.google.com/groups?threadm=87n0e1zqzh.fsf%40pobox.com


John



Re: where is the submit button?

2003-09-04 Thread John J Lee
On Wed, 3 Sep 2003, wendy soros wrote:
[...]
> What I am trying to do is very similar to the
> ABEBooks.com example in Burke (p.74): use the POST
> method to submit some parameters to a form and save
> the response to a local file. As done in the example,
> I first got the name-value pairs of the form.

Can you see the name-value pairs that will actually carry the data of
interest back to the server (don't worry about the submit part)?  I mean,
can you see them from the HTML::Form interface?


> The
> problem I have is that I don't know how to submit the
> form. In the webpage, there are two buttons "Search"
> and "Clear Form", but I can't find the pair of name
> and value for submission.

That doesn't necessarily matter.


> In the source code, the two
> buttons are represented by two images:
> "search_gray.gif" and "clearform_gray.gif", below is
> the part of the source code I guess is relavent.
[...]

You have JavaScript code embedded in the HTML.  Luckily, it seems from
what you say that the JavaScript doesn't actually generate the name/value
pairs, but only does the submission.  You could look at the submitForm
function to see exactly what it does (it might be in a separate HTML file,
referenced in a src attribute -- it will likely have a .js extension).

However, you'll probably find that all that submitForm function does is to
validate the form values, so you could just try submitting the form as-is
after filling it in with HTML::Form.

IIRC, $form->submit(); gets you an HTTP::Request, which you then use in
the normal way (I forget what that is :-).

As somebody here says periodically, it'd be nice if LWP could interpret JS
automatically.  What happened to the last guy who tried that?  He reported
some progress, then disappeared.  I have some somewhat-working Python code
to do this, if any enthusiastic Perl hacker wants to use it as a starting
point (it uses a spidermonkey JS/Python bridge -- very similar to Perl's
Javascript module -- to bind JS to a pure-Python level 2 HTML DOM and
assorted paraphenalia).


John



Re: help with accessing lists with HTML::Form (with code sample)

2003-08-26 Thread John J Lee
On 25 Aug 2003, Gisle Aas wrote:

> Mark Stosberg <[EMAIL PROTECTED]> writes:
>
> > I would like to see an extension to this part of the interface which
> > allows one to treat single and multiple SELECT lists the same way. In
> > the current situation calling the same command can result in dealing
> > with either the SELECT tag, or the OPTION tag, which I find less useful
> > and confusing.
>
> But it is a model that give a uniform behaviour and interface for all
> the inputs.  I think this is a good thing and that it should be enough
> to write better documentation that explains the mapping between the
> HTML tags/elements and the HTML::Form input objects.  As you
> discovered it is not one-to-one when it comes to .
[...]
> Since there is no representation of the  tag itself there is
> no object to put this method on.
[...]

FWIW, I found this part of the way HTML::Form works confusing too, and
changed it in my port (well, quite a few other things have changed, so
it's not a simple port any more).  All 'controls' (in the HTML 4
terminology) are represented by single Control objects.  Maybe the
examples of the modified API here are of interest (in Python):

http://wwwsearch.sourceforge.net/ClientForm/src/README-0_1_7b.html


John



RE: Question about wildcarding when getting files with LWP

2003-08-21 Thread John J Lee
On Thu, 21 Aug 2003, Patrick Collins wrote:
[...]
> If you control the webserver you could try upgrading to Apache2 and
> using mod_dav. The Webdav protocol allows you to get parsable directory
> listings from which you can then download whatever files you choose.
[...]

And no doubt there's a module out there somewhere that will try to parse
any old directory listing.


John



Re: TreeBuilder cgi memory problems

2003-08-14 Thread John J Lee
On Thu, 7 Aug 2003, [EMAIL PROTECTED] wrote:

> Having a potential TreeBuilder memory problem when using it to parse
> through a large HTML table (> 2K rows) where the memory allocation grows to
> about 20M on my server and never goes down even after finishing with the
> HTML and TreeBuilder structures. The Perl script runs as a CGI and Apache
> gives up after awhile with the following line in the error logs - "Out of
> Memory !!"
[...]

20 Mb does seem a lot, but why would one expect the process memory usage
to fall after parsing is comlpete?  On most systems, memory used by a
process and free'd isn't returned to the system until the process exits.

Sorry, no actual help...


John



Re: how to handle the

2003-08-06 Thread John J Lee
On Tue, 5 Aug 2003, Andrea Tasso wrote:
[...]
> and lynx is short and with a 

Re: Crypt-SSLeay on Win32: support for 128 bit X509 CA certificates

2003-07-22 Thread John J Lee
On Mon, 21 Jul 2003 [EMAIL PROTECTED] wrote:
[...]
>   I'm setting the HTTPS_CA_DIR and HTTPS_CA_FILE environment
>   variables as described in the documentation.
> - Something else strange - when I don't(!) set the two
>   environment variables, then I can access both sites(!!). The
>   warning "Client-SSL-Warning: Peer certificate not verified"
>   is however still being issued.

Presumably because it interprets the lack of those environment vars as an
implicit request not to attempt verification.  Since you didn't *ask* for
verification, it figures it can go ahead and access the site without
verifying.

Python's httplib *never* verifies, IIRC :-(

...OTOH, neither do humans <0.5 wink> -- few people take any notice of
failed verifications.


John



Re: Help needed

2003-07-15 Thread John J Lee
On Mon, 14 Jul 2003, Octavian Rasnita wrote:
[...]
> After downloading an HTML page, what modules can I use to read the cell 4
> from the fifth row of a table, if that table is placed in another table in
> the cell 2 of the row 3?
[...]

http://theoryx5.uwinnipeg.ca/CPAN/data/HTML-TableExtract/HTML/TableExtract.html

It *looks* easy to use (I've never used it, so don't believe me).  It has
some terribly clever declarative ways of specifying which bits of which
table you want, but hopefully you can ignore those. :-)

A quick CPAN search turned up these, which I've never looked at:

HTML::TableParser
HTML::TableContentParser


John



Re: HTML parsing

2003-07-10 Thread John J Lee
On Thu, 10 Jul 2003, John J Lee wrote:
[...]
> Another way is to use libtidy (a new shared library-ized HTMLTidy, with
[...]

Actually, it's called tidylib, not libtidy:

http://tidy.sourceforge.net/libintro.html


John



Re: HTML parsing

2003-07-10 Thread John J Lee
On Wed, 9 Jul 2003, Reinier Post wrote:

> On Tue, Jul 08, 2003 at 03:34:12PM +0100, Richard Lamb wrote:
> > working out a means of stripping HTML tags (via the DOM interface, which
> > [...]
> I have only tried HTML::TreeBuilder (not DOM, but the same principle;
> uses heuristic HTML parsing and patching that does some unwanted things)
[...]

Another way is to use libtidy (a new shared library-ized HTMLTidy, with
Perl binding) and a (standard, rather than TreeBuilder) DOM Parser.  I
think libtidy can output both XHTML and HTML (there is such a thing as
HTML DOM, remember).


John



Re: a better description of the problem.

2003-07-10 Thread John J Lee
On Wed, 9 Jul 2003, Jonathan Daigle wrote:
[...]
>   $inref->{dbh}->prepare(qq| SELECT * from affiliate_account|);  # for
[...]

This looks like web server code.  This list is for discussion of web
client code.


John



Re: user-agent supporting tables,vbscript, frames, etc....how?

2003-07-10 Thread John J Lee
On Wed, 9 Jul 2003, Terry wrote:
[...]
> If a page has frames, the main page gets returned, not
> the frames.  Another request has to be made or
> something to get the contents of the frame.  How do I
> go about that?
[...]

The same way you grabbed the main page?  Just parse out the URLs, and
fetch them.  Maybe the thing you're using or WWW::Mechanize has code to do
that, I don't know.


John



Re: user-agent supporting tables,vbscript, frames, etc....how?

2003-07-08 Thread John J Lee
On Tue, 8 Jul 2003, Terry wrote:

> I am using HTTP::WebTest to do site logon simulation.
> However, there are browser checks on some websites.
> For example, some require that the user-agent (client)
> supports frames, tables, and vbscript, is there a way
> to 'trick' the server into thinking my client can
> support these things?

Depends what you mean by 'some require that the user-agent... supports'.

If you mean they won't even send you the frames, javascript etc. (does
anybody still *require* VBScript?), then you probably just need to set the
User-Agent header to something like "Mozilla/5.0".

If you mean they do send you those things, but HTTP::WebTest doesn't know
how to handle them, obviously that's harder.  Ask about particular
problems you're having.


John



RE: Passing the same cookie and headers to a new site

2003-06-23 Thread John J Lee
On Mon, 23 Jun 2003, Alan Olegario wrote:

> I tried checking what headers are being sent with ethereal, but it looks
> like I can't get the info since it's going over https and being
> encrypted.
[...]

There are several solutions to that.  Look at the message I posted here a
week or two ago for details.


John



Re: Passing the same cookie and headers to a new site

2003-06-21 Thread John J Lee
On Sat, 21 Jun 2003, Matthew Darwin wrote:

> The LWP behaviour looks like a security problem to me.
>
> For example, davin.ottawa.on.ca is not related to flora.ottawa.on.ca
> So if one sets a cookie the other site can get it?
> Very bad.
>
> Canadian domains are in the form ...ca
> or ..ca or .ca

Cookies are a security problem, not LWP's implementation of them.  The
behaviour you describe is an long-established part of the Netscape cookie
protocol.

If you want to have your cookies only sent back to your own domain, don't
give any explicit Domain attribute in the Set-Cookie header.  Even there,
some browsers (MSIE 5) don't require an exact domain string-match (for
example, a cookie set by www.foo.com can be returned to
rhubarb.www.foo.com).  In fact, IIRC, some browsers allow foo.co.uk to set
a cookie for the entire .co.uk domain!  Don't trust Netscape's 'standard'
(cookie_spec.html) further than you can spit it: nobody has ever followed
it, and nobody ever will.

RFC 2965 (which LWP knows about) is much more clearly defined and better
thought through, but hardly anybody uses it (neither IE nor Mozilla
implements it -- nor RFC 2109 for that matter).  The incentives just
aren't there for it to come into widespread use.  I've heard rumour that
the European Union may pass legislation containing requirements for which
P3P (which deals with third party cookies, amongst other things) is
insufficient / inappropriate, and hence open a gap that RFC 2965 might
fill, but I haven't tried to verify that there is/was any truth in that.
There's also the fact that RFC 2965 has unresolved Netscape-protocol
(old-style cookies) interoperability issues -- errata were being
discussed, but that effort seems to have stalled in the last couple of
months.


John



RE: Passing the same cookie and headers to a new site

2003-06-21 Thread John J Lee
On Fri, 20 Jun 2003, Alan Olegario wrote:
[...]
> HTTP::Cookies::extract_cookies: Set cookie SMSESSION => [cookie info]
> HTTP::Cookies::extract_cookies: Set cookie FORMCRED =>
> HTTP::Cookies::extract_cookies: Set cookie EntFXSessionR => [cookie info]
> HTTP::Cookies::extract_cookies: Set cookie LOGIN => 0
[...]
> HTTP::Cookies::add_cookie_header: Checking testsite.somesite.com for cookies
> HTTP::Cookies::add_cookie_header: Checking .somesite.com for cookies
> HTTP::Cookies::add_cookie_header: - checking cookie path=/
> HTTP::Cookies::add_cookie_header:  - checking cookie LOGIN=0
> HTTP::Cookies::add_cookie_header:it's a match
> HTTP::Cookies::add_cookie_header:  - checking cookie FORMCRED=
> HTTP::Cookies::add_cookie_header:it's a match
> HTTP::Cookies::add_cookie_header:  - checking cookie EntFXSessionR=[same cookie info 
> as above]
> HTTP::Cookies::add_cookie_header:it's a match
> HTTP::Cookies::add_cookie_header:  - checking cookie SMSESSION=[same cookie info as 
> above]
> HTTP::Cookies::add_cookie_header:it's a match
[...]


Looks OK to me.  LWP wants to send all your www.somesite.com cookies back
to testsite.somesite.com.

Have you checked the headers that are actually being sent (eg. ethereal)?
Checking what your browser is sending and comparing with what LWP sends
will probably quickly let you find the problem.

If the Cookie header is there, standard answer: what other state are you
forgetting about (Referer, for example)?


John



  1   2   >