[issue41989] htmlparser unclosed script tag causes data loss

2020-10-11 Thread Waylan Limberg


Change by Waylan Limberg :


--
keywords: +patch
pull_requests: +21635
stage:  -> patch review
pull_request: https://github.com/python/cpython/pull/22658

___
Python tracker 
<https://bugs.python.org/issue41989>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue41989] htmlparser unclosed script tag causes data loss

2020-10-09 Thread Waylan Limberg


New submission from Waylan Limberg :

When the `close` method of the HtmlParser is called, any cached text data is 
generally flushed and passed to a `data` event; except when in `data_mode`. 
Specifically, if an unclosed `script` or `style` tag has been encountered, a 
call to `close` does not flush the data.

A simple test which demonstrates the issue is attached.

I see that in Lib/html/parser.py#L244-L249 there are two nested if statements 
which both check for `not self.cdata_elem`. Obviously, if we got past the first 
one, that situation will never exist for the nested one. Somehow this block of 
code needs a branch for when `self.cdata_elem` is True.

I should note that the input is invalid HTML. However, the existing behavior 
results in data loss. Within any other unclosed tag (other than `script` or 
`style`) any data is still flushed and passed to a `data` event. I would expect 
the same behavior here. Although, the data escaping behavior should perhaps be 
applied as it is with data within properly closed tags.

--
components: Library (Lib)
files: test_html.py
messages: 378359
nosy: waylan
priority: normal
severity: normal
status: open
title: htmlparser unclosed script tag causes data loss
type: behavior
versions: Python 3.10, Python 3.5, Python 3.6, Python 3.7, Python 3.8, Python 
3.9
Added file: https://bugs.python.org/file49505/test_html.py

___
Python tracker 
<https://bugs.python.org/issue41989>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



Re: How to upload to Pythonhosted.org

2017-12-08 Thread waylan
After asking here, I found a mailing list post here:
https://mail.python.org/pipermail/distutils-sig/2015-May/026381.html

That post outlines a roadmap for shutting down pythonhosted. Unfortunately, it
seems that they skipped from step 1 to step 5 without bothering with steps 2,
3, & 4.

In any event, that list discussion seems to be the official word that things
are being shut down, which was what I was looking for.  It's unfortunate that
things weren't done more smoothly.

Also it seems that if you want to avoid search results showing up for the
pythonhosted content after you find a new host, they at least provide a way to
"delete" the content from pyhtonhosted. That way, Google will stop indexing it
and stop including it in search results. Unfortunately, all the existing links
across the internet are now dead with no way to redirect people.

Waylan

On Thursday, November 30, 2017 at 1:47:32 PM UTC-5, Irmen de Jong wrote:
> On 11/30/2017 03:31 AM, Ben Finney wrote:
> > Irmen de Jong <irmen.nos...@xs4all.nl> writes:
> >
> >> On 11/30/2017 02:06 AM, waylan wrote:
> >>> So, how do I upload an update to my documentation?
> >>
> >> I ran into the same issue. From what I gathered, Pythonhosted.org is
> >> in the process of being dismantled and it hasn't allowed new doc
> >> uploads for quite some time now. I switched to using readthedocs.io
> >> instead.
> >
> > The issue that many are facing is how to update the pages *at the
> > existing URL* to tell visitors where to go next. Cool URIs don't change
> > <URL:https://www.w3.org/Provider/Style/URI.html> but, when they do, we
> > are obliged to update the existing pages to point to the new ones.
>
> Sorry, yes, that is the problem I experience as well. My library's old
version
> documentation is somehow frozen on Pythonhosted.org (and obviously still pops
up as the
> first few google hits).
>
>
> > So, if pythonhosted.org is indeed being dismantled, there should be a
> > way to update the pages there for informing visitor where they should go
> > next.
> >
> > If that's not possible and instead the service is just locked down,
> > that's IMO a mistake.
>
> I agree with that. I think it's an unsolved issue until now, that gets some
discussion
> in this github issue https://github.com/pypa/warehouse/issues/582
>
>
> Irmen

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: How to upload to Pythonhosted.org

2017-12-05 Thread waylan
After asking here, I found a mailing list post here: 
https://mail.python.org/pipermail/distutils-sig/2015-May/026381.html

That post outlines a roadmap for shutting down pythonhosted. Unfortunately, it 
seems that they skipped from step 1 to step 5 without bothering with steps 2, 
3, & 4.

In any event, that list discussion seems to be the official word that things 
are being shut down, which was what I was looking for.  It's unfortunate that 
things weren't done more smoothly.

Also it seems that if you want to avoid search results showing up for the 
pythonhosted content after you find a new host, they at least provide a way to 
"delete" the content from pyhtonhosted. That way, Google will stop indexing it 
and stop including it in search results. Unfortunately, all the existing links 
across the internet are now dead with no way to redirect people.

Waylan

On Thursday, November 30, 2017 at 1:47:32 PM UTC-5, Irmen de Jong wrote:
> On 11/30/2017 03:31 AM, Ben Finney wrote:
> > Irmen de Jong <irmen.nos...@xs4all.nl> writes:
> > 
> >> On 11/30/2017 02:06 AM, waylan wrote:
> >>> So, how do I upload an update to my documentation?
> >>
> >> I ran into the same issue. From what I gathered, Pythonhosted.org is
> >> in the process of being dismantled and it hasn't allowed new doc
> >> uploads for quite some time now. I switched to using readthedocs.io
> >> instead.
> > 
> > The issue that many are facing is how to update the pages *at the
> > existing URL* to tell visitors where to go next. Cool URIs don't change
> > <URL:https://www.w3.org/Provider/Style/URI.html> but, when they do, we
> > are obliged to update the existing pages to point to the new ones.
> 
> Sorry, yes, that is the problem I experience as well. My library's old version
> documentation is somehow frozen on Pythonhosted.org (and obviously still pops 
> up as the
> first few google hits).
> 
> 
> > So, if pythonhosted.org is indeed being dismantled, there should be a
> > way to update the pages there for informing visitor where they should go
> > next.
> > 
> > If that's not possible and instead the service is just locked down,
> > that's IMO a mistake.
> 
> I agree with that. I think it's an unsolved issue until now, that gets some 
> discussion
> in this github issue https://github.com/pypa/warehouse/issues/582
> 
> 
> Irmen

-- 
https://mail.python.org/mailman/listinfo/python-list


How to upload to Pythonhosted.org

2017-11-29 Thread waylan
I've been hosting documentation for many years on pythonhosted.org. However, I 
can't seem to upload any updates recently. The homepage at 
http://pythonhosted.org states:

> To upload documentation, go to your package edit page 
> (http://pypi.python.org/pypi?%3Aaction=pkg_edit=yourpackage), and fill 
> out the form at the bottom of the page.

However, there is no longer a form at the bottom of the edit page for uploading 
documentation. Instead I only see:

> If you would like to DESTROY any existing documentation hosted at 
> http://pythonhosted.org/ProjectName Use this button, There is no undo.
>
> [Destroy Documentation]

I also went to pypi.org and logged in there. But I don't see any options for 
editing my projects or uploading documentation on that site.

So, how do I upload an update to my documentation?

Waylan
-- 
https://mail.python.org/mailman/listinfo/python-list


Strange Behavior on Python 3 Windows Command Line

2012-02-13 Thread waylan
When I try running any Python Script on the command line with Python
3.2 I get this weird behavior. The cursor dances around the command
line window and nothing ever happens. Pressing Ctr+C does nothing.
When I close the window (mouse click on X in top right corner), an
error dialog appears asking me to force it to close.

See a short (26 sec) video of it here: https://vimeo.com/36491748

Also, the printer suddenly starts printing multiple copies of the
contents of the command line window - which has wasted much paper.

Strangely it was working fine the other day. Then while debugging a
script it suddenly started do this and now does this for every script
I've run in Python 3.2. Multiple system reboots had no effect.
I also have Python 2.5  2.7 installed and they work fine.

Even the most basic script results in this behavior:

if __name__ == __main__:
print(Hello, World!)

In an attempt to check the exact version of Python, even this causes
the strange behavior:

c:\Python32\python.exe -V

I'm on Windows XP if that matters. IDLE (which works fine) tells me
I'm on Python 3.2.2

Any suggestions?
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Strange Behavior on Python 3 Windows Command Line

2012-02-13 Thread Waylan Limberg
On Mon, Feb 13, 2012 at 3:16 PM, Arnaud Delobelle arno...@gmail.com wrote:
 Strangely it was working fine the other day. Then while debugging a
 script it suddenly started do this and now does this for every script

 How were you debugging?

I think I may have been attempting to use pipes to redirect stdin
and/or stdout when the problem first presented itself.  Unfortunately,
once I closed the window, I lost whatever pipe combination I had
tried.

It just occurred to me that I was unsure if I had been doing that pipe
correctly, and that maybe I overwrote python.exe. Sure enough, the
modify date on that file indicated I overwrote it. A re-install has
resolved the problem.

It's just a little embarrassing that I didn't think of that until now,
but the fact that everything else seems to work was throwing me off.
Of course, everything else was running `pythonw.exe` not `python.exe`.

Anyway, thanks for the pointer Arnaud.

-- 

\X/ /-\ `/ |_ /-\ |\|
Waylan Limberg
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: how to remove multiple occurrences of a string within a list?

2007-04-05 Thread waylan
On Apr 3, 6:05 pm, Steven Bethard [EMAIL PROTECTED] wrote:
 bahoo wrote:
  The larger problem is, I have a list of strings that I want to remove
  from another list of strings.

 If you don't care about the resulting order::

   items = ['foo', 'bar', 'baz', 'bar', 'foo', 'frobble']
   to_remove = ['foo', 'bar']
   set(items) - set(to_remove)
  set(['frobble', 'baz'])

I'm surprised no one has mentioned any of the methods of set. For
instance:

 set.difference.__doc__
'Return the difference of two sets as a new set.\n\n(i.e. all
elements that are in this set but not in the other.)'
set(items).difference(to_remove)
set(['frobble', 'baz'])

There are a few other cool methods of sets that come in handy for this
sort of thing. If only order could be preserved.

 If you do care about the resulting order::

   to_remove = set(to_remove)
   [item for item in items if item not in to_remove]
  ['baz', 'frobble']

 STeVe


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is there an alternative to os.walk?

2006-10-04 Thread waylan
Bruce wrote:
 Hi all,
 I have a question about traversing file systems, and could use some
 help. Because of directories with many files in them, os.walk appears
 to be rather slow. I`m thinking there is a potential for speed-up since
 I don`t need os.walk to report filenames of all the files in every
 directory it visits. Is there some clever way to use os.walk or another
 tool that would provide functionality like os.walk except for the
 listing of the filenames?

You might want to check out the path module [1] (not os.path). The
following is from the docs:

 The method path.walk() returns an iterator which steps recursively
 through a whole directory tree. path.walkdirs() and path.walkfiles()
 are the same, but they yield only the directories and only the files,
 respectively.

Oh, and you can thank Paul Bissex for pointing me to path [2].

[1]: http://www.jorendorff.com/articles/python/path/
[2]: http://e-scribe.com/news/289

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: page contents are not refreshed

2006-09-13 Thread waylan
Gleb Rybkin wrote:
 when running apache, mod_python in windows.

 This looks pretty strange. Creating a simple python file that shows
 current time will correctly display the time in apache the first time,
 but freezes afterwards and shows the same time on all subsequent clicks
 as long as the file is not modified.

 Any ideas what's wrong? Thanks.

The first time the page was requested mod_python compiled and loaded
your code. Every request after that mod_python refers to the already
loaded code in memory in which your expression had already been
evaluated the first time.

Therefore, you need to make curtime a 'callable object' so that it will
be re-evaluated on each request. Unfortunelty, I don't recall if simply
wraping your strftime() expression in a function will be enough or if
its more complex that that. That said, I **think** this should work:

 from mod_python import apache
 from time import strftime, gmtime

def curtime():
return strftime(%a, %d %b %Y %H:%M:%S +, gmtime())

 def handler(req):
  req.content_type = text/plain
  req.send_http_header()
  req.write(str(curtime()))
  return apache.OK

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: page contents are not refreshed

2006-09-13 Thread waylan
Steve Holden wrote:
 waylan wrote:
[snip]
 
 from mod_python import apache
 from time import strftime, gmtime
 
 
  def curtime():
  return strftime(%a, %d %b %Y %H:%M:%S +, gmtime())
 
 
 def handler(req):
  req.content_type = text/plain
  req.send_http_header()
  req.write(str(curtime()))
  return apache.OK
 
 
 This is a very long way round for a shortcut (though it does have the
 merit of working). Why not just

 def handler(req):
   req.content_type = text/plain
   req.send_http_header()
   curtime = strftime(%a, %d %b %Y %H:%M:%S +, gmtime())
   req.write(str(curtime))
   return apache.OK

 Or even

 def handler(req):
   req.content_type = text/plain
   req.send_http_header()
   req.write(strftime(%a, %d %b %Y %H:%M:%S +, gmtime()))
   return apache.OK


While Steve's examples certainly do the trick in this limited case, I
assumed that the original poster was just starting with mod_python and
I was simply trying to explain the bigger picture for future reference.
As one develops more sophisticated code, simply adding it to the
`handler` function becomes less desirable. Reacognizing that anything
that must be reevaluated on each request must be callable will be a
bigger help IMHO.

Steve's examples work because the current time is evaluated within
`handler` and :

 callable(handler)
True

While in the the original example:

 callable(curtime)
False

Yet in my example:

 callable(curtime)
True

Finally, by way of explaination:

 callable.__doc__
'callable(object) - bool\n\nReturn whether the object is callable
(i.e., some kind of function).\nNote that classes are callable, as are
instances with a __call__() method.'

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Using Beautiful Soup to entangle bookmarks.html

2006-09-07 Thread waylan

Diez B. Roggisch wrote:
 suppose it is well-formed, most probably even xml.

Maybe not. Otherwise, why would there be a script like this one[1]?
Anyway, I found that and other scripts that work with firefox
bookmarks.html files with a quick search [2]. Perhaps you will find
something there that is helpful.

[1]:
http://www.physic.ut.ee/~kkannike/english/prog/python/util/bookmarks/code/bookmarks.py
[2]: http://www.google.com/search?q=firefox+bookmarks.html+python

Waylan

-- 
http://mail.python.org/mailman/listinfo/python-list