[issue41989] htmlparser unclosed script tag causes data loss
Change by Waylan Limberg : -- keywords: +patch pull_requests: +21635 stage: -> patch review pull_request: https://github.com/python/cpython/pull/22658 ___ Python tracker <https://bugs.python.org/issue41989> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue41989] htmlparser unclosed script tag causes data loss
New submission from Waylan Limberg : When the `close` method of the HtmlParser is called, any cached text data is generally flushed and passed to a `data` event; except when in `data_mode`. Specifically, if an unclosed `script` or `style` tag has been encountered, a call to `close` does not flush the data. A simple test which demonstrates the issue is attached. I see that in Lib/html/parser.py#L244-L249 there are two nested if statements which both check for `not self.cdata_elem`. Obviously, if we got past the first one, that situation will never exist for the nested one. Somehow this block of code needs a branch for when `self.cdata_elem` is True. I should note that the input is invalid HTML. However, the existing behavior results in data loss. Within any other unclosed tag (other than `script` or `style`) any data is still flushed and passed to a `data` event. I would expect the same behavior here. Although, the data escaping behavior should perhaps be applied as it is with data within properly closed tags. -- components: Library (Lib) files: test_html.py messages: 378359 nosy: waylan priority: normal severity: normal status: open title: htmlparser unclosed script tag causes data loss type: behavior versions: Python 3.10, Python 3.5, Python 3.6, Python 3.7, Python 3.8, Python 3.9 Added file: https://bugs.python.org/file49505/test_html.py ___ Python tracker <https://bugs.python.org/issue41989> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
Re: How to upload to Pythonhosted.org
After asking here, I found a mailing list post here: https://mail.python.org/pipermail/distutils-sig/2015-May/026381.html That post outlines a roadmap for shutting down pythonhosted. Unfortunately, it seems that they skipped from step 1 to step 5 without bothering with steps 2, 3, & 4. In any event, that list discussion seems to be the official word that things are being shut down, which was what I was looking for. It's unfortunate that things weren't done more smoothly. Also it seems that if you want to avoid search results showing up for the pythonhosted content after you find a new host, they at least provide a way to "delete" the content from pyhtonhosted. That way, Google will stop indexing it and stop including it in search results. Unfortunately, all the existing links across the internet are now dead with no way to redirect people. Waylan On Thursday, November 30, 2017 at 1:47:32 PM UTC-5, Irmen de Jong wrote: > On 11/30/2017 03:31 AM, Ben Finney wrote: > > Irmen de Jong <irmen.nos...@xs4all.nl> writes: > > > >> On 11/30/2017 02:06 AM, waylan wrote: > >>> So, how do I upload an update to my documentation? > >> > >> I ran into the same issue. From what I gathered, Pythonhosted.org is > >> in the process of being dismantled and it hasn't allowed new doc > >> uploads for quite some time now. I switched to using readthedocs.io > >> instead. > > > > The issue that many are facing is how to update the pages *at the > > existing URL* to tell visitors where to go next. Cool URIs don't change > > <URL:https://www.w3.org/Provider/Style/URI.html> but, when they do, we > > are obliged to update the existing pages to point to the new ones. > > Sorry, yes, that is the problem I experience as well. My library's old version > documentation is somehow frozen on Pythonhosted.org (and obviously still pops up as the > first few google hits). > > > > So, if pythonhosted.org is indeed being dismantled, there should be a > > way to update the pages there for informing visitor where they should go > > next. > > > > If that's not possible and instead the service is just locked down, > > that's IMO a mistake. > > I agree with that. I think it's an unsolved issue until now, that gets some discussion > in this github issue https://github.com/pypa/warehouse/issues/582 > > > Irmen -- https://mail.python.org/mailman/listinfo/python-list
Re: How to upload to Pythonhosted.org
After asking here, I found a mailing list post here: https://mail.python.org/pipermail/distutils-sig/2015-May/026381.html That post outlines a roadmap for shutting down pythonhosted. Unfortunately, it seems that they skipped from step 1 to step 5 without bothering with steps 2, 3, & 4. In any event, that list discussion seems to be the official word that things are being shut down, which was what I was looking for. It's unfortunate that things weren't done more smoothly. Also it seems that if you want to avoid search results showing up for the pythonhosted content after you find a new host, they at least provide a way to "delete" the content from pyhtonhosted. That way, Google will stop indexing it and stop including it in search results. Unfortunately, all the existing links across the internet are now dead with no way to redirect people. Waylan On Thursday, November 30, 2017 at 1:47:32 PM UTC-5, Irmen de Jong wrote: > On 11/30/2017 03:31 AM, Ben Finney wrote: > > Irmen de Jong <irmen.nos...@xs4all.nl> writes: > > > >> On 11/30/2017 02:06 AM, waylan wrote: > >>> So, how do I upload an update to my documentation? > >> > >> I ran into the same issue. From what I gathered, Pythonhosted.org is > >> in the process of being dismantled and it hasn't allowed new doc > >> uploads for quite some time now. I switched to using readthedocs.io > >> instead. > > > > The issue that many are facing is how to update the pages *at the > > existing URL* to tell visitors where to go next. Cool URIs don't change > > <URL:https://www.w3.org/Provider/Style/URI.html> but, when they do, we > > are obliged to update the existing pages to point to the new ones. > > Sorry, yes, that is the problem I experience as well. My library's old version > documentation is somehow frozen on Pythonhosted.org (and obviously still pops > up as the > first few google hits). > > > > So, if pythonhosted.org is indeed being dismantled, there should be a > > way to update the pages there for informing visitor where they should go > > next. > > > > If that's not possible and instead the service is just locked down, > > that's IMO a mistake. > > I agree with that. I think it's an unsolved issue until now, that gets some > discussion > in this github issue https://github.com/pypa/warehouse/issues/582 > > > Irmen -- https://mail.python.org/mailman/listinfo/python-list
How to upload to Pythonhosted.org
I've been hosting documentation for many years on pythonhosted.org. However, I can't seem to upload any updates recently. The homepage at http://pythonhosted.org states: > To upload documentation, go to your package edit page > (http://pypi.python.org/pypi?%3Aaction=pkg_edit=yourpackage), and fill > out the form at the bottom of the page. However, there is no longer a form at the bottom of the edit page for uploading documentation. Instead I only see: > If you would like to DESTROY any existing documentation hosted at > http://pythonhosted.org/ProjectName Use this button, There is no undo. > > [Destroy Documentation] I also went to pypi.org and logged in there. But I don't see any options for editing my projects or uploading documentation on that site. So, how do I upload an update to my documentation? Waylan -- https://mail.python.org/mailman/listinfo/python-list
Strange Behavior on Python 3 Windows Command Line
When I try running any Python Script on the command line with Python 3.2 I get this weird behavior. The cursor dances around the command line window and nothing ever happens. Pressing Ctr+C does nothing. When I close the window (mouse click on X in top right corner), an error dialog appears asking me to force it to close. See a short (26 sec) video of it here: https://vimeo.com/36491748 Also, the printer suddenly starts printing multiple copies of the contents of the command line window - which has wasted much paper. Strangely it was working fine the other day. Then while debugging a script it suddenly started do this and now does this for every script I've run in Python 3.2. Multiple system reboots had no effect. I also have Python 2.5 2.7 installed and they work fine. Even the most basic script results in this behavior: if __name__ == __main__: print(Hello, World!) In an attempt to check the exact version of Python, even this causes the strange behavior: c:\Python32\python.exe -V I'm on Windows XP if that matters. IDLE (which works fine) tells me I'm on Python 3.2.2 Any suggestions? -- http://mail.python.org/mailman/listinfo/python-list
Re: Strange Behavior on Python 3 Windows Command Line
On Mon, Feb 13, 2012 at 3:16 PM, Arnaud Delobelle arno...@gmail.com wrote: Strangely it was working fine the other day. Then while debugging a script it suddenly started do this and now does this for every script How were you debugging? I think I may have been attempting to use pipes to redirect stdin and/or stdout when the problem first presented itself. Unfortunately, once I closed the window, I lost whatever pipe combination I had tried. It just occurred to me that I was unsure if I had been doing that pipe correctly, and that maybe I overwrote python.exe. Sure enough, the modify date on that file indicated I overwrote it. A re-install has resolved the problem. It's just a little embarrassing that I didn't think of that until now, but the fact that everything else seems to work was throwing me off. Of course, everything else was running `pythonw.exe` not `python.exe`. Anyway, thanks for the pointer Arnaud. -- \X/ /-\ `/ |_ /-\ |\| Waylan Limberg -- http://mail.python.org/mailman/listinfo/python-list
Re: how to remove multiple occurrences of a string within a list?
On Apr 3, 6:05 pm, Steven Bethard [EMAIL PROTECTED] wrote: bahoo wrote: The larger problem is, I have a list of strings that I want to remove from another list of strings. If you don't care about the resulting order:: items = ['foo', 'bar', 'baz', 'bar', 'foo', 'frobble'] to_remove = ['foo', 'bar'] set(items) - set(to_remove) set(['frobble', 'baz']) I'm surprised no one has mentioned any of the methods of set. For instance: set.difference.__doc__ 'Return the difference of two sets as a new set.\n\n(i.e. all elements that are in this set but not in the other.)' set(items).difference(to_remove) set(['frobble', 'baz']) There are a few other cool methods of sets that come in handy for this sort of thing. If only order could be preserved. If you do care about the resulting order:: to_remove = set(to_remove) [item for item in items if item not in to_remove] ['baz', 'frobble'] STeVe -- http://mail.python.org/mailman/listinfo/python-list
Re: Is there an alternative to os.walk?
Bruce wrote: Hi all, I have a question about traversing file systems, and could use some help. Because of directories with many files in them, os.walk appears to be rather slow. I`m thinking there is a potential for speed-up since I don`t need os.walk to report filenames of all the files in every directory it visits. Is there some clever way to use os.walk or another tool that would provide functionality like os.walk except for the listing of the filenames? You might want to check out the path module [1] (not os.path). The following is from the docs: The method path.walk() returns an iterator which steps recursively through a whole directory tree. path.walkdirs() and path.walkfiles() are the same, but they yield only the directories and only the files, respectively. Oh, and you can thank Paul Bissex for pointing me to path [2]. [1]: http://www.jorendorff.com/articles/python/path/ [2]: http://e-scribe.com/news/289 -- http://mail.python.org/mailman/listinfo/python-list
Re: page contents are not refreshed
Gleb Rybkin wrote: when running apache, mod_python in windows. This looks pretty strange. Creating a simple python file that shows current time will correctly display the time in apache the first time, but freezes afterwards and shows the same time on all subsequent clicks as long as the file is not modified. Any ideas what's wrong? Thanks. The first time the page was requested mod_python compiled and loaded your code. Every request after that mod_python refers to the already loaded code in memory in which your expression had already been evaluated the first time. Therefore, you need to make curtime a 'callable object' so that it will be re-evaluated on each request. Unfortunelty, I don't recall if simply wraping your strftime() expression in a function will be enough or if its more complex that that. That said, I **think** this should work: from mod_python import apache from time import strftime, gmtime def curtime(): return strftime(%a, %d %b %Y %H:%M:%S +, gmtime()) def handler(req): req.content_type = text/plain req.send_http_header() req.write(str(curtime())) return apache.OK -- http://mail.python.org/mailman/listinfo/python-list
Re: page contents are not refreshed
Steve Holden wrote: waylan wrote: [snip] from mod_python import apache from time import strftime, gmtime def curtime(): return strftime(%a, %d %b %Y %H:%M:%S +, gmtime()) def handler(req): req.content_type = text/plain req.send_http_header() req.write(str(curtime())) return apache.OK This is a very long way round for a shortcut (though it does have the merit of working). Why not just def handler(req): req.content_type = text/plain req.send_http_header() curtime = strftime(%a, %d %b %Y %H:%M:%S +, gmtime()) req.write(str(curtime)) return apache.OK Or even def handler(req): req.content_type = text/plain req.send_http_header() req.write(strftime(%a, %d %b %Y %H:%M:%S +, gmtime())) return apache.OK While Steve's examples certainly do the trick in this limited case, I assumed that the original poster was just starting with mod_python and I was simply trying to explain the bigger picture for future reference. As one develops more sophisticated code, simply adding it to the `handler` function becomes less desirable. Reacognizing that anything that must be reevaluated on each request must be callable will be a bigger help IMHO. Steve's examples work because the current time is evaluated within `handler` and : callable(handler) True While in the the original example: callable(curtime) False Yet in my example: callable(curtime) True Finally, by way of explaination: callable.__doc__ 'callable(object) - bool\n\nReturn whether the object is callable (i.e., some kind of function).\nNote that classes are callable, as are instances with a __call__() method.' -- http://mail.python.org/mailman/listinfo/python-list
Re: Using Beautiful Soup to entangle bookmarks.html
Diez B. Roggisch wrote: suppose it is well-formed, most probably even xml. Maybe not. Otherwise, why would there be a script like this one[1]? Anyway, I found that and other scripts that work with firefox bookmarks.html files with a quick search [2]. Perhaps you will find something there that is helpful. [1]: http://www.physic.ut.ee/~kkannike/english/prog/python/util/bookmarks/code/bookmarks.py [2]: http://www.google.com/search?q=firefox+bookmarks.html+python Waylan -- http://mail.python.org/mailman/listinfo/python-list