[Tutor] Python Websocket Server
Hi, I was wondering what the best practice for writing web socket servers in Python was in 2019? I found an old example on the web which used the tornado library but that was talking about Chrome 22 as the client which is ancient now so I'm not sure if things have changed? Any suggestions on the best library to use would be greatfully accepted. I'd ideally like to be able to do it in an asynchronous manner. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Extract main text from HTML document
That looks like a useful combination. Thanks. On 6 May 2018 at 17:32, Mark Lawrence wrote: > On 05/05/18 18:59, Simon Connah wrote: >> >> Hi, >> >> I'm writing a very simple web scraper. It'll download a page from a >> website and then store the result in a database of some sort. The >> problem is that this will obviously include a whole heap of HTML, >> JavaScript and maybe even some CSS. None of which is useful to me. >> >> I was wondering if there was a way in which I could download a web >> page and then just extract the main body of text without all of the >> HTML. >> >> The title is obviously easy but the main body of text could contain >> all sorts of HTML and I'm interested to know how I might go about >> removing the bits that are not needed but still keep the meaning of >> the document intact. >> >> Does anyone have any suggestions on this front at all? >> >> Thanks for any help. >> >> Simon. > > > A combination of requests http://docs.python-requests.org/en/master/ and > beautiful soup https://www.crummy.com/software/BeautifulSoup/bs4/doc/ should > fit the bill. Both are installable with pip and are regarded as best of > breed. > > -- > My fellow Pythonistas, ask not what our language can do for you, ask > what you can do for our language. > > Mark Lawrence > > > ___ > Tutor maillist - Tutor@python.org > To unsubscribe or change subscription options: > https://mail.python.org/mailman/listinfo/tutor ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Extract main text from HTML document
Thanks for the replies, everyone. Beautiful Soup looks like a good option. My primary goal is to extract the main body text, the title and the meta description from a web page and run it through one of the cloud Natural Language processing services to find out some information that I'd like to know and I'd like to do it to quite a few websites. This is all for a little project I have in mind. I'm not even sure if it'll work but it'll be fun to try. I might have to do some custom work on top of what Beautiful Soup offers though as I need to get very specific data in a certain format. On 5 May 2018 at 22:43, boB Stepp wrote: > On Sat, May 5, 2018 at 12:59 PM, Simon Connah wrote: > >> I was wondering if there was a way in which I could download a web >> page and then just extract the main body of text without all of the >> HTML. > > I do not have any experience with this, but I like to collect books. > One of them [1] says on page 245: > > "Beautiful Soup is a module for extracting information from an HTML > page (and is much better for this purpose than regular expressions)." > > I believe this topic has come up before on this list as well as the > main Python list. You may want to check it out. It can be installed > with pip. > > [1] "Automate the Boring Stuff with Python -- Practical Programming > for Total Beginners" by Al Sweigart. > > HTH! > -- > boB > ___ > Tutor maillist - Tutor@python.org > To unsubscribe or change subscription options: > https://mail.python.org/mailman/listinfo/tutor ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
[Tutor] Extract main text from HTML document
Hi, I'm writing a very simple web scraper. It'll download a page from a website and then store the result in a database of some sort. The problem is that this will obviously include a whole heap of HTML, JavaScript and maybe even some CSS. None of which is useful to me. I was wondering if there was a way in which I could download a web page and then just extract the main body of text without all of the HTML. The title is obviously easy but the main body of text could contain all sorts of HTML and I'm interested to know how I might go about removing the bits that are not needed but still keep the meaning of the document intact. Does anyone have any suggestions on this front at all? Thanks for any help. Simon. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
[Tutor] Async TCP Server
Hi, I've come up with an idea for a new protocol I want to implement in Python using 3.6 (or maybe 3.7 when that comes out), but I'm somewhat confused about how to do it in an async way. The way I understand it is that you have a loop that waits for an incoming request and then calls a function/method asynchronously which handles the incoming request. While that is happening the main event loop is still listening for incoming connections. Is that roughly correct? The idea is to have a chat application that can at least handle a few hundred clients if not more in the future. I'm planning on using Python because I am pretty up-to-date with it, but I've never written a network server before. Also another quick question. Does Python support async database operations? I'm thinking of the psycopg2-binary database driver. That way I can offload the storage in the database while still handling incoming connections. If I have misunderstood anything, any clarification would be much appreciated. Simon. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Proper way to unit test the raising of exceptions?
Awesome. Thank you all. Your solutions are great and should make the whole process a lot more simple. The only problem is that some_func() on my end is Django model with about 8 named arguments so it might be a bit of a pain passing all of those arguments. The context manager example seems like a perfect fit for that particular problem. Thanks again. All of your help is much appreciated. On Sunday, 1 April 2018, 16:32:11 BST, Mats Wichmann wrote: On 04/01/2018 09:10 AM, Peter Otten wrote: > Simon Connah via Tutor wrote: > >> Hi, >> I'm just wondering what the accepted way to handle unit testing exceptions >> is? I know you are meant to use assertRaises, but my code seems a little >> off. > >> try: >> some_func() >> except SomeException: >> self.assertRaises(SomeException) > > The logic is wrong here as you surmise below. If you catch the exception > explicitly you have to write > > try: > some_func() > except SomeException: > pass # OK > else: > self.fail("no exception raised") If you use PyTest, the procedure is pretty well documented: https://docs.pytest.org/en/latest/assert.html ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
[Tutor] Proper way to unit test the raising of exceptions?
Hi, I'm just wondering what the accepted way to handle unit testing exceptions is? I know you are meant to use assertRaises, but my code seems a little off. try: some_func() except SomeException: self.assertRaises(SomeException) Is there a better way to do this at all? The problem with the above code is that if no exception is raised the code completely skips the except block and that would mean that the unit test would pass so, I considered adding: self.fail('No exception raised') at the end outside of the except block but not sure if that is what I need to do. Any help is appreciated. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
[Tutor] Does the secrets module in Python 3.6 use a hardware RNG like that provided in Intel CPUs?
Hi, I was reading through the secrets documentation in Python 3.6 and noticed that it uses /dev/urandom but I'm unsure if that means it'll use a hardware RNG or just one provided by the operating system (Linux / Windows / etc) in software. The question is is it possible to determine the source of the randomness from os.urandom if there was ever a flaw found in a particular hardware RNG? Plus systems could have a third party hardware RNG that was an external addon card or similar which might be better than the one found in Intel CPUs. I'm just a bit curious about the whole "will always use the strongest source for pseudo-random numbers" when research could change that assumption overnight based on discovered flaws. This is probably a really stupid question and if it is I apologise but I'm somewhat confused. Thanks for any help. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor