Re: BeautifulSoup doesn't work with a threaded input queue?
Ah, shoot me. I had a .join() statement on the output queue but not on in the input queue. So the threads for the input queue got terminated before BeautifulSoup could get started. I went down that same rabbit hole with CSVWriter the other day. *sigh* Thanks for everyone's help. Chris R. -- https://mail.python.org/mailman/listinfo/python-list
Re: BeautifulSoup doesn't work with a threaded input queue?
Christopher Reimer via Python-list wrote: > On 8/27/2017 1:31 PM, Peter Otten wrote: > >> Here's a simple example that extracts titles from generated html. It >> seems to work. Does it resemble what you do? > Your example is similar to my code when I'm using a list for the input > to the parser. You have soup_threads and write_threads, but no > read_threads. > > The particular website I'm scraping requires checking each page for the > sentinel value (i.e., "Sorry, no more comments") in order to determine > when to stop requesting pages. Where's that check happening? If it's in the soup thread you need some kind of back channel to the read threads to inform them that you're need no more pages. > For my comment history that's ~750 pages > to parse ~11,000 comments. > > I have 20 read_threads requesting and putting pages into the output > queue that is the input_queue for the parser. My soup_threads can get > items from the queue, but BeautifulSoup doesn't do anything after that. > > Chris R. -- https://mail.python.org/mailman/listinfo/python-list
Re: BeautifulSoup doesn't work with a threaded input queue?
Christopher Reimerwrites: > I have 20 read_threads requesting and putting pages into the output > queue that is the input_queue for the parser. Given how slow parsing is, you probably want to scrap the pages into disk files, and then run the parser in parallel processes that read from the disk. You could also use something like Redis (redis.io) as a queue. -- https://mail.python.org/mailman/listinfo/python-list
Re: BeautifulSoup doesn't work with a threaded input queue?
On 8/27/2017 1:50 PM, MRAB wrote: What if you don't sort the list? I ask because it sounds like you're changing 2 variables (i.e. list->queue, sorted->unsorted) at the same time, so you can't be sure that it's the queue that's the problem. If I'm using a list, I'm using a for loop to input items into the parser. If I'm using a queue, I'm using worker threads to put or get items. The item is still the same whether in a list or a queue. Chris R. -- https://mail.python.org/mailman/listinfo/python-list
Re: BeautifulSoup doesn't work with a threaded input queue?
On 8/27/2017 1:31 PM, Peter Otten wrote: Here's a simple example that extracts titles from generated html. It seems to work. Does it resemble what you do? Your example is similar to my code when I'm using a list for the input to the parser. You have soup_threads and write_threads, but no read_threads. The particular website I'm scraping requires checking each page for the sentinel value (i.e., "Sorry, no more comments") in order to determine when to stop requesting pages. For my comment history that's ~750 pages to parse ~11,000 comments. I have 20 read_threads requesting and putting pages into the output queue that is the input_queue for the parser. My soup_threads can get items from the queue, but BeautifulSoup doesn't do anything after that. Chris R. -- https://mail.python.org/mailman/listinfo/python-list
Re: BeautifulSoup doesn't work with a threaded input queue?
On 2017-08-27 21:35, Christopher Reimer via Python-list wrote: On 8/27/2017 1:12 PM, MRAB wrote: What do you mean by "queue (random order)"? A queue is sequential order, first-in-first-out. With 20 threads requesting 20 different pages, they're not going into the queue in sequential order (i.e., 0, 1, 2, ..., 17, 18, 19) and coming in at different times for the parser worker threads to get for processing. Similar situation with a list but I sort the list before giving it to the parser, so all the items are in sequential order and fed to the parser one at time. What if you don't sort the list? I ask because it sounds like you're changing 2 variables (i.e. list->queue, sorted->unsorted) at the same time, so you can't be sure that it's the queue that's the problem. -- https://mail.python.org/mailman/listinfo/python-list
Re: BeautifulSoup doesn't work with a threaded input queue?
On 8/27/2017 1:12 PM, MRAB wrote: What do you mean by "queue (random order)"? A queue is sequential order, first-in-first-out. With 20 threads requesting 20 different pages, they're not going into the queue in sequential order (i.e., 0, 1, 2, ..., 17, 18, 19) and coming in at different times for the parser worker threads to get for processing. Similar situation with a list but I sort the list before giving it to the parser, so all the items are in sequential order and fed to the parser one at time. Chris R. -- https://mail.python.org/mailman/listinfo/python-list
Re: BeautifulSoup doesn't work with a threaded input queue?
Christopher Reimer via Python-list wrote: > On 8/27/2017 11:54 AM, Peter Otten wrote: > >> The documentation >> >> https://www.crummy.com/software/BeautifulSoup/bs4/doc/#making-the-soup >> >> says you can make the BeautifulSoup object from a string or file. >> Can you give a few more details where the queue comes into play? A small >> code sample would be ideal. > > A worker thread uses a request object to get the page and puts it into > queue as page.content (HTML). Another worker thread gets the > page.content from the queue to apply BeautifulSoup and nothing happens. > > soup = BeautifulSoup(page_content, 'lxml') > print(soup) > > No output whatsoever. If I remove 'lxml', I get the UserWarning that no > parser wasn't explicitly set and get the reference to threading.py at > line 80. > > I verified that page.content that goes into and out of the queue is the > same page.content that goes into and out of a list. > > I read somewhere that BeautifulSoup may not be thread-safe. I've never > had a problem with threads storing the output into a queue. Using a > queue (random order) instead of a list (sequential order) to feed pages > for the input is making it wonky. Here's a simple example that extracts titles from generated html. It seems to work. Does it resemble what you do? import csv import threading import time from queue import Queue import bs4 def process_html(source, dest, index): while True: html = source.get() if html is DONE: dest.put(DONE) break soup = bs4.BeautifulSoup(html, "lxml") dest.put(soup.find("title").text) def write_csv(source, filename, to_go): with open(filename, "w") as f: writer = csv.writer(f) while True: title = source.get() if title is DONE: to_go -= 1 if not to_go: return else: writer.writerow([title]) NUM_SOUP_THREADS = 10 DONE = object() web_to_soup = Queue() soup_to_file = Queue() soup_threads = [ threading.Thread(target=process_html, args=(web_to_soup, soup_to_file, i)) for i in range(NUM_SOUP_THREADS) ] write_thread = threading.Thread( target=write_csv, args=(soup_to_file, "tmp.csv", NUM_SOUP_THREADS), ) write_thread.start() for thread in soup_threads: thread.start() for i in range(100): web_to_soup.put("#{}".format(i)) for i in range(NUM_SOUP_THREADS): web_to_soup.put(DONE) for t in soup_threads: t.join() write_thread.join() -- https://mail.python.org/mailman/listinfo/python-list
Re: BeautifulSoup doesn't work with a threaded input queue?
On 2017-08-27 20:35, Christopher Reimer via Python-list wrote: On 8/27/2017 11:54 AM, Peter Otten wrote: The documentation https://www.crummy.com/software/BeautifulSoup/bs4/doc/#making-the-soup says you can make the BeautifulSoup object from a string or file. Can you give a few more details where the queue comes into play? A small code sample would be ideal. A worker thread uses a request object to get the page and puts it into queue as page.content (HTML). Another worker thread gets the page.content from the queue to apply BeautifulSoup and nothing happens. soup = BeautifulSoup(page_content, 'lxml') print(soup) No output whatsoever. If I remove 'lxml', I get the UserWarning that no parser wasn't explicitly set and get the reference to threading.py at line 80. I verified that page.content that goes into and out of the queue is the same page.content that goes into and out of a list. I read somewhere that BeautifulSoup may not be thread-safe. I've never had a problem with threads storing the output into a queue. Using a queue (random order) instead of a list (sequential order) to feed pages for the input is making it wonky. What do you mean by "queue (random order)"? A queue is sequential order, first-in-first-out. -- https://mail.python.org/mailman/listinfo/python-list
Re: BeautifulSoup doesn't work with a threaded input queue?
On 8/27/2017 11:54 AM, Peter Otten wrote: The documentation https://www.crummy.com/software/BeautifulSoup/bs4/doc/#making-the-soup says you can make the BeautifulSoup object from a string or file. Can you give a few more details where the queue comes into play? A small code sample would be ideal. A worker thread uses a request object to get the page and puts it into queue as page.content (HTML). Another worker thread gets the page.content from the queue to apply BeautifulSoup and nothing happens. soup = BeautifulSoup(page_content, 'lxml') print(soup) No output whatsoever. If I remove 'lxml', I get the UserWarning that no parser wasn't explicitly set and get the reference to threading.py at line 80. I verified that page.content that goes into and out of the queue is the same page.content that goes into and out of a list. I read somewhere that BeautifulSoup may not be thread-safe. I've never had a problem with threads storing the output into a queue. Using a queue (random order) instead of a list (sequential order) to feed pages for the input is making it wonky. Chris R. -- https://mail.python.org/mailman/listinfo/python-list
Re: BeautifulSoup doesn't work with a threaded input queue?
Christopher Reimer via Python-list wrote: > Greetings, > > I have Python 3.6 script on Windows to scrape comment history from a > website. It's currently set up this way: > > Requestor (threads) -> list -> Parser (threads) -> queue -> CVSWriter > (single thread) > > It takes 15 minutes to process ~11,000 comments. > > When I replaced the list with a queue between the Requestor and Parser > to speed up things, BeautifulSoup stopped working. > > When I changed BeautifulSoup(contents, "lxml") to > BeautifulSoup(contents), I get the UserWarning that no parser wasn't > explicitly set and a reference to line 80 in threading.py (which puts it > in the RLock factory function). > > When I switched back to using list between the Requestor and Parser, the > Parser worked again. > > BeautifulSoup doesn't work with a threaded input queue? The documentation https://www.crummy.com/software/BeautifulSoup/bs4/doc/#making-the-soup says you can make the BeautifulSoup object from a string or file. Can you give a few more details where the queue comes into play? A small code sample would be ideal... -- https://mail.python.org/mailman/listinfo/python-list