Re: BeautifulSoup doesn't work with a threaded input queue?

2017-08-27 Thread Christopher Reimer via Python-list
Ah, shoot me. I had a .join() statement on the output queue but not on 
in the input queue. So the threads for the input queue got terminated 
before BeautifulSoup could get started. I went down that same rabbit 
hole with CSVWriter the other day. *sigh*


Thanks for everyone's help.

Chris R.
--
https://mail.python.org/mailman/listinfo/python-list


Re: BeautifulSoup doesn't work with a threaded input queue?

2017-08-27 Thread Peter Otten
Christopher Reimer via Python-list wrote:

> On 8/27/2017 1:31 PM, Peter Otten wrote:
> 
>> Here's a simple example that extracts titles from generated html. It
>> seems to work. Does it resemble what you do?
> Your example is similar to my code when I'm using a list for the input
> to the parser. You have soup_threads and write_threads, but no
> read_threads.
> 
> The particular website I'm scraping requires checking each page for the
> sentinel value (i.e., "Sorry, no more comments") in order to determine
> when to stop requesting pages. 

Where's that check happening? If it's in the soup thread you need some kind 
of back channel to the read threads to inform them that you're need no more 
pages.
 
> For my comment history that's ~750 pages
> to parse ~11,000 comments.
> 
> I have 20 read_threads requesting and putting pages into the output
> queue that is the input_queue for the parser. My soup_threads can get
> items from the queue, but BeautifulSoup doesn't do anything after that.
> 
> Chris R.


-- 
https://mail.python.org/mailman/listinfo/python-list


Re: BeautifulSoup doesn't work with a threaded input queue?

2017-08-27 Thread Paul Rubin
Christopher Reimer  writes:
> I have 20 read_threads requesting and putting pages into the output
> queue that is the input_queue for the parser. 

Given how slow parsing is, you probably want to scrap the pages into
disk files, and then run the parser in parallel processes that read from
the disk.  You could also use something like Redis (redis.io) as a queue.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: BeautifulSoup doesn't work with a threaded input queue?

2017-08-27 Thread Christopher Reimer via Python-list

On 8/27/2017 1:50 PM, MRAB wrote:
What if you don't sort the list? I ask because it sounds like you're 
changing 2 variables (i.e. list->queue, sorted->unsorted) at the same 
time, so you can't be sure that it's the queue that's the problem.


If I'm using a list, I'm using a for loop to input items into the parser.

If I'm using a queue, I'm using worker threads to put or get items.

The item is still the same whether in a list or a queue.

Chris R.
--
https://mail.python.org/mailman/listinfo/python-list


Re: BeautifulSoup doesn't work with a threaded input queue?

2017-08-27 Thread Christopher Reimer via Python-list

On 8/27/2017 1:31 PM, Peter Otten wrote:


Here's a simple example that extracts titles from generated html. It seems
to work. Does it resemble what you do?
Your example is similar to my code when I'm using a list for the input 
to the parser. You have soup_threads and write_threads, but no read_threads.


The particular website I'm scraping requires checking each page for the 
sentinel value (i.e., "Sorry, no more comments") in order to determine 
when to stop requesting pages. For my comment history that's ~750 pages 
to parse ~11,000 comments.


I have 20 read_threads requesting and putting pages into the output 
queue that is the input_queue for the parser. My soup_threads can get 
items from the queue, but BeautifulSoup doesn't do anything after that.


Chris R.
--
https://mail.python.org/mailman/listinfo/python-list


Re: BeautifulSoup doesn't work with a threaded input queue?

2017-08-27 Thread MRAB

On 2017-08-27 21:35, Christopher Reimer via Python-list wrote:

On 8/27/2017 1:12 PM, MRAB wrote:

What do you mean by "queue (random order)"? A queue is sequential 
order, first-in-first-out. 


With 20 threads requesting 20 different pages, they're not going into
the queue in sequential order (i.e., 0, 1, 2, ..., 17, 18, 19) and
coming in at different times for the parser worker threads to get for
processing.

Similar situation with a list but I sort the list before giving it to
the parser, so all the items are in sequential order and fed to the
parser one at time.

What if you don't sort the list? I ask because it sounds like you're 
changing 2 variables (i.e. list->queue, sorted->unsorted) at the same 
time, so you can't be sure that it's the queue that's the problem.

--
https://mail.python.org/mailman/listinfo/python-list


Re: BeautifulSoup doesn't work with a threaded input queue?

2017-08-27 Thread Christopher Reimer via Python-list

On 8/27/2017 1:12 PM, MRAB wrote:

What do you mean by "queue (random order)"? A queue is sequential 
order, first-in-first-out. 


With 20 threads requesting 20 different pages, they're not going into 
the queue in sequential order (i.e., 0, 1, 2, ..., 17, 18, 19) and 
coming in at different times for the parser worker threads to get for 
processing.


Similar situation with a list but I sort the list before giving it to 
the parser, so all the items are in sequential order and fed to the 
parser one at time.


Chris R.

--
https://mail.python.org/mailman/listinfo/python-list


Re: BeautifulSoup doesn't work with a threaded input queue?

2017-08-27 Thread Peter Otten
Christopher Reimer via Python-list wrote:

> On 8/27/2017 11:54 AM, Peter Otten wrote:
> 
>> The documentation
>>
>> https://www.crummy.com/software/BeautifulSoup/bs4/doc/#making-the-soup
>>
>> says you can make the BeautifulSoup object from a string or file.
>> Can you give a few more details where the queue comes into play? A small
>> code sample would be ideal.
> 
> A worker thread uses a request object to get the page and puts it into
> queue as page.content (HTML).  Another worker thread gets the
> page.content from the queue to apply BeautifulSoup and nothing happens.
> 
> soup = BeautifulSoup(page_content, 'lxml')
> print(soup)
> 
> No output whatsoever. If I remove 'lxml', I get the UserWarning that no
> parser wasn't explicitly set and get the reference to threading.py at
> line 80.
> 
> I verified that page.content that goes into and out of the queue is the
> same page.content that goes into and out of a list.
> 
> I read somewhere that BeautifulSoup may not be thread-safe. I've never
> had a problem with threads storing the output into a queue. Using a
> queue (random order) instead of a list (sequential order) to feed pages
> for the input is making it wonky.

Here's a simple example that extracts titles from generated html. It seems 
to work. Does it resemble what you do?

import csv
import threading
import time
from queue import Queue

import bs4


def process_html(source, dest, index):
while True:
html = source.get()
if html is DONE:
dest.put(DONE)
break
soup = bs4.BeautifulSoup(html, "lxml")
dest.put(soup.find("title").text)


def write_csv(source, filename, to_go):
with open(filename, "w") as f:
writer = csv.writer(f)
while True:
title = source.get()
if title is DONE:
to_go -= 1
if not to_go:
return
else:
writer.writerow([title])

NUM_SOUP_THREADS = 10
DONE = object()

web_to_soup = Queue()
soup_to_file = Queue()

soup_threads = [
threading.Thread(target=process_html, args=(web_to_soup, soup_to_file, 
i))
for i in range(NUM_SOUP_THREADS)
]

write_thread = threading.Thread(
target=write_csv,  args=(soup_to_file, "tmp.csv", NUM_SOUP_THREADS),
)

write_thread.start()

for thread in soup_threads:
thread.start()

for i in range(100):
web_to_soup.put("#{}".format(i))
for i in range(NUM_SOUP_THREADS):
web_to_soup.put(DONE)

for t in soup_threads:
t.join()
write_thread.join()


-- 
https://mail.python.org/mailman/listinfo/python-list


Re: BeautifulSoup doesn't work with a threaded input queue?

2017-08-27 Thread MRAB

On 2017-08-27 20:35, Christopher Reimer via Python-list wrote:

On 8/27/2017 11:54 AM, Peter Otten wrote:


The documentation

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#making-the-soup

says you can make the BeautifulSoup object from a string or file.
Can you give a few more details where the queue comes into play? A small
code sample would be ideal.


A worker thread uses a request object to get the page and puts it into
queue as page.content (HTML).  Another worker thread gets the
page.content from the queue to apply BeautifulSoup and nothing happens.

soup = BeautifulSoup(page_content, 'lxml')
print(soup)

No output whatsoever. If I remove 'lxml', I get the UserWarning that no
parser wasn't explicitly set and get the reference to threading.py at
line 80.

I verified that page.content that goes into and out of the queue is the
same page.content that goes into and out of a list.

I read somewhere that BeautifulSoup may not be thread-safe. I've never
had a problem with threads storing the output into a queue. Using a
queue (random order) instead of a list (sequential order) to feed pages
for the input is making it wonky.

What do you mean by "queue (random order)"? A queue is sequential order, 
first-in-first-out.

--
https://mail.python.org/mailman/listinfo/python-list


Re: BeautifulSoup doesn't work with a threaded input queue?

2017-08-27 Thread Christopher Reimer via Python-list

On 8/27/2017 11:54 AM, Peter Otten wrote:


The documentation

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#making-the-soup

says you can make the BeautifulSoup object from a string or file.
Can you give a few more details where the queue comes into play? A small
code sample would be ideal.


A worker thread uses a request object to get the page and puts it into 
queue as page.content (HTML).  Another worker thread gets the 
page.content from the queue to apply BeautifulSoup and nothing happens.


soup = BeautifulSoup(page_content, 'lxml')
print(soup)

No output whatsoever. If I remove 'lxml', I get the UserWarning that no 
parser wasn't explicitly set and get the reference to threading.py at 
line 80.


I verified that page.content that goes into and out of the queue is the 
same page.content that goes into and out of a list.


I read somewhere that BeautifulSoup may not be thread-safe. I've never 
had a problem with threads storing the output into a queue. Using a 
queue (random order) instead of a list (sequential order) to feed pages 
for the input is making it wonky.


Chris R.
--
https://mail.python.org/mailman/listinfo/python-list


Re: BeautifulSoup doesn't work with a threaded input queue?

2017-08-27 Thread Peter Otten
Christopher Reimer via Python-list wrote:

> Greetings,
> 
> I have Python 3.6 script on Windows to scrape comment history from a
> website. It's currently set up this way:
> 
> Requestor (threads) -> list -> Parser (threads) -> queue -> CVSWriter
> (single thread)
> 
> It takes 15 minutes to process ~11,000 comments.
> 
> When I replaced the list with a queue between the Requestor and Parser
> to speed up things, BeautifulSoup stopped working.
> 
> When I changed BeautifulSoup(contents, "lxml") to
> BeautifulSoup(contents), I get the UserWarning that no parser wasn't
> explicitly set and a reference to line 80 in threading.py (which puts it
> in the RLock factory function).
> 
> When I switched back to using list between the Requestor and Parser, the
> Parser worked again.
> 
> BeautifulSoup doesn't work with a threaded input queue?

The documentation

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#making-the-soup

says you can make the BeautifulSoup object from a string or file.
Can you give a few more details where the queue comes into play? A small 
code sample would be ideal...

-- 
https://mail.python.org/mailman/listinfo/python-list