Re: [SQL] Pick random rows from SELECT?

2009-09-23 Thread Gilles Ganault
On Mon, 21 Sep 2009 21:40:02 -0700, Dennis Lee Bieber
wlfr...@ix.netcom.com wrote:
   I'd suggest either a pool of threads -- 5-10, each reading company
names from a shared QUEUE, which is populated by the main thread
(remember to commit() so that you don't block on database updates by the
threads). OR... determine how many companies there are, and start
threads feeding them start and length (length being #names /
#threads, round up -- start then being 0*length+1, 1*length+1, etc...)
and use those in thread specific selects using ... limit length
offset start... This way each thread retrieves its own limited set of
companies (make sure to use the same sorting criteria).

Thanks for the help :)
-- 
http://mail.python.org/mailman/listinfo/python-list


[SQL] Pick random rows from SELECT?

2009-09-22 Thread Threader Slash
Here is a simple and quick solution --

Generate a random number
random.shuffle(x[, random])¶Shuffle the sequence x in place. The optional
argument random is a 0-argument function returning a random float in [0.0,
1.0); by default, this is the function random().
http://docs.python.org/library/random.html

Multiple the random value returned by the maxnumber of your table primary
key index.
http://www.tizag.com/mysqlTutorial/mysqlmax.php

Then use the result in your query as

randID = MAX(id) * random()

SELECT objectname FROM products WHERE objectID = randID

Hope this help.
Cheers... Threader


-- Forwarded message --
From: Dennis Lee Bieber wlfr...@ix.netcom.com
To: python-list@python.org
Date: Mon, 21 Sep 2009 21:40:02 -0700
Subject: Re: [SQL] Pick random rows from SELECT?
On Mon, 21 Sep 2009 10:59:38 +0200, Gilles Ganault nos...@nospam.com
declaimed the following in gmane.comp.python.general:

 Since this list is quite big and the site is the bottleneck, I'd like
 to run multiple instances of this script, and figured a solution would
 be to pick rows at random from the dataset, check in my local database
 if this item has already been taken care of, and if not, download
 details from the remote web site.

   You really think making MULTIPLE, overlapping requests to a web site
is going to be more efficient than just suffering the single transfer
time of one large query?

 If someone's done this before, should I perform the randomization in
 the SQL query (SQLite using the APSW wrapper
 http://code.google.com/p/apsw/), or in Python?

   Pardon, I thought you implied the bottleneck is the web-site
database -- I'd worry about any web-site that exposes a file-server
based database to direct user access.

 Here's some simplified code:

 sql = 'SELECT id,label FROM companies WHERE activity=1'
 rows=list(cursor.execute(sql))
 for row in rows:
   id = row[0]
   label = row[1]

   print strftime(%H:%M)
   url = http://www.acme.com/details.php?id=%s; % id
   req = urllib2.Request(url, None, headers)
   response = urllib2.urlopen(req).read()

   name = re_name.search(response)
   if name:
   name = name.group(1)
   sql = 'UPDATE companies SET name=? WHERE id=?'
   cursor.execute(sql, (name,id) )

   Ah... You mean you are retrieving the names from a local database,
and then requesting web-site details based upon that name.

   No matter how you look at it, you appear to want to process the
entire local list of companies... Multiple randomized local queries will
just add to the final run-time as you start to get duplicates -- and
have to reject that one to query for another random name.

   I'd suggest either a pool of threads -- 5-10, each reading company
names from a shared QUEUE, which is populated by the main thread
(remember to commit() so that you don't block on database updates by the
threads). OR... determine how many companies there are, and start
threads feeding them start and length (length being #names /
#threads, round up -- start then being 0*length+1, 1*length+1, etc...)
and use those in thread specific selects using ... limit length
offset start... This way each thread retrieves its own limited set of
companies (make sure to use the same sorting criteria).
--
   Wulfraed Dennis Lee Bieber   KD6MOG
   wlfr...@ix.netcom.com
HTTP://wlfraed.home.netcom.com/http://wlfraed.home.netcom.com/




-- Forwarded message --
From: greg g...@cosc.canterbury.ac.nz
To: python-list@python.org
Date: Tue, 22 Sep 2009 17:07:33 +1200
Subject: Re: Comparison of parsers in python?
Nobody wrote:

 What I want: a tokeniser generator which can take a lex-style grammar (not
 necessarily lex syntax, but a set of token specifications defined by
 REs, BNF, or whatever), generate a DFA, then run the DFA on sequences of
 bytes. It must allow the syntax to be defined at run-time.


You might find my Plex package useful:

http://www.cosc.canterbury.ac.nz/greg.ewing/python/Plex/

It was written some time ago, so it doesn't know about
the new bytes type yet, but it shouldn't be hard to
adapt it for that if you need to.

 What I don't want: anything written by someone who doesn't understand the
 field (i.e. anything which doesn't use a DFA).


Plex uses a DFA.

-- 
Greg
-- 
http://mail.python.org/mailman/listinfo/python-list


[SQL] Pick random rows from SELECT?

2009-09-21 Thread Gilles Ganault
Hello

I have a working Python script that SELECTs rows from a database to
fetch a company's name from a web-based database.

Since this list is quite big and the site is the bottleneck, I'd like
to run multiple instances of this script, and figured a solution would
be to pick rows at random from the dataset, check in my local database
if this item has already been taken care of, and if not, download
details from the remote web site.

If someone's done this before, should I perform the randomization in
the SQL query (SQLite using the APSW wrapper
http://code.google.com/p/apsw/), or in Python?

Thank you.

Here's some simplified code:

sql = 'SELECT id,label FROM companies WHERE activity=1'
rows=list(cursor.execute(sql))
for row in rows:
id = row[0]
label = row[1]

print strftime(%H:%M)
url = http://www.acme.com/details.php?id=%s; % id
req = urllib2.Request(url, None, headers)
response = urllib2.urlopen(req).read()

name = re_name.search(response)
if name:
name = name.group(1)
sql = 'UPDATE companies SET name=? WHERE id=?'
cursor.execute(sql, (name,id) )
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: [SQL] Pick random rows from SELECT?

2009-09-21 Thread Peter Otten
Gilles Ganault wrote:

 I have a working Python script that SELECTs rows from a database to
 fetch a company's name from a web-based database.
 
 Since this list is quite big and the site is the bottleneck, I'd like
 to run multiple instances of this script, and figured a solution would
 be to pick rows at random from the dataset, check in my local database
 if this item has already been taken care of, and if not, download
 details from the remote web site.
 
 If someone's done this before, should I perform the randomization in
 the SQL query (SQLite using the APSW wrapper
 http://code.google.com/p/apsw/), or in Python?
 
 Thank you.
 
 Here's some simplified code:
 
 sql = 'SELECT id,label FROM companies WHERE activity=1'
 rows=list(cursor.execute(sql))
 for row in rows:
 id = row[0]
 label = row[1]
 
 print strftime(%H:%M)
 url = http://www.acme.com/details.php?id=%s; % id
 req = urllib2.Request(url, None, headers)
 response = urllib2.urlopen(req).read()
 
 name = re_name.search(response)
 if name:
 name = name.group(1)
 sql = 'UPDATE companies SET name=? WHERE id=?'
 cursor.execute(sql, (name,id) )
 
I don't think you need to randomize the requests. Instead you could control 
a pool of worker processes using

http://docs.python.org/library/multiprocessing.html

Peter

-- 
http://mail.python.org/mailman/listinfo/python-list