Re: Is there a faster way to do this?

2008-08-06 Thread Boris Borcic

Is your product ID always the 3rd and last item on the line ?
Else your output won't separate IDs.

And how does

output = open(output_file,'w')
for x in set(line.split(',')[2] for line in open(input_file)) :
output.write(x)
output.close()

behave ?


[EMAIL PROTECTED] wrote:

I have a csv file containing product information that is 700+ MB in
size. I'm trying to go through and pull out unique product ID's only
as there are a lot of multiples. My problem is that I am appending the
ProductID to an array and then searching through that array each time
to see if I've seen the product ID before. So each search takes longer
and longer. I let the script run for 2 hours before killing it and had
only run through less than 1/10 if the file.

Heres the code:
import string

def checkForProduct(product_id, product_list):
for product in product_list:
if product == product_id:
return 1
return 0


input_file=c:\\input.txt
output_file=c:\\output.txt
product_info = []
input_count = 0

input = open(input_file,r)
output = open(output_file, w)

for line in input:
break_down = line.split(,)
product_number = break_down[2]
input_count+=1
if input_count == 1:
product_info.append(product_number)
output.write(line)
output_count = 1
if not checkForProduct(product_number,product_info):
product_info.append(product_number)
output.write(line)
output_count+=1

output.close()
input.close()
print input_count
print output_count
--
http://mail.python.org/mailman/listinfo/python-list



--
http://mail.python.org/mailman/listinfo/python-list


Is there a faster way to do this?

2008-08-05 Thread [EMAIL PROTECTED]
I have a csv file containing product information that is 700+ MB in
size. I'm trying to go through and pull out unique product ID's only
as there are a lot of multiples. My problem is that I am appending the
ProductID to an array and then searching through that array each time
to see if I've seen the product ID before. So each search takes longer
and longer. I let the script run for 2 hours before killing it and had
only run through less than 1/10 if the file.

Heres the code:
import string

def checkForProduct(product_id, product_list):
for product in product_list:
if product == product_id:
return 1
return 0


input_file=c:\\input.txt
output_file=c:\\output.txt
product_info = []
input_count = 0

input = open(input_file,r)
output = open(output_file, w)

for line in input:
break_down = line.split(,)
product_number = break_down[2]
input_count+=1
if input_count == 1:
product_info.append(product_number)
output.write(line)
output_count = 1
if not checkForProduct(product_number,product_info):
product_info.append(product_number)
output.write(line)
output_count+=1

output.close()
input.close()
print input_count
print output_count
--
http://mail.python.org/mailman/listinfo/python-list


Re: Is there a faster way to do this?

2008-08-05 Thread Gary Herron

[EMAIL PROTECTED] wrote:

I have a csv file containing product information that is 700+ MB in
size. I'm trying to go through and pull out unique product ID's only
as there are a lot of multiples. My problem is that I am appending the
ProductID to an array and then searching through that array each time
to see if I've seen the product ID before. So each search takes longer
and longer. I let the script run for 2 hours before killing it and had
only run through less than 1/10 if the file.
  


Store your ID's in a dictionary or a set.   Then test for for existence 
of a new ID in that set.  That test will be *much* more efficient that 
searching a list.  (It uses a hashing scheme.)



IDs = set()
for row in ...
  ID = extractIdFromRow(row)
  if ID not in IDs:
 set.add(ID)
 ... whatever ...


In fact if *all* you are doing is trying to identify all product IDs 
that occur in the file (no matter how many times they occur)


IDs = set()
for row in ...
  ID = extractIdFromRow(row)
  set,add(ID)

and your set is will contain *one* copy of each ID added, no matter how 
many were added.



Better yet, if you can write you ID extraction as a generator or list 
comprehension...


IDs = set(extractIdFromRow(row) for row in rowsOfTable)

or some such would be most efficient.


Gary Herron




Heres the code:
import string

def checkForProduct(product_id, product_list):
for product in product_list:
if product == product_id:
return 1
return 0


input_file=c:\\input.txt
output_file=c:\\output.txt
product_info = []
input_count = 0

input = open(input_file,r)
output = open(output_file, w)

for line in input:
break_down = line.split(,)
product_number = break_down[2]
input_count+=1
if input_count == 1:
product_info.append(product_number)
output.write(line)
output_count = 1
if not checkForProduct(product_number,product_info):
product_info.append(product_number)
output.write(line)
output_count+=1

output.close()
input.close()
print input_count
print output_count
--
http://mail.python.org/mailman/listinfo/python-list
  


--
http://mail.python.org/mailman/listinfo/python-list


Re: Is there a faster way to do this?

2008-08-05 Thread Avinash Vora

On Aug 5, 2008, at 10:00 PM, [EMAIL PROTECTED] wrote:


I have a csv file containing product information that is 700+ MB in
size. I'm trying to go through and pull out unique product ID's only
as there are a lot of multiples. My problem is that I am appending the
ProductID to an array and then searching through that array each time
to see if I've seen the product ID before. So each search takes longer
and longer. I let the script run for 2 hours before killing it and had
only run through less than 1/10 if the file.


Why not split the file into more manageable chunks, especially as it's  
just what seems like plaintext?



Heres the code:
import string

def checkForProduct(product_id, product_list):
   for product in product_list:
   if product == product_id:
   return 1
   return 0


input_file=c:\\input.txt
output_file=c:\\output.txt
product_info = []
input_count = 0

input = open(input_file,r)
output = open(output_file, w)

for line in input:
   break_down = line.split(,)
   product_number = break_down[2]
   input_count+=1
   if input_count == 1:
   product_info.append(product_number)
   output.write(line)
   output_count = 1


This seems redundant.


   if not checkForProduct(product_number,product_info):
   product_info.append(product_number)
   output.write(line)
   output_count+=1


File writing is extremely expensive.  In fact, so is reading.  Think  
about reading the file in whole chunks.  Put those chunks into Python  
data structures, and make your output information in Python data  
structures.  If you use a dictionary and search the ID's there, you'll  
notice some speed improvements as Python does a dictionary lookup far  
quicker than searching a list.  Then, output your data all at once at  
the end.


--
Avi

--
http://mail.python.org/mailman/listinfo/python-list


Re: Is there a faster way to do this?

2008-08-05 Thread Gary Herron

Avinash Vora wrote:

On Aug 5, 2008, at 10:00 PM, [EMAIL PROTECTED] wrote:


I have a csv file containing product information that is 700+ MB in
size. I'm trying to go through and pull out unique product ID's only
as there are a lot of multiples. My problem is that I am appending the
ProductID to an array and then searching through that array each time
to see if I've seen the product ID before. So each search takes longer
and longer. I let the script run for 2 hours before killing it and had
only run through less than 1/10 if the file.


Why not split the file into more manageable chunks, especially as it's 
just what seems like plaintext?



Heres the code:
import string

def checkForProduct(product_id, product_list):
   for product in product_list:
   if product == product_id:
   return 1
   return 0


input_file=c:\\input.txt
output_file=c:\\output.txt
product_info = []
input_count = 0

input = open(input_file,r)
output = open(output_file, w)

for line in input:
   break_down = line.split(,)
   product_number = break_down[2]
   input_count+=1
   if input_count == 1:
   product_info.append(product_number)
   output.write(line)
   output_count = 1


This seems redundant.


   if not checkForProduct(product_number,product_info):
   product_info.append(product_number)
   output.write(line)
   output_count+=1


File writing is extremely expensive.  In fact, so is reading.  Think 
about reading the file in whole chunks.  Put those chunks into Python 
data structures, and make your output information in Python data 
structures.


Don't bother yourself with this suggestion about reading in chunks -- 
Python already does this for you, and does so more efficiently that you 
could.  The code

 for line in open(input_file,r):
reads in large chunks (efficiently) and then serves up the contents 
line-by-line.


Gary Herron


If you use a dictionary and search the ID's there, you'll notice some 
speed improvements as Python does a dictionary lookup far quicker than 
searching a list.  Then, output your data all at once at the end.


--
Avi

--
http://mail.python.org/mailman/listinfo/python-list


--
http://mail.python.org/mailman/listinfo/python-list


Re: Is there a faster way to do this?

2008-08-05 Thread RPM1

[EMAIL PROTECTED] wrote:

I have a csv file containing product information that is 700+ MB in
size. I'm trying to go through and pull out unique product ID's only
as there are a lot of multiples. My problem is that I am appending the
ProductID to an array and then searching through that array each time
to see if I've seen the product ID before. So each search takes longer
and longer. I let the script run for 2 hours before killing it and had
only run through less than 1/10 if the file.



I think you need to learn about Python's dictionary data type.
--
http://mail.python.org/mailman/listinfo/python-list


Re: Is there a faster way to do this?

2008-08-05 Thread Roy H. Han
Why not just use sets?

a = set()
a.add(1)
a.add(2)

On Tue, Aug 5, 2008 at 10:14 PM, RPM1 [EMAIL PROTECTED] wrote:
 [EMAIL PROTECTED] wrote:

 I have a csv file containing product information that is 700+ MB in
 size. I'm trying to go through and pull out unique product ID's only
 as there are a lot of multiples. My problem is that I am appending the
 ProductID to an array and then searching through that array each time
 to see if I've seen the product ID before. So each search takes longer
 and longer. I let the script run for 2 hours before killing it and had
 only run through less than 1/10 if the file.


 I think you need to learn about Python's dictionary data type.
 --
 http://mail.python.org/mailman/listinfo/python-list

--
http://mail.python.org/mailman/listinfo/python-list


Re: Is there a faster way to do this?

2008-08-05 Thread Tomasz Rola
On Tue, 5 Aug 2008, [EMAIL PROTECTED] wrote:

 I have a csv file containing product information that is 700+ MB in
 size. I'm trying to go through and pull out unique product ID's only
 as there are a lot of multiples. My problem is that I am appending the
 ProductID to an array and then searching through that array each time
 to see if I've seen the product ID before. So each search takes longer
 and longer. I let the script run for 2 hours before killing it and had
 only run through less than 1/10 if the file.

My take:

I assume you have 80 bytes per line, that makes 10 milion lines for 700M 
file. To be quite sure lets round it to 20 milions. Next, I don't want to 
trash my disk with 700M+ files, so I assume reading the line, breaking it 
and getting product id takes roughly the same time as generating random id 
by my code. So, I:

1. read all records line by line (or just make random ids), append the 
product id to the list (actually, I preallocate the list with empty space 
and fill it up)
2. sort() the list
3. iterate the list, count the unique ids, optionally write to file

The code (most of it is just making random names, which was real fun):

import  string

RECORD_NUM = 20*1024*1024 # 700M/80-bytes-per-line = ca. 10M+ records
ALPHA = string.digits + string.ascii_letters
RAND = None

def  random_iter ( ) :
  x = 12345678910
  y = 234567891011
  M = 2**61 - 1
  M0 = 2**31 - 1
  pot10 = [ 1, 10, 100, 1000, 1, 10 ]
  while 1 :
p = x * y
l = pot10[5 - (p % 10)]
n = (p / l) % M
d = l * (n % 10)
p = p % M0
x1 = y - d + p
y1 = x + d + p
x, y = x1, y1
yield n
pass
  pass

def  next_random ( ) :
  global RAND
  if RAND == None :
RAND = random_iter()
  return RAND.next()

def  num_to_name ( n ) :
  s = []
  len_a = len(ALPHA)
  while n  0 :
n1, r = divmod(n, len_a)
s.append(ALPHA[r])
n = n1
pass
  return .join(s)

def  get_product_id ( ) :
  return  num_to_name(next_random())

def  dry_run ( n ) :
  r = [ 0 ] * n
  while n  0 :
n -= 1
r[n] = get_product_id()
  return r

###
  
if __name__ == __main__:
  print hello
  for i in range(10) : print get_product_id()
  print
  print RECORD_NUM: %d % RECORD_NUM
  products = dry_run(RECORD_NUM)
  print RECORDS PRODUCED: %d % len(products)
  products.sort()
  i = 0
  lastp = 
  count = 0
  while i  len(products) :
if lastp != products[i] :
  lastp = products[i]
  count += 1
i += 1
pass
  print UNIQUE RECORDS: %d % count

I may have made some bugs, but it works on my computer. Or seems to ;-/.

For 20 mln products, on my Athlon XP @1800 / 1.5G ram, Debian Linux box, 
it takes about 13 minutes to go through list generation, about 3 minutes 
to sort the list, and few more seconds to skim it (writing should not be 
much longer). All summed up, about 18 minutes of real time, with some 
other programs computing a little etc in the background - so, much less 
then 2 hours.

Regards,
Tomasz Rola

--
** A C programmer asked whether computer had Buddha's nature.  **
** As the answer, master did rm -rif on the programmer's home**
** directory. And then the C programmer became enlightened...  **
** **
** Tomasz Rola  mailto:[EMAIL PROTECTED] **
--
http://mail.python.org/mailman/listinfo/python-list


Re: Faster way to do this...

2005-03-03 Thread Timo Virkkala
Harlin Seritt wrote:
Roy, I like what you showed: nums = [a for a in range(100)] . My
mistake for not expressing my question as well as I should have. Not
only am I looking for a way to fill in 100 spots (more or less) in an
array er... list, but I'd like to be able to do it in intervals of
2, 4, 8 etc. as well as other things.
range(2, 100, 4)
How about you fire up the interactive python and try
help(range)
;)
--
Timo Virkkala
--
http://mail.python.org/mailman/listinfo/python-list


Re: Faster way to do this...

2005-03-02 Thread Will McGugan
Warren Postma wrote:
Will McGugan wrote:
Isn't that equivalent to simply..
nums= range(100)

I remember the day I first realized that 900 lines of some C++ program I 
was working on could be expressed in three lines of python.  Ahh.
Lately I've found myself commenting C++ code with the equivalent Python 
code. I find it clearer and more terse than simply commenting in English!

Will
--
http://mail.python.org/mailman/listinfo/python-list


Re: Faster way to do this...

2005-03-02 Thread Robert Kern
Will McGugan wrote:
Warren Postma wrote:
Will McGugan wrote:
Isn't that equivalent to simply..
nums= range(100)

I remember the day I first realized that 900 lines of some C++ program 
I was working on could be expressed in three lines of python.  Ahh.

Lately I've found myself commenting C++ code with the equivalent Python 
code. I find it clearer and more terse than simply commenting in English!
If you used literate programming tools, you might be able to get a 
Python version and a C++ version of your code in one go!

--
Robert Kern
[EMAIL PROTECTED]
In the fields of hell where the grass grows high
 Are the graves of dreams allowed to die.
  -- Richard Harter
--
http://mail.python.org/mailman/listinfo/python-list


Faster way to do this...

2005-03-01 Thread Harlin Seritt
I've got the following code:

nums = range(0)
for a in range(100):
   nums.append(a)

Is there a better way to have num initialized to a list of 100
consecutive int values?

Thanks,

Harlin

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Faster way to do this...

2005-03-01 Thread Will McGugan
Harlin Seritt wrote:
I've got the following code:
nums = range(0)
for a in range(100):
   nums.append(a)
Is there a better way to have num initialized to a list of 100
consecutive int values?
Isn't that equivalent to simply..
nums= range(100)
Will McGugan
--
http://mail.python.org/mailman/listinfo/python-list


Re: Faster way to do this...

2005-03-01 Thread Steve Holden
Harlin Seritt wrote:
I've got the following code:
nums = range(0)
for a in range(100):
   nums.append(a)
Is there a better way to have num initialized to a list of 100
consecutive int values?
Why not the simplest solution?
  a = range(100)
regards
 Steve
--
http://mail.python.org/mailman/listinfo/python-list


Re: Faster way to do this...

2005-03-01 Thread Aaron Bingham
Harlin Seritt wrote:
I've got the following code:
nums = range(0)
for a in range(100):
   nums.append(a)
Is there a better way to have num initialized to a list of 100
consecutive int values?
You mean like this?
nums = range(100)
;-)
--

Aaron Bingham
Software Engineer
Cenix BioScience GmbH

--
http://mail.python.org/mailman/listinfo/python-list


Re: Faster way to do this...

2005-03-01 Thread Roy Smith
Harlin Seritt [EMAIL PROTECTED] wrote:
I've got the following code:

nums = range(0)
for a in range(100):
   nums.append(a)

Is there a better way to have num initialized to a list of 100
consecutive int values?

Step one would be to change the first line to

nums = []

which is simpler and results in the same thing.  Or, you could write
the whole thing as a one-liner using a list comprehension

nums = [a for a in range(100)]

and then you can take it one step further and just write

nums = range(100)

which I think is about as simple as you can get.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Faster way to do this...

2005-03-01 Thread Warren Postma
Will McGugan wrote:
Isn't that equivalent to simply..
nums= range(100)
I remember the day I first realized that 900 lines of some C++ program I 
was working on could be expressed in three lines of python.  Ahh.
Rebirth.  Then there was the phase of the python-newbie so enamored of 
map and lambda.  ... Wait, actually, I'm not out of that yet. :-)

Warren
--
http://mail.python.org/mailman/listinfo/python-list


Re: Faster way to do this...

2005-03-01 Thread Harlin Seritt
Excellent point Warren. I have been working with Python for about 3
years in all, but only really seriously for about a year. I am still
utterly amazed that near everything that takes me about 5 to 20 lines
of code can be done in 1, 2 or 3 lines of Python code (when done
correctly). It is very frustrating that I am still using Python as
though I would if I were writing Java or C++ code. One day I'll get the
hang of this.

Roy, I like what you showed: nums = [a for a in range(100)] . My
mistake for not expressing my question as well as I should have. Not
only am I looking for a way to fill in 100 spots (more or less) in an
array er... list, but I'd like to be able to do it in intervals of
2, 4, 8 etc. as well as other things.

Thanks,

Harlin

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Faster way to do this...

2005-03-01 Thread Robert Kern
Harlin Seritt wrote:
Roy, I like what you showed: nums = [a for a in range(100)] . My
mistake for not expressing my question as well as I should have. Not
only am I looking for a way to fill in 100 spots (more or less) in an
array er... list, but I'd like to be able to do it in intervals of
2, 4, 8 etc. as well as other things.
Like
  nums = range(0, 100, 4)
?
--
Robert Kern
[EMAIL PROTECTED]
In the fields of hell where the grass grows high
 Are the graves of dreams allowed to die.
  -- Richard Harter
--
http://mail.python.org/mailman/listinfo/python-list