Sorting Large File (Code/Performance)

2008-01-24 Thread Ira . Kovac
Hello all,

I have an Unicode text file with 1.6 billon lines (~2GB) that I'd like
to sort based on first two characters.

I'd greatly appreciate if someone can post sample code that can help
me do this.

Also, any ideas on approximately how long is the sort process going to
take (XP, Dual Core 2.0GHz w/2GB RAM).

Cheers,

Ira

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Sorting Large File (Code/Performance)

2008-01-24 Thread Ira . Kovac
Thanks to all who replied. It's very appreciated.

Yes, I had to doublecheck line counts and the number of lines is ~16
million (insetead of stated 1.6B).

Also:

What is a Unicode text file? How is it encoded: utf8, utf16, utf16le, 
utf16be, ??? If you don't know, do this:
The file is UTF-8

 Do the first two characters always belong to the ASCII subset?
Yes, first two always belong to ASCII subset

 What are you going to do with it after it's sorted?
I need to isolate all lines that start with two characters (zz to be
particular)

 Here's a start: http://docs.python.org/lib/typesseq-mutable.html
 Google GnuWin32 and see if their sort does what you want.
Will do, thanks for the tip.

 If you really have a 2GB file and only 2GB of RAM, I suggest that you don't 
 hold your breath.
I am limited with resources. Unfortunately.

Cheers,

Ira
-- 
http://mail.python.org/mailman/listinfo/python-list


Filtering content of a text file

2007-07-27 Thread Ira . Kovac
Hello All,

I'd greatly appreciate if you can take a look at the task I need help
with.

It'd be outstanding if someone can provide some sample Python code.

Thanks a lot,

Ira

---
Problem
---

I am working with 30K+ record datasets in flat file format (.txt) that
look like this:

//-+alibaba sinage
//-+amra damian//_9
//-+anix anire//_
//-+borom
//-+bokima sun drane
//-+ciren
//-+cop calestieon eded
//-+ciciban
//-+drago kimano sole


The records start with the same string (in the example //-+) wich is
followed by another string of characters taht's changing from record
to record.

I am working on one file at the time and for each file I need to be
able to do the following:

a) By looping thru the file the program should isolate all records
that have letter a following the //-+
b) The isolated dataset will contain only records that start with //-
+a
c) Save the isolated dataset as flat flat text file named a.txt
d) Repeat a), b) and c) for all letters of english alphabet (a thru z)
and numerical values (0 thru 9)






CP: An inch of time is an inch of gold but you can't buy that inch of
time with an inch of gold.

Random Link Generator
--
http://www.transactioncodes.com
--

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Filtering content of a text file

2007-07-27 Thread Ira . Kovac
Thanks all for the input. This is going to be a great basis for
starting. And, yeah - I wish it was a homework.

Best,

Ira

-- 
http://mail.python.org/mailman/listinfo/python-list