Dear Francesc,
thank you for your reply. I'll try to better explain my problem using
real examples of data and code.
As I wrote I start with an input file. It contains a string of
variable length (10e7-10e8). This string consists of four different
characters (A,C,G,T), the bases of a DNA molecule.
The format of the input file is:
>scaffold_0
AGCAGTGACAGATGACAGATGACAGATGACAGTGAC
AGCAGTGACAGATGACAGATGACAGATGACAGTGAC
AGCAGTGACAGATGACAGATGACAGATGACAGTGAC
... until 10e8 characters
Each character or base can be associated to a specific position. The
first A has position 1, the second G 2 and so on.
Using pytables I can store all characters base by base in a structure
like the following:
(1, A)
(2, G)
... and so on
Then I have a second file in which there are other strings and related
positions. Reading this file, I have to update the table according to
the position.
For example I read the at the position 2 I have another G, at position
3 a C, at position 1 a G. According to the position I can associate:
(1, A) --> G
(2, G) --> G
(3, C) --> C
I can read the same position more than time, a variable number of time.
(1, A) --> GGGGAAAAAAAAAAA
(2, G) --> GGGGGGCGGG
(3, C) --> CCCCC
I cannot predict a priori the number of character to associate to each
position.
As you suggested I tried to use a vlarray. In practice during the
generation of the table I build also the vlarray in order to
inizialize the structure.
The code I tried is the following:
from tables import *
from numpy import *
class NucSeq(IsDescription):
id = Int32Col(pos=1) # integer
gnuc = StringCol(1, pos=2) # 1-character String
# Open a file in "w"rite mode
fileh = openFile("table1.h5", mode = "w")
root = fileh.root
# Create a new group
group = fileh.createGroup(root, "newgroup")
# Create a new table in newgroup group
tableNuc = fileh.createTable(group, 'tableNuc', NucSeq, "tableNuc",
Filters(1))
nucseq = tableNuc.row
vlarray = fileh.createVLArray(root, 'vlarray', StringAtom(itemsize=1),
"vlarray test")
f=open("seq")
x=1
for i in f:
if i[0]!=">":
l=i.strip()
for j in l:
nucseq['id']=x
nucseq['gnuc']=j
nucseq.append()
vlarray.append([])
x+=1
f.close()
tableNuc.flush()
fileh.close()
If I remove the vlarray, pytables can build the table in several
seconds. Adding the vlarray the time increases and the same job can be
completed after more than 20 hours.
In the code above I preferred to inizialize the structure because then
I can quickly add each character calling the specific position.
If you need I could provide the "seq" file (it is 4MB after
compression).
Thank you very much in advance for any help and suggestion.
Ernesto
PS: sorry for the late answer but I don't receive directly the reply.
I don't know why.
------------------------------------------------------------------------------
Return on Information:
Google Enterprise Search pays you back
Get the facts.
http://p.sf.net/sfu/google-dev2dev
_______________________________________________
Pytables-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pytables-users