On Sat, Dec 12, 2009 at 9:45 AM, Ernesto <[email protected]> wrote:
> Dear Francesc,
>
> thank you for your reply. I'll try to better explain my problem using
> real examples of data and code.
>
> As I wrote I start with an input file. It contains a string of
> variable length (10e7-10e8). This string consists of four different
> characters (A,C,G,T), the bases of a DNA molecule.
> The format of the input file is:
>
> >scaffold_0
> AGCAGTGACAGATGACAGATGACAGATGACAGTGAC
> AGCAGTGACAGATGACAGATGACAGATGACAGTGAC
> AGCAGTGACAGATGACAGATGACAGATGACAGTGAC
> ... until 10e8 characters
>
> Each character or base can be associated to a specific position. The
> first A has position 1, the second G 2 and so on.
>
> Using pytables I can store all characters base by base in a structure
> like the following:
>
> (1, A)
> (2, G)
> ... and so on
i apologize if i misunderstand the problem, but
if you are just keeping track of counts, and not order, you can use a
"column" for each base pair which is initialized to zero. in numpy:
>>> a = np.zeros((1000000, 4), dtype=[('A', int), ('C', int), ('G', int), ('T',
>>> int)]
>>> a[0]['A'] += 1
>>> a[1]['G'] += 1
>>> a[1]['A'] += 1
etc.
that is easily stored in pytables.
-brentp
>
> Then I have a second file in which there are other strings and related
> positions. Reading this file, I have to update the table according to
> the position.
> For example I read the at the position 2 I have another G, at position
> 3 a C, at position 1 a G. According to the position I can associate:
>
> (1, A) --> G
> (2, G) --> G
> (3, C) --> C
>
> I can read the same position more than time, a variable number of time.
>
> (1, A) --> GGGGAAAAAAAAAAA
> (2, G) --> GGGGGGCGGG
> (3, C) --> CCCCC
>
> I cannot predict a priori the number of character to associate to each
> position.
>
> As you suggested I tried to use a vlarray. In practice during the
> generation of the table I build also the vlarray in order to
> inizialize the structure.
> The code I tried is the following:
>
> from tables import *
> from numpy import *
>
> class NucSeq(IsDescription):
> id = Int32Col(pos=1) # integer
> gnuc = StringCol(1, pos=2) # 1-character String
>
> # Open a file in "w"rite mode
> fileh = openFile("table1.h5", mode = "w")
> root = fileh.root
> # Create a new group
> group = fileh.createGroup(root, "newgroup")
> # Create a new table in newgroup group
> tableNuc = fileh.createTable(group, 'tableNuc', NucSeq, "tableNuc",
> Filters(1))
> nucseq = tableNuc.row
> vlarray = fileh.createVLArray(root, 'vlarray', StringAtom(itemsize=1),
> "vlarray test")
> f=open("seq")
> x=1
> for i in f:
> if i[0]!=">":
> l=i.strip()
> for j in l:
> nucseq['id']=x
> nucseq['gnuc']=j
> nucseq.append()
> vlarray.append([])
> x+=1
> f.close()
> tableNuc.flush()
> fileh.close()
>
> If I remove the vlarray, pytables can build the table in several
> seconds. Adding the vlarray the time increases and the same job can be
> completed after more than 20 hours.
> In the code above I preferred to inizialize the structure because then
> I can quickly add each character calling the specific position.
> If you need I could provide the "seq" file (it is 4MB after
> compression).
>
> Thank you very much in advance for any help and suggestion.
>
> Ernesto
>
> PS: sorry for the late answer but I don't receive directly the reply.
> I don't know why.
>
> ------------------------------------------------------------------------------
> Return on Information:
> Google Enterprise Search pays you back
> Get the facts.
> http://p.sf.net/sfu/google-dev2dev
> _______________________________________________
> Pytables-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
------------------------------------------------------------------------------
Return on Information:
Google Enterprise Search pays you back
Get the facts.
http://p.sf.net/sfu/google-dev2dev
_______________________________________________
Pytables-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pytables-users