On 03/12/14 09:36, [email protected] wrote:
Dear all
I aattach the file I want to parse and I wanto to create a sqlite database.
The problem is I'm not able to create in the right way because there is
some logic problems on my script. I have  this code:

Do you get an error message? If so please send a cut n paste of the full error text.

import sqlite
import sqlite3

You only need the second of these imports

import os

# Creates or opens a file called mydb with a SQLite3 DB
db = sqlite3.connect('Fastqcd_completo.db')
cursor = db.cursor()
create_table_sql = """
create table fastqc_summary (
fileid varchar,
module varchar,
status varchar,
total int,
duplicate varchar
);"""
> cursor.execute(create_table_sql)

This looks fine except its normal to have a unique key somewhere.
But if you are confident that you will have uniqueness in your
data you don't strictly need it.

create_table_sql2="""
create table fastqc_details (
id serial primary key,

But this is odd. I can't find anything about a serial keyword for SQLite. The normal format of this would be

id INTEGER PRIMARY KEY

which makes id an auto-incrementing unique value.

fileid varchar,
module varchar,
col1 varchar,
ocols varchar
);
"""
cursor.execute(create_table_sql2)
db.commit()

The other potential issue is that you are creating these tables each time you run the script. But if you run the script a second time the tables willalready exist and the CREATES will fail. You should either drop the tables at the top of the script or use the

CREATE TABLE IF NOT EXISTS <table> ...

format of create. Then if the tab;le already exists your data will
be appended. (If you want it overwritten use the drop table technique)

for root, dirs, files in
os.walk("/home/mauro/Desktop/LAVORO_CRO/2014/Statitica_RNAseqalign/FASTQC_completo/fastqcdecembre/"):
# walk a r
     for name in files:
         if (name == "fastqc_data.txt"):

You are searching for a specific name. Could there be multiple such files or are you just using wa;lk to find one? If the latter it would be better to exit the os.walk loop once you find it and then process the file. You need to store the root and filename first of course.

             fileid = name # use string slicing here if you only want
part of the

    with open(os.path.join(root,name),"r") as p: # automatically close
the file when done
                 for i in p:
                     line =i.strip()

                     if "Filename" in line:
                         fileid = line.split()[1]
                     if "Total Sequence" in line:
                         total = line.split()[2]

                     if "Total Duplicate" in line:
                         dup = line.split()[3]

                     if (line[:2] == ">>" and line[:12] != ">>END_MODULE"):
                         module = line[2:-5] # grab module name
                         status = line[-4:] # and overall status
pass/warn/fail
                         sql = "insert into
fastqc_summary(fileid,module,status,total,duplicate) values(?,?,?,?,?);"
                         data = (fileid,module,status,total,dup)
                         cursor.execute(sql,data)
                     elif (line[:2] != ">>" and line[:2] != "##"): #
grab details under each module
                         cols = line.split("\t")
                         col1 = cols[0]
                         ocols = "|".join(cols[1:])
                         sql = "insert into
fastqc_details(fileid,module,col1,ocols) values(?,?,?,?);"
                         data = (fileid,module,col1,ocols)
                         cursor.execute(sql,data)
db.commit()


So the problem is how to excatct only some parts of the files. In red
are the point of the problems. The say the Filename are not defined

What says that? It's true that you don't have a variable Filename defined. But I don't see Filename in your code anywhere either. So I don't understand the error message. Can you post it in full please?

the file atached I want to take this part and use for create the database:
##FastQC    0.10.1
 >>Basic Statistics    pass
#Measure    Value
Filename    R05_CTTGTA_L004_R1_001.fastq.gz
File type    Conventional base calls
Encoding    Sanger / Illumina 1.9
Total Sequences    27868496
Filtered Sequences    0
Sequence length    50
%GC    50
 >>END_MODULE


--
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos


_______________________________________________
Tutor maillist  -  [email protected]
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Reply via email to