Re: Review Request of Python Code

subhabangalore Thu, 10 Mar 2016 10:17:33 -0800

On Wednesday, March 9, 2016 at 9:49:17 AM UTC+5:30, [email protected] wrote:
> Dear Group,
> 
> I am trying to write a code for pulling data from MySQL at the backend and 
> annotating words and trying to put the results as separated sentences with 
> each line. The code is generally running fine but I am feeling it may be 
> better in the end of giving out sentences, and for small data sets it is okay 
> but with 50,000 news articles it is performing dead slow. I am using 
> Python2.7.11 on Windows 7 with 8GB RAM. 
> 
> I am trying to copy the code here, for your kind review. 
> 
> import MySQLdb
> import nltk
> def sql_connect_NewTest1():
>     db = MySQLdb.connect(host="localhost",
>                      user="*****",         
>                      passwd="*****",  
>                      db="abcd_efgh")
>     cur = db.cursor()
>     #cur.execute("SELECT * FROM newsinput limit 0,50000;") #REPORTING RUNTIME 
> ERROR
>     cur.execute("SELECT * FROM newsinput limit 0,50;")
>     dict_open=open("/python27/NewTotalTag.txt","r") #OPENING THE DICTIONARY 
> FILE 
>     dict_read=dict_open.read() 
>     dict_word=dict_read.split()
>     a4=dict_word #Assignment for code. 
>     list1=[]
>     flist1=[]
>     nlist=[]
>     for row in cur.fetchall():
>         #print row[2]
>         var1=row[3]
>         #print var1 #Printing lines
>         #var2=len(var1) # Length of file
>         var3=var1.split(".") #SPLITTING INTO LINES
>         #print var3 #Printing The Lines 
>         #list1.append(var1)
>         var4=len(var3) #Number of all lines
>         #print "No",var4
>         for line in var3:
>             #print line
>             #flist1.append(line)
>             linew=line.split()
>             for word in linew:
>                 if word in a4:
>                     windex=a4.index(word)
>                     windex1=windex+1
>                     word1=a4[windex1]
>                     word2=word+"/"+word1
>                     nlist.append(word2)
>                     #print list1
>                     #print nlist
>                 elif word not in a4:
>                     word3=word+"/"+"NA"
>                     nlist.append(word3)
>                     #print list1
>                     #print nlist
>                 else:
>                     print "None"
>             
>     #print "###",flist1
>     #print len(flist1)
>     #db.close()
>     #print nlist
>     lol = lambda lst, sz: [lst[i:i+sz] for i in range(0, len(lst), sz)] 
> #TRYING TO SPLIT THE RESULTS AS SENTENCES 
>     nlist1=lol(nlist,7)
>     #print nlist1
>     for i in nlist1:
>         string1=" ".join(i)
>         print i
>         #print string1
>     
>    
> Thanks in Advance.


****************************************************************************
Dear Group,

Thank you all, for your kind time and all suggestions in helping me.

Thank you Steve for writing the whole code. It is working full 
and fine. But speed is still an issue. We need to speed up. 

Inada I tried to change to 
cur = db.cursor(MySQLdb.cursors.SSCursor) but my System Admin 
said that may not be an issue.

Freidrich, my problem is I have a big text repository of .txt
files in MySQL in the backend. I have another list of words with
their possible tags. The tags are not conventional Parts of Speech(PoS)
tags,  and bit defined by others. 
The code is expected to read each file and its each line.
On reading each line it will scan the list for appropriate
tag, if it is found it would assign, else would assign NA.
The assignment should be in the format of /tag, so that
if there is a string of n words, it should look like,
w1/tag w2/tag w3/tag w4/tag ....wn/tag, 

where tag may be tag in the list or NA as per the situation.

This format is taken because the files are expected to be tagged
in Brown Corpus format. There is a Python Library named NLTK.
If I want to save my data for use with their models, I need 
some specifications. I want to use it as Tagged Corpus format. 

Now the tagged data coming out in this format, should be one 
tagged sentences in each new line or a lattice. 

They expect the data to be saved in .pos format but presently 
I am not doing in this code, I may do that later. 

Please let me know if I need to give any more information.

Matt, thank you for if...else suggestion, the data of NewTotalTag.txt
is like a simple list of words with unconventional tags, like,

w1 tag1
w2 tag2
w3 tag3
...
...
w3  tag3

like that. 

Regards,
Subhabrata  

  
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Review Request of Python Code

Reply via email to