On Thursday, July 5, 2012 4:51:46 AM UTC+5:30, (unknown) wrote: > Dear Group, > > I am Sri Subhabrata Banerjee trying to write from Gurgaon, India to discuss > some coding issues. If any one of this learned room can shower some light I > would be helpful enough. > > I got to code a bunch of documents which are combined together. > Like, > > 1)A Mumbai-bound aircraft with 99 passengers on board was struck by lightning > on Tuesday evening that led to complete communication failure in mid-air and > forced the pilot to make an emergency landing. > 2) The discovery of a new sub-atomic particle that is key to understanding > how the universe is built has an intrinsic Indian connection. > 3) A bomb explosion outside a shopping mall here on Tuesday left no one > injured, but Nigerian authorities put security agencies on high alert fearing > more such attacks in the city. > > The task is to separate the documents on the fly and to parse each of the > documents with a definite set of rules. > > Now, the way I am processing is: > I am clubbing all the documents together, as, > > A Mumbai-bound aircraft with 99 passengers on board was struck by lightning > on Tuesday evening that led to complete communication failure in mid-air and > forced the pilot to make an emergency landing.The discovery of a new > sub-atomic particle that is key to understanding how the universe is built > has an intrinsic Indian connection. A bomb explosion outside a shopping mall > here on Tuesday left no one injured, but Nigerian authorities put security > agencies on high alert fearing more such attacks in the city. > > But they are separated by a tag set, like, > A Mumbai-bound aircraft with 99 passengers on board was struck by lightning > on Tuesday evening that led to complete communication failure in mid-air and > forced the pilot to make an emergency landing.$ > The discovery of a new sub-atomic particle that is key to understanding how > the universe is built has an intrinsic Indian connection.$ > A bomb explosion outside a shopping mall here on Tuesday left no one injured, > but Nigerian authorities put security agencies on high alert fearing more > such attacks in the city. > > To detect the document boundaries, I am splitting them into a bag of words > and using a simple for loop as, > for i in range(len(bag_words)): > if bag_words[i]=="$": > print (bag_words[i],i) > > There is no issue. I am segmenting it nicely. I am using annotated corpus so > applying parse rules. > > The confusion comes next, > > As per my problem statement the size of the file (of documents combined > together) won’t increase on the fly. So, just to support all kinds of > combinations I am appending in a list the “I” values, taking its length, and > using slice. Works perfect. Question is, is there a smarter way to achieve > this, and a curious question if the documents are on the fly with no > preprocessed tag set like “$” how may I do it? From a bunch without EOF isn’t > it a classification problem? > > There is no question on parsing it seems I am achieving it independent of > length of the document. > > If any one in the group can suggest how I am dealing with the problem and > which portions should be improved and how? > > Thanking You in Advance, > > Best Regards, > Subhabrata Banerjee.
Thanks Peter but I feel your earlier one was better, I got an interesting one: [i - 1 for i in range(len(f1)) if f1.startswith('$', i - 1)] But I am bit intrigued with another question, suppose I say: file_open=open("/python32/doc1.txt","r") file=a1.read().lower() for line in file: line_word=line.split() This works fine. But if I print it would be printed continuously. I like to store in some variable,so that I may print line of my choice and manipulate them at my choice. Is there any way out to this problem? Regards, Subhabrata Banerjee -- http://mail.python.org/mailman/listinfo/python-list