Dear Group,

I am Sri Subhabrata Banerjee trying to write from Gurgaon, India to discuss 
some coding issues. If any one of this learned room can shower some light I 
would be helpful enough. 

I got to code a bunch of documents  which are combined together. 
Like, 

1)A Mumbai-bound aircraft with 99 passengers on board was struck by lightning 
on Tuesday evening that led to complete communication failure in mid-air and 
forced the pilot to make an emergency landing.
2) The discovery of a new sub-atomic particle that is key to understanding how 
the universe is built has an intrinsic Indian connection.
3) A bomb explosion outside a shopping mall here on Tuesday left no one 
injured, but Nigerian authorities put security agencies on high alert fearing 
more such attacks in the city.

The task is to separate the documents on the fly and to parse each of the 
documents with a definite set of rules. 

Now, the way I am processing is: 
I am clubbing all the documents together, as,

A Mumbai-bound aircraft with 99 passengers on board was struck by lightning on 
Tuesday evening that led to complete communication failure in mid-air and 
forced the pilot to make an emergency landing.The discovery of a new sub-atomic 
particle that is key to understanding how the universe is built has an 
intrinsic Indian connection. A bomb explosion outside a shopping mall here on 
Tuesday left no one injured, but Nigerian authorities put security agencies on 
high alert fearing more such attacks in the city.

But they are separated by a tag set, like, 
A Mumbai-bound aircraft with 99 passengers on board was struck by lightning on 
Tuesday evening that led to complete communication failure in mid-air and 
forced the pilot to make an emergency landing.$
The discovery of a new sub-atomic particle that is key to understanding how the 
universe is built has an intrinsic Indian connection.$
A bomb explosion outside a shopping mall here on Tuesday left no one injured, 
but Nigerian authorities put security agencies on high alert fearing more such 
attacks in the city.

To detect the document boundaries, I am splitting them into a bag of words and 
using a simple for loop as, 
for i in range(len(bag_words)):
        if bag_words[i]=="$":
            print (bag_words[i],i)

There is no issue. I am segmenting it nicely. I am using annotated corpus so 
applying parse rules. 

The confusion comes next, 

As per my problem statement the size of the file (of documents combined 
together) won’t increase on the fly. So, just to support all kinds of 
combinations I am appending in a list the “I” values, taking its length, and 
using slice. Works perfect. Question is, is there a smarter way to achieve 
this, and a curious question if the documents are on the fly with no 
preprocessed tag set like “$” how may I do it? From a bunch without EOF isn’t 
it a classification problem? 

There is no question on parsing it seems I am achieving it independent of 
length of the document. 

If any one in the group can suggest how I am dealing with the problem and which 
portions should be improved and how?

Thanking You in Advance,

Best Regards,
Subhabrata Banerjee. 
-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to