Hi list,

I'm working on an app in which longish text chunks (could be up to a few MB in 
size, and stored either in flat text files or in fields of database records - 
TBD) need to be searched for the presence of a combination of string constants, 
where the string constants can be combined with boolean operators (in the user 
input given in a web form, for the search). The DB is MongoDB, accessed from 
Python using the PyMongo driver / client.

The text chunks are associated with structured fields of database records, of 
users (say, suppliers), so the end result wanted is that the app user should be 
able to specify a search using boolean operators and the app should search 
across all the text chunks belonging to all users (users to text chunks is a 
one-to-many relationship, whether the chunks are stored as long varchar in the 
DB or as flat files, with a field of a database record containing the filename 
of the text file containing the chunk). Then the app should return both the 
user names and the corresponding text chunks, for chunks that matched the 
boolean expression.

Example return value:

user2
   filename1.txt
   filename4.txt
user5
   filename3.txt
user7
   filename6.txt
(where the filenames.txt could instead be the chunks of text, if the chunks are 
stored in the DB instead of the file system). 

The boolean expressions can be like these:

apple and orange
apple or pear
apple and (pear or mango)
(apple and pear) or mango
not orange
apple or not (pear and orange)
etc.

What are some good approaches to doing this using Python?
  
I'm first looking at pure Python code approaches, though have considered using 
search tools like Solr (via PySolr), ElasticSearch, etc. Right now the app is 
in the prototype stage so I first want to establish the feasibility and a 
simple approach to doing the search, though later, if the app needs to scale, 
may investigate the tools like PySolr or ES.

I think I may need to use one of the parsing libraries for Python such as PLY 
and others mentioned here:

https://wiki.python.org/moin/LanguageParsing

Let's say a parse tree is built up by whichever approach is used. A related 
question would be how to use this parse tree to search in the text chunks.

I realize that it is a somewhat complex questiom, so am not looking for 
complete solutions, but for suggestions on how to go about it.

Thanks for any suggestions.
--
Vasudev Ram
www.dancingbison.com
jugad2.blogspot.com

-- 
https://mail.python.org/mailman/listinfo/python-list

Reply via email to