Hello.  I am working on a project where one system (System A) contains seven 
text fields (unstructured data for comments).
I have concatenated all of the fields into a single field.

There is a second system (System B) containing two unstructured fields that 
capture text comments.  I have concatenated these fields into a single field
just as I did for the first system.  This system contains highly sensitive and 
prohibitive data.

The issue that I'm trying to solve is that there should not be any text data 
from System B (sensitive narratives, investigative IDs, etc.)
In essence, I am trying to find the following three items:
1) Find direct references to investigations ("Investigation number ABC123")
2) Language that talks about references (i.e. "Jane Doe is under investigation")
3) Actual cut-and-paste segments where they copied something verbatim from 
System B to System A in the commentary fields.

It seems as though I may have to use different text similarity (comparison 
between System A and System B text) or search techniques for one or more of the 
three items.
I was thinking that Cosine Similarity Computation (CSC) would perhaps be 
useful, but I thought I would solicit some advice as I'm a recent text analyst 
using Python.

Thank you in advance.


Kenneth R Adams
Compliance Technology and Analytics
TAS -Text Analytics as a Service
Wells Fargo & Co. |  401 South Tryon Street, Twenty-sixth Floor | Charlotte, NC 
28202
MAC: D1050-262
Cell: 704-408.5157

kenneth.r.ad...@wellsfargo.com<mailto:kenneth.r.ad...@wellsfargo.com>


[WellsFargoLogo_w_SC]

-- 
https://mail.python.org/mailman/listinfo/python-list

Reply via email to