Hi everyone, this is my first post here :) About two weeks ago, due to the low demand in my project, I have been assigned a completely unusual request: to automatically extract answers from documents based on machine learning. I've never read anything about ML, AI or NLP before, so I've been basically doing just that for the past two weeks.
When it comes to ML, most book recommendations and tutorials I've found so far use the Python language and tools, so I took the first week to learn about Python, NumPy, Scikit, Panda, Matplotlib and so on. Then, this week I started reading about NLP itself, after spending a few days reading about generic ML algorithms. So far, I've basically read about Bag of Words, using TF-IDF (or simply terms count) to convert the words to numeric representations and a few methods such as the gaussian and multinomial naive bayes methods to train and predict values. The methods also mention the importance of using the usual pre-processing methods such as lemmatization and alikes. However, basically all examples assume that a given text can be classified in one of the categorized topics, like the sentiment analysis use case. I'm afraid this doesn't represent my use case, so I'd like to describe it here so that you could help me identifying which methods I should be looking for. We have a system with thousands of transactions/deals inputted manually by an specialized team. Each deal has a set of documents (a dozen per deal typically) and some documents could have hundreds of pages. The inputing team has to extract about a thousand fields from those documents for any particular deal. So, in our database we have all their data and we typically also know the document specific snippets associated to each field value. So, my task is to, given a new document and deal, and based on the previous answers, fill in as many fields as I could by automatically finding the corresponding snippets in the new documents. I'm not sure how I should approach this problem. For example, I could consider each sentence of the document as a separate document to be analyzed and compared to the snippets I already have for the matching data. However, I can't be sure whether some of those sentences would actually answer the question. For example, maybe there are 6 occurrences in the documents that would answer a particular question/field, but maybe the inputters only identified 2 or 3 of them. Also, for any given sentence, it could tell that the answer for a given field is A or B, or it could be that there's absolutely no association between the sentence and the field/question, as it would be the case for most sentences. I know that Scikit provides the predict_proba method, so that I could try to only consider the sentence as relevant if the probabilities of answering the question would be above 80%, for example, but based on a few quick tests I've made with a few sentences and words, I suspect this won't work very well. Also, it could be quite slow to treat each sentence of a 500-hundreds of pages documents as a separate document to be analyzed, so I'm not sure if there are better methods to handle this use case. Some of the fields are free-text ones, like company and firm names, for example, and I suspect those would be the hardest to answer, so I'm trying to start with the multiple-choice ones, with a finite set of classification. How would you advise me to look at this problem? Are there any algorithms you'd recommend me to study for solving this particular problem? Here are some sample data so that you could get a better understanding of the problem: One of the fields is called "Deal Structure" and it could have the following values: "Asset Purchase", "Stock or Equity Purchase" or "Public Target Merger" (there are a few others, but this gives you an idea). So, here are some sentences highlighted for Public Target Merger deals (those documents come from Edgar Filings public database which are freely available for US deals): deal 1 / doc 1: "AGREEMENT AND PLAN OF MERGER, dated as of March 14, 2018 (this “Agreement”), by and among HarborOne Bancorp, Inc., a Massachusetts corporation (“Buyer”), Massachusetts Acquisitions, LLC, a Maryland limited liability company of which Buyer is the sole member (“Merger LLC”), and Coastway Bancorp, Inc., a Maryland corporation (the “Company”)." "WHEREAS, Buyer, Merger LLC, and the Company intend to effect a merger (the “Merger”) of Merger LLC with and into the Company in accordance with this Agreement and the Maryland General Corporation Law (the “MGCL”) and the Maryland Limited Liability Company Act, as amended (the “MLLCA”), with the Company to be the surviving entity in the Merger. The Merger will be followed immediately by a merger of the Company with and into Buyer (the “Upstream Merger”), with the Buyer to be the surviving entity in the Upstream Merger. It is intended that the Merger be mutually interdependent with and a condition precedent to the Upstream Merger and that the Upstream Merger shall, through the binding commitment evidenced by this Agreement, be effected immediately following the Effective Time (as defined below) without further approval, authorization or direction from or by any of the parties hereto; and" deal 2 / doc 1: "WHEREAS, it is also proposed that, as soon as practicable following the consummation of the Offer, the Parties wish to effect the acquisition of the Company by Parent through the merger of Purchaser with and into the Company, with the Company being the surviving entity (the “Merger”);" Now, for Asset Purchase deals: deal 3 / doc 1: "Subject to the terms and conditions of this Agreement, Sellers are willing to sell to Buyer, and Buyer is willing to purchase from Sellers, all of their assets relating to the Businesses as set forth herein." deal 4 / doc 1: "WHEREAS, Seller wishes to sell and assign to Buyer, and Buyer wishes to purchase and assume from Seller, the rights and obligations of Seller to the Purchased Assets (as defined herein), subject to the terms and conditions set forth herein." Please forgive me for any imprecise/incorrect terms or understanding on this topic as this is all very new to me. Any help is very appreciated. I've also asked this question in StackOverflow, so if you'd prefer to answer there instead, here is the link: https://stackoverflow.com/questions/55499866/how-to-answer-questions-from-big-documents Would this field be called data mining? Feature extraction? Question answering? I'm not sure how to properly search about this subject so any hints are very welcome :) Thanks in advance, Rodrigo.
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn