+1 to what Erik said. To give an example, at ingestion time you could add to each document an element (or JSON property) with the VIN number, like: <vin>Pn5123456</vin> Then at query time you could look for any document with the vin element.
To extract the vin from each document at and add the element at ingest time, you could use any ETL tool or scripting language or XQuery via CPF, trigger, or MLCP. Several natural language processors try to solve this kind of enrichment problem. Maybe someone on the list can recommend specific NLP tools for VIN recognition based on their experience. In my experience, don't use an NLP tool if you can solve the problem with a simple regex. Sam Mefford Senior Engineer MarkLogic Corporation sam.meff...@marklogic.com<mailto:sam.meff...@marklogic.com> Cell: +1 801 706 9731 www.marklogic.com<http://www.marklogic.com> This e-mail and any accompanying attachments are confidential. The information is intended solely for the use of the individual to whom it is addressed. Any review, disclosure, copying, distribution, or use of this e-mail communication by others is strictly prohibited. If you are not the intended recipient, please notify us immediately by returning this message to the sender and delete all copies. Thank you for your cooperation. On 7/14/2015 10:01 PM, Erik Hennum wrote: Hi, Javier: If it's a smallish set of documents, you can write a loop that reads each document and applies a regex to all of the text in the document, but if it is a substantial corpus, you should look at enriching the documents to support searching for VIN numbers. To search over a set of values with performance at scale requires an index over the values. To recognize the values within JSON or XML documents, the indexer looks for a specified JSON property or XML element or attribute. That requires modifying the documents on or after ingestion to identify the VIN numbers. (It's easiest if you can specify a unique JSON property or XML element or attribute, but if that's not possible, fields can support unions and path range indexes can support containment.) Several natural language processors try to solve this kind of enrichment problem. Maybe someone on the list can recommend specific NLP tools for VIN recognition based on their experience. Hoping that helps, Erik Hennum ________________________________ From: general-boun...@developer.marklogic.com<mailto:general-boun...@developer.marklogic.com> [general-boun...@developer.marklogic.com<mailto:general-boun...@developer.marklogic.com>] on behalf of Javier Lizarraga [jlizarr...@gennet.com<mailto:jlizarr...@gennet.com>] Sent: Tuesday, July 14, 2015 5:21 PM To: general@developer.marklogic.com<mailto:general@developer.marklogic.com> Subject: [MarkLogic Dev General] search MarkLogic Database using Regular Expressions Is there a way to issue a search using a regular expression in MarkLogic? For example the following regular expression identifies a vin number: (([a-h,A-H,j-n,J-N,p-z,P-Z,0-9]{9})([a-h,A-H,j-n,J-N,p,P,r-t,R-T,v-z,V-Z,0-9])([a-h,A-H,j-n,J-N,p-z,P-Z,0-9])(\d{6})) I would like to issue a query that would search the entire database returning documents that contain valid vin numbers. Similar to the MarkLogic fn:match which takes in a string and outputs a Boolean value. fn:matches("this is my string 2T3JK4DV1AW023473" , "(([a-h,A-H,j-n,J-N,p-z,P-Z,0-9]{9})([a-h,A-H,j-n,J-N,p,P,r-t,R-T,v-z,V-Z,0-9])([a-h,A-H,j-n,J-N,p-z,P-Z,0-9])(\d{6}))") I’d like to do something like this cts:search(“(([a-h,A-H,j-n,J-N,p-z,P-Z,0-9]{9})([a-h,A-H,j-n,J-N,p,P,r-t,R-T,v-z,V-Z,0-9])([a-h,A-H,j-n,J-N,p-z,P-Z,0-9])(\d{6}))) Any help would be greatly appreciated!! Javier _______________________________________________ General mailing list General@developer.marklogic.com<mailto:General@developer.marklogic.com> Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general