+1 to what Erik said.  To give an example, at ingestion time you could add to 
each document an element (or JSON property) with the VIN number, like:
<vin>Pn5123456</vin>
Then at query time you could look for any document with the vin element.

To extract the vin from each document at and add the element at ingest time, 
you could use any ETL tool or scripting language or XQuery via CPF, trigger, or 
MLCP.

Several natural language processors try to solve this kind of enrichment 
problem.  Maybe someone on the list can recommend specific NLP tools for VIN 
recognition based on their experience.
In my experience, don't use an NLP tool if you can solve the problem with a 
simple regex.




Sam Mefford
Senior Engineer
MarkLogic Corporation
sam.meff...@marklogic.com<mailto:sam.meff...@marklogic.com>
Cell: +1 801 706 9731
www.marklogic.com<http://www.marklogic.com>

This e-mail and any accompanying attachments are confidential. The information 
is intended
solely for the use of the individual to whom it is addressed. Any review, 
disclosure, copying,
distribution, or use of this e-mail communication by others is strictly 
prohibited. If you
are not the intended recipient, please notify us immediately by returning this 
message to
the sender and delete all copies. Thank you for your cooperation.


On 7/14/2015 10:01 PM, Erik Hennum wrote:
Hi, Javier:

If it's a smallish set of documents, you can write a loop that reads each 
document and applies a regex to all of the text in the document, but if it is a 
substantial corpus, you should look at enriching the documents to support 
searching for VIN numbers.

To search over a set of values with performance at scale requires an index over 
the values.

To recognize the values within JSON or XML documents, the indexer looks for a 
specified JSON property or XML element or attribute.

That requires modifying the documents on or after ingestion to identify the VIN 
numbers.  (It's easiest if you can specify a unique JSON property or XML 
element or attribute, but if that's not possible, fields can support unions and 
path range indexes can support containment.)

Several natural language processors try to solve this kind of enrichment 
problem.  Maybe someone on the list can recommend specific NLP tools for VIN 
recognition based on their experience.


Hoping that helps,


Erik Hennum

________________________________
From: 
general-boun...@developer.marklogic.com<mailto:general-boun...@developer.marklogic.com>
 
[general-boun...@developer.marklogic.com<mailto:general-boun...@developer.marklogic.com>]
 on behalf of Javier Lizarraga 
[jlizarr...@gennet.com<mailto:jlizarr...@gennet.com>]
Sent: Tuesday, July 14, 2015 5:21 PM
To: general@developer.marklogic.com<mailto:general@developer.marklogic.com>
Subject: [MarkLogic Dev General] search MarkLogic Database using Regular 
Expressions

Is there a way to issue a search using a regular expression in MarkLogic?

For example the following regular expression identifies a vin number:
(([a-h,A-H,j-n,J-N,p-z,P-Z,0-9]{9})([a-h,A-H,j-n,J-N,p,P,r-t,R-T,v-z,V-Z,0-9])([a-h,A-H,j-n,J-N,p-z,P-Z,0-9])(\d{6}))

I would like to issue a query that would search the entire database returning 
documents that contain valid vin numbers.

Similar to the MarkLogic fn:match which takes in a string and outputs  a 
Boolean value.
fn:matches("this is my string 2T3JK4DV1AW023473" , 
"(([a-h,A-H,j-n,J-N,p-z,P-Z,0-9]{9})([a-h,A-H,j-n,J-N,p,P,r-t,R-T,v-z,V-Z,0-9])([a-h,A-H,j-n,J-N,p-z,P-Z,0-9])(\d{6}))")

I’d like to do something like this 
cts:search(“(([a-h,A-H,j-n,J-N,p-z,P-Z,0-9]{9})([a-h,A-H,j-n,J-N,p,P,r-t,R-T,v-z,V-Z,0-9])([a-h,A-H,j-n,J-N,p-z,P-Z,0-9])(\d{6})))

Any help would be greatly appreciated!!

Javier



_______________________________________________
General mailing list
General@developer.marklogic.com<mailto:General@developer.marklogic.com>
Manage your subscription at:
http://developer.marklogic.com/mailman/listinfo/general


_______________________________________________
General mailing list
General@developer.marklogic.com
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to