Hi Ron, Hi Fabrice, Your observation w.r.t. to element boundaries is right, the document is converted to a textual representation, by default it returns all nodes in their string representation:
$doc := > <doc> > XQuery > <_>and XPAth</_> > <_>are awesome</_> > </doc>/data() Will turn to: > > XQuery > and XPAthare awesome > So: > $doc contains text { 'XPath‘ } will return false. You have 3.5 options: 1) => as Fabrice showed, query the individual text nodes 2) use the ft:search() Function to query the index directly, http://docs.basex.org/wiki/Full-Text_Module#ft:search <http://docs.basex.org/wiki/Full-Text_Module#ft:search> > ft:search( > 'CTGovDebug', > 'neoplasms' > )/.. (: get parent element for the matching text()-node 3) disable chopping when creating the database, http://docs.basex.org/wiki/Options#XML_Parsing <http://docs.basex.org/wiki/Options#XML_Parsing> > db:create( > 'CTGovDebug', > "Path/to/NCT00473512.xml", > "NCT00473512.xml", > map { > 'ftindex': true(), > 'chop': false() > }) 3.5) use the xml:space="preserve“ attribute to tell the parser not to chop child nodes of <clinical_study/> when creating a database: > <clinical_study xml:space="preserve"> > <!-- This xml conforms to an XML Schema at: > https://clinicaltrials.gov/ct2/html/images/info/public.xsd --> > <required_header> > <download_date>ClinicalTrials.gov processed this data on August 31, > 2017</download_date> > <link_text>Link to the current ClinicalTrials.gov record.</link_text> > Hope this helped shed some light :-) Best from Konstanz Michael -- Michael Seiferle, BaseX GmbH, http://www.basexgmbh.de |-- Firmensitz: Obere Laube 73, 78462 Konstanz |-- Registergericht Freiburg, HRB: 708285, Geschäftsführer: | Dr. Christian Grün, Dr. Alexander Holupirek, Michael Seiferle `-- Tel: +49 7531 916 82 77 > Am 11.09.2017 um 09:35 schrieb Fabrice ETANCHAUD > <fetanch...@pch.cerfrance.fr>: > > Hello Ron, > > I don’t know how ft operators behave on document nodes. > Supposing documents are converted to their data() representation, Your query > would yield the same negative answer. > You should consider applying ft operators on text nodes like this : > > for $trial in db:open('NCT00473512')//text() (: > [clinical_study/id_info/nct_id='NCT00473512'] :) > return $trial[. contains text { 'neoplasms' }] > > Best regards, > Fabrice Etanchaud > > > De : basex-talk-boun...@mailman.uni-konstanz.de > [mailto:basex-talk-boun...@mailman.uni-konstanz.de] De la part de Ron Katriel > Envoyé : lundi 11 septembre 2017 00:42 > À : BaseX > Objet : [basex-talk] Issue with Full Text Retrieval > > Hi, > > I am seeing strange behavior with Full Text retrieval. The following query > fails for a number of words that are in the XML document (see attached): > > for $trial in db:open('CTGovDebug') (: > [clinical_study/id_info/nct_id='NCT00473512'] :) > return $trial contains text { 'neoplasms' } > > It fails on a good number of words including neoplasms, cougar, industry, > yes, completed, november, 2005, interventional, single, male, female, > assignment, none, research, principal, primary, secondary, age, years, > gender, etc. But it matches most of the words in the file. > > Observation: The words that fail are located at the beginning and/or end of > the text and do not occur anywhere else in the middle of any text. > > The document is the only one in the database. It does not make a difference > whether full text indexing is on or off. My BaseX version is 8.6.4. > > Thanks, > Ron > > > Ron Katriel, Ph.D. | Principal Data Scientist | Medidata Solutions > <http://www.mdsol.com/> > 350 Hudson Street, 7th Floor, New York, NY 10014 > rkatr...@mdsol.com <mailto:tbro...@mdsol.com> | direct: +1 201 337 3622 > <tel://201%20337%203622> | mobile: +1 201 675 5598 > <tel://+1%20201%20675%205598> | main: +1 212 918 1800 > <tel://+1%20212%20918%201800>