Re: [basex-talk] Issue with Full Text Retrieval

Michael Seiferle Mon, 11 Sep 2017 01:14:59 -0700

Hi Ron,
Hi Fabrice,

Your observation w.r.t. to element boundaries is right, the document is 
converted to a textual representation, by default it returns all nodes in their 
string representation:


$doc :=
> <doc>
>   XQuery 
>   <_>and XPAth</_>
>   <_>are   awesome</_>
> </doc>/data()

Will turn to:
> 
>   XQuery 
>   and XPAthare   awesome
>  
So:
> $doc contains text { 'XPath‘ }

will return false.

You have 3.5 options:

1) => as Fabrice showed, query the individual text nodes

2) use the ft:search() Function to query the index directly, 
http://docs.basex.org/wiki/Full-Text_Module#ft:search 
<http://docs.basex.org/wiki/Full-Text_Module#ft:search>

> ft:search(
>   'CTGovDebug',
>   'neoplasms'
> )/.. (: get parent element for the matching text()-node

3) disable chopping when creating the database, 
http://docs.basex.org/wiki/Options#XML_Parsing 
<http://docs.basex.org/wiki/Options#XML_Parsing> 
> db:create(
>   'CTGovDebug',
>   "Path/to/NCT00473512.xml",
>   "NCT00473512.xml",
>   map {
>    'ftindex': true(),
>    'chop': false()
>   })


3.5) use the xml:space="preserve“ attribute to tell the parser not to chop 
child nodes of <clinical_study/> when creating a database:
> <clinical_study xml:space="preserve">
>   <!-- This xml conforms to an XML Schema at:
>     https://clinicaltrials.gov/ct2/html/images/info/public.xsd -->
>   <required_header>
>     <download_date>ClinicalTrials.gov processed this data on August 31, 
> 2017</download_date>
>     <link_text>Link to the current ClinicalTrials.gov record.</link_text>
> 



Hope this helped shed some light :-)

Best from Konstanz
Michael
--
Michael Seiferle, BaseX GmbH, http://www.basexgmbh.de
|-- Firmensitz: Obere Laube 73, 78462 Konstanz
|-- Registergericht Freiburg, HRB: 708285, Geschäftsführer:
|   Dr. Christian Grün, Dr. Alexander Holupirek, Michael Seiferle
`-- Tel: +49 7531 916 82 77

> Am 11.09.2017 um 09:35 schrieb Fabrice ETANCHAUD 
> <fetanch...@pch.cerfrance.fr>:
> 
> Hello Ron,
>  
> I don’t know how ft operators behave on document nodes.
> Supposing documents are converted to their data() representation, Your query 
> would yield the same negative answer.
> You should consider applying ft operators on text nodes like this :
>  
> for $trial in db:open('NCT00473512')//text() (: 
> [clinical_study/id_info/nct_id='NCT00473512'] :)
> return $trial[. contains text { 'neoplasms' }]
>  
> Best regards,
> Fabrice Etanchaud
>  
>  
> De : basex-talk-boun...@mailman.uni-konstanz.de 
> [mailto:basex-talk-boun...@mailman.uni-konstanz.de] De la part de Ron Katriel
> Envoyé : lundi 11 septembre 2017 00:42
> À : BaseX
> Objet : [basex-talk] Issue with Full Text Retrieval
>  
> Hi,
>  
> I am seeing strange behavior with Full Text retrieval. The following query 
> fails for a number of words that are in the XML document (see attached):
>  
> for $trial in db:open('CTGovDebug') (: 
> [clinical_study/id_info/nct_id='NCT00473512'] :)
> return $trial contains text { 'neoplasms' }
>  
> It fails on a good number of words including neoplasms, cougar, industry, 
> yes, completed, november, 2005, interventional, single, male, female, 
> assignment, none, research, principal, primary, secondary, age, years, 
> gender, etc. But it matches most of the words in the file.
>  
> Observation: The words that fail are located at the beginning and/or end of 
> the text and do not occur anywhere else in the middle of any text.
>  
> The document is the only one in the database. It does not make a difference 
> whether full text indexing is on or off. My BaseX version is 8.6.4.
>  
> Thanks,
> Ron
>  
>  
> Ron Katriel, Ph.D. | Principal Data Scientist | Medidata Solutions 
> <http://www.mdsol.com/>
> 350 Hudson Street, 7th Floor, New York, NY 10014
> rkatr...@mdsol.com <mailto:tbro...@mdsol.com> | direct: +1 201 337 3622 
> <tel://201%20337%203622> | mobile: +1 201 675 5598 
> <tel://+1%20201%20675%205598> | main: +1 212 918 1800 
> <tel://+1%20212%20918%201800>

Re: [basex-talk] Issue with Full Text Retrieval

Reply via email to