And as you probably already checked, inserting the proper
*tokenizerFactory* also expands the right synonym line:
q = (body:"Cytosolic 5'-nucleotidase II" OR body:"EC 3.1.3.5")
parsedQuery = SpanOrQuery(spanOr([body:p49902, spanNear([body:cytosol,
body:purin, body:5, body:nucleotidas], 0, true), spanNear([body:ec,
body:3.1.3.5], 0, true), spanNear([body:cytosol, body:5,
body:nucleotidas, body:ii], 0, true)])) SpanOrQuery(spanOr([body:p49902,
spanNear([body:cytosol, body:purin, body:5, body:nucleotidas], 0, true),
spanNear([body:cytosol, body:5, body:nucleotidas, body:ii], 0, true),
spanNear([body:ec, body:3.1.3.5], 0, true)]))
Best,
Andrea
On 05/09/18 16:10, Andrea Gazzarini wrote:
You're right, my answer forgot to mention the *tokenizerFactory*
parameter that you can add in the filter declaration. But, differently
from what you think the default tokenizer used for parsing the
synonyms _is not_ the tokenizer of the current analyzer
(StandardTokenizer in your example) but WhitespaceTokenizer. See here
[1] for a complete description of the filter capabilities.
So instead of switching the analyzer tokenizer you could also add a
tokenizerFactory="solr.StandardTokenizerFactory" in the synonym filter
declaration.
Best,
Andrea
[1]
https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html#FilterDescriptions-SynonymGraphFilter
On 05/09/2018 15:58, Danilo Tomasoni wrote:
Hi Andrea,
thank you for your answer.
About the second question: The standardTokenizer should be applied
also to the phrase query, so the ' and - symbols should be removed
even there, and this should allow a match in the synonim file isn't it?
With an example:
in phrase query:
"Cytosolic 5'-nucleotidase II" -> standardTokenizer -> Cytosolic, 5,
nucleotidase, II
in synonym parsing:
...,Cytosolic 5'-nucleotidase II,... -> standardTokenizer ->
Cytosolic, 5, nucleotidase, II
So the two graphs should match.. or I'm wrong?
Thank you
Danilo
ody:On 05/09/2018 13:23, Andrea Gazzarini wrote:
Hi Danilo,
let's see if this can help you (I'm sorry for the poor debugging,
I'm reading & writing from my mobile): the first issue should have
something to do with synonym overlapping and since I'm very curious
about what it is happening, I will be more precise when I will be in
front of a laptop.
The second: I guess the main problem is the StandardTokenizer, which
removes the ' and - symbols. That should be the reason why you don't
have any synonym detection. You should replace it with a
WhitespaceTokenizer but, be aware that if you do that, the
apostrophe in the document ( ′ ) is not the same symbol ( ' ) you've
used in the query and in the synonyms file, so you need to replace
it somewhere (in the document and/or in the query) otherwise you
won't have any match.
HTH
Gazza
On 05/09/2018 12:19, Danilo Tomasoni wrote:
Hello to all,
I have an issue related to synonimgraphfilter expanding the wrong
synonims for a phrase-term at query time.
I have a dictionary with the following lines
P49902,Cytosolic purine 5'-nucleotidase,EC 3.1.3.5,Cytosolic
5'-nucleotidase II
A8K9N1,Glucosidase\, beta\, acid 3,Cytosolic,Glucosidase\, beta\,
acid 3,Cytosolic\, isoform CRA_b,cDNA FLJ78196\, highly similar to
Homo sapiens glucosidase\, beta\, acid 3,cytosolic,GBA3\,
mRNA,cDNA\, FLJ93688\, Homo sapiens glucosidase\, beta\, acid
3,cytosolic,GBA3\, mRNA
and two documents
{"body":"8. The method of claim 6 wherein said method inhibits at
least one 5′-nucleotidase chosen from cytosolic 5′-nucleotidase II
(cN-II), cytosolic 5′-nucleotidase IA (cN-IA), cytosolic
5′-nucleotidase IB (cN-IB), cytosolic 5′-nucleotidase IMA
(cN-IIIA), cytosolic 5′-nucleotidase NIB (cN-IIIB),
ecto-5′-nucleotidase (eN, CD73), cytosolic 5′(3′)-deoxynucleotidase
(cdN) and mitochondrial 5′(3′)-deoxynucleotidase (mdN)."}
{"body":"Trichomonosis caused by the flagellate protozoan
Trichomonas vaginalis represents the most prevalent nonviral
sexually transmitted disease worldwide (WHO-DRHR 2012). In women,
the symptoms are cyclic and often worsen around the menstruation
period. In men, trichomonosis is largely asymptomatic and these men
are considered to be carriers of T. vaginalis (Petrin et al. 1998).
This infection has been associated with birth outcomes (Klebanoff
et al. 2001), infertility (Grodstein et al. 1993), cervical and
prostate cancer (Viikki et al. 2000, Sutcliffe et al. 2012) and
pelvic inflammatory disease (Cherpes et al. 2006). Importantly, T.
vaginalis is a co-factor in human immunodeficiency virus
transmission and acquisition (Sorvillo et al. 2001, Van Der Pol et
al. 2008). Therefore, it is important to study the host-parasite
relationship to understand T. vaginalis infection and pathogenesis.
Colonisation of the mucosa by T. vaginalis is a complex multi-step
process that involves distinct mechanisms (Alderete et al. 2004).
The parasite interacts with mucin (Lehker & Sweeney 1999), adheres
to vaginal epithelial cells (VECs) in a process mediated by
adhesion proteins (AP120, AP65, AP51, AP33 and AP23) and undergoes
dramatic morphological changes from a pyriform to an amoeboid form
(Engbring & Alderete 1998, Kucknoor et al. 2005, Moreno-Brito et
al. 2005). After adhesion to VECs, the synthesis and gene
expression of adhesins are increased (Kucknoor et al. 2005). These
mechanisms must be tightly regulated and iron plays a pivotal role
in this regulation. Iron is an essential element for all living
organisms, from the most primitive to the most complex, as a
component of haeme, iron-sulphur clusters and a variety of
proteins. Iron is known to contribute to biological functions such
as DNA and RNA synthesis, oxygen transport and metabolic reactions.
T. vaginalis has developed multiple iron uptake systems such as
receptors for hololactoferrin, haemoglobin (HB), haemin (HM) and
haeme binding as well as adhesins to erythrocytes and epithelial
cells (Moreno-Brito et al. 2005, Ardalan et al. 2009). Iron plays a
crucial role in the pathogenesis of trichomonosis by increasing
cytoadherence and modulating resistance to complement lyses,
ligation to the extracellular matrix and the expression of
proteases (Figueroa-Angulo et al. 2012). In agreement with this
role, the symptoms of trichomonosis worsen after menstruation. In
addition, iron also influences nucleotide hydrolysis in T.
vaginalis (Tasca et al. 2005, de Jesus et al. 2006). The
extracellular concentrations of ATP and adenosine can markedly
increase under several conditions such as inflammation and hypoxia
as well as in the presence of pathogens (Robson et al. 2006, Sansom
2012). In the extracellular medium, these nucleotides can act as
immunomodulators by triggering immunological effects. Extracellular
ATP acts as a proinflammatory immune-mediator by triggering
multiple immunological effects on cell types such as neutrophils,
macrophages, dendritic cells and lymphocytes (Bours et al. 2006).
In this sense, ATP and adenosine concentrations in the
extracellular compartment are controlled by ectoenzymes, including
those of the nucleoside triphosphate diphosphohydrolase (NTPDase)
(EC: 3.1.4.1) family, which hydrolyze tri and diphosphates and
ecto-5’-nucleotidase (EC: 3.1.3.5), which hydrolyses monophosphates
(Zimmermann 2001). Considering that de novo nucleotide synthesis is
absent in T. vaginalis (Heyworth et al. 1982, 1984), this enzyme
cascade is important as a source of the precursor adenosine for
purine synthesis in the parasite (Munagala & Wang 2003).
Extracellular nucleotide metabolism has been characterised in
several parasite species such as Toxoplasma gondii, Schistosoma
mansoni, Leishmania spp, Trypanosoma cruzi, Acanthamoeba, Entamoeba
histolytica, Giardia lamblia and fungi, Saccharomyces cerevisiae,
Cryptococcus neoformans, Candida parapsilosis and Candida albicans
(Sansom 2012). In T. vaginalis , NTPDase and ecto-5’-nucleotidase
activities have been characterised and they are involved in
host-parasite interactions by controlling ATP and adenosine levels
(Matos et al. 2001, d, de Jesus et al. 2002, Tasca et al. 2003).
Considering that (i) iron plays a crucial role in the pathogenesis
of trichomonosis, (ii) ATP exerts a proinflammatory effect in
inflammation, (iii) adenosine is important to T. vaginalis growth
and acts as an antiinflammatory factor (Frasson et al. 2012) and
(iv) ectonucleotidases modulate the nucleotide levels at infection
sites (such as those observed in trichomonosis), the aim of this
study was to investigate the effect of iron on the extracellular
nucleotide hydrolysis and gene expression of T . vaginalis."}
Body has the type "text_en" configured in this way
<fieldType name="text_en" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="lang/stopwords_en.txt"
/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="lang/stopwords_en.txt"
/>
<filter class="solr.SynonymGraphFilterFactory"
synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
the two dictionary lines are in the file "synonyms.txt".
If in a solr instance configured this way with those documents and
I run the following query
(body:"Cytosolic 5'-nucleotidase II" OR body:"EC 3.1.3.5")
both documents are returned.
Surprisingly, if I run the query
(body:"Cytosolic 5'-nucleotidase II")
the second one is not returned.
If I set debugQuery=true I see that the second line is expanded
A8K9N1,Glucosidase\, beta\, acid 3,Cytosolic,Glucosidase\, beta\,
acid 3,Cytosolic\, isoform CRA_b,cDNA FLJ78196\, highly similar to
Homo sapiens glucosidase\, beta\, acid 3,cytosolic,GBA3\,
mRNA,cDNA\, FLJ93688\, Homo sapiens glucosidase\, beta\, acid
3,cytosolic,GBA3\, mRNA
instead of the first
P49902,Cytosolic purine 5'-nucleotidase,EC 3.1.3.5,Cytosolic
5'-nucleotidase II
The parsed query (given by debugquery) is
"parsedquery":"SpanNearQuery(spanNear([spanOr([body:a8k9n1,
spanNear([body:glucosidase,, body:beta,, body:acid, body:3],
0,true), spanNear([body:cytosolic,, body:isoform, body:cra_b],
0,true), spanNear([body:cdna, body:flj78196,, body:highli,
body:similar, body:to, body:homo, body:sapien, body:glucosidase,,
body:beta,, body:acid, body:3], 0,true), body:cytosol,
spanNear([body:gba3,, body:mrna], 0,true), spanNear([body:cdna,,
body:flj93688,, body:homo, body:sapien, body:glucosidase,,
body:beta,, body:acid, body:3], 0,true), body:cytosol]), body:5,
body:nucleotidas, body:ii], 0,true))
If I remove the second line, no synonym is expanded
"parsedquery":"PhraseQuery(body_unnamed:\"cytosol 5 nucleotidas
ii\")",
I think this is related to the word "cytosolic" that appears as a
synonim for the second line. If I remove cytosolic as a synonim
from the second line, then again no synonym is expanded.
Can you tell me why this happens? I thought that the first line
should be expanded since it has a multi-word synonym in it that
match exactly the phrase query.
Thank you