Hi,

I'm hosting a local snapshot of the OpenAlex dataset in solr (~460M records, 
~1TB):
https://docs.openalex.org/download-all-data/openalex-snapshot

I'd be grateful if someone with experience in query construction could provide pointers for questions below.

First, some context: I'm maintaining and updating a local copy of the snapshot for over two years now and it's running great and enables us to do systematic literature reviews that wouldn't otherwise be possible. For what we do, it is important to be able to run very specific boolean queries with AND, OR, NOT, NEAR, phrases, and wildcards (and nesting thereof). Unfortunately, the documentation is a little bit too concise for me to fully understand how to correctly combine these features. In an attempt to learn through experimentation, I ran into unexpected numbers and behaviour.

Here are the query parameters:
-------
df = title_abstract
q.op = AND
defType = lucene

Option 1) q = {!complexphrase v='<QUERY>'}
Option 2) q = <QUERY>
-------

The managed schema setup for this field:
-------
...
<fieldType name="oa_text" class="solr.TextField" positionIncrementGap="100" docValues="false" multiValued="false" indexed="true" stored="true">
    <analyzer type="index">
        <tokenizer name="letter" maxTokenLen="127"/>
        <filter name="lowercase"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer name="letter" maxTokenLen="127"/>
        <filter name="lowercase"/>
    </analyzer>
</fieldType>


<field name="title_abstract" type="oa_text" large="true"/>
...
-------
Not stemming is a deliberate choice because the dataset contains multiple languages and stemming may break technical terminology or abbreviations. For our use case, it seems better to leave all decisions to the queries (e.g. through wildcards)


Here are variations of two building blocks of an overall much larger query:

1) Fossil fuels
  -> phrases as W: ( fossil W fuel* OR coal OR oil OR petroleum OR natural W 
gas OR LNG )
  -> ... W w/o wildcards: (fossil W fuel OR coal OR oil OR petroleum OR natural 
W gas OR LNG)
  -> phrases as quotes: ( "fossil fuel*" OR coal OR oil OR petroleum OR "natural 
gas" OR LNG )
  -> ... w/o wildcards: ( "fossil fuel" OR coal OR oil OR petroleum OR "natural 
gas" OR LNG )
  ---
  -> standard | phrases as W: 29,059
  -> standard | phrases as W w/o wildcards: 29,059
  -> standard | phrases as quotes: 2,153,181
  -> standard | phrases as quotes w/o wildcards: 2,153,181
  -> complexphrase | phrases as W: 29,059
  -> complexphrase | phrases as W w/o wildcards: 29,059
  -> complexphrase | phrases as quotes: 2,199,207
  -> complexphrase | phrases as quotes w/o wildcards: 2,153,181


2) General climate change
  -> phrases as W: ( climat* OR global W warming OR greenhouse W effect* )
  -> phrases as W w/o wildcards: ( climat OR global W warming OR greenhouse W 
effect )
  -> phrases as quotes: ( climat* OR "global warming" OR "greenhouse effect*" )
  -> phrases as quotes w/o wildcards: ( climat OR "global warming" OR "greenhouse 
effect" )
  ---
  -> standard | phrases as W: 378,690
  -> standard | phrases as W w/o wildcards: 197,049
  -> standard | phrases as quotes: 1,937,868
  -> standard | phrases as quotes w/o wildcards: 164,361
  -> complexphrase | phrases as W: 378,690
  -> complexphrase | phrases as W w/o wildcards: 197,049
  -> complexphrase | phrases as quotes: {'msg': 'maxClauseCount is set to 
10000',...
  -> complexphrase | phrases as quotes w/o wildcards: 164,361

Ideally, we'd like to reliably run queries of the form:
q = ( (A OR B) 3W ((C* 1W D?) OR E*) )

We have a few questions we could not figure out after a lot of experimentation and reading. Specifically the combination of wildcards and NEAR operators (also tested with nesting {!surround v=''} parsers) is challenging.

1) From my understanding,t he maxClauseCount error is caused by the wildcard expansion to an explicit list of ORs of all terms with that prefix in the vocabulary. Is there a way to parametrise the querybuilder to only use the first 20 terms in the expansion? I tried to find relevant areas in the solr/lucene codebase but couldn't figure it out. I know I can increase the maxClauseCount, but it requires some unreasonable limits of ~1M which makes queries needlessly huge and slow.

2) The ? wildcard stands for "any character" and * for none or many. Is there a wildcard for "one or none"? Also, the counts for "term?" and "term??" is the same, even though there is also "termed" in the vocabulary. So "??" doesn't stand for "two of any character"?

3) In the examples above, seemingly equivalent queries produce different numbers. For example, I assume '"phrase term"' and 'phrase W term' should be the same, and they sometimes are, but not always.

4) In the example above, some seemingly different queries produce the same numbers, for example when dropping the wildcards. Sometimes dropping wildcards lowers the numbers (as expected) though.

This is a rather long email. I tried to provide enough context and would be very thankful if anyone could reach a hand down the rabbit hole to pull me out of it. Clearly I have misunderstood how to properly write queries and which query parser to use for what and how.

Best
Tim

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to