Re: REINDEXCOLLECTION not working on an alias

2020-05-20 Thread Bjarke Buur Mortensen
OK, that makes sense.
 Looking forward to that fix, thanks for the reply.

Den tir. 19. maj 2020 kl. 17.21 skrev Joel Bernstein :

> I believe the issue is that under the covers this feature is using the
> "topic" streaming expressions which it was just reported doesn't work with
> aliases. This is something that will get fixed, but for the current release
> there isn't a workaround for this issue.
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
>
> On Tue, May 19, 2020 at 8:25 AM Bjarke Buur Mortensen <
> morten...@eluence.com>
> wrote:
>
> > Hi list,
> >
> > I seem to be unable to get REINDEXCOLLECTION to work on a collection
> alias
> > (running Solr 8.2.0). The documentation seems to state that that should
> be
> > possible:
> >
> >
> https://lucene.apache.org/solr/guide/8_2/collection-management.html#reindexcollection
> > "name
> > Source collection name, may be an alias. This parameter is required."
> >
> > If I run on my alias (qa_supplier_products):
> > curl "
> >
> >
> http://localhost:8983/solr/admin/collections?action=REINDEXCOLLECTION&name=qa_supplier_products&numShards=1&cmd=start
> > I get an error:
> > "org.apache.solr.common.SolrException: Unable to copy documents from
> > qa_supplier_products to .rx_qa_supplier_products_6:
> > {\"result-set\":{\"docs\":[\n
> >  {\"DaemonOp\":\"Deamon:.rx_qa_supplier_products_6 started on
> > .rx_qa_supplier_products_0_shard1_replica_n1\"
> >
> > If I instead point to the underlying collection, everything works fine.
> Now
> > I have an alias pointing to an alias, which works, but ideally I would
> like
> > to just have my main alias point to the newly reindexed collection.
> >
> > Can anybody help me out here?
> >
> > Thanks,
> > /Bjarke
> >
>


REINDEXCOLLECTION not working on an alias

2020-05-19 Thread Bjarke Buur Mortensen
Hi list,

I seem to be unable to get REINDEXCOLLECTION to work on a collection alias
(running Solr 8.2.0). The documentation seems to state that that should be
possible:
https://lucene.apache.org/solr/guide/8_2/collection-management.html#reindexcollection
"name
Source collection name, may be an alias. This parameter is required."

If I run on my alias (qa_supplier_products):
curl "
http://localhost:8983/solr/admin/collections?action=REINDEXCOLLECTION&name=qa_supplier_products&numShards=1&cmd=start
I get an error:
"org.apache.solr.common.SolrException: Unable to copy documents from
qa_supplier_products to .rx_qa_supplier_products_6:
{\"result-set\":{\"docs\":[\n
 {\"DaemonOp\":\"Deamon:.rx_qa_supplier_products_6 started on
.rx_qa_supplier_products_0_shard1_replica_n1\"

If I instead point to the underlying collection, everything works fine. Now
I have an alias pointing to an alias, which works, but ideally I would like
to just have my main alias point to the newly reindexed collection.

Can anybody help me out here?

Thanks,
/Bjarke


Re: Reindexing using dataimporthandler

2020-04-27 Thread Bjarke Buur Mortensen
Wow, thanks. Erick. That's actually much better :-)
You live and you learn.

Cheers,
Bjarke

Den man. 27. apr. 2020 kl. 15.00 skrev Erick Erickson <
erickerick...@gmail.com>:

> What about the Collections API REINDEXCOLLECTION? That has the
> advantage of being something officially supported, puts the source
> collection into read-only mode, uses a much more efficient query
> process (streaming actually) etc.
>
> It has the disadvantage of producing a new collection under the
> covers and aliasing to it. But you can always rename the collection
> later.
>
> Best,
> Erick
>
> > On Apr 27, 2020, at 8:23 AM, Bjarke Buur Mortensen <
> morten...@eluence.com> wrote:
> >
> > Thanks for the reply,
> > I'm on solr 8.2 so cursorMark is there.
> >
> > Doing this from one collection to another collection, and then use a
> > collection alias is probably the way to go, but  actually, my suggestion
> > was a little more bold:
> >
> > I'm indexing on top of the same core, i.e from
> > http://localhost:8983/solr/mycollection to
> > http://localhost:8983/solr/mycollection
> >
> > (This is why I suggested adding a version:[* TO
> ]
> > to ensure it terminates for large imports.)
> >
> > With this in mind, are you still thinking this is a safe approach?
> >
> > Thanks,
> > Bjarke
> >
> >
> > Den man. 27. apr. 2020 kl. 13.46 skrev Emir Arnautović <
> > emir.arnauto...@sematext.com>:
> >
> >> Hi Bjarke,
> >> I don’t see a problem with that approach if you have enough resources to
> >> handle both cores at the same time, especially if you are doing that
> while
> >> serving production queries. The only issue is that if you plan to do
> that
> >> then you have to have all fields stored. Also note that cursorMark
> support
> >> was added a bit later to entity processor, so if you are running a bit
> >> older version of Solr, you might not have cursors - I’ve found it the
> hard
> >> way.
> >>
> >> Emir
> >> --
> >> Monitoring - Log Management - Alerting - Anomaly Detection
> >> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> >>
> >>
> >>
> >>> On 27 Apr 2020, at 13:11, Bjarke Buur Mortensen  >
> >> wrote:
> >>>
> >>> Hi list,
> >>>
> >>> Let's say I add a copyField to my solr schema, or change the analysis
> >> chain
> >>> of a field or some other change.
> >>> It seems to me to be an alluring choice to use a very simple
> >>> dataimporthandler to reindex all documents, by using a
> >> SolrEntityProcessor
> >>> that points to itself. I have just done this for a very small
> collection,
> >>> but I was wondering what the caveats are, since this is not the
> >> recommended
> >>> practice. What can go wrong using this approach?
> >>>
> >>>   >> url=
> >>> "http://localhost:8983/solr/mycollection"; qt="lucene" query="*:*" wt=
> >>> "javabin" rows="1000" cursorMark="true" sort="id asc" fl=
> >>> "*,orig_version_l:_version_"/> 
> >>>
> >>> PS: (It is probably necessary to add a version:[* TO
> >>> ] to ensure it terminates for large imports)
> >>> PPS: (Obviously you shouldn't add the clean parameter)
> >>>
> >>> /Bjarke
> >>
> >>
>
>


Re: Reindexing using dataimporthandler

2020-04-27 Thread Bjarke Buur Mortensen
Thanks for the reply,
I'm on solr 8.2 so cursorMark is there.

Doing this from one collection to another collection, and then use a
collection alias is probably the way to go, but  actually, my suggestion
was a little more bold:

I'm indexing on top of the same core, i.e from
http://localhost:8983/solr/mycollection to
http://localhost:8983/solr/mycollection

(This is why I suggested adding a version:[* TO ]
to ensure it terminates for large imports.)

With this in mind, are you still thinking this is a safe approach?

Thanks,
Bjarke


Den man. 27. apr. 2020 kl. 13.46 skrev Emir Arnautović <
emir.arnauto...@sematext.com>:

> Hi Bjarke,
> I don’t see a problem with that approach if you have enough resources to
> handle both cores at the same time, especially if you are doing that while
> serving production queries. The only issue is that if you plan to do that
> then you have to have all fields stored. Also note that cursorMark support
> was added a bit later to entity processor, so if you are running a bit
> older version of Solr, you might not have cursors - I’ve found it the hard
> way.
>
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 27 Apr 2020, at 13:11, Bjarke Buur Mortensen 
> wrote:
> >
> > Hi list,
> >
> > Let's say I add a copyField to my solr schema, or change the analysis
> chain
> > of a field or some other change.
> > It seems to me to be an alluring choice to use a very simple
> > dataimporthandler to reindex all documents, by using a
> SolrEntityProcessor
> > that points to itself. I have just done this for a very small collection,
> > but I was wondering what the caveats are, since this is not the
> recommended
> > practice. What can go wrong using this approach?
> >
> >   url=
> > "http://localhost:8983/solr/mycollection"; qt="lucene" query="*:*" wt=
> > "javabin" rows="1000" cursorMark="true" sort="id asc" fl=
> > "*,orig_version_l:_version_"/> 
> >
> > PS: (It is probably necessary to add a version:[* TO
> > ] to ensure it terminates for large imports)
> > PPS: (Obviously you shouldn't add the clean parameter)
> >
> > /Bjarke
>
>


Reindexing using dataimporthandler

2020-04-27 Thread Bjarke Buur Mortensen
Hi list,

Let's say I add a copyField to my solr schema, or change the analysis chain
of a field or some other change.
It seems to me to be an alluring choice to use a very simple
dataimporthandler to reindex all documents, by using a SolrEntityProcessor
that points to itself. I have just done this for a very small collection,
but I was wondering what the caveats are, since this is not the recommended
practice. What can go wrong using this approach?

 http://localhost:8983/solr/mycollection"; qt="lucene" query="*:*" wt=
"javabin" rows="1000" cursorMark="true" sort="id asc" fl=
"*,orig_version_l:_version_"/> 

PS: (It is probably necessary to add a version:[* TO
] to ensure it terminates for large imports)
PPS: (Obviously you shouldn't add the clean parameter)

/Bjarke


Re: Query-time synonyms without indexing

2019-08-29 Thread Bjarke Buur Mortensen
The  section without type is the one getting picked up for the
index-time chain, so that wasn't my problem.

It turns out that because of
https://issues.apache.org/jira/browse/LUCENE-8134, I needed to add
a omitTermFreqAndPositions="true" to the  declaration.
This has to do with defaults for a string field being different from a text
field, and i Solr 8+ indexing fails because of above ticket.
Adding omitTermFreqAndPositions="true" ensures that index field type and
the schema field type agree on the settings, as I understand it.

Regards,
Bjarke



Den ons. 28. aug. 2019 kl. 13.26 skrev Erick Erickson <
erickerick...@gmail.com>:

> Not sure. You have an
> 
> section and
> 
>
> section. Frankly I’m not sure which one will be used for the index-time
> chain.
>
> Why don’t you just try it?
> change
> 
> to
> 
>
> reload and go. It’d take you 5 minutes and you’d have your answer.
>
> Best,
> Erick
>
>
> > On Aug 28, 2019, at 1:57 AM, Bjarke Buur Mortensen <
> morten...@eluence.com> wrote:
> >
> > Yes, but isn't that what I am already doing in this case (look at the
> > fieldType in the original mail)?
> > Is there some other way to specify that field type and achieve what I
> want?
> >
> > Thanks,
> > Bjarke
> >
> > On Tue, Aug 27, 2019, 17:32 Erick Erickson 
> wrote:
> >
> >> You can have separate index and query time analysis chains, there are
> many
> >> examples in the stock Solr schemas.
> >>
> >> Best,
> >> Erick
> >>
> >>> On Aug 27, 2019, at 8:48 AM, Bjarke Buur Mortensen <
> >> morten...@eluence.com> wrote:
> >>>
> >>> We have a solr file of type "string".
> >>> It turns out that we need to do synonym expansion on query time in
> order
> >> to
> >>> account for some changes over time in the values stored in that field.
> >>>
> >>> So we have tried introducing a custom fieldType that applies the
> synonym
> >>> filter at query time only (see bottom of mail), but that requires us to
> >>> change the field. But now, when we index new documents, Solr complains:
> >>> 400 Bad Request
> >>> Error: 'Exception writing document id someid to the index; possible
> >>> analysis error: cannot change field "auth_country_code" from index
> >>> options=DOCS to inconsistent index
> options=DOCS_AND_FREQS_AND_POSITIONS',
> >>>
> >>> Since we are only making query time changes, I would really like to not
> >>> have to reindex our entire collection. Is that possible somehow?
> >>>
> >>> Thanks,
> >>> Bjarke
> >>>
> >>>
> >>>  >>> sortMissingLast="true" positionIncrementGap="100">
> >>>   
> >>> 
> >>>   
> >>>   
> >>>   
> >>>>>> synonyms="country-synonyms.txt" ignoreCase="false" expand="true"/>
> >>>   
> >>> 
> >>
> >>
>
>


Re: Query-time synonyms without indexing

2019-08-27 Thread Bjarke Buur Mortensen
Yes, but isn't that what I am already doing in this case (look at the
fieldType in the original mail)?
Is there some other way to specify that field type and achieve what I want?

Thanks,
Bjarke

On Tue, Aug 27, 2019, 17:32 Erick Erickson  wrote:

> You can have separate index and query time analysis chains, there are many
> examples in the stock Solr schemas.
>
> Best,
> Erick
>
> > On Aug 27, 2019, at 8:48 AM, Bjarke Buur Mortensen <
> morten...@eluence.com> wrote:
> >
> > We have a solr file of type "string".
> > It turns out that we need to do synonym expansion on query time in order
> to
> > account for some changes over time in the values stored in that field.
> >
> > So we have tried introducing a custom fieldType that applies the synonym
> > filter at query time only (see bottom of mail), but that requires us to
> > change the field. But now, when we index new documents, Solr complains:
> > 400 Bad Request
> > Error: 'Exception writing document id someid to the index; possible
> > analysis error: cannot change field "auth_country_code" from index
> > options=DOCS to inconsistent index options=DOCS_AND_FREQS_AND_POSITIONS',
> >
> > Since we are only making query time changes, I would really like to not
> > have to reindex our entire collection. Is that possible somehow?
> >
> > Thanks,
> > Bjarke
> >
> >
> >   > sortMissingLast="true" positionIncrementGap="100">
> >
> >  
> >
> >
> >
> > > synonyms="country-synonyms.txt" ignoreCase="false" expand="true"/>
> >
> >  
>
>


Query-time synonyms without indexing

2019-08-27 Thread Bjarke Buur Mortensen
We have a solr file of type "string".
It turns out that we need to do synonym expansion on query time in order to
account for some changes over time in the values stored in that field.

So we have tried introducing a custom fieldType that applies the synonym
filter at query time only (see bottom of mail), but that requires us to
change the field. But now, when we index new documents, Solr complains:
400 Bad Request
Error: 'Exception writing document id someid to the index; possible
analysis error: cannot change field "auth_country_code" from index
options=DOCS to inconsistent index options=DOCS_AND_FREQS_AND_POSITIONS',

Since we are only making query time changes, I would really like to not
have to reindex our entire collection. Is that possible somehow?

Thanks,
Bjarke


  

  





  


Solr cloud collection restore silently fails for two shards

2019-07-01 Thread Bjarke Buur Mortensen
Hi list,

we have a Solr Cloud setup with a collection with 4 shards.
We backup this collection once a day.

Each night, we try to restore the latest backup on a test server.
So we restore all shards to the same machine. Upon restore, the solr logs
prints the following:
solr.log.3:25163:java.nio.file.NoSuchFileException:
/tmp/solr_restore_2019-06-30T21_47_53_380/solr.procurement_full.snapshot_tmp/snapshot.shard3/_hx_Lucene50_0.tim
solr.log.3:25190:java.nio.file.NoSuchFileException:
/tmp/solr_restore_2019-06-30T21_47_53_380/solr.procurement_full.snapshot_tmp/snapshot.shard4/_wc_Lucene50_0.tim

When Solr has loaded, these two shards are empty.
Looking at the cores, shard3 and shard4 has index directory set to the
generic data/index. e.g
/var/solr/data/procurement_shard3_replica_n3/data/index
whereas shard1 and shard2 correctly points to data/restore.xxx, e.g.
/var/solr/data/procurement_shard1_replica_n7/data/restore.20190701015508919

We monitor the restore process by polling
http://localhost:8983/solr/admin/collections?action=REQUESTSTATUS&requestid=procurement_restore
The restore action never enters a failed, and the last state is "completed":

{
  "responseHeader":{
"status":0,
"QTime":540},
  "status":{
"state":"completed",
"msg":"found [procurement_restore] in completed tasks"}}


It might well be that there is some error in the backup that we are trying
to restore (although backup finishes without error), which is causing this
problem, but I would expect the restore processs to end up in a failed
state, instead of a state with two functional shards and two empty.

Can anybody help me out here?

Thanks,
/Bjarke


Re: Solr 8.0.0 error: cannot change field from index options=DOCS to inconsistent index options=DOCS_AND_FREQS_AND_POSITIONS

2019-05-13 Thread Bjarke Buur Mortensen
OK, so the problem seems to come from
https://issues.apache.org/jira/browse/LUCENE-8134
Our field used to be type="string", but we have since changed it to a text
type to be able to use synonyms (see below).

So we'll still have some documents that were indexed as "string". Am I
right in assuming that we need to reindex in order to upgrade to 8.0.0?

Thanks,
Bjarke

  
 
  

  





  


Den fre. 10. maj 2019 kl. 22.38 skrev Erick Erickson <
erickerick...@gmail.com>:

> I suspect that perhaps some defaults have changed? So I’d try changing the
> definition in the schema for that field. These changes should be pointed
> out in the upgrade notes in Lucene or Solr CHANGES.txt.
>
> Best,
> Erick
>
> > On May 10, 2019, at 1:17 AM, Bjarke Buur Mortensen <
> morten...@eluence.com> wrote:
> >
> > Hi list,
> >
> > I'm trying to open a 7.x core in Solr 8.
> > I'm getting the error:
> >
> org.apache.solr.common.SolrException:org.apache.solr.common.SolrException:
> > Error opening new searcher
> >
> > Digging further in the logs, I see the error:
> > "
> > ...
> > Caused by: java.lang.IllegalArgumentException: cannot change field
> > "delivery_place_code" from index options=DOCS to inconsistent index
> > options=DOCS_AND_FREQS_AND_POSITIONS
> > ...
> > "
> >
> > Is this a known issue when upgrading to 8.0.0?
> > Can I do anything to avoid it?
> >
> > Thanks,
> > Bjarke
>
>


Solr 8.0.0 error: cannot change field from index options=DOCS to inconsistent index options=DOCS_AND_FREQS_AND_POSITIONS

2019-05-10 Thread Bjarke Buur Mortensen
Hi list,

I'm trying to open a 7.x core in Solr 8.
I'm getting the error:
org.apache.solr.common.SolrException:org.apache.solr.common.SolrException:
Error opening new searcher

Digging further in the logs, I see the error:
"
...
Caused by: java.lang.IllegalArgumentException: cannot change field
"delivery_place_code" from index options=DOCS to inconsistent index
options=DOCS_AND_FREQS_AND_POSITIONS
...
"

Is this a known issue when upgrading to 8.0.0?
Can I do anything to avoid it?

Thanks,
Bjarke


Re: Recipe for moving to solr cloud without reindexing

2018-08-08 Thread Bjarke Buur Mortensen
OK, thanks.

As long as it's my dev box, reindexing is fine.
I just hope that my assumption holds, that our prod solr is 7x segments
only.

Thanks again,
Bjarke

2018-08-08 20:03 GMT+02:00 Erick Erickson :

> Bjarke:
>
> Using SPLITSHARD on an index with 6x segments just seems to not work,
> even outside the standalone-> cloud issue. I'll raise a JIRA.
> Meanwhile I think you'll have to re-index I'm afraid.
>
> Thanks for raising the issue.
>
> Erick
>
> On Wed, Aug 8, 2018 at 6:34 AM, Bjarke Buur Mortensen
>  wrote:
> > Erick,
> >
> > thanks, that is of course something I left out of the original question.
> > Our Solr is 7.1, so that should not present a problem (crossing fingers).
> >
> > However, on my dev box I'm trying out the steps, and here I have some
> > segments created with version 6 of Solr.
> >
> > After having copied data from my non-cloud solr into my
> > single-shard-single-replica collection and verified that Solr Cloud works
> > with this collection, I then submit the splitshard command
> >
> > http://172.17.0.4:8984
> > /solr/admin/collections?action=SPLITSHARD&collection=
> procurement&shard=shard1
> >
> > However, this gives me the error:
> > org.apache.solr.client.solrj.impl.HttpSolrClient$
> RemoteSolrException:Error
> > from server at http://172.17.0.4:8984/solr:
> > java.lang.IllegalArgumentException: Cannot merge a segment that has been
> > created with major version 6 into this index which has been created by
> > major version 7"}
> >
> > I have tried running both optimize and IndexUpgrader on the index before
> > shard splitting, but the same error still occurs.
> >
> > Any ideas as to why this happens?
> >
> > Below is an output from running IndexUpgrader, which I cannot decipher.
> > It both states that "All segments upgraded to version 7.1.0" and ''all
> > running merges have aborted" ¯\_(ツ)_/¯
> >
> > Thanks a lot,
> > Bjarke
> >
> >
> > ==
> > java -cp
> > /opt/solr/server/solr-webapp/webapp/WEB-INF/lib/lucene-
> backward-codecs-7.1.0.jar:/opt/solr/server/solr-webapp/
> webapp/WEB-INF/lib/lucene-core-7.1.0.jar
> > org.apache.lucene.index.IndexUpgrader -delete-prior-commits -verbose
> > /var/solr/cloud/procurement_shard1_replica_n1/data/index
> > IFD 0 [2018-08-08T13:00:18.244Z; main]: init: current segments file is
> > "segments_4vs";
> > deletionPolicy=org.apache.lucene.index.KeepOnlyLastCommitDeletionPoli
> cy@721e0f4f
> > IFD 0 [2018-08-08T13:00:18.266Z; main]: init: load commit "segments_4vs"
> > IFD 0 [2018-08-08T13:00:18.270Z; main]: now checkpoint
> > "_bhg(7.1.0):C108396" [1 segments ; isCommit = false]
> > IFD 0 [2018-08-08T13:00:18.270Z; main]: 0 msec to checkpoint
> > IW 0 [2018-08-08T13:00:18.270Z; main]: init: create=false
> > IW 0 [2018-08-08T13:00:18.273Z; main]:
> > dir=MMapDirectory@/var/solr/cloud/procurement_shard1_
> replica_n1/data/index
> > lockFactory=org.apache.lucene.store.NativeFSLockFactory@6debcae2
> > index=_bhg(7.1.0):C108396
> > version=7.1.0
> > analyzer=null
> > ramBufferSizeMB=16.0
> > maxBufferedDocs=-1
> > mergedSegmentWarmer=null
> > delPolicy=org.apache.lucene.index.KeepOnlyLastCommitDeletionPolicy
> > commit=null
> > openMode=CREATE_OR_APPEND
> > similarity=org.apache.lucene.search.similarities.BM25Similarity
> > mergeScheduler=ConcurrentMergeScheduler: maxThreadCount=-1,
> > maxMergeCount=-1, ioThrottle=true
> > codec=Lucene70
> > infoStream=org.apache.lucene.util.PrintStreamInfoStream
> > mergePolicy=UpgradeIndexMergePolicy([TieredMergePolicy:
> maxMergeAtOnce=10,
> > maxMergeAtOnceExplicit=30, maxMergedSegmentMB=5120.0, floorSegmentMB=2.0,
> > forceMergeDeletesPctAllowed=10.0, segmentsPerTier=10.0,
> > maxCFSSegmentSizeMB=8.796093022207999E12, noCFSRatio=0.1)
> > indexerThreadPool=org.apache.lucene.index.DocumentsWriterPerThreadPool@
> 5ba23b66
> > readerPooling=true
> > perThreadHardLimitMB=1945
> > useCompoundFile=true
> > commitOnClose=true
> > indexSort=null
> > writer=org.apache.lucene.index.IndexWriter@2ff4f00f
> >
> > IW 0 [2018-08-08T13:00:18.273Z; main]: MMapDirectory.UNMAP_SUPPORTED=
> true
> > IndexUpgrader 0 [2018-08-08T13:00:18.274Z; main]: Upgrading all pre-7.1.0
> > segments of index directory
> > 'MMapDirectory@/var/solr/cloud/procurement_shard1_replica_n1/data/index
> > lockFactory=org.apache.lucene.store.NativeFSLockFactory@6debcae2' to

Re: Recipe for moving to solr cloud without reindexing

2018-08-08 Thread Bjarke Buur Mortensen
-08-08T13:00:18.283Z; main]:   index before flush
_bhg(7.1.0):C108396
DW 0 [2018-08-08T13:00:18.283Z; main]: startFullFlush
IW 0 [2018-08-08T13:00:18.283Z; main]: now apply all deletes for all
segments buffered updates bytesUsed=0 reader pool bytesUsed=0
BD 0 [2018-08-08T13:00:18.283Z; main]: waitApply: no deletes to apply
DW 0 [2018-08-08T13:00:18.284Z; main]: main finishFullFlush success=true
IW 0 [2018-08-08T13:00:18.284Z; main]: startCommit(): start
IW 0 [2018-08-08T13:00:18.284Z; main]: startCommit
index=_bhg(7.1.0):C108396 changeCount=2
IW 0 [2018-08-08T13:00:18.293Z; main]: startCommit: wrote pending segments
file "pending_segments_4vt"
IW 0 [2018-08-08T13:00:18.295Z; main]: done all syncs:
[_bhg_Lucene50_0.tip, _bhg.fdx, _bhg.fnm, _bhg.nvm, _bhg.fdt, _bhg.si,
_bhg_Lucene50_0.pos, _bhg.nvd, _bhg_Lucene50_0.doc, _bhg_Lucene50_0.tim]
IW 0 [2018-08-08T13:00:18.295Z; main]: commit: pendingCommit != null
IW 0 [2018-08-08T13:00:18.298Z; main]: commit: done writing segments file
"segments_4vt"
IFD 0 [2018-08-08T13:00:18.298Z; main]: now checkpoint
"_bhg(7.1.0):C108396" [1 segments ; isCommit = true]
IFD 0 [2018-08-08T13:00:18.298Z; main]: deleteCommits: now decRef commit
"segments_4vs"
IFD 0 [2018-08-08T13:00:18.298Z; main]: delete [segments_4vs]
IFD 0 [2018-08-08T13:00:18.299Z; main]: 0 msec to checkpoint
IW 0 [2018-08-08T13:00:18.319Z; main]: commit: took 16.0 msec
IW 0 [2018-08-08T13:00:18.319Z; main]: commit: done
IndexUpgrader 0 [2018-08-08T13:00:18.319Z; main]: Committed upgraded
metadata to index.
IW 0 [2018-08-08T13:00:18.319Z; main]: now flush at close
IW 0 [2018-08-08T13:00:18.319Z; main]:   start flush: applyAllDeletes=true
IW 0 [2018-08-08T13:00:18.319Z; main]:   index before flush
_bhg(7.1.0):C108396
DW 0 [2018-08-08T13:00:18.319Z; main]: startFullFlush
DW 0 [2018-08-08T13:00:18.320Z; main]: main finishFullFlush success=true
IW 0 [2018-08-08T13:00:18.320Z; main]: now apply all deletes for all
segments buffered updates bytesUsed=0 reader pool bytesUsed=0
BD 0 [2018-08-08T13:00:18.320Z; main]: waitApply: no deletes to apply
MS 0 [2018-08-08T13:00:18.320Z; main]: updateMergeThreads ioThrottle=true
targetMBPerSec=10240.0 MB/sec
MS 0 [2018-08-08T13:00:18.320Z; main]: now merge
MS 0 [2018-08-08T13:00:18.321Z; main]:   index: _bhg(7.1.0):C108396
MS 0 [2018-08-08T13:00:18.321Z; main]:   no more merges pending; now return
IW 0 [2018-08-08T13:00:18.321Z; main]: waitForMerges
IW 0 [2018-08-08T13:00:18.321Z; main]: waitForMerges done
IW 0 [2018-08-08T13:00:18.321Z; main]: commit: start
IW 0 [2018-08-08T13:00:18.321Z; main]: commit: enter lock
IW 0 [2018-08-08T13:00:18.321Z; main]: commit: now prepare
IW 0 [2018-08-08T13:00:18.321Z; main]: prepareCommit: flush
IW 0 [2018-08-08T13:00:18.321Z; main]:   index before flush
_bhg(7.1.0):C108396
DW 0 [2018-08-08T13:00:18.321Z; main]: startFullFlush
IW 0 [2018-08-08T13:00:18.321Z; main]: now apply all deletes for all
segments buffered updates bytesUsed=0 reader pool bytesUsed=0
BD 0 [2018-08-08T13:00:18.322Z; main]: waitApply: no deletes to apply
DW 0 [2018-08-08T13:00:18.322Z; main]: main finishFullFlush success=true
IW 0 [2018-08-08T13:00:18.322Z; main]: startCommit(): start
IW 0 [2018-08-08T13:00:18.322Z; main]:   skip startCommit(): no changes
pending
IW 0 [2018-08-08T13:00:18.322Z; main]: commit: pendingCommit == null; skip
IW 0 [2018-08-08T13:00:18.322Z; main]: commit: took 0.9 msec
IW 0 [2018-08-08T13:00:18.322Z; main]: commit: done
IW 0 [2018-08-08T13:00:18.322Z; main]: rollback
IW 0 [2018-08-08T13:00:18.322Z; main]: all running merges have aborted
IW 0 [2018-08-08T13:00:18.323Z; main]: rollback: done finish merges
DW 0 [2018-08-08T13:00:18.323Z; main]: abort
DW 0 [2018-08-08T13:00:18.323Z; main]: done abort success=true
IW 0 [2018-08-08T13:00:18.323Z; main]: rollback: infos=_bhg(7.1.0):C108396
IFD 0 [2018-08-08T13:00:18.323Z; main]: now checkpoint
"_bhg(7.1.0):C108396" [1 segments ; isCommit = false]
IFD 0 [2018-08-08T13:00:18.323Z; main]: 0 msec to checkpoint

2018-08-07 16:38 GMT+02:00 Erick Erickson :

> Bjarke:
>
> One thing, what version of Solr are you moving _from_ and _to_?
> Solr/Lucene only guarantee one major backward revision so you can copy
> an index created with Solr 6 to another Solr 6 or Solr 7, but you
> couldn't copy an index created with Solr 5 to Solr 7...
>
> Also note that shard splitting is a very expensive operation, so be
> patient
>
> Best,
> Erick
>
> On Tue, Aug 7, 2018 at 6:17 AM, Rahul Singh
>  wrote:
> > Bjarke,
> >
> > I am imagining that at some point you may need to shard that data if it
> grows. Or do you imagine this data to remain stagnant?
> >
> > Generally you want to add solrcloud to do two things : 1. Increase
> availability with replicas 2. Increase available data via shards 3.
> Increase fault tolerance with leader and replicas being spread around the
> cluster.
> &g

Re: Recipe for moving to solr cloud without reindexing

2018-08-08 Thread Bjarke Buur Mortensen
Rahul, thanks, I do indeed want to be able to shard.
For now I'll go with Markus' suggestion and try to use the SPLITSHARD
command.

2018-08-07 15:17 GMT+02:00 Rahul Singh :

> Bjarke,
>
> I am imagining that at some point you may need to shard that data if it
> grows. Or do you imagine this data to remain stagnant?
>
> Generally you want to add solrcloud to do two things : 1. Increase
> availability with replicas 2. Increase available data via shards 3.
> Increase fault tolerance with leader and replicas being spread around the
> cluster.
>
> You would be bypassing general High availability / distributed computing
> processes by trying to not reindex.
>
> Rahul
> On Aug 7, 2018, 7:06 AM -0400, Bjarke Buur Mortensen <
> morten...@eluence.com>, wrote:
> > Hi List,
> >
> > is there a cookbook recipe for moving an existing solr core to a solr
> cloud
> > collection.
> >
> > We currently have a single machine with a large core (~150gb), and we
> would
> > like to move to solr cloud.
> >
> > I haven't been able to find anything that reuses an existing index, so
> any
> > pointers much appreciated.
> >
> > Thanks,
> > Bjarke
>


Re: Recipe for moving to solr cloud without reindexing

2018-08-07 Thread Bjarke Buur Mortensen
Right, that seems like a way to go, will give it a try.

Thanks!
/Bjarke

2018-08-07 14:08 GMT+02:00 Markus Jelsma :

> Hello Bjarke,
>
> You can use shard splitting:
> https://lucene.apache.org/solr/guide/6_6/collections-
> api.html#CollectionsAPI-splitshard
>
> Regards,
> Markus
>
>
>
> -Original message-
> > From:Bjarke Buur Mortensen 
> > Sent: Tuesday 7th August 2018 13:47
> > To: solr-user@lucene.apache.org
> > Subject: Re: Recipe for moving to solr cloud without reindexing
> >
> > Thank you, that is of course a way to go, but I would actually like to be
> > able to shard ...
> > Could I use your approach and add shards dynamically?
> >
> >
> > 2018-08-07 13:28 GMT+02:00 Markus Jelsma :
> >
> > > Hello Bjarke,
> > >
> > > If you are not going to shard you can just create a 1 shard/1 replica
> > > collection, shut down Solr, copy the data directory into the replica's
> > > directory and start up again.
> > >
> > > Regards,
> > > Markus
> > >
> > > -Original message-
> > > > From:Bjarke Buur Mortensen 
> > > > Sent: Tuesday 7th August 2018 13:06
> > > > To: solr-user@lucene.apache.org
> > > > Subject: Recipe for moving to solr cloud without reindexing
> > > >
> > > > Hi List,
> > > >
> > > > is there a cookbook recipe for moving an existing solr core to a solr
> > > cloud
> > > > collection.
> > > >
> > > > We currently have a single machine with a large core (~150gb), and we
> > > would
> > > > like to move to solr cloud.
> > > >
> > > > I haven't been able to find anything that reuses an existing index,
> so
> > > any
> > > > pointers much appreciated.
> > > >
> > > > Thanks,
> > > > Bjarke
> > > >
> > >
> >
>


Re: Recipe for moving to solr cloud without reindexing

2018-08-07 Thread Bjarke Buur Mortensen
Thank you, that is of course a way to go, but I would actually like to be
able to shard ...
Could I use your approach and add shards dynamically?


2018-08-07 13:28 GMT+02:00 Markus Jelsma :

> Hello Bjarke,
>
> If you are not going to shard you can just create a 1 shard/1 replica
> collection, shut down Solr, copy the data directory into the replica's
> directory and start up again.
>
> Regards,
> Markus
>
> -Original message-
> > From:Bjarke Buur Mortensen 
> > Sent: Tuesday 7th August 2018 13:06
> > To: solr-user@lucene.apache.org
> > Subject: Recipe for moving to solr cloud without reindexing
> >
> > Hi List,
> >
> > is there a cookbook recipe for moving an existing solr core to a solr
> cloud
> > collection.
> >
> > We currently have a single machine with a large core (~150gb), and we
> would
> > like to move to solr cloud.
> >
> > I haven't been able to find anything that reuses an existing index, so
> any
> > pointers much appreciated.
> >
> > Thanks,
> > Bjarke
> >
>


Recipe for moving to solr cloud without reindexing

2018-08-07 Thread Bjarke Buur Mortensen
Hi List,

is there a cookbook recipe for moving an existing solr core to a solr cloud
collection.

We currently have a single machine with a large core (~150gb), and we would
like to move to solr cloud.

I haven't been able to find anything that reuses an existing index, so any
pointers much appreciated.

Thanks,
Bjarke


How to only highlight terms that caused the document to match

2018-07-04 Thread Bjarke Buur Mortensen
Hi list,

I'm having difficulties getting the solr highlighter to highlight only the
terms that actually caused the match. Let med explain:

Given a query "john OR (peter AND mary)"
and two documents:
"john is awesome and so is peter"
"peter is awesome and so is mary",

solr will highlight "peter" and "mary" in the second document, which is
expected.
However it will also highlight both 'john' and 'peter' in the first
document, even though peter requires that mary is present also.

Is there any way to improve this?

If I add debugQuery, the explain-block can easily tell me that the first
document matched because of john, giving it a score of 1, whereas the
second matched because of the presence of both peter and mary, giving it a
score of 2.

So somehow, the information is available, but not used by the highlighter.

Below, I have included a real world solr output to explain what I mean.

Thanks,
Bjarke


---

{
  "responseHeader":{
"status":0,
"QTime":12,
"params":{
  "hl.snippets":"2",
  "q":"plejehjem*  OR (plejecentre* AND boliger*)",
  "defType":"lucene",
  "hl":"on",
  "fl":"doc_id,score",
  "fq":"doc_id:(0273-000545 OR 259531-2018)",
  "hl.method":"unified",
  "debugQuery":"on"}},
  "response":{"numFound":2,"start":0,"maxScore":3.0,"docs":[
  {
"doc_id":"0273-000545",
"score":3.0},
  {
"doc_id":"259531-2018",
"score":1.0}]
  },
  "highlighting":{
"udbuddk-0273-000545":{
  
"content_and_cpv_descriptions_da":["Beskrivelse\n---\n\nKonkurrenceudsættelsen
omfatter drift af følgende 2 plejecentre: \n·
Sandgårdsparken, Kjellerup, 40 boliger  \n·
Solgården, Sjørslev, 22 boliger  \nBeslutningen om at udsætte
driften af plejecentre for konkurrence er aftalt i den
politiske budgetaftale for 2015, der blev indgået i august 2014 mellem
alle byrådets partier undtagen Dansk Folkeparti og Enhedslisten.
\n”Ældre- og Handicapudvalget igangsætter en proces for
konkurrenceudsættelse af drift af ca. 72 plejehjemspladser.
",
"85144100 Sygepleje på plejehjem"]},
"TED-259531-2018":{
  "content_and_cpv_descriptions_da":["Morsø Kommune 41333014
Jernbanevej 7 Nykøbing M 7900 Birgitte Lund +45 99707017
birgitte.l...@morsoe.dk https://permalink.mercell.com/8747.aspx
http://www.morsoe.dk/ https://permalink.mercell.com/8747.aspx
Mercell Danmark A/S Østre Stationsvej 33, Vestfløjen Odense C 5000
support...@mercell.com https://permalink.mercell.com/8747.aspx
https://permalink.mercell.com/8747.aspx Vikarydelser på
ældreområdet 773-2018-5278 Udbuddet omfatter hjemmeplejen og
plejecentre i Morsø Kommune. ",
"85144100 Sygepleje på plejehjem"]}},
  "debug":{
"rawquerystring":"plejehjem*  OR (plejecentre* AND boliger*)",
"querystring":"plejehjem*  OR (plejecentre* AND boliger*)",
"parsedquery":"content_and_cpv_descriptions_da:plejehjem*
(+content_and_cpv_descriptions_da:plejecentre*
+content_and_cpv_descriptions_da:boliger*)",
"parsedquery_toString":"content_and_cpv_descriptions_da:plejehjem*
(+content_and_cpv_descriptions_da:plejecentre*
+content_and_cpv_descriptions_da:boliger*)",
"explain":{
  "udbuddk-0273-000545":"\n3.0 = sum of:\n  1.0 =
content_and_cpv_descriptions_da:plejehjem*\n  2.0 = sum of:\n1.0 =
content_and_cpv_descriptions_da:plejecentre*\n1.0 =
content_and_cpv_descriptions_da:boliger*\n",
  "TED-259531-2018":"\n1.0 = sum of:\n  1.0 =
content_and_cpv_descriptions_da:plejehjem*\n"},
"QParser":"LuceneQParser",
"filter_queries":["doc_id:(0273-000545 OR 259531-2018)"],
"parsed_filter_queries":["doc_id:0273-000545 doc_id:259531-2018"],
"timing":{
  "time":12.0,
  "prepare":{
"time":0.0,
"query":{
  "time":0.0},
"facet":{
  "time":0.0},
"facet_module":{
  "time":0.0},
"mlt":{
  "time":0.0},
"highlight":{
  "time":0.0},
"stats":{
  "time":0.0},
"expand":{
  "time":0.0},
"terms":{
  "time":0.0},
"debug":{
  "time":0.0}},
  "process":{
"time":11.0,
"query":{
  "time":1.0},
"facet":{
  "time":0.0},
"facet_module":{
  "time":0.0},
"mlt":{
  "time":0.0},
"highlight":{
  "time":9.0},
"stats":{
  "time":0.0},
"expand":{
  "time":0.0},
"terms":{
  "time":0.0},
"debug":{
  "time":0.0}},
  "loadFieldValues":{
"time":0.0


Re: Multiple consecutive wildcards (**) causes Out-of-memory

2018-02-07 Thread Bjarke Buur Mortensen
Just to clarify:
I can only cause this to happen when using the complexphrase query parser.
Lucene/dismax/edismax parsers are not affected.

2018-02-07 13:09 GMT+01:00 Bjarke Buur Mortensen :

> Hello list,
>
> Whenever I make a query for ** (two consecutive wildcards) it causes my
> Solr to run out of memory.
>
> http://localhost:8983/solr/select?q=**
>
> Why is that?
>
> I realize that this is not a reasonable query to make, but the system
> supports input from users, and they might by accident input this query,
> causing Solr to crash.
>
> I should add that we use the complexphrase query parser as the default
> parser on a Solr 7.1.
>
> Can anyone repro this or explain what causes the problem?
>
> Thanks in advance,
> Bjarke Buur Mortensen
> Senior Software Engineer
> Eluence A/S
>
>
>
>
>
>
>


Multiple consecutive wildcards (**) causes Out-of-memory

2018-02-07 Thread Bjarke Buur Mortensen
Hello list,

Whenever I make a query for ** (two consecutive wildcards) it causes my
Solr to run out of memory.

http://localhost:8983/solr/select?q=**

Why is that?

I realize that this is not a reasonable query to make, but the system
supports input from users, and they might by accident input this query,
causing Solr to crash.

I should add that we use the complexphrase query parser as the default
parser on a Solr 7.1.

Can anyone repro this or explain what causes the problem?

Thanks in advance,
Bjarke Buur Mortensen
Senior Software Engineer
Eluence A/S


Re: Complexphrase treats wildcards differently than other query parsers

2017-10-09 Thread Bjarke Buur Mortensen
Thanks again, Tim,
following your recipe, I was able to write a failing test:

assertQ(req("q", "{!complexphrase} iso-latin1:cr\u00E6zy*")
, "//result[@numFound='1']"
, "//doc[./str[@name='id']='1']"
);

Notice how cr\u00E6zy* is used as a query term which mimics the behaviour I
originally reported, namely that CPQP does not analyse it because of the
wildcard and thus does not hit the charfilter from the query side.


2017-10-06 20:54 GMT+02:00 Allison, Timothy B. :

> That could be it.  I'm not able to reproduce this with trunk.  More next
> week.
>
> In trunk, if I add this to schema15.xml:
>   
> 
>   
>   
> 
>   
>stored="true"/>
>
> This test passes.
>
>   @Test
>   public void testCharFilter() {
> assertU(adoc("iso-latin1", "cr\u00E6zy tr\u00E6n", "id", "1"));
> assertU(commit());
> assertU(optimize());
>
> assertQ(req("q", "{!complexphrase} iso-latin1:craezy")
> , "//result[@numFound='1']"
> , "//doc[./str[@name='id']='1']"
> );
>
> assertQ(req("q", "{!complexphrase} iso-latin1:traen")
> , "//result[@numFound='1']"
> , "//doc[./str[@name='id']='1']"
> );
>
> assertQ(req("q", "{!complexphrase} iso-latin1:caezy~1")
> , "//result[@numFound='1']"
> , "//doc[./str[@name='id']='1']"
> );
>
> assertQ(req("q", "{!complexphrase} iso-latin1:crae*")
> , "//result[@numFound='1']"
> , "//doc[./str[@name='id']='1']"
> );
>
> assertQ(req("q", "{!complexphrase} iso-latin1:*aezy")
> , "//result[@numFound='1']"
> , "//doc[./str[@name='id']='1']"
> );
>
> assertQ(req("q", "{!complexphrase} iso-latin1:crae*y")
> , "//result[@numFound='1']"
> , "//doc[./str[@name='id']='1']"
> );
>
> assertQ(req("q", "{!complexphrase} iso-latin1:\"craezy traen\"")
> , "//result[@numFound='1']"
>     , "//doc[./str[@name='id']='1']"
> );
>
> assertQ(req("q", "{!complexphrase} iso-latin1:\"caezy~1 traen\"")
> , "//result[@numFound='1']"
> , "//doc[./str[@name='id']='1']"
> );
>
> assertQ(req("q", "{!complexphrase} iso-latin1:\"craez* traen\"")
> , "//result[@numFound='1']"
> , "//doc[./str[@name='id']='1']"
> );
>
> assertQ(req("q", "{!complexphrase} iso-latin1:\"*aezy traen\"")
> , "//result[@numFound='1']"
> , "//doc[./str[@name='id']='1']"
> );
>
> assertQ(req("q", "{!complexphrase} iso-latin1:\"crae*y traen\"")
> , "//result[@numFound='1']"
> , "//doc[./str[@name='id']='1']"
> );
>   }
>
>
>
> -Original Message-
> From: Bjarke Buur Mortensen [mailto:morten...@eluence.com]
> Sent: Friday, October 6, 2017 6:46 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Complexphrase treats wildcards differently than other query
> parsers
>
> Thanks a lot for your effort, Tim.
>
> Looking at it from the Solr side, I see some use of local classes. The
> snippet below in particular caught my eye (in solr/core/src/java/org/apache/
> solr/search/ComplexPhraseQParserPlugin.java).
> The instance of ComplexPhraseQueryParser is not the clean one from Lucene,
> but a modified one. If any of the modifications messes with the analysis
> logic, well then that might answer it.
>
> What do you make of it?
>
> lparser = new ComplexPhraseQueryParser(defaultField, getReq().getSchema().
> getQueryAnalyzer())
> {
> protected Query newWildcardQuery(org.apache.lucene.index.Term t) { try {
> org.apache.lucene.search.Query wildcardQuery = reverseAwareParser.
> getWildcardQuery(t.field(), t.text());
> setRewriteMethod(wildcardQuery);
> return wildcardQuery;
> } catch (SyntaxError e) {
> throw new RuntimeException(e);
> }
> }
> private Query setRewriteMethod(org.apache.lucene.search.Query query) { if
> (query instanceof MultiTermQuery) {
> ((MultiTermQuery) query).setRewriteMethod( org.apache.lucene.search.
> MultiTermQuery.SCORING_BOOLEAN_REWRITE);
> }
> return query;
> }
> protected Query newRangeQuery(String field, String part1, String part2,
> boolean startInclusive, boolean endInclusive) { boolean reverse =
> reverseAwareParser.isRangeShouldBeProtectedFromReverse(field,
> part1);
> return super.newRangeQuery(field,
> reverse ? reverseAwareParser.getLowerBoundForReverse() : part1, part2,
> startInclusive || reverse, endInclusive); } } ;
>
> Thanks,
> Bjarke
>
>
>


Re: Complexphrase treats wildcards differently than other query parsers

2017-10-06 Thread Bjarke Buur Mortensen
Thanks a lot for your effort, Tim.

Looking at it from the Solr side, I see some use of local classes. The
snippet below in particular caught my eye (in
solr/core/src/java/org/apache/solr/search/ComplexPhraseQParserPlugin.java).
The instance of ComplexPhraseQueryParser is not the clean one from Lucene,
but a modified one. If any of the modifications messes with the analysis
logic, well then that might answer it.

What do you make of it?

lparser = new ComplexPhraseQueryParser(defaultField, getReq().getSchema().
getQueryAnalyzer())
{
protected Query newWildcardQuery(org.apache.lucene.index.Term t) {
try {
org.apache.lucene.search.Query wildcardQuery = reverseAwareParser.
getWildcardQuery(t.field(), t.text());
setRewriteMethod(wildcardQuery);
return wildcardQuery;
} catch (SyntaxError e) {
throw new RuntimeException(e);
}
}
private Query setRewriteMethod(org.apache.lucene.search.Query query) {
if (query instanceof MultiTermQuery) {
((MultiTermQuery) query).setRewriteMethod(
org.apache.lucene.search.MultiTermQuery.SCORING_BOOLEAN_REWRITE);
}
return query;
}
protected Query newRangeQuery(String field, String part1, String part2,
boolean startInclusive,
boolean endInclusive) {
boolean reverse = reverseAwareParser.isRangeShouldBeProtectedFromReverse(field,
part1);
return super.newRangeQuery(field,
reverse ? reverseAwareParser.getLowerBoundForReverse() : part1,
part2,
startInclusive || reverse,
endInclusive);
}
}
;

Thanks,
Bjarke

2017-10-05 21:15 GMT+02:00 Allison, Timothy B. :

> After some more digging, I'm wrong even at the Lucene level.
>
> When I use the CustomAnalyzer and make my UC vowel mock filter
> MultitermAware, I get this with Lucene in trunk:
>
> "the* quick~" name:thE* name:qUIck~2 name:thE name:qUIck
>
> So, there's room for improvement with phrases, but the regular multiterms
> should be ok.
>
> Still no answer for you...
>
> 2017-10-05 14:34 GMT+02:00 Allison, Timothy B. :
>
> > There's every chance that I'm missing something at the Solr level, but
> > it _looks_ at the Lucene level, like ComplexPhraseQueryParser is still
> > not applying analysis to multiterms.
> >
> > When I call this on 7.0.0:
> >QueryParser qp = new ComplexPhraseQueryParser(defaultFieldName,
> > analyzer);
> > return qp.parse(qString);
> >
> >  where the analyzer is a mock "uppercase vowel" analyzer[1] and the
> > qString is;
> >
> > "the* quick~" the* quick~ the quick
> >
> > I get this:
> > "the* quick~" name:the* name:quick~2 name:thE name:qUIck
>
>


Re: Complexphrase treats wildcards differently than other query parsers

2017-10-05 Thread Bjarke Buur Mortensen
Thanks Tim,
that might be what I'm experiencing. I'm actually quite certain of it :-)

Do you remember any reason that multi term analysis is not happening in
ComplexPhraseQueryParser?

I'm on 6.6.1, so latest on the 6.x branch.

2017-10-05 14:34 GMT+02:00 Allison, Timothy B. :

> There's every chance that I'm missing something at the Solr level, but it
> _looks_ at the Lucene level, like ComplexPhraseQueryParser is still not
> applying analysis to multiterms.
>
> When I call this on 7.0.0:
>QueryParser qp = new ComplexPhraseQueryParser(defaultFieldName,
> analyzer);
> return qp.parse(qString);
>
>  where the analyzer is a mock "uppercase vowel" analyzer[1] and the
> qString is;
>
> "the* quick~" the* quick~ the quick
>
> I get this:
> "the* quick~" name:the* name:quick~2 name:thE name:qUIck
>
>
> [1] https://github.com/tballison/lucene-addons/blob/master/
> lucene-5205/src/test/java/org/apache/lucene/queryparser/
> spans/TestAdvancedAnalyzers.java#L117
>
> -Original Message-
> From: Allison, Timothy B. [mailto:talli...@mitre.org]
> Sent: Thursday, October 5, 2017 8:02 AM
> To: solr-user@lucene.apache.org
> Subject: RE: Complexphrase treats wildcards differently than other query
> parsers
>
> What version of Solr are you using?
>
> I thought this had been fixed fairly recently, but I can't quickly find
> the JIRA.  Let me take a look.
>
> Best,
>
>  Tim
>
> This was one of my initial reasons for my SpanQueryParser LUCENE-5205[1]
> and [2], which handles analysis of multiterms even in phrases.
>
> [1] https://github.com/tballison/lucene-addons/tree/master/lucene-5205
> [2] https://mvnrepository.com/artifact/org.tallison.lucene/
> lucene-5205/6.6-0.1
>
> -Original Message-
> From: Bjarke Buur Mortensen [mailto:morten...@eluence.com]
> Sent: Thursday, October 5, 2017 6:28 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Complexphrase treats wildcards differently than other query
> parsers
>
> 2017-10-05 11:29 GMT+02:00 Emir Arnautović :
>
> > Hi Bjarke,
> > You are right - I jumped into wrong/old conclusion as the simplest
> > answer to your question.
>
>
>  No problem :-)
>
> I guess looking at the code could give you an answer.
> >
>
> This is what I would like to avoid out of fear that my head would explode
> ;-)
>
>
> >
> > Thanks,
> > Emir
> > --
> > Monitoring - Log Management - Alerting - Anomaly Detection Solr &
> > Elasticsearch Consulting Support Training - http://sematext.com/
> >
> >
> >
> > > On 5 Oct 2017, at 10:44, Bjarke Buur Mortensen
> > > 
> > wrote:
> > >
> > > Well, according to
> > > https://lucidworks.com/2011/11/29/whats-with-lowercasing-
> > wildcard-multiterm-queries-in-solr/
> > > multiterm means
> > >
> > > wildcard
> > > range
> > > prefix
> > >
> > > so it is that way i'm using the word. That same article explains how
> > > analysis will be performed with wildcards if the analyzers are
> > > multi-term aware.
> > > Furthermore, both lucene and dismax do the correct analysis, so I
> > > don't think you are right in your statement about the majority of
> > > QPs skipping analysis for wildcards.
> > >
> > > So I'm still confused as to why complexphrase does things differently.
> > >
> > > Thanks,
> > > /Bjarke
> > >
> > > 2017-10-05 10:16 GMT+02:00 Emir Arnautović
> > > > >:
> > >
> > >> Hi Bjarke,
> > >> It is not multiterm that is causing query parser to skip analysis
> > >> chain but wildcard. The majority of query parsers do not analyse
> > >> query string
> > if
> > >> there are wildcards.
> > >>
> > >> HTH
> > >> Emir
> > >> --
> > >> Monitoring - Log Management - Alerting - Anomaly Detection Solr &
> > >> Elasticsearch Consulting Support Training - http://sematext.com/
> > >>
> > >>
> > >>
> > >>> On 4 Oct 2017, at 22:08, Bjarke Buur Mortensen
> > >>> 
> > >> wrote:
> > >>>
> > >>> Hi list,
> > >>>
> > >>> I'm trying to search for the term funktionsnedsättning* In my
> > >>> analyzer chain I use a MappingCharFilterFactory to change ä to a.
> > >>> So I would expect that funktionsnedsättning* would translate to
> > &

Re: Complexphrase treats wildcards differently than other query parsers

2017-10-05 Thread Bjarke Buur Mortensen
2017-10-05 11:29 GMT+02:00 Emir Arnautović :

> Hi Bjarke,
> You are right - I jumped into wrong/old conclusion as the simplest answer
> to your question.


 No problem :-)

I guess looking at the code could give you an answer.
>

This is what I would like to avoid out of fear that my head would explode
;-)


>
> Thanks,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 5 Oct 2017, at 10:44, Bjarke Buur Mortensen 
> wrote:
> >
> > Well, according to
> > https://lucidworks.com/2011/11/29/whats-with-lowercasing-
> wildcard-multiterm-queries-in-solr/
> > multiterm means
> >
> > wildcard
> > range
> > prefix
> >
> > so it is that way i'm using the word. That same article explains how
> > analysis will be performed with wildcards if the analyzers are multi-term
> > aware.
> > Furthermore, both lucene and dismax do the correct analysis, so I don't
> > think you are right in your statement about the majority of QPs skipping
> > analysis for wildcards.
> >
> > So I'm still confused as to why complexphrase does things differently.
> >
> > Thanks,
> > /Bjarke
> >
> > 2017-10-05 10:16 GMT+02:00 Emir Arnautović  >:
> >
> >> Hi Bjarke,
> >> It is not multiterm that is causing query parser to skip analysis chain
> >> but wildcard. The majority of query parsers do not analyse query string
> if
> >> there are wildcards.
> >>
> >> HTH
> >> Emir
> >> --
> >> Monitoring - Log Management - Alerting - Anomaly Detection
> >> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> >>
> >>
> >>
> >>> On 4 Oct 2017, at 22:08, Bjarke Buur Mortensen 
> >> wrote:
> >>>
> >>> Hi list,
> >>>
> >>> I'm trying to search for the term funktionsnedsättning*
> >>> In my analyzer chain I use a MappingCharFilterFactory to change ä to a.
> >>> So I would expect that funktionsnedsättning* would translate to
> >>> funktionsnedsattning*.
> >>>
> >>> If I use e.g. the lucene query parser, this is indeed what happens:
> >>> ...debugQuery=on&defType=lucene&q=funktionsneds%C3%A4ttning* gives me
> >>> "rawquerystring":"funktionsnedsättning*", "querystring":
> >>> "funktionsnedsättning*", "parsedquery":"content_ol:
> >> funktionsnedsattning*"
> >>> and 15 documents returned.
> >>>
> >>> Trying the same with complexphrase gives me:
> >>> ...debugQuery=on&defType=complexphrase&q=funktionsneds%C3%A4ttning*
> >> gives me
> >>> "rawquerystring":"funktionsnedsättning*", "querystring":
> >>> "funktionsnedsättning*", "parsedquery":"content_ol:
> >> funktionsnedsättning*"
> >>> and 0 documents. Notice how ä has not been changed to a.
> >>>
> >>> How can this be? Is complexphrase somehow skipping the analysis chain
> for
> >>> multiterms, even though components and in particular
> >>> MappingCharFilterFactory are Multi-term aware
> >>>
> >>> Are there any configuration gotchas that I'm not aware of?
> >>>
> >>> Thanks for the help,
> >>> Bjarke Buur Mortensen
> >>> Senior Software Engineer, Eluence A/S
> >>
> >>
>
>


Re: Complexphrase treats wildcards differently than other query parsers

2017-10-05 Thread Bjarke Buur Mortensen
Well, according to
https://lucidworks.com/2011/11/29/whats-with-lowercasing-wildcard-multiterm-queries-in-solr/
multiterm means

wildcard
range
prefix

so it is that way i'm using the word. That same article explains how
analysis will be performed with wildcards if the analyzers are multi-term
aware.
Furthermore, both lucene and dismax do the correct analysis, so I don't
think you are right in your statement about the majority of QPs skipping
analysis for wildcards.

So I'm still confused as to why complexphrase does things differently.

Thanks,
/Bjarke

2017-10-05 10:16 GMT+02:00 Emir Arnautović :

> Hi Bjarke,
> It is not multiterm that is causing query parser to skip analysis chain
> but wildcard. The majority of query parsers do not analyse query string if
> there are wildcards.
>
> HTH
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 4 Oct 2017, at 22:08, Bjarke Buur Mortensen 
> wrote:
> >
> > Hi list,
> >
> > I'm trying to search for the term funktionsnedsättning*
> > In my analyzer chain I use a MappingCharFilterFactory to change ä to a.
> > So I would expect that funktionsnedsättning* would translate to
> > funktionsnedsattning*.
> >
> > If I use e.g. the lucene query parser, this is indeed what happens:
> > ...debugQuery=on&defType=lucene&q=funktionsneds%C3%A4ttning* gives me
> > "rawquerystring":"funktionsnedsättning*", "querystring":
> > "funktionsnedsättning*", "parsedquery":"content_ol:
> funktionsnedsattning*"
> > and 15 documents returned.
> >
> > Trying the same with complexphrase gives me:
> > ...debugQuery=on&defType=complexphrase&q=funktionsneds%C3%A4ttning*
> gives me
> > "rawquerystring":"funktionsnedsättning*", "querystring":
> > "funktionsnedsättning*", "parsedquery":"content_ol:
> funktionsnedsättning*"
> > and 0 documents. Notice how ä has not been changed to a.
> >
> > How can this be? Is complexphrase somehow skipping the analysis chain for
> > multiterms, even though components and in particular
> > MappingCharFilterFactory are Multi-term aware
> >
> > Are there any configuration gotchas that I'm not aware of?
> >
> > Thanks for the help,
> > Bjarke Buur Mortensen
> > Senior Software Engineer, Eluence A/S
>
>


Complexphrase treats wildcards differently than other query parsers

2017-10-04 Thread Bjarke Buur Mortensen
Hi list,

I'm trying to search for the term funktionsnedsättning*
In my analyzer chain I use a MappingCharFilterFactory to change ä to a.
So I would expect that funktionsnedsättning* would translate to
funktionsnedsattning*.

If I use e.g. the lucene query parser, this is indeed what happens:
...debugQuery=on&defType=lucene&q=funktionsneds%C3%A4ttning* gives me
"rawquerystring":"funktionsnedsättning*", "querystring":
"funktionsnedsättning*", "parsedquery":"content_ol:funktionsnedsattning*"
and 15 documents returned.

Trying the same with complexphrase gives me:
...debugQuery=on&defType=complexphrase&q=funktionsneds%C3%A4ttning* gives me
"rawquerystring":"funktionsnedsättning*", "querystring":
"funktionsnedsättning*", "parsedquery":"content_ol:funktionsnedsättning*"
and 0 documents. Notice how ä has not been changed to a.

How can this be? Is complexphrase somehow skipping the analysis chain for
multiterms, even though components and in particular
MappingCharFilterFactory are Multi-term aware

Are there any configuration gotchas that I'm not aware of?

Thanks for the help,
Bjarke Buur Mortensen
Senior Software Engineer, Eluence A/S


Re: Is there a way to retrieve the a term's position/offset in Solr

2017-03-30 Thread Bjarke Buur Mortensen
OK, that complicates things a bit.

I would still try to go for a solution where you store the rich text in
Solr, but make sure you tokenize it correctly.

If the format is relatively simple, you could use either a regexp pattern
tokenizer
https://cwiki.apache.org/confluence/display/solr/Tokenizers#Tokenizers-SimplifiedRegularExpressionPatternTokenizer

or perhaps, before tokenization, use a pattern replace char filter to strip
out the parts of the rich text that should not be indexed
https://cwiki.apache.org/confluence/display/solr/CharFilterFactories#CharFilterFactories-solr.PatternReplaceCharFilterFactory

I assume that you have some process for converting the rich text to plain
text before indexing, so if you can replicate that process using Solr's
charfilters, tokenizers and filters then that would allow you to use the
highlighter to get the rich text back.

HTH,
Bjarle


2017-03-30 10:39 GMT+02:00 forest_soup :

> Unfortunately the rich text is not an html/xml/doc/pdf or any other popular
> rich text format. And we would like to show the highlighted text in the
> doc's own specific viewer. That's why I'm eagerly want the offset.
>
> The /tvrh(term vector component) and tv.offsets/tv.positions can give us
> such info, but they returns all terms' data instead of the being searched
> ones. So we are still seeking ways to filter the results.
>
> Any ideas?
>
> Thanks!
>
>
>
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/Is-there-a-way-to-retrieve-the-a-term-s-
> position-offset-in-Solr-tp4326931p4327623.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Is there a way to retrieve the a term's position/offset in Solr

2017-03-29 Thread Bjarke Buur Mortensen
OK, so the next thing to do would be to index and store the rich text ...
is it HTML? Because then you can use HTMLStripCharFilterFactory in your
analyzer, and still get the correct highlight back with hl.fragsize=0.

I would think that you will have a hard time using the term positions, if
what you are indexing is somehow transformed before indexing and you want
to map the positions back to the untransformed text.

2017-03-29 4:44 GMT+02:00 forest_soup :

> Thanks All!
>
> Actually we are going to show the highlighted words in a rich text format
> instead of the plain text which was indexed. So the hl.fragsize=0 seems not
> work for me..
>
> And for the patch(SOLR-4722), haven't tried it. Hope it can return the
> position/offset info.
>
> Thanks!
>
>
>
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/Is-there-a-way-to-retrieve-the-a-term-s-
> position-offset-in-Solr-tp4326931p4327339.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Is there a way to retrieve the a term's position/offset in Solr

2017-03-28 Thread Bjarke Buur Mortensen
Well, you can get Solr to highlight the entire field if that's what you are
after by setting:
hl.fragsize=0

From
https://cwiki.apache.org/confluence/display/solr/Highlighting#Highlighting-Usage
:
Specifies the approximate size, in characters, of fragments to consider for
highlighting. *0* indicates that no fragmenting should be considered and
the whole field value should be used.


2017-03-28 10:59 GMT+02:00 forest_soup :

> Thanks Eric.
>
> Actually solr highlighting function does not meet my requirement. My
> requirement is not showing the highlighted words in snippets, but show them
> in the whole opening document. So I would like to get the term's
> position/offset info from solr. I went through the highlight feature, but
> found that exact info(position/offset) is not returned.
> If you know that info within highlighting feature, could you please point
> it
> out to me?
>
> The most promising way seems to be /tvrh and tv.offsets/tv.positions
> parameters. But I haven't tried it. Any comments on that one?
>
> Thanks!
>
>
>
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/Is-there-a-way-to-retrieve-the-a-term-s-
> position-offset-in-Solr-tp4326931p4327149.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Unified highlighter and complexphrase

2017-03-17 Thread Bjarke Buur Mortensen
Hi list,
Given the text:
"Kontraktsproget vil være dansk og arbejdssproget kan være dansk, svensk,
norsk og engelsk"
and the query:
{!complexphrase df=content_da}("sve* no*")
the unified highlighter (hl.method=unified) does not return any highlights.
For reference, the original highlighter returns a snippet with the expected
highlights:
Kontraktsproget vil være dansk og arbejdssproget kan være dansk,
svensk, norsk og
Is this expected behaviour with the unified highlighter?

I have also filed this a bug report here:
https://issues.apache.org/jira/browse/SOLR-10309
but maybe some of you can help out.

Thanks in advance,
Bjarke
Senior Software Engineer, Eluence A/S