Re: Is there a way to force content extraction with a given encoding

2019-11-07 Thread Jörn Franke
I would convert them to UTF-8 before posting and use UTF-8 in your application. 
Most of the web and applications use UTF-8. If you use other encodings you will 
always run into problems.

> Am 08.11.2019 um 07:47 schrieb lala :
> 
> I am using the /update/extract request handler to push documents into solr,
> but some text documents, that are encoded as windows-1255 (arabic texts) are
> not extracted properly, the text given is not readable.
> 
> I searched in the web, and solr documentation and found nothing. I need to
> send the file encoding as a parameter if possible to let the tika parser get
> to know it.
> 
> Is there a way to achieve that?
> 
> 
> 
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Using solr API to return csv results

2019-11-07 Thread Paras Lehana
Hi Rhys,

There's already a JIRA for this:
https://issues.apache.org/jira/browse/SOLR-2731.

You can comment on the ticket. I also recommend you to read about /export

handler.

On Fri, 8 Nov 2019 at 01:39, rhys J  wrote:

> If I am using the Solr API to query the core, is there a way to tell how
> many documents are found if i use wt=CSV?
>
> Thanks,
>
> Rhys
>


-- 
-- 
Regards,

*Paras Lehana* [65871]
Development Engineer, Auto-Suggest,
IndiaMART Intermesh Ltd.

8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
Noida, UP, IN - 201303

Mob.: +91-9560911996
Work: 01203916600 | Extn:  *8173*

-- 
IMPORTANT: 
NEVER share your IndiaMART OTP/ Password with anyone.


Is there a way to force content extraction with a given encoding

2019-11-07 Thread lala
I am using the /update/extract request handler to push documents into solr,
but some text documents, that are encoded as windows-1255 (arabic texts) are
not extracted properly, the text given is not readable.

I searched in the web, and solr documentation and found nothing. I need to
send the file encoding as a parameter if possible to let the tika parser get
to know it.

Is there a way to achieve that?



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

2019-11-07 Thread Paras Lehana
Hi Guilherme

By accident, I ended up querying the using the default handler (/select)
> and it worked.


You've just found the culprit. Thanks for giving the material I requested.
Your analysis chain is working as expected. I don't see any issue in either
StopWordFilter or your boosts. I also use a boost of 50 when boosting
contextual suggestions (boosting "gold iphone" on a page of iphone) but I
take Walter's suggestion and would try to optimize my weights. I agree that
this 50 thing was not researched much about by us as well (we never faced
performance or relevance issues).

See the major difference in both the handlers - edismax. I'm pretty sure
that your problem lies in the parsing of queries (you can confirm that from
parsedquery key in debug of both JSON responses). I hope you have provided
the response with fl=*. Replace q with q.alt in your /search handler query
and I think you should start getting responses. That's because q.alt uses
standard parser. If you want to keep using edisMax, I suggest you to test
the responses removing some combination of lst (qf, bf) and find what's
restricting the documents to come up. I'm out of office today - would have
certainly tried analyzing the field values of the document in /select
request and compare it with qf/bq in solrconfig.xml /search. Do this for me
and you'd certainly find something.

On Thu, 7 Nov 2019 at 21:00, Walter Underwood  wrote:

> I normally use a weight of 8 for the most important field, like title.
> Other fields might get a 4 or 2.
>
> I add a “pf” field with the weights doubled, so that phrase matches have a
> higher weight.
>
> The weight of 8 comes from experience at Infoseek and Inktomi, two early
> web search engines. With different relevance algorithms and totally
> different evaluation and tuning systems, they settled on weights of 8 and
> 7.5 for HTML titles. With the the two radically different system getting
> the same number, I decided that was a property of the documents, not of the
> search engines.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> On Nov 7, 2019, at 9:03 AM, Guilherme Viteri  wrote:
>
> Hi Wunder,
>
> My indexer takes quite a few hours to be executed I am shortening it to
> run faster, but I also need to make sure it gives what we are expecting.
> This implementation's been there for >4y, and massively used.
>
> In your edismax handlers, weights of 20, 50, and 100 are extremely high. I
> don’t think I’ve ever used a weight higher than 16 in a dozen years of
> configuring Solr.
>
> I've inherited that implementation and I am really keen to adequate it,
> what would you recommend ?
>
> Cheers
> Guilherme
>
> On 7 Nov 2019, at 14:43, Walter Underwood  wrote:
>
> Thanks for posting the files. Looking at schema.xml, I see that you still
> are using StopFilterFactory. The first advice we gave you was to remove
> that.
>
> Remove StopFilterFactory everywhere and reindex.
>
> You will continue to have problems matching stopwords until you do that.
>
> In your edismax handlers, weights of 20, 50, and 100 are extremely high. I
> don’t think I’ve ever used a weight higher than 16 in a dozen years of
> configuring Solr.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> On Nov 7, 2019, at 6:56 AM, Guilherme Viteri  wrote:
>
> Hi Paras, everyone
>
> Thank you again for your inputs and suggestions. I sorry to hear you had
> trouble with the attachments I will host it somewhere and share the links.
> I don't tweak my index, I get the data from the graph database, create a
> document as they are and save to solr.
>
> So, I am sending the new analysis screen querying the way you suggested.
> Also the results with params and solr query url.
>
> During the process of querying what you asked I found something really
> weird (at least for me). By accident, I ended up querying the using the
> default handler (/select) and it worked. Then If I use the one I must use,
> then sadly doesn't work. I am posting both results and I will also post the
> handlers as well.
>
> Here is the link with all the files mentioned before
> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
>  >
> If the link doesn't work www dot dropbox dot com slash sh slash
> fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a ? dl equals 0
>
> Thanks
>
> On 7 Nov 2019, at 05:23, Paras Lehana  wrote:
>
> Hi Guilherme.
>
> I am sending they analysis result and the json result as requested.
>
>
> Thanks for the effort. Luckily, I can see your attachments (low quality
> though).
>
> From the analysis screen, the analysis is working as expected. One of the
> reasons for query="lymphoid and *a* non-lymphoid cell" not matching
> document containing "Lymphoid and a non-Lymphoid cell" I can initially
> think of is: the stopword "a" is probably present in post-analysis either
> of 

Re: ConcurrentModificationException in SolrInputDocument writeMap

2019-11-07 Thread Shawn Heisey

On 11/6/2019 8:17 AM, Tim Swetland wrote:

I'm currently running into a ConcurrentModificationException ingesting data
as we attempt to upgrade from Solr 8.1 to 8.2. It's not every document, but
it definitely appears regularly in our logs. We didn't run into this
problem in 8.1, so I'm not sure what might have changed. I feel like this
is probably a bug, but if there's a workaround or if there's an idea of
something I might be doing wrong, please let me know.

Stack trace:
o.a.s.u.ErrorReportingConcurrentUpdateSolrClient Error when calling
SolrCmdDistributor$Req: cmd=add{_version=,id=}; node=StdNode:
https:///solr/coll_shard1_replica_n2/ to
https:///solr/coll_shard1_replica_n2/
=> java.util.ConcurrentModificationException
 at java.util.LinkedHashMap.forEach(LinkedHashMap.java:686)
java.util.ConcurrentModificationException: null
   at java.util.LinkedHashMap.forEach(LinkedHashMap.java:686)
   at
org.apache.solr.common.SolrInputDocument.writeMap(SolrInputDocument.java:51)


This error, as mentioned in the SOLR-8028 issue linked by Edward 
Ribeiro, sounds like you are re-using objects when you are building 
SolrInputDocument instances for your indexing.


Looking at the actual code involved in the stacktrace, I think what's 
happening is that at the same time SolrJ is converting a 
SolrInputDocument object to the javabin format so it can be sent to 
Solr, something else has modified that SolrInputDocument object.


You should never re-use SolrJ objects that you construct for indexing. 
A brand new SolrInputDocument instance should be created every time you 
need one, and any objects that go into its creation should also be new 
objects.  This is especially important when the code is multi-threaded, 
but there are certain code constructs that can cause this to happen even 
in code that does not create multiple threads.  Also, code that uses 
CloudSolrClient can very easily become multi-threaded even if the user's 
code isn't.


In Solr 8.2, clients and the server were updated to use HTTP/2.  I did 
not closely follow the work for this, but it would have required some 
pretty major changes to SolrJ, changes that very well could have altered 
the precise timing of operations.  Your additional note says you are 
also seeing this problem with 8.1, which does not surprise me.  The 
precise timing of the 8.1 code might have been such that the problem was 
far less noticeable.


If you can share your code that uses SolrJ, we *might* be able to help 
you narrow down what's happening and get it fixed.


Thanks,
Shawn


Re: ConcurrentModificationException in SolrInputDocument writeMap

2019-11-07 Thread Edward Ribeiro
You probably hit
https://issues.apache.org/jira/projects/SOLR/issues/SOLR-8028


Regards,
Edward


Em qua, 6 de nov de 2019 13:23, Mikhail Khludnev  escreveu:

> Hello, Tim.
> Please confirm my understanding. Does exception happens in standalone Java
> ingesting app?
> If, it's so, Does it reuse either SolrInputDocument instances of
> fields/values collections between update calls?
>
> On Wed, Nov 6, 2019 at 8:00 AM Tim Swetland  wrote:
>
> > Nevermind my comment on not having this problem in 8.1. We do have it
> there
> > as well, I just didn't look far enough back in our logs on my initial
> > search. Would still appreciate whatever thoughts anyone might have on the
> > exception.
> >
> > On Wed, Nov 6, 2019 at 10:17 AM Tim Swetland 
> wrote:
> >
> > > I'm currently running into a ConcurrentModificationException ingesting
> > > data as we attempt to upgrade from Solr 8.1 to 8.2. It's not every
> > > document, but it definitely appears regularly in our logs. We didn't
> run
> > > into this problem in 8.1, so I'm not sure what might have changed. I
> feel
> > > like this is probably a bug, but if there's a workaround or if there's
> an
> > > idea of something I might be doing wrong, please let me know.
> > >
> > > Stack trace:
> > > o.a.s.u.ErrorReportingConcurrentUpdateSolrClient Error when calling
> > > SolrCmdDistributor$Req: cmd=add{_version=,id=};
> > node=StdNode:
> > > https:///solr/coll_shard1_replica_n2/ to https://
> > /solr/coll_shard1_replica_n2/
> > > => java.util.ConcurrentModificationException
> > > at java.util.LinkedHashMap.forEach(LinkedHashMap.java:686)
> > > java.util.ConcurrentModificationException: null
> > >   at java.util.LinkedHashMap.forEach(LinkedHashMap.java:686)
> > >   at
> > >
> >
> org.apache.solr.common.SolrInputDocument.writeMap(SolrInputDocument.java:51)
> > >   at
> > >
> >
> org.apache.solr.common.util.JavaBinCodec.writeSolrInputDocument(JavaBinCodec.java:658)
> > >   at
> > >
> >
> org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:383)
> > >   at
> > >
> org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:253)
> > >   at
> > >
> >
> org.apache.solr.common.util.JavaBinCodec.writeMapEntry(JavaBinCodec.java:813)
> > >
> > >   at
> > >
> >
> org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:411)
> > >
> > >   at
> > >
> org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:253)
> > >   at
> > >
> >
> org.apache.solr.common.util.JavaBinCodec.writeIterator(JavaBinCodec.java:750)
> > >
> > >   at
> > >
> >
> org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:395)
> > >
> > >   at
> > >
> org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:253)
> > >   at
> > >
> >
> org.apache.solr.common.util.JavaBinCodec.writeNamedList(JavaBinCodec.java:248)
> > >
> > >   at
> > >
> >
> org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:355)
> > >
> > >   at
> > >
> org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:253)
> > >   at
> > > org.apache.solr.common.util.JavaBinCodec.marshal(JavaBinCodec.java:167)
> > >   at
> > >
> >
> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.marshal(JavaBinUpdateRequestCodec.java:102)
> > >   at
> > >
> >
> org.apache.solr.client.solrj.impl.BinaryRequestWriter.write(BinaryRequestWriter.java:83)
> > >   at
> > >
> >
> org.apache.solr.client.solrj.impl.Http2SolrClient.send(Http2SolrClient.java:338)
> > >
> > >   at
> > >
> >
> org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient$Runner.sendUpdateStream(ConcurrentUpdateHttp2SolrClient.java:231)
> > >
> > >   at
> > >
> >
> org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient$Runner.run(ConcurrentUpdateHttp2SolrClient.java:176)
> > >
> > >   at
> > >
> >
> com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:181)
> > >   at
> > >
> >
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil
> > > .java:209)
> > >   at
> > >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> > >   at
> > >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> > >
> > >   at java.lang.Thread.run(Thread.java:748)
> > >
> > >
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>


Re: Cursor mark page duplicates

2019-11-07 Thread Chris Hostetter


: I'm using Solr's cursor mark feature and noticing duplicates when paging 
: through results.  The duplicate records happen intermittently and appear 
: at the end of one page, and the beginning of the next (but not on all 
: pages through the results). So if rows=20 the duplicate records would be 
: document 20 on page1, and document 21 on page 2.  The document's id come 

Can you try to reproduce and show us the specifics of this including:

1) The sort param you're using
2) An 'fl' list that includes every field in the sort param
3) The returned values of every 'fl' field for the "duplicate" document 
you are seeing as it appears in *BOTH* pages of results -- allong with the 
cursorMark value in use on both of those pages.


: (-MM-DD HH:MM.SS)), score. In this Solr community post 
: 
(https://lucene.472066.n3.nabble.com/Solr-document-duplicated-during-pagination-td4269176.html)
 
: Shawn Heisey suggests:

...that post was *NOT* about using cursorMark -- it was plain old regular 
pagination, where even on a single core/replica you can see a document 
X get "pushed" from page#1 to page#2 by updates/additions of some other
doxument Z that causes Z to sort "before" X.

With cursors this kind of "pushing other docs back" or "pushing other docs 
forward" doesn't exist because of the cursorMark.  The only way a doc 
*should* move is if it's OWN sort values are updated, causing it to 
reposition itself.

But, if you have a static index, then it's *possible* that the last time 
your document X was updated, there was a "glitch" somewhere in the 
distributed update process, and the update didn't succeed in osme 
replicas -- so the same document may have different sort values 
on diff replicas.

: In the Solr query below for one of the example duplicates in question I 
: can see a search by the id returns only a single document. The 
: replication factor for the collection is 2 so the id will also appear in 
: this shards replica.  Taking into consideration Shawn's advice above, my 

If you've already identified a particular document where this has 
happened, then you can also verify/disprove my hypothosis by hitting each 
of the replicas that hosts this document with a request that looks like...

/solr/MyCollection_shard4_replica_n12/select?q=id:FOO&distrib=false
/solr/MyCollection_shard4_replica_n35/select?q=id:FOO&distrib=false

...and compare the results to see if all field values match


-Hoss
http://www.lucidworks.com/


Re: Good Open Source Front End for Solr

2019-11-07 Thread A Adel
It depends on the use case. There are several front-ends that works with
Solr, each one has its own use cases and vary in how integrative it is.
Banana (https://github.com/lucidworks/banana) is a visualization frontend
that works only with Solr. It allows creating interactive, real-time
dashboards for data stored in Solr. It supports many textual and graphical
panels. It also supports dynamic filtration and slice and dice

Apache Zeppelin is another option that allows building notebooks for many
backends including Solr. In order to get it working, you have to install
zeppelin-solr interpreter.

On Thu, Nov 7, 2019 at 10:59 AM Erik Hatcher  wrote:

> Blacklight: http://projectblacklight.org/ 
>
> ;)
>
>
>
> > On Nov 6, 2019, at 11:16 PM, Java Developer 
> wrote:
> >
> > Hi,
> >
> > What is the best open source front-end for Solr
> >
> > Thanks
>
> --
Sent from my iPhone


Using solr API to return csv results

2019-11-07 Thread rhys J
If I am using the Solr API to query the core, is there a way to tell how
many documents are found if i use wt=CSV?

Thanks,

Rhys


Re: Good Open Source Front End for Solr

2019-11-07 Thread David Hastings
well thats pretty slick

On Thu, Nov 7, 2019 at 1:59 PM Erik Hatcher  wrote:

> Blacklight: http://projectblacklight.org/ 
>
> ;)
>
>
>
> > On Nov 6, 2019, at 11:16 PM, Java Developer 
> wrote:
> >
> > Hi,
> >
> > What is the best open source front-end for Solr
> >
> > Thanks
>
>


Re: Good Open Source Front End for Solr

2019-11-07 Thread Erik Hatcher
Blacklight: http://projectblacklight.org/ 

;)



> On Nov 6, 2019, at 11:16 PM, Java Developer  wrote:
> 
> Hi,
> 
> What is the best open source front-end for Solr
> 
> Thanks



Re: Solr healthcheck fails all the time

2019-11-07 Thread Houston Putman
Hello,

Could you provide some more information about your cloud, for example:

   - The number of requests that it handles per minute
   - How much data you are indexing
   - If there is any memory pressure

The ping handler merely sends a query to the collection and makes sure that
it responds healthily. Can you check your schema to see what the query is
that it is sending? The ping query may be more expensive than you think.
Also is the ping using distrib=true or false in the query?

On Wed, Nov 6, 2019 at 6:45 PM amruth  wrote:

> I am running Solr Cloud 6.6 and all the nodes fail healthcheck too
> frequently
> with *Read Timed out * error. Here is the stacktrace,
>
> http://solr-host1:8983/solr/collection1/admin/ping is DOWN, error:
> HTTPConnectionPool(host='solr-host1', port=8983): Read timed out. (read
> timeout=1). Connection failed after 1001 ms
>
> Can someone please say why it fails all the time?(at least once every
> 10min)
>
>
>
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

2019-11-07 Thread Walter Underwood
I normally use a weight of 8 for the most important field, like title. Other 
fields might get a 4 or 2.

I add a “pf” field with the weights doubled, so that phrase matches have a 
higher weight.

The weight of 8 comes from experience at Infoseek and Inktomi, two early web 
search engines. With different relevance algorithms and totally different 
evaluation and tuning systems, they settled on weights of 8 and 7.5 for HTML 
titles. With the the two radically different system getting the same number, I 
decided that was a property of the documents, not of the search engines.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Nov 7, 2019, at 9:03 AM, Guilherme Viteri  wrote:
> 
> Hi Wunder,
> 
> My indexer takes quite a few hours to be executed I am shortening it to run 
> faster, but I also need to make sure it gives what we are expecting. This 
> implementation's been there for >4y, and massively used.
> 
>> In your edismax handlers, weights of 20, 50, and 100 are extremely high. I 
>> don’t think I’ve ever used a weight higher than 16 in a dozen years of 
>> configuring Solr.
> I've inherited that implementation and I am really keen to adequate it, what 
> would you recommend ?
> 
> Cheers
> Guilherme
> 
>> On 7 Nov 2019, at 14:43, Walter Underwood  wrote:
>> 
>> Thanks for posting the files. Looking at schema.xml, I see that you still 
>> are using StopFilterFactory. The first advice we gave you was to remove that.
>> 
>> Remove StopFilterFactory everywhere and reindex.
>> 
>> You will continue to have problems matching stopwords until you do that.
>> 
>> In your edismax handlers, weights of 20, 50, and 100 are extremely high. I 
>> don’t think I’ve ever used a weight higher than 16 in a dozen years of 
>> configuring Solr.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On Nov 7, 2019, at 6:56 AM, Guilherme Viteri  wrote:
>>> 
>>> Hi Paras, everyone
>>> 
>>> Thank you again for your inputs and suggestions. I sorry to hear you had 
>>> trouble with the attachments I will host it somewhere and share the links. 
>>> I don't tweak my index, I get the data from the graph database, create a 
>>> document as they are and save to solr.
>>> 
>>> So, I am sending the new analysis screen querying the way you suggested. 
>>> Also the results with params and solr query url.
>>> 
>>> During the process of querying what you asked I found something really 
>>> weird (at least for me). By accident, I ended up querying the using the 
>>> default handler (/select) and it worked. Then If I use the one I must use, 
>>> then sadly doesn't work. I am posting both results and I will also post the 
>>> handlers as well.
>>> 
>>> Here is the link with all the files mentioned before
>>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0 
>>> 
>>> If the link doesn't work www dot dropbox dot com slash sh slash 
>>> fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a ? dl equals 0
>>> 
>>> Thanks
>>> 
 On 7 Nov 2019, at 05:23, Paras Lehana  wrote:
 
 Hi Guilherme.
 
 I am sending they analysis result and the json result as requested.
 
 
 Thanks for the effort. Luckily, I can see your attachments (low quality
 though).
 
 From the analysis screen, the analysis is working as expected. One of the
 reasons for query="lymphoid and *a* non-lymphoid cell" not matching
 document containing "Lymphoid and a non-Lymphoid cell" I can initially
 think of is: the stopword "a" is probably present in post-analysis either
 of query or index. Did you tweak your index time analysis after indexing?
 
 Do two things:
 
 1. Post the analysis screen for and index=*"Immunoregulatory
 interactions between a Lymphoid and a non-Lymphoid cell"* and
 "query=*"lymphoid
 and a non-lymphoid cell"*. Try hosting the image and providing the link
 here.
 2. Give the same JSON output as you have sent but this time with
 *"echoParams=all"*. Also, post the exact Solr query url.
 
 
 
 On Wed, 6 Nov 2019 at 21:07, Erick Erickson  
 wrote:
 
> I don’t see the attachments, maybe I deleted old e-mails or some such. The
> Apache server is fairly aggressive about stripping attachments though, so
> it’s also possible they didn’t make it through.
> 
>> On Nov 6, 2019, at 9:28 AM, Guilherme Viteri  wrote:
>> 
>> Thanks Erick.
>> 
>>> First, your index and analysis chains are considerably different, this
> can easily be a source of problems. In particular, using two different
> tokenizers is a huge red flag. I _strongly_ recommend against this unless
> you’re totally sure you understand the consequences. Additionally, your 
> use
> of the length filter is suspicious, especially since your problem 
> sta

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

2019-11-07 Thread Guilherme Viteri
Hi Wunder,

My indexer takes quite a few hours to be executed I am shortening it to run 
faster, but I also need to make sure it gives what we are expecting. This 
implementation's been there for >4y, and massively used.

> In your edismax handlers, weights of 20, 50, and 100 are extremely high. I 
> don’t think I’ve ever used a weight higher than 16 in a dozen years of 
> configuring Solr.
I've inherited that implementation and I am really keen to adequate it, what 
would you recommend ?

Cheers
Guilherme

> On 7 Nov 2019, at 14:43, Walter Underwood  wrote:
> 
> Thanks for posting the files. Looking at schema.xml, I see that you still are 
> using StopFilterFactory. The first advice we gave you was to remove that.
> 
> Remove StopFilterFactory everywhere and reindex.
> 
> You will continue to have problems matching stopwords until you do that.
> 
> In your edismax handlers, weights of 20, 50, and 100 are extremely high. I 
> don’t think I’ve ever used a weight higher than 16 in a dozen years of 
> configuring Solr.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
>> On Nov 7, 2019, at 6:56 AM, Guilherme Viteri  wrote:
>> 
>> Hi Paras, everyone
>> 
>> Thank you again for your inputs and suggestions. I sorry to hear you had 
>> trouble with the attachments I will host it somewhere and share the links. 
>> I don't tweak my index, I get the data from the graph database, create a 
>> document as they are and save to solr.
>> 
>> So, I am sending the new analysis screen querying the way you suggested. 
>> Also the results with params and solr query url.
>> 
>> During the process of querying what you asked I found something really weird 
>> (at least for me). By accident, I ended up querying the using the default 
>> handler (/select) and it worked. Then If I use the one I must use, then 
>> sadly doesn't work. I am posting both results and I will also post the 
>> handlers as well.
>> 
>> Here is the link with all the files mentioned before
>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0 
>> 
>> If the link doesn't work www dot dropbox dot com slash sh slash 
>> fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a ? dl equals 0
>> 
>> Thanks
>> 
>>> On 7 Nov 2019, at 05:23, Paras Lehana  wrote:
>>> 
>>> Hi Guilherme.
>>> 
>>> I am sending they analysis result and the json result as requested.
>>> 
>>> 
>>> Thanks for the effort. Luckily, I can see your attachments (low quality
>>> though).
>>> 
>>> From the analysis screen, the analysis is working as expected. One of the
>>> reasons for query="lymphoid and *a* non-lymphoid cell" not matching
>>> document containing "Lymphoid and a non-Lymphoid cell" I can initially
>>> think of is: the stopword "a" is probably present in post-analysis either
>>> of query or index. Did you tweak your index time analysis after indexing?
>>> 
>>> Do two things:
>>> 
>>> 1. Post the analysis screen for and index=*"Immunoregulatory
>>> interactions between a Lymphoid and a non-Lymphoid cell"* and
>>> "query=*"lymphoid
>>> and a non-lymphoid cell"*. Try hosting the image and providing the link
>>> here.
>>> 2. Give the same JSON output as you have sent but this time with
>>> *"echoParams=all"*. Also, post the exact Solr query url.
>>> 
>>> 
>>> 
>>> On Wed, 6 Nov 2019 at 21:07, Erick Erickson  wrote:
>>> 
 I don’t see the attachments, maybe I deleted old e-mails or some such. The
 Apache server is fairly aggressive about stripping attachments though, so
 it’s also possible they didn’t make it through.
 
> On Nov 6, 2019, at 9:28 AM, Guilherme Viteri  wrote:
> 
> Thanks Erick.
> 
>> First, your index and analysis chains are considerably different, this
 can easily be a source of problems. In particular, using two different
 tokenizers is a huge red flag. I _strongly_ recommend against this unless
 you’re totally sure you understand the consequences. Additionally, your use
 of the length filter is suspicious, especially since your problem statement
 is about the addition of a single letter term and the min length allowed on
 that filter is 2. That said, it’s reasonable to suppose that the ’a’ is
 filtered out in both cases, but maybe you’ve found something odd about the
 interactions.
> I will investigate the min length and post the results later.
> 
>> Second, I have no idea what this will do. Are the equal signs typos?
 Used by custom code?
> This the url in my application, not solr params. That's the query string.
> 
>> What does “species=“ do? That’s not Solr syntax, so it’s likely that
 all the params with an equal-sign are totally ignored unless it’s just a
 typo.
> This is part of the application. Species will be used later on in solr
 to filter out the result. That's not solr. That my app params.
> 
>> Third, the ea

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

2019-11-07 Thread David Hastings
Ha, funny enough i still use qf/pf boosts starting at 100 and go down,
gives me room to add boosting to more fields but not equal.  maybe
excessive but haven't noticed a performance issue

On Thu, Nov 7, 2019 at 9:44 AM Walter Underwood 
wrote:

> Thanks for posting the files. Looking at schema.xml, I see that you still
> are using StopFilterFactory. The first advice we gave you was to remove
> that.
>
> Remove StopFilterFactory everywhere and reindex.
>
> You will continue to have problems matching stopwords until you do that.
>
> In your edismax handlers, weights of 20, 50, and 100 are extremely high. I
> don’t think I’ve ever used a weight higher than 16 in a dozen years of
> configuring Solr.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Nov 7, 2019, at 6:56 AM, Guilherme Viteri  wrote:
> >
> > Hi Paras, everyone
> >
> > Thank you again for your inputs and suggestions. I sorry to hear you had
> trouble with the attachments I will host it somewhere and share the links.
> > I don't tweak my index, I get the data from the graph database, create a
> document as they are and save to solr.
> >
> > So, I am sending the new analysis screen querying the way you suggested.
> Also the results with params and solr query url.
> >
> > During the process of querying what you asked I found something really
> weird (at least for me). By accident, I ended up querying the using the
> default handler (/select) and it worked. Then If I use the one I must use,
> then sadly doesn't work. I am posting both results and I will also post the
> handlers as well.
> >
> > Here is the link with all the files mentioned before
> >
> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
>  >
> > If the link doesn't work www dot dropbox dot com slash sh slash
> fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a ? dl equals 0
> >
> > Thanks
> >
> >> On 7 Nov 2019, at 05:23, Paras Lehana 
> wrote:
> >>
> >> Hi Guilherme.
> >>
> >> I am sending they analysis result and the json result as requested.
> >>
> >>
> >> Thanks for the effort. Luckily, I can see your attachments (low quality
> >> though).
> >>
> >> From the analysis screen, the analysis is working as expected. One of
> the
> >> reasons for query="lymphoid and *a* non-lymphoid cell" not matching
> >> document containing "Lymphoid and a non-Lymphoid cell" I can initially
> >> think of is: the stopword "a" is probably present in post-analysis
> either
> >> of query or index. Did you tweak your index time analysis after
> indexing?
> >>
> >> Do two things:
> >>
> >>  1. Post the analysis screen for and index=*"Immunoregulatory
> >>  interactions between a Lymphoid and a non-Lymphoid cell"* and
> >> "query=*"lymphoid
> >>  and a non-lymphoid cell"*. Try hosting the image and providing the link
> >>  here.
> >>  2. Give the same JSON output as you have sent but this time with
> >>  *"echoParams=all"*. Also, post the exact Solr query url.
> >>
> >>
> >>
> >> On Wed, 6 Nov 2019 at 21:07, Erick Erickson 
> wrote:
> >>
> >>> I don’t see the attachments, maybe I deleted old e-mails or some such.
> The
> >>> Apache server is fairly aggressive about stripping attachments though,
> so
> >>> it’s also possible they didn’t make it through.
> >>>
>  On Nov 6, 2019, at 9:28 AM, Guilherme Viteri 
> wrote:
> 
>  Thanks Erick.
> 
> > First, your index and analysis chains are considerably different,
> this
> >>> can easily be a source of problems. In particular, using two different
> >>> tokenizers is a huge red flag. I _strongly_ recommend against this
> unless
> >>> you’re totally sure you understand the consequences. Additionally,
> your use
> >>> of the length filter is suspicious, especially since your problem
> statement
> >>> is about the addition of a single letter term and the min length
> allowed on
> >>> that filter is 2. That said, it’s reasonable to suppose that the ’a’ is
> >>> filtered out in both cases, but maybe you’ve found something odd about
> the
> >>> interactions.
>  I will investigate the min length and post the results later.
> 
> > Second, I have no idea what this will do. Are the equal signs typos?
> >>> Used by custom code?
>  This the url in my application, not solr params. That's the query
> string.
> 
> > What does “species=“ do? That’s not Solr syntax, so it’s likely that
> >>> all the params with an equal-sign are totally ignored unless it’s just
> a
> >>> typo.
>  This is part of the application. Species will be used later on in solr
> >>> to filter out the result. That's not solr. That my app params.
> 
> > Third, the easiest way to see what’s happening under the covers is to
> >>> add “&debug=true” to the query and look at the parsed query. Ignore
> all the
> >>> relevance calculations for the nonce, or specify “&debug=query” to skip
> >>> that part.
>  The two 

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

2019-11-07 Thread Walter Underwood
Thanks for posting the files. Looking at schema.xml, I see that you still are 
using StopFilterFactory. The first advice we gave you was to remove that.

Remove StopFilterFactory everywhere and reindex.

You will continue to have problems matching stopwords until you do that.

In your edismax handlers, weights of 20, 50, and 100 are extremely high. I 
don’t think I’ve ever used a weight higher than 16 in a dozen years of 
configuring Solr.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Nov 7, 2019, at 6:56 AM, Guilherme Viteri  wrote:
> 
> Hi Paras, everyone
> 
> Thank you again for your inputs and suggestions. I sorry to hear you had 
> trouble with the attachments I will host it somewhere and share the links. 
> I don't tweak my index, I get the data from the graph database, create a 
> document as they are and save to solr.
> 
> So, I am sending the new analysis screen querying the way you suggested. Also 
> the results with params and solr query url.
> 
> During the process of querying what you asked I found something really weird 
> (at least for me). By accident, I ended up querying the using the default 
> handler (/select) and it worked. Then If I use the one I must use, then sadly 
> doesn't work. I am posting both results and I will also post the handlers as 
> well.
> 
> Here is the link with all the files mentioned before
> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0 
> 
> If the link doesn't work www dot dropbox dot com slash sh slash 
> fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a ? dl equals 0
> 
> Thanks
> 
>> On 7 Nov 2019, at 05:23, Paras Lehana  wrote:
>> 
>> Hi Guilherme.
>> 
>> I am sending they analysis result and the json result as requested.
>> 
>> 
>> Thanks for the effort. Luckily, I can see your attachments (low quality
>> though).
>> 
>> From the analysis screen, the analysis is working as expected. One of the
>> reasons for query="lymphoid and *a* non-lymphoid cell" not matching
>> document containing "Lymphoid and a non-Lymphoid cell" I can initially
>> think of is: the stopword "a" is probably present in post-analysis either
>> of query or index. Did you tweak your index time analysis after indexing?
>> 
>> Do two things:
>> 
>>  1. Post the analysis screen for and index=*"Immunoregulatory
>>  interactions between a Lymphoid and a non-Lymphoid cell"* and
>> "query=*"lymphoid
>>  and a non-lymphoid cell"*. Try hosting the image and providing the link
>>  here.
>>  2. Give the same JSON output as you have sent but this time with
>>  *"echoParams=all"*. Also, post the exact Solr query url.
>> 
>> 
>> 
>> On Wed, 6 Nov 2019 at 21:07, Erick Erickson  wrote:
>> 
>>> I don’t see the attachments, maybe I deleted old e-mails or some such. The
>>> Apache server is fairly aggressive about stripping attachments though, so
>>> it’s also possible they didn’t make it through.
>>> 
 On Nov 6, 2019, at 9:28 AM, Guilherme Viteri  wrote:
 
 Thanks Erick.
 
> First, your index and analysis chains are considerably different, this
>>> can easily be a source of problems. In particular, using two different
>>> tokenizers is a huge red flag. I _strongly_ recommend against this unless
>>> you’re totally sure you understand the consequences. Additionally, your use
>>> of the length filter is suspicious, especially since your problem statement
>>> is about the addition of a single letter term and the min length allowed on
>>> that filter is 2. That said, it’s reasonable to suppose that the ’a’ is
>>> filtered out in both cases, but maybe you’ve found something odd about the
>>> interactions.
 I will investigate the min length and post the results later.
 
> Second, I have no idea what this will do. Are the equal signs typos?
>>> Used by custom code?
 This the url in my application, not solr params. That's the query string.
 
> What does “species=“ do? That’s not Solr syntax, so it’s likely that
>>> all the params with an equal-sign are totally ignored unless it’s just a
>>> typo.
 This is part of the application. Species will be used later on in solr
>>> to filter out the result. That's not solr. That my app params.
 
> Third, the easiest way to see what’s happening under the covers is to
>>> add “&debug=true” to the query and look at the parsed query. Ignore all the
>>> relevance calculations for the nonce, or specify “&debug=query” to skip
>>> that part.
 The two json files i've sent, they are debugQuery=on and the explain tag
>>> is present.
 I will try the searching the way you mentioned.
 
 Thank for your inputs
 
 Guilherme
 
> On 6 Nov 2019, at 14:14, Erick Erickson 
>>> wrote:
> 
> Fwd to another server
> 
> First, your index and analysis chains are considerably different, this
>>> can easily be a source of problems. In particular, using two dif

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

2019-11-07 Thread Guilherme Viteri
Hi Paras, everyone

Thank you again for your inputs and suggestions. I sorry to hear you had 
trouble with the attachments I will host it somewhere and share the links. 
I don't tweak my index, I get the data from the graph database, create a 
document as they are and save to solr.

So, I am sending the new analysis screen querying the way you suggested. Also 
the results with params and solr query url.

During the process of querying what you asked I found something really weird 
(at least for me). By accident, I ended up querying the using the default 
handler (/select) and it worked. Then If I use the one I must use, then sadly 
doesn't work. I am posting both results and I will also post the handlers as 
well.

Here is the link with all the files mentioned before
https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0 

If the link doesn't work www dot dropbox dot com slash sh slash 
fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a ? dl equals 0

Thanks

> On 7 Nov 2019, at 05:23, Paras Lehana  wrote:
> 
> Hi Guilherme.
> 
> I am sending they analysis result and the json result as requested.
> 
> 
> Thanks for the effort. Luckily, I can see your attachments (low quality
> though).
> 
> From the analysis screen, the analysis is working as expected. One of the
> reasons for query="lymphoid and *a* non-lymphoid cell" not matching
> document containing "Lymphoid and a non-Lymphoid cell" I can initially
> think of is: the stopword "a" is probably present in post-analysis either
> of query or index. Did you tweak your index time analysis after indexing?
> 
> Do two things:
> 
>   1. Post the analysis screen for and index=*"Immunoregulatory
>   interactions between a Lymphoid and a non-Lymphoid cell"* and
> "query=*"lymphoid
>   and a non-lymphoid cell"*. Try hosting the image and providing the link
>   here.
>   2. Give the same JSON output as you have sent but this time with
>   *"echoParams=all"*. Also, post the exact Solr query url.
> 
> 
> 
> On Wed, 6 Nov 2019 at 21:07, Erick Erickson  wrote:
> 
>> I don’t see the attachments, maybe I deleted old e-mails or some such. The
>> Apache server is fairly aggressive about stripping attachments though, so
>> it’s also possible they didn’t make it through.
>> 
>>> On Nov 6, 2019, at 9:28 AM, Guilherme Viteri  wrote:
>>> 
>>> Thanks Erick.
>>> 
 First, your index and analysis chains are considerably different, this
>> can easily be a source of problems. In particular, using two different
>> tokenizers is a huge red flag. I _strongly_ recommend against this unless
>> you’re totally sure you understand the consequences. Additionally, your use
>> of the length filter is suspicious, especially since your problem statement
>> is about the addition of a single letter term and the min length allowed on
>> that filter is 2. That said, it’s reasonable to suppose that the ’a’ is
>> filtered out in both cases, but maybe you’ve found something odd about the
>> interactions.
>>> I will investigate the min length and post the results later.
>>> 
 Second, I have no idea what this will do. Are the equal signs typos?
>> Used by custom code?
>>> This the url in my application, not solr params. That's the query string.
>>> 
 What does “species=“ do? That’s not Solr syntax, so it’s likely that
>> all the params with an equal-sign are totally ignored unless it’s just a
>> typo.
>>> This is part of the application. Species will be used later on in solr
>> to filter out the result. That's not solr. That my app params.
>>> 
 Third, the easiest way to see what’s happening under the covers is to
>> add “&debug=true” to the query and look at the parsed query. Ignore all the
>> relevance calculations for the nonce, or specify “&debug=query” to skip
>> that part.
>>> The two json files i've sent, they are debugQuery=on and the explain tag
>> is present.
>>> I will try the searching the way you mentioned.
>>> 
>>> Thank for your inputs
>>> 
>>> Guilherme
>>> 
 On 6 Nov 2019, at 14:14, Erick Erickson 
>> wrote:
 
 Fwd to another server
 
 First, your index and analysis chains are considerably different, this
>> can easily be a source of problems. In particular, using two different
>> tokenizers is a huge red flag. I _strongly_ recommend against this unless
>> you’re totally sure you understand the consequences. Additionally, your use
>> of the length filter is suspicious, especially since your problem statement
>> is about the addition of a single letter term and the min length allowed on
>> that filter is 2. That said, it’s reasonable to suppose that the ’a’ is
>> filtered out in both cases, but maybe you’ve found something odd about the
>> interactions.
 
 Second, I have no idea what this will do. Are the equal signs typos?
>> Used by custom code?
 
>> 
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&specie

Re: Cursor mark page duplicates

2019-11-07 Thread Erick Erickson
Dwane:

Nice writeup. This is puzzling. First, theoretically the two replicas shouldn’t 
have any effect. Shawn’e comment was more that somehow two _different_ shards 
had a duplicate ID.

Do both replicas have exactly the same document count? You can find this out by 
“..solr/collection1_shard1_replica_n1?q=*:*&distrib=false”. The “distrib=false” 
will query _only_ the replica it’s pointed to. I’m wondering if somehow the 
replicas are out of sync and this is a crude test.

If you can record the IDs when this happens and use the above trick to see 
whether there is anything unexpected about the returns when you look at, say, 
the 5 docs before the repeated one and the 5 docs after. They should, of 
course, be the exact same.

You could also use the "&distrib=false” trick to pull all the IDs from the two 
replicas and see if they all match with a streaming expression.

If all the IDs are all the same on both replicas, I haven’t a clue…..

Best,
Erick

> On Nov 7, 2019, at 5:34 AM, Dwane Hall  wrote:
> 
> Hey Solr community,
> 
> I'm using Solr's cursor mark feature and noticing duplicates when paging 
> through results.   The duplicate records happen intermittently and appear at 
> the end of one page, and the beginning of the next (but not on all pages 
> through the results). So if rows=20 the duplicate records would be document 
> 20 on page1, and document 21 on page 2.   The document's id come from a 
> database and that field is a unique primary key so I'm confident that there 
> are no duplicate document id's in my corpus.   Additionally no index updates 
> are occurring in the index (it's completely static).  My result sort order is 
> id (a string representation of a timestamp (-MM-DD HH:MM.SS)), score. 
> In this Solr community post 
> (https://lucene.472066.n3.nabble.com/Solr-document-duplicated-during-pagination-td4269176.html)
>  Shawn Heisey suggests:
> 
> 
> "There are two ways this can happen.  One is that the index has changed
> between different queries, pushing or pulling results between the end of
> one page and the beginning of the next page.  The other is having the
> same uniqueKey value in more than one shard."
> 
> In the Solr query below for one of the example duplicates in question I can 
> see a search by the id returns only a single document. The replication factor 
> for the collection is 2 so the id will also appear in this shards replica.  
> Taking into consideration Shawn's advice above, my question is will having a 
> shard replica still count as the document having a duplicate id in another 
> shard and potentially introduce duplicates into my paged results?  If not 
> could anyone suggest another possible scenario where duplicates could 
> potentially be introduced?
> 
> As always any advice would be greatly appreciated,
> 
> Thanks,
> 
> Dwane
> 
> Environment
> Solr cloud (7.7.2)
> 8 shard collection, replication factor 2
> 
> {
> 
>  "responseHeader":{
> 
>"zkConnected":true,
> 
>"status":0,
> 
>"QTime":2072,
> 
>"params":{
> 
>  "q":"id:myUUID(-MM-DD HH:MM.SS)",
> 
>  "fl":"id,[shard]"}},
> 
>  "response":{"numFound":1,"start":0,"maxScore":17.601822,"docs":[
> 
>  {
> 
>"id":"myUUID(-MM-DD HH:MM.SS)",
> 
>
> "[shard]":"https://solr1:9014/solr/MyCollection_shard4_replica_n12/|https://solr2:9011/solr/MyCollection_shard4_replica_n35/"}]
> 
>  }}
> 
> 



Re: Query regarding truncated Date Sort

2019-11-07 Thread Erick Erickson
The easiest and most efficient would be to store the date (or a copy) at day 
resolution and sort on that field instead.

> On Nov 7, 2019, at 3:00 AM, Paras Lehana  wrote:
> 
> Hi Inderjeet,
> 
> Wouldn't sorting on the default format will yield documents date-wise
> sorted? The time won't impact the date order or do you have
> different timezones also?
> 
> On Thu, 7 Nov 2019 at 12:52, Inderjeet Singh  wrote:
> 
>> Hi
>> 
>> I am currently using solr 7.1.0.  I have indexed a few documents which have
>> a date associated with it.
>> The Managed schema configuration for that field is :
>> > docValues="true"/>
>> > indexed="true" stored="true"/>
>> 
>> Example of few values are :
>>  "Published_Date":"2019-10-25T00:00:00Z"
>> "Published_Date":"2019-10-21T10:00:00Z"
>> 
>> I want to sort the documents based on these Published_Date parameters but
>> only on the day(not the time/timezones)
>> Sorting on the basis of '2019-10-25'
>> 
>> Please help me in finding how I could achieve this.
>> 
>> Regards
>> Inderjeet Singh
>> 
> 
> 
> -- 
> -- 
> Regards,
> 
> *Paras Lehana* [65871]
> Development Engineer, Auto-Suggest,
> IndiaMART Intermesh Ltd.
> 
> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
> Noida, UP, IN - 201303
> 
> Mob.: +91-9560911996
> Work: 01203916600 | Extn:  *8173*
> 
> -- 
> IMPORTANT: 
> NEVER share your IndiaMART OTP/ Password with anyone.



Cursor mark page duplicates

2019-11-07 Thread Dwane Hall
Hey Solr community,

I'm using Solr's cursor mark feature and noticing duplicates when paging 
through results.   The duplicate records happen intermittently and appear at 
the end of one page, and the beginning of the next (but not on all pages 
through the results). So if rows=20 the duplicate records would be document 20 
on page1, and document 21 on page 2.   The document's id come from a database 
and that field is a unique primary key so I'm confident that there are no 
duplicate document id's in my corpus.   Additionally no index updates are 
occurring in the index (it's completely static).  My result sort order is id (a 
string representation of a timestamp (-MM-DD HH:MM.SS)), score. In this 
Solr community post 
(https://lucene.472066.n3.nabble.com/Solr-document-duplicated-during-pagination-td4269176.html)
 Shawn Heisey suggests:


"There are two ways this can happen.  One is that the index has changed
between different queries, pushing or pulling results between the end of
one page and the beginning of the next page.  The other is having the
same uniqueKey value in more than one shard."

In the Solr query below for one of the example duplicates in question I can see 
a search by the id returns only a single document. The replication factor for 
the collection is 2 so the id will also appear in this shards replica.  Taking 
into consideration Shawn's advice above, my question is will having a shard 
replica still count as the document having a duplicate id in another shard and 
potentially introduce duplicates into my paged results?  If not could anyone 
suggest another possible scenario where duplicates could potentially be 
introduced?

As always any advice would be greatly appreciated,

Thanks,

Dwane

Environment
Solr cloud (7.7.2)
8 shard collection, replication factor 2

{

  "responseHeader":{

"zkConnected":true,

"status":0,

"QTime":2072,

"params":{

  "q":"id:myUUID(-MM-DD HH:MM.SS)",

  "fl":"id,[shard]"}},

  "response":{"numFound":1,"start":0,"maxScore":17.601822,"docs":[

  {

"id":"myUUID(-MM-DD HH:MM.SS)",


"[shard]":"https://solr1:9014/solr/MyCollection_shard4_replica_n12/|https://solr2:9011/solr/MyCollection_shard4_replica_n35/"}]

  }}




Re: Query regarding truncated Date Sort

2019-11-07 Thread Paras Lehana
Hi Inderjeet,

Wouldn't sorting on the default format will yield documents date-wise
sorted? The time won't impact the date order or do you have
different timezones also?

On Thu, 7 Nov 2019 at 12:52, Inderjeet Singh  wrote:

> Hi
>
> I am currently using solr 7.1.0.  I have indexed a few documents which have
> a date associated with it.
> The Managed schema configuration for that field is :
>   docValues="true"/>
>   indexed="true" stored="true"/>
>
> Example of few values are :
>   "Published_Date":"2019-10-25T00:00:00Z"
> "Published_Date":"2019-10-21T10:00:00Z"
>
> I want to sort the documents based on these Published_Date parameters but
> only on the day(not the time/timezones)
> Sorting on the basis of '2019-10-25'
>
> Please help me in finding how I could achieve this.
>
> Regards
> Inderjeet Singh
>


-- 
-- 
Regards,

*Paras Lehana* [65871]
Development Engineer, Auto-Suggest,
IndiaMART Intermesh Ltd.

8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
Noida, UP, IN - 201303

Mob.: +91-9560911996
Work: 01203916600 | Extn:  *8173*

-- 
IMPORTANT: 
NEVER share your IndiaMART OTP/ Password with anyone.