SOLR Cloud: 1500+ threads are in TIMED_WAITING status

2018-04-03 Thread Doss
We have SOLR(7.0.1) cloud 3 VM Linux instances wit 4 CPU, 90 GB RAM with
zookeeper (3.4.11) ensemble running on the same machines. We have 130 cores
of overall size of 45GB. No Sharding, almost all VMs has the same copy of
data. These nodes are under LB.

Index Config:
=

300

   30
   100
   30.0



   18
   6


Commit Configs:
===

   ${solr.autoCommit.maxTime:60}
   false



   ${solr.autoSoftCommit.maxTime:6}



We do 3500 Insert / Updates per second spread across all 130 cores, We yet
to start using selects effectively.

The problem what we are facing is at times suddenly the thread count
increase heavily which results SOLR non responsive or throwing 503 response
for client (PHP HTTP CURL) requests.

Today 04-04-2018 the thread dump shows that the peak went upto 13000+

Please hlep me in fixing this issue. Thanks!


Sample Threads:
===

1.updateExecutor-2-thread-25746-processing-http:
172.10.2.19:8983//solr//profileviews x:profileviews r:core_node2
n:172.10.2.18:8983_solr s:shard1 c:profileviews", "state":"TIMED_WAITING",
"lock":"java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@297be1d5",
"cpuTime":"162.4371ms", "userTime":"120.ms",
"stackTrace":["sun.misc.Unsafe.park(Native Method)",
"java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)",
"java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)",

2. ERROR true HttpSolrCall
null:org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException:
Async exception during distributed update: Error from server at
172.10.2.18:8983/solr/profileviews: Server Error request:
http://172.10.2.18:8983/solr/profileviews/update?update.distrib=TOLEADER=http%3A%2F%2F172.10.2.19%3A8983%2Fsolr%2Fprofileviews%2F=javabin=2
Remote error message: empty String

3. So Many Threads like:
"name":"qtp959447386-21",
"state":"TIMED_WAITING",

"lock":"java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@6a1a2bf4
",
"cpuTime":"4522.0837ms",
"userTime":"3770.ms",
"stackTrace":["sun.misc.Unsafe.park(Native Method)",

"java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)",

"java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)",

"org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:392)",

"org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:563)",

"org.eclipse.jetty.util.thread.QueuedThreadPool.access$800(QueuedThreadPool.java:48)",

"org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:626)",
  "java.lang.Thread.run(Thread.java:748)"


Re: Solr cloud schema and schemaless

2018-04-03 Thread Erick Erickson
The schema mode is _per collection_, not per node. So there's no trouble mixing
replicas from collection A running schema model 1 with replicas from
collection B
running a different schema model.

That said, schemaless is _not_ recommended for production unless you have
total control over the ETL chain and can guarantee that documents conform to
some standard. Schemaless does its best, but it guesses based on the
first time it
sees a field. So if the first doc has field X with a value of 1, it
infers that this field
is an int type. If doc2 has a value of 1.0, the doc fails with a parsing error.

FYI,
Erick

On Tue, Apr 3, 2018 at 2:39 PM, Kojo  wrote:
> Hi Solrs,
> We have a Solr cloud running in three nodes.
> Five collections are running in schema mode and we would like to create
> another collection running schemalles.
>
> Does it fit all together schema and schemales on the same nodes?
>
> I am not sure, because on this page it starts solr in schemalles mode but I
> start Solr cloud whithout this option.
>
> https://lucene.apache.org/solr/guide/6_6/schemaless-mode.html
>
> bin/solr start -e schemaless
>
>
>
> Thank you all!


Re: Largest number of indexed documents used by Solr

2018-04-03 Thread Yago Riveiro
Hi,

In my company we are running a 12 node cluster with 10 (american) Billion 
documents 12 shards / 2 replicas.

We do mainly faceting queries with a very reasonable performance.

36 million documents it's not an issue, you can handle that volume of documents 
with 2 nodes with SSDs and 32G of ram

Regards.

--

Yago Riveiro

On 4 Apr 2018 02:15 +0100, Abhi Basu <9000r...@gmail.com>, wrote:
> We have tested Solr 4.10 with 200 million docs with avg doc size of 250 KB.
> No issues with performance when using 3 shards / 2 replicas.
>
>
>
> On Tue, Apr 3, 2018 at 8:12 PM, Steven White  wrote:
>
> > Hi everyone,
> >
> > I'm about to start a project that requires indexing 36 million records
> > using Solr 7.2.1. Each record range from 500 KB to 0.25 MB where the
> > average is 0.1 MB.
> >
> > Has anyone indexed this number of records? What are the things I should
> > worry about? And out of curiosity, what is the largest number of records
> > that Solr has indexed which is published out there?
> >
> > Thanks
> >
> > Steven
> >
>
>
>
> --
> Abhi Basu


Re: Largest number of indexed documents used by Solr

2018-04-03 Thread Walter Underwood
We have a 24 million document index. Our documents are a bit smaller than 
yours, homework problems.

The Hathi Trust probably has the record. They haven’t updated their blog for a 
while, but they were at 11 million books and billions of pages in 2014.

https://www.hathitrust.org/blogslarge-scale-search

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Apr 3, 2018, at 6:12 PM, Steven White  wrote:
> 
> Hi everyone,
> 
> I'm about to start a project that requires indexing 36 million records
> using Solr 7.2.1.  Each record range from 500 KB to 0.25 MB where the
> average is 0.1 MB.
> 
> Has anyone indexed this number of records?  What are the things I should
> worry about?  And out of curiosity, what is the largest number of records
> that Solr has indexed which is published out there?
> 
> Thanks
> 
> Steven



Re: Largest number of indexed documents used by Solr

2018-04-03 Thread Abhi Basu
We have tested Solr 4.10 with 200 million docs with avg doc size of 250 KB.
No issues with performance when using 3 shards / 2 replicas.



On Tue, Apr 3, 2018 at 8:12 PM, Steven White  wrote:

> Hi everyone,
>
> I'm about to start a project that requires indexing 36 million records
> using Solr 7.2.1.  Each record range from 500 KB to 0.25 MB where the
> average is 0.1 MB.
>
> Has anyone indexed this number of records?  What are the things I should
> worry about?  And out of curiosity, what is the largest number of records
> that Solr has indexed which is published out there?
>
> Thanks
>
> Steven
>



-- 
Abhi Basu


Largest number of indexed documents used by Solr

2018-04-03 Thread Steven White
Hi everyone,

I'm about to start a project that requires indexing 36 million records
using Solr 7.2.1.  Each record range from 500 KB to 0.25 MB where the
average is 0.1 MB.

Has anyone indexed this number of records?  What are the things I should
worry about?  And out of curiosity, what is the largest number of records
that Solr has indexed which is published out there?

Thanks

Steven


Re: How do I create a schema file for FIX data in Solr

2018-04-03 Thread Raymond Xie
I'm talking to the author to find out, thanks.

~~~sent from my cell phone, sorry if there is any typo

Adhyan Arizki  于 2018年4月3日周二 下午1:38写道:

> Raymond,
>
> Seems you are having issue with the node environment. Likely the path isn't
> registered correctly judging from the error message. Note though, this is
> no longer related to Solr issue.
>
> On Tue, 3 Apr 2018, 23:00 Raymond Xie,  wrote:
>
> > Hi Rick,
> >
> > Following your suggestion I found
> https://github.com/SunGard-Labs/fix2json
> > which seems to be a fit;
> >
> > I followed the installation instruction and successfully installed the
> > fix2json on my Ubuntu host.
> >
> > sudo npm install -g fix2json
> >
> > I ran the same command as indicated in the git:
> >
> > fix2json -p dict/FIX50SP2.CME.xml XCME_MD_GE_FUT_20160315.gz
> >
> >
> > and I received error of:
> >
> > /usr/bin/env: ‘node’: No such file or directory
> >
> > It would be appreciated if you can point out what is missing here?
> >
> > Thank you again for your kind help.
> >
> >
> >
> > **
> > *Sincerely yours,*
> >
> >
> > *Raymond*
> >
> > On Mon, Apr 2, 2018 at 9:30 AM, Raymond Xie 
> wrote:
> >
> > > Thank you Rick for the enlightening.
> > >
> > > I will get the FIX message parsed first and come back here later.
> > >
> > >
> > > **
> > > *Sincerely yours,*
> > >
> > >
> > > *Raymond*
> > >
> > > On Mon, Apr 2, 2018 at 9:15 AM, Rick Leir  wrote:
> > >
> > >> Google
> > >>fix to json,
> > >> there are a few interesting leads.
> > >>
> > >> On April 2, 2018 12:34:44 AM EDT, Raymond Xie 
> > >> wrote:
> > >> >Thank you, Shawn, Rick and other readers,
> > >> >
> > >> >To Shawn:
> > >> >
> > >> >For  *8=FIX.4.4 9=653 35=RIO* as an example, in the FIX standard: 8
> > >> >means BeginString, in this example, its value is  FIX.4.4.9, and 9
> > >> >means
> > >> >body length, it is 653 for this message, 35 is RIO, meaning the
> message
> > >> >type is RIO, 122 stands for OrigSendingTime and has a format of
> > >> >UTCTimestamp
> > >> >
> > >> >You can refer to this page for details: https://www.onixs.biz
> > >> >/fix-dictionary/4.2/fields_by_tag.html
> > >> >
> > >> >All the values are explained as string type.
> > >> >
> > >> >All the tag numbers are from FIX standard so it doesn't change (in my
> > >> >case)
> > >> >
> > >> >I expect a python program might be needed to parse the message and
> > >> >extract
> > >> >each tag's value, index is to be made on those extracted value as
> long
> > >> >as
> > >> >their field (tag) name.
> > >> >
> > >> >With index in place, ideally and naturally user will search for any
> > >> >keyword, however, in this case, most queries would be based on tag 37
> > >> >(Order ID) and 75 (Trade Date), there is another customized tag (not
> in
> > >> >the
> > >> >standard) Order Version to be queried on.
> > >> >
> > >> >I understand the parser creation would be a manual process, as long
> as
> > >> >I
> > >> >know or have a small sample program, I will do it myself and maybe
> > >> >adjust
> > >> >it as per need.
> > >> >
> > >> >To Rick:
> > >> >
> > >> >You mentioned creating JSON document, my understanding is a parser
> > >> >would be
> > >> >needed to generate that JSON document, do you have any existing
> example
> > >> >code?
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >Thank you guys very much.
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >**
> > >> >*Sincerely yours,*
> > >> >
> > >> >
> > >> >*Raymond*
> > >> >
> > >> >On Sun, Apr 1, 2018 at 2:16 PM, Shawn Heisey 
> > >> >wrote:
> > >> >
> > >> >> On 4/1/2018 10:12 AM, Raymond Xie wrote:
> > >> >>
> > >> >>> FIX is a format standard of financial data. It contains lots of
> tags
> > >> >in
> > >> >>> number with value for the tag, like 8=asdf, where 8 is the tag and
> > >> >asdf is
> > >> >>> the tag's value. Each tag has its definition.
> > >> >>>
> > >> >>> The sample msg in FIX format was in the original question.
> > >> >>>
> > >> >>> All I need to do is to know how to paste the msg and get all tag's
> > >> >value.
> > >> >>>
> > >> >>> I found so far a parser is what I need to start with., But I am
> more
> > >> >>> concerning about how to create index in Solr on the extracted
> tag's
> > >> >value,
> > >> >>> that is the first step, the next would be to customize the
> dashboard
> > >> >for
> > >> >>> users to search with a value to find out which msg contains that
> > >> >value in
> > >> >>> which tag and present users the whole msg as proof.
> > >> >>>
> > >> >>
> > >> >> Most of Solr's functionality is provided by Lucene.  Lucene is a
> java
> > >> >API
> > >> >> that implements search functionality.  Solr bolts on some
> > >> >functionality on
> > >> >> top of Lucene, but doesn't 

Solr cloud schema and schemaless

2018-04-03 Thread Kojo
Hi Solrs,
We have a Solr cloud running in three nodes.
Five collections are running in schema mode and we would like to create
another collection running schemalles.

Does it fit all together schema and schemales on the same nodes?

I am not sure, because on this page it starts solr in schemalles mode but I
start Solr cloud whithout this option.

https://lucene.apache.org/solr/guide/6_6/schemaless-mode.html

bin/solr start -e schemaless



Thank you all!


Re: solr 5.2->7.2, suggester failure

2018-04-03 Thread David Hastings
Ah, Thank you
Turns out it was an experiment, so I removed them any ways and its all good
now.

Since Im here in the configuration for the new 7.x instances I was going to
ask a side question.  A lot of my Java properties are old or have  been
tweaked over time from a series of different machines, so at this point its
like a hodge podge collection of settings and im not sure if there are any
blaring holes.  If someone could let me know if there is something i
definitely need to address, that would be awesome.  some of these settings
were from solr 1 all the way to now.. this is running on machines with 142
gb ram, collection indexes around 300gb to 500gb, on 2TB ssds:

-XX:+CMSParallelRemarkEnabled-XX:+CMSScavengeBeforeRemark
-XX:+ParallelRefProcEnabled-XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCDateStamps-XX:+PrintGCDetails-XX:+PrintGCTimeStamps
-XX:+PrintHeapAtGC-XX:+PrintTenuringDistribution
-XX:+UseCMSInitiatingOccupancyOnly-XX:+UseConcMarkSweepGC-XX:+UseParNewGC
-XX:CMSInitiatingOccupancyFraction=50-XX:CMSMaxAbortablePrecleanTime=6000
-XX:ConcGCThreads=4-XX:MaxTenuringThreshold=8-XX:NewRatio=3
-XX:ParallelGCThreads=8-XX:PretenureSizeThreshold=64m-XX:SurvivorRatio=4
-XX:TargetSurvivorRatio=90
-Xloggc:/SSD2TB-01/solr-5.2.1/server/logs/solr_gc.log-Xms5m-Xmx5m
-Xss256k-verbose:gc



On Tue, Apr 3, 2018 at 2:50 PM, Kevin Risden  wrote:

> It looks like there were changes in Lucene 7.0 that limited the size of the
> automaton to prevent overflowing the stack.
>
> https://issues.apache.org/jira/browse/LUCENE-7914
>
> The commit being:
> https://github.com/apache/lucene-solr/commit/
> 7dde798473d1a8640edafb41f28ad25d17f25a2d
>
> Kevin Risden
>
> On Tue, Apr 3, 2018 at 1:45 PM, David Hastings <
> hastings.recurs...@gmail.com
> > wrote:
>
> > For data, its primarily a lot of garbage, around 200k titles, varying
> > length.  im actually looking through my application now to see if I even
> > still use it or if it was an early experiment.  I am just finding it odd
> > thats its failing in 7 but does fine on 5
> >
> > On Tue, Apr 3, 2018 at 2:41 PM, Erick Erickson 
> > wrote:
> >
> > > What kinds of things go into your title field? On first blush that's a
> > > bit odd for a multi-word title field since it treats the entire input
> > > as a single string. The code is trying to build a large FST to hold
> > > all of this data. Would AnalyzingInfixLookupFactory or similar make
> > > more sense?
> > >
> > > buildOnStartup and buildOnOptimize are other red flags. This means
> > > that every time you start up, the data for the title field is read
> > > from disk and the FST is built (or index if you use a different impl).
> > > On a large corpus this may take many minutes.
> > >
> > > Best,
> > > Erick
> > >
> > > On Tue, Apr 3, 2018 at 11:28 AM, David Hastings
> > >  wrote:
> > > > Hey all, I recently got a 7.2 instance up and running, and it seems
> to
> > be
> > > > going well however, I have ran into this when creating one of my
> > indexes,
> > > > and was wondering if anyone had a quick idea right off the top of
> their
> > > > head.
> > > >
> > > > solrconfig:
> > > >
> > > > 
> > > >   
> > > > fixspell
> > > > FuzzyLookupFactory
> > > >
> > > > string
> > > >
> > > > DocumentDictionaryFactory
> > > > title
> > > > true
> > > > true
> > > >   
> > > >
> > > >
> > > > received error:
> > > >
> > > >
> > > > ERROR true
> > > > SuggestComponent
> > > > Exception in building suggester index for: fixspell
> > > > java.lang.IllegalArgumentException: input automaton is too large:
> 1001
> > > > at
> > > > org.apache.lucene.util.automaton.Operations.topoSortStatesRecurse(
> > > Operations.java:1298)
> > > > at
> > > > org.apache.lucene.util.automaton.Operations.topoSortStatesRecurse(
> > > Operations.java:1306)
> > > > at
> > > > org.apache.lucene.util.automaton.Operations.topoSortStatesRecurse(
> > > Operations.java:1306)
> > > >
> > > > .
> > > >
> > > > at
> > > > org.apache.lucene.util.automaton.Operations.topoSortStatesRecurse(
> > > Operations.java:1306)
> > > > at
> > > > org.apache.lucene.util.automaton.Operations.topoSortStatesRecurse(
> > > Operations.java:1306)
> > > > at
> > > > org.apache.lucene.util.automaton.Operations.
> topoSortStates(Operations.
> > > java:1275)
> > > > at
> > > > org.apache.lucene.search.suggest.analyzing.
> > > AnalyzingSuggester.replaceSep(AnalyzingSuggester.java:292)
> > > > at
> > > > org.apache.lucene.search.suggest.analyzing.AnalyzingSuggester.
> > > toAutomaton(AnalyzingSuggester.java:854)
> > > > at
> > > > org.apache.lucene.search.suggest.analyzing.AnalyzingSuggester.build(
> > > AnalyzingSuggester.java:430)
> > > > at org.apache.lucene.search.suggest.Lookup.build(Lookup.java:190)
> > > > at
> > > > org.apache.solr.spelling.suggest.SolrSuggester.build(
> > > SolrSuggester.java:181)
> > > > at
> > > > org.apache.solr.handler.component.SuggestComponent$
> 

Re: solr 5.2->7.2, suggester failure

2018-04-03 Thread Kevin Risden
It looks like there were changes in Lucene 7.0 that limited the size of the
automaton to prevent overflowing the stack.

https://issues.apache.org/jira/browse/LUCENE-7914

The commit being:
https://github.com/apache/lucene-solr/commit/7dde798473d1a8640edafb41f28ad25d17f25a2d

Kevin Risden

On Tue, Apr 3, 2018 at 1:45 PM, David Hastings  wrote:

> For data, its primarily a lot of garbage, around 200k titles, varying
> length.  im actually looking through my application now to see if I even
> still use it or if it was an early experiment.  I am just finding it odd
> thats its failing in 7 but does fine on 5
>
> On Tue, Apr 3, 2018 at 2:41 PM, Erick Erickson 
> wrote:
>
> > What kinds of things go into your title field? On first blush that's a
> > bit odd for a multi-word title field since it treats the entire input
> > as a single string. The code is trying to build a large FST to hold
> > all of this data. Would AnalyzingInfixLookupFactory or similar make
> > more sense?
> >
> > buildOnStartup and buildOnOptimize are other red flags. This means
> > that every time you start up, the data for the title field is read
> > from disk and the FST is built (or index if you use a different impl).
> > On a large corpus this may take many minutes.
> >
> > Best,
> > Erick
> >
> > On Tue, Apr 3, 2018 at 11:28 AM, David Hastings
> >  wrote:
> > > Hey all, I recently got a 7.2 instance up and running, and it seems to
> be
> > > going well however, I have ran into this when creating one of my
> indexes,
> > > and was wondering if anyone had a quick idea right off the top of their
> > > head.
> > >
> > > solrconfig:
> > >
> > > 
> > >   
> > > fixspell
> > > FuzzyLookupFactory
> > >
> > > string
> > >
> > > DocumentDictionaryFactory
> > > title
> > > true
> > > true
> > >   
> > >
> > >
> > > received error:
> > >
> > >
> > > ERROR true
> > > SuggestComponent
> > > Exception in building suggester index for: fixspell
> > > java.lang.IllegalArgumentException: input automaton is too large: 1001
> > > at
> > > org.apache.lucene.util.automaton.Operations.topoSortStatesRecurse(
> > Operations.java:1298)
> > > at
> > > org.apache.lucene.util.automaton.Operations.topoSortStatesRecurse(
> > Operations.java:1306)
> > > at
> > > org.apache.lucene.util.automaton.Operations.topoSortStatesRecurse(
> > Operations.java:1306)
> > >
> > > .
> > >
> > > at
> > > org.apache.lucene.util.automaton.Operations.topoSortStatesRecurse(
> > Operations.java:1306)
> > > at
> > > org.apache.lucene.util.automaton.Operations.topoSortStatesRecurse(
> > Operations.java:1306)
> > > at
> > > org.apache.lucene.util.automaton.Operations.topoSortStates(Operations.
> > java:1275)
> > > at
> > > org.apache.lucene.search.suggest.analyzing.
> > AnalyzingSuggester.replaceSep(AnalyzingSuggester.java:292)
> > > at
> > > org.apache.lucene.search.suggest.analyzing.AnalyzingSuggester.
> > toAutomaton(AnalyzingSuggester.java:854)
> > > at
> > > org.apache.lucene.search.suggest.analyzing.AnalyzingSuggester.build(
> > AnalyzingSuggester.java:430)
> > > at org.apache.lucene.search.suggest.Lookup.build(Lookup.java:190)
> > > at
> > > org.apache.solr.spelling.suggest.SolrSuggester.build(
> > SolrSuggester.java:181)
> > > at
> > > org.apache.solr.handler.component.SuggestComponent$SuggesterListener.
> > buildSuggesterIndex(SuggestComponent.java:529)
> > > at
> > > org.apache.solr.handler.component.SuggestComponent$
> > SuggesterListener.newSearcher(SuggestComponent.java:511)
> > > at org.apache.solr.core.SolrCore.lambda$getSearcher$17(
> > SolrCore.java:2275)
> >
>


Re: solr 5.2->7.2, suggester failure

2018-04-03 Thread David Hastings
For data, its primarily a lot of garbage, around 200k titles, varying
length.  im actually looking through my application now to see if I even
still use it or if it was an early experiment.  I am just finding it odd
thats its failing in 7 but does fine on 5

On Tue, Apr 3, 2018 at 2:41 PM, Erick Erickson 
wrote:

> What kinds of things go into your title field? On first blush that's a
> bit odd for a multi-word title field since it treats the entire input
> as a single string. The code is trying to build a large FST to hold
> all of this data. Would AnalyzingInfixLookupFactory or similar make
> more sense?
>
> buildOnStartup and buildOnOptimize are other red flags. This means
> that every time you start up, the data for the title field is read
> from disk and the FST is built (or index if you use a different impl).
> On a large corpus this may take many minutes.
>
> Best,
> Erick
>
> On Tue, Apr 3, 2018 at 11:28 AM, David Hastings
>  wrote:
> > Hey all, I recently got a 7.2 instance up and running, and it seems to be
> > going well however, I have ran into this when creating one of my indexes,
> > and was wondering if anyone had a quick idea right off the top of their
> > head.
> >
> > solrconfig:
> >
> > 
> >   
> > fixspell
> > FuzzyLookupFactory
> >
> > string
> >
> > DocumentDictionaryFactory
> > title
> > true
> > true
> >   
> >
> >
> > received error:
> >
> >
> > ERROR true
> > SuggestComponent
> > Exception in building suggester index for: fixspell
> > java.lang.IllegalArgumentException: input automaton is too large: 1001
> > at
> > org.apache.lucene.util.automaton.Operations.topoSortStatesRecurse(
> Operations.java:1298)
> > at
> > org.apache.lucene.util.automaton.Operations.topoSortStatesRecurse(
> Operations.java:1306)
> > at
> > org.apache.lucene.util.automaton.Operations.topoSortStatesRecurse(
> Operations.java:1306)
> >
> > .
> >
> > at
> > org.apache.lucene.util.automaton.Operations.topoSortStatesRecurse(
> Operations.java:1306)
> > at
> > org.apache.lucene.util.automaton.Operations.topoSortStatesRecurse(
> Operations.java:1306)
> > at
> > org.apache.lucene.util.automaton.Operations.topoSortStates(Operations.
> java:1275)
> > at
> > org.apache.lucene.search.suggest.analyzing.
> AnalyzingSuggester.replaceSep(AnalyzingSuggester.java:292)
> > at
> > org.apache.lucene.search.suggest.analyzing.AnalyzingSuggester.
> toAutomaton(AnalyzingSuggester.java:854)
> > at
> > org.apache.lucene.search.suggest.analyzing.AnalyzingSuggester.build(
> AnalyzingSuggester.java:430)
> > at org.apache.lucene.search.suggest.Lookup.build(Lookup.java:190)
> > at
> > org.apache.solr.spelling.suggest.SolrSuggester.build(
> SolrSuggester.java:181)
> > at
> > org.apache.solr.handler.component.SuggestComponent$SuggesterListener.
> buildSuggesterIndex(SuggestComponent.java:529)
> > at
> > org.apache.solr.handler.component.SuggestComponent$
> SuggesterListener.newSearcher(SuggestComponent.java:511)
> > at org.apache.solr.core.SolrCore.lambda$getSearcher$17(
> SolrCore.java:2275)
>


Re: solr 5.2->7.2, suggester failure

2018-04-03 Thread Erick Erickson
What kinds of things go into your title field? On first blush that's a
bit odd for a multi-word title field since it treats the entire input
as a single string. The code is trying to build a large FST to hold
all of this data. Would AnalyzingInfixLookupFactory or similar make
more sense?

buildOnStartup and buildOnOptimize are other red flags. This means
that every time you start up, the data for the title field is read
from disk and the FST is built (or index if you use a different impl).
On a large corpus this may take many minutes.

Best,
Erick

On Tue, Apr 3, 2018 at 11:28 AM, David Hastings
 wrote:
> Hey all, I recently got a 7.2 instance up and running, and it seems to be
> going well however, I have ran into this when creating one of my indexes,
> and was wondering if anyone had a quick idea right off the top of their
> head.
>
> solrconfig:
>
> 
>   
> fixspell
> FuzzyLookupFactory
>
> string
>
> DocumentDictionaryFactory
> title
> true
> true
>   
>
>
> received error:
>
>
> ERROR true
> SuggestComponent
> Exception in building suggester index for: fixspell
> java.lang.IllegalArgumentException: input automaton is too large: 1001
> at
> org.apache.lucene.util.automaton.Operations.topoSortStatesRecurse(Operations.java:1298)
> at
> org.apache.lucene.util.automaton.Operations.topoSortStatesRecurse(Operations.java:1306)
> at
> org.apache.lucene.util.automaton.Operations.topoSortStatesRecurse(Operations.java:1306)
>
> .
>
> at
> org.apache.lucene.util.automaton.Operations.topoSortStatesRecurse(Operations.java:1306)
> at
> org.apache.lucene.util.automaton.Operations.topoSortStatesRecurse(Operations.java:1306)
> at
> org.apache.lucene.util.automaton.Operations.topoSortStates(Operations.java:1275)
> at
> org.apache.lucene.search.suggest.analyzing.AnalyzingSuggester.replaceSep(AnalyzingSuggester.java:292)
> at
> org.apache.lucene.search.suggest.analyzing.AnalyzingSuggester.toAutomaton(AnalyzingSuggester.java:854)
> at
> org.apache.lucene.search.suggest.analyzing.AnalyzingSuggester.build(AnalyzingSuggester.java:430)
> at org.apache.lucene.search.suggest.Lookup.build(Lookup.java:190)
> at
> org.apache.solr.spelling.suggest.SolrSuggester.build(SolrSuggester.java:181)
> at
> org.apache.solr.handler.component.SuggestComponent$SuggesterListener.buildSuggesterIndex(SuggestComponent.java:529)
> at
> org.apache.solr.handler.component.SuggestComponent$SuggesterListener.newSearcher(SuggestComponent.java:511)
> at org.apache.solr.core.SolrCore.lambda$getSearcher$17(SolrCore.java:2275)


solr 5.2->7.2, suggester failure

2018-04-03 Thread David Hastings
Hey all, I recently got a 7.2 instance up and running, and it seems to be
going well however, I have ran into this when creating one of my indexes,
and was wondering if anyone had a quick idea right off the top of their
head.

solrconfig:


  
fixspell
FuzzyLookupFactory

string

DocumentDictionaryFactory
title
true
true
  


received error:


ERROR true
SuggestComponent
Exception in building suggester index for: fixspell
java.lang.IllegalArgumentException: input automaton is too large: 1001
at
org.apache.lucene.util.automaton.Operations.topoSortStatesRecurse(Operations.java:1298)
at
org.apache.lucene.util.automaton.Operations.topoSortStatesRecurse(Operations.java:1306)
at
org.apache.lucene.util.automaton.Operations.topoSortStatesRecurse(Operations.java:1306)

.

at
org.apache.lucene.util.automaton.Operations.topoSortStatesRecurse(Operations.java:1306)
at
org.apache.lucene.util.automaton.Operations.topoSortStatesRecurse(Operations.java:1306)
at
org.apache.lucene.util.automaton.Operations.topoSortStates(Operations.java:1275)
at
org.apache.lucene.search.suggest.analyzing.AnalyzingSuggester.replaceSep(AnalyzingSuggester.java:292)
at
org.apache.lucene.search.suggest.analyzing.AnalyzingSuggester.toAutomaton(AnalyzingSuggester.java:854)
at
org.apache.lucene.search.suggest.analyzing.AnalyzingSuggester.build(AnalyzingSuggester.java:430)
at org.apache.lucene.search.suggest.Lookup.build(Lookup.java:190)
at
org.apache.solr.spelling.suggest.SolrSuggester.build(SolrSuggester.java:181)
at
org.apache.solr.handler.component.SuggestComponent$SuggesterListener.buildSuggesterIndex(SuggestComponent.java:529)
at
org.apache.solr.handler.component.SuggestComponent$SuggesterListener.newSearcher(SuggestComponent.java:511)
at org.apache.solr.core.SolrCore.lambda$getSearcher$17(SolrCore.java:2275)


Re: How do I create a schema file for FIX data in Solr

2018-04-03 Thread Adhyan Arizki
Raymond,

Seems you are having issue with the node environment. Likely the path isn't
registered correctly judging from the error message. Note though, this is
no longer related to Solr issue.

On Tue, 3 Apr 2018, 23:00 Raymond Xie,  wrote:

> Hi Rick,
>
> Following your suggestion I found https://github.com/SunGard-Labs/fix2json
> which seems to be a fit;
>
> I followed the installation instruction and successfully installed the
> fix2json on my Ubuntu host.
>
> sudo npm install -g fix2json
>
> I ran the same command as indicated in the git:
>
> fix2json -p dict/FIX50SP2.CME.xml XCME_MD_GE_FUT_20160315.gz
>
>
> and I received error of:
>
> /usr/bin/env: ‘node’: No such file or directory
>
> It would be appreciated if you can point out what is missing here?
>
> Thank you again for your kind help.
>
>
>
> **
> *Sincerely yours,*
>
>
> *Raymond*
>
> On Mon, Apr 2, 2018 at 9:30 AM, Raymond Xie  wrote:
>
> > Thank you Rick for the enlightening.
> >
> > I will get the FIX message parsed first and come back here later.
> >
> >
> > **
> > *Sincerely yours,*
> >
> >
> > *Raymond*
> >
> > On Mon, Apr 2, 2018 at 9:15 AM, Rick Leir  wrote:
> >
> >> Google
> >>fix to json,
> >> there are a few interesting leads.
> >>
> >> On April 2, 2018 12:34:44 AM EDT, Raymond Xie 
> >> wrote:
> >> >Thank you, Shawn, Rick and other readers,
> >> >
> >> >To Shawn:
> >> >
> >> >For  *8=FIX.4.4 9=653 35=RIO* as an example, in the FIX standard: 8
> >> >means BeginString, in this example, its value is  FIX.4.4.9, and 9
> >> >means
> >> >body length, it is 653 for this message, 35 is RIO, meaning the message
> >> >type is RIO, 122 stands for OrigSendingTime and has a format of
> >> >UTCTimestamp
> >> >
> >> >You can refer to this page for details: https://www.onixs.biz
> >> >/fix-dictionary/4.2/fields_by_tag.html
> >> >
> >> >All the values are explained as string type.
> >> >
> >> >All the tag numbers are from FIX standard so it doesn't change (in my
> >> >case)
> >> >
> >> >I expect a python program might be needed to parse the message and
> >> >extract
> >> >each tag's value, index is to be made on those extracted value as long
> >> >as
> >> >their field (tag) name.
> >> >
> >> >With index in place, ideally and naturally user will search for any
> >> >keyword, however, in this case, most queries would be based on tag 37
> >> >(Order ID) and 75 (Trade Date), there is another customized tag (not in
> >> >the
> >> >standard) Order Version to be queried on.
> >> >
> >> >I understand the parser creation would be a manual process, as long as
> >> >I
> >> >know or have a small sample program, I will do it myself and maybe
> >> >adjust
> >> >it as per need.
> >> >
> >> >To Rick:
> >> >
> >> >You mentioned creating JSON document, my understanding is a parser
> >> >would be
> >> >needed to generate that JSON document, do you have any existing example
> >> >code?
> >> >
> >> >
> >> >
> >> >
> >> >Thank you guys very much.
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >**
> >> >*Sincerely yours,*
> >> >
> >> >
> >> >*Raymond*
> >> >
> >> >On Sun, Apr 1, 2018 at 2:16 PM, Shawn Heisey 
> >> >wrote:
> >> >
> >> >> On 4/1/2018 10:12 AM, Raymond Xie wrote:
> >> >>
> >> >>> FIX is a format standard of financial data. It contains lots of tags
> >> >in
> >> >>> number with value for the tag, like 8=asdf, where 8 is the tag and
> >> >asdf is
> >> >>> the tag's value. Each tag has its definition.
> >> >>>
> >> >>> The sample msg in FIX format was in the original question.
> >> >>>
> >> >>> All I need to do is to know how to paste the msg and get all tag's
> >> >value.
> >> >>>
> >> >>> I found so far a parser is what I need to start with., But I am more
> >> >>> concerning about how to create index in Solr on the extracted tag's
> >> >value,
> >> >>> that is the first step, the next would be to customize the dashboard
> >> >for
> >> >>> users to search with a value to find out which msg contains that
> >> >value in
> >> >>> which tag and present users the whole msg as proof.
> >> >>>
> >> >>
> >> >> Most of Solr's functionality is provided by Lucene.  Lucene is a java
> >> >API
> >> >> that implements search functionality.  Solr bolts on some
> >> >functionality on
> >> >> top of Lucene, but doesn't really do anything to fundamentally change
> >> >the
> >> >> fact that you're dealing with a Lucene index.  So I'm going to mostly
> >> >talk
> >> >> about Lucene below.
> >> >>
> >> >> Lucene organizes data in a unit that we call a "document." An easy
> >> >analogy
> >> >> for this is that it is a lot like a row in a single database table.
> >> >It has
> >> >> fields, each field has a type. Unless custom software is used, there
> >> >is
> >> >> really no support for data other than basic 

Re: Learning to Rank (LTR) with grouping

2018-04-03 Thread ilayaraja
Thanks Roopa.

I was expecting that the issue has been fixed in solr 7.0 as per here
https://issues.apache.org/jira/browse/SOLR-8776.

Let me see why it is still not working on solr-ltr-7.2.1



-
--Ilay
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr 6. 3 Can not talk to ZK Updates are disabled

2018-04-03 Thread Erick Erickson
With beefy machines, one strategy is to create multiple JVMs. For
example, if you have one JVM and it hosts 32 replicas, splitting that
up to 4 JVMs hosting 8 replicas each. That can allow you to drop down
the heap allocated to each.

Managing memory is always "exciting" at scale. If you're sorting,
faceting, or grouping on a field that does _not_ have docValues
enabled, that can be a major memory hog. If you enable docValues you
need to re-index completely BTW...

>From there, it's a matter of trying to figure out where the memory is
being used and see what can be done about that.

Best,
Erick

On Mon, Apr 2, 2018 at 2:57 PM, Shawn Heisey  wrote:
> On 4/2/2018 2:43 PM, murugesh karmegam wrote:
>> So given all of that wondering is there any options
>> like G1 GC tuning ?
>
> Targeted reply.
>
> I've put some G1 information out there for Solr.
>
> https://wiki.apache.org/solr/ShawnHeisey
>
> Thanks,
> Shawn
>


Re: some parent documents

2018-04-03 Thread Arturas Mazeika
Hi Mikhail,

Thanks a lot for the reply.

You mentioned that

q=+{!parent which.. v='+text:hello +person:A'} +{!parent
which..v='+text:ciao +person:B'}

is the way to go. How would it look like precisely for the following
collection?

{
"id":1,
"_childDocuments_":
[
{"id":"1_1", "person":"Vai" , "time":"3:14", "msg":"Hello"},
{"id":"1_2", "person":"Arturas" , "time":"3:14", "msg":"Hello"},
{"id":"1_3", "person":"Vai" , "time":"3:15", "msg":"Coz
Mathias is working on another system- different screen."},
{"id":"1_4", "person":"Vai" , "time":"3:15", "msg":"It can
get annoying"},
{"id":"1_5", "person":"Arturas" , "time":"3:15", "msg":"Thank
you. this is very nice of you"},
{"id":"1_6", "person":"Vai" , "time":"3:16", "msg":"ciao"},
{"id":"1_7", "person":"Arturas" , "time":"3:16", "msg":"ciao"}
]
},
{
"id":2,
"_childDocuments_":
[
{"id":"2_1", "person":"Vai" , "time":"4:14", "msg":"Hello"},
{"id":"2_2", "person":"Arturas" , "time":"4:14", "msg":"IBM
Watson"},
{"id":"2_3", "person":"Vai" , "time":"4:15", "msg":"need to
retain content"},
{"id":"2_4", "person":"Vai" , "time":"4:15", "msg":"It can
get annoying"},
{"id":"2_5", "person":"Arturas" , "time":"4:15", "msg":"You can
make all your meetings more access"},
{"id":"2_6", "person":"Vai" , "time":"4:16", "msg":"Make
every meeting a Skype meeting"},
{"id":"2_7", "person":"Arturas" , "time":"4:16", "msg":"ciao"}
]
}

Cheers,
Arturas


On Tue, Apr 3, 2018 at 4:33 PM, Mikhail Khludnev  wrote:

> Hello, Arturas.
>
> TLDR; Please find inline below.
>
> On Tue, Apr 3, 2018 at 5:14 PM, Arturas Mazeika  wrote:
>
> > Hi Solr Fans,
> >
> > I am trying to make sense of information retrieval using expressions like
> > "some parent", "*only parent*", " *all parent*". I am also trying to
> > understand the syntax "!parent which" and "!child of". On the technical
> > level, I am reading the following documents:
> >
> > [1]
> > https://lucene.apache.org/solr/guide/7_2/other-parsers.
> > html#block-join-query-parsers
> > [2]
> > https://lucene.apache.org/solr/guide/7_2/uploading-data-
> > with-index-handlers.html#nested-child-documents
> > [3] http://yonik.com/solr-nested-objects/
> >
> > and I am confused to read:
> >
> > This parser takes a query that matches some parent documents and returns
> > their children. The syntax for this parser is: q={!child
> > of=}. The parameter allParents is a filter that
> > matches *only parent documents*; here you would define the field and
> value
> > that you used to identify *all parent documents*. The parameter
> someParents
> > identifies a query that will match some of the parent documents. The
> output
> > is the children.
> >
> > The first sentence talks about "matching" but does not define what that
> > means (and why it is only some parents matching?). The second sentence
> > introduces a syntax of the parser, but blurs the understanding as "some"
> > and "all" of parents are combined into one sentence. My understanding is
> > that all documents are retrieve that satisfy a query. The query must
> > express some constraints on the parent node and some on the child node. I
> > have a feeling that "only parent documents" reads "criteria is formulated
> > over the parent part of {parent document}->{child document} of entity.
> > My simplified conceptual world of solr looks in the following way:
> >
> > 1. Every document has an ID.
> > 2. Every document may have additional attributes
> > 3. Text attributes is what's at stake in solr. Sure we can search for
> > products that costs at most X, but this is the added functionality. For
> > simplicity I am neglecting those here.
> > 4. The user has an information need. She expresses it with (key)words and
> > hopes to find matching documents. For simplicity, I am skipping all
> issues
> > related to the information presentation of the documents
> > 5. Analysis chain (and inverse index) are the key technologies solr is
> > based upon. Once the chain-processing is applied, mathematical logic
> kicks
> > in, retrieving the documents (that are a set of processed, normalized,
> > enriched tokens) matching the query (processed, normalized and enriched
> > tokens). Clearly, the logic function can be a fancy one (at least one of
> > query token is in the document set of tokens, etc.), ranking is used to
> > sort the results.
> > 6. A nested document concept is introduced in solr. It needs to be
> uploaded
> > into the index structure using a specific handlers [2]. A nested
> documents
> > is a tree. A root may contain children documents, which may be parents of
> > grandchildren documents.
> > 7. Querying nested documents is supported in the following manner:
> > 7.1 Child documents are return that satisfies {parent
> > 

Re: SolrCloud 5.2.1 - collection creation error

2018-04-03 Thread bondinthepond
Hi Aaron GIbbons, 

Need you help. What were the changes you did with the scripts in zookeeper
machine. I am stuck with similar problem.

Thanks in Advance.





--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: How do I create a schema file for FIX data in Solr

2018-04-03 Thread Raymond Xie
Hi Rick,

Following your suggestion I found https://github.com/SunGard-Labs/fix2json
which seems to be a fit;

I followed the installation instruction and successfully installed the
fix2json on my Ubuntu host.

sudo npm install -g fix2json

I ran the same command as indicated in the git:

fix2json -p dict/FIX50SP2.CME.xml XCME_MD_GE_FUT_20160315.gz


and I received error of:

/usr/bin/env: ‘node’: No such file or directory

It would be appreciated if you can point out what is missing here?

Thank you again for your kind help.



**
*Sincerely yours,*


*Raymond*

On Mon, Apr 2, 2018 at 9:30 AM, Raymond Xie  wrote:

> Thank you Rick for the enlightening.
>
> I will get the FIX message parsed first and come back here later.
>
>
> **
> *Sincerely yours,*
>
>
> *Raymond*
>
> On Mon, Apr 2, 2018 at 9:15 AM, Rick Leir  wrote:
>
>> Google
>>fix to json,
>> there are a few interesting leads.
>>
>> On April 2, 2018 12:34:44 AM EDT, Raymond Xie 
>> wrote:
>> >Thank you, Shawn, Rick and other readers,
>> >
>> >To Shawn:
>> >
>> >For  *8=FIX.4.4 9=653 35=RIO* as an example, in the FIX standard: 8
>> >means BeginString, in this example, its value is  FIX.4.4.9, and 9
>> >means
>> >body length, it is 653 for this message, 35 is RIO, meaning the message
>> >type is RIO, 122 stands for OrigSendingTime and has a format of
>> >UTCTimestamp
>> >
>> >You can refer to this page for details: https://www.onixs.biz
>> >/fix-dictionary/4.2/fields_by_tag.html
>> >
>> >All the values are explained as string type.
>> >
>> >All the tag numbers are from FIX standard so it doesn't change (in my
>> >case)
>> >
>> >I expect a python program might be needed to parse the message and
>> >extract
>> >each tag's value, index is to be made on those extracted value as long
>> >as
>> >their field (tag) name.
>> >
>> >With index in place, ideally and naturally user will search for any
>> >keyword, however, in this case, most queries would be based on tag 37
>> >(Order ID) and 75 (Trade Date), there is another customized tag (not in
>> >the
>> >standard) Order Version to be queried on.
>> >
>> >I understand the parser creation would be a manual process, as long as
>> >I
>> >know or have a small sample program, I will do it myself and maybe
>> >adjust
>> >it as per need.
>> >
>> >To Rick:
>> >
>> >You mentioned creating JSON document, my understanding is a parser
>> >would be
>> >needed to generate that JSON document, do you have any existing example
>> >code?
>> >
>> >
>> >
>> >
>> >Thank you guys very much.
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >**
>> >*Sincerely yours,*
>> >
>> >
>> >*Raymond*
>> >
>> >On Sun, Apr 1, 2018 at 2:16 PM, Shawn Heisey 
>> >wrote:
>> >
>> >> On 4/1/2018 10:12 AM, Raymond Xie wrote:
>> >>
>> >>> FIX is a format standard of financial data. It contains lots of tags
>> >in
>> >>> number with value for the tag, like 8=asdf, where 8 is the tag and
>> >asdf is
>> >>> the tag's value. Each tag has its definition.
>> >>>
>> >>> The sample msg in FIX format was in the original question.
>> >>>
>> >>> All I need to do is to know how to paste the msg and get all tag's
>> >value.
>> >>>
>> >>> I found so far a parser is what I need to start with., But I am more
>> >>> concerning about how to create index in Solr on the extracted tag's
>> >value,
>> >>> that is the first step, the next would be to customize the dashboard
>> >for
>> >>> users to search with a value to find out which msg contains that
>> >value in
>> >>> which tag and present users the whole msg as proof.
>> >>>
>> >>
>> >> Most of Solr's functionality is provided by Lucene.  Lucene is a java
>> >API
>> >> that implements search functionality.  Solr bolts on some
>> >functionality on
>> >> top of Lucene, but doesn't really do anything to fundamentally change
>> >the
>> >> fact that you're dealing with a Lucene index.  So I'm going to mostly
>> >talk
>> >> about Lucene below.
>> >>
>> >> Lucene organizes data in a unit that we call a "document." An easy
>> >analogy
>> >> for this is that it is a lot like a row in a single database table.
>> >It has
>> >> fields, each field has a type. Unless custom software is used, there
>> >is
>> >> really no support for data other than basic primitive types --
>> >numbers and
>> >> strings.  The only complex type that I can think of that Solr
>> >supports out
>> >> of the box is geospatial coordinates, and it might even support
>> >> multi-dimensional coordinates, but I'm not sure.  It's not all that
>> >complex
>> >> -- the field just stores and manipulates multiple numbers instead of
>> >one.
>> >> The Lucene API does support a FEW things that Solr doesn't implement.
>> > I
>> >> don't think those are applicable to what you're trying to do.
>> >>
>> >> Let's look at the first part of the 

Re: querying vs. highlighting: complete freedom?

2018-04-03 Thread Arturas Mazeika
Hi David,

Thanks a lot for the reply and the infos.

I suspected that the minimum on the indexing/storage side was that hl.fl
need to be "stored". I understand that my expression "minimal requirements"
are totally loose/unclear, I wasn't sure how to formulate that as (i) I am
not yet sure how to express myself clearly using the language of the forum
and (ii) I was not sure what impact it has if other component is selected
(like FastVector Highlighter). Deep inside I had a feeling that some solr
configurations would allow highlighting even without the "stored" property
set.

It came to my mind that the document nicely describes how to setup the
parameter hl.method (unified, original, fastVector). Similarly, there's the
hl.qparser parameter, but the documentation of that parameter is not as
rich (the documentation says, that the default value is lucene). I am
wondering are there other alternatives available? In case you are referring
to other components, can you add a reference to those?

With respect to your question, why I'd like to use the analysis-chain for
highlighting. That is a very good question: our end users cannot yet
distinguish between highlighting capability of solr/information retrieval
and search of the occurrences of the query terms in the documents. It is a
rather difficult situation I am in. It is cool that there's a JIRA or two
on the the load-balancing side.

Thanks!
Arturas

On Tue, Apr 3, 2018 at 4:29 PM, David Smiley 
wrote:

> Thanks for your review!
>
> On Tue, Apr 3, 2018 at 6:56 AM Arturas Mazeika  wrote:
> ...
>
> > What I missed at the beginning of the documentation is the minimal set of
> > requirements that is reacquired to have highlighting sensible: somehow I
> > have a feeling that one needs some of the information stored in schema in
> > some form. This of course is mentioned later on in the corresponding
> > section, but I'd write this explicitly.
> >
>
> Explicitly say what up front?  "Requirements" are somewhat loose/minimal.
> We ought to say clearly say that hl.fl fields need to be "stored".
>
> ...
>
> > Is there a way to "load-balance" analyze-query-chain for the purpose of
> > highlighting matches? In the url below, I need to specify a specific
> core.
>
> ...
>
> I doubt it.  You'll have to do this yourself.  Why do you want to use this
> for highlighting?  Is it to get the offsets returned to you?  There's a
> JIRA or two for that already; someone ought to make that happen.
> --
> Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
> LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
> http://www.solrenterprisesearchserver.com
>


Re: some parent documents

2018-04-03 Thread Mikhail Khludnev
Hello, Arturas.

TLDR; Please find inline below.

On Tue, Apr 3, 2018 at 5:14 PM, Arturas Mazeika  wrote:

> Hi Solr Fans,
>
> I am trying to make sense of information retrieval using expressions like
> "some parent", "*only parent*", " *all parent*". I am also trying to
> understand the syntax "!parent which" and "!child of". On the technical
> level, I am reading the following documents:
>
> [1]
> https://lucene.apache.org/solr/guide/7_2/other-parsers.
> html#block-join-query-parsers
> [2]
> https://lucene.apache.org/solr/guide/7_2/uploading-data-
> with-index-handlers.html#nested-child-documents
> [3] http://yonik.com/solr-nested-objects/
>
> and I am confused to read:
>
> This parser takes a query that matches some parent documents and returns
> their children. The syntax for this parser is: q={!child
> of=}. The parameter allParents is a filter that
> matches *only parent documents*; here you would define the field and value
> that you used to identify *all parent documents*. The parameter someParents
> identifies a query that will match some of the parent documents. The output
> is the children.
>
> The first sentence talks about "matching" but does not define what that
> means (and why it is only some parents matching?). The second sentence
> introduces a syntax of the parser, but blurs the understanding as "some"
> and "all" of parents are combined into one sentence. My understanding is
> that all documents are retrieve that satisfy a query. The query must
> express some constraints on the parent node and some on the child node. I
> have a feeling that "only parent documents" reads "criteria is formulated
> over the parent part of {parent document}->{child document} of entity.
> My simplified conceptual world of solr looks in the following way:
>
> 1. Every document has an ID.
> 2. Every document may have additional attributes
> 3. Text attributes is what's at stake in solr. Sure we can search for
> products that costs at most X, but this is the added functionality. For
> simplicity I am neglecting those here.
> 4. The user has an information need. She expresses it with (key)words and
> hopes to find matching documents. For simplicity, I am skipping all issues
> related to the information presentation of the documents
> 5. Analysis chain (and inverse index) are the key technologies solr is
> based upon. Once the chain-processing is applied, mathematical logic kicks
> in, retrieving the documents (that are a set of processed, normalized,
> enriched tokens) matching the query (processed, normalized and enriched
> tokens). Clearly, the logic function can be a fancy one (at least one of
> query token is in the document set of tokens, etc.), ranking is used to
> sort the results.
> 6. A nested document concept is introduced in solr. It needs to be uploaded
> into the index structure using a specific handlers [2]. A nested documents
> is a tree. A root may contain children documents, which may be parents of
> grandchildren documents.
> 7. Querying nested documents is supported in the following manner:
> 7.1 Child documents are return that satisfies {parent
> document}->{document}
> 7.2 Parent documents are return that satisfy {document}->{child
> document}
>
> Would I be very wrong to have this conceptual picture?
>
> From this point, the situation is a bit bury in my head. At the core, I do
> not really understand what "a document" is anymore (since the complete json
> or xml, so is a sub-json and sub-xml are documents, every document must
> have an ID, does that meant the the subdocuments must have and ID too, or
> sub-ids are also fine?), how to formulate mathematical expressions over
> documents and what it means that the document satisfies my (key)word query?
> Can we define a document to be the largest entity of information that does
> not contain any other nested documents [4]? If this is defined and
> communicated like this already where can I find it? There is a use of the
> clarification, as the concept of the document means different things in
> different contexts (e.g., you can update only the "complete document" in
> the index vs. parent document, etc.).
>
> Is it possible to formulate what's going on using mathematical logic? Can
> one express something like
>
> { give documents d : d is a document, d is parent of document c, d
> satisfies logical criteria C1,,CN, c satisfies logical criteria
> C1',...,CM'}
> { give documents c : c is a document, d is parent of document c, d
> satisfies logical criteria C1,,CN, c satisfies logical criteria
> C1',...,CM'}
>
> here the meaning of document is as in definition [4] above.
>
> 1. Is it possible to retrieve all parent documents that have two children
> c1 and c2? Consider a document that is a skype chat, and children are
> individual lines of communication in the chat. I would be looking for the
> (parent) documents that have "hello" said by person A and "ciao" said by
> person B (as two different 

Re: querying vs. highlighting: complete freedom?

2018-04-03 Thread David Smiley
Thanks for your review!

On Tue, Apr 3, 2018 at 6:56 AM Arturas Mazeika  wrote:
...

> What I missed at the beginning of the documentation is the minimal set of
> requirements that is reacquired to have highlighting sensible: somehow I
> have a feeling that one needs some of the information stored in schema in
> some form. This of course is mentioned later on in the corresponding
> section, but I'd write this explicitly.
>

Explicitly say what up front?  "Requirements" are somewhat loose/minimal.
We ought to say clearly say that hl.fl fields need to be "stored".

...

> Is there a way to "load-balance" analyze-query-chain for the purpose of
> highlighting matches? In the url below, I need to specify a specific core.

...

I doubt it.  You'll have to do this yourself.  Why do you want to use this
for highlighting?  Is it to get the offsets returned to you?  There's a
JIRA or two for that already; someone ought to make that happen.
-- 
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com


some parent documents

2018-04-03 Thread Arturas Mazeika
Hi Solr Fans,

I am trying to make sense of information retrieval using expressions like
"some parent", "*only parent*", " *all parent*". I am also trying to
understand the syntax "!parent which" and "!child of". On the technical
level, I am reading the following documents:

[1]
https://lucene.apache.org/solr/guide/7_2/other-parsers.html#block-join-query-parsers
[2]
https://lucene.apache.org/solr/guide/7_2/uploading-data-with-index-handlers.html#nested-child-documents
[3] http://yonik.com/solr-nested-objects/

and I am confused to read:

This parser takes a query that matches some parent documents and returns
their children. The syntax for this parser is: q={!child
of=}. The parameter allParents is a filter that
matches *only parent documents*; here you would define the field and value
that you used to identify *all parent documents*. The parameter someParents
identifies a query that will match some of the parent documents. The output
is the children.

The first sentence talks about "matching" but does not define what that
means (and why it is only some parents matching?). The second sentence
introduces a syntax of the parser, but blurs the understanding as "some"
and "all" of parents are combined into one sentence. My understanding is
that all documents are retrieve that satisfy a query. The query must
express some constraints on the parent node and some on the child node. I
have a feeling that "only parent documents" reads "criteria is formulated
over the parent part of {parent document}->{child document} of entity.
My simplified conceptual world of solr looks in the following way:

1. Every document has an ID.
2. Every document may have additional attributes
3. Text attributes is what's at stake in solr. Sure we can search for
products that costs at most X, but this is the added functionality. For
simplicity I am neglecting those here.
4. The user has an information need. She expresses it with (key)words and
hopes to find matching documents. For simplicity, I am skipping all issues
related to the information presentation of the documents
5. Analysis chain (and inverse index) are the key technologies solr is
based upon. Once the chain-processing is applied, mathematical logic kicks
in, retrieving the documents (that are a set of processed, normalized,
enriched tokens) matching the query (processed, normalized and enriched
tokens). Clearly, the logic function can be a fancy one (at least one of
query token is in the document set of tokens, etc.), ranking is used to
sort the results.
6. A nested document concept is introduced in solr. It needs to be uploaded
into the index structure using a specific handlers [2]. A nested documents
is a tree. A root may contain children documents, which may be parents of
grandchildren documents.
7. Querying nested documents is supported in the following manner:
7.1 Child documents are return that satisfies {parent
document}->{document}
7.2 Parent documents are return that satisfy {document}->{child
document}

Would I be very wrong to have this conceptual picture?

>From this point, the situation is a bit bury in my head. At the core, I do
not really understand what "a document" is anymore (since the complete json
or xml, so is a sub-json and sub-xml are documents, every document must
have an ID, does that meant the the subdocuments must have and ID too, or
sub-ids are also fine?), how to formulate mathematical expressions over
documents and what it means that the document satisfies my (key)word query?
Can we define a document to be the largest entity of information that does
not contain any other nested documents [4]? If this is defined and
communicated like this already where can I find it? There is a use of the
clarification, as the concept of the document means different things in
different contexts (e.g., you can update only the "complete document" in
the index vs. parent document, etc.).

Is it possible to formulate what's going on using mathematical logic? Can
one express something like

{ give documents d : d is a document, d is parent of document c, d
satisfies logical criteria C1,,CN, c satisfies logical criteria
C1',...,CM'}
{ give documents c : c is a document, d is parent of document c, d
satisfies logical criteria C1,,CN, c satisfies logical criteria
C1',...,CM'}

here the meaning of document is as in definition [4] above.

1. Is it possible to retrieve all parent documents that have two children
c1 and c2? Consider a document that is a skype chat, and children are
individual lines of communication in the chat. I would be looking for the
(parent) documents that have "hello" said by person A and "ciao" said by
person B (as two different sub-documents).

2. Is it possible to search for documents such that they have a grandchild
and the grandchild has the word "hello"?

3. Is it possible to search for documents that do not have children?

Is this the right venue to discuss documentation of solr?

Thanks!
Arturas


SolrCloud 7.2 problem with leader election

2018-04-03 Thread Gael Jourdan-Weil
Hello,

We are trying to upgrade from Solr 6.6 to Solr 7.2.1 and we are using Solr 
Cloud.

Doing some tests with 2 replicas, ZooKeeper doesn't know which one to elect as 
a leader:

ERROR org.apache.solr.cloud.ZkController:getLeader:1206  - Error getting leader 
from zk
org.apache.solr.common.SolrException: There is conflicting information about 
the leader of shard: shard1 our state 
says:http://host1:8080/searchsolrnodefr/fr_blue/ but zookeeper 
says:http://host2:8080/searchsolrnodefr/fr_blue/

solr.xml file:
${genericCoreNodeNames:true}

In the core.properties files, each replica has the same coreNodeName value: 
"core_node1".
When changing this property on host2 with value "core_node2", ZooKeeper can 
elect a leader and everything is fine.

Do we need to set genericCoreNodeNames to false in solr.xml ? 

Gaël

Re: Trying to Restore older indexes in Solr7.2.1

2018-04-03 Thread Shawn Heisey

On 4/3/2018 3:22 AM, Mugdha Varadkar wrote:


is the collection using the compositeId router?

Yes collection used of both the versions are using compositeId router, 
PFA screenshot of the same.


If you attached a screenshot, it was lost.  The mailing list does not 
allow most attachments through.



You would need to look at the information in zookeeper for both
versions

Any specific information I can check / look into: shall I check for 
uploaded configs for the collection or any specific set of properties 
in solrconfig.xml?


It's not in solrconfig.xml.  I think you'll find the hash ranges for 
your shards in the '/collections/NAME/state.json' file for the 
collection, in zookeeper.  I'm not 100% sure that this is the correct 
location, but I *think* it is.


The difference between schema of solr-5.5.5 and schema that is there 
for solr-7.2.1 is that, we are adding docValues on required fields to 
improvise indexing of required fields.
Hence before upgrading from solr-5.5.5, I took a backup of documents 
using restore-api and then upgraded solr from 5.5.5 to 7.2.1 and tried 
to restore the backed-up data by converting the documents to the newer 
format using the index-upgrader tool.


IndexUpgrader is a Lucene tool.  The schema is a Solr invention.  
IndexUpgrader does not know anything about the schema.  It only knows 
about what's actually in the index.


If you try to use a different schema to access your index than the 
schema it was created with, you run the risk that you won't be able to 
access your data.  Most schema changes require rebuilding the index from 
scratch.  Adding docValues is a change that requires this.  Running 
indexUpgrader is *NOT* a reindex.


And, issue I got after restoring the documents of older version was : 
all the documents that were there in collection that was created for 
solr-7.2.1 were not available at all.


Are you getting any errors when you query the index?  Look in solr.log.

Thanks,
Shawn



Re: Need help to get started on Solr, searching get nothing. Thank you very much in advance

2018-04-03 Thread Shawn Heisey

On 4/2/2018 9:00 PM, Raymond Xie wrote:

I see there is "/browse" in solrconfig.xml :

  
 
   explicit
 
   

and  name="defaults" with one item of "df" as shown below:
   
 
   _text_
 
   

My understanding is I can put whatever fields I want to enable index and
searching here in parallel with  _text_, am I correct?


Be careful with the /browse handler.  It is not intended for production 
use.  The primary reason is that in order for /browse to work, the end 
user must have direct access to the Solr server -- the /browse handler 
instructs the user's browser to make direct calls to Solr's API.  Those 
calls do not happen server side.  The /browse handler serves as an 
example of Solr's capability.


Giving end users direct access to Solr, unless you put an intelligent 
proxy in between them that can block undesirable requests, is a security 
risk.  Configuring such a proxy is not a trivial exercise.


The standard query parser uses the df parameter (default field) to 
indicate which field to search when no field is given. The df parameter 
can only use ONE field.  The dismax and edismax parsers have qf, pf, and 
friends, to specify multiple fields to query.


Thanks,
Shawn



Re: querying vs. highlighting: complete freedom?

2018-04-03 Thread Arturas Mazeika
Hi David,

Thanks a lot for the reply, the effort to update the documentation, and
have the documentation reflect the question I posted here.

I've read the doc you provided. I've read the updated parts and the the
document as carefully as I could. I've browsed and skimmed part of the
document (where it got rather detailed, especially the parts from the
original, unified and vector highlighters. I'll have to revisit those parts
as I deepen my understanding about information retrieval and solr in
particular.

The updates in the document are helpful and improved the document quite a
bit. I also agree that it is hard to document the problem and give a
solution to my problem. I see at least two reasons why this becomes very
challenging in this case: (i) the document aims to cover all options and
possibilities of highlighting in solr, (ii) the documents aims to teach the
reader how to use highlighting in solr. These aims are conflicting: If one
wants to cover the options and possibilities, one structures the content
hierarchically, starting with most basic building blocks (jumping into
details first). If one aims at usage, one starts with the simplest possible
case that illustrates highlighting, followed up by more complex use cases
illustrating more sophisticated and advanced cases (abstracts from details,
focuses on big picture). 1st type of documentation tends to be long and
boring (check out manuals provided by Microsoft, they perfected this style
of documenting in my opinion) second type of documentation repeats itself
constantly, or contains multiple references to outside (as every new use
case is somewhat based on the previous one). You have sections that focus
on both aspects in the documentation: some examples give very simple
targeted examples how to use solr, and some sections dig into the details.
What I missed at the beginning of the documentation is the minimal set of
requirements that is reacquired to have highlighting sensible: somehow I
have a feeling that one needs some of the information stored in schema in
some form. This of course is mentioned later on in the corresponding
section, but I'd write this explicitly.

I still have a question that would really be cool to get an answer (which
is more about analyses and less about highlighting). My key question is:

Is there a way to "load-balance" analyze-query-chain for the purpose of
highlighting matches? In the url below, I need to specify a specific core.

http://localhost:8983/solr/trans_shard1_replica_n1/analysis/field?wt=xml;
analysis.showmatch=true=Albert%20Einstein%20(14%20March%
201879%20%E2%80%93%2018%20April%201955)%20was%20a%
20German-born%20theoretical%20physicist[5]%20who%20developed%20the%20theory%
20of%20relativity,%20one%20of%20the%20two%20pillars%20of%
20modern%20physics%20(alongside%20quantum%20mechanics).=
reletivity%20theory=text_en


The context for this question is:

> Steven hint pushed me into this direction further: he suggested to use the
> query part of solr to filter and sort out the relevant answers in the 1st
> step and in the 2nd step he'd highlight all the keywords using CTR+F (in
> the browser or some alternative viewer). This brought be to the next
> question:
>
> How can one match query terms with the analyze-chained documents in an
> efficient and distributed manner? My current understanding how to achieve
> this is the following:
>
> 1. Get the list of ids (contents) of the documents that match the query
> 2. Use the http://localhost:8983/solr/#/trans/analysis to re-analyze the
> document and the query
> 3. Use the matching of the substrings from the original text to last
> filter/tokenizer/analyzer in the analyze-chain to map the terms of the
> query
> 4. Emulate CTRL+F highlighting
>
> Web Interface of Solr offers quite a bit to advance towards this goal. If
> one fires this request:
>
> * analysis.fieldvalue=Albert Einstein (14 March 1879 – 18 April 1955) was
a
> German-born theoretical physicist[5] who developed the theory of
> relativity, one of the two pillars of modern physics (alongside quantum
> mechanics).&
> * analysis.query=reletivity theory
>
> to one of the cores of solr, one gets the steps 1-3 done:
>
>
> http://localhost:8983/solr/trans_shard1_replica_n1/analysis/field?wt=xml;
analysis.showmatch=true=Albert%20Einstein%20(14%20March%
201879%20%E2%80%93%2018%20April%201955)%20was%20a%
20German-born%20theoretical%20physicist[5]%20who%20developed%20the%20theory%
20of%20relativity,%20one%20of%20the%20two%20pillars%20of%
20modern%20physics%20(alongside%20quantum%20mechanics).=
reletivity%20theory=text_en
>
> Questions:
>
> 1. Is there a way to "load-balance" this? In the above url, I need to
> specify a specific core. Is it possible to generalize it, so the core that
> receives the request is not necessarily the one that processes it? Or this
> already is distributed in a sense that receiving core and processing cores
> are never the same?
>
> 2. The document was already analyze-chained. Is is 

Re: MatchMode in Dismax parser

2018-04-03 Thread lsharma3
Hi Shawn, My code is ready, I just need to raise the PR for the same, Can you
please guide me to raise my First PR for the SOLR.

Regards,
Lucky Sharma



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: MatchMode in Dismax parser

2018-04-03 Thread lsharma3
Hi Shawn, 
I have already made the changes for this, 
can you guid e me to raise my first PR :) .

It would be a great help


Regards,
Lucky Sharma



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


RE: PreAnalyzed FieldType, and simultaneously importing JSON

2018-04-03 Thread Markus Jelsma
Hi David!

Many thanks, this looks much better!

Regards,
Markus
 
-Original message-
> From:David Smiley 
> Sent: Monday 2nd April 2018 21:27
> To: solr-user@lucene.apache.org
> Subject: Re: PreAnalyzed FieldType, and simultaneously importing JSON
> 
> Hello Markus,
> 
> It appears you are not familiar with PreAnalyzedUpdateProcessor?  Using
> that is much more flexible -- you could have different URP chains for your
> use-cases. IMO PreAnalyzedField ought to go away.  I argued for the URP
> version and thus it's superiority to the FieldType here:
> https://issues.apache.org/jira/browse/SOLR-4619?focusedCommentId=13611191=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13611191
> Sadly, the FieldType is the one that is documented in the ref guide, but
> not the URP :-(
> 
> ~ David
> 
> On Thu, Mar 29, 2018 at 5:06 PM Markus Jelsma 
> wrote:
> 
> > Hello,
> >
> > We want to move to PreAnalyzed FieldType to offload our very heavy
> > analysis chain away from the search cluster, so we have to configure our
> > fields to accept pre-analyzed tokens in production.
> >
> > But we use the same schema in development environments too, and that is
> > where we use JSON files, or stream (export/import) data directly from
> > production servers into a development environment, again via JSON. And in
> > case of disaster recovery, we can import the daily exported JSON bzipped
> > files back into our production servers.
> >
> > But this JSON loading does not work with PreAnalyzed FieldType. So to load
> > JSON we must reset all fields back to their respective language specific
> > FieldTypes on-the-fly, we could automate, but it is a hassle we like to
> > avoid.
> >
> > Have i overlooked any configuration parameters that can help? Must we
> > automate the on-the-fly schema reconfiguration and reset to PreAnalyzed
> > after JSON loading is finished?
> >
> > Many thanks!
> > Markus
> >
> -- 
> Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
> LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
> http://www.solrenterprisesearchserver.com
> 


Re: Trying to Restore older indexes in Solr7.2.1

2018-04-03 Thread Mugdha Varadkar
Hi Shawn Heisey,

Thank you for the reply given here

.

Please find below answers to your questions,

is the collection using the compositeId router?

Yes collection used of both the versions are using compositeId router, PFA
screenshot of the same.
>
> You would need to look at the information in zookeeper for both versions

Any specific information I can check / look into: shall I check for
uploaded configs for the collection or any specific set of properties in
solrconfig.xml?
>From my earlier mail, trying to re-phrase my point #6 - The new documents
available after new collection creation got lost after restoring older
indexes (in step #5).
The difference between schema of solr-5.5.5 and schema that is there for
solr-7.2.1 is that, we are adding docValues on required fields to improvise
indexing of required fields.
Hence before upgrading from solr-5.5.5, I took a backup of documents using
restore-api and then upgraded solr from 5.5.5 to 7.2.1 and tried to restore
the backed-up data by converting the documents to the newer format using
the index-upgrader tool.
And, issue I got after restoring the documents of older version was : all
the documents that were there in collection that was created for solr-7.2.1
were not available at all.

Thanks,
Mugdha Varadkar

On Wed, Mar 28, 2018 at 12:46 PM, Mugdha Varadkar <
mugdha.varadkar...@gmail.com> wrote:

> Hi,
>
> I am trying to restore Solr 5.5.5 indexes into Solr 7.2.1
>
> Performed below steps:
>
>1. Upgraded the indexes to Solr 6.6.2 indexes using IndexUpgraded Tool
>
> 
>Command used : java -cp server/solr-webapp/webapp/WEB-
>INF/lib/lucene-core-6.6.2.jar:server/solr-webapp/webapp/WEB-
>INF/lib/lucene-backward-codecs-6.6.2.jar 
> org.apache.lucene.index.IndexUpgrader
>-verbose /usr/local/solr_5_5_data/index/
>2. Deleted the old collection used previously, as there is a change in
>schema file.
>3. Created new collection for new schema. Keeping the replication and
>shards same as old collection.
>4. Now there is data going into new collection.
>5. Then I restore the upgraded indexes using replication API.(As the
>backup was taken using that.)
>6. The indexes got restored but the new data created after new
>collection creation was not available.
>
> Can someone suggest what I missed during upgrading the indexes ?
>
> Thanks,
> Mugdha Varadkar
>


Copy field on dynamic fields?

2018-04-03 Thread jatin roy
Hi,

Can we create copy field on dynamic fields? If yes then how it decide which 
field should be copied to which one?

For example: if I have dynamic field: category_* and while indexing 4 fields 
are formed such as:
category_1
category_2
category_3
category_4
and now I have to copy the contents of already existing dynamic field 
"category_*" to "new_category_*".

So my question is how the algorithm decides that category_1 data has to be 
indexed in new_category_1 ?

Regards
Jatin Roy
Software developer