Re: Good practices on indexing larger amount of documents at once using SolrJ

2018-07-20 Thread Arunan Sugunakumar
Dear Erick,

Thank you for your reply. I initialize the arraylist variable with a new
Array List after I add and commit the solrDocumentList into the solrClient.
So I dont think I have the problem of ever increasing ArrayList. (I hope
the add method in solrClient flushes the previous documents added). But as
you said I do a hard commit during the loop. I can change it by adding
commitWithin. What is the value you would recommend for this type of
scenario.

Thank you,
Arunan

*Sugunakumar Arunan*
Undergraduate - CSE | UOM

Email : aruna ns...@cse.mrt.ac.lk
Mobile : 0094 766016272 <076%20601%206272>
LinkedIn : https://www.linkedin.com/in/arunans23/

On 20 July 2018 at 23:21, Erick Erickson  wrote:

> I do this all the time with batches of 1,000 and don't see this problem.
>
> one thing that sometimes bites people is to fail to clear the doclist
> after every call to add. So you send ever-increasing batches to Solr.
> Assuming when you talk about batch size meaning the size of the
> solrDocunentList, increasing it would make  the broken pipe problem
> worse if anything...
>
> Also, it's generally bad practice to commit after every batch. That's not
> your problem here, just something to note. Let your autocommit
> settings in solrconfig handle it or specify commitWithin in your
> add call.
>
> I'd also look in your Solr logs and see if there's a problem there.
>
> Net-net is this is a perfectly reasonable pattern, I suspect some
> innocent-seeming problem with your indexing code.
>
> Best,
> Erick
>
>
>
> On Fri, Jul 20, 2018 at 9:32 AM, Arunan Sugunakumar
>  wrote:
> > Hi,
> >
> > I have around 12 millions objects in my PostgreSQL database to be
> indexed.
> > I'm running a thread to fetch the rows from the database. The thread will
> > also create the documents and put it in an indexing queue. While this is
> > happening my main process will retrieve the documents from the queue and
> > will index it in the size of 1000. For some time the process is running
> as
> > expected, but after some time, I get an exception.
> >
> > *[corePostProcess] org.apache.solr.client.solrj.SolrServerException:
> > IOException occured when talking to server at:
> > http://localhost:8983/solr/mine-search
> > ……….…
> …….[corePostProcess]
> > Caused by: java.net.SocketException: Broken pipe (Write
> > failed)[corePostProcess]at
> > java.net.SocketOutputStream.socketWrite0(Native Method)*
> >
> >
> > I tried increasing the batch size upto 3. Then I got a different
> > exception.
> >
> > *[corePostProcess] org.apache.solr.client.solrj.SolrServerException:
> > IOException occured when talking to server at:
> > http://localhost:8983/solr/mine-search
> > ……
> .….[corePostProcess]
> > Caused by: org.apache.http.NoHttpResponseException: localhost:8983
> failed
> > to respond*
> >
> >
> > I would like to know whether there are any good practices on handling
> such
> > situation, such as max no of documents to index in one attempt etc.
> >
> > My environement :
> >
> > Version : solr 7.2, solrj 7.2
> > Ubuntu 16.04
> > RAM 20GB
> > I started Solr in standalone mode.
> > Number of replicas and shards : 1
> >
> > The method I used :
> > UpdateResponse response = solrClient.add(
> solrDocumentList);
> > solrClient.commit();
> >
> >
> > Thanks in advance.
> >
> > Arunan
>


Re: Exact Phrase search not returning results.

2018-07-20 Thread Tim Casey
Deepti,

I am going to guess the analyzer part of the .net application is cutting
off the last token.
If you try the queries on the console of the running solr cluster, what do
you get?  If you dump that specific field for all the docs, can you find it
with grep?

tim


On Fri, Jul 20, 2018 at 10:56 AM Krishnan, Deepti (NIH/OD) [C] <
deepti.krish...@nih.gov> wrote:

> Hi,
>
>
>
> We are working on a .net application using Solr. When we initially
> launched the site we were using the 5.5.3 version and last sprint we
> updated it to the 7.3.1 version. Everything is working fine ass expected
> expect for one feature.
>
>
>
> The exact phrase search does not return any value for some search
> criteria, and this used to work fine with the older version. Based on our
> research, those search terms with stop words and more than one word
> following it is not working.
>
>
>
> The field has been defined as a text_general type in the schema and below
> are the tokenizers and filters used during indexing and querying.
>
>
>
>
>
> Eg.
>
>
>
>- “PROMOTING SCHOOL READINESS AMONG LOW-INCOME FAMILIES” – This works.
>No stop wods
>- “national institutes of health” – This works as well. Notice that
>there is a stop word (of) but only one word following it
>- “Structure of choroid plexus” – Does not work. Notice there are more
>than 2 words following the stop word(of)
>- "Health and Human Services" – This doesn’t work but “Health and
>Human” works.
>
>
>
> Please let me know if there is something I am missing and if something is
> unclear or you need to reach out to me to discuss further.
>
>
>
> Thanks,
>
> Deepti
>
>


Re: Exact Phrase search not returning results.

2018-07-20 Thread Shawn Heisey
On 7/20/2018 8:33 AM, Krishnan, Deepti (NIH/OD) [C] wrote:
>
> We are working on a .net application using Solr. When we initially
> launched the site we were using the 5.5.3 version and last sprint we
> updated it to the 7.3.1 version. Everything is working fine ass
> expected expect for one feature.
>
>  
>
> The exact phrase search does not return any value for some search
> criteria, and this used to work fine with the older version. Based on
> our research, those search terms with stop words and more than one
> word following it is not working.
>
>  
>
> The field has been defined as a text_general type in the schema and
> below are the tokenizers and filters used during indexing and querying.
>

Your image did not make it to the list.  Can't see it.

To test a theory, try adding a sow parameter set to true to the
request.  I have no idea how to do this in a .net application, but if it
were a URL in a browser, you would add "&sow=true" (without the quotes)
to the URL.

If that works, I am not sure whether it's a bug or not, but you'll have
a workaround.  The sow parameter was added in one of the 6.x releases
with the default value set to true.  In 7.0, the default value was
changed to false.  The parameter is shorthand for "split on whitespace".

When sow=false, phrase queries do not appear to work correctly.

https://lucene.apache.org/solr/guide/7_4/the-standard-query-parser.html#standard-query-parser-parameters

Thanks,
Shawn



Re: Exact Phrase search not returning results.

2018-07-20 Thread Steve Rowe
Hi Deepti,

Your schema snippet didn’t make it to the list.  Please repost as inline text 
rather than an image.

--
Steve
www.lucidworks.com

> On Jul 20, 2018, at 10:33 AM, Krishnan, Deepti (NIH/OD) [C] 
>  wrote:
> 
> Hi,
>  
> We are working on a .net application using Solr. When we initially launched 
> the site we were using the 5.5.3 version and last sprint we updated it to the 
> 7.3.1 version. Everything is working fine ass expected expect for one feature.
>  
> The exact phrase search does not return any value for some search criteria, 
> and this used to work fine with the older version. Based on our research, 
> those search terms with stop words and more than one word following it is not 
> working.
>  
> The field has been defined as a text_general type in the schema and below are 
> the tokenizers and filters used during indexing and querying.
>  
> 
>  
> Eg. 
>  
>   • “PROMOTING SCHOOL READINESS AMONG LOW-INCOME FAMILIES” – This works. 
> No stop wods 
>   • “national institutes of health” – This works as well. Notice that 
> there is a stop word (of) but only one word following it
>   • “Structure of choroid plexus” – Does not work. Notice there are more 
> than 2 words following the stop word(of)
>   • "Health and Human Services" – This doesn’t work but “Health and 
> Human” works.
>  
> Please let me know if there is something I am missing and if something is 
> unclear or you need to reach out to me to discuss further.
>  
> Thanks,
> Deepti



Re: Question regarding searching Chinese characters

2018-07-20 Thread Tomoko Uchida
Yes, while traditional - simplified transformation would be out of the
scope of Unicode normalization,
you would like to add ICUNormalizer2CharFilterFactory anyway :)

Let me refine my example settings:


  
  
  


Regards,
Tomoko


2018年7月21日(土) 2:54 Alexandre Rafalovitch :

> Would  ICUNormalizer2CharFilterFactory do? Or at least serve as a
> template of what needs to be done.
>
> Regards,
>Alex.
>
> On 20 July 2018 at 12:40, Walter Underwood  wrote:
> > Looks like we need a charfilter version of the ICU transforms. That
> could run before the tokenizer.
> >
> > I’ve never built a charfilter, but it seems like this would be a good
> first project for someone who wants to contribute.
> >
> > wunder
> > Walter Underwood
> > wun...@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
> >> On Jul 20, 2018, at 8:24 AM, Tomoko Uchida <
> tomoko.uchida.1...@gmail.com> wrote:
> >>
> >> Exactly. More concretely, the starting point is: replacing your analyzer
> >>
> >>  class="org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer"/>
> >>
> >> to
> >>
> >> 
> >>  
> >>   >> id="Traditional-Simplified"/>
> >> 
> >>
> >> and see if the results are as expected. Then research another filters if
> >> your requirements is not met.
> >>
> >> Just a reminder: HMMChineseTokenizerFactory do not handle traditional
> >> characters as I noted previous in post, so ICUTransformFilterFactory is
> an
> >> incomplete workaround.
> >>
> >> 2018年7月21日(土) 0:05 Walter Underwood :
> >>
> >>> I expect that this is the line that does the transformation:
> >>>
> >>>>>> id="Traditional-Simplified"/>
> >>>
> >>> This mapping is a standard feature of ICU. More info on ICU transforms
> is
> >>> in this doc, though not much detail on this particular transform.
> >>>
> >>> http://userguide.icu-project.org/transforms/general
> >>>
> >>> wunder
> >>> Walter Underwood
> >>> wun...@wunderwood.org
> >>> http://observer.wunderwood.org/  (my blog)
> >>>
>  On Jul 20, 2018, at 7:43 AM, Susheel Kumar 
> >>> wrote:
> 
>  I think so.  I used the exact as in github
> 
>    positionIncrementGap="1" autoGeneratePhraseQueries="false">
>  
>    
>    
> class="edu.stanford.lucene.analysis.CJKFoldingFilterFactory"/>
> >>> id="Traditional-Simplified"/>
> >>> id="Katakana-Hiragana"/>
>    
>  hiragana="true" katakana="true" hangul="true" outputUnigrams="true" />
>  
>  
> 
> 
> 
>  On Fri, Jul 20, 2018 at 10:12 AM, Amanda Shuman <
> amanda.shu...@gmail.com
> 
>  wrote:
> 
> > Thanks! That does indeed look promising... This can be added on top
> of
> > Smart Chinese, right? Or is it an alternative?
> >
> >
> > --
> > Dr. Amanda Shuman
> > Post-doc researcher, University of Freiburg, The Maoist Legacy
> Project
> > 
> > PhD, University of California, Santa Cruz
> > http://www.amandashuman.net/
> > http://www.prchistoryresources.org/
> > Office: +49 (0) 761 203 4925
> >
> >
> > On Fri, Jul 20, 2018 at 3:11 PM, Susheel Kumar <
> susheel2...@gmail.com>
> > wrote:
> >
> >> I think CJKFoldingFilter will work for you.  I put 舊小說 in index and
> >>> then
> >> each of A, B or C or D in query and they seems to be matching and
> CJKFF
> > is
> >> transforming the 舊 to 旧
> >>
> >> On Fri, Jul 20, 2018 at 9:08 AM, Susheel Kumar <
> susheel2...@gmail.com>
> >> wrote:
> >>
> >>> Lack of my chinese language knowledge but if you want, I can do
> quick
> >> test
> >>> for you in Analysis tab if you can give me what to put in index and
> > query
> >>> window...
> >>>
> >>> On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar <
> susheel2...@gmail.com
> 
> >>> wrote:
> >>>
>  Have you tried to use CJKFoldingFilter https://g
>  ithub.com/sul-dlss/CJKFoldingFilter.  I am not sure if this would
> > cover
>  your use case but I am using this filter and so far no issues.
> 
>  Thnx
> 
>  On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman <
> > amanda.shu...@gmail.com
> >>>
>  wrote:
> 
> > Thanks, Alex - I have seen a few of those links but never
> considered
> > transliteration! We use lucene's Smart Chinese analyzer. The
> issue
> >>> is
> > basically what is laid out in the old blogspot post, namely this
> > point:
> >
> >
> > "Why approach CJK resource discovery differently?
> >
> > 2.  Search results must be as script agnostic as possible.
> >
> > There is more than one way to write each word. "Simplified"
> > characters
> > were
> > emphasized for printed materials in mainland China starting in
> the
> >> 1950s;
> > "Traditional" characters were used in printed materials prior t

Exact Phrase search not returning results.

2018-07-20 Thread Krishnan, Deepti (NIH/OD) [C]
Hi,

We are working on a .net application using Solr. When we initially launched the 
site we were using the 5.5.3 version and last sprint we updated it to the 7.3.1 
version. Everything is working fine ass expected expect for one feature.

The exact phrase search does not return any value for some search criteria, and 
this used to work fine with the older version. Based on our research, those 
search terms with stop words and more than one word following it is not working.

The field has been defined as a text_general type in the schema and below are 
the tokenizers and filters used during indexing and querying.

[cid:image001.jpg@01D42015.107F3080]

Eg.


  *   "PROMOTING SCHOOL READINESS AMONG LOW-INCOME FAMILIES" - This works. No 
stop wods
  *   "national institutes of health" - This works as well. Notice that there 
is a stop word (of) but only one word following it
  *   "Structure of choroid plexus" - Does not work. Notice there are more than 
2 words following the stop word(of)
  *   "Health and Human Services" - This doesn't work but "Health and Human" 
works.

Please let me know if there is something I am missing and if something is 
unclear or you need to reach out to me to discuss further.

Thanks,
Deepti



What is the cause of the below error?

2018-07-20 Thread rgummadi
What is the cause of the below errror. Is it a disconnect from the overseer
node to zookeeper node. We are running a cluster with Solr 4.6. 

org.apache.solr.handler.admin.CoreAdminHandler -
:org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /overseer/queue/qn-
at
org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
at
org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
at
org.apache.solr.common.cloud.SolrZkClient$5.execute(SolrZkClient.java:240)
at
org.apache.solr.common.cloud.SolrZkClient$5.execute(SolrZkClient.java:237)
at
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
at
org.apache.solr.common.cloud.SolrZkClient.create(SolrZkClient.java:237)
at
org.apache.solr.cloud.DistributedQueue.createData(DistributedQueue.java:284)
at
org.apache.solr.cloud.DistributedQueue.offer(DistributedQueue.java:271)
at
org.apache.solr.cloud.ZkController.publish(ZkController.java:1071)
at
org.apache.solr.cloud.ZkController.publish(ZkController.java:1019)
at
org.apache.solr.cloud.ZkController.publish(ZkController.java:1015)
at
org.apache.solr.handler.admin.CoreAdminHandler$2.run(CoreAdminHandler.java:762)




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Question regarding searching Chinese characters

2018-07-20 Thread Alexandre Rafalovitch
Would  ICUNormalizer2CharFilterFactory do? Or at least serve as a
template of what needs to be done.

Regards,
   Alex.

On 20 July 2018 at 12:40, Walter Underwood  wrote:
> Looks like we need a charfilter version of the ICU transforms. That could run 
> before the tokenizer.
>
> I’ve never built a charfilter, but it seems like this would be a good first 
> project for someone who wants to contribute.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>> On Jul 20, 2018, at 8:24 AM, Tomoko Uchida  
>> wrote:
>>
>> Exactly. More concretely, the starting point is: replacing your analyzer
>>
>> 
>>
>> to
>>
>> 
>>  
>>  > id="Traditional-Simplified"/>
>> 
>>
>> and see if the results are as expected. Then research another filters if
>> your requirements is not met.
>>
>> Just a reminder: HMMChineseTokenizerFactory do not handle traditional
>> characters as I noted previous in post, so ICUTransformFilterFactory is an
>> incomplete workaround.
>>
>> 2018年7月21日(土) 0:05 Walter Underwood :
>>
>>> I expect that this is the line that does the transformation:
>>>
>>>   >> id="Traditional-Simplified"/>
>>>
>>> This mapping is a standard feature of ICU. More info on ICU transforms is
>>> in this doc, though not much detail on this particular transform.
>>>
>>> http://userguide.icu-project.org/transforms/general
>>>
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>>
 On Jul 20, 2018, at 7:43 AM, Susheel Kumar 
>>> wrote:

 I think so.  I used the exact as in github

 >>> positionIncrementGap="1" autoGeneratePhraseQueries="false">
 
   
   
   
   >> id="Traditional-Simplified"/>
   >> id="Katakana-Hiragana"/>
   
   >>> hiragana="true" katakana="true" hangul="true" outputUnigrams="true" />
 
 



 On Fri, Jul 20, 2018 at 10:12 AM, Amanda Shuman >>>
 wrote:

> Thanks! That does indeed look promising... This can be added on top of
> Smart Chinese, right? Or is it an alternative?
>
>
> --
> Dr. Amanda Shuman
> Post-doc researcher, University of Freiburg, The Maoist Legacy Project
> 
> PhD, University of California, Santa Cruz
> http://www.amandashuman.net/
> http://www.prchistoryresources.org/
> Office: +49 (0) 761 203 4925
>
>
> On Fri, Jul 20, 2018 at 3:11 PM, Susheel Kumar 
> wrote:
>
>> I think CJKFoldingFilter will work for you.  I put 舊小說 in index and
>>> then
>> each of A, B or C or D in query and they seems to be matching and CJKFF
> is
>> transforming the 舊 to 旧
>>
>> On Fri, Jul 20, 2018 at 9:08 AM, Susheel Kumar 
>> wrote:
>>
>>> Lack of my chinese language knowledge but if you want, I can do quick
>> test
>>> for you in Analysis tab if you can give me what to put in index and
> query
>>> window...
>>>
>>> On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar >>>
>>> wrote:
>>>
 Have you tried to use CJKFoldingFilter https://g
 ithub.com/sul-dlss/CJKFoldingFilter.  I am not sure if this would
> cover
 your use case but I am using this filter and so far no issues.

 Thnx

 On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman <
> amanda.shu...@gmail.com
>>>
 wrote:

> Thanks, Alex - I have seen a few of those links but never considered
> transliteration! We use lucene's Smart Chinese analyzer. The issue
>>> is
> basically what is laid out in the old blogspot post, namely this
> point:
>
>
> "Why approach CJK resource discovery differently?
>
> 2.  Search results must be as script agnostic as possible.
>
> There is more than one way to write each word. "Simplified"
> characters
> were
> emphasized for printed materials in mainland China starting in the
>> 1950s;
> "Traditional" characters were used in printed materials prior to the
> 1950s,
> and are still used in Taiwan, Hong Kong and Macau today.
> Since the characters are distinct, it's as if Chinese materials are
> written
> in two scripts.
> Another way to think about it:  every written Chinese word has at
> least
> two
> completely different spellings.  And it can be mix-n-match:  a word
> can
> be
> written with one traditional  and one simplified character.
> Example:   Given a user query 舊小說  (traditional for old fiction),
>>> the
> results should include matches for 舊小說 (traditional) and 旧小说
>> (simplified
> characters for old fiction)"
>
> So, using the example provided above, we are dealing with materials
> produced in the 1950s-1970s that do even weirder things

Re: Good practices on indexing larger amount of documents at once using SolrJ

2018-07-20 Thread Erick Erickson
I do this all the time with batches of 1,000 and don't see this problem.

one thing that sometimes bites people is to fail to clear the doclist
after every call to add. So you send ever-increasing batches to Solr.
Assuming when you talk about batch size meaning the size of the
solrDocunentList, increasing it would make  the broken pipe problem
worse if anything...

Also, it's generally bad practice to commit after every batch. That's not
your problem here, just something to note. Let your autocommit
settings in solrconfig handle it or specify commitWithin in your
add call.

I'd also look in your Solr logs and see if there's a problem there.

Net-net is this is a perfectly reasonable pattern, I suspect some
innocent-seeming problem with your indexing code.

Best,
Erick



On Fri, Jul 20, 2018 at 9:32 AM, Arunan Sugunakumar
 wrote:
> Hi,
>
> I have around 12 millions objects in my PostgreSQL database to be indexed.
> I'm running a thread to fetch the rows from the database. The thread will
> also create the documents and put it in an indexing queue. While this is
> happening my main process will retrieve the documents from the queue and
> will index it in the size of 1000. For some time the process is running as
> expected, but after some time, I get an exception.
>
> *[corePostProcess] org.apache.solr.client.solrj.SolrServerException:
> IOException occured when talking to server at:
> http://localhost:8983/solr/mine-search
> ……….……….[corePostProcess]
> Caused by: java.net.SocketException: Broken pipe (Write
> failed)[corePostProcess]at
> java.net.SocketOutputStream.socketWrite0(Native Method)*
>
>
> I tried increasing the batch size upto 3. Then I got a different
> exception.
>
> *[corePostProcess] org.apache.solr.client.solrj.SolrServerException:
> IOException occured when talking to server at:
> http://localhost:8983/solr/mine-search
> …….….[corePostProcess]
> Caused by: org.apache.http.NoHttpResponseException: localhost:8983 failed
> to respond*
>
>
> I would like to know whether there are any good practices on handling such
> situation, such as max no of documents to index in one attempt etc.
>
> My environement :
>
> Version : solr 7.2, solrj 7.2
> Ubuntu 16.04
> RAM 20GB
> I started Solr in standalone mode.
> Number of replicas and shards : 1
>
> The method I used :
> UpdateResponse response = solrClient.add(solrDocumentList);
> solrClient.commit();
>
>
> Thanks in advance.
>
> Arunan


Re: Sorting issue while using collection parameter

2018-07-20 Thread Erick Erickson
Just tried this on master and can't reproduce.

Didn't try 5.4.

Any chance this is a multiValued field? That can sometimes confuse things.

Best,
Erick

On Fri, Jul 20, 2018 at 2:50 AM, Vijay Tiwary  wrote:
> Hello Erick
>
> We are using string field and data is stored in lower case while indexing.
> We have alias set up to query multiple collections simultaneously.
> alias=collection1, collection2
> If we are querying through alias then sorting is broken. For e.g. Results
> for descending sort are as follows. (Empty lines are documents with no
> value for the field on which sorting is applied).
> It seems there is an issue in solr while aggregating the results of
> individual
> shard responses
> d
> d
> d
>
>
> b
> b
> b
> c
> c
>
> b
> b
>
>
>
> On Fri, 29 Jun 2018, 9:16 pm Erick Erickson, 
> wrote:
>
>> What _is_ your expectation? You haven't provided any examples of what
>> your input and expectations _are_.
>>
>> You might review: https://wiki.apache.org/solr/UsingMailingLists
>>
>> string types are case-sensitive for instance, so that's one thing that
>> could be happening. You
>> can also specify sortMissingFirst/Last to determine where docs with
>> missing fields appear in the results.
>>
>> Best,
>> Erick
>>
>> On Fri, Jun 29, 2018 at 3:13 AM, Vijay Tiwary 
>> wrote:
>> > Hello Eric,
>> >
>> > title is a string field
>> >
>> > On Wed, 27 Jun 2018, 9:21 pm Erick Erickson, 
>> > wrote:
>> >
>> >> what kind of field is title? text_general or something? Sorting on a
>> >> tokenized field is usually something you don't want to do. If a field
>> >> has aardvard and zebra, how would it sort?
>> >>
>> >> There's usually something like alphaOnlySort. People often copyField
>> >> from "title" to "title_sort" and search on "title" and sort on
>> >> title_sort.
>> >>
>> >> alphaOnlySort uses KeywordTokenizer and LowercaseFilterFactory.
>> >>
>> >> Best,
>> >> Erick
>> >>
>> >> On Wed, Jun 27, 2018 at 12:45 AM, Vijay Tiwary <
>> vijaykr.tiw...@gmail.com>
>> >> wrote:
>> >> > Hello Team,
>> >> >
>> >> > I have multiple collection on solr (5.4.1) cloud based on year
>> >> > content2107
>> >> > content2018
>> >> >
>> >> > Also I have a collection "content" which does not have any data.
>> >> >
>> >> > Now if I query them as follows
>> >> > http://host:port/solr/content/select?q=*:*&collection=content2107,
>> >> > content2108&sort=title
>> >> > asc
>> >> >
>> >> > Where title is string field then results are not getting sorted as per
>> >> the
>> >> > expectation. Also note value for title is not present for some
>> documents.
>> >> >
>> >> > Please help.
>> >>
>>


Re: Question regarding searching Chinese characters

2018-07-20 Thread Walter Underwood
Looks like we need a charfilter version of the ICU transforms. That could run 
before the tokenizer.

I’ve never built a charfilter, but it seems like this would be a good first 
project for someone who wants to contribute.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jul 20, 2018, at 8:24 AM, Tomoko Uchida  
> wrote:
> 
> Exactly. More concretely, the starting point is: replacing your analyzer
> 
> 
> 
> to
> 
> 
>  
>   id="Traditional-Simplified"/>
> 
> 
> and see if the results are as expected. Then research another filters if
> your requirements is not met.
> 
> Just a reminder: HMMChineseTokenizerFactory do not handle traditional
> characters as I noted previous in post, so ICUTransformFilterFactory is an
> incomplete workaround.
> 
> 2018年7月21日(土) 0:05 Walter Underwood :
> 
>> I expect that this is the line that does the transformation:
>> 
>>   > id="Traditional-Simplified"/>
>> 
>> This mapping is a standard feature of ICU. More info on ICU transforms is
>> in this doc, though not much detail on this particular transform.
>> 
>> http://userguide.icu-project.org/transforms/general
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On Jul 20, 2018, at 7:43 AM, Susheel Kumar 
>> wrote:
>>> 
>>> I think so.  I used the exact as in github
>>> 
>>> >> positionIncrementGap="1" autoGeneratePhraseQueries="false">
>>> 
>>>   
>>>   
>>>   
>>>   > id="Traditional-Simplified"/>
>>>   > id="Katakana-Hiragana"/>
>>>   
>>>   >> hiragana="true" katakana="true" hangul="true" outputUnigrams="true" />
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Fri, Jul 20, 2018 at 10:12 AM, Amanda Shuman >> 
>>> wrote:
>>> 
 Thanks! That does indeed look promising... This can be added on top of
 Smart Chinese, right? Or is it an alternative?
 
 
 --
 Dr. Amanda Shuman
 Post-doc researcher, University of Freiburg, The Maoist Legacy Project
 
 PhD, University of California, Santa Cruz
 http://www.amandashuman.net/
 http://www.prchistoryresources.org/
 Office: +49 (0) 761 203 4925
 
 
 On Fri, Jul 20, 2018 at 3:11 PM, Susheel Kumar 
 wrote:
 
> I think CJKFoldingFilter will work for you.  I put 舊小說 in index and
>> then
> each of A, B or C or D in query and they seems to be matching and CJKFF
 is
> transforming the 舊 to 旧
> 
> On Fri, Jul 20, 2018 at 9:08 AM, Susheel Kumar 
> wrote:
> 
>> Lack of my chinese language knowledge but if you want, I can do quick
> test
>> for you in Analysis tab if you can give me what to put in index and
 query
>> window...
>> 
>> On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar >> 
>> wrote:
>> 
>>> Have you tried to use CJKFoldingFilter https://g
>>> ithub.com/sul-dlss/CJKFoldingFilter.  I am not sure if this would
 cover
>>> your use case but I am using this filter and so far no issues.
>>> 
>>> Thnx
>>> 
>>> On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman <
 amanda.shu...@gmail.com
>> 
>>> wrote:
>>> 
 Thanks, Alex - I have seen a few of those links but never considered
 transliteration! We use lucene's Smart Chinese analyzer. The issue
>> is
 basically what is laid out in the old blogspot post, namely this
 point:
 
 
 "Why approach CJK resource discovery differently?
 
 2.  Search results must be as script agnostic as possible.
 
 There is more than one way to write each word. "Simplified"
 characters
 were
 emphasized for printed materials in mainland China starting in the
> 1950s;
 "Traditional" characters were used in printed materials prior to the
 1950s,
 and are still used in Taiwan, Hong Kong and Macau today.
 Since the characters are distinct, it's as if Chinese materials are
 written
 in two scripts.
 Another way to think about it:  every written Chinese word has at
 least
 two
 completely different spellings.  And it can be mix-n-match:  a word
 can
 be
 written with one traditional  and one simplified character.
 Example:   Given a user query 舊小說  (traditional for old fiction),
>> the
 results should include matches for 舊小說 (traditional) and 旧小说
> (simplified
 characters for old fiction)"
 
 So, using the example provided above, we are dealing with materials
 produced in the 1950s-1970s that do even weirder things like:
 
 A. 舊小說
 
 can also be
 
 B. 旧小说 (all simplified)
 or
 C. 旧小說 (first character simplified, last character traditional)
 or
 D. 舊小 说 (first character traditional, last character simplified)
 
 Thankf

Good practices on indexing larger amount of documents at once using SolrJ

2018-07-20 Thread Arunan Sugunakumar
Hi,

I have around 12 millions objects in my PostgreSQL database to be indexed.
I'm running a thread to fetch the rows from the database. The thread will
also create the documents and put it in an indexing queue. While this is
happening my main process will retrieve the documents from the queue and
will index it in the size of 1000. For some time the process is running as
expected, but after some time, I get an exception.

*[corePostProcess] org.apache.solr.client.solrj.SolrServerException:
IOException occured when talking to server at:
http://localhost:8983/solr/mine-search
……….……….[corePostProcess]
Caused by: java.net.SocketException: Broken pipe (Write
failed)[corePostProcess]at
java.net.SocketOutputStream.socketWrite0(Native Method)*


I tried increasing the batch size upto 3. Then I got a different
exception.

*[corePostProcess] org.apache.solr.client.solrj.SolrServerException:
IOException occured when talking to server at:
http://localhost:8983/solr/mine-search
…….….[corePostProcess]
Caused by: org.apache.http.NoHttpResponseException: localhost:8983 failed
to respond*


I would like to know whether there are any good practices on handling such
situation, such as max no of documents to index in one attempt etc.

My environement :

Version : solr 7.2, solrj 7.2
Ubuntu 16.04
RAM 20GB
I started Solr in standalone mode.
Number of replicas and shards : 1

The method I used :
UpdateResponse response = solrClient.add(solrDocumentList);
solrClient.commit();


Thanks in advance.

Arunan


Re: Creating a collection in Solr standalone mode using solrj

2018-07-20 Thread Arunan Sugunakumar
Hi Jason and Shawn,

As you mentioned, I've mixed up the concept of a collection and core. Thank
you for clearing up.

Thank you,
Arunan


On 20 July 2018 at 20:31, Shawn Heisey  wrote:

> On 7/20/2018 12:09 AM, Arunan Sugunakumar wrote:
> > I would like to know whether it is possible to create a collection in
> Solr
> > through SolrJ. I tried to create and it throws me an error saying that
> > "Solr instance is not running in SolrCloud mode.
>
> A "collection" is a SolrCloud concept.  Collections are comprised of one
> or more shards.  Shards are comprised of one or more replicas.  Each
> shard replica is a Solr index core.  The Collections API, which is most
> likely what you are calling in SolrJ when you get that error, only works
> in SolrCloud mode.
>
> Standalone mode only has cores.  They are not called collections.
>
> You can use the CoreAdmin API in standalone mode, but be aware of the
> large warning box here titled "CREATE must be able to find a
> configuration":
>
> https://lucene.apache.org/solr/guide/7_4/coreadmin-api.html
>
> Typically, filesystem access is required to create the configuration
> when running in standalone mode, before the core can be created.  If you
> want to be able to create indexes completely remotely, that is a whole
> lot easier if Solr is running in SolrCloud mode.
>
> Something that the documentation for the CoreAdmin API doesn't say, but
> probably should, is that in most usage, instanceDir and dataDir should
> not be specified.  The instanceDir will default to the same name as the
> core, and it will live in the solr home.  The dataDir defaults to
> "./data" relative to the instanceDir, and this is usually the best
> option.  Changing these is an expert option.
>
> Thanks,
> Shawn
>
>


Re: SOLR 7.1 ClassicSimilarityFactory Problem

2018-07-20 Thread Erick Erickson
Why do you think you need to "fix" anything here?

FieldNorm here is significantly different. On a quick scan (and you're
right, trying to understand it all at a glance is daunting) your
fieldNorm is lowering the score of the second doc. Basically the
"two hits" are in a longer field so their weight is less. Which is
part of the basic function of scoring.

Plus it looks like you've n-grammed the field, which is further
confusing the issue.

I don't see what rows is changing, please point it out. You're getting
the exact same score for the reported documents, it's just that
as you add more rows you get information for more docs as far as
I can tell.

You can try omitting norms and/or creating a non-ngrammed field.

As for why it's different from 4x, no clue. Perhaps the Lucene
folks can weigh in.

Best,
Erick

On Fri, Jul 20, 2018 at 8:41 AM, Hodder, Rick  wrote:

> I am using SOLR 7.1
>
> ClassicSimilarityFactory
>
> I have data in my core with field called CompanyName in an indexed field
> IDX_CompanyName
>
>
>
>  stored="false" multiValued="true" />
>
> 
>
> 
>
>
>
> Here are a few of the 900,000 rows in the core
>
>
>
> Cityview
>
> Citadel
>
> CivicVentures
>
> Clutch City Sports
>
> Clutch City Sports & Entertainment
>
> Clutch City Sports & Entertainment
>
> Clutch City Sports & Entertainment
>
>
>
>
>
> If I *search* for IDX_Company:(clutch AND city) and a fl=*,score and
> maxrows of 750, and at 1500 I get the following results
>
>
>
> *CompanyNameScore*
>
> Cityview   5.874983
>
> Citadel  5.3502507
>
> CivicVentures4.7278214
>
> 
>
>
>
> If I *search* for IDX_Company:(clutch AND city) and a maxrows of 5000 I
> get the following results
>
>
>
> *CompanyName
> Score*
>
> Cityview
> 5.874983
>
> Citadel
> 5.3502507
>
> CivicVentures
> 4.7278214
>
> Clutch City Sports & Entertainment3.6542892
>
> Clutch City Sports & Entertainment3.6542892
>
> Clutch City Sports & Entertainment3.6542892
>
>
>
> Ive tried looking at the debug query to figure out what its doing and I’m
> confused by what it is saying
>
>
>
> The debug info for Cityview is
>
>
>
> 
>
> 5.874983 = sum of:
>
>   1.9583277 = weight(Synonym(IDX_CompanyName:c IDX_ CompanyName:cl IDX_
> CompanyName:clu IDX_CompanyName:clut IDX_CompanyName:clutc
> IDX_CompanyName:clutch) in 16639) [ClassicSimilarity], result of:
>
> 1.9583277 = fieldWeight in 16639, product of:
>
>   1.0 = tf(freq=1.0), with freq of:
>
> 1.0 = termFreq=1.0
>
>   1.9583277 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
>
> 166407.0 = docFreq
>
> 433880.0 = docCount
>
>   1.0 = fieldNorm(doc=16639)
>
>   3.9166553 = weight(Synonym(IDX_ CompanyName:c IDX_ CompanyName:ci IDX_
> CompanyName:cit IDX_ CompanyName:city) in 16639) [ClassicSimilarity],
> result of:
>
> 3.9166553 = fieldWeight in 16639, product of:
>
>   2.0 = tf(freq=4.0), with freq of:
>
> 4.0 = termFreq=4.0
>
>   1.9583277 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
>
> 166407.0 = docFreq
>
> 433880.0 = docCount
>
>   1.0 = fieldNorm(doc=16639)
>
> 
>
>
>
> The debug info for Clutch City Sports & Entertainment is
>
>
>
> 
>
> 3.6542892 = sum of:
>
>   1.9583277 = weight(Synonym(IDX_CompanyName:c IDX_ CompanyName:cl IDX_
> CompanyName:clu IDX_ CompanyName:clut IDX_ CompanyName:clutc IDX_
> CompanyName:clutch) in 9549) [ClassicSimilarity], result of:
>
> 1.9583277 = fieldWeight in 9549, product of:
>
>   2.828427 = tf(freq=8.0), with freq of:
>
> 8.0 = termFreq=8.0
>
>   1.9583277 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
>
> 166407.0 = docFreq
>
> 433880.0 = docCount
>
>   0.35355338 = fieldNorm(doc=9549)
>
>   1.6959615 = weight(Synonym(IDX_ CompanyName:c IDX_ CompanyName:ci IDX_
> CompanyName:cit IDX_ CompanyName:city) in 9549) [ClassicSimilarity], result
> of:
>
> 1.6959615 = fieldWeight in 9549, product of:
>
>   2.4494898 = tf(freq=6.0), with freq of:
>
> 6.0 = termFreq=6.0
>
>   1.9583277 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
>
> 166407.0 = docFreq
>
> 433880.0 = docCount
>
>   0.35355338 = fieldNorm(doc=9549)
>
> 
>
>
>
> Why would something with 2 hits score lower? Why does the max rows
> influence this?
>
>
>
> How might I fix this?
>
>
>
> This didn’t used to happen in SOLR 4.10 (I know its an older version, but…)
>
>
>
>
>
> Thanks,
>
>
>
> Rick Hodder
>
> Information Technology
>
> Navigators Management Company, Inc.
>
> 83 Wooster Heights Road
> ,
> 2nd Floor
>
> Danbury, CT  06810
>
> (475) 329-6251
>
>
>
> [image: Forbes_Best Places Logo2016]
>
>
>


SOLR 7.1 ClassicSimilarityFactory Problem

2018-07-20 Thread Hodder, Rick
I am using SOLR 7.1
ClassicSimilarityFactory
I have data in my core with field called CompanyName in an indexed field 
IDX_CompanyName





Here are a few of the 900,000 rows in the core

Cityview
Citadel
CivicVentures
Clutch City Sports
Clutch City Sports & Entertainment
Clutch City Sports & Entertainment
Clutch City Sports & Entertainment


If I search for IDX_Company:(clutch AND city) and a fl=*,score and maxrows of 
750, and at 1500 I get the following results

CompanyNameScore
Cityview   5.874983
Citadel  5.3502507
CivicVentures4.7278214


If I search for IDX_Company:(clutch AND city) and a maxrows of 5000 I get the 
following results

CompanyNameScore
Cityview
   5.874983
Citadel 
 5.3502507
CivicVentures   
 4.7278214
Clutch City Sports & Entertainment3.6542892
Clutch City Sports & Entertainment3.6542892
Clutch City Sports & Entertainment3.6542892

Ive tried looking at the debug query to figure out what its doing and I'm 
confused by what it is saying

The debug info for Cityview is


5.874983 = sum of:
  1.9583277 = weight(Synonym(IDX_CompanyName:c IDX_ CompanyName:cl IDX_ 
CompanyName:clu IDX_CompanyName:clut IDX_CompanyName:clutc 
IDX_CompanyName:clutch) in 16639) [ClassicSimilarity], result of:
1.9583277 = fieldWeight in 16639, product of:
  1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
  1.9583277 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
166407.0 = docFreq
433880.0 = docCount
  1.0 = fieldNorm(doc=16639)
  3.9166553 = weight(Synonym(IDX_ CompanyName:c IDX_ CompanyName:ci IDX_ 
CompanyName:cit IDX_ CompanyName:city) in 16639) [ClassicSimilarity], result of:
3.9166553 = fieldWeight in 16639, product of:
  2.0 = tf(freq=4.0), with freq of:
4.0 = termFreq=4.0
  1.9583277 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
166407.0 = docFreq
433880.0 = docCount
  1.0 = fieldNorm(doc=16639)


The debug info for Clutch City Sports & Entertainment is


3.6542892 = sum of:
  1.9583277 = weight(Synonym(IDX_CompanyName:c IDX_ CompanyName:cl IDX_ 
CompanyName:clu IDX_ CompanyName:clut IDX_ CompanyName:clutc IDX_ 
CompanyName:clutch) in 9549) [ClassicSimilarity], result of:
1.9583277 = fieldWeight in 9549, product of:
  2.828427 = tf(freq=8.0), with freq of:
8.0 = termFreq=8.0
  1.9583277 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
166407.0 = docFreq
433880.0 = docCount
  0.35355338 = fieldNorm(doc=9549)
  1.6959615 = weight(Synonym(IDX_ CompanyName:c IDX_ CompanyName:ci IDX_ 
CompanyName:cit IDX_ CompanyName:city) in 9549) [ClassicSimilarity], result of:
1.6959615 = fieldWeight in 9549, product of:
  2.4494898 = tf(freq=6.0), with freq of:
6.0 = termFreq=6.0
  1.9583277 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
166407.0 = docFreq
433880.0 = docCount
  0.35355338 = fieldNorm(doc=9549)


Why would something with 2 hits score lower? Why does the max rows influence 
this?

How might I fix this?

This didn't used to happen in SOLR 4.10 (I know its an older version, but...)


Thanks,

Rick Hodder
Information Technology
Navigators Management Company, Inc.
83 Wooster Heights Road, 2nd Floor
Danbury, CT  06810
(475) 329-6251

[Forbes_Best Places Logo2016]



Time Routed Aliases & CDCR

2018-07-20 Thread Pavel Micka
Hello,

We are planning to implement Time Routed Aliases to our solution. But one of 
our requirements is to be able to provide disaster recovery in case one of two 
Data Centers dies. We have a network between DCs, which is potentially unstable 
and has latencies in hundreds of millis.

We were recommended to use CDCR and it really seems to fit our needs. But after 
reading docs, I have some questions.


1)  With TRA, we define a single solrconfig.xml, this SolrConfig is then 
assigned to each new collection, when it is automatically created by TRA logic.

a.   BUT CDCR requires us to specify sourceCollectionName and 
targetCollectionName 
(https://lucene.apache.org/solr/guide/7_4/cdcr-config.html#cdcr-config), but I 
can't specify it, because I have the same solrConfig applied to all collections 
behind the alias. And I do not have the creation of collections in my hands, 
its done automatically? (and I do not get, why I need to specify the names, 
when solrconfig.xml file is per collection...)

2)  CDCR docs state that "Configuration files (solrconfig.xml, 
managed-schema, etc.) are not automatically synchronized between the Source and 
Target clusters.". Does this apply also to files stored in ZooKeeper? Or only 
to those on disks. If also to those in ZK, we may have a problem, the 
collections are created automatically, so we can't easily detect that we should 
do the ZK sync to backup site.

If there is some smarter way, how to do Disaster Recovery (2 node Solr setup) 
to backup site (over possibly bad network), please let me know either in this 
mailing list, or on stack overflow 
(https://stackoverflow.com/questions/51425009/solrcloud-2-nodes-cluster).

Thanks,

Pavel




Re: Question regarding searching Chinese characters

2018-07-20 Thread Tomoko Uchida
Exactly. More concretely, the starting point is: replacing your analyzer



to


  
  


and see if the results are as expected. Then research another filters if
your requirements is not met.

Just a reminder: HMMChineseTokenizerFactory do not handle traditional
characters as I noted previous in post, so ICUTransformFilterFactory is an
incomplete workaround.

2018年7月21日(土) 0:05 Walter Underwood :

> I expect that this is the line that does the transformation:
>
> id="Traditional-Simplified"/>
>
> This mapping is a standard feature of ICU. More info on ICU transforms is
> in this doc, though not much detail on this particular transform.
>
> http://userguide.icu-project.org/transforms/general
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Jul 20, 2018, at 7:43 AM, Susheel Kumar 
> wrote:
> >
> > I think so.  I used the exact as in github
> >
> >  > positionIncrementGap="1" autoGeneratePhraseQueries="false">
> >  
> >
> >
> >
> > id="Traditional-Simplified"/>
> > id="Katakana-Hiragana"/>
> >
> > > hiragana="true" katakana="true" hangul="true" outputUnigrams="true" />
> >  
> > 
> >
> >
> >
> > On Fri, Jul 20, 2018 at 10:12 AM, Amanda Shuman  >
> > wrote:
> >
> >> Thanks! That does indeed look promising... This can be added on top of
> >> Smart Chinese, right? Or is it an alternative?
> >>
> >>
> >> --
> >> Dr. Amanda Shuman
> >> Post-doc researcher, University of Freiburg, The Maoist Legacy Project
> >> 
> >> PhD, University of California, Santa Cruz
> >> http://www.amandashuman.net/
> >> http://www.prchistoryresources.org/
> >> Office: +49 (0) 761 203 4925
> >>
> >>
> >> On Fri, Jul 20, 2018 at 3:11 PM, Susheel Kumar 
> >> wrote:
> >>
> >>> I think CJKFoldingFilter will work for you.  I put 舊小說 in index and
> then
> >>> each of A, B or C or D in query and they seems to be matching and CJKFF
> >> is
> >>> transforming the 舊 to 旧
> >>>
> >>> On Fri, Jul 20, 2018 at 9:08 AM, Susheel Kumar 
> >>> wrote:
> >>>
>  Lack of my chinese language knowledge but if you want, I can do quick
> >>> test
>  for you in Analysis tab if you can give me what to put in index and
> >> query
>  window...
> 
>  On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar  >
>  wrote:
> 
> > Have you tried to use CJKFoldingFilter https://g
> > ithub.com/sul-dlss/CJKFoldingFilter.  I am not sure if this would
> >> cover
> > your use case but I am using this filter and so far no issues.
> >
> > Thnx
> >
> > On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman <
> >> amanda.shu...@gmail.com
> 
> > wrote:
> >
> >> Thanks, Alex - I have seen a few of those links but never considered
> >> transliteration! We use lucene's Smart Chinese analyzer. The issue
> is
> >> basically what is laid out in the old blogspot post, namely this
> >> point:
> >>
> >>
> >> "Why approach CJK resource discovery differently?
> >>
> >> 2.  Search results must be as script agnostic as possible.
> >>
> >> There is more than one way to write each word. "Simplified"
> >> characters
> >> were
> >> emphasized for printed materials in mainland China starting in the
> >>> 1950s;
> >> "Traditional" characters were used in printed materials prior to the
> >> 1950s,
> >> and are still used in Taiwan, Hong Kong and Macau today.
> >> Since the characters are distinct, it's as if Chinese materials are
> >> written
> >> in two scripts.
> >> Another way to think about it:  every written Chinese word has at
> >> least
> >> two
> >> completely different spellings.  And it can be mix-n-match:  a word
> >> can
> >> be
> >> written with one traditional  and one simplified character.
> >> Example:   Given a user query 舊小說  (traditional for old fiction),
> the
> >> results should include matches for 舊小說 (traditional) and 旧小说
> >>> (simplified
> >> characters for old fiction)"
> >>
> >> So, using the example provided above, we are dealing with materials
> >> produced in the 1950s-1970s that do even weirder things like:
> >>
> >> A. 舊小說
> >>
> >> can also be
> >>
> >> B. 旧小说 (all simplified)
> >> or
> >> C. 旧小說 (first character simplified, last character traditional)
> >> or
> >> D. 舊小 说 (first character traditional, last character simplified)
> >>
> >> Thankfully the middle character was never simplified in recent
> times.
> >>
> >> From a historical standpoint, the mixed nature of the characters in
> >> the
> >> same word/phrase is because not all simplified characters were
> >> adopted
> >>> at
> >> the same time by everyone uniformly (good times...).
> >>
> >> The problem seems to be that Solr can easily handle A or B above,
> but
> >> NOT C
> >> or D using the Smart Chinese analyzer. I'm not really sur

Re: Question regarding searching Chinese characters

2018-07-20 Thread Walter Underwood
I expect that this is the line that does the transformation:

   

This mapping is a standard feature of ICU. More info on ICU transforms is in 
this doc, though not much detail on this particular transform. 

http://userguide.icu-project.org/transforms/general

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jul 20, 2018, at 7:43 AM, Susheel Kumar  wrote:
> 
> I think so.  I used the exact as in github
> 
>  positionIncrementGap="1" autoGeneratePhraseQueries="false">
>  
>
>
>
> id="Traditional-Simplified"/>
>
>
> hiragana="true" katakana="true" hangul="true" outputUnigrams="true" />
>  
> 
> 
> 
> 
> On Fri, Jul 20, 2018 at 10:12 AM, Amanda Shuman 
> wrote:
> 
>> Thanks! That does indeed look promising... This can be added on top of
>> Smart Chinese, right? Or is it an alternative?
>> 
>> 
>> --
>> Dr. Amanda Shuman
>> Post-doc researcher, University of Freiburg, The Maoist Legacy Project
>> 
>> PhD, University of California, Santa Cruz
>> http://www.amandashuman.net/
>> http://www.prchistoryresources.org/
>> Office: +49 (0) 761 203 4925
>> 
>> 
>> On Fri, Jul 20, 2018 at 3:11 PM, Susheel Kumar 
>> wrote:
>> 
>>> I think CJKFoldingFilter will work for you.  I put 舊小說 in index and then
>>> each of A, B or C or D in query and they seems to be matching and CJKFF
>> is
>>> transforming the 舊 to 旧
>>> 
>>> On Fri, Jul 20, 2018 at 9:08 AM, Susheel Kumar 
>>> wrote:
>>> 
 Lack of my chinese language knowledge but if you want, I can do quick
>>> test
 for you in Analysis tab if you can give me what to put in index and
>> query
 window...
 
 On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar 
 wrote:
 
> Have you tried to use CJKFoldingFilter https://g
> ithub.com/sul-dlss/CJKFoldingFilter.  I am not sure if this would
>> cover
> your use case but I am using this filter and so far no issues.
> 
> Thnx
> 
> On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman <
>> amanda.shu...@gmail.com
 
> wrote:
> 
>> Thanks, Alex - I have seen a few of those links but never considered
>> transliteration! We use lucene's Smart Chinese analyzer. The issue is
>> basically what is laid out in the old blogspot post, namely this
>> point:
>> 
>> 
>> "Why approach CJK resource discovery differently?
>> 
>> 2.  Search results must be as script agnostic as possible.
>> 
>> There is more than one way to write each word. "Simplified"
>> characters
>> were
>> emphasized for printed materials in mainland China starting in the
>>> 1950s;
>> "Traditional" characters were used in printed materials prior to the
>> 1950s,
>> and are still used in Taiwan, Hong Kong and Macau today.
>> Since the characters are distinct, it's as if Chinese materials are
>> written
>> in two scripts.
>> Another way to think about it:  every written Chinese word has at
>> least
>> two
>> completely different spellings.  And it can be mix-n-match:  a word
>> can
>> be
>> written with one traditional  and one simplified character.
>> Example:   Given a user query 舊小說  (traditional for old fiction), the
>> results should include matches for 舊小說 (traditional) and 旧小说
>>> (simplified
>> characters for old fiction)"
>> 
>> So, using the example provided above, we are dealing with materials
>> produced in the 1950s-1970s that do even weirder things like:
>> 
>> A. 舊小說
>> 
>> can also be
>> 
>> B. 旧小说 (all simplified)
>> or
>> C. 旧小說 (first character simplified, last character traditional)
>> or
>> D. 舊小 说 (first character traditional, last character simplified)
>> 
>> Thankfully the middle character was never simplified in recent times.
>> 
>> From a historical standpoint, the mixed nature of the characters in
>> the
>> same word/phrase is because not all simplified characters were
>> adopted
>>> at
>> the same time by everyone uniformly (good times...).
>> 
>> The problem seems to be that Solr can easily handle A or B above, but
>> NOT C
>> or D using the Smart Chinese analyzer. I'm not really sure how to
>>> change
>> that at this point... maybe I should figure out how to contact the
>> creators
>> of the analyzer and ask them?
>> 
>> Amanda
>> 
>> --
>> Dr. Amanda Shuman
>> Post-doc researcher, University of Freiburg, The Maoist Legacy
>> Project
>> 
>> PhD, University of California, Santa Cruz
>> http://www.amandashuman.net/
>> http://www.prchistoryresources.org/
>> Office: +49 (0) 761 203 4925
>> 
>> 
>> On Fri, Jul 20, 2018 at 1:40 PM, Alexandre Rafalovitch <
>> arafa...@gmail.com>
>> wrote:
>> 
>>> This is probably your start, if not read already:
>>> https://luc

Re: Creating a collection in Solr standalone mode using solrj

2018-07-20 Thread Shawn Heisey
On 7/20/2018 12:09 AM, Arunan Sugunakumar wrote:
> I would like to know whether it is possible to create a collection in Solr
> through SolrJ. I tried to create and it throws me an error saying that
> "Solr instance is not running in SolrCloud mode.

A "collection" is a SolrCloud concept.  Collections are comprised of one
or more shards.  Shards are comprised of one or more replicas.  Each
shard replica is a Solr index core.  The Collections API, which is most
likely what you are calling in SolrJ when you get that error, only works
in SolrCloud mode.

Standalone mode only has cores.  They are not called collections.

You can use the CoreAdmin API in standalone mode, but be aware of the
large warning box here titled "CREATE must be able to find a configuration":

https://lucene.apache.org/solr/guide/7_4/coreadmin-api.html

Typically, filesystem access is required to create the configuration
when running in standalone mode, before the core can be created.  If you
want to be able to create indexes completely remotely, that is a whole
lot easier if Solr is running in SolrCloud mode.

Something that the documentation for the CoreAdmin API doesn't say, but
probably should, is that in most usage, instanceDir and dataDir should
not be specified.  The instanceDir will default to the same name as the
core, and it will live in the solr home.  The dataDir defaults to
"./data" relative to the instanceDir, and this is usually the best
option.  Changing these is an expert option.

Thanks,
Shawn



Re: Creating a collection in Solr standalone mode using solrj

2018-07-20 Thread Jason Gerlowski
Hi Arunan,

Solr runs in one of two main modes: "Cloud" mode or "Standalone" mode.
Collections can only be created in Cloud mode.  Standalone mode
doesn't allow creation of collections; it uses cores instead.  From
your error message above, it looks like the problem is that you're
trying to create a collection in "standalone" mode, which doesn't
support that.

SolrJ has methods to create both cores and collections, you just have
to have Solr running in the right mode:
- Collection creation:
https://lucene.apache.org/solr/7_4_0/solr-solrj/org/apache/solr/client/solrj/request/CollectionAdminRequest.Create.html
- Core creation:
https://lucene.apache.org/solr/7_4_0/solr-solrj/org/apache/solr/client/solrj/request/CoreAdminRequest.Create.html

You'll have to decide whether you want to run Solr in cloud or
standalone mode, and adjust your core/collection creation accordingly.

Best,

Jason
On Fri, Jul 20, 2018 at 2:09 AM Arunan Sugunakumar
 wrote:
>
> Hi,
>
> I would like to know whether it is possible to create a collection in Solr
> through SolrJ. I tried to create and it throws me an error saying that
> "Solr instance is not running in SolrCloud mode.
> "
> I am trying to upgrade a system to use solr which used lucene library in
> the past. In lucene, everything is controlled via code and user does not
> have worry about creating collections. I am trying to replicate this
> experience in Solr.
>
> Thanks in Advance,
> Arunan


Re: Question regarding searching Chinese characters

2018-07-20 Thread Susheel Kumar
I think so.  I used the exact as in github


  







  




On Fri, Jul 20, 2018 at 10:12 AM, Amanda Shuman 
wrote:

> Thanks! That does indeed look promising... This can be added on top of
> Smart Chinese, right? Or is it an alternative?
>
>
> --
> Dr. Amanda Shuman
> Post-doc researcher, University of Freiburg, The Maoist Legacy Project
> 
> PhD, University of California, Santa Cruz
> http://www.amandashuman.net/
> http://www.prchistoryresources.org/
> Office: +49 (0) 761 203 4925
>
>
> On Fri, Jul 20, 2018 at 3:11 PM, Susheel Kumar 
> wrote:
>
> > I think CJKFoldingFilter will work for you.  I put 舊小說 in index and then
> > each of A, B or C or D in query and they seems to be matching and CJKFF
> is
> > transforming the 舊 to 旧
> >
> > On Fri, Jul 20, 2018 at 9:08 AM, Susheel Kumar 
> > wrote:
> >
> > > Lack of my chinese language knowledge but if you want, I can do quick
> > test
> > > for you in Analysis tab if you can give me what to put in index and
> query
> > > window...
> > >
> > > On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar 
> > > wrote:
> > >
> > >> Have you tried to use CJKFoldingFilter https://g
> > >> ithub.com/sul-dlss/CJKFoldingFilter.  I am not sure if this would
> cover
> > >> your use case but I am using this filter and so far no issues.
> > >>
> > >> Thnx
> > >>
> > >> On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman <
> amanda.shu...@gmail.com
> > >
> > >> wrote:
> > >>
> > >>> Thanks, Alex - I have seen a few of those links but never considered
> > >>> transliteration! We use lucene's Smart Chinese analyzer. The issue is
> > >>> basically what is laid out in the old blogspot post, namely this
> point:
> > >>>
> > >>>
> > >>> "Why approach CJK resource discovery differently?
> > >>>
> > >>> 2.  Search results must be as script agnostic as possible.
> > >>>
> > >>> There is more than one way to write each word. "Simplified"
> characters
> > >>> were
> > >>> emphasized for printed materials in mainland China starting in the
> > 1950s;
> > >>> "Traditional" characters were used in printed materials prior to the
> > >>> 1950s,
> > >>> and are still used in Taiwan, Hong Kong and Macau today.
> > >>> Since the characters are distinct, it's as if Chinese materials are
> > >>> written
> > >>> in two scripts.
> > >>> Another way to think about it:  every written Chinese word has at
> least
> > >>> two
> > >>> completely different spellings.  And it can be mix-n-match:  a word
> can
> > >>> be
> > >>> written with one traditional  and one simplified character.
> > >>> Example:   Given a user query 舊小說  (traditional for old fiction), the
> > >>> results should include matches for 舊小說 (traditional) and 旧小说
> > (simplified
> > >>> characters for old fiction)"
> > >>>
> > >>> So, using the example provided above, we are dealing with materials
> > >>> produced in the 1950s-1970s that do even weirder things like:
> > >>>
> > >>> A. 舊小說
> > >>>
> > >>> can also be
> > >>>
> > >>> B. 旧小说 (all simplified)
> > >>> or
> > >>> C. 旧小說 (first character simplified, last character traditional)
> > >>> or
> > >>> D. 舊小 说 (first character traditional, last character simplified)
> > >>>
> > >>> Thankfully the middle character was never simplified in recent times.
> > >>>
> > >>> From a historical standpoint, the mixed nature of the characters in
> the
> > >>> same word/phrase is because not all simplified characters were
> adopted
> > at
> > >>> the same time by everyone uniformly (good times...).
> > >>>
> > >>> The problem seems to be that Solr can easily handle A or B above, but
> > >>> NOT C
> > >>> or D using the Smart Chinese analyzer. I'm not really sure how to
> > change
> > >>> that at this point... maybe I should figure out how to contact the
> > >>> creators
> > >>> of the analyzer and ask them?
> > >>>
> > >>> Amanda
> > >>>
> > >>> --
> > >>> Dr. Amanda Shuman
> > >>> Post-doc researcher, University of Freiburg, The Maoist Legacy
> Project
> > >>> 
> > >>> PhD, University of California, Santa Cruz
> > >>> http://www.amandashuman.net/
> > >>> http://www.prchistoryresources.org/
> > >>> Office: +49 (0) 761 203 4925
> > >>>
> > >>>
> > >>> On Fri, Jul 20, 2018 at 1:40 PM, Alexandre Rafalovitch <
> > >>> arafa...@gmail.com>
> > >>> wrote:
> > >>>
> > >>> > This is probably your start, if not read already:
> > >>> > https://lucene.apache.org/solr/guide/7_4/language-analysis.html
> > >>> >
> > >>> > Otherwise, I think your answer would be somewhere around using
> ICU4J,
> > >>> > IBM's library for dealing with Unicode:
> http://site.icu-project.org/
> > >>> > (mentioned on the same page above)
> > >>> > Specifically, transformations:
> > >>> > http://userguide.icu-project.org/transforms/general
> > >>> >
> > >>> > With that, maybe you map both alphabets into latin. I did that once
> > >>> > for Thai for a demo:
> > >>> > https://github.com/arafalov/solr-thai-test/blob/master/
> > >>> > c

Re: Question regarding searching Chinese characters

2018-07-20 Thread Tomoko Uchida
Hi,

There is ICUTransformFilter (that included Solr distribution) which also
should be work for you.
See the example settings:
https://lucene.apache.org/solr/guide/7_4/filter-descriptions.html#icu-transform-filter

Combine it with HMMChineseTokenizer.
https://lucene.apache.org/solr/guide/7_4/language-analysis.html#hmm-chinese-tokenizer

In other words, replace your SmartChineseAnalyzer settings by
HMMChineseTokenizer
& ICUTransformFilter pipeline.


Here is a bit complicated explanation, so you can skip if you do not want
to go into analyzer details.

I do not understand Chinese, but seems there are no easy or one-stop
solutions in my view. (As Japanese, we have similar problems with Chinese.)

HMMChineseTokenizer expects Simplified Chinese text.
See:
https://lucene.apache.org/core/7_4_0/analyzers-smartcn/org/apache/lucene/analysis/cn/smart/HMMChineseTokenizer.html

So you should transform all traditional Chinese characters **before**
applying HMMChineseTokenizer by CharFilters, otherwise the Tokenizer do not
correctly work.

Unfortunately, there is no such CharFilters as far as I know.
ICUNormalizer2CharFilter do not handle such transformation so it is no
help. CJKFoldingFilter and  ICUTransformFilter do the
traditional-simplified transformation, however, they are TokenFilters that
works after applying a Tokenizer.

I think you need two steps if you want to use HMMChineseTokenizer correctly.

1. transform all traditional characters to simplified ones and save to
temporary files.
I do not have clear idea for doing this, but you can create a Java
program that calls Lucene's ICUTransformFilter
2. then, index to Solr using SmartChineseAnalyzer.

Regards,
Tomoko

2018年7月20日(金) 22:12 Susheel Kumar :

> I think CJKFoldingFilter will work for you.  I put 舊小說 in index and then
> each of A, B or C or D in query and they seems to be matching and CJKFF is
> transforming the 舊 to 旧
>
> On Fri, Jul 20, 2018 at 9:08 AM, Susheel Kumar 
> wrote:
>
> > Lack of my chinese language knowledge but if you want, I can do quick
> test
> > for you in Analysis tab if you can give me what to put in index and query
> > window...
> >
> > On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar 
> > wrote:
> >
> >> Have you tried to use CJKFoldingFilter https://g
> >> ithub.com/sul-dlss/CJKFoldingFilter.  I am not sure if this would cover
> >> your use case but I am using this filter and so far no issues.
> >>
> >> Thnx
> >>
> >> On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman  >
> >> wrote:
> >>
> >>> Thanks, Alex - I have seen a few of those links but never considered
> >>> transliteration! We use lucene's Smart Chinese analyzer. The issue is
> >>> basically what is laid out in the old blogspot post, namely this point:
> >>>
> >>>
> >>> "Why approach CJK resource discovery differently?
> >>>
> >>> 2.  Search results must be as script agnostic as possible.
> >>>
> >>> There is more than one way to write each word. "Simplified" characters
> >>> were
> >>> emphasized for printed materials in mainland China starting in the
> 1950s;
> >>> "Traditional" characters were used in printed materials prior to the
> >>> 1950s,
> >>> and are still used in Taiwan, Hong Kong and Macau today.
> >>> Since the characters are distinct, it's as if Chinese materials are
> >>> written
> >>> in two scripts.
> >>> Another way to think about it:  every written Chinese word has at least
> >>> two
> >>> completely different spellings.  And it can be mix-n-match:  a word can
> >>> be
> >>> written with one traditional  and one simplified character.
> >>> Example:   Given a user query 舊小說  (traditional for old fiction), the
> >>> results should include matches for 舊小說 (traditional) and 旧小说
> (simplified
> >>> characters for old fiction)"
> >>>
> >>> So, using the example provided above, we are dealing with materials
> >>> produced in the 1950s-1970s that do even weirder things like:
> >>>
> >>> A. 舊小說
> >>>
> >>> can also be
> >>>
> >>> B. 旧小说 (all simplified)
> >>> or
> >>> C. 旧小說 (first character simplified, last character traditional)
> >>> or
> >>> D. 舊小 说 (first character traditional, last character simplified)
> >>>
> >>> Thankfully the middle character was never simplified in recent times.
> >>>
> >>> From a historical standpoint, the mixed nature of the characters in the
> >>> same word/phrase is because not all simplified characters were adopted
> at
> >>> the same time by everyone uniformly (good times...).
> >>>
> >>> The problem seems to be that Solr can easily handle A or B above, but
> >>> NOT C
> >>> or D using the Smart Chinese analyzer. I'm not really sure how to
> change
> >>> that at this point... maybe I should figure out how to contact the
> >>> creators
> >>> of the analyzer and ask them?
> >>>
> >>> Amanda
> >>>
> >>> --
> >>> Dr. Amanda Shuman
> >>> Post-doc researcher, University of Freiburg, The Maoist Legacy Project
> >>> 
> >>> PhD, University of California, Santa Cruz
> >>> http://www.amanda

Re: Question regarding searching Chinese characters

2018-07-20 Thread Amanda Shuman
Thanks! That does indeed look promising... This can be added on top of
Smart Chinese, right? Or is it an alternative?


--
Dr. Amanda Shuman
Post-doc researcher, University of Freiburg, The Maoist Legacy Project

PhD, University of California, Santa Cruz
http://www.amandashuman.net/
http://www.prchistoryresources.org/
Office: +49 (0) 761 203 4925


On Fri, Jul 20, 2018 at 3:11 PM, Susheel Kumar 
wrote:

> I think CJKFoldingFilter will work for you.  I put 舊小說 in index and then
> each of A, B or C or D in query and they seems to be matching and CJKFF is
> transforming the 舊 to 旧
>
> On Fri, Jul 20, 2018 at 9:08 AM, Susheel Kumar 
> wrote:
>
> > Lack of my chinese language knowledge but if you want, I can do quick
> test
> > for you in Analysis tab if you can give me what to put in index and query
> > window...
> >
> > On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar 
> > wrote:
> >
> >> Have you tried to use CJKFoldingFilter https://g
> >> ithub.com/sul-dlss/CJKFoldingFilter.  I am not sure if this would cover
> >> your use case but I am using this filter and so far no issues.
> >>
> >> Thnx
> >>
> >> On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman  >
> >> wrote:
> >>
> >>> Thanks, Alex - I have seen a few of those links but never considered
> >>> transliteration! We use lucene's Smart Chinese analyzer. The issue is
> >>> basically what is laid out in the old blogspot post, namely this point:
> >>>
> >>>
> >>> "Why approach CJK resource discovery differently?
> >>>
> >>> 2.  Search results must be as script agnostic as possible.
> >>>
> >>> There is more than one way to write each word. "Simplified" characters
> >>> were
> >>> emphasized for printed materials in mainland China starting in the
> 1950s;
> >>> "Traditional" characters were used in printed materials prior to the
> >>> 1950s,
> >>> and are still used in Taiwan, Hong Kong and Macau today.
> >>> Since the characters are distinct, it's as if Chinese materials are
> >>> written
> >>> in two scripts.
> >>> Another way to think about it:  every written Chinese word has at least
> >>> two
> >>> completely different spellings.  And it can be mix-n-match:  a word can
> >>> be
> >>> written with one traditional  and one simplified character.
> >>> Example:   Given a user query 舊小說  (traditional for old fiction), the
> >>> results should include matches for 舊小說 (traditional) and 旧小说
> (simplified
> >>> characters for old fiction)"
> >>>
> >>> So, using the example provided above, we are dealing with materials
> >>> produced in the 1950s-1970s that do even weirder things like:
> >>>
> >>> A. 舊小說
> >>>
> >>> can also be
> >>>
> >>> B. 旧小说 (all simplified)
> >>> or
> >>> C. 旧小說 (first character simplified, last character traditional)
> >>> or
> >>> D. 舊小 说 (first character traditional, last character simplified)
> >>>
> >>> Thankfully the middle character was never simplified in recent times.
> >>>
> >>> From a historical standpoint, the mixed nature of the characters in the
> >>> same word/phrase is because not all simplified characters were adopted
> at
> >>> the same time by everyone uniformly (good times...).
> >>>
> >>> The problem seems to be that Solr can easily handle A or B above, but
> >>> NOT C
> >>> or D using the Smart Chinese analyzer. I'm not really sure how to
> change
> >>> that at this point... maybe I should figure out how to contact the
> >>> creators
> >>> of the analyzer and ask them?
> >>>
> >>> Amanda
> >>>
> >>> --
> >>> Dr. Amanda Shuman
> >>> Post-doc researcher, University of Freiburg, The Maoist Legacy Project
> >>> 
> >>> PhD, University of California, Santa Cruz
> >>> http://www.amandashuman.net/
> >>> http://www.prchistoryresources.org/
> >>> Office: +49 (0) 761 203 4925
> >>>
> >>>
> >>> On Fri, Jul 20, 2018 at 1:40 PM, Alexandre Rafalovitch <
> >>> arafa...@gmail.com>
> >>> wrote:
> >>>
> >>> > This is probably your start, if not read already:
> >>> > https://lucene.apache.org/solr/guide/7_4/language-analysis.html
> >>> >
> >>> > Otherwise, I think your answer would be somewhere around using ICU4J,
> >>> > IBM's library for dealing with Unicode: http://site.icu-project.org/
> >>> > (mentioned on the same page above)
> >>> > Specifically, transformations:
> >>> > http://userguide.icu-project.org/transforms/general
> >>> >
> >>> > With that, maybe you map both alphabets into latin. I did that once
> >>> > for Thai for a demo:
> >>> > https://github.com/arafalov/solr-thai-test/blob/master/
> >>> > collection1/conf/schema.xml#L34
> >>> >
> >>> > The challenge is to figure out all the magic rules for that. You'd
> >>> > have to dig through the ICU documentation and other web pages. I
> found
> >>> > this one for example:
> >>> > http://avajava.com/tutorials/lessons/what-are-the-system-
> >>> > transliterators-available-with-icu4j.html;jsessionid=
> >>> > BEAB0AF05A588B97B8A2393054D908C0
> >>> >
> >>> > There is also 12 part series on Solr and 

Re: Memory requirements for TLOGs (7.3.1)

2018-07-20 Thread Shawn Heisey
On 7/18/2018 6:33 PM, Ash Ramesh wrote:
> Thanks for the quick responses Shawn & Erick! Just to clarify another few
> points:
>  1. Does having a larger heap size impact ingesting additional documents to
> the index (all CRUD operations) onto a TLOG?

It's extremely difficult, maybe even impossible, for anyone on this list
to predict whether performance will be improved by increasing the heap,
at least not without some really concrete information from the system. 
If you shared your GC log and whatever activity you want to improve was
happening during that log creation, I could probably answer that
question for your specific server.

>  2. Does having a larger ram configured machine (in this case 32gb) affect
> ingestion on TLOGS also?

Having more memory for the OS disk cache does not usually improve
indexing performance.  The only kind of memory that is likely to matter
for that is heap memory.  Once you reach a sufficient heap size,
increasing it further won't help and might actually hurt performance.

>  3. We are currently routing queries via Amazon ASG / Load Balancer. Is
> this one of the recommended ways to set up SOLR infrastructure?

If your client software is not cloud-aware, you'll want an external load
balancer.  The only cloud-aware client that I know for sure exists is
the Java client, which is part of Solr itself as well as a standalone
client.  I did hear once about a cloud-aware client under development
for Python, but I do not know the status of that client -- it would be
third-party software.

Because you're using an external load balancer, you could list only the
PULL replicas in the load balancer back end configuration, and include
the preferLocalShards parameter on the request, so that SolrCloud will
not load balance the requests further.

Thanks,
Shawn



Re: SOLR 7.2.1 on SLES 11?

2018-07-20 Thread Shawn Heisey
On 7/19/2018 2:52 PM, Lichte, Lucas R - DHS (Tek Systems) wrote:
> Welp, that didn't go spectacularly.  All the OpenSuSE SLES 11 downloads are 
> RPM, both source and compiled.  Non-relocatable.  I did attempt to rebuild, 
> but it choked on the following dependencies:
>
> audit-devel is needed by bash-4.3-286.1.x86_64
> fdupes is needed by bash-4.3-286.1.x86_64
> patchutils is needed by bash-4.3-286.1.x86_64
>
> If I can find a repository for them I can throw that into Zypper, but thus 
> far I've failed.  Anyone out there have any suggestions?

If it were me in that situation, I would download the source code of the
latest stable version (4.4.18 as I write this) from gnu directly:

http://ftp.gnu.org/gnu/bash/

Then I would compile it and install it into /usr/local, which is where
it should install by default.  You will naturally need development stuff
including a C compiler.  I do not know whether there are any development
dependencies that bash requires.

If SLES includes /usr/local/bin in the path by default, that might be
all you need.  But you might need to adjust the first line of each Solr
script to explicitly point at the new shell location.

Thanks,
Shawn



Re: Question regarding searching Chinese characters

2018-07-20 Thread Susheel Kumar
I think CJKFoldingFilter will work for you.  I put 舊小說 in index and then
each of A, B or C or D in query and they seems to be matching and CJKFF is
transforming the 舊 to 旧

On Fri, Jul 20, 2018 at 9:08 AM, Susheel Kumar 
wrote:

> Lack of my chinese language knowledge but if you want, I can do quick test
> for you in Analysis tab if you can give me what to put in index and query
> window...
>
> On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar 
> wrote:
>
>> Have you tried to use CJKFoldingFilter https://g
>> ithub.com/sul-dlss/CJKFoldingFilter.  I am not sure if this would cover
>> your use case but I am using this filter and so far no issues.
>>
>> Thnx
>>
>> On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman 
>> wrote:
>>
>>> Thanks, Alex - I have seen a few of those links but never considered
>>> transliteration! We use lucene's Smart Chinese analyzer. The issue is
>>> basically what is laid out in the old blogspot post, namely this point:
>>>
>>>
>>> "Why approach CJK resource discovery differently?
>>>
>>> 2.  Search results must be as script agnostic as possible.
>>>
>>> There is more than one way to write each word. "Simplified" characters
>>> were
>>> emphasized for printed materials in mainland China starting in the 1950s;
>>> "Traditional" characters were used in printed materials prior to the
>>> 1950s,
>>> and are still used in Taiwan, Hong Kong and Macau today.
>>> Since the characters are distinct, it's as if Chinese materials are
>>> written
>>> in two scripts.
>>> Another way to think about it:  every written Chinese word has at least
>>> two
>>> completely different spellings.  And it can be mix-n-match:  a word can
>>> be
>>> written with one traditional  and one simplified character.
>>> Example:   Given a user query 舊小說  (traditional for old fiction), the
>>> results should include matches for 舊小說 (traditional) and 旧小说 (simplified
>>> characters for old fiction)"
>>>
>>> So, using the example provided above, we are dealing with materials
>>> produced in the 1950s-1970s that do even weirder things like:
>>>
>>> A. 舊小說
>>>
>>> can also be
>>>
>>> B. 旧小说 (all simplified)
>>> or
>>> C. 旧小說 (first character simplified, last character traditional)
>>> or
>>> D. 舊小 说 (first character traditional, last character simplified)
>>>
>>> Thankfully the middle character was never simplified in recent times.
>>>
>>> From a historical standpoint, the mixed nature of the characters in the
>>> same word/phrase is because not all simplified characters were adopted at
>>> the same time by everyone uniformly (good times...).
>>>
>>> The problem seems to be that Solr can easily handle A or B above, but
>>> NOT C
>>> or D using the Smart Chinese analyzer. I'm not really sure how to change
>>> that at this point... maybe I should figure out how to contact the
>>> creators
>>> of the analyzer and ask them?
>>>
>>> Amanda
>>>
>>> --
>>> Dr. Amanda Shuman
>>> Post-doc researcher, University of Freiburg, The Maoist Legacy Project
>>> 
>>> PhD, University of California, Santa Cruz
>>> http://www.amandashuman.net/
>>> http://www.prchistoryresources.org/
>>> Office: +49 (0) 761 203 4925
>>>
>>>
>>> On Fri, Jul 20, 2018 at 1:40 PM, Alexandre Rafalovitch <
>>> arafa...@gmail.com>
>>> wrote:
>>>
>>> > This is probably your start, if not read already:
>>> > https://lucene.apache.org/solr/guide/7_4/language-analysis.html
>>> >
>>> > Otherwise, I think your answer would be somewhere around using ICU4J,
>>> > IBM's library for dealing with Unicode: http://site.icu-project.org/
>>> > (mentioned on the same page above)
>>> > Specifically, transformations:
>>> > http://userguide.icu-project.org/transforms/general
>>> >
>>> > With that, maybe you map both alphabets into latin. I did that once
>>> > for Thai for a demo:
>>> > https://github.com/arafalov/solr-thai-test/blob/master/
>>> > collection1/conf/schema.xml#L34
>>> >
>>> > The challenge is to figure out all the magic rules for that. You'd
>>> > have to dig through the ICU documentation and other web pages. I found
>>> > this one for example:
>>> > http://avajava.com/tutorials/lessons/what-are-the-system-
>>> > transliterators-available-with-icu4j.html;jsessionid=
>>> > BEAB0AF05A588B97B8A2393054D908C0
>>> >
>>> > There is also 12 part series on Solr and Asian text processing, though
>>> > it is a bit old now: http://discovery-grindstone.blogspot.com/
>>> >
>>> > Hope one of these things help.
>>> >
>>> > Regards,
>>> >Alex.
>>> >
>>> >
>>> > On 20 July 2018 at 03:54, Amanda Shuman 
>>> wrote:
>>> > > Hi all,
>>> > >
>>> > > We have a problem. Some of our historical documents have mixed
>>> together
>>> > > simplified and Chinese characters. There seems to be no problem when
>>> > > searching either traditional or simplified separately - that is, if a
>>> > > particular string/phrase is all in traditional or simplified, it
>>> finds
>>> > it -
>>> > > but it does not find the string/phrase if the two different
>>> characters
>>> >

Re: Question regarding searching Chinese characters

2018-07-20 Thread Susheel Kumar
Lack of my chinese language knowledge but if you want, I can do quick test
for you in Analysis tab if you can give me what to put in index and query
window...

On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar 
wrote:

> Have you tried to use CJKFoldingFilter https://github.com/sul-dlss/
> CJKFoldingFilter.  I am not sure if this would cover your use case but I
> am using this filter and so far no issues.
>
> Thnx
>
> On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman 
> wrote:
>
>> Thanks, Alex - I have seen a few of those links but never considered
>> transliteration! We use lucene's Smart Chinese analyzer. The issue is
>> basically what is laid out in the old blogspot post, namely this point:
>>
>>
>> "Why approach CJK resource discovery differently?
>>
>> 2.  Search results must be as script agnostic as possible.
>>
>> There is more than one way to write each word. "Simplified" characters
>> were
>> emphasized for printed materials in mainland China starting in the 1950s;
>> "Traditional" characters were used in printed materials prior to the
>> 1950s,
>> and are still used in Taiwan, Hong Kong and Macau today.
>> Since the characters are distinct, it's as if Chinese materials are
>> written
>> in two scripts.
>> Another way to think about it:  every written Chinese word has at least
>> two
>> completely different spellings.  And it can be mix-n-match:  a word can be
>> written with one traditional  and one simplified character.
>> Example:   Given a user query 舊小說  (traditional for old fiction), the
>> results should include matches for 舊小說 (traditional) and 旧小说 (simplified
>> characters for old fiction)"
>>
>> So, using the example provided above, we are dealing with materials
>> produced in the 1950s-1970s that do even weirder things like:
>>
>> A. 舊小說
>>
>> can also be
>>
>> B. 旧小说 (all simplified)
>> or
>> C. 旧小說 (first character simplified, last character traditional)
>> or
>> D. 舊小 说 (first character traditional, last character simplified)
>>
>> Thankfully the middle character was never simplified in recent times.
>>
>> From a historical standpoint, the mixed nature of the characters in the
>> same word/phrase is because not all simplified characters were adopted at
>> the same time by everyone uniformly (good times...).
>>
>> The problem seems to be that Solr can easily handle A or B above, but NOT
>> C
>> or D using the Smart Chinese analyzer. I'm not really sure how to change
>> that at this point... maybe I should figure out how to contact the
>> creators
>> of the analyzer and ask them?
>>
>> Amanda
>>
>> --
>> Dr. Amanda Shuman
>> Post-doc researcher, University of Freiburg, The Maoist Legacy Project
>> 
>> PhD, University of California, Santa Cruz
>> http://www.amandashuman.net/
>> http://www.prchistoryresources.org/
>> Office: +49 (0) 761 203 4925
>>
>>
>> On Fri, Jul 20, 2018 at 1:40 PM, Alexandre Rafalovitch <
>> arafa...@gmail.com>
>> wrote:
>>
>> > This is probably your start, if not read already:
>> > https://lucene.apache.org/solr/guide/7_4/language-analysis.html
>> >
>> > Otherwise, I think your answer would be somewhere around using ICU4J,
>> > IBM's library for dealing with Unicode: http://site.icu-project.org/
>> > (mentioned on the same page above)
>> > Specifically, transformations:
>> > http://userguide.icu-project.org/transforms/general
>> >
>> > With that, maybe you map both alphabets into latin. I did that once
>> > for Thai for a demo:
>> > https://github.com/arafalov/solr-thai-test/blob/master/
>> > collection1/conf/schema.xml#L34
>> >
>> > The challenge is to figure out all the magic rules for that. You'd
>> > have to dig through the ICU documentation and other web pages. I found
>> > this one for example:
>> > http://avajava.com/tutorials/lessons/what-are-the-system-
>> > transliterators-available-with-icu4j.html;jsessionid=
>> > BEAB0AF05A588B97B8A2393054D908C0
>> >
>> > There is also 12 part series on Solr and Asian text processing, though
>> > it is a bit old now: http://discovery-grindstone.blogspot.com/
>> >
>> > Hope one of these things help.
>> >
>> > Regards,
>> >Alex.
>> >
>> >
>> > On 20 July 2018 at 03:54, Amanda Shuman 
>> wrote:
>> > > Hi all,
>> > >
>> > > We have a problem. Some of our historical documents have mixed
>> together
>> > > simplified and Chinese characters. There seems to be no problem when
>> > > searching either traditional or simplified separately - that is, if a
>> > > particular string/phrase is all in traditional or simplified, it finds
>> > it -
>> > > but it does not find the string/phrase if the two different characters
>> > (one
>> > > traditional, one simplified) are mixed together in the SAME
>> > string/phrase.
>> > >
>> > > Has anyone ever handled this problem before? I know some libraries
>> seem
>> > to
>> > > have implemented something that seems to be able to handle this, but
>> I'm
>> > > not sure how they did so!
>> > >
>> > > Amanda
>> > > --
>> > > Dr. Amanda Shuman
>> > >

Re: Question regarding searching Chinese characters

2018-07-20 Thread Susheel Kumar
Have you tried to use CJKFoldingFilter
https://github.com/sul-dlss/CJKFoldingFilter.  I am not sure if this would
cover your use case but I am using this filter and so far no issues.

Thnx

On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman 
wrote:

> Thanks, Alex - I have seen a few of those links but never considered
> transliteration! We use lucene's Smart Chinese analyzer. The issue is
> basically what is laid out in the old blogspot post, namely this point:
>
>
> "Why approach CJK resource discovery differently?
>
> 2.  Search results must be as script agnostic as possible.
>
> There is more than one way to write each word. "Simplified" characters were
> emphasized for printed materials in mainland China starting in the 1950s;
> "Traditional" characters were used in printed materials prior to the 1950s,
> and are still used in Taiwan, Hong Kong and Macau today.
> Since the characters are distinct, it's as if Chinese materials are written
> in two scripts.
> Another way to think about it:  every written Chinese word has at least two
> completely different spellings.  And it can be mix-n-match:  a word can be
> written with one traditional  and one simplified character.
> Example:   Given a user query 舊小說  (traditional for old fiction), the
> results should include matches for 舊小說 (traditional) and 旧小说 (simplified
> characters for old fiction)"
>
> So, using the example provided above, we are dealing with materials
> produced in the 1950s-1970s that do even weirder things like:
>
> A. 舊小說
>
> can also be
>
> B. 旧小说 (all simplified)
> or
> C. 旧小說 (first character simplified, last character traditional)
> or
> D. 舊小 说 (first character traditional, last character simplified)
>
> Thankfully the middle character was never simplified in recent times.
>
> From a historical standpoint, the mixed nature of the characters in the
> same word/phrase is because not all simplified characters were adopted at
> the same time by everyone uniformly (good times...).
>
> The problem seems to be that Solr can easily handle A or B above, but NOT C
> or D using the Smart Chinese analyzer. I'm not really sure how to change
> that at this point... maybe I should figure out how to contact the creators
> of the analyzer and ask them?
>
> Amanda
>
> --
> Dr. Amanda Shuman
> Post-doc researcher, University of Freiburg, The Maoist Legacy Project
> 
> PhD, University of California, Santa Cruz
> http://www.amandashuman.net/
> http://www.prchistoryresources.org/
> Office: +49 (0) 761 203 4925
>
>
> On Fri, Jul 20, 2018 at 1:40 PM, Alexandre Rafalovitch  >
> wrote:
>
> > This is probably your start, if not read already:
> > https://lucene.apache.org/solr/guide/7_4/language-analysis.html
> >
> > Otherwise, I think your answer would be somewhere around using ICU4J,
> > IBM's library for dealing with Unicode: http://site.icu-project.org/
> > (mentioned on the same page above)
> > Specifically, transformations:
> > http://userguide.icu-project.org/transforms/general
> >
> > With that, maybe you map both alphabets into latin. I did that once
> > for Thai for a demo:
> > https://github.com/arafalov/solr-thai-test/blob/master/
> > collection1/conf/schema.xml#L34
> >
> > The challenge is to figure out all the magic rules for that. You'd
> > have to dig through the ICU documentation and other web pages. I found
> > this one for example:
> > http://avajava.com/tutorials/lessons/what-are-the-system-
> > transliterators-available-with-icu4j.html;jsessionid=
> > BEAB0AF05A588B97B8A2393054D908C0
> >
> > There is also 12 part series on Solr and Asian text processing, though
> > it is a bit old now: http://discovery-grindstone.blogspot.com/
> >
> > Hope one of these things help.
> >
> > Regards,
> >Alex.
> >
> >
> > On 20 July 2018 at 03:54, Amanda Shuman  wrote:
> > > Hi all,
> > >
> > > We have a problem. Some of our historical documents have mixed together
> > > simplified and Chinese characters. There seems to be no problem when
> > > searching either traditional or simplified separately - that is, if a
> > > particular string/phrase is all in traditional or simplified, it finds
> > it -
> > > but it does not find the string/phrase if the two different characters
> > (one
> > > traditional, one simplified) are mixed together in the SAME
> > string/phrase.
> > >
> > > Has anyone ever handled this problem before? I know some libraries seem
> > to
> > > have implemented something that seems to be able to handle this, but
> I'm
> > > not sure how they did so!
> > >
> > > Amanda
> > > --
> > > Dr. Amanda Shuman
> > > Post-doc researcher, University of Freiburg, The Maoist Legacy Project
> > > 
> > > PhD, University of California, Santa Cruz
> > > http://www.amandashuman.net/
> > > http://www.prchistoryresources.org/
> > > Office: +49 (0) 761 203 4925
> >
>


Re: Question regarding searching Chinese characters

2018-07-20 Thread Amanda Shuman
Thanks, Alex - I have seen a few of those links but never considered
transliteration! We use lucene's Smart Chinese analyzer. The issue is
basically what is laid out in the old blogspot post, namely this point:


"Why approach CJK resource discovery differently?

2.  Search results must be as script agnostic as possible.

There is more than one way to write each word. "Simplified" characters were
emphasized for printed materials in mainland China starting in the 1950s;
"Traditional" characters were used in printed materials prior to the 1950s,
and are still used in Taiwan, Hong Kong and Macau today.
Since the characters are distinct, it's as if Chinese materials are written
in two scripts.
Another way to think about it:  every written Chinese word has at least two
completely different spellings.  And it can be mix-n-match:  a word can be
written with one traditional  and one simplified character.
Example:   Given a user query 舊小說  (traditional for old fiction), the
results should include matches for 舊小說 (traditional) and 旧小说 (simplified
characters for old fiction)"

So, using the example provided above, we are dealing with materials
produced in the 1950s-1970s that do even weirder things like:

A. 舊小說

can also be

B. 旧小说 (all simplified)
or
C. 旧小說 (first character simplified, last character traditional)
or
D. 舊小 说 (first character traditional, last character simplified)

Thankfully the middle character was never simplified in recent times.

>From a historical standpoint, the mixed nature of the characters in the
same word/phrase is because not all simplified characters were adopted at
the same time by everyone uniformly (good times...).

The problem seems to be that Solr can easily handle A or B above, but NOT C
or D using the Smart Chinese analyzer. I'm not really sure how to change
that at this point... maybe I should figure out how to contact the creators
of the analyzer and ask them?

Amanda

--
Dr. Amanda Shuman
Post-doc researcher, University of Freiburg, The Maoist Legacy Project

PhD, University of California, Santa Cruz
http://www.amandashuman.net/
http://www.prchistoryresources.org/
Office: +49 (0) 761 203 4925


On Fri, Jul 20, 2018 at 1:40 PM, Alexandre Rafalovitch 
wrote:

> This is probably your start, if not read already:
> https://lucene.apache.org/solr/guide/7_4/language-analysis.html
>
> Otherwise, I think your answer would be somewhere around using ICU4J,
> IBM's library for dealing with Unicode: http://site.icu-project.org/
> (mentioned on the same page above)
> Specifically, transformations:
> http://userguide.icu-project.org/transforms/general
>
> With that, maybe you map both alphabets into latin. I did that once
> for Thai for a demo:
> https://github.com/arafalov/solr-thai-test/blob/master/
> collection1/conf/schema.xml#L34
>
> The challenge is to figure out all the magic rules for that. You'd
> have to dig through the ICU documentation and other web pages. I found
> this one for example:
> http://avajava.com/tutorials/lessons/what-are-the-system-
> transliterators-available-with-icu4j.html;jsessionid=
> BEAB0AF05A588B97B8A2393054D908C0
>
> There is also 12 part series on Solr and Asian text processing, though
> it is a bit old now: http://discovery-grindstone.blogspot.com/
>
> Hope one of these things help.
>
> Regards,
>Alex.
>
>
> On 20 July 2018 at 03:54, Amanda Shuman  wrote:
> > Hi all,
> >
> > We have a problem. Some of our historical documents have mixed together
> > simplified and Chinese characters. There seems to be no problem when
> > searching either traditional or simplified separately - that is, if a
> > particular string/phrase is all in traditional or simplified, it finds
> it -
> > but it does not find the string/phrase if the two different characters
> (one
> > traditional, one simplified) are mixed together in the SAME
> string/phrase.
> >
> > Has anyone ever handled this problem before? I know some libraries seem
> to
> > have implemented something that seems to be able to handle this, but I'm
> > not sure how they did so!
> >
> > Amanda
> > --
> > Dr. Amanda Shuman
> > Post-doc researcher, University of Freiburg, The Maoist Legacy Project
> > 
> > PhD, University of California, Santa Cruz
> > http://www.amandashuman.net/
> > http://www.prchistoryresources.org/
> > Office: +49 (0) 761 203 4925
>


Re: Question regarding searching Chinese characters

2018-07-20 Thread Alexandre Rafalovitch
This is probably your start, if not read already:
https://lucene.apache.org/solr/guide/7_4/language-analysis.html

Otherwise, I think your answer would be somewhere around using ICU4J,
IBM's library for dealing with Unicode: http://site.icu-project.org/
(mentioned on the same page above)
Specifically, transformations:
http://userguide.icu-project.org/transforms/general

With that, maybe you map both alphabets into latin. I did that once
for Thai for a demo:
https://github.com/arafalov/solr-thai-test/blob/master/collection1/conf/schema.xml#L34

The challenge is to figure out all the magic rules for that. You'd
have to dig through the ICU documentation and other web pages. I found
this one for example:
http://avajava.com/tutorials/lessons/what-are-the-system-transliterators-available-with-icu4j.html;jsessionid=BEAB0AF05A588B97B8A2393054D908C0

There is also 12 part series on Solr and Asian text processing, though
it is a bit old now: http://discovery-grindstone.blogspot.com/

Hope one of these things help.

Regards,
   Alex.


On 20 July 2018 at 03:54, Amanda Shuman  wrote:
> Hi all,
>
> We have a problem. Some of our historical documents have mixed together
> simplified and Chinese characters. There seems to be no problem when
> searching either traditional or simplified separately - that is, if a
> particular string/phrase is all in traditional or simplified, it finds it -
> but it does not find the string/phrase if the two different characters (one
> traditional, one simplified) are mixed together in the SAME string/phrase.
>
> Has anyone ever handled this problem before? I know some libraries seem to
> have implemented something that seems to be able to handle this, but I'm
> not sure how they did so!
>
> Amanda
> --
> Dr. Amanda Shuman
> Post-doc researcher, University of Freiburg, The Maoist Legacy Project
> 
> PhD, University of California, Santa Cruz
> http://www.amandashuman.net/
> http://www.prchistoryresources.org/
> Office: +49 (0) 761 203 4925


Re: Sorting issue while using collection parameter

2018-07-20 Thread Vijay Tiwary
Hello Erick

We are using string field and data is stored in lower case while indexing.
We have alias set up to query multiple collections simultaneously.
alias=collection1, collection2
If we are querying through alias then sorting is broken. For e.g. Results
for descending sort are as follows. (Empty lines are documents with no
value for the field on which sorting is applied).
It seems there is an issue in solr while aggregating the results of
individual
shard responses
d
d
d


b
b
b
c
c

b
b



On Fri, 29 Jun 2018, 9:16 pm Erick Erickson, 
wrote:

> What _is_ your expectation? You haven't provided any examples of what
> your input and expectations _are_.
>
> You might review: https://wiki.apache.org/solr/UsingMailingLists
>
> string types are case-sensitive for instance, so that's one thing that
> could be happening. You
> can also specify sortMissingFirst/Last to determine where docs with
> missing fields appear in the results.
>
> Best,
> Erick
>
> On Fri, Jun 29, 2018 at 3:13 AM, Vijay Tiwary 
> wrote:
> > Hello Eric,
> >
> > title is a string field
> >
> > On Wed, 27 Jun 2018, 9:21 pm Erick Erickson, 
> > wrote:
> >
> >> what kind of field is title? text_general or something? Sorting on a
> >> tokenized field is usually something you don't want to do. If a field
> >> has aardvard and zebra, how would it sort?
> >>
> >> There's usually something like alphaOnlySort. People often copyField
> >> from "title" to "title_sort" and search on "title" and sort on
> >> title_sort.
> >>
> >> alphaOnlySort uses KeywordTokenizer and LowercaseFilterFactory.
> >>
> >> Best,
> >> Erick
> >>
> >> On Wed, Jun 27, 2018 at 12:45 AM, Vijay Tiwary <
> vijaykr.tiw...@gmail.com>
> >> wrote:
> >> > Hello Team,
> >> >
> >> > I have multiple collection on solr (5.4.1) cloud based on year
> >> > content2107
> >> > content2018
> >> >
> >> > Also I have a collection "content" which does not have any data.
> >> >
> >> > Now if I query them as follows
> >> > http://host:port/solr/content/select?q=*:*&collection=content2107,
> >> > content2108&sort=title
> >> > asc
> >> >
> >> > Where title is string field then results are not getting sorted as per
> >> the
> >> > expectation. Also note value for title is not present for some
> documents.
> >> >
> >> > Please help.
> >>
>


Re: Need an advice for architecture.

2018-07-20 Thread servus01
Well, thanks a lot. 


Chris Hostetter-3 wrote
> The first question i have is why you are using a version of Solr that's 
> almost 5 years old.

*Well, Solr is part of another software and integrated with this version.
With next update they will also update Solr to ver. 7...*


Chris Hostetter-3 wrote
> The second question you should consider is what your indexing process 
> looks like, and whether it's multithreaded or not, and if the bottleneck 
> is your network/DB. 

*Diggin deeper into the system, shows that SQL is the bottleneck. Next to
Solr around 25 applications acces the DB (110GB) and causes a load of 100%
to DB memory [32GB] and disk access [SAS Raid].
The main problem is to get data out of db as fast as possible. Running into
some other problems due to this circumstances. API Agent tries to deploy
batch of 25 elements at once to solr but already runs into a timeout to get
all the associated fields for this batch from SQL. After failing of 25 batch
> 12 > 6 > 3 > 2 > 1. This ends up in at least 1 document every 7 minutes.
:(*

So at this time the DB admin has to do his work first.

Really appreciate your thoughts on this.

kindest regards

Francois






--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Question regarding searching Chinese characters

2018-07-20 Thread Amanda Shuman
Hi all,

We have a problem. Some of our historical documents have mixed together
simplified and Chinese characters. There seems to be no problem when
searching either traditional or simplified separately - that is, if a
particular string/phrase is all in traditional or simplified, it finds it -
but it does not find the string/phrase if the two different characters (one
traditional, one simplified) are mixed together in the SAME string/phrase.

Has anyone ever handled this problem before? I know some libraries seem to
have implemented something that seems to be able to handle this, but I'm
not sure how they did so!

Amanda
--
Dr. Amanda Shuman
Post-doc researcher, University of Freiburg, The Maoist Legacy Project

PhD, University of California, Santa Cruz
http://www.amandashuman.net/
http://www.prchistoryresources.org/
Office: +49 (0) 761 203 4925