Re: Document Update performances Improvement

2019-10-23 Thread Jörn Franke
Well coalesce does require shuffle and network, however in most cases it is 
less than repartition as it moves the data (through the network) to already 
existing executors.
However as you see and others confirm: for high peformance you don’t need high 
parallelism on the ingestion side, but you can load the data in batches with a 
low parallelism. Tuning with some parameters (commit interval, merge segment 
size) can, but only if needed, deliver even more performance. If you then still 
need more performance you can increase the number of Solr nodes and shards.

> Am 23.10.2019 um 22:01 schrieb Nicolas Paris :
> 
> 
>> 
>> With Spark-Solr additional complexity comes. You could have too many
>> executors for your Solr instance(s), ie a too high parallelism.
> 
> I have been reducing the parallelism of spark-solr part by 5. I had 40
> executors loading 4 shards. Right now only 8 executors loading 4 shards.
> As a result, I can see a 10 times update improvement, and I suspect the
> update process had been overhelmed by spark.
> 
> I have been able to keep 40 executor for document preprocessing and
> reducing to 8 executors within the same spark job by using the
> "dataframe.coalesce" feature which does not shuffle the data at all and
> keeps both spark cluster and solr quiet in term of network.
> 
> Thanks
> 
>> On Sat, Oct 19, 2019 at 10:10:36PM +0200, Jörn Franke wrote:
>> Maybe you need to give more details. I recommend always to try and test 
>> yourself as you know your own solution best. Depending on your spark process 
>> atomic updates  could be faster.
>> 
>> With Spark-Solr additional complexity comes. You could have too many 
>> executors for your Solr instance(s), ie a too high parallelism.
>> 
>> Probably the most important question is:
>> What performance do your use car needs and what is your current performance?
>> 
>> Once this is clear further architecture aspects can be derived, such as 
>> number of spark executors, number of Solr instances, sharding, replication, 
>> commit timing etc.
>> 
 Am 19.10.2019 um 21:52 schrieb Nicolas Paris :
>>> 
>>> Hi community,
>>> 
>>> Any advice to speed-up updates ?
>>> Is there any advice on commit, memory, docvalues, stored or any tips to
>>> faster things ?
>>> 
>>> Thanks
>>> 
>>> 
 On Wed, Oct 16, 2019 at 12:47:47AM +0200, Nicolas Paris wrote:
 Hi
 
 I am looking for a way to faster the update of documents.
 
 In my context, the update replaces one of the many existing indexed
 fields, and keep the others as is.
 
 Right now, I am building the whole document, and replacing the existing
 one by id.
 
 I am wondering if **atomic update feature** would faster the process.
 
 From one hand, using this feature would save network because only a
 small subset of the document would be send from the client to the
 server. 
 On the other hand, the server will have to collect the values from the
 disk and reindex them. In addition, this implies to store the values for
 every fields (I am not storing every fields) and use more space.
 
 Also I have read about the ConcurrentUpdateSolrServer class might be an
 optimized way of updating documents.
 
 I am using spark-solr library to deal with solr-cloud. If something
 exist to faster the process, I would be glad to implement it in that
 library.
 Also, I have split the collection over multiple shard, and I admit this
 faster the update process, but who knows ?
 
 Thoughts ?
 
 -- 
 nicolas
 
>>> 
>>> -- 
>>> nicolas
>> 
> 
> -- 
> nicolas


Re: tlogs are not deleted

2019-10-23 Thread Erick Erickson
Why it’s enabled by default: Really it shouldn’t be. Raise a JIRA?

Why it’s there in the first place: It’s a leftover from before there was the 
“full sync” capability so you could intentionally queue up the updates while 
performing maintenance on the target cluster.

Not great reasons, but….

> On Oct 23, 2019, at 1:41 PM, Webster Homer  
> wrote:
> 
> Tlogs will accumulate if you have buffers "enabled". Make sure that you 
> explicitly disable buffering from the cdcr endpoint
> https://lucene.apache.org/solr/guide/7_7/cdcr-api.html#disablebuffer
> Make sure that they're disabled on both the source and targets
> 
> I believe that sometimes buffers get enabled on their own. We added 
> monitoring of CDCR to check for the buffer setting
> This endpoint shows you the status
> https://lucene.apache.org/solr/guide/7_7/cdcr-api.html#cdcr-status-example
> 
> I don't understand the use case for enabling  buffers, or why it is enabled 
> by default.
> 
> -Original Message-
> From: Erick Erickson 
> Sent: Wednesday, October 23, 2019 7:23 AM
> To: solr-user@lucene.apache.org
> Subject: Re: tlogs are not deleted
> 
> My first guess is that your CDCR setup isn’t running. CDCR uses tlogs as a 
> queueing mechanism. If CDCR can’t send docs to the target collection, they’ll 
> accumulate forever.
> 
> Best,
> Erick
> 
>> On Oct 22, 2019, at 7:48 PM, Woo Choi  wrote:
>> 
>> Hi,
>> 
>> We are using solr 7.7 cloud with CDCR(every collection has 3 replicas,
>> 1 shard).
>> 
>> In solrconfig.xml,
>> 
>> tlog configuration is super simple like : 
>> 
>> There is also daily data import and commit is called after data import
>> every time.
>> 
>> Indexing works fine, but the problem is that the number of tlogs keeps
>> growing.
>> 
>> According to the documentation
>> here(https://lucene.apache.org/solr/guide/6_6/updatehandlers-in-solrco
>> nfig.html), I expected tlog will remain as many as 10(default value of
>> maxNumLogsToKeep=10).
>> 
>> However I still have a bunch of tlogs - the oldest one is Sep 6..!
>> 
>> I did an experiment by running data import with commit option from
>> solr admin ui, but any of tlogs were not deleted.
>> 
>> tlog.002.1643995079881261056
>> tlog.018.1645444642733293568
>> tlog.034.1646803619099443200
>> tlog.003.1644085718240198656
>> tlog.019.1645535304072822784
>> tlog.035.1646894195509559296
>> tlog.004.1644176284537847808
>> tlog.020.1645625651261079552
>> tlog.036.1646984623121498112
>> tlog.005.1644357373324689408
>> tlog.021.1645625651316654083
>> tlog.037.1647076244416626688
>> tlog.006.167899616018432
>> tlog.022.1645716477747134464
>> tlog.038.1647165801017376768
>> tlog.007.1644538486210953216
>> tlog.023.1645806853961023488
>> tlog.039.1647165801042542594
>> tlog.008.1644629084296183808
>> tlog.024.1645897663703416832
>> tlog.040.1647256590865137664
>> tlog.009.1644719895268556800
>> tlog.025.1645988248838733824
>> tlog.041.1647347172490870784
>> tlog.010.1644810493331767296
>> tlog.026.1646078905702940672
>> tlog.042.1647437758859313152
>> tlog.011.1644901113324896256
>> tlog.027.1646169478772293632
>> tlog.043.1647528345005457408
>> tlog.012.1645031030684385280
>> tlog.028.1646259838395613184
>> tlog.044.1647618793025830912
>> tlog.013.164503103008545
>> tlog.029.1646350429145006080
>> tlog.045.1647709579019026432
>> tlog.014.1645082080252526592
>> tlog.030.1646441456502571008
>> tlog.046.1647890587519549440
>> tlog.015.1645172929206419456
>> tlog.031.1646531802044563456
>> tlog.047.1647981403286011904
>> tlog.016.1645263488829882368
>> tlog.032.16466061568
>> tlog.048.1648071989042085888
>> tlog.017.1645353861842468864
>> tlog.033.1646712822719053824
>> tlog.049.1648135546466205696
>> 
>> Did I miss something in the solrconfig file?
>> 
>> 
>> 
>> --
>> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
> 
> This message and any attachment are confidential and may be privileged or 
> otherwise protected from disclosure. If you are not the intended recipient, 
> you must not copy this message or attachment or disclose the contents to any 
> other person. If you have received this transmission in error, please notify 
> the sender immediately and delete the message and any attachment from 

Re: Document Update performances Improvement

2019-10-23 Thread Erick Erickson
My first question is always “what’s the bottleneck”? Unless you’re driving your 
CPUs and/or I/O hard on Solr, the bottleneck is in the acquisition of the docs 
not on the Solr side.

Also, be sure and batch in groups of at least 10x the number of shards, see: 
https://lucidworks.com/post/really-batch-updates-solr-2/

Although it sounds like you’ve figured this out already…. And yeah, I’ve seen 
Solr indexing degrade when it’s being overwhelmed, so that might be the total 
issue.

Best,
Erick

> On Oct 23, 2019, at 9:49 AM, Shawn Heisey  wrote:
> 
> On 10/22/2019 1:12 PM, Nicolas Paris wrote:
>>> We, at Auto-Suggest, also do atomic updates daily and specifically
>>> changing merge factor gave us a boost of ~4x
>> Interesting. What kind of change exactly on the merge factor side ?
> 
> The mergeFactor setting is deprecated.  Instead, use maxMergeAtOnce, 
> segmentsPerTier, and a setting that is not mentioned in the ref guide -- 
> maxMergeAtOnceExplicit.
> 
> Set the first two to the same number, and the third to a minumum of three 
> times what you set the other two.
> 
> The default setting for maxMergeAtOnce and segmentsPerTier is 10, with 30 for 
> maxMergeAtOnceExplicit.  When you're trying to increase indexing speed and 
> you think segment merging is interfering, you want to increase these values 
> to something larger.  Note that increasing these values will increase the 
> number of files that your Solr install keeps open.
> 
> https://lucene.apache.org/solr/guide/8_1/indexconfig-in-solrconfig.html#mergepolicyfactory
> 
> When I built a Solr setup, I increased maxMergeAtOnce and segmentsPerTier to 
> 35, and maxMergeAtOnceExplicit to 105.  This made merging happen a lot less 
> frequently.
> 
>> Would you say atomical update is faster than regular replacement of
>> documents ? (considering my first thought on this below)
> 
> On the Solr side, atomic updates will be slightly slower than indexing the 
> whole document provided to Solr.  When an atomic update is done, Solr will 
> find the existing document, then combine what's in that document with the 
> changes you specify using the atomic update, and then index the whole 
> combined document as a new document that replaces with original.
> 
> Whether or not atomic updates are faster or slower in practice than indexing 
> the whole document will depend on how your source systems work, and that is 
> not something we can know.  If Solr can access the previous document faster 
> than you can get the document from your source system, then atomic updates 
> might be faster.
> 
> Thanks,
> Shawn



Re: regarding Extracting text from Images

2019-10-23 Thread Erick Erickson
Here’s a blog about why and how to use Tika outside Solr (and an RDBMS too, but 
you can pull that part out pretty easily):
https://lucidworks.com/post/indexing-with-solrj/



> On Oct 23, 2019, at 7:16 PM, Alexandre Rafalovitch  wrote:
> 
> Again, I think you are best to do it out of Solr.
> 
> But even of you want to get it to work in Solr, I think you start by
> getting it to work directly in Tika. Then, get the missing libraries and
> configuration into Solr.
> 
> Regards,
>Alex
> 
> On Wed, Oct 23, 2019, 7:08 PM suresh pendap,  wrote:
> 
>> Hi Alex,
>> Thanks for your reply. How do we integrate tesseract with Solr?  Do we have
>> to implement Custom update processor or extend the
>> ExtractingRequestProcessor?
>> 
>> Regards
>> Suresh
>> 
>> On Wed, Oct 23, 2019 at 11:21 AM Alexandre Rafalovitch >> 
>> wrote:
>> 
>>> I believe Tika that powers this can do so with extra libraries
>> (tesseract?)
>>> But Solr does not bundle those extras.
>>> 
>>> In any case, you may want to run Tika externally to avoid the
>>> conversion/extraction process be a burden to Solr itself.
>>> 
>>> Regards,
>>> Alex
>>> 
>>> On Wed, Oct 23, 2019, 1:58 PM suresh pendap, 
>>> wrote:
>>> 
 Hello,
 I am reading the Solr documentation about integration with Tika and
>> Solr
 Cell framework over here
 
 
>>> 
>> https://lucene.apache.org/solr/guide/6_6/uploading-data-with-solr-cell-using-apache-tika.html
 
 I would like to know if the can Solr Cell framework also be used to
>>> extract
 text from the image files?
 
 Regards
 Suresh
 
>>> 
>> 



Re: regarding Extracting text from Images

2019-10-23 Thread Alexandre Rafalovitch
Again, I think you are best to do it out of Solr.

But even of you want to get it to work in Solr, I think you start by
getting it to work directly in Tika. Then, get the missing libraries and
configuration into Solr.

Regards,
Alex

On Wed, Oct 23, 2019, 7:08 PM suresh pendap,  wrote:

> Hi Alex,
> Thanks for your reply. How do we integrate tesseract with Solr?  Do we have
> to implement Custom update processor or extend the
> ExtractingRequestProcessor?
>
> Regards
> Suresh
>
> On Wed, Oct 23, 2019 at 11:21 AM Alexandre Rafalovitch  >
> wrote:
>
> > I believe Tika that powers this can do so with extra libraries
> (tesseract?)
> > But Solr does not bundle those extras.
> >
> > In any case, you may want to run Tika externally to avoid the
> > conversion/extraction process be a burden to Solr itself.
> >
> > Regards,
> >  Alex
> >
> > On Wed, Oct 23, 2019, 1:58 PM suresh pendap, 
> > wrote:
> >
> > > Hello,
> > > I am reading the Solr documentation about integration with Tika and
> Solr
> > > Cell framework over here
> > >
> > >
> >
> https://lucene.apache.org/solr/guide/6_6/uploading-data-with-solr-cell-using-apache-tika.html
> > >
> > > I would like to know if the can Solr Cell framework also be used to
> > extract
> > > text from the image files?
> > >
> > > Regards
> > > Suresh
> > >
> >
>


Re: regarding Extracting text from Images

2019-10-23 Thread suresh pendap
Hi Alex,
Thanks for your reply. How do we integrate tesseract with Solr?  Do we have
to implement Custom update processor or extend the
ExtractingRequestProcessor?

Regards
Suresh

On Wed, Oct 23, 2019 at 11:21 AM Alexandre Rafalovitch 
wrote:

> I believe Tika that powers this can do so with extra libraries (tesseract?)
> But Solr does not bundle those extras.
>
> In any case, you may want to run Tika externally to avoid the
> conversion/extraction process be a burden to Solr itself.
>
> Regards,
>  Alex
>
> On Wed, Oct 23, 2019, 1:58 PM suresh pendap, 
> wrote:
>
> > Hello,
> > I am reading the Solr documentation about integration with Tika and Solr
> > Cell framework over here
> >
> >
> https://lucene.apache.org/solr/guide/6_6/uploading-data-with-solr-cell-using-apache-tika.html
> >
> > I would like to know if the can Solr Cell framework also be used to
> extract
> > text from the image files?
> >
> > Regards
> > Suresh
> >
>


How to update range of dynamic fields in Solr

2019-10-23 Thread Arnold Bronley
Here is the detailed question in stack-overflow. Please help.

https://stackoverflow.com/questions/14280506/how-to-update-range-of-dynamic-fields-in-solr-4


Re: copyField - why source should contain * when dest contains *?

2019-10-23 Thread Chris Hostetter


: Documentation says that we can copy multiple fields using wildcard to one
: or more than one fields.

correct ... the limitation is in the syntax and the ambiguity that would 
be unresolvable if you had a wildcard in the dest but not in the source.  

the wildcard is essentially a variable.  if you have...

   source="foo" desc="*_bar"

...then solr has no idea what full field name to use as the destination 
when it seees values in a field "foo" ... should it be "1_bar" ? 
"aaa_bar" ? ... "z_bar" ? all three?

: Yes, that's what hit me initially. But, "*_x" while indexing (in XMLs)
: doesn't mean anything, right? It's only used in dynamicFields while
: defining schema to let Solr know that we would have some undeclared fields

use of wildcards in copyField is not contstrained to only 
using dynamicFields, this would be a perfectly valid copyField using 
wildcards, even if these are the only fields in the schema, and it had 
no dynamicFields at all...

  
  
  

  
  


  

: having names like this. Also, according to the documentation, we can have
: dest="*_x" when source="*_x" if I'm right. In this case, there's support
: for multiple destinations when there are multiple source.

correct.  there is support for copying from one field to another 
via a *MAPPING* -- so a single copyField declaration can go from multiple 
sources to multiple destiations, but using a wildcard in the dest
only woks with a one-to-one mapping when the wildcard also exists in the 
source.

on the flip side however, you have have a many-to-one mapping by using a 
wildcard *only* in the source

  
  
  

  

  



-Hoss
http://www.lucidworks.com/


Re: Document Update performances Improvement

2019-10-23 Thread Nicolas Paris
> With Spark-Solr additional complexity comes. You could have too many
> executors for your Solr instance(s), ie a too high parallelism.

I have been reducing the parallelism of spark-solr part by 5. I had 40
executors loading 4 shards. Right now only 8 executors loading 4 shards.
As a result, I can see a 10 times update improvement, and I suspect the
update process had been overhelmed by spark.

I have been able to keep 40 executor for document preprocessing and
reducing to 8 executors within the same spark job by using the
"dataframe.coalesce" feature which does not shuffle the data at all and
keeps both spark cluster and solr quiet in term of network.

Thanks

On Sat, Oct 19, 2019 at 10:10:36PM +0200, Jörn Franke wrote:
> Maybe you need to give more details. I recommend always to try and test 
> yourself as you know your own solution best. Depending on your spark process 
> atomic updates  could be faster.
> 
> With Spark-Solr additional complexity comes. You could have too many 
> executors for your Solr instance(s), ie a too high parallelism.
> 
> Probably the most important question is:
> What performance do your use car needs and what is your current performance?
> 
> Once this is clear further architecture aspects can be derived, such as 
> number of spark executors, number of Solr instances, sharding, replication, 
> commit timing etc.
> 
> > Am 19.10.2019 um 21:52 schrieb Nicolas Paris :
> > 
> > Hi community,
> > 
> > Any advice to speed-up updates ?
> > Is there any advice on commit, memory, docvalues, stored or any tips to
> > faster things ?
> > 
> > Thanks
> > 
> > 
> >> On Wed, Oct 16, 2019 at 12:47:47AM +0200, Nicolas Paris wrote:
> >> Hi
> >> 
> >> I am looking for a way to faster the update of documents.
> >> 
> >> In my context, the update replaces one of the many existing indexed
> >> fields, and keep the others as is.
> >> 
> >> Right now, I am building the whole document, and replacing the existing
> >> one by id.
> >> 
> >> I am wondering if **atomic update feature** would faster the process.
> >> 
> >> From one hand, using this feature would save network because only a
> >> small subset of the document would be send from the client to the
> >> server. 
> >> On the other hand, the server will have to collect the values from the
> >> disk and reindex them. In addition, this implies to store the values for
> >> every fields (I am not storing every fields) and use more space.
> >> 
> >> Also I have read about the ConcurrentUpdateSolrServer class might be an
> >> optimized way of updating documents.
> >> 
> >> I am using spark-solr library to deal with solr-cloud. If something
> >> exist to faster the process, I would be glad to implement it in that
> >> library.
> >> Also, I have split the collection over multiple shard, and I admit this
> >> faster the update process, but who knows ?
> >> 
> >> Thoughts ?
> >> 
> >> -- 
> >> nicolas
> >> 
> > 
> > -- 
> > nicolas
> 

-- 
nicolas


Re: WordDelimiter in extended way.

2019-10-23 Thread servus01
got it, thank you



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Document Update performances Improvement

2019-10-23 Thread Nicolas Paris
> Set the first two to the same number, and the third to a minumum of three
> times what you set the other two.
> When I built a Solr setup, I increased maxMergeAtOnce and segmentsPerTier to
> 35, and maxMergeAtOnceExplicit to 105.  This made merging happen a lot less
> frequently.

Good to know the chief recipes.

> On the Solr side, atomic updates will be slightly slower than indexing the
> whole document provided to Solr. 

This makes sense.

> If Solr can access the previous document faster than you can get the
> document from your source system, then atomic updates might be faster.

The documents are stored within parquet files without any processing
needed. In this case, the atomic update is not likely to faster things.


Thanks

On Wed, Oct 23, 2019 at 07:49:44AM -0600, Shawn Heisey wrote:
> On 10/22/2019 1:12 PM, Nicolas Paris wrote:
> > > We, at Auto-Suggest, also do atomic updates daily and specifically
> > > changing merge factor gave us a boost of ~4x
> > 
> > Interesting. What kind of change exactly on the merge factor side ?
> 
> The mergeFactor setting is deprecated.  Instead, use maxMergeAtOnce,
> segmentsPerTier, and a setting that is not mentioned in the ref guide --
> maxMergeAtOnceExplicit.
> 
> Set the first two to the same number, and the third to a minumum of three
> times what you set the other two.
> 
> The default setting for maxMergeAtOnce and segmentsPerTier is 10, with 30
> for maxMergeAtOnceExplicit.  When you're trying to increase indexing speed
> and you think segment merging is interfering, you want to increase these
> values to something larger.  Note that increasing these values will increase
> the number of files that your Solr install keeps open.
> 
> https://lucene.apache.org/solr/guide/8_1/indexconfig-in-solrconfig.html#mergepolicyfactory
> 
> When I built a Solr setup, I increased maxMergeAtOnce and segmentsPerTier to
> 35, and maxMergeAtOnceExplicit to 105.  This made merging happen a lot less
> frequently.
> 
> > Would you say atomical update is faster than regular replacement of
> > documents ? (considering my first thought on this below)
> 
> On the Solr side, atomic updates will be slightly slower than indexing the
> whole document provided to Solr.  When an atomic update is done, Solr will
> find the existing document, then combine what's in that document with the
> changes you specify using the atomic update, and then index the whole
> combined document as a new document that replaces with original.
> 
> Whether or not atomic updates are faster or slower in practice than indexing
> the whole document will depend on how your source systems work, and that is
> not something we can know.  If Solr can access the previous document faster
> than you can get the document from your source system, then atomic updates
> might be faster.
> 
> Thanks,
> Shawn
> 

-- 
nicolas


Re: Document Update performances Improvement

2019-10-23 Thread Nicolas Paris
> .
> 

Thanks for those relevant pointers and the explanation.

> How often do you commit? Are you committing after each XML is
> indexed? If yes, what is your batch (XML) size? Review default settings of
> autoCommit and considering increasing it. 

I guess I do not use any XML under the hood: spark-solr uses sorlj which
serialize the document in java binary objects. However the commit
strategy applies too, I have setup 20,000 documents or 20,000 ms.

> Do you want real time reflection
> of updates? If no, you can compromise on commits and merge factors and do
> faster indexing. Don't so soft commits then.

Indeed I d'like the document be accessible sooner. That being said, 5
minutes delay is acceptable.

> In our case, I have set autoCommit to commit after 50,000 documents are
> indexed. After EdgeNGrams tokenization, while full indexing, we have seen
> index to get over 60 GBs. Once we are done with full indexing, I optimize
> the index and the index size comes below 13 GB!

I guess I get the idea: "put the dollars as fast as possible in the bag,
we will clean-up when back home"

Thanks

On Wed, Oct 23, 2019 at 11:34:44AM +0530, Paras Lehana wrote:
> Hi Nicolas,
> 
> What kind of change exactly on the merge factor side ?
> 
> 
> We increased maxMergeAtOnce and segmentsPerTier from 5 to 50. This will
> make Solr to merge segments less frequently after many index updates. Yes,
> you need to find the sweet spot here but do try increasing these values
> from the default ones. I strongly recommend you to give a 2 min read to this
> .
> Do note that increasing these values will require you to have larger
> physical storage until segments merge.
> 
> Besides this, do review your autoCommit config
> 
> or the frequency of your hard commits. In our case, we don't want real time
> updates - so we can always commit less frequently. This makes indexing
> faster. How often do you commit? Are you committing after each XML is
> indexed? If yes, what is your batch (XML) size? Review default settings of
> autoCommit and considering increasing it. Do you want real time reflection
> of updates? If no, you can compromise on commits and merge factors and do
> faster indexing. Don't so soft commits then.
> 
> In our case, I have set autoCommit to commit after 50,000 documents are
> indexed. After EdgeNGrams tokenization, while full indexing, we have seen
> index to get over 60 GBs. Once we are done with full indexing, I optimize
> the index and the index size comes below 13 GB! Since we can trade off
> space temporarily for increased indexing speed, we are still committed to
> find sweeter spots for faster indexing. For statistics purpose, we have
> over 250 million documents for indexing that converges to 60 million unique
> documents after atomic updates (full indexing).
> 
> 
> 
> > Would you say atomical update is faster than regular replacement of
> > documents?
> 
> 
> No, I don't say that. Either of the two configs (autoCommit, Merge Policy)
> will impact regular indexing too. In our case, non-atomic indexing is out
> of question.
> 
> On Wed, 23 Oct 2019 at 00:43, Nicolas Paris 
> wrote:
> 
> > > We, at Auto-Suggest, also do atomic updates daily and specifically
> > > changing merge factor gave us a boost of ~4x
> >
> > Interesting. What kind of change exactly on the merge factor side ?
> >
> >
> > > At current configuration, our core atomically updates ~423 documents
> > > per second.
> >
> > Would you say atomical update is faster than regular replacement of
> > documents ? (considering my first thought on this below)
> >
> > > > I am wondering if **atomic update feature** would faster the process.
> > > > From one hand, using this feature would save network because only a
> > > > small subset of the document would be send from the client to the
> > > > server.
> > > > On the other hand, the server will have to collect the values from the
> > > > disk and reindex them. In addition, this implies to store the values
> > > > every fields (I am not storing every fields) and use more space.
> >
> >
> > Thanks Paras
> >
> >
> >
> > On Tue, Oct 22, 2019 at 01:00:10PM +0530, Paras Lehana wrote:
> > > Hi Nicolas,
> > >
> > > Have you tried playing with values of *IndexConfig*
> > >  > >
> > > (merge factor, segment size, maxBufferedDocs, Merge Policies)? We, at
> > > Auto-Suggest, also do atomic updates daily and specifically changing
> > merge
> > > factor gave us a boost of ~4x during indexing. At current configuration,
> > > our core atomically updates 

Re: regarding Extracting text from Images

2019-10-23 Thread Alexandre Rafalovitch
I believe Tika that powers this can do so with extra libraries (tesseract?)
But Solr does not bundle those extras.

In any case, you may want to run Tika externally to avoid the
conversion/extraction process be a burden to Solr itself.

Regards,
 Alex

On Wed, Oct 23, 2019, 1:58 PM suresh pendap,  wrote:

> Hello,
> I am reading the Solr documentation about integration with Tika and Solr
> Cell framework over here
>
> https://lucene.apache.org/solr/guide/6_6/uploading-data-with-solr-cell-using-apache-tika.html
>
> I would like to know if the can Solr Cell framework also be used to extract
> text from the image files?
>
> Regards
> Suresh
>


regarding Extracting text from Images

2019-10-23 Thread suresh pendap
Hello,
I am reading the Solr documentation about integration with Tika and Solr
Cell framework over here
https://lucene.apache.org/solr/guide/6_6/uploading-data-with-solr-cell-using-apache-tika.html

I would like to know if the can Solr Cell framework also be used to extract
text from the image files?

Regards
Suresh


RE: tlogs are not deleted

2019-10-23 Thread Webster Homer
Tlogs will accumulate if you have buffers "enabled". Make sure that you 
explicitly disable buffering from the cdcr endpoint
https://lucene.apache.org/solr/guide/7_7/cdcr-api.html#disablebuffer
Make sure that they're disabled on both the source and targets

I believe that sometimes buffers get enabled on their own. We added monitoring 
of CDCR to check for the buffer setting
This endpoint shows you the status
https://lucene.apache.org/solr/guide/7_7/cdcr-api.html#cdcr-status-example

I don't understand the use case for enabling  buffers, or why it is enabled by 
default.

-Original Message-
From: Erick Erickson 
Sent: Wednesday, October 23, 2019 7:23 AM
To: solr-user@lucene.apache.org
Subject: Re: tlogs are not deleted

My first guess is that your CDCR setup isn’t running. CDCR uses tlogs as a 
queueing mechanism. If CDCR can’t send docs to the target collection, they’ll 
accumulate forever.

Best,
Erick

> On Oct 22, 2019, at 7:48 PM, Woo Choi  wrote:
>
> Hi,
>
> We are using solr 7.7 cloud with CDCR(every collection has 3 replicas,
> 1 shard).
>
> In solrconfig.xml,
>
> tlog configuration is super simple like : 
>
> There is also daily data import and commit is called after data import
> every time.
>
> Indexing works fine, but the problem is that the number of tlogs keeps
> growing.
>
> According to the documentation
> here(https://lucene.apache.org/solr/guide/6_6/updatehandlers-in-solrco
> nfig.html), I expected tlog will remain as many as 10(default value of
> maxNumLogsToKeep=10).
>
> However I still have a bunch of tlogs - the oldest one is Sep 6..!
>
> I did an experiment by running data import with commit option from
> solr admin ui, but any of tlogs were not deleted.
>
> tlog.002.1643995079881261056
> tlog.018.1645444642733293568
> tlog.034.1646803619099443200
> tlog.003.1644085718240198656
> tlog.019.1645535304072822784
> tlog.035.1646894195509559296
> tlog.004.1644176284537847808
> tlog.020.1645625651261079552
> tlog.036.1646984623121498112
> tlog.005.1644357373324689408
> tlog.021.1645625651316654083
> tlog.037.1647076244416626688
> tlog.006.167899616018432
> tlog.022.1645716477747134464
> tlog.038.1647165801017376768
> tlog.007.1644538486210953216
> tlog.023.1645806853961023488
> tlog.039.1647165801042542594
> tlog.008.1644629084296183808
> tlog.024.1645897663703416832
> tlog.040.1647256590865137664
> tlog.009.1644719895268556800
> tlog.025.1645988248838733824
> tlog.041.1647347172490870784
> tlog.010.1644810493331767296
> tlog.026.1646078905702940672
> tlog.042.1647437758859313152
> tlog.011.1644901113324896256
> tlog.027.1646169478772293632
> tlog.043.1647528345005457408
> tlog.012.1645031030684385280
> tlog.028.1646259838395613184
> tlog.044.1647618793025830912
> tlog.013.164503103008545
> tlog.029.1646350429145006080
> tlog.045.1647709579019026432
> tlog.014.1645082080252526592
> tlog.030.1646441456502571008
> tlog.046.1647890587519549440
> tlog.015.1645172929206419456
> tlog.031.1646531802044563456
> tlog.047.1647981403286011904
> tlog.016.1645263488829882368
> tlog.032.16466061568
> tlog.048.1648071989042085888
> tlog.017.1645353861842468864
> tlog.033.1646712822719053824
> tlog.049.1648135546466205696
>
> Did I miss something in the solrconfig file?
>
>
>
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html

This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, you 
must not copy this message or attachment or disclose the contents to any other 
person. If you have received this transmission in error, please notify the 
sender immediately and delete the message and any attachment from your system. 
Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept 
liability for any omissions or errors in this message which may arise as a 
result of E-Mail-transmission or for damages resulting from any unauthorized 
changes of the content of this message and any attachment thereto. Merck KGaA, 
Darmstadt, Germany and any of its subsidiaries do not guarantee that this 
message is free of viruses and does not accept liability for any damages caused 
by any virus transmitted therewith. Cl

Re: WordDelimiter in extended way.

2019-10-23 Thread Shawn Heisey

On 10/23/2019 9:41 AM, servus01 wrote:

Hey,

thank you for helping me:

Thanks in advanced for any help, really appriciate.





It is not the WordDelimiter filter that is affecting your punctuation. 
It is the StandardTokenizer, which is the first analysis component that 
runs.  You can see this in the first screenshot, where that tokenizer 
outputs terms of "CCF" "HD" and "2nd".


That filter is capable of affecting punctuation, depending on its 
settings, but in this case, no punctuation is left by the time the 
analysis hits that filter.


Thanks,
Shawn


Re: Solr Prod stopped yesterday - saya "insufficient memory for the Java Runtime Environment"

2019-10-23 Thread Shawn Heisey

On 10/23/2019 9:08 AM, Vignan Malyala wrote:

Ok. I have around 500 cores in my solr. So, how much heap I should allocate
in solr and jvm?
(Currently as I see, in solr.in.sh shows heap as  - Xms 20g -Xmx 20g.
And my system jvm heap shows -Xms 528m -Xmx 8g. I've re-checked it.)


We have no way of knowing how big a heap you need.

https://lucidworks.com/post/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

We can make an educated guess with certain information that has not been 
provided, but it would only be a guess, and might be wrong.


As I said before, there is only one heap for a Java program.  There is 
no distinction between Java heap, Solr heap, system heap, or a heap with 
any other name.  If you have multiple entries in the commandline for 
-Xmx or -Xms, only one of them is going to take effect, and I do not 
know which one.


The error message you're getting in Java's error file means that there 
wasn't enough memory available for Java to allocate what it wanted to 
allocate, so Java crashed.  This is a problem at the Java level, Solr is 
not involved.


Thanks,
Shawn


Re: WordDelimiter in extended way.

2019-10-23 Thread servus01
Hey,

thank you for helping me:







Thanks in advanced for any help, really appriciate.

 
 



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr Prod stopped yesterday - saya "insufficient memory for the Java Runtime Environment"

2019-10-23 Thread Vignan Malyala
Ok. I have around 500 cores in my solr. So, how much heap I should allocate
in solr and jvm?
(Currently as I see, in solr.in.sh shows heap as  - Xms 20g -Xmx 20g.
And my system jvm heap shows -Xms 528m -Xmx 8g. I've re-checked it.)


On Wed 23 Oct, 2019, 7:52 PM Shawn Heisey,  wrote:

> On 10/23/2019 4:09 AM, Vignan Malyala wrote:
> > *Solr prod stopped yesterday. How to prevent this.*
> >
> > Solr heap info is :  -Xms20g -Xmx20g
> > JVM Heap info. : -Xms528m -Xmx8g
>
> There is no such thing as a Solr heap separate from the JVM heap.  There
> are multiple environment variables that can specify the heap size ...
> only one of those settings is actually going to take effect.  I have not
> done any investigation to determine which one it will be.
>
> > Physical Ram - 32GB
> > Solr version - 6.6.1
> > Swap memory - 8g
> >
> > *hc_err_pid.log got created with following info in it:*
> > #
> > # There is insufficient memory for the Java Runtime Environment to
> continue.
> > # Native memory allocation (mmap) failed to map 16106127360 bytes for
> > committing reserved memory.
>
> This sounds like there is insufficient memory available when running
> Solr for the system to start Java with the configured settings.  Based
> on this number, which is about 16GB, I'm betting that the heap size
> which took effect is the 20GB one, or maybe it got set to 16GB by
> another setting that you did not mention above.
>
> Your information says that there is 32GB total memory ... maybe there
> are other programs that are using up some of that memory before Solr
> attempts to start, and there is not enough memory left for Solr.
>
> Thanks,
> Shawn
>


Re: WordDelimiter in extended way.

2019-10-23 Thread Shawn Heisey

On 10/23/2019 7:43 AM, servus01 wrote:

Now Solr behaves in such a way that on the one hand the hyphens which have a
blank before and after are not indexed and also the search as soon as blank
- blank is searched does not return any results.
With the WordDelimiter I have already covered the cases like 2019-2020. But
for blank - blank i'm running out of ideas. Normally it should tokenize the
word before the hyphen the blanks with hyphen and the word after hyphen as
one token.


To figure out what's happening, we will need to see the entire analysis 
chain, both index and query.  In order to see those, we will need the 
field definition as well as the referenced fieldType definition from 
your schema.  Additional details needed:  Exact Solr version and the 
schema version.  The schema version is at the top of the schema.


Thanks,
Shawn


Re: Solr Paryload example

2019-10-23 Thread Vincenzo D'Amore
Hi Erick, yes, absolutely, it's a great pleasure for me contribute.

On Wed, Oct 23, 2019 at 2:25 PM Erick Erickson 
wrote:

> Bookmarked. Do you intend that this should be incorporated into Solr? If
> so, please raise a JIRA and link your PR in….
>
> Thanks!
> Erick
>
> > On Oct 22, 2019, at 6:56 PM, Vincenzo D'Amore 
> wrote:
> >
> > Hi all,
> >
> > this evening I had some spare hour to spend in order to put everything
> > together in a repository.
> >
> > https://github.com/freedev/solr-payload-string-function-query
> >
> >
> >
> > On Tue, Oct 22, 2019 at 5:54 PM Vincenzo D'Amore 
> wrote:
> >
> >> Hi all,
> >>
> >> thanks for supporting. And many thanks whom have implemented
> >> the integration of the github Solr repository with the intellij IDE.
> >> To configure the environment and run the debugger I spent less than one
> >> hour, (and most of the time I had to wait the compilation).
> >> Solr and you guys really rocks together.
> >>
> >> What I've done:
> >>
> >> I was looking at the original payload function is defined into
> >> the ValueSourceParser, this function uses a FloatPayloadValueSource to
> >> return the value found.
> >>
> >> As said I wrote a new version of payload function that handles strings,
> I
> >> named it spayload, and basically is able to extract the string value
> from
> >> the payload.
> >>
> >> Given the former example where I have a multivalue field payloadCurrency
> >>
> >> payloadCurrency: [
> >> "store1|USD",
> >> "store2|EUR",
> >> "store3|GBP"
> >> ]
> >>
> >> executing spayload(payloadCurrency,store2) returns "EUR", and so on for
> >> the remaining key/value in the field.
> >>
> >> To implement the spayload function, I've added a new ValueSourceParser
> >> instance to the list of defined functions and which returns
> >> a StringPayloadValueSource with the value inside (does the same thing of
> >> former FloatPayloadValueSource).
> >>
> >> That's all. As said, always beware of your code when works at first run.
> >> And really there was something wrong, initially I messed up in the
> >> conversion of the payload into String (bytes, offset, etc).
> >> Now it is fixed, or at least it seems to me.
> >> I see this function cannot be used in the sort, very likely the simple
> >> implementation of the StringPayloadValueSource miss something.
> >>
> >> As far as I understand I'm scratching the surface of this solution,
> there
> >> are few things I'm worried about. I have a bunch of questions, please be
> >> patient.
> >> This function returns an empty string "" when does not match any key, or
> >> should return an empty value? not sure about, what's the correct way to
> >> return an empty value?
> >> I wasn't able to find a test unit for the payload function in the tests.
> >> Could you give me few suggestion in order to test properly the
> >> implementation?
> >> In case the spayload is used on a different field type (i.e. the use
> >> spayload on a float payload) the behaviour is not handled. Can this
> >> function check the type of the payload content?
> >> And at last, what do you think, can this simple fix be interesting for
> the
> >> Solr community, may I try to submit a pull request or add a feature to
> JIRA?
> >>
> >> Best regards,
> >> Vincenzo
> >>
> >>
> >> On Mon, Oct 21, 2019 at 9:12 PM Erik Hatcher 
> >> wrote:
> >>
> >>> Yes.   The decoding of a payload based on its schema type is what the
> >>> payload() function does.   Your Payloader won't currently work
> well/legibly
> >>> for fields encoded numerically:
> >>>
> >>>
> >>>
> https://github.com/o19s/payload-component/blob/master/src/main/java/com/o19s/payloads/Payloader.java#L130
> >>> <
> >>>
> https://github.com/o19s/payload-component/blob/master/src/main/java/com/o19s/payloads/Payloader.java#L130
> 
> >>>
> >>> I think that code could probably be slightly enhanced to leverage
> >>> PayloadUtils.getPayloadDecoder(fieldType) and use bytes if the field
> type
> >>> doesn't have a better decoder.
> >>>
> >>>Erik
> >>>
> >>>
>  On Oct 21, 2019, at 2:55 PM, Eric Pugh <
> ep...@opensourceconnections.com>
> >>> wrote:
> 
>  Have you checked out
>  https://github.com/o19s/payload-component
> 
>  On Mon, Oct 21, 2019 at 2:47 PM Erik Hatcher 
> >>> wrote:
> 
> > How about a single field, with terms like:
> >
> >   store1_USD|125.0 store2_EUR|220.0 store3_GBP|225.0
> >
> > Would that do the trick?
> >
> > And yeah, payload decoding is currently limited to float and int with
> >>> the
> > built-in payload() function.   We'd need a new way to pull out
> > textual/bytes payloads - like maybe a DocTransformer?
> >
> >   Erik
> >
> >
> >> On Oct 21, 2019, at 9:59 AM, Vincenzo D'Amore 
> > wrote:
> >>
> >> Hi Erick,
> >>
> >> thanks for getting back to me. We started to use payloads because we
> >>> have
> >> the classical per-store pricing problem.
> >> Thousands of stores across and differen

Re: Solr Prod stopped yesterday - saya "insufficient memory for the Java Runtime Environment"

2019-10-23 Thread Shawn Heisey

On 10/23/2019 4:09 AM, Vignan Malyala wrote:

*Solr prod stopped yesterday. How to prevent this.*

Solr heap info is :  -Xms20g -Xmx20g
JVM Heap info. : -Xms528m -Xmx8g


There is no such thing as a Solr heap separate from the JVM heap.  There 
are multiple environment variables that can specify the heap size ... 
only one of those settings is actually going to take effect.  I have not 
done any investigation to determine which one it will be.



Physical Ram - 32GB
Solr version - 6.6.1
Swap memory - 8g

*hc_err_pid.log got created with following info in it:*
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 16106127360 bytes for
committing reserved memory.


This sounds like there is insufficient memory available when running 
Solr for the system to start Java with the configured settings.  Based 
on this number, which is about 16GB, I'm betting that the heap size 
which took effect is the 20GB one, or maybe it got set to 16GB by 
another setting that you did not mention above.


Your information says that there is 32GB total memory ... maybe there 
are other programs that are using up some of that memory before Solr 
attempts to start, and there is not enough memory left for Solr.


Thanks,
Shawn


Re: Document Update performances Improvement

2019-10-23 Thread Shawn Heisey

On 10/22/2019 1:12 PM, Nicolas Paris wrote:

We, at Auto-Suggest, also do atomic updates daily and specifically
changing merge factor gave us a boost of ~4x


Interesting. What kind of change exactly on the merge factor side ?


The mergeFactor setting is deprecated.  Instead, use maxMergeAtOnce, 
segmentsPerTier, and a setting that is not mentioned in the ref guide -- 
maxMergeAtOnceExplicit.


Set the first two to the same number, and the third to a minumum of 
three times what you set the other two.


The default setting for maxMergeAtOnce and segmentsPerTier is 10, with 
30 for maxMergeAtOnceExplicit.  When you're trying to increase indexing 
speed and you think segment merging is interfering, you want to increase 
these values to something larger.  Note that increasing these values 
will increase the number of files that your Solr install keeps open.


https://lucene.apache.org/solr/guide/8_1/indexconfig-in-solrconfig.html#mergepolicyfactory

When I built a Solr setup, I increased maxMergeAtOnce and 
segmentsPerTier to 35, and maxMergeAtOnceExplicit to 105.  This made 
merging happen a lot less frequently.



Would you say atomical update is faster than regular replacement of
documents ? (considering my first thought on this below)


On the Solr side, atomic updates will be slightly slower than indexing 
the whole document provided to Solr.  When an atomic update is done, 
Solr will find the existing document, then combine what's in that 
document with the changes you specify using the atomic update, and then 
index the whole combined document as a new document that replaces with 
original.


Whether or not atomic updates are faster or slower in practice than 
indexing the whole document will depend on how your source systems work, 
and that is not something we can know.  If Solr can access the previous 
document faster than you can get the document from your source system, 
then atomic updates might be faster.


Thanks,
Shawn


WordDelimiter in extended way.

2019-10-23 Thread servus01
Hello,

maybe somebody can help me out. We have a lot of datasets that are always
built according to the same scheme:

Expression - Expression

as an example:

"CCF *HD - 2nd* BL 2019-2020 1st matchday VfL Osnabrück vs. 1st FC
Heidenheim 1846 | 1st HZ without WZ"

or 

"Scouting Feed *mp4 - 2.* BL 2019-2020 1st matchday SV Wehen Wiesbaden vs.
Karlsruher SC"

Now Solr behaves in such a way that on the one hand the hyphens which have a
blank before and after are not indexed and also the search as soon as blank
- blank is searched does not return any results.
With the WordDelimiter I have already covered the cases like 2019-2020. But
for blank - blank i'm running out of ideas. Normally it should tokenize the
word before the hyphen the blanks with hyphen and the word after hyphen as
one token.

Best

Francois



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr 7.2.1 - Performance recommendation needed

2019-10-23 Thread Erick Erickson
I suggest you use something like GCViewer to analyze it and report on what you 
see and any specific questions you have.

Erick

> On Oct 22, 2019, at 7:23 PM, saravanamanoj  wrote:
> 
> Thanks Erick,
> 
> Below is the link for our GC report when the incident happened.
> 
> https://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgceasy.io%2Fmy-gc-report.jsp%3Fp%3DYXJjaGl2ZWQvMjAxOS8xMC83Ly0tMDJfc29scl9nYy5sb2cuNi5jdXJyZW50LS0xNC00My01OA%3D%3D%26channel%3DWEB&data=02%7C01%7CSethuramanG%40dnb.com%7C0a3a55d0df6942c2ed0608d753dacc70%7C19e2b708bf12437597198dec42771b3e%7C0%7C1%7C637070071509525306&sdata=%2FL12%2F00s9sq50mQP5YYquwxMGfZSesibCH3YkUzMA18%3D&reserved=0
> 
> 
> 
> 
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html



Re: Solr Prod stopped yesterday - saya "insufficient memory for the Java Runtime Environment"

2019-10-23 Thread Paras Lehana
Hi Vignan,

I see this setting quite strange.


Also, if this is the case, you have allocated Solr more memory even than
maximum allowed for total JVM. Solr heap is a subset of JVM heap - do note
that!

On Wed, 23 Oct 2019 at 16:41, Vincenzo D'Amore  wrote:

> Hi,
>
> I see this setting quite strange:
>
> Solr heap info is :  -Xms20g -Xmx20g
> JVM Heap info. : -Xms528m -Xmx8g
>
> “Usually” Solr runs inside the jvm and you can have only one of these
> settings really active. I suggest to double check your memory
> configuration.
>
> Ciao,
> Vincenzo
>
> --
> skype: free.dev
>
> > On 23 Oct 2019, at 12:16, Vignan Malyala  wrote:
> >
> > Solr heap info is :  -Xms20g -Xmx20g
> > JVM Heap info. : -Xms528m -Xmx8g
>


-- 
-- 
Regards,

*Paras Lehana* [65871]
Development Engineer, Auto-Suggest,
IndiaMART Intermesh Ltd.

8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
Noida, UP, IN - 201303

Mob.: +91-9560911996
Work: 01203916600 | Extn:  *8173*

-- 
IMPORTANT: 
NEVER share your IndiaMART OTP/ Password with anyone.


Re: copyField - why source should contain * when dest contains *?

2019-10-23 Thread Paras Lehana
Hey Erick,

Thanks for addressing.

Copyfields are intended to copy exactly one field in the input into exactly
> one field in the destination, not multiple ones at the same time.


Documentation says that we can copy multiple fields using wildcard to one
or more than one fields.



Remember that Solr is also dealing with dynamic fields. In this case, what
> does “*_x” mean?


Yes, that's what hit me initially. But, "*_x" while indexing (in XMLs)
doesn't mean anything, right? It's only used in dynamicFields while
defining schema to let Solr know that we would have some undeclared fields
having names like this. Also, according to the documentation, we can have
dest="*_x" when source="*_x" if I'm right. In this case, there's support
for multiple destinations when there are multiple source.



 Or is this mostly curiosity?


I'm just curious what exactly restricts multiple destination and single
source.



 And what use-case do you want to solve?


Yes, it does seem not too practical. Maybe impossibility of chaining
copyFields is the reason here. I'm just curious about the implementation -
there should be a catch.


Anyways, thanks for replying, Erick. :)

On Wed, 23 Oct 2019 at 17:41, Erick Erickson 
wrote:

> So how would that work? Copyfields are intended to copy exactly one field
> in the input into exactly one field in the destination, not multiple ones
> at the same time. If you need to do that, define multiple copyField
> directives.
>
> I don’t even see how that would work.  dest=“*_x”/>. Remember that Solr is also dealing with dynamic fields. In
> this case, what does “*_x” mean? Create N new fields?
>
> And what use-case do you want to solve? Or is this mostly curiosity?
>
> Best,
> Erick
>
> > On Oct 23, 2019, at 7:55 AM, Paras Lehana 
> wrote:
> >
> > Can't we have one source field
> > information that is copied into different fields
>
>

-- 
-- 
Regards,

*Paras Lehana* [65871]
Development Engineer, Auto-Suggest,
IndiaMART Intermesh Ltd.

8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
Noida, UP, IN - 201303

Mob.: +91-9560911996
Work: 01203916600 | Extn:  *8173*

-- 
IMPORTANT: 
NEVER share your IndiaMART OTP/ Password with anyone.


Re: Query on changing FieldType

2019-10-23 Thread Erick Erickson
Really, just don’t do this. Please. As others have pointed out, it may look 
like it works, but it won’t. I’ve spent many hours tracking down why clients 
got weird errors after making changes like this, sometimes weeks later. Or more 
accurately, if you choose to change field types without reindexing, please 
don’t ask for others to troubleshoot it when something blows up.

As far as creating a new core, if that takes a significant amount of time 
relative to re-indexing, then you must be working with a very small index. 
Those operations should take a couple of minutes, tops.

Best,
Erick

> On Oct 23, 2019, at 5:13 AM, Emir Arnautović  
> wrote:
> 
> Hi Shubham,
> My guess that it might be working for text because it uses o.toString() so 
> there are no runtime errors while in case of others, it has to assume some 
> class so it does class casting. You can check in logs what sort of error 
> happens. But in any case, like Jason pointed out, that is a problem that is 
> just waiting to happen somewhere and the only way to make sure it does not 
> happen is to do full reindexing or to create a new field (with a new name) 
> and stop using the one that is wrong. Different field types are indexed in 
> different structures and with different defaults (e.g. for docValues) and I 
> would not rely on some features working after field type changed.
> 
> Thanks,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> 
> 
> 
>> On 23 Oct 2019, at 08:18, Shubham Goswami  wrote:
>> 
>> Hi Jason
>> 
>> Thanks for the response.
>> You are right that re-indexing is required after making any changes to
>> Schema even i am re-indexing the docs in which i have
>> changed the fieldtypes, but here Emir is talking about full re-indexing
>> i.e. deleting the existing/core and creating new one that is
>> time consuming i think. My doubt is that i am not able to change the type
>> which has implementation classes like LongPointField/IntPointField to the
>> type with implementation classes LongPointField/IntPointField.t
>> 
>>But i am able to change into Text related fields like TextFields
>> and from TextFields to any other Int/Long type fields.
>> So i just want to know that what is exact dependency on these classes so
>> that iam able to change types of some fields ?
>> 
>> Thanks
>> Shubham
>> 
>> On Tue, Oct 22, 2019 at 6:29 PM Jason Gerlowski 
>> wrote:
>> 
>>> Hi Shubbham,
>>> 
>>> Emir gave you accurate advice - you cannot (safely) change field types
>>> without reindexing.  You may avoid errors for a time, and searches may
>>> even return the results you expect.  But the type-change is still a
>>> ticking time bomb...Solr might try to merge segments down the road or
>>> do some other operation and blow up in unexpected ways.  For more
>>> information on why this is, see the documentation here:
>>> https://lucene.apache.org/solr/guide/8_2/reindexing.html.
>>> 
>>> Unfortunately there's no way around it.  This, by the way, is why the
>>> community strongly recommends against using schema-guessing mode for
>>> anything other than experimentation.
>>> 
>>> Best of luck,
>>> 
>>> Jason
>>> 
>>> On Tue, Oct 22, 2019 at 7:42 AM Shubham Goswami
>>>  wrote:
 
 Hi Emir
 
 As you have mentioned above we cannot change field type after indexing
>>> once
 and we have to do dull re-indexing again, I tried to change field type
>>> from
 plong to pint which has implemented class solr.LongPointField and
 solr.IntPointField respectively and it was showing error as expected.
   But when i changed field types from pint/plong to any type which
 has implemented class solr.TextField, in this case its working fine and i
 am able to index the documents after changing its fieldtype with same and
 different id.
 
 So i want to know if is there any compatibility with implemented classes
>>> ?
 
 Thanks
 Shubham
 
 On Tue, Oct 22, 2019 at 2:46 PM Emir Arnautović <
 emir.arnauto...@sematext.com> wrote:
 
> Hi Shubham,
> No you cannot. What you can do is to use copy field or update request
> processor to store is as some other field and use that in your query
>>> and
> ignore the old one that will eventually disappear as the result of
>>> segment
> merges.
> 
> HTH,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training -
>>> http://sematext.com/
> 
> 
> 
>> On 22 Oct 2019, at 10:53, Shubham Goswami >>> 
> wrote:
>> 
>> Hi Emir
>> 
>> Thanks for the reply, i got your point.
>> But is there any other way to do like one field could have two or
>>> more
>> different types defined ?
>> or  if i talk about my previous query, can we index some data for the
> same
>> field with different unique id after replac

Re: Solr Paryload example

2019-10-23 Thread Erick Erickson
Bookmarked. Do you intend that this should be incorporated into Solr? If so, 
please raise a JIRA and link your PR in….

Thanks!
Erick

> On Oct 22, 2019, at 6:56 PM, Vincenzo D'Amore  wrote:
> 
> Hi all,
> 
> this evening I had some spare hour to spend in order to put everything
> together in a repository.
> 
> https://github.com/freedev/solr-payload-string-function-query
> 
> 
> 
> On Tue, Oct 22, 2019 at 5:54 PM Vincenzo D'Amore  wrote:
> 
>> Hi all,
>> 
>> thanks for supporting. And many thanks whom have implemented
>> the integration of the github Solr repository with the intellij IDE.
>> To configure the environment and run the debugger I spent less than one
>> hour, (and most of the time I had to wait the compilation).
>> Solr and you guys really rocks together.
>> 
>> What I've done:
>> 
>> I was looking at the original payload function is defined into
>> the ValueSourceParser, this function uses a FloatPayloadValueSource to
>> return the value found.
>> 
>> As said I wrote a new version of payload function that handles strings, I
>> named it spayload, and basically is able to extract the string value from
>> the payload.
>> 
>> Given the former example where I have a multivalue field payloadCurrency
>> 
>> payloadCurrency: [
>> "store1|USD",
>> "store2|EUR",
>> "store3|GBP"
>> ]
>> 
>> executing spayload(payloadCurrency,store2) returns "EUR", and so on for
>> the remaining key/value in the field.
>> 
>> To implement the spayload function, I've added a new ValueSourceParser
>> instance to the list of defined functions and which returns
>> a StringPayloadValueSource with the value inside (does the same thing of
>> former FloatPayloadValueSource).
>> 
>> That's all. As said, always beware of your code when works at first run.
>> And really there was something wrong, initially I messed up in the
>> conversion of the payload into String (bytes, offset, etc).
>> Now it is fixed, or at least it seems to me.
>> I see this function cannot be used in the sort, very likely the simple
>> implementation of the StringPayloadValueSource miss something.
>> 
>> As far as I understand I'm scratching the surface of this solution, there
>> are few things I'm worried about. I have a bunch of questions, please be
>> patient.
>> This function returns an empty string "" when does not match any key, or
>> should return an empty value? not sure about, what's the correct way to
>> return an empty value?
>> I wasn't able to find a test unit for the payload function in the tests.
>> Could you give me few suggestion in order to test properly the
>> implementation?
>> In case the spayload is used on a different field type (i.e. the use
>> spayload on a float payload) the behaviour is not handled. Can this
>> function check the type of the payload content?
>> And at last, what do you think, can this simple fix be interesting for the
>> Solr community, may I try to submit a pull request or add a feature to JIRA?
>> 
>> Best regards,
>> Vincenzo
>> 
>> 
>> On Mon, Oct 21, 2019 at 9:12 PM Erik Hatcher 
>> wrote:
>> 
>>> Yes.   The decoding of a payload based on its schema type is what the
>>> payload() function does.   Your Payloader won't currently work well/legibly
>>> for fields encoded numerically:
>>> 
>>> 
>>> https://github.com/o19s/payload-component/blob/master/src/main/java/com/o19s/payloads/Payloader.java#L130
>>> <
>>> https://github.com/o19s/payload-component/blob/master/src/main/java/com/o19s/payloads/Payloader.java#L130
 
>>> 
>>> I think that code could probably be slightly enhanced to leverage
>>> PayloadUtils.getPayloadDecoder(fieldType) and use bytes if the field type
>>> doesn't have a better decoder.
>>> 
>>>Erik
>>> 
>>> 
 On Oct 21, 2019, at 2:55 PM, Eric Pugh 
>>> wrote:
 
 Have you checked out
 https://github.com/o19s/payload-component
 
 On Mon, Oct 21, 2019 at 2:47 PM Erik Hatcher 
>>> wrote:
 
> How about a single field, with terms like:
> 
>   store1_USD|125.0 store2_EUR|220.0 store3_GBP|225.0
> 
> Would that do the trick?
> 
> And yeah, payload decoding is currently limited to float and int with
>>> the
> built-in payload() function.   We'd need a new way to pull out
> textual/bytes payloads - like maybe a DocTransformer?
> 
>   Erik
> 
> 
>> On Oct 21, 2019, at 9:59 AM, Vincenzo D'Amore 
> wrote:
>> 
>> Hi Erick,
>> 
>> thanks for getting back to me. We started to use payloads because we
>>> have
>> the classical per-store pricing problem.
>> Thousands of stores across and different prices.
>> Then we found the payloads very useful started to use it for many
> reasons,
>> like enabling/disabling the product for such store, save the stock
>> availability, or save the other info like buy/sell price, discount
>>> rates,
>> and so on.
>> All those information are numbers, but stores can also be in different
>> countries, I mean would be useful also have

Re: tlogs are not deleted

2019-10-23 Thread Erick Erickson
My first guess is that your CDCR setup isn’t running. CDCR uses tlogs as a 
queueing mechanism. If CDCR can’t send docs to the target collection, they’ll 
accumulate forever.

Best,
Erick

> On Oct 22, 2019, at 7:48 PM, Woo Choi  wrote:
> 
> Hi,
> 
> We are using solr 7.7 cloud with CDCR(every collection has 3 replicas, 1
> shard).
> 
> In solrconfig.xml, 
> 
> tlog configuration is super simple like : 
> 
> There is also daily data import and commit is called after data import every
> time.
> 
> Indexing works fine, but the problem is that the number of tlogs keeps
> growing.
> 
> According to the documentation
> here(https://lucene.apache.org/solr/guide/6_6/updatehandlers-in-solrconfig.html),
> I expected tlog will remain as many as 10(default value of
> maxNumLogsToKeep=10).
> 
> However I still have a bunch of tlogs - the oldest one is Sep 6..!
> 
> I did an experiment by running data import with commit option from solr
> admin ui, but any of tlogs were not deleted.
> 
> tlog.002.1643995079881261056 
> tlog.018.1645444642733293568 
> tlog.034.1646803619099443200
> tlog.003.1644085718240198656 
> tlog.019.1645535304072822784 
> tlog.035.1646894195509559296
> tlog.004.1644176284537847808 
> tlog.020.1645625651261079552 
> tlog.036.1646984623121498112
> tlog.005.1644357373324689408 
> tlog.021.1645625651316654083 
> tlog.037.1647076244416626688
> tlog.006.167899616018432 
> tlog.022.1645716477747134464 
> tlog.038.1647165801017376768
> tlog.007.1644538486210953216 
> tlog.023.1645806853961023488 
> tlog.039.1647165801042542594
> tlog.008.1644629084296183808 
> tlog.024.1645897663703416832 
> tlog.040.1647256590865137664
> tlog.009.1644719895268556800 
> tlog.025.1645988248838733824 
> tlog.041.1647347172490870784
> tlog.010.1644810493331767296 
> tlog.026.1646078905702940672 
> tlog.042.1647437758859313152
> tlog.011.1644901113324896256 
> tlog.027.1646169478772293632 
> tlog.043.1647528345005457408
> tlog.012.1645031030684385280 
> tlog.028.1646259838395613184 
> tlog.044.1647618793025830912
> tlog.013.164503103008545 
> tlog.029.1646350429145006080 
> tlog.045.1647709579019026432
> tlog.014.1645082080252526592 
> tlog.030.1646441456502571008 
> tlog.046.1647890587519549440
> tlog.015.1645172929206419456 
> tlog.031.1646531802044563456 
> tlog.047.1647981403286011904
> tlog.016.1645263488829882368 
> tlog.032.16466061568 
> tlog.048.1648071989042085888
> tlog.017.1645353861842468864 
> tlog.033.1646712822719053824 
> tlog.049.1648135546466205696
> 
> Did I miss something in the solrconfig file?
> 
> 
> 
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html



Re: copyField - why source should contain * when dest contains *?

2019-10-23 Thread Erick Erickson
So how would that work? Copyfields are intended to copy exactly one field in 
the input into exactly one field in the destination, not multiple ones at the 
same time. If you need to do that, define multiple copyField directives.

I don’t even see how that would work. . 
Remember that Solr is also dealing with dynamic fields. In this case, what does 
“*_x” mean? Create N new fields?

And what use-case do you want to solve? Or is this mostly curiosity?

Best,
Erick

> On Oct 23, 2019, at 7:55 AM, Paras Lehana  wrote:
> 
> Can't we have one source field
> information that is copied into different fields



copyField - why source should contain * when dest contains *?

2019-10-23 Thread Paras Lehana
Hi Community,

I was just going through *Solr Ref Guide 8.1* from scratch and I was
reading about* copyFields
*. We have
been working on copyFields in 6.6 since a year. I just wanted to refresh
what we know and what we should before we upgrade to 8.2.

The last quote on the page mentions:

*The copyField command can use a wildcard (*) character in the dest
> parameter only if the source parameter contains one as well. copyField uses
> the matching glob from the source field for the dest field name into which
> the source content is copied.*


*Why do we have this restriction? *Can't we have one source field
information that is copied into different fields? The second statement is
probably the explanation for the first (if not, please help me understand
that as well) but I cannot relate it.

Thanks in advance.

-- 
-- 
Regards,

*Paras Lehana* [65871]
Development Engineer, Auto-Suggest,
IndiaMART Intermesh Ltd.

8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
Noida, UP, IN - 201303

Mob.: +91-9560911996
Work: 01203916600 | Extn:  *8173*

-- 
IMPORTANT: 
NEVER share your IndiaMART OTP/ Password with anyone.


Re: Solr Prod stopped yesterday - saya "insufficient memory for the Java Runtime Environment"

2019-10-23 Thread Vincenzo D'Amore
Hi,

I see this setting quite strange:

Solr heap info is :  -Xms20g -Xmx20g
JVM Heap info. : -Xms528m -Xmx8g

“Usually” Solr runs inside the jvm and you can have only one of these settings 
really active. I suggest to double check your memory configuration. 

Ciao,
Vincenzo

--
skype: free.dev

> On 23 Oct 2019, at 12:16, Vignan Malyala  wrote:
> 
> Solr heap info is :  -Xms20g -Xmx20g
> JVM Heap info. : -Xms528m -Xmx8g


Solr Prod stopped yesterday - saya "insufficient memory for the Java Runtime Environment"

2019-10-23 Thread Vignan Malyala
*Solr prod stopped yesterday. How to prevent this.*

Solr heap info is :  -Xms20g -Xmx20g
JVM Heap info. : -Xms528m -Xmx8g
Physical Ram - 32GB
Solr version - 6.6.1
Swap memory - 8g

*hc_err_pid.log got created with following info in it:*
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 16106127360 bytes for
committing reserved memory.
# Possible reasons:
#   The system is out of physical RAM or swap space
#   The process is running with CompressedOops enabled, and the Java Heap
may be blocking the growth of the native heap
# Possible solutions:
#   Reduce memory load on the system
#   Increase physical memory or swap space
#   Check if swap backing store is full
#   Decrease Java heap size (-Xmx/-Xms)
#   Decrease number of Java threads
#   Decrease Java thread stack sizes (-Xss)
#   Set larger code cache with -XX:ReservedCodeCacheSize=
# This output file may be truncated or incomplete.
#
#  Out of Memory Error (os_linux.cpp:2749), pid=86291,
tid=0x7f8822e47700
#
# JRE version:  (8.0_211-b12) (build )
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.211-b12 mixed mode
linux-amd64 compressed oops)
# Failed to write core dump. Core dumps have been disabled. To enable core
dumping, try "ulimit -c unlimited" before starting Java again
#


*solr_gc.log shows following:*
2019-10-23T01:24:23.688+0530: 644433.457: Total time for which application
threads were stopped: 0.0028257 seconds, Stopping threads took: 0.0016863
seconds
Heap
 par new generation   total 4369088K, used 736363K [0x0002c000,
0x0004, 0x0004)
  eden space 3495296K,  13% used [0x0002c000, 0x0002dc9d5ed8,
0x00039556)
  from space 873792K,  30% used [0x0003caab, 0x0003daff4d80,
0x0004)
  to   space 873792K,   0% used [0x00039556, 0x00039556,
0x0003caab)
 concurrent mark-sweep generation total 15728640K, used 3807001K
[0x0004, 0x0007c000, 0x0007c000)
 Metaspace   used 45325K, capacity 47047K, committed 47324K, reserved
1091584K
  class spaceused 4821K, capacity 5230K, committed 5340K, reserved
1048576K


Re: Query on changing FieldType

2019-10-23 Thread Emir Arnautović
Hi Shubham,
My guess that it might be working for text because it uses o.toString() so 
there are no runtime errors while in case of others, it has to assume some 
class so it does class casting. You can check in logs what sort of error 
happens. But in any case, like Jason pointed out, that is a problem that is 
just waiting to happen somewhere and the only way to make sure it does not 
happen is to do full reindexing or to create a new field (with a new name) and 
stop using the one that is wrong. Different field types are indexed in 
different structures and with different defaults (e.g. for docValues) and I 
would not rely on some features working after field type changed.

Thanks,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 23 Oct 2019, at 08:18, Shubham Goswami  wrote:
> 
> Hi Jason
> 
> Thanks for the response.
> You are right that re-indexing is required after making any changes to
> Schema even i am re-indexing the docs in which i have
> changed the fieldtypes, but here Emir is talking about full re-indexing
> i.e. deleting the existing/core and creating new one that is
> time consuming i think. My doubt is that i am not able to change the type
> which has implementation classes like LongPointField/IntPointField to the
> type with implementation classes LongPointField/IntPointField.t
> 
> But i am able to change into Text related fields like TextFields
> and from TextFields to any other Int/Long type fields.
> So i just want to know that what is exact dependency on these classes so
> that iam able to change types of some fields ?
> 
> Thanks
> Shubham
> 
> On Tue, Oct 22, 2019 at 6:29 PM Jason Gerlowski 
> wrote:
> 
>> Hi Shubbham,
>> 
>> Emir gave you accurate advice - you cannot (safely) change field types
>> without reindexing.  You may avoid errors for a time, and searches may
>> even return the results you expect.  But the type-change is still a
>> ticking time bomb...Solr might try to merge segments down the road or
>> do some other operation and blow up in unexpected ways.  For more
>> information on why this is, see the documentation here:
>> https://lucene.apache.org/solr/guide/8_2/reindexing.html.
>> 
>> Unfortunately there's no way around it.  This, by the way, is why the
>> community strongly recommends against using schema-guessing mode for
>> anything other than experimentation.
>> 
>> Best of luck,
>> 
>> Jason
>> 
>> On Tue, Oct 22, 2019 at 7:42 AM Shubham Goswami
>>  wrote:
>>> 
>>> Hi Emir
>>> 
>>> As you have mentioned above we cannot change field type after indexing
>> once
>>> and we have to do dull re-indexing again, I tried to change field type
>> from
>>> plong to pint which has implemented class solr.LongPointField and
>>> solr.IntPointField respectively and it was showing error as expected.
>>>But when i changed field types from pint/plong to any type which
>>> has implemented class solr.TextField, in this case its working fine and i
>>> am able to index the documents after changing its fieldtype with same and
>>> different id.
>>> 
>>> So i want to know if is there any compatibility with implemented classes
>> ?
>>> 
>>> Thanks
>>> Shubham
>>> 
>>> On Tue, Oct 22, 2019 at 2:46 PM Emir Arnautović <
>>> emir.arnauto...@sematext.com> wrote:
>>> 
 Hi Shubham,
 No you cannot. What you can do is to use copy field or update request
 processor to store is as some other field and use that in your query
>> and
 ignore the old one that will eventually disappear as the result of
>> segment
 merges.
 
 HTH,
 Emir
 --
 Monitoring - Log Management - Alerting - Anomaly Detection
 Solr & Elasticsearch Consulting Support Training -
>> http://sematext.com/
 
 
 
> On 22 Oct 2019, at 10:53, Shubham Goswami >> 
 wrote:
> 
> Hi Emir
> 
> Thanks for the reply, i got your point.
> But is there any other way to do like one field could have two or
>> more
> different types defined ?
> or  if i talk about my previous query, can we index some data for the
 same
> field with different unique id after replacing the type ?
> 
> Thanks again
> Shubham
> 
> On Tue, Oct 22, 2019 at 1:23 PM Emir Arnautović <
> emir.arnauto...@sematext.com> wrote:
> 
>> Hi Shubham,
>> Changing type is not allowed without full reindexing. If you do
 something
>> like that, Solr will end up with segments with different types for
>> the
 same
>> field. Remember that segments are immutable and that reindexing some
>> document will be in new segment, but old segment will still be there
 and at
>> query type Solr will have mismatch between what is stated in schema
>> and
>> what is in segment. In order to change type you have to do full
 reindexing
>> - create a new collection and reindex all documents.
>> 
>> HTH,
>> Emir