Re: Result grouping performance

2017-11-22 Thread Mikhail Khludnev
Akos, Can you provide your request params? Do you just group and/or count grouped facets? Can you clarify how field collapsing is different from grouping, just make it unambiguous? On Wed, Nov 22, 2017 at 4:13 PM, Kempelen, Ákos < akos.kempe...@wolterskluwer.com> wrote: > Hello, > > I am

Re: Merging of index in Solr

2017-11-22 Thread Shawn Heisey
On 11/22/2017 6:19 PM, Zheng Lin Edwin Yeo wrote: I'm doing the merging on the SSD drive, the speed should be ok? The speed of virtually all modern disks will have almost no influence on the speed of the merge.  The bottleneck isn't disk transfer speed, it's the operation of the merge code

Re: NullPointerException in PeerSync.handleUpdates

2017-11-22 Thread Michael Braun
I went ahead and resolved the jira - it was never seen again by us in later versions of Solr. There are a number of bug fixes since the 6.2 release, so I personally recommend updating! On Wed, Nov 22, 2017 at 11:48 AM, Pushkar Raste wrote: > As mentioned in the JIRA,

Re: Merging of index in Solr

2017-11-22 Thread Zheng Lin Edwin Yeo
Hi Erick, Yes, we are planning to do sharding when we upgrade to the newer Solr 7.1.0, and probably will re-index everything. But currently we are waiting for certain issues on indexing the EML files to Solr 7.1.0 to be addressed first, like for this JIRA,

Re: DelimitedPayloadTokenFilterFactory missing from ref guide

2017-11-22 Thread John Anonymous
I would love to add to the documentation, but I don't know enough about this filter to do so. I was under the impression that this filter would store my payloads and then strip the payload characters from the indexed text. I am getting good results from the payload_check parser, but my indexed

Re: Merging of index in Solr

2017-11-22 Thread Erick Erickson
Sure, sharding can give you accurate faceting, although do note there are nuances, JSON faceting can occasionally be not exact, although there are JIRAs being worked on to correct this. "traditional" faceting has a refinement phase that gets accurate counts. But the net-net is that I believe

Re: Merging of index in Solr

2017-11-22 Thread Zheng Lin Edwin Yeo
I'm doing the merging on the SSD drive, the speed should be ok? We need to merge because the data are indexed in two different collections, and we need them to be under the same collection, so that we can do things like faceting more accurately. Will sharding alone achieve this? Or do we have to

Re: Recovery Issue - Solr 6.6.1 and HDFS

2017-11-22 Thread Erick Erickson
Hmm. This is quite possible. Any time things take "too long" it can be a problem. For instance, if the leader sends docs to a replica and the request times out, the leader throws the follower into "Leader Initiated Recovery". The smoking gun here is that there are no errors on the follower, just

Re: Solr on HDFS vs local storage - Benchmarking

2017-11-22 Thread Erick Erickson
bq: We also had an HDFS setup already so it looked like a good option to not loos data. Earlier we had a few cases where we lost the machines so HDFS looked safer for that. right, that's one of the places where using HDFS to back Solr makes a lot of sense. The other approach is to just have

Re: Embedded SOLR - Best practice?

2017-11-22 Thread Erick Erickson
I don't really understand what you're saying here. Solr is pretty fast, why not just put all 400K docs on a Solr instance and just use that? EmbeddedSolrServer works, but it really seems like for a small corpus like this just using a separate stand-alone Solr is way easier. Best, Erick On Wed,

Re: DelimitedPayloadTokenFilterFactory missing from ref guide

2017-11-22 Thread Erick Erickson
Thanks for noticing. Note that anyone can edit the asciidoc pages and submit a patch, it'd be great if you could submit a patch and add it to a JIRA. See the Write/Improve User Documentation section here: https://wiki.apache.org/solr/HowToContribute Best, Erick On Wed, Nov 22, 2017 at 3:37 PM,

DelimitedPayloadTokenFilterFactory missing from ref guide

2017-11-22 Thread John Anonymous
DelimitedPayloadTokenFilterFactory appears to be missing from this page: https://lucene.apache.org/solr/guide/7_1/filter-descriptions.html

Embedded SOLR - Best practice?

2017-11-22 Thread hvengurlekar
Hello Folks, Currently, I am using SOLR in production and the around 40 documents are stored. For a particular use-case, I am looking to cache around 10 documents daily. I am thinking of using embedded solr as the cache to support the range queries. I did not find any good documentation

Re: Solr on HDFS vs local storage - Benchmarking

2017-11-22 Thread Hendrik Haddorp
We actually use no auto warming. Our collections are pretty small and the query performance is not really a problem so far. We are using lots of collections and most Solr caches seem to be per core and not global so we also have a problem with caching. I have to test the HDFS cache some more

Re: Recovery Issue - Solr 6.6.1 and HDFS

2017-11-22 Thread Joe Obernberger
Hi Shawn - thank you for your reply.  The index is 29.9TBytes as reported by: hadoop fs -du -s -h /solr6.6.0 29.9 T  89.9 T  /solr6.6.0 The 89.9TBytes is due to HDFS having 3x replication.  There are about 1.1 billion documents indexed and we index about 2.5 million documents per day. 

Re: highlight separator

2017-11-22 Thread David Hastings
Thanks, I kind of figured that was the case. On Wed, Nov 22, 2017 at 12:24 PM, Erick Erickson wrote: > I think that's only for the Unified Highlighter, which was introduced > to Lucene in 6.3 and Solr in 6.4. See: SOLR-9708 > > Best, > Erick > > On Wed, Nov 22, 2017 at

Re: Result grouping performance

2017-11-22 Thread Erick Erickson
Have you enabled docValues (and reindexed from scratch) on the field you're grouping on? Best, Erick On Wed, Nov 22, 2017 at 5:13 AM, Kempelen, Ákos wrote: > Hello, > > I am migrating our codebase from Solr 4.7 to 7.0.1 but the performance of > result grouping

Re: Solr on HDFS vs local storage - Benchmarking

2017-11-22 Thread Erick Erickson
In my experience, for relatively static indexes the performance is roughly similar. Once the data is read from whatever data source it's in memory, where the data came from is (largely) secondary in importance. In cases where there's a lot of I/O I expect HDFS to be slower, this fits Hendrik's

Re: Merging of index in Solr

2017-11-22 Thread Erick Erickson
Really, let's back up here though. This sure seems like an XY problem. You're merging indexes that will eventually be something on the order of 3.5TB. I claim that an index of that size is very difficult to work with effectively. _Why_ do you want to do this? Do you have any evidence that you'll

Re: highlight separator

2017-11-22 Thread Erick Erickson
I think that's only for the Unified Highlighter, which was introduced to Lucene in 6.3 and Solr in 6.4. See: SOLR-9708 Best, Erick On Wed, Nov 22, 2017 at 9:01 AM, David Hastings wrote: > Im on solr 5.x at the moment, and am trying to get the highlighter to >

Re: Do i need to reindex after changing similarity setting

2017-11-22 Thread Nawab Zada Asad Iqbal
Thanks Walter On Mon, Nov 20, 2017 at 4:59 PM Walter Underwood wrote: > Similarity is query time. > > wunder > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) > > > > On Nov 20, 2017, at 4:57 PM, Nawab Zada Asad Iqbal

Re: Reusable tokenstream

2017-11-22 Thread Emir Arnautović
Hi Roxana, The idea with update request processor is to have following parameters: * inputField - document field with text to analyse * sharedAnalysis - field type with shared analysis definition * targetFields - comma separated list of fields where results should be stored. *

highlight separator

2017-11-22 Thread David Hastings
Im on solr 5.x at the moment, and am trying to get the highlighter to display complete sentences containing the match. setting: 'hl.method' => 'fastVector', 'hl.bs.type' =>'SENTENCE', hasnt been proving to work. is there a way for me to do it in the query itself? thanks -Dave

Re: Reusable tokenstream

2017-11-22 Thread Roxana Danger
Mikhail, Yes, I've just seen your message... "Hello, Roxana. You probably looking for TeeSinkTokenFilter, but I believe the idea is cumbersome to implement in Solr. Also there is a preanalyzed field which can keep tokenstream in external form." This is the answer I was looking for. Thanks a

Re: Recovery Issue - Solr 6.6.1 and HDFS

2017-11-22 Thread Shawn Heisey
On 11/22/2017 6:44 AM, Joe Obernberger wrote: > Right now, we have a relatively small block cache due to the > requirements that the servers run other software.  We tried to find > the best balance between block cache size, and RAM for programs, while > still giving enough for local FS cache. 

Re: Merging of index in Solr

2017-11-22 Thread Shawn Heisey
On 11/21/2017 9:10 AM, Zheng Lin Edwin Yeo wrote: > I am using the IndexMergeTool from Solr, from the command below: > > java -classpath lucene-core-6.5.1.jar;lucene-misc-6.5.1.jar > org.apache.lucene.misc.IndexMergeTool > > The heap size is 32GB. There are more than 20 million documents in the

Re: Trailing wild card searches very slow in Solr

2017-11-22 Thread Shawn Heisey
On 11/20/2017 12:50 PM, Sundeep T wrote: > I initially asked this question regarding leading wildcards. This was a > typo, and what I meant was trailing wild card queries were slow. So queries > like text:'hello*" are slow. We were expecting since the string field is > already indexed, the

Re: How to get a solr core to persist

2017-11-22 Thread Shawn Heisey
On 11/20/2017 6:26 AM, Amanda Shuman wrote: > I did as you suggested and created the core by hand - I copied the files > from the existing core, including the index files (data directory) and > changed the core.properties file to the new core name (core_new) and > restarted. Now I'm having a

Re: NullPointerException in PeerSync.handleUpdates

2017-11-22 Thread Pushkar Raste
As mentioned in the JIRA, exception seems to be coming from a log statement. The issue was fixed in 6.3, here is relevant line f rom 6.3 https://github.com/apache/lucene-solr/blob/releases/lucene-solr/6.3.0/solr/core/src/java/org/apache/solr/update/PeerSync.java#L707 On Wed, Nov 22, 2017 at

Re: Reusable tokenstream

2017-11-22 Thread Roxana Danger
Hi Emir, In this case, I need more control at Lucene level, so I have to use the lucene index writer directly. So, I can not use Solr for importing. Or, is there anyway I can add a tokenstream to a SolrInputDocument (is there any other class exposed by Solr during indexing that I can use for this

Re: Solr7: Very High number of threads on aggregator node

2017-11-22 Thread Nawab Zada Asad Iqbal
Rick Your suspicion is correct. I mostly reused my config from solr4 except where it was deprecated or obsoleted and I switched to the newer configs: Having said that I couldn't find any new query related settings which can impact us, since most of our queries dont use fancy new features. I

Re: Solr on HDFS vs local storage - Benchmarking

2017-11-22 Thread Greenhorn Techie
Hendrik, Thanks for your response. Regarding "But this seems to greatly depend on how your setup looks like and what actions you perform." May I know what are the factors influence and what considerations are to be taken in relation to this? Thanks On Wed, 22 Nov 2017 at 14:16 Hendrik Haddorp

Re: Merging of index in Solr

2017-11-22 Thread Zheng Lin Edwin Yeo
Hi Emir, Yes, I am running the merging on a Windows machine. The hard disk is a SSD disk in NTFS file system. Regards, Edwin On 22 November 2017 at 16:50, Emir Arnautović wrote: > Hi Edwin, > Quick googling suggests that this is the issue of NTFS related to large

Re: Solr on HDFS vs local storage - Benchmarking

2017-11-22 Thread Hendrik Haddorp
We did some testing and the performance was strangely even better with HDFS then the with the local file system. But this seems to greatly depend on how your setup looks like and what actions you perform. We now had a patter with lots of small updates and commits and that seems to be quite a

Re: Recovery Issue - Solr 6.6.1 and HDFS

2017-11-22 Thread Kevin Risden
Thanks for the detailed answers Joe. Definitely sounds like you covered most of the easy HDFS performance items. Kevin Risden On Wed, Nov 22, 2017 at 7:44 AM, Joe Obernberger < joseph.obernber...@gmail.com> wrote: > Hi Kevin - > * HDFS is part of Cloudera 5.12.0. > * Solr is co-located in most

Solr on HDFS vs local storage - Benchmarking

2017-11-22 Thread Greenhorn Techie
Hi, Good Afternoon!! While the discussion around issues related to "Solr on HDFS" is live, I would like to understand if anyone has done any performance benchmarking for both Solr indexing and search between HDFS vs local file system. Also, from experience, what would the community folks

Result grouping performance

2017-11-22 Thread Kempelen , Ákos
Hello, I am migrating our codebase from Solr 4.7 to 7.0.1 but the performance of result grouping seems very poor using the newer Solr. For example a simple MatchAllDocsQuery takes 5 sec on Solr4.7, and 21 sec on Solr7. I wonder what causes the x4 difference in time? We hoped that newer Solr

Re: Recovery Issue - Solr 6.6.1 and HDFS

2017-11-22 Thread Joe Obernberger
Hi Kevin - * HDFS is part of Cloudera 5.12.0. * Solr is co-located in most cases.  We do have several nodes that run on servers that are not data nodes, but most do. Unfortunately, our nodes are not the same size.  Some nodes have 8TBytes of disk, while our largest nodes are 64TBytes.  This

Re: Recovery Issue - Solr 6.6.1 and HDFS

2017-11-22 Thread Kevin Risden
Joe, I have a few questions about your Solr and HDFS setup that could help improve the recovery performance. * Is HDFS part of a distribution from Hortonworks, Cloudera, etc? * Is Solr colocated with HDFS data nodes? * What is the output of "ps aux | grep solr"? (specifically looking for the

Re: Recovery Issue - Solr 6.6.1 and HDFS

2017-11-22 Thread Joe Obernberger
Hi Hendrick - I was halting a replica and then restarting it, waited, then restarted another one.  That didn't work, but when I halted all three, and then restarted those one by one, the shard finally elected a leader and came up.  Phew!  I too noticed the lock files in index. folders. 

Re: Reusable tokenstream

2017-11-22 Thread Mikhail Khludnev
Roxana, Have you seen my response in "tokenstream reusable" thread? reusableTokenStream(java.lang.String , doesn't help you. TokenStream is stateless, it holds the

Re: Reusable tokenstream

2017-11-22 Thread Emir Arnautović
Hi Roxana, I think you can use https://lucene.apache.org/core/5_4_0/analyzers-common/org/apache/lucene/analysis/sinks/TeeSinkTokenFilter.html like suggested earlier. HTH, Emir --

Re: Reusable tokenstream

2017-11-22 Thread Roxana Danger
Hi Emir, Many thanks for your reply. The UpdateProcessor can do this work, but is analyzer.reusableTokenStream the way to obtain a previous generated

Re: Reusable tokenstream

2017-11-22 Thread Emir Arnautović
Hi Roxana, I don’t think that it is possible. In some cases (seems like yours is good fit) you could create custom update request processor that would do the shared analysis (you can have it defined in schema) and after analysis use those tokens to create new values for those two fields and

Re: Recovery Issue - Solr 6.6.1 and HDFS

2017-11-22 Thread Hendrik Haddorp
Hi Joe, sorry, I have not seen that problem. I would normally not delete a replica if the shard is down but only if there is an active shard. Without an active leader the replica should not be able to recover. I also just had a case where all replicas of a shard stayed in down state and

Reusable tokenstream

2017-11-22 Thread Roxana Danger
Hello all, I would like to reuse the tokenstream generated for one field, to create a new tokenstream (adding a few filters to the available tokenstream), for another field without the need of executing again the whole analysis. The particular application is: - I have field *tokens* that uses an

Re: LTR training

2017-11-22 Thread ilayaraja
Thanks Diego for the pointers..will check. - --Ilay -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Merging of index in Solr

2017-11-22 Thread Emir Arnautović
Hi Edwin, Quick googling suggests that this is the issue of NTFS related to large number of file fragments caused by large number of files in one directory of huge files. Are you running this merging on a Windows machine? Emir -- Monitoring - Log Management - Alerting - Anomaly Detection Solr &