Re: slow indexing when keys are verious
Hi all We have changed all solr configs and commit parameters that were mentioned by Shawn, but still - when inserting the same 300 documents from 20 threads we see no latency and when inserting different 300 docs from 20 threads it is very slow and no cpu/ram/disk/network are showing high metrics. I am wondering if the problem might be related to the fact that when inserting different 300 docs from each thread, the key is the only field that varied whilst the other fields are identical. So maybe many same values on the other fields for different keys cause the latency? As for latency that is related to doc routing, I don't see where it can affect us. Is it the zookeeper that might become a bottleneck? Thanks! Gilad -- View this message in context: http://lucene.472066.n3.nabble.com/slow-indexing-when-keys-are-verious-tp4327681p4329451.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Using BasicAuth with SolrJ Code
can u paste the stacktrace here On Tue, Apr 11, 2017 at 1:19 PM, Zheng Lin Edwin Yeo wrote: > I found from StackOverflow that we should declare it this way: > http://stackoverflow.com/questions/43335419/using-basicauth-with-solrj-code > > > SolrRequest req = new QueryRequest(new SolrQuery("*:*"));//create a new > request object > req.setBasicAuthCredentials(userName, password); > solrClient.request(req); > > Is that correct? > > For this, the NullPointerException is not coming out, but the SolrJ is > still not able to get authenticated. I'm still getting Error Code 401 even > after putting in this code. > > Any advice on which part of the SolrJ code should we place this code in? > > Regards, > Edwin > > > On 10 April 2017 at 23:50, Zheng Lin Edwin Yeo wrote: > >> Hi, >> >> I have just set up the Basic Authentication Plugin in Solr 6.4.2 on >> SolrCloud, and I am trying to modify my SolrJ code so that the code can go >> through the authentication and do the indexing. >> >> I tried using the following code from the Solr Documentation >> https://cwiki.apache.org/confluence/display/solr/Basic+Authentication+ >> Plugin. >> >> SolrRequest req ;//create a new request object >> req.setBasicAuthCredentials(userName, password); >> solrClient.request(req); >> >> However, the code complains that the req is not initialized. >> >> If I initialized it, it will be initialize as null. >> >> SolrRequest req = null;//create a new request object >> req.setBasicAuthCredentials(userName, password); >> solrClient.request(req); >> >> This will caused a null pointer exception. >> Exception in thread "main" java.lang.NullPointerException >> >> How should we go about putting these codes, so that the error can be >> prevented? >> >> Regards, >> Edwin >> >> -- - Noble Paul
Re: Solr 6.4. Can't index MS Visio vsdx files
when 1.15 will be released? maybe you have some beta version and I could test it :) SAX sounds interesting, and from info that I found in google it could solve my issues. On Tue, Apr 11, 2017 at 10:48 PM, Allison, Timothy B. wrote: > It depends. We've been trying to make parsers more, erm, flexible, but > there are some problems from which we cannot recover. > > Tl;dr there isn't a short answer. :( > > My sense is that DIH/ExtractingDocumentHandler is intended to get people > up and running with Solr easily but it is not really a great idea for > production. See Erick's gem: https://lucidworks.com/2012/ > 02/14/indexing-with-solrj/ > > As for the Tika portion... at the very least, Tika _shouldn't_ cause the > ingesting process to crash. At most, it should fail at the file level and > not cause greater havoc. In practice, if you're processing millions of > files from the wild, you'll run into bad behavior and need to defend > against permanent hangs, oom, memory leaks. > > Also, at the least, if there's an exception with an embedded file, Tika > should catch it and keep going with the rest of the file. If this doesn't > happen let us know! We are aware that some types of embedded file stream > problems were causing parse failures on the entire file, and we now catch > those in Tika 1.15-SNAPSHOT and don't let them percolate up through the > parent file (they're reported in the metadata though). > > Specifically for your stack traces: > > For your initial problem with the missing class exceptions -- I thought we > used to catch those in docx and log them. I haven't been able to track > this down, though. I can look more if you have a need. > > For "Caused by: org.apache.poi.POIXMLException: Invalid 'Row_Type' name > 'PolylineTo' ", this problem might go away if we implemented a pure SAX > parser for vsdx. We just did this for docx and pptx (coming in 1.15) and > these are more robust to variation because they aren't requiring a match > with the ooxml schema. I haven't looked much at vsdx, but that _might_ > help. > > For "TODO Support v5 Pointers", this isn't supported and would require > contributions. However, I agree that POI shouldn't throw a Runtime > exception. Perhaps open an issue in POI, or maybe we should catch this > special example at the Tika level? > > For "Caused by: java.lang.ArrayIndexOutOfBoundsException:", the POI team > _might_ be able to modify the parser to ignore a stream if there's an > exception, but that's often a sign that something needs to be fixed with > the parser. In short, the solution will come from POI. > > Best, > > Tim > > -Original Message- > From: Gytis Mikuciunas [mailto:gyt...@gmail.com] > Sent: Tuesday, April 11, 2017 1:56 PM > To: solr-user@lucene.apache.org > Subject: RE: Solr 6.4. Can't index MS Visio vsdx files > > Thanks for your responses. > Are there any posibilities to ignore parsing errors and continue indexing? > because now solr/tika stops parsing whole document if it finds any > exception > > On Apr 11, 2017 19:51, "Allison, Timothy B." wrote: > > > You might want to drop a note to the dev or user's list on Apache POI. > > > > I'm not extremely familiar with the vsd(x) portion of our code base. > > > > The first item ("PolylineTo") may be caused by a mismatch btwn your > > doc and the ooxml spec. > > > > The second item appears to be an unsupported feature. > > > > The third item may be an area for improvement within our codebase...I > > can't tell just from the stacktrace. > > > > You'll probably get more helpful answers over on POI. Sorry, I can't > > help with this... > > > > Best, > > > >Tim > > > > P.S. > > > 3.1. ooxml-schemas-1.3.jar instead of poi-ooxml-schemas-3.15.jar > > > > You shouldn't need both. Ooxml-schemas-1.3.jar should be a super set > > of poi-ooxml-schemas-3.15.jar > > > > > > >
NonRepeatableRequestException Error during indexing after setting up Basic Authentication
Hi, I'm getting an error with indexing using SolrJ after setting up the Basic Authentication with the following code. Credentials defaultcreds = new UsernamePasswordCredentials("id", "password"); appendAuthentication(defaultcreds, "BASIC", solr); private static void appendAuthentication(Credentials credentials, String authPolicy, SolrClient solrClient) { // if (isHttpSolrClient(solrClient)) { HttpSolrClient httpSolrClient = (HttpSolrClient) solrClient; // if (credentials != null && StringUtils.isNotBlank(authPolicy) // && assertHttpClientInstance(httpSolrClient.getHttpClient())) { AbstractHttpClient httpClient = (AbstractHttpClient) httpSolrClient.getHttpClient(); httpClient.getCredentialsProvider().setCredentials(new AuthScope(AuthScope.ANY), credentials); httpClient.getParams().setParameter(AuthPNames.TARGET_AUTH_PREF, Arrays.asList(authPolicy)); // } // } } The is the error message that I got. org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://localhost:8983/edm/chinaSapSo at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:624) at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:279) at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:268) at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:149) at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:106) at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:71) at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:85) at testing.indexing(testing.java:2848) at testing.main(testing.java:265) Caused by: org.apache.http.client.ClientProtocolException at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:839) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:515) ... 8 more Caused by: org.apache.http.client.NonRepeatableRequestException: Cannot retry request with a non-repeatable request entity. at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:662) at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:486) at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:835) ... 11 more The error occurs at the part where I do the adding of the index to Solr solr.add(batch); This is what is defined for "solr". static SolrClient solr; solr = new HttpSolrClient( SOLR_URL ); What could be the reason for this? Is there anything wrong with my code? I'm using SolrCloud on Solr 6.4.2. Regards, Edwin
Re: Long GC pauses while reading Solr docs using Cursor approach
JVM version? We’re running v8 update 121 with the G1 collector and it is working really well. We also have an 8GB heap. Graph your heap usage. You’ll see a sawtooth shape, where it grows, then there is a major GC. The maximum of the base of the sawtooth is the working set of heap that your Solr installation needs. Set the heap to that value, plus a gigabyte or so. We run with a 2GB eden (new space) because so much of Solr’s allocations have a lifetime of one request. So, the base of the sawtooth, plus a gigabyte breathing room, plus two more for eden. That should work. I don’t set all the ratios and stuff. When were running CMS, I set a size for the heap and a size for the new space. Done. With G1, I don’t even get that fussy. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Apr 11, 2017, at 8:22 PM, Shawn Heisey wrote: > > On 4/11/2017 2:56 PM, Chetas Joshi wrote: >> I am using Solr (5.5.0) on HDFS. SolrCloud of 80 nodes. Sold collection >> with number of shards = 80 and replication Factor=2 >> >> Sold JVM heap size = 20 GB >> solr.hdfs.blockcache.enabled = true >> solr.hdfs.blockcache.direct.memory.allocation = true >> MaxDirectMemorySize = 25 GB >> >> I am querying a solr collection with index size = 500 MB per core. > > I see that you and I have traded messages before on the list. > > How much total system memory is there per server? How many of these > 500MB cores are on each server? How many docs are in a 500MB core? The > answers to these questions may affect the other advice that I give you. > >> The off-heap (25 GB) is huge so that it can load the entire index. > > I still know very little about how HDFS handles caching and memory. You > want to be sure that as much data as possible from your indexes is > sitting in local memory on the server. > >> Using cursor approach (number of rows = 100K), I read 2 fields (Total 40 >> bytes per solr doc) from the Solr docs that satisfy the query. The docs are >> sorted by "id" and then by those 2 fields. >> >> I am not able to understand why the heap memory is getting full and Full >> GCs are consecutively running with long GC pauses (> 30 seconds). I am >> using CMS GC. > > A 20GB heap is quite large. Do you actually need it to be that large? > If you graph JVM heap usage over a long period of time, what are the low > points in the graph? > > A result containing 100K docs is going to be pretty large, even with a > limited number of fields. It is likely to be several megabytes. It > will need to be entirely built in the heap memory before it is sent to > the client -- both as Lucene data structures (which will probably be > much larger than the actual response due to Java overhead) and as the > actual response format. Then it will be garbage as soon as the response > is done. Repeat this enough times, and you're going to go through even > a 20GB heap pretty fast, and need a full GC. Full GCs on a 20GB heap > are slow. > > You could try switching to G1, as long as you realize that you're going > against advice from Lucene experts but honestly, I do not expect > this to really help, because you would probably still need full GCs due > to the rate that garbage is being created. If you do try it, I would > strongly recommend the latest Java 8, either Oracle or OpenJDK. Here's > my wiki page where I discuss this: > > https://wiki.apache.org/solr/ShawnHeisey#G1_.28Garbage_First.29_Collector > > Reducing the heap size (which may not be possible -- need to know the > answer to the question about memory graphing) and reducing the number of > rows per query are the only quick solutions I can think of. > > Thanks, > Shawn >
Re: Long GC pauses while reading Solr docs using Cursor approach
On 4/11/2017 2:56 PM, Chetas Joshi wrote: > I am using Solr (5.5.0) on HDFS. SolrCloud of 80 nodes. Sold collection > with number of shards = 80 and replication Factor=2 > > Sold JVM heap size = 20 GB > solr.hdfs.blockcache.enabled = true > solr.hdfs.blockcache.direct.memory.allocation = true > MaxDirectMemorySize = 25 GB > > I am querying a solr collection with index size = 500 MB per core. I see that you and I have traded messages before on the list. How much total system memory is there per server? How many of these 500MB cores are on each server? How many docs are in a 500MB core? The answers to these questions may affect the other advice that I give you. > The off-heap (25 GB) is huge so that it can load the entire index. I still know very little about how HDFS handles caching and memory. You want to be sure that as much data as possible from your indexes is sitting in local memory on the server. > Using cursor approach (number of rows = 100K), I read 2 fields (Total 40 > bytes per solr doc) from the Solr docs that satisfy the query. The docs are > sorted by "id" and then by those 2 fields. > > I am not able to understand why the heap memory is getting full and Full > GCs are consecutively running with long GC pauses (> 30 seconds). I am > using CMS GC. A 20GB heap is quite large. Do you actually need it to be that large? If you graph JVM heap usage over a long period of time, what are the low points in the graph? A result containing 100K docs is going to be pretty large, even with a limited number of fields. It is likely to be several megabytes. It will need to be entirely built in the heap memory before it is sent to the client -- both as Lucene data structures (which will probably be much larger than the actual response due to Java overhead) and as the actual response format. Then it will be garbage as soon as the response is done. Repeat this enough times, and you're going to go through even a 20GB heap pretty fast, and need a full GC. Full GCs on a 20GB heap are slow. You could try switching to G1, as long as you realize that you're going against advice from Lucene experts but honestly, I do not expect this to really help, because you would probably still need full GCs due to the rate that garbage is being created. If you do try it, I would strongly recommend the latest Java 8, either Oracle or OpenJDK. Here's my wiki page where I discuss this: https://wiki.apache.org/solr/ShawnHeisey#G1_.28Garbage_First.29_Collector Reducing the heap size (which may not be possible -- need to know the answer to the question about memory graphing) and reducing the number of rows per query are the only quick solutions I can think of. Thanks, Shawn
Re: Deleting a field in schema.xml, reindex needed?
On 4/11/2017 2:19 PM, Scruggs, Matt wrote: > I’m updating our schema.xml file with 1 change: deleting a field. > > Do I need to re-index all of my documents in Solr, or can I simply reload my > collection config by calling: > > http://mysolrhost:8000/solr/admin/collections?action=RELOAD&name=mycollection Deleting a field won't require a reindex, but any data in that field will remain in your index until you do. This probably can affect the performance of the index, but unless you're running with insufficient resources, you may not even notice. Thanks, Shawn
Re: Expiry of Basic Authentication Plugin
Hi Jordi, Thanks for the advice. Regards, Edwin On 11 April 2017 at 18:27, Jordi Domingo Borràs wrote: > Browsers retain basic auth information. You have to close it or clean > browsing history. You can also change the user password at server side. > > Best > > On Tue, Apr 11, 2017 at 7:18 AM, Zheng Lin Edwin Yeo > > wrote: > > > Anyone has any idea if the authentication will expired automatically? > Mine > > has already been authenticated for more than 20 hours, and it has not > auto > > logged out yet. > > > > Regards, > > Edwin > > > > On 11 April 2017 at 00:19, Zheng Lin Edwin Yeo > > wrote: > > > > > Hi, > > > > > > Would like to check, after I have entered the authentication to access > > > Solr with Basic Authentication Plugin, will the authentication be > expired > > > automatically after a period of time? > > > > > > I'm using SolrCloud on Solr 6.4.2 > > > > > > Regards, > > > Edwin > > > > > >
Re: Deleting a field in schema.xml, reindex needed?
When I have done this, it is in multiple steps. 1. Change the indexing so that no data is going to that field. 2. Reindex, so the field is empty. 3. Remove the field from the schema. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Apr 11, 2017, at 3:10 PM, Markus Jelsma wrote: > > Hi - We did this on one occasion and Solr started complaining in the logs > about a field that is present but not defined. We thought the problem would > go away within 30 days - the time within every document is reindexed or > deleted - but it did not, for some reason. Forcing a merge did not solve the > warnings, although i thought it should. > > So we decided to delete everything, reindex in a standby index and get that > one back online. But don't worry, it are just warnings, everything worked > well and nothing failed. Maybe our 30 days reindexing strategy failed at that > point, or i didn't wait the for exactly 30 days. > > Regards, > Markus > > > > -Original message- >> From:Scruggs, Matt >> Sent: Tuesday 11th April 2017 23:59 >> To: solr-user@lucene.apache.org >> Subject: Deleting a field in schema.xml, reindex needed? >> >> I’m updating our schema.xml file with 1 change: deleting a field. >> >> Do I need to re-index all of my documents in Solr, or can I simply reload my >> collection config by calling: >> >> http://mysolrhost:8000/solr/admin/collections?action=RELOAD&name=mycollection >> >> >> Thanks, >> Matt >> >>
RE: Deleting a field in schema.xml, reindex needed?
Hi - We did this on one occasion and Solr started complaining in the logs about a field that is present but not defined. We thought the problem would go away within 30 days - the time within every document is reindexed or deleted - but it did not, for some reason. Forcing a merge did not solve the warnings, although i thought it should. So we decided to delete everything, reindex in a standby index and get that one back online. But don't worry, it are just warnings, everything worked well and nothing failed. Maybe our 30 days reindexing strategy failed at that point, or i didn't wait the for exactly 30 days. Regards, Markus -Original message- > From:Scruggs, Matt > Sent: Tuesday 11th April 2017 23:59 > To: solr-user@lucene.apache.org > Subject: Deleting a field in schema.xml, reindex needed? > > I’m updating our schema.xml file with 1 change: deleting a field. > > Do I need to re-index all of my documents in Solr, or can I simply reload my > collection config by calling: > > http://mysolrhost:8000/solr/admin/collections?action=RELOAD&name=mycollection > > > Thanks, > Matt > >
Deleting a field in schema.xml, reindex needed?
I’m updating our schema.xml file with 1 change: deleting a field. Do I need to re-index all of my documents in Solr, or can I simply reload my collection config by calling: http://mysolrhost:8000/solr/admin/collections?action=RELOAD&name=mycollection Thanks, Matt
RE: CommonGrams
Hi - i cannot think of any real drawback right away. But you probably can expect a slightly different ordered MLT response. It should not be a problem if you select enough terms for MLT lookup. Regards, Markus -Original message- > From:David Hastings > Sent: Tuesday 11th April 2017 22:18 > To: solr-user@lucene.apache.org > Subject: CommonGrams > > Hi, was wondering if there are any known drawbacks to using the CommonGram > factory, in regards to such features as the "more like this" >
Re: SolrJ appears to have problems with Docker Toolbox
On 4/8/2017 6:42 PM, Mike Thomsen wrote: > I'm running two nodes of SolrCloud in Docker on Windows using Docker > Toolbox. The problem I am having is that Docker Toolbox runs inside of a > VM and so it has an internal network inside the VM that is not accessible > to the Docker Toolbox VM's host OS. If I go to the VM's IP which is > 192.168.99.100, I can load the admin UI and do basic operations that are > written to go against that IP and port (like querying, schema editor, > manually adding documents, etc.) > > However, when I try to run code that uses SolrJ to add documents, it fails > because the ZK configuration has the IPs for the internal Docker network > which is 172.X.Y..Z. If I log into the toolbox VM and run the Java code > from there, it works just fine. From the host OS, doesn't. SolrCloud and the CloudSolrClient class from SolrJ will have issues if each instance registers with zookeeper using addresses that are not reachable from other Solr instances AND from clients. In situations where there are both external and internal addresses, each SolrCloud node must be configured to register with Zookeeper using the external address or name, and the networking must be set up so clients and other Solr instances can communicate using that address. See the "host" parameter here: https://cwiki.apache.org/confluence/display/solr/Parameter+Reference If you are also translating TCP ports, you probably need to define the hostPort parameter as well as the host parameter. By default, SolrCloud does the best it can to detect the address and port it registers in Zookeeper. When translation is involved or the machine has more than one NIC, that often results in incorrect information in Zookeeper. If you change the registration information for existing nodes in an existing cloud, you may find yourself in a situation where you need to manually edit the zookeeper database to remove information about the incorrect addresses that were registered before. If you can do so, setting up a new cloud from scratch with a fresh ZK ensemble (or a different chroot within the existing ensemble) may be the best plan. Thanks, Shawn
Re: Dynamic schema memory consumption
Here is a small snippet that I copy pated from Shawn Helsey (who is a core contributor I think, he's good): > One thing to note: SolrCloud begins to have performance issues when the > number of collections in the cloud reaches the low hundreds. It's not > going to scale very well with a collection per user or per mailbox > unless there aren't very many users. There are people looking into how > to scale better, but this hasn't really gone anywhere yet. Here's one > issue about it, with a lot of very dense comments: > > https://issues.apache.org/jira/browse/SOLR-7191 On Tue, Apr 11, 2017 at 9:11 PM, Dorian Hoxha wrote: > And this overhead depends on what? I mean, if I create an empty collection >> will it take up much heap size just for "being there" ? > > Yes. You can search on elastic-search/solr/lucene mailing lists and see > that it's true. But nobody has `empty` collections, so yours will have a > schema and some data/segments and translog. > > On Tue, Apr 11, 2017 at 7:41 PM, jpereira wrote: > >> The way the data is spread across the cluster is not really uniform. Most >> of >> shards have way lower than 50GB; I would say about 15% of the total shards >> have more than 50GB. >> >> >> Dorian Hoxha wrote >> > Each shard is a lucene index which has a lot of overhead. >> >> And this overhead depends on what? I mean, if I create an empty collection >> will it take up much heap size just for "being there" ? >> >> >> Dorian Hoxha wrote >> > I don't know about static/dynamic memory-issue though. >> >> I could not find anything related in the docs or the mailing list either, >> but I'm still not ready to discard this suspicion... >> >> Again, thx for your time >> >> >> >> -- >> View this message in context: http://lucene.472066.n3.nabble >> .com/Dynamic-schema-memory-consumption-tp4329184p4329367.html >> Sent from the Solr - User mailing list archive at Nabble.com. >> > >
Long GC pauses while reading Solr docs using Cursor approach
Hello, I am using Solr (5.5.0) on HDFS. SolrCloud of 80 nodes. Sold collection with number of shards = 80 and replication Factor=2 Sold JVM heap size = 20 GB solr.hdfs.blockcache.enabled = true solr.hdfs.blockcache.direct.memory.allocation = true MaxDirectMemorySize = 25 GB I am querying a solr collection with index size = 500 MB per core. The off-heap (25 GB) is huge so that it can load the entire index. Using cursor approach (number of rows = 100K), I read 2 fields (Total 40 bytes per solr doc) from the Solr docs that satisfy the query. The docs are sorted by "id" and then by those 2 fields. I am not able to understand why the heap memory is getting full and Full GCs are consecutively running with long GC pauses (> 30 seconds). I am using CMS GC. -XX:NewRatio=3 \ -XX:SurvivorRatio=4 \ -XX:TargetSurvivorRatio=90 \ -XX:MaxTenuringThreshold=8 \ -XX:+UseConcMarkSweepGC \ -XX:+UseParNewGC \ -XX:ConcGCThreads=4 -XX:ParallelGCThreads=4 \ -XX:+CMSScavengeBeforeRemark \ -XX:PretenureSizeThreshold=64m \ -XX:+UseCMSInitiatingOccupancyOnly \ -XX:CMSInitiatingOccupancyFraction=50 \ -XX:CMSMaxAbortablePrecleanTime=6000 \ -XX:+CMSParallelRemarkEnabled \ -XX:+ParallelRefProcEnabled" Please guide me in debugging the heap usage issue. Thanks!
Re: simple matches not catching at query time
John, Here I mean a query, which matches a doc, which it expected to be matched by the problem query. https://cwiki.apache.org/confluence/display/solr/Common+Query+Parameters#CommonQueryParameters-TheexplainOtherParameter On Tue, Apr 11, 2017 at 11:32 PM, John Blythe wrote: > first off, i don't think i have a full handle on the import of what is > outputted by the debugger. > > that said, if "...PhraseQuery(manufacturer_split_syn:\"vendor vendor\")" > is > matching against `vendor_coolmed | coolmed | vendor`, then 'vendor' should > match. the query analyzer is keywordtokenizer, pattern replacement > (replaces all non-alphanumeric with underscores), checks for synonyms (the > underscores are my way around the multi term synonym problem), then > worddelimiter is used to blow out the underscores and generate word parts > ("vendor_vendor" => 'vendor' 'vendor'), stop filter, lower case, stem. > > in your mentioned strategy, what is the "id:" representative of? > > thanks! > > -- > *John Blythe* > Product Manager & Lead Developer > > 251.605.3071 | j...@curvolabs.com > www.curvolabs.com > > 58 Adams Ave > Evansville, IN 47713 > > On Tue, Apr 11, 2017 at 4:12 PM, Mikhail Khludnev wrote: > > > John, > > > > How do you suppose to match any of "parsed_filter_queries":[" > > MultiPhraseQuery(manufacturer_syn_both:\"(vendor_vendor_us vendor) > > vendor\")", "PhraseQuery(manufacturer_split_syn:\"vendor vendor\")" > > against > > vendor_coolmed | coolmed | vendor ? > > > > I just can't see any chance to match them. > > > > One possible strategy is pick the simplest filter query, put it as a main > > query. > > Then pass &expainOther=id: and share the explanation. > > > > > > > > On Tue, Apr 11, 2017 at 8:57 PM, John Blythe wrote: > > > > > hi, erick. > > > > > > appreciate the feedback. > > > > > > 1> i'm sending the terms to solr enquoted > > > 2> i'd thought that at one point and reran the indexing. i _had_ had > two > > of > > > the fields not indexed, but this represented one pass (same analyzer) > > from > > > two diff source fields while 2 or 3 of the other 4 fields _were_ > seeming > > as > > > if they should match. maybe just need to do this for said sanity at > this > > > point lol > > > 3> i'm using dismax, no mm param set > > > > > > some further context: > > > > > > i'm querying something like this: ...fq=manufacturer:("VENDOR:VENDOR > > US") > > > OR manufacturer_syn:("VENDOR:VENDOR US")... > > > > > > The indexed value is: "Vendor" > > > > > > the output of field 1 in the Analysis tab would be: > > > *index*: vendor_coolmed | coolmed | vendor > > > *query*: vendor_vendor_coolmed | vendor | vendor > > > > > > the other field (and a couple other, related ones, actually) have > similar > > > situations where I see a clear match (as well as get the confirmation > of > > it > > > when switching to the old UI and seeing the highlighting) yet get no > > > results in my actual query. > > > > > > a further note. when i get the query debugging enabled I can see this > in > > > the output: > > > "filter_queries":["manufacturer_syn_both:\"Vendor:Vendor US\"", > > > "manufacturer_split_syn:(\"Vendor:Vendor US\")"], > > > "parsed_filter_queries":[" > > > MultiPhraseQuery(manufacturer_syn_both:\"(vendor_vendor_us vendor) > > > vendor\")", "PhraseQuery(manufacturer_split_syn:\"vendor > vendor\")"],... > > > > > > It looks as if the parsed query is wrapped in quotes even after having > > been > > > parsed, so while the correct tokens, i.e. "vendor", are present to > match > > > against the indexed value, the fact that the entire parsed derivative > of > > > the initial query is sent to match (if that's indeed what's happening) > > > won't actually get any hits. Yet if I remove the quotes when sending > over > > > to query then the parsing doesn't get to a point of having any > > > worthwhile/matching tokens to begin with. > > > > > > one last thing: i've attempted with just "vendor" being sent over to > help > > > remove complexity and, once more, i see Analysis chain functioning just > > > fine but the query itself getting 0 hits. > > > > > > think TermComponents is the best option at this point or something else > > > given the above filler info? > > > > > > -- > > > *John Blythe* > > > Product Manager & Lead Developer > > > > > > 251.605.3071 | j...@curvolabs.com > > > www.curvolabs.com > > > > > > 58 Adams Ave > > > Evansville, IN 47713 > > > > > > On Tue, Apr 11, 2017 at 1:20 PM, Erick Erickson < > erickerick...@gmail.com > > > > > > wrote: > > > > > > > &debug=query is your friend. There are several issues that often trip > > > > people up: > > > > > > > > 1> The analysis tab pre-supposes that what you put in the boxes gets > > > > all the way to the field in question. Trivial example: > > > > I put (without quotes) "erick erickson" in the "name" field in the > > > > analysis page and see that it gets tokenized correctly. But the query > > > > "name:erick erickson" actually gets parsed at a higher level into >
Re: simple matches not catching at query time
first off, i don't think i have a full handle on the import of what is outputted by the debugger. that said, if "...PhraseQuery(manufacturer_split_syn:\"vendor vendor\")" is matching against `vendor_coolmed | coolmed | vendor`, then 'vendor' should match. the query analyzer is keywordtokenizer, pattern replacement (replaces all non-alphanumeric with underscores), checks for synonyms (the underscores are my way around the multi term synonym problem), then worddelimiter is used to blow out the underscores and generate word parts ("vendor_vendor" => 'vendor' 'vendor'), stop filter, lower case, stem. in your mentioned strategy, what is the "id:" representative of? thanks! -- *John Blythe* Product Manager & Lead Developer 251.605.3071 | j...@curvolabs.com www.curvolabs.com 58 Adams Ave Evansville, IN 47713 On Tue, Apr 11, 2017 at 4:12 PM, Mikhail Khludnev wrote: > John, > > How do you suppose to match any of "parsed_filter_queries":[" > MultiPhraseQuery(manufacturer_syn_both:\"(vendor_vendor_us vendor) > vendor\")", "PhraseQuery(manufacturer_split_syn:\"vendor vendor\")" > against > vendor_coolmed | coolmed | vendor ? > > I just can't see any chance to match them. > > One possible strategy is pick the simplest filter query, put it as a main > query. > Then pass &expainOther=id: and share the explanation. > > > > On Tue, Apr 11, 2017 at 8:57 PM, John Blythe wrote: > > > hi, erick. > > > > appreciate the feedback. > > > > 1> i'm sending the terms to solr enquoted > > 2> i'd thought that at one point and reran the indexing. i _had_ had two > of > > the fields not indexed, but this represented one pass (same analyzer) > from > > two diff source fields while 2 or 3 of the other 4 fields _were_ seeming > as > > if they should match. maybe just need to do this for said sanity at this > > point lol > > 3> i'm using dismax, no mm param set > > > > some further context: > > > > i'm querying something like this: ...fq=manufacturer:("VENDOR:VENDOR > US") > > OR manufacturer_syn:("VENDOR:VENDOR US")... > > > > The indexed value is: "Vendor" > > > > the output of field 1 in the Analysis tab would be: > > *index*: vendor_coolmed | coolmed | vendor > > *query*: vendor_vendor_coolmed | vendor | vendor > > > > the other field (and a couple other, related ones, actually) have similar > > situations where I see a clear match (as well as get the confirmation of > it > > when switching to the old UI and seeing the highlighting) yet get no > > results in my actual query. > > > > a further note. when i get the query debugging enabled I can see this in > > the output: > > "filter_queries":["manufacturer_syn_both:\"Vendor:Vendor US\"", > > "manufacturer_split_syn:(\"Vendor:Vendor US\")"], > > "parsed_filter_queries":[" > > MultiPhraseQuery(manufacturer_syn_both:\"(vendor_vendor_us vendor) > > vendor\")", "PhraseQuery(manufacturer_split_syn:\"vendor vendor\")"],... > > > > It looks as if the parsed query is wrapped in quotes even after having > been > > parsed, so while the correct tokens, i.e. "vendor", are present to match > > against the indexed value, the fact that the entire parsed derivative of > > the initial query is sent to match (if that's indeed what's happening) > > won't actually get any hits. Yet if I remove the quotes when sending over > > to query then the parsing doesn't get to a point of having any > > worthwhile/matching tokens to begin with. > > > > one last thing: i've attempted with just "vendor" being sent over to help > > remove complexity and, once more, i see Analysis chain functioning just > > fine but the query itself getting 0 hits. > > > > think TermComponents is the best option at this point or something else > > given the above filler info? > > > > -- > > *John Blythe* > > Product Manager & Lead Developer > > > > 251.605.3071 | j...@curvolabs.com > > www.curvolabs.com > > > > 58 Adams Ave > > Evansville, IN 47713 > > > > On Tue, Apr 11, 2017 at 1:20 PM, Erick Erickson > > > wrote: > > > > > &debug=query is your friend. There are several issues that often trip > > > people up: > > > > > > 1> The analysis tab pre-supposes that what you put in the boxes gets > > > all the way to the field in question. Trivial example: > > > I put (without quotes) "erick erickson" in the "name" field in the > > > analysis page and see that it gets tokenized correctly. But the query > > > "name:erick erickson" actually gets parsed at a higher level into > > > name:erick default_search_field:erickson. See the discussion at: > > > SOLR-9185 > > > > > > 2> what you think is in your indexed field isn't really. Can happen if > > > you changed your analysis chain but didn't totally re-index. Can > > > happen because one of the parts of the analysis chain works > > > differently than you expect (WordDelimiterFilterFactory, for instance, > > > has a ton of options that can alter the tokens emitted). The > > > TermsComponent will let you examine the terms actually _in_ the index > > > that you search on. You stated that the
CommonGrams
Hi, was wondering if there are any known drawbacks to using the CommonGram factory, in regards to such features as the "more like this"
Re: simple matches not catching at query time
John, How do you suppose to match any of "parsed_filter_queries":[" MultiPhraseQuery(manufacturer_syn_both:\"(vendor_vendor_us vendor) vendor\")", "PhraseQuery(manufacturer_split_syn:\"vendor vendor\")" against vendor_coolmed | coolmed | vendor ? I just can't see any chance to match them. One possible strategy is pick the simplest filter query, put it as a main query. Then pass &expainOther=id: and share the explanation. On Tue, Apr 11, 2017 at 8:57 PM, John Blythe wrote: > hi, erick. > > appreciate the feedback. > > 1> i'm sending the terms to solr enquoted > 2> i'd thought that at one point and reran the indexing. i _had_ had two of > the fields not indexed, but this represented one pass (same analyzer) from > two diff source fields while 2 or 3 of the other 4 fields _were_ seeming as > if they should match. maybe just need to do this for said sanity at this > point lol > 3> i'm using dismax, no mm param set > > some further context: > > i'm querying something like this: ...fq=manufacturer:("VENDOR:VENDOR US") > OR manufacturer_syn:("VENDOR:VENDOR US")... > > The indexed value is: "Vendor" > > the output of field 1 in the Analysis tab would be: > *index*: vendor_coolmed | coolmed | vendor > *query*: vendor_vendor_coolmed | vendor | vendor > > the other field (and a couple other, related ones, actually) have similar > situations where I see a clear match (as well as get the confirmation of it > when switching to the old UI and seeing the highlighting) yet get no > results in my actual query. > > a further note. when i get the query debugging enabled I can see this in > the output: > "filter_queries":["manufacturer_syn_both:\"Vendor:Vendor US\"", > "manufacturer_split_syn:(\"Vendor:Vendor US\")"], > "parsed_filter_queries":[" > MultiPhraseQuery(manufacturer_syn_both:\"(vendor_vendor_us vendor) > vendor\")", "PhraseQuery(manufacturer_split_syn:\"vendor vendor\")"],... > > It looks as if the parsed query is wrapped in quotes even after having been > parsed, so while the correct tokens, i.e. "vendor", are present to match > against the indexed value, the fact that the entire parsed derivative of > the initial query is sent to match (if that's indeed what's happening) > won't actually get any hits. Yet if I remove the quotes when sending over > to query then the parsing doesn't get to a point of having any > worthwhile/matching tokens to begin with. > > one last thing: i've attempted with just "vendor" being sent over to help > remove complexity and, once more, i see Analysis chain functioning just > fine but the query itself getting 0 hits. > > think TermComponents is the best option at this point or something else > given the above filler info? > > -- > *John Blythe* > Product Manager & Lead Developer > > 251.605.3071 | j...@curvolabs.com > www.curvolabs.com > > 58 Adams Ave > Evansville, IN 47713 > > On Tue, Apr 11, 2017 at 1:20 PM, Erick Erickson > wrote: > > > &debug=query is your friend. There are several issues that often trip > > people up: > > > > 1> The analysis tab pre-supposes that what you put in the boxes gets > > all the way to the field in question. Trivial example: > > I put (without quotes) "erick erickson" in the "name" field in the > > analysis page and see that it gets tokenized correctly. But the query > > "name:erick erickson" actually gets parsed at a higher level into > > name:erick default_search_field:erickson. See the discussion at: > > SOLR-9185 > > > > 2> what you think is in your indexed field isn't really. Can happen if > > you changed your analysis chain but didn't totally re-index. Can > > happen because one of the parts of the analysis chain works > > differently than you expect (WordDelimiterFilterFactory, for instance, > > has a ton of options that can alter the tokens emitted). The > > TermsComponent will let you examine the terms actually _in_ the index > > that you search on. You stated that the analysis page shows you what > > you expect, so this is a sanity check. > > > > 3> You're using edismax and setting some parameter, mm=100% is a > > favorite and it's having this effect. > > > > So add debug=query and provide a sample document (or just a field) and > > the schema definition for the field in question if you're still > > stumped. > > > > Best, > > Erick > > > > On Tue, Apr 11, 2017 at 8:35 AM, John Blythe wrote: > > > hi everyone. > > > > > > i recently wrote in ('analysis matching, query not') but never heard > back > > > so wanted to follow up. i'm at my wit's end currently. i have several > > > fields that are showing matches in the analysis tab. when i dumb down > the > > > string sent over to query it still gives me issues in some field cases. > > > > > > any thoughts on how to debug to figure out wtf is going on here would > be > > > greatly appreciated. the use case is straightforward and the solution > > > should be as well, so i'm at a loss as to how in the world i'm having > > > issues w this. > > > > > > can provide any amount of contextualizin
RE: Solr 6.4. Can't index MS Visio vsdx files
It depends. We've been trying to make parsers more, erm, flexible, but there are some problems from which we cannot recover. Tl;dr there isn't a short answer. :( My sense is that DIH/ExtractingDocumentHandler is intended to get people up and running with Solr easily but it is not really a great idea for production. See Erick's gem: https://lucidworks.com/2012/02/14/indexing-with-solrj/ As for the Tika portion... at the very least, Tika _shouldn't_ cause the ingesting process to crash. At most, it should fail at the file level and not cause greater havoc. In practice, if you're processing millions of files from the wild, you'll run into bad behavior and need to defend against permanent hangs, oom, memory leaks. Also, at the least, if there's an exception with an embedded file, Tika should catch it and keep going with the rest of the file. If this doesn't happen let us know! We are aware that some types of embedded file stream problems were causing parse failures on the entire file, and we now catch those in Tika 1.15-SNAPSHOT and don't let them percolate up through the parent file (they're reported in the metadata though). Specifically for your stack traces: For your initial problem with the missing class exceptions -- I thought we used to catch those in docx and log them. I haven't been able to track this down, though. I can look more if you have a need. For "Caused by: org.apache.poi.POIXMLException: Invalid 'Row_Type' name 'PolylineTo' ", this problem might go away if we implemented a pure SAX parser for vsdx. We just did this for docx and pptx (coming in 1.15) and these are more robust to variation because they aren't requiring a match with the ooxml schema. I haven't looked much at vsdx, but that _might_ help. For "TODO Support v5 Pointers", this isn't supported and would require contributions. However, I agree that POI shouldn't throw a Runtime exception. Perhaps open an issue in POI, or maybe we should catch this special example at the Tika level? For "Caused by: java.lang.ArrayIndexOutOfBoundsException:", the POI team _might_ be able to modify the parser to ignore a stream if there's an exception, but that's often a sign that something needs to be fixed with the parser. In short, the solution will come from POI. Best, Tim -Original Message- From: Gytis Mikuciunas [mailto:gyt...@gmail.com] Sent: Tuesday, April 11, 2017 1:56 PM To: solr-user@lucene.apache.org Subject: RE: Solr 6.4. Can't index MS Visio vsdx files Thanks for your responses. Are there any posibilities to ignore parsing errors and continue indexing? because now solr/tika stops parsing whole document if it finds any exception On Apr 11, 2017 19:51, "Allison, Timothy B." wrote: > You might want to drop a note to the dev or user's list on Apache POI. > > I'm not extremely familiar with the vsd(x) portion of our code base. > > The first item ("PolylineTo") may be caused by a mismatch btwn your > doc and the ooxml spec. > > The second item appears to be an unsupported feature. > > The third item may be an area for improvement within our codebase...I > can't tell just from the stacktrace. > > You'll probably get more helpful answers over on POI. Sorry, I can't > help with this... > > Best, > >Tim > > P.S. > > 3.1. ooxml-schemas-1.3.jar instead of poi-ooxml-schemas-3.15.jar > > You shouldn't need both. Ooxml-schemas-1.3.jar should be a super set > of poi-ooxml-schemas-3.15.jar > > >
Re: Dynamic schema memory consumption
> > And this overhead depends on what? I mean, if I create an empty collection > will it take up much heap size just for "being there" ? Yes. You can search on elastic-search/solr/lucene mailing lists and see that it's true. But nobody has `empty` collections, so yours will have a schema and some data/segments and translog. On Tue, Apr 11, 2017 at 7:41 PM, jpereira wrote: > The way the data is spread across the cluster is not really uniform. Most > of > shards have way lower than 50GB; I would say about 15% of the total shards > have more than 50GB. > > > Dorian Hoxha wrote > > Each shard is a lucene index which has a lot of overhead. > > And this overhead depends on what? I mean, if I create an empty collection > will it take up much heap size just for "being there" ? > > > Dorian Hoxha wrote > > I don't know about static/dynamic memory-issue though. > > I could not find anything related in the docs or the mailing list either, > but I'm still not ready to discard this suspicion... > > Again, thx for your time > > > > -- > View this message in context: http://lucene.472066.n3. > nabble.com/Dynamic-schema-memory-consumption-tp4329184p4329367.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: simple matches not catching at query time
hi, erick. appreciate the feedback. 1> i'm sending the terms to solr enquoted 2> i'd thought that at one point and reran the indexing. i _had_ had two of the fields not indexed, but this represented one pass (same analyzer) from two diff source fields while 2 or 3 of the other 4 fields _were_ seeming as if they should match. maybe just need to do this for said sanity at this point lol 3> i'm using dismax, no mm param set some further context: i'm querying something like this: ...fq=manufacturer:("VENDOR:VENDOR US") OR manufacturer_syn:("VENDOR:VENDOR US")... The indexed value is: "Vendor" the output of field 1 in the Analysis tab would be: *index*: vendor_coolmed | coolmed | vendor *query*: vendor_vendor_coolmed | vendor | vendor the other field (and a couple other, related ones, actually) have similar situations where I see a clear match (as well as get the confirmation of it when switching to the old UI and seeing the highlighting) yet get no results in my actual query. a further note. when i get the query debugging enabled I can see this in the output: "filter_queries":["manufacturer_syn_both:\"Vendor:Vendor US\"", "manufacturer_split_syn:(\"Vendor:Vendor US\")"], "parsed_filter_queries":[" MultiPhraseQuery(manufacturer_syn_both:\"(vendor_vendor_us vendor) vendor\")", "PhraseQuery(manufacturer_split_syn:\"vendor vendor\")"],... It looks as if the parsed query is wrapped in quotes even after having been parsed, so while the correct tokens, i.e. "vendor", are present to match against the indexed value, the fact that the entire parsed derivative of the initial query is sent to match (if that's indeed what's happening) won't actually get any hits. Yet if I remove the quotes when sending over to query then the parsing doesn't get to a point of having any worthwhile/matching tokens to begin with. one last thing: i've attempted with just "vendor" being sent over to help remove complexity and, once more, i see Analysis chain functioning just fine but the query itself getting 0 hits. think TermComponents is the best option at this point or something else given the above filler info? -- *John Blythe* Product Manager & Lead Developer 251.605.3071 | j...@curvolabs.com www.curvolabs.com 58 Adams Ave Evansville, IN 47713 On Tue, Apr 11, 2017 at 1:20 PM, Erick Erickson wrote: > &debug=query is your friend. There are several issues that often trip > people up: > > 1> The analysis tab pre-supposes that what you put in the boxes gets > all the way to the field in question. Trivial example: > I put (without quotes) "erick erickson" in the "name" field in the > analysis page and see that it gets tokenized correctly. But the query > "name:erick erickson" actually gets parsed at a higher level into > name:erick default_search_field:erickson. See the discussion at: > SOLR-9185 > > 2> what you think is in your indexed field isn't really. Can happen if > you changed your analysis chain but didn't totally re-index. Can > happen because one of the parts of the analysis chain works > differently than you expect (WordDelimiterFilterFactory, for instance, > has a ton of options that can alter the tokens emitted). The > TermsComponent will let you examine the terms actually _in_ the index > that you search on. You stated that the analysis page shows you what > you expect, so this is a sanity check. > > 3> You're using edismax and setting some parameter, mm=100% is a > favorite and it's having this effect. > > So add debug=query and provide a sample document (or just a field) and > the schema definition for the field in question if you're still > stumped. > > Best, > Erick > > On Tue, Apr 11, 2017 at 8:35 AM, John Blythe wrote: > > hi everyone. > > > > i recently wrote in ('analysis matching, query not') but never heard back > > so wanted to follow up. i'm at my wit's end currently. i have several > > fields that are showing matches in the analysis tab. when i dumb down the > > string sent over to query it still gives me issues in some field cases. > > > > any thoughts on how to debug to figure out wtf is going on here would be > > greatly appreciated. the use case is straightforward and the solution > > should be as well, so i'm at a loss as to how in the world i'm having > > issues w this. > > > > can provide any amount of contextualizing information you need, just let > me > > know what could be beneficial. > > > > best, > > > > john >
RE: Solr 6.4. Can't index MS Visio vsdx files
Thanks for your responses. Are there any posibilities to ignore parsing errors and continue indexing? because now solr/tika stops parsing whole document if it finds any exception On Apr 11, 2017 19:51, "Allison, Timothy B." wrote: > You might want to drop a note to the dev or user's list on Apache POI. > > I'm not extremely familiar with the vsd(x) portion of our code base. > > The first item ("PolylineTo") may be caused by a mismatch btwn your doc > and the ooxml spec. > > The second item appears to be an unsupported feature. > > The third item may be an area for improvement within our codebase...I > can't tell just from the stacktrace. > > You'll probably get more helpful answers over on POI. Sorry, I can't help > with this... > > Best, > >Tim > > P.S. > > 3.1. ooxml-schemas-1.3.jar instead of poi-ooxml-schemas-3.15.jar > > You shouldn't need both. Ooxml-schemas-1.3.jar should be a super set of > poi-ooxml-schemas-3.15.jar > > >
Re: Dynamic schema memory consumption
The way the data is spread across the cluster is not really uniform. Most of shards have way lower than 50GB; I would say about 15% of the total shards have more than 50GB. Dorian Hoxha wrote > Each shard is a lucene index which has a lot of overhead. And this overhead depends on what? I mean, if I create an empty collection will it take up much heap size just for "being there" ? Dorian Hoxha wrote > I don't know about static/dynamic memory-issue though. I could not find anything related in the docs or the mailing list either, but I'm still not ready to discard this suspicion... Again, thx for your time -- View this message in context: http://lucene.472066.n3.nabble.com/Dynamic-schema-memory-consumption-tp4329184p4329367.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Grouped Result sort issue
Skimming, I don't think this is inconsistent. First I assume that you're OK with the second example, it's this one seems odd to you: sort=score asc group.sort=score desc You're telling Solr to return the highest scoring doc in each group. However, you're asking to order the _groups_ in ascending score order (i.e. the group with the lowest scoring doc first) of _any_ doc in that group, not just the one(s) returned. These are two separate things. "groupValue":"63", "doclist":{"numFound":143, My bet is that the 143rd doc's score in this group has a lower score than any document returned in any group. To verify: Specify: sort=score asc group.sort=score asc My bet: The ordering of the groups will be the same as sort=score asc group.sort=score desc It's just that the doc returned will be the lowest scoring doc. Best, Erick On Tue, Apr 11, 2017 at 8:16 AM, Eric Cartman wrote: > I modified and cleaned the previous query. As you can see the first query > sorting is a bit odd. > > Using parameters > sort=score asc > group.sort=score desc > > http://localhost:8983/solr/mcontent.ph_post/select?=&fl=*,score&group.field=partnerId&group.limit=1&group.main=false&group.ngroups=true&group.sort=score > desc&group=true&indent=on&q=text:cars&rows=5000&sort=score > asc&start=0&wt=json&omitHeader=true > > { > "grouped":{ > "partnerId":{ > "matches":8681, > "ngroups":10, > "groups":[{ > "groupValue":"63", > "doclist":{"numFound":143,"start":0,"maxScore":0.48749906,"docs":[ > { > "postId":"26317", > "score":0.48749906}] > }}, > { > "groupValue":"64", > "doclist":{"numFound":144,"start":0,"maxScore":0.34190965,"docs":[ > { > "postId":"25549", > "score":0.34190965}] > }}, > { > "groupValue":"28", > "doclist":{"numFound":2023,"start":0,"maxScore":0.6838193,"docs":[ > { > "postId":"31447", > "score":0.6838193}] > }}, > { > "groupValue":"23", > "doclist":{"numFound":3539,"start":0,"maxScore":0.6223264,"docs":[ > { > "postId":"15053", > "score":0.6223264}] > }}, > { > "groupValue":"25", > "doclist":{"numFound":2651,"start":0,"maxScore":0.9381923,"docs":[ > { > "postId":"21199", > "score":0.9381923}] > }}, > { > "groupValue":"61", > "doclist":{"numFound":160,"start":0,"maxScore":0.66007686,"docs":[ > { > "postId":"8730", > "score":0.66007686}] > }}, > { > "groupValue":"141", > "doclist":{"numFound":9,"start":0,"maxScore":0.5074051,"docs":[ > { > "postId":"34406", > "score":0.5074051}] > }}, > { > "groupValue":"142", > "doclist":{"numFound":9,"start":0,"maxScore":0.22002561,"docs":[ > { > "postId":"35000", > "score":0.22002561}] > }}, > { > "groupValue":"189", > "doclist":{"numFound":1,"start":0,"maxScore":0.09951033,"docs":[ > { > "postId":"33971", > "score":0.09951033}] > }}, > { > "groupValue":"40", > "doclist":{"numFound":2,"start":0,"maxScore":0.3283673,"docs":[ > { > "postId":"30142", > "score":0.3283673}] > }}]}}} > > Using parameters > sort=score desc > group.sort=score desc > > http://localhost:8983/solr/mcontent.ph_post/select?=&fl=*,score&group.field=partnerId&group.limit=1&group.main=false&group.ngroups=true&group.sort=score > desc&group=true&indent=on&q=text:cars&rows=5000&sort=score > desc&start=0&wt=json&omitHeader=true > { > "grouped":{ > "partnerId":{ > "matches":8681, > "ngroups":10, > "groups":[{ > "groupValue":"25", > "doclist":{"numFound":2651,"start":0,"maxScore":0.9381923,"docs":[ > { > "postId":"21199", > "score":0.9381923}] > }}, > { > "groupValue":"28", > "doclist":{"numFound":2023,"start":0,"maxScore":0.6838193,"docs":[ > { > "postId":"31447", > "score":0.6838193}] > }}, > { > "groupValue":"61", > "doclist":{"numFound":160,"start":0,"maxScore":0.66007686,"docs":[ > { > "postId":"8730", > "score":0.66007686}] > }}, > { > "groupValue":"23", > "doclist":{"numFound":3539,"start":0,"maxScore":0.6223264,"docs":[ > { > "p
Re: simple matches not catching at query time
&debug=query is your friend. There are several issues that often trip people up: 1> The analysis tab pre-supposes that what you put in the boxes gets all the way to the field in question. Trivial example: I put (without quotes) "erick erickson" in the "name" field in the analysis page and see that it gets tokenized correctly. But the query "name:erick erickson" actually gets parsed at a higher level into name:erick default_search_field:erickson. See the discussion at: SOLR-9185 2> what you think is in your indexed field isn't really. Can happen if you changed your analysis chain but didn't totally re-index. Can happen because one of the parts of the analysis chain works differently than you expect (WordDelimiterFilterFactory, for instance, has a ton of options that can alter the tokens emitted). The TermsComponent will let you examine the terms actually _in_ the index that you search on. You stated that the analysis page shows you what you expect, so this is a sanity check. 3> You're using edismax and setting some parameter, mm=100% is a favorite and it's having this effect. So add debug=query and provide a sample document (or just a field) and the schema definition for the field in question if you're still stumped. Best, Erick On Tue, Apr 11, 2017 at 8:35 AM, John Blythe wrote: > hi everyone. > > i recently wrote in ('analysis matching, query not') but never heard back > so wanted to follow up. i'm at my wit's end currently. i have several > fields that are showing matches in the analysis tab. when i dumb down the > string sent over to query it still gives me issues in some field cases. > > any thoughts on how to debug to figure out wtf is going on here would be > greatly appreciated. the use case is straightforward and the solution > should be as well, so i'm at a loss as to how in the world i'm having > issues w this. > > can provide any amount of contextualizing information you need, just let me > know what could be beneficial. > > best, > > john
RE: Solr 6.4. Can't index MS Visio vsdx files
You might want to drop a note to the dev or user's list on Apache POI. I'm not extremely familiar with the vsd(x) portion of our code base. The first item ("PolylineTo") may be caused by a mismatch btwn your doc and the ooxml spec. The second item appears to be an unsupported feature. The third item may be an area for improvement within our codebase...I can't tell just from the stacktrace. You'll probably get more helpful answers over on POI. Sorry, I can't help with this... Best, Tim P.S. > 3.1. ooxml-schemas-1.3.jar instead of poi-ooxml-schemas-3.15.jar You shouldn't need both. Ooxml-schemas-1.3.jar should be a super set of poi-ooxml-schemas-3.15.jar
Re: SolrJ appears to have problems with Docker Toolbox
Ok :) But if you have time have a look at my project https://github.com/freedev/ solrcloud-zookeeper-docker The project builds a couple of docker instances (solr - zookeeper) or a cluster with 6 nodes. Then you have just to put in your hosts file the ip addresses of your VM and you can play with it. On Tue, Apr 11, 2017 at 6:06 PM, Mike Thomsen wrote: > Thanks. I think I'll take a look at that. I decided to just build a big > vagrant-managed desktop VM to let me run Ubuntu on my company machine, so I > expect that this pain point may be largely gone soon. > > On Mon, Apr 10, 2017 at 12:31 PM, Vincenzo D'Amore > wrote: > > > Hi Mike > > > > disclaimer I'm the author of https://github.com/freedev/ > > solrcloud-zookeeper-docker > > > > I had same problem when I tried to create a cluster SolrCloud with > docker, > > just because the docker instances were referred by ip addresses I cannot > > access with SolrJ. > > > > I avoided this problem referring each docker instance via a hostname > > instead of ip address. > > > > Docker-compose is a great help to have a network where your docker > > instances can be resolved using their names. > > > > I'll suggest to take a look at my project, in particular at the > > docker-compose.yml used to start a SolrCloud cluster (3 Solr nodes with a > > zookeeper ensemble of 3): > > > > https://raw.githubusercontent.com/freedev/solrcloud- > > zookeeper-docker/master/ > > solrcloud-3-nodes-zookeeper-ensemble/docker-compose.yml > > > > Ok, I know, it sounds too much create a SolrCloud into a single VM, I did > > it just to understand how Solr works... :) > > > > Once you've build your SolrCloud Docker network, you can map the name of > > your docker instances externally, for example in your private network or > in > > your hosts file. > > > > In other words, given a Docker Solr instance named solr-1, in the docker > > network the instance named solr-1 has a docker ip address that cannot be > > used outside the VM. > > > > So when you use SolrJ client on your computer you must have into > /etc/hosts > > an entry solr-1 that points to the ip address your VM (the public network > > interface where the docker instance is mapped). > > > > Hope you understand... :) > > > > Cheers, > > Vincenzo > > > > > > On Sun, Apr 9, 2017 at 2:42 AM, Mike Thomsen > > wrote: > > > > > I'm running two nodes of SolrCloud in Docker on Windows using Docker > > > Toolbox. The problem I am having is that Docker Toolbox runs inside > of a > > > VM and so it has an internal network inside the VM that is not > accessible > > > to the Docker Toolbox VM's host OS. If I go to the VM's IP which is > > > 192.168.99.100, I can load the admin UI and do basic operations that > are > > > written to go against that IP and port (like querying, schema editor, > > > manually adding documents, etc.) > > > > > > However, when I try to run code that uses SolrJ to add documents, it > > fails > > > because the ZK configuration has the IPs for the internal Docker > network > > > which is 172.X.Y..Z. If I log into the toolbox VM and run the Java code > > > from there, it works just fine. From the host OS, doesn't. > > > > > > Anyone have any ideas on how to get around this? If I rewrite the > > indexing > > > code to do a manual JSON POST to the update handler on one of the > nodes, > > it > > > does work just fine, but that leaves me not using SolrJ. > > > > > > Thanks, > > > > > > Mike > > > > > > > > > > > -- > > Vincenzo D'Amore > > email: v.dam...@gmail.com > > skype: free.dev > > mobile: +39 349 8513251 <349%20851%203251> > > > -- Vincenzo D'Amore email: v.dam...@gmail.com skype: free.dev mobile: +39 349 8513251
Invoking a SerachHandler inside Solr Plugin
I am looking for best practices when a search component in one handler, needs to invoke another handler, say /basic. So far, I got this working prototype: public void process(ResponseBuilder rb) throws IOException { SolrQueryResponse response = new SolrQueryResponse(); ModifiableSolrParams params=new ModifiableSolrParams(); params.add("defType", "lucene").add("fl","product_id").add("wt","json"). add("df","competitor_product_titles").add("echoParams","explicit").add("q",rb.req.getParams().get("q")); SolrQueryRequest request= new LocalSolrQueryRequest(rb.req.getCore(),params ); SolrRequestHandler hdlr = rb.req.getCore().getRequestHandler("/basic"); rb.req.getCore().execute(hdlr, request, response); DocList docList=((ResultContext)response.getValues().get("response")).docs; //Do some crazy stuff with the result } My concerns: 1) What is a clean way to read the /basic handler's default parameters from solrconfig.xml and use them in LocalSolrQueryRequest(). 2) Is there a better way to accomplish this task overall? Thanks, Max.
Re: Dynamic schema memory consumption
What I'm suggesting, is that you should aim for max(50GB) per shard of data. How much is it currently ? Each shard is a lucene index which has a lot of overhead. If you can, try to have 20x-50x-100x less shards than you currently do and you'll see lower heap requirement. I don't know about static/dynamic memory-issue though. On Tue, Apr 11, 2017 at 6:09 PM, jpereira wrote: > Dorian Hoxha wrote > > Isn't 18K lucene-indexes (1 for each shard, not counting the replicas) a > > little too much for 3TB of data ? > > Something like 0.167GB for each shard ? > > Isn't that too much overhead (i've mostly worked with es but still lucene > > underneath) ? > > I don't have only 3TB , I have 3TB in two tier2 machines, the whole cluster > is 12 TB :) So what I was trying to explain was this: > > NODES A & B > 3TB per machine , 36 collections * 12 shards (432 indexes) , average heap > footprint of 60GB > > NODES C & D - at first > ~725GB per machine, 4 collections * 12 shards (48 indexes) , average heap > footprint of 12GB > > NODES C & D - after addding 220GB schemaless data > ~1TB per machine, 46 collections * 12 shards (552 indexes), average heap > footprint of 48GB > > So, what you are suggesting is that the culprit for the bump in heap > footprint is the new collections? > > > Dorian Hoxha wrote > > Also you should change the heap 32GB->30GB so you're guaranteed to get > > pointer compression. I think you should have no need to increase it more > > than this, since most things have moved to out-of-heap stuff, like > > docValues etc. > > I was forced to raise the heap size because the memory requirements > dramatically raised, hence this post :) > > Thanks > > > > -- > View this message in context: http://lucene.472066.n3. > nabble.com/Dynamic-schema-memory-consumption-tp4329184p4329345.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: Dynamic schema memory consumption
Dorian Hoxha wrote > Isn't 18K lucene-indexes (1 for each shard, not counting the replicas) a > little too much for 3TB of data ? > Something like 0.167GB for each shard ? > Isn't that too much overhead (i've mostly worked with es but still lucene > underneath) ? I don't have only 3TB , I have 3TB in two tier2 machines, the whole cluster is 12 TB :) So what I was trying to explain was this: NODES A & B 3TB per machine , 36 collections * 12 shards (432 indexes) , average heap footprint of 60GB NODES C & D - at first ~725GB per machine, 4 collections * 12 shards (48 indexes) , average heap footprint of 12GB NODES C & D - after addding 220GB schemaless data ~1TB per machine, 46 collections * 12 shards (552 indexes), average heap footprint of 48GB So, what you are suggesting is that the culprit for the bump in heap footprint is the new collections? Dorian Hoxha wrote > Also you should change the heap 32GB->30GB so you're guaranteed to get > pointer compression. I think you should have no need to increase it more > than this, since most things have moved to out-of-heap stuff, like > docValues etc. I was forced to raise the heap size because the memory requirements dramatically raised, hence this post :) Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Dynamic-schema-memory-consumption-tp4329184p4329345.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrJ appears to have problems with Docker Toolbox
Thanks. I think I'll take a look at that. I decided to just build a big vagrant-managed desktop VM to let me run Ubuntu on my company machine, so I expect that this pain point may be largely gone soon. On Mon, Apr 10, 2017 at 12:31 PM, Vincenzo D'Amore wrote: > Hi Mike > > disclaimer I'm the author of https://github.com/freedev/ > solrcloud-zookeeper-docker > > I had same problem when I tried to create a cluster SolrCloud with docker, > just because the docker instances were referred by ip addresses I cannot > access with SolrJ. > > I avoided this problem referring each docker instance via a hostname > instead of ip address. > > Docker-compose is a great help to have a network where your docker > instances can be resolved using their names. > > I'll suggest to take a look at my project, in particular at the > docker-compose.yml used to start a SolrCloud cluster (3 Solr nodes with a > zookeeper ensemble of 3): > > https://raw.githubusercontent.com/freedev/solrcloud- > zookeeper-docker/master/ > solrcloud-3-nodes-zookeeper-ensemble/docker-compose.yml > > Ok, I know, it sounds too much create a SolrCloud into a single VM, I did > it just to understand how Solr works... :) > > Once you've build your SolrCloud Docker network, you can map the name of > your docker instances externally, for example in your private network or in > your hosts file. > > In other words, given a Docker Solr instance named solr-1, in the docker > network the instance named solr-1 has a docker ip address that cannot be > used outside the VM. > > So when you use SolrJ client on your computer you must have into /etc/hosts > an entry solr-1 that points to the ip address your VM (the public network > interface where the docker instance is mapped). > > Hope you understand... :) > > Cheers, > Vincenzo > > > On Sun, Apr 9, 2017 at 2:42 AM, Mike Thomsen > wrote: > > > I'm running two nodes of SolrCloud in Docker on Windows using Docker > > Toolbox. The problem I am having is that Docker Toolbox runs inside of a > > VM and so it has an internal network inside the VM that is not accessible > > to the Docker Toolbox VM's host OS. If I go to the VM's IP which is > > 192.168.99.100, I can load the admin UI and do basic operations that are > > written to go against that IP and port (like querying, schema editor, > > manually adding documents, etc.) > > > > However, when I try to run code that uses SolrJ to add documents, it > fails > > because the ZK configuration has the IPs for the internal Docker network > > which is 172.X.Y..Z. If I log into the toolbox VM and run the Java code > > from there, it works just fine. From the host OS, doesn't. > > > > Anyone have any ideas on how to get around this? If I rewrite the > indexing > > code to do a manual JSON POST to the update handler on one of the nodes, > it > > does work just fine, but that leaves me not using SolrJ. > > > > Thanks, > > > > Mike > > > > > > -- > Vincenzo D'Amore > email: v.dam...@gmail.com > skype: free.dev > mobile: +39 349 8513251 <349%20851%203251> >
simple matches not catching at query time
hi everyone. i recently wrote in ('analysis matching, query not') but never heard back so wanted to follow up. i'm at my wit's end currently. i have several fields that are showing matches in the analysis tab. when i dumb down the string sent over to query it still gives me issues in some field cases. any thoughts on how to debug to figure out wtf is going on here would be greatly appreciated. the use case is straightforward and the solution should be as well, so i'm at a loss as to how in the world i'm having issues w this. can provide any amount of contextualizing information you need, just let me know what could be beneficial. best, john
Re: Grouped Result sort issue
I modified and cleaned the previous query. As you can see the first query sorting is a bit odd. Using parameters sort=score asc group.sort=score desc http://localhost:8983/solr/mcontent.ph_post/select?=&fl=*,score&group.field=partnerId&group.limit=1&group.main=false&group.ngroups=true&group.sort=score desc&group=true&indent=on&q=text:cars&rows=5000&sort=score asc&start=0&wt=json&omitHeader=true { "grouped":{ "partnerId":{ "matches":8681, "ngroups":10, "groups":[{ "groupValue":"63", "doclist":{"numFound":143,"start":0,"maxScore":0.48749906,"docs":[ { "postId":"26317", "score":0.48749906}] }}, { "groupValue":"64", "doclist":{"numFound":144,"start":0,"maxScore":0.34190965,"docs":[ { "postId":"25549", "score":0.34190965}] }}, { "groupValue":"28", "doclist":{"numFound":2023,"start":0,"maxScore":0.6838193,"docs":[ { "postId":"31447", "score":0.6838193}] }}, { "groupValue":"23", "doclist":{"numFound":3539,"start":0,"maxScore":0.6223264,"docs":[ { "postId":"15053", "score":0.6223264}] }}, { "groupValue":"25", "doclist":{"numFound":2651,"start":0,"maxScore":0.9381923,"docs":[ { "postId":"21199", "score":0.9381923}] }}, { "groupValue":"61", "doclist":{"numFound":160,"start":0,"maxScore":0.66007686,"docs":[ { "postId":"8730", "score":0.66007686}] }}, { "groupValue":"141", "doclist":{"numFound":9,"start":0,"maxScore":0.5074051,"docs":[ { "postId":"34406", "score":0.5074051}] }}, { "groupValue":"142", "doclist":{"numFound":9,"start":0,"maxScore":0.22002561,"docs":[ { "postId":"35000", "score":0.22002561}] }}, { "groupValue":"189", "doclist":{"numFound":1,"start":0,"maxScore":0.09951033,"docs":[ { "postId":"33971", "score":0.09951033}] }}, { "groupValue":"40", "doclist":{"numFound":2,"start":0,"maxScore":0.3283673,"docs":[ { "postId":"30142", "score":0.3283673}] }}]}}} Using parameters sort=score desc group.sort=score desc http://localhost:8983/solr/mcontent.ph_post/select?=&fl=*,score&group.field=partnerId&group.limit=1&group.main=false&group.ngroups=true&group.sort=score desc&group=true&indent=on&q=text:cars&rows=5000&sort=score desc&start=0&wt=json&omitHeader=true { "grouped":{ "partnerId":{ "matches":8681, "ngroups":10, "groups":[{ "groupValue":"25", "doclist":{"numFound":2651,"start":0,"maxScore":0.9381923,"docs":[ { "postId":"21199", "score":0.9381923}] }}, { "groupValue":"28", "doclist":{"numFound":2023,"start":0,"maxScore":0.6838193,"docs":[ { "postId":"31447", "score":0.6838193}] }}, { "groupValue":"61", "doclist":{"numFound":160,"start":0,"maxScore":0.66007686,"docs":[ { "postId":"8730", "score":0.66007686}] }}, { "groupValue":"23", "doclist":{"numFound":3539,"start":0,"maxScore":0.6223264,"docs":[ { "postId":"15053", "score":0.6223264}] }}, { "groupValue":"141", "doclist":{"numFound":9,"start":0,"maxScore":0.5074051,"docs":[ { "postId":"34406", "score":0.5074051}] }}, { "groupValue":"63", "doclist":{"numFound":143,"start":0,"maxScore":0.48749906,"docs":[ { "postId":"26317", "score":0.48749906}] }}, { "groupValue":"64", "doclist":{"numFound":144,"start":0,"maxScore":0.34190965,"docs":[ { "postId":"25549", "score":0.34190965}] }}, { "groupValue":"40", "doclist":{"numFound":2,"start":0,"maxScore":0.3283673,"docs":[ { "postId":"30142", "score":0.3283673}] }}, { "groupValue":"142", "doclist":{"numFound":9,"start":0,"maxScore":0.22002561,"docs":[ { "postId":"35000", "score":0.22002561}] }},
Re: Solr/ Velocity dont show full field value
#field() is defined in _macros.vm as this monstrosity: # TODO: make this parameterized fully, no context sensitivity #macro(field $f) #if($response.response.highlighting.get($docId).get($f).get(0)) #set($pad = "") #foreach($v in $response.response.highlighting.get($docId).get($f)) $pad$v## #TODO: $esc.html() or maybe make that optional? #set($pad = " ... ") #end #else $esc.html($display.list($doc.getFieldValues($f), ", ")) #end #end Basically that’s saying if there is highlighting returned for the specified field, then render it, otherwise render the full field value. $doc.getFieldValue() won’t ever work with highlighting - it’s the raw returned field value (or empty, potentially) - highlighting has to be looked up separately and that’s what the #field() macro tries to do - make it look a bit more seamless and slick, to just do #field(“field_name”). But it does rely on highlighting working - so try the json or xml response until you get the highlighting configured as needed. Erik > On Apr 11, 2017, at 6:14 AM, Hamso wrote: > > Hey guys, > I have a problem: > > In Velocity: > > *Beschreibung:*#field('LONG_TEXT') > > In Solr the field "LONG_TEXT" dont show everything only the first ~90-110 > characters. > But if I set "$doc.getFieldValue('LONG_TEXT')" in the Velocity file, then he > show me everything whats inside in the field "LONG_TEXT". > But there is one problem, if I use "$doc.getFieldValue('LONG_TEXT')" instead > of #field('LONG_TEXT'), the highlight doesnt work. > Can someone please help me, why #field('LONG_TEXT') doesnt show everthing > whats inside the field, or why highlighting with > "$doc.getFieldValue('LONG_TEXT')" doesnt work. > > Schema.xml: > > /> > > positionIncrementGap="100"> > > > ignoreCase="true"/> > > maxGramSize="500"/> > > > > ignoreCase="true"/> > ignoreCase="true" synonyms="synonyms.txt"/> > > > > > solrconfig only in /browse: > > on > LONG_TEXT > true > html > > > > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Solr-Velocity-dont-show-full-field-value-tp4329290.html > Sent from the Solr - User mailing list archive at Nabble.com.
Re: Grouped Result sort issue
the group.sort spec is specified twice in the URL group.sort=score desc& group.sort=score desc Is there a chance that during testing you only changed _one_ of them so you had group.sort=score desc& group.sort=score asc ? I think the last one should win.. Shot in the dark. Best, Erick On Tue, Apr 11, 2017 at 3:23 AM, alessandro.benedetti wrote: > To be fair the second result seems consistent with the Solr grouping logic : > > *First Query results (Suspicious)* > 1) group.sort= score desc -> select the group head as you have 1 doc per > group( the head will be the top scoring doc per group) > 2) sort=score asc -> sort the groups by the score of the head ascending ( so > the final resulting groups should be ascending in score) > > > *Second Query results ( CORRECT)* > 1) group.sort= score desc -> select the group head as you have 1 doc per > group( the head will be the top scoring doc per group) > 2) sort -> sort the groups by the score of the head ( so the final resulting > groups are sorted descending) > > Are we sure the the sort is expected to sort the groups after the grouping > happened ? > I need to check the internals but I agree the current behaviour is not > intuitive. > > Cheers > > > > > > - > --- > Alessandro Benedetti > Search Consultant, R&D Software Engineer, Director > Sease Ltd. - www.sease.io > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Grouped-Result-sort-issue-tp4329255p4329292.html > Sent from the Solr - User mailing list archive at Nabble.com.
Re: Filtering results by minimum relevancy score
Can't the filter be used in cases when you're paginating in sharded-scenario ? So if you do limit=10, offset=10, each shard will return 20 docs ? While if you do limit=10, _score<=last_page.min_score, then each shard will return 10 docs ? (they will still score all docs, but merging will be faster) Makes sense ? On Tue, Apr 11, 2017 at 12:49 PM, alessandro.benedetti wrote: > Can i ask what is the final requirement here ? > What are you trying to do ? > - just display less results ? > you can easily do at search client time, cutting after a certain amount > - make search faster returning less results ? > This is not going to work, as you need to score all of them as Erick > explained. > > Function query ( as Mikhail specified) will run on a per document basis ( > if > I am correct), so if your idea was to speed up the things, this is not > going > to work. > > It makes much more sense to refine your system to improve relevancy if your > concern is to have more relevant docs. > If your concern is just to not show that many pages, you can limit that > client side. > > > > > > > - > --- > Alessandro Benedetti > Search Consultant, R&D Software Engineer, Director > Sease Ltd. - www.sease.io > -- > View this message in context: http://lucene.472066.n3. > nabble.com/Filtering-results-by-minimum-relevancy-score- > tp4329180p4329295.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Solr/ Velocity dont show full field value
Hey guys, I have a problem: In Velocity: *Beschreibung:*#field('LONG_TEXT') In Solr the field "LONG_TEXT" dont show everything only the first ~90-110 characters. But if I set "$doc.getFieldValue('LONG_TEXT')" in the Velocity file, then he show me everything whats inside in the field "LONG_TEXT". But there is one problem, if I use "$doc.getFieldValue('LONG_TEXT')" instead of #field('LONG_TEXT'), the highlight doesnt work. Can someone please help me, why #field('LONG_TEXT') doesnt show everthing whats inside the field, or why highlighting with "$doc.getFieldValue('LONG_TEXT')" doesnt work. Schema.xml: solrconfig only in /browse: on LONG_TEXT true html -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Velocity-dont-show-full-field-value-tp4329290.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr 6.4. Can't index MS Visio vsdx files
Hi, history: 1. we're using single core Solr 6.4 instance on windows server (windows server 2012 R2 standard), 2. Java v8, (build 1.8.0_121-b13). 3. as a workaround for earlier issues with visio files, we have in solr-6.4.0\contrib\extraction\lib: 3.1. ooxml-schemas-1.3.jar instead of poi-ooxml-schemas-3.15.jar 3.2. curvesapi-1.03.jar This workaround solved many parsing issues on visio files. However we still have some other parsing issues left with bunch of visio files. Could you propose a solution for us on how to fix them? errors similar to these: { "responseHeader": { "status": 500, "QTime": 155 }, "error": { "msg": "org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@3c9f695c", "code": 500, "trace": "org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@3c9f695c\r\n\tat org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:234)\r\n\tat org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)\r\n\tat org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:166)\r\n\tat org.apache.solr.core.SolrCore.execute(SolrCore.java:2306)\r\n\tat org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:658)\r\n\tat org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:464)\r\n\tat org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:345)\r\n\tat org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:296)\r\n\tat org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)\r\n\tat org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)\r\n\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)\r\n\tat org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)\r\n\tat org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)\r\n\tat org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)\r\n\tat org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)\r\n\tat org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)\r\n\tat org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)\r\n\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\r\n\tat org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)\r\n\tat org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)\r\n\tat org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)\r\n\tat org.eclipse.jetty.server.Server.handle(Server.java:534)\r\n\tat org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)\r\n\tat org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)\r\n\tat org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)\r\n\tat org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)\r\n\tat org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)\r\n\tat org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)\r\n\tat org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)\r\n\tat org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)\r\n\tat org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)\r\n\tat org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)\r\n\tat java.lang.Thread.run(Unknown Source)\r\nCaused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@3c9f695c\r\n\tat org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)\r\n\tat org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\r\n\tat org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)\r\n\tat org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228)\r\n\t... 32 more\r\nCaused by: org.apache.poi.POIXMLException: /visio/masters/masters.xml: /visio/masters/master50.xml: : Invalid 'Row_Type' name 'PolylineTo'\r\n\tat org.apache.poi.xdgf.exceptions.XDGFException.wrap(XDGFException.java:43)\r\n\tat org.apache.poi.xdgf.usermodel.XDGFMasters.onDocumentRead(XDGFMasters.java:107)\r\n\tat org.apache.poi.xdgf.usermodel.XmlVisioDocument.onDocumentRead(XmlVisioDocument.java:106)\r\n\tat org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:190)\r\n\tat org.apache.poi.xdgf.usermodel.XmlVisioDocument.(XmlVisioDocume
Re: Filtering results by minimum relevancy score
Can i ask what is the final requirement here ? What are you trying to do ? - just display less results ? you can easily do at search client time, cutting after a certain amount - make search faster returning less results ? This is not going to work, as you need to score all of them as Erick explained. Function query ( as Mikhail specified) will run on a per document basis ( if I am correct), so if your idea was to speed up the things, this is not going to work. It makes much more sense to refine your system to improve relevancy if your concern is to have more relevant docs. If your concern is just to not show that many pages, you can limit that client side. - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- View this message in context: http://lucene.472066.n3.nabble.com/Filtering-results-by-minimum-relevancy-score-tp4329180p4329295.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr Index size keeps fluctuating, becomes ~4x normal size.
On Mon, 2017-04-10 at 13:27 +0530, Himanshu Sachdeva wrote: > Thanks for your time and quick response. As you said, I changed our > logging level from SEVERE to INFO and indeed found the performance > warning *Overlapping onDeckSearchers=2* in the logs. If you only see it occasionally, it is probably not a problem. If you see it often, that means that you are re-opening at a high rate, relative to the time it takes for a searcher to be ready. Since each searcher holds a lock on the files it searches, and you have multiple concurrent open searchers on a volatile index, that helps explain the index size fluctuations. Each searcher also requires heap, which might explain why you get Out Of Memory errors. This all boils down to avoid having (too many) overlapping warming searchers. * Reduce your auto-warm if it is high * Prolong the time between searcher-opening commits * Check that you have docValues on fields that you facet or group on > I am considering limiting the *maxWarmingSearchers* count in > configuration but want to be sure that nothing breaks in production > in case simultaneous commits do happen afterwards. That is one way of doing it, but it does not help you pinpoint where your problem is. > What would happen if we set *maxWarmingSearchers* count to 1 and make > simultaneous commit from different endpoints? I understand that solr > will prevent opening a new searcher for the second commit but is that > all there is to it? Does it mean solr will serve stale data( i.e. > send stale data to the slaves) ignoring the changes from the second > commit? [...] Sorry, I am not that familiar with the details of master-slave-setups. -- Toke Eskildsen, Royal Danish Library
Re: Expiry of Basic Authentication Plugin
Browsers retain basic auth information. You have to close it or clean browsing history. You can also change the user password at server side. Best On Tue, Apr 11, 2017 at 7:18 AM, Zheng Lin Edwin Yeo wrote: > Anyone has any idea if the authentication will expired automatically? Mine > has already been authenticated for more than 20 hours, and it has not auto > logged out yet. > > Regards, > Edwin > > On 11 April 2017 at 00:19, Zheng Lin Edwin Yeo > wrote: > > > Hi, > > > > Would like to check, after I have entered the authentication to access > > Solr with Basic Authentication Plugin, will the authentication be expired > > automatically after a period of time? > > > > I'm using SolrCloud on Solr 6.4.2 > > > > Regards, > > Edwin > > >
Re: Grouped Result sort issue
To be fair the second result seems consistent with the Solr grouping logic : *First Query results (Suspicious)* 1) group.sort= score desc -> select the group head as you have 1 doc per group( the head will be the top scoring doc per group) 2) sort=score asc -> sort the groups by the score of the head ascending ( so the final resulting groups should be ascending in score) *Second Query results ( CORRECT)* 1) group.sort= score desc -> select the group head as you have 1 doc per group( the head will be the top scoring doc per group) 2) sort -> sort the groups by the score of the head ( so the final resulting groups are sorted descending) Are we sure the the sort is expected to sort the groups after the grouping happened ? I need to check the internals but I agree the current behaviour is not intuitive. Cheers - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- View this message in context: http://lucene.472066.n3.nabble.com/Grouped-Result-sort-issue-tp4329255p4329292.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Dynamic schema memory consumption
Also you should change the heap 32GB->30GB so you're guaranteed to get pointer compression. I think you should have no need to increase it more than this, since most things have moved to out-of-heap stuff, like docValues etc. On Tue, Apr 11, 2017 at 12:07 PM, Dorian Hoxha wrote: > Isn't 18K lucene-indexes (1 for each shard, not counting the replicas) a > little too much for 3TB of data ? > Something like 0.167GB for each shard ? > Isn't that too much overhead (i've mostly worked with es but still lucene > underneath) ? > > Can't you use 1/100 the current number of collections ? > > > On Mon, Apr 10, 2017 at 5:22 PM, jpereira wrote: > >> Hello guys, >> >> I manage a Solr cluster and I am experiencing some problems with dynamic >> schemas. >> >> The cluster has 16 nodes and 1500 collections with 12 shards per >> collection >> and 2 replicas per shard. The nodes can be divided in 2 major tiers: >> - tier1 is composed of 12 machines with 4 physical cores (8 virtual), >> 32GB >> ram and 4TB ssd; these are used mostly for direct queries and data >> exports; >> - tier2 is composed of 4 machines with 20 physical cores (40 virtual), >> 128GB and 4TB ssd; these are mostly for aggregation queries (facets) >> >> The problem I am experiencing is that when using dynamic schemas, the Solr >> heap size rises dramatically. >> >> I have two tier2 machines (lets call them A and B) running one Solr >> instance >> each with 96GB heap size, with 36 collections totaling 3TB of mainly >> fixed-schema (55GB schemaless) data indexed in each machine, and the heap >> consumption is on average 60GB (it peaks at around 80GB and drops to >> around >> 40GB after a GC run). >> >> On the other tier2 machines (C and D) I was running one Solr instance on >> each machine with 32GB heap size and 4 fixed schema collections with about >> 725GB of data indexed in each machine, which took up about 12GB of heap >> size. Recently I added 46 collections to these machines with about 220Gb >> of >> data. In order to do this I was forced to raise the heap size to 64GB and >> after indexing everything now the machines have an averaged consumption of >> 48GB (!!!) (max ~55GB, after GC runs ~37GB) >> >> I also noticed that when indexed fixed schema data the CPU utilization is >> also dramatically lower. I have around 100 workers indexing fixed schema >> data with and CPU utilization rate of about 10%, while I have only one >> worker for schemaless data with a CPU utilization cost of about 20%. >> >> So, I have a two big questions here: >> 1. Is this dramatic rise in resources consumption when using dynamic >> fields >> "normal"? >> 2. Is there a way to lower the memory requirements? If so, how? >> >> Thanks for your time! >> >> >> >> -- >> View this message in context: http://lucene.472066.n3.nabble >> .com/Dynamic-schema-memory-consumption-tp4329184.html >> Sent from the Solr - User mailing list archive at Nabble.com. >> > >
Re: Dynamic schema memory consumption
Isn't 18K lucene-indexes (1 for each shard, not counting the replicas) a little too much for 3TB of data ? Something like 0.167GB for each shard ? Isn't that too much overhead (i've mostly worked with es but still lucene underneath) ? Can't you use 1/100 the current number of collections ? On Mon, Apr 10, 2017 at 5:22 PM, jpereira wrote: > Hello guys, > > I manage a Solr cluster and I am experiencing some problems with dynamic > schemas. > > The cluster has 16 nodes and 1500 collections with 12 shards per collection > and 2 replicas per shard. The nodes can be divided in 2 major tiers: > - tier1 is composed of 12 machines with 4 physical cores (8 virtual), 32GB > ram and 4TB ssd; these are used mostly for direct queries and data exports; > - tier2 is composed of 4 machines with 20 physical cores (40 virtual), > 128GB and 4TB ssd; these are mostly for aggregation queries (facets) > > The problem I am experiencing is that when using dynamic schemas, the Solr > heap size rises dramatically. > > I have two tier2 machines (lets call them A and B) running one Solr > instance > each with 96GB heap size, with 36 collections totaling 3TB of mainly > fixed-schema (55GB schemaless) data indexed in each machine, and the heap > consumption is on average 60GB (it peaks at around 80GB and drops to around > 40GB after a GC run). > > On the other tier2 machines (C and D) I was running one Solr instance on > each machine with 32GB heap size and 4 fixed schema collections with about > 725GB of data indexed in each machine, which took up about 12GB of heap > size. Recently I added 46 collections to these machines with about 220Gb of > data. In order to do this I was forced to raise the heap size to 64GB and > after indexing everything now the machines have an averaged consumption of > 48GB (!!!) (max ~55GB, after GC runs ~37GB) > > I also noticed that when indexed fixed schema data the CPU utilization is > also dramatically lower. I have around 100 workers indexing fixed schema > data with and CPU utilization rate of about 10%, while I have only one > worker for schemaless data with a CPU utilization cost of about 20%. > > So, I have a two big questions here: > 1. Is this dramatic rise in resources consumption when using dynamic fields > "normal"? > 2. Is there a way to lower the memory requirements? If so, how? > > Thanks for your time! > > > > -- > View this message in context: http://lucene.472066.n3.nabble > .com/Dynamic-schema-memory-consumption-tp4329184.html > Sent from the Solr - User mailing list archive at Nabble.com. >