Re: Hierarchical faceting
I realize you want to avoid putting depth details into the field values, but something has to imply the depth. So with that in mind, here is another approach (with the assumption that you are chasing down a single branch of a tree (and all its subbranch offshoots)), Use dynamic fields Step from one level to the next with a simple increment Build the facet for the next level on the call The UI needs only know the current level This would possibly be as so: step_fieldname_n With a dynamic field configuration of: step_* The content of the step_fieldname_n field would either be the strong of the field value or the delimited path of the current level (as suited to taste). Either way, most likely a fieldType of String (or some variation thereof) The UI would then call: facet.field=step_fieldname_n+1 And the UI would need to be aware to carry the n+1 into the fq link verbiage: fq=step_fieldname_n+1:facetvalue The trick of all of this is that you must build your index with the depth of your hierarchy in mind to place the values into the suitable fields. You could, of course, write an UpdateProcessor to accomplish this if that seems fitting. Jason On Nov 17, 2014, at 12:22 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: You might be able to stick in a couple of PatternReplaceFilterFactory in a row with regular expressions to catch different levels. Something like: filter class=solr.PatternReplaceFilterFactory pattern=^[^0-9][^/]+/[^/]/[^/]+$ replacement=2$0 / filter class=solr.PatternReplaceFilterFactory pattern=^[^0-9][^/]+/[^/]$ replacement=1$0 / ... I did not test this, you may need to escape some thing or put explicit groups in there. Regards, Alex. P.s. http://www.solr-start.com/javadoc/solr-lucene/org/apache/lucene/analysis/pattern/PatternReplaceFilterFactory.html Personal: http://www.outerthoughts.com/ and @arafalov Solr resources and newsletter: http://www.solr-start.com/ and @solrstart Solr popularizers community: https://www.linkedin.com/groups?gid=6713853 On 17 November 2014 15:01, rashmy1 rashmy.appanerava...@siemens.com wrote: Hi Alexandre, Yes, I've read this post and that's the 'Option1' listed in my initial post. I'm looking to see if Solr has any in-built tokenizer that splits the tokens and prepends with the depth information. I'd like to avoid building depth information into the filed values if Solr already has something that can be used. Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Hierarchical-faceting-tp4169263p4169536.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: openSearcher, default commit settings
Boon, I expect you will find many definitions of “proper usage” depending upon context and expected results. Personally, don’t believe this is Solr’s job to enforce, and there are many ways through the use of directives in the servlet container layer that can allow restrictions if you feel this is required. I would recommend considering an abstraction layer if you feel your development team may (accidentally) abuse the system they are permitted to use. I’ve seen this employed very well with minimal latency and cost in extremely large corporations that have many multiple development teams using the same search infrastructure. Jason On Jun 2, 2014, at 3:53 AM, Boon Low boon@dctfh.com wrote: Thanks for clearing this up. The wiki, being an authoritative reference, needs to be corrected. Re. default commit settings. I agree educating developers is very essential. But in reality, you can't rely on this as the sole mechanism for ensuring proper usage of the update API, especially for calls such as commit, optimize, expungeDeletes which can be very expensive for large indexes on a shared infrastructure. The issue is, there's no control mechanism in Solr for update calls (cf. rewriting calls via load-balancer). Once you expose the update handler to the developers, they could send 10 commit/optimise op per min, opening new searchers for each of those calls (openSearcher is only configurable for autocommit). And there is nothing you can do about it in Solr, even as a immediate stopgap while a fix is being implemented for the next sprint. It's be good to have some consistency in terms of configuring handlers, i.e. having default/invariant settings for both the search and update handlers. Thanks, Boon - Boon Low Search Engineer, DCT Family History On 29 May 2014, at 18:03, Shawn Heisey s...@elyograg.orgmailto:s...@elyograg.org wrote: On 5/29/2014 9:21 AM, Boon Low wrote: 1. openSearcher (autoCommit) According to the Apache Solr reference, autoCommit/openSearcher is set to false by default. https://cwiki.apache.org/confluence/display/solr/UpdateHandlers+in+SolrConfig But on Solr v4.8.1, if openSearcher is omitted from the autoCommit config, new searchers are opened and warmed post auto-commits. Is this behaviour intended or the wiki wrong? I am reasonably certain that the default for openSearcher if it is not specified will always be true. My understanding and your actual experience says that the documentation is wrong. Additional note: The docs for autoSoftCommit are basically a footnote on autoCommit, which I think is a mistake -- it should have its own section, and the docs should mention that openSearcher does not apply. I think the code confirms this. From SolrConfig.java: protected UpdateHandlerInfo loadUpdatehandlerInfo() { return new UpdateHandlerInfo(get(updateHandler/@class,null), getInt(updateHandler/autoCommit/maxDocs,-1), getInt(updateHandler/autoCommit/maxTime,-1), getBool(updateHandler/autoCommit/openSearcher,true), getInt(updateHandler/commitIntervalLowerBound,-1), getInt(updateHandler/autoSoftCommit/maxDocs,-1), getInt(updateHandler/autoSoftCommit/maxTime,-1), getBool(updateHandler/commitWithin/softCommit,true)); } 2. openSearcher and other default commit settings From previous posts, I know it's not possible to disable commits completely in Solr config (without coding). But is there a way to configure the default settings of hard/explicit commits for the update handler? If not it makes sense to have a configuration mechanism. Currently, a simple commit call seems to be hard-wired with the following options: .. commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false} There's no server-side option, e.g. to set openSearcher=false as default or invariant (cf. searchHandler) to prevent new searchers from opening. I found that at times it is necessary to have better server- or infrastructure-side controls for update/commits, especially in agile teams. Client/UI developers do not necessarily have complete Solr knowledge. Unintended commits from misbehaving client-side updates may be norm (e.g. 10 times per minute!). Since you want to handle commits automatically, you'll want to educate your developers and tell them that they should never send commits -- let Solr handle it. If the code that talks to Solr is Java and uses SolrJ, you might want to consider using forbidden-apis in your project so that a build will fail if the commit method gets used. https://code.google.com/p/forbidden-apis/ Thanks, Shawn __ brightsolid is used in this email to mean brightsolid online technology limited. Email Disclaimer This message is confidential and
Re: Boost documents having a field value
Hakim, That is what Boost Query (bq=) does. http://wiki.apache.org/solr/DisMaxQParserPlugin#bq_.28Boost_Query.29 Jason On Jun 2, 2014, at 10:58 AM, Hakim Benoudjit h.benoud...@gmail.com wrote: Hi guys, Is it possible in solr to boost documents having a field value (Ex. field:value)? I know that it's possible to boost a field above other fields at query-time, but I want to boost a field value not the field name. And if so, is the boosting done at query time or on indexing? -- Hakim Benoudjit.
Re: SolrCloud: Understanding Replication
Marc, Fundamentally it’s a good solution design to always be capable of reposting (reindexing) your data to Solr. You are demonstrating a classic use case of this, which is upgrade. Is there a critical reason why you are avoiding this step? Jason On May 30, 2014, at 10:38 AM, Marc Campeau cam...@gmail.com wrote: 2014-05-30 12:24 GMT-04:00 Erick Erickson erickerick...@gmail.com: Let's back up a bit here. Why are you copying your indexes around? SolrCloud does all this for you. I suspect you've somehow made a mis-step. I started by copying the index around because my 4.5.1 instance is not setup as Cloud and I wanted to avoid reindexing all my data when migrating to my new 4.8.1 SolrCloud setup. I've now put that aside and I'm just trying to get replication happening when I populate an empty collection. So here's what I'd do by preference; Just set up a new collection and re-index. Make sure all of the nodes are up and then just go ahead and index to any of them. If you're using SolrJ, CloudSolrServer will be a bit more efficient than sending the docs to random nodes, but that's not necessary. I've been trying that this morning. Stop the instances, deleted the contents of /data on all my 4.8.1 instances then started them again... they all show up in a 1 shard cluster as 4 replicas and one is the leader... they're still shown as down in clusterstate. Then I sent a document to be added to one of the nodes specifically. Only that node now contains the document. It hasn't been replicated to the other instances. When I issue queries to the collection for that document through my load balancer it works roughtly 1/4 times, in accordance with the fact that it's only on the instance where it was added. Must I use the CLI API for collections to create this new collection or can I just do it old style by creating subfolder in /solr directory with my confs? Here's the log of these operations LOG of Instance where document was added: 2758138 [qtp1781256139-14] INFO org.apache.solr.update.processor.LogUpdateProcessor – [mycollection] webapp=/solr path=/update/ params={indent=onversion=2.2wt=json} {add=[Listing_3446279]} 0 271 2769177 [qtp1781256139-12] INFO org.apache.solr.core.SolrCore – [mycollection] webapp=/solr path=/admin/ping params={} hits=0 status=0 QTime=1 [... More Pings ... ] 2773138 [commitScheduler-7-thread-1] INFO org.apache.solr.update.UpdateHandler – start commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=true,prepareCommit=false} 2773377 [commitScheduler-7-thread-1] INFO org.apache.solr.search.SolrIndexSearcher – Opening Searcher@175816a5[mycollection] main 2773389 [searcherExecutor-5-thread-1] INFO org.apache.solr.core.SolrCore – QuerySenderListener sending requests to Searcher@175816a5[mycollection] main{StandardDirectoryReader(segments_1:3:nrt _0(4.8):C1)} 2773389 [searcherExecutor-5-thread-1] INFO org.apache.solr.core.SolrCore – QuerySenderListener done. 2773390 [searcherExecutor-5-thread-1] INFO org.apache.solr.core.SolrCore – [mycollection] Registered new searcher Searcher@175816a5[mycollection] main{StandardDirectoryReader(segments_1:3:nrt _0(4.8):C1)} 2773390 [commitScheduler-7-thread-1] INFO org.apache.solr.update.UpdateHandler – end_commit_flush [... More Pings ... ] 2799792 [qtp1781256139-18] INFO org.apache.solr.update.UpdateHandler – start commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false} 2799883 [qtp1781256139-18] INFO org.apache.solr.core.SolrCore – SolrDeletionPolicy.onCommit: commits: num=2 commit{dir=NRTCachingDirectory(MMapDirectory@/opt/solr-4.8.0/example/solr/mycollection/data/index lockFactory=NativeFSLockFactory@/opt/solr-4.8.0/example/solr/mycollection/data/index; maxCacheMB=48.0 maxMergeSizeMB=4.0),segFN=segments_1,generation=1} commit{dir=NRTCachingDirectory(MMapDirectory@/opt/solr-4.8.0/example/solr/mycollection/data/index lockFactory=NativeFSLockFactory@/opt/solr-4.8.0/example/solr/mycollection/data/index; maxCacheMB=48.0 maxMergeSizeMB=4.0),segFN=segments_2,generation=2} 2799884 [qtp1781256139-18] INFO org.apache.solr.core.SolrCore – newest commit generation = 2 2799887 [qtp1781256139-18] INFO org.apache.solr.core.SolrCore – SolrIndexSearcher has not changed - not re-opening: org.apache.solr.search.SolrIndexSearcher 2799887 [qtp1781256139-18] INFO org.apache.solr.update.UpdateHandler – end_commit_flush 2799888 [qtp1781256139-18] INFO org.apache.solr.update.processor.LogUpdateProcessor – [mycollection] webapp=/solr path=/update params={update.distrib=FROMLEADERwaitSearcher=trueopenSearcher=truecommit=truesoftCommit=falsedistrib.from= http://192.168.150.90:8983/solr/mycollection/commit_end_point=truewt=javabinversion=2expungeDeletes=false} {commit=} 0 96 2800051 [qtp1781256139-14] INFO
Re: Error enquiry- exceeded limit of maxWarmingSearchers=2
I’m also not sure I understand the practical purpose of your hard/soft auto commit settings. You are stating the following: Every 10 seconds I want data written to disk, but not be searchable. Every 15 seconds I want data to be written into memory and searchable. I would consider whether your soft commit window is too long, or if you can lengthen your hard commit period. It’s typical to see hard commits occur *less* frequently than soft commits. On May 30, 2014, at 11:04 AM, Shawn Heisey s...@elyograg.org wrote: On 5/29/2014 9:55 PM, M, Arjun (NSN - IN/Bangalore) wrote: Thanks a lot for your nice explanation.. Now I understood the difference between autoCommit and autoSoftCommit.. Now my config looks like below. autoCommit maxDocs1/maxDocs openSearcherfalse/openSearcher /autoCommit autoSoftCommit maxTime15000/maxTime /autoSoftCommit With this now I am getting some other error like this. org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: version conflict for 140142167803912812800030383128128 expected=1469497192978841608 actual=1469497212082847746 This sounds like you are including the _version_ field in your document when you index. You probably shouldn't be doing that. Here's what that field is for, and how it works: http://heliosearch.org/solr/optimistic-concurrency/ Thanks, Shawn
Re: Error enquiry- exceeded limit of maxWarmingSearchers=2
I just realized I failed my own reading comprehension :) You have maxDocs, not maxTime for hard commit. Please disregard. On May 30, 2014, at 1:46 PM, Jason Hellman jhell...@innoventsolutions.com wrote: I’m also not sure I understand the practical purpose of your hard/soft auto commit settings. You are stating the following: Every 10 seconds I want data written to disk, but not be searchable. Every 15 seconds I want data to be written into memory and searchable. I would consider whether your soft commit window is too long, or if you can lengthen your hard commit period. It’s typical to see hard commits occur *less* frequently than soft commits. On May 30, 2014, at 11:04 AM, Shawn Heisey s...@elyograg.org wrote: On 5/29/2014 9:55 PM, M, Arjun (NSN - IN/Bangalore) wrote: Thanks a lot for your nice explanation.. Now I understood the difference between autoCommit and autoSoftCommit.. Now my config looks like below. autoCommit maxDocs1/maxDocs openSearcherfalse/openSearcher /autoCommit autoSoftCommit maxTime15000/maxTime /autoSoftCommit With this now I am getting some other error like this. org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: version conflict for 140142167803912812800030383128128 expected=1469497192978841608 actual=1469497212082847746 This sounds like you are including the _version_ field in your document when you index. You probably shouldn't be doing that. Here's what that field is for, and how it works: http://heliosearch.org/solr/optimistic-concurrency/ Thanks, Shawn
Re: Enforcing a hard timeout on shard requests?
Gregg, I don’t have an answer to your question but I’m very curious what use case you have that permits such arbitrary partial-results. Is it just an edge case or do you want to permit a common occurrence? Jason On May 30, 2014, at 3:05 PM, Gregg Donovan gregg...@gmail.com wrote: I'd like a to add a hard timeout on some of my sharded requests. E.g.: for about 30% of the requests, I want to wait no longer than 120ms before a response comes back, but aggregating results from as many shards as possible in that 120ms. My first attempt was to use timeAllowed=120shards.tolerant=true. This sort of works, in that I'll see partial results occasionally, but slow shards will still take much longer than my timeout to return, sometimes up to 700ms. I imagine if the CPU is busy or the node is GC-ing that it won't be able to enforce the timeAllowed and return. Is there a way to enforce this timeout without failing the request entirely? I'd still like to get as many shards to return in 120ms as I can, even if they have partialResults. Thanks. --Gregg
Re: Solr interface
This. And so much this. As much this as you can muster. On Apr 7, 2014, at 1:49 PM, Michael Della Bitta michael.della.bi...@appinions.com wrote: The speed of ingest via HTTP improves greatly once you do two things: 1. Batch multiple documents into a single request. 2. Index with multiple threads at once. Michael Della Bitta Applications Developer o: +1 646 532 3062 appinions inc. The Science of Influence Marketing 18 East 41st Street New York, NY 10017 t: @appinions https://twitter.com/Appinions | g+: plus.google.com/appinionshttps://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts w: appinions.com http://www.appinions.com/ On Mon, Apr 7, 2014 at 12:40 PM, Daniel Collins danwcoll...@gmail.comwrote: I have to agree with Shawn. We have a SolrCloud setup with 256 shards, ~400M documents in total, with 4-way replication (so its quite a big setup!) I had thought that HTTP would slow things down, so we recently trialed a JNI approach (clients are C++) so we could call SolrJ and get the benefits of JavaBin encoding for our indexing Once we had done benchmarks with both solutions, I think we saved about 1ms per document (on average) with JNI, so it wasn't as big a gain as we were expecting. There are other benefits of SolrJ (zookeeper integration, better routing, etc) and we were doing local HTTP (so it was literally just a TCP port to localhost, no actual net traffic) but that just goes to prove what other posters have said here. Check whether HTTP really *is* the bottleneck before you try to replace it! On 7 April 2014 17:05, Shawn Heisey s...@elyograg.org wrote: On 4/7/2014 5:52 AM, Jonathan Varsanik wrote: Do you mean to tell me that the people on this list that are indexing 100s of millions of documents are doing this over http? I have been using custom Lucene code to index files, as I thought this would be faster for many documents and I wanted some non-standard OCR and index fields. Is there a better way? To the OP: You can also use Lucene to locally index files for Solr. My sharded index has 94 million docs in it. All normal indexing and maintenance is done with SolrJ, over http.Currently full rebuilds are done with the dataimport handler loading from MySQL, but that is legacy. This is NOT a SolrCloud installation. It is also not a replicated setup -- my indexing program keeps both copies up to date independently, similar to what happens behind the scenes with SolrCloud. The single-thread DIH is very well optimized, and is faster than what I have written myself -- also single-threaded. The real reason that we still use DIH for rebuilds is that I can run the DIH simultaenously on all shards. A full rebuild that way takes about 5 hours. A SolrJ process feeding all shards with a single thread would take a lot longer. Once I have time to work on it, I can make the SolrJ rebuild multi-threaded, and I expect it will be similar to DIH in rebuild speed. Hopefully I can make it faster. There is always overhead with HTTP. On a gigabit LAN, I don't think it's high enough to matter. Using Lucene to index files for Solr is an option -- but that requires writing a custom Lucene application, and knowledge about how to turn the Solr schema into Lucene code. A lot of users on this list (me included) do not have the skills required. I know SolrJ reasonably well, but Lucene is a nut that I haven't cracked. Thanks, Shawn
Re: Exact fragment length in highlighting
Juan, Pay close attention to the boundary scanner you’re employing: http://wiki.apache.org/solr/HighlightingParameters#hl.boundaryScanner You can be explicit to indicate a type (hl.bs.type) with options such as CHARACTER, WORD, SENTENCE, and LINE. The default is WORD (as the wiki indicates) and I presume this is what you are employing. Be careful about using explicit characters. I had an interesting case of highlight returns that looked like this: This is a highlight Here is another highlight Yes, another one, etc… It was a bit maddening trying to figure out why “” was in the highlight…turned out it was XML content and the character boundary clipped the trailing “” based on the boundary rules. In any case, you should be able to achieve a pretty flexible result depending on what you’re really after with the right combination of settings. Jason On Feb 19, 2014, at 7:53 AM, Juan Carlos Serrano jcserran...@gmail.com wrote: Hello everybody, I'm using Solr 4.6.1. and I'd like to know if there's a way to determine exactly the number of characters of a fragment used in highlights. If I use hl.fragsize=70 the length of the fragments that I get is variable (often) and I get results of 90 characters length. Regards and thanks in advance, Juan Carlos
Re: Caching Solr boost functions?
Gregg, The QueryResultCache caches a sorted int array of results matching the a query. This should overlap very nicely with your desired behavior, as a hit in this cache will not perform a Lucene query nor a need to calculate score. Now, ‘for the life of the Searcher’ is the trick here. You can size your cache large enough to ensure it can fit every possible query, but at some point this is untenable. I would argue that high volatility of query parameters would invalidate the need for caching anyway, but that’s clearly debatable. Nevertheless, this should work admirably well to solve your needs. Jason On Feb 18, 2014, at 11:32 AM, Gregg Donovan gregg...@gmail.com wrote: We're testing out a new handler that uses edismax with three different boost functions. One has a random() function in it, so is not very cacheable, but the other two boost functions do not change from query to query. I'd like to tell Solr to cache those boost queries for the life of the Searcher so they don't get recomputed every time. Is there any way to do that out of the box? In a different custom QParser we have we wrote a CachingValueSource that wrapped a ValueSource with a custom ValueSource cache. Would it make sense to implement that as a standard Solr function so that one could do: boost=cache(expensiveFunctionQuery()) Thanks. --Gregg
Re: Solr Autosuggest - Strange issue with leading numbers in query
Here’s a rather obvious question: have you rebuilt your spell index recently? Is it possible the offending numbers snuck into the spell dictionary? The terms component will show you what’s in your current, searchable field…but not the dictionary. If my memory serves correctly, with collate=true this would allow for such behavior to occur, especially with onlyMorePopular set to false (which would ensure the resulting collation has a query count greater than the current query). Have you flipped onlyMorePopular to true to confirm? On Feb 18, 2014, at 10:16 AM, bbi123 bbar...@gmail.com wrote: Thanks a lot for your response Erik. I was trying to find if I have any suggestion starting with numbers using terms component but I couldn't find any.. Its very strange!!! Anyways, thanks again for your response. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Autosuggest-Strange-issue-with-leading-numbers-in-query-tp4116751p4118072.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: block join and atomic updates
Thinking in terms of normalized data in the context of a Lucene index is dangerous. It is not a relational data model technology, and the join behaviors available to you have limited use. Each approach requires compromises that are likely impermissible for certain uses cases. If it is at all reasonable to consider you will likely be best served de-normalizing the data. Of course, your specific details may prove an exception to this rule…but generally approach works very well. On Feb 18, 2014, at 4:19 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: absolutely. On Tue, Feb 18, 2014 at 1:20 PM, m...@preselect-media.com wrote: But isn't query time join much slower when it comes to a large amount of documents? Zitat von Mikhail Khludnev mkhlud...@griddynamics.com: Hello, It sounds like you need to switch to query time join. 15.02.2014 21:57 пользователь m...@preselect-media.com написал: Any suggestions? Zitat von m...@preselect-media.com: Yonik Seeley yo...@heliosearch.com: On Thu, Feb 13, 2014 at 8:25 AM, m...@preselect-media.com wrote: Is there any workaround to perform atomic updates on blocks or do I have to re-index the parent document and all its children always again if I want to update a field? The latter, unfortunately. Is there any plan to change this behavior in near future? So, I'm thinking of alternatives without loosing the benefit of block join. I try to explain an idea I just thought about: Let's say I have a parent document A with a number of fields I want to update regularly and a number of child documents AC_1 ... AC_n which are only indexed once and aren't going to change anymore. So, if I index A and AC_* in a block and I update A, the block is gone. But if I create an additional document AF which only contains something like an foreign key to A and indexing AF + AC_* as a block (not A + AC_* anymore), could I perform a {!parent ... } query on AF + AC_* and make an join from the results to get A? Does this makes any sense and is it even possible? ;-) And if it's possible, how can I do it? Thanks, - Moritz -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Solr server requirements for 100+ million documents
Whether you use the same machines as Solr or separate machines is a matter suited to taste. If you are the CTO, then you should make this decision. If not, inform management that risk conditions are greater when you share function and control on a single piece of hardware. A single failure of a replica + zookeeper node will be more impactful than a single failure of a replica *or* a zookeeper node. Let them earn the big bucks to make the risk decision. The good news is, zookeeper hardware can be extremely lightweight for Solr Cloud. Commodity hardware should work just fine…and thus scaling to 5 nodes for zookeeper is not that hard at all. Jason On Feb 11, 2014, at 3:00 PM, svante karlsson s...@csi.se wrote: ZK needs a quorum to keep functional so 3 servers handles one failure. 5 handles 2 node failures. If you Solr with 1 replica per shard then stick to 3 ZK. If you use 2 replicas use 5 ZK
Re: Memory Usage on Windows Os while indexing
To a very large extent, the capability of a platform is measurable by the skill of the team administering it. If core competencies lie in Windows OS then I would wager heavily the platform will outperform a similar Linux OS installation in the long haul. All things being equal, it’s really hard to argue with Linux. But nothing is ever equal. On Jan 21, 2014, at 8:57 PM, Shawn Heisey s...@elyograg.org wrote: On 1/21/2014 2:17 AM, onetwothree wrote: Does Solr on a Linux Os has a better memory management than a Windows Os, or can you neglect this comparison? As Toke said, this is indeed debatable. I personally believe that Linux is better at almost everything, but if you're running a recent 64-bit Windows Server OS, you may not actually see a lot of difference. Microsoft has VERY talented people working for them, and even though I won't use it for most server applications, Windows is a very capable platform. If you ignore personal bias and proceed with the idea that Linux and Windows are approximately equal in terms of real-world performance, then one factor that might be critical is price. Linux can be installed for zero cost, a standalone bare metal Windows Server license is several hundred dollars, sometimes more. Thanks, Shawn
Re: how to best convert some term in q to a fq
I second this notion. My reasoning focuses mostly on maintainability, where I posit that your client code will be far easier to extend/modify/troubleshoot than any effort spent attempting to do this within Solr. Jason On Dec 23, 2013, at 12:07 PM, Joel Bernstein joels...@gmail.com wrote: I would suggest handling this in the client. You could write custom Solr code also but it would be more complicated because you'd be working with Solr's API's. Joel Bernstein Search Engineer at Heliosearch On Mon, Dec 23, 2013 at 2:36 PM, jmlucjav jmluc...@gmail.com wrote: Hi, I have this scenario that I think is no unusual: solr will get a user entered query string like 'apple pear france'. I need to do this: if any of the terms is a country, then change the query params to move that term to a fq, i.e: q=apple pear france to q=apple pearfq=country:france What do you guys would be the best way to implement this? - custom searchcomponent or queryparser - servlet in same jetty as solr - client code To simplify, consider countries are just a single term. Any pointer to an example to base this on would be great. thanks
Re: Problem with size of segments
David, I find Mike McCandless’ blog article to be very informative. Give it a go and let us know if you are still seeking clarification: http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html Jason On Nov 7, 2013, at 5:09 AM, david.dav...@correo.aeat.es wrote: Hi, I have an index very big, with 337 G more or less. I am using Solr 4.2. The problem we have is related with the size of segments: this is the size of the biggest ones: 324 G, 3.7G, 3.6 G, 1.6 G, 1.6 G, 465 M ... We have LogByteSizeMergePolicy with 10 as MergeFactor in our solrconfig. Really the issue is not a problem, but at least I would like to know why my segments have this size. According with I have read in papers, if I have a MergeFactor of 10 each level within the index should be one order of magnitude bigger than previously. So , I can't understand why I have a segment of 324 G while the others are only of 3 G, this is 2 orders of magnitude bigger. Is this correct or it is a problem with my index? Where can I read a good explanation about the Merge Policy? Thank you very much, Regards, David Dávila AEAT
Re: Function query matching
You can, of course, us a function range query: select?q=text:newsfq={!frange l=0 u=100}sum(x,y) http://lucene.apache.org/solr/4_5_1/solr-core/org/apache/solr/search/FunctionRangeQParserPlugin.html This will give you a bit more flexibility to meet your goal. On Nov 7, 2013, at 7:26 AM, Erik Hatcher erik.hatc...@gmail.com wrote: Function queries score (all) documents, but don't filter them. All documents effectively match a function query. Erik On Nov 7, 2013, at 1:48 PM, Peter Keegan peterlkee...@gmail.com wrote: Why does this function query return docs that don't match the embedded query? select?qq=text:newsq={!func}sum(query($qq),0)
Re: Replacing Google Mini Search Appliance with Solr?
Nutch is an excellent option. It should feel very comfortable for people migrating away from the Google appliances. Apache Droids is another possible way to approach, and I’ve found people using Heretrix or Manifold for various use cases (and usually in combination with other use cases where the extra overhead was worth the trouble). I think the simples approach will be Nutch…it’s absolutely worth taking a shot at it. DO NOT write a crawler! That is a rabbit hole you do not want to peer down into :) On Oct 30, 2013, at 10:54 AM, Markus Jelsma markus.jel...@openindex.io wrote: Hi Eric, We have also helped some government institution to replave their expensive GSA with open source software. In our case we use Apache Nutch 1.7 to crawl the websites and index to Apache Solr. It is very effective, robust and scales easily with Hadoop if you have to. Nutch may not be the easiest tool for the job but is very stable, feature rich and has an active community here at Apache. Cheers, -Original message- From:Palmer, Eric epal...@richmond.edu Sent: Wednesday 30th October 2013 18:48 To: solr-user@lucene.apache.org Subject: Replacing Google Mini Search Appliance with Solr? Hello all, Been lurking on the list for awhile. We are at the end of life for replacing two google mini search appliances used to index our public web sites. Google is no longer selling the mini appliances and buying the big appliance is not cost beneficial. http://search.richmond.edu/ We would run a solr replacement in linux (cents, redhat, similar) with open Java or Oracle Java. Background == ~130 sites only ~12,000 pages (at a depth of 3) probably ~40,000 pages if we go to a depth of 4 We use key matches a lot. In solr terms these are elevated documents (elevations) We would code a search query form in php and wrap it into our design (http://www.richmond.edu) I have played with and love lucidworks and know that their $ solution works for our use cases but the cost model is not attractive for such a small collection. So with solr what are my open source options and what are people's experiences crawling and indexing web sites with solr + crawler. I understand there is not a crawler with solr so that would have to be first up to get one working. We can code in Java, PHP, Python etc. if we have to, but we don't want to write a crawler if we can avoid it. thanks in advance for and information. -- Eric Palmer Web Services U of Richmond
Re: When is/should qf different from pf?
It is probable that with no addition boost to pf fields that the sum of the scores will be higher. But it is *possible* that they are not, and adding a boost to pf gives greater probability that they will be. All of this bears testing to confirm what search use cases merit what level of boost. No boost value is universally right…so YMMV, etc... On Oct 29, 2013, at 9:30 AM, xavier jmlucjav jmluc...@gmail.com wrote: I am confused, wouldn't a doc that match both the phrase and the term queries have a better score than a doc matching only the term score, even if qf and pf are the same?? On Mon, Oct 28, 2013 at 7:54 PM, Upayavira u...@odoko.co.uk wrote: There'd be no point having them the same. You're likely to include boosts in your pf, so that docs that match the phrase query as well as the term query score higher than those that just match the term query. Such as: qf=text descriptionpf=text^2 description^4 Upayavira On Mon, Oct 28, 2013, at 05:44 PM, Amit Nithian wrote: Thanks Erick. Numeric fields make sense as I guess would strictly string fields too since its one term? In the normal text searching case though does it make sense to have qf and pf differ? Thanks Amit On Oct 28, 2013 3:36 AM, Erick Erickson erickerick...@gmail.com wrote: The facetious answer is when phrases aren't important in the fields. If you're doing a simple boolean match, adding phrase fields will add expense, to no good purpose etc. Phrases on numeric fields seems wrong. FWIW, Erick On Mon, Oct 28, 2013 at 1:03 AM, Amit Nithian anith...@gmail.com wrote: Hi all, I have been using Solr for years but never really stopped to wonder: When using the dismax/edismax handler, when do you have the qf different from the pf? I have always set them to be the same (maybe different weights) but I was wondering if there is a situation where you would have a field in the qf not in the pf or vice versa. My understanding from the docs is that qf is a term-wise hard filter while pf is a phrase-wise boost of documents who made it past the qf filter. Thanks! Amit
Re: Reclaiming disk space from (large, optimized) segments
If I sage Otis’ intent here it is to create shards on the basis of intervals of time. A shard represents a single interval (let’s say a year’s worth of data) and when that data is no longer necessary it is simply shut down and no longer included in queries. So, for example, you could have three shards spanning the years 2011, 2012, and 2013 respectively. When you no longer need 2011 you simply remove the shard. My example is simple…compress based upon your needs. On Oct 29, 2013, at 8:42 AM, Gun Akkor gun.ak...@carbonblack.com wrote: Otis, Thank you for your response, Could you elaborate a bit more on what you have in mind when you say time-based indices? Gun --- Senior Software Engineer Carbon Black, Inc. gun.ak...@carbonblack.com On Thu, Oct 24, 2013 at 11:56 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Only skimmed your email, but purge every 4 hours jumped out at me. Would it make sense to have time-based indices that can be periodically dropped instead of being purged? Otis Solr ElasticSearch Support http://sematext.com/ On Oct 23, 2013 10:33 AM, Scott Lundgren scott.lundg...@carbonblack.com wrote: *Background:* - Our use case is to use SOLR as a massive FIFO queue. - Document additions and updates happen continuously. - Documents are being added at sustained a rate of 50 - 100 documents per second. - About 50% of these document are updates to existing docs, indexed using atomic updates: the original doc is thus deleted and re-added. - There is a separate purge operation running every four hours that deletes the oldest docs, if required based on a number of unrelated configuration parameters. - At some time in the past, a manual force merge / optimize with maxSegments=2 was run to troubleshoot high disk i/o and remove too many segments as a potential variable. Currently, the largest fdts are 74G and 43G. There are 47 total segments, the largest other sizes are all around 2G. - Merge policies are all at Solr 4 defaults. Index size is currently ~50M maxDocs, ~35M numDocs, 276GB. *Issue:* The background purge operation is deleting docs on schedule, but the disk space is not being recovered. *Presumptions:* I presume, but have not confirmed (how?) the 15M deleted documents are predominately in the two large segments. Because they are largely in the two large segments, and those large segments still have (some/many) live documents, the segment backing files are not deleted. *Questions:* - When will those segments get merged and documents recovered? Does it happen when _all_ the documents in those segments are deleted? Some percentage of the segment is filled with deleted documents? - Is there a way to do it right now vs. just waiting? - In some cases, the purge delete conditional is _just_ free disk space: when index free space, delete oldest. Those setups are now in scenarios where index free space, and getting worse. How does low disk space effect above two questions? - Is there a way for me to determine stats on a per-segment basis? - for example, how many deleted documents in a particular segment? - On the flip side, can I determine in what segment a particular document is located? Thank you, Scott -- Scott Lundgren Director of Engineering Carbon Black, Inc. (210) 204-0483 | scott.lundg...@carbonblack.com
Re: SOLRJ replace document
Keep in mind that DataStax has a custom update handler, and as such isn't exactly a vanilla Solr implementation (even though in many ways it still is). Since updates are co-written to Cassandra and Solr you should always tread a bit carefully when slightly outside what they perceive to be norms. On Oct 18, 2013, at 7:21 PM, Brent Ryan brent.r...@gmail.com wrote: So I think the issue might be related to the tech stack we're using which is SOLR within DataStax enterprise which doesn't support atomic updates. But I think it must have some sort of bug around this because it doesn't appear to work correctly for this use case when using solrj ... Anyways, I've contacted support so lets see what they say. On Fri, Oct 18, 2013 at 5:51 PM, Shawn Heisey s...@elyograg.org wrote: On 10/18/2013 3:36 PM, Brent Ryan wrote: My schema is pretty simple and has a string field called solr_id as my unique key. Once I get back to my computer I'll send some more details. If you are trying to use a Map object as the value of a field, that is probably why it is interpreting your add request as an atomic update. If this is the case, and you're doing it because you have a multivalued field, you can use a List object rather than a Map. If this doesn't sound like what's going on, can you share your code, or a simplification of the SolrJ parts of it? Thanks, Shawn
Re: field title_ngram was indexed without position data; cannot run PhraseQuery
If you consider what n-grams do this should make sense to you. Consider the following piece of data: White iPod If the field is fed through a bigram filter (n-gram with size of 2) the resulting token stream would appear as such: wh hi it te ip po od The usual use of n-grams is to match those partial tokens, essentially giving you a great deal of power in creating non-wildcard partial matches. How you use this is up to your imagination, but one easy use is in partial matches in autosuggest features. I can't speak for the intent behind the way it's coded, but it makes a great deal of sense to me that positional data would be seen as unnecessary since the intent of n-grams typically doesn't collide with phrase searches. If you need both behaviors it's far better to use copyField and have one field dedicated to standard tokenization and token filters, and another field for n-grams. I hope that's useful to you. On Oct 15, 2013, at 6:14 AM, MC videm...@gmail.com wrote: Hello, Could someone explain (or perhaps provide a documentation link) what does the following error mean: field title_ngram was indexed without position data; cannot run PhraseQuery I'll do some more searching online, I was just wondering if anyone has encountered this error before, and what the possible solution might be. I've recently upgraded my version of solr from 3.6.0 to 4.5.0, I'm not sure if this has any bearing or not. Thanks, M
Re: Concurent indexing
The limitations on how many threads you can use to load data is primarily driven by factors on your hardware: CPU, heap usage, I/O, and the like. It is common for most index load processes to be able to handle more incoming data on the Solr side of the equation than can typically be loaded from the source repository. You'll have to explore a bit to find the limits, but if your hardware is sufficient you can likely load a great deal. As for commits, they will indeed commit anything added to Solr regardless of the thread of the update. Keep this in mind if you have a rollback concept in mind, or if you're measuring your incremental load to restart in case of error/failure. Presuming you want more control, and If you are multi-threading index updates, it may be useful to have a delegate handle the commit process…or on a large data load, consider a commit at the end. On Oct 14, 2013, at 6:44 AM, maephisto my_sky...@yahoo.com wrote: Hi, I have a collection (numShards=3, replicationFactor=2) split on 2 machines. Since the amount of data is huge I have to index, I would like start multiple instances of the same process that would index data to Solr. Is there any limitation or counter-indication is this area? The indexing client is custom built by me and parses files (each instance parses a different file), and the uniqueId is auto-generated. Would a commit in a process also commit the uncommitted changes created by another process? -- View this message in context: http://lucene.472066.n3.nabble.com/Concurent-indexing-tp4095409.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Update existing documents when using ExtractingRequestHandler?
As an endorsement of Erick's like, the primary benefit I see to processing through your own code is better error-, exception-, and logging-handling which is trivial for you to write. Consider that your code could reside on any server, either receiving through a PUSH or PULLing the data from your web server (as suits your needs) and thus offloads the effort from your busy web server. In the long run, this will be a more flexible, adaptable solution that meets future needs with minimal effort. Further, it typically doesn't require a Solr expert to write so you can find plenty of people to help on this as future needs dictate. On Oct 10, 2013, at 4:21 AM, Erick Erickson erickerick...@gmail.com wrote: 1 - puts the work on the Solr server though. 2 - This is just a SolrJ program, could be run anywhere. See: http://searchhub.org/dev/2012/02/14/indexing-with-solrj/ It would give you the most flexibility to offload the Tika processing to N other machines. 3 - This could work, but you'd then be indexing every document twice as well as loading the server with the Tika work. And you'd have to store all the fields. Personally I like 2... FWIW, Erick On Wed, Oct 9, 2013 at 11:50 AM, Jeroen Steggink jer...@stegg-inc.com wrote: Hi, In a content management system I have a document and an attachment. The document contains the meta data and the attachment the actual data. I would like to combine data of both in one Solr document. I have thought of several options: 1. Using ExtractingRequestHandler I would extract the data (extractOnly) and combine it with the meta data and send it to Solr. But this might be inefficient and increase the network traffic. 2. Seperate Tika installation and use that to extract and send the data to Solr. This would stress an already busy web server. 3. First upload the file using ExtractingRequestHandler, then use atomic updates to add the other fields. Or is there another way? First add the meta data and later use the ExtractingRequestHandler to add the file contents? Cheers, Jeroen -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Re: Field with default value and stored=false, will be reset back to the default value in case of updating other fields
The best use case I see for atomic updates typically involves avoid transmission of large documents for small field updates. If you are updating a readCount field of a PDF document that is 1MB in size you will avoid resending the 1MB PDF document's data in order to increment the readCount field. If, instead, we're talking about 5K database records then there's plenty of argument to be made that the whole document should just be retransmitted and thus avoid the (potentially) unnecessary cost of storing all fields. As in everything, we face compromises…the question is which one better suits your needs. On Oct 10, 2013, at 5:07 AM, Erick Erickson erickerick...@gmail.com wrote: bq: so what is the point of having atomic updates if i need to update everything? _nobody_ claims this is ideal, it does solve a certain use-case. We'd all like like true partial-updates that didn't require stored fields. The use-case here is that you don't have access to the system-of-record so you don't have a choice. See the JIRA about stacked segments for update without storing fields work. Best, Erick On Thu, Oct 10, 2013 at 12:09 AM, Shawn Heisey elyog...@elyograg.org wrote: On 10/9/2013 8:39 PM, deniz wrote: Billnbell wrote You have to update the whole record including all fields... so what is the point of having atomic updates if i need to update everything? If you have any regular fields that are not stored, atomic updates will not work -- unstored field data will be lost. If you have copyField destination fields that *are* stored, atomic updates will not work as expected with those fields. The wiki spells out the requirements: http://wiki.apache.org/solr/Atomic_Updates#Caveats_and_Limitations An atomic update is just a shortcut for read all existing fields from the original document, apply the atomic updates, and re-insert the document, overwriting the original. Thanks, Shawn
Re: Solr auto suggestion not working
Very specifically, what is the field definition that is being used for the suggestions? On Oct 10, 2013, at 5:49 AM, Furkan KAMACI furkankam...@gmail.com wrote: What is your configuration for auto suggestion? 2013/10/10 ar...@skillnetinc.com ar...@skillnetinc.com Hi, We are encountering an issue in solr search auto suggestion feature. Here is the problem statement with an example: We have a product named 'Apple iphone 5s - 16 GB'. Now when in the search box we type 'Apple' or 'iphone' this product name comes in the suggestion list. But when we type 'iphone 5s' no result comes in suggestion list. Even when we type only '5s' then also no result comes. Please help us in resolving this issue and it is occurring on production environment and impacting client's business. Regards, Arun -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-auto-suggestion-not-working-tp4094660.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to achieve distributed spelling check in SolrCloud ?
The shards.qt parameter is the easiest one to forget, with the most dramatic of consequences! On Oct 8, 2013, at 11:10 AM, shamik sham...@gmail.com wrote: James, Thanks for your reply. The shards.qt did the trick. I read the documentation earlier but was not clear on the implementation, now it totally makes sense. Appreciate your help. Regards, Shamik -- View this message in context: http://lucene.472066.n3.nabble.com/RE-How-to-achieve-distributed-spelling-check-in-SolrCloud-tp4094113p4094137.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Delete a field - Atomic updates (SOLR 4.1.0) without using null=true
I don't know if there's a way to accomplish your goal directly, but as a pure workaround, you can write a routine to fetch all the stored values and resubmit the document without the field in question. This is what atomic updates do, minus the overhead of the transmission. On Oct 7, 2013, at 11:15 AM, SolrLover bbar...@gmail.com wrote: I am using SOLR 4.1.0 and perform atomic updates on SOLR documents. Unfortunately there is a bug in 4.1.0 (https://issues.apache.org/jira/browse/SOLR-4297) that blocks me from using null=true for deleting a field through atomic update functionality. Is there any other way to delete a field other than using this syntax? FYI..I wont be able to migrate to latest version now due to company code freeze hence trying to figure out a temporary work around. -- View this message in context: http://lucene.472066.n3.nabble.com/Delete-a-field-Atomic-updates-SOLR-4-1-0-without-using-null-true-tp4093951.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Adding OR operator in querystring and grouping fields?
fq=here:there OR this:that For the lurker: an AND should be: fq=here:therefq=this:that While you can, technically, pass: fq=here:there AND this:that Solr will cache the separate fq= parameters and reuse them in any context. The AND(ed) filter will be cached as a single entry and only used when the same AND construct is sent. Perhaps useful, not as generally desirable. On Oct 7, 2013, at 2:10 PM, Jack Krupansky j...@basetechnology.com wrote: Combine the two filter queries with an explicit OR operator. -- Jack Krupansky -Original Message- From: PeterKerk Sent: Monday, October 07, 2013 1:50 PM To: solr-user@lucene.apache.org Subject: Re: Adding OR operator in querystring and grouping fields? Ok thanks. you must combine them into one filter query parameter. , how would I do that? Can I simply change the URL structure or must I change my schema.xml and/or data-config.xml? -- View this message in context: http://lucene.472066.n3.nabble.com/Adding-OR-operator-in-querystring-and-grouping-fields-tp4093942p4093947.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Some text not indexed in solr4.4
Utkarsh, Check to see if the value is actually indexed into the field by using the Terms request handler: http://localhost:8983/solr/terms?terms.fl=textterms.prefix=d (adjust the prefix to whatever you're looking for) This should get you going in the right direction. Jason On Sep 17, 2013, at 2:20 PM, Utkarsh Sengar utkarsh2...@gmail.com wrote: I have a copyField called allText with type text_general: https://gist.github.com/utkarsh2012/6167128#file-schema-xml-L68 I have ~100 documents which have the text: dyson and dc44 or dc41 etc. For example: title: Dyson DC44 Animal Digital Slim Cordless Vacuum description: The DC44 Animal is the new Dyson Digital Slim vacuum cleaner the cordless machine that doesn’t lose suction. It has been engineered for floor to ceiling cleaning. DC44 Animal has a detachable long-reach wand which is balanced for floor to ceiling cleaning. The motorized floor tool has twice the power of the DC35 floor tool to drive the bristles deeper into the carpet pile with more force. It attaches to the wand or directly to the machine for cleaning awkward spaces. The brush bar has carbon fiber filaments for removing fine dust from hard floors. DC44 Animal has a run time of 20 minutes or 8 minutes on Boost mode. Powered by the Dyson digital motor DC44 Animal has a fade-free nickel manganese cobalt battery and Root Cyclone technology for constant powerful suction., UPC: 0879957006362 The documents are indexed. Analysis says its indexeD: http://i.imgur.com/O52ino1.png But when I search for allText:dyson dc44 I get no results, response: http://pastie.org/8334220 Any suggestions about the problem? I am out of ideas about how to debug this. -- Thanks, -Utkarsh
Re: JSON update request handler commitWithin
They have modified the mechanisms for committing documents…Solr in DSE is not stock Solr...so you are likely encountering a boundary where stock Solr behavior is not fully supported. I would definitely reach out to them to find out if they support the request. On Sep 5, 2013, at 8:27 AM, Ryan, Brent br...@cvent.com wrote: Ya, looks like this is a bug in Datastax Enterprise 3.1.2. I'm using their enterprise cluster search product which is built on SOLR 4. :( On 9/5/13 11:24 AM, Jack Krupansky j...@basetechnology.com wrote: I just tried commitWithin with the standard Solr example in Solr 4.4 and it works fine. Can you reproduce your problem using the standard Solr example in Solr 4.4? -- Jack Krupansky From: Ryan, Brent Sent: Thursday, September 05, 2013 10:39 AM To: solr-user@lucene.apache.org Subject: JSON update request handler commitWithin I'm prototyping a search product for us and I was trying to use the commitWithin parameter for posting updated JSON documents like so: curl -v 'http://localhost:8983/solr/proposal.solr/update/json?commitWithin=1' --data-binary @rfp.json -H 'Content-type:application/json' However, the commit never seems to happen as you can see below there are still 2 docsPending (even 1 hour later). Is there a trick to getting this to work with submitting to the json update request handler?
Re: data/index naming format
The circumstance I've most typically seen the index.timestamp show up is when an update is sent to a slave server. The replication then appears to preserve the updated slave index in a separate folder while still respecting the correct data from the master. On Sep 5, 2013, at 8:03 PM, Shawn Heisey s...@elyograg.org wrote: On 9/5/2013 6:48 PM, Aditya Sakhuja wrote: I am running solr 4.1 for now, and am confused about the structure and naming of the contents of the data dir. I do not see the index.properties being generated on a fresh solr node start either. Can someone clarify when should one expect to see data/index vs. data/index.timestamp, and the index.properties along with the second version. I have never seen an index.properties file get created. I've used versions from 1.4.0 through 4.4.0. Generally when you have an index.timestamp directory, it's because you're doing replication. There may be other circumstances when it appears, but I do not know what those are. As for the other files in the index directory, here's Lucene's file format documentation: http://lucene.apache.org/core/4_4_0/core/org/apache/lucene/codecs/lucene42/package-summary.html#package_description Thanks, Shawn
Re: SolrCloud Set up
One additional thought here: from a paranoid risk-management perspective it's not a good idea to have two critical services dependent upon a single point of failure if the hardware fails. Obviously risk-management is suited to taste, so you may feel the cost/benefit does not merit the separation. But it's good to make that decision consciously…you'd hate to have to justify a failure here after-the-fact as something overlooked :) On Aug 30, 2013, at 9:40 AM, Shawn Heisey s...@elyograg.org wrote: On 8/30/2013 9:43 AM, Jared Griffith wrote: One last thing. Is there any real benefit in running SolrCloud and Zookeeper separate? I am seeing some funkiness with the separation of the two, funkiness I wasn't seeing when running SolrCloud + Zookeeper together as outlined in the Wiki. For a robust install, you want zookeeper to be a separate process. It can run on the same server as Solr, but the embedded zookeeper (-DzkRun) should not be used except for dev and proof of concept work. The reason is simple. Zookeeper is the central coordinator for SolrCloud. In order for it to remain stable, it should not be restarted without good reason. If you are running zookeeper as part of Solr, then you will be affecting zookeeper operation anytime you restart that instance of Solr. Making changes to your Solr setup often requires that you restart Solr. This includes upgrading Solr and changing some aspects of its configuration. Some configuration aspects can be changed with just a collection reload, but others require a full application restart. Thanks, Shawn
Re: Indexing hangs when more than 1 server in a cluster
Kevin, I wouldn't have considered using softCommits at all based on what I understand from your use case. You appear to be loading in large batches, and softCommits are better aligned to NRT search where there is a steady stream of smaller updates that need to be available immediately. As Erick pointed out, soft commits are all about avoiding constant reopening of the index searcher…where by constant we mean every few seconds. Provided you can wait until your batch is completed, and that frequency is roughly a minute or more, you likely will find an old-fashioned hard commit (with openSearcher=true) will work just fine (YMMV). Jason On Aug 14, 2013, at 4:51 AM, Erick Erickson erickerick...@gmail.com wrote: right, SOLR-5081 is possible but somewhat unlikely given the fact that you actually don't have very many nodes in your cluster. soft commits aren't relevant to the tlog, but here's the thing. Your tlogs may get replayed when you restart solr. If they're large, this may take a long time. When you said you restarted Solr after killing it, you might have triggered this. The way to keep tlogs small is to hard commit more frequently (you should look at their size before worrying about it though!). If you set openSearcher=false, this is pretty inexpensive, all it really does is close the current segment files, open new ones, and start a new tlog file. It does _not_ invalidate caches, do autowarming, all that expensive stuff. Your soft commit does _not_ improve performance! It is just less expensive than a hard commit with openSearcher=true. It _does_ invalidate caches, fire off autowarming, etc. So it does improve performance over doing hard commits with openSearcher=true with the same frequency, but it still isn't free. It's still good to have the soft commit interval as long as you can tolerate. It's perfectly reasonable to have a hard commit interval that's much shorter than your soft commit interval. As Yonik explained once, soft commits are about visibility but hard commits are about durability. Best Erick On Wed, Aug 14, 2013 at 2:20 AM, Kevin Osborn kevin.osb...@cbsi.com wrote: Interesting, that did work. Do you or anyone else have any ideas or what I should look at? While soft commit is not a requirement in my project, my understanding is that it should help performance. On the same index, I will be doing both a large number of queries as well as updates. If I have to disable autoCommit, should I increase the chunk size? Of course, I will have to run a more large scale test tomorrow, but I saw this problem fairly consistently in my smaller test. In a previous experiment, I applied the SOLR-4816 patch that someone indicated might help. I also reduced the CSV upload chunk size to 500. It seemed like things got a little better, but still eventually hung. I also see SOLR-5081, but I don't know if that is my issue or not. At least in my test, the index writes are not parallel as in the ticket. -Kevin On Tue, Aug 13, 2013 at 8:40 PM, Jason Hellman jhell...@innoventsolutions.com wrote: While I don't have a past history of this issue to use as reference, if I were in your shoes I would consider trying your updates with softCommit disabled. My suspicion is you're experiencing some issue with the transaction logging and how it's managed when your hard commit occurs. If you can give that a try and let us know how that fares we might have some further input to share. On Aug 13, 2013, at 11:54 AM, Kevin Osborn kevin.osb...@cbsi.com wrote: I am using Solr Cloud 4.4. It is pretty much a base configuration. We have 2 servers and 3 collections. Collection1 is 1 shard and the Collection2 and Collection3 both have 2 shards. Both servers are identical. So, here is my process, I do a lot of queries on Collection1 and Collection2. I then do a bunch of inserts into Collection3. I am doing CSV uploads. I am also doing custom shard routing. All the products in a single upload will have the same shard key. All Solr interaction is through SolrJ with full Zookeeper awareness. My uploads are also using soft commits. I tried this on a record set of 936 products. Everything worked fine. I then sent over a record set of 300k products. The upload into Collection3 is chunked. I tried both 1000 and 200,000 with similar results. The first upload to Solr would just hang. There would simply be no response from Solr. A few of the products from this request would make it into the index, but not many. In this state, queries continued to work, but deletes did not. My only solution was to kill each Solr process. As an experiment, I did the large catalog first. First, I reset everything. With A chunk size of 1000, about 110,000 out of 300,000 records made it into Solr before the process hung. Again, queries worked, but deletes did not and I had to kill Solr. It hung after about 30 seconds
Re: Facet field display name
It's been my experience that using they convenient feature to change the output key still doesn't save you from having to map it back to the field name underlying it in order to trigger the filter query. With that in mind it just makes more sense to me to leave the effort in the View portion of the design. On Aug 12, 2013, at 6:34 AM, Peter Sturge peter.stu...@gmail.com wrote: 2c worth, We do lots of facet lookups to allow 'prettyprint' versions of facet names. We do this on the client-side, though. The reason is then the lookups can be different for different locations/users etc. - makes it easy for localization. It's also very easy to implement such a lookup, without having to disturb the innards of Solr... On Mon, Aug 12, 2013 at 2:25 PM, Erick Erickson erickerick...@gmail.comwrote: Have you seen the key parameter here: http://wiki.apache.org/solr/SimpleFacetParameters#key_:_Changing_the_output_key it allows you to label the output key anything you want, and since these are field names, this seems to-able. Best, Erick On Mon, Aug 12, 2013 at 4:02 AM, Aleksander Akerø aleksan...@gurusoft.no wrote: Hi I wondered if there was some way to configure a display name for facet fields. Either that or some way to display nordic letters without it messing up the faceting. Say I wanted a facet field called område (norwegian, area in english). Then I would have to create the field something like this in schema.xml: field name=omrade type=string indexed=true stored=true required=false / But then I would have to do a replace to show a prettier name in frontend. It would be preferred not to do this sort of hardcoding, as I would have to do this for all the facet fields. Either that or I could try encoding the 'å' like this: field name=omr#229;de type=string indexed=true stored=true required=false / Then it will show up with a pretty name, but the faceting will fail. Maybe this is due to encoding issues, seen as the frontend is encoded with ISO-8859-1? So does anyone have a good practice for either getting this sort of problem working properly. Or a way to define an alternative display name for a facet field, that I could display instead of the field.name? *Aleksander Akerø* Systemkonsulent Mobil: 944 89 054 E-post: aleksan...@gurusoft.no *Gurusoft AS* Telefon: 92 44 09 99 Østre Kullerød www.gurusoft.no
Re: Indexing hangs when more than 1 server in a cluster
While I don't have a past history of this issue to use as reference, if I were in your shoes I would consider trying your updates with softCommit disabled. My suspicion is you're experiencing some issue with the transaction logging and how it's managed when your hard commit occurs. If you can give that a try and let us know how that fares we might have some further input to share. On Aug 13, 2013, at 11:54 AM, Kevin Osborn kevin.osb...@cbsi.com wrote: I am using Solr Cloud 4.4. It is pretty much a base configuration. We have 2 servers and 3 collections. Collection1 is 1 shard and the Collection2 and Collection3 both have 2 shards. Both servers are identical. So, here is my process, I do a lot of queries on Collection1 and Collection2. I then do a bunch of inserts into Collection3. I am doing CSV uploads. I am also doing custom shard routing. All the products in a single upload will have the same shard key. All Solr interaction is through SolrJ with full Zookeeper awareness. My uploads are also using soft commits. I tried this on a record set of 936 products. Everything worked fine. I then sent over a record set of 300k products. The upload into Collection3 is chunked. I tried both 1000 and 200,000 with similar results. The first upload to Solr would just hang. There would simply be no response from Solr. A few of the products from this request would make it into the index, but not many. In this state, queries continued to work, but deletes did not. My only solution was to kill each Solr process. As an experiment, I did the large catalog first. First, I reset everything. With A chunk size of 1000, about 110,000 out of 300,000 records made it into Solr before the process hung. Again, queries worked, but deletes did not and I had to kill Solr. It hung after about 30 seconds. Timing-wise, this is at about the second autocommit cycle, given the default autocommit of 15 seconds. I am not sure if this is related or not. As an additional experiment, I ran the entire test with just a single node in the cluster. This time, everything ran fine. Does anyone have any ideas? Everything is pretty default. These servers are Azure VMs, although I have seen similar behavior running two Solr instances on a single internal server as well. I had also noticed similar behavior before with Solr 4.3. It definitely has something do with the clustering, but I am not sure what. And I don't see any error message (or really anything else) in the Solr logs. Thanks. -- *KEVIN OSBORN* LEAD SOFTWARE ENGINEER CNET Content Solutions OFFICE 949.399.8714 CELL 949.310.4677 SKYPE osbornk 5 Park Plaza, Suite 600, Irvine, CA 92614 [image: CNET Content Solutions]
Re: Spelling suggestions.
The majority of the behavior outlined in that wiki page should work quite sufficiently for 3.5.0. Note that there are only a few items that are marked Solr4.0 only (DirectSolrSpellChecker and WordBreakSolrSpellChecker, for example). On Aug 9, 2013, at 6:26 AM, Kamaljeet Kaur kamal.kaur...@gmail.com wrote: Hello, I have just configured apache-solr with my django project. And its working fine with a very simple and basic searching. I want to add spelling suggestions, if user misspell any word in the string entered. In this particular mailing-list, I searched for it. Many have give the link: http://wiki.apache.org/solr/SpellCheckComponent#head-78f5afcf43df544832809abc68dd36b98152670c But I am using the version 3.5.0, Its for the version 1.3 Should i follow this tutorial or they are available for solr version 3.5.0 ? Thanks Kamaljeet Kaur -- View this message in context: http://lucene.472066.n3.nabble.com/Spelling-suggestions-tp4083519.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Phrase query with prefix query
Or shingles, presuming you want to tokenize and output unigrams. On Aug 2, 2013, at 11:33 AM, Walter Underwood wun...@wunderwood.org wrote: Search against a field using edge N-grams. --wunder On Aug 2, 2013, at 11:16 AM, T. Kuro Kurosaka wrote: Is there a query parser that supports a phrase query with prefix query at the end, such as San Fran* ? -- - T. Kuro Kurosaka • Senior Software Engineer
Re: restricting a query by a set of field values
Ben, This could be constructed as so: fl=date_depositedfq=date[2013-07-01T00:00:00Z TO 2013-07-31T23:59:00Z]fq=collection_id(1 2 n)q.op=OR The parenthesis around the 1 2 n set indicate a boolean query, and we're ensuring they are an OR boolean by the q.op parameter. This should get you the result set you desire. Please beware that a very large boolean set (your IN(…) parameter) may be expensive to run. Jason On Jul 29, 2013, at 7:33 AM, Benjamin Ryan benjamin.r...@manchester.ac.uk wrote: Hi, Is it possible to construct a query in SOLR to perform a query that is restricted to only those documents that have a field value in a particular set of values similar to what would be done in POstgres with the SQL query: SELECT date_deposited FROM stats WHERE date BETWEEN '2013-07-01 00:00:00' AND '2013-07-31 23:59:00' AND collection_id IN () In my SOLR schema.xml date_deposited is a TrieDateField and collection_id is an IntField Regards, Ben -- Dr Ben Ryan Jorum Technical Manager 5.12 Roscoe Building The University of Manchester Oxford Road Manchester M13 9PL Tel: 0160 275 6039 E-mail: benjamin.r...@manchester.ac.ukhttps://outlook.manchester.ac.uk/owa/redir.aspx?C=b28b5bdd1a91425abf8e32748c93f487URL=mailto%3abenjamin.ryan%40manchester.ac.uk --
Re: Solr 4.3.1 - query does not return documents, just numFounds, 2 shards, replication Factor 1
Nitin, You need to ensure the fields you wish to see are marked stored=true in your schema.xml file, and you should include fields in your fl= parameter (fl=*,score is a good place to start). Jason On Jul 29, 2013, at 8:08 AM, Nitin Agarwal 2nitinagar...@gmail.com wrote: Hi, I am using Solr 4.3.1 with 2 Shards and replication factor of 1, running on apache tomcat 7.0.42 with external zookeeper 3.4.5. When I query select?q=*:* I only get the number of documents found, but no actual document. When I query with rows=0, I do get correct count of documents in the index. Faceting queries as well as group by queries also work with rows=0. However, when rows is not equal to 0 I do not get any documents. When I query the index I see that a query is being sent to both shards, and subsequently I see a query being sent with just ids, however, after that query returns I do not see any documents back. Not sure what do I need to change, please help. Thanks, Nitin
Re: solr - set fileds as default search field
Or use the copyField technique to a single searchable field and set df= to that field. The example schema does this with the field called text. On Jul 29, 2013, at 8:35 AM, Ahmet Arslan iori...@yahoo.com wrote: Hi, df is a single valued parameter. Only one field can be a default field. To query multiple fields use (e)dismax query parser : http://wiki.apache.org/solr/ExtendedDisMax#qf_.28Query_Fields.29 From: Mysurf Mail stammail...@gmail.com To: solr-user@lucene.apache.org Sent: Monday, July 29, 2013 6:31 PM Subject: solr - set fileds as default search field The following query works well for me http://[]:8983/solr/vault/select?q=VersionComments%3AWhite returns all the documents where version comments includes White I try to omit the field name and put it as a default value as follows : In solr config I write requestHandler name=/select class=solr.SearchHandler !-- default values for query parameters can be specified, these will be overridden by parameters in the request -- lst name=defaults str name=echoParamsexplicit/str int name=rows10/int str name=dfPackageName/str str name=dfTag/str str name=dfVersionComments/str str name=dfVersionTag/str str name=dfDescription/str str name=dfSKU/str str name=dfSKUDesc/str /lst I restart the solr and create a full import. Then I try using http://[]:8983/solr/vault/select?q=White (Where http://[]:8983/solr/vault/select?q=VersionComments%3AWhite still works) But I dont get the document any as answer. What am I doing wrong?
Re: solr 4.3, autocommit, maxdocs
Jonathan, Please note the openSearcher=false part of your configuration. This is why you don't see documents. The commits are occurring, and being written to segments on disk, but they are not visible to the search engine because a Solr searcher class has not opened them for visibility. You can either change the value to true, or alternatively call a deterministic commit call at the end of your load (a solr/update?commit=true will default to openSearcher=true). Hope that's of use! Jason On Jul 15, 2013, at 9:52 AM, Jonathan Rochkind rochk...@jhu.edu wrote: I have a solr 4.3 instance I am in the process of standing up. It started out with an empty index. I have in it's solrconfig.xml, updateHandler class=solr.DirectUpdateHandler2 autoCommit maxDocs10/maxDocs openSearcherfalse/openSearcher /autoCommit updateHandler I have an index process running, that has currently added around 400k documents to Solr. I had expected that a 'commit' would be run every 100k documents, from the above configuration, so 4 commits would have been run by now, and I'd see documents in the index. However, when I look in the Solr admin interface, at my core's 'overview' page, it still says num docs 0, segment count 0. When I expected num docs 400k at this point. Is there something I'm misunderstanding about the configuration or the admin interface? Or am I right in my expectations, but something else must be going wrong? Thanks for any advice, Jonathan
Re: Using the Schema API from SolrJ
Steven, Some information can be gleaned from the system admin request handler: http://localhost:8983/solr/admin/system I am specifically looking at this: lst name=corestr name=schemaexample/str Mind you, that is a manually-set value in the schema file. But just in case you want to get crazy you can also call the file admin request handler: http://localhost:8983/solr/admin/file?file=schema.xml …and parse the whole stinking thing :) Jason On Jul 6, 2013, at 1:59 PM, Steven Glass steven.gl...@zekira.com wrote: Does anyone have any idea how I can access the schema version info using SolrJ? Thanks. On Jul 3, 2013, at 4:16 PM, Steven Glass wrote: I'm using a Solr 4.3 server and accessing it from both a Java based desktop application using SolrJ and an Android based mobile application using my home-grown REST adaptor. I'm trying to make sure that versions of the application are synchronized with updates to the server (too often testers forget to update an app when the server changes). I want to read the schema version from the server and make sure it is the expected value. This was very easy to do using my home-grown REST adaptor. The wiki examples at http://wiki.apache.org/solr/SchemaRESTAPI were sufficient. Unfortunately, I cannot figure out how to do the equivalent with SolrJ. I suspect that there is a really simple approach but I'm just missing it. Thanks in advance for any guidance you can offer. Best regards, Steven Glass
Re: Surprising score?
Also considering using the SweetSpotSimilarityFactory class which allows to to still engage normalization but control how intrusive it is. This, combined with the ability to set a custom Similarity class on a per-fieldType basis may be extremely useful. More info: http://lucene.apache.org/solr/4_3_1/solr-core/org/apache/solr/search/similarities/SweetSpotSimilarityFactory.html Jason On Jul 5, 2013, at 5:59 AM, pravesh suyalprav...@yahoo.com wrote: Is there a way to omitNorms and still be able to use {!boost b=boost} ? OR you could let /omitNorms=false/ as usual and have your custom Similarity implementation with the length normalization method overridden for using a constant value of 1. Regards Pravesh -- View this message in context: http://lucene.472066.n3.nabble.com/Surprising-score-tp4075436p4075722.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: 2.1billion+ document
Saqib: At the simplest level: 1) Source the machine 2) Install Java 3) Install a servlet container of your choice 4) Copy your Solr WAR and conf directories as desired (probably a rough mirror of your current single server) 5) Start it up and start sending data there 6) Query both by simply adding: shards=host1/solr/collection,host2/solr/collection 7) Profit Or, in shorthand: 1) Install new Solr instance and start indexing data there 2) Add the shards parameter to your queries with both (or more) servers 3) … 4) Profit Now…we usually want to be concerned about how to manage the data so that we don't send duplicates. Without SolrCloud it is our responsibility to delegate traffic for updates and deletes. We also like to think a bit more about how to take advantage of our lovely parallelism to increase index or query time. We should also consider strategies to isolate domain data to single shards so as to allow isolated queries against dedicated data models in single shards. But if you just want to basics, it really is as easy as describe above. Jason On Jul 5, 2013, at 7:36 PM, Ali, Saqib docbook@gmail.com wrote: Hello Otis, I was thinking more in terms of Solr DistributedSearch rather than SolrCloud. I was hoping to add another Solr instance, when the time comes. This is a low use application, but with lot of data. Uptime and query speed are not of importance. However we would like to be able to index more then 2.1 b document when the time comes.. Any advise will be highly appreciated. Thanks!!! :) Saqib On Fri, Jul 5, 2013 at 6:23 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, It's a broad question, but it starts with getting a few servers, putting Solr 4.3.1 on it (soon 4.4), setting up Zookeeper, creating a Solr Collection (index) with N shards and M replicas, and reindexing your old data to this new cluster, which you can expand with new nodes over time. If you have specific questions... Otis -- Solr ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm On Fri, Jul 5, 2013 at 8:42 PM, Ali, Saqib docbook@gmail.com wrote: Question regarding the 2.1 billion+ document. I understand that a single instance of solr has a limit of 2.1 billion documents. We currently have a single solr server. If we reach 2.1billion documents limit, what is involved in moving to the Solr DistributedSearch? Thanks! :)
Re: how to replicate Solr Cloud
Kevin, I can imagine this working if you consider your second data center a pure slave relationship to your SolrCloud cluster. I haven't tried it, but I don't see why the solrconfig.xml can't identify as a master allowing you to call any of your cores in the cluster to replicate out. That being said, this idea doesn't facilitate a SolrCloud cluster in the second data center…just a slave that could be a repeater. You say that sending the data in both directions is not idea, but it works and is conceptually very simple. What is the reasoning behind wanting to get away from that approach? Jason On Jun 25, 2013, at 10:07 AM, Kevin Osborn kevin.osb...@cbsi.com wrote: We are going to have two datacenters, each with their own SolrCloud and ZooKeeper quorums. The end result will be that they should be replicas of each other. One method that has been mentioned is that we should add documents to each cluster separately. For various reasons, this may not be ideal for us. Instead, we are playing around with the idea of always indexing to one datacenter. And then having that replicate to the other datacenter. And this is where I am having some trouble on how to proceed. The nice thing about SolrCloud is that there is no masters and slaves. Each node is equals, has the same configs, etc. But in this case, I want to have a node in one datacenter poll for changes in another data center. Before SolrCloud, I would have used slave/master replication. But in the SolrCloud world, I am not sure how to configure this setup? Or is there any better ideas on how to use replication to push or pull data from one datacenter to another? In my case, NRT is not a requirement. And I will also be dealing with about 3 collections and 5 or 6 shards. Thanks. -- *KEVIN OSBORN* LEAD SOFTWARE ENGINEER CNET Content Solutions OFFICE 949.399.8714 CELL 949.310.4677 SKYPE osbornk 5 Park Plaza, Suite 600, Irvine, CA 92614 [image: CNET Content Solutions]
Re: [solr cloud] solr hangs when indexing large number of documents from multiple threads
Vinay, What autoCommit settings do you have for your indexing process? Jason On Jun 24, 2013, at 1:28 PM, Vinay Pothnis poth...@gmail.com wrote: Here is the ulimit -a output: core file size (blocks, -c) 0 data seg size(kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 179963 max locked memory(kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 32769 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time(seconds, -t) unlimited max user processes (-u) 14 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited On Mon, Jun 24, 2013 at 12:47 PM, Yago Riveiro yago.rive...@gmail.comwrote: Hi, I have the same issue too, and the deploy is quasi exact like than mine, http://lucene.472066.n3.nabble.com/updating-docs-in-solr-cloud-hangs-td4067388.html#a4067862 With some concurrence and batches of 10 solr apparently have some deadlock distributing updates Can you dump the configuration of the ulimit on your servers?, some people had the same issues because they are reach the ulimit maximum defined for descriptor and process. -- Yago Riveiro Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Monday, June 24, 2013 at 7:49 PM, Vinay Pothnis wrote: Hello All, I have the following set up of solr cloud. * solr version 4.3.1 * 3 node solr cloud + replciation factor 2 * 3 zoo keepers * load balancer in front of the 3 solr nodes I am seeing this strange behavior when I am indexing a large number of documents (10 mil). When I have more than 3-5 threads sending documents (in batch of 20) to solr, sometimes solr goes into a hung state. After this all the update requests get timed out. What we see via AppDynamics (a performance monitoring tool) is that there are a number of threads that are stalled. The stack trace for one of the threads is shown below. The cluster has to be restarted to recover from this. When I reduce the concurrency to 1, 2, 3 threads, then the indexing goes through smoothly. Any pointers as to what could be wrong here? We send the updates to one of the nodes in the solr cloud through a load balancer. Thanks Vinay Thread Name:qtp2141131052-78 ID:78 Time:Fri Jun 21 23:20:22 GMT 2013 State:WAITING Priority:5 sun.misc.Unsafe.park(Native Method) java.util.concurrent.locks. LockSupport.park(LockSupport.java:186) java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834) java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994) java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303) java.util.concurrent.Semaphore.acquire(Semaphore.java:317) org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61) org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418) org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368) org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300) org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:96) org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:462) org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1178) org.apache.solr.update.processor.LogUpdateProcessor.finish(LogUpdateProcessorFactory.java:179) org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83) org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) org.apache.solr.core.SolrCore.execute(SolrCore.java:1820) org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656) org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359) org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155) org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1423) org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:450) org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138) org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564) org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213) org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1083) org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:379)
Re: [solr cloud] solr hangs when indexing large number of documents from multiple threads
Vinay, You may wish to pay attention to how many transaction logs are being created along the way to your hard autoCommit, which should truncate the open handles for those files. I might suggest setting a maxDocs value in parallel with your maxTime value (you can use both) to ensure the commit occurs at either breakpoint. 30 seconds is plenty of time for 5 parallel processes of 20 document submissions to push you over the edge. Jason On Jun 24, 2013, at 2:21 PM, Vinay Pothnis poth...@gmail.com wrote: I have 'softAutoCommit' at 1 second and 'hardAutoCommit' at 30 seconds. On Mon, Jun 24, 2013 at 1:54 PM, Jason Hellman jhell...@innoventsolutions.com wrote: Vinay, What autoCommit settings do you have for your indexing process? Jason On Jun 24, 2013, at 1:28 PM, Vinay Pothnis poth...@gmail.com wrote: Here is the ulimit -a output: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 179963 max locked memory(kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 32769 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time(seconds, -t) unlimited max user processes (-u) 14 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited On Mon, Jun 24, 2013 at 12:47 PM, Yago Riveiro yago.rive...@gmail.com wrote: Hi, I have the same issue too, and the deploy is quasi exact like than mine, http://lucene.472066.n3.nabble.com/updating-docs-in-solr-cloud-hangs-td4067388.html#a4067862 With some concurrence and batches of 10 solr apparently have some deadlock distributing updates Can you dump the configuration of the ulimit on your servers?, some people had the same issues because they are reach the ulimit maximum defined for descriptor and process. -- Yago Riveiro Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Monday, June 24, 2013 at 7:49 PM, Vinay Pothnis wrote: Hello All, I have the following set up of solr cloud. * solr version 4.3.1 * 3 node solr cloud + replciation factor 2 * 3 zoo keepers * load balancer in front of the 3 solr nodes I am seeing this strange behavior when I am indexing a large number of documents (10 mil). When I have more than 3-5 threads sending documents (in batch of 20) to solr, sometimes solr goes into a hung state. After this all the update requests get timed out. What we see via AppDynamics (a performance monitoring tool) is that there are a number of threads that are stalled. The stack trace for one of the threads is shown below. The cluster has to be restarted to recover from this. When I reduce the concurrency to 1, 2, 3 threads, then the indexing goes through smoothly. Any pointers as to what could be wrong here? We send the updates to one of the nodes in the solr cloud through a load balancer. Thanks Vinay Thread Name:qtp2141131052-78 ID:78 Time:Fri Jun 21 23:20:22 GMT 2013 State:WAITING Priority:5 sun.misc.Unsafe.park(Native Method) java.util.concurrent.locks. LockSupport.park(LockSupport.java:186) java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834) java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994) java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303) java.util.concurrent.Semaphore.acquire(Semaphore.java:317) org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61) org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418) org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368) org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300) org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:96) org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:462) org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1178) org.apache.solr.update.processor.LogUpdateProcessor.finish(LogUpdateProcessorFactory.java:179) org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83) org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) org.apache.solr.core.SolrCore.execute(SolrCore.java:1820) org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656) org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359
Re: [solr cloud] solr hangs when indexing large number of documents from multiple threads
Scott, My comment was meant to be a bit tongue-in-cheek, but my intent in the statement was to represent hard failure along the lines Vinay is seeing. We're talking about OutOfMemoryException conditions, total cluster paralysis requiring restart, or other similar and disastrous conditions. Where that line is is impossible to generically define, but trivial to accomplish. What any of us running Solr has to achieve is a realistic simulation of our desired production load (probably well above peak) and to see what limits are reached. Armed with that information we tweak. In this case, we look at finding the point where data ingestion reaches a natural limit. For some that may be JVM GC, for others memory buffer size on the client load, and yet others it may be I/O limits on multithreaded reads from a database or file system. In old Solr days we had a little less to worry about. We might play with a commitWithin parameter, ramBufferSizeMB tweaks, or contemplate partial commits and rollback recoveries. But with 4.x we now have more durable write options and NRT to consider, and SolrCloud begs to use this. So we have to consider transaction logs, the file handles they leave open until commit operations occur, and how we want to manage writing to all cores simultaneously instead of a more narrow master/slave relationship. It's all manageable, all predictable (with some load testing) and all filled with many possibilities to meet our specific needs. Considering hat each person's data model, ingestion pipeline, request processors, and field analysis steps will be different, 5 threads of input at face value doesn't really contemplate the whole problem. We have to measure our actual data against our expectations and find where the weak chain links are to strengthen them. The symptoms aren't necessarily predictable in advance of this testing, but they're likely addressable and not difficult to decipher. For what it's worth, SolrCloud is new enough that we're still experiencing some uncharted territory with unknown ramifications but with continued dialog through channels like these there are fewer territories without good cartography :) Hope that's of use! Jason On Jun 24, 2013, at 7:12 PM, Scott Lundgren scott.lundg...@carbonblack.com wrote: Jason, Regarding your statement push you over the edge- what does that mean? Does it mean uncharted territory with unknown ramifications or something more like specific, known symptoms? I ask because our use is similar to Vinay's in some respects, and we want to be able to push the capabilities of write perf - but not over the edge! In particular, I am interested in knowing the symptoms of failure, to help us troubleshoot the underlying problems if and when they arise. Thanks, Scott On Monday, June 24, 2013, Jason Hellman wrote: Vinay, You may wish to pay attention to how many transaction logs are being created along the way to your hard autoCommit, which should truncate the open handles for those files. I might suggest setting a maxDocs value in parallel with your maxTime value (you can use both) to ensure the commit occurs at either breakpoint. 30 seconds is plenty of time for 5 parallel processes of 20 document submissions to push you over the edge. Jason On Jun 24, 2013, at 2:21 PM, Vinay Pothnis poth...@gmail.com wrote: I have 'softAutoCommit' at 1 second and 'hardAutoCommit' at 30 seconds. On Mon, Jun 24, 2013 at 1:54 PM, Jason Hellman jhell...@innoventsolutions.com wrote: Vinay, What autoCommit settings do you have for your indexing process? Jason On Jun 24, 2013, at 1:28 PM, Vinay Pothnis poth...@gmail.com wrote: Here is the ulimit -a output: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 179963 max locked memory(kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 32769 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time(seconds, -t) unlimited max user processes (-u) 14 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited On Mon, Jun 24, 2013 at 12:47 PM, Yago Riveiro yago.rive...@gmail.com wrote: Hi, I have the same issue too, and the deploy is quasi exact like than mine, http://lucene.472066.n3.nabble.com/updating-docs-in-solr-cloud-hangs-td4067388.html#a4067862 With some concurrence and batches of 10 solr apparently have some deadlock distributing updates Can you dump the configuration of the ulimit on your servers?, some people had the same issues because they are reach the ulimit maximum defined for descriptor and process
Re: Restarting SOLR will remove all cache?
Shalin, There's one point to test without caches, which is to establish how much value a cache actually provides. For me, this primarily means providing a benchmark by which to decide when to stop obsessing over caches. But yes, for load testing I definitely agree :) Jason On Jun 21, 2013, at 11:01 AM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: There are no disk caches as such. There is no point in testing without caches. Also, Lucene has field caches required for sorting which cannot be turned off. On Fri, Jun 21, 2013 at 11:22 PM, Learner bbar...@gmail.com wrote: I have a very simple question. Does restarting SOLR removes all caches (including disk caches if any?). I have disabled all caches in solrconfig.xml but even then I see that there is some caching happening all the time. I am currently doing some performance testing and I dont want cache to play any role now.. -- View this message in context: http://lucene.472066.n3.nabble.com/Restarting-SOLR-will-remove-all-cache-tp4072200.html Sent from the Solr - User mailing list archive at Nabble.com. -- Regards, Shalin Shekhar Mangar.
Re: in Solr 3.5, optimization increase the index size to double
And let's not forget the interesting bug in MMapDirectory: http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/all/org/apache/lucene/store/MMapDirectory.html NOTE: memory mapping uses up a portion of the virtual memory address space in your process equal to the size of the file being mapped. Before using this class, be sure your have plenty of virtual address space, e.g. by using a 64 bit JRE, or a 32 bit JRE with indexes that are guaranteed to fit within the address space. On 32 bit platforms also consult setMaxChunkSize(int) if you have problems with mmap failing because of fragmented address space. If you get an OutOfMemoryException, it is recommended to reduce the chunk size, until it works. Due to this bug in Sun's JRE, MMapDirectory's IndexInput.close() is unable to close the underlying OS file handle. Only when GC finally collects the underlying objects, which could be quite some time later, will the file handle be closed. This will consume additional transient disk usage: on Windows, attempts to delete or overwrite the files will result in an exception; on other platforms, which typically have a delete on last close semantics, while such operations will succeed, the bytes are still consuming space on disk. For many applications this limitation is not a problem (e.g. if you have plenty of disk space, and you don't rely on overwriting files on Windows) but it's still an important limitation to be aware of. If you're measuring by directory size (and not explicitly by the viewable files) you may very well be seeing this. Jason On Jun 16, 2013, at 4:53 AM, Erick Erickson erickerick...@gmail.com wrote: Optimzing will _temporarily_ double the index size, but it shouldn't be permanent. Is it possible that you have inadvertently told Solr to keep an extra snapshot? I think it's numberToKeep in your replication handler, but I'm going from memory here. Best Erick On Fri, Jun 14, 2013 at 2:15 AM, Montu v Boda montu.b...@highqsolutions.com wrote: Hi, i have replicate my index from 1.4 to 3.5 and after replication i try optimize the index in 3.5 with below URL. http://localhost:9002/solr35/collection1/update?optimize=truecommit=true when i optimize the index in 3.5, it's increase the index size to double. in 1.4 the size of index is 428GB and after optimization in 3.5 it becomes 791 GB. Thanks Rrgards Montu v Boda -- View this message in context: http://lucene.472066.n3.nabble.com/in-Solr-3-5-optimization-increase-the-index-size-to-double-tp4070433.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Filtering down terms in suggest
Aloke, It may be best to simply run a query to populate the suggestion list. While not as fast as the terms component (and suggester offshoots) it can still be tuned to be very, very fast. In this way, you can generate any fq/q combination required to meet your needs. You can play with wildcard searches, or better yet NGram (EdgeNGram) behavior to get the right suggestion data back. I would suggest an additional core to accomplish this (fed via replication) to avoid cache entry collision with your normal queries. Hope that's useful to you. Jason On Jun 12, 2013, at 7:43 AM, Aloke Ghoshal alghos...@gmail.com wrote: Barani - the fq option doesn't work. Jason - the dynamic field option won't work due to the high number of groups and users. On Wed, Jun 12, 2013 at 1:12 AM, Jason Hellman jhell...@innoventsolutions.com wrote: Aloke, If you do not have a factorial problem in the combination of userid and groupid (which I can imagine you might) you could consider creating a field for each combination (u1g1, u2g2) which can easily be done via dynamic fields. Use CopyField to get data into these various constructs (again, easily configured via wildcard patterns) and then send the suggestion query to the right field. Obviously this will get out of hand if you have too many of these...so this has limits. Jason On Jun 11, 2013, at 8:29 AM, Aloke Ghoshal alghos...@gmail.com wrote: Hi, Trying to find a way to filter down the suggested terms set based on the term value of another indexed field? Let's say we have the following documents indexed in Solr: userid:1, groupid:1, content:alpha beta gamma userid:2, groupid:1, content:alternate better garden userid:3, groupid:2, content:altruism bent garner Now a query on (with a dictionary built using terms in the content field): q:groupid:1 AND content:al should suggest alpha alternate, (not altruism, since it has a different groupid). The option to have a separate dictionary per group gets ruled out due to the high number of distinct groups (50K+). Kindly suggest ways to get this working. Thanks, Aloke
Re: Filtering down terms in suggest
Aloke, If you do not have a factorial problem in the combination of userid and groupid (which I can imagine you might) you could consider creating a field for each combination (u1g1, u2g2) which can easily be done via dynamic fields. Use CopyField to get data into these various constructs (again, easily configured via wildcard patterns) and then send the suggestion query to the right field. Obviously this will get out of hand if you have too many of these...so this has limits. Jason On Jun 11, 2013, at 8:29 AM, Aloke Ghoshal alghos...@gmail.com wrote: Hi, Trying to find a way to filter down the suggested terms set based on the term value of another indexed field? Let's say we have the following documents indexed in Solr: userid:1, groupid:1, content:alpha beta gamma userid:2, groupid:1, content:alternate better garden userid:3, groupid:2, content:altruism bent garner Now a query on (with a dictionary built using terms in the content field): q:groupid:1 AND content:al should suggest alpha alternate, (not altruism, since it has a different groupid). The option to have a separate dictionary per group gets ruled out due to the high number of distinct groups (50K+). Kindly suggest ways to get this working. Thanks, Aloke
Re: Two instances of solr - the same datadir?
Roman, Could you be more specific as to why replication doesn't meet your requirements? It was geared explicitly for this purpose, including the automatic discovery of changes to the data on the index master. Jason On Jun 4, 2013, at 1:50 PM, Roman Chyla roman.ch...@gmail.com wrote: OK, so I have verified the two instances can run alongside, sharing the same datadir All update handlers are unaccessible in the read-only master updateHandler class=solr.DirectUpdateHandler2 enable=${solr.can.write:true} java -Dsolr.can.write=false . And I can reload the index manually: curl http://localhost:5005/solr/admin/cores?wt=jsonaction=RELOADcore=collection1 But this is not an ideal solution; I'd like for the read-only server to discover index changes on its own. Any pointers? Thanks, roman On Tue, Jun 4, 2013 at 2:01 PM, Roman Chyla roman.ch...@gmail.com wrote: Hello, I need your expert advice. I am thinking about running two instances of solr that share the same datadirectory. The *reason* being: indexing instance is constantly building cache after every commit (we have a big cache) and this slows it down. But indexing doesn't need much RAM, only the search does (and server has lots of CPUs) So, it is like having two solr instances 1. solr-indexing-master 2. solr-read-only-master In the solrconfig.xml I can disable update components, It should be fine. However, I don't know how to 'trigger' index re-opening on (2) after the commit happens on (1). Ideally, the second instance could monitor the disk and re-open disk after new files appear there. Do I have to implement custom IndexReaderFactory? Or something else? Please note: I know about the replication, this usecase is IMHO slightly different - in fact, write-only-master (1) is also a replication master Googling turned out only this http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/71912 - no pointers there. But If I am approaching the problem wrongly, please don't hesitate to 're-educate' me :) Thanks! roman
Re: Can mm (min-match) be specified by field in dismax or edismax?
Well, there is a hack(ish) way to do it: _query_:{!type=edismax qf='someField' v='$q' mm=100%} This is clearly not a solrconfig.xml settings, but part of your query string using LocalParam behavior. This is going to get really messy if you have plenty of fields you'd like to search, where you'd need a similar construct for each. I cannot attest to performance at scale with such a construct…but just showing a way you can go about this if you feel compelled enough to do so. Jason On Jun 3, 2013, at 8:08 AM, Jack Krupansky j...@basetechnology.com wrote: No, but you can with the LucidWorks Search query parser: f1:(cat dog fox bat fish cow)~50% f2:(cat dog fox bat fish zebra)~2 See: http://docs.lucidworks.com/display/lweug/Minimum+Match+for+Simple+Queries -- Jack Krupansky -Original Message- From: Eric Wilson Sent: Monday, June 03, 2013 10:30 AM To: solr-user@lucene.apache.org Subject: Can mm (min-match) be specified by field in dismax or edismax? I would like to have the min-match set differently for different fields in my dismax handler. Is this possible?
Re: Getting tons of EofException with jetty/SolrCloud
Those are default, though autoSoftCommit is commented out by default. Keep in mind about the hard commit running every 15 seconds: it is not updating your searchable data (due to the openSearcher=false setting). In theory, your data should be searchable due to autoSoftCommit running every 1 second. Every 15 seconds the hard commit comes along to truncate the transaction logs and persist the data to lucene segments, but searches are still being served from a combat ion of the last hard commit with openSearcher=true plus all the soft committed data in memory. At some point it's useful to call a hard commit with openSearcher=true. This will essentially set the state of all searchable data to the segment data from Lucene. Also, the 15 second default isn't intended to be a one-size-fits-all policy. You need to find some good balancer here and testing this out with simulated load is the right way to do this. Others reading this thread may be able to provide better empirical or anecdotal suggestions to you on settings, but be sure to test! On May 31, 2013, at 12:14 PM, ltenny lte...@gmail.com wrote: autoCommit maxTime15000/maxTime openSearcherfalse/openSearcher /autoCommit autoSoftCommit maxTime1000/maxTime /autoSoftCommit I think these are close to the default values...not sure if I changed them. These mean a hard commit every 15 seconds...right? Seems sort of reasonable since we get a few hundred doc inserts in 15 seconds. Not sure...any advice is very welcome. -- View this message in context: http://lucene.472066.n3.nabble.com/Getting-tons-of-EofException-with-jetty-SolrCloud-tp4067427p4067433.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: 2 VM setup for SOLRCLOUD?
Jamey, You will need a load balancer on the front end to direct traffic into one of your SolrCore entry points. It doesn't matter, technically, which one though you will find benefits to narrowing traffic to fewer (for purposes of better cache management). Internally SolrCloud will round-robin distribute requests to other shards once a query begins execution. But you do need an entry point externally to be defined through your load balancer. Hope this is useful! Jason On May 30, 2013, at 12:48 PM, James Dulin jdu...@crelate.com wrote: Working to setup SolrCloud in Windows Azure. I have read over the solr Cloud wiki, but am a little confused about some of the deployment options. I am attaching an image for what I am thinking we want to do. 2 VM’s that will have 2 shards spanning across them. 4 Nodes total across the two machines, and a zookeeper on each VM. I think this is feasible, but, I am a little confused about how each node knows how to respond to requests (do I need a load balancer in front, or can we just reference the “collection” etc.) Thanks! Jamey
Re: Nested Facets and distributed shard system.
You have mentioned Pivot Facets, but have you looked at the Path Hierarchy Tokenizer Factory: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PathHierarchyTokenizerFactory This matches your use case, as best as I understand it. Jason On May 28, 2013, at 12:47 PM, vibhoreng04 vibhoren...@gmail.com wrote: Hi Erick and Markus, Any Idea on this ? can we resolve this by group by queries? -- View this message in context: http://lucene.472066.n3.nabble.com/Nested-Facets-and-distributed-shard-system-tp4065847p4066583.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: split document or not
You may wish to explore the concept of using the Result Grouping (Field Collapsing) feature in which your paragraphs are individual documents that share a field to group them by (the ID of the document/book/article/whatever). http://wiki.apache.org/solr/FieldCollapsing This will net you absolutely isolated results for paragraphs, and give you a great deal of flexibility on how to query the results in cases where you do or do not need them grouped. Jason On May 28, 2013, at 3:10 PM, Hard_Club meddn...@gmail.com wrote: Thanks, Alexandre. But I need to know in which paragraph is matched the request. I need it because paragraphs are binded to some extra data that I need to output on result page. So I need to know paragraphs is'd. How to bind such attribute to multivalued field? -- View this message in context: http://lucene.472066.n3.nabble.com/split-document-or-not-tp4066170p4066629.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: filter query by string length or word count?
Sam, I would highly suggest counting the words in your external pipeline and sending that value in as a specific field. It can then be queried quite simply with a: wordcount:{80 TO *] (Note the { next to 80, excluding the value of 80) Jason On May 22, 2013, at 11:37 AM, Sam Lee skyn...@gmail.com wrote: I have schema.xml field name=body type=text_en_html indexed=true stored=true omitNorms=true/ ... fieldType name=text_en_html class=solr.TextField positionIncrementGap=100 analyzer type=index charFilter class=solr.HTMLStripCharFilterFactory/ tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_en.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPossessiveFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.PorterStemFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_en.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPossessiveFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.PorterStemFilterFactory/ /analyzer /fieldType how can I query docs whose body has more than 80 words (or 80 characters) ?
Re: Not able to search Spanish word with ascent in solr
And use the /terms request handler to view what is present in the field: /solr/terms?terms.fl=text_esterms.prefix=a You're looking to ensure the index does, in fact, have the accented characters present. It's just a sanity check, but could possibly save you a little (sanity, that is). Jason On May 20, 2013, at 12:51 PM, Jack Krupansky j...@basetechnology.com wrote: Try the Solr Admin UI Analysis page - enter text for both index and query for your field and see whether the final terms still have their accents. -- Jack Krupansky -Original Message- From: jignesh Sent: Monday, May 20, 2013 10:46 AM To: solr-user@lucene.apache.org Subject: Re: Not able to search Spanish word with ascent in solr Thanks for the reply.. I am send below type of xml to solr ?xml version=1.0 encoding=UTF-8?adddoc field name=id15/field field name=id_i15/field field name=nameMis nuevos colgantes de PRIMARK/field field name=featuresamp;iquest;Alguna vez os habamp;eacute;is pasado por la zona de bisuteramp;iacute;a de PRIMARK? Cada vez que me doy una vuelta y paso por delante no puedo evitar echar un vistazo a ver si encuentro algamp;uacute;n detallito mono. Colgantes, pendientes, pulseras, diademastienen de todo y siempre estamp;aacute; bien de precio. Hoy queramp;iacute;a enseamp;ntilde;aros mis dos amp;uacute;ltimas compras: dos colgantes, uno con forma de bamp;uacute;ho y otro con un robot fashion. Y lo mejor es que samp;oacute;lo me he gastado 5 euros. amp;iquest;Quamp;eacute; os parecen? amp;iquest;Habamp;eacute;is comprado alguna vez en esta tienda? /field /doc I am giving below url http://localhost:8983/solr/select/?q=étnicoindent=onqf=nameqf=featuresdefType=edismaxstart=0rows=50wt=json waiting for reply Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Not-able-to-search-Spanish-word-with-ascent-in-solr-tp4064404p4064651.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: multiple cache for same field
Most definitely not the number of unique elements in each segment. My 32 document sample index (built from the default example docs data) has the following: entry#0: 'StandardDirectoryReader(segments_b:29 _8(4.2.1):C32)'='manu_exact',class org.apache.lucene.index.SortedDocValues,0.5=org.apache.lucene.search.FieldCacheImpl$SortedDocValuesImpl#1778857102 There is no chance for there to be 1.8 billion unique elements in that index. On May 20, 2013, at 1:20 PM, Erick Erickson erickerick...@gmail.com wrote: Not sure, never had to worry about what they are.. On Mon, May 20, 2013 at 12:28 PM, J Mohamed Zahoor zah...@indix.com wrote: What is the number at the end? is it the no of unique elements in each segment? ./zahoor On 20-May-2013, at 7:37 PM, Erick Erickson erickerick...@gmail.com wrote: Because the same field is split amongst a number of segments. If you look in the index directory, you should see files like _3fgm.* and _3ffm.*. Each such group represents one segment. The number of segments changes with merging etc. Best Erick On Mon, May 20, 2013 at 6:43 AM, J Mohamed Zahoor zah...@indix.com wrote: Hi Why is that lucene field cache has multiple entries for the same field S_24. It is a dynamic field. 'SegmentCoreReader(owner=_3fgm(4.2.1):C7681)'='S_24',double,org.apache.lucene.search.FieldCache.NUMERIC_UTILS_DOUBLE_PARSER=org.apache.lucene.search.FieldCacheImpl$DoublesFromArray#1174240382 'SegmentCoreReader(owner=_3ffm(4.2.1):C1596758)'='S_24',double,org.apache.lucene.search.FieldCache.NUMERIC_UTILS_DOUBLE_PARSER=org.apache.lucene.search.FieldCacheImpl$DoublesFromArray#83384344 'SegmentCoreReader(owner=_3fgh(4.2.1):C2301)'='S_24',double,org.apache.lucene.search.FieldCache.NUMERIC_UTILS_DOUBLE_PARSER=org.apache.lucene.search.FieldCacheImpl$DoublesFromArray#1281331764 Also, the number at the end.. does it specified the no of entries in that cache bucket? ./zahoor
Re: Upgrading from SOLR 3.5 to 4.2.1 Results.
Rishi, Fantastic! Thank you so very much for sharing the details. Jason On May 17, 2013, at 12:29 PM, Rishi Easwaran rishi.easwa...@aol.com wrote: Hi All, Its Friday 3:00pm, warm sunny outside and it was a good week. Figured I'd share some good news. I work for AOL mail team and we use SOLR for our mail search backend. We have been using it since pre-SOLR 1.4 and strong supporters of SOLR community. We deal with millions indexes and billions of requests a day across our complex. We finished full rollout of SOLR 4.2.1 into our production last week. Some key highlights: - ~75% Reduction in Search response times - ~50% Reduction in SOLR Disk busy , which in turn helped with ~90% Reduction in errors - Garbage collection total stop reduction by over 50% moving application throughput into the 99.8% - 99.9% range - ~15% reduction in CPU usage We did not tune our application moving from 3.5 to 4.2.1 nor update java. For the most part it was a binary upgrade, with patches for our special use case. Now going forward we are looking at prototyping SOLR Cloud for our search system, upgrade java and tomcat, tune our application further. Lots of fun stuff :) Have a great weekend everyone. Thanks, Rishi.
Re: Deleting an entry from a collection when they key has : in it
The first rule of Solr without Unique Key is that we don't talk about Solr without a Unique Key. The second rule... On May 16, 2013, at 8:47 PM, Jack Krupansky j...@basetechnology.com wrote: Technically, core Solr does not require a unique key. A lot of features in Solr do require unique keys, and it is recommended that you have unique keys, but it is not an absolute requirement. -- Jack Krupansky -Original Message- From: Daniel Baughman Sent: Thursday, May 16, 2013 1:50 PM To: solr-user@lucene.apache.org Subject: RE: Deleting an entry from a collection when they key has : in it Thanks for the idea http://localhost:8983/solr/docrepo/update/?stream.body=%3Cdelete%3E%3Cquery% 3Ekey%3AD\:\\Webdocs\\sw4\\docRepo\\documents\\Hiring%20Manager\\Disciplinar y\\asdfasdf\.docx%3C%2Fquery%3E%3C%2Fdelete%3E I do have :'s and \'s escaped, I believe. If in my schema, I have the key field set to indexed=false, then is that maybe the issue? I'm going to try to set that to true and rebuild the repository and see if that does it. -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Thursday, May 16, 2013 11:20 AM To: solr-user@lucene.apache.org Subject: Re: Deleting an entry from a collection when they key has : in it You need to escape colons in queries, using either a backslash or enclosing the full query term in quotes. In your case, you have backslashes as well in your query, which the query parser will interpret as an escape! So, you need to escape those backslashes as well: D\:\\somedir\\somefile.pdf or D:\\somedir\\somefile.pdf -- Jack Krupansky -Original Message- From: Daniel Baughman Sent: Thursday, May 16, 2013 11:33 AM To: solr-user@lucene.apache.org Subject: Deleting an entry from a collection when they key has : in it Hi All, I seem to be really struggling to delete an entry from a search repository that has a : in the key. The key is path to the file ie, D:\somedir\somefile.pdf. I want to use a query to delete it and I just can't seem to make it go away. I've been trying stuff lke this: http://localhost:8983/solr/docrepo/update/?stream.body=%3Cdelete%3E%3Cquery% 3Ekey%3AD\:\\Webdocs\\sw4\\docRepo\\documents\\Hiring%20Manager\\Disciplinar y\\asdfasdf\.docx%3C%2Fquery%3E%3C%2Fdelete%3E http://localhost:8983/solr/docrepo/update/?stream.body=%3Cdelete%3E%3Cquery %3Ekey%3AD\:\\Webdocs\\sw4\\docRepo\\documents\\Hiring%20Manager\\Disciplina ry\\asdfasdf\.docx%3C%2Fquery%3E%3C%2Fdelete%3Eversion=2.2start=0rows=10 indent=on version=2.2start=0rows=10indent=on It doesn't throw an error but it doesn't delete the document either. Does anyone have any suggestions? Thanks, Dan
Re: Aggregate word counts over a subset of documents
David, A Pivot Facet could possibly provide these results by the following syntax: facet.pivot=category,includes We would presume that includes is a tokenized field and thus a set of facet values would be rendered from the terms resoling from that tokenization. This would be nested in each category…and, of course, the entire set of documents considered for these facets is constrained by the current query. I think this maps to your requirement. Jason On May 16, 2013, at 12:29 PM, David Larochelle dlaroche...@cyber.law.harvard.edu wrote: Is there a way to get aggregate word counts over a subset of documents? For example given the following data: { id: 1, category: cat1, includes: The green car., }, { id: 2, category: cat1, includes: The red car., }, { id: 3, category: cat2, includes: The black car., } I'd like to be able to get total term frequency counts per category. e.g. category name=cat1 lst name=the2/lst lst name=car2/lst lst name=green1/lst lst name=red1/lst /category category name=cat2 lst name=the1/lst lst name=car1/lst lst name=black1/lst /category I was initially hoping to do this within Solr and I tried using the TermFrequencyComponent. This gives term frequencies for individual documents and term frequencies for the entire index but doesn't seem to help with subsets. For example, TermFrequencyComponent would tell me that car occurs 3 times over all documents in the index and 1 time in document 1 but not that it occurs 2 times over cat1 documents and 1 time over cat2 documents. Is there a good way to use Solr/Lucene to gather aggregate results like this? I've been focusing on just using Solr with XML files but I could certainly write Java code if necessary. Thanks, David
Re: Solr - Best Java Combination for performance?
I have run across plenty of implementations using just about every common servlet container on the market, and haven't run across any common problems to dissuade you against any one of them. On the JVM front most people seem to use Oracle because of it ubiquity. But I have also run across a solid minority of Open and they seem just fine. For that matter, more hand a handful of custom JVMs (usually via IBM). The advice I always give on this topic leans heavily on practical consideration: What servlet container and JVM does your team know best how to address if a problem occurs? If you're unsure, I'd stick with Tomcat and Oracle since they are the most common and you'll find metric tons of help via posts on the internet that may coincide with an issue or optimization you're considering. Hope that's useful! On May 11, 2013, at 4:56 AM, Spadez james_will...@hotmail.com wrote: Hi, I was wondering, what setup have people had the most luck with from a performance point of view? Tomcat Vs Jetty Open JDK vs Oracle JDK I haven't been able to find any information online to backup any sort of performance claims. I am planning on using Tomcat with Open JDK, has anyone had any experience with this and is it a wise path to go down? Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Best-Java-Combination-for-performance-tp4062554.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Negative Boosting at Recent Versions of Solr?
You learned the gosh-darndest things: http://localhost:8983/solr/browse?q=ipodboost=product(price,-2)debugQuery=on …nets: -0.3797992 = (MATCH) sum of: 0.13510442 = (MATCH) max of: 0.045963455 = (MATCH) weight(text:ipod^0.5 in 4) [DefaultSimilarity], result of: 0.045963455 = score(doc=4,freq=3.0 = termFreq=3.0 ), product of: …blah blah blah… -0.5149036 = (MATCH) FunctionQuery(product(float(price),const(-2))), product of: -23.0 = product(float(price)=11.5,const(-2)) 1.0 = boost 0.022387113 = queryNorm it works Similarly with boost= -3.1081805 = (MATCH) boost((id:ipod^10.0 | author:ipod^2.0 | title:ipod^10.0 | text:ipod^0.5 | cat:ipod^1.4 | keywords:ipod^5.0 | manu:ipod^1.1 | description:ipod^5.0 | resourcename:ipod | name:ipod^1.2 | features:ipod | sku:ipod^1.5),product(float(price),const(-2))), product of: 0.13513829 = (MATCH) max of: 0.045974977 = (MATCH) weight(text:ipod^0.5 in 4) [DefaultSimilarity], result of: 0.045974977 = score(doc=4,freq=3.0 = termFreq=3.0 ), product of: …more blah… -23.0 = product(float(price)=11.5,const(-2)) I wonder how fantastically this can be abused now? On May 10, 2013, at 7:22 AM, Dyer, James james.d...@ingramcontent.com wrote: Despite the discussion in SOLR-3823/SOLR-3278, my experience with Solr 4.2 is that it does indeed allow negative boosts on both bf and qf. I think the functionality was added under the radar possibly with SOLR-4093, not sure though. In disbelief, I did some testing and it seems to really work. James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Thursday, May 09, 2013 5:41 PM To: solr-user@lucene.apache.org Subject: Re: Negative Boosting at Recent Versions of Solr? Solr does support both additive and multiplicative boosts. Although Solr doesn't support negative multiplicative boosts on query terms, it does support fractional multiplicative boosts (0.25) which do allow you to de-boost a term. The boosts for individual query terms and for the edismax qf parameter cannot be negative, but can be fractional. The edismax bf parameter give a function query that provides an additive boost, which could be negative. The edismax boost parameter gives a function query that provides a multiplicative boost - which could be negative, so it’s not absolutely true that doesn't support negative boosts. -- Jack Krupansky -Original Message- From: Furkan KAMACI Sent: Thursday, May 09, 2013 6:08 PM To: solr-user@lucene.apache.org Subject: Negative Boosting at Recent Versions of Solr? I know that whilst Lucene allows negative boosts, Solr does not. However did it change with newer versions of Solr (I use Solr 4.2.1) or still same?
Re: Looking for Best Practice of Spellchecker
Nicholas, Also consider that some misspellings are better handled through Synonyms (or injected metadata). You can garner a great deal of value out of the spell checker by following the great advice James is giving here…but you'll find a well-placed helper synonym or metavalue can often save a lot of headache and time. Jason On May 10, 2013, at 7:32 AM, Dyer, James james.d...@ingramcontent.com wrote: Nicholas, It sounds like you might want to use WordBreakSolrSpellChecker, which gets obscure mention in the wiki. Read through this section: http://wiki.apache.org/solr/SpellCheckComponent#Configuration and you will see some information. Also, the Solr Example shows how to configure this. See http://svn.apache.org/repos/asf/lucene/dev/branches/branch_4x/solr/example/solr/collection1/conf/solrconfig.xml Look for... lst name=spellchecker str name=namewordbreak/str ... /lst ...and... requestHandler name=/spell ... ... /requestHandler Also, I'd recommend you take a look at each parameter in the /spell request handler and read its section on the spellcheckcomponent wiki page. You probably will want to set many of these parameters as well. You can get a query to return only spell results simply by specifying rows=0. However, its one less query to just have it return the results also. If there are no results, your application can check for collations and re-issue a collation query. If there are both results and collations returned, you can give the user results with did-you-mean suggestions. James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: Nicholas Ding [mailto:nicholas...@gmail.com] Sent: Friday, May 10, 2013 8:47 AM To: solr-user@lucene.apache.org Subject: Looking for Best Practice of Spellchecker Hi guys, I'm working on a local search project, I wanna integrate spellchecker for the search. So basically, my search engines is used to search local businesses. For example, user could search for wall mart, here is a typo, I wanna spellchecker to give me Collation for walmart. My problems are: 1. I use DirectSolrSpellChecker on my BusinessNameField and pass wall mart as phrase search, but I can't get collation from the spellchecker. 2. I tried not to pass phrase search, but pass q=Wall AND Mart to force a 100% match, but spellchecker can't give me collation also. I read the documents about spellchecker on Solr wiki, but it's very brief. I'm wondering is there any best practice of spellchecker, I believe it's widely used in the search, right? And I have another idea, I don't know whether it's valid or not. I want to apply spellchecker everything before doing the search, so that I could rely on the spellchecker to tell me whether my search could get result or not. Thanks Nicholas
Re: Sharing index data between two Solr instances
Milen, At some point you'll need to call a commit to search your data, either via AutoCommit policy or deterministically. There are various schools of though on which way to go but something needs to do this. If you go the AutoCommit route be sure to pay attention to the openSearcher value. The default value of false will not cause an IndexSearcher to open the new data, and there is a strong use case for this…but if you're not aware you might be caught by surprise. Once the commit fires your search process will automatically see the new data, with no interruption to its queue of queries. You may also want to consider having a Master/Slave relationship via replication for higher availability. it is trivial to set up and works like a charm. Jason On May 10, 2013, at 8:14 AM, milen.ti...@materna.de wrote: Hello together! I've been googleing on this topic but still couldn't find a definitive answer to my question. We have a setup of two machines both running Solr 4.2 within Tomcat. We are considering sharing the index data between both webapps. One of the machines will be configured to update the index periodically, the other one will be accessing it read-only. Using native locking on a network-mounted NTFS, is it possible for the reader to detect when new index data has been imported or do we need to signal it from the updating webapp and make a commit in order to open a new reader with the updated content? Thanks in advance! Milen Tilev Master of Science Softwareentwickler Business Unit Information MATERNA GmbH Information Communications Voßkuhle 37 44141 Dortmund Deutschland Telefon: +49 231 5599-8257 Fax: +49 231 5599-98257 E-Mail: milen.ti...@materna.demailto:milen.ti...@materna.de www.materna.dehttp://www.materna.de/ | Newsletterhttp://www.materna.de/newsletter | Twitterhttp://twitter.com/MATERNA_GmbH | XINGhttp://www.xing.com/companies/MATERNAGMBH | Facebookhttp://www.facebook.com/maternagmbh Sitz der MATERNA GmbH: Voßkuhle 37, 44141 Dortmund Geschäftsführer: Dr. Winfried Materna, Helmut an de Meulen, Ralph Hartwig Amtsgericht Dortmund HRB 5839
Re: Sharing index data between two Solr instances
Milen, It is possible to have the configuration shared amongst multiple cores, I have seen this…though I haven't seen multiple separate instances share the same solr.xml core configuration (and, for that matter, separate possible locking policies). It might work. Honestly, I don't like it. Your config is not likely changing often, and keeping these in sync should be relatively trivial for your data ingestion delegate. But all of this is what replication does for you. Of course, as you note, there is latency…and as such you may wish to consider SolrCloud instead. Or a NRT (non SolrCloud) configuration. You have a lot of options! But the replication master/slave behavior is rock solid and does nearly everything you seek. Jason On May 10, 2013, at 8:40 AM, milen.ti...@materna.de wrote: Hello Jason, Thanks for Your quick response! The alternative of using the Solr replication is also still pending at this point, so we will consider its pros and cons, too. Fortunately, we are not using AutoCommit in our project, as we need to control the creation of new segments, so I will propose to my colleagues that we issue a manual commit on the read-only node after successful index update. Just one more question: would it be possible in this case to use the same solrhome/conf directory (shared schema and solrconfig) and solr.xml file within both webapps? I guess we should then signal the read-only side each time the solr.xml has changed (additional cores may be added by the updating machine depending on the imported data). Thanks again and best regards! Milen -Ursprüngliche Nachricht- Von: Jason Hellman [mailto:jhell...@innoventsolutions.com] Gesendet: Freitag, 10. Mai 2013 17:30 An: solr-user@lucene.apache.org Betreff: Re: Sharing index data between two Solr instances Milen, At some point you'll need to call a commit to search your data, either via AutoCommit policy or deterministically. There are various schools of though on which way to go but something needs to do this. If you go the AutoCommit route be sure to pay attention to the openSearcher value. The default value of false will not cause an IndexSearcher to open the new data, and there is a strong use case for this.but if you're not aware you might be caught by surprise. Once the commit fires your search process will automatically see the new data, with no interruption to its queue of queries. You may also want to consider having a Master/Slave relationship via replication for higher availability. it is trivial to set up and works like a charm. Jason On May 10, 2013, at 8:14 AM, milen.ti...@materna.de wrote: Hello together! I've been googleing on this topic but still couldn't find a definitive answer to my question. We have a setup of two machines both running Solr 4.2 within Tomcat. We are considering sharing the index data between both webapps. One of the machines will be configured to update the index periodically, the other one will be accessing it read-only. Using native locking on a network-mounted NTFS, is it possible for the reader to detect when new index data has been imported or do we need to signal it from the updating webapp and make a commit in order to open a new reader with the updated content? Thanks in advance! Milen Tilev Master of Science Softwareentwickler Business Unit Information MATERNA GmbH Information Communications Voßkuhle 37 44141 Dortmund Deutschland Telefon: +49 231 5599-8257 Fax: +49 231 5599-98257 E-Mail: milen.ti...@materna.demailto:milen.ti...@materna.de www.materna.dehttp://www.materna.de/ | Newsletterhttp://www.materna.de/newsletter | Twitterhttp://twitter.com/MATERNA_GmbH | XINGhttp://www.xing.com/companies/MATERNAGMBH | Facebookhttp://www.facebook.com/maternagmbh Sitz der MATERNA GmbH: Voßkuhle 37, 44141 Dortmund Geschäftsführer: Dr. Winfried Materna, Helmut an de Meulen, Ralph Hartwig Amtsgericht Dortmund HRB 5839
Re: SOLR guidance required
One more tip on the use of filter queries. DO: fq=name1:value1fq=name2:value2fq=namen:valuen DON'T: fq=name1:value1 AND name2:value2 AND name3:value3 Where OR operators apply, this does not matter. But your Solr cache will be much more savvy with the first construct. Jason On May 10, 2013, at 11:39 AM, pravesh suyalprav...@yahoo.com wrote: Aditya, As suggested by others, definitely you should use the filter queries directly to query SOLR. Just keep your indexes updated. Keep all your fields indexed/stored as per your requirements. Refer through the filter query wiki http://wiki.apache.org/solr/CommonQueryParameters http://wiki.apache.org/solr/CommonQueryParameters http://wiki.apache.org/solr/SimpleFacetParameters http://wiki.apache.org/solr/SimpleFacetParameters BTW, almost all the job sites out there (whether small/medium/big) use SOLR/lucene to power their searches :) Best Pravesh -- View this message in context: http://lucene.472066.n3.nabble.com/SOLR-guidance-required-tp4062188p4062422.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Does Distributed Search are Cached Only the By Node That Runs Query?
And for 10,000 documents across n shards, that can be significant! On May 10, 2013, at 11:43 AM, Joel Bernstein joels...@gmail.com wrote: How many shards are in your collection? The query aggregator node will pull pack that results from each shard and hold the results in memory. Then it will add the results to a priority queue to sort them. This queue will need to be as large as the page that is being generated. After the query is finished this memory should be collectable. On Thu, May 9, 2013 at 8:00 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: You are looking at jvm heap but attributing it to caching only. Not quite right...there are other things in that jvm heap. Otis Solr ElasticSearch Support http://sematext.com/ On May 9, 2013 3:55 PM, Furkan KAMACI furkankam...@gmail.com wrote: I have Solr 4.2.1 and run them as SolrCloud. When I do a search on SolrCloud as like that: ip_of_node_1:8983solr/select?q=*:*rows=1 and when I check admin page I see that: I have 5 GB Java Heap. 616.32 MB is dark gray, 3.13 GB is gray. Before my search it was something like: 150 MB dark gray, 500 MB gray. I understand that when I do a search like that, fields are cached. However when I look at other SolrCloud nodes' admin pages there are no differences. Why that query is cached only by the node that I run that query on? -- Joel Bernstein Professional Services LucidWorks
Re: Use case for storing positions and offsets in index?
Consider further that term vector data and highlighting becomes very useful if you highlight externally to Solr. That is to say, you have the data stored externally and wish to re-parse positions of terms (especially synonyms) from source material. This is a (not too uncommon) technique used for extremely large articles, where data storage in the Lucene index might be repetitive. On May 8, 2013, at 11:04 PM, Jack Krupansky j...@basetechnology.com wrote: Term positions in the index are used for phrase query and span queries. There is a separate concept called term vectors that maintains positions as well. It is most useful for highlighting - you want to know exactly where a term started and ended. -- Jack Krupansky -Original Message- From: KnightRider Sent: Tuesday, May 07, 2013 12:58 PM To: solr-user@lucene.apache.org Subject: Use case for storing positions and offsets in index? Can someone please tell me the usecase for storing term positions and offsets in the index? I am trying to understand the difference between storing positions/offsets vs indexing positions/offsets. Thanks KR - Thanks -K'Rider -- View this message in context: http://lucene.472066.n3.nabble.com/Use-case-for-storing-positions-and-offsets-in-index-tp4061376.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Grouping search results by field returning all search results for a given query
Luis, I am presuming you do not have an overarching grouping value here…and simply wish to show a standard search result that shows 1 item per company. You should be able to accomplish your second page of desired items (the second item from each of your 20 represented companies) by using the group.offset parameter. This will shift the position in the returned array of documents to the value provided. Thus: group.limit=1group.field=companyidgroup.offset=1 …would return the second item in each companyid group matching your current query. Jason On May 9, 2013, at 10:30 AM, Luis Carlos Guerrero Covo lcguerreroc...@gmail.com wrote: Hi, I'm using solr to maintain an index of items that belong to different companies. I want the search results to be returned in a way that is fair to all companies, thus I wish to group the results such that each company has 1 item in each group, and the groups of results should be returned sorted by score. example: -- 20 companies first 100 results 1-20 results - (company1 highest score item, company2 highest score item, etc..) 20-40 results - (company1 second highest score item, company 2 second highest score item, etc..) ... -- I'm trying to use the field collapsing feature but I have only been able to create the first group of results by using group.limit=1,group.field=companyid. If I raise the group.limit value, I would be violating the 'fairness rule' because more than one result of a company would be returned in the first group of results. Can I achieve the desired search result using SOLR, or do I have to look at other options? thank you, Luis Guerrero
Re: 4.3 logging setup
From: http://lucene.apache.org/solr/4_3_0/changes/Changes.html#4.3.0.upgrading_from_solr_4.2.0 Slf4j/logging jars are no longer included in the Solr webapp. All logging jars are now in example/lib/ext. Changing logging impls is now as easy as updating the jars in this folder with those necessary for the logging impl you would like. If you are using another webapp container, these jars will need to go in the corresponding location for that container. In conjunction, the dist-excl-slf4j and dist-war-excl-slf4 build targets have been removed since they are redundent. See the Slf4j documentation, SOLR-3706, and SOLR-4651 for more details. It should just require you provide your preferred logging jars within an appropriate classpath. On May 9, 2013, at 9:24 AM, richardg richa...@dvdempire.com wrote: On all prior index version I setup my log via the logging.properties file in /usr/local/tomcat/conf, it looked like this: # Licensed to the Apache Software Foundation (ASF) under one or more # contributor license agreements. See the NOTICE file distributed with # this work for additional information regarding copyright ownership. # The ASF licenses this file to You under the Apache License, Version 2.0 # (the License); you may not use this file except in compliance with # the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an AS IS BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. handlers = 1catalina.org.apache.juli.FileHandler, 2localhost.org.apache.juli.FileHandler, 3manager.org.apache.juli.FileHandler, 4host-manager.org.apache.juli.FileHandler, java.util.logging.ConsoleHandler .handlers = 1catalina.org.apache.juli.FileHandler, java.util.logging.ConsoleHandler # Handler specific properties. # Describes specific configuration info for Handlers. 1catalina.org.apache.juli.FileHandler.level = WARNING 1catalina.org.apache.juli.FileHandler.directory = ${catalina.base}/logs 1catalina.org.apache.juli.FileHandler.prefix = catalina. 2localhost.org.apache.juli.FileHandler.level = FINE 2localhost.org.apache.juli.FileHandler.directory = ${catalina.base}/logs 2localhost.org.apache.juli.FileHandler.prefix = localhost. 3manager.org.apache.juli.FileHandler.level = FINE 3manager.org.apache.juli.FileHandler.directory = ${catalina.base}/logs 3manager.org.apache.juli.FileHandler.prefix = manager. 4host-manager.org.apache.juli.FileHandler.level = FINE 4host-manager.org.apache.juli.FileHandler.directory = ${catalina.base}/logs 4host-manager.org.apache.juli.FileHandler.prefix = host-manager. java.util.logging.ConsoleHandler.level = FINE java.util.logging.ConsoleHandler.formatter = java.util.logging.SimpleFormatter # Facility specific properties. # Provides extra control for each logger. org.apache.catalina.core.ContainerBase.[Catalina].[localhost].level = INFO org.apache.catalina.core.ContainerBase.[Catalina].[localhost].handlers = 2localhost.org.apache.juli.FileHandler org.apache.catalina.core.ContainerBase.[Catalina].[localhost].[/manager].level = INFO org.apache.catalina.core.ContainerBase.[Catalina].[localhost].[/manager].handlers = 3manager.org.apache.juli.FileHandler org.apache.catalina.core.ContainerBase.[Catalina].[localhost].[/host-manager].level = INFO org.apache.catalina.core.ContainerBase.[Catalina].[localhost].[/host-manager].handlers = 4host-manager.org.apache.juli.FileHandler # For example, set the org.apache.catalina.util.LifecycleBase logger to log # each component that extends LifecycleBase changing state: #org.apache.catalina.util.LifecycleBase.level = FINE # To see debug messages in TldLocationsCache, uncomment the following line: #org.apache.jasper.compiler.TldLocationsCache.level = FINE After upgrading to 4.3 today the files defined aren't being logged to. I know things have changed for logging w/ 4.3 but how can I get it setup like it was before? -- View this message in context: http://lucene.472066.n3.nabble.com/4-3-logging-setup-tp4061875.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: More Like This and Caching
Purely from empirical observation, both the DocumentCache and QueryResultCache are being populated and reused in reloads of a simple MLT search. You can see in the cache inserts how much extra-curricular activity is happening to populate the MLT data by how many inserts and lookups occur on the first load. (lifted right out of the MLT wiki http://wiki.apache.org/solr/MoreLikeThis ) http://localhost:8983/solr/select?q=apachemlt=truemlt.fl=manu,catmlt.mindf=1mlt.mintf=1fl=id,score There is no activity in the filterCache, fieldCache, or fieldValueCache - and that makes plenty of sense. On May 9, 2013, at 11:12 AM, David Parks davidpark...@yahoo.com wrote: I'm not the expert here, but perhaps what you're noticing is actually the OS's disk cache. The actual solr index isn't cached by solr, but as you read the blocks off disk the OS disk cache probably did cache those blocks for you. On the 2nd run the index blocks were read out of memory. There was a very extensive discussion on this list not long back titled: Re: SolrCloud loadbalancing, replication, and failover look that thread up and you'll get a lot of in-depth on the topic. David -Original Message- From: Giammarco Schisani [mailto:giamma...@schisani.com] Sent: Thursday, May 09, 2013 2:59 PM To: solr-user@lucene.apache.org Subject: More Like This and Caching Hi all, Could anybody explain which Solr cache (e.g. queryResultCache, documentCache, fieldCache, etc.) can be used by the More Like This handler? One of my colleagues had previously suggested that the More Like This handler does not take advantage of any of the Solr caches. However, if I issue two identical MLT requests to the same Solr instance, the second request will execute much faster than the first request (for example, the first request will execute in 200ms and the second request will execute in 20ms). This makes me believe that at least one of the Solr caches is being used by the More Like This handler. I think the documentCache is the cache that is most likely being used, but would you be able to confirm? As information, I am currently using Solr version 3.6.1. Kind regards, Giammarco Schisani
Re: 4.3 logging setup
If you nab the jars in example/lib/ext and place them within the appropriate folder in Tomcat (and this will somewhat depend on which version of Tomcat you are using…let's presume tomcat/lib as a brute-force approach) you should be back in business. On May 9, 2013, at 11:41 AM, richardg richa...@dvdempire.com wrote: Thanks for responding. My issue is I've never changed anything w/ logging, I have always used the built in Juli. I've never messed w/ any jar files, just had edit the logging.properties file. I don't know where I would get the jars for juli or where to put them, if that is what is needed. I had read what you posted before I just can't make any sense of it. Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/4-3-logging-setup-tp4061875p4061901.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Grouping search results by field returning all search results for a given query
I would think pagination is resolved by obtaining the numFound value for your returned groups. If you have numFound=6 then each page of 20 items (one item per company) would imply a total of 6 pages. You'll have to arbitrate for the variance here…but it would seem to me you need as many pages as the highest value in the numFound field for all groups. This shouldn't require requerying but will definitely require a little intelligence on the web app to handle the groups that are less than the largest size. Hope that's useful! On May 9, 2013, at 12:23 PM, Luis Carlos Guerrero Covo lcguerreroc...@gmail.com wrote: Thank you for the prompt reply jason. The group.offset parameter is working for me, now I can iterate through all items for each company. The problem I'm having right now is pagination. Is there a way how this can be implemented out of the box with solr? Before I was using the group.main=true for easy pagination of results, but it seems like I'll have to ditch that and use the standard grouping format returned by solr for the group.offset parameter to be useful. Since all groups don't have the same number of items, I'll have to carefully calculate the results that should be returned for each page of 20 items and probably make several solr calls per page rendered. On Thu, May 9, 2013 at 1:07 PM, Jason Hellman jhell...@innoventsolutions.com wrote: Luis, I am presuming you do not have an overarching grouping value here…and simply wish to show a standard search result that shows 1 item per company. You should be able to accomplish your second page of desired items (the second item from each of your 20 represented companies) by using the group.offset parameter. This will shift the position in the returned array of documents to the value provided. Thus: group.limit=1group.field=companyidgroup.offset=1 …would return the second item in each companyid group matching your current query. Jason On May 9, 2013, at 10:30 AM, Luis Carlos Guerrero Covo lcguerreroc...@gmail.com wrote: Hi, I'm using solr to maintain an index of items that belong to different companies. I want the search results to be returned in a way that is fair to all companies, thus I wish to group the results such that each company has 1 item in each group, and the groups of results should be returned sorted by score. example: -- 20 companies first 100 results 1-20 results - (company1 highest score item, company2 highest score item, etc..) 20-40 results - (company1 second highest score item, company 2 second highest score item, etc..) ... -- I'm trying to use the field collapsing feature but I have only been able to create the first group of results by using group.limit=1,group.field=companyid. If I raise the group.limit value, I would be violating the 'fairness rule' because more than one result of a company would be returned in the first group of results. Can I achieve the desired search result using SOLR, or do I have to look at other options? thank you, Luis Guerrero -- Luis Carlos Guerrero Covo M.S. Computer Engineering (57) 3183542047
Re: disaster recovery scenarios for solr cloud and zookeeper
I have to imagine I'm quibbling with the original assertion that Solr 4.x is architected with a dependency on Zookeeper when I say the following: Solr 4.x is not architected with a dependency on Zookeeper. SolrCloud, however, is. As such, if a line of reasoning drives greater concern about Zookeeper than (necessarily) Solr's resiliency it can clearly be opted to use Solr 4.x without Zookeeper. I have to further imagine that isn't really the point of the original message. Unfortunately for me somehow I'm obsessing on saying it :) On May 3, 2013, at 12:21 PM, Dennis Haller dhal...@talenttech.com wrote: Hi, Solr 4.x is architected with a dependency on Zookeeper, and Zookeeper is expected to have a very high (perfect?) availability. With 3 or 5 zookeeper nodes, it is possible to manage zookeeper maintenance and online availability to be close to %100. But what is the worst case for Solr if for some unanticipated reason all Zookeeper nodes go offline? Could someone comment on a couple of possible scenarios for which all ZK nodes are offline. What would happen to Solr and what would be needed to recover in each case? 1) brief interruption, say 2 minutes, 2) longer downtime, say 60 min Thanks Dennis