Re: Ability to load solrcore.properties from zookeeper
Sorry for not responding back earlier, I went ahead and created a ticket here: https://issues.apache.org/jira/browse/SOLR-7613 It does look somewhat trivial if you just update the current loading mechanism as Chris describes, I can provide a patch for that if you want. Though, if you want to go the refactoring route I can leave it to Alan to take a crack at it. Thanks, -Steve On Fri, May 29, 2015 at 3:29 AM, Alan Woodward a...@flax.co.uk wrote: Yeah, you could do it like that. But looking at it further, I think solrcore.properties is actually being loaded in entirely the wrong place - it should be done by whatever is creating the CoreDescriptor, and then passed in as a Properties object to the CD constructor. At the moment, you can't refer to a property defined in solrcore.properties within your core.properties file. I'll open a JIRA if Steve hasn't already done so Alan Woodward www.flax.co.uk On 28 May 2015, at 17:57, Chris Hostetter wrote: : certainly didn't intend to write it like this!). The problem here will : be that CoreDescriptors are currently built entirely from : core.properties files, and the CoreLocators that construct them don't : have any access to zookeeper. But they do have access to the CoreContainer which is passed to the CoreDescriptor constructor -- it has all the ZK access you'd need at the time when loadExtraProperties() is called. correct? as fleshed out in my last emil... : patch: IIUC CoreDescriptor.loadExtraProperties is the relevent method ... : it would need to build up the path including the core name and get the : system level resource loader (CoreContainer.getResourceLoader()) to access : it since the core doesn't exist yet so there is no core level : ResourceLoader to use. -Hoss http://www.lucidworks.com/
Re: any changes about limitations on huge number of fields lately?
Thanks Toke for the input. I think the plan is to facet only on class_u1, class_u2 for queries from user1, etc. So faceting would not happen on all fields on a single query. But still. I did not design the schema, just found out about the number of fields and advised again that, when they asked for a second opinion. We did not get to discuss a different schema, but if we get to this point I will take that plan into consideration for sure. xavi On Sat, May 30, 2015 at 10:17 PM, Toke Eskildsen t...@statsbiblioteket.dk wrote: xavi jmlucjav jmluc...@gmail.com wrote: They reason for such a large number of fields: - users create dynamically 'classes' of documents, say one user creates 10 classes on average - for each 'class', the fields are created like this: unique_id_+fieldname - there are potentially hundreds of thousands of users. Switch to a scheme where you control the names of fields outside of Solr, but share the fields internally: User 1 has 10 custom classes: u1_a, u1_b, u1_c, ... u1_j Internally they are mapped to class1, class2, class3, ... class10 User 2 uses 2 classes: u2_horses, u2_elephants Internally they are mapped to class1, class2 When User 2 queries field u2_horses, you rewrite the query to use class1 instead. There is faceting in each users' fields. So this will result in 1M fields, very sparsely populated. If you are faceting on all of them and if you are not using DocValues, this will explode your memory requirements with vanilla Solr: UnInverted faceting maintains separate a map from all documentIDs to field values (ordinals for Strings) for _all_ the facet fields. Even if you only had 10 million documents and even if your 1 million facet fields all had just 1 value, represented by 1 bit, it would still require 10M * 1M * 1 bits in memory, which is 10 terabyte of RAM. - Toke Eskildsen
Re: Deleting Fields
Faceting on very high cardinality fields can use up memory, no doubt about that. I think the entire delete question was a red herring, but you know that already ;) So I think you can forget about the delete stuff. Although do note that if you do re-index your old documents, the new version won't have the field, and as segments are merged the deleted documents will have all their resources reclaimed, effectively deleting the field from the old docs So you could gradually re-index your corpus and get this stuff out of there. Best, Erick On Sat, May 30, 2015 at 5:18 AM, Joseph Obernberger j...@lovehorsepower.com wrote: Thank you Erick. I was thinking that it actually went through and removed the index data; that you for the clarification. What happened was I had some bad data that created a lot of fields (some 8000). I was getting some errors adding new fields where solr could not talk to zookeeper, and I thought it may be because there are so many fields. The index size is some 420million docs. I'm hesitant to try to re-create as when the shards crash, they leave a write.lock file in HDFS, and I need to manually delete that file (on 27 machines) before bringing them back up. I believe this is the stack trace - but this looks to be related to facets, and I'm not 100% sure that this is the correct trace! Sorry - I if it happens again I will update. ERROR - 2015-05-29 20:39:34.707; [UNCLASS shard9 core_node14 UNCLASS] org.apache.solr.common.SolrException; null:java.lang.RuntimeException: java.lang.OutOfMemoryError: unable to create new native thread at org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:854) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:463) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:220) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:368) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489) at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53) at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:953) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1014) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:861) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240) at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:714) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:949) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1371) at org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:637) at org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:280) at org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java:106) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:222) at
Re: any changes about limitations on huge number of fields lately?
Nothing's really changed in that area lately. Your co-worker is perhaps confusing the statement that Solr has no a-priori limit on the number of distinct fields that can be in a corpus with supporting an infinite number of fields. Not having a built-in limit is much different than supporting Whether Solr breaks with thousands and thousands of fields is pretty dependent on what you _do_ with those fields. Simply doing keyword searches isn't going to put the same memory pressure on as, say, faceting on them all (even if in different queries). I'd really ask why so many fields are necessary though. Best, Erick On Sat, May 30, 2015 at 6:18 AM, xavi jmlucjav jmluc...@gmail.com wrote: Hi guys, someone I work with has been advised that currently Solr can support 'infinite' number of fields. I thought there was a practical limitation of say thousands of fields (for sure less than a million), orthings can start to break (I think I remember seeings memory issues reported on the mailing list by several people). Was there any change I missed lately that makes having say 1M fields in Solr practical?? thanks
Re: any changes about limitations on huge number of fields lately?
xavi jmlucjav jmluc...@gmail.com wrote: They reason for such a large number of fields: - users create dynamically 'classes' of documents, say one user creates 10 classes on average - for each 'class', the fields are created like this: unique_id_+fieldname - there are potentially hundreds of thousands of users. Switch to a scheme where you control the names of fields outside of Solr, but share the fields internally: User 1 has 10 custom classes: u1_a, u1_b, u1_c, ... u1_j Internally they are mapped to class1, class2, class3, ... class10 User 2 uses 2 classes: u2_horses, u2_elephants Internally they are mapped to class1, class2 When User 2 queries field u2_horses, you rewrite the query to use class1 instead. There is faceting in each users' fields. So this will result in 1M fields, very sparsely populated. If you are faceting on all of them and if you are not using DocValues, this will explode your memory requirements with vanilla Solr: UnInverted faceting maintains separate a map from all documentIDs to field values (ordinals for Strings) for _all_ the facet fields. Even if you only had 10 million documents and even if your 1 million facet fields all had just 1 value, represented by 1 bit, it would still require 10M * 1M * 1 bits in memory, which is 10 terabyte of RAM. - Toke Eskildsen
Re: any changes about limitations on huge number of fields lately?
Anything more than a few hundred seems very suspicious. Anything more than a few dozen or 50 or 75 seems suspicious as well. The point should not be how crazy can you get with Solr, but that craziness should be avoided altogether! Solr's design is optimal for a large number of relatively small documents, not large documents. -- Jack Krupansky On Sat, May 30, 2015 at 3:05 PM, Erick Erickson erickerick...@gmail.com wrote: Nothing's really changed in that area lately. Your co-worker is perhaps confusing the statement that Solr has no a-priori limit on the number of distinct fields that can be in a corpus with supporting an infinite number of fields. Not having a built-in limit is much different than supporting Whether Solr breaks with thousands and thousands of fields is pretty dependent on what you _do_ with those fields. Simply doing keyword searches isn't going to put the same memory pressure on as, say, faceting on them all (even if in different queries). I'd really ask why so many fields are necessary though. Best, Erick On Sat, May 30, 2015 at 6:18 AM, xavi jmlucjav jmluc...@gmail.com wrote: Hi guys, someone I work with has been advised that currently Solr can support 'infinite' number of fields. I thought there was a practical limitation of say thousands of fields (for sure less than a million), orthings can start to break (I think I remember seeings memory issues reported on the mailing list by several people). Was there any change I missed lately that makes having say 1M fields in Solr practical?? thanks
Re: any changes about limitations on huge number of fields lately?
xavi jmlucjav jmluc...@gmail.com wrote: I think the plan is to facet only on class_u1, class_u2 for queries from user1, etc. So faceting would not happen on all fields on a single query. I understand that, but most of the created structures stays in memory between calls (DocValues helps here). Your heap will slowly fill up as more and more users perform faceted queries on their content. - Toke Eskildsen
Re: any changes about limitations on huge number of fields lately?
They reason for such a large number of fields: - users create dynamically 'classes' of documents, say one user creates 10 classes on average - for each 'class', the fields are created like this: unique_id_+fieldname - there are potentially hundreds of thousands of users. There is faceting in each users' fields. So this will result in 1M fields, very sparsely populated. I warned them this did not sound like a good design to me, but apparently someone very knowledgeable in solr said this will work out fine. That is why I wanted to double check... On Sat, May 30, 2015 at 9:22 PM, Jack Krupansky jack.krupan...@gmail.com wrote: Anything more than a few hundred seems very suspicious. Anything more than a few dozen or 50 or 75 seems suspicious as well. The point should not be how crazy can you get with Solr, but that craziness should be avoided altogether! Solr's design is optimal for a large number of relatively small documents, not large documents. -- Jack Krupansky On Sat, May 30, 2015 at 3:05 PM, Erick Erickson erickerick...@gmail.com wrote: Nothing's really changed in that area lately. Your co-worker is perhaps confusing the statement that Solr has no a-priori limit on the number of distinct fields that can be in a corpus with supporting an infinite number of fields. Not having a built-in limit is much different than supporting Whether Solr breaks with thousands and thousands of fields is pretty dependent on what you _do_ with those fields. Simply doing keyword searches isn't going to put the same memory pressure on as, say, faceting on them all (even if in different queries). I'd really ask why so many fields are necessary though. Best, Erick On Sat, May 30, 2015 at 6:18 AM, xavi jmlucjav jmluc...@gmail.com wrote: Hi guys, someone I work with has been advised that currently Solr can support 'infinite' number of fields. I thought there was a practical limitation of say thousands of fields (for sure less than a million), orthings can start to break (I think I remember seeings memory issues reported on the mailing list by several people). Was there any change I missed lately that makes having say 1M fields in Solr practical?? thanks
Re: any changes about limitations on huge number of fields lately?
On Sat, May 30, 2015 at 11:15 PM, Toke Eskildsen t...@statsbiblioteket.dk wrote: xavi jmlucjav jmluc...@gmail.com wrote: I think the plan is to facet only on class_u1, class_u2 for queries from user1, etc. So faceting would not happen on all fields on a single query. I understand that, but most of the created structures stays in memory between calls (DocValues helps here). Your heap will slowly fill up as more and more users perform faceted queries on their content. got it...priceless info, thanks! - Toke Eskildsen
Re: Storing positions and offsets vs FieldType IndexOptions DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS
I also met the same problem, could you tell me why? Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Storing-positions-and-offsets-vs-FieldType-IndexOptions-DOCS-AND-FREQS-AND-POSITIONS-AND-OFFSETS-tp4061354p4208875.html Sent from the Solr - User mailing list archive at Nabble.com.
Issues in compiling Solr code.
Hi All, I am trying to build and compile Solr . I have been following the below link . https://wiki.apache.org/solr/HowToCompileSolr I have taken the latest version of code and have run ant clean compile command , followed by ant dist. Both the steps were successful, but no war was created(as per the wiki , a war should be created which should be copied to ...\LuceneSolr https://wiki.apache.org/solr/LuceneSolr\solr\example\webapps folder .(I cannot find any webapps folder under example folder either !!). Can someone please help me , and let me know what steps should be taken to bring up the server , and am I missing something here. In case , there's an updated wiki link, please redirect me to the same. Appreciate the help. Cheers, Aniket
Optimal FS block size for small documents in Solr?
Hi, What would be an optimal FS block size to use? Using Solr 4.7.2, I have an RAID-5 of SSD drives currently configured with a 128KB block size. Can I expect better indexing/query time performance with a smaller block size (say 8K)? Considering my documents are almost always smaller than 8K. I assume all stored fields would fit into one block which is good, but what will Lucene prefer for reading a long posting list and other data structures. Any rules of thumb or anyone that had experimented on this?
Re: Deleting Fields
Thank you Erick. I was thinking that it actually went through and removed the index data; that you for the clarification. What happened was I had some bad data that created a lot of fields (some 8000). I was getting some errors adding new fields where solr could not talk to zookeeper, and I thought it may be because there are so many fields. The index size is some 420million docs. I'm hesitant to try to re-create as when the shards crash, they leave a write.lock file in HDFS, and I need to manually delete that file (on 27 machines) before bringing them back up. I believe this is the stack trace - but this looks to be related to facets, and I'm not 100% sure that this is the correct trace! Sorry - I if it happens again I will update. ERROR - 2015-05-29 20:39:34.707; [UNCLASS shard9 core_node14 UNCLASS] org.apache.solr.common.SolrException; null:java.lang.RuntimeException: java.lang.OutOfMemoryError: unable to create new native thread at org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:854) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:463) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:220) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:368) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489) at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53) at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:953) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1014) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:861) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240) at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:714) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:949) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1371) at org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:637) at org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:280) at org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java:106) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:222) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1984) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:829) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:446) ... 26 more Then later: ERROR - 2015-05-29 21:57:22.370; [UNCLASS shard9 core_node14 UNCLASS] org.apache.solr.common.SolrException; null:java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space at org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:854) at
Re: How To: Debuging the whole indexing process
Please help me here With Regards Aman Tandon On Sat, May 30, 2015 at 12:43 AM, Aman Tandon amantandon...@gmail.com wrote: Thanks Alex, yes it for my testing to understand the code/process flow actually. Any other ideas. With Regards Aman Tandon On Fri, May 29, 2015 at 12:48 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: In production or in test? I assume in test. This level of detail usually implies some sort of Java debugger and java instrumentation enabled. E.g. Chronon, which is commercial but can be tried as a plugin with IntelliJ Idea full version trial. Regards, Alex On 29 May 2015 4:38 pm, Aman Tandon amantandon...@gmail.com wrote: Hi, I want to debug the whole indexing process, the life cycle of indexing process (each and every function call by going via function to function), from the posting of the data.xml to creation of various index files ( _fnm, _fdt, etc ). So how/what should I setup and start, please help. I will be thankful to you. *add doc field name=title![CDATA[Aman Tandon]]/field field name=job_role![CDATA[Search Engineer]]/field* * /doc/add* With Regards Aman Tandon
Re: optimal shard assignment with low shard key cardinality using compositeId to enable shard splitting
Wow, thanks both for the suggestions Erik: good point for the uneven shard load I'm not worried about the growth of a particular shard, in case I'd use shard splitting and if necessary add a server to the cluster but even if I manage to spread docs of typeA producer evenly on the cluster I could have an uneven query distribution (the two problems are very similar) at time t I could have a shard queried by 11 type A producers while another shard is being queried by a single type A producer, not ideal So I could use few bits (0 or 1) of the composite id for typeA producer's docs to avoid those kinds of problems For typeB and typeC producers the problems discussed above seem unlikely, so I'd like to weight pros and cons of sharding on userid pros I'm reducing the size of the problem, instead of searching across the whole repository I'm searching only a part of it cons I could have uneven distribution of documents and queries across the cluster (unlikely, there are lots of users of typeB, typeC) docs for one user aren't searched in parallel using more shards this could be useful if one users produces so many docs to benefit from sharding (should happen only for typeA) I think the pro is appealing, under these hypothesis if users of type B, C increase I can scale the system without many concerns Do you agree? Il giorno 29/mag/2015, alle ore 20:18, Reitzel, Charles ha scritto: Thanks, Erick. I appreciate the sanity check. -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Thursday, May 28, 2015 5:50 PM To: solr-user@lucene.apache.org Subject: Re: optimal shard assignment with low shard key cardinality using compositeId to enable shard splitting Charles: You raise good points, and I didn't mean to say that co-locating docs due to some critera was never a good idea. That said, it does add administrative complexity that I'd prefer to avoid unless necessary. I suppose it largely depends on what the load and response SLAs are. If there's 1 query/second peak load, the sharding overhead for queries is probably not noticeable. If there are 1,000 QPS, then it might be worth it. Measure, measure, measure.. I think your composite ID understanding is fine. Best, Erick On Thu, May 28, 2015 at 1:40 PM, Reitzel, Charles charles.reit...@tiaa-cref.org wrote: We have used a similar sharding strategy for exactly the reasons you say. But we are fairly certain that the # of documents per user ID is 5000 and, typically, 500. Thus, we think the overhead of distributed searches clearly outweighs the benefits. Would you agree? We have done some load testing (with 100's of simultaneous users) and performance has been good with data and queries distributed evenly across shards. In Matteo's case, this model appears to apply well to user types B and C. Not sure about user type A, though.At 100,000 docs per user per year, on average, that load seems ok for one node. But, is it enough to benefit significantly from a parallel search? With a 2 part composite ID, each part will contribute 16 bits to a 32 bit hash value, which is then compared to the set of hash ranges for each active shard. Since the user ID will contribute the high-order bytes, it will dominate in matching the target shard(s). But dominance doesn't mean the lower order 16 bits will always be ignored, does it? I.e. if the original shard has been split, perhaps multiple times, isn't it possible that one user IDs documents will be spread over a multiple shards? In Matteo's case, it might make sense to specify fewer bits to the user ID for user category A. I.e. what I described above is the default for userId!docId. But if you use userId/8!docId/24 (8 bits for userId and 24 bits for the document ID), then couldn't one user's docs might be split over multiple shards, even without splitting? I'm just making sure I understand how composite ID sharding works correctly. Have I got it right? Has any of this logic changed in 5.x? -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Thursday, May 21, 2015 11:30 AM To: solr-user@lucene.apache.org Subject: Re: optimal shard assignment with low shard key cardinality using compositeId to enable shard splitting I question your base assumption: bq: So shard by document producer seems a good choice Because what this _also_ does is force all of the work for a query onto one node and all indexing for a particular producer ditto. And will cause you to manually monitor your shards to see if some of them grow out of proportion to others. And I think it would be much less hassle to just let Solr distribute the docs as it may based on the uniqueKey and forget about it. Unless you want, say, to do joins etc There
Unsubscribe me
Unsubscribe me
Re: Unsubscribe me
Quoting Erik from two days ago: Please follow the instructions here: http://lucene.apache.org/solr/resources.html. Be sure to use the exact same e-mail you used to subscribe. On May 30, 2015, at 6:07 AM, Lalit Kumar 4 lkum...@sapient.com wrote: Please unsubscribe me as well On May 30, 2015 15:23, Neha Jatav neha.ja...@gmail.com wrote: Unsubscribe me
Re: Unsubscribe me
Please unsubscribe me as well On May 30, 2015 15:23, Neha Jatav neha.ja...@gmail.com wrote: Unsubscribe me
Re: Deleting Fields
Hi Joseph, On May 30, 2015, at 8:18 AM, Joseph Obernberger j...@lovehorsepower.com wrote: Thank you Erick. I was thinking that it actually went through and removed the index data; that you for the clarification. I added more info to the Schema API page about this not being true. Here’s what I’ve got so far - let me know if you think we should add more warnings about this: - Re-index after schema modifications! If you modify your schema, you will likely need to re-index all documents. If you do not, you may lose access to documents, or not be able to interpret them properly, e.g. after replacing a field type. Modifying your schema will never modify any documents that are already indexed. Again, you must re-index documents in order to apply schema changes to them. […] When modifying the schema with the API, a core reload will automatically occur in order for the changes to be available immediately for documents indexed thereafter. Previously indexed documents will not be automatically handled - they must be re-indexed if they used schema elements that you changed. - Steve
any changes about limitations on huge number of fields lately?
Hi guys, someone I work with has been advised that currently Solr can support 'infinite' number of fields. I thought there was a practical limitation of say thousands of fields (for sure less than a million), orthings can start to break (I think I remember seeings memory issues reported on the mailing list by several people). Was there any change I missed lately that makes having say 1M fields in Solr practical?? thanks
Re: Issues in compiling Solr code.
On 5/30/2015 1:59 AM, Aniket Kumar wrote: Hi All, I am trying to build and compile Solr . I have been following the below link . https://wiki.apache.org/solr/HowToCompileSolr I have taken the latest version of code and have run ant clean compile command , followed by ant dist. Both the steps were successful, but no war was created(as per the wiki , a war should be created which should be copied to ...\LuceneSolr https://wiki.apache.org/solr/LuceneSolr\solr\example\webapps folder .(I cannot find any webapps folder under example folder either !!). As of Solr 5.0, things have changed. Solr is still a war, but the war is embedded in the example, and the example has changed quite a lot. Run ant example instead of ant dist and then you will find that you can start solr by typing bin/solr start or bin\solr start on Windows. You might also consider typing ant package which will create entire .zip and .tgz packages similar to what you download from the website. Here's the beginnings of a discussion about why we are changing the packaging: https://wiki.apache.org/solr/WhyNoWar Our startup infrastructure and documentation still need work, especially in the parts describing how to take Solr to production. When I find some free time, I will be doing what I can to make this better. Thanks, Shawn
Re: docValues: Can we apply synonym
What I'm suggesting is that you have two fields, one for searching, one for faceting. You may find you can't use docValues for your field type, in which case Solr will just use caches to improve faceting performance. Upayavira On Sat, May 30, 2015, at 01:50 AM, Aman Tandon wrote: Hi Upayavira, How the copyField will help in my scenario when I have to add the synonym in docValue enable field. With Regards Aman Tandon On Sat, May 30, 2015 at 1:18 AM, Upayavira u...@odoko.co.uk wrote: Use copyField to clone the field for faceting purposes. Upayavira On Fri, May 29, 2015, at 08:06 PM, Aman Tandon wrote: Hi Erick, Thanks for suggestion, We are this query parser plugin ( *SynonymExpandingExtendedDismaxQParserPlugin*) to manage multi-word synonym. So it does work slower than edismax that's why it is not in contrib right? (I am asking this question because we are using for all our searches to handle 10 multiword ice cube, icecube etc) *Moreover I thought a solution for this docValue problem* I need to make city field as *multivalued* and by this I mean i will add the synonym (*mumbai, bombay*) as an extra value to that field if present. Now searching operation will work fine as before. *field name=citymumbai/fieldfield name=citybombay/field* The only prob is if we have to remove the 'city alias/synonym facets' when we are providing results to the clients. *mumbai, 1000* With Regards Aman Tandon On Fri, May 29, 2015 at 7:26 PM, Erick Erickson erickerick...@gmail.com wrote: Do take time for performance testing with that parser. It can be slow depending on your data as I remember. That said it solves the problem it set out to solve so if it meets your SLAs, it can be a life-saver. Best, Erick On Fri, May 29, 2015 at 2:35 AM, Alessandro Benedetti benedetti.ale...@gmail.com wrote: Even if a little bit outdated, that query parser is really really cool to manage synonyms ! +1 ! 2015-05-29 1:01 GMT+01:00 Aman Tandon amantandon...@gmail.com: Thanks chris. Yes we are using it for handling multiword synonym problem. With Regards Aman Tandon On Fri, May 29, 2015 at 12:38 AM, Reitzel, Charles charles.reit...@tiaa-cref.org wrote: Again, I would recommend using Nolan Lawson's SynonymExpandingExtendedDismaxQParserPlugin. http://nolanlawson.com/2012/10/31/better-synonym-handling-in-solr/ -Original Message- From: Aman Tandon [mailto:amantandon...@gmail.com] Sent: Wednesday, May 27, 2015 6:42 PM To: solr-user@lucene.apache.org Subject: Re: docValues: Can we apply synonym Ok and what synonym processor you is talking about maybe it could help ? With Regards Aman Tandon On Thu, May 28, 2015 at 4:01 AM, Reitzel, Charles charles.reit...@tiaa-cref.org wrote: Sorry, my bad. The synonym processor I mention works differently. It's an extension of the EDisMax query processor and doesn't require field level synonym configs. -Original Message- From: Reitzel, Charles [mailto:charles.reit...@tiaa-cref.org] Sent: Wednesday, May 27, 2015 6:12 PM To: solr-user@lucene.apache.org Subject: RE: docValues: Can we apply synonym But the query analysis isn't on a specific field, it is applied to the query string. -Original Message- From: Aman Tandon [mailto:amantandon...@gmail.com] Sent: Wednesday, May 27, 2015 6:08 PM To: solr-user@lucene.apache.org Subject: Re: docValues: Can we apply synonym Hi Charles, The problem here is that the docValues works only with primitives data type only like String, int, etc So how could we apply synonym on primitive data type. With Regards Aman Tandon On Thu, May 28, 2015 at 3:19 AM, Reitzel, Charles charles.reit...@tiaa-cref.org wrote: Is there any reason you cannot apply the synonyms at query time? Applying synonyms at indexing time has problems, e.g. polluting the term frequency for synonyms added, preventing distance queries, ... Since city names often have multiple terms, e.g. New York, Den Hague, etc., I would recommend using Nolan Lawson's SynonymExpandingExtendedDismaxQParserPlugin. Tastes great, less filling. http://nolanlawson.com/2012/10/31/better-synonym-handling-in-solr/ We found this to fix synonyms like ny for New York and vice versa. Haven't tried it with docValues, tho. -Original Message- From: Aman Tandon
Re: Optimal FS block size for small documents in Solr?
On Sat, May 30, 2015, at 09:51 AM, Gili Nachum wrote: Hi, What would be an optimal FS block size to use? Using Solr 4.7.2, I have an RAID-5 of SSD drives currently configured with a 128KB block size. Can I expect better indexing/query time performance with a smaller block size (say 8K)? Considering my documents are almost always smaller than 8K. I assume all stored fields would fit into one block which is good, but what will Lucene prefer for reading a long posting list and other data structures. Any rules of thumb or anyone that had experimented on this? I'm gonna start this response with the observation that I don't know anything about the topic you are asking about. So, with that out of the way, a Lucene index is write only, that is, when you do a commit, all of the data that makes up your index is written to disk - that is, all documents making up a single commit are written into a set of files, making a segment. Therefore, it isn't the size of a document that matters, more the number and size of documents making up a single commit. There's a lot more to it too, e.g. whether fields are stored, how they are analysed, etc. You could do a simple experiment. Write a little app that pushes docs to Solr and commits, then look at the file sizes on disk. Then repeat with more documents, see what impact on file sizes. I suspect you can answer your question relatively easily. Upayavira
Re: Optimal FS block size for small documents in Solr?
On 5/30/2015 2:51 AM, Gili Nachum wrote: Hi, What would be an optimal FS block size to use? Using Solr 4.7.2, I have an RAID-5 of SSD drives currently configured with a 128KB block size. Can I expect better indexing/query time performance with a smaller block size (say 8K)? Considering my documents are almost always smaller than 8K. I assume all stored fields would fit into one block which is good, but what will Lucene prefer for reading a long posting list and other data structures. Generally speaking, RAID levels that use striping should have the largest block size you can make, which for most modern RAID controllers is 1MB or 2MB. When you make the stripe size very small, reading and writing even small files requires hitting all the disks. With large stripes, accessing data randomly is more likely to have one read hit one disk while another read hits another disk. For Lucene/Solr, there might be benefits to smaller block sizes, but I believe that they might cause more problems than they solve. There are some additional things to think about: If your server has its memory appropriately sized, then you will have enough RAM to let your operating system cache your index entirely. For queries, you will only rarely be hitting the disk ... so disk speed and layout don't matter much at all, and you will only need to be concerned about *write* speed for indexing. RAID levels 3 through 6 (and any derivations like level 50) are *horrible* if there is very much write activity -- for a Solr install, that means indexing, and to a slightly lesser extent, logging. When you write to a RAID5 array, you slow *everything* down. Even *reads* that happen at the same time as writes are strongly affected by those writes. It is the nature of RAID5. If your system is entirely read-only, then RAID5 is awesome ... but RAID10 is better. RAID10 *is* initially more expensive than RAID5 ... but the performance and reliability benefits are completely worth the additional expense. Additional reading material below. I do highly recommend reading at least the first link: http://www.miracleas.com/BAARF/RAID5_versus_RAID10.txt http://www.baarf.com/ The RAID10 stripe size should be at least 1MB if your controller supports blocks that large. Thanks, Shawn