Re: Very long young generation stop the world GC pause
Also curious why such a large heap is required... If it's due to field caches being loaded I'd highly recommend MMapDirectory (if not using already) and turning on DocValues for all fields you plan to perform sort/facet/analytics on. steve On Wed, Dec 21, 2016 at 9:25 AM Pushkar Rastewrote: > You should probably have as small a swap as possible. I still feel long GCs > are either due to swapping or thread contention. > > Did you try to remove all other G1GC tuning parameters except for the > ParallelRefProcEnabled? > > On Dec 19, 2016 1:39 AM, "forest_soup" wrote: > > > Sorry for my wrong memory. The swap is 16GB. > > > > > > > > -- > > View this message in context: http://lucene.472066.n3. > > nabble.com/Very-long-young-generation-stop-the-world-GC- > > pause-tp4308911p4310301.html > > Sent from the Solr - User mailing list archive at Nabble.com. > > >
Re: SOLR Disk Access Latency Problem
That sounds like some SAN vendor BS if you ask me. Breaking up 300gb into smaller chunks would only be relevant if they were caching entire files not blocks and I find that hard to believe. Would be interested to know more about the specifics of the problem as the vendor sees it. As Shawn said local attached storage (preferably SSD) is the way to go.. In addition using MMapDirectory with lots of RAM will give the best performance. My rule of thumb is to keep a maximum of 4-1 ratio between index size and amount of RAM on a box. steve On Wed, Sep 21, 2016 at 8:08 PM Shawn Heiseywrote: > On 9/21/2016 7:52 AM, Kyle Daving wrote: > > We are currently running solr 5.2.1 and attempted to upgrade to 6.2.1. > > We attempted this last week but ran into disk access latency problems > > so reverted back to 5.2.1. We found that after upgrading we overran > > the NVRAM on our SAN and caused a fairly large queue depth for disk > > access (we did not have this problem in 5.2.1). We reached out to our > > SAN vendor and they said that is was due to the size of our optimized > > indexes. It is not uncommon for us to have roughly 300GB single file > > optimized indexes. Our SAN vendor advised that splitting the index > > into smaller fragmented chunks would alleviate the NVRAM/queue depth > > problem. > > How is this filesystem presented to the server? Is it a block device > using a protocol like iSCSI, or is it a network filesystem, like NFS or > SMB? Block filesystems will appear to the OS as if they are a > completely local filesystem, and local machine memory will be used to > cache data. Network filesystems will usually require memory on the > storage device for caching, and typically those machines do not have a > lot of memory compared to the amount of storage space they have. > > > Why do we not see this problem with the same size index in 5.2.1? Did > > solr change the way it accesses disk in v5 vs v6? > > It's hard to say why you didn't have the problem with the earlier version. > > All the index disk access is handled by Lucene, and from Solr's point of > view, it's a black box, with only minimal configuration available. > Lucene is constantly being improved, but those improvements assume the > general best-case installation -- a machine with a local filesystem and > plenty of spare memory to effectively cache the data that filesystem > contains. > > > Is there a configuration file we should be looking at making > > adjustments in? > > Unless we can figure out why there's a problem, this question cannot be > answered. > > > Since everything worked fine in 5.2.1 there has to be something we are > > overlooking when trying to use 6.2.1. Any comments and thoughts are > > appreciated. > > Best guess (which could be wrong): There's not enough memory to > effectively cache the data in the Lucene indexes. A newer version of > Solr generally has *better* performance characteristics than an earlier > version, but *ONLY* if there's enough memory available to effectively > cache the index data, which assures that data can be accessed very > quickly. When the actual disk must be read, access speed will be slow > ... and the problem may get worse with a different version. > > How much memory is in your Solr server, and how much is assigned to the > Java heap for Solr? Are you running more than one Solr instance per > server? > > When you're dealing with a remote filesystem on a SAN, exactly where to > add memory to boost performance will depend on how the filesystem is > being presented. > > I strongly recommend against using a network filesystem like NFS or SMB > to hold a Solr index. Solr works best when the filesystem is local to > the server and there's plenty of extra memory for caching. The amount > of memory required for good performance with a 300GB index will be > substantial. > > Thanks, > Shawn > >
Re: Full re-index without downtime
There are two options as I see it.. 1. Do something like you describe and create a secondary index, index into it, then switch... I personally would create a completely separate solr cloud alongside my existing one vs new core in the same cloud as you might see some negative impacts on GC caused by the indexing load. 2. Tag each record with a field (eg "generation") that identifies which generation of data a record is from.. when querying filter on only the generation of data that is complete.. new records get a new generation.. the only problem with this is changing field types doesn't really work with the same field names.. but if you used dynamic fields instead of static the field name would change anyway which isn't a problem then. We use both of these patterns in different applications.. steve On Wed, Jul 6, 2016 at 1:27 PM Steven Whitewrote: > Hi everyone, > > In my environment, I have use cases where I need to fully re-index my > data. This happens because Solr's schema requires changes based on changes > made to my data source, the DB. For example, my DB schema may change so > that it now has a whole new set of field added or removed (on records), or > the data type changed (on fields). When that happens, the only solution I > have right now is to drop the current Solr index, update Solr's schema.xml, > re-index my data (I use Solr's core admin to dynamical do all this). > > The issue with my current solution is during the re-indexing, which right > now takes 10 hours (expect it to take over 30 hours as my data keeps on > growing) search via Solr is not available. Sure, I can enable search while > the data is being re-indexed, but then I get partial results. > > My question is this: how can I avoid this so there is minimal downtime, > under 1 min.? I was thinking of creating a second core (again dynamically) > and re-index into it (after setting up the new schema) and once the > re-index is fully done, switch over to the new core and drop the index from > the old core and then delete the old core, and rename the new core to the > old core (original core). > > Would the above work or is there a better way to do this? How do you guys > solve this problem? > > Again, my goal is to minimize downtime during re-indexing when Solr's > schema is drastically changed (requiring re-indexing). > > Thanks in advanced. > > Steve >
Re: deploy solr on cloud providers
Looking deeper into zookeeper as truth mode I was wrong about existing replicas being recreated once storage is gone.. Seems there is intent for the type of behavior based upon existing tickets.. We'll look at creating a patch for this too.. Steve On Tue, Jul 5, 2016 at 6:00 PM Tomás Fernández Löbbewrote: > The leader will do the replication before responding to the client, so lets > say the leader gets to update it's local copy, but it's terminated before > sending the request to the replicas, the client should get either an HTTP > 500 or no http response. From the client code you can take action (log, > retry, etc). > The "min_rf" is useful for the case where replicas may be down or not > accessible. Again, you can use this for retrying or take any necessary > action on the client side if the desired rf is not achieved. > > Tomás > > On Tue, Jul 5, 2016 at 11:39 AM, Lorenzo Fundaró < > lorenzo.fund...@dawandamail.com> wrote: > > > @Tomas and @Steven > > > > I am a bit skeptical about this two statements: > > > > If a node just disappears you should be fine in terms of data > > > availability, since Solr in "SolrCloud" replicates the data as it comes > > it > > > (before sending the http response) > > > > > > and > > > > > > > > You shouldn't "need" to move the storage as SolrCloud will replicate > all > > > data to the new node and anything in the transaction log will already > be > > > distributed through the rest of the machines.. > > > > > > because according to the official documentation here > > < > > > https://cwiki.apache.org/confluence/display/solr/Read+and+Write+Side+Fault+Tolerance > > >: > > (Write side fault tolerant -> recovery) > > > > If a leader goes down, it may have sent requests to some replicas and not > > > others. So when a new potential leader is identified, it runs a synch > > > process against the other replicas. If this is successful, everything > > > should be consistent, the leader registers as active, and normal > actions > > > proceed > > > > > > I think there is a possibility that an update is not sent by the leader > but > > is kept in the local disk and after it comes up again it can sync the > > non-sent data. > > > > Furthermore: > > > > Achieved Replication Factor > > > When using a replication factor greater than one, an update request may > > > succeed on the shard leader but fail on one or more of the replicas. > For > > > instance, consider a collection with one shard and replication factor > of > > > three. In this case, you have a shard leader and two additional > replicas. > > > If an update request succeeds on the leader but fails on both replicas, > > for > > > whatever reason, the update request is still considered successful from > > the > > > perspective of the client. The replicas that missed the update will > sync > > > with the leader when they recover. > > > > > > They have implemented this parameter called *min_rf* that you can use > > (client-side) to make sure that your update was replicated to at least > one > > replica (e.g.: min_rf > 1). > > > > This is why my concern about moving storage around, because then I know > > when the shard leader comes back, solrcloud will run sync process for > those > > documents that couldn't be sent to the replicas. > > > > Am I missing something or misunderstood the documentation ? > > > > Cheers ! > > > > > > > > > > > > > > > > On 5 July 2016 at 19:49, Davis, Daniel (NIH/NLM) [C] < > daniel.da...@nih.gov > > > > > wrote: > > > > > Lorenzo, this probably comes late, but my systems guys just don't want > to > > > give me real disk. Although RAID-5 or LVM on-top of JBOD may be > better > > > than Amazon EBS, Amazon EBS is still much closer to real disk in terms > of > > > IOPS and latency than NFS ;)I even ran a mini test (not an official > > > benchmark), and found the response time for random reads to be better. > > > > > > If you are a young/smallish company, this may be all in the cloud, but > if > > > you are in a large organization like mine, you may also need to allow > for > > > other architectures, such as a "virtual" Netapp in the cloud that > > > communicates with a physical Netapp on-premises, and the > > throughput/latency > > > of that. The most important thing is to actually measure the numbers > > you > > > are getting, both for search and for simply raw I/O, or to get your > > > systems/storage guys to measure those numbers. If you get your > > > systems/storage guys to just measure storage - you will want to care > > about > > > three things for indexing primarily: > > > > > > Sequential Write Throughput > > > Random Read Throughput > > > Random Read Response Time/Latency > > > > > > Hope this helps, > > > > > > Dan Davis, Systems/Applications Architect (Contractor), > > > Office of Computer and Communications Systems, > > > National Library of Medicine, NIH > > > > > > > > > > > > -Original Message- > > > From: Lorenzo
Re: deploy solr on cloud providers
You shouldn't "need" to move the storage as SolrCloud will replicate all data to the new node and anything in the transaction log will already be distributed through the rest of the machines.. One option to keep all your data attached to nodes might be to use Amazon EFS (pretty new) to store your data.. However I've not seen any good perf testing done against it so not sure how it will scale.. steve On Tue, Jul 5, 2016 at 11:46 AM Lorenzo Fundaró < lorenzo.fund...@dawandamail.com> wrote: > On 5 July 2016 at 15:55, Shawn Heiseywrote: > > > On 7/5/2016 1:19 AM, Lorenzo Fundaró wrote: > > > Hi Shawn. Actually what im trying to find out is whether this is the > best > > > approach for deploying solr in the cloud. I believe solrcloud solves a > > lot > > > of problems in terms of High Availability but when it comes to storage > > > there seems to be a limitation that can be workaround of course but > it's > > a > > > bit cumbersome and i was wondering if there is a better option for this > > or > > > if im missing something with the way I'm doing it. I wonder if there > are > > > some proved experience about how to solve the storage problem when > > > deploying in the cloud. Any advise or point to some enlightening > > > documentation will be appreciated. Thanks. > > > > When you ask whether "this is the best approach" ... you need to define > > what "this" is. You mention a "storage problem" that needs solving ... > > but haven't actually described that problem in a way that I can > > understand. > > > So, Im trying to put Solrcloud in a cloud provider where a node can > disappear any time > because of hardware failure. In order to preserve any non replicated > updates I need to > make the storage of that dead node go to the newly spawned node. I am not > having a problem with this > approach actually, I just want to know if there is a better way of doing > this. I know there is HDFS support that makes > all this easier but this is not an option for me. Thank you and I apologise > for the unclear mails. > > > > > > Let's back up and cover some basics: > > > > What steps are you taking? > > What do you expect (or want) to happen? > > What actually happens? > > > > The answers to these questions need to be very detailed. > > > > Thanks, > > Shawn > > > > > > > -- > > -- > Lorenzo Fundaro > Backend Engineer > E-Mail: lorenzo.fund...@dawandamail.com > > Fax + 49 - (0)30 - 25 76 08 52 > Tel+ 49 - (0)179 - 51 10 982 > > DaWanda GmbH > Windscheidstraße 18 > 10627 Berlin > > Geschäftsführer: Claudia Helming und Niels Nüssler > AG Charlottenburg HRB 104695 B http://www.dawanda.com >
Re: stateless solr ?
The ticket in question is https://issues.apache.org/jira/browse/SOLR-9265 We are working on a patch now... will update when we have a working patch / tests.. Shawn is correct that when adding a new node to a SolrCloud cluster it will not automatically add replicas/etc.. The idea behind this patch though is that you'll be hard-coding the node-names (eg node1, node2, etc..) that are are normally generated from the host/port of the instance and as such not actually adding a "new" node but replacing an existing node. In this case Solr will see the existing cores and replicate the data down.. At least in theory.. We will need to verify this assumption with a lot of testing, especially around having multiple nodes with the same node name showing up at the same time (which is something currently that cannot really occur, but can once we do this work).. steve On Tue, Jul 5, 2016 at 9:54 AM Shawn Heiseywrote: > On 7/4/2016 7:46 AM, Lorenzo Fundaró wrote: > > I am trying to run Solr on my infrastructure using docker containers > > and Mesos. My problem is that I don't have a shared filesystem. I have > > a cluster of 3 shards and 3 replicas (9 nodes in total) so if I > > distribute well my nodes I always have 2 fallbacks of my data for > > every shard. Every solr node will store the index in its internal > > docker filesystem. My problem is that if I want to relocate a certain > > node (maybe an automatic relocation because of a hardware failure), I > > need to create the core manually in the new node because it's > > expecting to find the core.properties file in the data folder and of > > course it won't because the storage is ephemeral. Is there a way to > > make a new node join the cluster with no manual intervention ? > > The things you're asking sound like SolrCloud. The rest of this message > assumes that you're running cloud. If you're not, then we may need to > start over. > > When you start a new node, it automatically joins the cluster described > by the Zookeeper database that you point it to. > > SolrCloud will **NOT** automatically create replicas when a new node > joins the cluster. There's no way for SolrCloud to know what you > actually want to use that new node for, so anything that it did > automatically might be completely the wrong thing. > > Once you add a new node, you can replicate existing data to it with the > ADDREPLICA action on the Collections API: > > > https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api_addreplica > > If the original problem was a down node, you might also want to use the > DELETEREPLICA action to delete any replicas on the node that you lost > that are marked down: > > > https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api9 > > Creating cores manually in your situation is not advisable. The > CoreAdmin API should not be used when you're running in cloud mode. > > Thanks, > Shawn > >
Re: stateless solr ?
I don't think that's a bad approach with the sidecar.. We run a huge number of solr ~5k instances so adding sidecars for each one ads a lot of extra containers.. What I mean by transition is a container dying and a new one being brought online to replace it.. With the mod we are working on you won't need the sidecar to add cores to the new node via the API and remove the old cores.. A new instance would start up with the same node name and just take over the existing cores (of course will require replication but that will happen automatically ) Steve On Mon, Jul 4, 2016 at 5:27 PM Upayavira <u...@odoko.co.uk> wrote: > What do you mean by a "transition"? > > Can you configure a sidekick container within your orchestrator? Have a > sidekick always run alongside your SolrCloud nodes? In which case, this > would be an app that does the calling of the API for you. > > Upayavira > > On Mon, 4 Jul 2016, at 08:53 PM, Steven Bower wrote: > > My main issue is having to make any solr collection api calls during a > > transition.. It makes integrating with orchestration engines way more > > complex.. > > On Mon, Jul 4, 2016 at 3:40 PM Upayavira <u...@odoko.co.uk> wrote: > > > > > Are you using Solrcloud? With Solrcloud this stuff is easy. You just > add > > > a new replica for a collection, and the data is added to the new host. > > > > > > I'm working on a demo that will show this all working within Docker and > > > Rancher. I've got some code (which I will open source) that handles > > > config uploads, collection creation, etc. You can add a replica by > > > running a container on the same node as you want the replica to reside, > > > it'll do the rest for you. > > > > > > I've got the Solr bit more or less done, I'm now working on everything > > > else (Dockerised Docker Registry/Jenkins, AWS infra build, etc). > > > > > > Let me know if this is interesting to you. If so, I'll post it here > when > > > I'm done with it. > > > > > > Upayavira > > > > > > On Mon, 4 Jul 2016, at 02:46 PM, Lorenzo Fundaró wrote: > > > > Hello guys, > > > > > > > > I am trying to run Solr on my infrastructure using docker containers > and > > > > Mesos. My problem is that I don't have a shared filesystem. I have a > > > > cluster of 3 shards and 3 replicas (9 nodes in total) so if I > distribute > > > > well my nodes I always have 2 fallbacks of my data for every shard. > Every > > > > solr node will store the index in its internal docker filesystem. My > > > > problem is that if I want to relocate a certain node (maybe an > automatic > > > > relocation because of a hardware failure), I need to create the core > > > > manually in the new node because it's expecting to find the > > > > core.properties > > > > file in the data folder and of course it won't because the storage is > > > > ephemeral. Is there a way to make a new node join the cluster with no > > > > manual intervention ? > > > > > > > > Thanks in advance ! > > > > > > > > > > > > -- > > > > > > > > -- > > > > Lorenzo Fundaro > > > > Backend Engineer > > > > E-Mail: lorenzo.fund...@dawandamail.com > > > > > > > > Fax + 49 - (0)30 - 25 76 08 52 > > > > Tel+ 49 - (0)179 - 51 10 982 > > > > > > > > DaWanda GmbH > > > > Windscheidstraße 18 > > > > 10627 Berlin > > > > > > > > Geschäftsführer: Claudia Helming, Niels Nüssler und Michael Pütz > > > > AG Charlottenburg HRB 104695 B http://www.dawanda.com > > > >
Re: stateless solr ?
My main issue is having to make any solr collection api calls during a transition.. It makes integrating with orchestration engines way more complex.. On Mon, Jul 4, 2016 at 3:40 PM Upayavirawrote: > Are you using Solrcloud? With Solrcloud this stuff is easy. You just add > a new replica for a collection, and the data is added to the new host. > > I'm working on a demo that will show this all working within Docker and > Rancher. I've got some code (which I will open source) that handles > config uploads, collection creation, etc. You can add a replica by > running a container on the same node as you want the replica to reside, > it'll do the rest for you. > > I've got the Solr bit more or less done, I'm now working on everything > else (Dockerised Docker Registry/Jenkins, AWS infra build, etc). > > Let me know if this is interesting to you. If so, I'll post it here when > I'm done with it. > > Upayavira > > On Mon, 4 Jul 2016, at 02:46 PM, Lorenzo Fundaró wrote: > > Hello guys, > > > > I am trying to run Solr on my infrastructure using docker containers and > > Mesos. My problem is that I don't have a shared filesystem. I have a > > cluster of 3 shards and 3 replicas (9 nodes in total) so if I distribute > > well my nodes I always have 2 fallbacks of my data for every shard. Every > > solr node will store the index in its internal docker filesystem. My > > problem is that if I want to relocate a certain node (maybe an automatic > > relocation because of a hardware failure), I need to create the core > > manually in the new node because it's expecting to find the > > core.properties > > file in the data folder and of course it won't because the storage is > > ephemeral. Is there a way to make a new node join the cluster with no > > manual intervention ? > > > > Thanks in advance ! > > > > > > -- > > > > -- > > Lorenzo Fundaro > > Backend Engineer > > E-Mail: lorenzo.fund...@dawandamail.com > > > > Fax + 49 - (0)30 - 25 76 08 52 > > Tel+ 49 - (0)179 - 51 10 982 > > > > DaWanda GmbH > > Windscheidstraße 18 > > 10627 Berlin > > > > Geschäftsführer: Claudia Helming, Niels Nüssler und Michael Pütz > > AG Charlottenburg HRB 104695 B http://www.dawanda.com >
Re: stateless solr ?
We have been working on some changes that should help with this.. 1st challenge is having the node name remain static regardless of where the node runs (right now it uses host and port, so this won't work unless you are using some sort of tunneled or dynamic networking).. We have a patch we are working on for this.. Once this is in place and you use the "zookeeper is truth" mode for solr cloud and this should seamlessly transition into the new node (and replicate).. Will update with the ticket number as I forget it off hand Steve On Mon, Jul 4, 2016 at 9:47 AM Lorenzo Fundaró < lorenzo.fund...@dawandamail.com> wrote: > Hello guys, > > I am trying to run Solr on my infrastructure using docker containers and > Mesos. My problem is that I don't have a shared filesystem. I have a > cluster of 3 shards and 3 replicas (9 nodes in total) so if I distribute > well my nodes I always have 2 fallbacks of my data for every shard. Every > solr node will store the index in its internal docker filesystem. My > problem is that if I want to relocate a certain node (maybe an automatic > relocation because of a hardware failure), I need to create the core > manually in the new node because it's expecting to find the core.properties > file in the data folder and of course it won't because the storage is > ephemeral. Is there a way to make a new node join the cluster with no > manual intervention ? > > Thanks in advance ! > > > -- > > -- > Lorenzo Fundaro > Backend Engineer > E-Mail: lorenzo.fund...@dawandamail.com > > Fax + 49 - (0)30 - 25 76 08 52 > Tel+ 49 - (0)179 - 51 10 982 > > DaWanda GmbH > Windscheidstraße 18 > 10627 Berlin > > Geschäftsführer: Claudia Helming, Niels Nüssler und Michael Pütz > AG Charlottenburg HRB 104695 B http://www.dawanda.com >
Re: Solr cross core join special condition
commenting so this ends up in Dennis' inbox.. On Tue, Oct 13, 2015 at 7:17 PM Yonik Seeleywrote: > On Wed, Oct 7, 2015 at 9:42 AM, Ryan Josal wrote: > > I developed a join transformer plugin that did that (although it didn't > > flatten the results like that). The one thing that was painful about it > is > > that the TextResponseWriter has references to both the IndexSchema and > > SolrReturnFields objects for the primary core. So when you add a > > SolrDocument from another core it returned the wrong fields. > > We've made some progress on this front in trunk: > > * SOLR-7957: internal/expert - ResultContext was significantly changed > and expanded > to allow for multiple full query results (DocLists) per Solr request. > TransformContext was rendered redundant and was removed. (yonik) > > So ResultContext now has it's own searcher, ReturnFields, etc. > > -Yonik >
SolrCloud with local configs
Is it possible to run in cloud mode with zookeeper managing collections/state/etc.. but to read all config files (solrconfig, schema, etc..) from local disk? Obviously this implies that you'd have to keep them in sync.. My thought here is of running Solr in a docker container, but instead of having to manage schema changes/etc via zk I can just build the config into the container.. and then just produce a new docker image with a solr version and the new config and just do rolling restarts of the containers.. Thanks, Steve
Re: Spatial maxDistErr changes
Thanks... I noticed that.. I tried to send a mail to your Mitre address and it got returned... Not sure if you've locked something new down but if you are interested we are looking to hire for our search team at Bloomberg LP steve On Wed, Apr 2, 2014 at 11:20 AM, David Smiley dsmi...@apache.org wrote: Good question Steve, You'll have to re-index right off. ~ David p.s. Sorry I didn't reply sooner; I just switched jobs and reconfigured my mailing list subscriptions Steven Bower wrote If am only indexing point shapes and I want to change the maxDistErr from 0.09 (1m res) to 0.00045 will this break as in searches stop working or will search work but any performance gain won't be seen until all docs are reindexed? Or will I have to reindex right off? thanks, steve - Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book Independent Lucene/Solr search consultant, http://www.linkedin.com/in/davidwsmiley -- View this message in context: http://lucene.472066.n3.nabble.com/Spatial-maxDistErr-changes-tp4124836p4128620.html Sent from the Solr - User mailing list archive at Nabble.com.
Spatial maxDistErr changes
If am only indexing point shapes and I want to change the maxDistErr from 0.09 (1m res) to 0.00045 will this break as in searches stop working or will search work but any performance gain won't be seen until all docs are reindexed? Or will I have to reindex right off? thanks, steve
IDF maxDocs / numDocs
I am noticing the maxDocs between replicas is consistently different and that in the idf calculation it is used which causes idf scores for the same query/doc between replicas to be different. obviously an optimize can normalize the maxDocs scores, but that is only temporary.. is there a way to have idf use numDocs instead (as it should be consistent across replicas)? thanks, steve
Re: IDF maxDocs / numDocs
My problem is that both maxDoc() and docCount() both report documents that have been deleted in their values. Because of merging/etc.. those numbers can be different per replica (or at least that is what I'm seeing). I need a value that is consistent across replicas... I see in the comment it makes mention of not using IndexReader.numDocs() but there doesn't seem to me a way to get ahold of the IndexReader within a similarity implementation (as only TermStats, CollectionStats are passed in, and neither contains of ref to the reader) I am contemplating just using a static value for the number of docs as this won't change dramatically often.. steve On Wed, Mar 12, 2014 at 11:18 AM, Markus Jelsma markus.jel...@openindex.iowrote: Hi Steve - it seems most similarities use CollectionStatistics.maxDoc() in idfExplain but there's also a docCount(). We use docCount in all our custom similarities, also because it allows you to have multiple languages in one index where one is much larger than the other. The small language will have very high IDF scores using maxDoc but they are proportional enough using docCount(). Using docCount() also fixes SolrCloud ranking problems, unless one of your replica's becomes inconsistent ;) https://lucene.apache.org/core/4_7_0/core/org/apache/lucene/search/CollectionStatistics.html#docCount%28%29 -Original message- From:Steven Bower smb-apa...@alcyon.net Sent: Wednesday 12th March 2014 16:08 To: solr-user solr-user@lucene.apache.org Subject: IDF maxDocs / numDocs I am noticing the maxDocs between replicas is consistently different and that in the idf calculation it is used which causes idf scores for the same query/doc between replicas to be different. obviously an optimize can normalize the maxDocs scores, but that is only temporary.. is there a way to have idf use numDocs instead (as it should be consistent across replicas)? thanks, steve
Re: Issue with spatial search
great.. that worked! What does distErrPct actually control, besides controlling the error percentage? or maybe better put how does it impact perf? steve On Mon, Mar 10, 2014 at 11:17 PM, David Smiley (@MITRE.org) dsmi...@mitre.org wrote: Correct, Steve. Alternatively you can also put this option in your query after the end of the last parenthesis, as in this example from the wiki: fq=geo:IsWithin(POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30))) distErrPct=0 ~ David Steven Bower wrote Only points in the index.. Am I correct this won't require a reindex? On Monday, March 10, 2014, Smiley, David W. lt; dsmiley@ gt; wrote: Hi Steven, Set distErrPct to 0 in order to get non-point shapes to always be as accurate as maxDistErr. Point shapes are always that accurate. As long as you only index points, not other shapes (you don't index polygons, etc.) then distErrPct of 0 should be fine. In fact, perhaps a future Solr version should simply use 0 as the default; the last time I did benchmarks it was pretty marginal impact of higher distErrPct. It's a fairly different story if you are indexing non-point shapes. ~ David From: Steven Bower lt; smb-apache@ lt;javascript:;gt; mailto: smb-apache@ lt;javascript:;gt; Reply-To: solr-user@.apache lt;javascript:;gt; mailto: solr-user@.apache lt;javascript:;gt; lt; solr-user@.apache lt;javascript:;gt; lt;mailto: solr-user@.apache lt;javascript:;gt; Date: Monday, March 10, 2014 at 4:23 PM To: solr-user@.apache lt;javascript:;gt; mailto: solr-user@.apache lt;javascript:;gt; lt; solr-user@.apache lt;javascript:;gt; lt;mailto: solr-user@.apache lt;javascript:;gt; Subject: Re: Issue with spatial search Minor edit to the KML to adjust color of polygon On Mon, Mar 10, 2014 at 4:21 PM, Steven Bower lt; smb-apache@ lt;javascript:;gt; lt;mailto: smb-apache@ lt;javascript:;gt; wrote: I am seeing a error when doing a spatial search where a particular point is showing up within a polygon, but by all methods I've tried that point is not within the polygon.. First the point is: 41.2299,29.1345 (lat/lon) The polygon is: 31.2719,32.283 31.2179,32.3681 31.1333,32.3407 30.9356,32.6318 31.0707,34.5196 35.2053,36.9415 37.2959,36.6339 40.8334,30.4273 41.1622,29.1421 41.6484,27.4832 47.0255,13.6342 43.9457,3.17525 37.0029,-5.7017 35.7741,-5.57719 34.801,-4.66201 33.345,10.0157 29.6745,18.9366 30.6592,29.1683 31.2719,32.283 The geo field we are using has this config: fieldType name=location_rpt class=solr.SpatialRecursivePrefixTreeFieldType distErrPct=0.025 maxDistErr=0.09 spatialContextFactory=com.spatial4j.core.context.jts.JtsSpatialContextFactory units=degrees/ The config is basically the same as the one from the docs... They query I am issuing is this: location:Intersects(POLYGON((32.283 31.2719, 32.3681 31.2179, 32.3407 31.1333, 32.6318 30.9356, 34.5196 31.0707, 36.9415 35.2053, 36.6339 37.2959, 30.4273 40.8334, 29.1421 41.1622, 27.4832 41.6484, 13.6342 47.0255, 3.17525 43.9457, -5.7017 37.0029, -5.57719 35.7741, -4.66201 34.801, 10.0157 33.345, 18.9366 29.6745, 29.1683 30.6592, 32.283 31.2719))) and it brings back a result where the location field is 41.2299,29.1345 I've attached a KML with the polygon and the point and you can see from that, visually, that the point is not within the polygon. I also tried in google maps API but after playing around realize that the polygons in maps are draw in Euclidian space while the map itself is a Mercator projection.. Loading the kml in earth fixes this issue but the point still lays outside the polygon.. The distance between the edge of the polygon closes to the point and the point itself is ~1.2 miles which is much larger than the 1meter accuracy given by the maxDistErr (per the docs). Any thoughts on this? Thanks, Steve - Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/Issue-with-spatial-search-tp4122690p4122744.html Sent from the Solr - User mailing list archive at Nabble.com.
Issue with spatial search
I am seeing a error when doing a spatial search where a particular point is showing up within a polygon, but by all methods I've tried that point is not within the polygon.. First the point is: 41.2299,29.1345 (lat/lon) The polygon is: 31.2719,32.283 31.2179,32.3681 31.1333,32.3407 30.9356,32.6318 31.0707,34.5196 35.2053,36.9415 37.2959,36.6339 40.8334,30.4273 41.1622,29.1421 41.6484,27.4832 47.0255,13.6342 43.9457,3.17525 37.0029,-5.7017 35.7741,-5.57719 34.801,-4.66201 33.345,10.0157 29.6745,18.9366 30.6592,29.1683 31.2719,32.283 The geo field we are using has this config: fieldType name=location_rpt class=solr.SpatialRecursivePrefixTreeFieldType distErrPct=0.025 maxDistErr=0.09 spatialContextFactory=com.spatial4j.core.context.jts.JtsSpatialContextFactory units=degrees/ The config is basically the same as the one from the docs... They query I am issuing is this: location:Intersects(POLYGON((32.283 31.2719, 32.3681 31.2179, 32.3407 31.1333, 32.6318 30.9356, 34.5196 31.0707, 36.9415 35.2053, 36.6339 37.2959, 30.4273 40.8334, 29.1421 41.1622, 27.4832 41.6484, 13.6342 47.0255, 3.17525 43.9457, -5.7017 37.0029, -5.57719 35.7741, -4.66201 34.801, 10.0157 33.345, 18.9366 29.6745, 29.1683 30.6592, 32.283 31.2719))) and it brings back a result where the location field is 41.2299,29.1345 I've attached a KML with the polygon and the point and you can see from that, visually, that the point is not within the polygon. I also tried in google maps API but after playing around realize that the polygons in maps are draw in Euclidian space while the map itself is a Mercator projection.. Loading the kml in earth fixes this issue but the point still lays outside the polygon.. The distance between the edge of the polygon closes to the point and the point itself is ~1.2 miles which is much larger than the 1meter accuracy given by the maxDistErr (per the docs). Any thoughts on this? Thanks, Steve solr_map_issue.kml Description: application/vnd.google-earth.kml
Re: Issue with spatial search
Minor edit to the KML to adjust color of polygon On Mon, Mar 10, 2014 at 4:21 PM, Steven Bower smb-apa...@alcyon.net wrote: I am seeing a error when doing a spatial search where a particular point is showing up within a polygon, but by all methods I've tried that point is not within the polygon.. First the point is: 41.2299,29.1345 (lat/lon) The polygon is: 31.2719,32.283 31.2179,32.3681 31.1333,32.3407 30.9356,32.6318 31.0707,34.5196 35.2053,36.9415 37.2959,36.6339 40.8334,30.4273 41.1622,29.1421 41.6484,27.4832 47.0255,13.6342 43.9457,3.17525 37.0029,-5.7017 35.7741,-5.57719 34.801,-4.66201 33.345,10.0157 29.6745,18.9366 30.6592,29.1683 31.2719,32.283 The geo field we are using has this config: fieldType name=location_rpt class=solr.SpatialRecursivePrefixTreeFieldType distErrPct=0.025 maxDistErr=0.09 spatialContextFactory=com.spatial4j.core.context.jts.JtsSpatialContextFactory units=degrees/ The config is basically the same as the one from the docs... They query I am issuing is this: location:Intersects(POLYGON((32.283 31.2719, 32.3681 31.2179, 32.3407 31.1333, 32.6318 30.9356, 34.5196 31.0707, 36.9415 35.2053, 36.6339 37.2959, 30.4273 40.8334, 29.1421 41.1622, 27.4832 41.6484, 13.6342 47.0255, 3.17525 43.9457, -5.7017 37.0029, -5.57719 35.7741, -4.66201 34.801, 10.0157 33.345, 18.9366 29.6745, 29.1683 30.6592, 32.283 31.2719))) and it brings back a result where the location field is 41.2299,29.1345 I've attached a KML with the polygon and the point and you can see from that, visually, that the point is not within the polygon. I also tried in google maps API but after playing around realize that the polygons in maps are draw in Euclidian space while the map itself is a Mercator projection.. Loading the kml in earth fixes this issue but the point still lays outside the polygon.. The distance between the edge of the polygon closes to the point and the point itself is ~1.2 miles which is much larger than the 1meter accuracy given by the maxDistErr (per the docs). Any thoughts on this? Thanks, Steve solr_map_issue.kml Description: application/vnd.google-earth.kml
Re: Issue with spatial search
Weirdly that same point shows up in the polygon below as well, which in the area around the point doesn't intersect with the polygon in my first msg... 29.0454,41.2198 29.2349,41.1826 31.1107,40.9956 38.437,40.7991 41.1616,40.8988 42.1284,42.2141 40.0919,47.8482 30.4169,47.5783 26.9892,43.6459 27.2095,41.5676 29.0454,41.2198 On Mon, Mar 10, 2014 at 4:23 PM, Steven Bower smb-apa...@alcyon.net wrote: Minor edit to the KML to adjust color of polygon On Mon, Mar 10, 2014 at 4:21 PM, Steven Bower smb-apa...@alcyon.netwrote: I am seeing a error when doing a spatial search where a particular point is showing up within a polygon, but by all methods I've tried that point is not within the polygon.. First the point is: 41.2299,29.1345 (lat/lon) The polygon is: 31.2719,32.283 31.2179,32.3681 31.1333,32.3407 30.9356,32.6318 31.0707,34.5196 35.2053,36.9415 37.2959,36.6339 40.8334,30.4273 41.1622,29.1421 41.6484,27.4832 47.0255,13.6342 43.9457,3.17525 37.0029,-5.7017 35.7741,-5.57719 34.801,-4.66201 33.345,10.0157 29.6745,18.9366 30.6592,29.1683 31.2719,32.283 The geo field we are using has this config: fieldType name=location_rpt class=solr.SpatialRecursivePrefixTreeFieldType distErrPct=0.025 maxDistErr=0.09 spatialContextFactory=com.spatial4j.core.context.jts.JtsSpatialContextFactory units=degrees/ The config is basically the same as the one from the docs... They query I am issuing is this: location:Intersects(POLYGON((32.283 31.2719, 32.3681 31.2179, 32.3407 31.1333, 32.6318 30.9356, 34.5196 31.0707, 36.9415 35.2053, 36.6339 37.2959, 30.4273 40.8334, 29.1421 41.1622, 27.4832 41.6484, 13.6342 47.0255, 3.17525 43.9457, -5.7017 37.0029, -5.57719 35.7741, -4.66201 34.801, 10.0157 33.345, 18.9366 29.6745, 29.1683 30.6592, 32.283 31.2719))) and it brings back a result where the location field is 41.2299,29.1345 I've attached a KML with the polygon and the point and you can see from that, visually, that the point is not within the polygon. I also tried in google maps API but after playing around realize that the polygons in maps are draw in Euclidian space while the map itself is a Mercator projection.. Loading the kml in earth fixes this issue but the point still lays outside the polygon.. The distance between the edge of the polygon closes to the point and the point itself is ~1.2 miles which is much larger than the 1meter accuracy given by the maxDistErr (per the docs). Any thoughts on this? Thanks, Steve
Re: Issue with spatial search
Only points in the index.. Am I correct this won't require a reindex? On Monday, March 10, 2014, Smiley, David W. dsmi...@mitre.org wrote: Hi Steven, Set distErrPct to 0 in order to get non-point shapes to always be as accurate as maxDistErr. Point shapes are always that accurate. As long as you only index points, not other shapes (you don't index polygons, etc.) then distErrPct of 0 should be fine. In fact, perhaps a future Solr version should simply use 0 as the default; the last time I did benchmarks it was pretty marginal impact of higher distErrPct. It's a fairly different story if you are indexing non-point shapes. ~ David From: Steven Bower smb-apa...@alcyon.net javascript:;mailto: smb-apa...@alcyon.net javascript:; Reply-To: solr-user@lucene.apache.org javascript:;mailto: solr-user@lucene.apache.org javascript:; solr-user@lucene.apache.orgjavascript:; mailto:solr-user@lucene.apache.org javascript:; Date: Monday, March 10, 2014 at 4:23 PM To: solr-user@lucene.apache.org javascript:;mailto: solr-user@lucene.apache.org javascript:; solr-user@lucene.apache.orgjavascript:; mailto:solr-user@lucene.apache.org javascript:; Subject: Re: Issue with spatial search Minor edit to the KML to adjust color of polygon On Mon, Mar 10, 2014 at 4:21 PM, Steven Bower smb-apa...@alcyon.netjavascript:; mailto:smb-apa...@alcyon.net javascript:; wrote: I am seeing a error when doing a spatial search where a particular point is showing up within a polygon, but by all methods I've tried that point is not within the polygon.. First the point is: 41.2299,29.1345 (lat/lon) The polygon is: 31.2719,32.283 31.2179,32.3681 31.1333,32.3407 30.9356,32.6318 31.0707,34.5196 35.2053,36.9415 37.2959,36.6339 40.8334,30.4273 41.1622,29.1421 41.6484,27.4832 47.0255,13.6342 43.9457,3.17525 37.0029,-5.7017 35.7741,-5.57719 34.801,-4.66201 33.345,10.0157 29.6745,18.9366 30.6592,29.1683 31.2719,32.283 The geo field we are using has this config: fieldType name=location_rpt class=solr.SpatialRecursivePrefixTreeFieldType distErrPct=0.025 maxDistErr=0.09 spatialContextFactory=com.spatial4j.core.context.jts.JtsSpatialContextFactory units=degrees/ The config is basically the same as the one from the docs... They query I am issuing is this: location:Intersects(POLYGON((32.283 31.2719, 32.3681 31.2179, 32.3407 31.1333, 32.6318 30.9356, 34.5196 31.0707, 36.9415 35.2053, 36.6339 37.2959, 30.4273 40.8334, 29.1421 41.1622, 27.4832 41.6484, 13.6342 47.0255, 3.17525 43.9457, -5.7017 37.0029, -5.57719 35.7741, -4.66201 34.801, 10.0157 33.345, 18.9366 29.6745, 29.1683 30.6592, 32.283 31.2719))) and it brings back a result where the location field is 41.2299,29.1345 I've attached a KML with the polygon and the point and you can see from that, visually, that the point is not within the polygon. I also tried in google maps API but after playing around realize that the polygons in maps are draw in Euclidian space while the map itself is a Mercator projection.. Loading the kml in earth fixes this issue but the point still lays outside the polygon.. The distance between the edge of the polygon closes to the point and the point itself is ~1.2 miles which is much larger than the 1meter accuracy given by the maxDistErr (per the docs). Any thoughts on this? Thanks, Steve
Re: core.properties and solr.xml
For us we don't fully rely on cloud/collections api for creating and deploying instances/etc.. we control this via an external mechanism so this would allow me to have instances figure out what they should be based on an external system.. we do this now but have to drop core.properties files all over.. i'd like to not have to do that... its more of a desire for cleanliness of my filesystem than anything else because this is all automated at this point.. On Wed, Jan 15, 2014 at 1:49 PM, Mark Miller markrmil...@gmail.com wrote: What’s the benefit? So you can avoid having a simple core properties file? I’d rather see more value than that prompt exposing something like this to the user. It’s a can of warms that I personally have not seen a lot of value in yet. Whether we mark it experimental or not, this adds a burden, and I’m still wondering if the gains are worth it. - Mark On Jan 15, 2014, at 12:04 PM, Alan Woodward a...@flax.co.uk wrote: This is true. But if we slap big warning: experimental messages all over it, then users can't complain too much about backwards-compat breaks. My intention when pulling all this stuff into the CoresLocator interface was to allow other implementations to be tested out, and other suggestions have already come up from time to time on the list. It seems a shame to *not* allow this to be opened up for advanced users. Alan Woodward www.flax.co.uk On 15 Jan 2014, at 16:24, Mark Miller wrote: I think these API’s are pretty new and deep to want to support them for users at this point. It constrains refactoring and can complicates things down the line, especially with SolrCloud. This same discussion has come up in JIRA issues before. At best, I think all the recent refactoring in this area needs to bake. - Mark On Jan 15, 2014, at 11:01 AM, Alan Woodward a...@flax.co.uk wrote: I think solr.xml is the correct place for it, and you can then set up substitution variables to allow it to be set by environment variables, etc. But let's discuss on the JIRA ticket. Alan Woodward www.flax.co.uk On 15 Jan 2014, at 15:39, Steven Bower wrote: I will open up a JIRA... I'm more concerned over the core locator stuff vs the solr.xml.. Should the specification of the core locator go into the solr.xml or via some other method? steve On Tue, Jan 14, 2014 at 5:06 PM, Alan Woodward a...@flax.co.uk wrote: Hi Steve, I think this is a great idea. Currently the implementation of CoresLocator is picked depending on the type of solr.xml you have (new- vs old-style), but it should be easy enough to extend the new-style logic to optionally look up and instantiate a plugin implementation. Core loading and new core creation is all done through the CL now, so as long as the plugin implemented all methods, it shouldn't break the Collections API either. Do you want to open a JIRA? Alan Woodward www.flax.co.uk On 14 Jan 2014, at 19:20, Erick Erickson wrote: The work done as part of new style solr.xml, particularly by romsegeek should make this a lot easier. But no, there's no formal support for such a thing. There's also a desire to make ZK the one source of truth in Solr 5, although that effort is in early stages. Which is a long way of saying that I think this would be a good thing to add. Currently there's no formal way to specify one though. We'd have to give some thought as to what abstract methods are required. The current old style and new style classes . There's also the chicken-and-egg question; how does one specify the new class? This seems like something that would be in a (very small) solr.xml or specified as a sysprop. And knowing where to load the class from could be interesting. A pluggable SolrConfig I think is a stickier wicket, it hasn't been broken out into nice interfaces like coreslocator has been. And it's used all over the place, passed in and recorded in constructors etc, as well as being possibly unique for each core. There's been some talk of sharing a single config object, and there's also talk about using config sets that might address some of those concerns, but neither one has gotten very far in 4x land. FWIW, Erick On Tue, Jan 14, 2014 at 1:41 PM, Steven Bower smb-apa...@alcyon.net wrote: Are there any plans/tickets to allow for pluggable SolrConf and CoreLocator? In my use case my solr.xml is totally static, i have a separate dataDir and my core.properties are derived from a separate configuration (living in ZK) but totally outside of the SolrCloud.. I'd like to be able to not have any instance directories and/or no solr.xml or core.properties files laying around as right now I just regenerate them on startup each time in my start scripts.. Obviously I can just hack my stuff in and clearly this could break the write side of the collections API (which i don't
Re: core.properties and solr.xml
I will open up a JIRA... I'm more concerned over the core locator stuff vs the solr.xml.. Should the specification of the core locator go into the solr.xml or via some other method? steve On Tue, Jan 14, 2014 at 5:06 PM, Alan Woodward a...@flax.co.uk wrote: Hi Steve, I think this is a great idea. Currently the implementation of CoresLocator is picked depending on the type of solr.xml you have (new- vs old-style), but it should be easy enough to extend the new-style logic to optionally look up and instantiate a plugin implementation. Core loading and new core creation is all done through the CL now, so as long as the plugin implemented all methods, it shouldn't break the Collections API either. Do you want to open a JIRA? Alan Woodward www.flax.co.uk On 14 Jan 2014, at 19:20, Erick Erickson wrote: The work done as part of new style solr.xml, particularly by romsegeek should make this a lot easier. But no, there's no formal support for such a thing. There's also a desire to make ZK the one source of truth in Solr 5, although that effort is in early stages. Which is a long way of saying that I think this would be a good thing to add. Currently there's no formal way to specify one though. We'd have to give some thought as to what abstract methods are required. The current old style and new style classes . There's also the chicken-and-egg question; how does one specify the new class? This seems like something that would be in a (very small) solr.xml or specified as a sysprop. And knowing where to load the class from could be interesting. A pluggable SolrConfig I think is a stickier wicket, it hasn't been broken out into nice interfaces like coreslocator has been. And it's used all over the place, passed in and recorded in constructors etc, as well as being possibly unique for each core. There's been some talk of sharing a single config object, and there's also talk about using config sets that might address some of those concerns, but neither one has gotten very far in 4x land. FWIW, Erick On Tue, Jan 14, 2014 at 1:41 PM, Steven Bower smb-apa...@alcyon.net wrote: Are there any plans/tickets to allow for pluggable SolrConf and CoreLocator? In my use case my solr.xml is totally static, i have a separate dataDir and my core.properties are derived from a separate configuration (living in ZK) but totally outside of the SolrCloud.. I'd like to be able to not have any instance directories and/or no solr.xml or core.properties files laying around as right now I just regenerate them on startup each time in my start scripts.. Obviously I can just hack my stuff in and clearly this could break the write side of the collections API (which i don't care about for my case)... but having a way to plug these would be nice.. steve
core.properties and solr.xml
Are there any plans/tickets to allow for pluggable SolrConf and CoreLocator? In my use case my solr.xml is totally static, i have a separate dataDir and my core.properties are derived from a separate configuration (living in ZK) but totally outside of the SolrCloud.. I'd like to be able to not have any instance directories and/or no solr.xml or core.properties files laying around as right now I just regenerate them on startup each time in my start scripts.. Obviously I can just hack my stuff in and clearly this could break the write side of the collections API (which i don't care about for my case)... but having a way to plug these would be nice.. steve
Index Sizes
I was looking at the code for getIndexSize() on the ReplicationHandler to get at the size of the index on disk. From what I can tell, because this does directory.listAll() to get all the files in the directory, the size on disk includes not only what is searchable at the moment but potentially also files that are being created by background merges/etc.. I am wondering if there is an API that would give me the size of the currently searchable index files (doubt this exists, but maybe).. If not what is the most appropriate way to get a list of the segments/files that are currently in use by the active searcher such that I could then ask the directory implementation for the size of all those files? For a more complete picture of what I'm trying to accomplish, I am looking at building a quota/monitoring component that will trigger when index size on disk gets above a certain size. I don't want to trigger if index is doing a merge and ephemerally uses disk for that process. If anyone has any suggestions/recommendations here too I'd be interested.. Thanks, steve
SolrCoreAware
Under what circumstances will a handler that implements SolrCoreAware have its inform() method called? thanks, steve
Re: SolrCoreAware
So its something that can happen multiple times during the lifetime of process, but i'm guessing something not occuring very often? Also is there a way to hook the shutdown of the core? steve On Fri, Nov 15, 2013 at 12:08 PM, Alan Woodward a...@flax.co.uk wrote: Hi Steven, It's called when the handler is created, either at SolrCore construction time (solr startup or core reload) or the first time the handler is requested if it's a lazy-loading handler. Alan Woodward www.flax.co.uk On 15 Nov 2013, at 15:40, Steven Bower wrote: Under what circumstances will a handler that implements SolrCoreAware have its inform() method called? thanks, steve
Re: SolrCoreAware
it should be called only once during hte lifetime of a given plugin, usually not long after construction -- but it could be called many, many times in the lifetime of the solr process. So for a given instance of a handler it will only be called once during the lifetime of that handler? Also, when the core is passed in as part of inform() is it guaranteed to be ready to go? (ie I can start feeding content at this point?) thanks, steve On Fri, Nov 15, 2013 at 12:52 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: : So its something that can happen multiple times during the lifetime of : process, but i'm guessing something not occuring very often? it should be called only once during hte lifetime of a given plugin, usually not long after construction -- but it could be called many, many times in the lifetime of the solr process. : Also is there a way to hook the shutdown of the core? any object (SolrCoreAware or otherwise) can ask the SolrCore to add a CloseHook at anytime... https://lucene.apache.org/solr/4_5_1/solr-core/org/apache/solr/core/SolrCore.html#addCloseHook%28org.apache.solr.core.CloseHook%29 -Hoss
Re: SolrCoreAware
And the close hook will basically only be fired once during shutdown? On Fri, Nov 15, 2013 at 1:07 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: : So for a given instance of a handler it will only be called once during the : lifetime of that handler? correct (unless there is a bug somewhere) : Also, when the core is passed in as part of inform() is it guaranteed to be : ready to go? (ie I can start feeding content at this point?) Right, that's the point of the interface: way back in the day we had people writting plugins that were trying to use SolrCore from their init() methods and the SolrCore wasn't fully initialized yet (didn't have DirectUpdateHandler yet, didn't have all of hte RequestHandler's initialized, didn't have an openSearcher, etc...) the inform(SolrCore) method is called after the SolrCore is initialized, and all plugins hanging off of it have been init()ed ... you can still get into trouble if you write FooA.inform(SolrCore) such that it asks the SolrCore for for a pointer to some FooB plugin and expect that FooB's infomr(SolrCore) method has already been called -- because there is no garunteed order -- but the basic functionality and basic plugin initialization has all been done at that point. -Hoss
Re: How to set a condition over stats result
Check out: https://issues.apache.org/jira/browse/SOLR-5302 can do this using query facets On Fri, Jul 12, 2013 at 11:35 AM, Jack Krupansky j...@basetechnology.comwrote: sum(x, y, z) = x + y + z (sums those specific fields values for the current document) sum(x, y) = x + y (sum of those two specific field values for the current document) sum(x) = field(x) = x (the specific field value for the current document) The sum function in function queries is not an aggregate function. Ditto for min and max. -- Jack Krupansky -Original Message- From: mihaela olteanu Sent: Friday, July 12, 2013 1:44 AM To: solr-user@lucene.apache.org Subject: Re: How to set a condition over stats result What if you perform sub(sum(myfieldvalue),100) 0 using frange? __**__ From: Jack Krupansky j...@basetechnology.com To: solr-user@lucene.apache.org Sent: Friday, July 12, 2013 7:44 AM Subject: Re: How to set a condition over stats result None that I know of, short of writing a custom search component. Seriously, you could hack up a copy of the stats component with your own logic. Actually... this may be a case for the new, proposed Script Request Handler, which would let you execute a query and then you could do any custom JavaScript logic you wanted. When we get that feature, it might be interesting to implement a variation of the standard stats component as a JavaScript script, and then people could easily hack it such as in your request. Fascinating. -- Jack Krupansky -Original Message- From: Matt Lieber Sent: Thursday, July 11, 2013 6:08 PM To: solr-user@lucene.apache.org Subject: How to set a condition over stats result Hello, I am trying to see how I can test the sum of values of an attribute across docs. I.e. Whether sum(myfieldvalue)100 . I know I can use the stats module which compiles the sum of my attributes on a certain facet , but how can I perform a test this result (i.e. Is sum100) within my stats query? From what I read, it's not supported yet to perform a function on the stats module.. Any other way to do this ? Cheers, Matt __**__ NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.
Re: StatsComponent with median
Check out: https://issues.apache.org/jira/browse/SOLR-5302 it supports median value On Wed, Jul 3, 2013 at 12:11 PM, William Bell billnb...@gmail.com wrote: If you are a programmer, you can modify it and attach a patch in Jira... On Tue, Jun 4, 2013 at 4:25 AM, Marcin Rzewucki mrzewu...@gmail.com wrote: Hi there, StatsComponent currently does not have median on the list of results. Is there a plan to add it in the next release(s) ? Shall I add a ticket in Jira for this ? Regards. -- Bill Bell billnb...@gmail.com cell 720-256-8076
Re: bucket count for facets
Understood, what I need is a count of the unique values in a field and that field is multi-valued (which makes stats component a non-option) On Fri, Sep 6, 2013 at 4:22 AM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: Stats Component can give you a count of non-null values in a field. See https://cwiki.apache.org/confluence/display/solr/The+Stats+Component On Fri, Sep 6, 2013 at 12:28 AM, Steven Bower smb-apa...@alcyon.net wrote: Is there a way to get the count of buckets (ie unique values) for a field facet? the rudimentary approach of course is to get back all buckets, but in some cases this is a huge amount of data. thanks, steve -- Regards, Shalin Shekhar Mangar.
bucket count for facets
Is there a way to get the count of buckets (ie unique values) for a field facet? the rudimentary approach of course is to get back all buckets, but in some cases this is a huge amount of data. thanks, steve
AND not working
I have query like: q=foo AND bar defType=edismax qf=field1 qf=field2 qf=field3 with debug on I see it parsing to this: (+(DisjunctionMaxQuery((field1:foo | field2:foo | field3:foo)) DisjunctionMaxQuery((field1:and | field2:and | field3:and)) DisjunctionMaxQuery((field1:bar | field2:bar | field3:bar/no_coord basically it seems to be treating the AND as a term... any thoughts? thx, steve
Re: AND not working
@Yonik that was exactly the issue... I'll file a ticket... there def should be an exception thrown for something like this.. It would seem to me that eating any sort of exception is a really bad thing... steve On Thu, Aug 15, 2013 at 5:59 PM, Yonik Seeley yo...@lucidworks.com wrote: I can reproduce something like this by specifying a field that doesn't exist for a qf param. This seems like a bug... if a field doesn't exist, we should throw an exception (since it's a parameter error not related to the q string where we avoid throwing any errors). -Yonik http://lucidworks.com On Thu, Aug 15, 2013 at 5:19 PM, Steven Bower smb-apa...@alcyon.net wrote: I have query like: q=foo AND bar defType=edismax qf=field1 qf=field2 qf=field3 with debug on I see it parsing to this: (+(DisjunctionMaxQuery((field1:foo | field2:foo | field3:foo)) DisjunctionMaxQuery((field1:and | field2:and | field3:and)) DisjunctionMaxQuery((field1:bar | field2:bar | field3:bar/no_coord basically it seems to be treating the AND as a term... any thoughts? thx, steve
Re: AND not working
https://issues.apache.org/jira/browse/SOLR-5163 On Thu, Aug 15, 2013 at 6:04 PM, Steven Bower smb-apa...@alcyon.net wrote: @Yonik that was exactly the issue... I'll file a ticket... there def should be an exception thrown for something like this.. It would seem to me that eating any sort of exception is a really bad thing... steve On Thu, Aug 15, 2013 at 5:59 PM, Yonik Seeley yo...@lucidworks.comwrote: I can reproduce something like this by specifying a field that doesn't exist for a qf param. This seems like a bug... if a field doesn't exist, we should throw an exception (since it's a parameter error not related to the q string where we avoid throwing any errors). -Yonik http://lucidworks.com On Thu, Aug 15, 2013 at 5:19 PM, Steven Bower smb-apa...@alcyon.net wrote: I have query like: q=foo AND bar defType=edismax qf=field1 qf=field2 qf=field3 with debug on I see it parsing to this: (+(DisjunctionMaxQuery((field1:foo | field2:foo | field3:foo)) DisjunctionMaxQuery((field1:and | field2:and | field3:and)) DisjunctionMaxQuery((field1:bar | field2:bar | field3:bar/no_coord basically it seems to be treating the AND as a term... any thoughts? thx, steve
Schema Lint
Is there an easy way in code / command line to lint a solr config (or even just a solr schema)? Steve
Re: Performance question on Spatial Search
So after re-feeding our data with a new boolean field that is true when data exists and false when it doesn't our search times have gone from avg of about 20s to around 150ms... pretty amazing change in perf... It seems like https://issues.apache.org/jira/browse/SOLR-5093 might alleviate many peoples pain in doing this kind of query (if I have some time I may take a look at it).. Anyway we are in pretty good shape at this point.. the only remaining issue is that the first queries after commits are taking 5-6s... This is cause by the loading of 2 (one long and one int) FieldCaches (uninvert) that are used for sorting.. I'm suspecting that docvalues will greatly help this load performance? thanks, steve On Wed, Jul 31, 2013 at 4:32 PM, Steven Bower smb-apa...@alcyon.net wrote: the list of IDs does change relatively frequently, but this doesn't seem to have very much impact on the performance of the query as far as I can tell. attached are the stacks thanks, steve On Wed, Jul 31, 2013 at 6:33 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: On Wed, Jul 31, 2013 at 1:10 AM, Steven Bower sbo...@alcyon.net wrote: not sure what you mean by good hit raitio? I mean such queries are really expensive (even on cache hit), so if the list of ids changes every time, it never hit cache and hence executes these heavy queries every time. It's well known performance problem. Here are the stacks... they seems like hotspots, and shows index reading that's reasonable. But I can't see what caused these readings, to get that I need whole stack of hot thread. Name Time (ms) Own Time (ms) org.apache.lucene.search.MultiTermQueryWrapperFilter.getDocIdSet(AtomicReaderContext, Bits) 300879 203478 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsEnum.nextDoc() 45539 19 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsEnum.refillDocs() 45519 40 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.readVIntBlock(IndexInput, int[], int[], int, boolean) 24352 0 org.apache.lucene.store.DataInput.readVInt() 24352 24352 org.apache.lucene.codecs.lucene41.ForUtil.readBlock(IndexInput, byte[], int[]) 21126 14976 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int) 6150 0 java.nio.DirectByteBuffer.get(byte[], int, int) 6150 0 java.nio.Bits.copyToArray(long, Object, long, long, long) 6150 6150 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.docs(Bits, DocsEnum, int) 35342 421 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.decodeMetaData() 34920 27939 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.nextTerm(FieldInfo, BlockTermState) 6980 6980 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.next() 14129 1053 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadNextFloorBlock() 5948 261 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock() 5686 199 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int) 3606 0 java.nio.DirectByteBuffer.get(byte[], int, int) 3606 0 java.nio.Bits.copyToArray(long, Object, long, long, long) 3606 3606 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.readTermsBlock(IndexInput, FieldInfo, BlockTermState) 1879 80 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int) 1798 0java.nio.DirectByteBuffer.get(byte[], int, int) 1798 0 java.nio.Bits.copyToArray(long, Object, long, long, long) 1798 1798 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.next() 4010 3324 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.nextNonLeaf() 685 685 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock() 3117 144 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int) 1861 0java.nio.DirectByteBuffer.get(byte[], int, int) 1861 0 java.nio.Bits.copyToArray(long, Object, long, long, long) 1861 1861 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.readTermsBlock(IndexInput, FieldInfo, BlockTermState) 1090 19 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int) 1070 0 java.nio.DirectByteBuffer.get(byte[], int, int) 1070 0 java.nio.Bits.copyToArray(long, Object, long, long, long) 1070 1070 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.initIndexInput() 20 0org.apache.lucene.store.ByteBufferIndexInput.clone() 20 0 org.apache.lucene.store.ByteBufferIndexInput.clone() 20 0 org.apache.lucene.store.ByteBufferIndexInput.buildSlice(long, long) 20 0 org.apache.lucene.util.WeakIdentityMap.put(Object, Object) 20 0
Re: Performance question on Spatial Search
the list of IDs does change relatively frequently, but this doesn't seem to have very much impact on the performance of the query as far as I can tell. attached are the stacks thanks, steve On Wed, Jul 31, 2013 at 6:33 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: On Wed, Jul 31, 2013 at 1:10 AM, Steven Bower sbo...@alcyon.net wrote: not sure what you mean by good hit raitio? I mean such queries are really expensive (even on cache hit), so if the list of ids changes every time, it never hit cache and hence executes these heavy queries every time. It's well known performance problem. Here are the stacks... they seems like hotspots, and shows index reading that's reasonable. But I can't see what caused these readings, to get that I need whole stack of hot thread. Name Time (ms) Own Time (ms) org.apache.lucene.search.MultiTermQueryWrapperFilter.getDocIdSet(AtomicReaderContext, Bits) 300879 203478 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsEnum.nextDoc() 45539 19 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsEnum.refillDocs() 45519 40 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.readVIntBlock(IndexInput, int[], int[], int, boolean) 24352 0 org.apache.lucene.store.DataInput.readVInt() 24352 24352 org.apache.lucene.codecs.lucene41.ForUtil.readBlock(IndexInput, byte[], int[]) 21126 14976 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int) 6150 0 java.nio.DirectByteBuffer.get(byte[], int, int) 6150 0 java.nio.Bits.copyToArray(long, Object, long, long, long) 6150 6150 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.docs(Bits, DocsEnum, int) 35342 421 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.decodeMetaData() 34920 27939 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.nextTerm(FieldInfo, BlockTermState) 6980 6980 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.next() 14129 1053 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadNextFloorBlock() 5948 261 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock() 5686 199 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int) 3606 0 java.nio.DirectByteBuffer.get(byte[], int, int) 3606 0 java.nio.Bits.copyToArray(long, Object, long, long, long) 3606 3606 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.readTermsBlock(IndexInput, FieldInfo, BlockTermState) 1879 80 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int) 1798 0java.nio.DirectByteBuffer.get(byte[], int, int) 1798 0 java.nio.Bits.copyToArray(long, Object, long, long, long) 1798 1798 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.next() 4010 3324 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.nextNonLeaf() 685 685 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock() 3117 144 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int) 1861 0java.nio.DirectByteBuffer.get(byte[], int, int) 1861 0 java.nio.Bits.copyToArray(long, Object, long, long, long) 1861 1861 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.readTermsBlock(IndexInput, FieldInfo, BlockTermState) 1090 19 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int) 1070 0 java.nio.DirectByteBuffer.get(byte[], int, int) 1070 0 java.nio.Bits.copyToArray(long, Object, long, long, long) 1070 1070 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.initIndexInput() 20 0org.apache.lucene.store.ByteBufferIndexInput.clone() 20 0 org.apache.lucene.store.ByteBufferIndexInput.clone() 20 0 org.apache.lucene.store.ByteBufferIndexInput.buildSlice(long, long) 20 0 org.apache.lucene.util.WeakIdentityMap.put(Object, Object) 20 0 org.apache.lucene.util.WeakIdentityMap$IdentityWeakReference.init(Object, ReferenceQueue) 20 0 java.lang.System.identityHashCode(Object) 20 20 org.apache.lucene.index.FilteredTermsEnum.docs(Bits, DocsEnum, int) 1485 527 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.docs(Bits, DocsEnum, int) 957 0 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.decodeMetaData() 957 513 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.nextTerm(FieldInfo, BlockTermState) 443 443 org.apache.lucene.index.FilteredTermsEnum.next() 874 324 org.apache.lucene.search.NumericRangeQuery$NumericRangeTermsEnum.accept(BytesRef) 368 0 org.apache.lucene.util.BytesRef$UTF8SortedAsUnicodeComparator.compare(Object, Object) 368
Re: Performance question on Spatial Search
Until I get the data refed I there was another field (a date field) that was there and not when the geo field was/was not... i tried that field:* and query times come down to 2.5s .. also just removing that filter brings the query down to 30ms.. so I'm very hopeful that with just a boolean i'll be down in that sub 100ms range.. steve On Tue, Jul 30, 2013 at 12:02 PM, Steven Bower sbo...@alcyon.net wrote: Will give the boolean thing a shot... makes sense... On Tue, Jul 30, 2013 at 11:53 AM, Smiley, David W. dsmi...@mitre.orgwrote: I see the problem ‹ it's +pp:*. It may look innocent but it's a performance killer. What your telling Lucene to do is iterate over *every* term in this index to find all documents that have this data. Most fields are pretty slow to do that. Lucene/Solr does not have some kind of cache for this. Instead, you should index a new boolean field indicating wether or not 'pp' is populated and then do a simple true check against that field. Another approach you could do right now without reindexing is to simplify the last 2 clauses of your 3-clause boolean query by using the IsDisjointTo predicate. But unfortunately Lucene doesn't have a generic filter cache capability and so this predicate has no place to cache the whole-world query it does internally (each and every time it's used), so it will be slower than the boolean field I suggested you add. Nevermind on LatLonType; it doesn't support JTS/Polygons. There is something close called SpatialPointVectorFieldType that could be modified trivially but it doesn't support it now. ~ David On 7/30/13 11:32 AM, Steven Bower sbo...@alcyon.net wrote: #1 Here is my query: sort=vid asc start=0 rows=1000 defType=edismax q=*:* fq=recordType:xxx fq=vt:X12B AND fq=(cls:3 OR cls:8) fq=dt:[2013-05-08T00:00:00.00Z TO 2013-07-08T00:00:00.00Z] fq=(vid:86XXX73 OR vid:86XXX20 OR vid:89XXX60 OR vid:89XXX72 OR vid:89XXX48 OR vid:89XXX31 OR vid:89XXX28 OR vid:89XXX67 OR vid:90XXX76 OR vid:90XXX33 OR vid:90XXX47 OR vid:90XXX97 OR vid:90XXX69 OR vid:90XXX31 OR vid:90XXX44 OR vid:91XXX82 OR vid:91XXX08 OR vid:91XXX32 OR vid:91XXX13 OR vid:91XXX87 OR vid:91XXX82 OR vid:91XXX48 OR vid:91XXX34 OR vid:91XXX31 OR vid:91XXX94 OR vid:91XXX29 OR vid:91XXX31 OR vid:91XXX43 OR vid:91XXX55 OR vid:91XXX67 OR vid:91XXX15 OR vid:91XXX59 OR vid:92XXX95 OR vid:92XXX24 OR vid:92XXX13 OR vid:92XXX07 OR vid:92XXX92 OR vid:92XXX22 OR vid:92XXX25 OR vid:92XXX99 OR vid:92XXX53 OR vid:92XXX55 OR vid:92XXX27 OR vid:92XXX65 OR vid:92XXX41 OR vid:92XXX89 OR vid:92XXX11 OR vid:93XXX45 OR vid:93XXX05 OR vid:93XXX98 OR vid:93XXX70 OR vid:93XXX24 OR vid:93XXX39 OR vid:93XXX69 OR vid:93XXX28 OR vid:93XXX79 OR vid:93XXX66 OR vid:94XXX13 OR vid:94XXX16 OR vid:94XXX10 OR vid:94XXX37 OR vid:94XXX69 OR vid:94XXX29 OR vid:94XXX70 OR vid:94XXX58 OR vid:94XXX08 OR vid:94XXX64 OR vid:94XXX32 OR vid:94XXX44 OR vid:94XXX56 OR vid:95XXX59 OR vid:95XXX72 OR vid:95XXX14 OR vid:95XXX08 OR vid:96XXX10 OR vid:96XXX54 ) fq=gp:Intersects(POLYGON((47.0 30.0, 47.0 27.0, 52.0 27.0, 52.0 30.0, 47.0 30.0))) AND NOT pp:Intersects(POLYGON((47.0 30.0, 47.0 27.0, 52.0 27.0, 52.0 30.0, 47.0 30.0))) AND +pp:* Basically looking for a set of records by vid then if its gp is in one polygon and is pp is not in another (and it has a pp)... essentially looking to see if a record moved between two polygons (gp=current, pp=prev) during a time period. #2 Yes on JTS (unless from my query above I don't) however this is only an initial use case and I suspect we'll need more complex stuff in the future #3 The data is distributed globally but along generally fixed paths and then clustering around certain areas... for example the polygon above has about 11k points (with no date filtering). So basically some areas will be very dense and most areas not, the majority of searches will be around the dense areas #4 Its very likely to be less than 1M results (with filters) .. is there any functinoality loss with LatLonType fields? Thanks, steve On Tue, Jul 30, 2013 at 10:49 AM, David Smiley (@MITRE.org) dsmi...@mitre.org wrote: Steve, (1) Can you give a specific example of how your are specifying the spatial query? I'm looking to ensure you are not using IsWithin, which is not meant for point data. If your query shape is a circle or the bounding box of a circle, you should use the geofilt query parser, otherwise use the quirky syntax that allows you to specify the spatial predicate with Intersects. (2) Do you actually need JTS? i.e. are you using Polygons, etc. (3) How dense would you estimate the data is at the 50m resolution you've configured the data? If It's very dense then I'll tell you how to raise the prefix grid scan level to a # closer to max-levels. (4) Do all of your searches find less than a million points, considering all filters? If so then it's worth comparing
Re: Performance question on Spatial Search
I am curious why the field:* walks the entire terms list.. could this be discovered from a field cache / docvalues? steve On Tue, Jul 30, 2013 at 2:00 PM, Steven Bower sbo...@alcyon.net wrote: Until I get the data refed I there was another field (a date field) that was there and not when the geo field was/was not... i tried that field:* and query times come down to 2.5s .. also just removing that filter brings the query down to 30ms.. so I'm very hopeful that with just a boolean i'll be down in that sub 100ms range.. steve On Tue, Jul 30, 2013 at 12:02 PM, Steven Bower sbo...@alcyon.net wrote: Will give the boolean thing a shot... makes sense... On Tue, Jul 30, 2013 at 11:53 AM, Smiley, David W. dsmi...@mitre.orgwrote: I see the problem ‹ it's +pp:*. It may look innocent but it's a performance killer. What your telling Lucene to do is iterate over *every* term in this index to find all documents that have this data. Most fields are pretty slow to do that. Lucene/Solr does not have some kind of cache for this. Instead, you should index a new boolean field indicating wether or not 'pp' is populated and then do a simple true check against that field. Another approach you could do right now without reindexing is to simplify the last 2 clauses of your 3-clause boolean query by using the IsDisjointTo predicate. But unfortunately Lucene doesn't have a generic filter cache capability and so this predicate has no place to cache the whole-world query it does internally (each and every time it's used), so it will be slower than the boolean field I suggested you add. Nevermind on LatLonType; it doesn't support JTS/Polygons. There is something close called SpatialPointVectorFieldType that could be modified trivially but it doesn't support it now. ~ David On 7/30/13 11:32 AM, Steven Bower sbo...@alcyon.net wrote: #1 Here is my query: sort=vid asc start=0 rows=1000 defType=edismax q=*:* fq=recordType:xxx fq=vt:X12B AND fq=(cls:3 OR cls:8) fq=dt:[2013-05-08T00:00:00.00Z TO 2013-07-08T00:00:00.00Z] fq=(vid:86XXX73 OR vid:86XXX20 OR vid:89XXX60 OR vid:89XXX72 OR vid:89XXX48 OR vid:89XXX31 OR vid:89XXX28 OR vid:89XXX67 OR vid:90XXX76 OR vid:90XXX33 OR vid:90XXX47 OR vid:90XXX97 OR vid:90XXX69 OR vid:90XXX31 OR vid:90XXX44 OR vid:91XXX82 OR vid:91XXX08 OR vid:91XXX32 OR vid:91XXX13 OR vid:91XXX87 OR vid:91XXX82 OR vid:91XXX48 OR vid:91XXX34 OR vid:91XXX31 OR vid:91XXX94 OR vid:91XXX29 OR vid:91XXX31 OR vid:91XXX43 OR vid:91XXX55 OR vid:91XXX67 OR vid:91XXX15 OR vid:91XXX59 OR vid:92XXX95 OR vid:92XXX24 OR vid:92XXX13 OR vid:92XXX07 OR vid:92XXX92 OR vid:92XXX22 OR vid:92XXX25 OR vid:92XXX99 OR vid:92XXX53 OR vid:92XXX55 OR vid:92XXX27 OR vid:92XXX65 OR vid:92XXX41 OR vid:92XXX89 OR vid:92XXX11 OR vid:93XXX45 OR vid:93XXX05 OR vid:93XXX98 OR vid:93XXX70 OR vid:93XXX24 OR vid:93XXX39 OR vid:93XXX69 OR vid:93XXX28 OR vid:93XXX79 OR vid:93XXX66 OR vid:94XXX13 OR vid:94XXX16 OR vid:94XXX10 OR vid:94XXX37 OR vid:94XXX69 OR vid:94XXX29 OR vid:94XXX70 OR vid:94XXX58 OR vid:94XXX08 OR vid:94XXX64 OR vid:94XXX32 OR vid:94XXX44 OR vid:94XXX56 OR vid:95XXX59 OR vid:95XXX72 OR vid:95XXX14 OR vid:95XXX08 OR vid:96XXX10 OR vid:96XXX54 ) fq=gp:Intersects(POLYGON((47.0 30.0, 47.0 27.0, 52.0 27.0, 52.0 30.0, 47.0 30.0))) AND NOT pp:Intersects(POLYGON((47.0 30.0, 47.0 27.0, 52.0 27.0, 52.0 30.0, 47.0 30.0))) AND +pp:* Basically looking for a set of records by vid then if its gp is in one polygon and is pp is not in another (and it has a pp)... essentially looking to see if a record moved between two polygons (gp=current, pp=prev) during a time period. #2 Yes on JTS (unless from my query above I don't) however this is only an initial use case and I suspect we'll need more complex stuff in the future #3 The data is distributed globally but along generally fixed paths and then clustering around certain areas... for example the polygon above has about 11k points (with no date filtering). So basically some areas will be very dense and most areas not, the majority of searches will be around the dense areas #4 Its very likely to be less than 1M results (with filters) .. is there any functinoality loss with LatLonType fields? Thanks, steve On Tue, Jul 30, 2013 at 10:49 AM, David Smiley (@MITRE.org) dsmi...@mitre.org wrote: Steve, (1) Can you give a specific example of how your are specifying the spatial query? I'm looking to ensure you are not using IsWithin, which is not meant for point data. If your query shape is a circle or the bounding box of a circle, you should use the geofilt query parser, otherwise use the quirky syntax that allows you to specify the spatial predicate with Intersects. (2) Do you actually need JTS? i.e. are you using Polygons, etc. (3) How dense would you estimate the data is at the 50m resolution you've configured the data? If It's very dense then I'll tell
Re: Performance question on Spatial Search
org.apache.lucene.store.ByteBufferIndexInput.clone() 19 0 org.apache.lucene.store.ByteBufferIndexInput.buildSlice(long, long) 19 0 org.apache.lucene.util.WeakIdentityMap.put(Object, Object) 19 0 org.apache.lucene.util.WeakIdentityMap$IdentityWeakReference.init(Object, ReferenceQueue) 19 0 java.lang.System.identityHashCode(Object) 19 19 org.apache.lucene.util.FixedBitSet.init(int) 28 28 On Tue, Jul 30, 2013 at 4:18 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: On Tue, Jul 30, 2013 at 12:45 AM, Steven Bower smb-apa...@alcyon.net wrote: - Most of my time (98%) is being spent in java.nio.Bits.copyToByteArray(long,Object,long,long) which is being Steven, please http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html.my benchmarking experience shows that NIO is a turtle, absolutely. also, are you sure that fq=(vid:86XXX73 OR vid:86XXX20 . has good hit ratio? otherwise it's a well known beast. could you also show deeper stack, to make sure what causes to excessive reading? -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Performance question on Spatial Search
@David I will certainly update when we get the data refed... and if you have things you'd like to investigate or try out please let me know.. I'm happy to eval things at scale here... we will be taking this index from its current 45m records to 6-700m over the next few months as well.. steve On Tue, Jul 30, 2013 at 5:10 PM, Steven Bower sbo...@alcyon.net wrote: Very good read... Already using MMap... verified using pmap and vsz from top.. not sure what you mean by good hit raitio? Here are the stacks... Name Time (ms) Own Time (ms) org.apache.lucene.search.MultiTermQueryWrapperFilter.getDocIdSet(AtomicReaderContext, Bits) 300879 203478 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsEnum.nextDoc() 45539 19 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsEnum.refillDocs() 45519 40 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.readVIntBlock(IndexInput, int[], int[], int, boolean) 24352 0 org.apache.lucene.store.DataInput.readVInt() 24352 24352 org.apache.lucene.codecs.lucene41.ForUtil.readBlock(IndexInput, byte[], int[]) 21126 14976 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int) 6150 0 java.nio.DirectByteBuffer.get(byte[], int, int) 6150 0 java.nio.Bits.copyToArray(long, Object, long, long, long) 6150 6150 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.docs(Bits, DocsEnum, int) 35342 421 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.decodeMetaData() 34920 27939 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.nextTerm(FieldInfo, BlockTermState) 6980 6980 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.next() 14129 1053 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadNextFloorBlock() 5948 261 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock() 5686 199 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int) 3606 0 java.nio.DirectByteBuffer.get(byte[], int, int) 3606 0 java.nio.Bits.copyToArray(long, Object, long, long, long) 3606 3606 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.readTermsBlock(IndexInput, FieldInfo, BlockTermState) 1879 80 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int) 1798 0java.nio.DirectByteBuffer.get(byte[], int, int) 1798 0 java.nio.Bits.copyToArray(long, Object, long, long, long) 1798 1798 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.next() 4010 3324 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.nextNonLeaf() 685 685 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock() 3117 144 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int) 1861 0java.nio.DirectByteBuffer.get(byte[], int, int) 1861 0 java.nio.Bits.copyToArray(long, Object, long, long, long) 1861 1861 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.readTermsBlock(IndexInput, FieldInfo, BlockTermState) 1090 19 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int) 1070 0 java.nio.DirectByteBuffer.get(byte[], int, int) 1070 0 java.nio.Bits.copyToArray(long, Object, long, long, long) 1070 1070 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.initIndexInput() 20 0org.apache.lucene.store.ByteBufferIndexInput.clone() 20 0 org.apache.lucene.store.ByteBufferIndexInput.clone() 20 0 org.apache.lucene.store.ByteBufferIndexInput.buildSlice(long, long) 20 0 org.apache.lucene.util.WeakIdentityMap.put(Object, Object) 20 0 org.apache.lucene.util.WeakIdentityMap$IdentityWeakReference.init(Object, ReferenceQueue) 20 0 java.lang.System.identityHashCode(Object) 20 20 org.apache.lucene.index.FilteredTermsEnum.docs(Bits, DocsEnum, int) 1485 527 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.docs(Bits, DocsEnum, int) 957 0 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.decodeMetaData() 957 513 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.nextTerm(FieldInfo, BlockTermState) 443 443 org.apache.lucene.index.FilteredTermsEnum.next() 874 324 org.apache.lucene.search.NumericRangeQuery$NumericRangeTermsEnum.accept(BytesRef) 368 0 org.apache.lucene.util.BytesRef$UTF8SortedAsUnicodeComparator.compare(Object, Object) 368 368 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.next() 160 0 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadNextFloorBlock() 160 0 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock() 160 0 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int) 120 0 org.apache.lucene.codecs.lucene41
Re: Transaction Logs Leaking FileDescriptors
Looking at the timestamps on the tlog files they seem to have all been created around the same time (04:55).. starting around this time I start seeing the exception below (there were 1628).. in fact its getting tons of these (200k+) but most of the time inside regular commits... 2013-15-05 04:55:06.634 ERROR UpdateLog [recoveryExecutor-6-thread-7922] - java.lang.ArrayIndexOutOfBoundsException: 2603 at org.apache.lucene.codecs.lucene40.BitVector.get(BitVector.java:146) at org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsEnum.nextDoc(Lucene41PostingsReader.java:492) at org.apache.lucene.index.BufferedDeletesStream.applyTermDeletes(BufferedDeletesStream.java:407) at org.apache.lucene.index.BufferedDeletesStream.applyDeletes(BufferedDeletesStream.java:273) at org.apache.lucene.index.IndexWriter.applyAllDeletes(IndexWriter.java:2973) at org.apache.lucene.index.IndexWriter.maybeApplyDeletes(IndexWriter.java:2964) at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2704) at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2839) at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2819) at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:536) at org.apache.solr.update.UpdateLog$LogReplayer.doReplay(UpdateLog.java:1339) at org.apache.solr.update.UpdateLog$LogReplayer.run(UpdateLog.java:1163) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) On Thu, May 16, 2013 at 9:35 AM, Yonik Seeley yo...@lucidworks.com wrote: See https://issues.apache.org/jira/browse/SOLR-3939 Do you see these log messages from this in your logs? log.info(I may be the new leader - try and sync); How reproducible is this bug for you? It would be great to know if the patch in the issue fixes things. -Yonik http://lucidworks.com On Wed, May 15, 2013 at 6:04 PM, Steven Bower sbo...@alcyon.net wrote: They are visible to ls... On Wed, May 15, 2013 at 5:49 PM, Yonik Seeley yo...@lucidworks.com wrote: On Wed, May 15, 2013 at 5:20 PM, Steven Bower sbo...@alcyon.net wrote: when the TransactionLog objects are dereferenced their RandomAccessFile object is not closed.. Have the files been deleted (unlinked from the directory), or are they still visible via ls? -Yonik http://lucidworks.com
Re: Transaction Logs Leaking FileDescriptors
Created https://issues.apache.org/jira/browse/SOLR-4831 to capture this issue On Thu, May 16, 2013 at 10:10 AM, Steven Bower sbo...@alcyon.net wrote: Looking at the timestamps on the tlog files they seem to have all been created around the same time (04:55).. starting around this time I start seeing the exception below (there were 1628).. in fact its getting tons of these (200k+) but most of the time inside regular commits... 2013-15-05 04:55:06.634 ERROR UpdateLog [recoveryExecutor-6-thread-7922] - java.lang.ArrayIndexOutOfBoundsException: 2603 at org.apache.lucene.codecs.lucene40.BitVector.get(BitVector.java:146) at org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsEnum.nextDoc(Lucene41PostingsReader.java:492) at org.apache.lucene.index.BufferedDeletesStream.applyTermDeletes(BufferedDeletesStream.java:407) at org.apache.lucene.index.BufferedDeletesStream.applyDeletes(BufferedDeletesStream.java:273) at org.apache.lucene.index.IndexWriter.applyAllDeletes(IndexWriter.java:2973) at org.apache.lucene.index.IndexWriter.maybeApplyDeletes(IndexWriter.java:2964) at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2704) at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2839) at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2819) at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:536) at org.apache.solr.update.UpdateLog$LogReplayer.doReplay(UpdateLog.java:1339) at org.apache.solr.update.UpdateLog$LogReplayer.run(UpdateLog.java:1163) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) On Thu, May 16, 2013 at 9:35 AM, Yonik Seeley yo...@lucidworks.comwrote: See https://issues.apache.org/jira/browse/SOLR-3939 Do you see these log messages from this in your logs? log.info(I may be the new leader - try and sync); How reproducible is this bug for you? It would be great to know if the patch in the issue fixes things. -Yonik http://lucidworks.com On Wed, May 15, 2013 at 6:04 PM, Steven Bower sbo...@alcyon.net wrote: They are visible to ls... On Wed, May 15, 2013 at 5:49 PM, Yonik Seeley yo...@lucidworks.com wrote: On Wed, May 15, 2013 at 5:20 PM, Steven Bower sbo...@alcyon.net wrote: when the TransactionLog objects are dereferenced their RandomAccessFile object is not closed.. Have the files been deleted (unlinked from the directory), or are they still visible via ls? -Yonik http://lucidworks.com
Transaction Logs Leaking FileDescriptors
We have a system in which a client is sending 1 record at a time (via REST) followed by a commit. This has produced ~65k tlog files and the JVM has run out of file descriptors... I grabbed a heap dump from the JVM and I can see ~52k unreachable FileDescriptors... This leads me to believe that the TransactionLog is not properly closing all of it's files before getting rid of the object... I've verified with lsof that indeed there are ~60k tlog files that are open currently.. This is Solr 4.3.0 Thanks, steve
Re: Transaction Logs Leaking FileDescriptors
Most definetly understand the don't commit after each record... unfortunately the data is being fed by another team which I cannot control... Limiting the number of potential tlog files is good but I think there is also an issue in that when the TransactionLog objects are dereferenced their RandomAccessFile object is not closed.. thus delaying release of the descriptor until the object is GC'd... I'm hunting through the UpdateHandler code to try and find where this happens now.. steve On Wed, May 15, 2013 at 5:13 PM, Yonik Seeley yo...@lucidworks.com wrote: Hmmm, we keep open a number of tlog files based on the number of records in each file (so we always have a certain amount of history), but IIRC, the number of tlog files is also capped. Perhaps there is a bug when the limit to tlog files is reached (as opposed to the number of documents in the tlog files). I'll see if I can create a test case to reproduce this. Separately, you'll get a lot better performance if you don't commit per update of course (or at least use something like commitWithin). -Yonik http://lucidworks.com On Wed, May 15, 2013 at 5:06 PM, Steven Bower sbo...@alcyon.net wrote: We have a system in which a client is sending 1 record at a time (via REST) followed by a commit. This has produced ~65k tlog files and the JVM has run out of file descriptors... I grabbed a heap dump from the JVM and I can see ~52k unreachable FileDescriptors... This leads me to believe that the TransactionLog is not properly closing all of it's files before getting rid of the object... I've verified with lsof that indeed there are ~60k tlog files that are open currently.. This is Solr 4.3.0 Thanks, steve
Re: Transaction Logs Leaking FileDescriptors
There seem to be quite a few places where the RecentUpdates class is used but is not properly created/closed throughout the code... For example in RecoveryStrategy it does this correctly: UpdateLog.RecentUpdates recentUpdates = null; try { recentUpdates = ulog.getRecentUpdates(); recentVersions = recentUpdates.getVersions(ulog.numRecordsToKeep); } catch (Throwable t) { SolrException.log(log, Corrupt tlog - ignoring. core= + coreName, t); recentVersions = new ArrayListLong(0); } finally { if (recentUpdates != null) { recentUpdates.close(); } } But in a number of other places its used more like this: UpdateLog.RecentUpdates recentUpdates = ulog.getRecentUpdates() try { ... some code ... } finally { recentUpdates.close(); } The problem it would seem is that RecentUpdates.getRecentUpdates() can fail when it calls update() as it is doing IO on the log itself.. in that case you'll get orphaned references to the log... I'm not 100% sure this is my problem.. I'm scouring the logs to see if this codepath was triggered... steve On Wed, May 15, 2013 at 5:26 PM, Walter Underwood wun...@wunderwood.orgwrote: Maybe we need a flag in the update handler to ignore commit requests. I just enabled a similar thing for our JVM, because something, somewhere was calling System.gc(). You can completely ignore explicit GC calls or you can turn them into requests for a concurrent GC. A similar setting for Solr might map commit requests to hard commit (default), soft commit, or none. wunder On May 15, 2013, at 2:20 PM, Steven Bower wrote: Most definetly understand the don't commit after each record... unfortunately the data is being fed by another team which I cannot control... Limiting the number of potential tlog files is good but I think there is also an issue in that when the TransactionLog objects are dereferenced their RandomAccessFile object is not closed.. thus delaying release of the descriptor until the object is GC'd... I'm hunting through the UpdateHandler code to try and find where this happens now.. steve On Wed, May 15, 2013 at 5:13 PM, Yonik Seeley yo...@lucidworks.com wrote: Hmmm, we keep open a number of tlog files based on the number of records in each file (so we always have a certain amount of history), but IIRC, the number of tlog files is also capped. Perhaps there is a bug when the limit to tlog files is reached (as opposed to the number of documents in the tlog files). I'll see if I can create a test case to reproduce this. Separately, you'll get a lot better performance if you don't commit per update of course (or at least use something like commitWithin). -Yonik http://lucidworks.com On Wed, May 15, 2013 at 5:06 PM, Steven Bower sbo...@alcyon.net wrote: We have a system in which a client is sending 1 record at a time (via REST) followed by a commit. This has produced ~65k tlog files and the JVM has run out of file descriptors... I grabbed a heap dump from the JVM and I can see ~52k unreachable FileDescriptors... This leads me to believe that the TransactionLog is not properly closing all of it's files before getting rid of the object... I've verified with lsof that indeed there are ~60k tlog files that are open currently.. This is Solr 4.3.0 Thanks, steve
Re: Transaction Logs Leaking FileDescriptors
They are visible to ls... On Wed, May 15, 2013 at 5:49 PM, Yonik Seeley yo...@lucidworks.com wrote: On Wed, May 15, 2013 at 5:20 PM, Steven Bower sbo...@alcyon.net wrote: when the TransactionLog objects are dereferenced their RandomAccessFile object is not closed.. Have the files been deleted (unlinked from the directory), or are they still visible via ls? -Yonik http://lucidworks.com
Re: Per Shard Replication Factor
This approach would work to satisfy the requirement but I think would generally be nice to have the ability to control this within a single collection (so you don't give up any functionality when querying between the collections and to make the management of the system easier). Anyway I'll create a ticket and take a look at how this might work.. steve On Thu, May 9, 2013 at 8:23 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Could these just be different collections? Then sharding and replication is independent. And you can reduce replication factor as the index ages. Otis Solr ElasticSearch Support http://sematext.com/ On May 9, 2013 1:43 AM, Steven Bower smb-apa...@alcyon.net wrote: Is it currently possible to have per-shard replication factor? A bit of background on the use case... If you are hashing content to shards by a known factor (lets say date ranges, 12 shards, 1 per month) it might be the case that most of your search traffic would be directed to one particular shard (eg. the current month shard) and having increased query capacity in that shard would be useful... this could be extended to many use cases such as data hashed by organization, type, etc. Thanks, steve
Per Shard Replication Factor
Is it currently possible to have per-shard replication factor? A bit of background on the use case... If you are hashing content to shards by a known factor (lets say date ranges, 12 shards, 1 per month) it might be the case that most of your search traffic would be directed to one particular shard (eg. the current month shard) and having increased query capacity in that shard would be useful... this could be extended to many use cases such as data hashed by organization, type, etc. Thanks, steve