Re: Very long young generation stop the world GC pause

2016-12-21 Thread Steven Bower
Also curious why such a large heap is required... If it's due to field
caches being loaded I'd highly recommend MMapDirectory (if not using
already) and turning on DocValues for all fields you plan to perform
sort/facet/analytics on.

steve

On Wed, Dec 21, 2016 at 9:25 AM Pushkar Raste 
wrote:

> You should probably have as small a swap as possible. I still feel long GCs
> are either due to swapping or thread contention.
>
> Did you try to remove all other G1GC tuning parameters except for the
> ParallelRefProcEnabled?
>
> On Dec 19, 2016 1:39 AM, "forest_soup"  wrote:
>
> > Sorry for my wrong memory. The swap is 16GB.
> >
> >
> >
> > --
> > View this message in context: http://lucene.472066.n3.
> > nabble.com/Very-long-young-generation-stop-the-world-GC-
> > pause-tp4308911p4310301.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>


Re: SOLR Disk Access Latency Problem

2016-09-21 Thread Steven Bower
That sounds like some SAN vendor BS if you ask me. Breaking up 300gb into
smaller chunks would only be relevant if they were caching entire files not
blocks and I find that hard to believe. Would be interested to know more
about the specifics of the problem as the vendor sees it.

As Shawn said local attached storage (preferably SSD) is the way to go.. In
addition using MMapDirectory with lots of RAM will give the best
performance. My rule of thumb is to keep a maximum of 4-1 ratio between
index size and amount of RAM on a box.

steve

On Wed, Sep 21, 2016 at 8:08 PM Shawn Heisey  wrote:

> On 9/21/2016 7:52 AM, Kyle Daving wrote:
> > We are currently running solr 5.2.1 and attempted to upgrade to 6.2.1.
> > We attempted this last week but ran into disk access latency problems
> > so reverted back to 5.2.1. We found that after upgrading we overran
> > the NVRAM on our SAN and caused a fairly large queue depth for disk
> > access (we did not have this problem in 5.2.1). We reached out to our
> > SAN vendor and they said that is was due to the size of our optimized
> > indexes. It is not uncommon for us to have roughly 300GB single file
> > optimized indexes. Our SAN vendor advised that splitting the index
> > into smaller fragmented chunks would alleviate the NVRAM/queue depth
> > problem.
>
> How is this filesystem presented to the server?  Is it a block device
> using a protocol like iSCSI, or is it a network filesystem, like NFS or
> SMB?  Block filesystems will appear to the OS as if they are a
> completely local filesystem, and local machine memory will be used to
> cache data.  Network filesystems will usually require memory on the
> storage device for caching, and typically those machines do not have a
> lot of memory compared to the amount of storage space they have.
>
> > Why do we not see this problem with the same size index in 5.2.1? Did
> > solr change the way it accesses disk in v5 vs v6?
>
> It's hard to say why you didn't have the problem with the earlier version.
>
> All the index disk access is handled by Lucene, and from Solr's point of
> view, it's a black box, with only minimal configuration available.
> Lucene is constantly being improved, but those improvements assume the
> general best-case installation -- a machine with a local filesystem and
> plenty of spare memory to effectively cache the data that filesystem
> contains.
>
> > Is there a configuration file we should be looking at making
> > adjustments in?
>
> Unless we can figure out why there's a problem, this question cannot be
> answered.
>
> > Since everything worked fine in 5.2.1 there has to be something we are
> > overlooking when trying to use 6.2.1. Any comments and thoughts are
> > appreciated.
>
> Best guess (which could be wrong):  There's not enough memory to
> effectively cache the data in the Lucene indexes.  A newer version of
> Solr generally has *better* performance characteristics than an earlier
> version, but *ONLY* if there's enough memory available to effectively
> cache the index data, which assures that data can be accessed very
> quickly.  When the actual disk must be read, access speed will be slow
> ... and the problem may get worse with a different version.
>
> How much memory is in your Solr server, and how much is assigned to the
> Java heap for Solr?  Are you running more than one Solr instance per
> server?
>
> When you're dealing with a remote filesystem on a SAN, exactly where to
> add memory to boost performance will depend on how the filesystem is
> being presented.
>
> I strongly recommend against using a network filesystem like NFS or SMB
> to hold a Solr index.  Solr works best when the filesystem is local to
> the server and there's plenty of extra memory for caching.  The amount
> of memory required for good performance with a 300GB index will be
> substantial.
>
> Thanks,
> Shawn
>
>


Re: Full re-index without downtime

2016-07-06 Thread Steven Bower
There are two options as I see it..

1. Do something like you describe and create a secondary index, index into
it, then switch... I personally would create a completely separate solr
cloud alongside my existing one vs new core in the same cloud as you might
see some negative impacts on GC caused by the indexing load.

2. Tag each record with a field (eg "generation") that identifies which
generation of data a record is from.. when querying filter on only the
generation of data that is complete.. new records get a new generation..
the only problem with this is changing field types doesn't really work with
the same field names.. but if you used dynamic fields instead of static the
field name would change anyway which isn't a problem then.

We use both of these patterns in different applications..

steve

On Wed, Jul 6, 2016 at 1:27 PM Steven White  wrote:

> Hi everyone,
>
> In my environment, I have use cases where I need to fully re-index my
> data.  This happens because Solr's schema requires changes based on changes
> made to my data source, the DB.  For example, my DB schema may change so
> that it now has a whole new set of field added or removed (on records), or
> the data type changed (on fields).  When that happens, the only solution I
> have right now is to drop the current Solr index, update Solr's schema.xml,
> re-index my data (I use Solr's core admin to dynamical do all this).
>
> The issue with my current solution is during the re-indexing, which right
> now takes 10 hours (expect it to take over 30 hours as my data keeps on
> growing) search via Solr is not available.  Sure, I can enable search while
> the data is being re-indexed, but then I get partial results.
>
> My question is this: how can I avoid this so there is minimal downtime,
> under 1 min.?  I was thinking of creating a second core (again dynamically)
> and re-index into it (after setting up the new schema) and once the
> re-index is fully done, switch over to the new core and drop the index from
> the old core and then delete the old core, and rename the new core to the
> old core (original core).
>
> Would the above work or is there a better way to do this?  How do you guys
> solve this problem?
>
> Again, my goal is to minimize downtime during re-indexing when Solr's
> schema is drastically changed (requiring re-indexing).
>
> Thanks in advanced.
>
> Steve
>


Re: deploy solr on cloud providers

2016-07-05 Thread Steven Bower
Looking deeper into zookeeper as truth mode I was wrong about existing
replicas being recreated once storage is gone.. Seems there is intent for
the type of behavior based upon existing tickets.. We'll look at creating a
patch for this too..

Steve
On Tue, Jul 5, 2016 at 6:00 PM Tomás Fernández Löbbe 
wrote:

> The leader will do the replication before responding to the client, so lets
> say the leader gets to update it's local copy, but it's terminated before
> sending the request to the replicas, the client should get either an HTTP
> 500 or no http response. From the client code you can take action (log,
> retry, etc).
> The "min_rf" is useful for the case where replicas may be down or not
> accessible. Again, you can use this for retrying or take any necessary
> action on the client side if the desired rf is not achieved.
>
> Tomás
>
> On Tue, Jul 5, 2016 at 11:39 AM, Lorenzo Fundaró <
> lorenzo.fund...@dawandamail.com> wrote:
>
> > @Tomas and @Steven
> >
> > I am a bit skeptical about this two statements:
> >
> > If a node just disappears you should be fine in terms of data
> > > availability, since Solr in "SolrCloud" replicates the data as it comes
> > it
> > > (before sending the http response)
> >
> >
> > and
> >
> > >
> > > You shouldn't "need" to move the storage as SolrCloud will replicate
> all
> > > data to the new node and anything in the transaction log will already
> be
> > > distributed through the rest of the machines..
> >
> >
> > because according to the official documentation here
> > <
> >
> https://cwiki.apache.org/confluence/display/solr/Read+and+Write+Side+Fault+Tolerance
> > >:
> > (Write side fault tolerant -> recovery)
> >
> > If a leader goes down, it may have sent requests to some replicas and not
> > > others. So when a new potential leader is identified, it runs a synch
> > > process against the other replicas. If this is successful, everything
> > > should be consistent, the leader registers as active, and normal
> actions
> > > proceed
> >
> >
> > I think there is a possibility that an update is not sent by the leader
> but
> > is kept in the local disk and after it comes up again it can sync the
> > non-sent data.
> >
> > Furthermore:
> >
> > Achieved Replication Factor
> > > When using a replication factor greater than one, an update request may
> > > succeed on the shard leader but fail on one or more of the replicas.
> For
> > > instance, consider a collection with one shard and replication factor
> of
> > > three. In this case, you have a shard leader and two additional
> replicas.
> > > If an update request succeeds on the leader but fails on both replicas,
> > for
> > > whatever reason, the update request is still considered successful from
> > the
> > > perspective of the client. The replicas that missed the update will
> sync
> > > with the leader when they recover.
> >
> >
> > They have implemented this parameter called *min_rf* that you can use
> > (client-side) to make sure that your update was replicated to at least
> one
> > replica (e.g.: min_rf > 1).
> >
> > This is why my concern about moving storage around, because then I know
> > when the shard leader comes back, solrcloud will run sync process for
> those
> > documents that couldn't be sent to the replicas.
> >
> > Am I missing something or misunderstood the documentation ?
> >
> > Cheers !
> >
> >
> >
> >
> >
> >
> >
> > On 5 July 2016 at 19:49, Davis, Daniel (NIH/NLM) [C] <
> daniel.da...@nih.gov
> > >
> > wrote:
> >
> > > Lorenzo, this probably comes late, but my systems guys just don't want
> to
> > > give me real disk.   Although RAID-5 or LVM on-top of JBOD may be
> better
> > > than Amazon EBS, Amazon EBS is still much closer to real disk in terms
> of
> > > IOPS and latency than NFS ;)I even ran a mini test (not an official
> > > benchmark), and found the response time for random reads to be better.
> > >
> > > If you are a young/smallish company, this may be all in the cloud, but
> if
> > > you are in a large organization like mine, you may also need to allow
> for
> > > other architectures, such as a "virtual" Netapp in the cloud that
> > > communicates with a physical Netapp on-premises, and the
> > throughput/latency
> > > of that.   The most important thing is to actually measure the numbers
> > you
> > > are getting, both for search and for simply raw I/O, or to get your
> > > systems/storage guys to measure those numbers. If you get your
> > > systems/storage guys to just measure storage - you will want to care
> > about
> > > three things for indexing primarily:
> > >
> > > Sequential Write Throughput
> > > Random Read Throughput
> > > Random Read Response Time/Latency
> > >
> > > Hope this helps,
> > >
> > > Dan Davis, Systems/Applications Architect (Contractor),
> > > Office of Computer and Communications Systems,
> > > National Library of Medicine, NIH
> > >
> > >
> > >
> > > -Original Message-
> > > From: Lorenzo 

Re: deploy solr on cloud providers

2016-07-05 Thread Steven Bower
You shouldn't "need" to move the storage as SolrCloud will replicate all
data to the new node and anything in the transaction log will already be
distributed through the rest of the machines..

One option to keep all your data attached to nodes might be to use Amazon
EFS (pretty new) to store your data.. However I've not seen any good perf
testing done against it so not sure how it will scale..

steve

On Tue, Jul 5, 2016 at 11:46 AM Lorenzo Fundaró <
lorenzo.fund...@dawandamail.com> wrote:

> On 5 July 2016 at 15:55, Shawn Heisey  wrote:
>
> > On 7/5/2016 1:19 AM, Lorenzo Fundaró wrote:
> > > Hi Shawn. Actually what im trying to find out is whether this is the
> best
> > > approach for deploying solr in the cloud. I believe solrcloud solves a
> > lot
> > > of problems in terms of High Availability but when it comes to storage
> > > there seems to be a limitation that can be workaround of course but
> it's
> > a
> > > bit cumbersome and i was wondering if there is a better option for this
> > or
> > > if im missing something with the way I'm doing it. I wonder if there
> are
> > > some proved experience about how to solve the storage problem when
> > > deploying in the cloud. Any advise or point to some enlightening
> > > documentation will be appreciated. Thanks.
> >
> > When you ask whether "this is the best approach" ... you need to define
> > what "this" is.  You mention a "storage problem" that needs solving ...
> > but haven't actually described that problem in a way that I can
> > understand.
>
>
> So, Im trying to put Solrcloud in a cloud provider where a node can
> disappear any time
> because of hardware failure. In order to preserve any non replicated
> updates I need to
> make the storage of that dead node go to the newly spawned node. I am not
> having a problem with this
> approach actually, I just want to know if there is a better way of doing
> this. I know there is HDFS support that makes
> all this easier but this is not an option for me. Thank you and I apologise
> for the unclear mails.
>
>
> >
> > Let's back up and cover some basics:
> >
> > What steps are you taking?
>
> What do you expect (or want) to happen?
>
> What actually happens?
> >
> > The answers to these questions need to be very detailed.
> >
> > Thanks,
> > Shawn
> >
> >
>
>
> --
>
> --
> Lorenzo Fundaro
> Backend Engineer
> E-Mail: lorenzo.fund...@dawandamail.com
>
> Fax   + 49 - (0)30 - 25 76 08 52
> Tel+ 49 - (0)179 - 51 10 982
>
> DaWanda GmbH
> Windscheidstraße 18
> 10627 Berlin
>
> Geschäftsführer: Claudia Helming und Niels Nüssler
> AG Charlottenburg HRB 104695 B http://www.dawanda.com
>


Re: stateless solr ?

2016-07-05 Thread Steven Bower
The ticket in question is https://issues.apache.org/jira/browse/SOLR-9265

We are working on a patch now... will update when we have a working patch /
tests..

Shawn is correct that when adding a new node to a SolrCloud cluster it will
not automatically add replicas/etc..

The idea behind this patch though is that you'll be hard-coding the
node-names (eg node1, node2, etc..) that are are normally generated from
the host/port of the instance and as such not actually adding a "new" node
but replacing an existing node. In this case Solr will see the existing
cores and replicate the data down.. At least in theory.. We will need to
verify this assumption with a lot of testing, especially around having
multiple nodes with the same node name showing up at the same time (which
is something currently that cannot really occur, but can once we do this
work)..

steve

On Tue, Jul 5, 2016 at 9:54 AM Shawn Heisey  wrote:

> On 7/4/2016 7:46 AM, Lorenzo Fundaró wrote:
> > I am trying to run Solr on my infrastructure using docker containers
> > and Mesos. My problem is that I don't have a shared filesystem. I have
> > a cluster of 3 shards and 3 replicas (9 nodes in total) so if I
> > distribute well my nodes I always have 2 fallbacks of my data for
> > every shard. Every solr node will store the index in its internal
> > docker filesystem. My problem is that if I want to relocate a certain
> > node (maybe an automatic relocation because of a hardware failure), I
> > need to create the core manually in the new node because it's
> > expecting to find the core.properties file in the data folder and of
> > course it won't because the storage is ephemeral. Is there a way to
> > make a new node join the cluster with no manual intervention ?
>
> The things you're asking sound like SolrCloud.  The rest of this message
> assumes that you're running cloud.  If you're not, then we may need to
> start over.
>
> When you start a new node, it automatically joins the cluster described
> by the Zookeeper database that you point it to.
>
> SolrCloud will **NOT** automatically create replicas when a new node
> joins the cluster.  There's no way for SolrCloud to know what you
> actually want to use that new node for, so anything that it did
> automatically might be completely the wrong thing.
>
> Once you add a new node, you can replicate existing data to it with the
> ADDREPLICA action on the Collections API:
>
>
> https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api_addreplica
>
> If the original problem was a down node, you might also want to use the
> DELETEREPLICA action to delete any replicas on the node that you lost
> that are marked down:
>
>
> https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api9
>
> Creating cores manually in your situation is not advisable.  The
> CoreAdmin API should not be used when you're running in cloud mode.
>
> Thanks,
> Shawn
>
>


Re: stateless solr ?

2016-07-04 Thread Steven Bower
I don't think that's a bad approach with the sidecar.. We run a huge number
of solr ~5k instances so adding sidecars for each one ads a lot of extra
containers..

What I mean by transition is a container dying and a new one being brought
online to replace it.. With the mod we are working on you won't need the
sidecar to add cores to the new node via the API and remove the old cores..
A new instance would start up with the same node name and just take over
the existing cores (of course will require replication but that will happen
automatically )

Steve
On Mon, Jul 4, 2016 at 5:27 PM Upayavira <u...@odoko.co.uk> wrote:

> What do you mean by a "transition"?
>
> Can you configure a sidekick container within your orchestrator? Have a
> sidekick always run alongside your SolrCloud nodes? In which case, this
> would be an app that does the calling of the API for you.
>
> Upayavira
>
> On Mon, 4 Jul 2016, at 08:53 PM, Steven Bower wrote:
> > My main issue is having to make any solr collection api calls during a
> > transition.. It makes integrating with orchestration engines way more
> > complex..
> > On Mon, Jul 4, 2016 at 3:40 PM Upayavira <u...@odoko.co.uk> wrote:
> >
> > > Are you using Solrcloud? With Solrcloud this stuff is easy. You just
> add
> > > a new replica for a collection, and the data is added to the new host.
> > >
> > > I'm working on a demo that will show this all working within Docker and
> > > Rancher. I've got some code (which I will open source) that handles
> > > config uploads, collection creation, etc. You can add a replica by
> > > running a container on the same node as you want the replica to reside,
> > > it'll do the rest for you.
> > >
> > > I've got the Solr bit more or less done, I'm now working on everything
> > > else (Dockerised Docker Registry/Jenkins, AWS infra build, etc).
> > >
> > > Let me know if this is interesting to you. If so, I'll post it here
> when
> > > I'm done with it.
> > >
> > > Upayavira
> > >
> > > On Mon, 4 Jul 2016, at 02:46 PM, Lorenzo Fundaró wrote:
> > > > Hello guys,
> > > >
> > > > I am trying to run Solr on my infrastructure using docker containers
> and
> > > > Mesos. My problem is that I don't have a shared filesystem. I have a
> > > > cluster of 3 shards and 3 replicas (9 nodes in total) so if I
> distribute
> > > > well my nodes I always have 2 fallbacks of my data for every shard.
> Every
> > > > solr node will store the index in its internal docker filesystem. My
> > > > problem is that if I want to relocate a certain node (maybe an
> automatic
> > > > relocation because of a hardware failure), I need to create the core
> > > > manually in the new node because it's expecting to find the
> > > > core.properties
> > > > file in the data folder and of course it won't because the storage is
> > > > ephemeral. Is there a way to make a new node join the cluster with no
> > > > manual intervention ?
> > > >
> > > > Thanks in advance !
> > > >
> > > >
> > > > --
> > > >
> > > > --
> > > > Lorenzo Fundaro
> > > > Backend Engineer
> > > > E-Mail: lorenzo.fund...@dawandamail.com
> > > >
> > > > Fax   + 49 - (0)30 - 25 76 08 52
> > > > Tel+ 49 - (0)179 - 51 10 982
> > > >
> > > > DaWanda GmbH
> > > > Windscheidstraße 18
> > > > 10627 Berlin
> > > >
> > > > Geschäftsführer: Claudia Helming, Niels Nüssler und Michael Pütz
> > > > AG Charlottenburg HRB 104695 B http://www.dawanda.com
> > >
>


Re: stateless solr ?

2016-07-04 Thread Steven Bower
My main issue is having to make any solr collection api calls during a
transition.. It makes integrating with orchestration engines way more
complex..
On Mon, Jul 4, 2016 at 3:40 PM Upayavira  wrote:

> Are you using Solrcloud? With Solrcloud this stuff is easy. You just add
> a new replica for a collection, and the data is added to the new host.
>
> I'm working on a demo that will show this all working within Docker and
> Rancher. I've got some code (which I will open source) that handles
> config uploads, collection creation, etc. You can add a replica by
> running a container on the same node as you want the replica to reside,
> it'll do the rest for you.
>
> I've got the Solr bit more or less done, I'm now working on everything
> else (Dockerised Docker Registry/Jenkins, AWS infra build, etc).
>
> Let me know if this is interesting to you. If so, I'll post it here when
> I'm done with it.
>
> Upayavira
>
> On Mon, 4 Jul 2016, at 02:46 PM, Lorenzo Fundaró wrote:
> > Hello guys,
> >
> > I am trying to run Solr on my infrastructure using docker containers and
> > Mesos. My problem is that I don't have a shared filesystem. I have a
> > cluster of 3 shards and 3 replicas (9 nodes in total) so if I distribute
> > well my nodes I always have 2 fallbacks of my data for every shard. Every
> > solr node will store the index in its internal docker filesystem. My
> > problem is that if I want to relocate a certain node (maybe an automatic
> > relocation because of a hardware failure), I need to create the core
> > manually in the new node because it's expecting to find the
> > core.properties
> > file in the data folder and of course it won't because the storage is
> > ephemeral. Is there a way to make a new node join the cluster with no
> > manual intervention ?
> >
> > Thanks in advance !
> >
> >
> > --
> >
> > --
> > Lorenzo Fundaro
> > Backend Engineer
> > E-Mail: lorenzo.fund...@dawandamail.com
> >
> > Fax   + 49 - (0)30 - 25 76 08 52
> > Tel+ 49 - (0)179 - 51 10 982
> >
> > DaWanda GmbH
> > Windscheidstraße 18
> > 10627 Berlin
> >
> > Geschäftsführer: Claudia Helming, Niels Nüssler und Michael Pütz
> > AG Charlottenburg HRB 104695 B http://www.dawanda.com
>


Re: stateless solr ?

2016-07-04 Thread Steven Bower
We have been working on some changes that should help with this.. 1st
challenge is having the node name remain static regardless of where the
node runs (right now it uses host and port, so this won't work unless you
are using some sort of tunneled or dynamic networking).. We have a patch we
are working on for this.. Once this is in place and you use the "zookeeper
is truth" mode for solr cloud and this should seamlessly transition into
the new node (and replicate).. Will update with the ticket number as I
forget it off hand

Steve
On Mon, Jul 4, 2016 at 9:47 AM Lorenzo Fundaró <
lorenzo.fund...@dawandamail.com> wrote:

> Hello guys,
>
> I am trying to run Solr on my infrastructure using docker containers and
> Mesos. My problem is that I don't have a shared filesystem. I have a
> cluster of 3 shards and 3 replicas (9 nodes in total) so if I distribute
> well my nodes I always have 2 fallbacks of my data for every shard. Every
> solr node will store the index in its internal docker filesystem. My
> problem is that if I want to relocate a certain node (maybe an automatic
> relocation because of a hardware failure), I need to create the core
> manually in the new node because it's expecting to find the core.properties
> file in the data folder and of course it won't because the storage is
> ephemeral. Is there a way to make a new node join the cluster with no
> manual intervention ?
>
> Thanks in advance !
>
>
> --
>
> --
> Lorenzo Fundaro
> Backend Engineer
> E-Mail: lorenzo.fund...@dawandamail.com
>
> Fax   + 49 - (0)30 - 25 76 08 52
> Tel+ 49 - (0)179 - 51 10 982
>
> DaWanda GmbH
> Windscheidstraße 18
> 10627 Berlin
>
> Geschäftsführer: Claudia Helming, Niels Nüssler und Michael Pütz
> AG Charlottenburg HRB 104695 B http://www.dawanda.com
>


Re: Solr cross core join special condition

2015-11-11 Thread Steven Bower
commenting so this ends up in Dennis' inbox..

On Tue, Oct 13, 2015 at 7:17 PM Yonik Seeley  wrote:

> On Wed, Oct 7, 2015 at 9:42 AM, Ryan Josal  wrote:
> > I developed a join transformer plugin that did that (although it didn't
> > flatten the results like that).  The one thing that was painful about it
> is
> > that the TextResponseWriter has references to both the IndexSchema and
> > SolrReturnFields objects for the primary core.  So when you add a
> > SolrDocument from another core it returned the wrong fields.
>
> We've made some progress on this front in trunk:
>
> * SOLR-7957: internal/expert - ResultContext was significantly changed
> and expanded
>   to allow for multiple full query results (DocLists) per Solr request.
>   TransformContext was rendered redundant and was removed. (yonik)
>
> So ResultContext now has it's own searcher, ReturnFields, etc.
>
> -Yonik
>


SolrCloud with local configs

2015-05-21 Thread Steven Bower
Is it possible to run in cloud mode with zookeeper managing
collections/state/etc.. but to read all config files (solrconfig, schema,
etc..) from local disk?

Obviously this implies that you'd have to keep them in sync..

My thought here is of running Solr in a docker container, but instead of
having to manage schema changes/etc via zk I can just build the config into
the container.. and then just produce a new docker image with a solr
version and the new config and just do rolling restarts of the containers..

Thanks,

Steve


Re: Spatial maxDistErr changes

2014-04-03 Thread Steven Bower
Thanks...

I noticed that.. I tried to send a mail to your Mitre address and it got
returned...

Not sure if you've locked something new down but if you are interested we
are looking to hire for our search team at Bloomberg LP

steve


On Wed, Apr 2, 2014 at 11:20 AM, David Smiley dsmi...@apache.org wrote:

 Good question Steve,

 You'll have to re-index right off.

 ~ David
 p.s. Sorry I didn't reply sooner; I just switched jobs and reconfigured my
 mailing list subscriptions



 Steven Bower wrote
  If am only indexing point shapes and I want to change the maxDistErr from
  0.09 (1m res) to 0.00045 will this break as in searches stop
 working
  or will search work but any performance gain won't be seen until all docs
  are reindexed? Or will I have to reindex right off?
 
  thanks,
 
  steve





 -
  Author:
 http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
  Independent Lucene/Solr search consultant,
 http://www.linkedin.com/in/davidwsmiley
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Spatial-maxDistErr-changes-tp4124836p4128620.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Spatial maxDistErr changes

2014-03-17 Thread Steven Bower
If am only indexing point shapes and I want to change the maxDistErr from
0.09 (1m res) to 0.00045 will this break as in searches stop working
or will search work but any performance gain won't be seen until all docs
are reindexed? Or will I have to reindex right off?

thanks,

steve


IDF maxDocs / numDocs

2014-03-12 Thread Steven Bower
I am noticing the maxDocs between replicas is consistently different and
that in the idf calculation it is used which causes idf scores for the same
query/doc between replicas to be different. obviously an optimize can
normalize the maxDocs scores, but that is only temporary.. is there a way
to have idf use numDocs instead (as it should be consistent across
replicas)?

thanks,

steve


Re: IDF maxDocs / numDocs

2014-03-12 Thread Steven Bower
My problem is that both maxDoc() and docCount() both report documents that
have been deleted in their values. Because of merging/etc.. those numbers
can be different per replica (or at least that is what I'm seeing). I need
a value that is consistent across replicas... I see in the comment it makes
mention of not using IndexReader.numDocs() but there doesn't seem to me a
way to get ahold of the IndexReader within a similarity implementation (as
only TermStats, CollectionStats are passed in, and neither contains of ref
to the reader)

I am contemplating just using a static value for the number of docs as
this won't change dramatically often..

steve


On Wed, Mar 12, 2014 at 11:18 AM, Markus Jelsma
markus.jel...@openindex.iowrote:

 Hi Steve - it seems most similarities use CollectionStatistics.maxDoc() in
 idfExplain but there's also a docCount(). We use docCount in all our custom
 similarities, also because it allows you to have multiple languages in one
 index where one is much larger than the other. The small language will have
 very high IDF scores using maxDoc but they are proportional enough using
 docCount(). Using docCount() also fixes SolrCloud ranking problems, unless
 one of your replica's becomes inconsistent ;)


 https://lucene.apache.org/core/4_7_0/core/org/apache/lucene/search/CollectionStatistics.html#docCount%28%29



 -Original message-
  From:Steven Bower smb-apa...@alcyon.net
  Sent: Wednesday 12th March 2014 16:08
  To: solr-user solr-user@lucene.apache.org
  Subject: IDF maxDocs / numDocs
 
  I am noticing the maxDocs between replicas is consistently different and
  that in the idf calculation it is used which causes idf scores for the
 same
  query/doc between replicas to be different. obviously an optimize can
  normalize the maxDocs scores, but that is only temporary.. is there a way
  to have idf use numDocs instead (as it should be consistent across
  replicas)?
 
  thanks,
 
  steve
 



Re: Issue with spatial search

2014-03-11 Thread Steven Bower
great.. that worked!

What does distErrPct actually control, besides controlling the error
percentage? or maybe better put how does it impact perf?

steve


On Mon, Mar 10, 2014 at 11:17 PM, David Smiley (@MITRE.org) 
dsmi...@mitre.org wrote:

 Correct, Steve. Alternatively you can also put this option in your query
 after the end of the last parenthesis, as in this example from the wiki:

   fq=geo:IsWithin(POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30)))
 distErrPct=0

 ~ David


 Steven Bower wrote
  Only points in the index.. Am I correct this won't require a reindex?
 
  On Monday, March 10, 2014, Smiley, David W. lt;

  dsmiley@

  gt; wrote:
 
  Hi Steven,
 
  Set distErrPct to 0 in order to get non-point shapes to always be as
  accurate as maxDistErr.  Point shapes are always that accurate.  As long
  as
  you only index points, not other shapes (you don't index polygons, etc.)
  then distErrPct of 0 should be fine.  In fact, perhaps a future Solr
  version should simply use 0 as the default; the last time I did
  benchmarks
  it was pretty marginal impact of higher distErrPct.
 
  It's a fairly different story if you are indexing non-point shapes.
 
  ~ David
 
  From: Steven Bower lt;

  smb-apache@

   lt;javascript:;gt;
  mailto:
 
 

  smb-apache@

   lt;javascript:;gt;
  Reply-To: 

  solr-user@.apache

   lt;javascript:;gt;
  mailto:
 
 

  solr-user@.apache

   lt;javascript:;gt; lt;

  solr-user@.apache

  lt;javascript:;gt;
  lt;mailto:

  solr-user@.apache

   lt;javascript:;gt;
  Date: Monday, March 10, 2014 at 4:23 PM
  To: 

  solr-user@.apache

   lt;javascript:;gt;
  mailto:
 
 

  solr-user@.apache

   lt;javascript:;gt; lt;

  solr-user@.apache

  lt;javascript:;gt;
  lt;mailto:

  solr-user@.apache

   lt;javascript:;gt;
  Subject: Re: Issue with spatial search
 
  Minor edit to the KML to adjust color of polygon
 
 
  On Mon, Mar 10, 2014 at 4:21 PM, Steven Bower lt;

  smb-apache@

  lt;javascript:;gt;
  lt;mailto:

  smb-apache@

   lt;javascript:;gt; wrote:
  I am seeing a error when doing a spatial search where a particular
  point
  is showing up within a polygon, but by all methods I've tried that point
  is
  not within the polygon..
 
  First the point is: 41.2299,29.1345 (lat/lon)
 
  The polygon is:
 
  31.2719,32.283
  31.2179,32.3681
  31.1333,32.3407
  30.9356,32.6318
  31.0707,34.5196
  35.2053,36.9415
  37.2959,36.6339
  40.8334,30.4273
  41.1622,29.1421
  41.6484,27.4832
  47.0255,13.6342
  43.9457,3.17525
  37.0029,-5.7017
  35.7741,-5.57719
  34.801,-4.66201
  33.345,10.0157
  29.6745,18.9366
  30.6592,29.1683
  31.2719,32.283
 
  The geo field we are using has this config:
 
 
  fieldType name=location_rpt
 
  class=solr.SpatialRecursivePrefixTreeFieldType
 distErrPct=0.025
 maxDistErr=0.09
 
 
 
 spatialContextFactory=com.spatial4j.core.context.jts.JtsSpatialContextFactory
 units=degrees/
 
  The config is basically the same as the one from the docs...
 
  They query I am issuing is this:
 
  location:Intersects(POLYGON((32.283 31.2719, 32.3681 31.2179, 32.3407
  31.1333, 32.6318 30.9356, 34.5196 31.0707, 36.9415 35.2053, 36.6339
  37.2959, 30.4273 40.8334, 29.1421 41.1622, 27.4832 41.6484, 13.6342
  47.0255, 3.17525 43.9457, -5.7017 37.0029, -5.57719 35.7741, -4.66201
  34.801, 10.0157 33.345, 18.9366 29.6745, 29.1683 30.6592, 32.283
  31.2719)))
 
  and it brings back a result where the location field is
 41.2299,29.1345
 
  I've attached a KML with the polygon and the point and you can see from
  that, visually, that the point is not within the polygon. I also tried
 in
  google maps API but after playing around realize that the polygons in
  maps
  are draw in Euclidian space while the map itself is a Mercator
  projection..
  Loading the kml in earth fixes this issue but the point still lays
  outside
  the polygon.. The distance between the edge of the polygon closes to the
  point and the point itself is ~1.2 miles which is much larger than the
  1meter accuracy given by the maxDistErr (per the docs).
 
  Any thoughts on this?
 
  Thanks,
 
  Steve
 
 





 -
  Author:
 http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Issue-with-spatial-search-tp4122690p4122744.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Issue with spatial search

2014-03-10 Thread Steven Bower
I am seeing a error when doing a spatial search where a particular point
is showing up within a polygon, but by all methods I've tried that point is
not within the polygon..

First the point is: 41.2299,29.1345 (lat/lon)

The polygon is:

31.2719,32.283
31.2179,32.3681
31.1333,32.3407
30.9356,32.6318
31.0707,34.5196
35.2053,36.9415
37.2959,36.6339
40.8334,30.4273
41.1622,29.1421
41.6484,27.4832
47.0255,13.6342
43.9457,3.17525
37.0029,-5.7017
35.7741,-5.57719
34.801,-4.66201
33.345,10.0157
29.6745,18.9366
30.6592,29.1683
31.2719,32.283

The geo field we are using has this config:

fieldType name=location_rpt
   class=solr.SpatialRecursivePrefixTreeFieldType
   distErrPct=0.025
   maxDistErr=0.09

 spatialContextFactory=com.spatial4j.core.context.jts.JtsSpatialContextFactory
   units=degrees/

The config is basically the same as the one from the docs...

They query I am issuing is this:

location:Intersects(POLYGON((32.283 31.2719, 32.3681 31.2179, 32.3407
31.1333, 32.6318 30.9356, 34.5196 31.0707, 36.9415 35.2053, 36.6339
37.2959, 30.4273 40.8334, 29.1421 41.1622, 27.4832 41.6484, 13.6342
47.0255, 3.17525 43.9457, -5.7017 37.0029, -5.57719 35.7741, -4.66201
34.801, 10.0157 33.345, 18.9366 29.6745, 29.1683 30.6592, 32.283 31.2719)))

and it brings back a result where the location field is 41.2299,29.1345

I've attached a KML with the polygon and the point and you can see from
that, visually, that the point is not within the polygon. I also tried in
google maps API but after playing around realize that the polygons in maps
are draw in Euclidian space while the map itself is a Mercator projection..
Loading the kml in earth fixes this issue but the point still lays outside
the polygon.. The distance between the edge of the polygon closes to the
point and the point itself is ~1.2 miles which is much larger than the
1meter accuracy given by the maxDistErr (per the docs).

Any thoughts on this?

Thanks,

Steve


solr_map_issue.kml
Description: application/vnd.google-earth.kml


Re: Issue with spatial search

2014-03-10 Thread Steven Bower
Minor edit to the KML to adjust color of polygon


On Mon, Mar 10, 2014 at 4:21 PM, Steven Bower smb-apa...@alcyon.net wrote:

 I am seeing a error when doing a spatial search where a particular point
 is showing up within a polygon, but by all methods I've tried that point is
 not within the polygon..

 First the point is: 41.2299,29.1345 (lat/lon)

 The polygon is:

 31.2719,32.283
 31.2179,32.3681
 31.1333,32.3407
 30.9356,32.6318
 31.0707,34.5196
 35.2053,36.9415
 37.2959,36.6339
 40.8334,30.4273
 41.1622,29.1421
 41.6484,27.4832
 47.0255,13.6342
 43.9457,3.17525
 37.0029,-5.7017
 35.7741,-5.57719
 34.801,-4.66201
 33.345,10.0157
 29.6745,18.9366
 30.6592,29.1683
 31.2719,32.283

 The geo field we are using has this config:

 fieldType name=location_rpt
class=solr.SpatialRecursivePrefixTreeFieldType
distErrPct=0.025
maxDistErr=0.09

  
 spatialContextFactory=com.spatial4j.core.context.jts.JtsSpatialContextFactory
units=degrees/

 The config is basically the same as the one from the docs...

 They query I am issuing is this:

 location:Intersects(POLYGON((32.283 31.2719, 32.3681 31.2179, 32.3407
 31.1333, 32.6318 30.9356, 34.5196 31.0707, 36.9415 35.2053, 36.6339
 37.2959, 30.4273 40.8334, 29.1421 41.1622, 27.4832 41.6484, 13.6342
 47.0255, 3.17525 43.9457, -5.7017 37.0029, -5.57719 35.7741, -4.66201
 34.801, 10.0157 33.345, 18.9366 29.6745, 29.1683 30.6592, 32.283 31.2719)))

 and it brings back a result where the location field is 41.2299,29.1345

 I've attached a KML with the polygon and the point and you can see from
 that, visually, that the point is not within the polygon. I also tried in
 google maps API but after playing around realize that the polygons in maps
 are draw in Euclidian space while the map itself is a Mercator projection..
 Loading the kml in earth fixes this issue but the point still lays outside
 the polygon.. The distance between the edge of the polygon closes to the
 point and the point itself is ~1.2 miles which is much larger than the
 1meter accuracy given by the maxDistErr (per the docs).

 Any thoughts on this?

 Thanks,

 Steve



solr_map_issue.kml
Description: application/vnd.google-earth.kml


Re: Issue with spatial search

2014-03-10 Thread Steven Bower
Weirdly that same point shows up in the polygon below as well, which in the
area around the point doesn't intersect with the polygon in my first msg...

29.0454,41.2198
29.2349,41.1826
31.1107,40.9956
38.437,40.7991
41.1616,40.8988
42.1284,42.2141
40.0919,47.8482
30.4169,47.5783
26.9892,43.6459
27.2095,41.5676
29.0454,41.2198



On Mon, Mar 10, 2014 at 4:23 PM, Steven Bower smb-apa...@alcyon.net wrote:

 Minor edit to the KML to adjust color of polygon


 On Mon, Mar 10, 2014 at 4:21 PM, Steven Bower smb-apa...@alcyon.netwrote:

 I am seeing a error when doing a spatial search where a particular
 point is showing up within a polygon, but by all methods I've tried that
 point is not within the polygon..

 First the point is: 41.2299,29.1345 (lat/lon)

 The polygon is:

 31.2719,32.283
 31.2179,32.3681
 31.1333,32.3407
 30.9356,32.6318
 31.0707,34.5196
 35.2053,36.9415
 37.2959,36.6339
 40.8334,30.4273
 41.1622,29.1421
 41.6484,27.4832
 47.0255,13.6342
 43.9457,3.17525
 37.0029,-5.7017
 35.7741,-5.57719
 34.801,-4.66201
 33.345,10.0157
 29.6745,18.9366
 30.6592,29.1683
 31.2719,32.283

 The geo field we are using has this config:

 fieldType name=location_rpt
class=solr.SpatialRecursivePrefixTreeFieldType
distErrPct=0.025
maxDistErr=0.09

  
 spatialContextFactory=com.spatial4j.core.context.jts.JtsSpatialContextFactory
units=degrees/

 The config is basically the same as the one from the docs...

 They query I am issuing is this:

 location:Intersects(POLYGON((32.283 31.2719, 32.3681 31.2179, 32.3407
 31.1333, 32.6318 30.9356, 34.5196 31.0707, 36.9415 35.2053, 36.6339
 37.2959, 30.4273 40.8334, 29.1421 41.1622, 27.4832 41.6484, 13.6342
 47.0255, 3.17525 43.9457, -5.7017 37.0029, -5.57719 35.7741, -4.66201
 34.801, 10.0157 33.345, 18.9366 29.6745, 29.1683 30.6592, 32.283 31.2719)))

 and it brings back a result where the location field is 41.2299,29.1345

 I've attached a KML with the polygon and the point and you can see from
 that, visually, that the point is not within the polygon. I also tried in
 google maps API but after playing around realize that the polygons in maps
 are draw in Euclidian space while the map itself is a Mercator projection..
 Loading the kml in earth fixes this issue but the point still lays outside
 the polygon.. The distance between the edge of the polygon closes to the
 point and the point itself is ~1.2 miles which is much larger than the
 1meter accuracy given by the maxDistErr (per the docs).

 Any thoughts on this?

 Thanks,

 Steve





Re: Issue with spatial search

2014-03-10 Thread Steven Bower
Only points in the index.. Am I correct this won't require a reindex?

On Monday, March 10, 2014, Smiley, David W. dsmi...@mitre.org wrote:

 Hi Steven,

 Set distErrPct to 0 in order to get non-point shapes to always be as
 accurate as maxDistErr.  Point shapes are always that accurate.  As long as
 you only index points, not other shapes (you don't index polygons, etc.)
 then distErrPct of 0 should be fine.  In fact, perhaps a future Solr
 version should simply use 0 as the default; the last time I did benchmarks
 it was pretty marginal impact of higher distErrPct.

 It's a fairly different story if you are indexing non-point shapes.

 ~ David

 From: Steven Bower smb-apa...@alcyon.net javascript:;mailto:
 smb-apa...@alcyon.net javascript:;
 Reply-To: solr-user@lucene.apache.org javascript:;mailto:
 solr-user@lucene.apache.org javascript:; 
 solr-user@lucene.apache.orgjavascript:;
 mailto:solr-user@lucene.apache.org javascript:;
 Date: Monday, March 10, 2014 at 4:23 PM
 To: solr-user@lucene.apache.org javascript:;mailto:
 solr-user@lucene.apache.org javascript:; 
 solr-user@lucene.apache.orgjavascript:;
 mailto:solr-user@lucene.apache.org javascript:;
 Subject: Re: Issue with spatial search

 Minor edit to the KML to adjust color of polygon


 On Mon, Mar 10, 2014 at 4:21 PM, Steven Bower 
 smb-apa...@alcyon.netjavascript:;
 mailto:smb-apa...@alcyon.net javascript:; wrote:
 I am seeing a error when doing a spatial search where a particular point
 is showing up within a polygon, but by all methods I've tried that point is
 not within the polygon..

 First the point is: 41.2299,29.1345 (lat/lon)

 The polygon is:

 31.2719,32.283
 31.2179,32.3681
 31.1333,32.3407
 30.9356,32.6318
 31.0707,34.5196
 35.2053,36.9415
 37.2959,36.6339
 40.8334,30.4273
 41.1622,29.1421
 41.6484,27.4832
 47.0255,13.6342
 43.9457,3.17525
 37.0029,-5.7017
 35.7741,-5.57719
 34.801,-4.66201
 33.345,10.0157
 29.6745,18.9366
 30.6592,29.1683
 31.2719,32.283

 The geo field we are using has this config:

 fieldType name=location_rpt
class=solr.SpatialRecursivePrefixTreeFieldType
distErrPct=0.025
maxDistErr=0.09

  
 spatialContextFactory=com.spatial4j.core.context.jts.JtsSpatialContextFactory
units=degrees/

 The config is basically the same as the one from the docs...

 They query I am issuing is this:

 location:Intersects(POLYGON((32.283 31.2719, 32.3681 31.2179, 32.3407
 31.1333, 32.6318 30.9356, 34.5196 31.0707, 36.9415 35.2053, 36.6339
 37.2959, 30.4273 40.8334, 29.1421 41.1622, 27.4832 41.6484, 13.6342
 47.0255, 3.17525 43.9457, -5.7017 37.0029, -5.57719 35.7741, -4.66201
 34.801, 10.0157 33.345, 18.9366 29.6745, 29.1683 30.6592, 32.283 31.2719)))

 and it brings back a result where the location field is 41.2299,29.1345

 I've attached a KML with the polygon and the point and you can see from
 that, visually, that the point is not within the polygon. I also tried in
 google maps API but after playing around realize that the polygons in maps
 are draw in Euclidian space while the map itself is a Mercator projection..
 Loading the kml in earth fixes this issue but the point still lays outside
 the polygon.. The distance between the edge of the polygon closes to the
 point and the point itself is ~1.2 miles which is much larger than the
 1meter accuracy given by the maxDistErr (per the docs).

 Any thoughts on this?

 Thanks,

 Steve




Re: core.properties and solr.xml

2014-01-23 Thread Steven Bower
For us we don't fully rely on cloud/collections api for creating and
deploying instances/etc.. we control this via an external mechanism so this
would allow me to have instances figure out what they should be based on an
external system.. we do this now but have to drop core.properties files all
over.. i'd like to not have to do that... its more of a desire for
cleanliness of my filesystem than anything else because this is all
automated at this point..


On Wed, Jan 15, 2014 at 1:49 PM, Mark Miller markrmil...@gmail.com wrote:

 What’s the benefit? So you can avoid having a simple core properties file?
 I’d rather see more value than that prompt exposing something like this to
 the user. It’s a can of warms that I personally have not seen a lot of
 value in yet.

 Whether we mark it experimental or not, this adds a burden, and I’m still
 wondering if the gains are worth it.

 - Mark

 On Jan 15, 2014, at 12:04 PM, Alan Woodward a...@flax.co.uk wrote:

  This is true.  But if we slap big warning: experimental messages all
 over it, then users can't complain too much about backwards-compat breaks.
  My intention when pulling all this stuff into the CoresLocator interface
 was to allow other implementations to be tested out, and other suggestions
 have already come up from time to time on the list.  It seems a shame to
 *not* allow this to be opened up for advanced users.
 
  Alan Woodward
  www.flax.co.uk
 
 
  On 15 Jan 2014, at 16:24, Mark Miller wrote:
 
  I think these API’s are pretty new and deep to want to support them for
 users at this point. It constrains refactoring and can complicates things
 down the line, especially with SolrCloud. This same discussion has come up
 in JIRA issues before. At best, I think all the recent refactoring in this
 area needs to bake.
 
  - Mark
 
  On Jan 15, 2014, at 11:01 AM, Alan Woodward a...@flax.co.uk wrote:
 
  I think solr.xml is the correct place for it, and you can then set up
 substitution variables to allow it to be set by environment variables, etc.
  But let's discuss on the JIRA ticket.
 
  Alan Woodward
  www.flax.co.uk
 
 
  On 15 Jan 2014, at 15:39, Steven Bower wrote:
 
  I will open up a JIRA... I'm more concerned over the core locator
 stuff vs
  the solr.xml.. Should the specification of the core locator go into
 the
  solr.xml or via some other method?
 
  steve
 
 
  On Tue, Jan 14, 2014 at 5:06 PM, Alan Woodward a...@flax.co.uk
 wrote:
 
  Hi Steve,
 
  I think this is a great idea.  Currently the implementation of
  CoresLocator is picked depending on the type of solr.xml you have
 (new- vs
  old-style), but it should be easy enough to extend the new-style
 logic to
  optionally look up and instantiate a plugin implementation.
 
  Core loading and new core creation is all done through the CL now,
 so as
  long as the plugin implemented all methods, it shouldn't break the
  Collections API either.
 
  Do you want to open a JIRA?
 
  Alan Woodward
  www.flax.co.uk
 
 
  On 14 Jan 2014, at 19:20, Erick Erickson wrote:
 
  The work done as part of new style solr.xml, particularly by
  romsegeek should make this a lot easier. But no, there's no formal
  support for such a thing.
 
  There's also a desire to make ZK the one source of truth in Solr
 5,
  although that effort is in early stages.
 
  Which is a long way of saying that I think this would be a good
 thing
  to add. Currently there's no formal way to specify one though. We'd
  have to give some thought as to what abstract methods are required.
  The current old style and new style classes . There's also the
  chicken-and-egg question; how does one specify the new class? This
  seems like something that would be in a (very small) solr.xml or
  specified as a sysprop. And knowing where to load the class from
 could
  be interesting.
 
  A pluggable SolrConfig I think is a stickier wicket, it hasn't been
  broken out into nice interfaces like coreslocator has been. And it's
  used all over the place, passed in and recorded in constructors etc,
  as well as being possibly unique for each core. There's been some
 talk
  of sharing a single config object, and there's also talk about using
  config sets that might address some of those concerns, but neither
  one has gotten very far in 4x land.
 
  FWIW,
  Erick
 
  On Tue, Jan 14, 2014 at 1:41 PM, Steven Bower 
 smb-apa...@alcyon.net
  wrote:
  Are there any plans/tickets to allow for pluggable SolrConf and
  CoreLocator? In my use case my solr.xml is totally static, i have a
  separate dataDir and my core.properties are derived from a separate
  configuration (living in ZK) but totally outside of the SolrCloud..
 
  I'd like to be able to not have any instance directories and/or no
  solr.xml
  or core.properties files laying around as right now I just
 regenerate
  them
  on startup each time in my start scripts..
 
  Obviously I can just hack my stuff in and clearly this could break
 the
  write side of the collections API (which i don't

Re: core.properties and solr.xml

2014-01-15 Thread Steven Bower
I will open up a JIRA... I'm more concerned over the core locator stuff vs
the solr.xml.. Should the specification of the core locator go into the
solr.xml or via some other method?

steve


On Tue, Jan 14, 2014 at 5:06 PM, Alan Woodward a...@flax.co.uk wrote:

 Hi Steve,

 I think this is a great idea.  Currently the implementation of
 CoresLocator is picked depending on the type of solr.xml you have (new- vs
 old-style), but it should be easy enough to extend the new-style logic to
 optionally look up and instantiate a plugin implementation.

 Core loading and new core creation is all done through the CL now, so as
 long as the plugin implemented all methods, it shouldn't break the
 Collections API either.

 Do you want to open a JIRA?

 Alan Woodward
 www.flax.co.uk


 On 14 Jan 2014, at 19:20, Erick Erickson wrote:

  The work done as part of new style solr.xml, particularly by
  romsegeek should make this a lot easier. But no, there's no formal
  support for such a thing.
 
  There's also a desire to make ZK the one source of truth in Solr 5,
  although that effort is in early stages.
 
  Which is a long way of saying that I think this would be a good thing
  to add. Currently there's no formal way to specify one though. We'd
  have to give some thought as to what abstract methods are required.
  The current old style and new style classes . There's also the
  chicken-and-egg question; how does one specify the new class? This
  seems like something that would be in a (very small) solr.xml or
  specified as a sysprop. And knowing where to load the class from could
  be interesting.
 
  A pluggable SolrConfig I think is a stickier wicket, it hasn't been
  broken out into nice interfaces like coreslocator has been. And it's
  used all over the place, passed in and recorded in constructors etc,
  as well as being possibly unique for each core. There's been some talk
  of sharing a single config object, and there's also talk about using
  config sets that might address some of those concerns, but neither
  one has gotten very far in 4x land.
 
  FWIW,
  Erick
 
  On Tue, Jan 14, 2014 at 1:41 PM, Steven Bower smb-apa...@alcyon.net
 wrote:
  Are there any plans/tickets to allow for pluggable SolrConf and
  CoreLocator? In my use case my solr.xml is totally static, i have a
  separate dataDir and my core.properties are derived from a separate
  configuration (living in ZK) but totally outside of the SolrCloud..
 
  I'd like to be able to not have any instance directories and/or no
 solr.xml
  or core.properties files laying around as right now I just regenerate
 them
  on startup each time in my start scripts..
 
  Obviously I can just hack my stuff in and clearly this could break the
  write side of the collections API (which i don't care about for my
 case)...
  but having a way to plug these would be nice..
 
  steve




core.properties and solr.xml

2014-01-14 Thread Steven Bower
Are there any plans/tickets to allow for pluggable SolrConf and
CoreLocator? In my use case my solr.xml is totally static, i have a
separate dataDir and my core.properties are derived from a separate
configuration (living in ZK) but totally outside of the SolrCloud..

I'd like to be able to not have any instance directories and/or no solr.xml
or core.properties files laying around as right now I just regenerate them
on startup each time in my start scripts..

Obviously I can just hack my stuff in and clearly this could break the
write side of the collections API (which i don't care about for my case)...
but having a way to plug these would be nice..

steve


Index Sizes

2014-01-07 Thread Steven Bower
I was looking at the code for getIndexSize() on the ReplicationHandler to
get at the size of the index on disk. From what I can tell, because this
does directory.listAll() to get all the files in the directory, the size on
disk includes not only what is searchable at the moment but potentially
also files that are being created by background merges/etc.. I am wondering
if there is an API that would give me the size of the currently
searchable index files (doubt this exists, but maybe)..

If not what is the most appropriate way to get a list of the segments/files
that are currently in use by the active searcher such that I could then ask
the directory implementation for the size of all those files?

For a more complete picture of what I'm trying to accomplish, I am looking
at building a quota/monitoring component that will trigger when index size
on disk gets above a certain size. I don't want to trigger if index is
doing a merge and ephemerally uses disk for that process. If anyone has any
suggestions/recommendations here too I'd be interested..

Thanks,

steve


SolrCoreAware

2013-11-15 Thread Steven Bower
Under what circumstances will a handler that implements SolrCoreAware have
its inform() method called?

thanks,

steve


Re: SolrCoreAware

2013-11-15 Thread Steven Bower
So its something that can happen multiple times during the lifetime of
process, but i'm guessing something not occuring very often?

Also is there a way to hook the shutdown of the core?

steve


On Fri, Nov 15, 2013 at 12:08 PM, Alan Woodward a...@flax.co.uk wrote:

 Hi Steven,

 It's called when the handler is created, either at SolrCore construction
 time (solr startup or core reload) or the first time the handler is
 requested if it's a lazy-loading handler.

 Alan Woodward
 www.flax.co.uk


 On 15 Nov 2013, at 15:40, Steven Bower wrote:

  Under what circumstances will a handler that implements SolrCoreAware
 have
  its inform() method called?
 
  thanks,
 
  steve




Re: SolrCoreAware

2013-11-15 Thread Steven Bower
 it should be called only once during hte lifetime of a given plugin,
 usually not long after construction -- but it could be called many, many
 times in the lifetime of the solr process.

So for a given instance of a handler it will only be called once during the
lifetime of that handler?

Also, when the core is passed in as part of inform() is it guaranteed to be
ready to go? (ie I can start feeding content at this point?)

thanks,

steve




On Fri, Nov 15, 2013 at 12:52 PM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 : So its something that can happen multiple times during the lifetime of
 : process, but i'm guessing something not occuring very often?

 it should be called only once during hte lifetime of a given plugin,
 usually not long after construction -- but it could be called many, many
 times in the lifetime of the solr process.

 : Also is there a way to hook the shutdown of the core?

 any object (SolrCoreAware or otherwise) can ask the SolrCore to add a
 CloseHook at anytime...


 https://lucene.apache.org/solr/4_5_1/solr-core/org/apache/solr/core/SolrCore.html#addCloseHook%28org.apache.solr.core.CloseHook%29


 -Hoss



Re: SolrCoreAware

2013-11-15 Thread Steven Bower
And the close hook will basically only be fired once during shutdown?


On Fri, Nov 15, 2013 at 1:07 PM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 : So for a given instance of a handler it will only be called once during
 the
 : lifetime of that handler?

 correct (unless there is a bug somewhere)

 : Also, when the core is passed in as part of inform() is it guaranteed to
 be
 : ready to go? (ie I can start feeding content at this point?)

 Right, that's the point of the interface: way back in the day we had
 people writting plugins that were trying to use SolrCore from their init()
 methods and the SolrCore wasn't fully initialized yet (didn't have
 DirectUpdateHandler yet, didn't have all of hte RequestHandler's
 initialized, didn't have an openSearcher, etc...)

 the inform(SolrCore) method is called after the SolrCore is initialized,
 and all plugins hanging off of it have been init()ed ... you can still get
 into trouble if you write FooA.inform(SolrCore) such that it asks the
 SolrCore for for a pointer to some FooB plugin and expect that FooB's
 infomr(SolrCore) method has already been called -- because there is no
 garunteed order -- but the basic functionality and basic plugin
 initialization has all been done at that point.


 -Hoss



Re: How to set a condition over stats result

2013-10-04 Thread Steven Bower
Check out: https://issues.apache.org/jira/browse/SOLR-5302 can do this
using query facets


On Fri, Jul 12, 2013 at 11:35 AM, Jack Krupansky j...@basetechnology.comwrote:

 sum(x, y, z) = x + y + z (sums those specific fields values for the
 current document)

 sum(x, y) = x + y (sum of those two specific field values for the current
 document)

 sum(x) = field(x) = x (the specific field value for the current document)

 The sum function in function queries is not an aggregate function. Ditto
 for min and max.

 -- Jack Krupansky

 -Original Message- From: mihaela olteanu
 Sent: Friday, July 12, 2013 1:44 AM
 To: solr-user@lucene.apache.org

 Subject: Re: How to set a condition over stats result

 What if you perform sub(sum(myfieldvalue),100)  0 using frange?


 __**__
 From: Jack Krupansky j...@basetechnology.com
 To: solr-user@lucene.apache.org
 Sent: Friday, July 12, 2013 7:44 AM
 Subject: Re: How to set a condition over stats result


 None that I know of, short of writing a custom search component.
 Seriously, you could hack up a copy of the stats component with your own
 logic.

 Actually... this may be a case for the new, proposed Script Request
 Handler, which would let you execute a query and then you could do any
 custom JavaScript logic you wanted.

 When we get that feature, it might be interesting to implement a variation
 of the standard stats component as a JavaScript script, and then people
 could easily hack it such as in your request. Fascinating.

 -- Jack Krupansky

 -Original Message- From: Matt Lieber
 Sent: Thursday, July 11, 2013 6:08 PM
 To: solr-user@lucene.apache.org
 Subject: How to set a condition over stats result


  Hello,

 I am trying to see how I can test the sum of values of an attribute across
 docs.
 I.e. Whether sum(myfieldvalue)100 .

 I know I can use the stats module which compiles the sum of my attributes
 on a certain facet , but how can I perform a test this result (i.e. Is
 sum100) within my stats query? From what I read, it's not supported yet
 to perform a function on the stats module..
 Any other way to do this ?

 Cheers,
 Matt





 __**__






 NOTE: This message may contain information that is confidential,
 proprietary, privileged or otherwise protected by law. The message is
 intended solely for the named addressee. If received in error, please
 destroy and notify the sender. Any use of this email is prohibited when
 received in error. Impetus does not represent, warrant and/or guarantee,
 that the integrity of this communication has been maintained nor that the
 communication is free of errors, virus, interception or interference.



Re: StatsComponent with median

2013-10-04 Thread Steven Bower
Check out: https://issues.apache.org/jira/browse/SOLR-5302 it supports
median value


On Wed, Jul 3, 2013 at 12:11 PM, William Bell billnb...@gmail.com wrote:

 If you are a programmer, you can modify it and attach a patch in Jira...




 On Tue, Jun 4, 2013 at 4:25 AM, Marcin Rzewucki mrzewu...@gmail.com
 wrote:

  Hi there,
 
  StatsComponent currently does not have median on the list of results. Is
  there a plan to add it in the next release(s) ? Shall I add a ticket in
  Jira for this ?
 
  Regards.
 



 --
 Bill Bell
 billnb...@gmail.com
 cell 720-256-8076



Re: bucket count for facets

2013-09-06 Thread Steven Bower
Understood, what I need is a count of the unique values in a field and that
field is multi-valued (which makes stats component a non-option)


On Fri, Sep 6, 2013 at 4:22 AM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 Stats Component can give you a count of non-null values in a field.

 See https://cwiki.apache.org/confluence/display/solr/The+Stats+Component

 On Fri, Sep 6, 2013 at 12:28 AM, Steven Bower smb-apa...@alcyon.net
 wrote:
  Is there a way to get the count of buckets (ie unique values) for a field
  facet? the rudimentary approach of course is to get back all buckets, but
  in some cases this is a huge amount of data.
 
  thanks,
 
  steve



 --
 Regards,
 Shalin Shekhar Mangar.



bucket count for facets

2013-09-05 Thread Steven Bower
Is there a way to get the count of buckets (ie unique values) for a field
facet? the rudimentary approach of course is to get back all buckets, but
in some cases this is a huge amount of data.

thanks,

steve


AND not working

2013-08-15 Thread Steven Bower
I have query like:

q=foo AND bar
defType=edismax
qf=field1
qf=field2
qf=field3

with debug on I see it parsing to this:

(+(DisjunctionMaxQuery((field1:foo | field2:foo | field3:foo))
DisjunctionMaxQuery((field1:and | field2:and | field3:and))
DisjunctionMaxQuery((field1:bar | field2:bar | field3:bar/no_coord

basically it seems to be treating the AND as a term... any thoughts?

thx,

steve


Re: AND not working

2013-08-15 Thread Steven Bower
@Yonik that was exactly the issue... I'll file a ticket... there def should
be an exception thrown for something like this..

It would seem to me that eating any sort of exception is a really bad
thing...

steve


On Thu, Aug 15, 2013 at 5:59 PM, Yonik Seeley yo...@lucidworks.com wrote:

 I can reproduce something like this by specifying a field that doesn't
 exist for a qf param.
 This seems like a bug... if a field doesn't exist, we should throw an
 exception (since it's a parameter error not related to the q string where
 we avoid throwing any errors).

 -Yonik
 http://lucidworks.com


 On Thu, Aug 15, 2013 at 5:19 PM, Steven Bower smb-apa...@alcyon.net
 wrote:

  I have query like:
 
  q=foo AND bar
  defType=edismax
  qf=field1
  qf=field2
  qf=field3
 
  with debug on I see it parsing to this:
 
  (+(DisjunctionMaxQuery((field1:foo | field2:foo | field3:foo))
  DisjunctionMaxQuery((field1:and | field2:and | field3:and))
  DisjunctionMaxQuery((field1:bar | field2:bar | field3:bar/no_coord
 
  basically it seems to be treating the AND as a term... any thoughts?
 
  thx,
 
  steve
 



Re: AND not working

2013-08-15 Thread Steven Bower
https://issues.apache.org/jira/browse/SOLR-5163


On Thu, Aug 15, 2013 at 6:04 PM, Steven Bower smb-apa...@alcyon.net wrote:

 @Yonik that was exactly the issue... I'll file a ticket... there def
 should be an exception thrown for something like this..

 It would seem to me that eating any sort of exception is a really bad
 thing...

 steve


 On Thu, Aug 15, 2013 at 5:59 PM, Yonik Seeley yo...@lucidworks.comwrote:

 I can reproduce something like this by specifying a field that doesn't
 exist for a qf param.
 This seems like a bug... if a field doesn't exist, we should throw an
 exception (since it's a parameter error not related to the q string
 where
 we avoid throwing any errors).

 -Yonik
 http://lucidworks.com


 On Thu, Aug 15, 2013 at 5:19 PM, Steven Bower smb-apa...@alcyon.net
 wrote:

  I have query like:
 
  q=foo AND bar
  defType=edismax
  qf=field1
  qf=field2
  qf=field3
 
  with debug on I see it parsing to this:
 
  (+(DisjunctionMaxQuery((field1:foo | field2:foo | field3:foo))
  DisjunctionMaxQuery((field1:and | field2:and | field3:and))
  DisjunctionMaxQuery((field1:bar | field2:bar | field3:bar/no_coord
 
  basically it seems to be treating the AND as a term... any thoughts?
 
  thx,
 
  steve
 





Schema Lint

2013-08-06 Thread Steven Bower
Is there an easy way in code / command line to lint a solr config (or even
just a solr schema)?

Steve


Re: Performance question on Spatial Search

2013-08-05 Thread Steven Bower
So after re-feeding our data with a new boolean field that is true when
data exists and false when it doesn't our search times have gone from avg
of about 20s to around 150ms... pretty amazing change in perf... It seems
like https://issues.apache.org/jira/browse/SOLR-5093 might alleviate many
peoples pain in doing this kind of query (if I have some time I may take a
look at it)..

Anyway we are in pretty good shape at this point.. the only remaining issue
is that the first queries after commits are taking 5-6s... This is cause by
the loading of 2 (one long and one int) FieldCaches (uninvert) that are
used for sorting.. I'm suspecting that docvalues will greatly help this
load performance?

thanks,

steve


On Wed, Jul 31, 2013 at 4:32 PM, Steven Bower smb-apa...@alcyon.net wrote:

 the list of IDs does change relatively frequently, but this doesn't seem
 to have very much impact on the performance of the query as far as I can
 tell.

 attached are the stacks

 thanks,

 steve


 On Wed, Jul 31, 2013 at 6:33 AM, Mikhail Khludnev 
 mkhlud...@griddynamics.com wrote:

 On Wed, Jul 31, 2013 at 1:10 AM, Steven Bower sbo...@alcyon.net wrote:

 
  not sure what you mean by good hit raitio?
 

 I mean such queries are really expensive (even on cache hit), so if the
 list of ids changes every time, it never hit cache and hence executes
 these
 heavy queries every time. It's well known performance problem.


  Here are the stacks...
 
 they seems like hotspots, and shows index reading that's reasonable. But I
 can't see what caused these readings, to get that I need whole stack of
 hot
 thread.


 
Name Time (ms) Own Time (ms)
 
 
 org.apache.lucene.search.MultiTermQueryWrapperFilter.getDocIdSet(AtomicReaderContext,
  Bits) 300879 203478
 
 
 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsEnum.nextDoc()
  45539 19
 
 
 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsEnum.refillDocs()
  45519 40
 
 
 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.readVIntBlock(IndexInput,
  int[], int[], int, boolean) 24352 0
  org.apache.lucene.store.DataInput.readVInt() 24352 24352
  org.apache.lucene.codecs.lucene41.ForUtil.readBlock(IndexInput, byte[],
  int[]) 21126 14976
  org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int)
  6150 0  java.nio.DirectByteBuffer.get(byte[], int, int)
  6150 0
  java.nio.Bits.copyToArray(long, Object, long, long, long) 6150 6150
 
 
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.docs(Bits,
  DocsEnum, int) 35342 421
 
 
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.decodeMetaData()
  34920 27939
 
 
 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.nextTerm(FieldInfo,
  BlockTermState) 6980 6980
 
 
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.next()
  14129 1053
 
 
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadNextFloorBlock()
  5948 261
 
 
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock()
  5686 199
  org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int)
  3606 0  java.nio.DirectByteBuffer.get(byte[], int, int)
  3606 0
  java.nio.Bits.copyToArray(long, Object, long, long, long) 3606 3606
 
 
 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.readTermsBlock(IndexInput,
  FieldInfo, BlockTermState) 1879 80
  org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int)
  1798 0java.nio.DirectByteBuffer.get(byte[], int, int)
  1798 0
  java.nio.Bits.copyToArray(long, Object, long, long, long) 1798 1798
 
 
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.next()
  4010 3324
 
 
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.nextNonLeaf()
  685 685
 
 
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock()
  3117 144
  org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int)
  1861 0java.nio.DirectByteBuffer.get(byte[], int, int) 1861
  0
  java.nio.Bits.copyToArray(long, Object, long, long, long) 1861 1861
 
 
 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.readTermsBlock(IndexInput,
  FieldInfo, BlockTermState) 1090 19
  org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int)
  1070 0  java.nio.DirectByteBuffer.get(byte[], int, int)
  1070 0
  java.nio.Bits.copyToArray(long, Object, long, long, long) 1070 1070
 
 
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.initIndexInput()
  20 0org.apache.lucene.store.ByteBufferIndexInput.clone()
  20 0
  org.apache.lucene.store.ByteBufferIndexInput.clone() 20 0
  org.apache.lucene.store.ByteBufferIndexInput.buildSlice(long, long) 20
  0
  org.apache.lucene.util.WeakIdentityMap.put(Object, Object) 20 0

Re: Performance question on Spatial Search

2013-07-31 Thread Steven Bower
the list of IDs does change relatively frequently, but this doesn't seem to
have very much impact on the performance of the query as far as I can tell.

attached are the stacks

thanks,

steve


On Wed, Jul 31, 2013 at 6:33 AM, Mikhail Khludnev 
mkhlud...@griddynamics.com wrote:

 On Wed, Jul 31, 2013 at 1:10 AM, Steven Bower sbo...@alcyon.net wrote:

 
  not sure what you mean by good hit raitio?
 

 I mean such queries are really expensive (even on cache hit), so if the
 list of ids changes every time, it never hit cache and hence executes these
 heavy queries every time. It's well known performance problem.


  Here are the stacks...
 
 they seems like hotspots, and shows index reading that's reasonable. But I
 can't see what caused these readings, to get that I need whole stack of hot
 thread.


 
Name Time (ms) Own Time (ms)
 
 
 org.apache.lucene.search.MultiTermQueryWrapperFilter.getDocIdSet(AtomicReaderContext,
  Bits) 300879 203478
 
 
 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsEnum.nextDoc()
  45539 19
 
 
 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsEnum.refillDocs()
  45519 40
 
 
 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.readVIntBlock(IndexInput,
  int[], int[], int, boolean) 24352 0
  org.apache.lucene.store.DataInput.readVInt() 24352 24352
  org.apache.lucene.codecs.lucene41.ForUtil.readBlock(IndexInput, byte[],
  int[]) 21126 14976
  org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int)
  6150 0  java.nio.DirectByteBuffer.get(byte[], int, int)
  6150 0
  java.nio.Bits.copyToArray(long, Object, long, long, long) 6150 6150
 
 
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.docs(Bits,
  DocsEnum, int) 35342 421
 
 
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.decodeMetaData()
  34920 27939
 
 
 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.nextTerm(FieldInfo,
  BlockTermState) 6980 6980
 
 
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.next()
  14129 1053
 
 
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadNextFloorBlock()
  5948 261
 
 
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock()
  5686 199
  org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int)
  3606 0  java.nio.DirectByteBuffer.get(byte[], int, int)
  3606 0
  java.nio.Bits.copyToArray(long, Object, long, long, long) 3606 3606
 
 
 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.readTermsBlock(IndexInput,
  FieldInfo, BlockTermState) 1879 80
  org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int)
  1798 0java.nio.DirectByteBuffer.get(byte[], int, int)
  1798 0
  java.nio.Bits.copyToArray(long, Object, long, long, long) 1798 1798
 
 
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.next()
  4010 3324
 
 
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.nextNonLeaf()
  685 685
 
 
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock()
  3117 144
  org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int)
  1861 0java.nio.DirectByteBuffer.get(byte[], int, int) 1861
  0
  java.nio.Bits.copyToArray(long, Object, long, long, long) 1861 1861
 
 
 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.readTermsBlock(IndexInput,
  FieldInfo, BlockTermState) 1090 19
  org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int)
  1070 0  java.nio.DirectByteBuffer.get(byte[], int, int)
  1070 0
  java.nio.Bits.copyToArray(long, Object, long, long, long) 1070 1070
 
 
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.initIndexInput()
  20 0org.apache.lucene.store.ByteBufferIndexInput.clone()
  20 0
  org.apache.lucene.store.ByteBufferIndexInput.clone() 20 0
  org.apache.lucene.store.ByteBufferIndexInput.buildSlice(long, long) 20
  0
  org.apache.lucene.util.WeakIdentityMap.put(Object, Object) 20 0
 
 org.apache.lucene.util.WeakIdentityMap$IdentityWeakReference.init(Object,
  ReferenceQueue) 20 0
  java.lang.System.identityHashCode(Object) 20 20
  org.apache.lucene.index.FilteredTermsEnum.docs(Bits, DocsEnum, int)
  1485 527
 
 
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.docs(Bits,
  DocsEnum, int) 957 0
 
 
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.decodeMetaData()
  957 513
 
 
 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.nextTerm(FieldInfo,
  BlockTermState) 443 443
  org.apache.lucene.index.FilteredTermsEnum.next() 874 324
 
 
 org.apache.lucene.search.NumericRangeQuery$NumericRangeTermsEnum.accept(BytesRef)
  368 0
 
 
 org.apache.lucene.util.BytesRef$UTF8SortedAsUnicodeComparator.compare(Object,
  Object) 368

Re: Performance question on Spatial Search

2013-07-30 Thread Steven Bower
Until I get the data refed I there was another field (a date field) that
was there and not when the geo field was/was not... i tried that field:*
and query times come down to 2.5s .. also just removing that filter brings
the query down to 30ms.. so I'm very hopeful that with just a boolean i'll
be down in that sub 100ms range..

steve


On Tue, Jul 30, 2013 at 12:02 PM, Steven Bower sbo...@alcyon.net wrote:

 Will give the boolean thing a shot... makes sense...


 On Tue, Jul 30, 2013 at 11:53 AM, Smiley, David W. dsmi...@mitre.orgwrote:

 I see the problem ‹ it's +pp:*. It may look innocent but it's a
 performance killer.  What your telling Lucene to do is iterate over
 *every* term in this index to find all documents that have this data.
 Most fields are pretty slow to do that.  Lucene/Solr does not have some
 kind of cache for this. Instead, you should index a new boolean field
 indicating wether or not 'pp' is populated and then do a simple true check
 against that field.  Another approach you could do right now without
 reindexing is to simplify the last 2 clauses of your 3-clause boolean
 query by using the IsDisjointTo predicate.  But unfortunately Lucene
 doesn't have a generic filter cache capability and so this predicate has
 no place to cache the whole-world query it does internally (each and every
 time it's used), so it will be slower than the boolean field I suggested
 you add.


 Nevermind on LatLonType; it doesn't support JTS/Polygons.  There is
 something close called SpatialPointVectorFieldType that could be modified
 trivially but it doesn't support it now.

 ~ David

 On 7/30/13 11:32 AM, Steven Bower sbo...@alcyon.net wrote:

 #1 Here is my query:
 
 sort=vid asc
 start=0
 rows=1000
 defType=edismax
 q=*:*
 fq=recordType:xxx
 fq=vt:X12B AND
 fq=(cls:3 OR cls:8)
 fq=dt:[2013-05-08T00:00:00.00Z TO 2013-07-08T00:00:00.00Z]
 fq=(vid:86XXX73 OR vid:86XXX20 OR vid:89XXX60 OR vid:89XXX72 OR
 vid:89XXX48
 OR vid:89XXX31 OR vid:89XXX28 OR vid:89XXX67 OR vid:90XXX76 OR
 vid:90XXX33
 OR vid:90XXX47 OR vid:90XXX97 OR vid:90XXX69 OR vid:90XXX31 OR
 vid:90XXX44
 OR vid:91XXX82 OR vid:91XXX08 OR vid:91XXX32 OR vid:91XXX13 OR
 vid:91XXX87
 OR vid:91XXX82 OR vid:91XXX48 OR vid:91XXX34 OR vid:91XXX31 OR
 vid:91XXX94
 OR vid:91XXX29 OR vid:91XXX31 OR vid:91XXX43 OR vid:91XXX55 OR
 vid:91XXX67
 OR vid:91XXX15 OR vid:91XXX59 OR vid:92XXX95 OR vid:92XXX24 OR
 vid:92XXX13
 OR vid:92XXX07 OR vid:92XXX92 OR vid:92XXX22 OR vid:92XXX25 OR
 vid:92XXX99
 OR vid:92XXX53 OR vid:92XXX55 OR vid:92XXX27 OR vid:92XXX65 OR
 vid:92XXX41
 OR vid:92XXX89 OR vid:92XXX11 OR vid:93XXX45 OR vid:93XXX05 OR
 vid:93XXX98
 OR vid:93XXX70 OR vid:93XXX24 OR vid:93XXX39 OR vid:93XXX69 OR
 vid:93XXX28
 OR vid:93XXX79 OR vid:93XXX66 OR vid:94XXX13 OR vid:94XXX16 OR
 vid:94XXX10
 OR vid:94XXX37 OR vid:94XXX69 OR vid:94XXX29 OR vid:94XXX70 OR
 vid:94XXX58
 OR vid:94XXX08 OR vid:94XXX64 OR vid:94XXX32 OR vid:94XXX44 OR
 vid:94XXX56
 OR vid:95XXX59 OR vid:95XXX72 OR vid:95XXX14 OR vid:95XXX08 OR
 vid:96XXX10
 OR vid:96XXX54 )
 fq=gp:Intersects(POLYGON((47.0 30.0, 47.0 27.0, 52.0 27.0, 52.0 30.0,
 47.0
 30.0))) AND NOT pp:Intersects(POLYGON((47.0 30.0, 47.0 27.0, 52.0 27.0,
 52.0 30.0, 47.0 30.0))) AND +pp:*
 
 Basically looking for a set of records by vid then if its gp is in one
 polygon and is pp is not in another (and it has a pp)... essentially
 looking to see if a record moved between two polygons (gp=current,
 pp=prev)
 during a time period.
 
 #2 Yes on JTS (unless from my query above I don't) however this is only
 an
 initial use case and I suspect we'll need more complex stuff in the
 future
 
 #3 The data is distributed globally but along generally fixed paths and
 then clustering around certain areas... for example the polygon above has
 about 11k points (with no date filtering). So basically some areas will
 be
 very dense and most areas not, the majority of searches will be around
 the
 dense areas
 
 #4 Its very likely to be less than 1M results (with filters) .. is there
 any functinoality loss with LatLonType fields?
 
 Thanks,
 
 steve
 
 
 On Tue, Jul 30, 2013 at 10:49 AM, David Smiley (@MITRE.org) 
 dsmi...@mitre.org wrote:
 
  Steve,
  (1)  Can you give a specific example of how your are specifying the
 spatial
  query?  I'm looking to ensure you are not using IsWithin, which is
 not
  meant for point data.  If your query shape is a circle or the bounding
 box
  of a circle, you should use the geofilt query parser, otherwise use the
  quirky syntax that allows you to specify the spatial predicate with
  Intersects.
  (2) Do you actually need JTS?  i.e. are you using Polygons, etc.
  (3) How dense would you estimate the data is at the 50m resolution
 you've
  configured the data?  If It's very dense then I'll tell you how to
 raise
  the
  prefix grid scan level to a # closer to max-levels.
  (4) Do all of your searches find less than a million points,
 considering
  all
  filters?  If so then it's worth comparing

Re: Performance question on Spatial Search

2013-07-30 Thread Steven Bower
I am curious why the field:* walks the entire terms list.. could this be
discovered from a field cache / docvalues?

steve


On Tue, Jul 30, 2013 at 2:00 PM, Steven Bower sbo...@alcyon.net wrote:

 Until I get the data refed I there was another field (a date field) that
 was there and not when the geo field was/was not... i tried that field:*
 and query times come down to 2.5s .. also just removing that filter brings
 the query down to 30ms.. so I'm very hopeful that with just a boolean i'll
 be down in that sub 100ms range..

 steve


 On Tue, Jul 30, 2013 at 12:02 PM, Steven Bower sbo...@alcyon.net wrote:

 Will give the boolean thing a shot... makes sense...


 On Tue, Jul 30, 2013 at 11:53 AM, Smiley, David W. dsmi...@mitre.orgwrote:

 I see the problem ‹ it's +pp:*. It may look innocent but it's a
 performance killer.  What your telling Lucene to do is iterate over
 *every* term in this index to find all documents that have this data.
 Most fields are pretty slow to do that.  Lucene/Solr does not have some
 kind of cache for this. Instead, you should index a new boolean field
 indicating wether or not 'pp' is populated and then do a simple true
 check
 against that field.  Another approach you could do right now without
 reindexing is to simplify the last 2 clauses of your 3-clause boolean
 query by using the IsDisjointTo predicate.  But unfortunately Lucene
 doesn't have a generic filter cache capability and so this predicate has
 no place to cache the whole-world query it does internally (each and
 every
 time it's used), so it will be slower than the boolean field I suggested
 you add.


 Nevermind on LatLonType; it doesn't support JTS/Polygons.  There is
 something close called SpatialPointVectorFieldType that could be modified
 trivially but it doesn't support it now.

 ~ David

 On 7/30/13 11:32 AM, Steven Bower sbo...@alcyon.net wrote:

 #1 Here is my query:
 
 sort=vid asc
 start=0
 rows=1000
 defType=edismax
 q=*:*
 fq=recordType:xxx
 fq=vt:X12B AND
 fq=(cls:3 OR cls:8)
 fq=dt:[2013-05-08T00:00:00.00Z TO 2013-07-08T00:00:00.00Z]
 fq=(vid:86XXX73 OR vid:86XXX20 OR vid:89XXX60 OR vid:89XXX72 OR
 vid:89XXX48
 OR vid:89XXX31 OR vid:89XXX28 OR vid:89XXX67 OR vid:90XXX76 OR
 vid:90XXX33
 OR vid:90XXX47 OR vid:90XXX97 OR vid:90XXX69 OR vid:90XXX31 OR
 vid:90XXX44
 OR vid:91XXX82 OR vid:91XXX08 OR vid:91XXX32 OR vid:91XXX13 OR
 vid:91XXX87
 OR vid:91XXX82 OR vid:91XXX48 OR vid:91XXX34 OR vid:91XXX31 OR
 vid:91XXX94
 OR vid:91XXX29 OR vid:91XXX31 OR vid:91XXX43 OR vid:91XXX55 OR
 vid:91XXX67
 OR vid:91XXX15 OR vid:91XXX59 OR vid:92XXX95 OR vid:92XXX24 OR
 vid:92XXX13
 OR vid:92XXX07 OR vid:92XXX92 OR vid:92XXX22 OR vid:92XXX25 OR
 vid:92XXX99
 OR vid:92XXX53 OR vid:92XXX55 OR vid:92XXX27 OR vid:92XXX65 OR
 vid:92XXX41
 OR vid:92XXX89 OR vid:92XXX11 OR vid:93XXX45 OR vid:93XXX05 OR
 vid:93XXX98
 OR vid:93XXX70 OR vid:93XXX24 OR vid:93XXX39 OR vid:93XXX69 OR
 vid:93XXX28
 OR vid:93XXX79 OR vid:93XXX66 OR vid:94XXX13 OR vid:94XXX16 OR
 vid:94XXX10
 OR vid:94XXX37 OR vid:94XXX69 OR vid:94XXX29 OR vid:94XXX70 OR
 vid:94XXX58
 OR vid:94XXX08 OR vid:94XXX64 OR vid:94XXX32 OR vid:94XXX44 OR
 vid:94XXX56
 OR vid:95XXX59 OR vid:95XXX72 OR vid:95XXX14 OR vid:95XXX08 OR
 vid:96XXX10
 OR vid:96XXX54 )
 fq=gp:Intersects(POLYGON((47.0 30.0, 47.0 27.0, 52.0 27.0, 52.0 30.0,
 47.0
 30.0))) AND NOT pp:Intersects(POLYGON((47.0 30.0, 47.0 27.0, 52.0
 27.0,
 52.0 30.0, 47.0 30.0))) AND +pp:*
 
 Basically looking for a set of records by vid then if its gp is in one
 polygon and is pp is not in another (and it has a pp)... essentially
 looking to see if a record moved between two polygons (gp=current,
 pp=prev)
 during a time period.
 
 #2 Yes on JTS (unless from my query above I don't) however this is only
 an
 initial use case and I suspect we'll need more complex stuff in the
 future
 
 #3 The data is distributed globally but along generally fixed paths and
 then clustering around certain areas... for example the polygon above
 has
 about 11k points (with no date filtering). So basically some areas will
 be
 very dense and most areas not, the majority of searches will be around
 the
 dense areas
 
 #4 Its very likely to be less than 1M results (with filters) .. is there
 any functinoality loss with LatLonType fields?
 
 Thanks,
 
 steve
 
 
 On Tue, Jul 30, 2013 at 10:49 AM, David Smiley (@MITRE.org) 
 dsmi...@mitre.org wrote:
 
  Steve,
  (1)  Can you give a specific example of how your are specifying the
 spatial
  query?  I'm looking to ensure you are not using IsWithin, which is
 not
  meant for point data.  If your query shape is a circle or the bounding
 box
  of a circle, you should use the geofilt query parser, otherwise use
 the
  quirky syntax that allows you to specify the spatial predicate with
  Intersects.
  (2) Do you actually need JTS?  i.e. are you using Polygons, etc.
  (3) How dense would you estimate the data is at the 50m resolution
 you've
  configured the data?  If It's very dense then I'll tell

Re: Performance question on Spatial Search

2013-07-30 Thread Steven Bower
org.apache.lucene.store.ByteBufferIndexInput.clone() 19 0
org.apache.lucene.store.ByteBufferIndexInput.buildSlice(long, long) 19
0
org.apache.lucene.util.WeakIdentityMap.put(Object, Object) 19 0
org.apache.lucene.util.WeakIdentityMap$IdentityWeakReference.init(Object,
ReferenceQueue) 19 0
java.lang.System.identityHashCode(Object) 19 19
org.apache.lucene.util.FixedBitSet.init(int) 28 28


On Tue, Jul 30, 2013 at 4:18 PM, Mikhail Khludnev 
mkhlud...@griddynamics.com wrote:

 On Tue, Jul 30, 2013 at 12:45 AM, Steven Bower smb-apa...@alcyon.net
 wrote:

 
  - Most of my time (98%) is being spent in
  java.nio.Bits.copyToByteArray(long,Object,long,long) which is being


 Steven, please
 http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html.my
 benchmarking experience shows that NIO is a turtle, absolutely.

 also, are you sure that fq=(vid:86XXX73 OR vid:86XXX20 . has good hit
 ratio? otherwise it's a  well known beast.

 could you also show deeper stack, to make sure what causes to excessive
 reading?



 --
 Sincerely yours
 Mikhail Khludnev
 Principal Engineer,
 Grid Dynamics

 http://www.griddynamics.com
  mkhlud...@griddynamics.com



Re: Performance question on Spatial Search

2013-07-30 Thread Steven Bower
@David I will certainly update when we get the data refed... and if you
have things you'd like to investigate or try out please let me know.. I'm
happy to eval things at scale here... we will be taking this index from its
current 45m records to 6-700m over the next few months as well..

steve


On Tue, Jul 30, 2013 at 5:10 PM, Steven Bower sbo...@alcyon.net wrote:

 Very good read... Already using MMap... verified using pmap and vsz from
 top..

 not sure what you mean by good hit raitio?

 Here are the stacks...

Name Time (ms) Own Time (ms)
 org.apache.lucene.search.MultiTermQueryWrapperFilter.getDocIdSet(AtomicReaderContext,
 Bits) 300879 203478
 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsEnum.nextDoc()
 45539 19
 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsEnum.refillDocs()
 45519 40
 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.readVIntBlock(IndexInput,
 int[], int[], int, boolean) 24352 0
 org.apache.lucene.store.DataInput.readVInt() 24352 24352
 org.apache.lucene.codecs.lucene41.ForUtil.readBlock(IndexInput, byte[],
 int[]) 21126 14976
 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int)
 6150 0  java.nio.DirectByteBuffer.get(byte[], int, int) 6150 0
 java.nio.Bits.copyToArray(long, Object, long, long, long) 6150 6150
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.docs(Bits,
 DocsEnum, int) 35342 421
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.decodeMetaData()
 34920 27939
 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.nextTerm(FieldInfo,
 BlockTermState) 6980 6980
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.next()
 14129 1053
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadNextFloorBlock()
 5948 261
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock()
 5686 199
 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int)
 3606 0  java.nio.DirectByteBuffer.get(byte[], int, int) 3606 0
 java.nio.Bits.copyToArray(long, Object, long, long, long) 3606 3606
 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.readTermsBlock(IndexInput,
 FieldInfo, BlockTermState) 1879 80
 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int)
 1798 0java.nio.DirectByteBuffer.get(byte[], int, int) 1798
 0  java.nio.Bits.copyToArray(long, Object, long, long,
 long) 1798 1798
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.next()
 4010 3324
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.nextNonLeaf()
 685 685
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock()
 3117 144
 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int)
 1861 0java.nio.DirectByteBuffer.get(byte[], int, int) 1861 0
 java.nio.Bits.copyToArray(long, Object, long, long, long) 1861 1861
 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.readTermsBlock(IndexInput,
 FieldInfo, BlockTermState) 1090 19
 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int)
 1070 0  java.nio.DirectByteBuffer.get(byte[], int, int) 1070 0
 java.nio.Bits.copyToArray(long, Object, long, long, long) 1070 1070
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.initIndexInput()
 20 0org.apache.lucene.store.ByteBufferIndexInput.clone() 20 0
 org.apache.lucene.store.ByteBufferIndexInput.clone() 20 0
 org.apache.lucene.store.ByteBufferIndexInput.buildSlice(long, long) 20 0
 org.apache.lucene.util.WeakIdentityMap.put(Object, Object) 20 0
 org.apache.lucene.util.WeakIdentityMap$IdentityWeakReference.init(Object,
 ReferenceQueue) 20 0
 java.lang.System.identityHashCode(Object) 20 20
 org.apache.lucene.index.FilteredTermsEnum.docs(Bits, DocsEnum, int) 1485
 527
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.docs(Bits,
 DocsEnum, int) 957 0
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.decodeMetaData()
 957 513
 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.nextTerm(FieldInfo,
 BlockTermState) 443 443
 org.apache.lucene.index.FilteredTermsEnum.next() 874 324
 org.apache.lucene.search.NumericRangeQuery$NumericRangeTermsEnum.accept(BytesRef)
 368 0
 org.apache.lucene.util.BytesRef$UTF8SortedAsUnicodeComparator.compare(Object,
 Object) 368 368
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.next()
 160 0
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadNextFloorBlock()
 160 0
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock()
 160 0
 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int)
 120 0
 org.apache.lucene.codecs.lucene41

Re: Transaction Logs Leaking FileDescriptors

2013-05-16 Thread Steven Bower
Looking at the timestamps on the tlog files they seem to have all been
created around the same time (04:55).. starting around this time I start
seeing the exception below (there were 1628).. in fact its getting tons of
these (200k+) but most of the time inside regular commits...

2013-15-05 04:55:06.634 ERROR UpdateLog [recoveryExecutor-6-thread-7922] -
java.lang.ArrayIndexOutOfBoundsException: 2603
at
org.apache.lucene.codecs.lucene40.BitVector.get(BitVector.java:146)
at
org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsEnum.nextDoc(Lucene41PostingsReader.java:492)
at
org.apache.lucene.index.BufferedDeletesStream.applyTermDeletes(BufferedDeletesStream.java:407)
at
org.apache.lucene.index.BufferedDeletesStream.applyDeletes(BufferedDeletesStream.java:273)
at
org.apache.lucene.index.IndexWriter.applyAllDeletes(IndexWriter.java:2973)
at
org.apache.lucene.index.IndexWriter.maybeApplyDeletes(IndexWriter.java:2964)
at
org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2704)
at
org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2839)
at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2819)
at
org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:536)
at
org.apache.solr.update.UpdateLog$LogReplayer.doReplay(UpdateLog.java:1339)
at
org.apache.solr.update.UpdateLog$LogReplayer.run(UpdateLog.java:1163)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)



On Thu, May 16, 2013 at 9:35 AM, Yonik Seeley yo...@lucidworks.com wrote:

 See https://issues.apache.org/jira/browse/SOLR-3939

 Do you see these log messages from this in your logs?
   log.info(I may be the new leader - try and sync);

 How reproducible is this bug for you?  It would be great to know if
 the patch in the issue fixes things.

 -Yonik
 http://lucidworks.com


 On Wed, May 15, 2013 at 6:04 PM, Steven Bower sbo...@alcyon.net wrote:
  They are visible to ls...
 
 
  On Wed, May 15, 2013 at 5:49 PM, Yonik Seeley yo...@lucidworks.com
 wrote:
 
  On Wed, May 15, 2013 at 5:20 PM, Steven Bower sbo...@alcyon.net
 wrote:
   when the TransactionLog objects are dereferenced
   their RandomAccessFile object is not closed..
 
  Have the files been deleted (unlinked from the directory), or are they
  still visible via ls?
 
  -Yonik
  http://lucidworks.com
 



Re: Transaction Logs Leaking FileDescriptors

2013-05-16 Thread Steven Bower
Created https://issues.apache.org/jira/browse/SOLR-4831 to capture this
issue


On Thu, May 16, 2013 at 10:10 AM, Steven Bower sbo...@alcyon.net wrote:

 Looking at the timestamps on the tlog files they seem to have all been
 created around the same time (04:55).. starting around this time I start
 seeing the exception below (there were 1628).. in fact its getting tons of
 these (200k+) but most of the time inside regular commits...

 2013-15-05 04:55:06.634 ERROR UpdateLog [recoveryExecutor-6-thread-7922] -
 java.lang.ArrayIndexOutOfBoundsException: 2603
 at
 org.apache.lucene.codecs.lucene40.BitVector.get(BitVector.java:146)
 at
 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsEnum.nextDoc(Lucene41PostingsReader.java:492)
 at
 org.apache.lucene.index.BufferedDeletesStream.applyTermDeletes(BufferedDeletesStream.java:407)
 at
 org.apache.lucene.index.BufferedDeletesStream.applyDeletes(BufferedDeletesStream.java:273)
 at
 org.apache.lucene.index.IndexWriter.applyAllDeletes(IndexWriter.java:2973)
 at
 org.apache.lucene.index.IndexWriter.maybeApplyDeletes(IndexWriter.java:2964)
 at
 org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2704)
 at
 org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2839)
 at
 org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2819)
 at
 org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:536)
 at
 org.apache.solr.update.UpdateLog$LogReplayer.doReplay(UpdateLog.java:1339)
 at
 org.apache.solr.update.UpdateLog$LogReplayer.run(UpdateLog.java:1163)
 at
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
 at
 java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 at
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
 at
 java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:619)



 On Thu, May 16, 2013 at 9:35 AM, Yonik Seeley yo...@lucidworks.comwrote:

 See https://issues.apache.org/jira/browse/SOLR-3939

 Do you see these log messages from this in your logs?
   log.info(I may be the new leader - try and sync);

 How reproducible is this bug for you?  It would be great to know if
 the patch in the issue fixes things.

 -Yonik
 http://lucidworks.com


 On Wed, May 15, 2013 at 6:04 PM, Steven Bower sbo...@alcyon.net wrote:
  They are visible to ls...
 
 
  On Wed, May 15, 2013 at 5:49 PM, Yonik Seeley yo...@lucidworks.com
 wrote:
 
  On Wed, May 15, 2013 at 5:20 PM, Steven Bower sbo...@alcyon.net
 wrote:
   when the TransactionLog objects are dereferenced
   their RandomAccessFile object is not closed..
 
  Have the files been deleted (unlinked from the directory), or are they
  still visible via ls?
 
  -Yonik
  http://lucidworks.com
 





Transaction Logs Leaking FileDescriptors

2013-05-15 Thread Steven Bower
We have a system in which a client is sending 1 record at a time (via REST)
followed by a commit. This has produced ~65k tlog files and the JVM has run
out of file descriptors... I grabbed a heap dump from the JVM and I can see
~52k unreachable FileDescriptors... This leads me to believe that the
TransactionLog is not properly closing all of it's files before getting rid
of the object...

I've verified with lsof that indeed there are ~60k tlog files that are open
currently..

This is Solr 4.3.0

Thanks,

steve


Re: Transaction Logs Leaking FileDescriptors

2013-05-15 Thread Steven Bower
Most definetly understand the don't commit after each record...
unfortunately the data is being fed by another team which I cannot
control...

Limiting the number of potential tlog files is good but I think there is
also an issue in that when the TransactionLog objects are dereferenced
their RandomAccessFile object is not closed.. thus delaying release of the
descriptor until the object is GC'd...

I'm hunting through the UpdateHandler code to try and find where this
happens now..

steve


On Wed, May 15, 2013 at 5:13 PM, Yonik Seeley yo...@lucidworks.com wrote:

 Hmmm, we keep open a number of tlog files based on the number of
 records in each file (so we always have a certain amount of history),
 but IIRC, the number of tlog files is also capped.  Perhaps there is a
 bug when the limit to tlog files is reached (as opposed to the number
 of documents in the tlog files).

 I'll see if I can create a test case to reproduce this.

 Separately, you'll get a lot better performance if you don't commit
 per update of course (or at least use something like commitWithin).

 -Yonik
 http://lucidworks.com

 On Wed, May 15, 2013 at 5:06 PM, Steven Bower sbo...@alcyon.net wrote:
  We have a system in which a client is sending 1 record at a time (via
 REST)
  followed by a commit. This has produced ~65k tlog files and the JVM has
 run
  out of file descriptors... I grabbed a heap dump from the JVM and I can
 see
  ~52k unreachable FileDescriptors... This leads me to believe that the
  TransactionLog is not properly closing all of it's files before getting
 rid
  of the object...
 
  I've verified with lsof that indeed there are ~60k tlog files that are
 open
  currently..
 
  This is Solr 4.3.0
 
  Thanks,
 
  steve



Re: Transaction Logs Leaking FileDescriptors

2013-05-15 Thread Steven Bower
There seem to be quite a few places where the RecentUpdates class is used
but is not properly created/closed throughout the code...

For example in RecoveryStrategy it does this correctly:

   UpdateLog.RecentUpdates recentUpdates = null;
try {
  recentUpdates = ulog.getRecentUpdates();
  recentVersions = recentUpdates.getVersions(ulog.numRecordsToKeep);
} catch (Throwable t) {
  SolrException.log(log, Corrupt tlog - ignoring. core= + coreName,
t);
  recentVersions = new ArrayListLong(0);
} finally {
  if (recentUpdates != null) {
recentUpdates.close();
  }
}

But in a number of other places its used more like this:

UpdateLog.RecentUpdates recentUpdates = ulog.getRecentUpdates()
try {
  ... some code ...
} finally {
  recentUpdates.close();
}

The problem it would seem is that RecentUpdates.getRecentUpdates() can fail
when it calls update() as it is doing IO on the log itself.. in that case
you'll get orphaned references to the log...

I'm not 100% sure this is my problem.. I'm scouring the logs to see if this
codepath was triggered...

steve


On Wed, May 15, 2013 at 5:26 PM, Walter Underwood wun...@wunderwood.orgwrote:

 Maybe we need a flag in the update handler to ignore commit requests.

 I just enabled a similar thing for our JVM, because something, somewhere
 was calling System.gc(). You can completely ignore explicit GC calls or you
 can turn them into requests for a concurrent GC.

 A similar setting for Solr might map commit requests to hard commit
 (default), soft commit, or none.

 wunder

 On May 15, 2013, at 2:20 PM, Steven Bower wrote:

  Most definetly understand the don't commit after each record...
  unfortunately the data is being fed by another team which I cannot
  control...
 
  Limiting the number of potential tlog files is good but I think there is
  also an issue in that when the TransactionLog objects are dereferenced
  their RandomAccessFile object is not closed.. thus delaying release of
 the
  descriptor until the object is GC'd...
 
  I'm hunting through the UpdateHandler code to try and find where this
  happens now..
 
  steve
 
 
  On Wed, May 15, 2013 at 5:13 PM, Yonik Seeley yo...@lucidworks.com
 wrote:
 
  Hmmm, we keep open a number of tlog files based on the number of
  records in each file (so we always have a certain amount of history),
  but IIRC, the number of tlog files is also capped.  Perhaps there is a
  bug when the limit to tlog files is reached (as opposed to the number
  of documents in the tlog files).
 
  I'll see if I can create a test case to reproduce this.
 
  Separately, you'll get a lot better performance if you don't commit
  per update of course (or at least use something like commitWithin).
 
  -Yonik
  http://lucidworks.com
 
  On Wed, May 15, 2013 at 5:06 PM, Steven Bower sbo...@alcyon.net
 wrote:
  We have a system in which a client is sending 1 record at a time (via
  REST)
  followed by a commit. This has produced ~65k tlog files and the JVM has
  run
  out of file descriptors... I grabbed a heap dump from the JVM and I can
  see
  ~52k unreachable FileDescriptors... This leads me to believe that the
  TransactionLog is not properly closing all of it's files before getting
  rid
  of the object...
 
  I've verified with lsof that indeed there are ~60k tlog files that are
  open
  currently..
 
  This is Solr 4.3.0
 
  Thanks,
 
  steve
 







Re: Transaction Logs Leaking FileDescriptors

2013-05-15 Thread Steven Bower
They are visible to ls...


On Wed, May 15, 2013 at 5:49 PM, Yonik Seeley yo...@lucidworks.com wrote:

 On Wed, May 15, 2013 at 5:20 PM, Steven Bower sbo...@alcyon.net wrote:
  when the TransactionLog objects are dereferenced
  their RandomAccessFile object is not closed..

 Have the files been deleted (unlinked from the directory), or are they
 still visible via ls?

 -Yonik
 http://lucidworks.com



Re: Per Shard Replication Factor

2013-05-10 Thread Steven Bower
This approach would work to satisfy the requirement but I think would
generally be nice to have the ability to control this within a single
collection (so you don't give up any functionality when querying between
the collections and to make the management of the system easier).

Anyway I'll create a ticket and take a look at how this might work..

steve


On Thu, May 9, 2013 at 8:23 PM, Otis Gospodnetic otis.gospodne...@gmail.com
 wrote:

 Could these just be different collections? Then sharding and replication is
 independent.  And you can reduce replication factor as the index ages.

 Otis
 Solr  ElasticSearch Support
 http://sematext.com/
 On May 9, 2013 1:43 AM, Steven Bower smb-apa...@alcyon.net wrote:

  Is it currently possible to have per-shard replication factor?
 
  A bit of background on the use case...
 
  If you are hashing content to shards by a known factor (lets say date
  ranges, 12 shards, 1 per month) it might be the case that most of your
  search traffic would be directed to one particular shard (eg. the current
  month shard) and having increased query capacity in that shard would be
  useful... this could be extended to many use cases such as data hashed by
  organization, type, etc.
 
  Thanks,
 
  steve
 



Per Shard Replication Factor

2013-05-08 Thread Steven Bower
Is it currently possible to have per-shard replication factor?

A bit of background on the use case...

If you are hashing content to shards by a known factor (lets say date
ranges, 12 shards, 1 per month) it might be the case that most of your
search traffic would be directed to one particular shard (eg. the current
month shard) and having increased query capacity in that shard would be
useful... this could be extended to many use cases such as data hashed by
organization, type, etc.

Thanks,

steve