Re: Closing the IndexSearcher/IndexWriter for a core

2015-08-03 Thread Brian Hurt
So unloading a core doesn't delete the data?  That is good to know.

On Mon, Aug 3, 2015 at 6:22 PM, Erick Erickson erickerick...@gmail.com
wrote:

 This doesn't work in SolrCloud, but it really sounds like lots of
 cores which is designed
 to keep the most recent N cores loaded and auto-unload older ones, see:
 http://wiki.apache.org/solr/LotsOfCores

 Best,
 Erick

 On Mon, Aug 3, 2015 at 4:57 PM, Brian Hurt bhur...@gmail.com wrote:
  Is there are an easy way for a client to tell Solr to close or release
 the
  IndexSearcher and/or IndexWriter for a core?
 
  I have a use case where we're creating a lot of cores with not that many
  documents per zone (a few hundred to maybe 10's of thousands).  Writes
 come
  in batches, and reads also tend to be bursty, if less so than the writes.
 
  And we're having problems with ram usage on the server.  Poking around a
  heap dump, the problem is that every IndexSearcher or IndexWriter being
  opened is taking up large amounts of memory.
 
  I've looked at the unload call, and while it is unclear, it seems like it
  deletes the data on disk as well.  I don't want to delete the data on
 disk,
  I just want to unload the searcher and writer, and free up the memory.
 
  So I'm wondering if there is a call I can make when I know or suspect
 that
  the core isn't going to be used in the near future to release these
 objects
  and return the memory?  Or a configuration option I can set to do so
 after,
  say, being idle for 5 seconds?  It's OK for there to be a performance hit
  the first time I reopen the core.
 
  Thanks,
 
  Brian



Re: Closing the IndexSearcher/IndexWriter for a core

2015-08-03 Thread Brian Hurt
Some further information:

The main things use memory that I see from my heap dump are:

1. Arrays of org.apache.lucene.util.fst.FST$Arc classes- which mainly seem
to hold nulls.  The ones of these I've investigated have been held by
org.apache.lucene.util.fst.FST objects, I have 38 cores open and have over
121,000 of these arrays, taking up over 126M of space.

2. Byte arrays, of which I have 384,000 of, taking up 106M of space.

When I trace the cycle of references up, I've always ended up at an
IndexSearcher or IndexWriter class, causing me to assume the problem was
that I was simply opening up too many cores, but I could be mistaken.

This was on a freshly started system without many cores having been touched
yet- so the memory usage, while larger than I expect, isn't critical yet.
It does become critical as the number of cores increases.

Thanks,

Brian



On Mon, Aug 3, 2015 at 4:57 PM, Brian Hurt bhur...@gmail.com wrote:


 Is there are an easy way for a client to tell Solr to close or release the
 IndexSearcher and/or IndexWriter for a core?

 I have a use case where we're creating a lot of cores with not that many
 documents per zone (a few hundred to maybe 10's of thousands).  Writes come
 in batches, and reads also tend to be bursty, if less so than the writes.

 And we're having problems with ram usage on the server.  Poking around a
 heap dump, the problem is that every IndexSearcher or IndexWriter being
 opened is taking up large amounts of memory.

 I've looked at the unload call, and while it is unclear, it seems like it
 deletes the data on disk as well.  I don't want to delete the data on disk,
 I just want to unload the searcher and writer, and free up the memory.

 So I'm wondering if there is a call I can make when I know or suspect that
 the core isn't going to be used in the near future to release these objects
 and return the memory?  Or a configuration option I can set to do so after,
 say, being idle for 5 seconds?  It's OK for there to be a performance hit
 the first time I reopen the core.

 Thanks,

 Brian




Closing the IndexSearcher/IndexWriter for a core

2015-08-03 Thread Brian Hurt
Is there are an easy way for a client to tell Solr to close or release the
IndexSearcher and/or IndexWriter for a core?

I have a use case where we're creating a lot of cores with not that many
documents per zone (a few hundred to maybe 10's of thousands).  Writes come
in batches, and reads also tend to be bursty, if less so than the writes.

And we're having problems with ram usage on the server.  Poking around a
heap dump, the problem is that every IndexSearcher or IndexWriter being
opened is taking up large amounts of memory.

I've looked at the unload call, and while it is unclear, it seems like it
deletes the data on disk as well.  I don't want to delete the data on disk,
I just want to unload the searcher and writer, and free up the memory.

So I'm wondering if there is a call I can make when I know or suspect that
the core isn't going to be used in the near future to release these objects
and return the memory?  Or a configuration option I can set to do so after,
say, being idle for 5 seconds?  It's OK for there to be a performance hit
the first time I reopen the core.

Thanks,

Brian


having create copy the directory on non-cloud solr

2013-08-02 Thread Brian Hurt
I seem to recall somewhere in the documention that the create function on
non-cloud solr doesn't copy the config files in, you have to copy them in
by hand.  Is this correct?  If so, can anyone point me to where in the docs
it says this, and if there are any plans to change this?  Thanks.

Brian


Getting a large number of documents by id

2013-07-18 Thread Brian Hurt
I have a situation which is common in our current use case, where I need to
get a large number (many hundreds) of documents by id.  What I'm doing
currently is creating a large query of the form id:12345 OR id:23456 OR
... and sending it off.  Unfortunately, this query is taking a long time,
especially the first time it's executed.  I'm seeing times of like 4+
seconds for this query to return, to get 847 documents.

So, my question is: what should I be looking at to improve the performance
here?

Brian


Re: Getting a large number of documents by id

2013-07-18 Thread Brian Hurt
Thanks everyone for the response.

On Thu, Jul 18, 2013 at 11:22 AM, Alexandre Rafalovitch
arafa...@gmail.comwrote:

 You could start from doing id:(12345 23456) to reduce the query length and
 possibly speed up parsing.


I didn't know about this syntax- it looks useful.


 You could also move the query from 'q' parameter to 'fq' parameter, since
 you probably don't care about ranking ('fq' does not rank).


Yes, I don't care about rank, so this helps.


 If these are unique every time, you could probably look at not caching
 (can't remember exact syntax).


That's all I can think of at the moment without digging deep into why you
 need to do this at all.


Short version of a long story: I'm implementing a graph database on top of
solr.  Which is not what solr is designed for, I know.  This is a case
where I'm following a set of edges from a given node to it's 847 children,
and I need to get the children.  And yes, I've looked at neo4j- it doesn't
help.



 Regards,
Alex.

 Personal website: http://www.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all at
 once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


 On Thu, Jul 18, 2013 at 10:46 AM, Brian Hurt bhur...@gmail.com wrote:

  I have a situation which is common in our current use case, where I need
 to
  get a large number (many hundreds) of documents by id.  What I'm doing
  currently is creating a large query of the form id:12345 OR id:23456 OR
  ... and sending it off.  Unfortunately, this query is taking a long
 time,
  especially the first time it's executed.  I'm seeing times of like 4+
  seconds for this query to return, to get 847 documents.
 
  So, my question is: what should I be looking at to improve the
 performance
  here?
 
  Brian
 



Noob question: why doesn't this query work?

2013-04-24 Thread Brian Hurt
So, I'm executing the following query:
id:6178dB=@Fm AND i_0:613OFS AND (i_3:6111 OR i_3:1yyy\~) AND (NOT
id:6178ZwWj5m OR numfields:[* TO 6114] OR d_4:false OR NOT
i_4:6142E=m)

It's machine generated, which explains the redundancies.  The problem is
that the query returns no results- but there is a document that should
match- it has an id of 6178dB=@Fm, an i_0 field of 613OFS, an i_3 field
of 6111, a numfields of 611A, a d_4 of true (but this shouldn't
matter), and an i_4 of 6142F1S.

The problem seems to be with the negations.  I did try to replace the NOT's
with -'s, so, for example, NOT id:6178ZwWj5m would become
-id:6178ZwWj5m, and this didn't seem to work.

Help?  What's wrong with the query?  Thanks.

Brian


Re: Noob question: why doesn't this query work?

2013-04-24 Thread Brian Hurt
Thanks for your reponse.  You've given me some solid leads.


On Wed, Apr 24, 2013 at 11:25 AM, Shawn Heisey s...@elyograg.org wrote:

 On 4/24/2013 8:59 AM, Brian Hurt wrote:
  So, I'm executing the following query:
  id:6178dB=@Fm AND i_0:613OFS AND (i_3:6111 OR i_3:1yyy\~) AND
 (NOT
  id:6178ZwWj5m OR numfields:[* TO 6114] OR d_4:false OR NOT
  i_4:6142E=m)
 
  It's machine generated, which explains the redundancies.  The problem is
  that the query returns no results- but there is a document that should
  match- it has an id of 6178dB=@Fm, an i_0 field of 613OFS, an i_3
 field
  of 6111, a numfields of 611A, a d_4 of true (but this shouldn't
  matter), and an i_4 of 6142F1S.
 
  The problem seems to be with the negations.  I did try to replace the
 NOT's
  with -'s, so, for example, NOT id:6178ZwWj5m would become
  -id:6178ZwWj5m, and this didn't seem to work.
 
  Help?  What's wrong with the query?  Thanks.

 It looks like you might have meant to negate all of the query clauses
 inside the last set of parentheses.  That's not what your actual query
 says. If you change your negation so that the NOT is outside the
 parentheses, so that it reads AND NOT (... OR ...), that should fix
 that part of it.


No, I meant the NOT to only bind to the next id.  So the query I wanted was:

id:6178dB=@Fm AND i_0:613OFS AND (i_3:6111 OR i_3:1yyy\~) AND ((NOT
id:6178ZwWj5m) OR numfields:[* TO 6114] OR d_4:false OR (NOT
i_4:6142E=m))



 If the boolean layout you have is really what you want, then you need to
 change the negation queries to (*:* -query) instead, because pure
 negative queries are not supported.  That syntax says all documents
 except those that match the query.  For simple negation queries, Solr
 can figure out that it needs to add the *:* internally, but this query
 is more complex.


This could be the problem.  This is query is machine generated, so I don't
care how ugly it is.  Does this apply even to inner queries?  I.e., should
that last clause be (*:* -i_4:6142E=m) instead of (NOT I-4:6142E=m)?


 A few other possible problems:

 A backslash is a special character used to escape other special
 characters, so you *might* need two of them - one to say 'the next
 character is literal' and one to actually be the backslash.  If you
 follow the advice in the next paragraph, I can guarantee this will be
 the case.  For that reason, you might want to keep the quotes on fields
 that might contain characters that have special meaning to the Solr
 query parser.


I wash all strings through ClientUtils.escapeQueryChars always, so this
isn't a problem.  That string should just be 1yyy~, the ~ was getting
escaped.


 Don't use quotes unless you really are after phrase queries or you can't
 escape special characters.  You might actually need phrase queries for
 some of this, but I would try simple one-field queries without the
 quotes to see whether you need them.  I have no idea what happens if you
 include quotes inside a range query (the 6114), but it might not do
 what you expect.  I would definitely remove the quotes from that part of
 the query.


This is another solid possibility, although it might raise some
difficulties for me- I need to be able to support literal string
comparisons, so I'm not sure how well this would support the query s_7 =
some string with spaces sorts of queries.  But some experimentation here
is definitely in order.


 Thanks,
 Shawn




Re: Help getting a document by unique ID

2013-03-19 Thread Brian Hurt
On Mon, Mar 18, 2013 at 7:08 PM, Jack Krupansky j...@basetechnology.com wrote:
 Hmmm... if query by your unique key field is killing your performance, maybe
 you have some larger problem to address.

This is almost certainly true.  I'm well outside the use cases
targeted by Solr/Lucene, and it's a testament to the quality of the
product that it works at all.  Among other things, I'm implementing a
graph database on top of Solr (it being easier to build a graph
database on top of Solr than it is to implement Solr on top of a graph
database).

Which is the problem- you might think that 60ms unique key accesses
(what I'm seeing) is more than good enough- and for most use cases,
you'd be right.  But it's not unusual for a single web-page hit to
generate many dozens, if not low hundreds, of calls to get document by
id.  At which point, 60ms hits pile up fast.

The current plan is to just cache the documents as files in the local
file system (or possibly other systems), and have the get document
calls go there instead, while complicated searches still go to Solr.
Fortunately, this isn't complicated.

 How bad is it? Are you using the
 string field type? How long are your ids?

My ids start at 100 million and go up like a kite from there- thus the
string representation.


 The only thing the real-time GET API gives you is more immediate access to
 recently added, uncommitted data. Accessing older, committed data will be no
 faster. But if accessing that recent data is what you are after, real-time
 GET may do the trick.

OK, so this is good to know.  This answers question #1: GET isn't the
function I should be calling.  Thanks.

Brian


Help getting a document by unique ID

2013-03-18 Thread Brian Hurt
So here's the problem I'm trying to solve: in my use case, all my
documents have a unique id associated with them (a string), and I very
often need to get them by id.  Currently I'm doing a search on id, and
this takes long enough it's killing my performance.  Now, it looks
like there is a GET call in the REST interface which does exactly what
I need, but I'm using the solrj interface.

So my two questions are:

1. Is GET the right function I should be using?  Or should I be using
some other function, or storing copies of the documents some where
else entirely for fast id-based retrieval?

2. How do I call GET with solrj?  I've googled for how to do this, and
haven't come up with anything.

Thanks.

Brian