Re: Slow Highlighter Performance Even Using FastVectorHighlighter

2013-06-15 Thread Michael Sokolov
If you have very large documents (many MB) that can lead to slow 
highlighting, even with FVH.


See https://issues.apache.org/jira/browse/LUCENE-3234

and try setting phraseLimit=1 (or some bigger number, but not infinite, 
which is the default)


-Mike


On 6/14/13 4:52 PM, Andy Brown wrote:

Bryan,

For specifics, I'll refer you back to my original email where I
specified all the fields/field types/handlers I use. Here's a general
overview.
  
I really only have 3 fields that I index and search against: name,

description, and content. All of which are just general text
(string) fields. I have a catch-all field called text that is only
used for querying. It's indexed but not stored. The name,
description, and content fields are copied into the text field.
  
For partial word matching, I have 4 more fields: name_par,

description_par, content_par, and text_par. The text_par field
has the same relationship to the *_par fields as text does to the
others (only used for querying). Those partial word matching fields are
of type text_general_partial which I created. That field type is
analyzed different than the regular text field in that it goes through
an EdgeNGramFilterFactory with the minGramSize=2 and maxGramSize=7
at index time.
  
I query against both text and text_par fields using edismax deftype

with my qf set to text^2 text_par^1 to give full word matches a higher
score. This part returns back very fast as previously stated. It's when
I turn on highlighting that I take the huge performance hit.
  
Again, I'm using the FastVectorHighlighting. The hl.fl is set to name

name_par description description_par content content_par so that it
returns highlights for full and partial word matches. All of those
fields have indexed, stored, termPositions, termVectors, and termOffsets
set to true.
  
It all seems redundant just to allow for partial word

matching/highlighting but I didn't know of a better way. Does anything
stand out to you that could be the culprit? Let me know if you need any
more clarification.
  
Thanks!
  
- Andy


-Original Message-
From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com]
Sent: Wednesday, May 29, 2013 5:44 PM
To: solr-user@lucene.apache.org
Subject: RE: Slow Highlighter Performance Even Using
FastVectorHighlighter

Andy,


I don't understand why it's taking 7 secs to return highlights. The

size

of the index is only 20.93 MB. The JVM heap Xms and Xmx are both set

to

1024 for this verification purpose and that should be more than

enough.

The processor is plenty powerful enough as well.

Running VisualVM shows all my CPU time being taken by mainly these 3
methods:



org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI

nfo.getStartOffset()


org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI

nfo.getStartOffset()


org.apache.lucene.search.vectorhighlight.FieldPhraseList.addIfNoOverlap(

)

That is a strange and interesting set of things to be spending most of
your CPU time on. The implication, I think, is that the number of term
matches in the document for terms in your query (or, at least, terms
matching exact words or the beginning of phrases in your query) is
extremely high . Perhaps that's coming from this partial word match
you
mention -- how does that work?

-- Bryan


My guess is that this has something to do with how I'm handling

partial

word matches/highlighting. I have setup another request handler that
only searches the whole word fields and it returns in 850 ms with
highlighting.

Any ideas?

- Andy


-Original Message-
From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com]
Sent: Monday, May 20, 2013 1:39 PM
To: solr-user@lucene.apache.org
Subject: RE: Slow Highlighter Performance Even Using
FastVectorHighlighter

My guess is that the problem is those 200M documents.
FastVectorHighlighter is fast at deciding whether a match, especially

a

phrase, appears in a document, but it still starts out by walking the
entire list of term vectors, and ends by breaking the document into
candidate-snippet fragments, both processes that are proportional to

the

length of the document.

It's hard to do much about the first, but for the second you could
choose
to expose FastVectorHighlighter's FieldPhraseList representation, and
return offsets to the caller rather than fragments, building up your

own

snippets from a separate store of indexed files. This would also

permit

you to set stored=false, improving your memory/core size ratio,

which

I'm guessing could use some improving. It would require some work, and
it
would require you to store a representation of what was indexed

outside

the Solr core, in some constant-bytes-to-character representation that
you
can use offsets with (e.g. UTF-16, or ASCII+entity references).

However, you may not need to do this -- it may be that you just need
more
memory for your search machine. Not JVM memory, but memory that the

O/S

can use as a file 

Re: yet another optimize question

2013-06-15 Thread Otis Gospodnetic
Hi Robi,

I'm going to guess you are seeing smaller heap also simply because you
restarted the JVM recently (hm, you don't say you restarted, maybe I'm
making this up). If you are indeed indexing continuously then you
shouldn't optimize. Lucene will merge segments itself. Lower
mergeFactor will force it to do it more often (it means slower
indexing, bigger IO hit when segments are merged, more per-segment
data that Lucene/Solr need to read from the segment for faceting and
such, etc.) so maybe you shouldn't mess with that.  Do you know what
your caches are like in terms of size, hit %, evictions?  We've
recently seen people set those to a few hundred K or even higher,
which can eat a lot of heap.  We have had luck with G1 recently, too.
Maybe you can run jstat and see which of the memory pools get filled
up and change/increase appropriate JVM param based on that?  How many
fields do you index, facet, or group on?

Otis
--
Performance Monitoring - http://sematext.com/spm/index.html
Solr  ElasticSearch Support -- http://sematext.com/





On Fri, Jun 14, 2013 at 8:04 PM, Petersen, Robert
robert.peter...@mail.rakuten.com wrote:
 Hi guys,

 We're on solr 3.6.1 and I've read the discussions about whether to optimize 
 or not to optimize.  I decided to try not optimizing our index as was 
 recommended.  We have a little over 15 million docs in our biggest index and 
 a 32gb heap for our jvm.  So without the optimizes the index folder seemed to 
 grow in size and quantity of files.  There seemed to be an upper limit but 
 eventually it hit 300 files consuming 26gb of space and that seemed to push 
 our slave farm over the edge and we started getting the dreaded OOMs.  We 
 have continuous indexing activity, so I stopped the indexer and manually ran 
 an optimize which made the index become 9 files consuming 15gb of space and 
 our slave farm started having acceptable memory usage.  Our merge factor is 
 10, we're on java 7.  Before optimizing, I tried on one slave machine to go 
 with the latest JVM and tried switching from the CMS GC to the G1GC but it 
 hit OOM condition even faster.  So it seems like I have to continue to 
 schedule a regular optimize.  Right now it has been a couple of days since 
 running the optimize and the index is slowly growing bigger, now up to a bit 
 over 19gb.  What do you guys think?  Did I miss something that would make us 
 able to run without doing an optimize?

 Robert (Robi) Petersen
 Senior Software Engineer
 Search Department


Re: Adding pdf/word file using JSON/XML

2013-06-15 Thread Grant Ingersoll

On Jun 13, 2013, at 11:24 AM, Walter Underwood wun...@wunderwood.org wrote:

 That was my thought exactly. Contribute a REST request handler. --wunder
 

+1.  The bits are already in place for a lot of it now that RESTlet is in.  

That being said, it truly amazes me that people were ever able to implement 
Solr, given some of the FUD in this thread.  I guess those tens of thousands of 
deployments out there were all done by above average devs...

-Grant

Re: Slow Highlighter Performance Even Using FastVectorHighlighter

2013-06-15 Thread Michael McCandless
You could also try the new[ish] PostingsHighlighter:
http://blog.mikemccandless.com/2012/12/a-new-lucene-highlighter-is-born.html

Mike McCandless

http://blog.mikemccandless.com


On Sat, Jun 15, 2013 at 8:50 AM, Michael Sokolov
msoko...@safaribooksonline.com wrote:
 If you have very large documents (many MB) that can lead to slow
 highlighting, even with FVH.

 See https://issues.apache.org/jira/browse/LUCENE-3234

 and try setting phraseLimit=1 (or some bigger number, but not infinite,
 which is the default)

 -Mike



 On 6/14/13 4:52 PM, Andy Brown wrote:

 Bryan,

 For specifics, I'll refer you back to my original email where I
 specified all the fields/field types/handlers I use. Here's a general
 overview.
   I really only have 3 fields that I index and search against: name,
 description, and content. All of which are just general text
 (string) fields. I have a catch-all field called text that is only
 used for querying. It's indexed but not stored. The name,
 description, and content fields are copied into the text field.
   For partial word matching, I have 4 more fields: name_par,
 description_par, content_par, and text_par. The text_par field
 has the same relationship to the *_par fields as text does to the
 others (only used for querying). Those partial word matching fields are
 of type text_general_partial which I created. That field type is
 analyzed different than the regular text field in that it goes through
 an EdgeNGramFilterFactory with the minGramSize=2 and maxGramSize=7
 at index time.
   I query against both text and text_par fields using edismax deftype
 with my qf set to text^2 text_par^1 to give full word matches a higher
 score. This part returns back very fast as previously stated. It's when
 I turn on highlighting that I take the huge performance hit.
   Again, I'm using the FastVectorHighlighting. The hl.fl is set to name
 name_par description description_par content content_par so that it
 returns highlights for full and partial word matches. All of those
 fields have indexed, stored, termPositions, termVectors, and termOffsets
 set to true.
   It all seems redundant just to allow for partial word
 matching/highlighting but I didn't know of a better way. Does anything
 stand out to you that could be the culprit? Let me know if you need any
 more clarification.
   Thanks!
   - Andy

 -Original Message-
 From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com]
 Sent: Wednesday, May 29, 2013 5:44 PM
 To: solr-user@lucene.apache.org
 Subject: RE: Slow Highlighter Performance Even Using
 FastVectorHighlighter

 Andy,

 I don't understand why it's taking 7 secs to return highlights. The

 size

 of the index is only 20.93 MB. The JVM heap Xms and Xmx are both set

 to

 1024 for this verification purpose and that should be more than

 enough.

 The processor is plenty powerful enough as well.

 Running VisualVM shows all my CPU time being taken by mainly these 3
 methods:


 org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI

 nfo.getStartOffset()

 org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI

 nfo.getStartOffset()

 org.apache.lucene.search.vectorhighlight.FieldPhraseList.addIfNoOverlap(

 )

 That is a strange and interesting set of things to be spending most of
 your CPU time on. The implication, I think, is that the number of term
 matches in the document for terms in your query (or, at least, terms
 matching exact words or the beginning of phrases in your query) is
 extremely high . Perhaps that's coming from this partial word match
 you
 mention -- how does that work?

 -- Bryan

 My guess is that this has something to do with how I'm handling

 partial

 word matches/highlighting. I have setup another request handler that
 only searches the whole word fields and it returns in 850 ms with
 highlighting.

 Any ideas?

 - Andy


 -Original Message-
 From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com]
 Sent: Monday, May 20, 2013 1:39 PM
 To: solr-user@lucene.apache.org
 Subject: RE: Slow Highlighter Performance Even Using
 FastVectorHighlighter

 My guess is that the problem is those 200M documents.
 FastVectorHighlighter is fast at deciding whether a match, especially

 a

 phrase, appears in a document, but it still starts out by walking the
 entire list of term vectors, and ends by breaking the document into
 candidate-snippet fragments, both processes that are proportional to

 the

 length of the document.

 It's hard to do much about the first, but for the second you could
 choose
 to expose FastVectorHighlighter's FieldPhraseList representation, and
 return offsets to the caller rather than fragments, building up your

 own

 snippets from a separate store of indexed files. This would also

 permit

 you to set stored=false, improving your memory/core size ratio,

 which

 I'm guessing could use some improving. It would require some work, and
 it
 would require you to store a 

Re: Adding pdf/word file using JSON/XML

2013-06-15 Thread Alexandre Rafalovitch
On Sat, Jun 15, 2013 at 10:35 AM, Grant Ingersoll gsing...@apache.org wrote:
 That being said, it truly amazes me that people were ever able to implement 
 Solr, given some of the FUD in this thread.  I guess those tens of thousands 
 of deployments out there were all done by above average devs...

I would not classify the thread as FUD. More like confusion about the
best practices. And, from my tech support days, this usually means
lack of documentation, outdated explanations or missing representative
examples.

So, if there is a magic solution just under our noses that we are
missing, it would be good to know. Might be as easy to solve for the
next person as adding a Wiki link in the correct place.

Regards,
   Alex.

Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


Re: Adding pdf/word file using JSON/XML

2013-06-15 Thread Grant Ingersoll

On Jun 15, 2013, at 12:54 PM, Alexandre Rafalovitch arafa...@gmail.com wrote:

 On Sat, Jun 15, 2013 at 10:35 AM, Grant Ingersoll gsing...@apache.org wrote:
 That being said, it truly amazes me that people were ever able to implement 
 Solr, given some of the FUD in this thread.  I guess those tens of thousands 
 of deployments out there were all done by above average devs...
 
 I would not classify the thread as FUD.

I was just referring to the part about how Solr isn't something average devs 
can do, which I think is FUD.

At any rate, I think the ExtractingReqHandler could be updated to allow for 
metadata, etc. to be passed in with the raw document itself and a patch would 
be welcome.  It's something the literals stand in for now as a lightweight 
proxy, but clearly there is an opportunity for more to be passed in.

Managing SolrCloud

2013-06-15 Thread Furkan KAMACI
I want to design a controlling mechanism for my SolrCloud. I have two
choices.

First one is controlling every Solr node from a single point and when I
want to start and stop jetty from remote I will connect to my nodes via an
ssh library at Java. I will send backup command and recovery process with
same policy.

Second one is running another jar at all my Solr nodes and I will send
backup command that custom jars and they will call backup via
HttpSolrServer and recovery process will be managed over that jars.
Starting and stopping jetty will be done from that jars too.

There are some pros and cons using each of them.

First one's pros:
* When I start my custom management server there will be no need to send
some small jars into every Solr node machine.

First one's cons:
* needs keeping eye on from a single point.
* I have to open /admin URLs of every Solr nodes into the outside. Because
my custom management server should reach them. If anybody else can reach
that URL than he/she can run a delete all documents command. (if I put my
management server inside the environment of Solr nodes I think that I can
overwhelm that issue.)
* Connection every Solr node via ssh library may be resource consuming.

Second one's pros
* I can distribute work load. If I have hundreds of Solr nodes within my
SolrCloud I can send backup/recovery commands into my custom jars and each
of them can handle processes.
* I can make forbidden all Solr admin pages into the outside environment of
Solr nodes. My custom jars can run only the commands which I have defined
inside them. They can access to the Solr Node which they are responsible
for via HttpSolrServer.
* This custom jars maybe used at further purposes (advices are welcome)

Second one's cons
* I have to send small jars into each Solr node.

What folks think about such scenarios and what they suggest me?


Re: Adding pdf/word file using JSON/XML

2013-06-15 Thread Jack Krupansky
[My apologies to Roland for hijacking his original thread for this rant! 
Look what you started!!]


And I will stand by my statement: Solr is too much of a beast for average 
app developers to master.


And the key word there, in case a too-casual reader missed it is master - 
not use in the sense of hack something together or solving a niche 
application for a typical Solr deployment, but master in the sense of having 
a high level of confidence about the vast bulk (even if not absolutely 100%) 
of the subject matter, Solr itself.


I mean, generally, on average what percentage of Solr's many features  has 
the average Solr app-deployer actually mastered?


And, what I am really referring to is not what expertise the pioneers and 
expert Solr solution consultants have had, but the level of expertise 
required for those who are to come in the years ahead who simply want to 
focus on their application without needing to become a Solr expert first.


The context of my statement was the application devs referenced earlier in 
this thread who were struggling because the Solr API was not 100% pure 
RESTful. As the respondent indicated, they were much happier to have a 
cleaner, more RESTful API that they as app developers can deal with, so that 
they wouldn't have to master all of the bizarre inconsistencies of Solr 
itself (e.g., just the knowledge that SolrCell doesn't support 
partial/atomic update.)


And, the real focus of my statement, again in this particular context is 
the actual application devs, the guys focused on the actual application 
subject matter itself, not the Solr Experts or Solr solution architects 
who do have a lot higher mastery of Solr than the average application 
devs.


And if my statement were in fact false, questions such as began this thread 
would never have come up. The level of traffic for Solr User would be 
essentially zero if it were really true that average application developers 
can easily master Solr.


And there would be zero need so many of these Solr training classes if Solr 
were so easy to master. In fact, the very existence of so many Solr 
training classes effectively proves my point. And that's just for basic 
Solr, not any of the many esoteric points such as at the heart of this 
particular thread (i.e., SolrCell not supporting partial/atomic update.)


And, in conclusion, my real interest is in helping the many average 
application developers who post inquiries on this Solr user list for the 
simple reason that they ARE in fact struggling with Solr.


Personally, I would suggest that a typical (average) successful deployer of 
Solr would be more readily characterized as having survived the Solr 
deployment process rather than having achieved a truly deep mastery of 
Solr. They may have achieved confidence about exactly what they have 
deployed, but do they also have great confidence that they know exactly what 
will happen if they make slight and subtle changes or what exactly the fix 
will be for certain runtime errors? For the average application developer 
I'm talking about, not the elite expert Solr consultants.


One final way of putting it. If a manager or project leader wanted to staff 
a dev position to be in-house Solr expert, can they just hire any old 
average Java programmer with no Solr experience and expect that he will 
rapidly master Solr?


I mean, why would so many recruiters be looking for a Solr expert or 
engaging the services of Solr sonsultancies if mastery of Solr by average 
application developers was a reality?!


[I want to hear Otis' take on this!]

-- Jack Krupansky

-Original Message- 
From: Grant Ingersoll

Sent: Saturday, June 15, 2013 1:47 PM
To: solr-user@lucene.apache.org
Subject: Re: Adding pdf/word file using JSON/XML


On Jun 15, 2013, at 12:54 PM, Alexandre Rafalovitch arafa...@gmail.com 
wrote:


On Sat, Jun 15, 2013 at 10:35 AM, Grant Ingersoll gsing...@apache.org 
wrote:
That being said, it truly amazes me that people were ever able to 
implement Solr, given some of the FUD in this thread.  I guess those tens 
of thousands of deployments out there were all done by above average 
devs...


I would not classify the thread as FUD.


I was just referring to the part about how Solr isn't something average devs 
can do, which I think is FUD.


At any rate, I think the ExtractingReqHandler could be updated to allow for 
metadata, etc. to be passed in with the raw document itself and a patch 
would be welcome.  It's something the literals stand in for now as a 
lightweight proxy, but clearly there is an opportunity for more to be passed 
in.= 



Re: Suggest and Filtering

2013-06-15 Thread Brendan Grainger
Hi Otis and Jorge,

I probably wasn't phrasing my question too well, but I think I was looking
for FuzzySuggest. Messing around with the configs found here seems to be
doing what I want:

http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/test-files/solr/collection1/conf/solrconfig-phrasesuggest.xml

Thanks
Brendan


On Fri, Jun 14, 2013 at 11:50 AM, Brendan Grainger 
brendan.grain...@gmail.com wrote:

 Hi Otis,

 Sorry was a bit tired when I wrote that. I think what I'd like is to be
 able spellcheck the suggestions. For example. If a user types in brayk (as
 opposed to brake) I'd still get the following suggestions say:

 brake line
 brake condition

 Does that make sense?

 Thanks
 Brendan



 On Thu, Jun 13, 2013 at 8:53 PM, Otis Gospodnetic 
 otis.gospodne...@gmail.com wrote:

 Hi,

 I think you are talking about wanting instant search?

 See https://github.com/fergiemcdowall/solrstrap

 Otis
 --
 Solr  ElasticSearch Support
 http://sematext.com/





 On Thu, Jun 13, 2013 at 7:43 PM, Brendan Grainger
 brendan.grain...@gmail.com wrote:
  Hi Solr Guru's
 
  I am trying to implement auto suggest where solr would suggest several
  phrases that would return results as the user types in a query (as
 distinct
  from autocomplete). e.g. say the user starts typing 'br' and we have
  documents that contain brake pads and left disc brake, solr would
  suggest both of those phrases with brake pads first. I also want to
 only
  look at documents that match a given filter query. So say I have a
 bunch of
  documents for a toyota cressida that contain the bi-gram brake pads,
  while the documents for a honda accord don't have any brake pad
 articles.
  If the user is filtering on the honda accord I wouldn't want brake
 pads
  as a suggestion.
 
  Right now, I've played with the suggest component and using faceting.
 
  Any thoughts?
 
  Thanks
  Brendan
 
  --
  Brendan Grainger
  www.kuripai.com




 --
 Brendan Grainger
 www.kuripai.com




-- 
Brendan Grainger
www.kuripai.com


Solr large boolean filter

2013-06-15 Thread Igor Kustov
I know i'm not the first one with this problem. 

I'm currently using solr 4.2.1 with approximately 10 mln documents in the
index.

The index is updated frequently.

The filter_query is just a one big boolean or query by id.

fq=id:(1 2 3 4 ... 50950)

ids list is always different and not sequential.

The problem is that query performance not so well, as you can imagine.

In some particular cases i'm able to do filtering based on different fields,
but in some cases (like 30-40% of all queries) i'm still end up with this
large id filter.

I'm looking for the ways to improve this query performance.

It doesn't seem like solr join could be applied there.

Another option that I found is to somehow use Lucene FieldCacheTermsFilter.
Does it worth a try? 

Maybe i've missed some other options?





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-large-boolean-filter-tp4070747.html
Sent from the Solr - User mailing list archive at Nabble.com.