New operator.

2013-06-15 Thread Yanis Kakamaikis
Hi all,I want to add a new operator to my solr.   I need that operator
to call my proprietary engine and build an answer vector to solr, in a way
that this vector will be part of the boolean query at the next step.   How
do I do that?
Thanks


Solr large boolean filter

2013-06-15 Thread Igor Kustov
I know i'm not the first one with this problem. 

I'm currently using solr 4.2.1 with approximately 10 mln documents in the
index.

The index is updated frequently.

The filter_query is just a one big boolean or query by id.

fq=id:(1 2 3 4 ... 50950)

ids list is always different and not sequential.

The problem is that query performance not so well, as you can imagine.

In some particular cases i'm able to do filtering based on different fields,
but in some cases (like 30-40% of all queries) i'm still end up with this
large id filter.

I'm looking for the ways to improve this query performance.

It doesn't seem like solr join could be applied there.

Another option that I found is to somehow use Lucene FieldCacheTermsFilter.
Does it worth a try? 

Maybe i've missed some other options?





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-large-boolean-filter-tp4070747.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Suggest and Filtering

2013-06-15 Thread Brendan Grainger
Hi Otis and Jorge,

I probably wasn't phrasing my question too well, but I think I was looking
for FuzzySuggest. Messing around with the configs found here seems to be
doing what I want:

http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/test-files/solr/collection1/conf/solrconfig-phrasesuggest.xml

Thanks
Brendan


On Fri, Jun 14, 2013 at 11:50 AM, Brendan Grainger <
brendan.grain...@gmail.com> wrote:

> Hi Otis,
>
> Sorry was a bit tired when I wrote that. I think what I'd like is to be
> able spellcheck the suggestions. For example. If a user types in brayk (as
> opposed to brake) I'd still get the following suggestions say:
>
> brake line
> brake condition
>
> Does that make sense?
>
> Thanks
> Brendan
>
>
>
> On Thu, Jun 13, 2013 at 8:53 PM, Otis Gospodnetic <
> otis.gospodne...@gmail.com> wrote:
>
>> Hi,
>>
>> I think you are talking about wanting instant search?
>>
>> See https://github.com/fergiemcdowall/solrstrap
>>
>> Otis
>> --
>> Solr & ElasticSearch Support
>> http://sematext.com/
>>
>>
>>
>>
>>
>> On Thu, Jun 13, 2013 at 7:43 PM, Brendan Grainger
>>  wrote:
>> > Hi Solr Guru's
>> >
>> > I am trying to implement auto suggest where solr would suggest several
>> > phrases that would return results as the user types in a query (as
>> distinct
>> > from autocomplete). e.g. say the user starts typing 'br' and we have
>> > documents that contain "brake pads" and "left disc brake", solr would
>> > suggest both of those phrases with "brake pads" first. I also want to
>> only
>> > look at documents that match a given filter query. So say I have a
>> bunch of
>> > documents for a toyota cressida that contain the bi-gram "brake pads",
>> > while the documents for a honda accord don't have any brake pad
>> articles.
>> > If the user is filtering on the honda accord I wouldn't want "brake
>> pads"
>> > as a suggestion.
>> >
>> > Right now, I've played with the suggest component and using faceting.
>> >
>> > Any thoughts?
>> >
>> > Thanks
>> > Brendan
>> >
>> > --
>> > Brendan Grainger
>> > www.kuripai.com
>>
>
>
>
> --
> Brendan Grainger
> www.kuripai.com
>



-- 
Brendan Grainger
www.kuripai.com


Re: Adding pdf/word file using JSON/XML

2013-06-15 Thread Jack Krupansky
[My apologies to Roland for "hijacking" his original thread for this rant! 
Look what you started!!]


And I will stand by my statement: "Solr is too much of a beast for average 
app developers to master."


And the key word there, in case a too-casual reader missed it is "master" - 
not "use" in the sense of hack something together or solving a niche 
application for a typical Solr deployment, but master in the sense of having 
a high level of confidence about the vast bulk (even if not absolutely 100%) 
of the subject matter, Solr itself.


I mean, generally, on average what percentage of Solr's many features  has 
the average Solr app-deployer actually "mastered"?


And, what I am really referring to is not what expertise the pioneers and 
"expert" Solr solution consultants have had, but the level of expertise 
required for those who are to come in the years ahead who simply want to 
focus on their application without needing to become a "Solr expert" first.


The context of my statement was the application "devs" referenced earlier in 
this thread who were struggling because the Solr API was not 100% pure 
RESTful. As the respondent indicated, they were much happier to have a 
cleaner, more RESTful API that they as app developers can deal with, so that 
they wouldn't have to "master" all of the bizarre inconsistencies of Solr 
itself (e.g., just the knowledge that SolrCell doesn't support 
partial/atomic update.)


And, the real focus of my statement, again in this particular context" is 
the actual application devs, the guys focused on the actual application 
subject matter itself, not the "Solr Experts" or "Solr solution architects" 
who do have a lot higher mastery of Solr than the "average" application 
devs.


And if my statement were in fact false, questions such as began this thread 
would never have come up. The level of traffic for Solr User would be 
essentially zero if it were really true that average application developers 
can easily "master" Solr.


And there would be zero need so many of these Solr training classes if Solr 
were so easy to "master". In fact, the very existence of so many Solr 
training classes effectively proves my point. And that's just for "basic" 
Solr, not any of the many esoteric points such as at the heart of this 
particular thread (i.e., SolrCell not supporting partial/atomic update.)


And, in conclusion, my real interest is in helping the many "average" 
application developers who post inquiries on this Solr user list for the 
simple reason that they ARE in fact "struggling" with Solr.


Personally, I would suggest that a typical (average) successful deployer of 
Solr would be more readily characterized as having "survived" the Solr 
deployment process rather than having achieved a truly deep "mastery" of 
Solr. They may have achieved confidence about exactly what they have 
deployed, but do they also have great confidence that they know exactly what 
will happen if they make slight and subtle changes or what exactly the fix 
will be for certain runtime errors? For the "average application developer" 
I'm talking about, not the elite expert Solr consultants.


One final way of putting it. If a manager or project leader wanted to staff 
a dev position to be "in-house Solr expert", can they just hire any old 
average Java programmer with no Solr experience and expect that he will 
rapidly "master" Solr?


I mean, why would so many recruiters be looking for a "Solr expert" or 
engaging the services of Solr sonsultancies if mastery of Solr by "average 
application developers" was a reality?!


[I want to hear Otis' take on this!]

-- Jack Krupansky

-Original Message- 
From: Grant Ingersoll

Sent: Saturday, June 15, 2013 1:47 PM
To: solr-user@lucene.apache.org
Subject: Re: Adding pdf/word file using JSON/XML


On Jun 15, 2013, at 12:54 PM, Alexandre Rafalovitch  
wrote:


On Sat, Jun 15, 2013 at 10:35 AM, Grant Ingersoll  
wrote:
That being said, it truly amazes me that people were ever able to 
implement Solr, given some of the FUD in this thread.  I guess those tens 
of thousands of deployments out there were all done by above average 
devs...


I would not classify the thread as FUD.


I was just referring to the part about how Solr isn't something average devs 
can do, which I think is FUD.


At any rate, I think the ExtractingReqHandler could be updated to allow for 
metadata, etc. to be passed in with the raw document itself and a patch 
would be welcome.  It's something the literals stand in for now as a 
lightweight proxy, but clearly there is an opportunity for more to be passed 
in.= 



Managing SolrCloud

2013-06-15 Thread Furkan KAMACI
I want to design a controlling mechanism for my SolrCloud. I have two
choices.

First one is controlling every Solr node from a single point and when I
want to start and stop jetty from remote I will connect to my nodes via an
ssh library at Java. I will send backup command and recovery process with
same policy.

Second one is running another jar at all my Solr nodes and I will send
backup command that custom jars and they will call backup via
HttpSolrServer and recovery process will be managed over that jars.
Starting and stopping jetty will be done from that jars too.

There are some pros and cons using each of them.

First one's pros:
* When I start my custom management server there will be no need to send
some small jars into every Solr node machine.

First one's cons:
* needs keeping eye on from a single point.
* I have to open /admin URLs of every Solr nodes into the outside. Because
my custom management server should reach them. If anybody else can reach
that URL than he/she can run a delete all documents command. (if I put my
management server inside the environment of Solr nodes I think that I can
overwhelm that issue.)
* Connection every Solr node via ssh library may be resource consuming.

Second one's pros
* I can distribute work load. If I have hundreds of Solr nodes within my
SolrCloud I can send backup/recovery commands into my custom jars and each
of them can handle processes.
* I can make forbidden all Solr admin pages into the outside environment of
Solr nodes. My custom jars can run only the commands which I have defined
inside them. They can access to the Solr Node which they are responsible
for via HttpSolrServer.
* This custom jars maybe used at further purposes (advices are welcome)

Second one's cons
* I have to send small jars into each Solr node.

What folks think about such scenarios and what they suggest me?


Re: Adding pdf/word file using JSON/XML

2013-06-15 Thread Grant Ingersoll

On Jun 15, 2013, at 12:54 PM, Alexandre Rafalovitch  wrote:

> On Sat, Jun 15, 2013 at 10:35 AM, Grant Ingersoll  wrote:
>> That being said, it truly amazes me that people were ever able to implement 
>> Solr, given some of the FUD in this thread.  I guess those tens of thousands 
>> of deployments out there were all done by above average devs...
> 
> I would not classify the thread as FUD.

I was just referring to the part about how Solr isn't something average devs 
can do, which I think is FUD.

At any rate, I think the ExtractingReqHandler could be updated to allow for 
metadata, etc. to be passed in with the raw document itself and a patch would 
be welcome.  It's something the literals stand in for now as a lightweight 
proxy, but clearly there is an opportunity for more to be passed in.

Re: Adding pdf/word file using JSON/XML

2013-06-15 Thread Alexandre Rafalovitch
On Sat, Jun 15, 2013 at 10:35 AM, Grant Ingersoll  wrote:
> That being said, it truly amazes me that people were ever able to implement 
> Solr, given some of the FUD in this thread.  I guess those tens of thousands 
> of deployments out there were all done by above average devs...

I would not classify the thread as FUD. More like confusion about the
best practices. And, from my tech support days, this usually means
lack of documentation, outdated explanations or missing representative
examples.

So, if there is a magic solution just under our noses that we are
missing, it would be good to know. Might be as easy to solve for the
next person as adding a Wiki link in the correct place.

Regards,
   Alex.

Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


Re: Slow Highlighter Performance Even Using FastVectorHighlighter

2013-06-15 Thread Michael McCandless
You could also try the new[ish] PostingsHighlighter:
http://blog.mikemccandless.com/2012/12/a-new-lucene-highlighter-is-born.html

Mike McCandless

http://blog.mikemccandless.com


On Sat, Jun 15, 2013 at 8:50 AM, Michael Sokolov
 wrote:
> If you have very large documents (many MB) that can lead to slow
> highlighting, even with FVH.
>
> See https://issues.apache.org/jira/browse/LUCENE-3234
>
> and try setting phraseLimit=1 (or some bigger number, but not infinite,
> which is the default)
>
> -Mike
>
>
>
> On 6/14/13 4:52 PM, Andy Brown wrote:
>>
>> Bryan,
>>
>> For specifics, I'll refer you back to my original email where I
>> specified all the fields/field types/handlers I use. Here's a general
>> overview.
>>   I really only have 3 fields that I index and search against: "name",
>> "description", and "content". All of which are just general text
>> (string) fields. I have a catch-all field called "text" that is only
>> used for querying. It's indexed but not stored. The "name",
>> "description", and "content" fields are copied into the "text" field.
>>   For partial word matching, I have 4 more fields: "name_par",
>> "description_par", "content_par", and "text_par". The "text_par" field
>> has the same relationship to the "*_par" fields as "text" does to the
>> others (only used for querying). Those partial word matching fields are
>> of type "text_general_partial" which I created. That field type is
>> analyzed different than the regular text field in that it goes through
>> an EdgeNGramFilterFactory with the minGramSize="2" and maxGramSize="7"
>> at index time.
>>   I query against both "text" and "text_par" fields using edismax deftype
>> with my qf set to "text^2 text_par^1" to give full word matches a higher
>> score. This part returns back very fast as previously stated. It's when
>> I turn on highlighting that I take the huge performance hit.
>>   Again, I'm using the FastVectorHighlighting. The hl.fl is set to "name
>> name_par description description_par content content_par" so that it
>> returns highlights for full and partial word matches. All of those
>> fields have indexed, stored, termPositions, termVectors, and termOffsets
>> set to "true".
>>   It all seems redundant just to allow for partial word
>> matching/highlighting but I didn't know of a better way. Does anything
>> stand out to you that could be the culprit? Let me know if you need any
>> more clarification.
>>   Thanks!
>>   - Andy
>>
>> -Original Message-
>> From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com]
>> Sent: Wednesday, May 29, 2013 5:44 PM
>> To: solr-user@lucene.apache.org
>> Subject: RE: Slow Highlighter Performance Even Using
>> FastVectorHighlighter
>>
>> Andy,
>>
>>> I don't understand why it's taking 7 secs to return highlights. The
>>
>> size
>>>
>>> of the index is only 20.93 MB. The JVM heap Xms and Xmx are both set
>>
>> to
>>>
>>> 1024 for this verification purpose and that should be more than
>>
>> enough.
>>>
>>> The processor is plenty powerful enough as well.
>>>
>>> Running VisualVM shows all my CPU time being taken by mainly these 3
>>> methods:
>>>
>>>
>> org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI
>>>
>>> nfo.getStartOffset()
>>>
>> org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI
>>>
>>> nfo.getStartOffset()
>>>
>> org.apache.lucene.search.vectorhighlight.FieldPhraseList.addIfNoOverlap(
>>>
>>> )
>>
>> That is a strange and interesting set of things to be spending most of
>> your CPU time on. The implication, I think, is that the number of term
>> matches in the document for terms in your query (or, at least, terms
>> matching exact words or the beginning of phrases in your query) is
>> extremely high . Perhaps that's coming from this "partial word match"
>> you
>> mention -- how does that work?
>>
>> -- Bryan
>>
>>> My guess is that this has something to do with how I'm handling
>>
>> partial
>>>
>>> word matches/highlighting. I have setup another request handler that
>>> only searches the whole word fields and it returns in 850 ms with
>>> highlighting.
>>>
>>> Any ideas?
>>>
>>> - Andy
>>>
>>>
>>> -Original Message-
>>> From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com]
>>> Sent: Monday, May 20, 2013 1:39 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: RE: Slow Highlighter Performance Even Using
>>> FastVectorHighlighter
>>>
>>> My guess is that the problem is those 200M documents.
>>> FastVectorHighlighter is fast at deciding whether a match, especially
>>
>> a
>>>
>>> phrase, appears in a document, but it still starts out by walking the
>>> entire list of term vectors, and ends by breaking the document into
>>> candidate-snippet fragments, both processes that are proportional to
>>
>> the
>>>
>>> length of the document.
>>>
>>> It's hard to do much about the first, but for the second you could
>>> choose
>>> to expose FastVectorHighlighter's FieldPhraseList representation, and
>>> return offse

Re: Adding pdf/word file using JSON/XML

2013-06-15 Thread Grant Ingersoll

On Jun 13, 2013, at 11:24 AM, Walter Underwood  wrote:

> That was my thought exactly. Contribute a REST request handler. --wunder
> 

+1.  The bits are already in place for a lot of it now that RESTlet is in.  

That being said, it truly amazes me that people were ever able to implement 
Solr, given some of the FUD in this thread.  I guess those tens of thousands of 
deployments out there were all done by above average devs...

-Grant

Re: yet another optimize question

2013-06-15 Thread Otis Gospodnetic
Hi Robi,

I'm going to guess you are seeing smaller heap also simply because you
restarted the JVM recently (hm, you don't say you restarted, maybe I'm
making this up). If you are indeed indexing continuously then you
shouldn't optimize. Lucene will merge segments itself. Lower
mergeFactor will force it to do it more often (it means slower
indexing, bigger IO hit when segments are merged, more per-segment
data that Lucene/Solr need to read from the segment for faceting and
such, etc.) so maybe you shouldn't mess with that.  Do you know what
your caches are like in terms of size, hit %, evictions?  We've
recently seen people set those to a few hundred K or even higher,
which can eat a lot of heap.  We have had luck with G1 recently, too.
Maybe you can run jstat and see which of the memory pools get filled
up and change/increase appropriate JVM param based on that?  How many
fields do you index, facet, or group on?

Otis
--
Performance Monitoring - http://sematext.com/spm/index.html
Solr & ElasticSearch Support -- http://sematext.com/





On Fri, Jun 14, 2013 at 8:04 PM, Petersen, Robert
 wrote:
> Hi guys,
>
> We're on solr 3.6.1 and I've read the discussions about whether to optimize 
> or not to optimize.  I decided to try not optimizing our index as was 
> recommended.  We have a little over 15 million docs in our biggest index and 
> a 32gb heap for our jvm.  So without the optimizes the index folder seemed to 
> grow in size and quantity of files.  There seemed to be an upper limit but 
> eventually it hit 300 files consuming 26gb of space and that seemed to push 
> our slave farm over the edge and we started getting the dreaded OOMs.  We 
> have continuous indexing activity, so I stopped the indexer and manually ran 
> an optimize which made the index become 9 files consuming 15gb of space and 
> our slave farm started having acceptable memory usage.  Our merge factor is 
> 10, we're on java 7.  Before optimizing, I tried on one slave machine to go 
> with the latest JVM and tried switching from the CMS GC to the G1GC but it 
> hit OOM condition even faster.  So it seems like I have to continue to 
> schedule a regular optimize.  Right now it has been a couple of days since 
> running the optimize and the index is slowly growing bigger, now up to a bit 
> over 19gb.  What do you guys think?  Did I miss something that would make us 
> able to run without doing an optimize?
>
> Robert (Robi) Petersen
> Senior Software Engineer
> Search Department


Re: Slow Highlighter Performance Even Using FastVectorHighlighter

2013-06-15 Thread Michael Sokolov
If you have very large documents (many MB) that can lead to slow 
highlighting, even with FVH.


See https://issues.apache.org/jira/browse/LUCENE-3234

and try setting phraseLimit=1 (or some bigger number, but not infinite, 
which is the default)


-Mike


On 6/14/13 4:52 PM, Andy Brown wrote:

Bryan,

For specifics, I'll refer you back to my original email where I
specified all the fields/field types/handlers I use. Here's a general
overview.
  
I really only have 3 fields that I index and search against: "name",

"description", and "content". All of which are just general text
(string) fields. I have a catch-all field called "text" that is only
used for querying. It's indexed but not stored. The "name",
"description", and "content" fields are copied into the "text" field.
  
For partial word matching, I have 4 more fields: "name_par",

"description_par", "content_par", and "text_par". The "text_par" field
has the same relationship to the "*_par" fields as "text" does to the
others (only used for querying). Those partial word matching fields are
of type "text_general_partial" which I created. That field type is
analyzed different than the regular text field in that it goes through
an EdgeNGramFilterFactory with the minGramSize="2" and maxGramSize="7"
at index time.
  
I query against both "text" and "text_par" fields using edismax deftype

with my qf set to "text^2 text_par^1" to give full word matches a higher
score. This part returns back very fast as previously stated. It's when
I turn on highlighting that I take the huge performance hit.
  
Again, I'm using the FastVectorHighlighting. The hl.fl is set to "name

name_par description description_par content content_par" so that it
returns highlights for full and partial word matches. All of those
fields have indexed, stored, termPositions, termVectors, and termOffsets
set to "true".
  
It all seems redundant just to allow for partial word

matching/highlighting but I didn't know of a better way. Does anything
stand out to you that could be the culprit? Let me know if you need any
more clarification.
  
Thanks!
  
- Andy


-Original Message-
From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com]
Sent: Wednesday, May 29, 2013 5:44 PM
To: solr-user@lucene.apache.org
Subject: RE: Slow Highlighter Performance Even Using
FastVectorHighlighter

Andy,


I don't understand why it's taking 7 secs to return highlights. The

size

of the index is only 20.93 MB. The JVM heap Xms and Xmx are both set

to

1024 for this verification purpose and that should be more than

enough.

The processor is plenty powerful enough as well.

Running VisualVM shows all my CPU time being taken by mainly these 3
methods:



org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI

nfo.getStartOffset()


org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI

nfo.getStartOffset()


org.apache.lucene.search.vectorhighlight.FieldPhraseList.addIfNoOverlap(

)

That is a strange and interesting set of things to be spending most of
your CPU time on. The implication, I think, is that the number of term
matches in the document for terms in your query (or, at least, terms
matching exact words or the beginning of phrases in your query) is
extremely high . Perhaps that's coming from this "partial word match"
you
mention -- how does that work?

-- Bryan


My guess is that this has something to do with how I'm handling

partial

word matches/highlighting. I have setup another request handler that
only searches the whole word fields and it returns in 850 ms with
highlighting.

Any ideas?

- Andy


-Original Message-
From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com]
Sent: Monday, May 20, 2013 1:39 PM
To: solr-user@lucene.apache.org
Subject: RE: Slow Highlighter Performance Even Using
FastVectorHighlighter

My guess is that the problem is those 200M documents.
FastVectorHighlighter is fast at deciding whether a match, especially

a

phrase, appears in a document, but it still starts out by walking the
entire list of term vectors, and ends by breaking the document into
candidate-snippet fragments, both processes that are proportional to

the

length of the document.

It's hard to do much about the first, but for the second you could
choose
to expose FastVectorHighlighter's FieldPhraseList representation, and
return offsets to the caller rather than fragments, building up your

own

snippets from a separate store of indexed files. This would also

permit

you to set stored="false", improving your memory/core size ratio,

which

I'm guessing could use some improving. It would require some work, and
it
would require you to store a representation of what was indexed

outside

the Solr core, in some constant-bytes-to-character representation that
you
can use offsets with (e.g. UTF-16, or ASCII+entity references).

However, you may not need to do this -- it may be that you just need
more
memory for your search machine. Not JVM memo