Re: Use case for the Shingle Filter

2017-03-05 Thread Ryan Josal
I thought new versions of solr didn't split on whitespace at the query
parser anymore, so this should work?

That being said, I think I remember it having a problem coming after a
synonym filter.  IIRC, if your input is "Foo Bar" and you have a synonym
"foo <=> baz" you would get foobaz bazbar instead of foobar and bazbar.  I
wrote a custom shingler to account for that.

Ryan

On Sun, Mar 5, 2017 at 02:48 Markus Jelsma 
wrote:

> Hello - we use it for text classification and online near-duplicate
> document detection/filtering. Using shingles means you want to consider
> order in the text. It is analogous to using bigrams and trigrams when doing
> language detection, you cannot distinguish between Danish and Norwegian
> solely on single characters.
>
> Markus
>
>
>
> -Original message-
> > From:Ryan Yacyshyn 
> > Sent: Sunday 5th March 2017 5:57
> > To: solr-user@lucene.apache.org
> > Subject: Use case for the Shingle Filter
> >
> > Hi everyone,
> >
> > I was thinking of using the Shingle Filter to help solve an issue I'm
> > facing. I can see this working in the analysis panel in the Solr admin,
> but
> > not when I make my queries.
> >
> > I find out it's because of the query parser splitting up the tokens on
> > white space before passing them along.
> >
> > This made me wonder what a practical use case can be, for using the
> shingle
> > filter?
> >
> > Any enlightenment on this would be much appreciated!
> >
> > Thanks,
> > Ryan
> >
>


Re: Forking Solr

2015-10-16 Thread Ryan Josal
Thanks for the feedback, forking lucene/solr is my last resort indeed.

1) It's not about creating fresh new plugins.  It's about modifying
existing ones or core solr code.
2) I want to submit the patch to modify core solr or lucene code, but I
also want to run it in prod before its accepted and released publicly.
Also I think this helps solidify the patch over time.
3) I have to do this all the time, and I agree it's better than forking,
but doing this repeatedly over time has diminishing returns because it
increases the cost of upgrading solr.  I also requires some ugly reflection
in most cases, and in others copying verbatim a pile of other classes.

I will send my questions to lucene-dev, thanks!
Ryan

On Friday, October 16, 2015, Doug Turnbull <
dturnb...@opensourceconnections.com> wrote:

> Ryan,
>
> From a "solr-user" perspective :) I would advise against forking Solr. Some
> of our consulting business is "people who forked Solr, need to upgrade, and
> now have gotten themselves into hot water."
>
> I would try, in the following order
> 1. Creating a plugin (sounds like you can't do this)
> 2. Submitting a patch to Solr that makes it easier to create the plugin you
> need
> 3. Copy-pasting code to create a plugin. I once had to do this for a
> highlighter. It's ugly, but its better than forking.
> 4
> 999. Hiring Yonik :)
> 1000. Forking Solr
>
> 999 a prereq for 1000 :)
>
> Even very heavily customized versions of Solr sold by major vendors that
> staff committers are entirely plugin driven.
>
> Cheers
> -Doug
>
>
> On Fri, Oct 16, 2015 at 3:30 PM, Alexandre Rafalovitch  >
> wrote:
>
> > I suspect these questions should go the Lucene Dev list instead. This
> > one is more for those who build on top of standard Solr.
> >
> > Regards,
> >    Alex.
> >
> > 
> > Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> > http://www.solr-start.com/
> >
> >
> > On 16 October 2015 at 12:07, Ryan Josal >
> wrote:
> > > Hi guys, I'd like to get your tips on how to run a Solr fork at my
> > > company.  I know Yonik has a "heliosearch" fork, and I'm sure many
> others
> > > have a fork.  There have been times where I want to add features to an
> > > existing core plugin, and subclassing isn't possible so I end up
> copying
> > > the source code into my repo, then using some crazy reflection to get
> it
> > to
> > > work.  Sometimes there's a little bug in something and I have to do the
> > > same thing.  Sometimes there's something I want to do deeper in core
> Solr
> > > code that isn't pluggable and I end up doing an interesting workaround.
> > > Sometimes I want to apply a patch from JIRA.  I also think forking solr
> > > will make it easier for me to contribute patches back.  So here are my
> > > questions:
> > >
> > > *) how do I properly fork it outside of github to my own company's git
> > > system?
> > > *) how do I pull new changes?  I think I would expect to sync new
> changes
> > > when there is a new public release.  What branches do I need to work
> > > with/on?
> > > *) how do I test my changes?  What part of the test suites do I run for
> > > what changes?
> > > *) how do I build a new version when I'm ready to go to prod?  This is
> > > slightly more unclear to me now that it isn't just a war.
> > >
> > > Thanks,
> > > Ryan
> >
>
>
>
> --
> *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections
> <http://opensourceconnections.com>, LLC | 240.476.9983
> Author: Relevant Search <http://manning.com/turnbull>
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless
> of whether attachments are marked as such.
>


Forking Solr

2015-10-16 Thread Ryan Josal
Hi guys, I'd like to get your tips on how to run a Solr fork at my
company.  I know Yonik has a "heliosearch" fork, and I'm sure many others
have a fork.  There have been times where I want to add features to an
existing core plugin, and subclassing isn't possible so I end up copying
the source code into my repo, then using some crazy reflection to get it to
work.  Sometimes there's a little bug in something and I have to do the
same thing.  Sometimes there's something I want to do deeper in core Solr
code that isn't pluggable and I end up doing an interesting workaround.
Sometimes I want to apply a patch from JIRA.  I also think forking solr
will make it easier for me to contribute patches back.  So here are my
questions:

*) how do I properly fork it outside of github to my own company's git
system?
*) how do I pull new changes?  I think I would expect to sync new changes
when there is a new public release.  What branches do I need to work
with/on?
*) how do I test my changes?  What part of the test suites do I run for
what changes?
*) how do I build a new version when I'm ready to go to prod?  This is
slightly more unclear to me now that it isn't just a war.

Thanks,
Ryan


Re: Solr cross core join special condition

2015-10-07 Thread Ryan Josal
I developed a join transformer plugin that did that (although it didn't
flatten the results like that).  The one thing that was painful about it is
that the TextResponseWriter has references to both the IndexSchema and
SolrReturnFields objects for the primary core.  So when you add a
SolrDocument from another core it returned the wrong fields.  I worked
around that by transforming the SolrDocument to a NamedList.  Then when it
gets to processing the IndexableFields it uses the wrong IndexSchema, I
worked around that by transforming each field to a hard Java object
(through the IndexSchema and FieldType of the correct core).  I think it
would be great to patch TextResponseWriter with multi core writing
abilities, but there is one question, how can it tell which core a
SolrDocument or IndexableField is from?  Seems we'd have to add an
attribute for that.

The other possibly simpler thing to do is execute the join at index time
with an update processor.

Ryan

On Tuesday, October 6, 2015, Mikhail Khludnev 
wrote:

> On Wed, Oct 7, 2015 at 7:05 AM, Ali Nazemian  > wrote:
>
> > it
> > seems there is not any way to do that right now and it should be
> developed
> > somehow. Am I right?
> >
>
> yep
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> 
> >
>


Bug in query elevation transformers SOLR-7953

2015-08-20 Thread Ryan Josal
Hey guys, I just logged this bug and I wanted to raise awareness.  If you
use the QueryElevationComponent, and ask for fl=[elevated], you'll get only
false if solr is using LazyDocuments.  This looks even stranger when you
request exclusive=true and you only get back elevated documents, and they
all say false.  I'm not sure how often LazyDocuments are used, but it's
probably not an uncommon issue.

Ryan


Re: rq breaks wildcard search?

2015-04-22 Thread Ryan Josal
Awesome thanks!  I was on 4.10.2

Ryan

> On Apr 22, 2015, at 16:44, Joel Bernstein  wrote:
> 
> For your own implementation you'll need to implement the following methods:
> 
> public Query rewrite(IndexReader reader) throws IOException
> public void extractTerms(Set terms)
> 
> You can review the 4.10.3 version of the ReRankQParserPlugin to see how it
> implements these methods.
> 
> Joel Bernstein
> http://joelsolr.blogspot.com/
> 
>> On Wed, Apr 22, 2015 at 7:33 PM, Joel Bernstein  wrote:
>> 
>> Just confirmed that wildcard queries work with Re-Ranking following
>> SOLR-6323.
>> 
>> Joel Bernstein
>> http://joelsolr.blogspot.com/
>> 
>> On Wed, Apr 22, 2015 at 7:26 PM, Joel Bernstein 
>> wrote:
>> 
>>> This should be resolved in
>>> https://issues.apache.org/jira/browse/SOLR-6323.
>>> 
>>> Solr 4.10.3
>>> 
>>> Joel Bernstein
>>> http://joelsolr.blogspot.com/
>>> 
>>>> On Wed, Apr 15, 2015 at 6:23 PM, Ryan Josal  wrote:
>>>> 
>>>> Using edismax, supplying a rq= param, like {!rerank ...} is causing an
>>>> UnsupportedOperationException because the Query doesn't implement
>>>> createWeight.  This is for WildcardQuery in particular.  From some
>>>> preliminary debugging it looks like without rq, somehow the qf Queries
>>>> might turn into ConstantScore instead of WildcardQuery.  I don't think
>>>> this
>>>> is related to the RankQuery implementation as my own subclass has the
>>>> same
>>>> issue.  Anyway the effect is that all q's containing ? or * return http
>>>> 500
>>>> because I always have rq on.  Can anyone confirm if this is a bug?  I
>>>> will
>>>> log it in Jira if so.
>>>> 
>>>> Also, does anyone know how I can work around it?  Specifically, can I
>>>> disable edismax from making WildcardQueries?
>>>> 
>>>> Ryan
>> 


rq breaks wildcard search?

2015-04-15 Thread Ryan Josal
Using edismax, supplying a rq= param, like {!rerank ...} is causing an
UnsupportedOperationException because the Query doesn't implement
createWeight.  This is for WildcardQuery in particular.  From some
preliminary debugging it looks like without rq, somehow the qf Queries
might turn into ConstantScore instead of WildcardQuery.  I don't think this
is related to the RankQuery implementation as my own subclass has the same
issue.  Anyway the effect is that all q's containing ? or * return http 500
because I always have rq on.  Can anyone confirm if this is a bug?  I will
log it in Jira if so.

Also, does anyone know how I can work around it?  Specifically, can I
disable edismax from making WildcardQueries?

Ryan


Re: omitTermFreqAndPositions issue

2015-04-09 Thread Ryan Josal
Thanks a lot Erick, your suggestion on using similarity will work great; I
wasn't aware you could define similarity on a field by field basis until
now, and that solution works perfectly.

Sorry what I said was a little misleading. I should have said "I don't want
it to issue phrase queries to that specific field ever, because it has
positions turned off and so phrase queries cause exceptions".  Because I DO
want to run phrase queries on the "title" data, I just have another field
for that.

The problem is the one described here:
http://opensourceconnections.com/blog/2014/12/08/title-search-when-relevancy-is-only-skin-deep/

It still seems a bit off that you can't use an omitTermFreqAndPositions
field with edismax's qf; but I can't think of a situation that defining a
custom similarity wouldn't be the right solution.

Thanks again,
Ryan

On Wed, Apr 8, 2015 at 5:29 PM, Erick Erickson 
wrote:

> Ryan:
>
> bq:  I don't want it to issue phrase queries to that field ever
>
> This is one of those requirements that you'd have to enforce at the
> app layer. Having Solr (or Lucene) enforce a rule like this for
> everyone would be terrible.
>
> So if you're turning off TF but also saying title is "one of the
> primary components of score". Since TF in integral to calculating
> scores, I'm not quite sure what that means.
>
> You could write a custom similarity class that returns whatever you
> want (1.0 comes to mind) from the tf() method.
>
> Best,
> Erick
>
> On Wed, Apr 8, 2015 at 4:50 PM, Ryan Josal  wrote:
> > Thanks for your thought Shawn, I don't think fq will be helpful here.
> The
> > field for which I want to turn TF off is "title", which is actually one
> of
> > the primary components of score, so I really need it in qf.  I just don't
> > want the TF portion of the score for that field only.  I don't want it to
> > issue phrase queries to that field ever, but if the user quotes
> something,
> > it does, and I don't know how to make it stop.  To me it seems
> potentially
> > more appropriate to send that to the pf fields, although I can think of a
> > couple good reasons to put it against qf.  That's fine as long as it
> > doesn't try to build a phrase query against a no TF no pos field.
> >
> > Ryan
> >
> > On Wednesday, April 8, 2015, Shawn Heisey  wrote:
> >
> >> On 4/8/2015 5:06 PM, Ryan Josal wrote:
> >> > The error:
> >> > IllegalStateException: field "foo" indexed without position data;
> cannot
> >> > run PhraseQuery.
> >> >
> >> > It would actually be ok for us to index position data but there isn't
> an
> >> > option for that without term frequencies.  No TF is important for us
> when
> >> > it comes to searching product titles.
> >> >
> >> > I should say that only a small fraction of user queries contained
> quoted
> >> > phrases that trigger this error, so it works much of the time, but
> we'd
> >> > also like to continue supporting user quoted phrase queries.
> >> >
> >> > So how can I index a field without TF and use it in edismax qf?
> >>
> >> If you omit positions, you can't do phrase queries.  As far as I know,
> >> there is no option in Solr to omit only frequencies and not positions.
> >>
> >> I think there is a way that you can achieve what you want, though.  What
> >> you are looking for is filters.  The fq parameter (filter query) will
> >> restrict the result set to only entries that match the query, but will
> >> not affect the relevancy score *at all*.  Here is an example of a filter
> >> query that restricts the results to items that are in stock, assuming
> >> you have the appropriate schema:
> >>
> >> fq=inStock:true
> >>
> >> Queries specified in fq will default to the lucene query parser, but you
> >> can override that if you need to.  This query would be equivalent to the
> >> previous one, but it would be parsed using edismax:
> >>
> >> fq={!edismax}inStock:true
> >>
> >> Here's another example of a useful filter, using yet another query
> parser:
> >>
> >> fq={!terms f=userId}bob,alice,susan
> >>
> >> Remember, the reason I have suggested filters is that they do not
> >> influence score.
> >>
> >>
> >>
> https://cwiki.apache.org/confluence/display/solr/Common+Query+Parameters#CommonQueryParameters-Thefq%28FilterQuery%29Parameter
> >>
> >> Thanks,
> >> Shawn
> >>
> >>
>


Re: Group by score

2015-04-09 Thread Ryan Josal
You can use Result Grouping by a function using query(), but you'll need a
version of Lucene with this bug fixed:

https://issues.apache.org/jira/browse/SOLR-7046

Ryan

On Thursday, April 9, 2015, Jens Mayer  wrote:

> Hey everybody,
> I have the following situation in my search application: I've been
> searching street sources. By executing a search I receive several matches.
> The first 10 matches are displayed. But in this situation a part of the
> results are nearly the same.As example if I seach for Berlin I'll receive
> every zip as a single set of data in my top 10. But my application only
> shows the zip if you explicitly search for this e.g.14089 Berlin. So if you
> only search for Berlin you receive ten times Berlin without zip as result.
> So I like to avoid this situation.I've take a look about my results and it
> come to my attention that these set of data have every time the same score.
> At the following I would implement a grouping by score. But seeming solr
> don't know this field. But I wondering about the fact that inside a
> grouped set of data after all I can sort by score.
> Have someone an Idea how I can resolve this?
> Greetings
>


Re: omitTermFreqAndPositions issue

2015-04-08 Thread Ryan Josal
Thanks for your thought Shawn, I don't think fq will be helpful here.  The
field for which I want to turn TF off is "title", which is actually one of
the primary components of score, so I really need it in qf.  I just don't
want the TF portion of the score for that field only.  I don't want it to
issue phrase queries to that field ever, but if the user quotes something,
it does, and I don't know how to make it stop.  To me it seems potentially
more appropriate to send that to the pf fields, although I can think of a
couple good reasons to put it against qf.  That's fine as long as it
doesn't try to build a phrase query against a no TF no pos field.

Ryan

On Wednesday, April 8, 2015, Shawn Heisey  wrote:

> On 4/8/2015 5:06 PM, Ryan Josal wrote:
> > The error:
> > IllegalStateException: field "foo" indexed without position data; cannot
> > run PhraseQuery.
> >
> > It would actually be ok for us to index position data but there isn't an
> > option for that without term frequencies.  No TF is important for us when
> > it comes to searching product titles.
> >
> > I should say that only a small fraction of user queries contained quoted
> > phrases that trigger this error, so it works much of the time, but we'd
> > also like to continue supporting user quoted phrase queries.
> >
> > So how can I index a field without TF and use it in edismax qf?
>
> If you omit positions, you can't do phrase queries.  As far as I know,
> there is no option in Solr to omit only frequencies and not positions.
>
> I think there is a way that you can achieve what you want, though.  What
> you are looking for is filters.  The fq parameter (filter query) will
> restrict the result set to only entries that match the query, but will
> not affect the relevancy score *at all*.  Here is an example of a filter
> query that restricts the results to items that are in stock, assuming
> you have the appropriate schema:
>
> fq=inStock:true
>
> Queries specified in fq will default to the lucene query parser, but you
> can override that if you need to.  This query would be equivalent to the
> previous one, but it would be parsed using edismax:
>
> fq={!edismax}inStock:true
>
> Here's another example of a useful filter, using yet another query parser:
>
> fq={!terms f=userId}bob,alice,susan
>
> Remember, the reason I have suggested filters is that they do not
> influence score.
>
>
> https://cwiki.apache.org/confluence/display/solr/Common+Query+Parameters#CommonQueryParameters-Thefq%28FilterQuery%29Parameter
>
> Thanks,
> Shawn
>
>


omitTermFreqAndPositions issue

2015-04-08 Thread Ryan Josal
Hey guys, it seems that omitTermFreqAndPositions is not very usable with
edismax, and I'm wondering if this is intended behavior, and how I can get
around the problem.

The setup:
define field "foo" with omitTermFreqAndPositions=true

The query:
q="ground coffee"&qf=foo bar baz

The error:
IllegalStateException: field "foo" indexed without position data; cannot
run PhraseQuery.

It would actually be ok for us to index position data but there isn't an
option for that without term frequencies.  No TF is important for us when
it comes to searching product titles.

I should say that only a small fraction of user queries contained quoted
phrases that trigger this error, so it works much of the time, but we'd
also like to continue supporting user quoted phrase queries.

So how can I index a field without TF and use it in edismax qf?

Thanks for your help!
Ryan


Re: sort on facet.index?

2015-04-02 Thread Ryan Josal
Awesome, I didn't know this feature was going to add so much power!
Looking forward to using it.

On Thursday, April 2, 2015, Yonik Seeley  wrote:

> On Thu, Apr 2, 2015 at 10:25 AM, Ryan Josal  > wrote:
> > Sorting the result set or the facets?  For the facets there is
> > facet.sort=index (lexicographically) and facet.sort=count.  So maybe you
> > are asking if you can sort by index, but reversed?  I don't think this is
> > possible, and it's a good question.
>
> The new facet module that will be in Solr 5.1 supports sorting both
> directions on both count and index order (as well as by statistics /
> bucket aggregations).
> http://yonik.com/json-facet-api/
>
> -Yonik
>


Re: sort on facet.index?

2015-04-02 Thread Ryan Josal
Sorting the result set or the facets?  For the facets there is
facet.sort=index (lexicographically) and facet.sort=count.  So maybe you
are asking if you can sort by index, but reversed?  I don't think this is
possible, and it's a good question.  I wanted to chime in on this one
because I wanted my own facet.sort=rank, but there is no nice pluggable way
to implement a new sort.  I'd love to be able to add a Comparator for a new
sort.  I ended up subclassing FacetComponent to sort of hack on the rank
sort implementation but it isn't very pretty and I'm sure not as efficient
as it could be if FacetComponent was designed for more sorts.

Ryan

On Thursday, April 2, 2015, Derek Poh  wrote:

> Is sorting on facet index supported?
>
> I would like to sort on the below facet index
>
> 
> 14
> 8
> 12
> 349
> 81
> 8
> 12
> 
>
> to
>
> 
> 12
> 8
> 81
> 349
> ...
> ...
> ...
> 
>
> -Derek
>


DocTransformer#setContext

2015-03-20 Thread Ryan Josal
Hey guys, I wanted to ask if I'm using the DocTransformer API as intended.
There is a setContext( TransformerContext c ) method which is called by the
TextResponseWriter before it calls transform on any docs.  That context
object contains a DocIterator reference.  I want to use a DocTransformer to
add info from DynamoDB based on the uniquekeys of docs, so I figured this
would be the way to go to get all needed data from DDB in a batch before
transform.

Turns out if you call nextDoc on that iterator, that doc will not be
transformed because the iterator is not reset or regenerated in any way
before transformations start being called.  In some cases, if the Collector
collected extra docs, the DocSlice will have more docids to return even
after hasNext, and the code doesn't check that, so it will transform
those.  Then eventually it may throw an IndexOutOfBoundsException.  My gut
says this is not intended.  Why not give the DocList in the
TransformContext?

So in the example solrconfig, I think there is a suggestion to use
DocTransformers to get data from external DBs, but has anyone done this,
and how do they handle making a single/batch request instead of doing one
for every transform call?

Ryan


Re: rankquery usage bug?

2015-02-24 Thread Ryan Josal
Ticket filed, thanks!
https://issues.apache.org/jira/browse/SOLR-7152

On Fri, Feb 20, 2015 at 9:29 PM, Joel Bernstein  wrote:

> Ryan,
>
> This looks like a good jira ticket to me.
>
> Joel Bernstein
> Search Engineer at Heliosearch
>
> On Fri, Feb 20, 2015 at 6:40 PM, Ryan Josal  wrote:
>
> > Hey guys, I put a rq in defaults but I can't figure out how to override
> it
> > with no rankquery.  Looks like one option might be checking for empty
> > string before trying to use it in QueryComponent?  I can work around it
> in
> > the prep method of an earlier searchcomponent for now.
> >
> > Ryan
> >
>


Re: Solr synonyms logic

2015-02-21 Thread Ryan Josal
What you are describing is hyponymy.  Pastry is the hypernym.  You can
accomplish this by not using expansion, for example:
cannelloni => cannelloni, pastry

This has the result of adding pastry to the index.

Ryan

On Saturday, February 21, 2015, Mikhail Khludnev 
wrote:

> Hello,
>
> usually debugQuery=true output explains a lot of such details.
>
> On Sat, Feb 21, 2015 at 10:52 AM, davym >
> wrote:
>
> > Hi all,
> >
> > I'm querying a recipe database in Solr. By using synonyms, I'm trying to
> > make my search a little smarter.
> >
> > What I'm trying to do here, is that a search for pastry returns all
> > lasagne,
> > penne & cannelloni recipes.
> > However a search for lasagne should only return lasagne recipes.
> >
> > In my synonyms.txt, I have these lines:
> > -
> > lasagne,pastry
> > penne,pastry
> > cannelloni,pastry
> > -
> >
> > Filter in my scheme.xml looks like this:
> >  > ignoreCase="true" expand="true"
> > tokenizerFactory="solr.WhitespaceTokenizerFactory" />
> > Only in the index analyzer, not in the query.
> >
> > When using the Solr analysis tool, I can see that my index for lasagne
> has
> > a
> > synonym pastry and my query only queries lasagne. Same for penne and
> > cannelloni, they both have the synonym pastry.
> >
> > Currently my Solr query for lasagne also returns all penne and cannelloni
> > recipes. I cannot understand why this is the case.
> >
> > Can someone explain this behaviour to me please?
> >
> >
> >
> >
> >
> >
> >
> > --
> > View this message in context:
> > http://lucene.472066.n3.nabble.com/Solr-synonyms-logic-tp4187827.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> 
> >
>


rankquery usage bug?

2015-02-20 Thread Ryan Josal
Hey guys, I put a rq in defaults but I can't figure out how to override it
with no rankquery.  Looks like one option might be checking for empty
string before trying to use it in QueryComponent?  I can work around it in
the prep method of an earlier searchcomponent for now.

Ryan


Custom facet.sort

2015-02-16 Thread Ryan Josal
Hey guys, I have a desire to order (field) facets by their order of
appearance in the search results.

When I first thought about it, I figured there would be some way to plug a
custom Comparator into FacetComponent and link it to facet.sort=rank or
something like that, but not only is there no real way to plug in a custom
sort (nor is subclassing the component feasible), the complexity is further
compounded by the fact faceting really only operates on the docset and so
scores aren't available.  If max(score) was an attribute of a facetcount
object this type of sort could be done.  sum(score) might also be
interesting for a weighted approach.  I can imagine performance concerns
with doing this though.  Operating on the doclist isn't enough because it's
only a slice of the results.  What if I reduce my scope to only needing the
top 2 facets in order?  It still seems to be just as complex because you
have to start from the first page and request an extra long docslice from
QueryComponent by hacking the start/rows params, and for all you know you
need to get to the last document to get all the facets.

So does anyone have any ideas of how to implement this?  Maybe it isn't
even through faceting.

Ryan


Re: An interesting approach to grouping

2015-01-27 Thread Ryan Josal
This is great, thanks Jim.  Your patch worked and the sorting solution
meets the goal, although group.limit seems like it could cut various
results out of the middle of the result set.  I will play around with it
and see if it proves helpful.  Can you let me know the Jira so I can keep
an eye on it?

Ryan

On Tuesday, January 27, 2015, Jim.Musil  wrote:

> Interestingly, you can do something like this:
>
> group=true&
> group.main=true&
> group.func=rint(scale(query({!type=edismax v=$q}),0,20))& // puts into
> buckets
> group.limit=20& // gives you 20 from each bucket
> group.sort=category asc  // this will sort by category within each bucket,
> but this can be a function as well.
>
>
>
> Jim Musil
>
>
>
> On 1/27/15, 10:14 AM, "Jim.Musil" >
> wrote:
>
> >When using group.main=true, the results are not mixed as you expect:
> >
> >"If true, the result of the last field grouping command is used as the
> >main result list in the response, using group.format=simple”
> >
> >https://wiki.apache.org/solr/FieldCollapsing
> >
> >
> >Jim
> >
> >On 1/27/15, 9:22 AM, "Ryan Josal" >
> wrote:
> >
> >>Thanks a lot!  I'll try this out later this morning.  If group.func and
> >>group.field don't combine the way I think they might, I'll try to look
> >>for
> >>a way to put it all in group.func.
> >>
> >>On Tuesday, January 27, 2015, Jim.Musil  > wrote:
> >>
> >>> I¹m not sure the query you provided will do what you want, BUT I did
> >>>find
> >>> the bug in the code that is causing the NullPointerException.
> >>>
> >>> The variable context is supposed to be global, but when prepare() is
> >>> called, it is only defined in the scope of that function.
> >>>
> >>> Here¹s the simple patch:
> >>>
> >>> Index: core/src/java/org/apache/solr/search/Grouping.java
> >>> ===
> >>> --- core/src/java/org/apache/solr/search/Grouping.java  (revision
> >>>1653358)
> >>> +++ core/src/java/org/apache/solr/search/Grouping.java  (working copy)
> >>> @@ -926,7 +926,7 @@
> >>>   */
> >>>  @Override
> >>>  protected void prepare() throws IOException {
> >>> -  Map context = ValueSource.newContext(searcher);
> >>> +  context = ValueSource.newContext(searcher);
> >>>groupBy.createWeight(context, searcher);
> >>>actualGroupsToFind = getMax(offset, numGroups, maxDoc);
> >>>  }
> >>>
> >>>
> >>> I¹ll search for a Jira issue and open if I can¹t find one.
> >>>
> >>> Jim Musil
> >>>
> >>>
> >>>
> >>> On 1/26/15, 6:34 PM, "Ryan Josal" 
> >
> >>>wrote:
> >>>
> >>> >I have an index of products, and these products have a "category"
> >>>which we
> >>> >can say for now is a good approximation of its location in the store.
> >>>I'm
> >>> >investigating altering the ordering of the results so that the
> >>>categories
> >>> >aren't interlaced as much... so that the results are a little bit more
> >>> >grouped by category, but not *totally* grouped by category.  It's
> >>> >interesting because it's an approach that sort of compares results to
> >>> >near-scored/ranked results.  One of the hoped outcomes of this would
> >>>that
> >>> >there would be somewhat fewer categories represented in the top
> >>>results
> >>> >for
> >>> >a given query, although it is questionable if this is a good
> >>>measurement
> >>> >to
> >>> >determine the effectiveness of the implementation.
> >>> >
> >>> >My first attempt was to
> >>>
> >>>>group=true&group.main=true&group.field=category&group.func=rint(scale(q
> >>>>u
> >>>>er
> >>> >y({!type=edismax
> >>> >v=$q}),0,20))
> >>> >
> >>> >Or some FunctionQuery like that, so that in order to become a member
> >>>of a
> >>> >group, the doc would have to have the same category, and be dropped
> >>>into
> >>> >the same s

Re: An interesting approach to grouping

2015-01-27 Thread Ryan Josal
Thanks a lot!  I'll try this out later this morning.  If group.func and
group.field don't combine the way I think they might, I'll try to look for
a way to put it all in group.func.

On Tuesday, January 27, 2015, Jim.Musil  wrote:

> I¹m not sure the query you provided will do what you want, BUT I did find
> the bug in the code that is causing the NullPointerException.
>
> The variable context is supposed to be global, but when prepare() is
> called, it is only defined in the scope of that function.
>
> Here¹s the simple patch:
>
> Index: core/src/java/org/apache/solr/search/Grouping.java
> ===
> --- core/src/java/org/apache/solr/search/Grouping.java  (revision 1653358)
> +++ core/src/java/org/apache/solr/search/Grouping.java  (working copy)
> @@ -926,7 +926,7 @@
>   */
>  @Override
>  protected void prepare() throws IOException {
> -  Map context = ValueSource.newContext(searcher);
> +  context = ValueSource.newContext(searcher);
>groupBy.createWeight(context, searcher);
>actualGroupsToFind = getMax(offset, numGroups, maxDoc);
>  }
>
>
> I¹ll search for a Jira issue and open if I can¹t find one.
>
> Jim Musil
>
>
>
> On 1/26/15, 6:34 PM, "Ryan Josal" > wrote:
>
> >I have an index of products, and these products have a "category" which we
> >can say for now is a good approximation of its location in the store.  I'm
> >investigating altering the ordering of the results so that the categories
> >aren't interlaced as much... so that the results are a little bit more
> >grouped by category, but not *totally* grouped by category.  It's
> >interesting because it's an approach that sort of compares results to
> >near-scored/ranked results.  One of the hoped outcomes of this would that
> >there would be somewhat fewer categories represented in the top results
> >for
> >a given query, although it is questionable if this is a good measurement
> >to
> >determine the effectiveness of the implementation.
> >
> >My first attempt was to
> >group=true&group.main=true&group.field=category&group.func=rint(scale(quer
> >y({!type=edismax
> >v=$q}),0,20))
> >
> >Or some FunctionQuery like that, so that in order to become a member of a
> >group, the doc would have to have the same category, and be dropped into
> >the same score bucket (20 in this case).  This doesn't work out of the
> >gate
> >due to an NPE (solr 4.10.2) (although I'm not sure it would work anyway):
> >
> >java.lang.NullPointerException\n\tat
> >org.apache.lucene.queries.function.valuesource.ScaleFloatFunction.getValue
> >s(ScaleFloatFunction.java:104)\n\tat
> >org.apache.solr.search.DoubleParser$Function.getValues(ValueSourceParser.j
> >ava:)\n\tat
> >org.apache.lucene.search.grouping.function.FunctionFirstPassGroupingCollec
> >tor.setNextReader(FunctionFirstPassGroupingCollector.java:82)\n\tat
> >org.apache.lucene.search.MultiCollector.setNextReader(MultiCollector.java:
> >113)\n\tat
> >org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:612)\n\ta
> >t
> >org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:297)\n\ta
> >t
> >org.apache.solr.search.Grouping.searchWithTimeLimiter(Grouping.java:451)\n
> >\tat
> >org.apache.solr.search.Grouping.execute(Grouping.java:368)\n\tat
> >org.apache.solr.handler.component.QueryComponent.process(QueryComponent.ja
> >va:459)\n\tat
> >org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHa
> >ndler.java:218)\n\tat
> >
> >
> >Has anyone tried something like this before, and does anyone have any
> >novel
> >ideas for how to approach it, no matter how different?  How about a
> >workaround for the group.func error here?  I'm very open-minded about
> >where
> >to go on this one.
> >
> >Thanks,
> >Ryan
>
>


An interesting approach to grouping

2015-01-26 Thread Ryan Josal
I have an index of products, and these products have a "category" which we
can say for now is a good approximation of its location in the store.  I'm
investigating altering the ordering of the results so that the categories
aren't interlaced as much... so that the results are a little bit more
grouped by category, but not *totally* grouped by category.  It's
interesting because it's an approach that sort of compares results to
near-scored/ranked results.  One of the hoped outcomes of this would that
there would be somewhat fewer categories represented in the top results for
a given query, although it is questionable if this is a good measurement to
determine the effectiveness of the implementation.

My first attempt was to
group=true&group.main=true&group.field=category&group.func=rint(scale(query({!type=edismax
v=$q}),0,20))

Or some FunctionQuery like that, so that in order to become a member of a
group, the doc would have to have the same category, and be dropped into
the same score bucket (20 in this case).  This doesn't work out of the gate
due to an NPE (solr 4.10.2) (although I'm not sure it would work anyway):

java.lang.NullPointerException\n\tat
org.apache.lucene.queries.function.valuesource.ScaleFloatFunction.getValues(ScaleFloatFunction.java:104)\n\tat
org.apache.solr.search.DoubleParser$Function.getValues(ValueSourceParser.java:)\n\tat
org.apache.lucene.search.grouping.function.FunctionFirstPassGroupingCollector.setNextReader(FunctionFirstPassGroupingCollector.java:82)\n\tat
org.apache.lucene.search.MultiCollector.setNextReader(MultiCollector.java:113)\n\tat
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:612)\n\tat
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:297)\n\tat
org.apache.solr.search.Grouping.searchWithTimeLimiter(Grouping.java:451)\n\tat
org.apache.solr.search.Grouping.execute(Grouping.java:368)\n\tat
org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:459)\n\tat
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:218)\n\tat


Has anyone tried something like this before, and does anyone have any novel
ideas for how to approach it, no matter how different?  How about a
workaround for the group.func error here?  I'm very open-minded about where
to go on this one.

Thanks,
Ryan


Re: Dynamically loaded core.properties file

2014-08-21 Thread Ryan Josal
Thanks Erick, I tested that does work, and provide a solution to my 
problem!  So property expansion does work in core.properties, I did not 
know that, and I got the impression from Chris' comment that that would 
open up a can of worms when it comes to persisting core.properties.  I 
guess while the can's open, I'll eat up.


Just for fun I tried property expansion in my referenced subproperties 
file and it didn't work, which is fine for me.


Ryan

On 08/20/2014 04:11 PM, Erick Erickson wrote:

OK, not quite sure if this would work, but

In each core.properties file, put in a line similar to what Chris suggested:
properties=${env}/custom.properties

You might be able to now define your sys var like
-Drelative_or_absolute_path_to_dev_custom.proerties file.
or
-Drelative_or_absolute_path_to_prod_custom.proerties file.
on Solr startup. Then in the custom.properties file you have whatever
you need to define to make the prod/dev distinction you need.

WARNING: I'm not entirely sure that relative pathing works here, which
just means I haven't tried it.

Best,
Erick

On Wed, Aug 20, 2014 at 3:11 PM, Ryan Josal  wrote:

Thanks Erick, that mirrors my thoughts exactly.  If core.properties had
property expansion it would work for this, but I agree with not supporting
that for the complexities it introduces, and I'm not sure it's the right way
to solve it anyway.  So, it doesn't really handle my problem.

I think because the properties file I want to load is not actually related
to any core, it makes it easier to solve.  So if solr.xml is no longer
rewritten then it seems like a global properties file could safely be
specified there using property expansion.  Or maybe there is some way to
write some code that could get executed before schema and solrconfig are
parsed, although I'm not sure how that would work given how you need
solrconfig to load the libraries and define plugins.

Ryan


On 08/20/2014 01:07 PM, Erick Erickson wrote:

Hmmm, I was going to make a code change to do this, but Chris
Hostetter saved me from the madness that ensues. Here's his comment on
the JIRA that I did open (but then closed), does this handle your
problem?

I don't think we want to make the name of core.properties be variable
... that way leads to madness and confusion.

the request on the user list was about being able to dynamically load
a property file with diff values between dev & production like you
could do in the old style solr.xml – that doesn't mean core.properties
needs to have a configurable name, it just means there needs to be a
configurable way to load properties.

we already have a properties option which can be specified in
core.properties to point to an additional external file that should
also be loaded ... if variable substitution was in play when parsing
core.properties then you could have something like
properties=custom.${env}.properties in core.properties ... but
introducing variable substitution into thecore.properties (which solr
both reads & writes based on CoreAdmin calls) brings back the host of
complexities involved when we had persistence of solr.xml as a
feature, with the questions about persisting the original values with
variables in them, vs the values after evaluating variables.

Best,
Erick

On Wed, Aug 20, 2014 at 11:36 AM, Ryan Josal 
wrote:

Hi all, I have a question about dynamically loading a core properties
file
with the new core discovery method of defining cores.  The concept is
that I
can have a dev.properties file and a prod.properties file, and specify
which
one to load with -Dsolr.env=dev.  This way I can have one file which
specifies a bunch of runtime properties like external servers a plugin
might
use, etc.

Previously I was able to do this in solr.xml because it can do system
property substitution when defining which properties file to use for a
core.

Now I'm not sure how to do this with core discovery, since the core is
discovered based on this file, and now the file needs to contain things
that
are specific to that core, like name, which previously were defined in
the
xml definition.

Is there a way I can plugin some code that gets run before any schema or
solrconfigs are parsed?  That way I could write a property loader that
adds
properties from ${solr.env}.properties to the JVM system properties.

Thanks!
Ryan






Re: Dynamically loaded core.properties file

2014-08-20 Thread Ryan Josal
Thanks Erick, that mirrors my thoughts exactly.  If core.properties had 
property expansion it would work for this, but I agree with not 
supporting that for the complexities it introduces, and I'm not sure 
it's the right way to solve it anyway.  So, it doesn't really handle my 
problem.


I think because the properties file I want to load is not actually 
related to any core, it makes it easier to solve.  So if solr.xml is no 
longer rewritten then it seems like a global properties file could 
safely be specified there using property expansion.  Or maybe there is 
some way to write some code that could get executed before schema and 
solrconfig are parsed, although I'm not sure how that would work given 
how you need solrconfig to load the libraries and define plugins.


Ryan

On 08/20/2014 01:07 PM, Erick Erickson wrote:

Hmmm, I was going to make a code change to do this, but Chris
Hostetter saved me from the madness that ensues. Here's his comment on
the JIRA that I did open (but then closed), does this handle your
problem?

I don't think we want to make the name of core.properties be variable
... that way leads to madness and confusion.

the request on the user list was about being able to dynamically load
a property file with diff values between dev & production like you
could do in the old style solr.xml – that doesn't mean core.properties
needs to have a configurable name, it just means there needs to be a
configurable way to load properties.

we already have a properties option which can be specified in
core.properties to point to an additional external file that should
also be loaded ... if variable substitution was in play when parsing
core.properties then you could have something like
properties=custom.${env}.properties in core.properties ... but
introducing variable substitution into thecore.properties (which solr
both reads & writes based on CoreAdmin calls) brings back the host of
complexities involved when we had persistence of solr.xml as a
feature, with the questions about persisting the original values with
variables in them, vs the values after evaluating variables.

Best,
Erick

On Wed, Aug 20, 2014 at 11:36 AM, Ryan Josal  wrote:

Hi all, I have a question about dynamically loading a core properties file
with the new core discovery method of defining cores.  The concept is that I
can have a dev.properties file and a prod.properties file, and specify which
one to load with -Dsolr.env=dev.  This way I can have one file which
specifies a bunch of runtime properties like external servers a plugin might
use, etc.

Previously I was able to do this in solr.xml because it can do system
property substitution when defining which properties file to use for a core.

Now I'm not sure how to do this with core discovery, since the core is
discovered based on this file, and now the file needs to contain things that
are specific to that core, like name, which previously were defined in the
xml definition.

Is there a way I can plugin some code that gets run before any schema or
solrconfigs are parsed?  That way I could write a property loader that adds
properties from ${solr.env}.properties to the JVM system properties.

Thanks!
Ryan




Dynamically loaded core.properties file

2014-08-20 Thread Ryan Josal
Hi all, I have a question about dynamically loading a core properties 
file with the new core discovery method of defining cores.  The concept 
is that I can have a dev.properties file and a prod.properties file, and 
specify which one to load with -Dsolr.env=dev.  This way I can have one 
file which specifies a bunch of runtime properties like external servers 
a plugin might use, etc.


Previously I was able to do this in solr.xml because it can do system 
property substitution when defining which properties file to use for a core.


Now I'm not sure how to do this with core discovery, since the core is 
discovered based on this file, and now the file needs to contain things 
that are specific to that core, like name, which previously were defined 
in the xml definition.


Is there a way I can plugin some code that gets run before any schema or 
solrconfigs are parsed?  That way I could write a property loader that 
adds properties from ${solr.env}.properties to the JVM system properties.


Thanks!
Ryan


RE: Correct way for getting SolrCore?

2013-02-06 Thread Ryan Josal
This is perfect, thanks!  I'm surprised it eluded me for so long.

From: Mark Miller [markrmil...@gmail.com]
Sent: Tuesday, February 05, 2013 4:09 PM
To: solr-user@lucene.apache.org
Subject: Re: Correct way for getting SolrCore?

The SolrCoreAware interface?

- Mark

On Feb 5, 2013, at 5:42 PM, Ryan Josal  wrote:

> By way of the deprecated SolrCore.getSolrCore method,
>
> SolrCore.getSolrCore().getCoreDescriptor().getCoreContainer().getCores()
>
> Solr starts up in an infinite recursive loop of loading cores.  I understand 
> now that the UpdateProcessorFactory is initialized as part of the core 
> initialization, so I expect there is no way to read the index of a core if 
> the core has not been initialized yet.  I still feel a bit uneasy about 
> initialization on the first update request, so is there some other place I 
> can plugin initialization code that runs after the core is loaded?  I suppose 
> I'd be using SolrCore.getSearcher().get().getIndexReader() to get the 
> IndexReader, but if that happens after a good point of plugging in this 
> initialization, then I guess SolrCore.getIndexReaderFactory() is the way to 
> go.
>
> Thanks,
> Ryan
> 
> From: Ryan Josal [rjo...@rim.com]
> Sent: Tuesday, February 05, 2013 1:27 PM
> To: solr-user@lucene.apache.org
> Subject: RE: Correct way for getting SolrCore?
>
> Is there any way I can get the cores and do my initialization in the 
> @Override public void init(final NamedList args) method?  I could wait for 
> the first request, but I imagine I'd have to deal with indexing requests 
> piling up while I iterate over every document in every index.
>
> Ryan
> 
> From: Mark Miller [markrmil...@gmail.com]
> Sent: Tuesday, February 05, 2013 1:15 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Correct way for getting SolrCore?
>
> The request should give you access to the core - the core to the core 
> descriptor, the descriptor to the core container, which knows about all the 
> cores.
>
> - Mark
>
> On Feb 5, 2013, at 4:09 PM, Ryan Josal  wrote:
>
>> Hey guys,
>>
>> I am writing an UpdateRequestProcessorFactory plugin which needs to have 
>> some initialization code in the init method.  I need to build some 
>> information about each SolrCore in memory so that when an update comes in 
>> for a particular SolrCore, I can use the data for the appropriate core.  
>> Ultimately, I need a lucene IndexReader for each core.  I figure I'd get 
>> this through a SolrCore, CoreContainer, or CoreDescriptor.  I've looked 
>> around for awhile and I always end up going in circles.  So how can I 
>> iterate over cores that have been loaded?
>>
>> Ryan
>> -
>> This transmission (including any attachments) may contain confidential 
>> information, privileged material (including material protected by the 
>> solicitor-client or other applicable privileges), or constitute non-public 
>> information. Any use of this information by anyone other than the intended 
>> recipient is prohibited. If you have received this transmission in error, 
>> please immediately reply to the sender and delete this information from your 
>> system. Use, dissemination, distribution, or reproduction of this 
>> transmission by unintended recipients is not authorized and may be unlawful.
>
>
> -
> This transmission (including any attachments) may contain confidential 
> information, privileged material (including material protected by the 
> solicitor-client or other applicable privileges), or constitute non-public 
> information. Any use of this information by anyone other than the intended 
> recipient is prohibited. If you have received this transmission in error, 
> please immediately reply to the sender and delete this information from your 
> system. Use, dissemination, distribution, or reproduction of this 
> transmission by unintended recipients is not authorized and may be unlawful.
>
> -
> This transmission (including any attachments) may contain confidential 
> information, privileged material (including material protected by the 
> solicitor-client or other applicable privileges), or constitute non-public 
> information. Any use of this information by anyone other than the intended 
> recipient is prohibited. If you have received this transmission in error, 
> please immediately reply to the sender and delete this informatio

RE: Correct way for getting SolrCore?

2013-02-05 Thread Ryan Josal
By way of the deprecated SolrCore.getSolrCore method,

SolrCore.getSolrCore().getCoreDescriptor().getCoreContainer().getCores()

Solr starts up in an infinite recursive loop of loading cores.  I understand 
now that the UpdateProcessorFactory is initialized as part of the core 
initialization, so I expect there is no way to read the index of a core if the 
core has not been initialized yet.  I still feel a bit uneasy about 
initialization on the first update request, so is there some other place I can 
plugin initialization code that runs after the core is loaded?  I suppose I'd 
be using SolrCore.getSearcher().get().getIndexReader() to get the IndexReader, 
but if that happens after a good point of plugging in this initialization, then 
I guess SolrCore.getIndexReaderFactory() is the way to go.

Thanks,
Ryan

From: Ryan Josal [rjo...@rim.com]
Sent: Tuesday, February 05, 2013 1:27 PM
To: solr-user@lucene.apache.org
Subject: RE: Correct way for getting SolrCore?

Is there any way I can get the cores and do my initialization in the @Override 
public void init(final NamedList args) method?  I could wait for the first 
request, but I imagine I'd have to deal with indexing requests piling up while 
I iterate over every document in every index.

Ryan

From: Mark Miller [markrmil...@gmail.com]
Sent: Tuesday, February 05, 2013 1:15 PM
To: solr-user@lucene.apache.org
Subject: Re: Correct way for getting SolrCore?

The request should give you access to the core - the core to the core 
descriptor, the descriptor to the core container, which knows about all the 
cores.

- Mark

On Feb 5, 2013, at 4:09 PM, Ryan Josal  wrote:

> Hey guys,
>
>  I am writing an UpdateRequestProcessorFactory plugin which needs to have 
> some initialization code in the init method.  I need to build some 
> information about each SolrCore in memory so that when an update comes in for 
> a particular SolrCore, I can use the data for the appropriate core.  
> Ultimately, I need a lucene IndexReader for each core.  I figure I'd get this 
> through a SolrCore, CoreContainer, or CoreDescriptor.  I've looked around for 
> awhile and I always end up going in circles.  So how can I iterate over cores 
> that have been loaded?
>
> Ryan
> -
> This transmission (including any attachments) may contain confidential 
> information, privileged material (including material protected by the 
> solicitor-client or other applicable privileges), or constitute non-public 
> information. Any use of this information by anyone other than the intended 
> recipient is prohibited. If you have received this transmission in error, 
> please immediately reply to the sender and delete this information from your 
> system. Use, dissemination, distribution, or reproduction of this 
> transmission by unintended recipients is not authorized and may be unlawful.


-
This transmission (including any attachments) may contain confidential 
information, privileged material (including material protected by the 
solicitor-client or other applicable privileges), or constitute non-public 
information. Any use of this information by anyone other than the intended 
recipient is prohibited. If you have received this transmission in error, 
please immediately reply to the sender and delete this information from your 
system. Use, dissemination, distribution, or reproduction of this transmission 
by unintended recipients is not authorized and may be unlawful.

-
This transmission (including any attachments) may contain confidential 
information, privileged material (including material protected by the 
solicitor-client or other applicable privileges), or constitute non-public 
information. Any use of this information by anyone other than the intended 
recipient is prohibited. If you have received this transmission in error, 
please immediately reply to the sender and delete this information from your 
system. Use, dissemination, distribution, or reproduction of this transmission 
by unintended recipients is not authorized and may be unlawful.


RE: Correct way for getting SolrCore?

2013-02-05 Thread Ryan Josal
Is there any way I can get the cores and do my initialization in the @Override 
public void init(final NamedList args) method?  I could wait for the first 
request, but I imagine I'd have to deal with indexing requests piling up while 
I iterate over every document in every index.

Ryan

From: Mark Miller [markrmil...@gmail.com]
Sent: Tuesday, February 05, 2013 1:15 PM
To: solr-user@lucene.apache.org
Subject: Re: Correct way for getting SolrCore?

The request should give you access to the core - the core to the core 
descriptor, the descriptor to the core container, which knows about all the 
cores.

- Mark

On Feb 5, 2013, at 4:09 PM, Ryan Josal  wrote:

> Hey guys,
>
>  I am writing an UpdateRequestProcessorFactory plugin which needs to have 
> some initialization code in the init method.  I need to build some 
> information about each SolrCore in memory so that when an update comes in for 
> a particular SolrCore, I can use the data for the appropriate core.  
> Ultimately, I need a lucene IndexReader for each core.  I figure I'd get this 
> through a SolrCore, CoreContainer, or CoreDescriptor.  I've looked around for 
> awhile and I always end up going in circles.  So how can I iterate over cores 
> that have been loaded?
>
> Ryan
> -
> This transmission (including any attachments) may contain confidential 
> information, privileged material (including material protected by the 
> solicitor-client or other applicable privileges), or constitute non-public 
> information. Any use of this information by anyone other than the intended 
> recipient is prohibited. If you have received this transmission in error, 
> please immediately reply to the sender and delete this information from your 
> system. Use, dissemination, distribution, or reproduction of this 
> transmission by unintended recipients is not authorized and may be unlawful.


-
This transmission (including any attachments) may contain confidential 
information, privileged material (including material protected by the 
solicitor-client or other applicable privileges), or constitute non-public 
information. Any use of this information by anyone other than the intended 
recipient is prohibited. If you have received this transmission in error, 
please immediately reply to the sender and delete this information from your 
system. Use, dissemination, distribution, or reproduction of this transmission 
by unintended recipients is not authorized and may be unlawful.


Correct way for getting SolrCore?

2013-02-05 Thread Ryan Josal
Hey guys,

  I am writing an UpdateRequestProcessorFactory plugin which needs to have some 
initialization code in the init method.  I need to build some information about 
each SolrCore in memory so that when an update comes in for a particular 
SolrCore, I can use the data for the appropriate core.  Ultimately, I need a 
lucene IndexReader for each core.  I figure I'd get this through a SolrCore, 
CoreContainer, or CoreDescriptor.  I've looked around for awhile and I always 
end up going in circles.  So how can I iterate over cores that have been loaded?

Ryan
-
This transmission (including any attachments) may contain confidential 
information, privileged material (including material protected by the 
solicitor-client or other applicable privileges), or constitute non-public 
information. Any use of this information by anyone other than the intended 
recipient is prohibited. If you have received this transmission in error, 
please immediately reply to the sender and delete this information from your 
system. Use, dissemination, distribution, or reproduction of this transmission 
by unintended recipients is not authorized and may be unlawful.


RE: SolrJ DirectXmlRequest

2013-01-23 Thread Ryan Josal
Thanks Hoss,

  The issue mentioned describes a similar behavior to what I observed, but not 
quite.  Commons-fileupload creates java.io.File objects for the temp files, and 
when those Files are garbage collected, the temp file is deleted.  I've 
verified this by letting the temp files build up and then forcing a full 
collection which clears all of them.  So I think the reason a percentage of 
temp files built up in my system was that under heavy load, some of the 
java.io.Files made it into old gen in the heap.  I switched to G1, and the 
problem went away.

Regarding the how the XML files are being sent, I have verified that each XML 
file is sent as a single request, by aligning the access log of my Solr master 
server with the processing log of my SolrJ server.  I didn't test the requests 
to see if the MIME type is multipart, but I suppose it is possible if some 
other form data or instruction needed to be passed with it.  Either way, I 
suppose it would go through fileupload anyway, because somebody's got to make a 
temp file for large files, right?

Ryan

From: Chris Hostetter [hossman_luc...@fucit.org]
Sent: Wednesday, January 16, 2013 6:06 PM
To: solr-user@lucene.apache.org
Subject: RE: SolrJ DirectXmlRequest

: DirectXmlRequest is part of the SolrJ library, so I guess that means it
: is not commonly used.  My use case is that I'm applying an XSLT to the
: raw XML on the client side, instead of leaving that up to the Solr
: master (although even if I applied the XSLT on the Solr server, I'd

I think Otis's point was that most people don't have Solr XML files lying
arround that they send to Solr, nor do they build up XML strings in Java
in the Solr input format (with XSLT or otherwise) ... most people using
SolrJ build up SolrInputDocument objects and pass those to their
SolrServer instance.

: I've done some research and I'm fairly confident that apache
: commons-fileupload library is responsible for the temp files.  There's

I believe you are correct ... searching for "solr fileupload temp files"
lead me to this issue which seems to have fallen by the way side...

https://issues.apache.org/jira/browse/SOLR-1953

...if you could try that patch outand/or post your comments it would be
helpful.

Something that seems really odd to me however is how/why your basic
updates are even causing multipart/file-upload functionality to be used
... a quick skim of the client code suggests that that should only happen
if your try to send multiple ContentStreams in a single request: I can
understand why that wouldn't typically happen for most users building up
multiple SolrInputDocuments (they would get added to a single stream); and
i can understand why that would typically happen for users sending
multiple binary files to something like ExtractingRequestHandler -- but if
you are using DirectXmlRequest in the way you described each xml file
should be sent as a single stream in a single request and the XML should
be sent in the raw POST body -- the commons-fileupload code shouldn't even
come into play.  (either that, or i'm missing something, or you're using
an older version of solr that used fileupload even if there was only a
single content stream)


-Hoss

-
This transmission (including any attachments) may contain confidential 
information, privileged material (including material protected by the 
solicitor-client or other applicable privileges), or constitute non-public 
information. Any use of this information by anyone other than the intended 
recipient is prohibited. If you have received this transmission in error, 
please immediately reply to the sender and delete this information from your 
system. Use, dissemination, distribution, or reproduction of this transmission 
by unintended recipients is not authorized and may be unlawful.


RE: SolrJ DirectXmlRequest

2013-01-09 Thread Ryan Josal
Thanks Otis,

DirectXmlRequest is part of the SolrJ library, so I guess that means it is not 
commonly used.  My use case is that I'm applying an XSLT to the raw XML on the 
client side, instead of leaving that up to the Solr master (although even if I 
applied the XSLT on the Solr server, I'd still use DirectXmlRequest to get the 
raw XML there).  This does lead me to the idea that parsing the XML without the 
XSLT is probably better than copying some of XMLLoader to parse Solr XML as a 
workaround, and might be a good idea to do anyway.

I've done some research and I'm fairly confident that apache commons-fileupload 
library is responsible for the temp files.  There's an explanation for how 
files are cleaned up at http://commons.apache.org/fileupload/using.html in the 
"Resource cleanup" section.  I have observed that forcing a garbage collection 
over JMX results in all temporary files being purged.  This implies that many 
of the java.io.File objects are moving to old gen in the heap which survive 
long enough (only a few minutes in my case) to use up all tmp disk space.

I think this can probably be solved by GC tuning, or, failing that, introducing 
a (less desirable) System.gc() somewhere in the updateRequestProcessorChain.

Thanks for your help, and hopefully this will be useful if someone else runs 
into a similar problem.

Ryan

From: Otis Gospodnetic [otis.gospodne...@gmail.com]
Sent: Wednesday, January 09, 2013 11:53 AM
To: solr-user@lucene.apache.org
Subject: Re: SolrJ DirectXmlRequest

Hi Ryan,

One typically uses a Solr client library to talk to Solr instead of sending
raw XML.  For example, if your application in written in Java then you
would use SolrJ.

Otis
--
Solr & ElasticSearch Support
http://sematext.com/





On Wed, Jan 9, 2013 at 12:03 PM, Ryan Josal  wrote:

> I also don't know what's creating them.  Maybe Solr, but also maybe
> Tomcat, maybe apache commons.  I could change java.io.tmpdir to one with
> more space, but the problem is that many of the temp files end up
> permanent, so eventually it would still run out of space.  I also
> considered setting the tmpdir to /dev/null, but that would defeat the
> purpose of whatever is writing those log files in the first place.  I could
> periodically clean up the tmpdir myself, but that feels the hackiest.
>
> Is it fairly common to send XML to Solr this way from a remote host?  If
> it is, then that would lead me to believe Solr and any of it's libraries
> aren't causing it, and I should inspect Tomcat.  I'm using Tomcat 7.
>
> Ryan
> 
> From: Otis Gospodnetic [otis.gospodne...@gmail.com]
> Sent: Tuesday, January 08, 2013 7:29 PM
> To: solr-user@lucene.apache.org
> Subject: Re: SolrJ DirectXmlRequest
>
> Hi Ryan,
>
> I'm not sure what is creating those upload files something in Solr? Or
> Tomcat?
>
> Why not specify a different temp dir via system property command line
> parameter?
>
> Otis
> Solr & ElasticSearch Support
> http://sematext.com/
> On Jan 8, 2013 12:17 PM, "Ryan Josal"  wrote:
>
> > I have encountered an issue where using DirectXmlRequest to index data on
> > a remote host results in eventually running out have temp disk space in
> the
> > java.io.tmpdir directory.  This occurs when I process a sufficiently
> large
> > batch of files.  About 30% of the temporary files end up permanent.  The
> > filenames look like: upload__2341cdae_13c02829b77__7ffd_00029003.tmp.
>  Has
> > anyone else had this happen before?  The relevant code is:
> >
> > DirectXmlRequest up = new DirectXmlRequest( "/update", xml );
> > up.process(solr);
> >
> > where `xml` is a String containing Solr formatted XML, and `solr` is the
> > SolrServer.  When disk space is eventually exhausted, this is the error
> > message that is repeatedly seen on the master host:
> >
> > 2013-01-07 19:22:16,911 [http-bio-8090-exec-2657] [] ERROR
> > org.apache.solr.servlet.SolrDispatchFilter  [] -
> > org.apache.commons.fileupload.FileUploadBase$IOFileUploadException:
> > Processing of multipart/form-data request failed. No space left on device
> > at
> >
> org.apache.commons.fileupload.FileUploadBase.parseRequest(FileUploadBase.java:367)
> > at
> >
> org.apache.commons.fileupload.servlet.ServletFileUpload.parseRequest(ServletFileUpload.java:126)
> > at
> >
> org.apache.solr.servlet.MultipartRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:344)
> > at
> >
> org.apache.solr.servlet.StandardRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:397)
>

RE: SolrJ DirectXmlRequest

2013-01-09 Thread Ryan Josal
I also don't know what's creating them.  Maybe Solr, but also maybe Tomcat, 
maybe apache commons.  I could change java.io.tmpdir to one with more space, 
but the problem is that many of the temp files end up permanent, so eventually 
it would still run out of space.  I also considered setting the tmpdir to 
/dev/null, but that would defeat the purpose of whatever is writing those log 
files in the first place.  I could periodically clean up the tmpdir myself, but 
that feels the hackiest.

Is it fairly common to send XML to Solr this way from a remote host?  If it is, 
then that would lead me to believe Solr and any of it's libraries aren't 
causing it, and I should inspect Tomcat.  I'm using Tomcat 7.

Ryan

From: Otis Gospodnetic [otis.gospodne...@gmail.com]
Sent: Tuesday, January 08, 2013 7:29 PM
To: solr-user@lucene.apache.org
Subject: Re: SolrJ DirectXmlRequest

Hi Ryan,

I'm not sure what is creating those upload files something in Solr? Or
Tomcat?

Why not specify a different temp dir via system property command line
parameter?

Otis
Solr & ElasticSearch Support
http://sematext.com/
On Jan 8, 2013 12:17 PM, "Ryan Josal"  wrote:

> I have encountered an issue where using DirectXmlRequest to index data on
> a remote host results in eventually running out have temp disk space in the
> java.io.tmpdir directory.  This occurs when I process a sufficiently large
> batch of files.  About 30% of the temporary files end up permanent.  The
> filenames look like: upload__2341cdae_13c02829b77__7ffd_00029003.tmp.  Has
> anyone else had this happen before?  The relevant code is:
>
> DirectXmlRequest up = new DirectXmlRequest( "/update", xml );
> up.process(solr);
>
> where `xml` is a String containing Solr formatted XML, and `solr` is the
> SolrServer.  When disk space is eventually exhausted, this is the error
> message that is repeatedly seen on the master host:
>
> 2013-01-07 19:22:16,911 [http-bio-8090-exec-2657] [] ERROR
> org.apache.solr.servlet.SolrDispatchFilter  [] -
> org.apache.commons.fileupload.FileUploadBase$IOFileUploadException:
> Processing of multipart/form-data request failed. No space left on device
> at
> org.apache.commons.fileupload.FileUploadBase.parseRequest(FileUploadBase.java:367)
> at
> org.apache.commons.fileupload.servlet.ServletFileUpload.parseRequest(ServletFileUpload.java:126)
> at
> org.apache.solr.servlet.MultipartRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:344)
> at
> org.apache.solr.servlet.StandardRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:397)
> at
> org.apache.solr.servlet.SolrRequestParsers.parse(SolrRequestParsers.java:115)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:244)
> at
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
> ... truncated stack trace
>
> I am running Solr 3.6 on an Ubuntu 12.04 server.  I am considering working
> around this by pulling out as much as I can from XMLLoader into my client,
> and processing the XML myself into SolrInputDocuments for indexing, but
> this is certainly not ideal.
>
> Ryan
> -
> This transmission (including any attachments) may contain confidential
> information, privileged material (including material protected by the
> solicitor-client or other applicable privileges), or constitute non-public
> information. Any use of this information by anyone other than the intended
> recipient is prohibited. If you have received this transmission in error,
> please immediately reply to the sender and delete this information from
> your system. Use, dissemination, distribution, or reproduction of this
> transmission by unintended recipients is not authorized and may be unlawful.
>

-
This transmission (including any attachments) may contain confidential 
information, privileged material (including material protected by the 
solicitor-client or other applicable privileges), or constitute non-public 
information. Any use of this information by anyone other than the intended 
recipient is prohibited. If you have received this transmission in error, 
please immediately reply to the sender and delete this information from your 
system. Use, dissemination, distribution, or reproduction of this transmission 
by unintended recipients is not authorized and may be unlawful.


SolrJ DirectXmlRequest

2013-01-08 Thread Ryan Josal
I have encountered an issue where using DirectXmlRequest to index data on a 
remote host results in eventually running out have temp disk space in the 
java.io.tmpdir directory.  This occurs when I process a sufficiently large 
batch of files.  About 30% of the temporary files end up permanent.  The 
filenames look like: upload__2341cdae_13c02829b77__7ffd_00029003.tmp.  Has 
anyone else had this happen before?  The relevant code is:

DirectXmlRequest up = new DirectXmlRequest( "/update", xml );
up.process(solr);

where `xml` is a String containing Solr formatted XML, and `solr` is the 
SolrServer.  When disk space is eventually exhausted, this is the error message 
that is repeatedly seen on the master host:

2013-01-07 19:22:16,911 [http-bio-8090-exec-2657] [] ERROR 
org.apache.solr.servlet.SolrDispatchFilter  [] - 
org.apache.commons.fileupload.FileUploadBase$IOFileUploadException: Processing 
of multipart/form-data request failed. No space left on device
at 
org.apache.commons.fileupload.FileUploadBase.parseRequest(FileUploadBase.java:367)
at 
org.apache.commons.fileupload.servlet.ServletFileUpload.parseRequest(ServletFileUpload.java:126)
at 
org.apache.solr.servlet.MultipartRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:344)
at 
org.apache.solr.servlet.StandardRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:397)
at 
org.apache.solr.servlet.SolrRequestParsers.parse(SolrRequestParsers.java:115)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:244)
at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
... truncated stack trace

I am running Solr 3.6 on an Ubuntu 12.04 server.  I am considering working 
around this by pulling out as much as I can from XMLLoader into my client, and 
processing the XML myself into SolrInputDocuments for indexing, but this is 
certainly not ideal.

Ryan
-
This transmission (including any attachments) may contain confidential 
information, privileged material (including material protected by the 
solicitor-client or other applicable privileges), or constitute non-public 
information. Any use of this information by anyone other than the intended 
recipient is prohibited. If you have received this transmission in error, 
please immediately reply to the sender and delete this information from your 
system. Use, dissemination, distribution, or reproduction of this transmission 
by unintended recipients is not authorized and may be unlawful.