Highlight: simple.pre/post not being applied always

2013-10-31 Thread Andy Pickler
Solr: 4.5.1

I'm sending in a query of "july" and getting back the results and
highlighting I expect with one exception:




@@@hl@@@Julie@@@endhl@@@ A




#Month:July




The simple.pre of @@@hl@@@ and simple.post of @@@endhl@@@ is not being
applied to the one case of the field "#Month:July", even though it's
included in the highlighting section.  I've tried changing various
highlighting parameters to no avail.  Could someone help me know where to
look for why the pre/post aren't being applied?

Thanks,
Andy Pickler


Re: Join Query Behavior

2013-10-25 Thread Andy Pickler
If it helps to clarify any, here's the full query:

/select
?
q=*:*
&
fq=type:ProjectGroup
&
fq={!join from=project_id_i to=project_id_im}user_id_i:65615 -role_id_i:18
type:UserRole

We have two Solr servers that were indexed from the same database.  One of
the servers is running Solr 4.2, while the other (test server) is running
4.5.

Solr 4.2:


Solr 4.5.1:


Solr 4.2 returns the expected result with the project IDs "filtered" out
from the join query, while the 4.5 query shows *all* results (2642
records).  I can leave off the join query in 4.5 and get the same results,
which tells me obviously it is having no effect.

Is there a change to the join query behavior between these releases, or
could I have configured something differently in my 4.5.1 install?

Thanks,
Andy Pickler

On Thu, Oct 24, 2013 at 2:42 PM, Andy Pickler wrote:

> We're attempting to upgrade from Solr 4.2 to 4.5 but are finding that 4.5
> is not "honoring" this join query:
>
> ...
> &
> fq={!join from=project_id_i to=project_id_im}user_id_i:65615 -role_id_i:18
> type:UserRole
> &
> 
>
> On our Solr 4.2 instance adding/removing that query gives us different
> (and expected) results, while the query doesn't affect the results at all
> in 4.5.  Is there any known join query behavior differences/fixes between
> 4.2 and 4.5 that might explain this, or should I be looking at other
> factors?
>
> Thanks,
> Andy Pickler
>
>


Join Query Behavior

2013-10-24 Thread Andy Pickler
We're attempting to upgrade from Solr 4.2 to 4.5 but are finding that 4.5
is not "honoring" this join query:

...
&
fq={!join from=project_id_i to=project_id_im}user_id_i:65615 -role_id_i:18
type:UserRole
&


On our Solr 4.2 instance adding/removing that query gives us different (and
expected) results, while the query doesn't affect the results at all in
4.5.  Is there any known join query behavior differences/fixes between 4.2
and 4.5 that might explain this, or should I be looking at other factors?

Thanks,
Andy Pickler


Re: DIH: HTMLStripTransformer in sub-entities?

2013-07-06 Thread Andy Pickler
That's exactly what turned out to be the problem.  We thought we had
already tried that permutation but apparently hadn't.  I know it's obvious
in retrospect.  Thanks for the suggestion.

Thanks,
Andy Pickler

On Wed, Jul 3, 2013 at 2:38 PM, Alexandre Rafalovitch wrote:

> On Tue, Jul 2, 2013 at 10:59 AM, Andy Pickler  >wrote:
>
> > SELECT
> >   br.other_content AS replyContent
> > FROM block_reply
> > ">
> >  *THIS DOESN'T
> WORK!*
> >
>
> shouldn't it be
> column="replyContent"
> since you are renaming it in SELECT?
>
> Regards,
>Alex.
>
>
>
> Personal website: http://www.outerthoughts.com/
> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> - Time is the quality of nature that keeps events from happening all at
> once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)
>


Re: DIH: HTMLStripTransformer in sub-entities?

2013-07-02 Thread Andy Pickler
Thanks for the quick reply.  Unfortunately, I don't believe my company
would want me sharing our exact production schema in a public forum,
although I realize it makes it harder to diagnose the problem.  The
sub-entity is a multi-valued field that indeed does have a relationship to
the outer entity.  I just left off the 'where' clause from the sub-entity,
as I didn't believe it was helpful in the context of this problem.  We use
the convention of..

SELECT dbColumnName AS solrFieldName

...so that we can relate the database column name to what we what it to be
named in the Solr index.

I don't think any of this helps you identify my problem, but I tried to
address your questions.

Thanks,
Andy

On Tue, Jul 2, 2013 at 9:14 AM, Gora Mohanty  wrote:

> On 2 July 2013 20:29, Andy Pickler  wrote:
> > Solr 4.1.0
> >
> > We've been using the DIH to pull data in from a MySQL database for quite
> > some time now.  We're now wanting to strip all the HTML content out of
> many
> > fields using the HTMLStripTransformer (
> > http://wiki.apache.org/solr/DataImportHandler#HTMLStripTransformer).
> >  Unfortunately, while it seems to be working fine for "top-level"
> entities,
> > we can't seem to get it to work for sub-entities:
> >
> > (not exact schema, reduced for example purposes)
>
> Please do not do that. This DIH configuration file does
> not make sense (please see comments below), and we
> are left guessing in the dark. If the file is too large,
> you can share it on something like pastebin.com
>
> >  > transformer="HTMLStripTransformer" query="
> >   SELECT
> > id as blockId,
> > name as blockTitle,
> > content as content
> >   FROM engagement_block
> >   ">
> > *THIS WORKS!*
> >> transformer="HTMLStripTransformer" query="
> > SELECT
> >   br.other_content AS replyContent
> > FROM block_reply
> > ">
> >  *THIS DOESN'T
> WORK!*
> [...]
>
> (a) You SELECT replyContent, but the column attribute
>  in the field is named "other_content". Nothing should
>  be getting indexed into the field.
> (b) Why are your entities nested if the inner entity has no
>  relationship to the outer one?
>
> Regards,
> Gora
>


DIH: HTMLStripTransformer in sub-entities?

2013-07-02 Thread Andy Pickler
Solr 4.1.0

We've been using the DIH to pull data in from a MySQL database for quite
some time now.  We're now wanting to strip all the HTML content out of many
fields using the HTMLStripTransformer (
http://wiki.apache.org/solr/DataImportHandler#HTMLStripTransformer).
 Unfortunately, while it seems to be working fine for "top-level" entities,
we can't seem to get it to work for sub-entities:

(not exact schema, reduced for example purposes)


*THIS WORKS!*
  
 *THIS DOESN'T WORK!*
  


We've tried several different permutations of putting the sub-entity column
in different nest levels of the XML to no avail.  I'm curious if we're
trying something that is just not supported or whether we are just trying
the wrong things.

Thanks,
Andy Pickler


Re: MoreLikeThis - No Results

2013-05-22 Thread Andy Pickler
Answered my own question...

mlt.mintf: Minimum Term Frequency - the frequency below which terms will be
ignored in the source doc

Our "source doc" is a set of limited terms...not a large content field.  So
in our case I need to set that value to 1 (rather than the default of 2).
 Now I'm getting results...and they indeed are relevant.

Thanks,
Andy Pickler

On Wed, May 22, 2013 at 12:20 PM, Andy Pickler wrote:

> I'm a developing a recommendation feature in our app using the
> MoreLikeThisHandler <http://wiki.apache.org/solr/MoreLikeThisHandler>,
> and so far it is doing a great job.  We're using a user's "competency
> keywords" as the MLT field list and the user's corresponding document in
> Solr as the "comparison document".  I have found that for one user I'm not
> receiving any recommendations, and I'm not sure why.
>
> Solr: 4.1.0
>
> *relevant schema*:
>
>  stored="true" multiValued="true" termVectors="true"/>
>
>  positionIncrementGap="100" autoGeneratePhraseQueries="true">
>   
> 
> 
>  generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> 
> 
>   
>   
> 
> 
>  generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> 
> 
>   
> 
>
> *user's values*:
>
> 
> Healthcare Cost Trends
> 
>
> Is it possible that among all the ~40,000 users in this index (about 500
> of which have the same competency keywords), that the words "healthcare",
> "cost" and "trends" are just judged by Lucene to not be "significant".  I
> realize that I may not understand how the MLT Handler is doing things under
> the covers...I've only been guessing until now based on the (otherwise
> excellent) results I've been seeing.
>
> Thanks,
> Andy Pickler
>
> P.S.  For some additional information, the following query:
>
>
> /mlt?q=objectId:user91813&mlt.fl=competencyKeywords&mlt.interestingTerms=details&debugQuery=true&mlt.match.include=false
>
> ...produces the following results...
>
> 
> 
> 0
> 2
> 
> 
> 
> 
> objectId:user91813
> objectId:user91813
> 
> 
> 
> 
> 
>


MoreLikeThis - No Results

2013-05-22 Thread Andy Pickler
I'm a developing a recommendation feature in our app using the
MoreLikeThisHandler <http://wiki.apache.org/solr/MoreLikeThisHandler>, and
so far it is doing a great job.  We're using a user's "competency keywords"
as the MLT field list and the user's corresponding document in Solr as the
"comparison document".  I have found that for one user I'm not receiving
any recommendations, and I'm not sure why.

Solr: 4.1.0

*relevant schema*:




  





  
  





  


*user's values*:


Healthcare Cost Trends


Is it possible that among all the ~40,000 users in this index (about 500 of
which have the same competency keywords), that the words "healthcare",
"cost" and "trends" are just judged by Lucene to not be "significant".  I
realize that I may not understand how the MLT Handler is doing things under
the covers...I've only been guessing until now based on the (otherwise
excellent) results I've been seeing.

Thanks,
Andy Pickler

P.S.  For some additional information, the following query:

/mlt?q=objectId:user91813&mlt.fl=competencyKeywords&mlt.interestingTerms=details&debugQuery=true&mlt.match.include=false

...produces the following results...



0
2




objectId:user91813
objectId:user91813







Re: Top 10 Terms in Index (by date)

2013-04-02 Thread Andy Pickler
A key problem with those approaches as well as Lucene's HighFreqTerms class
(
http://lucene.apache.org/core/4_2_0/misc/org/apache/lucene/misc/HighFreqTerms.html)
is that none of them seem to have the ability to combine with a date range
query...which is key in my scenario.  I'm kinda thinking that what I'm
asking to do just isn't supported by Lucene or Solr, and that I'll have to
pursue another avenue.  If anyone has any other suggestions, I'm all ears.
I'm starting to wonder if I need to have some nightly batch job that
executes against my database and builds up "that day's top terms" in a
table or something.

Thanks,
Andy Pickler

On Tue, Apr 2, 2013 at 7:16 AM, Tomás Fernández Löbbe  wrote:

> Oh, I see, essentially you want to get the sum of the term frequencies for
> every term in a subset of documents (instead of the document frequency as
> the FacetComponent would give you). I don't know of an easy/out of the box
> solution for this. I know the TermVectorComponent will give you the tf for
> every term in a document, but I'm not sure if you can filter or sort on it.
> Maybe you can do something like:
> https://issues.apache.org/jira/browse/LUCENE-2393
> or what's suggested here:
> http://search-lucene.com/m/of5Fn1PUOHU/
> but I have never used something like that.
>
> Tomás
>
>
>
> On Mon, Apr 1, 2013 at 9:58 PM, Andy Pickler 
> wrote:
>
> > I need "total number of occurrences" across all documents for each term.
> > Imagine this...
> >
> > Post #1: "I think, therefore I am like you"
> > Reply #1: "You think too much"
> > Reply #2 "I think that I think much as you"
> >
> > Each of those "documents" are put into 'content'.  Pretending I don't
> have
> > stop words, the top term query (not considering dateCreated in this
> > example) would result in something like...
> >
> > "think": 4
> > "I": 4
> > "you": 3
> > "much": 2
> > ...
> >
> > Thus, just a "number of documents" approach doesn't work, because if a
> word
> > occurs more than one time in a document it needs to be counted that many
> > times.  That seemed to rule out faceting like you mentioned as well as
> the
> > TermsComponent (which as I understand also only counts "documents").
> >
> > Thanks,
> > Andy Pickler
> >
> > On Mon, Apr 1, 2013 at 4:31 PM, Tomás Fernández Löbbe <
> > tomasflo...@gmail.com
> > > wrote:
> >
> > > So you have one document per user comment? Why not use faceting plus
> > > filtering on the "dateCreated" field? That would count "number of
> > > documents" for each term (so, in your case, if a term is used twice in
> > one
> > > comment it would only count once). Is that what you are looking for?
> > >
> > > Tomás
> > >
> > >
> > > On Mon, Apr 1, 2013 at 6:32 PM, Andy Pickler 
> > > wrote:
> > >
> > > > Our company has an application that is "Facebook-like" for usage by
> > > > enterprise customers.  We'd like to do a report of "top 10 terms
> > entered
> > > by
> > > > users over (some time period)".  With that in mind I'm using the
> > > > DataImportHandler to put all the relevant data from our database
> into a
> > > > Solr 'content' field:
> > > >
> > > >  stored="false"
> > > > multiValued="false" required="true" termVectors="true"/>
> > > >
> > > > Along with the content is the 'dateCreated' for that content:
> > > >
> > > >  > > > multiValued="false" required="true"/>
> > > >
> > > > I'm struggling with the TermVectorComponent documentation to
> understand
> > > how
> > > > I can put together a query that answers the 'report' mentioned above.
> > >  For
> > > > each document I need each term counted however many times it is
> entered
> > > > (content of "I think what I think" would report 'think' as used
> twice).
> > > >  Does anyone have any insight as to whether I'm headed in the right
> > > > direction and then what my query would be?
> > > >
> > > > Thanks,
> > > > Andy Pickler
> > > >
> > >
> >
>


Re: Top 10 Terms in Index (by date)

2013-04-01 Thread Andy Pickler
I need "total number of occurrences" across all documents for each term.
Imagine this...

Post #1: "I think, therefore I am like you"
Reply #1: "You think too much"
Reply #2 "I think that I think much as you"

Each of those "documents" are put into 'content'.  Pretending I don't have
stop words, the top term query (not considering dateCreated in this
example) would result in something like...

"think": 4
"I": 4
"you": 3
"much": 2
...

Thus, just a "number of documents" approach doesn't work, because if a word
occurs more than one time in a document it needs to be counted that many
times.  That seemed to rule out faceting like you mentioned as well as the
TermsComponent (which as I understand also only counts "documents").

Thanks,
Andy Pickler

On Mon, Apr 1, 2013 at 4:31 PM, Tomás Fernández Löbbe  wrote:

> So you have one document per user comment? Why not use faceting plus
> filtering on the "dateCreated" field? That would count "number of
> documents" for each term (so, in your case, if a term is used twice in one
> comment it would only count once). Is that what you are looking for?
>
> Tomás
>
>
> On Mon, Apr 1, 2013 at 6:32 PM, Andy Pickler 
> wrote:
>
> > Our company has an application that is "Facebook-like" for usage by
> > enterprise customers.  We'd like to do a report of "top 10 terms entered
> by
> > users over (some time period)".  With that in mind I'm using the
> > DataImportHandler to put all the relevant data from our database into a
> > Solr 'content' field:
> >
> >  > multiValued="false" required="true" termVectors="true"/>
> >
> > Along with the content is the 'dateCreated' for that content:
> >
> >  > multiValued="false" required="true"/>
> >
> > I'm struggling with the TermVectorComponent documentation to understand
> how
> > I can put together a query that answers the 'report' mentioned above.
>  For
> > each document I need each term counted however many times it is entered
> > (content of "I think what I think" would report 'think' as used twice).
> >  Does anyone have any insight as to whether I'm headed in the right
> > direction and then what my query would be?
> >
> > Thanks,
> > Andy Pickler
> >
>


Top 10 Terms in Index (by date)

2013-04-01 Thread Andy Pickler
Our company has an application that is "Facebook-like" for usage by
enterprise customers.  We'd like to do a report of "top 10 terms entered by
users over (some time period)".  With that in mind I'm using the
DataImportHandler to put all the relevant data from our database into a
Solr 'content' field:



Along with the content is the 'dateCreated' for that content:



I'm struggling with the TermVectorComponent documentation to understand how
I can put together a query that answers the 'report' mentioned above.  For
each document I need each term counted however many times it is entered
(content of "I think what I think" would report 'think' as used twice).
 Does anyone have any insight as to whether I'm headed in the right
direction and then what my query would be?

Thanks,
Andy Pickler