PageRank sort

2009-04-23 Thread Marcus Herou
Hi.

I've posted before but here it goes again:

I have BlogData data which is more or less 100% static but one field is not
- the PageRank.
I would like to sort on that field and on the Lucene list I got these
answers.

1. Use two indexes and a ParallellReader
2. Use a FieldScoreQuery containing the PageRank field.
3. Use a CustomScoreQuery which uses the FieldScoreQuery combined with other
Queries (the actual search).

I think I could use this pattern as well:
1. Use two indexes and a ParallellReader
2. Normal search and Sort on the PageRank column (perhaps consuming more
memory)

Anyone have an idea of howto implement these patterns in SOLR ?
I have never extended SOLR but am not afraid of doing so if someone pushes
me in the right direction.

Kindly

//Marcus




-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/


Re: PageRank sort

2009-04-24 Thread Grant Ingersoll

How often are you updating the rank?

You might also be able to keep the rank info in a flat file via the  
ExternalFileField and the FileFloatSource and do FunctionQuery stuff  
that way.   However, I don't know how that handles refreshing data or  
if it would be efficient in your case.


On Apr 24, 2009, at 1:52 AM, Marcus Herou wrote:


Hi.

I've posted before but here it goes again:

I have BlogData data which is more or less 100% static but one field  
is not

- the PageRank.
I would like to sort on that field and on the Lucene list I got these
answers.

1. Use two indexes and a ParallellReader
2. Use a FieldScoreQuery containing the PageRank field.
3. Use a CustomScoreQuery which uses the FieldScoreQuery combined  
with other

Queries (the actual search).

I think I could use this pattern as well:
1. Use two indexes and a ParallellReader
2. Normal search and Sort on the PageRank column (perhaps consuming  
more

memory)

Anyone have an idea of howto implement these patterns in SOLR ?
I have never extended SOLR but am not afraid of doing so if someone  
pushes

me in the right direction.

Kindly

//Marcus




--
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/


--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

http://www.lucidimagination.com/search



Re: PageRank sort

2009-04-24 Thread Marcus Herou
Hi.

Comments inline.

On Fri, Apr 24, 2009 at 1:00 PM, Grant Ingersoll wrote:

> How often are you updating the rank?


The goal is to optimize the pagerank calculating algorithm so can have
continuous updates (1 blogs at a time 24/7) but more likely we'll end up
refreshing the index once a weeks or so (hopefully each night).


>
>
> You might also be able to keep the rank info in a flat file via the
> ExternalFileField and the FileFloatSource and do FunctionQuery stuff that
> way.   However, I don't know how that handles refreshing data or if it would
> be efficient in your case.


Great! That seems like something that could work. Depends on how that field
get's re-read/indexed I guess. Or is it used at query time solely ? I feel
that googling ExternalFileField does not really give the "meat" I need to
narrow this down. Any pointers and/or pseudo code ?

>
>
> On Apr 24, 2009, at 1:52 AM, Marcus Herou wrote:
>
>  Hi.
>>
>> I've posted before but here it goes again:
>>
>> I have BlogData data which is more or less 100% static but one field is
>> not
>> - the PageRank.
>> I would like to sort on that field and on the Lucene list I got these
>> answers.
>>
>> 1. Use two indexes and a ParallellReader
>> 2. Use a FieldScoreQuery containing the PageRank field.
>> 3. Use a CustomScoreQuery which uses the FieldScoreQuery combined with
>> other
>> Queries (the actual search).
>>
>> I think I could use this pattern as well:
>> 1. Use two indexes and a ParallellReader
>> 2. Normal search and Sort on the PageRank column (perhaps consuming more
>> memory)
>>
>> Anyone have an idea of howto implement these patterns in SOLR ?
>> I have never extended SOLR but am not afraid of doing so if someone pushes
>> me in the right direction.
>>
>> Kindly
>>
>> //Marcus
>>
>>
>>
>>
>> --
>> Marcus Herou CTO and co-founder Tailsweep AB
>> +46702561312
>> marcus.he...@tailsweep.com
>> http://www.tailsweep.com/
>> http://blogg.tailsweep.com/
>>
>
> --
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>
It seems to be a generic issue with Lucene since it is not really built in
the way that one would plugin an external scoring mechanism (very fast
internal one instead) but hopefully I'll sort this one out.

Thanks for the reply, really apprecciated.

Kindly

//Marcus



-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/


Re: PageRank sort

2009-04-24 Thread Yonik Seeley
On Fri, Apr 24, 2009 at 1:39 PM, Marcus Herou
 wrote:
> Great! That seems like something that could work. Depends on how that field
> get's re-read/indexed I guess.

http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html

It's a separate *text* file that just contains id/value pairs for a field.
Calculate your custom score and save it to that file.  Then call
commit so the file is re-read and all your scores will be updated (and
usable in a function query).

So the short answer is, you should be able to do what you want with no
Solr customization in Java... it's all built in.

-Yonik
http://www.lucidimagination.com


Re: PageRank sort

2009-04-24 Thread Marcus Herou
And I published the setup here:
http://dev.tailsweep.com/solr-external-scoring/en/

/M

On Sat, Apr 25, 2009 at 12:01 AM, Marcus Herou
wrote:

> Works like a charm!
>
> Thank you sir.
>
> //Marcus
>
>
> On Fri, Apr 24, 2009 at 11:01 PM, Marcus Herou  > wrote:
>
>> That is fantastic, I am creating a really small index right now trying to
>> figure out howto implement the FunctionQuery for this.
>>
>> //Marcus
>>
>>
>> On Fri, Apr 24, 2009 at 10:55 PM, Yonik Seeley <
>> yo...@lucidimagination.com> wrote:
>>
>>> On Fri, Apr 24, 2009 at 1:39 PM, Marcus Herou
>>>  wrote:
>>> > Great! That seems like something that could work. Depends on how that
>>> field
>>> > get's re-read/indexed I guess.
>>>
>>>
>>> http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html
>>>
>>> It's a separate *text* file that just contains id/value pairs for a
>>> field.
>>> Calculate your custom score and save it to that file.  Then call
>>> commit so the file is re-read and all your scores will be updated (and
>>> usable in a function query).
>>>
>>> So the short answer is, you should be able to do what you want with no
>>> Solr customization in Java... it's all built in.
>>>
>>> -Yonik
>>> http://www.lucidimagination.com
>>>
>>
>>
>>
>> --
>> Marcus Herou CTO and co-founder Tailsweep AB
>> +46702561312
>> marcus.he...@tailsweep.com
>> http://www.tailsweep.com/
>> http://blogg.tailsweep.com/
>>
>
>
>
> --
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> marcus.he...@tailsweep.com
> http://www.tailsweep.com/
> http://blogg.tailsweep.com/
>



-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/


Re: PageRank sort

2009-04-24 Thread Marcus Herou
Works like a charm!

Thank you sir.

//Marcus

On Fri, Apr 24, 2009 at 11:01 PM, Marcus Herou
wrote:

> That is fantastic, I am creating a really small index right now trying to
> figure out howto implement the FunctionQuery for this.
>
> //Marcus
>
>
> On Fri, Apr 24, 2009 at 10:55 PM, Yonik Seeley  > wrote:
>
>> On Fri, Apr 24, 2009 at 1:39 PM, Marcus Herou
>>  wrote:
>> > Great! That seems like something that could work. Depends on how that
>> field
>> > get's re-read/indexed I guess.
>>
>>
>> http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html
>>
>> It's a separate *text* file that just contains id/value pairs for a field.
>> Calculate your custom score and save it to that file.  Then call
>> commit so the file is re-read and all your scores will be updated (and
>> usable in a function query).
>>
>> So the short answer is, you should be able to do what you want with no
>> Solr customization in Java... it's all built in.
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
>
>
>
> --
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> marcus.he...@tailsweep.com
> http://www.tailsweep.com/
> http://blogg.tailsweep.com/
>



-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/


Re: PageRank sort

2009-04-24 Thread Marcus Herou
That is fantastic, I am creating a really small index right now trying to
figure out howto implement the FunctionQuery for this.

//Marcus

On Fri, Apr 24, 2009 at 10:55 PM, Yonik Seeley
wrote:

> On Fri, Apr 24, 2009 at 1:39 PM, Marcus Herou
>  wrote:
> > Great! That seems like something that could work. Depends on how that
> field
> > get's re-read/indexed I guess.
>
>
> http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html
>
> It's a separate *text* file that just contains id/value pairs for a field.
> Calculate your custom score and save it to that file.  Then call
> commit so the file is re-read and all your scores will be updated (and
> usable in a function query).
>
> So the short answer is, you should be able to do what you want with no
> Solr customization in Java... it's all built in.
>
> -Yonik
> http://www.lucidimagination.com
>



-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/


Re: PageRank sort

2009-04-24 Thread Yonik Seeley
You probably want to mix the custom score with the normal relevancy
score... to add, use a normal boolean query.  To multiply, check out
boosted query:
http://lucene.apache.org/solr/api/org/apache/solr/search/BoostQParserPlugin.html

For other options, use a more complex function query with the new
query() capability (need to use 1.4 trunk for that though).

Examples:

&q={!boost b=myScore v=$qq}&qq=my normal lucene query
  OR for a dismax relevancy query,
&q={!boost b=myScore v=$qq}&qq={!dismax qf=text_all pf=text_all}solr rocks

If the {! type of syntax looks new, check out
http://wiki.apache.org/solr/LocalParams
powerful stuff!

-Yonik
http://www.lucidimagination.com


Re: PageRank sort

2009-04-24 Thread Marcus Herou
That seems wise... PageRank * Text-based Scoring.

So you mean in my stupid case that:
GET '
http://127.0.0.1:8110/solr/test/select?indent=on&start=0&rows=100&q={!boostb=blogRank
v=$qq}&qq=*:*'
would yield the same results as:
GET 
"http://127.0.0.1:8110/solr/test/select?indent=on&start=0&rows=100&q=*:*_val_:\"log(blogRank)\""

since I have no text data

but if I introduce a tokenized textfield (title).
Example:


1solr solr
solr
2solr/field>


where blogId=1 had blogRank of 1
where blogId=2 had blogRank of 2
and if I searched for "solr"
GET '
http://127.0.0.1:8110/solr/test/select?indent=on&start=0&rows=100&q={!boostb=blogRank
v=$qq}&qq=title:solr'

I might get blogId=1 as nr1 in the results even though it had lower blogRank
due to the higher frequency of the term "solr" ?

Did I understand this correctly ?

//Marcus


On Sat, Apr 25, 2009 at 12:07 AM, Yonik Seeley
wrote:

> You probably want to mix the custom score with the normal relevancy
> score... to add, use a normal boolean query.  To multiply, check out
> boosted query:
>
> http://lucene.apache.org/solr/api/org/apache/solr/search/BoostQParserPlugin.html
>
> For other options, use a more complex function query with the new
> query() capability (need to use 1.4 trunk for that though).
>
> Examples:
>
> &q={!boost b=myScore v=$qq}&qq=my normal lucene query
>  OR for a dismax relevancy query,
> &q={!boost b=myScore v=$qq}&qq={!dismax qf=text_all pf=text_all}solr rocks
>
> If the {! type of syntax looks new, check out
> http://wiki.apache.org/solr/LocalParams
> powerful stuff!
>
> -Yonik
> http://www.lucidimagination.com
>



-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/


Re: PageRank sort

2009-04-24 Thread Marcus Herou
Cool!

GET '
http://127.0.0.1:8110/solr/test/select?indent=on&start=0&rows=100&q={!boostb=blogRank
v=$qq}&qq=title:solr&debugQuery=on'

On Sat, Apr 25, 2009 at 12:43 AM, Marcus Herou
wrote:

> That seems wise... PageRank * Text-based Scoring.
>
> So you mean in my stupid case that:
> GET '
> http://127.0.0.1:8110/solr/test/select?indent=on&start=0&rows=100&q={!boostb=blogRank
>  v=$qq}&qq=*:*'
> would yield the same results as:
> GET "
> http://127.0.0.1:8110/solr/test/select?indent=on&start=0&rows=100&q=*:*_val_:\"log(blogRank)\""
>
> since I have no text data
>
> but if I introduce a tokenized textfield (title).
> Example:
>
> 
> 1solr solr 
> solr
>
> 2solr/field>
> 
>
> where blogId=1 had blogRank of 1
> where blogId=2 had blogRank of 2
> and if I searched for "solr"
> GET '
> http://127.0.0.1:8110/solr/test/select?indent=on&start=0&rows=100&q={!boostb=blogRank
>  v=$qq}&qq=title:solr'
>
> I might get blogId=1 as nr1 in the results even though it had lower
> blogRank due to the higher frequency of the term "solr" ?
>
> Did I understand this correctly ?
>
> //Marcus
>
>
>
> On Sat, Apr 25, 2009 at 12:07 AM, Yonik Seeley  > wrote:
>
>> You probably want to mix the custom score with the normal relevancy
>> score... to add, use a normal boolean query.  To multiply, check out
>> boosted query:
>>
>> http://lucene.apache.org/solr/api/org/apache/solr/search/BoostQParserPlugin.html
>>
>> For other options, use a more complex function query with the new
>> query() capability (need to use 1.4 trunk for that though).
>>
>> Examples:
>>
>> &q={!boost b=myScore v=$qq}&qq=my normal lucene query
>>  OR for a dismax relevancy query,
>> &q={!boost b=myScore v=$qq}&qq={!dismax qf=text_all pf=text_all}solr rocks
>>
>> If the {! type of syntax looks new, check out
>> http://wiki.apache.org/solr/LocalParams
>> powerful stuff!
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
>
>
>
> --
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> marcus.he...@tailsweep.com
> http://www.tailsweep.com/
> http://blogg.tailsweep.com/
>



-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/


Re: PageRank sort

2009-04-24 Thread Marcus Herou
Meant this part:
m...@mahe-laptop:~$ GET '
http://127.0.0.1:8110/solr/test/select?indent=on&start=0&rows=100&q={!boostb=blogRank
v=$qq}&qq=title:solr&debugQuery=on'




 0
 121
 
  title:solr
  0
  on
  {!boost b=blogRank v=$qq}
  on
  100
 


 
  3
  4
 
 
  1
  1
 
 
  1
  2
 
 
  4
  5
 
 
  2
  3
 


 {!boost b=blogRank v=$qq}
 {!boost b=blogRank v=$qq}
 BoostedQuery(boost(title:solr,FileFloatSource(field=blogRank,keyField=blogId,defVal=0.0,dataDir=/srv/solr/test/data/)))
 boost(title:solr,FileFloatSource(field=blogRank,keyField=blogId,defVal=0.0,dataDir=/srv/solr/test/data/))
 
  
5.7488723 = (MATCH)
boost(title:solr,FileFloatSource(field=blogRank,keyField=blogId,defVal=0.0,dataDir=/srv/solr/test/data/)),
product of:
  1.9162908 = (MATCH) fieldWeight(title:solr in 13), product of:
2.0 = tf(termFreq(title:solr)=4)
1.9162908 = idf(docFreq=5, numDocs=5)
0.5 = fieldNorm(field=title, doc=13)
  3.0 = float(blogRank{type=blogRankFile,properties=})=3.0

  
3.8325815 = (MATCH)
boost(title:solr,FileFloatSource(field=blogRank,keyField=blogId,defVal=0.0,dataDir=/srv/solr/test/data/)),
product of:
  1.9162908 = (MATCH) fieldWeight(title:solr in 10), product of:
1.0 = tf(termFreq(title:solr)=1)
1.9162908 = idf(docFreq=5, numDocs=5)
1.0 = fieldNorm(field=title, doc=10)
  2.0 = float(blogRank{type=blogRankFile,properties=})=2.0

  
3.3875556 = (MATCH)
boost(title:solr,FileFloatSource(field=blogRank,keyField=blogId,defVal=0.0,dataDir=/srv/solr/test/data/)),
product of:
  1.6937778 = (MATCH) fieldWeight(title:solr in 11), product of:
1.4142135 = tf(termFreq(title:solr)=2)
1.9162908 = idf(docFreq=5, numDocs=5)
0.625 = fieldNorm(field=title, doc=11)
  2.0 = float(blogRank{type=blogRankFile,properties=})=2.0

  
1.8746685 = (MATCH)
boost(title:solr,FileFloatSource(field=blogRank,keyField=blogId,defVal=0.0,dataDir=/srv/solr/test/data/)),
product of:
  1.8746685 = (MATCH) fieldWeight(title:solr in 14), product of:
2.236068 = tf(termFreq(title:solr)=5)
1.9162908 = idf(docFreq=5, numDocs=5)
0.4375 = fieldNorm(field=title, doc=14)
  1.0 = float(blogRank{type=blogRankFile,properties=})=1.0

  
1.6595565 = (MATCH)
boost(title:solr,FileFloatSource(field=blogRank,keyField=blogId,defVal=0.0,dataDir=/srv/solr/test/data/)),
product of:
  1.6595565 = (MATCH) fieldWeight(title:solr in 12), product of:
1.7320508 = tf(termFreq(title:solr)=3)
1.9162908 = idf(docFreq=5, numDocs=5)
0.5 = fieldNorm(field=title, doc=12)
  1.0 = float(blogRank{type=blogRankFile,properties=})=1.0

 
 LuceneQParser
 blogRank
 org.apache.solr.search.function.FileFloatSource:FileFloatSource(field=blogRank,keyField=blogId,defVal=0.0,dataDir=/srv/solr/test/data/)
 
  106.0
  
55.0

 53.0


 0.0


 0.0


 0.0


 0.0


 0.0

  
  
50.0

 42.0


 0.0


 0.0


 0.0


 0.0


 7.0

  
 




On Sat, Apr 25, 2009 at 12:49 AM, Marcus Herou
wrote:

> Cool!
>
> GET '
> http://127.0.0.1:8110/solr/test/select?indent=on&start=0&rows=100&q={!boostb=blogRank
>  v=$qq}&qq=title:solr&debugQuery=on'
>
> On Sat, Apr 25, 2009 at 12:43 AM, Marcus Herou  > wrote:
>
>> That seems wise... PageRank * Text-based Scoring.
>>
>> So you mean in my stupid case that:
>> GET '
>> http://127.0.0.1:8110/solr/test/select?indent=on&start=0&rows=100&q={!boostb=blogRank
>>  v=$qq}&qq=*:*'
>> would yield the same results as:
>> GET "
>> http://127.0.0.1:8110/solr/test/select?indent=on&start=0&rows=100&q=*:*_val_:\"log(blogRank)\""
>>
>> since I have no text data
>>
>> but if I introduce a tokenized textfield (title).
>> Example:
>>
>> 
>> 1solr solr 
>> solr
>>
>>
>> 2solr/field>
>> 
>>
>> where blogId=1 had blogRank of 1
>> where blogId=2 had blogRank of 2
>> and if I searched for "solr"
>> GET '
>> http://127.0.0.1:8110/solr/test/select?indent=on&start=0&rows=100&q={!boostb=blogRank
>>  v=$qq}&qq=title:solr'
>>
>> I might get blogId=1 as nr1 in the results even though it had lower
>> blogRank due to the higher frequency of the term "solr" ?
>>
>> Did I understand this correctly ?
>>
>> //Marcus
>>
>>
>>
>> On Sat, Apr 25, 2009 at 12:07 AM, Yonik Seeley <
>> yo...@lucidimagination.com> wrote:
>>
>>> You probably want to mix the custom score with the normal relevancy
>>> score... to add, use a normal boolean query.  To multiply, check out
>>> boosted query:
>>>
>>> http://lucene.apache.org/solr/api/org/apache/solr/search/BoostQParserPlugin.html
>>>
>>> For other options, use a more complex function que