Re: processing documents in solr

2013-07-29 Thread Joe Zhang
I'll try reindexing the timestamp.

The id-creation approach suggested by Erick sounds attractive, but the
nutch/solr integration seems rather tight. I don't where to break in to
insert the id into solr.


On Mon, Jul 29, 2013 at 4:11 AM, Erick Erickson erickerick...@gmail.comwrote:

 No SolrJ doesn't provide this automatically. You'd be providing the
 counter by inserting it into the document as you created new docs.

 You could do this with any kind of document creation you are
 using.

 Best
 Erick

 On Mon, Jul 29, 2013 at 2:51 AM, Aditya findbestopensou...@gmail.com
 wrote:
  Hi,
 
  The easiest solution would be to have timestamp indexed. Is there any
 issue
  in doing re-indexing?
  If you want to process records in batch then you need a ordered list and
 a
  bookmark. You require a field to sort and maintain a counter / last id as
  bookmark. This is mandatory to solve your problem.
 
  If you don't want to re-index, then you need to maintain information
  related to visited nodes. Have a database / solr core which maintains
 list
  of IDs which already processed. Fetch record from Solr, For each record,
  check the new DB, if the record is already processed.
 
  Regards
  Aditya
  www.findbestopensource.com
 
 
 
 
 
  On Mon, Jul 29, 2013 at 10:26 AM, Joe Zhang smartag...@gmail.com
 wrote:
 
  Basically, I was thinking about running a range query like Shawn
 suggested
  on the tstamp field, but unfortunately it was not indexed. Range queries
  only work on indexed fields, right?
 
 
  On Sun, Jul 28, 2013 at 9:49 PM, Joe Zhang smartag...@gmail.com
 wrote:
 
   I've been thinking about tstamp solution int the past few days. but
 too
   bad, the field is avaialble but not indexed...
  
   I'm not familiar with SolrJ. Again, sounds like SolrJ is providing the
   counter value. If yes, that would be equivalent to an autoincrement
 id.
  I'm
   indexing from Nutch though; don't know how to feed in such counter...
  
  
   On Sun, Jul 28, 2013 at 7:03 AM, Erick Erickson 
 erickerick...@gmail.com
  wrote:
  
   Why wouldn't a simple timestamp work for the ordering? Although
   I guess simple timestamp isn't really simple if the time settings
   change.
  
   So how about a simple counter field in your documents? Assuming
   you're indexing from SolrJ, your setup is to query q=*:*sort=counter
   desc.
   Take the counter from the first document returned. Increment for
   each doc for the life of the indexing run. Now you've got, for all
  intents
   and purposes, an identity field albeit manually maintained.
  
   Then use your counter field as Shawn suggests for pulling all the
   data out.
  
   FWIW,
   Erick
  
   On Sun, Jul 28, 2013 at 1:01 AM, Maurizio Cucchiara
   mcucchi...@apache.org wrote:
In both cases, for better performance, first I'd load just all the
  IDs,
after, during processing I'd load each document.
For what concern the incremental requirement, it should not be
   difficult to
write an hash function which maps a non-numerical I'd to a value.
 On Jul 27, 2013 7:03 AM, Joe Zhang smartag...@gmail.com
 wrote:
   
Dear list:
   
I have an ever-growing solr repository, and I need to process
 every
   single
document to extract statistics. What would be a reasonable process
  that
satifies the following properties:
   
- Exhaustive: I have to traverse every single document
- Incremental: in other words, it has to allow me to divide and
   conquer ---
if I have processed the first 20k docs, next time I can start with
   20001.
   
A simple *:* query would satisfy the 1st but not the 2nd
 property.
  In
fact, given that the processing will take very long, and the
  repository
keeps growing, it is not even clear that the exhaustiveness is
   achieved.
   
I'm running solr 3.6.2 in a single-machine setting; no hadoop
   capability
yet. But I guess the same issues still hold even if I have the
 solr
   cloud
environment, right, say in each shard?
   
Any help would be greatly appreciated.
   
Joe
   
  
  
  
 



Re: processing documents in solr

2013-07-28 Thread Joe Zhang
I've been thinking about tstamp solution int the past few days. but too
bad, the field is avaialble but not indexed...

I'm not familiar with SolrJ. Again, sounds like SolrJ is providing the
counter value. If yes, that would be equivalent to an autoincrement id. I'm
indexing from Nutch though; don't know how to feed in such counter...


On Sun, Jul 28, 2013 at 7:03 AM, Erick Erickson erickerick...@gmail.comwrote:

 Why wouldn't a simple timestamp work for the ordering? Although
 I guess simple timestamp isn't really simple if the time settings
 change.

 So how about a simple counter field in your documents? Assuming
 you're indexing from SolrJ, your setup is to query q=*:*sort=counter desc.
 Take the counter from the first document returned. Increment for
 each doc for the life of the indexing run. Now you've got, for all intents
 and purposes, an identity field albeit manually maintained.

 Then use your counter field as Shawn suggests for pulling all the
 data out.

 FWIW,
 Erick

 On Sun, Jul 28, 2013 at 1:01 AM, Maurizio Cucchiara
 mcucchi...@apache.org wrote:
  In both cases, for better performance, first I'd load just all the IDs,
  after, during processing I'd load each document.
  For what concern the incremental requirement, it should not be difficult
 to
  write an hash function which maps a non-numerical I'd to a value.
   On Jul 27, 2013 7:03 AM, Joe Zhang smartag...@gmail.com wrote:
 
  Dear list:
 
  I have an ever-growing solr repository, and I need to process every
 single
  document to extract statistics. What would be a reasonable process that
  satifies the following properties:
 
  - Exhaustive: I have to traverse every single document
  - Incremental: in other words, it has to allow me to divide and conquer
 ---
  if I have processed the first 20k docs, next time I can start with
 20001.
 
  A simple *:* query would satisfy the 1st but not the 2nd property. In
  fact, given that the processing will take very long, and the repository
  keeps growing, it is not even clear that the exhaustiveness is achieved.
 
  I'm running solr 3.6.2 in a single-machine setting; no hadoop capability
  yet. But I guess the same issues still hold even if I have the solr
 cloud
  environment, right, say in each shard?
 
  Any help would be greatly appreciated.
 
  Joe
 



Re: processing documents in solr

2013-07-28 Thread Joe Zhang
Basically, I was thinking about running a range query like Shawn suggested
on the tstamp field, but unfortunately it was not indexed. Range queries
only work on indexed fields, right?


On Sun, Jul 28, 2013 at 9:49 PM, Joe Zhang smartag...@gmail.com wrote:

 I've been thinking about tstamp solution int the past few days. but too
 bad, the field is avaialble but not indexed...

 I'm not familiar with SolrJ. Again, sounds like SolrJ is providing the
 counter value. If yes, that would be equivalent to an autoincrement id. I'm
 indexing from Nutch though; don't know how to feed in such counter...


 On Sun, Jul 28, 2013 at 7:03 AM, Erick Erickson 
 erickerick...@gmail.comwrote:

 Why wouldn't a simple timestamp work for the ordering? Although
 I guess simple timestamp isn't really simple if the time settings
 change.

 So how about a simple counter field in your documents? Assuming
 you're indexing from SolrJ, your setup is to query q=*:*sort=counter
 desc.
 Take the counter from the first document returned. Increment for
 each doc for the life of the indexing run. Now you've got, for all intents
 and purposes, an identity field albeit manually maintained.

 Then use your counter field as Shawn suggests for pulling all the
 data out.

 FWIW,
 Erick

 On Sun, Jul 28, 2013 at 1:01 AM, Maurizio Cucchiara
 mcucchi...@apache.org wrote:
  In both cases, for better performance, first I'd load just all the IDs,
  after, during processing I'd load each document.
  For what concern the incremental requirement, it should not be
 difficult to
  write an hash function which maps a non-numerical I'd to a value.
   On Jul 27, 2013 7:03 AM, Joe Zhang smartag...@gmail.com wrote:
 
  Dear list:
 
  I have an ever-growing solr repository, and I need to process every
 single
  document to extract statistics. What would be a reasonable process that
  satifies the following properties:
 
  - Exhaustive: I have to traverse every single document
  - Incremental: in other words, it has to allow me to divide and
 conquer ---
  if I have processed the first 20k docs, next time I can start with
 20001.
 
  A simple *:* query would satisfy the 1st but not the 2nd property. In
  fact, given that the processing will take very long, and the repository
  keeps growing, it is not even clear that the exhaustiveness is
 achieved.
 
  I'm running solr 3.6.2 in a single-machine setting; no hadoop
 capability
  yet. But I guess the same issues still hold even if I have the solr
 cloud
  environment, right, say in each shard?
 
  Any help would be greatly appreciated.
 
  Joe
 





Re: processing documents in solr

2013-07-27 Thread Joe Zhang
On a related, inspired by what you said, Shawn, an auto increment id seems
perfect here. Yet I found there is no such support in solr. The UUID only
guarantees uniqueness.


On Fri, Jul 26, 2013 at 10:50 PM, Joe Zhang smartag...@gmail.com wrote:

 Thanks for your kind reply, Shawn.

 On Fri, Jul 26, 2013 at 10:27 PM, Shawn Heisey s...@elyograg.org wrote:

 On 7/26/2013 11:02 PM, Joe Zhang wrote:
  I have an ever-growing solr repository, and I need to process every
 single
  document to extract statistics. What would be a reasonable process that
  satifies the following properties:
 
  - Exhaustive: I have to traverse every single document
  - Incremental: in other words, it has to allow me to divide and conquer
 ---
  if I have processed the first 20k docs, next time I can start with
 20001.

 If your index isn't very big, a *:* query with rows and start parameters
 is perfectly acceptable.  Performance is terrible for this method when
 the index gets huge, though.


 == Essentially we are doing paigination here, right? If performance is
 not the concern, given that the index is dynamic, does the order of
 entries remain stable over time?



 If id is your uniqueKey field, here's how you can do it.  If that's
 not your uniqueKey field, substitute your uniqueKey field for id.  This
 method doesn't work properly if you don't use a field with values that
 are guaranteed to be unique.

 For the first query, send a query with these parameters, where NN is
 the number of docs you want to retrieve at once:
 q=*:*rows=NNsort=id asc

 For each subsequent query, use the following parameters, where XXX is
 the highest id value seen in the previous query:
 q={XXX TO *}rows=NNsort=id asc

 == This approach seems to require that the id field is numerical, right?
 I have a text-based id that is unique.

 == I'm not sure I understand the q={XXX TO *} part -- wouldn't query
 be matched against the default search field, which could be content, for
 example? How would that do the job?


 As soon as you see a numFound value less than NN, you will know that
 there's no more data.

 Generally speaking, you'd want to avoid updating the index while doing
 these queries.  If you never replace existing documents and you can
 guarantee that the value in the uniqueKey field for new documents will
 always be higher than any previous value, then you could continue
 updating the index.  A database autoincrement field would qualify for
 that condition.

 Thanks,
 Shawn





Re: processing documents in solr

2013-07-27 Thread Joe Zhang
On Fri, Jul 26, 2013 at 11:18 PM, Shawn Heisey s...@elyograg.org wrote:

 On 7/26/2013 11:50 PM, Joe Zhang wrote:
  == Essentially we are doing paigination here, right? If performance is
 not
  the concern, given that the index is dynamic, does the order of
  entries remain stable over time?

 Yes, it's pagination.  Just like the other method that I've described in
 detail, you'd have to avoid updating the index while you were getting
 information.  Unless you can come up with a sort parameter that's
 guaranteed to make sure that new documents are at the end, any changes
 to the index during the retrieval process will make it impossible to
 retrieve every document.

== What I can guarantee is that there is no deletion, but I guess this is
not equivalent to newly added docs are at the end, right?

== I believe you are right about performance. The retrived set becomes
larger and larger.


  == This approach seems to require that the id field is numerical,
 right?
  I have a text-based id that is unique.

 StrField types work perfectly with range queries.  As long as it's not a
 tokenized field, TextField works properly with range queries too.
 KeywordTokenizer is OK, as long you don't use filters that create
 additional tokens.  Some examples that create additional tokens are
 WordDelimiterFilter and EdgeNgramFilter.


== so a url field would work fine?




 == I'm not sure I understand the q={XXX TO *} part -- wouldn't query
 be
  matched against the default search field, which could be content, for
  example? How would that do the job?

 You are correct, I was too hasty in constructing the query.  That should
 be:
 q=id:{XXX TO *}rows=NNsort=id asc

 You could speed things up if you don't need to see all stored fields in
 the response by using the fl parameter to only return the fields that
 you need.

 Responding to your additional message about an autoincrement field -
 that would only be possible if you are importing from a data source that
 supports autoincrement, like MySQL.  Solr itself has no support for
 autoincrement.

 Thanks,
 Shawn




Re: processing documents in solr

2013-07-27 Thread Joe Zhang
Thanks.


On Fri, Jul 26, 2013 at 11:34 PM, Shawn Heisey s...@elyograg.org wrote:

 On 7/27/2013 12:30 AM, Joe Zhang wrote:
  == so a url field would work fine?

 As long as it's guaranteed unique on every document (especially if it is
 your uniqueKey) and goes into the index as a single token, that should
 work just fine for the range queries I've described.

 Thanks,
 Shawn




Re: processing documents in solr

2013-07-27 Thread Joe Zhang
Thanks for sharing, Roman. I'll look into your code.

One more thought on your suggestion, Shawn. In fact, for the id, we need
more than unique and rangeable; we also need some sense of atomic
values. Your approach might run into risk with a text-based id field, say:

the id/key has values 'a', 'c', 'f', 'g', and our pagesize is 2. Your
suggestion would work fine. But with newly added documents, there is no
guarantee that they are not going to use the key value 'b'. And this new
document would be missed in your algorithm, right?


On Sat, Jul 27, 2013 at 5:32 AM, Roman Chyla roman.ch...@gmail.com wrote:

 Dear list,
 I'vw written a special processor exactly for this kind of operations


 https://github.com/romanchyla/montysolr/tree/master/contrib/adsabs/src/java/org/apache/solr/handler/batch

 This is how we use it
 http://labs.adsabs.harvard.edu/trac/ads-invenio/wiki/SearchEngineBatch

 It is capable of processing index of 200gb in few minutes,
 copying/streaming large amounts of data is normal

 If there is general interest, we can create jira issue - but given my
 current workload time, it will take longer and also somebody else will
 *have to* invest their time and energy in testing it, reporting, etc. Of
 course, feel free to create the jira yourself or reuse the code -
 hopefully, you will improve it and let me know ;-)

 Roman
 On 27 Jul 2013 01:03, Joe Zhang smartag...@gmail.com wrote:

  Dear list:
 
  I have an ever-growing solr repository, and I need to process every
 single
  document to extract statistics. What would be a reasonable process that
  satifies the following properties:
 
  - Exhaustive: I have to traverse every single document
  - Incremental: in other words, it has to allow me to divide and conquer
 ---
  if I have processed the first 20k docs, next time I can start with 20001.
 
  A simple *:* query would satisfy the 1st but not the 2nd property. In
  fact, given that the processing will take very long, and the repository
  keeps growing, it is not even clear that the exhaustiveness is achieved.
 
  I'm running solr 3.6.2 in a single-machine setting; no hadoop capability
  yet. But I guess the same issues still hold even if I have the solr cloud
  environment, right, say in each shard?
 
  Any help would be greatly appreciated.
 
  Joe
 



Re: processing documents in solr

2013-07-27 Thread Joe Zhang
I have a constantly growing index, so not updating the index can't be
practical...

Going back to the beginning of this thread: when we use the vanilla
*:*+pagination approach, would the ordering of documents remain stable?
 the index is dynamic: update/insertion only, no deletion.


On Sat, Jul 27, 2013 at 10:28 AM, Shawn Heisey s...@elyograg.org wrote:

 On 7/27/2013 11:17 AM, Joe Zhang wrote:
  Thanks for sharing, Roman. I'll look into your code.
 
  One more thought on your suggestion, Shawn. In fact, for the id, we need
  more than unique and rangeable; we also need some sense of atomic
  values. Your approach might run into risk with a text-based id field,
 say:
 
  the id/key has values 'a', 'c', 'f', 'g', and our pagesize is 2. Your
  suggestion would work fine. But with newly added documents, there is no
  guarantee that they are not going to use the key value 'b'. And this new
  document would be missed in your algorithm, right?

 That's why I said that you would either have to not update the index or
 ensure that (in your example) a 'b' document never gets added.  Because
 you can't make that kind of guarantee in most situations, not updating
 the index is safer.

 Thanks,
 Shawn




processing documents in solr

2013-07-26 Thread Joe Zhang
Dear list:

I have an ever-growing solr repository, and I need to process every single
document to extract statistics. What would be a reasonable process that
satifies the following properties:

- Exhaustive: I have to traverse every single document
- Incremental: in other words, it has to allow me to divide and conquer ---
if I have processed the first 20k docs, next time I can start with 20001.

A simple *:* query would satisfy the 1st but not the 2nd property. In
fact, given that the processing will take very long, and the repository
keeps growing, it is not even clear that the exhaustiveness is achieved.

I'm running solr 3.6.2 in a single-machine setting; no hadoop capability
yet. But I guess the same issues still hold even if I have the solr cloud
environment, right, say in each shard?

Any help would be greatly appreciated.

Joe


Re: processing documents in solr

2013-07-26 Thread Joe Zhang
Thanks for your kind reply, Shawn.

On Fri, Jul 26, 2013 at 10:27 PM, Shawn Heisey s...@elyograg.org wrote:

 On 7/26/2013 11:02 PM, Joe Zhang wrote:
  I have an ever-growing solr repository, and I need to process every
 single
  document to extract statistics. What would be a reasonable process that
  satifies the following properties:
 
  - Exhaustive: I have to traverse every single document
  - Incremental: in other words, it has to allow me to divide and conquer
 ---
  if I have processed the first 20k docs, next time I can start with 20001.

 If your index isn't very big, a *:* query with rows and start parameters
 is perfectly acceptable.  Performance is terrible for this method when
 the index gets huge, though.


== Essentially we are doing paigination here, right? If performance is not
the concern, given that the index is dynamic, does the order of
entries remain stable over time?



 If id is your uniqueKey field, here's how you can do it.  If that's
 not your uniqueKey field, substitute your uniqueKey field for id.  This
 method doesn't work properly if you don't use a field with values that
 are guaranteed to be unique.

 For the first query, send a query with these parameters, where NN is
 the number of docs you want to retrieve at once:
 q=*:*rows=NNsort=id asc

 For each subsequent query, use the following parameters, where XXX is
 the highest id value seen in the previous query:
 q={XXX TO *}rows=NNsort=id asc

 == This approach seems to require that the id field is numerical, right?
I have a text-based id that is unique.

== I'm not sure I understand the q={XXX TO *} part -- wouldn't query be
matched against the default search field, which could be content, for
example? How would that do the job?


 As soon as you see a numFound value less than NN, you will know that
 there's no more data.

 Generally speaking, you'd want to avoid updating the index while doing
 these queries.  If you never replace existing documents and you can
 guarantee that the value in the uniqueKey field for new documents will
 always be higher than any previous value, then you could continue
 updating the index.  A database autoincrement field would qualify for
 that condition.

 Thanks,
 Shawn




Re: Question about field boost

2013-07-23 Thread Joe Zhang
I'm not sure I understand, Erick. I don't have a text field in my schema;
title and content are both legal fields.


On Tue, Jul 23, 2013 at 5:15 AM, Erick Erickson erickerick...@gmail.comwrote:

 this isn't doing what you think.
 title^10 content
 is actually parsed as

 text:title^100 text:content

 where text is my default search field.

 assuming title is a field. If you look a little
 farther up the debug output you'll see that.

 You probably want
 title:content^100 or some such?

 Erick

 On Tue, Jul 23, 2013 at 1:43 AM, Jack Krupansky j...@basetechnology.com
 wrote:
  That means that for that document china occurs in the title vs.
 snowden
  found in a document but not in the title.
 
 
  -- Jack Krupansky
 
  -Original Message- From: Joe Zhang
  Sent: Tuesday, July 23, 2013 12:52 AM
  To: solr-user@lucene.apache.org
  Subject: Re: Question about field boost
 
 
  Is my reading correct that the boost is only applied on china but not
  snowden? How can that be?
 
  My query is: q=china+snowdenqf=title^10 content
 
 
  On Mon, Jul 22, 2013 at 9:43 PM, Joe Zhang smartag...@gmail.com wrote:
 
  Thanks for your hint, Jack. Here is the debug results, which I'm having
 a
  hard deciphering (the two terms are china and snowden)...
 
  0.26839527 = (MATCH) sum of:
0.26839527 = (MATCH) sum of:
  0.26757246 = (MATCH) max of:
7.9147343E-4 = (MATCH) weight(content:china in 249), product of:
  0.019873314 = queryWeight(content:china), product of:
1.6649085 = idf(docFreq=46832, maxDocs=91058)
0.01193658 = queryNorm
  0.039825942 = (MATCH) fieldWeight(content:china in 249), product
  of:
4.8989797 = tf(termFreq(content:china)=24)
1.6649085 = idf(docFreq=46832, maxDocs=91058)
0.0048828125 = fieldNorm(field=content, doc=249)
0.26757246 = (MATCH) weight(title:china^10.0 in 249), product of:
  0.5836803 = queryWeight(title:china^10.0), product of:
10.0 = boost
4.8898454 = idf(docFreq=1861, maxDocs=91058)
0.01193658 = queryNorm
  0.45842302 = (MATCH) fieldWeight(title:china in 249), product
 of:
1.0 = tf(termFreq(title:china)=1)
4.8898454 = idf(docFreq=1861, maxDocs=91058)
0.09375 = fieldNorm(field=title, doc=249)
  8.2282536E-4 = (MATCH) max of:
8.2282536E-4 = (MATCH) weight(content:snowden in 249), product of:
  0.03407834 = queryWeight(content:snowden), product of:
2.8549502 = idf(docFreq=14246, maxDocs=91058)
0.01193658 = queryNorm
  0.024145111 = (MATCH) fieldWeight(content:snowden in 249),
 product
  of:
1.7320508 = tf(termFreq(content:snowden)=3)
2.8549502 = idf(docFreq=14246, maxDocs=91058)
0.0048828125 = fieldNorm(field=content, doc=249)
 
 
  On Mon, Jul 22, 2013 at 9:27 PM, Jack Krupansky
  j...@basetechnology.comwrote:
 
  Maybe you're not doing anything wrong - other than having an artificial
  expectation of what the true relevance of your data actually is. Many
  factors go into relevance scoring. You need to look at all aspects of
  your
  data.
 
  Maybe your terms don't occur in your titles the way you think they do.
 
  Maybe you need a boost of 500 or more...
 
  Lots of potential maybes.
 
  Relevancy tuning is an art and craft, hardly a science.
 
  Step one: Know your data, inside and out.
 
  Use the debugQuery=true parameter on your queries and see how much of
 the
  score is dominated by your query terms in the non-title fields.
 
  -- Jack Krupansky
 
  -Original Message- From: Joe Zhang
  Sent: Monday, July 22, 2013 11:06 PM
  To: solr-user@lucene.apache.org
  Subject: Question about field boost
 
 
  Dear Solr experts:
 
  Here is my query:
 
  defType=dismaxq=term1+term2**qf=title^100 content
 
  Apparently (at least I thought) my intention is to boost the title
 field.
  While I'm getting some non-trivial results, I'm surprised that the
  documents with both term1 and term2 in title (I know such docs do exist
  in
  my repository) were not returned (or maybe ranked very low). The
  situation
  does not change even when I use much larger boost factors.
 
  What am I doing wrong?
 
 
 
 



Question about field boost

2013-07-22 Thread Joe Zhang
Dear Solr experts:

Here is my query:

defType=dismaxq=term1+term2qf=title^100 content

Apparently (at least I thought) my intention is to boost the title field.
While I'm getting some non-trivial results, I'm surprised that the
documents with both term1 and term2 in title (I know such docs do exist in
my repository) were not returned (or maybe ranked very low). The situation
does not change even when I use much larger boost factors.

What am I doing wrong?


Re: Question about field boost

2013-07-22 Thread Joe Zhang
Thanks for your hint, Jack. Here is the debug results, which I'm having a
hard deciphering (the two terms are china and snowden)...

0.26839527 = (MATCH) sum of:
  0.26839527 = (MATCH) sum of:
0.26757246 = (MATCH) max of:
  7.9147343E-4 = (MATCH) weight(content:china in 249), product of:
0.019873314 = queryWeight(content:china), product of:
  1.6649085 = idf(docFreq=46832, maxDocs=91058)
  0.01193658 = queryNorm
0.039825942 = (MATCH) fieldWeight(content:china in 249), product of:
  4.8989797 = tf(termFreq(content:china)=24)
  1.6649085 = idf(docFreq=46832, maxDocs=91058)
  0.0048828125 = fieldNorm(field=content, doc=249)
  0.26757246 = (MATCH) weight(title:china^10.0 in 249), product of:
0.5836803 = queryWeight(title:china^10.0), product of:
  10.0 = boost
  4.8898454 = idf(docFreq=1861, maxDocs=91058)
  0.01193658 = queryNorm
0.45842302 = (MATCH) fieldWeight(title:china in 249), product of:
  1.0 = tf(termFreq(title:china)=1)
  4.8898454 = idf(docFreq=1861, maxDocs=91058)
  0.09375 = fieldNorm(field=title, doc=249)
8.2282536E-4 = (MATCH) max of:
  8.2282536E-4 = (MATCH) weight(content:snowden in 249), product of:
0.03407834 = queryWeight(content:snowden), product of:
  2.8549502 = idf(docFreq=14246, maxDocs=91058)
  0.01193658 = queryNorm
0.024145111 = (MATCH) fieldWeight(content:snowden in 249), product
of:
  1.7320508 = tf(termFreq(content:snowden)=3)
  2.8549502 = idf(docFreq=14246, maxDocs=91058)
  0.0048828125 = fieldNorm(field=content, doc=249)


On Mon, Jul 22, 2013 at 9:27 PM, Jack Krupansky j...@basetechnology.comwrote:

 Maybe you're not doing anything wrong - other than having an artificial
 expectation of what the true relevance of your data actually is. Many
 factors go into relevance scoring. You need to look at all aspects of your
 data.

 Maybe your terms don't occur in your titles the way you think they do.

 Maybe you need a boost of 500 or more...

 Lots of potential maybes.

 Relevancy tuning is an art and craft, hardly a science.

 Step one: Know your data, inside and out.

 Use the debugQuery=true parameter on your queries and see how much of the
 score is dominated by your query terms in the non-title fields.

 -- Jack Krupansky

 -Original Message- From: Joe Zhang
 Sent: Monday, July 22, 2013 11:06 PM
 To: solr-user@lucene.apache.org
 Subject: Question about field boost


 Dear Solr experts:

 Here is my query:

 defType=dismaxq=term1+term2**qf=title^100 content

 Apparently (at least I thought) my intention is to boost the title field.
 While I'm getting some non-trivial results, I'm surprised that the
 documents with both term1 and term2 in title (I know such docs do exist in
 my repository) were not returned (or maybe ranked very low). The situation
 does not change even when I use much larger boost factors.

 What am I doing wrong?



Re: Question about field boost

2013-07-22 Thread Joe Zhang
Is my reading correct that the boost is only applied on china but not
snowden? How can that be?

My query is: q=china+snowdenqf=title^10 content


On Mon, Jul 22, 2013 at 9:43 PM, Joe Zhang smartag...@gmail.com wrote:

 Thanks for your hint, Jack. Here is the debug results, which I'm having a
 hard deciphering (the two terms are china and snowden)...

 0.26839527 = (MATCH) sum of:
   0.26839527 = (MATCH) sum of:
 0.26757246 = (MATCH) max of:
   7.9147343E-4 = (MATCH) weight(content:china in 249), product of:
 0.019873314 = queryWeight(content:china), product of:
   1.6649085 = idf(docFreq=46832, maxDocs=91058)
   0.01193658 = queryNorm
 0.039825942 = (MATCH) fieldWeight(content:china in 249), product
 of:
   4.8989797 = tf(termFreq(content:china)=24)
   1.6649085 = idf(docFreq=46832, maxDocs=91058)
   0.0048828125 = fieldNorm(field=content, doc=249)
   0.26757246 = (MATCH) weight(title:china^10.0 in 249), product of:
 0.5836803 = queryWeight(title:china^10.0), product of:
   10.0 = boost
   4.8898454 = idf(docFreq=1861, maxDocs=91058)
   0.01193658 = queryNorm
 0.45842302 = (MATCH) fieldWeight(title:china in 249), product of:
   1.0 = tf(termFreq(title:china)=1)
   4.8898454 = idf(docFreq=1861, maxDocs=91058)
   0.09375 = fieldNorm(field=title, doc=249)
 8.2282536E-4 = (MATCH) max of:
   8.2282536E-4 = (MATCH) weight(content:snowden in 249), product of:
 0.03407834 = queryWeight(content:snowden), product of:
   2.8549502 = idf(docFreq=14246, maxDocs=91058)
   0.01193658 = queryNorm
 0.024145111 = (MATCH) fieldWeight(content:snowden in 249), product
 of:
   1.7320508 = tf(termFreq(content:snowden)=3)
   2.8549502 = idf(docFreq=14246, maxDocs=91058)
   0.0048828125 = fieldNorm(field=content, doc=249)


 On Mon, Jul 22, 2013 at 9:27 PM, Jack Krupansky 
 j...@basetechnology.comwrote:

 Maybe you're not doing anything wrong - other than having an artificial
 expectation of what the true relevance of your data actually is. Many
 factors go into relevance scoring. You need to look at all aspects of your
 data.

 Maybe your terms don't occur in your titles the way you think they do.

 Maybe you need a boost of 500 or more...

 Lots of potential maybes.

 Relevancy tuning is an art and craft, hardly a science.

 Step one: Know your data, inside and out.

 Use the debugQuery=true parameter on your queries and see how much of the
 score is dominated by your query terms in the non-title fields.

 -- Jack Krupansky

 -Original Message- From: Joe Zhang
 Sent: Monday, July 22, 2013 11:06 PM
 To: solr-user@lucene.apache.org
 Subject: Question about field boost


 Dear Solr experts:

 Here is my query:

 defType=dismaxq=term1+term2**qf=title^100 content

 Apparently (at least I thought) my intention is to boost the title field.
 While I'm getting some non-trivial results, I'm surprised that the
 documents with both term1 and term2 in title (I know such docs do exist in
 my repository) were not returned (or maybe ranked very low). The situation
 does not change even when I use much larger boost factors.

 What am I doing wrong?





zero-valued retrieval scores

2013-07-12 Thread Joe Zhang
when I search a keyword (such as apple), most of the docs carry 0.0 as
score. Here is an example from explain:

str name=
http://www.bloomberg.com/slideshow/2013-07-12/world-at-work-india.html;
0.0 = (MATCH) fieldWeight(content:appl in 51), product of:
  1.0 = tf(termFreq(content:appl)=1)
  2.096877 = idf(docFreq=5190, maxDocs=15546)
  0.0 = fieldNorm(field=content, doc=51)
Can somebody help me understand why fieldNorm is 0? What exactly is the
formula for computing fieldNorm?

Thanks!


Re: zero-valued retrieval scores

2013-07-12 Thread Joe Zhang
Yes, you are right, the boost on these documents are 0. I didn't provide
them, though.

I suppose the boost scores come from Nutch (yes, my solr indexes crawled
web docs). What could be wrong?

again, what exactly is the formula for fieldNorm?


On Fri, Jul 12, 2013 at 8:46 PM, Jack Krupansky j...@basetechnology.comwrote:

 Did you put a boost of 0.0 on the documents, as opposed to the default of
 1.0?

 x * 0.0 = 0.0

 -- Jack Krupansky

 -Original Message- From: Joe Zhang
 Sent: Friday, July 12, 2013 10:31 PM
 To: solr-user@lucene.apache.org
 Subject: zero-valued retrieval scores


 when I search a keyword (such as apple), most of the docs carry 0.0 as
 score. Here is an example from explain:

 str name=
 http://www.bloomberg.com/**slideshow/2013-07-12/world-at-**work-india.htmlhttp://www.bloomberg.com/slideshow/2013-07-12/world-at-work-india.html
 
 0.0 = (MATCH) fieldWeight(content:appl in 51), product of:
  1.0 = tf(termFreq(content:appl)=1)
  2.096877 = idf(docFreq=5190, maxDocs=15546)
  0.0 = fieldNorm(field=content, doc=51)
 Can somebody help me understand why fieldNorm is 0? What exactly is the
 formula for computing fieldNorm?

 Thanks!



Re: zero-valued retrieval scores

2013-07-12 Thread Joe Zhang
Thanks, Jack!


On Fri, Jul 12, 2013 at 9:37 PM, Jack Krupansky j...@basetechnology.comwrote:

 For the calculation of norm, see note number 6:

 http://lucene.apache.org/core/**4_3_0/core/org/apache/lucene/**
 search/similarities/**TFIDFSimilarity.htmlhttp://lucene.apache.org/core/4_3_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html

 You would need to talk to the Nutch guys to see why THEY are setting
 document boost to 0.0.


 -- Jack Krupansky

 -Original Message- From: Joe Zhang
 Sent: Friday, July 12, 2013 11:57 PM
 To: solr-user@lucene.apache.org
 Subject: Re: zero-valued retrieval scores


 Yes, you are right, the boost on these documents are 0. I didn't provide
 them, though.

 I suppose the boost scores come from Nutch (yes, my solr indexes crawled
 web docs). What could be wrong?

 again, what exactly is the formula for fieldNorm?


 On Fri, Jul 12, 2013 at 8:46 PM, Jack Krupansky j...@basetechnology.com*
 *wrote:

  Did you put a boost of 0.0 on the documents, as opposed to the default of
 1.0?

 x * 0.0 = 0.0

 -- Jack Krupansky

 -Original Message- From: Joe Zhang
 Sent: Friday, July 12, 2013 10:31 PM
 To: solr-user@lucene.apache.org
 Subject: zero-valued retrieval scores


 when I search a keyword (such as apple), most of the docs carry 0.0 as
 score. Here is an example from explain:

 str name=
 http://www.bloomberg.com/slideshow/2013-07-12/world-at-**
 **work-india.htmlhttp://www.bloomberg.com/**slideshow/2013-07-12/world-at-**work-india.html
 http://www.**bloomberg.com/slideshow/2013-**
 07-12/world-at-work-india.htmlhttp://www.bloomberg.com/slideshow/2013-07-12/world-at-work-india.html
 **

 
 0.0 = (MATCH) fieldWeight(content:appl in 51), product of:
  1.0 = tf(termFreq(content:appl)=1)
  2.096877 = idf(docFreq=5190, maxDocs=15546)
  0.0 = fieldNorm(field=content, doc=51)
 Can somebody help me understand why fieldNorm is 0? What exactly is the
 formula for computing fieldNorm?

 Thanks!





Re: document id in nutch/solr

2013-06-23 Thread Joe Zhang
Can somebody help with this one, please?


On Fri, Jun 21, 2013 at 10:36 PM, Joe Zhang smartag...@gmail.com wrote:

 A quite standard configuration of nutch seems to autoamtically map url
 to id. Two questions:

 - Where is such mapping defined? I can't find it anywhere in
 nutch-site.xml or schema.xml. The latter does define the id field as well
 as its uniqueness, but not the mapping.

 - Given that nutch nutch has already defined such an id, can i ask solr to
 redefine id as UUID?
 field name=id type=uuid indexed=true stored=true default=NEW/

 - This leads to a related question: do solr and nutch have to have
 IDENTICAL schema.xml?



Re: what does a zero score mean?

2013-06-21 Thread Joe Zhang
So, the reason is taht I'm getting zero values on FieldNorm. documentation
tells me that there are 3 factors in play here:

- LengthNorm -- can this be zero?
- index-time boost -- is this the boost value we get from nutch?
- field-boost -- none specified.

Can somebody help here?


On Tue, Jun 18, 2013 at 7:22 AM, Upayavira u...@odoko.co.uk wrote:

 debugQuery=true adds an extra block of XML to the bottom that will give
 you extra info.

 Alternatively, add fl=*,[explain] to your URL. That'll give you an extra
 field in your output. Then, view the source to see it structured
 properly.

 Upayavira

 On Tue, Jun 18, 2013, at 02:52 PM, Joe Zhang wrote:
  I did include debugQuery=on in the query, but nothing extra showed up
  in
  the response.
 
 
  On Mon, Jun 17, 2013 at 10:29 PM, Gora Mohanty g...@mimirtech.com
  wrote:
 
   On 18 June 2013 10:49, Joe Zhang smartag...@gmail.com wrote:
I issued a simple query (apple) to my collection and got 201
 documents
back, all of which are scored 0. What does this mean? --- The
 documents
   do
contain the query words.
  
   My guess is that the float-valued score is getting
   converted to an integer. You could also try your
   query with the parameter debugQuery=on
   to get an explanation of the scoring:
   http://wiki.apache.org/solr/CommonQueryParameters#debugQuery
  
   Regards,
   Gora
  



document id in nutch/solr

2013-06-21 Thread Joe Zhang
A quite standard configuration of nutch seems to autoamtically map url to
id. Two questions:

- Where is such mapping defined? I can't find it anywhere in nutch-site.xml
or schema.xml. The latter does define the id field as well as its
uniqueness, but not the mapping.

- Given that nutch nutch has already defined such an id, can i ask solr to
redefine id as UUID?
field name=id type=uuid indexed=true stored=true default=NEW/

- This leads to a related question: do solr and nutch have to have
IDENTICAL schema.xml?


Re: what does a zero score mean?

2013-06-18 Thread Joe Zhang
I did include debugQuery=on in the query, but nothing extra showed up in
the response.


On Mon, Jun 17, 2013 at 10:29 PM, Gora Mohanty g...@mimirtech.com wrote:

 On 18 June 2013 10:49, Joe Zhang smartag...@gmail.com wrote:
  I issued a simple query (apple) to my collection and got 201 documents
  back, all of which are scored 0. What does this mean? --- The documents
 do
  contain the query words.

 My guess is that the float-valued score is getting
 converted to an integer. You could also try your
 query with the parameter debugQuery=on
 to get an explanation of the scoring:
 http://wiki.apache.org/solr/CommonQueryParameters#debugQuery

 Regards,
 Gora



what does a zero score mean?

2013-06-17 Thread Joe Zhang
I issued a simple query (apple) to my collection and got 201 documents
back, all of which are scored 0. What does this mean? --- The documents do
contain the query words.


Re: Internal statistics in Solr index?

2012-12-21 Thread Joe Zhang
Thank you very much! This is a good starting point!

On Fri, Dec 21, 2012 at 6:15 AM, Erick Erickson erickerick...@gmail.comwrote:

 Have you seen the functions here:
 http://wiki.apache.org/solr/FunctionQuery#Relevance_Functions

 Best
 Erick


 On Thu, Dec 20, 2012 at 1:18 PM, Joe Zhang smartag...@gmail.com wrote:

  Dear list,
 
  Is there any way to access things such as word frequency, doc frequency
 in
  solr index?
 
  Thanks!
 



Re: behavior of solr.KeepWordFilterFactory

2012-12-03 Thread Joe Zhang
across-the-board case-senstive indexing is not what I want...

Let me make sure I understand your suggestion:

   fieldType name=text1 class=solr.TextField
positionIncrementGap=100
analyzer
tokenizer class=solr.StandardTokenizerFactory/

filter class=solr.LowerCaseFilterFactory/

/analyzer
/fieldType

   fieldType name=text2 class=solr.TextField
positionIncrementGap=100
analyzer
tokenizer class=solr.StandardTokenizerFactory/

/analyzer
/fieldType


And define content1 as text1, content2 as text2?
On Mon, Dec 3, 2012 at 1:09 AM, Xi Shen davidshe...@gmail.com wrote:

 Solr index is case-sensitive by default, unless you used the lower case
 filter. I remember I saw this topic on Solr, and the solution is simple:

 copy the filed;
 use a new analyzer/tokenizer to process this field, and do not use lower
 case filter

 when query, make sure both fields are included.


 On Mon, Dec 3, 2012 at 3:04 PM, Joe Zhang smartag...@gmail.com wrote:

  In other words, what I wanted to achieve is case-senstive indexing on a
  small set of words. Can anybody help?
 
  On Sun, Dec 2, 2012 at 11:56 PM, Joe Zhang smartag...@gmail.com wrote:
 
   To be more specific, this is the data type I was using:
  
  fieldType name=textspecial class=solr.TextField
   positionIncrementGap=100
   analyzer
   tokenizer class=solr.StandardTokenizerFactory/
   filter class=solr.KeepWordFilterFactory
   words=tickers.txt ignoreCase=false/
   filter class=solr.StopFilterFactory
   ignoreCase=true words=stopwords.txt/
   filter class=solr.WordDelimiterFilterFactory
   generateWordParts=1 generateNumberParts=1
   catenateWords=1 catenateNumbers=1
 catenateAll=0
   splitOnCaseChange=1/
   filter class=solr.LowerCaseFilterFactory/
   filter class=solr.EnglishPorterFilterFactory
   protected=protwords.txt/
   filter
 class=solr.RemoveDuplicatesTokenFilterFactory/
   /analyzer
   /fieldType
  
  
   On Sun, Dec 2, 2012 at 11:51 PM, Joe Zhang smartag...@gmail.com
 wrote:
  
   yes, that is the correct behavior. But how do I achieve my goal, i.e,
   speical treatment on a list of uppercase/special words, normal
  treatment on
   everything else?
  
  
   On Sun, Dec 2, 2012 at 11:46 PM, Xi Shen davidshe...@gmail.com
 wrote:
  
   By the definition on
  
  
 
 https://lucene.apache.org/solr/api-3_6_1/org/apache/solr/analysis/KeepWordFilter.html
   ,
   I am pretty sure it is the correct behavior of this filter :)
  
   I guess you are trying to this filter to index some special words in
   Chinese?
  
  
   On Mon, Dec 3, 2012 at 1:54 PM, Joe Zhang smartag...@gmail.com
  wrote:
  
I defined the following data type in my solr schema.xml
   
fieldtype name=testkeep class=solr.TextField
   analyzer
 filter class=solr.KeepWordFilterFactory
 words=keepwords.txt
ignoreCase=false/
   /analyzer
/fieldtype
   
when I use the type testkeep to index a test field, my true
   expecation
was to make sure solr indexes the uppercase form of a small list of
   words
in the file, AND TREAT EVERY OTHER WORD AS USUAL. The goal of
  securing
   the
closed list is achieved, but NO OTHER WORD outside the list is
  indexed!
   
Can anybody help? Thanks in advance!
   
Joe
   
  
  
  
   --
   Regards,
   David Shen
  
   http://about.me/davidshen
   https://twitter.com/#!/davidshen84
  
  
  
  
 



 --
 Regards,
 David Shen

 http://about.me/davidshen
 https://twitter.com/#!/davidshen84



search behavior on a case-sensitive field

2012-12-03 Thread Joe Zhang
I have a search like this:

fieldType name=text_cs class=solr.TextField
positionIncrementGap=100
analyzer
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory
ignoreCase=true words=stopwords.txt/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1
catenateWords=1 catenateNumbers=1 catenateAll=0
splitOnCaseChange=1/
!--filter class=solr.LowerCaseFilterFactory/  --
filter class=solr.EnglishPorterFilterFactory
protected=protwords.txt/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer
/fieldType

When I query COST, it gives reasonable results (n1);
When I query CoSt, however, it gives me n2 (n1) results, and I can't
locate actual occurence of CoSt in the docs at all. Can anybody advise?


Re: search behavior on a case-sensitive field

2012-12-03 Thread Joe Zhang
haha, makes perfect sense! Thanks a lot!

On Mon, Dec 3, 2012 at 9:25 PM, Jack Krupansky j...@basetechnology.comwrote:

 CoSt was split into two terms and the query parser generated an OR of
 them. Adding the autoGeneratePhraseQueries=**true attribute to your
 field type should fix the problem.

 You can also change splitOnCaseChange=1 to splitOnCaseChange=0 to
 avoid the term splitting issue.

 Be sure to completely reindex in either case.

 -- Jack Krupansky

 -Original Message- From: Joe Zhang
 Sent: Monday, December 03, 2012 11:10 PM
 To: solr-user@lucene.apache.org
 Subject: search behavior on a case-sensitive field


 I have a search like this:

fieldType name=text_cs class=solr.TextField
positionIncrementGap=100
analyzer
tokenizer class=solr.**WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory
ignoreCase=true words=stopwords.txt/
filter class=solr.**WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1
catenateWords=1 catenateNumbers=1 catenateAll=0
splitOnCaseChange=1/
 !--filter class=solr.**LowerCaseFilterFactory/  --
filter class=solr.**EnglishPorterFilterFactory
protected=protwords.txt/
filter class=solr.**RemoveDuplicatesTokenFilterFac**
 tory/
/analyzer
/fieldType

 When I query COST, it gives reasonable results (n1);
 When I query CoSt, however, it gives me n2 (n1) results, and I can't
 locate actual occurence of CoSt in the docs at all. Can anybody advise?



Re: multiple indexes?

2012-12-02 Thread Joe Zhang
This is very helpful. Thanks a lot, Shaun and Dikchant!

So in default single-core situation, the index would live in data/index,
correct?

On Fri, Nov 30, 2012 at 11:02 PM, Shawn Heisey s...@elyograg.org wrote:

 On 11/30/2012 10:11 PM, Joe Zhang wrote:

 May I ask: how to set up multiple indexes, and specify which index to send
 the docs to at indexing time, and later on, how to specify which index to
 work with?

 A related question: what is the storage location and structure of solr
 indexes?

 When you index or query data, you'll use a base URL specific to the index
 (core).  Everything goes through that base URL, which includes the name of
 the core:

 http://server:port/solr/**corename

 The file called solr.xml tells Solr about multiple cores.Each core has an
 instanceDir and a dataDir.

 http://wiki.apache.org/solr/**CoreAdminhttp://wiki.apache.org/solr/CoreAdmin

 In the dataDir, Solr will create an index dir, which contains the Lucene
 index.  Here are the file formats for recent versions:

 http://lucene.apache.org/core/**4_0_0/core/org/apache/lucene/**
 codecs/lucene40/package-**summary.htmlhttp://lucene.apache.org/core/4_0_0/core/org/apache/lucene/codecs/lucene40/package-summary.html
 http://lucene.apache.org/core/**3_6_1/fileformats.htmlhttp://lucene.apache.org/core/3_6_1/fileformats.html
 http://lucene.apache.org/core/**old_versioned_docs/versions/3_**
 5_0/fileformats.htmlhttp://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/fileformats.html

 Thanks,
 Shawn




behavior of solr.KeepWordFilterFactory

2012-12-02 Thread Joe Zhang
I defined the following data type in my solr schema.xml

fieldtype name=testkeep class=solr.TextField
   analyzer
 filter class=solr.KeepWordFilterFactory words=keepwords.txt
ignoreCase=false/
   /analyzer
/fieldtype

when I use the type testkeep to index a test field, my true expecation
was to make sure solr indexes the uppercase form of a small list of words
in the file, AND TREAT EVERY OTHER WORD AS USUAL. The goal of securing the
closed list is achieved, but NO OTHER WORD outside the list is indexed!

Can anybody help? Thanks in advance!

Joe


Re: duplicated URL sent from Nutch to solr index

2012-12-02 Thread Joe Zhang
Thanks!

On Sun, Dec 2, 2012 at 11:20 PM, Xi Shen davidshe...@gmail.com wrote:

 If the value for id field is the same, the old entry will be update; if
 it is new, a new entry will be created  indexed.

 This is my experience. :)


 On Mon, Dec 3, 2012 at 1:45 PM, Joe Zhang smartag...@gmail.com wrote:

  Dear list,
 
  I just want to confirm an expected behavior of solr:
 
  Assuming we have  uniqueKeyid/uniqueKey in schema.xml for solr,
 when
  we send the same URL from nutch to solr multiple times. would there be
 ONLY
  ONE entry for that URL, but the content (if changed) and timestamp would
 be
  updated?
 
 
  Thanks!
 
  Joe
 



 --
 Regards,
 David Shen

 http://about.me/davidshen
 https://twitter.com/#!/davidshen84



Re: duplicated URL sent from Nutch to solr index

2012-12-02 Thread Joe Zhang
Sorry I didn't make it perfectly clear. The id field is URL.

On Sun, Dec 2, 2012 at 11:33 PM, Joe Zhang smartag...@gmail.com wrote:

 Thanks!


 On Sun, Dec 2, 2012 at 11:20 PM, Xi Shen davidshe...@gmail.com wrote:

 If the value for id field is the same, the old entry will be update; if
 it is new, a new entry will be created  indexed.

 This is my experience. :)


 On Mon, Dec 3, 2012 at 1:45 PM, Joe Zhang smartag...@gmail.com wrote:

  Dear list,
 
  I just want to confirm an expected behavior of solr:
 
  Assuming we have  uniqueKeyid/uniqueKey in schema.xml for solr,
 when
  we send the same URL from nutch to solr multiple times. would there be
 ONLY
  ONE entry for that URL, but the content (if changed) and timestamp
 would be
  updated?
 
 
  Thanks!
 
  Joe
 



 --
 Regards,
 David Shen

 http://about.me/davidshen
 https://twitter.com/#!/davidshen84





Re: behavior of solr.KeepWordFilterFactory

2012-12-02 Thread Joe Zhang
yes, that is the correct behavior. But how do I achieve my goal, i.e,
speical treatment on a list of uppercase/special words, normal treatment on
everything else?

On Sun, Dec 2, 2012 at 11:46 PM, Xi Shen davidshe...@gmail.com wrote:

 By the definition on

 https://lucene.apache.org/solr/api-3_6_1/org/apache/solr/analysis/KeepWordFilter.html
 ,
 I am pretty sure it is the correct behavior of this filter :)

 I guess you are trying to this filter to index some special words in
 Chinese?


 On Mon, Dec 3, 2012 at 1:54 PM, Joe Zhang smartag...@gmail.com wrote:

  I defined the following data type in my solr schema.xml
 
  fieldtype name=testkeep class=solr.TextField
 analyzer
   filter class=solr.KeepWordFilterFactory words=keepwords.txt
  ignoreCase=false/
 /analyzer
  /fieldtype
 
  when I use the type testkeep to index a test field, my true expecation
  was to make sure solr indexes the uppercase form of a small list of words
  in the file, AND TREAT EVERY OTHER WORD AS USUAL. The goal of securing
 the
  closed list is achieved, but NO OTHER WORD outside the list is indexed!
 
  Can anybody help? Thanks in advance!
 
  Joe
 



 --
 Regards,
 David Shen

 http://about.me/davidshen
 https://twitter.com/#!/davidshen84



Re: behavior of solr.KeepWordFilterFactory

2012-12-02 Thread Joe Zhang
To be more specific, this is the data type I was using:

   fieldType name=textspecial class=solr.TextField
positionIncrementGap=100
analyzer
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.KeepWordFilterFactory
words=tickers.txt ignoreCase=false/
filter class=solr.StopFilterFactory
ignoreCase=true words=stopwords.txt/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1
catenateWords=1 catenateNumbers=1 catenateAll=0
splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPorterFilterFactory
protected=protwords.txt/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer
/fieldType


On Sun, Dec 2, 2012 at 11:51 PM, Joe Zhang smartag...@gmail.com wrote:

 yes, that is the correct behavior. But how do I achieve my goal, i.e,
 speical treatment on a list of uppercase/special words, normal treatment on
 everything else?


 On Sun, Dec 2, 2012 at 11:46 PM, Xi Shen davidshe...@gmail.com wrote:

 By the definition on

 https://lucene.apache.org/solr/api-3_6_1/org/apache/solr/analysis/KeepWordFilter.html
 ,
 I am pretty sure it is the correct behavior of this filter :)

 I guess you are trying to this filter to index some special words in
 Chinese?


 On Mon, Dec 3, 2012 at 1:54 PM, Joe Zhang smartag...@gmail.com wrote:

  I defined the following data type in my solr schema.xml
 
  fieldtype name=testkeep class=solr.TextField
 analyzer
   filter class=solr.KeepWordFilterFactory words=keepwords.txt
  ignoreCase=false/
 /analyzer
  /fieldtype
 
  when I use the type testkeep to index a test field, my true expecation
  was to make sure solr indexes the uppercase form of a small list of
 words
  in the file, AND TREAT EVERY OTHER WORD AS USUAL. The goal of securing
 the
  closed list is achieved, but NO OTHER WORD outside the list is indexed!
 
  Can anybody help? Thanks in advance!
 
  Joe
 



 --
 Regards,
 David Shen

 http://about.me/davidshen
 https://twitter.com/#!/davidshen84





Re: behavior of solr.KeepWordFilterFactory

2012-12-02 Thread Joe Zhang
In other words, what I wanted to achieve is case-senstive indexing on a
small set of words. Can anybody help?

On Sun, Dec 2, 2012 at 11:56 PM, Joe Zhang smartag...@gmail.com wrote:

 To be more specific, this is the data type I was using:

fieldType name=textspecial class=solr.TextField
 positionIncrementGap=100
 analyzer
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.KeepWordFilterFactory
 words=tickers.txt ignoreCase=false/
 filter class=solr.StopFilterFactory
 ignoreCase=true words=stopwords.txt/
 filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1
 catenateWords=1 catenateNumbers=1 catenateAll=0
 splitOnCaseChange=1/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.EnglishPorterFilterFactory
 protected=protwords.txt/
 filter class=solr.RemoveDuplicatesTokenFilterFactory/
 /analyzer
 /fieldType


 On Sun, Dec 2, 2012 at 11:51 PM, Joe Zhang smartag...@gmail.com wrote:

 yes, that is the correct behavior. But how do I achieve my goal, i.e,
 speical treatment on a list of uppercase/special words, normal treatment on
 everything else?


 On Sun, Dec 2, 2012 at 11:46 PM, Xi Shen davidshe...@gmail.com wrote:

 By the definition on

 https://lucene.apache.org/solr/api-3_6_1/org/apache/solr/analysis/KeepWordFilter.html
 ,
 I am pretty sure it is the correct behavior of this filter :)

 I guess you are trying to this filter to index some special words in
 Chinese?


 On Mon, Dec 3, 2012 at 1:54 PM, Joe Zhang smartag...@gmail.com wrote:

  I defined the following data type in my solr schema.xml
 
  fieldtype name=testkeep class=solr.TextField
 analyzer
   filter class=solr.KeepWordFilterFactory words=keepwords.txt
  ignoreCase=false/
 /analyzer
  /fieldtype
 
  when I use the type testkeep to index a test field, my true
 expecation
  was to make sure solr indexes the uppercase form of a small list of
 words
  in the file, AND TREAT EVERY OTHER WORD AS USUAL. The goal of securing
 the
  closed list is achieved, but NO OTHER WORD outside the list is indexed!
 
  Can anybody help? Thanks in advance!
 
  Joe
 



 --
 Regards,
 David Shen

 http://about.me/davidshen
 https://twitter.com/#!/davidshen84






Re: stopwords in solr

2012-11-27 Thread Joe Zhang
that is really strange. so basic stopwords such as a the' are not
eliminated from the index?

On Tue, Nov 27, 2012 at 11:16 PM, 曹霖 cao...@babytree-inc.com wrote:

 justt no stopwords are considered in that case

 2012/11/28 Joe Zhang smartag...@gmail.com

  t no stopwords are considered in
  this case