Re: processing documents in solr

2013-07-29 Thread Joe Zhang
I'll try reindexing the timestamp.

The id-creation approach suggested by Erick sounds attractive, but the
nutch/solr integration seems rather tight. I don't where to break in to
insert the id into solr.


On Mon, Jul 29, 2013 at 4:11 AM, Erick Erickson wrote:

> No SolrJ doesn't provide this automatically. You'd be providing the
> counter by inserting it into the document as you created new docs.
>
> You could do this with any kind of document creation you are
> using.
>
> Best
> Erick
>
> On Mon, Jul 29, 2013 at 2:51 AM, Aditya 
> wrote:
> > Hi,
> >
> > The easiest solution would be to have timestamp indexed. Is there any
> issue
> > in doing re-indexing?
> > If you want to process records in batch then you need a ordered list and
> a
> > bookmark. You require a field to sort and maintain a counter / last id as
> > bookmark. This is mandatory to solve your problem.
> >
> > If you don't want to re-index, then you need to maintain information
> > related to visited nodes. Have a database / solr core which maintains
> list
> > of IDs which already processed. Fetch record from Solr, For each record,
> > check the new DB, if the record is already processed.
> >
> > Regards
> > Aditya
> > www.findbestopensource.com
> >
> >
> >
> >
> >
> > On Mon, Jul 29, 2013 at 10:26 AM, Joe Zhang 
> wrote:
> >
> >> Basically, I was thinking about running a range query like Shawn
> suggested
> >> on the tstamp field, but unfortunately it was not indexed. Range queries
> >> only work on indexed fields, right?
> >>
> >>
> >> On Sun, Jul 28, 2013 at 9:49 PM, Joe Zhang 
> wrote:
> >>
> >> > I've been thinking about tstamp solution int the past few days. but
> too
> >> > bad, the field is avaialble but not indexed...
> >> >
> >> > I'm not familiar with SolrJ. Again, sounds like SolrJ is providing the
> >> > counter value. If yes, that would be equivalent to an autoincrement
> id.
> >> I'm
> >> > indexing from Nutch though; don't know how to feed in such counter...
> >> >
> >> >
> >> > On Sun, Jul 28, 2013 at 7:03 AM, Erick Erickson <
> erickerick...@gmail.com
> >> >wrote:
> >> >
> >> >> Why wouldn't a simple timestamp work for the ordering? Although
> >> >> I guess "simple timestamp" isn't really simple if the time settings
> >> >> change.
> >> >>
> >> >> So how about a simple counter field in your documents? Assuming
> >> >> you're indexing from SolrJ, your setup is to query q=*:*&sort=counter
> >> >> desc.
> >> >> Take the counter from the first document returned. Increment for
> >> >> each doc for the life of the indexing run. Now you've got, for all
> >> intents
> >> >> and purposes, an identity field albeit manually maintained.
> >> >>
> >> >> Then use your counter field as Shawn suggests for pulling all the
> >> >> data out.
> >> >>
> >> >> FWIW,
> >> >> Erick
> >> >>
> >> >> On Sun, Jul 28, 2013 at 1:01 AM, Maurizio Cucchiara
> >> >>  wrote:
> >> >> > In both cases, for better performance, first I'd load just all the
> >> IDs,
> >> >> > after, during processing I'd load each document.
> >> >> > For what concern the incremental requirement, it should not be
> >> >> difficult to
> >> >> > write an hash function which maps a non-numerical I'd to a value.
> >> >> >  On Jul 27, 2013 7:03 AM, "Joe Zhang" 
> wrote:
> >> >> >
> >> >> >> Dear list:
> >> >> >>
> >> >> >> I have an ever-growing solr repository, and I need to process
> every
> >> >> single
> >> >> >> document to extract statistics. What would be a reasonable process
> >> that
> >> >> >> satifies the following properties:
> >> >> >>
> >> >> >> - Exhaustive: I have to traverse every single document
> >> >> >> - Incremental: in other words, it has to allow me to divide and
> >> >> conquer ---
> >> >> >> if I have processed the first 20k docs, next time I can start with
> >> >> 20001.
> >> >> >>
> >> >> >> A simple "*:*" query would satisfy the 1st but not the 2nd
> property.
> >> In
> >> >> >> fact, given that the processing will take very long, and the
> >> repository
> >> >> >> keeps growing, it is not even clear that the exhaustiveness is
> >> >> achieved.
> >> >> >>
> >> >> >> I'm running solr 3.6.2 in a single-machine setting; no hadoop
> >> >> capability
> >> >> >> yet. But I guess the same issues still hold even if I have the
> solr
> >> >> cloud
> >> >> >> environment, right, say in each shard?
> >> >> >>
> >> >> >> Any help would be greatly appreciated.
> >> >> >>
> >> >> >> Joe
> >> >> >>
> >> >>
> >> >
> >> >
> >>
>


Re: processing documents in solr

2013-07-28 Thread Joe Zhang
Basically, I was thinking about running a range query like Shawn suggested
on the tstamp field, but unfortunately it was not indexed. Range queries
only work on indexed fields, right?


On Sun, Jul 28, 2013 at 9:49 PM, Joe Zhang  wrote:

> I've been thinking about tstamp solution int the past few days. but too
> bad, the field is avaialble but not indexed...
>
> I'm not familiar with SolrJ. Again, sounds like SolrJ is providing the
> counter value. If yes, that would be equivalent to an autoincrement id. I'm
> indexing from Nutch though; don't know how to feed in such counter...
>
>
> On Sun, Jul 28, 2013 at 7:03 AM, Erick Erickson 
> wrote:
>
>> Why wouldn't a simple timestamp work for the ordering? Although
>> I guess "simple timestamp" isn't really simple if the time settings
>> change.
>>
>> So how about a simple counter field in your documents? Assuming
>> you're indexing from SolrJ, your setup is to query q=*:*&sort=counter
>> desc.
>> Take the counter from the first document returned. Increment for
>> each doc for the life of the indexing run. Now you've got, for all intents
>> and purposes, an identity field albeit manually maintained.
>>
>> Then use your counter field as Shawn suggests for pulling all the
>> data out.
>>
>> FWIW,
>> Erick
>>
>> On Sun, Jul 28, 2013 at 1:01 AM, Maurizio Cucchiara
>>  wrote:
>> > In both cases, for better performance, first I'd load just all the IDs,
>> > after, during processing I'd load each document.
>> > For what concern the incremental requirement, it should not be
>> difficult to
>> > write an hash function which maps a non-numerical I'd to a value.
>> >  On Jul 27, 2013 7:03 AM, "Joe Zhang"  wrote:
>> >
>> >> Dear list:
>> >>
>> >> I have an ever-growing solr repository, and I need to process every
>> single
>> >> document to extract statistics. What would be a reasonable process that
>> >> satifies the following properties:
>> >>
>> >> - Exhaustive: I have to traverse every single document
>> >> - Incremental: in other words, it has to allow me to divide and
>> conquer ---
>> >> if I have processed the first 20k docs, next time I can start with
>> 20001.
>> >>
>> >> A simple "*:*" query would satisfy the 1st but not the 2nd property. In
>> >> fact, given that the processing will take very long, and the repository
>> >> keeps growing, it is not even clear that the exhaustiveness is
>> achieved.
>> >>
>> >> I'm running solr 3.6.2 in a single-machine setting; no hadoop
>> capability
>> >> yet. But I guess the same issues still hold even if I have the solr
>> cloud
>> >> environment, right, say in each shard?
>> >>
>> >> Any help would be greatly appreciated.
>> >>
>> >> Joe
>> >>
>>
>
>


Re: processing documents in solr

2013-07-28 Thread Joe Zhang
I've been thinking about tstamp solution int the past few days. but too
bad, the field is avaialble but not indexed...

I'm not familiar with SolrJ. Again, sounds like SolrJ is providing the
counter value. If yes, that would be equivalent to an autoincrement id. I'm
indexing from Nutch though; don't know how to feed in such counter...


On Sun, Jul 28, 2013 at 7:03 AM, Erick Erickson wrote:

> Why wouldn't a simple timestamp work for the ordering? Although
> I guess "simple timestamp" isn't really simple if the time settings
> change.
>
> So how about a simple counter field in your documents? Assuming
> you're indexing from SolrJ, your setup is to query q=*:*&sort=counter desc.
> Take the counter from the first document returned. Increment for
> each doc for the life of the indexing run. Now you've got, for all intents
> and purposes, an identity field albeit manually maintained.
>
> Then use your counter field as Shawn suggests for pulling all the
> data out.
>
> FWIW,
> Erick
>
> On Sun, Jul 28, 2013 at 1:01 AM, Maurizio Cucchiara
>  wrote:
> > In both cases, for better performance, first I'd load just all the IDs,
> > after, during processing I'd load each document.
> > For what concern the incremental requirement, it should not be difficult
> to
> > write an hash function which maps a non-numerical I'd to a value.
> >  On Jul 27, 2013 7:03 AM, "Joe Zhang"  wrote:
> >
> >> Dear list:
> >>
> >> I have an ever-growing solr repository, and I need to process every
> single
> >> document to extract statistics. What would be a reasonable process that
> >> satifies the following properties:
> >>
> >> - Exhaustive: I have to traverse every single document
> >> - Incremental: in other words, it has to allow me to divide and conquer
> ---
> >> if I have processed the first 20k docs, next time I can start with
> 20001.
> >>
> >> A simple "*:*" query would satisfy the 1st but not the 2nd property. In
> >> fact, given that the processing will take very long, and the repository
> >> keeps growing, it is not even clear that the exhaustiveness is achieved.
> >>
> >> I'm running solr 3.6.2 in a single-machine setting; no hadoop capability
> >> yet. But I guess the same issues still hold even if I have the solr
> cloud
> >> environment, right, say in each shard?
> >>
> >> Any help would be greatly appreciated.
> >>
> >> Joe
> >>
>


Re: processing documents in solr

2013-07-27 Thread Joe Zhang
I have a constantly growing index, so not updating the index can't be
practical...

Going back to the beginning of this thread: when we use the vanilla
"*:*"+pagination approach, would the ordering of documents remain stable?
 the index is dynamic: update/insertion only, no deletion.


On Sat, Jul 27, 2013 at 10:28 AM, Shawn Heisey  wrote:

> On 7/27/2013 11:17 AM, Joe Zhang wrote:
> > Thanks for sharing, Roman. I'll look into your code.
> >
> > One more thought on your suggestion, Shawn. In fact, for the id, we need
> > more than "unique" and "rangeable"; we also need some sense of atomic
> > values. Your approach might run into risk with a text-based id field,
> say:
> >
> > the id/key has values 'a', 'c', 'f', 'g', and our pagesize is 2. Your
> > suggestion would work fine. But with newly added documents, there is no
> > guarantee that they are not going to use the key value 'b'. And this new
> > document would be missed in your algorithm, right?
>
> That's why I said that you would either have to not update the index or
> ensure that (in your example) a 'b' document never gets added.  Because
> you can't make that kind of guarantee in most situations, not updating
> the index is safer.
>
> Thanks,
> Shawn
>
>


Re: processing documents in solr

2013-07-27 Thread Joe Zhang
Thanks for sharing, Roman. I'll look into your code.

One more thought on your suggestion, Shawn. In fact, for the id, we need
more than "unique" and "rangeable"; we also need some sense of atomic
values. Your approach might run into risk with a text-based id field, say:

the id/key has values 'a', 'c', 'f', 'g', and our pagesize is 2. Your
suggestion would work fine. But with newly added documents, there is no
guarantee that they are not going to use the key value 'b'. And this new
document would be missed in your algorithm, right?


On Sat, Jul 27, 2013 at 5:32 AM, Roman Chyla  wrote:

> Dear list,
> I'vw written a special processor exactly for this kind of operations
>
>
> https://github.com/romanchyla/montysolr/tree/master/contrib/adsabs/src/java/org/apache/solr/handler/batch
>
> This is how we use it
> http://labs.adsabs.harvard.edu/trac/ads-invenio/wiki/SearchEngineBatch
>
> It is capable of processing index of 200gb in few minutes,
> copying/streaming large amounts of data is normal
>
> If there is general interest, we can create jira issue - but given my
> current workload time, it will take longer and also somebody else will
> *have to* invest their time and energy in testing it, reporting, etc. Of
> course, feel free to create the jira yourself or reuse the code -
> hopefully, you will improve it and let me know ;-)
>
> Roman
> On 27 Jul 2013 01:03, "Joe Zhang"  wrote:
>
> > Dear list:
> >
> > I have an ever-growing solr repository, and I need to process every
> single
> > document to extract statistics. What would be a reasonable process that
> > satifies the following properties:
> >
> > - Exhaustive: I have to traverse every single document
> > - Incremental: in other words, it has to allow me to divide and conquer
> ---
> > if I have processed the first 20k docs, next time I can start with 20001.
> >
> > A simple "*:*" query would satisfy the 1st but not the 2nd property. In
> > fact, given that the processing will take very long, and the repository
> > keeps growing, it is not even clear that the exhaustiveness is achieved.
> >
> > I'm running solr 3.6.2 in a single-machine setting; no hadoop capability
> > yet. But I guess the same issues still hold even if I have the solr cloud
> > environment, right, say in each shard?
> >
> > Any help would be greatly appreciated.
> >
> > Joe
> >
>


Re: processing documents in solr

2013-07-26 Thread Joe Zhang
Thanks.


On Fri, Jul 26, 2013 at 11:34 PM, Shawn Heisey  wrote:

> On 7/27/2013 12:30 AM, Joe Zhang wrote:
> > ==> so a "url" field would work fine?
>
> As long as it's guaranteed unique on every document (especially if it is
> your uniqueKey) and goes into the index as a single token, that should
> work just fine for the range queries I've described.
>
> Thanks,
> Shawn
>
>


Re: processing documents in solr

2013-07-26 Thread Joe Zhang
On Fri, Jul 26, 2013 at 11:18 PM, Shawn Heisey  wrote:

> On 7/26/2013 11:50 PM, Joe Zhang wrote:
> > ==> Essentially we are doing paigination here, right? If performance is
> not
> > the concern, given that the index is dynamic, does the order of
> > entries remain stable over time?
>
> Yes, it's pagination.  Just like the other method that I've described in
> detail, you'd have to avoid updating the index while you were getting
> information.  Unless you can come up with a sort parameter that's
> guaranteed to make sure that new documents are at the end, any changes
> to the index during the retrieval process will make it impossible to
> retrieve every document.
>
==> What I can guarantee is that there is no deletion, but I guess this is
not equivalent to "newly added docs are at the end", right?

==> I believe you are right about performance. The retrived set becomes
larger and larger.

>
> >> ==> This approach seems to require that the id field is numerical,
> right?
> > I have a text-based id that is unique.
>
> StrField types work perfectly with range queries.  As long as it's not a
> tokenized field, TextField works properly with range queries too.
> KeywordTokenizer is OK, as long you don't use filters that create
> additional tokens.  Some examples that create additional tokens are
> WordDelimiterFilter and EdgeNgramFilter.
>
>
==> so a "url" field would work fine?


>
>
> ==> I'm not sure I understand the "q={XXX TO *}" part --> wouldn't query
> be
> > matched against the default search field, which could be "content", for
> > example? How would that do the job?
>
> You are correct, I was too hasty in constructing the query.  That should
> be:
> q=id:{XXX TO *}&rows=NN&sort=id asc
>
> You could speed things up if you don't need to see all stored fields in
> the response by using the fl parameter to only return the fields that
> you need.
>
> Responding to your additional message about an autoincrement field -
> that would only be possible if you are importing from a data source that
> supports autoincrement, like MySQL.  Solr itself has no support for
> autoincrement.
>
> Thanks,
> Shawn
>
>


Re: processing documents in solr

2013-07-26 Thread Joe Zhang
On a related, inspired by what you said, Shawn, an auto increment id seems
perfect here. Yet I found there is no such support in solr. The UUID only
guarantees uniqueness.


On Fri, Jul 26, 2013 at 10:50 PM, Joe Zhang  wrote:

> Thanks for your kind reply, Shawn.
>
> On Fri, Jul 26, 2013 at 10:27 PM, Shawn Heisey  wrote:
>
>> On 7/26/2013 11:02 PM, Joe Zhang wrote:
>> > I have an ever-growing solr repository, and I need to process every
>> single
>> > document to extract statistics. What would be a reasonable process that
>> > satifies the following properties:
>> >
>> > - Exhaustive: I have to traverse every single document
>> > - Incremental: in other words, it has to allow me to divide and conquer
>> ---
>> > if I have processed the first 20k docs, next time I can start with
>> 20001.
>>
>> If your index isn't very big, a *:* query with rows and start parameters
>> is perfectly acceptable.  Performance is terrible for this method when
>> the index gets huge, though.
>>
>
> ==> Essentially we are doing paigination here, right? If performance is
> not the concern, given that the index is dynamic, does the order of
> entries remain stable over time?
>
>
>
>> If "id" is your uniqueKey field, here's how you can do it.  If that's
>> not your uniqueKey field, substitute your uniqueKey field for id.  This
>> method doesn't work properly if you don't use a field with values that
>> are guaranteed to be unique.
>>
>> For the first query, send a query with these parameters, where NN is
>> the number of docs you want to retrieve at once:
>> q=*:*&rows=NN&sort=id asc
>>
>> For each subsequent query, use the following parameters, where XXX is
>> the highest id value seen in the previous query:
>> q={XXX TO *}&rows=NN&sort=id asc
>>
>> ==> This approach seems to require that the id field is numerical, right?
> I have a text-based id that is unique.
>
> ==> I'm not sure I understand the "q={XXX TO *}" part --> wouldn't query
> be matched against the default search field, which could be "content", for
> example? How would that do the job?
>
>
>> As soon as you see a numFound value less than NN, you will know that
>> there's no more data.
>>
>> Generally speaking, you'd want to avoid updating the index while doing
>> these queries.  If you never replace existing documents and you can
>> guarantee that the value in the uniqueKey field for new documents will
>> always be higher than any previous value, then you could continue
>> updating the index.  A database autoincrement field would qualify for
>> that condition.
>>
>> Thanks,
>> Shawn
>>
>>
>


Re: processing documents in solr

2013-07-26 Thread Joe Zhang
Thanks for your kind reply, Shawn.

On Fri, Jul 26, 2013 at 10:27 PM, Shawn Heisey  wrote:

> On 7/26/2013 11:02 PM, Joe Zhang wrote:
> > I have an ever-growing solr repository, and I need to process every
> single
> > document to extract statistics. What would be a reasonable process that
> > satifies the following properties:
> >
> > - Exhaustive: I have to traverse every single document
> > - Incremental: in other words, it has to allow me to divide and conquer
> ---
> > if I have processed the first 20k docs, next time I can start with 20001.
>
> If your index isn't very big, a *:* query with rows and start parameters
> is perfectly acceptable.  Performance is terrible for this method when
> the index gets huge, though.
>

==> Essentially we are doing paigination here, right? If performance is not
the concern, given that the index is dynamic, does the order of
entries remain stable over time?



> If "id" is your uniqueKey field, here's how you can do it.  If that's
> not your uniqueKey field, substitute your uniqueKey field for id.  This
> method doesn't work properly if you don't use a field with values that
> are guaranteed to be unique.
>
> For the first query, send a query with these parameters, where NN is
> the number of docs you want to retrieve at once:
> q=*:*&rows=NN&sort=id asc
>
> For each subsequent query, use the following parameters, where XXX is
> the highest id value seen in the previous query:
> q={XXX TO *}&rows=NN&sort=id asc
>
> ==> This approach seems to require that the id field is numerical, right?
I have a text-based id that is unique.

==> I'm not sure I understand the "q={XXX TO *}" part --> wouldn't query be
matched against the default search field, which could be "content", for
example? How would that do the job?


> As soon as you see a numFound value less than NN, you will know that
> there's no more data.
>
> Generally speaking, you'd want to avoid updating the index while doing
> these queries.  If you never replace existing documents and you can
> guarantee that the value in the uniqueKey field for new documents will
> always be higher than any previous value, then you could continue
> updating the index.  A database autoincrement field would qualify for
> that condition.
>
> Thanks,
> Shawn
>
>


processing documents in solr

2013-07-26 Thread Joe Zhang
Dear list:

I have an ever-growing solr repository, and I need to process every single
document to extract statistics. What would be a reasonable process that
satifies the following properties:

- Exhaustive: I have to traverse every single document
- Incremental: in other words, it has to allow me to divide and conquer ---
if I have processed the first 20k docs, next time I can start with 20001.

A simple "*:*" query would satisfy the 1st but not the 2nd property. In
fact, given that the processing will take very long, and the repository
keeps growing, it is not even clear that the exhaustiveness is achieved.

I'm running solr 3.6.2 in a single-machine setting; no hadoop capability
yet. But I guess the same issues still hold even if I have the solr cloud
environment, right, say in each shard?

Any help would be greatly appreciated.

Joe


Re: Question about field boost

2013-07-23 Thread Joe Zhang
I'm not sure I understand, Erick. I don't have a "text" field in my schema;
"title" and "content" are both legal fields.


On Tue, Jul 23, 2013 at 5:15 AM, Erick Erickson wrote:

> this isn't doing what you think.
> title^10 content
> is actually parsed as
>
> text:title^100 text:content
>
> where "text" is my default search field.
>
> assuming title is a field. If you look a little
> farther up the debug output you'll see that.
>
> You probably want
> title:content^100 or some such?
>
> Erick
>
> On Tue, Jul 23, 2013 at 1:43 AM, Jack Krupansky 
> wrote:
> > That means that for that document "china" occurs in the title vs.
> "snowden"
> > found in a document but not in the title.
> >
> >
> > -- Jack Krupansky
> >
> > -Original Message- From: Joe Zhang
> > Sent: Tuesday, July 23, 2013 12:52 AM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Question about field boost
> >
> >
> > Is my reading correct that the boost is only applied on "china" but not
> > "snowden"? How can that be?
> >
> > My query is: q=china+snowden&qf=title^10 content
> >
> >
> > On Mon, Jul 22, 2013 at 9:43 PM, Joe Zhang  wrote:
> >
> >> Thanks for your hint, Jack. Here is the debug results, which I'm having
> a
> >> hard deciphering (the two terms are "china" and "snowden")...
> >>
> >> 0.26839527 = (MATCH) sum of:
> >>   0.26839527 = (MATCH) sum of:
> >> 0.26757246 = (MATCH) max of:
> >>   7.9147343E-4 = (MATCH) weight(content:china in 249), product of:
> >> 0.019873314 = queryWeight(content:china), product of:
> >>   1.6649085 = idf(docFreq=46832, maxDocs=91058)
> >>   0.01193658 = queryNorm
> >> 0.039825942 = (MATCH) fieldWeight(content:china in 249), product
> >> of:
> >>   4.8989797 = tf(termFreq(content:china)=24)
> >>   1.6649085 = idf(docFreq=46832, maxDocs=91058)
> >>   0.0048828125 = fieldNorm(field=content, doc=249)
> >>   0.26757246 = (MATCH) weight(title:china^10.0 in 249), product of:
> >> 0.5836803 = queryWeight(title:china^10.0), product of:
> >>   10.0 = boost
> >>   4.8898454 = idf(docFreq=1861, maxDocs=91058)
> >>   0.01193658 = queryNorm
> >> 0.45842302 = (MATCH) fieldWeight(title:china in 249), product
> of:
> >>   1.0 = tf(termFreq(title:china)=1)
> >>   4.8898454 = idf(docFreq=1861, maxDocs=91058)
> >>   0.09375 = fieldNorm(field=title, doc=249)
> >> 8.2282536E-4 = (MATCH) max of:
> >>   8.2282536E-4 = (MATCH) weight(content:snowden in 249), product of:
> >> 0.03407834 = queryWeight(content:snowden), product of:
> >>   2.8549502 = idf(docFreq=14246, maxDocs=91058)
> >>   0.01193658 = queryNorm
> >> 0.024145111 = (MATCH) fieldWeight(content:snowden in 249),
> product
> >> of:
> >>   1.7320508 = tf(termFreq(content:snowden)=3)
> >>   2.8549502 = idf(docFreq=14246, maxDocs=91058)
> >>   0.0048828125 = fieldNorm(field=content, doc=249)
> >>
> >>
> >> On Mon, Jul 22, 2013 at 9:27 PM, Jack Krupansky
> >> wrote:
> >>
> >>> Maybe you're not doing anything wrong - other than having an artificial
> >>> expectation of what the true relevance of your data actually is. Many
> >>> factors go into relevance scoring. You need to look at all aspects of
> >>> your
> >>> data.
> >>>
> >>> Maybe your terms don't occur in your titles the way you think they do.
> >>>
> >>> Maybe you need a boost of 500 or more...
> >>>
> >>> Lots of potential maybes.
> >>>
> >>> Relevancy tuning is an art and craft, hardly a science.
> >>>
> >>> Step one: Know your data, inside and out.
> >>>
> >>> Use the debugQuery=true parameter on your queries and see how much of
> the
> >>> score is dominated by your query terms in the non-title fields.
> >>>
> >>> -- Jack Krupansky
> >>>
> >>> -Original Message- From: Joe Zhang
> >>> Sent: Monday, July 22, 2013 11:06 PM
> >>> To: solr-user@lucene.apache.org
> >>> Subject: Question about field boost
> >>>
> >>>
> >>> Dear Solr experts:
> >>>
> >>> Here is my query:
> >>>
> >>> defType=dismax&q=term1+term2&**qf=title^100 content
> >>>
> >>> Apparently (at least I thought) my intention is to boost the title
> field.
> >>> While I'm getting some non-trivial results, I'm surprised that the
> >>> documents with both term1 and term2 in title (I know such docs do exist
> >>> in
> >>> my repository) were not returned (or maybe ranked very low). The
> >>> situation
> >>> does not change even when I use much larger boost factors.
> >>>
> >>> What am I doing wrong?
> >>>
> >>
> >>
> >
>


Re: Question about field boost

2013-07-22 Thread Joe Zhang
Is my reading correct that the boost is only applied on "china" but not
"snowden"? How can that be?

My query is: q=china+snowden&qf=title^10 content


On Mon, Jul 22, 2013 at 9:43 PM, Joe Zhang  wrote:

> Thanks for your hint, Jack. Here is the debug results, which I'm having a
> hard deciphering (the two terms are "china" and "snowden")...
>
> 0.26839527 = (MATCH) sum of:
>   0.26839527 = (MATCH) sum of:
> 0.26757246 = (MATCH) max of:
>   7.9147343E-4 = (MATCH) weight(content:china in 249), product of:
> 0.019873314 = queryWeight(content:china), product of:
>   1.6649085 = idf(docFreq=46832, maxDocs=91058)
>   0.01193658 = queryNorm
> 0.039825942 = (MATCH) fieldWeight(content:china in 249), product
> of:
>   4.8989797 = tf(termFreq(content:china)=24)
>   1.6649085 = idf(docFreq=46832, maxDocs=91058)
>   0.0048828125 = fieldNorm(field=content, doc=249)
>   0.26757246 = (MATCH) weight(title:china^10.0 in 249), product of:
> 0.5836803 = queryWeight(title:china^10.0), product of:
>   10.0 = boost
>   4.8898454 = idf(docFreq=1861, maxDocs=91058)
>   0.01193658 = queryNorm
> 0.45842302 = (MATCH) fieldWeight(title:china in 249), product of:
>   1.0 = tf(termFreq(title:china)=1)
>   4.8898454 = idf(docFreq=1861, maxDocs=91058)
>   0.09375 = fieldNorm(field=title, doc=249)
> 8.2282536E-4 = (MATCH) max of:
>   8.2282536E-4 = (MATCH) weight(content:snowden in 249), product of:
> 0.03407834 = queryWeight(content:snowden), product of:
>   2.8549502 = idf(docFreq=14246, maxDocs=91058)
>   0.01193658 = queryNorm
> 0.024145111 = (MATCH) fieldWeight(content:snowden in 249), product
> of:
>   1.7320508 = tf(termFreq(content:snowden)=3)
>   2.8549502 = idf(docFreq=14246, maxDocs=91058)
>   0.0048828125 = fieldNorm(field=content, doc=249)
>
>
> On Mon, Jul 22, 2013 at 9:27 PM, Jack Krupansky 
> wrote:
>
>> Maybe you're not doing anything wrong - other than having an artificial
>> expectation of what the true relevance of your data actually is. Many
>> factors go into relevance scoring. You need to look at all aspects of your
>> data.
>>
>> Maybe your terms don't occur in your titles the way you think they do.
>>
>> Maybe you need a boost of 500 or more...
>>
>> Lots of potential maybes.
>>
>> Relevancy tuning is an art and craft, hardly a science.
>>
>> Step one: Know your data, inside and out.
>>
>> Use the debugQuery=true parameter on your queries and see how much of the
>> score is dominated by your query terms in the non-title fields.
>>
>> -- Jack Krupansky
>>
>> -Original Message- From: Joe Zhang
>> Sent: Monday, July 22, 2013 11:06 PM
>> To: solr-user@lucene.apache.org
>> Subject: Question about field boost
>>
>>
>> Dear Solr experts:
>>
>> Here is my query:
>>
>> defType=dismax&q=term1+term2&**qf=title^100 content
>>
>> Apparently (at least I thought) my intention is to boost the title field.
>> While I'm getting some non-trivial results, I'm surprised that the
>> documents with both term1 and term2 in title (I know such docs do exist in
>> my repository) were not returned (or maybe ranked very low). The situation
>> does not change even when I use much larger boost factors.
>>
>> What am I doing wrong?
>>
>
>


Re: Question about field boost

2013-07-22 Thread Joe Zhang
Thanks for your hint, Jack. Here is the debug results, which I'm having a
hard deciphering (the two terms are "china" and "snowden")...

0.26839527 = (MATCH) sum of:
  0.26839527 = (MATCH) sum of:
0.26757246 = (MATCH) max of:
  7.9147343E-4 = (MATCH) weight(content:china in 249), product of:
0.019873314 = queryWeight(content:china), product of:
  1.6649085 = idf(docFreq=46832, maxDocs=91058)
  0.01193658 = queryNorm
0.039825942 = (MATCH) fieldWeight(content:china in 249), product of:
  4.8989797 = tf(termFreq(content:china)=24)
  1.6649085 = idf(docFreq=46832, maxDocs=91058)
  0.0048828125 = fieldNorm(field=content, doc=249)
  0.26757246 = (MATCH) weight(title:china^10.0 in 249), product of:
0.5836803 = queryWeight(title:china^10.0), product of:
  10.0 = boost
  4.8898454 = idf(docFreq=1861, maxDocs=91058)
  0.01193658 = queryNorm
0.45842302 = (MATCH) fieldWeight(title:china in 249), product of:
  1.0 = tf(termFreq(title:china)=1)
  4.8898454 = idf(docFreq=1861, maxDocs=91058)
  0.09375 = fieldNorm(field=title, doc=249)
8.2282536E-4 = (MATCH) max of:
  8.2282536E-4 = (MATCH) weight(content:snowden in 249), product of:
0.03407834 = queryWeight(content:snowden), product of:
  2.8549502 = idf(docFreq=14246, maxDocs=91058)
  0.01193658 = queryNorm
0.024145111 = (MATCH) fieldWeight(content:snowden in 249), product
of:
  1.7320508 = tf(termFreq(content:snowden)=3)
  2.8549502 = idf(docFreq=14246, maxDocs=91058)
  0.0048828125 = fieldNorm(field=content, doc=249)


On Mon, Jul 22, 2013 at 9:27 PM, Jack Krupansky wrote:

> Maybe you're not doing anything wrong - other than having an artificial
> expectation of what the true relevance of your data actually is. Many
> factors go into relevance scoring. You need to look at all aspects of your
> data.
>
> Maybe your terms don't occur in your titles the way you think they do.
>
> Maybe you need a boost of 500 or more...
>
> Lots of potential maybes.
>
> Relevancy tuning is an art and craft, hardly a science.
>
> Step one: Know your data, inside and out.
>
> Use the debugQuery=true parameter on your queries and see how much of the
> score is dominated by your query terms in the non-title fields.
>
> -- Jack Krupansky
>
> -Original Message- From: Joe Zhang
> Sent: Monday, July 22, 2013 11:06 PM
> To: solr-user@lucene.apache.org
> Subject: Question about field boost
>
>
> Dear Solr experts:
>
> Here is my query:
>
> defType=dismax&q=term1+term2&**qf=title^100 content
>
> Apparently (at least I thought) my intention is to boost the title field.
> While I'm getting some non-trivial results, I'm surprised that the
> documents with both term1 and term2 in title (I know such docs do exist in
> my repository) were not returned (or maybe ranked very low). The situation
> does not change even when I use much larger boost factors.
>
> What am I doing wrong?
>


Question about field boost

2013-07-22 Thread Joe Zhang
Dear Solr experts:

Here is my query:

defType=dismax&q=term1+term2&qf=title^100 content

Apparently (at least I thought) my intention is to boost the title field.
While I'm getting some non-trivial results, I'm surprised that the
documents with both term1 and term2 in title (I know such docs do exist in
my repository) were not returned (or maybe ranked very low). The situation
does not change even when I use much larger boost factors.

What am I doing wrong?


Re: zero-valued retrieval scores

2013-07-12 Thread Joe Zhang
Thanks, Jack!


On Fri, Jul 12, 2013 at 9:37 PM, Jack Krupansky wrote:

> For the calculation of norm, see note number 6:
>
> http://lucene.apache.org/core/**4_3_0/core/org/apache/lucene/**
> search/similarities/**TFIDFSimilarity.html<http://lucene.apache.org/core/4_3_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html>
>
> You would need to talk to the Nutch guys to see why THEY are setting
> document boost to 0.0.
>
>
> -- Jack Krupansky
>
> -Original Message- From: Joe Zhang
> Sent: Friday, July 12, 2013 11:57 PM
> To: solr-user@lucene.apache.org
> Subject: Re: zero-valued retrieval scores
>
>
> Yes, you are right, the boost on these documents are 0. I didn't provide
> them, though.
>
> I suppose the boost scores come from Nutch (yes, my solr indexes crawled
> web docs). What could be wrong?
>
> again, what exactly is the formula for fieldNorm?
>
>
> On Fri, Jul 12, 2013 at 8:46 PM, Jack Krupansky *
> *wrote:
>
>  Did you put a boost of 0.0 on the documents, as opposed to the default of
>> 1.0?
>>
>> x * 0.0 = 0.0
>>
>> -- Jack Krupansky
>>
>> -Original Message- From: Joe Zhang
>> Sent: Friday, July 12, 2013 10:31 PM
>> To: solr-user@lucene.apache.org
>> Subject: zero-valued retrieval scores
>>
>>
>> when I search a keyword (such as "apple"), most of the docs carry 0.0 as
>> score. Here is an example from explain:
>>
>> str name="
>> http://www.bloomberg.com/slideshow/2013-07-12/world-at-**
>> **work-india.html<http://www.bloomberg.com/**slideshow/2013-07-12/world-at-**work-india.html>
>> <http://www.**bloomberg.com/slideshow/2013-**
>> 07-12/world-at-work-india.html<http://www.bloomberg.com/slideshow/2013-07-12/world-at-work-india.html>
>> **>
>>
>> ">
>> 0.0 = (MATCH) fieldWeight(content:appl in 51), product of:
>>  1.0 = tf(termFreq(content:appl)=1)
>>  2.096877 = idf(docFreq=5190, maxDocs=15546)
>>  0.0 = fieldNorm(field=content, doc=51)
>> Can somebody help me understand why fieldNorm is 0? What exactly is the
>> formula for computing fieldNorm?
>>
>> Thanks!
>>
>>
>


Re: zero-valued retrieval scores

2013-07-12 Thread Joe Zhang
Yes, you are right, the boost on these documents are 0. I didn't provide
them, though.

I suppose the boost scores come from Nutch (yes, my solr indexes crawled
web docs). What could be wrong?

again, what exactly is the formula for fieldNorm?


On Fri, Jul 12, 2013 at 8:46 PM, Jack Krupansky wrote:

> Did you put a boost of 0.0 on the documents, as opposed to the default of
> 1.0?
>
> x * 0.0 = 0.0
>
> -- Jack Krupansky
>
> -Original Message- From: Joe Zhang
> Sent: Friday, July 12, 2013 10:31 PM
> To: solr-user@lucene.apache.org
> Subject: zero-valued retrieval scores
>
>
> when I search a keyword (such as "apple"), most of the docs carry 0.0 as
> score. Here is an example from explain:
>
> str name="
> http://www.bloomberg.com/**slideshow/2013-07-12/world-at-**work-india.html<http://www.bloomberg.com/slideshow/2013-07-12/world-at-work-india.html>
> ">
> 0.0 = (MATCH) fieldWeight(content:appl in 51), product of:
>  1.0 = tf(termFreq(content:appl)=1)
>  2.096877 = idf(docFreq=5190, maxDocs=15546)
>  0.0 = fieldNorm(field=content, doc=51)
> Can somebody help me understand why fieldNorm is 0? What exactly is the
> formula for computing fieldNorm?
>
> Thanks!
>


zero-valued retrieval scores

2013-07-12 Thread Joe Zhang
when I search a keyword (such as "apple"), most of the docs carry 0.0 as
score. Here is an example from explain:

str name="
http://www.bloomberg.com/slideshow/2013-07-12/world-at-work-india.html";>
0.0 = (MATCH) fieldWeight(content:appl in 51), product of:
  1.0 = tf(termFreq(content:appl)=1)
  2.096877 = idf(docFreq=5190, maxDocs=15546)
  0.0 = fieldNorm(field=content, doc=51)
Can somebody help me understand why fieldNorm is 0? What exactly is the
formula for computing fieldNorm?

Thanks!


Re: document id in nutch/solr

2013-06-23 Thread Joe Zhang
Can somebody help with this one, please?


On Fri, Jun 21, 2013 at 10:36 PM, Joe Zhang  wrote:

> A quite standard configuration of nutch seems to autoamtically map "url"
> to "id". Two questions:
>
> - Where is such mapping defined? I can't find it anywhere in
> nutch-site.xml or schema.xml. The latter does define the "id" field as well
> as its uniqueness, but not the mapping.
>
> - Given that nutch nutch has already defined such an id, can i ask solr to
> redefine id as UUID?
> 
>
> - This leads to a related question: do solr and nutch have to have
> IDENTICAL schema.xml?
>


document id in nutch/solr

2013-06-21 Thread Joe Zhang
A quite standard configuration of nutch seems to autoamtically map "url" to
"id". Two questions:

- Where is such mapping defined? I can't find it anywhere in nutch-site.xml
or schema.xml. The latter does define the "id" field as well as its
uniqueness, but not the mapping.

- Given that nutch nutch has already defined such an id, can i ask solr to
redefine id as UUID?


- This leads to a related question: do solr and nutch have to have
IDENTICAL schema.xml?


Re: what does a zero score mean?

2013-06-21 Thread Joe Zhang
So, the reason is taht I'm getting zero values on FieldNorm. documentation
tells me that there are 3 factors in play here:

- LengthNorm --> can this be zero?
- index-time boost --> is this the boost value we get from nutch?
- field-boost --> none specified.

Can somebody help here?


On Tue, Jun 18, 2013 at 7:22 AM, Upayavira  wrote:

> debugQuery=true adds an extra block of XML to the bottom that will give
> you extra info.
>
> Alternatively, add fl=*,[explain] to your URL. That'll give you an extra
> field in your output. Then, view the source to see it structured
> properly.
>
> Upayavira
>
> On Tue, Jun 18, 2013, at 02:52 PM, Joe Zhang wrote:
> > I did include "debugQuery=on" in the query, but nothing extra showed up
> > in
> > the response.
> >
> >
> > On Mon, Jun 17, 2013 at 10:29 PM, Gora Mohanty 
> > wrote:
> >
> > > On 18 June 2013 10:49, Joe Zhang  wrote:
> > > > I issued a simple query ("apple") to my collection and got 201
> documents
> > > > back, all of which are scored 0. What does this mean? --- The
> documents
> > > do
> > > > contain the query words.
> > >
> > > My guess is that the float-valued score is getting
> > > converted to an integer. You could also try your
> > > query with the parameter &debugQuery=on
> > > to get an explanation of the scoring:
> > > http://wiki.apache.org/solr/CommonQueryParameters#debugQuery
> > >
> > > Regards,
> > > Gora
> > >
>


Re: what does a zero score mean?

2013-06-18 Thread Joe Zhang
I did include "debugQuery=on" in the query, but nothing extra showed up in
the response.


On Mon, Jun 17, 2013 at 10:29 PM, Gora Mohanty  wrote:

> On 18 June 2013 10:49, Joe Zhang  wrote:
> > I issued a simple query ("apple") to my collection and got 201 documents
> > back, all of which are scored 0. What does this mean? --- The documents
> do
> > contain the query words.
>
> My guess is that the float-valued score is getting
> converted to an integer. You could also try your
> query with the parameter &debugQuery=on
> to get an explanation of the scoring:
> http://wiki.apache.org/solr/CommonQueryParameters#debugQuery
>
> Regards,
> Gora
>


what does a zero score mean?

2013-06-17 Thread Joe Zhang
I issued a simple query ("apple") to my collection and got 201 documents
back, all of which are scored 0. What does this mean? --- The documents do
contain the query words.


Re: Internal statistics in Solr index?

2012-12-21 Thread Joe Zhang
Thank you very much! This is a good starting point!

On Fri, Dec 21, 2012 at 6:15 AM, Erick Erickson wrote:

> Have you seen the functions here:
> http://wiki.apache.org/solr/FunctionQuery#Relevance_Functions
>
> Best
> Erick
>
>
> On Thu, Dec 20, 2012 at 1:18 PM, Joe Zhang  wrote:
>
> > Dear list,
> >
> > Is there any way to access things such as word frequency, doc frequency
> in
> > solr index?
> >
> > Thanks!
> >
>


Re: search behavior on a case-sensitive field

2012-12-03 Thread Joe Zhang
haha, makes perfect sense! Thanks a lot!

On Mon, Dec 3, 2012 at 9:25 PM, Jack Krupansky wrote:

> "CoSt" was split into two terms and the query parser generated an OR of
> them. Adding the autoGeneratePhraseQueries="**true" attribute to your
> field type should fix the problem.
>
> You can also change splitOnCaseChange="1" to splitOnCaseChange="0" to
> avoid the term splitting issue.
>
> Be sure to completely reindex in either case.
>
> -- Jack Krupansky
>
> -Original Message- From: Joe Zhang
> Sent: Monday, December 03, 2012 11:10 PM
> To: solr-user@lucene.apache.org
> Subject: search behavior on a case-sensitive field
>
>
> I have a search like this:
>
>positionIncrementGap="100">
>
>
>ignoreCase="true" words="stopwords.txt"/>
>generateWordParts="1" generateNumberParts="1"
>catenateWords="1" catenateNumbers="1" catenateAll="0"
>splitOnCaseChange="1"/>
> 
>protected="protwords.txt"/>
>
>
>
>
> When I query "COST", it gives reasonable results (n1);
> When I query "CoSt", however, it gives me n2 (>n1) results, and I can't
> locate actual occurence of "CoSt" in the docs at all. Can anybody advise?
>


search behavior on a case-sensitive field

2012-12-03 Thread Joe Zhang
I have a search like this:












When I query "COST", it gives reasonable results (n1);
When I query "CoSt", however, it gives me n2 (>n1) results, and I can't
locate actual occurence of "CoSt" in the docs at all. Can anybody advise?


Re: behavior of solr.KeepWordFilterFactory

2012-12-03 Thread Joe Zhang
across-the-board case-senstive indexing is not what I want...

Let me make sure I understand your suggestion:

   








   







And define content1 as text1, content2 as text2?
On Mon, Dec 3, 2012 at 1:09 AM, Xi Shen  wrote:

> Solr index is case-sensitive by default, unless you used the lower case
> filter. I remember I saw this topic on Solr, and the solution is simple:
>
> copy the filed;
> use a new analyzer/tokenizer to process this field, and do not use lower
> case filter
>
> when query, make sure both fields are included.
>
>
> On Mon, Dec 3, 2012 at 3:04 PM, Joe Zhang  wrote:
>
> > In other words, what I wanted to achieve is case-senstive indexing on a
> > small set of words. Can anybody help?
> >
> > On Sun, Dec 2, 2012 at 11:56 PM, Joe Zhang  wrote:
> >
> > > To be more specific, this is the data type I was using:
> > >
> > > > > positionIncrementGap="100">
> > > 
> > > 
> > >  > > words="tickers.txt" ignoreCase="false"/>
> > >  > > ignoreCase="true" words="stopwords.txt"/>
> > >  > > generateWordParts="1" generateNumberParts="1"
> > > catenateWords="1" catenateNumbers="1"
> catenateAll="0"
> > >         splitOnCaseChange="1"/>
> > > 
> > >  > > protected="protwords.txt"/>
> > >  class="solr.RemoveDuplicatesTokenFilterFactory"/>
> > > 
> > > 
> > >
> > >
> > > On Sun, Dec 2, 2012 at 11:51 PM, Joe Zhang 
> wrote:
> > >
> > >> yes, that is the correct behavior. But how do I achieve my goal, i.e,
> > >> speical treatment on a list of uppercase/special words, normal
> > treatment on
> > >> everything else?
> > >>
> > >>
> > >> On Sun, Dec 2, 2012 at 11:46 PM, Xi Shen 
> wrote:
> > >>
> > >>> By the definition on
> > >>>
> > >>>
> >
> https://lucene.apache.org/solr/api-3_6_1/org/apache/solr/analysis/KeepWordFilter.html
> > >>> ,
> > >>> I am pretty sure it is the correct behavior of this filter :)
> > >>>
> > >>> I guess you are trying to this filter to index some special words in
> > >>> Chinese?
> > >>>
> > >>>
> > >>> On Mon, Dec 3, 2012 at 1:54 PM, Joe Zhang 
> > wrote:
> > >>>
> > >>> > I defined the following data type in my solr schema.xml
> > >>> >
> > >>> > 
> > >>> >
> > >>> >   words="keepwords.txt"
> > >>> > ignoreCase="false"/>
> > >>> >
> > >>> > 
> > >>> >
> > >>> > when I use the type "testkeep" to index a test field, my true
> > >>> expecation
> > >>> > was to make sure solr indexes the uppercase form of a small list of
> > >>> words
> > >>> > in the file, AND TREAT EVERY OTHER WORD AS USUAL. The goal of
> > securing
> > >>> the
> > >>> > closed list is achieved, but NO OTHER WORD outside the list is
> > indexed!
> > >>> >
> > >>> > Can anybody help? Thanks in advance!
> > >>> >
> > >>> > Joe
> > >>> >
> > >>>
> > >>>
> > >>>
> > >>> --
> > >>> Regards,
> > >>> David Shen
> > >>>
> > >>> http://about.me/davidshen
> > >>> https://twitter.com/#!/davidshen84
> > >>>
> > >>
> > >>
> > >
> >
>
>
>
> --
> Regards,
> David Shen
>
> http://about.me/davidshen
> https://twitter.com/#!/davidshen84
>


Re: behavior of solr.KeepWordFilterFactory

2012-12-02 Thread Joe Zhang
In other words, what I wanted to achieve is case-senstive indexing on a
small set of words. Can anybody help?

On Sun, Dec 2, 2012 at 11:56 PM, Joe Zhang  wrote:

> To be more specific, this is the data type I was using:
>
> positionIncrementGap="100">
> 
> 
>  words="tickers.txt" ignoreCase="false"/>
>  ignoreCase="true" words="stopwords.txt"/>
>  generateWordParts="1" generateNumberParts="1"
> catenateWords="1" catenateNumbers="1" catenateAll="0"
> splitOnCaseChange="1"/>
>     
>  protected="protwords.txt"/>
> 
> 
> 
>
>
> On Sun, Dec 2, 2012 at 11:51 PM, Joe Zhang  wrote:
>
>> yes, that is the correct behavior. But how do I achieve my goal, i.e,
>> speical treatment on a list of uppercase/special words, normal treatment on
>> everything else?
>>
>>
>> On Sun, Dec 2, 2012 at 11:46 PM, Xi Shen  wrote:
>>
>>> By the definition on
>>>
>>> https://lucene.apache.org/solr/api-3_6_1/org/apache/solr/analysis/KeepWordFilter.html
>>> ,
>>> I am pretty sure it is the correct behavior of this filter :)
>>>
>>> I guess you are trying to this filter to index some special words in
>>> Chinese?
>>>
>>>
>>> On Mon, Dec 3, 2012 at 1:54 PM, Joe Zhang  wrote:
>>>
>>> > I defined the following data type in my solr schema.xml
>>> >
>>> > 
>>> >
>>> >  >> > ignoreCase="false"/>
>>> >
>>> > 
>>> >
>>> > when I use the type "testkeep" to index a test field, my true
>>> expecation
>>> > was to make sure solr indexes the uppercase form of a small list of
>>> words
>>> > in the file, AND TREAT EVERY OTHER WORD AS USUAL. The goal of securing
>>> the
>>> > closed list is achieved, but NO OTHER WORD outside the list is indexed!
>>> >
>>> > Can anybody help? Thanks in advance!
>>> >
>>> > Joe
>>> >
>>>
>>>
>>>
>>> --
>>> Regards,
>>> David Shen
>>>
>>> http://about.me/davidshen
>>> https://twitter.com/#!/davidshen84
>>>
>>
>>
>


Re: behavior of solr.KeepWordFilterFactory

2012-12-02 Thread Joe Zhang
To be more specific, this is the data type I was using:

   












On Sun, Dec 2, 2012 at 11:51 PM, Joe Zhang  wrote:

> yes, that is the correct behavior. But how do I achieve my goal, i.e,
> speical treatment on a list of uppercase/special words, normal treatment on
> everything else?
>
>
> On Sun, Dec 2, 2012 at 11:46 PM, Xi Shen  wrote:
>
>> By the definition on
>>
>> https://lucene.apache.org/solr/api-3_6_1/org/apache/solr/analysis/KeepWordFilter.html
>> ,
>> I am pretty sure it is the correct behavior of this filter :)
>>
>> I guess you are trying to this filter to index some special words in
>> Chinese?
>>
>>
>> On Mon, Dec 3, 2012 at 1:54 PM, Joe Zhang  wrote:
>>
>> > I defined the following data type in my solr schema.xml
>> >
>> > 
>> >
>> >  > > ignoreCase="false"/>
>> >
>> > 
>> >
>> > when I use the type "testkeep" to index a test field, my true expecation
>> > was to make sure solr indexes the uppercase form of a small list of
>> words
>> > in the file, AND TREAT EVERY OTHER WORD AS USUAL. The goal of securing
>> the
>> > closed list is achieved, but NO OTHER WORD outside the list is indexed!
>> >
>> > Can anybody help? Thanks in advance!
>> >
>> > Joe
>> >
>>
>>
>>
>> --
>> Regards,
>> David Shen
>>
>> http://about.me/davidshen
>> https://twitter.com/#!/davidshen84
>>
>
>


Re: behavior of solr.KeepWordFilterFactory

2012-12-02 Thread Joe Zhang
yes, that is the correct behavior. But how do I achieve my goal, i.e,
speical treatment on a list of uppercase/special words, normal treatment on
everything else?

On Sun, Dec 2, 2012 at 11:46 PM, Xi Shen  wrote:

> By the definition on
>
> https://lucene.apache.org/solr/api-3_6_1/org/apache/solr/analysis/KeepWordFilter.html
> ,
> I am pretty sure it is the correct behavior of this filter :)
>
> I guess you are trying to this filter to index some special words in
> Chinese?
>
>
> On Mon, Dec 3, 2012 at 1:54 PM, Joe Zhang  wrote:
>
> > I defined the following data type in my solr schema.xml
> >
> > 
> >
> >   > ignoreCase="false"/>
> >
> > 
> >
> > when I use the type "testkeep" to index a test field, my true expecation
> > was to make sure solr indexes the uppercase form of a small list of words
> > in the file, AND TREAT EVERY OTHER WORD AS USUAL. The goal of securing
> the
> > closed list is achieved, but NO OTHER WORD outside the list is indexed!
> >
> > Can anybody help? Thanks in advance!
> >
> > Joe
> >
>
>
>
> --
> Regards,
> David Shen
>
> http://about.me/davidshen
> https://twitter.com/#!/davidshen84
>


Re: duplicated URL sent from Nutch to solr index

2012-12-02 Thread Joe Zhang
Sorry I didn't make it perfectly clear. The "id" field is URL.

On Sun, Dec 2, 2012 at 11:33 PM, Joe Zhang  wrote:

> Thanks!
>
>
> On Sun, Dec 2, 2012 at 11:20 PM, Xi Shen  wrote:
>
>> If the value for "id" field is the same, the old entry will be update; if
>> it is new, a new entry will be created & indexed.
>>
>> This is my experience. :)
>>
>>
>> On Mon, Dec 3, 2012 at 1:45 PM, Joe Zhang  wrote:
>>
>> > Dear list,
>> >
>> > I just want to confirm an expected behavior of solr:
>> >
>> > Assuming we have " id" in schema.xml for solr,
>> when
>> > we send the same URL from nutch to solr multiple times. would there be
>> ONLY
>> > ONE entry for that URL, but the content (if changed) and timestamp
>> would be
>> > updated?
>> >
>> >
>> > Thanks!
>> >
>> > Joe
>> >
>>
>>
>>
>> --
>> Regards,
>> David Shen
>>
>> http://about.me/davidshen
>> https://twitter.com/#!/davidshen84
>>
>
>


Re: duplicated URL sent from Nutch to solr index

2012-12-02 Thread Joe Zhang
Thanks!

On Sun, Dec 2, 2012 at 11:20 PM, Xi Shen  wrote:

> If the value for "id" field is the same, the old entry will be update; if
> it is new, a new entry will be created & indexed.
>
> This is my experience. :)
>
>
> On Mon, Dec 3, 2012 at 1:45 PM, Joe Zhang  wrote:
>
> > Dear list,
> >
> > I just want to confirm an expected behavior of solr:
> >
> > Assuming we have " id" in schema.xml for solr,
> when
> > we send the same URL from nutch to solr multiple times. would there be
> ONLY
> > ONE entry for that URL, but the content (if changed) and timestamp would
> be
> > updated?
> >
> >
> > Thanks!
> >
> > Joe
> >
>
>
>
> --
> Regards,
> David Shen
>
> http://about.me/davidshen
> https://twitter.com/#!/davidshen84
>


behavior of solr.KeepWordFilterFactory

2012-12-02 Thread Joe Zhang
I defined the following data type in my solr schema.xml


   
 
   


when I use the type "testkeep" to index a test field, my true expecation
was to make sure solr indexes the uppercase form of a small list of words
in the file, AND TREAT EVERY OTHER WORD AS USUAL. The goal of securing the
closed list is achieved, but NO OTHER WORD outside the list is indexed!

Can anybody help? Thanks in advance!

Joe


Re: multiple indexes?

2012-12-02 Thread Joe Zhang
This is very helpful. Thanks a lot, Shaun and Dikchant!

So in default single-core situation, the index would live in data/index,
correct?

On Fri, Nov 30, 2012 at 11:02 PM, Shawn Heisey  wrote:

> On 11/30/2012 10:11 PM, Joe Zhang wrote:
>
>> May I ask: how to set up multiple indexes, and specify which index to send
>> the docs to at indexing time, and later on, how to specify which index to
>> work with?
>>
>> A related question: what is the storage location and structure of solr
>> indexes?
>>
> When you index or query data, you'll use a base URL specific to the index
> (core).  Everything goes through that base URL, which includes the name of
> the core:
>
> http://server:port/solr/**corename
>
> The file called solr.xml tells Solr about multiple cores.Each core has an
> instanceDir and a dataDir.
>
> http://wiki.apache.org/solr/**CoreAdmin<http://wiki.apache.org/solr/CoreAdmin>
>
> In the dataDir, Solr will create an index dir, which contains the Lucene
> index.  Here are the file formats for recent versions:
>
> http://lucene.apache.org/core/**4_0_0/core/org/apache/lucene/**
> codecs/lucene40/package-**summary.html<http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/codecs/lucene40/package-summary.html>
> http://lucene.apache.org/core/**3_6_1/fileformats.html<http://lucene.apache.org/core/3_6_1/fileformats.html>
> http://lucene.apache.org/core/**old_versioned_docs/versions/3_**
> 5_0/fileformats.html<http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/fileformats.html>
>
> Thanks,
> Shawn
>
>


Re: stopwords in solr

2012-11-27 Thread Joe Zhang
that is really strange. so basic stopwords such as "a" "the' are not
eliminated from the index?

On Tue, Nov 27, 2012 at 11:16 PM, 曹霖  wrote:

> justt no stopwords are considered in that case
>
> 2012/11/28 Joe Zhang 
>
> > t no stopwords are considered in
> > this case
> >
>