Re: tomcat install

2006-09-19 Thread Nick Snels

Hi James,

now you should make a FilterFactory, there are a few examples in
c:\solr-nightly\src\java\org\apache\solr\analysis . You should also place
your FilterFactory in this directory and rerun 'ant dist'. I have made a
DutchStemFilterFactory class and this is the code:

package org.apache.solr.analysis;

import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.nl.DutchStemFilter;

import java.util.HashMap;
import java.util.Map;
import java.io.File;

public class DutchStemFilterFactory extends BaseTokenFilterFactory {
 private Map stemdict = new HashMap();

 public TokenStream create(TokenStream input) {
   File file = new File(org.apache.solr.core.Config.getInstanceDir() +
"conf/stems.txt");
   stemdict = org.apache.lucene.analysis.nl.WordlistLoader.getStemDict
(file);

   /*org.apache.lucene.analysis.nl.DutchStemFilter.setStemDictionary
(stemdict);*/

   return new DutchStemFilter(input, null, stemdict);
 }
}

Then I could change text_lu in schema.xml to:


 
   
   
   
   
   
   
 


Notice how DutchStemFilterFactory gets called. Hope that solves your
problem.

Kind regards,

Nick

On 9/19/06, James liu <[EMAIL PROTECTED]> wrote:


thank u, with your step and add junit,  it is ok.

you can analyzer your language?

i modify schema:

  








but nothing changed.



2006/9/19, Nick Snels <[EMAIL PROTECTED]>:
>
> Hi James,
>
> I also needed the DutchAnalyzer from Lucene in my Solr project. I did it
> the
> following way. Which is probably the hard way, because my Java knowledge
> isn't that great.
>
> 1. I unzipped the solr-nightly build
> 2. I downloaded the latest code from lucene, preferrably from svn :
> http://svn.apache.org/viewvc/lucene/java/ and all necessary analyzers
from
> the lucene sandbox
> 3. I put it into c:\solr-nightly\src\java\org\apache\lucene
> 4. I installed ant (unzip it and add ANT_HOME to your path)
> 5. than open a DOS prompt and go to c:\solr-nightly and run 'ant dist',
> this
> makes a new solr-1.0.war file in c:\solr-nightly\dist. That war file
> contains also the lucene code along with your analyzers
>
> This is how I did it, don't know if this is the right or the easiest way
> to
> do it.
>
> Kind regards,
>
> Nick
>
>
> On 9/18/06, James liu <[EMAIL PROTECTED]> wrote:
> >
> > Hi Nick,
> >
> > It is very funny. when i reboot my pc,it is ok and i do nothing.
> >
> > my new question is how to add lucene-analyzers-2.0.0.jar to tomcat or
> > jetty.
> >
> > i add useful classes to solr.war which exist
> > "C:\cygwin\tmp\solr-nightly\example\webapps\solr.war",,,but it is not
> > effect...
> >
> > do u know how to solve it?
> >
> >
> > Regards,
> >
> > JL
> >
> > 2006/9/18, Nick Snels <[EMAIL PROTECTED]>:
> > >
> > > Hi James,
> > >
> > > the problem is most likely a xml error in either schema.xml or
> > > solrconfig.xml. Go through your Tomcat logs, if it is an xml error
you
> > > should find the line where the xml parsing went wrong.
> > >
> > > Kind regards,
> > >
> > > Nick
> > >
> > > On 9/18/06, James liu <[EMAIL PROTECTED]> wrote:
> > > >
> > > > thk Nick.
> > > >
> > > > i do you tell me and i can see admin page.
> > > >
> > > > but when i click search ,,,error information:
> > > >
> > > > java.lang.NullPointerException
> > > > at org.apache.solr.search.SolrQueryParser.(
> SolrQueryParser.java
> > > :37)
> > > > at org.apache.solr.search.QueryParsing.parseQuery(
QueryParsing.java
> > :47)
> > > > at org.apache.solr.request.StandardRequestHandler.handleRequest(
> > > > StandardRequestHandler.java:94)
> > > > at org.apache.solr.core.SolrCore.execute(SolrCore.java:586)
> > > > at org.apache.solr.servlet.SolrServlet.doGet(SolrServlet.java:91)
> > > > at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
> > > > at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
> > > > at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(
> > > > ApplicationFilterChain.java:252)
> > > > at org.apache.catalina.core.ApplicationFilterChain.doFilter(
> > > > ApplicationFilterChain.java:173)
> > > > at org.apache.catalina.core.StandardWrapperValve.invoke(
> > > > StandardWrapperValve.java:213)
> > > > at org.apache.catalina.core.StandardContextValve.invoke(
> > > > StandardContextValve.java:178)
> > > > at org.apache.catalina.core.StandardHostValve.invoke(
> > > > StandardHostValve.java
> > > > :126)
> > > > at org.apache.catalina.valves.ErrorReportValve.invoke(
> > > > ErrorReportValve.java
> > > > :105)
> > > > at org.apache.catalina.core.StandardEngineValve.invoke(
> > > > StandardEngineValve.java:107)
> > > > at org.apache.catalina.connector.CoyoteAdapter.service(
> > > CoyoteAdapter.java
> > > > :148)
> > > > at org.apache.coyote.http11.Http11Processor.process(
> > Http11Processor.java
> > > > :869)
> > > > at
> > > >
> > > >
> > >
> >
>
org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection
> > > > (Http1

Re: tomcat install

2006-09-19 Thread James liu
today not Ok。i check source of cjk: CJKAnalyzer.java and CJKTokenizer.java (these from lucene 2.0 source code)and your code,,,i write CJKJLFilterFactory.java and CJKJLTokenizerFactory.javaant is ok.i
 copy new solr.war to tomcat's webappsand modify schema.xml use admin page, i use http://localhost:8484/solr/admin/analysis.jsp?highlight=on
 to check word analyzeit show me *()*)&*^&*, oh my god. i m failure.i see org.apache.lucene.analysis.nl's code, i find something difference. like your Tokenizer sames as StandardTokenizer,,i have to use myself..
 thk u very much, no your code, i think i maybe give up. i only use delphi and php, no java and unix before i meet lucene.i use lucene well and i think i can use solr well.thk u again.my msn: 
[EMAIL PROTECTED],,,maybe we can be friend.


Re: Facet performance with heterogeneous 'facets'?

2006-09-19 Thread Yonik Seeley

On 9/18/06, Michael Imbeault <[EMAIL PROTECTED]> wrote:

Yonik Seeley wrote:
> For cases like "author", if there is only one value per document, then
> a possible fix is to use the field cache.  If there can be multiple
> occurrences, there doesn't seem to be a good way that preserves exact
> counts, except maybe if the number of documents matching a query is
> low.
>
I have one value per document (I have fields for authors, last_author
and first_author, and I'm doing faceted search on first and last authors
fields). How would I use the field cache to fix my problem?


Unless you want to dive into Solr development, you don't :-)
It requires extensive changes to the faceting code and doing things a
different way in some cases.

The FieldCache is the fastest way to "uninvert" single valued
fields... it's currently only used for Sorting, where one needs to
quickly know the field value given the document id.
The downside is high memory use, and that it's not a general
solution... it can't handle fields with multiple tokens (tokenized
fields or multi-valued fields).

So the strategy would be to step through the documents, get the value
for the field from the FieldCache, increment a counter for that value,
then find the top counters when we are done.


Also, would
it be better to store a unique number (for each possible author) in an
int field along with the string, and do the faceted searching on the int
field?


It won't really help.  It wouldn't be faster, and it would require
only slightly less memory.


>> Just a little follow-up - I did a little more testing, and the query
>> takes 20 seconds no matter what - If there's one document in the results
>> set, or if I do a query that returns all 13 documents.
>
> Yes, currently the same strategy is always used.
>   intersection_count(docs_matching_query, docs_matching_author1)
>   intersection_count(docs_matching_query, docs_matching_author2)
>   intersection_count(docs_matching_query, docs_matching_author3)
>   etc...
>
> Normally, the docsets will be cached, but since the number of authors
> is greater than the size of the filtercache, the effective cache hit
> rate will be 0%
>
> -Yonik
So more memory would fix the problem?


Yes, if your collection size isn't that large...  it's not a practical
solution for many cases though.


Also, I was under the impression
that it was only searching / sorting for authors that it knows are in
the result set...


That's the problem... it's not necessarily easy to know *what* authors
are in the result set.  If we could quickly determine that, we could
just count them and not do any intersections or anything at all.


 in the case of only one document (1 result), it seems
strange that it takes the same time as for 130 000 results. It should
just check the results, see that there's only one author, and return
that? And in the case of 2 documents, just sort 2 authors (or 1 if
they're the same)? I understand your answer (it does intersections), but
I wonder why its intersecting from the whole document set at first, and
not docs_matching_query like you said.


It is just intersecting docs_matching_query.  The problem is that it's
intersecting that set with all possible author sets since it doesn't
know ahead of time what authors are in the docs that match the query.

There could be optimizations when docs_matching_query.size() is small,
so we start somehow with terms in the documents rather than terms in
the index.  That requires termvectors to be stored (medium speed), or
requires that the field be stored and that we re-analyze it (very
slow).

More optimization of special cases hasn't been done simply because no
one has done it yet... (as you note, faceting is a new feature).


-Yonik


Re: Facet performance with heterogeneous 'facets'?

2006-09-19 Thread Joachim Martin

Michael Imbeault wrote:

Also, is there any plans to add an option not to run a facet search if 
the result set is too big? To avoid 40 seconds queries if the docset 
is too large...



You could run one query with facet=false, check the result size and then 
run it again (should be fast because it is cached) with 
facet=true&rows=0 to get facet results only.


I would think that the decision to run/not run facets would be highly 
custom to your collection and not easily developed as a configurable 
feature.


--Joachim


Re: Facet performance with heterogeneous 'facets'?

2006-09-19 Thread Yonik Seeley

I just updated the comments in solrconfig.xml:

  
   

On 9/18/06, Michael Imbeault <[EMAIL PROTECTED]> wrote:

Another followup: I bumped all the caches in solrconfig.xml to

  size="1600384"
  initialSize="400096"
  autowarmCount="400096"

It seemed to fix the problem on a very small index (facets on last and
first author fields, + 12 range date facets, sub 0.3 seconds for
queries). I'll check on the full index tomorrow (it's indexing right
now, 400docs/sec!). However, I still don't have an idea what are these
values representing, and how I should estimate what values I should set
them to. Originally I thought it was the size of the cache in kb, and
someone on the list told me it was number of items, but I don't quite
get it. Better documentation on that would be welcomed :)

Also, is there any plans to add an option not to run a facet search if
the result set is too big? To avoid 40 seconds queries if the docset is
too large...


I'd like to speed up certain corner cases, but you can always set
timeouts in whatever frontend is making the request to Solr too.

-Yonik


Re: tomcat install

2006-09-19 Thread Nick Snels

Hi James,

don't give up, your very close to having it work. If you can get CJKAnalyzer
and CJKTokenizer to work in Lucene, you should also be able to get it to
work in Solr. Look at the bright site, at least ant doesn't throw any
errors. And my code isn't going to work, since it really cann't handle
Chinese or Japanese characters. You should have a look at how you did things
in Lucene.

I have went through my archives and I have found that people also have used
something similar to:





Maybe you can try it this way, and forget about the FilterFactories. Let me
know how it goes.

Kind regards,

Nick


On 9/19/06, James liu <[EMAIL PROTECTED]> wrote:


today not Ok。

i check source of cjk: CJKAnalyzer.java and CJKTokenizer.java (these from
lucene 2.0 source code)

and your code,,,i write CJKJLFilterFactory.java and
CJKJLTokenizerFactory.java

ant is ok.i copy new solr.war to tomcat's webappsand modify schema.xml

use admin page, i use http://localhost:8484/solr/admin/analysis.jsp?highlight=on
to check word analyze

it show me *()*)&*^&*, oh my god. i m failure.

i see org.apache.lucene.analysis.nl's code, i find something difference.
like your Tokenizer sames as StandardTokenizer,,i have to use myself..


thk u very much, no your code, i think i maybe give up. i only use delphi
and php, no java and unix before i meet lucene.

i use lucene well and i think i can use solr well.

thk u again.

my msn: [EMAIL PROTECTED],,,maybe we can be friend.



copyField to a dynamic field

2006-09-19 Thread Paul Terray
Hi,

 

I know this is a complex one, but it help me to be able to make a dynamic
copy field, like:

   

 

The goal is to have for each string index a tokenized one.

 

It does not seem possible at the moment, but will it be in the foreseeable
future?

 

Thanks anyway !

 


> 

Paul Terray 


  

Consultant Avant-Vente


> 

SOLLAN

 


  

27, bis rue du Progrès 
93100 Montreuil - France
Tel :  +33 (0)1 48 51 15 44
Fax : +33 (0)1 48 51 15 48
  [EMAIL PROTECTED]
  www.sollan.com

STRICTLY PERSONAL AND CONFIDENTIAL. This email may contain confidential and
proprietary material for the sole use of the intended recipient. Any review
or distribution by others is strictly prohibited. If you are not the
intended recipient please contact the sender and delete all copies. 


  SOLLAN

 



Multivalued vs single valued

2006-09-19 Thread Paul Terray
Hi,

 

Using a lot of dynamic fields, I’d like to simplify the field types. I had a
question on this: is there an advantage to have a field declared as single
valued, as opposed to multi-valued?

 

Thanks

 


> 

Paul Terray 


  

Consultant Avant-Vente


> 

SOLLAN

 


  

27, bis rue du Progrès 
93100 Montreuil - France
Tel :  +33 (0)1 48 51 15 44
Fax : +33 (0)1 48 51 15 48
  [EMAIL PROTECTED]
  www.sollan.com

STRICTLY PERSONAL AND CONFIDENTIAL. This email may contain confidential and
proprietary material for the sole use of the intended recipient. Any review
or distribution by others is strictly prohibited. If you are not the
intended recipient please contact the sender and delete all copies. 


  SOLLAN

 



Re: Multivalued vs single valued

2006-09-19 Thread Yonik Seeley

On 9/19/06, Paul Terray <[EMAIL PROTECTED]> wrote:

Using a lot of dynamic fields, I'd like to simplify the field types. I had a
question on this: is there an advantage to have a field declared as single
valued, as opposed to multi-valued?


The response of single valued fields is smaller (no encapsulating
array), you can sort on it if you need to, and (upcoming)
optimizations for faceted browsing on a single-valued field are
easier.


From the perspective of pure full-text searching only, there is no advantage.


-Yonik


strange highlighting behavior

2006-09-19 Thread Brian Lucas
I’m experiencing some unusual behavior when I perform a search with 
highlighting enabled.

 

I’ve set up “id” as “sint” and indexed properly, but performing a search gives 
the following result:

 



3.0647626

2

369845

1



Microsoft Reorganizes



Microsoft Reorganizes



 



3.0647626

2

369850

1



Microsoft Moment



Microsoft Moment



 

…

 







Microsoft Reorganizes





 





Microsoft Moment





 





NASCAR with Microsoft





 



 

The unusual characters on lst name=”…” are what I can’t figure out, as it 
DEFINITELY is not the id.  I’ve tried indexed id with “integer”, “sint”, and 
“string” all with the same result. 

 

Using Solr-9-18 and Tomcat 5.5.17.

 

Anyway to see where it’s getting these strange names from?  My understanding is 
that those should be the numeric ID’s given above.

Brian



Re: strange highlighting behavior

2006-09-19 Thread Yonik Seeley

On 9/19/06, Brian Lucas <[EMAIL PROTECTED]> wrote:

The unusual characters on lst name="…" are what I can't figure out, as it 
DEFINITELY
is not the id.  I've tried indexed id with "integer", "sint", and "string" all 
with the
same result.


Yes, looks like you hit a bug where you are seeing the "indexed" form
of sint (which is more of a binary format that allows terms to be
ordered in numeric order).  The fix would be to use
FieldType.indexedToReadable() to convert the indexed form back to a
readable form.

It should have worked with "integer" or "string" since the indexed and
readable forms are identical... I suspect the old documents with an
sint ID still exist in your index and that is what you are seeing.


-Yonik


Re: strange highlighting behavior

2006-09-19 Thread Yonik Seeley

On 9/19/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:

The fix would be to use
FieldType.indexedToReadable() to convert the indexed form back to a
readable form.


Oops, that should be storedToReadable since the id is obtained from
the stored fields, not from the index.

Hmmm, a quick look at the code suggests this is already beeing done:

String printId = searcher.getSchema().printableUniqueKey(doc);
fragments.add(printId == null ? null : printId, docSummaries);

What you are seeing may be due to indexing documents with one version
of the schema and viewing them with another.  Try deleting the
solr/data/index directory and then reindexing everything.

-Yonik


RE: strange highlighting behavior

2006-09-19 Thread Brian Lucas
Yonik, thanks for the tip. 

Converting to 'integer' and deleting/reindexing fixed it.  Can 'sint' be
used for the id with highlighting, or does one need to use integer or string
for that?  Just trying to figure out if it's a bug with sint, or possibly
due to the fact I could have changed sint to integer without deleting the
data.
 
-B

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley
Sent: Tuesday, September 19, 2006 11:55 AM
To: solr-user@lucene.apache.org
Subject: Re: strange highlighting behavior

On 9/19/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:
> The fix would be to use
> FieldType.indexedToReadable() to convert the indexed form back to a
> readable form.

Oops, that should be storedToReadable since the id is obtained from
the stored fields, not from the index.

Hmmm, a quick look at the code suggests this is already beeing done:

 String printId = searcher.getSchema().printableUniqueKey(doc);
 fragments.add(printId == null ? null : printId, docSummaries);

What you are seeing may be due to indexing documents with one version
of the schema and viewing them with another.  Try deleting the
solr/data/index directory and then reindexing everything.

-Yonik



Re: strange highlighting behavior

2006-09-19 Thread Yonik Seeley

On 9/19/06, Brian Lucas <[EMAIL PROTECTED]> wrote:

Converting to 'integer' and deleting/reindexing fixed it. Can 'sint' be
used for the id with highlighting, or does one need to use integer or string
for that?


It should be usable (but I personally haven't tested that).
If it's not, it's a bug and will be fixed :-)


 Just trying to figure out if it's a bug with sint, or possibly
due to the fact I could have changed sint to integer without deleting the
data.


The latter would be my guess.

-Yonik


Re: tomcat install

2006-09-19 Thread Chris Hostetter

: I have went through my archives and I have found that people also have used
: something similar to:
:
: 
:  
: 

Correct.  If you want to use a Lucene analyzer "as is" all you need to do
is specify the class name.  if you wnat to make an analyzer on the fly
from a tokenizer and some tokenfilters -- you need factories for each.

I would start by trying to use the CJKAnalyzer as is with the syntax
described above.  once you get that working, then look at what it takes to
write factories for the tokeinizer so you can mix/match it with other
token filters.


-Hoss



Re: no example to CollectionDistribution?

2006-09-19 Thread Chris Hostetter

: maybe i should get cron through cygwin..
:
: my system is win2003,not unix.
:
: today i try ./snappuller,,,but it seems wrong and i set master port,
: directory,snap directory

The CollectionDistribution scripts may not work well on windows -- many of
them require hardlinks which may or may-not be supported by windows
orcygwin (i've heard differet things) ... snappuller in particular
requires that you have rsync running.



-Hoss



Re: Facet performance with heterogeneous 'facets'?

2006-09-19 Thread Chris Hostetter

Quick Question: did you say you are faceting on the first name field
seperately from the last name field? ... why?

You'll probably see a sharp increase in performacne if you have a single
untokenized author field containing hte full name and you facet on that --
there will be a lot less unique terms to use when computing DocSets and
intersections.

Second: you mentioned increasing hte size of your filterCache
significantly, but we don't really know how heterogenous your index is ...
once you made that cahnge did your filterCache hitrate increase? .. do you
have any evictions (you can check on the "Statistics" patge)

: > Also, I was under the impression
: > that it was only searching / sorting for authors that it knows are in
: > the result set...
:
: That's the problem... it's not necessarily easy to know *what* authors
: are in the result set.  If we could quickly determine that, we could
: just count them and not do any intersections or anything at all.

another way to look at it is that by looking at all the authors, the work
done for generating the facet counts for query A can be completely reused
for the next query B -- presuming your filterCache is large enough to hold
all of the author filters.

: There could be optimizations when docs_matching_query.size() is small,
: so we start somehow with terms in the documents rather than terms in
: the index.  That requires termvectors to be stored (medium speed), or
: requires that the field be stored and that we re-analyze it (very
: slow).
:
: More optimization of special cases hasn't been done simply because no
: one has done it yet... (as you note, faceting is a new feature).

the optimization optimization i anticipated from teh begining, would
probably be usefull in the situation Michael is describing ... if there is
a "long tail" oif authors (and in my experience, there typically is) we
can cache an ordered list of the top N most prolific authors, along with
the count of how many documents they have in the index (this info is easy
to getfrom TermEnum.docFreq).  when we facet on the authors, we start with
that list and go in order, generating their facet constraint count using
the DocSet intersection just like we currently do ... if we reach our
facet.limit before we reach the end of hte list and the lowest constraint
count is higher then the total doc count of the last author in the list,
then we know we don't need to bother testing any other Author, because no
other author an possibly have a higher facet constraint count then the
ones on our list (since they haven't even written that many documents)



-Hoss



Re: Facet performance with heterogeneous 'facets'?

2006-09-19 Thread Yonik Seeley

On 9/19/06, Chris Hostetter <[EMAIL PROTECTED]> wrote:


Quick Question: did you say you are faceting on the first name field
seperately from the last name field? ... why?

You'll probably see a sharp increase in performacne if you have a single
untokenized author field containing hte full name and you facet on that --
there will be a lot less unique terms to use when computing DocSets and
intersections.

Second: you mentioned increasing hte size of your filterCache
significantly, but we don't really know how heterogenous your index is ...
once you made that cahnge did your filterCache hitrate increase? .. do you
have any evictions (you can check on the "Statistics" patge)

: > Also, I was under the impression
: > that it was only searching / sorting for authors that it knows are in
: > the result set...
:
: That's the problem... it's not necessarily easy to know *what* authors
: are in the result set.  If we could quickly determine that, we could
: just count them and not do any intersections or anything at all.

another way to look at it is that by looking at all the authors, the work
done for generating the facet counts for query A can be completely reused
for the next query B -- presuming your filterCache is large enough to hold
all of the author filters.

: There could be optimizations when docs_matching_query.size() is small,
: so we start somehow with terms in the documents rather than terms in
: the index.  That requires termvectors to be stored (medium speed), or
: requires that the field be stored and that we re-analyze it (very
: slow).
:
: More optimization of special cases hasn't been done simply because no
: one has done it yet... (as you note, faceting is a new feature).

the optimization optimization i anticipated from teh begining, would
probably be usefull in the situation Michael is describing ... if there is
a "long tail" oif authors (and in my experience, there typically is)



we
can cache an ordered list of the top N most prolific authors, along with
the count of how many documents they have in the index (this info is easy
to getfrom TermEnum.docFreq).


Yeah, I've thought about a fieldInfoCache too.  It could also cache
the total number of terms in order to make decisions about what
faceting strategy to follow.


when we facet on the authors, we start with
that list and go in order, generating their facet constraint count using
the DocSet intersection just like we currently do ... if we reach our
facet.limit before we reach the end of hte list and the lowest constraint
count is higher then the total doc count of the last author in the list,
then we know we don't need to bother testing any other Author, because no
other author an possibly have a higher facet constraint count then the
ones on our list


This works OK if the intersection counts are high (as a percentage of
the facet sets).  I'm not sure how often this will be the case though.

Another tradeoff is to allow getting inexact counts with multi-token fields by:
- simply faceting on the most popular values
  OR
- do some sort of statistical sampling by reading term vectors for a
fraction of the matching docs.

-Yonik


relational design in solr?

2006-09-19 Thread Joachim Martin
I am trying to integrate solr search results with results from a rdbms 
query.  It's working ok, but fairly complicated  due to large size of 
the results from the database, and many different sort requirements.


I know that solr/lucene was not designed to intelligently handle 
multiple document types in the same collection, i.e. provide join 
features, but I'm wondering if anyone on this list has any thoughts on 
how to do it in lucene, and how it might be integrated into a custom 
solr deployment.  I can't see going back to vanilla lucene after solr!


My basic idea is to add an objType field that would be used to define a 
"table".  There would be one main objType, any related objTypes would 
have a field pointing back to the main objs via id, like a foreign key.


I'd run multiple parallel searches and merge the results based on 
foreign keys, either using a Filter or just using custom code.  I'm 
anticipating that iterating through the results to retrieve the foreign 
key values will be too slow.


Our data is highly textual, temporal and spatial, which pretty much 
correspond to the 3 tables I would have.  I can de-normalize a lot of 
the data, but the combination of times, locations and textual 
representations would be way too large to fully flatten.


I'm about to start experimenting with different strategies, and I would 
appreciate any insight anyone can provide.  Would the faceting code help 
here somehow?


Thanks --Joachim







Re: Facet performance with heterogeneous 'facets'?

2006-09-19 Thread Chris Hostetter

: > when we facet on the authors, we start with
: > that list and go in order, generating their facet constraint count using
: > the DocSet intersection just like we currently do ... if we reach our
: > facet.limit before we reach the end of hte list and the lowest constraint
: > count is higher then the total doc count of the last author in the list,
: > then we know we don't need to bother testing any other Author, because no
: > other author an possibly have a higher facet constraint count then the
: > ones on our list
:
: This works OK if the intersection counts are high (as a percentage of
: the facet sets).  I'm not sure how often this will be the case though.

well, keep in mind "N" could be very big, big enough to store the full
list of Terms sorted in docFreq order (it shouldn't take up much space
since it's just hte Term and an int)e ... for any query that returns a
"large" number of results, you probably won't need to reach the end of the
list before you can tell that all the remaining Terms have a lower docFreq
then the current last constraint count in your facet.limit list.  For
queries that return a "small" number of results, it wouldn't be as
usefull, but thats where a switch could be fliped to start with the values
mapped to hte docs (using FieldCache -- assuming single-value fields)

: Another tradeoff is to allow getting inexact counts with multi-token fields 
by:
:  - simply faceting on the most popular values
:OR
:  - do some sort of statistical sampling by reading term vectors for a
: fraction of the matching docs.

i loath inexact counts ... i think of them as "Astrology" to the Astronomy
of true Faceted Searching ... but i'm sure they would be "good enough" for
some peoples use cases.



-Hoss



Re: Facet performance with heterogeneous 'facets'?

2006-09-19 Thread Chris Hostetter

: I just updated the comments in solrconfig.xml:

I've tweaked the SolrCaching wiki page to include some of this info as
well, feel free to add any additional info you think would be helpful to
other people (or ask any qestions about it if any of it still doesn't seem
clear to you)...

http://wiki.apache.org/solr/SolrCaching

: > now, 400docs/sec!). However, I still don't have an idea what are these
: > values representing, and how I should estimate what values I should set
: > them to. Originally I thought it was the size of the cache in kb, and
: > someone on the list told me it was number of items, but I don't quite
: > get it. Better documentation on that would be welcomed :)



-Hoss



wana use CJKAnalyzer

2006-09-19 Thread James liu
My step to support CJK...:1:add lucene-analyzers-2.0.0.jar to "C:\cygwin\tmp\solr-nightly\lib"2:use cmd, "cd C:\cygwin\tmp\solr-nightly","ant dist"3:copy "C:\cygwin\tmp\solr-nightly\dist\solr- 
1.0.war" to "C:\cygwin\tmp\solr-nightly\example\webapps\solr.war"4:modify schema(conf/schema.conf), like yours,,just "" 
5:delete solr/data/index;6:start jetty,java -jar start.jar7:no error.8: http://localhost:8983/solr/admin,,,i click analyzer link,,,and try test analyzer chinese word,but nothing happend. 
9: i use  xml.php to add index(english is well),it show me ok10: i try lukeall.jar to see solr's index data. but it show me like my attachements.xml.php maybe error althrough no error show.
i write jl.xml to example/exampledocsuse cygwin: sh post.sh jl.xml,no error。and i use lukeall.jar to see,,nothing changed.i failure.maybe someone can give me some advice to solve it.
-- regardsjl 


  
  	111
ÐÕÃûÊÇÁõƽ
  
  
  	112
ÐÕÃûÊÇСÍõ
  
  
  	113
ÀÏÆŲ»ÔÚ¼Ò
  


Re: no example to CollectionDistribution?

2006-09-19 Thread James liu

i see,thk u.

2006/9/20, Chris Hostetter <[EMAIL PROTECTED]>:



: maybe i should get cron through cygwin..
:
: my system is win2003,not unix.
:
: today i try ./snappuller,,,but it seems wrong and i set master port,
: directory,snap directory

The CollectionDistribution scripts may not work well on windows -- many of
them require hardlinks which may or may-not be supported by windows
orcygwin (i've heard differet things) ... snappuller in particular
requires that you have rsync running.



-Hoss





--
regards
jl


Re: tomcat install

2006-09-19 Thread James liu

i'd like to hear "I would start by trying to use the CJKAnalyzer as is with
the syntax,described above."

if need tester, call me.




2006/9/20, Chris Hostetter <[EMAIL PROTECTED]>:



: I have went through my archives and I have found that people also have
used
: something similar to:
:
: 
:  
: 

Correct.  If you want to use a Lucene analyzer "as is" all you need to do
is specify the class name.  if you wnat to make an analyzer on the fly
from a tokenizer and some tokenfilters -- you need factories for each.

I would start by trying to use the CJKAnalyzer as is with the syntax
described above.  once you get that working, then look at what it takes to
write factories for the tokeinizer so you can mix/match it with other
token filters.


-Hoss





--
regards
jl