Re: SolrCloud result correctness compared with single core

2015-01-29 Thread Yandong Yao
Pretty helpful, thanks Erick!

2015-01-24 9:48 GMT+08:00 Erick Erickson :

> you might, but probably not enough to notice. At 50G, the tf/idf
> stats will _probably_ be close enough you won't be able to tell.
>
> That said, recently distributed tf/idf has been implemented but
> you need to ask for it, see SOLR-1632. This is Solr 5.0 though.
>
> I've rarely seen it matter except in fairly specialized situations.
> Consider a single core. Deleted documents still count towards
> some of the tf/idf stats. So your scoring could theoretically
> change after, say, an optimize.
>
> So called "bottom line" is that yes, the scoring may change, but
> IMO not any more radically than was possible with single cores,
> and I wouldn't worry about unless I had evidence that it was
> biting me.
>
> Best
> Erick
>
> On Fri, Jan 23, 2015 at 2:52 PM, Yandong Yao  wrote:
>
> > Hi Guys,
> >
> > As the main scoring mechanism is based tf/idf, so will same query running
> > against SolrCloud return different result against running it against
> single
> > core with same data sets as idf will only count df inside one core?
> >
> > eg: Assume I have 100GB data:
> > A) Index those data using single core
> > B) Index those data using SolrCloud with two cores (each has 50GB data
> > index)
> >
> > Then If I query those with same query like 'apple', then will I get
> > different result for A and B?
> >
> >
> > Regards,
> > Yandong
> >
>


SolrCloud result correctness compared with single core

2015-01-23 Thread Yandong Yao
Hi Guys,

As the main scoring mechanism is based tf/idf, so will same query running
against SolrCloud return different result against running it against single
core with same data sets as idf will only count df inside one core?

eg: Assume I have 100GB data:
A) Index those data using single core
B) Index those data using SolrCloud with two cores (each has 50GB data
index)

Then If I query those with same query like 'apple', then will I get
different result for A and B?


Regards,
Yandong


Re: Index optimize takes more than 40 minutes for 18M documents

2013-02-21 Thread Yandong Yao
Thans Walter for info, we will disable optimize then and do more testing.

Regards,
Yandong

2013/2/22 Walter Underwood 

> That seems fairly fast. We index about 3 million documents in about half
> that time. We are probably limited by the time it takes to get the data
> from MySQL.
>
> Don't optimize. Solr automatically merges index segments as needed.
> Optimize forces a full merge. You'll probably never notice the difference,
> either in disk space or speed.
>
> It might make sense to force merge (optimize) if you reindex everything
> once per day and have no updates in between. But even then it may be a
> waste of time.
>
> You need lots of free disk space for merging, whether a forced merge or
> automatic. Free space equal to the size of the index is usually enough, but
> worst case can need double the size of the index.
>
> wunder
>
> On Feb 21, 2013, at 9:20 AM, Yandong Yao wrote:
>
> > Hi Guys,
> >
> > I am using Solr 4.1 and have indexed 18M documents using solrj
> > ConcurrentUpdateSolrServer (each document contains 5 fields, and average
> > length is less than 1k).
> >
> > 1) It takes 70 minutes to index those documents without optimize on my
> mac
> > 10.8, how is the performance, slow, fast or common?
> >
> > 2) It takes about 40 minutes to optimize those documents, following is
> top
> > output, and there are lots of FAULTS, what does this means?
> >
> > Processes: 118 total, 2 running, 8 stuck, 108 sleeping, 719 threads
> >
> >   00:56:52
> > Load Avg: 1.48, 1.56, 1.73  CPU usage: 6.63% user, 6.40% sys, 86.95% idle
> > SharedLibs: 31M resident, 0B data, 6712K linkedit.
> > MemRegions: 34734 total, 5801M resident, 39M private, 638M shared.
> PhysMem:
> > 982M wired, 3600M active, 3567M inactive, 8150M used, 38M free.
> > VM: 254G vsize, 1285M framework vsize, 1469887(368) pageins, 1095550(0)
> > pageouts.  Networks: packets: 14842595/9661M in, 14777685/9395M out.
> > Disks: 820048/43G read, 523814/53G written.
> >
> > PID   COMMAND  %CPU  TIME #TH  #WQ  #POR #MRE RPRVT  RSHRD  RSIZE
> > VPRVT  VSIZE  PGRP PPID STATE   UID  FAULTS   COW  MSGSENT  MSGRECV
> SYSBSD
> >   SYSMACH
> > 4585  java 11.7  02:52:01 32   1483  342  3866M+ 6724K
>  3856M+
> > 4246M  6908M  4580 4580 sleepin 501  1490340+ 402  3000781+ 231785+
> > 15044055+ 10033109+
> >
> > 3) If I don't run optimize, what is the impact? bigger disk size or slow
> > query performance?
> >
> > Following is my index config in  solrconfig.xml:
> >
> > 100
> > 10
> > 
> >   10
> >   30
> >   false
> > 
> >
> > Thanks very much in advance!
> >
> > Regards,
> > Yandong
>
>
>
>
>


Re: How to run many MoreLikeThis request efficiently?

2013-01-09 Thread Yandong Yao
Hi Otis,

Really appreciate your help on this!!  Will go with multi-thread firstly,
and then provide a custom component when performance is not good enough.

Regards,
Yandong

2013/1/10 Otis Gospodnetic 

> Patience, young Yandong :)
>
> Multi-threading *in your application* is the way to go. Alternatively, one
> could write a custom SearchComponent that is called once and inside of
> which the whole work is done after just one call to it. This component
> could then write the output somewhere, like in a new index since making a
> blocking call to it may time out.
>
> Otis
> Solr & ElasticSearch Support
> http://sematext.com/
> On Jan 9, 2013 6:07 PM, "Yandong Yao"  wrote:
>
> > Any comments on this? Thanks very much in advance!
> >
> > 2013/1/9 Yandong Yao 
> >
> > > Hi Solr Guru,
> > >
> > > I have two set of documents in one SolrCore, each set has about 1M
> > > documents with different document type, say 'type1' and 'type2'.
> > >
> > > Many documents in first set are very similar with 1 or 2 documents in
> the
> > > second set, What I want to get is:  for each document in set 2, return
> > the
> > > most similar document in set 1 using either 'MoreLikeThisHandler' or
> > > 'MoreLikeThisComponent'.
> > >
> > > Currently I use following code to get the result, while it will send
> far
> > > too many request to Solr server serially.  Is there any way to enhance
> > this
> > > besides using multi-threading?  Thanks very much!
> > >
> > > for each document in set 2 whose type is 'type2'
> > > run MoreLikeThis request against Solr server and get the most
> similar
> > > document
> > > end.
> > >
> > > Regards,
> > > Yandong
> > >
> >
>


Re: How to run many MoreLikeThis request efficiently?

2013-01-09 Thread Yandong Yao
Any comments on this? Thanks very much in advance!

2013/1/9 Yandong Yao 

> Hi Solr Guru,
>
> I have two set of documents in one SolrCore, each set has about 1M
> documents with different document type, say 'type1' and 'type2'.
>
> Many documents in first set are very similar with 1 or 2 documents in the
> second set, What I want to get is:  for each document in set 2, return the
> most similar document in set 1 using either 'MoreLikeThisHandler' or
> 'MoreLikeThisComponent'.
>
> Currently I use following code to get the result, while it will send far
> too many request to Solr server serially.  Is there any way to enhance this
> besides using multi-threading?  Thanks very much!
>
> for each document in set 2 whose type is 'type2'
> run MoreLikeThis request against Solr server and get the most similar
> document
> end.
>
> Regards,
> Yandong
>


How to run many MoreLikeThis request efficiently?

2013-01-08 Thread Yandong Yao
Hi Solr Guru,

I have two set of documents in one SolrCore, each set has about 1M
documents with different document type, say 'type1' and 'type2'.

Many documents in first set are very similar with 1 or 2 documents in the
second set, What I want to get is:  for each document in set 2, return the
most similar document in set 1 using either 'MoreLikeThisHandler' or
'MoreLikeThisComponent'.

Currently I use following code to get the result, while it will send far
too many request to Solr server serially.  Is there any way to enhance this
besides using multi-threading?  Thanks very much!

for each document in set 2 whose type is 'type2'
run MoreLikeThis request against Solr server and get the most similar
document
end.

Regards,
Yandong


Re: mergeindex: what happens if there is deletion during index merging

2012-08-21 Thread Yandong Yao
Hi Shalin,

Thanks very much for your detailed explanation!

Regards,
Yandong

2012/8/21 Shalin Shekhar Mangar 

> On Tue, Aug 21, 2012 at 8:47 AM, Yandong Yao  wrote:
>
> > Hi guys,
> >
> > From http://wiki.apache.org/solr/MergingSolrIndexes,  it said 'Using
> > "srcCore", care is taken to ensure that the merged index is not corrupted
> > even if writes are happening in parallel on the source index'.
> >
> > What does it means? If there are deletion request during merging, will
> this
> > deletion be processed correctly after merging finished?
> >
>
> Solr keeps an instance of the IndexReader for each srcCore which is a
> static snapshot of the index at the time of the merge request. This static
> snapshot is merged to the target core. Therefore any insert/delete request
> made to the srcCores after the merge request will not affect the merged
> index.
>
>
> >
> > 1)
> > eg:  I have an existing core 'core0', and I want to merge core 'core1'
> and
> > 'core2' to core 'core0', so I will use
> >
> >
> http://localhost:8983/solr/admin/cores?action=mergeindexes&core=core0&srcCore=core1&srcCore=core2
> > ,
> >
> > During the merging happens, core0, core1, core2 have received deletion
> > request to delete some old documents, will the final core 'core0'
> contains
> > all content from 'core1' and 'core2' and also all documents matches
> > deletion criteria has been deleted?
> >
>
> The final core0 will not have documents deleted by requests made on core0.
> However, documents deleted on core1 and core2 will still be in core0 if the
> merge started before those requests were made.
>
>
> >
> > 2)
> > And if core0, core1, and core2 are processing deletion request, at the
> same
> > time core merge request comes in, what will happen then? Will merge
> request
> > block until deletion finished on all cores?
> >
>
> I believe core0 will continue to process deletion requests concurrently
> with the merge. As for core1 and core2, since a merge reserves their
> IndexReader, the answer depends on when a commit happens on core1 and
> core2. If, for example, 2 deletions were made on core1 and then a commit
> was issued (or autoCommit happened) and then the merge was triggered then
> the final core0 will not have those documents but it may still have docs
> deleted after the commit.
>
>
> >
> > Thanks very much in advance!
> >
> > Regards,
> > Yandong
> >
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>


mergeindex: what happens if there is deletion during index merging

2012-08-20 Thread Yandong Yao
Hi guys,

>From http://wiki.apache.org/solr/MergingSolrIndexes,  it said 'Using
"srcCore", care is taken to ensure that the merged index is not corrupted
even if writes are happening in parallel on the source index'.

What does it means? If there are deletion request during merging, will this
deletion be processed correctly after merging finished?

1)
eg:  I have an existing core 'core0', and I want to merge core 'core1' and
'core2' to core 'core0', so I will use
http://localhost:8983/solr/admin/cores?action=mergeindexes&core=core0&srcCore=core1&srcCore=core2
,

During the merging happens, core0, core1, core2 have received deletion
request to delete some old documents, will the final core 'core0' contains
all content from 'core1' and 'core2' and also all documents matches
deletion criteria has been deleted?

2)
And if core0, core1, and core2 are processing deletion request, at the same
time core merge request comes in, what will happen then? Will merge request
block until deletion finished on all cores?

Thanks very much in advance!

Regards,
Yandong


Count is inconsistent between facet and stats

2012-07-18 Thread Yandong Yao
Hi Guys,

Steps to reproduce:

1) Download apache-solr-4.0.0-ALPHA
2) cd example;  java -jar start.jar
3) cd exampledocs;  ./post.sh *.xml
4) Use statsComponent to get the stats info for field 'popularity' based on
facet 'cat'.  And the 'count' for 'electronics' is 3
http://localhost:8983/solr/collection1/select?q=cat:electronics&wt=json&rows=0&stats=true&stats.field=popularity&stats.facet=cat

{

   - stats_fields:
   {
  - popularity:
  {
 - min: 0,
 - max: 10,
 - count: 14,
 - missing: 0,
 - sum: 75,
 - sumOfSquares: 503,
 - mean: 5.357142857142857,
 - stddev: 2.7902892835178013,
 - facets:
 {
- cat:
{
   - music:
   {
  - min: 10,
  - max: 10,
  - count: 1,
  - missing: 0,
  - sum: 10,
  - sumOfSquares: 100,
  - mean: 10,
  - stddev: 0
  },
   - monitor:
   {
  - min: 6,
  - max: 6,
  - count: 2,
  - missing: 0,
  - sum: 12,
  - sumOfSquares: 72,
  - mean: 6,
  - stddev: 0
  },
   - hard drive:
   {
  - min: 6,
  - max: 6,
  - count: 2,
  - missing: 0,
  - sum: 12,
  - sumOfSquares: 72,
  - mean: 6,
  - stddev: 0
  },
   - scanner:
   {
  - min: 6,
  - max: 6,
  - count: 1,
  - missing: 0,
  - sum: 6,
  - sumOfSquares: 36,
  - mean: 6,
  - stddev: 0
  },
   - memory:
   {
  - min: 0,
  - max: 7,
  - count: 3,
  - missing: 0,
  - sum: 12,
  - sumOfSquares: 74,
  - mean: 4,
  - stddev: 3.605551275463989
  },
   - graphics card:
   {
  - min: 7,
  - max: 7,
  - count: 2,
  - missing: 0,
  - sum: 14,
  - sumOfSquares: 98,
  - mean: 7,
  - stddev: 0
  },
   - electronics:
   {
  - min: 1,
  - max: 7,
  - count: 3,
  - missing: 0,
  - sum: 9,
  - sumOfSquares: 51,
  - mean: 3,
  - stddev: 3.4641016151377544
  }
   }
}
 }
  }

}
5)  Facet on 'cat' and the count is 14.
http://localhost:8983/solr/collection1/select?q=cat:electronics&wt=json&rows=0&facet=true&facet.field=cat

{

   - cat:
   [
  - "electronics",
  - 14,
  - "memory",
  - 3,
  - "connector",
  - 2,
  - "graphics card",
  - 2,
  - "hard drive",
  - 2,
  - "monitor",
  - 2,
  - "camera",
  - 1,
  - "copier",
  - 1,
  - "multifunction printer",
  - 1,
  - "music",
  - 1,
  - "printer",
  - 1,
  - "scanner",
  - 1,
  - "currency",
  - 0,
  - "search",
  - 0,
  - "software",
  - 0
  ]

},



So from StatsComponent the count for 'electronics' cat is 3, while
FacetComponent report 14 'electronics'. Is this a bug?

Following is the field definition for 'cat'.


Thanks,
Yandong


Re: SolrCloud: how to index documents into a specific core and how to search against that core?

2012-05-23 Thread Yandong Yao
Hi Mark, Darren

Thanks very much for your help, Will try collection for each customer then.

Regards,
Yandong


2012/5/22 Mark Miller 

> I think the key is this: you want to think of a SolrCore on a single node
> Solr installation as a collection on a multi node SolrCloud installation.
>
> So if you would use multiple SolrCore's with a std Solr setup, you should
> be using multiple collections in SolrCloud. If you were going to try to do
> everything in one SolrCore, that would be like putting everything in one
> collection in SolrCloud. I don't think it generally makes sense to try and
> work at the SolrCore level when working with SolrCloud. This will be made
> more clear once we add a simple collections api.
>
> So I think your choice should be similar to using a single node - do you
> want to put everything in one 'collection' and use a filter to separate
> customers (with all its caveats and limitations) or do you want to use a
> collection per customer. You can always start up more clusters if you reach
> any limits.
>
>
>
> On May 22, 2012, at 10:08 AM, Darren Govoni wrote:
>
> > I'm curious what the solrcloud experts say, but my suggestion is to try
> not to over-engineering the search architecture  on solrcloud. For example,
> what is the benefit of managing the what cores are indexed and searched?
> Having to know those details, in my mind, works against the automation in
> solrcore, but maybe there's a good reason you want to do it this way.
> >
> > --- Original Message ---
> > On 5/22/2012  07:35 AM Yandong Yao wrote:Hi Darren,
> > 
> > Thanks very much for your reply.
> > 
> > The reason I want to control core indexing/searching is that I want
> to
> > use one core to store one customer's data (all customer share same
> > config):  such as customer 1 use coreForCustomer1 and customer 2
> > use coreForCustomer2.
> > 
> > Is there any better way than using different core for different
> customer?
> > 
> > Another way maybe use different collection for different customer,
> while
> > not sure how many collections solr cloud could support. Which way is
> better
> > in terms of flexibility/scalability? (suppose there are tens of
> thousands
> > customers).
> > 
> > Regards,
> > Yandong
> > 
> > 2012/5/22 Darren Govoni 
> > 
> > > Why do you want to control what gets indexed into a core and then
> > > knowing what core to search? That's the kind of "knowing" that
> SolrCloud
> > > solves. In SolrCloud, it handles the distribution of documents
> across
> > > shards and retrieves them regardless of which node is searched
> from.
> > > That is the point of "cloud", you don't know the details of where
> > > exactly documents are being managed (i.e. they are cloudy). It can
> > > change and re-balance from time to time. SolrCloud performs the
> > > distributed search for you, therefore when you try to search a
> node/core
> > > with no documents, all the results from the "cloud" are retrieved
> > > regardless. This is considered "A Good Thing".
> > >
> > > It requires a change in thinking about indexing and searching
> > >
> > > On Tue, 2012-05-22 at 08:43 +0800, Yandong Yao wrote:
> > > > Hi Guys,
> > > >
> > > > I use following command to start solr cloud according to solr
> cloud wiki.
> > > >
> > > > yydzero:example bjcoe$ java -Dbootstrap_confdir=./solr/conf
> > > > -Dcollection.configName=myconf -DzkRun -DnumShards=2 -jar
> start.jar
> > > > yydzero:example2 bjcoe$ java -Djetty.port=7574
> -DzkHost=localhost:9983
> > > -jar
> > > > start.jar
> > > >
> > > > Then I have created several cores using CoreAdmin API (
> > > > http://localhost:8983/solr/admin/cores?action=CREATE&name=
> > > > &collection=collection1), and clusterstate.json show
> following
> > > > topology:
> > > >
> > > >
> > > > collection1:
> > > > -- shard1:
> > > >   -- collection1
> > > >   -- CoreForCustomer1
> > > >   -- CoreForCustomer3
> > > >   -- CoreForCustomer5
> > > > -- shard2:
> > > >   -- collection1
> > > >   -- CoreForCustomer2
> > > >   -- CoreForCustomer4
> > > >
> > > >
> > > > 1) Index:
> > > >
> > > > Using

Re: SolrCloud: how to index documents into a specific core and how to search against that core?

2012-05-22 Thread Yandong Yao
Hi Darren,

Thanks very much for your reply.

The reason I want to control core indexing/searching is that I want to
use one core to store one customer's data (all customer share same
config):  such as customer 1 use coreForCustomer1 and customer 2
use coreForCustomer2.

Is there any better way than using different core for different customer?

Another way maybe use different collection for different customer, while
not sure how many collections solr cloud could support. Which way is better
in terms of flexibility/scalability? (suppose there are tens of thousands
customers).

Regards,
Yandong

2012/5/22 Darren Govoni 

> Why do you want to control what gets indexed into a core and then
> knowing what core to search? That's the kind of "knowing" that SolrCloud
> solves. In SolrCloud, it handles the distribution of documents across
> shards and retrieves them regardless of which node is searched from.
> That is the point of "cloud", you don't know the details of where
> exactly documents are being managed (i.e. they are cloudy). It can
> change and re-balance from time to time. SolrCloud performs the
> distributed search for you, therefore when you try to search a node/core
> with no documents, all the results from the "cloud" are retrieved
> regardless. This is considered "A Good Thing".
>
> It requires a change in thinking about indexing and searching
>
> On Tue, 2012-05-22 at 08:43 +0800, Yandong Yao wrote:
> > Hi Guys,
> >
> > I use following command to start solr cloud according to solr cloud wiki.
> >
> > yydzero:example bjcoe$ java -Dbootstrap_confdir=./solr/conf
> > -Dcollection.configName=myconf -DzkRun -DnumShards=2 -jar start.jar
> > yydzero:example2 bjcoe$ java -Djetty.port=7574 -DzkHost=localhost:9983
> -jar
> > start.jar
> >
> > Then I have created several cores using CoreAdmin API (
> > http://localhost:8983/solr/admin/cores?action=CREATE&name=
> > &collection=collection1), and clusterstate.json show following
> > topology:
> >
> >
> > collection1:
> > -- shard1:
> >   -- collection1
> >   -- CoreForCustomer1
> >   -- CoreForCustomer3
> >   -- CoreForCustomer5
> > -- shard2:
> >   -- collection1
> >   -- CoreForCustomer2
> >   -- CoreForCustomer4
> >
> >
> > 1) Index:
> >
> > Using following command to index mem.xml file in exampledocs directory.
> >
> > yydzero:exampledocs bjcoe$ java -Durl=
> > http://localhost:8983/solr/coreForCustomer3/update -jar post.jar mem.xml
> > SimplePostTool: version 1.4
> > SimplePostTool: POSTing files to
> > http://localhost:8983/solr/coreForCustomer3/update..
> > SimplePostTool: POSTing file mem.xml
> > SimplePostTool: COMMITting Solr index changes.
> >
> > And now SolrAdmin UI shows that 'coreForCustomer1', 'coreForCustomer3',
> > 'coreForCustomer5' has 3 documents (mem.xml has 3 documents) and other 2
> > core has 0 documents.
> >
> > *Question 1:*  Is this expected behavior? How do I to index documents
> into
> > a specific core?
> >
> > *Question 2*:  If SolrCloud don't support this yet, how could I extend it
> > to support this feature (index document to particular core), where
> should i
> > start, the hashing algorithm?
> >
> > *Question 3*:  Why the documents are also indexed into 'coreForCustomer1'
> > and 'coreForCustomer5'?  The default replica for documents are 1, right?
> >
> > Then I try to index some document to 'coreForCustomer2':
> >
> > $ java -Durl=http://localhost:8983/solr/coreForCustomer2/update -jar
> > post.jar ipod_video.xml
> >
> > While 'coreForCustomer2' still have 0 documents and documents in
> ipod_video
> > are indexed to core for customer 1/3/5.
> >
> > *Question 4*:  Why this happens?
> >
> > 2) Search: I use "
> > http://localhost:8983/solr/coreForCustomer2/select?q=*%3A*&wt=xml"; to
> > search against 'CoreForCustomer2', while it will return all documents in
> > the whole collection even though this core has no documents at all.
> >
> > Then I use "
> >
> http://localhost:8983/solr/coreForCustomer2/select?q=*%3A*&wt=xml&shards=localhost:8983/solr/coreForCustomer2
> ",
> > and it will return 0 documents.
> >
> > *Question 5*: So If want to search against a particular core, we need to
> > use 'shards' parameter and use solrCore name as parameter value, right?
> >
> >
> > Thanks very much in advance!
> >
> > Regards,
> > Yandong
>
>
>


SolrCloud: how to index documents into a specific core and how to search against that core?

2012-05-21 Thread Yandong Yao
Hi Guys,

I use following command to start solr cloud according to solr cloud wiki.

yydzero:example bjcoe$ java -Dbootstrap_confdir=./solr/conf
-Dcollection.configName=myconf -DzkRun -DnumShards=2 -jar start.jar
yydzero:example2 bjcoe$ java -Djetty.port=7574 -DzkHost=localhost:9983 -jar
start.jar

Then I have created several cores using CoreAdmin API (
http://localhost:8983/solr/admin/cores?action=CREATE&name=
&collection=collection1), and clusterstate.json show following
topology:


collection1:
-- shard1:
  -- collection1
  -- CoreForCustomer1
  -- CoreForCustomer3
  -- CoreForCustomer5
-- shard2:
  -- collection1
  -- CoreForCustomer2
  -- CoreForCustomer4


1) Index:

Using following command to index mem.xml file in exampledocs directory.

yydzero:exampledocs bjcoe$ java -Durl=
http://localhost:8983/solr/coreForCustomer3/update -jar post.jar mem.xml
SimplePostTool: version 1.4
SimplePostTool: POSTing files to
http://localhost:8983/solr/coreForCustomer3/update..
SimplePostTool: POSTing file mem.xml
SimplePostTool: COMMITting Solr index changes.

And now SolrAdmin UI shows that 'coreForCustomer1', 'coreForCustomer3',
'coreForCustomer5' has 3 documents (mem.xml has 3 documents) and other 2
core has 0 documents.

*Question 1:*  Is this expected behavior? How do I to index documents into
a specific core?

*Question 2*:  If SolrCloud don't support this yet, how could I extend it
to support this feature (index document to particular core), where should i
start, the hashing algorithm?

*Question 3*:  Why the documents are also indexed into 'coreForCustomer1'
and 'coreForCustomer5'?  The default replica for documents are 1, right?

Then I try to index some document to 'coreForCustomer2':

$ java -Durl=http://localhost:8983/solr/coreForCustomer2/update -jar
post.jar ipod_video.xml

While 'coreForCustomer2' still have 0 documents and documents in ipod_video
are indexed to core for customer 1/3/5.

*Question 4*:  Why this happens?

2) Search: I use "
http://localhost:8983/solr/coreForCustomer2/select?q=*%3A*&wt=xml"; to
search against 'CoreForCustomer2', while it will return all documents in
the whole collection even though this core has no documents at all.

Then I use "
http://localhost:8983/solr/coreForCustomer2/select?q=*%3A*&wt=xml&shards=localhost:8983/solr/coreForCustomer2";,
and it will return 0 documents.

*Question 5*: So If want to search against a particular core, we need to
use 'shards' parameter and use solrCore name as parameter value, right?


Thanks very much in advance!

Regards,
Yandong


Re: Faster Solr Indexing

2012-03-11 Thread Yandong Yao
I have similar issues by using DIH,
and org.apache.solr.update.DirectUpdateHandler2.addDoc(AddUpdateCommand)
consumes most of the time when indexing 10K rows (each row is about 70K)
-  DIH nextRow takes about 10 seconds totally
-  If index uses whitespace tokenizer and lower case filter, then
addDoc() methods takes about 80 seconds
-  If index uses whitespace tokenizer, lower case filer, WDF, then
addDoc uses about 112 seconds
-  If index uses whitespace tokenizer, lower case filer, WDF and porter
stemmer, then addDoc uses about 145 seconds

We have more than million rows totally, and am wondering whether i am using
sth. wrong or is there any way to improve the performance of addDoc()?

Thanks very much in advance!


Following is the configure:
1) JVM:  -Xms256M -Xmx1048M -XX:MaxPermSize=512m
2) Solr version 3.5
3) solrconfig.xml  (almost copied from solr's  example/solr directory.)

  

false

10

64



2147483647
1000
1

native
  

2012/3/11 Peyman Faratin 

> Hi
>
> I am trying to index 12MM docs faster than is currently happening in Solr
> (using solrj). We have identified solr's add method as the bottleneck (and
> not commit - which is tuned ok through mergeFactor and maxRamBufferSize and
> jvm ram).
>
> Adding 1000 docs is taking approximately 25 seconds. We are making sure we
> add and commit in batches. And we've tried both CommonsHttpSolrServer and
> EmbeddedSolrServer (assuming removing http overhead would speed things up
> with embedding) but the differences is marginal.
>
> The docs being indexed are on average 20 fields long, mostly indexed but
> none stored. The major size contributors are two fields:
>
>- content, and
>- shingledContent (populated using copyField of content).
>
> The length of the content field is (likely) gaussian distributed (few
> large docs 50-80K tokens, but majority around 2k tokens). We use
> shingledContent to support phrase queries and content for unigram queries
> (following the advice of Solr Enterprise search server advice - p. 305,
> section "The Solution: Shingling").
>
> Clearly the size of the docs is a contributor to the slow adds (confirmed
> by removing these 2 fields resulting in halving the indexing time). We've
> tried compressed=true also but that is not working.
>
> Any guidance on how to support our application logic (without having to
> change the schema too much) and speed the indexing speed (from current 212
> days for 12MM docs) would be much appreciated.
>
> thank you
>
> Peyman
>
>


How to use nested query in fq?

2012-02-07 Thread Yandong Yao
Hi Guys,

I am using Solr 3.5, and would like to use a fq like
'getField(getDoc(uuid:workspace_${workspaceId})),  "isPublic"):true?

- workspace_${workspaceId}:  workspaceId is indexed field.
- getDoc(uuid:concat("workspace_", workspaceId):  return the document whose
uuid is "workspace_${workspaceId}"
- getField(getDoc(uuid:workspace_${workspaceId})),  "isPublic"):  return
the matched document's isPublic field

The use case is that I have workspace objects and workspace contains many
sub-objects, such as work files, comments, datasets and so on. And
workspace has a 'isPublic' field. If this field is true, then all
registered user could access this workspace and all its sub-objects.
Otherwise, only workspace member could access this workspace and its
sub-objects.

So I want to use fq to determine whether document in question belongs to
public workspace or not.  Is it possible?

If not, how to implement similar feature like this? implement a
ValueSourcePlugin? any guidance or example on this?

Or is there any better solutions?


It is possible to add 'isPublic' field to all sub-objects, while it makes
indexing update more complex. so try to find better solution.

Thanks very much in advance!

Regards,
Yandong


Re: Need help for solr searching case insensative item

2010-10-26 Thread yandong yao
Sounds like WordDelimiterFilter config issue, please refer to
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory
.

Also it will help if you could provide:
1) Tokenizers/Filters config in schema file
2) analysis.jsp output in admin page.

2010/10/26 wu liu 

> Hi all,
>
> I just noticed a wierd thing happend to my solr search result.
> if I do a search for "ecommons", it cannot get the result for "eCommons",
> instead,
> if i do a search for "eCommons", i can only get all the match for
> "eCommons", but not "ecommons".
>
> I cannot figure it out why?
>
> please help me
>
> Thanks very much in advance
>


Re: A question on WordDelimiterFilterFactory

2010-09-14 Thread yandong yao
After upgrading to 1.4.1, it is fixed.

Thanks very much for your help!

Regards,
Yandong Yao

2010/9/14 yandong yao 

> Hi Robert,
>
> I am using solr 1.4, will try with 1.4.1 tomorrow.
>
> Thanks very much!
>
> Regards,
> Yandong Yao
>
> 2010/9/14 Robert Muir 
>
> did you index with solr 1.4 (or are you using solr 1.4) ?
>>
>> at a quick glance, it looks like it might be this:
>> https://issues.apache.org/jira/browse/SOLR-1852 , which was fixed in
>> 1.4.1
>>
>> On Tue, Sep 14, 2010 at 5:40 AM, yandong yao  wrote:
>>
>> > Hi Guys,
>> >
>> > I encountered a problem when enabling WordDelimiterFilterFactory for
>> both
>> > index and query (pasted relative part of schema.xml at the bottom of
>> > email).
>> >
>> > *1. Steps to reproduce:*
>> >1.1 The indexed sample document contains only one sentence: "This is
>> a
>> > TechNote."
>> >1.2 Query is: q=TechNote
>> >1.3  Result: no matches return, while the above sentence contains
>> word
>> > 'TechNote' absolutely.
>> >
>> > *
>> > 2. Output when enabling debugQuery*
>> > By turning on debugQuery
>> >
>> >
>> http://localhost:7111/solr/test/select?indent=on&version=2.2&q=TechNote&fq=&start=0&rows=0&fl=*%2Cscore&qt=standard&wt=standard&debugQuery=on&explainOther=id%3A001&hl.fl=
>> > ,
>> > get following information:
>> >
>> > TechNote
>> > TechNote
>> > PhraseQuery(all:"tech note")
>> > all:"tech note"
>> > 
>> > id:001
>> > 
>> > 
>> > 0.0 = fieldWeight(all:"tech note" in 0), product of: 0.0 =
>> > tf(phraseFreq=0.0)
>> >  0.61370564 = idf(all: tech=1 note=1)
>> >  0.25 = fieldNorm(field=all, doc=0)
>> > 
>> > 
>> >
>> > Seems that the raw query string is converted to phrase query "tech
>> note",
>> > while its term frequency is 0, so no matches.
>> >
>> > *3. Result from admin/analysis.jsp page*
>> >
>> > From analysis.jsp, seems the query 'TechNote' matches the input
>> document,
>> > see below words marked by RED color.
>> >
>> > Index Analyzer org.apache.solr.analysis.WhitespaceTokenizerFactory {}
>>  term
>> > position 1234 term text ThisisaTechNote. term type wordwordwordword
>> source
>> > start,end 0,45,78,910,19 payload
>> >
>> >
>> >
>> >  org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
>> > expand=true, ignoreCase=true}  term position 1234 term text
>> > ThisisaTechNote. term
>> > type wordwordwordword source start,end 0,45,78,910,19 payload
>> >
>> >
>> >
>> >  org.apache.solr.analysis.WordDelimiterFilterFactory
>> {splitOnCaseChange=1,
>> > generateNumberParts=1, catenateWords=1, generateWordParts=1,
>> catenateAll=0,
>> > catenateNumbers=1}  term position 12345 term text ThisisaTechNote
>> TechNote
>> > term
>> > type wordwordwordwordword word source start,end 0,45,78,910,1414,18
>> 10,18
>> > payload
>> >
>> >
>> >
>> >
>> >
>> >  org.apache.solr.analysis.LowerCaseFilterFactory {}  term position 12345
>> > term
>> > text thisisatechnote technote term type wordwordwordwordword word source
>> > start,end 0,45,78,910,1414,18 10,18 payload
>> >
>> >
>> >
>> >
>> >
>> >  org.apache.solr.analysis.SnowballPorterFilterFactory
>> > {protected=protwords.txt, language=English}  term position 12345 term
>> text
>> > thisisa*tech**note* technot term type wordwordwordwordword word source
>> > start,end 0,45,78,910,1414,18 10,18 payload
>> >
>> >
>> >
>> >
>> >
>> >  Query Analyzer org.apache.solr.analysis.WhitespaceTokenizerFactory {}
>> >  term
>> > position 1 term text TechNote term type word source start,end 0,8
>> payload
>> >  org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
>> > expand=true, ignoreCase=true}  term position 1 term text TechNote term
>> type
>> > word source start,end 0,8 payload
>> >  org.apache.solr.analysis.WordDelimiterFilterFactory
>> {splitOnCaseChange=1,
>> > generateNumberParts=1, catenateWords=0, generateWordParts=1,
>> catenateAll=0,
>> > catena

Re: A question on WordDelimiterFilterFactory

2010-09-14 Thread yandong yao
Hi Robert,

I am using solr 1.4, will try with 1.4.1 tomorrow.

Thanks very much!

Regards,
Yandong Yao

2010/9/14 Robert Muir 

> did you index with solr 1.4 (or are you using solr 1.4) ?
>
> at a quick glance, it looks like it might be this:
> https://issues.apache.org/jira/browse/SOLR-1852 , which was fixed in 1.4.1
>
> On Tue, Sep 14, 2010 at 5:40 AM, yandong yao  wrote:
>
> > Hi Guys,
> >
> > I encountered a problem when enabling WordDelimiterFilterFactory for both
> > index and query (pasted relative part of schema.xml at the bottom of
> > email).
> >
> > *1. Steps to reproduce:*
> >1.1 The indexed sample document contains only one sentence: "This is a
> > TechNote."
> >1.2 Query is: q=TechNote
> >1.3  Result: no matches return, while the above sentence contains word
> > 'TechNote' absolutely.
> >
> > *
> > 2. Output when enabling debugQuery*
> > By turning on debugQuery
> >
> >
> http://localhost:7111/solr/test/select?indent=on&version=2.2&q=TechNote&fq=&start=0&rows=0&fl=*%2Cscore&qt=standard&wt=standard&debugQuery=on&explainOther=id%3A001&hl.fl=
> > ,
> > get following information:
> >
> > TechNote
> > TechNote
> > PhraseQuery(all:"tech note")
> > all:"tech note"
> > 
> > id:001
> > 
> > 
> > 0.0 = fieldWeight(all:"tech note" in 0), product of: 0.0 =
> > tf(phraseFreq=0.0)
> >  0.61370564 = idf(all: tech=1 note=1)
> >  0.25 = fieldNorm(field=all, doc=0)
> > 
> > 
> >
> > Seems that the raw query string is converted to phrase query "tech note",
> > while its term frequency is 0, so no matches.
> >
> > *3. Result from admin/analysis.jsp page*
> >
> > From analysis.jsp, seems the query 'TechNote' matches the input document,
> > see below words marked by RED color.
> >
> > Index Analyzer org.apache.solr.analysis.WhitespaceTokenizerFactory {}
>  term
> > position 1234 term text ThisisaTechNote. term type wordwordwordword
> source
> > start,end 0,45,78,910,19 payload
> >
> >
> >
> >  org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
> > expand=true, ignoreCase=true}  term position 1234 term text
> > ThisisaTechNote. term
> > type wordwordwordword source start,end 0,45,78,910,19 payload
> >
> >
> >
> >  org.apache.solr.analysis.WordDelimiterFilterFactory
> {splitOnCaseChange=1,
> > generateNumberParts=1, catenateWords=1, generateWordParts=1,
> catenateAll=0,
> > catenateNumbers=1}  term position 12345 term text ThisisaTechNote
> TechNote
> > term
> > type wordwordwordwordword word source start,end 0,45,78,910,1414,18 10,18
> > payload
> >
> >
> >
> >
> >
> >  org.apache.solr.analysis.LowerCaseFilterFactory {}  term position 12345
> > term
> > text thisisatechnote technote term type wordwordwordwordword word source
> > start,end 0,45,78,910,1414,18 10,18 payload
> >
> >
> >
> >
> >
> >  org.apache.solr.analysis.SnowballPorterFilterFactory
> > {protected=protwords.txt, language=English}  term position 12345 term
> text
> > thisisa*tech**note* technot term type wordwordwordwordword word source
> > start,end 0,45,78,910,1414,18 10,18 payload
> >
> >
> >
> >
> >
> >  Query Analyzer org.apache.solr.analysis.WhitespaceTokenizerFactory {}
> >  term
> > position 1 term text TechNote term type word source start,end 0,8 payload
> >  org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
> > expand=true, ignoreCase=true}  term position 1 term text TechNote term
> type
> > word source start,end 0,8 payload
> >  org.apache.solr.analysis.WordDelimiterFilterFactory
> {splitOnCaseChange=1,
> > generateNumberParts=1, catenateWords=0, generateWordParts=1,
> catenateAll=0,
> > catenateNumbers=0}  term position 12 term text TechNote term type
> > wordword source
> > start,end 0,44,8 payload
> >
> >  org.apache.solr.analysis.LowerCaseFilterFactory {}  term position 12
> term
> > text technote term type wordword source start,end 0,44,8 payload
> >
> >  org.apache.solr.analysis.SnowballPorterFilterFactory
> > {protected=protwords.txt, language=English} term position 12 term text
> tech
> > note term type wordword source start,end 0,44,8 payload
> >
> >
> > *
> > 4. My questions are:*
> >4.1: Why debugQuery and analysis.jsp has different result?
> > 

A question on WordDelimiterFilterFactory

2010-09-14 Thread yandong yao
Hi Guys,

I encountered a problem when enabling WordDelimiterFilterFactory for both
index and query (pasted relative part of schema.xml at the bottom of email).

*1. Steps to reproduce:*
1.1 The indexed sample document contains only one sentence: "This is a
TechNote."
1.2 Query is: q=TechNote
1.3  Result: no matches return, while the above sentence contains word
'TechNote' absolutely.

*
2. Output when enabling debugQuery*
By turning on debugQuery
http://localhost:7111/solr/test/select?indent=on&version=2.2&q=TechNote&fq=&start=0&rows=0&fl=*%2Cscore&qt=standard&wt=standard&debugQuery=on&explainOther=id%3A001&hl.fl=,
get following information:

TechNote
TechNote
PhraseQuery(all:"tech note")
all:"tech note"

id:001


0.0 = fieldWeight(all:"tech note" in 0), product of: 0.0 =
tf(phraseFreq=0.0)
  0.61370564 = idf(all: tech=1 note=1)
  0.25 = fieldNorm(field=all, doc=0)



Seems that the raw query string is converted to phrase query "tech note",
while its term frequency is 0, so no matches.

*3. Result from admin/analysis.jsp page*

>From analysis.jsp, seems the query 'TechNote' matches the input document,
see below words marked by RED color.

Index Analyzer org.apache.solr.analysis.WhitespaceTokenizerFactory {}  term
position 1234 term text ThisisaTechNote. term type wordwordwordword source
start,end 0,45,78,910,19 payload



 org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
expand=true, ignoreCase=true}  term position 1234 term text
ThisisaTechNote. term
type wordwordwordword source start,end 0,45,78,910,19 payload



 org.apache.solr.analysis.WordDelimiterFilterFactory {splitOnCaseChange=1,
generateNumberParts=1, catenateWords=1, generateWordParts=1, catenateAll=0,
catenateNumbers=1}  term position 12345 term text ThisisaTechNote TechNote term
type wordwordwordwordword word source start,end 0,45,78,910,1414,18 10,18
payload





 org.apache.solr.analysis.LowerCaseFilterFactory {}  term position 12345 term
text thisisatechnote technote term type wordwordwordwordword word source
start,end 0,45,78,910,1414,18 10,18 payload





 org.apache.solr.analysis.SnowballPorterFilterFactory
{protected=protwords.txt, language=English}  term position 12345 term text
thisisa*tech**note* technot term type wordwordwordwordword word source
start,end 0,45,78,910,1414,18 10,18 payload





 Query Analyzer org.apache.solr.analysis.WhitespaceTokenizerFactory {}  term
position 1 term text TechNote term type word source start,end 0,8 payload
 org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
expand=true, ignoreCase=true}  term position 1 term text TechNote term type
word source start,end 0,8 payload
 org.apache.solr.analysis.WordDelimiterFilterFactory {splitOnCaseChange=1,
generateNumberParts=1, catenateWords=0, generateWordParts=1, catenateAll=0,
catenateNumbers=0}  term position 12 term text TechNote term type
wordword source
start,end 0,44,8 payload

 org.apache.solr.analysis.LowerCaseFilterFactory {}  term position 12 term
text technote term type wordword source start,end 0,44,8 payload

 org.apache.solr.analysis.SnowballPorterFilterFactory
{protected=protwords.txt, language=English} term position 12 term text tech
note term type wordword source start,end 0,44,8 payload


*
4. My questions are:*
4.1: Why debugQuery and analysis.jsp has different result?
4.2: From my understanding, during indexing, the word 'TechNote' will be
converted to: 1) 'technote' and 2) 'tech note' according to my config in
schema.xml. And at query time, 'TechNote' will be converted to 'tech note',
thus it SHOULD match.  Am I right?
 4.3: Why the phrase frequency 'tech note' is 0 in the output of
debugQuery result (0.0 = tf(phraseFreq=0.0))?

Any suggestion/comments are absolutely welcome!


*5. fieldType definition in schema.xml*


  





  
  





  



Thanks very much!


Re: how to support "implicit trailing wildcards"

2010-08-11 Thread yandong yao
Hi Jan,

Seems q=mount OR mount* have different sorting order with q=mount for those
documents including mount.
Change to  q=mount^100 OR (mount?* -mount)^1.0, and test well.

Thanks very much!

2010/8/10 Jan Høydahl / Cominvent 

> Hi,
>
> You don't need to duplicate the content into two fields to achieve this.
> Try this:
>
> q=mount OR mount*
>
> The exact match will always get higher score than the wildcard match
> because wildcard matches uses "constant score".
>
> Making this work for multi term queries is a bit trickier, but something
> along these lines:
>
> q=(mount OR mount*) AND (everest OR everest*)
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Training in Europe - www.solrtraining.com
>
> On 10. aug. 2010, at 09.38, Geert-Jan Brits wrote:
>
> > you could satisfy this by making 2 fields:
> > 1. exactmatch
> > 2. wildcardmatch
> >
> > use copyfield in your schema to copy 1 --> 2 .
> >
> > q=exactmatch:mount+wildcardmatch:mount*&q.op=OR
> > this would score exact matches above (solely) wildcard matches
> >
> > Geert-Jan
> >
> > 2010/8/10 yandong yao 
> >
> >> Hi Bastian,
> >>
> >> Sorry for not make it clear, I also want exact match have higher score
> than
> >> wildcard match, that is means: if searching 'mount', documents with
> 'mount'
> >> will have higher score than documents with 'mountain', while 'mount*'
> seems
> >> treat 'mount' and 'mountain' as same.
> >>
> >> besides, also want the query to be processed with analyzer, while from
> >>
> >>
> http://wiki.apache.org/lucene-java/LuceneFAQ#Are_Wildcard.2C_Prefix.2C_and_Fuzzy_queries_case_sensitive.3F
> >> ,
> >> Wildcard, Prefix, and Fuzzy queries are not passed through the Analyzer.
> >> The
> >> rationale is that if search 'mounted', I also want documents with
> 'mount'
> >> match.
> >>
> >> So seems built-in wildcard search could not satisfy my requirements if i
> >> understand correctly.
> >>
> >> Thanks very much!
> >>
> >>
> >> 2010/8/9 Bastian Spitzer 
> >>
> >>> Wildcard-Search is already built in, just use:
> >>>
> >>> ?q=umoun*
> >>> ?q=mounta*
> >>>
> >>> -Ursprüngliche Nachricht-
> >>> Von: yandong yao [mailto:yydz...@gmail.com]
> >>> Gesendet: Montag, 9. August 2010 15:57
> >>> An: solr-user@lucene.apache.org
> >>> Betreff: how to support "implicit trailing wildcards"
> >>>
> >>> Hi everyone,
> >>>
> >>>
> >>> How to support 'implicit trailing wildcard *' using Solr, eg: using
> >> Google
> >>> to search 'umoun', 'umount' will be matched , search 'mounta',
> 'mountain'
> >>> will be matched.
> >>>
> >>> From my point of view, there are several ways, both with disadvantages:
> >>>
> >>> 1) Using EdgeNGramFilterFactory, thus 'umount' will be indexed with
> 'u',
> >>> 'um', 'umo', 'umou', 'umoun', 'umount'. The disadvantages are: a) the
> >> index
> >>> size increases dramatically, b) will matches even has no relationship,
> >> such
> >>> as such 'mount' will match 'mountain' also.
> >>>
> >>> 2) Using two pass searching: first pass searches term dictionary
> through
> >>> TermsComponent using given keyword, then using the first matched term
> >> from
> >>> term dictionary to search again. eg: when user enter 'umoun',
> >> TermsComponent
> >>> will match 'umount', then use 'umount' to search. The disadvantage are:
> >> a)
> >>> need to parse query string so that could recognize meta keywords such
> as
> >>> 'AND', 'OR', '+', '-', '"' (this makes more complex as I am using PHP
> >>> client), b) The returned hit counts is not for original search string,
> >> thus
> >>> will influence other components such as auto-suggest component based on
> >> user
> >>> search history and hit counts.
> >>>
> >>> 3) Write custom SearchComponent, while have no idea where/how to start
> >>> with.
> >>>
> >>> Is there any other way in Solr to do this, any feedback/suggestion are
> >>> welcome!
> >>>
> >>> Thanks very much in advance!
> >>>
> >>
>
>


Re: how to support "implicit trailing wildcards"

2010-08-09 Thread yandong yao
Hi Bastian,

Sorry for not make it clear, I also want exact match have higher score than
wildcard match, that is means: if searching 'mount', documents with 'mount'
will have higher score than documents with 'mountain', while 'mount*' seems
treat 'mount' and 'mountain' as same.

besides, also want the query to be processed with analyzer, while from
http://wiki.apache.org/lucene-java/LuceneFAQ#Are_Wildcard.2C_Prefix.2C_and_Fuzzy_queries_case_sensitive.3F,
Wildcard, Prefix, and Fuzzy queries are not passed through the Analyzer. The
rationale is that if search 'mounted', I also want documents with 'mount'
match.

So seems built-in wildcard search could not satisfy my requirements if i
understand correctly.

Thanks very much!


2010/8/9 Bastian Spitzer 

> Wildcard-Search is already built in, just use:
>
> ?q=umoun*
> ?q=mounta*
>
> -Ursprüngliche Nachricht-
> Von: yandong yao [mailto:yydz...@gmail.com]
> Gesendet: Montag, 9. August 2010 15:57
> An: solr-user@lucene.apache.org
> Betreff: how to support "implicit trailing wildcards"
>
> Hi everyone,
>
>
> How to support 'implicit trailing wildcard *' using Solr, eg: using Google
> to search 'umoun', 'umount' will be matched , search 'mounta', 'mountain'
> will be matched.
>
> From my point of view, there are several ways, both with disadvantages:
>
> 1) Using EdgeNGramFilterFactory, thus 'umount' will be indexed with 'u',
> 'um', 'umo', 'umou', 'umoun', 'umount'. The disadvantages are: a) the index
> size increases dramatically, b) will matches even has no relationship, such
> as such 'mount' will match 'mountain' also.
>
> 2) Using two pass searching: first pass searches term dictionary through
> TermsComponent using given keyword, then using the first matched term from
> term dictionary to search again. eg: when user enter 'umoun', TermsComponent
> will match 'umount', then use 'umount' to search. The disadvantage are: a)
> need to parse query string so that could recognize meta keywords such as
> 'AND', 'OR', '+', '-', '"' (this makes more complex as I am using PHP
> client), b) The returned hit counts is not for original search string, thus
> will influence other components such as auto-suggest component based on user
> search history and hit counts.
>
> 3) Write custom SearchComponent, while have no idea where/how to start
> with.
>
> Is there any other way in Solr to do this, any feedback/suggestion are
> welcome!
>
> Thanks very much in advance!
>


how to support "implicit trailing wildcards"

2010-08-09 Thread yandong yao
Hi everyone,


How to support 'implicit trailing wildcard *' using Solr, eg: using Google
to search 'umoun', 'umount' will be matched , search 'mounta', 'mountain'
will be matched.

>From my point of view, there are several ways, both with disadvantages:

1) Using EdgeNGramFilterFactory, thus 'umount' will be indexed with 'u',
'um', 'umo', 'umou', 'umoun', 'umount'. The disadvantages are: a) the index
size increases dramatically, b) will matches even has no relationship, such
as such 'mount' will match 'mountain' also.

2) Using two pass searching: first pass searches term dictionary through
TermsComponent using given keyword, then using the first matched term from
term dictionary to search again. eg: when user enter 'umoun', TermsComponent
will match 'umount', then use 'umount' to search. The disadvantage are: a)
need to parse query string so that could recognize meta keywords such as
'AND', 'OR', '+', '-', '"' (this makes more complex as I am using PHP
client), b) The returned hit counts is not for original search string, thus
will influence other components such as auto-suggest component based on user
search history and hit counts.

3) Write custom SearchComponent, while have no idea where/how to start with.

Is there any other way in Solr to do this, any feedback/suggestion are
welcome!

Thanks very much in advance!