Re: SolrCloud result correctness compared with single core

2015-01-29 Thread Yandong Yao
Pretty helpful, thanks Erick!

2015-01-24 9:48 GMT+08:00 Erick Erickson erickerick...@gmail.com:

 you might, but probably not enough to notice. At 50G, the tf/idf
 stats will _probably_ be close enough you won't be able to tell.

 That said, recently distributed tf/idf has been implemented but
 you need to ask for it, see SOLR-1632. This is Solr 5.0 though.

 I've rarely seen it matter except in fairly specialized situations.
 Consider a single core. Deleted documents still count towards
 some of the tf/idf stats. So your scoring could theoretically
 change after, say, an optimize.

 So called bottom line is that yes, the scoring may change, but
 IMO not any more radically than was possible with single cores,
 and I wouldn't worry about unless I had evidence that it was
 biting me.

 Best
 Erick

 On Fri, Jan 23, 2015 at 2:52 PM, Yandong Yao yydz...@gmail.com wrote:

  Hi Guys,
 
  As the main scoring mechanism is based tf/idf, so will same query running
  against SolrCloud return different result against running it against
 single
  core with same data sets as idf will only count df inside one core?
 
  eg: Assume I have 100GB data:
  A) Index those data using single core
  B) Index those data using SolrCloud with two cores (each has 50GB data
  index)
 
  Then If I query those with same query like 'apple', then will I get
  different result for A and B?
 
 
  Regards,
  Yandong
 



SolrCloud result correctness compared with single core

2015-01-23 Thread Yandong Yao
Hi Guys,

As the main scoring mechanism is based tf/idf, so will same query running
against SolrCloud return different result against running it against single
core with same data sets as idf will only count df inside one core?

eg: Assume I have 100GB data:
A) Index those data using single core
B) Index those data using SolrCloud with two cores (each has 50GB data
index)

Then If I query those with same query like 'apple', then will I get
different result for A and B?


Regards,
Yandong


Re: Index optimize takes more than 40 minutes for 18M documents

2013-02-21 Thread Yandong Yao
Thans Walter for info, we will disable optimize then and do more testing.

Regards,
Yandong

2013/2/22 Walter Underwood wun...@wunderwood.org

 That seems fairly fast. We index about 3 million documents in about half
 that time. We are probably limited by the time it takes to get the data
 from MySQL.

 Don't optimize. Solr automatically merges index segments as needed.
 Optimize forces a full merge. You'll probably never notice the difference,
 either in disk space or speed.

 It might make sense to force merge (optimize) if you reindex everything
 once per day and have no updates in between. But even then it may be a
 waste of time.

 You need lots of free disk space for merging, whether a forced merge or
 automatic. Free space equal to the size of the index is usually enough, but
 worst case can need double the size of the index.

 wunder

 On Feb 21, 2013, at 9:20 AM, Yandong Yao wrote:

  Hi Guys,
 
  I am using Solr 4.1 and have indexed 18M documents using solrj
  ConcurrentUpdateSolrServer (each document contains 5 fields, and average
  length is less than 1k).
 
  1) It takes 70 minutes to index those documents without optimize on my
 mac
  10.8, how is the performance, slow, fast or common?
 
  2) It takes about 40 minutes to optimize those documents, following is
 top
  output, and there are lots of FAULTS, what does this means?
 
  Processes: 118 total, 2 running, 8 stuck, 108 sleeping, 719 threads
 
00:56:52
  Load Avg: 1.48, 1.56, 1.73  CPU usage: 6.63% user, 6.40% sys, 86.95% idle
  SharedLibs: 31M resident, 0B data, 6712K linkedit.
  MemRegions: 34734 total, 5801M resident, 39M private, 638M shared.
 PhysMem:
  982M wired, 3600M active, 3567M inactive, 8150M used, 38M free.
  VM: 254G vsize, 1285M framework vsize, 1469887(368) pageins, 1095550(0)
  pageouts.  Networks: packets: 14842595/9661M in, 14777685/9395M out.
  Disks: 820048/43G read, 523814/53G written.
 
  PID   COMMAND  %CPU  TIME #TH  #WQ  #POR #MRE RPRVT  RSHRD  RSIZE
  VPRVT  VSIZE  PGRP PPID STATE   UID  FAULTS   COW  MSGSENT  MSGRECV
 SYSBSD
SYSMACH
  4585  java 11.7  02:52:01 32   1483  342  3866M+ 6724K
  3856M+
  4246M  6908M  4580 4580 sleepin 501  1490340+ 402  3000781+ 231785+
  15044055+ 10033109+
 
  3) If I don't run optimize, what is the impact? bigger disk size or slow
  query performance?
 
  Following is my index config in  solrconfig.xml:
 
  ramBufferSizeMB100/ramBufferSizeMB
  mergeFactor10/mergeFactor
  autoCommit
maxDocs10/maxDocs!-- 100K docs --
maxTime30/maxTime!-- 5 minutes --
openSearcherfalse/openSearcher
  /autoCommit
 
  Thanks very much in advance!
 
  Regards,
  Yandong







Re: How to run many MoreLikeThis request efficiently?

2013-01-09 Thread Yandong Yao
Any comments on this? Thanks very much in advance!

2013/1/9 Yandong Yao yydz...@gmail.com

 Hi Solr Guru,

 I have two set of documents in one SolrCore, each set has about 1M
 documents with different document type, say 'type1' and 'type2'.

 Many documents in first set are very similar with 1 or 2 documents in the
 second set, What I want to get is:  for each document in set 2, return the
 most similar document in set 1 using either 'MoreLikeThisHandler' or
 'MoreLikeThisComponent'.

 Currently I use following code to get the result, while it will send far
 too many request to Solr server serially.  Is there any way to enhance this
 besides using multi-threading?  Thanks very much!

 for each document in set 2 whose type is 'type2'
 run MoreLikeThis request against Solr server and get the most similar
 document
 end.

 Regards,
 Yandong



Re: How to run many MoreLikeThis request efficiently?

2013-01-09 Thread Yandong Yao
Hi Otis,

Really appreciate your help on this!!  Will go with multi-thread firstly,
and then provide a custom component when performance is not good enough.

Regards,
Yandong

2013/1/10 Otis Gospodnetic otis.gospodne...@gmail.com

 Patience, young Yandong :)

 Multi-threading *in your application* is the way to go. Alternatively, one
 could write a custom SearchComponent that is called once and inside of
 which the whole work is done after just one call to it. This component
 could then write the output somewhere, like in a new index since making a
 blocking call to it may time out.

 Otis
 Solr  ElasticSearch Support
 http://sematext.com/
 On Jan 9, 2013 6:07 PM, Yandong Yao yydz...@gmail.com wrote:

  Any comments on this? Thanks very much in advance!
 
  2013/1/9 Yandong Yao yydz...@gmail.com
 
   Hi Solr Guru,
  
   I have two set of documents in one SolrCore, each set has about 1M
   documents with different document type, say 'type1' and 'type2'.
  
   Many documents in first set are very similar with 1 or 2 documents in
 the
   second set, What I want to get is:  for each document in set 2, return
  the
   most similar document in set 1 using either 'MoreLikeThisHandler' or
   'MoreLikeThisComponent'.
  
   Currently I use following code to get the result, while it will send
 far
   too many request to Solr server serially.  Is there any way to enhance
  this
   besides using multi-threading?  Thanks very much!
  
   for each document in set 2 whose type is 'type2'
   run MoreLikeThis request against Solr server and get the most
 similar
   document
   end.
  
   Regards,
   Yandong
  
 



How to run many MoreLikeThis request efficiently?

2013-01-08 Thread Yandong Yao
Hi Solr Guru,

I have two set of documents in one SolrCore, each set has about 1M
documents with different document type, say 'type1' and 'type2'.

Many documents in first set are very similar with 1 or 2 documents in the
second set, What I want to get is:  for each document in set 2, return the
most similar document in set 1 using either 'MoreLikeThisHandler' or
'MoreLikeThisComponent'.

Currently I use following code to get the result, while it will send far
too many request to Solr server serially.  Is there any way to enhance this
besides using multi-threading?  Thanks very much!

for each document in set 2 whose type is 'type2'
run MoreLikeThis request against Solr server and get the most similar
document
end.

Regards,
Yandong


Re: mergeindex: what happens if there is deletion during index merging

2012-08-21 Thread Yandong Yao
Hi Shalin,

Thanks very much for your detailed explanation!

Regards,
Yandong

2012/8/21 Shalin Shekhar Mangar shalinman...@gmail.com

 On Tue, Aug 21, 2012 at 8:47 AM, Yandong Yao yydz...@gmail.com wrote:

  Hi guys,
 
  From http://wiki.apache.org/solr/MergingSolrIndexes,  it said 'Using
  srcCore, care is taken to ensure that the merged index is not corrupted
  even if writes are happening in parallel on the source index'.
 
  What does it means? If there are deletion request during merging, will
 this
  deletion be processed correctly after merging finished?
 

 Solr keeps an instance of the IndexReader for each srcCore which is a
 static snapshot of the index at the time of the merge request. This static
 snapshot is merged to the target core. Therefore any insert/delete request
 made to the srcCores after the merge request will not affect the merged
 index.


 
  1)
  eg:  I have an existing core 'core0', and I want to merge core 'core1'
 and
  'core2' to core 'core0', so I will use
 
 
 http://localhost:8983/solr/admin/cores?action=mergeindexescore=core0srcCore=core1srcCore=core2
  ,
 
  During the merging happens, core0, core1, core2 have received deletion
  request to delete some old documents, will the final core 'core0'
 contains
  all content from 'core1' and 'core2' and also all documents matches
  deletion criteria has been deleted?
 

 The final core0 will not have documents deleted by requests made on core0.
 However, documents deleted on core1 and core2 will still be in core0 if the
 merge started before those requests were made.


 
  2)
  And if core0, core1, and core2 are processing deletion request, at the
 same
  time core merge request comes in, what will happen then? Will merge
 request
  block until deletion finished on all cores?
 

 I believe core0 will continue to process deletion requests concurrently
 with the merge. As for core1 and core2, since a merge reserves their
 IndexReader, the answer depends on when a commit happens on core1 and
 core2. If, for example, 2 deletions were made on core1 and then a commit
 was issued (or autoCommit happened) and then the merge was triggered then
 the final core0 will not have those documents but it may still have docs
 deleted after the commit.


 
  Thanks very much in advance!
 
  Regards,
  Yandong
 



 --
 Regards,
 Shalin Shekhar Mangar.



mergeindex: what happens if there is deletion during index merging

2012-08-20 Thread Yandong Yao
Hi guys,

From http://wiki.apache.org/solr/MergingSolrIndexes,  it said 'Using
srcCore, care is taken to ensure that the merged index is not corrupted
even if writes are happening in parallel on the source index'.

What does it means? If there are deletion request during merging, will this
deletion be processed correctly after merging finished?

1)
eg:  I have an existing core 'core0', and I want to merge core 'core1' and
'core2' to core 'core0', so I will use
http://localhost:8983/solr/admin/cores?action=mergeindexescore=core0srcCore=core1srcCore=core2
,

During the merging happens, core0, core1, core2 have received deletion
request to delete some old documents, will the final core 'core0' contains
all content from 'core1' and 'core2' and also all documents matches
deletion criteria has been deleted?

2)
And if core0, core1, and core2 are processing deletion request, at the same
time core merge request comes in, what will happen then? Will merge request
block until deletion finished on all cores?

Thanks very much in advance!

Regards,
Yandong


Count is inconsistent between facet and stats

2012-07-18 Thread Yandong Yao
Hi Guys,

Steps to reproduce:

1) Download apache-solr-4.0.0-ALPHA
2) cd example;  java -jar start.jar
3) cd exampledocs;  ./post.sh *.xml
4) Use statsComponent to get the stats info for field 'popularity' based on
facet 'cat'.  And the 'count' for 'electronics' is 3
http://localhost:8983/solr/collection1/select?q=cat:electronicswt=jsonrows=0stats=truestats.field=popularitystats.facet=cat

{

   - stats_fields:
   {
  - popularity:
  {
 - min: 0,
 - max: 10,
 - count: 14,
 - missing: 0,
 - sum: 75,
 - sumOfSquares: 503,
 - mean: 5.357142857142857,
 - stddev: 2.7902892835178013,
 - facets:
 {
- cat:
{
   - music:
   {
  - min: 10,
  - max: 10,
  - count: 1,
  - missing: 0,
  - sum: 10,
  - sumOfSquares: 100,
  - mean: 10,
  - stddev: 0
  },
   - monitor:
   {
  - min: 6,
  - max: 6,
  - count: 2,
  - missing: 0,
  - sum: 12,
  - sumOfSquares: 72,
  - mean: 6,
  - stddev: 0
  },
   - hard drive:
   {
  - min: 6,
  - max: 6,
  - count: 2,
  - missing: 0,
  - sum: 12,
  - sumOfSquares: 72,
  - mean: 6,
  - stddev: 0
  },
   - scanner:
   {
  - min: 6,
  - max: 6,
  - count: 1,
  - missing: 0,
  - sum: 6,
  - sumOfSquares: 36,
  - mean: 6,
  - stddev: 0
  },
   - memory:
   {
  - min: 0,
  - max: 7,
  - count: 3,
  - missing: 0,
  - sum: 12,
  - sumOfSquares: 74,
  - mean: 4,
  - stddev: 3.605551275463989
  },
   - graphics card:
   {
  - min: 7,
  - max: 7,
  - count: 2,
  - missing: 0,
  - sum: 14,
  - sumOfSquares: 98,
  - mean: 7,
  - stddev: 0
  },
   - electronics:
   {
  - min: 1,
  - max: 7,
  - count: 3,
  - missing: 0,
  - sum: 9,
  - sumOfSquares: 51,
  - mean: 3,
  - stddev: 3.4641016151377544
  }
   }
}
 }
  }

}
5)  Facet on 'cat' and the count is 14.
http://localhost:8983/solr/collection1/select?q=cat:electronicswt=jsonrows=0facet=truefacet.field=cat

{

   - cat:
   [
  - electronics,
  - 14,
  - memory,
  - 3,
  - connector,
  - 2,
  - graphics card,
  - 2,
  - hard drive,
  - 2,
  - monitor,
  - 2,
  - camera,
  - 1,
  - copier,
  - 1,
  - multifunction printer,
  - 1,
  - music,
  - 1,
  - printer,
  - 1,
  - scanner,
  - 1,
  - currency,
  - 0,
  - search,
  - 0,
  - software,
  - 0
  ]

},



So from StatsComponent the count for 'electronics' cat is 3, while
FacetComponent report 14 'electronics'. Is this a bug?

Following is the field definition for 'cat'.
field name=cat type=string indexed=true stored=true
multiValued=true/

Thanks,
Yandong


Re: SolrCloud: how to index documents into a specific core and how to search against that core?

2012-05-23 Thread Yandong Yao
Hi Mark, Darren

Thanks very much for your help, Will try collection for each customer then.

Regards,
Yandong


2012/5/22 Mark Miller markrmil...@gmail.com

 I think the key is this: you want to think of a SolrCore on a single node
 Solr installation as a collection on a multi node SolrCloud installation.

 So if you would use multiple SolrCore's with a std Solr setup, you should
 be using multiple collections in SolrCloud. If you were going to try to do
 everything in one SolrCore, that would be like putting everything in one
 collection in SolrCloud. I don't think it generally makes sense to try and
 work at the SolrCore level when working with SolrCloud. This will be made
 more clear once we add a simple collections api.

 So I think your choice should be similar to using a single node - do you
 want to put everything in one 'collection' and use a filter to separate
 customers (with all its caveats and limitations) or do you want to use a
 collection per customer. You can always start up more clusters if you reach
 any limits.



 On May 22, 2012, at 10:08 AM, Darren Govoni wrote:

  I'm curious what the solrcloud experts say, but my suggestion is to try
 not to over-engineering the search architecture  on solrcloud. For example,
 what is the benefit of managing the what cores are indexed and searched?
 Having to know those details, in my mind, works against the automation in
 solrcore, but maybe there's a good reason you want to do it this way.
 
  brbrbr--- Original Message ---
  On 5/22/2012  07:35 AM Yandong Yao wrote:brHi Darren,
  br
  brThanks very much for your reply.
  br
  brThe reason I want to control core indexing/searching is that I want
 to
  bruse one core to store one customer's data (all customer share same
  brconfig):  such as customer 1 use coreForCustomer1 and customer 2
  bruse coreForCustomer2.
  br
  brIs there any better way than using different core for different
 customer?
  br
  brAnother way maybe use different collection for different customer,
 while
  brnot sure how many collections solr cloud could support. Which way is
 better
  brin terms of flexibility/scalability? (suppose there are tens of
 thousands
  brcustomers).
  br
  brRegards,
  brYandong
  br
  br2012/5/22 Darren Govoni dar...@ontrenet.com
  br
  br Why do you want to control what gets indexed into a core and then
  br knowing what core to search? That's the kind of knowing that
 SolrCloud
  br solves. In SolrCloud, it handles the distribution of documents
 across
  br shards and retrieves them regardless of which node is searched
 from.
  br That is the point of cloud, you don't know the details of where
  br exactly documents are being managed (i.e. they are cloudy). It can
  br change and re-balance from time to time. SolrCloud performs the
  br distributed search for you, therefore when you try to search a
 node/core
  br with no documents, all the results from the cloud are retrieved
  br regardless. This is considered A Good Thing.
  br
  br It requires a change in thinking about indexing and searching
  br
  br On Tue, 2012-05-22 at 08:43 +0800, Yandong Yao wrote:
  br  Hi Guys,
  br 
  br  I use following command to start solr cloud according to solr
 cloud wiki.
  br 
  br  yydzero:example bjcoe$ java -Dbootstrap_confdir=./solr/conf
  br  -Dcollection.configName=myconf -DzkRun -DnumShards=2 -jar
 start.jar
  br  yydzero:example2 bjcoe$ java -Djetty.port=7574
 -DzkHost=localhost:9983
  br -jar
  br  start.jar
  br 
  br  Then I have created several cores using CoreAdmin API (
  br  http://localhost:8983/solr/admin/cores?action=CREATEname=
  br  coreNamecollection=collection1), and clusterstate.json show
 following
  br  topology:
  br 
  br 
  br  collection1:
  br  -- shard1:
  br-- collection1
  br-- CoreForCustomer1
  br-- CoreForCustomer3
  br-- CoreForCustomer5
  br  -- shard2:
  br-- collection1
  br-- CoreForCustomer2
  br-- CoreForCustomer4
  br 
  br 
  br  1) Index:
  br 
  br  Using following command to index mem.xml file in exampledocs
 directory.
  br 
  br  yydzero:exampledocs bjcoe$ java -Durl=
  br  http://localhost:8983/solr/coreForCustomer3/update -jar
 post.jar mem.xml
  br  SimplePostTool: version 1.4
  br  SimplePostTool: POSTing files to
  br  http://localhost:8983/solr/coreForCustomer3/update..
  br  SimplePostTool: POSTing file mem.xml
  br  SimplePostTool: COMMITting Solr index changes.
  br 
  br  And now SolrAdmin UI shows that 'coreForCustomer1',
 'coreForCustomer3',
  br  'coreForCustomer5' has 3 documents (mem.xml has 3 documents) and
 other 2
  br  core has 0 documents.
  br 
  br  *Question 1:*  Is this expected behavior? How do I to index
 documents
  br into
  br  a specific core?
  br 
  br  *Question 2*:  If SolrCloud don't support this yet, how could I
 extend it
  br  to support this feature (index document to particular core),
 where
  br should i
  br

Re: SolrCloud: how to index documents into a specific core and how to search against that core?

2012-05-22 Thread Yandong Yao
Hi Darren,

Thanks very much for your reply.

The reason I want to control core indexing/searching is that I want to
use one core to store one customer's data (all customer share same
config):  such as customer 1 use coreForCustomer1 and customer 2
use coreForCustomer2.

Is there any better way than using different core for different customer?

Another way maybe use different collection for different customer, while
not sure how many collections solr cloud could support. Which way is better
in terms of flexibility/scalability? (suppose there are tens of thousands
customers).

Regards,
Yandong

2012/5/22 Darren Govoni dar...@ontrenet.com

 Why do you want to control what gets indexed into a core and then
 knowing what core to search? That's the kind of knowing that SolrCloud
 solves. In SolrCloud, it handles the distribution of documents across
 shards and retrieves them regardless of which node is searched from.
 That is the point of cloud, you don't know the details of where
 exactly documents are being managed (i.e. they are cloudy). It can
 change and re-balance from time to time. SolrCloud performs the
 distributed search for you, therefore when you try to search a node/core
 with no documents, all the results from the cloud are retrieved
 regardless. This is considered A Good Thing.

 It requires a change in thinking about indexing and searching

 On Tue, 2012-05-22 at 08:43 +0800, Yandong Yao wrote:
  Hi Guys,
 
  I use following command to start solr cloud according to solr cloud wiki.
 
  yydzero:example bjcoe$ java -Dbootstrap_confdir=./solr/conf
  -Dcollection.configName=myconf -DzkRun -DnumShards=2 -jar start.jar
  yydzero:example2 bjcoe$ java -Djetty.port=7574 -DzkHost=localhost:9983
 -jar
  start.jar
 
  Then I have created several cores using CoreAdmin API (
  http://localhost:8983/solr/admin/cores?action=CREATEname=
  coreNamecollection=collection1), and clusterstate.json show following
  topology:
 
 
  collection1:
  -- shard1:
-- collection1
-- CoreForCustomer1
-- CoreForCustomer3
-- CoreForCustomer5
  -- shard2:
-- collection1
-- CoreForCustomer2
-- CoreForCustomer4
 
 
  1) Index:
 
  Using following command to index mem.xml file in exampledocs directory.
 
  yydzero:exampledocs bjcoe$ java -Durl=
  http://localhost:8983/solr/coreForCustomer3/update -jar post.jar mem.xml
  SimplePostTool: version 1.4
  SimplePostTool: POSTing files to
  http://localhost:8983/solr/coreForCustomer3/update..
  SimplePostTool: POSTing file mem.xml
  SimplePostTool: COMMITting Solr index changes.
 
  And now SolrAdmin UI shows that 'coreForCustomer1', 'coreForCustomer3',
  'coreForCustomer5' has 3 documents (mem.xml has 3 documents) and other 2
  core has 0 documents.
 
  *Question 1:*  Is this expected behavior? How do I to index documents
 into
  a specific core?
 
  *Question 2*:  If SolrCloud don't support this yet, how could I extend it
  to support this feature (index document to particular core), where
 should i
  start, the hashing algorithm?
 
  *Question 3*:  Why the documents are also indexed into 'coreForCustomer1'
  and 'coreForCustomer5'?  The default replica for documents are 1, right?
 
  Then I try to index some document to 'coreForCustomer2':
 
  $ java -Durl=http://localhost:8983/solr/coreForCustomer2/update -jar
  post.jar ipod_video.xml
 
  While 'coreForCustomer2' still have 0 documents and documents in
 ipod_video
  are indexed to core for customer 1/3/5.
 
  *Question 4*:  Why this happens?
 
  2) Search: I use 
  http://localhost:8983/solr/coreForCustomer2/select?q=*%3A*wt=xml; to
  search against 'CoreForCustomer2', while it will return all documents in
  the whole collection even though this core has no documents at all.
 
  Then I use 
 
 http://localhost:8983/solr/coreForCustomer2/select?q=*%3A*wt=xmlshards=localhost:8983/solr/coreForCustomer2
 ,
  and it will return 0 documents.
 
  *Question 5*: So If want to search against a particular core, we need to
  use 'shards' parameter and use solrCore name as parameter value, right?
 
 
  Thanks very much in advance!
 
  Regards,
  Yandong





SolrCloud: how to index documents into a specific core and how to search against that core?

2012-05-21 Thread Yandong Yao
Hi Guys,

I use following command to start solr cloud according to solr cloud wiki.

yydzero:example bjcoe$ java -Dbootstrap_confdir=./solr/conf
-Dcollection.configName=myconf -DzkRun -DnumShards=2 -jar start.jar
yydzero:example2 bjcoe$ java -Djetty.port=7574 -DzkHost=localhost:9983 -jar
start.jar

Then I have created several cores using CoreAdmin API (
http://localhost:8983/solr/admin/cores?action=CREATEname=
coreNamecollection=collection1), and clusterstate.json show following
topology:


collection1:
-- shard1:
  -- collection1
  -- CoreForCustomer1
  -- CoreForCustomer3
  -- CoreForCustomer5
-- shard2:
  -- collection1
  -- CoreForCustomer2
  -- CoreForCustomer4


1) Index:

Using following command to index mem.xml file in exampledocs directory.

yydzero:exampledocs bjcoe$ java -Durl=
http://localhost:8983/solr/coreForCustomer3/update -jar post.jar mem.xml
SimplePostTool: version 1.4
SimplePostTool: POSTing files to
http://localhost:8983/solr/coreForCustomer3/update..
SimplePostTool: POSTing file mem.xml
SimplePostTool: COMMITting Solr index changes.

And now SolrAdmin UI shows that 'coreForCustomer1', 'coreForCustomer3',
'coreForCustomer5' has 3 documents (mem.xml has 3 documents) and other 2
core has 0 documents.

*Question 1:*  Is this expected behavior? How do I to index documents into
a specific core?

*Question 2*:  If SolrCloud don't support this yet, how could I extend it
to support this feature (index document to particular core), where should i
start, the hashing algorithm?

*Question 3*:  Why the documents are also indexed into 'coreForCustomer1'
and 'coreForCustomer5'?  The default replica for documents are 1, right?

Then I try to index some document to 'coreForCustomer2':

$ java -Durl=http://localhost:8983/solr/coreForCustomer2/update -jar
post.jar ipod_video.xml

While 'coreForCustomer2' still have 0 documents and documents in ipod_video
are indexed to core for customer 1/3/5.

*Question 4*:  Why this happens?

2) Search: I use 
http://localhost:8983/solr/coreForCustomer2/select?q=*%3A*wt=xml; to
search against 'CoreForCustomer2', while it will return all documents in
the whole collection even though this core has no documents at all.

Then I use 
http://localhost:8983/solr/coreForCustomer2/select?q=*%3A*wt=xmlshards=localhost:8983/solr/coreForCustomer2;,
and it will return 0 documents.

*Question 5*: So If want to search against a particular core, we need to
use 'shards' parameter and use solrCore name as parameter value, right?


Thanks very much in advance!

Regards,
Yandong


Re: Faster Solr Indexing

2012-03-11 Thread Yandong Yao
I have similar issues by using DIH,
and org.apache.solr.update.DirectUpdateHandler2.addDoc(AddUpdateCommand)
consumes most of the time when indexing 10K rows (each row is about 70K)
-  DIH nextRow takes about 10 seconds totally
-  If index uses whitespace tokenizer and lower case filter, then
addDoc() methods takes about 80 seconds
-  If index uses whitespace tokenizer, lower case filer, WDF, then
addDoc uses about 112 seconds
-  If index uses whitespace tokenizer, lower case filer, WDF and porter
stemmer, then addDoc uses about 145 seconds

We have more than million rows totally, and am wondering whether i am using
sth. wrong or is there any way to improve the performance of addDoc()?

Thanks very much in advance!


Following is the configure:
1) JVM:  -Xms256M -Xmx1048M -XX:MaxPermSize=512m
2) Solr version 3.5
3) solrconfig.xml  (almost copied from solr's  example/solr directory.)

  indexDefaults

useCompoundFilefalse/useCompoundFile

mergeFactor10/mergeFactor
!-- Sets the amount of RAM that may be used by Lucene indexing
 for buffering added documents and deletions before they are
 flushed to the Directory.  --
ramBufferSizeMB64/ramBufferSizeMB
!-- If both ramBufferSizeMB and maxBufferedDocs is set, then
 Lucene will flush based on whichever limit is hit first.
  --
!-- maxBufferedDocs1000/maxBufferedDocs --

maxFieldLength2147483647/maxFieldLength
writeLockTimeout1000/writeLockTimeout
commitLockTimeout1/commitLockTimeout

lockTypenative/lockType
  /indexDefaults

2012/3/11 Peyman Faratin pey...@robustlinks.com

 Hi

 I am trying to index 12MM docs faster than is currently happening in Solr
 (using solrj). We have identified solr's add method as the bottleneck (and
 not commit - which is tuned ok through mergeFactor and maxRamBufferSize and
 jvm ram).

 Adding 1000 docs is taking approximately 25 seconds. We are making sure we
 add and commit in batches. And we've tried both CommonsHttpSolrServer and
 EmbeddedSolrServer (assuming removing http overhead would speed things up
 with embedding) but the differences is marginal.

 The docs being indexed are on average 20 fields long, mostly indexed but
 none stored. The major size contributors are two fields:

- content, and
- shingledContent (populated using copyField of content).

 The length of the content field is (likely) gaussian distributed (few
 large docs 50-80K tokens, but majority around 2k tokens). We use
 shingledContent to support phrase queries and content for unigram queries
 (following the advice of Solr Enterprise search server advice - p. 305,
 section The Solution: Shingling).

 Clearly the size of the docs is a contributor to the slow adds (confirmed
 by removing these 2 fields resulting in halving the indexing time). We've
 tried compressed=true also but that is not working.

 Any guidance on how to support our application logic (without having to
 change the schema too much) and speed the indexing speed (from current 212
 days for 12MM docs) would be much appreciated.

 thank you

 Peyman




How to use nested query in fq?

2012-02-07 Thread Yandong Yao
Hi Guys,

I am using Solr 3.5, and would like to use a fq like
'getField(getDoc(uuid:workspace_${workspaceId})),  isPublic):true?

- workspace_${workspaceId}:  workspaceId is indexed field.
- getDoc(uuid:concat(workspace_, workspaceId):  return the document whose
uuid is workspace_${workspaceId}
- getField(getDoc(uuid:workspace_${workspaceId})),  isPublic):  return
the matched document's isPublic field

The use case is that I have workspace objects and workspace contains many
sub-objects, such as work files, comments, datasets and so on. And
workspace has a 'isPublic' field. If this field is true, then all
registered user could access this workspace and all its sub-objects.
Otherwise, only workspace member could access this workspace and its
sub-objects.

So I want to use fq to determine whether document in question belongs to
public workspace or not.  Is it possible?

If not, how to implement similar feature like this? implement a
ValueSourcePlugin? any guidance or example on this?

Or is there any better solutions?


It is possible to add 'isPublic' field to all sub-objects, while it makes
indexing update more complex. so try to find better solution.

Thanks very much in advance!

Regards,
Yandong


Re: Need help for solr searching case insensative item

2010-10-26 Thread yandong yao
Sounds like WordDelimiterFilter config issue, please refer to
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory
.

Also it will help if you could provide:
1) Tokenizers/Filters config in schema file
2) analysis.jsp output in admin page.

2010/10/26 wu liu wul...@mail.usask.ca

 Hi all,

 I just noticed a wierd thing happend to my solr search result.
 if I do a search for ecommons, it cannot get the result for eCommons,
 instead,
 if i do a search for eCommons, i can only get all the match for
 eCommons, but not ecommons.

 I cannot figure it out why?

 please help me

 Thanks very much in advance



A question on WordDelimiterFilterFactory

2010-09-14 Thread yandong yao
Hi Guys,

I encountered a problem when enabling WordDelimiterFilterFactory for both
index and query (pasted relative part of schema.xml at the bottom of email).

*1. Steps to reproduce:*
1.1 The indexed sample document contains only one sentence: This is a
TechNote.
1.2 Query is: q=TechNote
1.3  Result: no matches return, while the above sentence contains word
'TechNote' absolutely.

*
2. Output when enabling debugQuery*
By turning on debugQuery
http://localhost:7111/solr/test/select?indent=onversion=2.2q=TechNotefq=start=0rows=0fl=*%2Cscoreqt=standardwt=standarddebugQuery=onexplainOther=id%3A001hl.fl=,
get following information:

str name=rawquerystringTechNote/str
str name=querystringTechNote/str
str name=parsedqueryPhraseQuery(all:tech note)/str
str name=parsedquery_toStringall:tech note/str
lst name=explain/
str name=otherQueryid:001/str
lst name=explainOther
str name=001
0.0 = fieldWeight(all:tech note in 0), product of: 0.0 =
tf(phraseFreq=0.0)
  0.61370564 = idf(all: tech=1 note=1)
  0.25 = fieldNorm(field=all, doc=0)
/str
/lst

Seems that the raw query string is converted to phrase query tech note,
while its term frequency is 0, so no matches.

*3. Result from admin/analysis.jsp page*

From analysis.jsp, seems the query 'TechNote' matches the input document,
see below words marked by RED color.

Index Analyzer org.apache.solr.analysis.WhitespaceTokenizerFactory {}  term
position 1234 term text ThisisaTechNote. term type wordwordwordword source
start,end 0,45,78,910,19 payload



 org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
expand=true, ignoreCase=true}  term position 1234 term text
ThisisaTechNote. term
type wordwordwordword source start,end 0,45,78,910,19 payload



 org.apache.solr.analysis.WordDelimiterFilterFactory {splitOnCaseChange=1,
generateNumberParts=1, catenateWords=1, generateWordParts=1, catenateAll=0,
catenateNumbers=1}  term position 12345 term text ThisisaTechNote TechNote term
type wordwordwordwordword word source start,end 0,45,78,910,1414,18 10,18
payload





 org.apache.solr.analysis.LowerCaseFilterFactory {}  term position 12345 term
text thisisatechnote technote term type wordwordwordwordword word source
start,end 0,45,78,910,1414,18 10,18 payload





 org.apache.solr.analysis.SnowballPorterFilterFactory
{protected=protwords.txt, language=English}  term position 12345 term text
thisisa*tech**note* technot term type wordwordwordwordword word source
start,end 0,45,78,910,1414,18 10,18 payload





 Query Analyzer org.apache.solr.analysis.WhitespaceTokenizerFactory {}  term
position 1 term text TechNote term type word source start,end 0,8 payload
 org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
expand=true, ignoreCase=true}  term position 1 term text TechNote term type
word source start,end 0,8 payload
 org.apache.solr.analysis.WordDelimiterFilterFactory {splitOnCaseChange=1,
generateNumberParts=1, catenateWords=0, generateWordParts=1, catenateAll=0,
catenateNumbers=0}  term position 12 term text TechNote term type
wordword source
start,end 0,44,8 payload

 org.apache.solr.analysis.LowerCaseFilterFactory {}  term position 12 term
text technote term type wordword source start,end 0,44,8 payload

 org.apache.solr.analysis.SnowballPorterFilterFactory
{protected=protwords.txt, language=English} term position 12 term text tech
note term type wordword source start,end 0,44,8 payload


*
4. My questions are:*
4.1: Why debugQuery and analysis.jsp has different result?
4.2: From my understanding, during indexing, the word 'TechNote' will be
converted to: 1) 'technote' and 2) 'tech note' according to my config in
schema.xml. And at query time, 'TechNote' will be converted to 'tech note',
thus it SHOULD match.  Am I right?
 4.3: Why the phrase frequency 'tech note' is 0 in the output of
debugQuery result (0.0 = tf(phraseFreq=0.0))?

Any suggestion/comments are absolutely welcome!


*5. fieldType definition in schema.xml*

fieldType name=text class=solr.TextField
positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.SnowballPorterFilterFactory language=English
protected=protwords.txt/
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=0
catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter 

Re: A question on WordDelimiterFilterFactory

2010-09-14 Thread yandong yao
Hi Robert,

I am using solr 1.4, will try with 1.4.1 tomorrow.

Thanks very much!

Regards,
Yandong Yao

2010/9/14 Robert Muir rcm...@gmail.com

 did you index with solr 1.4 (or are you using solr 1.4) ?

 at a quick glance, it looks like it might be this:
 https://issues.apache.org/jira/browse/SOLR-1852 , which was fixed in 1.4.1

 On Tue, Sep 14, 2010 at 5:40 AM, yandong yao yydz...@gmail.com wrote:

  Hi Guys,
 
  I encountered a problem when enabling WordDelimiterFilterFactory for both
  index and query (pasted relative part of schema.xml at the bottom of
  email).
 
  *1. Steps to reproduce:*
 1.1 The indexed sample document contains only one sentence: This is a
  TechNote.
 1.2 Query is: q=TechNote
 1.3  Result: no matches return, while the above sentence contains word
  'TechNote' absolutely.
 
  *
  2. Output when enabling debugQuery*
  By turning on debugQuery
 
 
 http://localhost:7111/solr/test/select?indent=onversion=2.2q=TechNotefq=start=0rows=0fl=*%2Cscoreqt=standardwt=standarddebugQuery=onexplainOther=id%3A001hl.fl=
  ,
  get following information:
 
  str name=rawquerystringTechNote/str
  str name=querystringTechNote/str
  str name=parsedqueryPhraseQuery(all:tech note)/str
  str name=parsedquery_toStringall:tech note/str
  lst name=explain/
  str name=otherQueryid:001/str
  lst name=explainOther
  str name=001
  0.0 = fieldWeight(all:tech note in 0), product of: 0.0 =
  tf(phraseFreq=0.0)
   0.61370564 = idf(all: tech=1 note=1)
   0.25 = fieldNorm(field=all, doc=0)
  /str
  /lst
 
  Seems that the raw query string is converted to phrase query tech note,
  while its term frequency is 0, so no matches.
 
  *3. Result from admin/analysis.jsp page*
 
  From analysis.jsp, seems the query 'TechNote' matches the input document,
  see below words marked by RED color.
 
  Index Analyzer org.apache.solr.analysis.WhitespaceTokenizerFactory {}
  term
  position 1234 term text ThisisaTechNote. term type wordwordwordword
 source
  start,end 0,45,78,910,19 payload
 
 
 
   org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
  expand=true, ignoreCase=true}  term position 1234 term text
  ThisisaTechNote. term
  type wordwordwordword source start,end 0,45,78,910,19 payload
 
 
 
   org.apache.solr.analysis.WordDelimiterFilterFactory
 {splitOnCaseChange=1,
  generateNumberParts=1, catenateWords=1, generateWordParts=1,
 catenateAll=0,
  catenateNumbers=1}  term position 12345 term text ThisisaTechNote
 TechNote
  term
  type wordwordwordwordword word source start,end 0,45,78,910,1414,18 10,18
  payload
 
 
 
 
 
   org.apache.solr.analysis.LowerCaseFilterFactory {}  term position 12345
  term
  text thisisatechnote technote term type wordwordwordwordword word source
  start,end 0,45,78,910,1414,18 10,18 payload
 
 
 
 
 
   org.apache.solr.analysis.SnowballPorterFilterFactory
  {protected=protwords.txt, language=English}  term position 12345 term
 text
  thisisa*tech**note* technot term type wordwordwordwordword word source
  start,end 0,45,78,910,1414,18 10,18 payload
 
 
 
 
 
   Query Analyzer org.apache.solr.analysis.WhitespaceTokenizerFactory {}
   term
  position 1 term text TechNote term type word source start,end 0,8 payload
   org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
  expand=true, ignoreCase=true}  term position 1 term text TechNote term
 type
  word source start,end 0,8 payload
   org.apache.solr.analysis.WordDelimiterFilterFactory
 {splitOnCaseChange=1,
  generateNumberParts=1, catenateWords=0, generateWordParts=1,
 catenateAll=0,
  catenateNumbers=0}  term position 12 term text TechNote term type
  wordword source
  start,end 0,44,8 payload
 
   org.apache.solr.analysis.LowerCaseFilterFactory {}  term position 12
 term
  text technote term type wordword source start,end 0,44,8 payload
 
   org.apache.solr.analysis.SnowballPorterFilterFactory
  {protected=protwords.txt, language=English} term position 12 term text
 tech
  note term type wordword source start,end 0,44,8 payload
 
 
  *
  4. My questions are:*
 4.1: Why debugQuery and analysis.jsp has different result?
 4.2: From my understanding, during indexing, the word 'TechNote' will
 be
  converted to: 1) 'technote' and 2) 'tech note' according to my config in
  schema.xml. And at query time, 'TechNote' will be converted to 'tech
 note',
  thus it SHOULD match.  Am I right?
  4.3: Why the phrase frequency 'tech note' is 0 in the output of
  debugQuery result (0.0 = tf(phraseFreq=0.0))?
 
  Any suggestion/comments are absolutely welcome!
 
 
  *5. fieldType definition in schema.xml*
 
 fieldType name=text class=solr.TextField
  positionIncrementGap=100
   analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
  ignoreCase=true expand=true/
 filter class=solr.WordDelimiterFilterFactory
  generateWordParts=1 generateNumberParts=1 catenateWords=1
  catenateNumbers=1

Re: A question on WordDelimiterFilterFactory

2010-09-14 Thread yandong yao
After upgrading to 1.4.1, it is fixed.

Thanks very much for your help!

Regards,
Yandong Yao

2010/9/14 yandong yao yydz...@gmail.com

 Hi Robert,

 I am using solr 1.4, will try with 1.4.1 tomorrow.

 Thanks very much!

 Regards,
 Yandong Yao

 2010/9/14 Robert Muir rcm...@gmail.com

 did you index with solr 1.4 (or are you using solr 1.4) ?

 at a quick glance, it looks like it might be this:
 https://issues.apache.org/jira/browse/SOLR-1852 , which was fixed in
 1.4.1

 On Tue, Sep 14, 2010 at 5:40 AM, yandong yao yydz...@gmail.com wrote:

  Hi Guys,
 
  I encountered a problem when enabling WordDelimiterFilterFactory for
 both
  index and query (pasted relative part of schema.xml at the bottom of
  email).
 
  *1. Steps to reproduce:*
 1.1 The indexed sample document contains only one sentence: This is
 a
  TechNote.
 1.2 Query is: q=TechNote
 1.3  Result: no matches return, while the above sentence contains
 word
  'TechNote' absolutely.
 
  *
  2. Output when enabling debugQuery*
  By turning on debugQuery
 
 
 http://localhost:7111/solr/test/select?indent=onversion=2.2q=TechNotefq=start=0rows=0fl=*%2Cscoreqt=standardwt=standarddebugQuery=onexplainOther=id%3A001hl.fl=
  ,
  get following information:
 
  str name=rawquerystringTechNote/str
  str name=querystringTechNote/str
  str name=parsedqueryPhraseQuery(all:tech note)/str
  str name=parsedquery_toStringall:tech note/str
  lst name=explain/
  str name=otherQueryid:001/str
  lst name=explainOther
  str name=001
  0.0 = fieldWeight(all:tech note in 0), product of: 0.0 =
  tf(phraseFreq=0.0)
   0.61370564 = idf(all: tech=1 note=1)
   0.25 = fieldNorm(field=all, doc=0)
  /str
  /lst
 
  Seems that the raw query string is converted to phrase query tech
 note,
  while its term frequency is 0, so no matches.
 
  *3. Result from admin/analysis.jsp page*
 
  From analysis.jsp, seems the query 'TechNote' matches the input
 document,
  see below words marked by RED color.
 
  Index Analyzer org.apache.solr.analysis.WhitespaceTokenizerFactory {}
  term
  position 1234 term text ThisisaTechNote. term type wordwordwordword
 source
  start,end 0,45,78,910,19 payload
 
 
 
   org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
  expand=true, ignoreCase=true}  term position 1234 term text
  ThisisaTechNote. term
  type wordwordwordword source start,end 0,45,78,910,19 payload
 
 
 
   org.apache.solr.analysis.WordDelimiterFilterFactory
 {splitOnCaseChange=1,
  generateNumberParts=1, catenateWords=1, generateWordParts=1,
 catenateAll=0,
  catenateNumbers=1}  term position 12345 term text ThisisaTechNote
 TechNote
  term
  type wordwordwordwordword word source start,end 0,45,78,910,1414,18
 10,18
  payload
 
 
 
 
 
   org.apache.solr.analysis.LowerCaseFilterFactory {}  term position 12345
  term
  text thisisatechnote technote term type wordwordwordwordword word source
  start,end 0,45,78,910,1414,18 10,18 payload
 
 
 
 
 
   org.apache.solr.analysis.SnowballPorterFilterFactory
  {protected=protwords.txt, language=English}  term position 12345 term
 text
  thisisa*tech**note* technot term type wordwordwordwordword word source
  start,end 0,45,78,910,1414,18 10,18 payload
 
 
 
 
 
   Query Analyzer org.apache.solr.analysis.WhitespaceTokenizerFactory {}
   term
  position 1 term text TechNote term type word source start,end 0,8
 payload
   org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
  expand=true, ignoreCase=true}  term position 1 term text TechNote term
 type
  word source start,end 0,8 payload
   org.apache.solr.analysis.WordDelimiterFilterFactory
 {splitOnCaseChange=1,
  generateNumberParts=1, catenateWords=0, generateWordParts=1,
 catenateAll=0,
  catenateNumbers=0}  term position 12 term text TechNote term type
  wordword source
  start,end 0,44,8 payload
 
   org.apache.solr.analysis.LowerCaseFilterFactory {}  term position 12
 term
  text technote term type wordword source start,end 0,44,8 payload
 
   org.apache.solr.analysis.SnowballPorterFilterFactory
  {protected=protwords.txt, language=English} term position 12 term text
 tech
  note term type wordword source start,end 0,44,8 payload
 
 
  *
  4. My questions are:*
 4.1: Why debugQuery and analysis.jsp has different result?
 4.2: From my understanding, during indexing, the word 'TechNote' will
 be
  converted to: 1) 'technote' and 2) 'tech note' according to my config in
  schema.xml. And at query time, 'TechNote' will be converted to 'tech
 note',
  thus it SHOULD match.  Am I right?
  4.3: Why the phrase frequency 'tech note' is 0 in the output of
  debugQuery result (0.0 = tf(phraseFreq=0.0))?
 
  Any suggestion/comments are absolutely welcome!
 
 
  *5. fieldType definition in schema.xml*
 
 fieldType name=text class=solr.TextField
  positionIncrementGap=100
   analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
  ignoreCase=true

Re: how to support implicit trailing wildcards

2010-08-11 Thread yandong yao
Hi Jan,

Seems q=mount OR mount* have different sorting order with q=mount for those
documents including mount.
Change to  q=mount^100 OR (mount?* -mount)^1.0, and test well.

Thanks very much!

2010/8/10 Jan Høydahl / Cominvent jan@cominvent.com

 Hi,

 You don't need to duplicate the content into two fields to achieve this.
 Try this:

 q=mount OR mount*

 The exact match will always get higher score than the wildcard match
 because wildcard matches uses constant score.

 Making this work for multi term queries is a bit trickier, but something
 along these lines:

 q=(mount OR mount*) AND (everest OR everest*)

 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com
 Training in Europe - www.solrtraining.com

 On 10. aug. 2010, at 09.38, Geert-Jan Brits wrote:

  you could satisfy this by making 2 fields:
  1. exactmatch
  2. wildcardmatch
 
  use copyfield in your schema to copy 1 -- 2 .
 
  q=exactmatch:mount+wildcardmatch:mount*q.op=OR
  this would score exact matches above (solely) wildcard matches
 
  Geert-Jan
 
  2010/8/10 yandong yao yydz...@gmail.com
 
  Hi Bastian,
 
  Sorry for not make it clear, I also want exact match have higher score
 than
  wildcard match, that is means: if searching 'mount', documents with
 'mount'
  will have higher score than documents with 'mountain', while 'mount*'
 seems
  treat 'mount' and 'mountain' as same.
 
  besides, also want the query to be processed with analyzer, while from
 
 
 http://wiki.apache.org/lucene-java/LuceneFAQ#Are_Wildcard.2C_Prefix.2C_and_Fuzzy_queries_case_sensitive.3F
  ,
  Wildcard, Prefix, and Fuzzy queries are not passed through the Analyzer.
  The
  rationale is that if search 'mounted', I also want documents with
 'mount'
  match.
 
  So seems built-in wildcard search could not satisfy my requirements if i
  understand correctly.
 
  Thanks very much!
 
 
  2010/8/9 Bastian Spitzer bspit...@magix.net
 
  Wildcard-Search is already built in, just use:
 
  ?q=umoun*
  ?q=mounta*
 
  -Ursprüngliche Nachricht-
  Von: yandong yao [mailto:yydz...@gmail.com]
  Gesendet: Montag, 9. August 2010 15:57
  An: solr-user@lucene.apache.org
  Betreff: how to support implicit trailing wildcards
 
  Hi everyone,
 
 
  How to support 'implicit trailing wildcard *' using Solr, eg: using
  Google
  to search 'umoun', 'umount' will be matched , search 'mounta',
 'mountain'
  will be matched.
 
  From my point of view, there are several ways, both with disadvantages:
 
  1) Using EdgeNGramFilterFactory, thus 'umount' will be indexed with
 'u',
  'um', 'umo', 'umou', 'umoun', 'umount'. The disadvantages are: a) the
  index
  size increases dramatically, b) will matches even has no relationship,
  such
  as such 'mount' will match 'mountain' also.
 
  2) Using two pass searching: first pass searches term dictionary
 through
  TermsComponent using given keyword, then using the first matched term
  from
  term dictionary to search again. eg: when user enter 'umoun',
  TermsComponent
  will match 'umount', then use 'umount' to search. The disadvantage are:
  a)
  need to parse query string so that could recognize meta keywords such
 as
  'AND', 'OR', '+', '-', '' (this makes more complex as I am using PHP
  client), b) The returned hit counts is not for original search string,
  thus
  will influence other components such as auto-suggest component based on
  user
  search history and hit counts.
 
  3) Write custom SearchComponent, while have no idea where/how to start
  with.
 
  Is there any other way in Solr to do this, any feedback/suggestion are
  welcome!
 
  Thanks very much in advance!
 
 




how to support implicit trailing wildcards

2010-08-09 Thread yandong yao
Hi everyone,


How to support 'implicit trailing wildcard *' using Solr, eg: using Google
to search 'umoun', 'umount' will be matched , search 'mounta', 'mountain'
will be matched.

From my point of view, there are several ways, both with disadvantages:

1) Using EdgeNGramFilterFactory, thus 'umount' will be indexed with 'u',
'um', 'umo', 'umou', 'umoun', 'umount'. The disadvantages are: a) the index
size increases dramatically, b) will matches even has no relationship, such
as such 'mount' will match 'mountain' also.

2) Using two pass searching: first pass searches term dictionary through
TermsComponent using given keyword, then using the first matched term from
term dictionary to search again. eg: when user enter 'umoun', TermsComponent
will match 'umount', then use 'umount' to search. The disadvantage are: a)
need to parse query string so that could recognize meta keywords such as
'AND', 'OR', '+', '-', '' (this makes more complex as I am using PHP
client), b) The returned hit counts is not for original search string, thus
will influence other components such as auto-suggest component based on user
search history and hit counts.

3) Write custom SearchComponent, while have no idea where/how to start with.

Is there any other way in Solr to do this, any feedback/suggestion are
welcome!

Thanks very much in advance!


Re: how to support implicit trailing wildcards

2010-08-09 Thread yandong yao
Hi Bastian,

Sorry for not make it clear, I also want exact match have higher score than
wildcard match, that is means: if searching 'mount', documents with 'mount'
will have higher score than documents with 'mountain', while 'mount*' seems
treat 'mount' and 'mountain' as same.

besides, also want the query to be processed with analyzer, while from
http://wiki.apache.org/lucene-java/LuceneFAQ#Are_Wildcard.2C_Prefix.2C_and_Fuzzy_queries_case_sensitive.3F,
Wildcard, Prefix, and Fuzzy queries are not passed through the Analyzer. The
rationale is that if search 'mounted', I also want documents with 'mount'
match.

So seems built-in wildcard search could not satisfy my requirements if i
understand correctly.

Thanks very much!


2010/8/9 Bastian Spitzer bspit...@magix.net

 Wildcard-Search is already built in, just use:

 ?q=umoun*
 ?q=mounta*

 -Ursprüngliche Nachricht-
 Von: yandong yao [mailto:yydz...@gmail.com]
 Gesendet: Montag, 9. August 2010 15:57
 An: solr-user@lucene.apache.org
 Betreff: how to support implicit trailing wildcards

 Hi everyone,


 How to support 'implicit trailing wildcard *' using Solr, eg: using Google
 to search 'umoun', 'umount' will be matched , search 'mounta', 'mountain'
 will be matched.

 From my point of view, there are several ways, both with disadvantages:

 1) Using EdgeNGramFilterFactory, thus 'umount' will be indexed with 'u',
 'um', 'umo', 'umou', 'umoun', 'umount'. The disadvantages are: a) the index
 size increases dramatically, b) will matches even has no relationship, such
 as such 'mount' will match 'mountain' also.

 2) Using two pass searching: first pass searches term dictionary through
 TermsComponent using given keyword, then using the first matched term from
 term dictionary to search again. eg: when user enter 'umoun', TermsComponent
 will match 'umount', then use 'umount' to search. The disadvantage are: a)
 need to parse query string so that could recognize meta keywords such as
 'AND', 'OR', '+', '-', '' (this makes more complex as I am using PHP
 client), b) The returned hit counts is not for original search string, thus
 will influence other components such as auto-suggest component based on user
 search history and hit counts.

 3) Write custom SearchComponent, while have no idea where/how to start
 with.

 Is there any other way in Solr to do this, any feedback/suggestion are
 welcome!

 Thanks very much in advance!