Re: CollapseFilter with the latest Solr in trunk

2009-04-19 Thread climbingrose
Ok, here is how I fixed this problem:

  public DocListAndSet getDocListAndSet(Query query, ListQuery filterList,
DocSet docSet, Sort lsort, int offset, int len, int flags) throwsIOException {

//DocListAndSet ret = new DocListAndSet();

//getDocListC(ret,query,filterList,docSet,lsort,offset,len, flags |=
GET_DOCSET);

DocSet theFilt = getDocSet(filterList);

if (docSet != null) theFilt = (theFilt != null) ?
theFilt.intersection(docSet) : docSet;

QueryCommand qc = new QueryCommand();

qc.setQuery(query).setFilter(theFilt);

qc.setSort(lsort).setOffset(offset).setLen(len).setFlags(flags |=
GET_DOCSET);

QueryResult result = new QueryResult();

getDocListC(result,qc);



return result.getDocListAndSet();

  }


There is also one-off error in CollapseFilter which you can find solution on
Jira.

Cheers,
Cuong

On Sat, Apr 18, 2009 at 4:41 AM, Jeff Newburn jnewb...@zappos.com wrote:

 We are currently trying to do the same thing.  With the patch unaltered we
 can use fq as long as collapsing is turned on.  If we just send a normal
 document level query with an fq parameter it blows up.

 Additionally, it does not appear that the collapse.facet option works at
 all.

 --
 Jeff Newburn
 Software Engineer, Zappos.com
 jnewb...@zappos.com - 702-943-7562


  From: climbingrose climbingr...@gmail.com
  Reply-To: solr-user@lucene.apache.org
  Date: Fri, 17 Apr 2009 16:53:00 +1000
  To: solr-user solr-user@lucene.apache.org
  Subject: CollapseFilter with the latest Solr in trunk
 
  Hi all,
 
  Have any one try to use CollapseFilter with the latest version of Solr in
  trunk? However, it looks like Solr 1.4 doesn't allow calling
 setFilterList()
  and setFilter() on one instance of the QueryCommand. I modified the code
 in
  QueryCommand to allow this:
 
  public QueryCommand setFilterList(Query f) {
  //  if( filter != null ) {
  //throw new IllegalArgumentException( Either filter or
 filterList
  may be set in the QueryCommand, but not both. );
  //  }
filterList = null;
if (f != null) {
  filterList = new ArrayListQuery(2);
  filterList.add(f);
}
return this;
  }
 
  However, I still have a problem which prevent query filters from working
  when used in conjunction with CollapseFilter. In other words, query
 filters
  doesn't seem to have any effects on the result set when CollapseFilter is
  used.
 
  The other problem is related to OpenBitSet:
 
  java.lang.ArrayIndexOutOfBoundsException: 2183
  at org.apache.lucene.util.OpenBitSet.fastSet(OpenBitSet.java:242)
  at org.apache.solr.search.CollapseFilter.addDoc(CollapseFilter.java:202)
 
  at
 

 org.apache.solr.search.CollapseFilter.adjacentCollapse(CollapseFilter.java:161
 )
  at
 org.apache.solr.search.CollapseFilter.lt;initgt;(CollapseFilter.java:141)
 
  at
 
 org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:2
  17)
  at
 
 org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandle
  r.java:195)
  at
 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.ja
  va:131)
 
  at org.apache.solr.core.SolrCore.execute(SolrCore.java:1333)
  at
 

 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303
 )
  at
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:23
  2)
 
  at
 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFi
  lterChain.java:202)
  at
 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChai
  n.java:173)
  at
 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java
  :213)
 
  at
 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java
  :178)
  at
 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:126)
  at
 
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:105)
 
  at
 
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:1
  07)
  at
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:148)
  at
 org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:869)
 
  at
 
 org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processCon
  nection(Http11BaseProtocol.java:664)
  at
 
 org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:
  527)
  at
 
 org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWork
  erThread.java:80)
 
  at
 
 org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:
  684)
 
 
  at java.lang.Thread.run(Thread.java:619)
 
  I think CollapseFilter is rather an important function in Solr that gets
  used quite frequently. Does anyone have a solution for this?
 
  --
  Regards,
 
  Cuong Hoang




-- 
Regards,

Cuong Hoang


Best way to return ExternalFileField in the results

2008-07-15 Thread climbingrose
Hi all,
I've been trying to return a field of type ExternalFileField in the search
result. Upon examining XMLWriter class, it seems like Solr can't do this out
of the box. Therefore, I've tried to hack Solr to enable  this behaviour.
The goal is to call to ExternalFileField.getValueSource(SchemaField
field,QParser parser) in XMLWriter.writeDoc(String name, Document
document,...) method. There are two issues with doing this:

1) I need to create an instance of QParser in writeDoc method. What is the
best way to do this? What kind of overhead of creating a new QParser for
every document returned?

2) I have to modify writeDoc method to include the internal Lucene document
Id because I need it to retrieve the ExternalFileField:

fileField.getValueSource(schemaField,
qparser).getValues(request.getSearcher().getIndexReader()).floatVal(docId)

The immediate affect is that it breaks writeVal() method (because this
method references writeDoc()).

Any comments?

Thanks in advance.


-- 
Regards,

Cuong Hoang


Re: Document rating/popularity and scoring

2008-07-14 Thread climbingrose
Hi Yonik,

I have had a looked at ExternalFileField. However, I coudn't figured out how
to include the externally referenced field in the search results. Also,
sorting on this type of field isn't possible right?

Thanks.

On Sat, Jul 12, 2008 at 2:28 AM, climbingrose [EMAIL PROTECTED]
wrote:

 Thanks Yonik. I will try it out. Btw, what cache should we use for
 multivalued, untokenised fields with large number of terms? Faceted search
 on these fields seem to be noticeably slower even if I have allocated enough
 filterCache. There seems to be a lot of cache lookups for each query.

 On Sat, Jul 12, 2008 at 1:58 AM, Yonik Seeley [EMAIL PROTECTED] wrote:

 See ExternalFileField and BoostedQuery

 -Yonik

 On Fri, Jul 11, 2008 at 11:47 AM, climbingrose [EMAIL PROTECTED]
 wrote:
  Hi all,
  Has anyone tried to factor rating/popularity into Solr scoring? For
 example,
  I want documents with more page views to be ranked higher in the search
  results. From what I can see, the most difficult thing is that we have
 to
  update the number of page views for each document. With Solr-139,
 document
  can be updated at field level. However, it still have to retrieve the
  document and then do a reindex. With high traffic sites, the overhead
 might
  be too high.
 
  I'm thinking of using relational database to track page views / ratings
 and
  then do a daily sync with Solr. Is there a way for Solr to retrieve data
  from external sources (database server) and use the data for determining
  document ranking?
 
  Thanks.
 
  --
  Regards,
 
  Cuong Hoang
 




 --
 Regards,

 Cuong Hoang



Document rating/popularity and scoring

2008-07-11 Thread climbingrose
Hi all,
Has anyone tried to factor rating/popularity into Solr scoring? For example,
I want documents with more page views to be ranked higher in the search
results. From what I can see, the most difficult thing is that we have to
update the number of page views for each document. With Solr-139, document
can be updated at field level. However, it still have to retrieve the
document and then do a reindex. With high traffic sites, the overhead might
be too high.

I'm thinking of using relational database to track page views / ratings and
then do a daily sync with Solr. Is there a way for Solr to retrieve data
from external sources (database server) and use the data for determining
document ranking?

Thanks.

-- 
Regards,

Cuong Hoang


Re: Document rating/popularity and scoring

2008-07-11 Thread climbingrose
Thanks Yonik. I will try it out. Btw, what cache should we use for
multivalued, untokenised fields with large number of terms? Faceted search
on these fields seem to be noticeably slower even if I have allocated enough
filterCache. There seems to be a lot of cache lookups for each query.
On Sat, Jul 12, 2008 at 1:58 AM, Yonik Seeley [EMAIL PROTECTED] wrote:

 See ExternalFileField and BoostedQuery

 -Yonik

 On Fri, Jul 11, 2008 at 11:47 AM, climbingrose [EMAIL PROTECTED]
 wrote:
  Hi all,
  Has anyone tried to factor rating/popularity into Solr scoring? For
 example,
  I want documents with more page views to be ranked higher in the search
  results. From what I can see, the most difficult thing is that we have to
  update the number of page views for each document. With Solr-139,
 document
  can be updated at field level. However, it still have to retrieve the
  document and then do a reindex. With high traffic sites, the overhead
 might
  be too high.
 
  I'm thinking of using relational database to track page views / ratings
 and
  then do a daily sync with Solr. Is there a way for Solr to retrieve data
  from external sources (database server) and use the data for determining
  document ranking?
 
  Thanks.
 
  --
  Regards,
 
  Cuong Hoang
 




-- 
Regards,

Cuong Hoang


Re: Do I need Searcher on indexing machine

2008-07-10 Thread climbingrose
You do, I think. Have a look at DirectUpdateHandler2 class.

On Thu, Jul 10, 2008 at 9:16 PM, Gudata [EMAIL PROTECTED] wrote:


 Hi,
 I want (if possible) to dedicate one machine only for indexing and to be
 optimized only for that.

 In solrconfig.xml, I have:
 - commented all cache statements
 - set to use cold searchers.
 - set 1



 In the log files I see this all the time:

 INFO: Registered new searcher [EMAIL PROTECTED] main
 Jul 10, 2008 12:49:59 PM org.apache.solr.search.SolrIndexSearcher close
 INFO: Closing [EMAIL PROTECTED] main

 Why Solr is registering new searcher all the time. Is this overhead, and if
 yes, how to stop it?

 --
 View this message in context:
 http://www.nabble.com/Do-I-need-Searcher-on-indexing-machine-tp18380669p18380669.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Regards,

Cuong Hoang


Re: Limit Porter stemmer to plural stemming only?

2008-07-01 Thread climbingrose
Attached is the modified Snowball source code for plural-only English
stemmer. You need to compile it to Java using instruction here:
http://snowball.tartarus.org/runtime/use.html. Essentially, you need to:

1) Download (Snowball, algorithms, and libstemmer
library)http://snowball.tartarus.org/dist/snowball_code.tgz and
compile Snowball compiler it self using this command: gcc -O -o snowball
compiler/*.c.
2) Compile the the attached file to Java:
./snowball stem_ISO_8859_1.sbl -java -o EnglishStemmer -name EnglishStemmer

You can change EnglishStemmer to whatever you like, for example,
PluralEnglishStemmer. After that, you need to modify the generated Java
class so that it references the appropriate classes in net.sf.snowball.*
package instead of the one from Snowball website. I think only 2 classes you
need to import are Among and SnowballProgram.

Once, you have the new stemmer ready, write something similar to
EnglishPorterFilterFactory to use it within Solr.

Hope this helps.

Cheers,
Cuong


On Tue, Jul 1, 2008 at 6:07 PM, Guillaume Smet [EMAIL PROTECTED]
wrote:

 Hi Cuong,

 On Tue, Jul 1, 2008 at 4:45 AM, climbingrose [EMAIL PROTECTED]
 wrote:
  I modified the original English Stemmer written in Snowball language and
  regenerate the Java implementation using Snowball compiler. It's been
  working for me  so far. I certainly can share the modified Snowball
 English
  Stemmer if anyone wants to use it.

 Yeah, it would be nice. A step by step explanation of how to
 regenerate the Java files would be nice too (or a pointer to such a
 documentation if you found one).

 Thanks,

 --
 Guillaume



Limit Porter stemmer to plural stemming only?

2008-06-30 Thread climbingrose
Hi all,
Porter stemmer in general is really good. However, there are some cases
where it doesn't work. For example, accountant matches Accountant as
well as Account Manager which isn't desirable. Is it possible to use this
analyser for plural words only? For example:
+Accountant - accountant
+Accountants - accountant
+Account - Account
+Accounts - account

Thanks.

-- 
Regards,

Cuong Hoang


Re: Limit Porter stemmer to plural stemming only?

2008-06-30 Thread climbingrose
Ok, it looks like step 1a in Porter algo does what I need.
On Mon, Jun 30, 2008 at 6:39 PM, climbingrose [EMAIL PROTECTED]
wrote:

 Hi all,
 Porter stemmer in general is really good. However, there are some cases
 where it doesn't work. For example, accountant matches Accountant as
 well as Account Manager which isn't desirable. Is it possible to use this
 analyser for plural words only? For example:
 +Accountant - accountant
 +Accountants - accountant
 +Account - Account
 +Accounts - account

 Thanks.

 --
 Regards,

 Cuong Hoang




-- 
Regards,

Cuong Hoang


Re: Limit Porter stemmer to plural stemming only?

2008-06-30 Thread climbingrose
I modified the original English Stemmer written in Snowball language and
regenerate the Java implementation using Snowball compiler. It's been
working for me  so far. I certainly can share the modified Snowball English
Stemmer if anyone wants to use it.

Cheers,
Cuong

On Tue, Jul 1, 2008 at 4:12 AM, Mike Klaas [EMAIL PROTECTED] wrote:

 If you find a solution that works well, I encourage you to contribute it
 back to Solr.  Plural-only stemming is probably a common need (I've
 definitely wanted to use it before).

 cheers,
 -Mike


 On 30-Jun-08, at 2:25 AM, climbingrose wrote:

  Ok, it looks like step 1a in Porter algo does what I need.
 On Mon, Jun 30, 2008 at 6:39 PM, climbingrose [EMAIL PROTECTED]
 wrote:

  Hi all,
 Porter stemmer in general is really good. However, there are some cases
 where it doesn't work. For example, accountant matches Accountant as
 well as Account Manager which isn't desirable. Is it possible to use
 this
 analyser for plural words only? For example:
 +Accountant - accountant
 +Accountants - accountant
 +Account - Account
 +Accounts - account

 Thanks.

 --
 Regards,

 Cuong Hoang




 --
 Regards,

 Cuong Hoang





-- 
Regards,

Cuong Hoang


Re: Suggestion for short text matching using dictionary

2008-06-27 Thread climbingrose
Thanks Grant. I did try Secondstring before and found out that it wasn't
particular good for doing a lot of text matching. I'm leaning toward the
combination of Lucene and Secondstring. Googling around a bit, I came across
this project http://datamining.anu.edu.au/projects/linkage.html. Looks
interesting but the implementation is in Python though. I think they use
Hidden Markov Model to label training data then matching records
probalistically.

On Fri, Jun 27, 2008 at 10:12 PM, Grant Ingersoll [EMAIL PROTECTED]
wrote:

 below



 On Jun 27, 2008, at 1:18 AM, climbingrose wrote:

  Firstly, my apologies for being off topic. I'm asking this question
 because
 I think there are some machine learning and text processing experts on
 this
 mailing list.

 Basically, my task is to normalize a fairly unstructured set of short
 texts
 using a dictionary. We have a pre-defined list of products and
 periodically
 receive product feeds from various websites. Basically, our site is
 similar
 to a shopping comparison engine but on a different domain. We would like
 to
 normalize the products' names in the feeds to using our pre-defined list.
 For example:

 Nokia N95 8GB Black --- Nokia N95 8GB
 Black Nokia N95, 8GB + Free bluetooth headset -- Nokia N95 8GB

 My original idea is to index the list of pre-defined names and then query
 the index using the product's name. The highest scored result will be used
 to normalize the product.

 The problem with this is sometimes you get wrong matches because of noise.
 For example, Black Nokia N95, 8GB + Free bluetooth headset can match
 Nokia Bluetooth Headset which is desirable.



 I assume you mean not desirable here given the context...

 Your approach is worth trying.  At a deeper level, you may want to look
 into a topic called record linkage and an open source project called
 Second String by William Cohen's group at Carnegie Mellon (
 http://secondstring.sourceforge.net/) which has a whole bunch of
 implementations of fuzzy string matching algorithms like Jaro-Winkler,
 Levenstein, etc. that can then be used to implement what you are after.

 You could potentially use the spell checking functionality to simulate some
 of this a bit better than just a pure vector match.  Index your dictionary
 into a spelling index (see SOLR-572) and then send in spell checking
 queries.  In fact, you probably could integrate Second String into the spell
 checker pretty easily since one can now plugin the distance measure into the
 spell checker.

 You may find some help on this by searching http://lucene.markmail.org for
 things like record linkage or record matching or various other related
 terms.

 Another option is to write up a NormalizingTokenFilter that analyzes the
 tokens as they come in to see if they match your dictionary list.

 As with all of these, there is going to be some trial and error here to
 come up with something that hits most of the time, as it will never be
 perfect.

 Good luck,
 Grant


 --
 Grant Ingersoll
 http://www.lucidimagination.com

 Lucene Helpful Hints:
 http://wiki.apache.org/lucene-java/BasicsOfPerformance
 http://wiki.apache.org/lucene-java/LuceneFAQ










-- 
Regards,

Cuong Hoang


Re: searching only within allowed documents

2008-06-11 Thread climbingrose
It depends on your query. The second query is better if you know that
fieldb:bar filtered query will be reused often since it will be cached
separately from the query. The first query occuppies one cache entry while
the second one occuppies two cache entries, one in queryCache and one in
filteredCache. Therefore, if you're not going to reuse fieldb:bar, the
second query is better.

On Wed, Jun 11, 2008 at 10:53 PM, Geoffrey Young [EMAIL PROTECTED]
wrote:



  Solr allows you to specify filters in separate parameters that are
 applied to the main query, but cached separately.

 q=the user queryfq=folder:f13fq=folder:f24


 I've been wanting more explanation around this for a while, so maybe now is
 a good time to ask :)

 the cached separately verbiage here is the same as in the twiki, but I
 don't really understand what it means.  more precisely, I'm wondering what
 the real performance, caching, etc differences are between

  q=fielda:foo+fieldb:barmm=100%

 and

  q=fielda:foofq=fieldb:bar

 my situation is similar to the original poster's in that documents matching
 fielda is very large and common (say theaters across the world) while fieldb
 would narrow it considerably (one by country, then one by zipcode, etc).

 thanks

 --Geoff





-- 
Regards,

Cuong Hoang


Re: searching only within allowed documents

2008-06-11 Thread climbingrose
Just correct myself, in the last setence, the first query is better if
fieldb:bar isn't reused often

On Thu, Jun 12, 2008 at 2:02 PM, climbingrose [EMAIL PROTECTED]
wrote:

 It depends on your query. The second query is better if you know that
 fieldb:bar filtered query will be reused often since it will be cached
 separately from the query. The first query occuppies one cache entry while
 the second one occuppies two cache entries, one in queryCache and one in
 filteredCache. Therefore, if you're not going to reuse fieldb:bar, the
 second query is better.


 On Wed, Jun 11, 2008 at 10:53 PM, Geoffrey Young 
 [EMAIL PROTECTED] wrote:



  Solr allows you to specify filters in separate parameters that are
 applied to the main query, but cached separately.

 q=the user queryfq=folder:f13fq=folder:f24


 I've been wanting more explanation around this for a while, so maybe now
 is a good time to ask :)

 the cached separately verbiage here is the same as in the twiki, but I
 don't really understand what it means.  more precisely, I'm wondering what
 the real performance, caching, etc differences are between

  q=fielda:foo+fieldb:barmm=100%

 and

  q=fielda:foofq=fieldb:bar

 my situation is similar to the original poster's in that documents
 matching fielda is very large and common (say theaters across the world)
 while fieldb would narrow it considerably (one by country, then one by
 zipcode, etc).

 thanks

 --Geoff





 --
 Regards,

 Cuong Hoang




-- 
Regards,

Cuong Hoang


Re: Multiple Schema File

2008-06-04 Thread climbingrose
Hi Sachit,

I think what you could do is to create all the core fields of your models
such as username, role, title, body, images... You can name them with prefix
like user.username, user.role, article.title, article.body... If you want to
dynamically add more fields to your schema, you can use dynamic fields and
keep a mapping between your model's properties and these fields somewhere.
Have a look at the default schema.xml for examples. I did use this approach
for a previous project and it worked fine for me.

Cheers,
Cuong

On Thu, Jun 5, 2008 at 3:43 PM, Sachit P. Menon [EMAIL PROTECTED]
wrote:

 Hi folks,



 I have a scenario as follows:



 I have a CMS where in I'm storing all the contents. I need to index all
 these
 contents and have a search on these indexes. For indexing, I can define a
 schema for all the contents. Some of the properties are like title,
 headline,
 body, keywords, images, etc.

 Now I have a user management wherein I store all the user information. I
 need
 to index this also. This may have properties like user name, role, joining
 date, etc.



 I want to use only one Solr instance. That means I can have only one schema
 file.

 How can I define all these totally different properties in one schema file?

 The unique id storage for content and user management may also be
 different.
 How can I achieve this?





 Thanks and Regards

 Sachit P. Menon| Programmer Analyst| MindTree Ltd. |West Campus, Phase-1,
 Global Village, RVCE Post, Mysore Road, Bangalore-560 059, INDIA |Voice +91
 80 26264000 |Extn  64872|Fax +91 80 26264100 | Mob : +91
 9986747356|www.mindtree.com
 
 https://indiamail.mindtree.com/exchweb/bin/redir.asp?URL=http://www.mindtree
 .com/  |





 DISCLAIMER:
 This message (including attachment if any) is confidential and may be
 privileged. If you have received this message by mistake please notify the
 sender by return e-mail and delete this message from your system. Any
 unauthorized use or dissemination of this message in whole or in part is
 strictly prohibited.
 E-mail may contain viruses. Before opening attachments please check them
 for viruses and defects. While MindTree Limited (MindTree) has put in place
 checks to minimize the risks, MindTree will not be responsible for any
 viruses or defects or any forwarded attachments emanating either from within
 MindTree or outside.
 Please note that e-mails are susceptible to change and MindTree shall not
 be liable for any improper, untimely or incomplete transmission.
 MindTree reserves the right to monitor and review the content of all
 messages sent to or from MindTree e-mail address. Messages sent to or from
 this e-mail address may be stored on the MindTree e-mail system or else
 where.




-- 
Regards,

Cuong Hoang


Ideas on how to implement sponsored results

2008-06-03 Thread climbingrose
Hi all,

I'm trying to implement sponsored results in Solr search results similar
to that of Google. We index products from various sites and would like to
allow certain sites to promote their products. My approach is to query a
slave instance to get sponsored results for user queries in addition to the
normal search results. This part is easy. However, since the number of
products indexed for each sites can be very different (100, 1000, 1 or
6 products), we need a way to fairly distribute the sponsored results
among sites.

My initial thought is utilising field collapsing patch to collapse the
search results on siteId field. You can imagine that this will create a
series of buckets of results, each bucket representing results from a
site. After that, 2 or 3 buckets will randomly be selected from which I will
randomly select one or two results from. However, since I want these
sponsored results to be relevant to user queries, I'd like only want to have
the first 30 results in each buckets.

Obviously, it's desirable that if the user refreshes the page, new sponsored
results will be displayed. On the other hand, I also want to have the
advantages of Solr cache.

What would be the best way to implement this functionality? Thanks.

Cheers,
Cuong


Re: Ideas on how to implement sponsored results

2008-06-03 Thread climbingrose
Hi Alexander,

Thanks for your suggestion. I think my problem is a bit different from
yours. We don't have any sponsored words but we have to retrieve sponsored
results directly from the index. This is because a site can have 60,000
products which is hard to insert/update keywords. I can live with that by
issuing a separate query to fetch sponsored results. My problem is to
equally distribute sponsored results between sites so that each site will
have an opportunity to show their sponsored results no matter how many
products they have. For example, if site A has 6 products, site B has
only 2000 then sponsored products from site B will have a very small chance
to be displayed.


On Wed, Jun 4, 2008 at 2:56 AM, Alexander Ramos Jardim 
[EMAIL PROTECTED] wrote:

 Cuong,

 I have implemented sponsored words for a client. I don't know if my working
 can help you but I will expose it and let you decide.

 I have an index containing products entries that I created a field called
 sponsored words. What I do is to boost this field , so when these words are
 matched in the query that products appear first on my result.

 2008/6/3 climbingrose [EMAIL PROTECTED]:

  Hi all,
 
  I'm trying to implement sponsored results in Solr search results
 similar
  to that of Google. We index products from various sites and would like to
  allow certain sites to promote their products. My approach is to query a
  slave instance to get sponsored results for user queries in addition to
 the
  normal search results. This part is easy. However, since the number of
  products indexed for each sites can be very different (100, 1000, 1
 or
  6 products), we need a way to fairly distribute the sponsored results
  among sites.
 
  My initial thought is utilising field collapsing patch to collapse the
  search results on siteId field. You can imagine that this will create a
  series of buckets of results, each bucket representing results from a
  site. After that, 2 or 3 buckets will randomly be selected from which I
  will
  randomly select one or two results from. However, since I want these
  sponsored results to be relevant to user queries, I'd like only want to
  have
  the first 30 results in each buckets.
 
  Obviously, it's desirable that if the user refreshes the page, new
  sponsored
  results will be displayed. On the other hand, I also want to have the
  advantages of Solr cache.
 
  What would be the best way to implement this functionality? Thanks.
 
  Cheers,
  Cuong
 



 --
 Alexander Ramos Jardim




-- 
Regards,

Cuong Hoang


Re: Announcement of Solr Javascript Client

2008-05-25 Thread climbingrose
Hi Matthias,

How would you prevent Solr server from being exposed to outside world with
this javascript client? I prefer running Solr behind firewall and access it
from server side code.

Cheers.

On Mon, May 26, 2008 at 7:27 AM, Matthias Epheser [EMAIL PROTECTED]
wrote:

 Hi users,

 As initially described in this thread [1] I am currently working on a
 javascript client library for solr. The idea is based on a demo [2] that
 introduces a reusable javascript widget client.

 I spent the last weeks evaluating the best fitting technologies that ensure
 a clean generic port of the demo into the solr project. The goal is to make
 it easy to use and include in webpages on the one hand, and creating a clean
 interface to the solr server on the other hand.

 With this announcement, I want to ask the community for their experience
 with solr and javascript and would appreciate feedback about this proposal:

 - javascript toolkit: JQuery, because it is already shipped with the solr
 webapp

 - Using a manager object on the client that holds all widgets and takes
 care of the communication to the solr server.

 - Using the JSONResponsewriter to get the data to the widgets so they could
 update their ui.

 These technologies seem to be the currently best ones IMHO, any
 feedback/experiences welcome.

 Regards,
 matthias








 [1]
 http://www.nabble.com/-GSOC-proposal-%3A-Solr-javascript-client-library-to16422808.html#a16430329
 [2] http://lovo.test.dev.indoqa.com/mepheser/moobrowser/




-- 
Regards,

Cuong Hoang


Re: query for number of field entries in a multivalued field?

2008-05-23 Thread climbingrose
Probably the easiest way to do this is keep track of the number of items
yourself then retrieve it later on.

On Wed, May 21, 2008 at 7:57 AM, Brian Whitman [EMAIL PROTECTED]
wrote:

 Any way to query how many items are in a multivalued field? (Or use a
 functionquery against that # or anything?)




-- 
Regards,

Cuong Hoang


Re: Simple Solr POST using java

2008-05-10 Thread climbingrose
Agree. I've been using Solrj on product site for 9 months without any
problem at all. You should probably give it a try instead of dealing with
all those low level details.


On Sun, May 11, 2008 at 4:14 AM, Chris Hostetter [EMAIL PROTECTED]
wrote:


 : please post a snippet of Java code to add a document to the Solr index
 that
 : includes the URL reference as a String?

 you mean like this one...   :)


 http://svn.apache.org/viewvc/lucene/solr/trunk/src/java/org/apache/solr/util/SimplePostTool.java?view=markup

 FWIW: if you want to talk to Solr from a Java app, the SolrJ client API
 is probably worth looking into rather then dealing with the HTTP
 connections and XML formating directly...

 http://wiki.apache.org/solr/Solrj


 -Hoss




-- 
Regards,

Cuong Hoang


Re: Minimum should match and PhraseQuery

2008-03-23 Thread climbingrose
Thanks Christ. I probably have to repost this in Lucene mailing list.


On Sun, Mar 23, 2008 at 9:49 AM, Chris Hostetter [EMAIL PROTECTED]
wrote:


 the topic has come up before on the lucene java lists (allthough i can't
 think of any good search terms to find the old threads .. I can't really
 remember how people have discribed this idea in the past)

 I don't remember anyone ever suggesting/sharing a general purpose
 solution intrinsicly more efficient then if you just generated all the
 permutations yourself

 : 2) I also want to relax PhraseQuery a bit so that it not only match
 Senior
 : Java Developer~2 but also matches Java Developer~2 but of course with
 a
 : lower score. I can programmatically generate on the combination but it's
 not
 : gonna be efficient if user issues query with many terms.



 -Hoss




-- 
Regards,

Cuong Hoang


Minimum should match and PhraseQuery

2008-03-19 Thread climbingrose
Hi all,

I thought many people would encounter the situation I'm having here.
Basically, we'd like to have a PhraseQuery with minimum should match
property similar to BooleanQuery. Consider the query Senior Java
Developer:

1) I'd like to do a PhraseQuery on Senior Java Developer with a slop of
say 2, so that the query only matches documents with these words located in
proximity. I don't want to match documents like Senior Huge block of text
Java Huge block of Text Developer.
2) I also want to relax PhraseQuery a bit so that it not only match Senior
Java Developer~2 but also matches Java Developer~2 but of course with a
lower score. I can programmatically generate on the combination but it's not
gonna be efficient if user issues query with many terms.

Is it possible to do this with Solr and Lucene?

-- 
Cheers,

Cuong Hoang


Re: Accented search

2008-03-11 Thread climbingrose
Hi Peter,

It looks like a very promising approach for us. I'm going to implement an
custom Tokeniser based on your suggestions and see how it goes. Thank you
all for your comments!

Cheers

On Wed, Mar 12, 2008 at 2:37 AM, Binkley, Peter [EMAIL PROTECTED]
wrote:

 We've done this in a pre-Solr Lucene context by using the position
 increment: when a token contains accented characters, you add a stripped
 version of that token with a zero increment, so that for matching purposes
 the original and the stripped version are at the same position. Accents are
 not stripped from queries. The effect is that an accented search matches
 your Doc A, and an unaccented search matches Docs A and B. We do that after
 lower-casing the token.

 There are some limitations: users might start to expect that they can
 freely add accents to restrict their search to accented hits, but if they
 don't match the accents exactly they won't get any hits: e.g. if a word
 contains two accented characters and the user only accents one of them in
 their query, they won't match the accented or the unaccented version.

 Peter

 Peter Binkley
 Digital Initiatives Technology Librarian
 Information Technology Services
 4-30 Cameron Library
 University of Alberta Libraries
 Edmonton, Alberta
 Canada T6G 2J8
 Phone: (780) 492-3743
 Fax: (780) 492-9243
 e-mail: [EMAIL PROTECTED]

 ~ The code is willing, but the data is weak. ~


 -Original Message-
 From: climbingrose [mailto:[EMAIL PROTECTED]
 Sent: Monday, March 10, 2008 10:01 PM
 To: solr-user@lucene.apache.org
 Subject: Accented search

 Hi guys,

 I'm running to some problems with accented (UTF-8) language. I'd love to
 hear some ideas about how to use Solr with those languages. Basically, I
 want to achieve what Google did with UTF-8 language.

 My requirements including:
 1) Accent insensitive search and proper highlighting:
  For example, we have 2 documents:

  Doc A (title:Lập Trình Viên)
  Doc B (title:Lap Trinh Vien)

  if the user enters Lập Trình Viên, then Doc B is also matched and Lập
 Trình Viên is highlighted.
  On the other hand, if the query is Lap Trinh Vien, Doc A is also
 matched.
 2) Assign proper scores to accented or non-accented searches:
  if the user enters Lập Trình Viên, then Doc A should be given higher
 score than DOC B.
  if the query is Lap Trinh Vien, Doc A should be given higher score.

 Any ideas guys? Thanks in advance!

 --
 Regards,

 Cuong Hoang




-- 
Regards,

Cuong Hoang


Accented search

2008-03-10 Thread climbingrose
Hi guys,

I'm running to some problems with accented (UTF-8) language. I'd love to
hear some ideas about how to use Solr with those languages. Basically, I
want to achieve what Google did with UTF-8 language.

My requirements including:
1) Accent insensitive search and proper highlighting:
  For example, we have 2 documents:

  Doc A (title:Lập Trình Viên)
  Doc B (title:Lap Trinh Vien)

  if the user enters Lập Trình Viên, then Doc B is also matched and Lập
Trình Viên is highlighted.
  On the other hand, if the query is Lap Trinh Vien, Doc A is also
matched.
2) Assign proper scores to accented or non-accented searches:
  if the user enters Lập Trình Viên, then Doc A should be given higher
score than DOC B.
  if the query is Lap Trinh Vien, Doc A should be given higher score.

Any ideas guys? Thanks in advance!

-- 
Regards,

Cuong Hoang


Re: solr 1.3

2008-01-20 Thread climbingrose
I don't think they (Solr developers) have a time frame for 1.3 release.
However, I've been using the latest code from the trunk and I can tell you
it's quite stable. The only problem is the documentation sometimes doesn't
cover lastest changes in the code. You'll probably have to dig into the code
itself or post a question here and many people will be happy to help you.

On Jan 21, 2008 12:07 PM, anuvenk [EMAIL PROTECTED] wrote:


 when will this be released? where can i find the list of
 improvements/enhancements in 1.3 if its been documented already?
 --
 View this message in context:
 http://www.nabble.com/solr-1.3-tp14989395p14989395.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Regards,

Cuong Hoang


Re: solr 1.3

2008-01-20 Thread climbingrose
I'm using code pulled directly from Subversion.

On Jan 21, 2008 12:34 PM, anuvenk [EMAIL PROTECTED] wrote:


 Thanks. Would this be the latest code from the trunk that you mentioned?
 http://people.apache.org/builds/lucene/solr/nightly/solr-2008-01-19.zip


 climbingrose wrote:
 
  I don't think they (Solr developers) have a time frame for 1.3 release.
  However, I've been using the latest code from the trunk and I can tell
 you
  it's quite stable. The only problem is the documentation sometimes
 doesn't
  cover lastest changes in the code. You'll probably have to dig into the
  code
  itself or post a question here and many people will be happy to help
 you.
 
  On Jan 21, 2008 12:07 PM, anuvenk [EMAIL PROTECTED] wrote:
 
 
  when will this be released? where can i find the list of
  improvements/enhancements in 1.3 if its been documented already?
  --
  View this message in context:
  http://www.nabble.com/solr-1.3-tp14989395p14989395.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 
 
 
 
  --
  Regards,
 
  Cuong Hoang
 
 

 --
 View this message in context:
 http://www.nabble.com/solr-1.3-tp14989395p14989689.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Regards,

Cuong Hoang


Merry Christmas and happy new year

2007-12-24 Thread climbingrose
Good day all Solr users  developers,

May I wish you and your family a merry Xmas and happy new year. Hope that
new year brings you all health, wealth and peace. It's been my pleasure to
be on this mailing list and working with Solr. Thank you all!

-- 
Cheers,

Cuong Hoang


Re: Issues with postOptimize

2007-12-17 Thread climbingrose
Make sure that the user running Solr has permission to execute snapshooter.
Also, try ./snapshooter instead of snapshooter.

Good luck.

On Dec 18, 2007 10:57 AM, Sunny Bassan [EMAIL PROTECTED] wrote:

 I've set up solrconfig.xml to create a snap shot of an index after doing
 a optimize, but the snap shot cannot be created because of permission
 issues. I've set permissions to the bin, data and log directories to
 read/write/execute for all users. Even with these settings I cannot seem
 to be able to run snapshooter on the postOptimize event. Any ideas?
 Could it be a java permissions issue? Thanks.

 Sunny

 Config settings:

 listener event=postOptimize class=solr.RunExecutableListener
  str name=exesnapshooter/str
  str name=dir/search/replication_test/0/index/solr/bin/str
  bool name=waittrue/bool
 /listener

 Error:

 Dec 17, 2007 7:45:19 AM org.apache.solr.core.RunExecutableListener exec
 FINE: About to exec snapshooter
 Dec 17, 2007 7:45:19 AM org.apache.solr.core.SolrException log
 SEVERE: java.io.IOException: Cannot run program snapshooter (in
 directory /search/replication_test/0/index/solr/bin):
 java.io.IOException: error=13, Permission denied
  at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
  at java.lang.Runtime.exec(Runtime.java:593)
  at
 org.apache.solr.core.RunExecutableListener.exec(RunExecutableListener.ja
 va:70)
  at
 org.apache.solr.core.RunExecutableListener.postCommit(RunExecutableListe
 ner.java:97)
  at
 org.apache.solr.update.UpdateHandler.callPostOptimizeCallbacks(UpdateHan
 dler.java:105)
  at
 org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.
 java:516)
  at
 org.apache.solr.handler.XmlUpdateRequestHandler.update(XmlUpdateRequestH
 andler.java:214)
  at
 org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpd
 ateRequestHandler.java:84)
  at
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerB
 ase.java:77)
  at org.apache.solr.core.SolrCore.execute(SolrCore.java:658)
  at
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.ja
 va:191)
  at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.j
 ava:159)
  at
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Applica
 tionFilterChain.java:235)
  at
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilt
 erChain.java:206)
  at
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValv
 e.java:233)
  at
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValv
 e.java:175)
  at
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java
 :128)
  at
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java
 :102)
  at
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.
 java:109)
  at
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:2
 63)
  at
 org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:84
 4)
  at
 org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(
 Http11Protocol.java:584)
  at
 org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
  at java.lang.Thread.run(Thread.java:619)
 Caused by: java.io.IOException: java.io.IOException: error=13,
 Permission denied
  at java.lang.UNIXProcess.init(UNIXProcess.java:148)
  at java.lang.ProcessImpl.start(ProcessImpl.java:65)
  at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
  ... 23 more






-- 
Regards,

Cuong Hoang


Re: Replication hooks

2007-12-10 Thread climbingrose
I think there is a event listener interface for hooking into Solr events
such as post commit, post optimise and open new searcher. I can't remember
on top of my head but if you do a search for *EventListener in Eclipse,
you'll find it.
The Wiki shows how to trigger snapshooter after each commit and optimise.
You should be able to follow this example to create your own listener.

On Dec 11, 2007 1:03 PM, Tracy Flynn [EMAIL PROTECTED]
wrote:

 Hi,

 I'm interested in setting up simple replication. I've reviewed all the
 Wiki information, looked at the scripts etc. and understand most of
 what I see.

 There are some references to  'hooks in the code'  for both the master
 and slave nodes for handling replication. I've searched the 1.2 and
 trunk code bases for obvious phrases, but I can't identify these hooks.

 Can someone please point me to the correct place(s) to look?

 Thanks,

 Tracy Flynn




-- 
Regards,

Cuong Hoang


Re: solr + maven?

2007-12-05 Thread climbingrose
Hi Ryan,

I'm using solr with Maven 2 in our project. Here is how my pom.xml looks
like:

!-- Solrj --
dependency
groupIdorg.apache.solr/groupId
artifactIdsolr-solrj/artifactId
version1.3.0/version
/dependency

Since I have all solrj dependencies declared by other artifacts, I don't
need to declare any of solrj dependencies. You'll probably have to add
commons-httpclient artifacts.

On Dec 5, 2007 10:08 AM, Ryan McKinley [EMAIL PROTECTED] wrote:

 Is anyone managing solr projects with maven?  I see:
 https://issues.apache.org/jira/browse/SOLR-19
 but that is 1 year old

 If someone has a current pom.xml, can you post it on SOLR-19?

 I just started messing with maven, so I don't really know what I am
 doing yet.

 thanks
 ryan




-- 
Regards,

Cuong Hoang


Re: SOLR sorting - question

2007-12-04 Thread climbingrose
I don't think you have to. Just try the query on the REST interface and you
will know.

On Dec 5, 2007 9:56 AM, Kasi Sankaralingam [EMAIL PROTECTED] wrote:

 Do I need to select the fields in the query that I am trying to sort on?,
 for example if I want sort on update date then do I need to select that
 field?

 Thanks,




-- 
Regards,

Cuong Hoang


Access to SolrIndexSearcher in UpdateProcessor

2007-12-02 Thread climbingrose
Hi all,

I'm trying to implement a custom UpdateProcessor which requires access to
SolrIndexSearcher. However, I'm constantly running into Too many open
files exception. I'm confused about which is the correct way to get access
to SolrIndexSearcher in UpdateProcessor:

1) req.getSearcher()
2) req.getCore().getSearcher()
3) req.getCore().newSearcher(MyCustomerProcessorFactory);

I have tried 1)  3) but both produce Too many open files. The weird thing
with 3) is the SolrIndexSearcher created gets set to null automatically by
Solr so I didn't have a chance to call searcher.close() method. I suspect
all searchers open this way will be set to null when a commit is made. Any
recommendation?

-- 
Regards,

Cuong Hoang


Re: Get last updated/committed document

2007-11-23 Thread climbingrose
Assuming that you have the timestamp field defined:
q=*:*sort=timestamp desc

On Nov 23, 2007 10:43 PM, Thorsten Scherler
[EMAIL PROTECTED] wrote:
 Hi all,

 I need to ask solr to return me the id of the last committed document.

 Is there a way to archive this via a standard lucene query or do I need
 a custom connector that gives me this information?

 TIA for any information

 salu2
 --
 Thorsten Scherler thorsten.at.apache.org
 Open Source Java  consulting, training and solutions





-- 
Regards,

Cuong Hoang


Re: Near Duplicate Documents

2007-11-21 Thread climbingrose
The duplication detection mechanism in Nutch is quite primitive. I
think it uses a MD5 signature generated from the content of a field.
The generation algorithm is described here:
http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/crawl/TextProfileSignature.html.

The problem with this approach is MD5 hash is very sensitive: one
letter difference will generate completely different hash. You
probably have to roll your own near duplication detection algorithm.
My advice is have a look at existing literature on near duplication
detection techniques and then implement one of them. I know Google has
some papers that describe a technique called minhash. I read the paper
and found it's very interesting. I'm not sure if you can implement the
algorithm because they have patented it. That said, there are plenty
literature on near dup detection so you should be able to get one for
free!

On Nov 21, 2007 6:57 PM, Rishabh Joshi [EMAIL PROTECTED] wrote:
 Otis,

 Thanks for your response.

 I just gave a quick look to the Nutch Forum and find that there is an
 implementation to obtain de-duplicate documents/pages but none for Near
 Duplicates documents. Can you guide me a little further as to where exactly
 under Nutch I should be concentrating, regarding near duplicate documents?

 Regards,
 Rishabh

 On Nov 21, 2007 12:41 PM, Otis Gospodnetic [EMAIL PROTECTED]
 wrote:


  To whomever started this thread: look at Nutch.  I believe something
  related to this already exists in Nutch for near-duplicate detection.
 
  Otis
  --
  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
  - Original Message 
  From: Mike Klaas [EMAIL PROTECTED]
  To: solr-user@lucene.apache.org
  Sent: Sunday, November 18, 2007 11:08:38 PM
  Subject: Re: Near Duplicate Documents
 
  On 18-Nov-07, at 8:17 AM, Eswar K wrote:
 
   Is there any idea implementing that feature in the up coming
   releases?
 
  Not currently.  Feel free to contribute something if you find a good
  solution g.
 
  -Mike
 
 
   On Nov 18, 2007 9:35 PM, Stuart Sierra [EMAIL PROTECTED] wrote:
  
   On Nov 18, 2007 10:50 AM, Eswar K [EMAIL PROTECTED] wrote:
   We have a scenario, where we want to find out documents which are
   similar in
   content. To elaborate a little more on what we mean here, lets
   take an
   example.
  
   The example of this email chain in which we are interacting on,
   can be
   best
   used for illustrating the concept of near dupes (We are not getting
   confused
   with threads, they are two different things.). Each email in this
   thread
   is
   treated as a document by the system. A reply to the original mail
   also
   includes the original mail in which case it becomes a near
   duplicate of
   the
   orginal mail (depending on the percentage of similarity).
   Similarly it
   goes
   on. The near dupes need not be limited to emails.
  
   I think this is what's known as shingling.  See
   http://en.wikipedia.org/wiki/W-shingling
   Lucene (and therefore Solr) does not implement shingling.  The
   MoreLikeThis query might be close enough, however.
  
   -Stuart
  
 
 
 
 
 




-- 
Regards,

Cuong Hoang


Re: Help with Debian solr/jetty install?

2007-11-21 Thread climbingrose
Make sure you have JDK installed not just JRE. Also try to set
JAVA_HOME directory.

apt-get install sun-java5-jdk




On Nov 21, 2007 5:50 PM, Otis Gospodnetic [EMAIL PROTECTED] wrote:
 Phillip,

 I won't go into details, but I'll point out that the Java compiler is called 
 javac and if memory serves me well, it is defined in one of Jetty's XML 
 config files in its etc/ dir.  The java compiler is used to compile JSPs that 
 Solr uses for the admin UI.  So, make sure you have javac and make sure Jetty 
 can find it.

 Otis

 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


 - Original Message 
 From: Phillip Farber [EMAIL PROTECTED]
 To: solr-user@lucene.apache.org
 Sent: Tuesday, November 20, 2007 5:55:27 PM
 Subject: Help with Debian solr/jetty install?


 Hi,

 I've successfully run as far as the example admin page on Debian linux
  2.6.

 So I installed the solr-jetty packaged for Debian testing which gives
  me
 Jetty 5.1.14-1 and Solr 1.2.0+ds1-1.  Jetty starts fine and so does the

 Solr home page at http://localhost:8280/solr

 But I get an error when I try to run http://localhost:8280/solr/admin

 HTTP ERROR: 500
 No Java compiler available

 I have sun-java6-jre and sun-java6-jdk packages installed.  I'm new to
 servlet containers and java webapps.  What should I be looking for to
 fix this or what information could I provide the list to get me moving
 forward from here?

 I've included the trace from the Jetty log, and the java properties
  dump
 from the example below.

 Thanks,
 Phil

 ---

 Java properties (from the example):
 --

 sun.boot.library.path = /usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/i386
 java.vm.version = 1.6.0-b105
 java.vm.name = Java HotSpot(TM) Client VM
 user.dir = /tmp/apache-solr-1.2.0/example
 java.runtime.version = 1.6.0-b105
 os.arch = i386
 java.io.tmpdir = /tmp

 java.library.path =
 /usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/i386/client:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/i386:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/../lib/i386:/usr/java/packages/lib/i386:/lib:/usr/lib
 java.class.version = 50.0
 jetty.home = /tmp/apache-solr-1.2.0/example
 sun.management.compiler = HotSpot Client Compiler
 os.version = 2.6.22-2-686
 java.class.path =
 /tmp/apache-solr-1.2.0/example:/tmp/apache-solr-1.2.0/example/lib/jetty-6.1.3.jar:/tmp/apache-solr-1.2.0/example/lib/jetty-util-6.1.3.jar:/tmp/apache-solr-1.2.0/example/lib/servlet-api-2.5-6.1.3.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/ant-1.6.5.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/core-3.1.1.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/jsp-2.1.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/jsp-api-2.1.jar:/usr/share/ant/lib/ant.jar
 java.home = /usr/lib/jvm/java-6-sun-1.6.0.00/jre
 java.version = 1.6.0
 java.ext.dirs =
 /usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/ext:/usr/java/packages/lib/ext
 sun.boot.class.path =
 /usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/resources.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/rt.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/sunrsasign.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/jsse.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/jce.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/charsets.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/classes




 Jetty log (from the error under Debian Solr/Jetty):
 

 org.apache.jasper.JasperException: No Java compiler available
 at
 org.apache.jasper.servlet.JspServletWrapper.handleJspException(JspServletWrapper.java:460)
 at
 org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:367)
 at
  org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:329)
 at
  org.apache.jasper.servlet.JspServlet.service(JspServlet.java:265)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
 at
  org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:428)
 at
 org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.java:473)
 at
  org.mortbay.jetty.servlet.Dispatcher.dispatch(Dispatcher.java:286)
 at
  org.mortbay.jetty.servlet.Dispatcher.forward(Dispatcher.java:171)
 at org.mortbay.jetty.servlet.Default.handleGet(Default.java:302)
 at org.mortbay.jetty.servlet.Default.service(Default.java:223)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
 at
  org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:428)
 at
 org.mortbay.jetty.servlet.WebApplicationHandler$CachedChain.doFilter(WebApplicationHandler.java:830)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:185)
 at
 org.mortbay.jetty.servlet.WebApplicationHandler$CachedChain.doFilter(WebApplicationHandler.java:821)
 at
 org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.java:471)
 at
  org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:568)
 at 

Re: Near Duplicate Documents

2007-11-21 Thread climbingrose
Hi Ken,

It's correct that uncommon words are most likely not showing up in the
signature. However, I was trying to say that if two documents has 99%
common tokens and differ in one token with frequency  quantised
frequency, the two resulted hashes are completely different. If you
want true near dup detection, what you would like to have is two
hashes that differ only in 1-2 bytes. That way, the signatures will
truely reflect the content of the document they present. However, with
this approach, you need a bit more work to cluster near dup documents.
Basically, once you have the hash function as I describe above,
finding similar documents comes down to Hamming distance problem: two
docs are near dup if ther hashes different in k positions (with k
small, might be  3).


On Nov 22, 2007 2:35 AM, Ken Krugler [EMAIL PROTECTED] wrote:
 The duplication detection mechanism in Nutch is quite primitive. I
 think it uses a MD5 signature generated from the content of a field.
 The generation algorithm is described here:
 http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/crawl/TextProfileSignature.html.
 
 The problem with this approach is MD5 hash is very sensitive: one
 letter difference will generate completely different hash.

 I'm confused by your answer, assuming it's based on the page
 referenced by the URL you provided.

 The approach by TextProfileSignature would only generate a different
 MD5 hash with a single letter change if that change resulted in a
 change in the quantized frequency for that word. And if it's an
 uncommon word, then it wouldn't even show up in the signature.

 -- Ken


 You
 probably have to roll your own near duplication detection algorithm.
 My advice is have a look at existing literature on near duplication
 detection techniques and then implement one of them. I know Google has
 some papers that describe a technique called minhash. I read the paper
 and found it's very interesting. I'm not sure if you can implement the
 algorithm because they have patented it. That said, there are plenty
 literature on near dup detection so you should be able to get one for
 free!
 
 On Nov 21, 2007 6:57 PM, Rishabh Joshi [EMAIL PROTECTED] wrote:
   Otis,
 
   Thanks for your response.
 
I just gave a quick look to the Nutch Forum and find that there is an
   implementation to obtain de-duplicate documents/pages but none for Near
   Duplicates documents. Can you guide me a little further as to where 
  exactly
under Nutch I should be concentrating, regarding near duplicate 
  documents?
   
Regards,
   Rishabh
 
   On Nov 21, 2007 12:41 PM, Otis Gospodnetic [EMAIL PROTECTED]
   wrote:
 
 
To whomever started this thread: look at Nutch.  I believe something
related to this already exists in Nutch for near-duplicate detection.
   
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
   
- Original Message 
From: Mike Klaas [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Sunday, November 18, 2007 11:08:38 PM
Subject: Re: Near Duplicate Documents
   
On 18-Nov-07, at 8:17 AM, Eswar K wrote:
   
 Is there any idea implementing that feature in the up coming
 releases?
   
 Not currently.  Feel free to contribute something if you find a good
solution g.

-Mike
   
   
 On Nov 18, 2007 9:35 PM, Stuart Sierra [EMAIL PROTECTED] wrote:

 On Nov 18, 2007 10:50 AM, Eswar K [EMAIL PROTECTED] wrote:
 We have a scenario, where we want to find out documents which are
 similar in
 content. To elaborate a little more on what we mean here, lets
 take an
 example.

 The example of this email chain in which we are interacting on,
 can be
 best
 used for illustrating the concept of near dupes (We are not getting
 confused
 with threads, they are two different things.). Each email in this
 thread
 is
 treated as a document by the system. A reply to the original mail
 also
 includes the original mail in which case it becomes a near
 duplicate of
 the
 orginal mail (depending on the percentage of similarity).
 Similarly it
 goes
 on. The near dupes need not be limited to emails.

 I think this is what's known as shingling.  See
 http://en.wikipedia.org/wiki/W-shingling
 Lucene (and therefore Solr) does not implement shingling.  The
 MoreLikeThis query might be close enough, however.

  -Stuart

 --
 Ken Krugler
 Krugle, Inc.
 +1 530-210-6378
 If you can't find it, you can't fix it




-- 
Regards,

Cuong Hoang


Re: Pagination with Solr

2007-11-19 Thread climbingrose
Hi David,

Do you use one of Solr client available
http://wiki.apache.org/solr/IntegratingSolr? These clients should
probably have done all the XML parsing jobs for you. I speak from
Solrj experience.

IMO, your approach is probably most commonly used when it comes to
pagination. Solr caching mechanisms should speed up the request for
next page.

Cheers,

On Nov 20, 2007 10:27 AM, Dave C. [EMAIL PROTECTED] wrote:
 Hello again,

 I'm trying to accomplish very basic pagination with my Solr search results.

 What I'm trying is to parse the response for numFound:some number and if 
 this number is greater than the rows parameter, I send another search 
 request to Solr with a new start parameter.
 Is there a better way to do this?  Specifically, is there another way to 
 obtain the numFound rather than parsing the response stream/string?

 Thanks a lot,
 David

 _
 Share life as it happens with the new Windows Live.Download today it's FREE!
 http://www.windowslive.com/share.html?ocid=TXT_TAGLM_Wave2_sharelife_112007



-- 
Regards,

Cuong Hoang


Re: Finding all possible synonyms for a word

2007-11-19 Thread climbingrose
One approach is to extend SynonymFilter so that it reads synonyms from
database instead of a file. SynonymFilter is just a Java class so you
can do whatever you want with it :D. From what I remember, the filter
initialises a list of all input synonyms and store them in memory.
Therefore, you need to make sure that all the synonyms can fit into
memory at runtime.

On Nov 20, 2007 1:54 AM, Kishore AVK. Veleti [EMAIL PROTECTED] wrote:
 Hi Eswar,

 Thanks for the update.

 I have gone through the below link provided by you and what I understood from 
 it is, we need to have all possible synonyms in a text file. This file need 
 to be given as input for SynonymFilterFactory to work. If my understanding 
 is right then the approach may not suit my requirement. Reason is I need to 
 find synonyms of all the keywords in category description and store those 
 synonyms in the above said input file. The file may be too big.

 Let me know if my understanding is wrong.


 Thanks,
 Kishore Veleti A.V.K.




 -Original Message-
 From: Eswar K [mailto:[EMAIL PROTECTED]
 Sent: Monday, November 19, 2007 11:22 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Finding all possible synonyms for a word

 Kishore,

 Solr has a SynonymFilterFactory which might be off use to you (
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-2c461ac74b4ddd82e453dc68fcfc92da77358d46)


 Regards,
 Eswar

 On Nov 18, 2007 10:39 PM, Kishore AVK. Veleti [EMAIL PROTECTED]
 wrote:

  Hi All,
 
  I am new to Lucene / SOLR and developing a POC as part of research. Check
  below my requirement and problem statement. Need help on how I can index the
  data such data I have a very good search functionality in my POC.
 
  --
  Requirement:
  --
 
  Assume my web application is an Online book store and it sell all
  categories of books like Computers, Social Studies, Physical Sciences etc.
  Each of these categories has sub-categories. For example Computers has
  sub-categories like Software Engineering, Java, SQL Server etc
 
  I have a database table called Categories and it contains both Parent
  Category descriptions and also Child Category descriptions.
 
  Data structure of Category table is:
 
  Category_ID_Primay_Key  integer
  Parent_Category_ID  integer
  Category_Name varchar(100)
  Category_Description varchar(1000)
 
 
  --
  My Search UI:
  --
 
  My search page is very simple. We have a text field with Search button.
 
  --
  User Action:
  --
 
  User enter below search text in above text field and clicks on Search
  button.
 
  Books on Data Center
 
  --
  What is my expected behavior:
  --
 
  Since the word Data Center more relevant computers I should show books
  related to computers.
 
  --
  My Problem statement and Question to you all:
  --
 
  To have a better search in my web applications what kind of strategy
  should I have and index the data accordingly in SOLR/Lucene.
 
  In my Lucene Index I may or may not have the word data center. Still I
  should be able to return data center
 
  One thought I have is as follows:
 
  Modify the Category table by adding one more column to it:
 
  Category_ID_Primay_Key  integer
  Parent_Category_ID  integer
  Category_Name varchar(100)
  Category_Description varchar(1000)
  Category_Description_Keywords varchar(8000)
 
  Now take each word in Category_description, find synonyms of it and
  store that data in Category_Description_Keywords column. After doing it,
  index the Category table records in SOLR/Lucene.
 
  Below are my questions to you all:
 
  Question 1:
  Need your feedbacks on above approach or any other approach which help me
  to make my search better that returns most relevant results to the user.
 
  Question 2:
  Can you suggest me Java based best Open Source or commercial synonym
  engines. I want such a best synonym engine that gives me all possible
  synonyms of a word.
 
 
 
  Thanks in Advance,
  Kishore Veleti A.V.K.
 




-- 
Regards,

Cuong Hoang


Re: multiple delete by id in one delete command?

2007-11-18 Thread climbingrose
The easiest solution I know is:
deletequeryid:1 OR id:2 OR .../query/delete
If you know that all of these ids can be found by issuing a query, you
can do delete by query:
deletequeryYOUR_DELETE_QUERY_HERE/query/delete

Cheers

On Nov 19, 2007 4:18 PM, Norberto Meijome [EMAIL PROTECTED] wrote:
 Hi everyone,

 I'm trying to issue, via curl to SOLR (testing at the moment), 3 deletes by 
 id.
 I tried sending :

 deleteid1/idid2/idid3/id/delete

 and solr didn't like it at all.

 When I changed it to :

 deleteid1/id/deletedeleteid2/id/deletedeleteid3/id/delete

 as in :

 curl http://localhost:8983/vcs/update -H Content-Type: text/xml 
 --data-binary 
 'deleteid816bc47fd52ffb9c6059e6975eafa168949d51dfa93dbe3c1eca169edd19b3/id/deletedeleteid53f3f80e65482a5be353e7110f5308949d51dfa93dbe3c1eca169edd19b3/id/delete'

 only the 1st ( id =1 , or id = 
 816bc47fd52ffb9c6059e6975eafa168949d51dfa93dbe3c1eca169edd19b3 gets deleted 
 (after a commit, of course).

 So i figure I will have to issue a series of independent  
 deleteidxxx/id/delete commandsIs it not possible to bunch them 
 all together as it's possible with adddoc../docdoc.../doc/add ?


 thanks!!
 Beto
 _
 {Beto|Norberto|Numard} Meijome

 Imagination is more important than knowledge.
   Albert Einstein, On Science

 I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
 Reading disclaimers makes you go blind. Writing them is worse. You have been 
 Warned.




-- 
Regards,

Cuong Hoang


Re: Spell Check Handler

2007-10-11 Thread climbingrose
Hi all,

I've been so busy the last few days so I haven't replied to this email. I
modified SpellCheckerHandler a while ago to include support for multiword
query. To be honest, I didn't have time to write unit test for the code.
However, I deployed it in a production environment and it has been working
for me so far. My version, however, has two assumptions:

1) I assumpt that when user enter a misspelled multiword query, we should
only check for words that are actually misspelled. For example, if user
enter life expectancy calculatar, which has calculator misspelled, we
should only spellcheck calculatar.
2) I only return the best string for a mispelled query.

I guess I can just directly paste the code here so that others can adapt for
their own purposes. If you have any question, just send me an email. I'll
happy to help  you.

StringBuffer buf = null;
if (null != words  !.equals(words.trim())) {
Analyzer analyzer = req.getSchema
().getField(field).getType().getAnalyzer();

TokenStream source = analyzer.tokenStream(field, new
StringReader(words));
Token t;
boolean hasSuggestion = false;
boolean termExists = false;
while (true) {
try {
t = source.next();
} catch (IOException e) {
t = null;
}
if (t == null)
break;

String termText = t.termText();
String[] suggestions = spellChecker.suggestSimilar(termText,
numSug, req.getSearcher().getReader(), restrictToField, true);
if (suggestions != null  suggestions.length  0) {
if (!suggestions[0].equals(termText)) {
hasSuggestion = true;
}
if (buf == null) {
buf = new StringBuffer(suggestions[0]);
} else
buf.append( ).append(suggestions[0]);
} else if (spellChecker.exist(termText)){
termExists = true;
if (buf == null) {
buf = new StringBuffer(termText);
} else
buf.append( ).append(termText);
} else {
hasSuggestion = false;
termExists= false;
break;
}
}
try {
source.close();
} catch (IOException e) {
// ignore
}
// String[] suggestions = spellChecker.suggestSimilar(words,
numSug,
// nullReader, restrictToField, onlyMorePopular);
if (hasSuggestion || (!hasSuggestion  termExists))
rsp.add(suggestions, buf.toString());
else
rsp.add(suggestions, null);



On 10/11/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:

 Hoss,

 I had a feeling someone would be quoting Yonik's Law of Patches!  ;-)

 For now, this is done.

 I created the changes, created JavaDoc comments on the various settings
 and their expected output, created a JUnit test for the
 SpellCheckerRequestHandler
 which tests various components of the handler, and I also created the
 supporting configuration files for the JUnit tests (schema and solrconfig
 files).

 I attached the patch to the JIRA issue so now we just have to wait until
 it gets
 added back in to the main code stream.

 For anyone who is interested, here is a link to the JIRA:
 https://issues.apache.org/jira/browse/SOLR-375

 Could someone please drop me a hint on how to update the wiki or any other
 documentation that could benefit to being updated; I'll like to help out
 as much
 as possible, but first I need to know how. ;-)

 When these changes do get committed back in to the daily build, please
 review the generated JavaDoc for information on how to utilize these new
 features.
 If anyone has any questions, or comments, please do not hesitate to ask.

 As a general note of a self-critique on these changes, I am not 100% sure
 of the way I
 implemented the nested structure when the multiWords parameter is
 used.  My interest
 is that it should work smoothly with some other technology such as
 Prototype using the
 JSon output type.  Unfortunately, I will not be getting a chance to start
 on that coding until
 next week so it is up in the air as to if this structure will be conducive
 or not.  I am planning
 on providing more details in the documentations as far as how to utilize
 these modifications
 in Prototype and AJax when I get a chance (even provide links to a
 production site so you
 can see it in action and view the source if interested).  So stay tuned...

Thanks for everyones time,
   Scott Tabar

  Chris Hostetter [EMAIL PROTECTED] wrote:

 : If you like, I can post the source code changes that I made to the
 : SpellCheckerRequestHandler, but at 

Re: Spell Check Handler

2007-10-11 Thread climbingrose
Just to clarify this line of code:

String[] suggestions = spellChecker.suggestSimilar(termText, numSug,
req.getSearcher().getReader(), restrictToField, true);

I only return suggestions if they are more popular than termText. You
probably need to use code in Scott's patch to make this behaviour
configurable.

On 10/11/07, climbingrose [EMAIL PROTECTED] wrote:

 Hi all,

 I've been so busy the last few days so I haven't replied to this email. I
 modified SpellCheckerHandler a while ago to include support for multiword
 query. To be honest, I didn't have time to write unit test for the code.
 However, I deployed it in a production environment and it has been working
 for me so far. My version, however, has two assumptions:

 1) I assumpt that when user enter a misspelled multiword query, we should
 only check for words that are actually misspelled. For example, if user
 enter life expectancy calculatar, which has calculator misspelled, we
 should only spellcheck calculatar.
 2) I only return the best string for a mispelled query.

 I guess I can just directly paste the code here so that others can adapt
 for their own purposes. If you have any question, just send me an email.
 I'll happy to help  you.

 StringBuffer buf = null;
 if (null != words  !.equals(words.trim())) {
 Analyzer analyzer = req.getSchema
 ().getField(field).getType().getAnalyzer();

 TokenStream source = analyzer.tokenStream(field, new
 StringReader(words));
 Token t;
 boolean hasSuggestion = false;
 boolean termExists = false;
 while (true) {
 try {
 t = source.next();
 } catch (IOException e) {
 t = null;
 }
 if (t == null)
 break;

 String termText = t.termText();
 String[] suggestions = spellChecker.suggestSimilar(termText,
 numSug, req.getSearcher().getReader(), restrictToField, true);
 if (suggestions != null  suggestions.length  0) {
 if (!suggestions[0].equals(termText)) {
 hasSuggestion = true;
 }
 if (buf == null) {
 buf = new StringBuffer(suggestions[0]);
 } else
 buf.append( ).append(suggestions[0]);
 } else if (spellChecker.exist(termText)){
 termExists = true;
 if (buf == null) {
 buf = new StringBuffer(termText);
 } else
 buf.append( ).append(termText);
 } else {
 hasSuggestion = false;
 termExists= false;
 break;
 }
 }
 try {
 source.close();
 } catch (IOException e) {
 // ignore
 }
 // String[] suggestions = spellChecker.suggestSimilar(words,
 numSug,
 // nullReader, restrictToField, onlyMorePopular);
 if (hasSuggestion || (!hasSuggestion  termExists))
 rsp.add(suggestions, buf.toString());
 else
 rsp.add(suggestions, null);



 On 10/11/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
 
  Hoss,
 
  I had a feeling someone would be quoting Yonik's Law of Patches!  ;-)
 
  For now, this is done.
 
  I created the changes, created JavaDoc comments on the various settings
  and their expected output, created a JUnit test for the
  SpellCheckerRequestHandler
  which tests various components of the handler, and I also created the
  supporting configuration files for the JUnit tests (schema and
  solrconfig files).
 
  I attached the patch to the JIRA issue so now we just have to wait until
  it gets
  added back in to the main code stream.
 
  For anyone who is interested, here is a link to the JIRA:
  https://issues.apache.org/jira/browse/SOLR-375
 
  Could someone please drop me a hint on how to update the wiki or any
  other
  documentation that could benefit to being updated; I'll like to help out
  as much
  as possible, but first I need to know how. ;-)
 
  When these changes do get committed back in to the daily build, please
  review the generated JavaDoc for information on how to utilize these new
  features.
  If anyone has any questions, or comments, please do not hesitate to ask.
 
 
  As a general note of a self-critique on these changes, I am not 100%
  sure of the way I
  implemented the nested structure when the multiWords parameter is
  used.  My interest
  is that it should work smoothly with some other technology such as
  Prototype using the
  JSon output type.  Unfortunately, I will not be getting a chance to
  start on that coding until
  next week so it is up in the air as to if this structure will be
  conducive or not.  I am

Re: Solr replication

2007-10-01 Thread climbingrose
1)On solr.master:
+Edit scripts.conf:
solr_hostname=localhost
solr_port=8983
rsyncd_port=18983
+Enable and start rsync:
rsyncd-enable; rsyncd-start
+Run snapshooter:
snapshooter
After running this, you should be able to see a new folder named snapshot.*
in data/index folder.
You can can solrconfig.xml to trigger snapshooter after a commit or
optimise.

2) On slave:
+Edit scripts.conf:
solr_hostname=solr.master
solr_port=8986
rsyncd_port=18986
data_dir=
webapp_name=solr
master_host=localhost
master_data_dir=$MASTER_SOLR_HOME/data/
master_status_dir=$MASTER_SOLR_HOME/logs/clients/
+Run snappuller:
snappuller -P 18983
+Run snapinstaller:
snapinstaller

You should setup crontab to run snappuller and snapinstaller periodically.



On 10/1/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:

 Hi !

 I'm really new to Solr !

 Could anybody please explain me with a short example how I can setup a
 simple Solr replication with 3 machines (a master node and 2 slaves) ?

 This is my conf:

 * master (linux 2.6.20) :
 - Hostname solr.master with IP 192.168.1.1
 * 2 slaves (linux 2.6.20) :
 - Hostname solr.slave1 with IP 192.168.1.2
 - Hostname solr.slave2 with IP 192.168.1.3

 N.B: sorry if the question was already asked before, but I could't find
 anything better than the CollectionDistribution on the Wiki.

 Regards
 Y.




-- 
Regards,

Cuong Hoang


Re: Re: Re: Solr replication

2007-10-01 Thread climbingrose
sh /bin/commit should trigger a refresh. However, this command should be
executed as part of snapinstaller so you should have to run it manually.

On 10/1/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:

 One more question about replication.

 Now that the replication is working, how can I see the changes on slave
 nodes ?

 The page statistics :

 http://solr.slave1:8983/solr/admin/stats.jsp;

 doesn't reflect the correct number of indexed documents and still shows
 numDocs=0.

 Is there any command to tell Solr (on slave node) to sync itself with
 disk ?

 cheers
 Y.

 Message d'origine
 De: [EMAIL PROTECTED]
 A: solr-user@lucene.apache.org
 Sujet: Re: Re: Solr replication
 Date: Mon,  1 Oct 2007 15:00:46 +0200
 
 Works like a charm. Thanks very much.
 
 cheers
 Y.
 
 Message d'origine
 Date: Mon, 1 Oct 2007 21:55:30 +1000
 De: climbingrose
 A: solr-user@lucene.apache.org
 Sujet: Re: Solr replication
   boundary==_Part_10345_13696775.1191239730731
 
 1)On solr.master:
 +Edit scripts.conf:
 solr_hostname=localhost
 solr_port=8983
 rsyncd_port=18983
 +Enable and start rsync:
 rsyncd-enable; rsyncd-start
 +Run snapshooter:
 snapshooter
 After running this, you should be able to see a new folder named
 snapshot.*
 in data/index folder.
 You can can solrconfig.xml to trigger snapshooter after a commit or
 optimise.
 
 2) On slave:
 +Edit scripts.conf:
 solr_hostname=solr.master
 solr_port=8986
 rsyncd_port=18986
 data_dir=
 webapp_name=solr
 master_host=localhost
 master_data_dir=$MASTER_SOLR_HOME/data/
 master_status_dir=$MASTER_SOLR_HOME/logs/clients/
 +Run snappuller:
 snappuller -P 18983
 +Run snapinstaller:
 snapinstaller
 
 You should setup crontab to run snappuller and snapinstaller
 periodically.
 
 
 
 On 10/1/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
 
  Hi !
 
  I'm really new to Solr !
 
  Could anybody please explain me with a short example how I can setup a
  simple Solr replication with 3 machines (a master node and 2 slaves) ?
 
  This is my conf:
 
  * master (linux 2.6.20) :
  - Hostname solr.master with IP 192.168.1.1
  * 2 slaves (linux 2.6.20) :
  - Hostname solr.slave1 with IP 192.168.1.2
  - Hostname solr.slave2 with IP 192.168.1.3
 
  N.B: sorry if the question was already asked before, but I could't
 find
  anything better than the CollectionDistribution on the Wiki.
 
  Regards
  Y.
 
 
 
 
 --
 Regards,
 
 Cuong Hoang
 
 
 




-- 
Regards,

Cuong Hoang


Re: can solr do it?

2007-09-25 Thread climbingrose
I don't think you can with the current Solr because each instance runs in a
separate web app.

On 9/25/07, James liu [EMAIL PROTECTED] wrote:

 if use multi solr with one index, it will cache individually.

 so i think can it share their cache.(they have same config)

 --
 regards
 jl




-- 
Regards,

Cuong Hoang


Synchronize large number of records with Solr

2007-09-14 Thread climbingrose
Hi all,

I've been struggling to find a good way to synchronize Solr with a large
number of records. We collect our data from a number of sources and each
source produces around 50,000 docs. Each of these document has a sourceId
field indicating the source of the document. Now assuming we're indexing all
documents from SourceA (sourceId=SourceA), majority of these docs are
already in Solr and we don't want to update them. However, there might be
some docs in Solr that are not in the and we do want to delete them from the
index. So in summary:

1) If a doc is already in Solr, do nothing
2) If a doc is in the batch but not in Solr, index it
3) If a doc is in Solr but not in the batch, remove it from Solr.

The trick part is 1) because if not for that requirement, I can just simply
delete all documents with sourceId=SourceA and reindex all documents from
SourceA. Any suggestions?

Thanks.

-- 
Regards,

Cuong Hoang


Re: Synchronize large number of records with Solr

2007-09-14 Thread climbingrose
Hi Erik,

So in your case #1, documents are reindexed with this scheme - so if you
truly need to skip a reindexing for some reason (why, though?) you'll
need to come up with some other mechanism.  [perhaps update could be
enhanced to allow ignoring a duplicate id rather than reindexing?]

It's pretty easy to ignore duplicate id during indexing but it won't solve
my problem. I think the batch number works well in your case because you
reindex existing documents which will get the updated batch number. In my
case, I can't update existing documents and therefore, even if I use this
approach, there is no way to know if an document is to be deleted. I think I
will need to store all ids in the batch in a DocSet and then compare with
the list of all ids after indexing. That way I can at least get rid of all
expired documents. It's just not as elegant as using a batch identifier.


Re: Searching Versioned Resources

2007-09-12 Thread climbingrose
I think you can use the CollapseFilter to collapse on version field.
However, I think you need to modify the CollapseFilter code to sort by
version and get the latest version returned.

On 9/13/07, Adrian Sutton [EMAIL PROTECTED] wrote:

 Hi all,
 The document's we're indexing are versioned and generally we only
 want search results to return the latest version of a document,
 however there's a couple of scenarios where I'd like to be able to
 include previous versions in the search result.

 It feels like a straight-forward case of a filter, but given that
 each document has independent version numbers it's hard to know what
 to filter on. The only solution I can think of at the moment is to
 index each new version twice - once with the version and once with
 version=latest. We'd then tweak the ID field in such a way that there
 is only one version of each document with version=latest. It's then
 simple to use a filter for version=latest when we search.

 Is there a better way? Is there a way to achieve this without having
 to index the document twice?

 Thanks in advance,

 Adrian Sutton
 http://www.symphonious.net






-- 
Regards,

Cuong Hoang


Re: Embedded about 50% faster for indexing

2007-08-27 Thread climbingrose
Agree. I was actually thinking of developing the embedded version early this
year for one of my projects. I'm sure it will be needed in cases where
running another web server is an overkill.

On 8/28/07, Jonathan Woods [EMAIL PROTECTED] wrote:

 I don't think you should apologise for highlighting embedded usage.  For
 circumstances in which you're at liberty to run a Solr instance in the
 same
 JVM as an app which uses it, I find it very strange that you should have
 to
 use anything _other_ than embedded, and jump through all the unnecessary
 hoops (XML conversion, HTTP transport) that this implies.  It's a bit like
 suggesting you should throw away Java method invocations altogether, and
 write everything in XML-RPC.

 Bit of a pet issue of mine!  I'll be creating a JIRA issue on the subject
 soon.

 Jon

  -Original Message-
  From: Sundling, Paul [mailto:[EMAIL PROTECTED]
  Sent: 28 August 2007 03:24
  To: solr-user@lucene.apache.org
  Subject: RE: Embedded about 50% faster for indexing
 
  At this point I think I'm going recommend against embedded,
  regardless of any performance advantage.  The level of
  documentation is just too low, while the XML API is clearly
  documented.  It's clear that XML is preferred.
 
  The embedded example on the wiki is pretty good, but until
  mutliple core support comes out in the next version, you have
  to use multiple SolrCore.  If they are accessed in the same
  webapp, then you can't just set JNDI (since you can only have
  one value).  So you have to use a Config object as alluded to
  in the example.  However, you look at the code and there is
  no javadoc for the constructor.  The constructor args are
  (String name, InputStream is, String prefix).  I think name
  is a unique name for the solr core, but that is a guess.
  Inputstream may be a stream to the solr home, but it could be
  anything.  Prefix may be a URI prefix.  These are all guesses
  without trying to read through the code.
 
  When I look at SolrCore, it looks like it's a singleton, so
  maybe I can't even access more than one SolrCore using
  embedded anyway.  :(  So I apologize for highlighting Embedded.
 
  Anyway it's clear how to do multiple solr cores using XML.
  You just have different post URI for the difference cores.
  You can easily inject that with Spring and externalize the
  config.  Simple and easy.  So I concede XML is the way to go. :)
 
  Paul Sundling
 
  -Original Message-
  From: Mike Klaas [mailto:[EMAIL PROTECTED]
  Sent: Monday, August 27, 2007 5:50 PM
  To: solr-user@lucene.apache.org
  Subject: Re: Embedded about 50% faster for indexing
 
 
  On 27-Aug-07, at 12:44 PM, Sundling, Paul wrote:
 
   Whether embedded solr should give me a performance boost or not, it
   did.
   :)  I'm not surprised, since it skips XML parsing.
  Although you never
   know where cycles are used for sure until you profile.
 
  It certainly is possible that XML parsing dwarfs indexing, but I'd
  expect that only to occur under very light analysis and field
  storage
  workloads.
 
   I tried doing more records per post (200) and it was
  actually slightly
 
   slower and seemed to require more memory.  This makes sense because
   you
   have to take up more memory for the StringBuilder to store the much
   larger XML.  For 10,000 it was much slower.  For that size I would
   need
   to XML streaming or something to make it work.
  
   The solr war was on the same machine, so network overhead was only
   from
   using loopback.
 
  The big question is still your connection handling strategy:
  are you
  using persistent http connections?  Are you threadedly indexing?
 
  cheers,
  -Mike
 
   Paul Sundling
  
   -Original Message-
   From: climbingrose [mailto:[EMAIL PROTECTED]
   Sent: Monday, August 27, 2007 12:22 AM
   To: solr-user@lucene.apache.org
   Subject: Re: Embedded about 50% faster for indexing
  
  
   Haven't tried the embedded server but I think I have to agree with
   Mike.
   We're currently sending 2000 job batches to SOLR server and
  the amount
   of time required to transfer documents over http is insignificant
   compared with the time required to index them. So I do
  think unless
   you
   are sending document one by one, embedded SOLR shouldn't
  give you much
   more performance boost.
  
   On 8/25/07, Mike Klaas [EMAIL PROTECTED] wrote:
  
   On 24-Aug-07, at 2:29 PM, Wu, Daniel wrote:
  
   -Original Message-
   From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of
   Yonik Seeley
   Sent: Friday, August 24, 2007 2:07 PM
   To: solr-user@lucene.apache.org
   Subject: Re: Embedded about 50% faster for indexing
  
   One thing I'd like to avoid is everyone trying to embed just for
   performance gains. If there is really that much
  difference, then we
  
   need a better way for people to get that without
  resorting to Java
   code.
  
   -Yonik
  
  
   Theoretically and practically, embedded solution will be
  faster

Re: Spell Check Handler

2007-08-17 Thread climbingrose
Thanks Karl. I'll check it out!

On 8/18/07, karl wettin [EMAIL PROTECTED] wrote:

 I updated LUCENE-626 last night. It should now run smooth without
 LUCENE-550, but smoother with.

 Perhaps it is something you can use.


 12 aug 2007 kl. 14.24 skrev climbingrose:

  I'm happy to contribute code for the SpellCheckerRequestHandler.
  I'll post
  the code once I strip off stuff related to our product.
 
  On 8/12/07, Pieter Berkel [EMAIL PROTECTED] wrote:
 
  http://issues.apache.org/jira/browse/LUCENE-626On 11/08/07,
  climbingrose
  [EMAIL PROTECTED] wrote:
 
  That's exactly what I did with my custom version of the
  SpellCheckerHandler.
  However, I didn't handle suggestionCount and only returned the one
  corrected
  phrase which contains the best corrected terms. There is an
  issue on
  Lucene issue tracker regarding multi-word spellchecker:
 
  https://issues.apache.org/jira/browse/LUCENE-550?
  page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 
 
 
  I'd be interested to take a look at your modifications to the
  SpellCheckerHandler, how did you handle phrase queries? maybe we
  can open
  a
  JIRA issue to expand the spell checking functionality to perform
  analysis
  on
  multi-word input values.
 
  I did find http://issues.apache.org/jira/browse/LUCENE-626 after
  looking
  at
  LUCENE-550, but since these patches are not yet included in the
  Lucene
  trunk
  yet it might be a little difficult to justify implementing them in
  Solr.
 
 
 
 
  --
  Regards,
 
  Cuong Hoang




-- 
Regards,

Cuong Hoang


Re: Spell Check Handler

2007-08-11 Thread climbingrose
That's exactly what I did with my custom version of the SpellCheckerHandler.
However, I didn't handle suggestionCount and only returned the one corrected
phrase which contains the best corrected terms. There is an issue on
Lucene issue tracker regarding multi-word spellchecker:
https://issues.apache.org/jira/browse/LUCENE-550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
.


On 8/11/07, Pieter Berkel [EMAIL PROTECTED] wrote:

 On 11/08/07, climbingrose [EMAIL PROTECTED] wrote:
 
  The spellchecker handler doesn't seem to work with multi-word query. For
  example, when I tried to spellcheck Java developar, it returns nothing
  while if I tried developar, spellchecker correctly returns
 developer.
  I
  followed the setup on the wiki.


 While I suppose the general case for using the spelling checker would be a
 query containing a single misspelled word, it would be quite useful if the
 handler applied the analyzer specified by the termSourceField fieldType to
 the query input and then checked the spelling of each query token. This
 would seem to be the most flexible way of supporting multi-word queries
 (provided the termSourceField didn't use any stemmer filters I suppose).

 Piete




-- 
Regards,

Cuong Hoang


Re: FunctionQuery and boosting documents using date arithmetic

2007-08-11 Thread climbingrose
I'm having the date boosting function as well. I'm using this function:
F = recip(rord(creationDate),1,1000,1000)^10. However, since I have around
10,000 of documents added in one day, rord(createDate) returns very
different values for the same createDate. For example, the last document
added with have rord(createdDate) =1 while the last document added will have
rord(createdDate) = 10,000. When createDate  10,000, value of F is
approaching 0. Therefore, the boost query doesn't make any difference
between the the last document added today and the document added 10 days
ago. Now if I replace 1000 in F with a large number, say 10,  the boost
function  suddenly gives the last few documents enormous boost and make the
other query scores irrelevant.

So in my case (and many others' I believe), the true date value would be
more appropriate. I'm thinking along the same line of adding timestamp. It
wouldn't add much overhead this way, would it?

Regards,



On 8/11/07, Chris Hostetter [EMAIL PROTECTED] wrote:


 : Actually, just thinking about this a bit more, perhaps adding a function
 : call such as parseDate() might add too much overhead to the actual
 query,
 : perhaps it would be better to first convert the date to a timestamp at
 index
 : time and store it in a field type slong?  This might be more efficient
 but

 i would agree with you there, this is where a more robust (ie:
 less efficient) DateField-ish class that supports configuration options
 to specify:
   1) the output format
   2) the input format(s)
   3) the indexed format
 ...as SimpleDateFormatter pattern strings would be handy.  The
 ValueSource it uses could return seconds (or some other unit based on
 another config option) since epoch as the intValue.

 it's been discussed before, but there are a lot of tricky issues involved
 which is probably why no one has really tackled it.

 : that still leaves the problem of obtaining the current timestamp to use
 in
 : the boost function.

 it would be pretty easy to write a ValueSource that just knew about now
 as seconds since epoch.

 :  While it seems to work pretty well, I've realised that this may not be
 :  quite as effective as i had hoped given that the calculation is based
 on the
 :  ordinal of the field value rather than the value of the field
 itself.  In
 :  cases where the field type is 'date' and the actual field values are
 not
 :  distributed evenly across all documents in the index, the value
 returned by
 :  rord() is not going to give a true reflection of document age.  For
 example,

 be careful what you wish for.  you are 100% correct that functions using
 hte (r)ord value of a DateField aren't a function of true age, but
 dependong on how you look at it that may be better then using the real age
 (i think so anyway).  Why it sounds appealing to say that docA should
 score half as high as docB if it is twice as old, that typically isn't all
 that important when dealing with recent dates; and when dealing with older
 dates the ordinal value tends to approximate it decently well ... where a
 true measure of age might screw you up is when you have situations where
 few/no new articles get published on weekends (or late at night).  it's
 also very confusing to people when the ordering of documents changes even
 though no new documents have been published -- that can easily happen if
 you are heavily boosting on a true age calculation but will never happen
 when dealing with an ordinal ranking of documents by age.

 (allthough, this could be compensated by doing all of your true age
 calculations relative the min age of all articles in your index -- but
 you would still get really weird 'big' shifts in scores as soon as that
 first article gets published on monday morning.


 -Hoss




-- 
Regards,

Cuong Hoang


Re: Spell Check Handler

2007-08-11 Thread climbingrose
Yeah. How stable is the patch Karl? Is it possible to use it in product
environment?

On 8/12/07, karl wettin [EMAIL PROTECTED] wrote:


 11 aug 2007 kl. 10.36 skrev climbingrose:

  There is an issue on
  Lucene issue tracker regarding multi-word spellchecker:
  https://issues.apache.org/jira/browse/LUCENE-550

 I think you mean LUCENE-626 that sort of depends on LUCENE-550.


 --
 karl






-- 
Regards,

Cuong Hoang


Re: Spell Check Handler

2007-08-10 Thread climbingrose
The spellchecker handler doesn't seem to work with multi-word query. For
example, when I tried to spellcheck Java developar, it returns nothing
while if I tried developar, spellchecker correctly returns developer. I
followed the setup on the wiki.

Regards,

Cuong Hoang

On 7/10/07, Charles Hornberger [EMAIL PROTECTED] wrote:

 For what it's worth, I recently did a quick implementation of the
 spellchecker feature, and I simply created another field in my schema
 (Iike 'spell' in Tristan's example below). After feeding content into
 my search index, I used the spell field into add one single-field
 document for every distinct word in my document collection (I'm
 assuming the content folks have run spell-checkers :-)). E.g.:

 docfield name=spellaardvark/field/doc
 docfield name=spellabacus/field/doc
 docfield name=spellabbot/field/doc
 docfield name=spellacacia/field/doc
 etc.

 I also added some extra documents for proper names that appear in my
 documents. For instance, there are a couple fields that have
 comma-separated list of names, so I for each of those -- in addition
 to documents for john, doe, and jane, which were generated by
 the naive word-splitting done in the first pass -- I added documents
 like so:

 docfield name=spelljohn doe/field/doc
 docfield name=spelljane doe/field/doc
 etc.

 You could do the same for other searchable multi-word tokens in your
 input -- song/album/book/movie titles, publisher names, geographic
 names (cities, neighborhoods, etc.), product names, and so on.

 -Charlie

 On 7/9/07, Tristan Vittorio [EMAIL PROTECTED] wrote:
  I think there is some confusion regarding how the spell checker actually
  uses the termSourceField.  It is suggested that you use a simple field
 type
  such a string, however since this field type does not tokenize or
 split
  words, it is only useful in situations where the whole field is
 considered a
  dictionary word:
 
  add
  doc
  field name=titleAccountant/field
  
 http://localhost:8984/solr/select/?q=Accountentqt=spellcheckercmd=rebuildand
 field
  name=titleAuditor/field
  field name=titleSolicitor/field
  /doc
  /add
 
  The follow example case will not work with spell checker since the whole
  field is considered a single word or string:
 
  add
  doc
  field name=titleAccountant reveals that Accounting is boring/field
  /doc
  /add
 
  I might suggest that you create an additional field in your schema that
  takes advantage of the StandardTokenizer and StandardFilter which
 doesn't
  perform a great deal of processing on the field yet should provide
 decent
  results when used with the spell checker:
 
  fieldType name=spell class=solr.TextField
 positionIncrementGap=100
analyzer type=index
  tokenizer class=solr.StandardTokenizerFactory/
  filter class=solr.StopFilterFactory ignoreCase=true words=
  stopwords.txt/
  filter class=solr.StandardFilterFactory/
  filter class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer
analyzer type=query
  tokenizer class=solr.StandardTokenizerFactory/
  filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
  ignoreCase=true expand=true/
  filter class=solr.StopFilterFactory ignoreCase=true words=
  stopwords.txt/
  filter class=solr.StandardFilterFactory/
  filter class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer
  /fieldType
 
  If you want this field to be automatically populated with the contents
 of
  the title field when a document is added to the index, simply use a
  copyField:
 
  copyField source=title dest=spell/
 
  Hope this helps, let me know if this is still not clear, I probably will
 add
  it to the wiki page soon.
 
  cheers,
  Tristan
 
 
 
  On 7/9/07, climbingrose [EMAIL PROTECTED] wrote:
  
   Thanks for the quick reply. However, I'm still not able to setup
   spellchecker. Solr does create spell directory under data but doesn't
 seem
   to build the spellchecker index. Here are snippets of my schema.xml:
  
   field name=title type=string indexed=true stored=true/
  
   requestHandler name=spellchecker class=
 solr.SpellCheckerRequestHandler
   
   startup=lazy
   !-- default values for query parameters --
lst name=defaults
  int name=suggestionCount1/int
  float name=accuracy0.5/float
/lst
  
!-- Main init params for handler --
  
!-- The directory where your SpellChecker Index should live.
 --
!-- May be absolute, or relative to the Solr dataDir
 directory.
   --
!-- If this option is not specified, a RAM directory will be
 used
   --
str name=spellcheckerIndexDirspell/str
  
!-- the field in your schema that you want to be able to build
 --
!-- your spell index on. This should be a field that uses a very
 --
!-- simple FieldType without a lot of Analysis (ie: string) --
str name=termSourceFieldtitle/str
  
  /requestHandler
  
   I tried this url:
  
  
 http://localhost:8984/solr/select/?q

Re: Spell Check Handler

2007-08-10 Thread climbingrose
After looking the SpellChecker code, I realised that it only supports
single-word. I made a very naive modification of SpellCheckerHandler to get
multi-word support. Now the other problem that I have is how to have
different fields in SpellChecker index. For example, since my query has two
parts: description and location, I don't want to build a spellchecker
index which combines both description and location into one
termSourceField. I want to check description part with the description
field in the spellchecker index and location part with location field in
the index. Otherwise I might have irrelevant suggestions for the location
part since the number of terms in location is generally much smaller
compared with that of description. Any ideas?

Thanks.

On 8/11/07, climbingrose [EMAIL PROTECTED] wrote:

 The spellchecker handler doesn't seem to work with multi-word query. For
 example, when I tried to spellcheck Java developar, it returns nothing
 while if I tried developar, spellchecker correctly returns developer.
 I followed the setup on the wiki.

 Regards,

 Cuong Hoang

 On 7/10/07, Charles Hornberger [EMAIL PROTECTED] wrote:
 
  For what it's worth, I recently did a quick implementation of the
  spellchecker feature, and I simply created another field in my schema
  (Iike 'spell' in Tristan's example below). After feeding content into
  my search index, I used the spell field into add one single-field
  document for every distinct word in my document collection (I'm
  assuming the content folks have run spell-checkers :-)). E.g.:
 
  docfield name=spellaardvark/field/doc
  docfield name=spellabacus/field/doc
  docfield name=spellabbot/field/doc
  docfield name=spellacacia/field/doc
  etc.
 
  I also added some extra documents for proper names that appear in my
  documents. For instance, there are a couple fields that have
  comma-separated list of names, so I for each of those -- in addition
  to documents for john, doe, and jane, which were generated by
  the naive word-splitting done in the first pass -- I added documents
  like so:
 
  docfield name=spelljohn doe/field/doc
  docfield name=spelljane doe/field/doc
  etc.
 
  You could do the same for other searchable multi-word tokens in your
  input -- song/album/book/movie titles, publisher names, geographic
  names (cities, neighborhoods, etc.), product names, and so on.
 
  -Charlie
 
  On 7/9/07, Tristan Vittorio [EMAIL PROTECTED] wrote:
   I think there is some confusion regarding how the spell checker
  actually
   uses the termSourceField.  It is suggested that you use a simple field
  type
   such a string, however since this field type does not tokenize or
  split
   words, it is only useful in situations where the whole field is
  considered a
   dictionary word:
  
   add
   doc
   field name=titleAccountant/field
   http://localhost:8984/solr/select/?q=Accountentqt=spellcheckercmd=rebuildand
  field
   name=titleAuditor/field
   field name=titleSolicitor/field
   /doc
   /add
  
   The follow example case will not work with spell checker since the
  whole
   field is considered a single word or string:
  
   add
   doc
   field name=titleAccountant reveals that Accounting is
  boring/field
   /doc
   /add
  
   I might suggest that you create an additional field in your schema
  that
   takes advantage of the StandardTokenizer and StandardFilter which
  doesn't
   perform a great deal of processing on the field yet should provide
  decent
   results when used with the spell checker:
  
   fieldType name=spell class=solr.TextField
  positionIncrementGap=100
 analyzer type=index
   tokenizer class=solr.StandardTokenizerFactory /
   filter class=solr.StopFilterFactory ignoreCase=true words=
   stopwords.txt/
   filter class=solr.StandardFilterFactory/
   filter class=solr.RemoveDuplicatesTokenFilterFactory/
 /analyzer
 analyzer type=query
   tokenizer class=solr.StandardTokenizerFactory /
   filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
   ignoreCase=true expand=true/
   filter class=solr.StopFilterFactory  ignoreCase=true words=
   stopwords.txt/
   filter class=solr.StandardFilterFactory/
   filter class=solr.RemoveDuplicatesTokenFilterFactory /
 /analyzer
   /fieldType
  
   If you want this field to be automatically populated with the contents
  of
   the title field when a document is added to the index, simply use a
   copyField:
  
   copyField source=title dest=spell/
  
   Hope this helps, let me know if this is still not clear, I probably
  will add
   it to the wiki page soon.
  
   cheers,
   Tristan
  
  
  
   On 7/9/07, climbingrose [EMAIL PROTECTED] wrote:
   
Thanks for the quick reply. However, I'm still not able to setup
spellchecker. Solr does create spell directory under data but
  doesn't seem
to build the spellchecker index. Here are snippets of my schema.xml:
   
field name=title type=string indexed=true stored=true

Re: Spell Check Handler

2007-08-10 Thread climbingrose
OK, I just need to define 2 spellcheckers in solrconfig.xml for my purpose.

On 8/11/07, climbingrose [EMAIL PROTECTED] wrote:

 After looking the SpellChecker code, I realised that it only supports
 single-word. I made a very naive modification of SpellCheckerHandler to get
 multi-word support. Now the other problem that I have is how to have
 different fields in SpellChecker index. For example, since my query has two
 parts: description and location, I don't want to build a spellchecker
 index which combines both description and location into one
 termSourceField. I want to check description part with the description
 field in the spellchecker index and location part with location field in
 the index. Otherwise I might have irrelevant suggestions for the location
 part since the number of terms in location is generally much smaller
 compared with that of description. Any ideas?

 Thanks.

 On 8/11/07, climbingrose [EMAIL PROTECTED] wrote:
 
  The spellchecker handler doesn't seem to work with multi-word query. For
  example, when I tried to spellcheck Java developar, it returns nothing
  while if I tried developar, spellchecker correctly returns
  developer. I followed the setup on the wiki.
 
  Regards,
 
  Cuong Hoang
 
  On 7/10/07, Charles Hornberger  [EMAIL PROTECTED] wrote:
  
   For what it's worth, I recently did a quick implementation of the
   spellchecker feature, and I simply created another field in my schema
   (Iike 'spell' in Tristan's example below). After feeding content into
   my search index, I used the spell field into add one single-field
   document for every distinct word in my document collection (I'm
   assuming the content folks have run spell-checkers :-)). E.g.:
  
   docfield name=spellaardvark/field/doc
   docfield name=spellabacus/field/doc
   docfield name=spellabbot/field/doc
   docfield name=spellacacia/field/doc
   etc.
  
   I also added some extra documents for proper names that appear in my
   documents. For instance, there are a couple fields that have
   comma-separated list of names, so I for each of those -- in addition
   to documents for john, doe, and jane, which were generated by
   the naive word-splitting done in the first pass -- I added documents
   like so:
  
   docfield name=spelljohn doe/field/doc
   docfield name=spelljane doe/field/doc
   etc.
  
   You could do the same for other searchable multi-word tokens in your
   input -- song/album/book/movie titles, publisher names, geographic
   names (cities, neighborhoods, etc.), product names, and so on.
  
   -Charlie
  
   On 7/9/07, Tristan Vittorio [EMAIL PROTECTED] wrote:
I think there is some confusion regarding how the spell checker
   actually
uses the termSourceField.  It is suggested that you use a simple
   field type
such a string, however since this field type does not tokenize or
   split
words, it is only useful in situations where the whole field is
   considered a
dictionary word:
   
add
doc
field name=titleAccountant/field
http://localhost:8984/solr/select/?q=Accountentqt=spellcheckercmd=rebuildand
   field
name=titleAuditor/field
field name=titleSolicitor/field
/doc
/add
   
The follow example case will not work with spell checker since the
   whole
field is considered a single word or string:
   
add
doc
field name=titleAccountant reveals that Accounting is
   boring/field
/doc
/add
   
I might suggest that you create an additional field in your schema
   that
takes advantage of the StandardTokenizer and StandardFilter which
   doesn't
perform a great deal of processing on the field yet should provide
   decent
results when used with the spell checker:
   
fieldType name=spell class=solr.TextField
   positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.StandardTokenizerFactory /
filter class=solr.StopFilterFactory ignoreCase=true words=
stopwords.txt/
filter class=solr.StandardFilterFactory/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.StandardTokenizerFactory /
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
   
ignoreCase=true expand=true/
filter class=solr.StopFilterFactory  ignoreCase=true
   words=
stopwords.txt/
filter class=solr.StandardFilterFactory/
filter class=solr.RemoveDuplicatesTokenFilterFactory /
  /analyzer
/fieldType
   
If you want this field to be automatically populated with the
   contents of
the title field when a document is added to the index, simply use a
copyField:
   
copyField source=title dest=spell/
   
Hope this helps, let me know if this is still not clear, I probably
   will add
it to the wiki page soon.
   
cheers,
Tristan
   
   
   
On 7/9/07, climbingrose [EMAIL PROTECTED]  wrote:

 Thanks for the quick reply

Date rounding up

2007-08-08 Thread climbingrose
Hi all,

I think there might be something wrong with the date time rounding up. I
tried this query: q=*:*fq=listedDate:[NOW/DAY-1DAY TO *] which I think
should return results since yesterday. So if today is 9th of August, it
should return all results from the 8th of August. However, Solr returns also
returns result from the 7th of August. Any idea?

-- 
Regards,

Cuong Hoang


Re: mandatory and optional fields in the dismaxrequesthandler

2007-07-30 Thread climbingrose
I think I have the same question as Arnaud. For example, my dismax query has
qf=title^5 description^2. Now if I search for Java developer, I want to
make sure that the results have at least java or developer in the title.
Is this possible with dismax query?

On 7/30/07, Chris Hostetter [EMAIL PROTECTED] wrote:


 : Is it possible to specify precisely one or more mandatory fields in a
 : DismaxRequestHandler?

 what would the semantics making a field mandatory mean?  considering your
 specific example...

 :  str name=qf
 : text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4
 :  /str
 :  str name=bla
 :+text +feature name manu
 : str
 :
 : where 'text' and 'feature' are mandatory and 'name' and 'manu' are
 : optional fields.

 if text and feature are mandatory but name and manu are not, how are the
 other fields in the qf treated?

 if the q param is:  albino elephant  ... what would it mean that text and
 feature are mandatory?  do both words have to appear in text and in
 feature, or just one in each?




 -Hoss




-- 
Regards,

Cuong Hoang


DisMax query and date boosting

2007-07-19 Thread climbingrose

Hi all,

I'm puzzling over how to boost a date field in a DisMax query. Atm, my qf is
title^5 summary^1. However, what I really want to do is to allow document
with latest listedDate to have better score. For example, documents with
listedDate:[NOW-1DAY TO *] have additional score over documents with
listedDate:[* TO NOW-10DAY]. Any idea?

--
Regards,

Cuong Hoang


Re: DisMax query and date boosting

2007-07-19 Thread climbingrose

Thanks for both answers. Which one is better in terms of performance? bq or
bf?

On 7/20/07, Daniel Alheiros [EMAIL PROTECTED] wrote:


Sorry just correcting myself:
str name=bqyour_date_field:[NOW-24HOURS TO NOW]^10.0/str

Regards,
Daniel

On 19/7/07 15:25, Daniel Alheiros [EMAIL PROTECTED] wrote:

 I think in this case you can use a bq (Boost Query) so you can apply
this
 boost to the range you want.

 str name=bqyour_date_field:[NOW/DAY-24HOURS TO NOW]^10.0/str

 This example will boost your documents with date within the last 24h.

 Regards,
 Daniel

 On 19/7/07 14:45, climbingrose [EMAIL PROTECTED] wrote:

 Hi all,

 I'm puzzling over how to boost a date field in a DisMax query. Atm, my
qf is
 title^5 summary^1. However, what I really want to do is to allow
document
 with latest listedDate to have better score. For example, documents
with
 listedDate:[NOW-1DAY TO *] have additional score over documents with
 listedDate:[* TO NOW-10DAY]. Any idea?


 http://www.bbc.co.uk/
 This e-mail (and any attachments) is confidential and may contain
personal
 views which are not the views of the BBC unless specifically stated.
 If you have received it in error, please delete it from your system.
 Do not use, copy or disclose the information in any way nor act in
reliance on
 it and notify the sender immediately.
 Please note that the BBC monitors e-mails sent or received.
 Further communication will signify your consent to this.



http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal
views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in
reliance on it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.





--
Regards,

Cuong Hoang


Re: DisMax query and date boosting

2007-07-19 Thread climbingrose

Just tried the bq approach and it works beautifully. Exactly what I was
looking for. Still, I'd like to know which approach is the preferred? Thanks
again guys.

On 7/20/07, climbingrose [EMAIL PROTECTED] wrote:


Thanks for both answers. Which one is better in terms of performance? bq
or bf?

On 7/20/07, Daniel Alheiros  [EMAIL PROTECTED] wrote:

 Sorry just correcting myself:
 str name=bqyour_date_field:[NOW-24HOURS TO NOW]^ 10.0/str

 Regards,
 Daniel

 On 19/7/07 15:25, Daniel Alheiros [EMAIL PROTECTED] wrote:

  I think in this case you can use a bq (Boost Query) so you can apply
 this
  boost to the range you want.
 
  str name=bqyour_date_field:[NOW/DAY-24HOURS TO NOW]^10.0/str
 
  This example will boost your documents with date within the last 24h.
 
  Regards,
  Daniel
 
  On 19/7/07 14:45, climbingrose [EMAIL PROTECTED] wrote:
 
  Hi all,
 
  I'm puzzling over how to boost a date field in a DisMax query. Atm,
 my qf is
  title^5 summary^1. However, what I really want to do is to allow
 document
  with latest listedDate to have better score. For example, documents
 with
  listedDate:[NOW-1DAY TO *] have additional score over documents with
  listedDate:[* TO NOW-10DAY]. Any idea?
 
 
  http://www.bbc.co.uk/
  This e-mail (and any attachments) is confidential and may contain
 personal
  views which are not the views of the BBC unless specifically stated.
  If you have received it in error, please delete it from your system.
  Do not use, copy or disclose the information in any way nor act in
 reliance on
  it and notify the sender immediately.
  Please note that the BBC monitors e-mails sent or received.
  Further communication will signify your consent to this.
 


 http://www.bbc.co.uk/
 This e-mail (and any attachments) is confidential and may contain
 personal views which are not the views of the BBC unless specifically
 stated.
 If you have received it in error, please delete it from your system.
 Do not use, copy or disclose the information in any way nor act in
 reliance on it and notify the sender immediately.
 Please note that the BBC monitors e-mails sent or received.
 Further communication will signify your consent to this.




--
Regards,

Cuong Hoang





--
Regards,

Cuong Hoang


Re: DisMax query and date boosting

2007-07-19 Thread climbingrose

Thanks for the answer Chris. The DisMax query handler is just amazing!

On 7/20/07, Chris Hostetter [EMAIL PROTECTED] wrote:



: Just tried the bq approach and it works beautifully. Exactly what I was
: looking for. Still, I'd like to know which approach is the preferred?
Thanks
: again guys.

i personally recommend the function approach, because it gives you a more
gradual falloff in terms of the scores of documents ... the BQ approahc
works great for simple boosting of things in the last N days should score
really high but 1 millisecond after that cut off the score plummets
immediately.

side note...

:   Sorry just correcting myself:
:   str name=bqyour_date_field:[NOW-24HOURS TO NOW]^ 10.0/str

the first example is perfectly fine, and will be more efficient because it
will take better advantage of the field cache...

:str name=bqyour_date_field:[NOW/DAY-24HOURS TO NOW]^10.0/str

...if you don't round down to the nearest day, then every request will
generate a new query which will get put in the filterCache.  if a day
isn't granular enough for you, you can round to the nearest hour (or even
minute) but i strongly suggest you round to something so you don't wind up
using millisecond precision

str name=bqyour_date_field:[NOW/HOUR-1DAY TO NOW]^10.0/str



-Hoss





--
Regards,

Cuong Hoang


Slow facet with custom Analyser

2007-07-16 Thread climbingrose

Hi all,

My facet browsing performance has been decent on my system until I add my
custom Analyser. Initially, I facetted title field which is of default
string type (no analysers, tokenisers...) and got quick responses (first
query is just under 1s, subsequent queries are  0.1s). I created a custom
analyser which is not much different from the DefaultAnalyzer in FieldType
class. Essentially, this analyzer will not do any tokonisations, but only
convert the value into lower case, remove spaces, unwanted chars and words.
After I applied the analyser to title field, facet performance degraded
considerably. Every query is now  1.2s and the filterCache hit ratio is
extremely small:

lookups : 918485

hits : 23
hitratio : 0.00
inserts : 918487
evictions : 917971
size : 512
cumulative_lookups : 918485
cumulative_hits : 23
cumulative_hitratio : 0.00
cumulative_inserts : 918487
cumulative_evictions : 917971



Any idea? Here is my analyser code:

public class FacetTextAnalyser extends SolrAnalyzer {

final int maxChars;
final SetCharacter ignoredChars;
final SetString ignoredWords;

public final static char[] IGNORED_CHARS = {'/', '\\', '\'', '\',
'#', '', '!', '?', '*', '', '', ','};
public static final String[] IGNORED_WORDS = {
a, an, and, are, as, at, be, but, by,
for, if, in, into, is,
no, not, of, on, or, such,
that, the, their, then, there, these,
they, this, to, was, will, with
};

public FacetTextAnalyser() {
maxChars = 255;
ignoredChars = new HashSetCharacter();
for (int i = 0; i  IGNORED_CHARS.length; i++) {
ignoredChars.add(IGNORED_CHARS[i]);
}
ignoredWords = new HashSetString();
for (int i = 0; i  IGNORED_WORDS.length; i++) {
ignoredWords.add(IGNORED_WORDS[i]);
}

}

public FacetTextAnalyser(int maxChars, SetCharacter ignoredChars,
SetString ignoredWords) {
this.maxChars = maxChars;
this.ignoredChars = ignoredChars;
this.ignoredWords = ignoredWords;
}

@Override
public TokenStream tokenStream(String fieldName, Reader reader) {
return new Tokenizer(reader) {
char[] cbuf = new char[maxChars];

public Token next() throws IOException {
int n = input.read(cbuf, 0, maxChars);
if (n = 0)
return null;
char[] temp = new char[n];
int index = 0;
boolean space = true;
for (int i = 0; i  n; i++) {
char c = cbuf[i];
if (ignoredChars.contains(cbuf[i])) {
c = ' ';
}
if (Character.isWhitespace(c)) {
if (space)
continue;
else {
temp[index] = ' ';
if (index  0) {
int j = index - 1;
while (temp[j] != ' '  j  0) {
j--;
}
String str = (j == 0)? new String(temp, 0,
index): new String(temp, j + 1, index - j - 1);
System.out.println(str);
if (ignoredWords.contains(str))
index = j;
}
index++;
//Check ignored words
space = true;
}
} else {
temp[index] = Character.toLowerCase(c);
index++;
space = false;
}

}
temp[0] = Character.toUpperCase(temp[0]);
String s = new String(temp, 0, index);
return new Token(s, 0, n);
};
};
}
}



Here is how I declare the analyser:


  fieldType name=text_em class=solr.TextField
positionIncrementGap=100
analyzer class=net.jseeker.lucene.FacetTextAnalyser/
/fieldType




--
Regards,

Cuong Hoang


Re: Slow facet with custom Analyser

2007-07-16 Thread climbingrose

Thanks Yonik. In my case, there is only one title field per document so is
there a way to force Solr to work the old way? My analyser doesn't break up
the title field into multiple tokens. It only tries to format the field
value (to lower case, remove unwanted chars and words). Therefore, it's no
difference from using string single-valued type.

I'll try your first recommendation to see how it goes.

Thanks again.

On 7/17/07, Yonik Seeley [EMAIL PROTECTED] wrote:


Since you went from a non multi-valued string type (which Solr knows
has at most one value per document) to a custom analyzer type (which
could produce multiple tokens per document), Solr switched tactics
from using the FieldCache for faceting to using the filterCache.

Right now, you could try to
1) use facet.enum.cache.minDf=1000 (don't use the fieldCache except
for large facets)
2) expand the size of the fieldcache to 100 if you have the memory

Optimizing your index should also speed up faceting (but that is a lot
of facets).

-Yonik

On 7/16/07, climbingrose [EMAIL PROTECTED] wrote:
 Hi all,

 My facet browsing performance has been decent on my system until I add
my
 custom Analyser. Initially, I facetted title field which is of default
 string type (no analysers, tokenisers...) and got quick responses (first
 query is just under 1s, subsequent queries are  0.1s). I created a
custom
 analyser which is not much different from the DefaultAnalyzer in
FieldType
 class. Essentially, this analyzer will not do any tokonisations, but
only
 convert the value into lower case, remove spaces, unwanted chars and
words.
 After I applied the analyser to title field, facet performance
degraded
 considerably. Every query is now  1.2s and the filterCache hit ratio is
 extremely small:

 lookups : 918485
  hits : 23
  hitratio : 0.00
  inserts : 918487
  evictions : 917971
  size : 512
  cumulative_lookups : 918485
  cumulative_hits : 23
  cumulative_hitratio : 0.00
  cumulative_inserts : 918487
  cumulative_evictions : 917971





--
Regards,

Cuong Hoang


Re: Slow facet with custom Analyser

2007-07-16 Thread climbingrose

I've tried both of your recommendations (use facet.enum.cache.minDf=1000 and
optimise the index). The query time is around 0.4-0.5s now but it's still
slow compared to the old string type. I haven't tried to increase
filterCache but 100 of cached items looks a bit too much for my server
atm. It's quite pitty that we can't force Solr to use FieldCache. I think I
might pre-process title field and index it as string instead of using
analyser. However, it defeats the purpose of having pluggable analysers,
tokenisers...

On 7/17/07, Yonik Seeley [EMAIL PROTECTED] wrote:


On 7/16/07, climbingrose [EMAIL PROTECTED] wrote:
 Thanks Yonik. In my case, there is only one title field per document
so is
 there a way to force Solr to work the old way? My analyser doesn't break
up
 the title field into multiple tokens. It only tries to format the
field
 value (to lower case, remove unwanted chars and words). Therefore, it's
no
 difference from using string single-valued type.

There is currently no way to force Solr to use the FieldCache method.

Oh, and in
2) expand the size of the fieldcache to 100 if you have the memory
should have been filterCache, not fieldcache.

-Yonik

 I'll try your first recommendation to see how it goes.

faceting typically proceeds much faster on an optimized index too.

-Yonik





--
Regards,

Cuong Hoang


Re: Slow facet with custom Analyser

2007-07-16 Thread climbingrose

Thanks for the suggestion Chris. I modified SimpleFacets to check for
[f.foo.]facet.field.type==(single|multi)
and the performance has been improved significantly.

On 7/17/07, Chris Hostetter [EMAIL PROTECTED] wrote:



:  ...but i don't understand why both checking isTokenized() ...
shouldn't
:  multiValued() be enough?
:
: A field could return false for multiValued() and still have multiple
: tokens per document for that field.

ah .. right ... sorry: multiValued() indicates wether multiple discreet
values can be added to the field (and stored if the field is stored) but
says nothing baout what the Analyzer may do with any single value.

perhaps we should really have an [f.foo.]facet.field.type=(single|multi)
param to let clients indicate when they know exactly which method they
wnat used (getFacetTermEnumCounts vs getFieldCacheCounts) ... if the
property is not set, the default can be determeined using the
sf.multiValued() || ft.isTokenized() || ft instanceof BoolField logic.


-Hoss





--
Regards,

Cuong Hoang


A few questions regarding multi-word synonyms and parameters encoding

2007-07-10 Thread climbingrose

Hi all,

I've been using Solr for the last few projects and the experience has been
great. I'll post the link to the website once it finishes. Just have a few
questions regarding synonyms and parameters encoding:

1) Is multi-word synonyms possible now in Solr? For example, can I have
things like synonyms like:
I.T.  T, IT  T, Information Technologies, Computer science
I read the message on mailing list sometime ago (think back in mid 2006)
saying that there is no clean way to implement this. Is it possible now? In
my case, I have two field category and location in which category is of
default string type and location is of default text type:
+Category field is used only for faceting by category therefore, no anylasis
needs to be done. Can I use the synonyms config above to do facet query on
category field and the Solr will combine items having one of these category
into one facet category? For example:

I.T.  T (10)
IT  T (20)
Information Technologies (30)
Computer science (40)

Can I have something like:

I.T.  T (100)

Or do I have to manually filter query on for each category:I.T.  T and
count the results?

+Location field is used for searching by city, state and post code. Since I
collect the data from different sources, there might be mix  match
information. For example, on one record I might have Inner Sydney, NSW
while the other record I might have Inner Sydney, New South Wales. In
Australia, NSW  New South Wales are interchangeable used so when the users
search for NSW, I want New South Wales record to be returned and vice
versa. How could I achieve this? The location field is of the default text
type.

2) I'm having trouble with using facet values in my url. For example, I have
title facet field in my query and it returns something like:

Software engineer
C++ Programmer
C Programmer  PHP developer

Now I want create a link for each of these value so that the user can filter
the results by that title by clicking on the link. For example, if I click
on Software Engineer, the results are now narrowed down to just include
records with Software Engineer in their title. Since title field can
contain special chars like '+', '' ..., I really can't find a clean way to
do this. At the moment, I replace all the space by '+' and it seems to work
for words like Software engineer (converted to Software+Engineer).
However, C++ Programmer is converted to C+++Programmer, and it doesn't
seem to work (return no results). Any ideas?

Looking back, this is such a long email. If you reach this point, thanks a
lot for your time!!!

--
Regards,

Cuong Hoang


Re: history

2007-07-08 Thread climbingrose

Accidentally I have a very similar use case. Thanks for advice.

On 7/8/07, Yonik Seeley [EMAIL PROTECTED] wrote:


On 7/7/07, Brian Whitman [EMAIL PROTECTED] wrote:
 I have been trying to plan out a history function for Solr. When I
 update a document with an existing unique key, I would like the older
 version to stay around and get tagged with the date and some metadata
 to indicate it's not live. Any normal search would not touch
 history documents.

Interesting...
One might be able to accomplish this with the update processors that
Ryan  I have been batting around for the last few days, in
conjunction with updateable documents, which is on-deck.

The first idea that comes to mind is that during an update, you could
change the id of the older document to be something like
id_timestamp, and reindex it with the addition of a live:false
field.

For normal queries, use a filter of -live:false filter.
For all old of a document, use a prefix query id:mydocid_*
for all versions of a document, use query id:mydocid*

So if you can hold off a little bit, you shouldn't need a custom query
handler.  This will be a good use case to ensure that our request
processors and updateable documents are powerful enough.

-Yonik





--
Regards,

Cuong Hoang


Re: Dynamic fields performance question

2007-03-26 Thread climbingrose

Thanks Yonik. I think both of the conditions hold true for our application
;).

On 3/27/07, Yonik Seeley [EMAIL PROTECTED] wrote:


On 3/26/07, climbingrose [EMAIL PROTECTED] wrote:
 I'm developing an application that potentially creates thousands of
dynamic
 fields.  Does anyone know if large number of dynamic fields will degrade
 Solr performance?

Thousands of fields won't be a problem if
- you don't sort on most of them (sorting by a field takes up memory)
- you can omit norms on most of them

Provided the above is true, differences in searching + indexing
performance shouldn't be noticeable.

-Yonik





--
Regards,

Cuong Hoang


Dynamic fields performance question

2007-03-25 Thread climbingrose

Hi all,

I'm developing an application that potentially creates thousands of dynamic
fields.  Does anyone know if large number of dynamic fields will degrade
Solr performance?

Thanks.


--
Regards,

Cuong Hoang


Solr use case

2006-10-11 Thread climbingrose

Hi all,

Is it true that Solr is mainly used for applications that rarely change the
underlying data? As I understand, if you submit new data or modify existing
data on Solr server, you would have to refresh the cache somehow to
display the updated data. If my application frequently gets new data/updates
from users, should I use Solr? I love faceted browsing and dynamic
properties so much but I need to justify the choice of Solr. Thanks. By the
way, does anyone have any performance measure that can be shared (apart from
the one on the Wiki)? As I estimated, my application probably has half a
million docs, each of which has around 15 properties, does anyone know the
type of hardware I would need for reasonable performance.

Thanks.

--
Regards,

Cuong Hoang


Multiple schemas

2006-09-26 Thread climbingrose

Hi all,

Am I right that we can only have one schema per solr server? If so, how
would you deal with the issue of submitting completely different data models
(such as clothes and cars)?
Thanks.

--
Regards,

Cuong Hoang


Re: Mobile phone shop + Solr

2006-09-13 Thread climbingrose

I probably need to visualise my models:

MobileInfo (1)(1...*) SellingItem

MobileInfo has many fields to describe the characteristics of a mobile phone
model (color, size..). SellingItem is an instance of MobileInfo that is
currently sold by a user. So in the ERD terms, SellingItem will probably
have foreign key call MobileInfoId that references the primary key of
MobileInfo. Now obviously, I need to index MobileInfo to support faceted
browsing. How should I index SellingItem? The simplest way probably is to
combile mobile phone specs in MobileInfo and and fields in SellingItem, and
then index all of them. In this case, if I have 1000 SellingItems
referencing a particular MobileInfo, I have to repeat the fields in
MobileInfo a thousand times.

On 9/13/06, Chris Hostetter [EMAIL PROTECTED] wrote:



: Because the mobile phone info has many fields (40), I don't want to
: repeatedly submit it to Solr.

i'm not really sure what you mean by repeatedly submit to Solr or how it
relates to haveing more then 40 fields.  40 fields really isn't that many.

To give you a basis of comparison: the last Solr index i built from
scratch had 47 field declarations, and 4 dynamicField declarations
...those 4 dynamic fields result in approximately 1200 'fields' in the
index -- not every document has a value for every field, but the average
is above 200 fields per document.



-Hoss





--
Regards,

Cuong Hoang