date:20101111

data import scheduling

2010-11-11 Thread Tri Nguyen

Hi,

Has anyone gotten solr to schedule data imports at a certain time interval 
through configuring solr?

I tried setting interval=1, which is import every minute but I don't see it 
happening.

I'm trying to avoid cron jobs.

Thanks,

Tri

Re: solr dynamic core creation

2010-11-11 Thread nizan


Does anyone has any idea on how to do this?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-dynamic-core-creation-tp1867705p1881374.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: To cache or to not cache

2010-11-11 Thread Em


Jonathan,

thanks for your statement. In fact, you are quite right: A lot of people
developed great caching mechanisms.
However, the solution I got in mind was something like an HTTP-Cache - in
most cases on the same box.

I talked to some experts who told me that Squid would be a relatively large
monster, since we only want him for http-caching.

Do you know any benchmarks about responses per second, if most of the
queried data is in the cache?

Regards
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/To-cache-or-to-not-cache-tp1875289p1881714.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to use polish stemmer - Stempel - in schema.xml?

2010-11-11 Thread Jakub Godawa

Hi! Sorry for such a break, but I was moving house... anyway:

1. I took the 
~/apache-solr/src/java/org/apache/solr/analysis/StandardFilterFactory.java
file and modified it (named as StempelFilterFactory.java) in Vim that
way:

package org.getopt.solr.analysis;

import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardFilter;

public class StempelTokenFilterFactory extends BaseTokenFilterFactory {
  public StempelFilter create(TokenStream input) {
return new StempelFilter(input);
  }
}

2. Then I put the file to the extracted stempel-1.0.jar in
./org/getopt/solr/analysis/
3. Then I created a class from it: jar -cf
StempelTokenFilterFactory.class StempelFilterFactory.java
4. Then I created new stempel-1.0.jar archive: jar -cf stempel-1.0.jar
-C ./stempel-1.0/ .
5. Then in schema.xml I've put:

fieldType name=text_pl class=solr.TextField
  analyzer
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=org.getopt.solr.analysis.StempelTokenFilterFactory /
  /analyzer
/fieldType

6. I started the solr server and I recieved the following error:

2010-11-11 11:50:56 org.apache.solr.common.SolrException log
SEVERE: java.lang.ClassFormatError: Incompatible magic value
1347093252 in class file
org/getopt/solr/analysis/StempelTokenFilterFactory
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:634)
at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
...

Question: What is wrong? :) I use jar (fastjar) 0.98 to create jars,
I googled on that error but with no answer gave me idea what is wrong
in my .java file.

Please help, as I believe I am close to the end of that subject.

Cheers,
Jakub Godawa.

2010/11/3 Lance Norskog goks...@gmail.com:
 Here's the problem: Solr is a little dumb about these Filter classes,
 and so you have to make a Factory object for the Stempel Filter.

 There are a lot of other FilterFactory classes. You would have to just
 copy one and change the names to Stempel and it might actually work.

 This will take some Solr programming- perhaps the author can help you?

 On Tue, Nov 2, 2010 at 7:08 AM, Jakub Godawa jakub.god...@gmail.com wrote:
 Sorry, I am not Java programmer at all. I would appreciate more
 verbose (or step by step) help.

 2010/11/2 Bernd Fehling bernd.fehl...@uni-bielefeld.de:

 So you call org.getopt.solr.analysis.StempelTokenFilterFactory.
 In this case I would assume a file StempelTokenFilterFactory.class
 in your directory org/getopt/solr/analysis/.

 And a class which extends the BaseTokenFilterFactory rigth?
 ...
 public class StempelTokenFilterFactory extends BaseTokenFilterFactory 
 implements ResourceLoaderAware {
 ...



 Am 02.11.2010 14:20, schrieb Jakub Godawa:
 This is what stempel-1.0.jar consist of after jar -xf:

 jgod...@ubuntu:~/apache-solr-1.4.1/ifaq/lib$ ls -R org/
 org/:
 egothor  getopt

 org/egothor:
 stemmer

 org/egothor/stemmer:
 Cell.class     Diff.class    Gener.class  MultiTrie2.class
 Optimizer2.class  Reduce.class        Row.class    TestAll.class
 TestLoad.class  Trie$StrEnum.class
 Compile.class  DiffIt.class  Lift.class   MultiTrie.class
 Optimizer.class   Reduce$Remap.class  Stock.class  Test.class
 Trie.class

 org/getopt:
 stempel

 org/getopt/stempel:
 Benchmark.class  lucene  Stemmer.class

 org/getopt/stempel/lucene:
 StempelAnalyzer.class  StempelFilter.class
 jgod...@ubuntu:~/apache-solr-1.4.1/ifaq/lib$ ls -R META-INF/
 META-INF/:
 MANIFEST.MF
 jgod...@ubuntu:~/apache-solr-1.4.1/ifaq/lib$ ls -R res
 res:
 tables

 res/tables:
 readme.txt  stemmer_1000.out  stemmer_100.out  stemmer_2000.out
 stemmer_200.out  stemmer_500.out  stemmer_700.out

 2010/11/2 Bernd Fehling bernd.fehl...@uni-bielefeld.de:
 Hi Jakub,

 if you unzip your stempel-1.0.jar do you have the
 required directory structure and file in there?
 org/getopt/stempel/lucene/StempelFilter.class

 Regards,
 Bernd

 Am 02.11.2010 13:54, schrieb Jakub Godawa:
 Erick I've put the jar files like that before. I also added the
 directive and put the file in instanceDir/lib

 What is still a problem is that even the files are loaded:
 2010-11-02 13:20:48 org.apache.solr.core.SolrResourceLoader 
 replaceClassLoader
 INFO: Adding 
 'file:/home/jgodawa/apache-solr-1.4.1/ifaq/lib/stempel-1.0.jar'
 to classloader

 I am not able to use the FilterFactory... maybe I am attempting it in
 a wrong way?

 Cheers,
 Jakub Godawa.

 2010/11/2 Erick Erickson erickerick...@gmail.com:
 The polish stemmer jar file needs to be findable by Solr, if you copy
 it to solr_home/lib and restart solr you should be set.

 Alternatively, you can add another lib directive to the solrconfig.xml
 file
 (there are several examples in that file already).

 I'm a little confused about not being able to find TokenFilter, is that
 still
 a problem?

 HTH
 Erick

 On Tue, Nov 2, 2010 at

Error while indexing files with Solr

2010-11-11 Thread Kaustuv Royburman


Hi,
I am trying to index documents (PDF, Doc, XLS, RTF) using the 
ExtractingRequestHandler.


I am following the tutorial at 
http://wiki.apache.org/solr/ExtractingRequestHandler

But when i run the following command

*curl 
http://localhost:8983/solr/update/extract?literal.id=mydoc.docuprefix=attr_fmap.content=attr_content; 
-F myfile=@/home/system/Documents/mydoc.doc*


i am getting the following error :

html
head
meta http-equiv=Content-Type content=text/html; charset=ISO-8859-1/
titleError 500 /title
/head
bodyh2HTTP ERROR: 500/h2prelazy loading error

org.apache.solr.common.SolrException: lazy loading error
   at 
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWrappedHandler(RequestHandlers.java:249)
   at 
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:231)

   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
   at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
   at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
   at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
   at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
   at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
   at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
   at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)

   at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
   at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
   at 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
   at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)

   at org.mortbay.jetty.Server.handle(Server.java:285)
   at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
   at 
org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)

   at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
   at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
   at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
   at 
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
   at 
org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
Caused by: org.apache.solr.common.SolrException: Error loading class 
'org.apache.solr.handler.extraction.ExtractingRequestHandler'
   at 
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:375)

   at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:413)
   at org.apache.solr.core.SolrCore.createRequestHandler(SolrCore.java:449)
   at 
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWrappedHandler(RequestHandlers.java:240)

   ...21 more
Caused by: java.lang.ClassNotFoundException: 
org.apache.solr.handler.extraction.ExtractingRequestHandler not found in 
java.net.URLClassLoader{urls=[], parent=contextloa...@null}

   at java.net.URLClassLoader.findClass(libgcj.so.90)
   at java.lang.ClassLoader.loadClass(libgcj.so.90)
   at java.lang.ClassLoader.loadClass(libgcj.so.90)
   at java.lang.Class.forName(libgcj.so.90)
   at 
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:359)

   ...24 more
/pre
pRequestURI=/solr/update/extract/ppismalla 
href=http://jetty.mortbay.org/;Powered by 
Jetty:///a/small/i/pbr/

br/
br/
br/
br/
br/
br/
br/
br/
br/
br/
br/
br/
br/
br/
br/
br/
br/
br/
br/

/body
/html


I am running Debian Lenny and java version 1.6.0_22.
I am running apache-solr-1.4.1 and running it from the examples directory.

Please point me in the right direction and help me solve the problem.



--
---
Regards,
Kaustuv Royburman

Senior Software Developer
infoservices.in
DLF IT Park,
Rajarhat, 1st Floor, Tower - 3
Major Arterial Road,
Kolkata - 700156,
India

index just new articles from rss feeds - Data Import Request Handler

2010-11-11 Thread Matteo Moci

Hello,
I'd like to use solr to index some documents coming from an rss feed,
like the example at [1], but it seems that the configuration used
there is just for a one-time indexing, trying to get all the articles
exposed in the rss feed of the website.

Is it possible to manage and index just the new articles coming from
the rss source?

I found that maybe the delta-import can be useful but, from what I understand,
the delta-import is used to just update the index with contents of
documents that have been modified since the last indexing:
this is obviously useful, but I'd like to index just the new articles
coming from an rss feed.

Is it something managed automatically by solr or I have to deal with
it in a separate way? Maybe a full import with clean=false
parameters?
Are there any solutions that you would suggest?
Maybe storing the article feeds in a table like [2] and have a module
that periodically sends each row to solr for indexing it?

Thanks,
Matteo

[1] http://wiki.apache.org/solr/DataImportHandler#HttpDataSource_Example
[2] http://wiki.apache.org/solr/DataImportHandler#Usage_with_RDBMS

IndexTank technology...

2010-11-11 Thread Glen Newton

Does anyone know what technology they are using: http://www.indextank.com/
Is it Lucene under the hood?

Thanks, and apologies for cross-posting.
-Glen

http://zzzoot.blogspot.com

-- 

-

solr 1.3 how to parse rich documents

2010-11-11 Thread Nikola Garafolic


Hi,

I use solr 1.3 with patch for parsing rich documents, and when uploading 
for example pdf file, only thing I see in solr.log is following:


INFO: [] webapp=/solr path=/update/rich 
params={id=250stream.type=pdffieldnames=id,namecommit=truestream.fieldname=bodyname=iphone+user+guide+pdf+iphone_user_guide.pdf} 
status=0 QTime=12656


solrconfig.xml contains the line:

 requestHandler name=/update/rich 
class=solr.RichDocumentRequestHandler startup=lazy /


What else am I missing?

Since I am running solr as standalone, I do not need to build it with 
ant, or?


Regards,
Nikola

--
Nikola Garafolic
SRCE, Sveucilisni racunski centar
tel: +385 1 6165 804
email: nikola.garafo...@srce.hr

Re: Adding new field after data is already indexed

2010-11-11 Thread Erick Erickson

@Jerry Li

What version of Solr were you using? And was there any
data in the new field?  I have no problems here with a quick
test I ran on trunk...

Best
Erick

On Thu, Nov 11, 2010 at 1:37 AM, Jerry Li | 李宗杰 zongjie...@gmail.comwrote:

 but if I use this field to do sorting, there will be an error occured
 and throw an indexOfBoundArray exception.

 On Thursday, November 11, 2010, Robert Petersen rober...@buy.com wrote:
  1)  Just put the new field in the schema and stop/start solr.  Documents
  in the index will not have the field until you reindex them but it won't
  hurt anything.
 
  2)  Just turn off their handlers in solrconfig is all I think that
  takes.
 
  -Original Message-
  From: gauravshetti [mailto:gaurav.she...@tcs.com]
  Sent: Monday, November 08, 2010 5:21 AM
  To: solr-user@lucene.apache.org
  Subject: Adding new field after data is already indexed
 
 
  Hi,
 
   I had a few questions regarding Solr.
  Say my schema file looks like
  field name=folder_id type=long indexed=true stored=true/
  field name=indexed type=boolean indexed=true stored=true/
 
  and i index data on the basis of these fields. Now, incase i need to add
  a
  new field, is there a way i can add the field without corrupting the
  previous data. Is there any feature which adds a new field with a
  default
  value to the existing records.
 
 
  2) Is there any security mechanism/authorization check to prevent url
  like
  /admin and /update to only a few users.
 
  --
  View this message in context:
  http://lucene.472066.n3.nabble.com/Adding-new-field-after-data-is-alread
  y-indexed-tp1862575p1862575.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 

 --

 Best Regards.
 Jerry. Li | 李宗杰
 

Re: solr dynamic core creation

2010-11-11 Thread Robert Sandiford


Hi, nizan.  I didn't realize that just replying to a thread from my email
client wouldn't get back to you.  Here's some info on this thread since your
original post:


On Nov 10, 2010, at 12:30pm, Bob Sandiford wrote:

 Why not use replication?  Call it inexperience...

 We're really early into working with and fully understanding Solr and 
 the best way to approach various issues.  I did mention that this was 
 a prototype and non-production code, so I'm covered, though :)

 We'll take a look at the replication feature...

Replication doesn't replicate the top-level solr.xml file that defines
available cores, so if dynamic cores is a requirement then your custom code
isn't wasted :)

-- Ken


 -Original Message-
 From: Jonathan Rochkind [mailto:rochk...@jhu.edu]
 Sent: Wednesday, November 10, 2010 3:26 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Dynamic creating of cores in solr

 You could use the actual built-in Solr replication feature to 
 accomplish that same function -- complete re-index to a 'master', and 
 then when finished, trigger replication to the 'slave', with the 
 'slave' being the live index that actually serves your applications.

 I am curious if there was any reason you chose to roll your own 
 solution using JSolr and dynamic creation of cores, instead of simply 
 using the replication feature. Were there any downsides of using the 
 replication feature for this purpose that you amerliorated through 
 your solution?

 Jonathan

 Bob Sandiford wrote:
 We also use SolrJ, and have a dynamically created Core capability -
 where we don't know in advance what the Cores will be that we 
 require.

 We almost always do a complete index build, and if there's a 
 previous
 instance of that index, it needs to be available during a complete 
 index build, so we have two cores per index, and switch them as 
 required at the end of an indexing run.

 Here's a summary of how we do it (we're in an early prototype /
 implementation right now - this isn't  production quality code - as 
 you can tell from our voluminous javadocs on the methods...)

 1) Identify if the core exists, and if not, create it:

   /**
 * This method instantiates two SolrServer objects, solr and
 indexCore.  It requires that
 * indexName be set before calling.
 */
private void initSolrServer() throws IOException
{
String baseUrl = http://localhost:8983/solr/;;
solr = new CommonsHttpSolrServer(baseUrl);

String indexCoreName = indexName +
 SolrConstants.SUFFIX_INDEX; // SUFIX_INDEX = _INDEX
String indexCoreUrl = baseUrl + indexCoreName;

// Here we create two cores for the indexName, if they don't
 already exist - the live core used
// for searching and a second core used for indexing. After
 indexing, the two will be switched so the
// just-indexed core will become the live core. The way that
 core swapping works, the live core will always
// be named [indexName] and the indexing core will always be
 named [indexname]_INDEX, but the
// dataDir of each core will alternate between [indexName]_1
 and [indexName]_2.
createCoreIfNeeded(indexName, indexName + _1, solr);
createCoreIfNeeded(indexCoreName, indexName + _2, solr);
indexCore = new CommonsHttpSolrServer(indexCoreUrl);
}


   /**
 * Create a core if it does not already exists. Returns true if a
 new core was created, false otherwise.
 */
private boolean createCoreIfNeeded(String coreName, String
 dataDir, SolrServer server) throws IOException
{
boolean coreExists = true;
try
{
// SolrJ provides no direct method to check if a core
 exists, but getStatus will
// return an empty list for any core that doesn't.
CoreAdminResponse statusResponse =
 CoreAdminRequest.getStatus(coreName, server);
coreExists =
 statusResponse.getCoreStatus(coreName).size()  0;
if(!coreExists)
{
// Create the core
LOG.info(Creating Solr core:  + coreName);
CoreAdminRequest.Create create = new
 CoreAdminRequest.Create();
create.setCoreName(coreName);
create.setInstanceDir(.);
create.setDataDir(dataDir);
create.process(server);
}
}
catch (SolrServerException e)
{
e.printStackTrace();
}
return !coreExists;
}


 2) Do the index, clearing it first if it's a complete rebuild:

 [snip]
if (fullIndex)
{
try
{
indexCore.deleteByQuery(*:*);
}
catch (SolrServerException e)
{
e.printStackTrace();  //To change body of catch
 statement use File | Settings | File Templates.
}
}
 [snip]

 various logic, then (we submit batches of 100 :

 [snip]

Issue with facet fields

2010-11-11 Thread gauravshetti


I am facing this weird issue in facet fields

Within config xml
under
requestHandler name=standard class=solr.SearchHandler
!-- default values for query parameters --
−
lst name=defaults

I have defined the fl as 
str name=fl

file_id folder_id display_name file_name priority_text content_type
last_upload upload_by business indexed
 
/str

But my out xml doesnt contain the element upload_by and business
But i am able to do seach by upload_by: and business:

Even when i add in the url fl=* i do not get this facet field in the
response

Any idea what i am doing wrong.


-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Issue-with-facet-fields-tp1883106p1883106.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: WELCOME to solr-user@lucene.apache.org

2010-11-11 Thread Solr User

Hi,

I have a question about boosting.

I have the following fields in my schema.xml:

1. title
2. description
3. ISBN

etc

I want to boost the field title. I tried index time boosting but it did not
work. I also tried Query time boosting but with no luck.

Can someone help me on how to implement boosting on a specific field like
title?

Thanks,
Solr User

On Thu, Nov 11, 2010 at 10:26 AM, solr-user-h...@lucene.apache.org wrote:

 Hi! This is the ezmlm program. I'm managing the
 solr-user@lucene.apache.org mailing list.

 I'm working for my owner, who can be reached
 at solr-user-ow...@lucene.apache.org.

 Acknowledgment: I have added the address

   solr...@gmail.com

 to the solr-user mailing list.

 Welcome to solr-u...@lucene.apache.org!

 Please save this message so that you know the address you are
 subscribed under, in case you later want to unsubscribe or change your
 subscription address.


 --- Administrative commands for the solr-user list ---

 I can handle administrative requests automatically. Please
 do not send them to the list address! Instead, send
 your message to the correct command address:

 To subscribe to the list, send a message to:
   solr-user-subscr...@lucene.apache.org

 To remove your address from the list, send a message to:
   solr-user-unsubscr...@lucene.apache.org

 Send mail to the following for info and FAQ for this list:
   solr-user-i...@lucene.apache.org
   solr-user-...@lucene.apache.org

 Similar addresses exist for the digest list:
   solr-user-digest-subscr...@lucene.apache.org
   solr-user-digest-unsubscr...@lucene.apache.org

 To get messages 123 through 145 (a maximum of 100 per request), mail:
   solr-user-get.123_...@lucene.apache.org

 To get an index with subject and author for messages 123-456 , mail:
   solr-user-index.123_...@lucene.apache.org

 They are always returned as sets of 100, max 2000 per request,
 so you'll actually get 100-499.

 To receive all messages with the same subject as message 12345,
 send a short message to:
   solr-user-thread.12...@lucene.apache.org

 The messages should contain one line or word of text to avoid being
 treated as s...@m, but I will ignore their content.
 Only the ADDRESS you send to is important.

 You can start a subscription for an alternate address,
 for example j...@host.domain, just add a hyphen and your
 address (with '=' instead of '@') after the command word:
 solr-user-subscribe-john=host.dom...@lucene.apache.org

 To stop subscription for this address, mail:
 solr-user-unsubscribe-john=host.dom...@lucene.apache.org

 In both cases, I'll send a confirmation message to that address. When
 you receive it, simply reply to it to complete your subscription.

 If despite following these instructions, you do not get the
 desired results, please contact my owner at
 solr-user-ow...@lucene.apache.org. Please be patient, my owner is a
 lot slower than I am ;-)

 --- Enclosed is a copy of the request I received.

 Return-Path: solr...@gmail.com
 Received: (qmail 48883 invoked by uid 99); 11 Nov 2010 15:26:44 -
 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230)
by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 11 Nov 2010 15:26:44
 +
 X-ASF-Spam-Status: No, hits=2.2 required=10.0

  
 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL
 X-Spam-Check-By: apache.org
 Received-SPF: pass (nike.apache.org: domain of solr...@gmail.comdesignates 
 209.85.213.48 as permitted sender)
 Received: from [209.85.213.48] (HELO mail-yw0-f48.google.com)
 (209.85.213.48)
by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 11 Nov 2010 15:26:35
 +
 Received: by ywp4 with SMTP id 4so1394872ywp.35
for solr-user-sc.1289489103.apfngfdapdhadiahjfln-solrnew=gmail.com
 @lucene.apache.org; Thu, 11 Nov 2010 07:26:14 -0800 (PST)
 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=gmail.com; s=gamma;
h=domainkey-signature:mime-version:received:received:in-reply-to
 :references:date:message-id:subject:from:to:content-type;
bh=4KuKRrRVLjzTO4oB9/DNxMdQPfNQH2GnYznzPE6YqOo=;
b=l5lBfUYcyvipJn9SE+5j+t1XUmBjTtbyPYlRVj7jDb6G+W3NzQ21EHOowiD9rNH2L9

 gc2+6mGEZmRJOZQwpKD7SUQ2bXL9fVm7mVfS21TMAgC+ZsWQ3vvFOHXalWZa8dbtcOY7
 C23KauLY7YH1UfducfXL77J7u0/snEZl5jQ7A=
 DomainKey-Signature: a=rsa-sha1; c=nofws;
d=gmail.com; s=gamma;

  h=mime-version:in-reply-to:references:date:message-id:subject:from:to
 :content-type;
b=nb9+3a9bOHnjGO5T5BhMlW15adcafr+MPzvpgc5X5NXEUGCI05ViLho0SSoQP2Wp2i

 xp1Mfjrjw05umeKmHX23oeD5Idc2G6xgz8I3ZcJ1bUM+cD7c52cMKG2suE2VvhUHlfah
 z52rEtlqd0Q9fk/ZDWwR2DS7GoiVMRmgaWgD0=
 MIME-Version: 1.0
 Received: by 10.229.216.201 with SMTP id hj9mr877669qcb.58.1289489174123;
 Thu,
  11 Nov 2010 07:26:14 -0800 (PST)
 Received: by 10.229.66.165 with HTTP; Thu, 11 Nov 2010 07:26:14 -0800 (PST)
 In-Reply-To: 1289489103.46214.ez...@lucene.apache.org
 References: 1289489103.46214.ez...@lucene.apache.org

Boosting

2010-11-11 Thread Solr User

Hi,

I have a question about boosting.

I have the following fields in my schema.xml:

1. title
2. description
3. ISBN

etc

I want to boost the field title. I tried index time boosting but it did not
work. I also tried Query time boosting but with no luck.

Can someone help me on how to implement boosting on a specific field like
title?

Thanks,
Solr User

Re: solr dynamic core creation

2010-11-11 Thread nizan


Hi,

Thanks for the offers, I'll take deeper look into them.

In the offers you showed me, if I understand correctly, the call for
creation is done in the client side. I need the mechanism we'll work in the
server side.

I know it sounds stupid, but I need the client side wouldn't know about
which cores are there or not, and on the server side I (maybe with a
handler?), will understand if the core is not created, and create it if
needed.

Thanks, nizan
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-dynamic-core-creation-tp1867705p1883213.html
Sent from the Solr - User mailing list archive at Nabble.com.

problem with wildcard

2010-11-11 Thread Jean-Sebastien Vachon


Hi All,

I'm having some trouble with a query using some wildcard and I was wondering if 
anyone could tell me why these two
similar queries do not return the same number of results. Basically, the query 
I'm making should return all docs whose title starts
(or contain) the string lowe'. I suspect some analyzer is causing this 
behaviour and I'd like to know if there is a way to fix this problem.

1) select?q=*:*fq=title:(+lowe')debugQuery=onrows=0

result name=response numFound=302 start=0/
lst name=debug
str name=rawquerystring*:*/str
str name=querystring*:*/str
str name=parsedqueryMatchAllDocsQuery(*:*)/str
str name=parsedquery_toString*:*/str
lst name=explain/
str name=QParserLuceneQParser/str
arr name=filter_queries
strtitle:(  lowe')/str
/arr
arr name=parsed_filter_queries
strtitle:low/str
/arr

2) select?q=*:*fq=title:(+lowe'*)debugQuery=onrows=0 

result name=response numFound=0 start=0/
lst name=debug
str name=rawquerystring*:*/str
str name=querystring*:*/str
str name=parsedqueryMatchAllDocsQuery(*:*)/str
str name=parsedquery_toString*:*/str
lst name=explain/
str name=QParserLuceneQParser/str
arr name=filter_queries
strtitle:(  lowe'*)/str
/arr
arr name=parsed_filter_queries
strtitle:lowe'*/str
/arr
...
/lst


The title field is defined as:

field name=title type=text indexed=true stored=true required=false/

where the text type is:

fieldType name=text class=solr.TextField positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
!-- in this example, we will only use synonyms at query time
filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt 
ignoreCase=true expand=false/
--
!-- Case insensitive stop word removal.
  add enablePositionIncrements=true in both the index and query
  analyzers to leave a 'gap' for more accurate phrase queries.
--
filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords.txt
enablePositionIncrements=true
/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 
splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.SnowballPorterFilterFactory language=English 
protected=protwords.txt/
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt 
ignoreCase=true expand=true/
filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords.txt
enablePositionIncrements=true
/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 
splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.SnowballPorterFilterFactory language=English 
protected=protwords.txt/
  /analyzer
/fieldType

Re: solr dynamic core creation

2010-11-11 Thread Robert Sandiford


Hmmm.  Maybe you need to define what you mean by 'server' and what you mean
by 'client'.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-dynamic-core-creation-tp1867705p1883238.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: solr dynamic core creation

2010-11-11 Thread nizan


Hi,

Maybe just don't understand all the concept there and I mix up server and
client...

Client - The place where I make the http calls (for index, search etc.) -
where I use the CommonsHttpSolrServer as the solr server. This machine isn't
defined as master or slave, it just use solr as search engine

Server - The http calls I made in the client, goes to another server, the
master solr server (or one of the slaves), where I have embeddedSolrServer,
aren't they?

thanks, nizan
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-dynamic-core-creation-tp1867705p1883269.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Crawling with nutch and mapping fields to solr

2010-11-11 Thread Jean-Luc


I'm going down the route of patching nutch so I can use this ParseMetaTags
plugin:
https://issues.apache.org/jira/browse/NUTCH-809

Also wondering whether I will be able to use the XMLParser to allow me to
parse well formed XHTML, using xpath would be bonus:
https://issues.apache.org/jira/browse/NUTCH-185

Any thoughts appreciated...
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Crawling-with-nutch-and-mapping-fields-to-solr-tp1879060p1883295.html
Sent from the Solr - User mailing list archive at Nabble.com.

EdgeNGram relevancy

2010-11-11 Thread Robert Gründler

Hi,

consider the following fieldtype (used for autocompletion):

  fieldType name=edgytext class=solr.TextField positionIncrementGap=100
   analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt enablePositionIncrements=true / 
 filter class=solr.PatternReplaceFilterFactory 
pattern=([^a-z]) replacement= replace=all /
 filter class=solr.EdgeNGramFilterFactory minGramSize=1 
maxGramSize=25 /
   /analyzer
   analyzer type=query
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt enablePositionIncrements=true /
 filter class=solr.PatternReplaceFilterFactory 
pattern=([^a-z]) replacement= replace=all /
   /analyzer
  /fieldType


This works fine as long as the query string is a single word. For multiple 
words, the ranking is weird though.

Example:

Query String: Bill Cl

Result (in that order):

- Clyde Phillips
- Clay Rogers
- Roger Cloud
- Bill Clinton

Bill Clinton should have the highest rank in that case.  

Has anyone an idea how to to configure this fieldtype to make matches in both 
tokens rank higher than those who match in either token?


thanks!


-robert

Re: solr dynamic core creation

2010-11-11 Thread Robert Sandiford


No - in reading what you just wrote, and what you originally wrote, I think
the misunderstanding was mine, based on the architecture of my code.  In my
code, it is our 'server' level that does the SolrJ indexing calls, but you
meant 'server' to be the Solr instance, and what you mean by 'client' is
what I was thinking of (without thinking) as the 'server'...

Sorry about that.  Hopefully someone else can chime in on your specific
issue...
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-dynamic-core-creation-tp1867705p1883354.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Any Copy Field Caveats?

2010-11-11 Thread Tod


I've noticed that using camelCase in field names causes problems.


On 11/5/2010 11:02 AM, Will Milspec wrote:

Hi all,

we're moving from an old lucene version to solr  and plan to use the Copy
Field functionality. Previously we had rolled our own implementation,
sticking title, description, etc. in a field called 'content'.

We lose some flexibility (i.e. java layer can no longer control what gets in
the new copied field), at the expense of simplicity. A fair tradeoff IMO.

My question: has anyone found any subtle issues or gotchas with copy
fields?

(from the subject line caveat--pronounced 'kah-VEY-AT'  is Latin as in
Caveat Emptor...let the buyer beware).

thanks,

will

will

Re: Concatenate multiple tokens into one

2010-11-11 Thread Nick Martin

Hi Robert, All,

I have a similar problem, here is my fieldType, 
http://paste.pocoo.org/show/289910/
I want to include stopword removal and lowercase the incoming terms. The idea 
being to take, Foo Bar Baz Ltd and turn it into foobarbaz for the EdgeNgram 
filter factory.
If anyone can tell me a simple way to concatenate tokens into one token again, 
similar too the KeyWordTokenizer that would be super helpful.

Many thanks

Nick

On 11 Nov 2010, at 00:23, Robert Gründler wrote:

 
 On Nov 11, 2010, at 1:12 AM, Jonathan Rochkind wrote:
 
 Are you sure you really want to throw out stopwords for your use case?  I 
 don't think autocompletion will work how you want if you do. 
 
 in our case i think it makes sense. the content is targetting the electronic 
 music / dj scene, so we have a lot of words like DJ or featuring which
 make sense to throw out of the query. Also searches for the beastie boys 
 and beastie boys should return a match in the autocompletion.
 
 
 And if you don't... then why use the WhitespaceTokenizer and then try to jam 
 the tokens back together? Why not just NOT tokenize in the first place. Use 
 the KeywordTokenizer, which really should be called the 
 NonTokenizingTokenizer, becaues it doesn't tokenize at all, it just creates 
 one token from the entire input string. 
 
 I started out with the KeywordTokenizer, which worked well, except the 
 StopWord problem.
 
 For now, i've come up with a quick-and-dirty custom ConcatFilter, which 
 does what i'm after:
 
 public class ConcatFilter extends TokenFilter {
 
   private TokenStream tstream;
 
   protected ConcatFilter(TokenStream input) {
   super(input);
   this.tstream = input;
   }
 
   @Override
   public Token next() throws IOException {
   
   Token token = new Token();
   StringBuilder builder = new StringBuilder();
   
   TermAttribute termAttribute = (TermAttribute) 
 tstream.getAttribute(TermAttribute.class);
   TypeAttribute typeAttribute = (TypeAttribute) 
 tstream.getAttribute(TypeAttribute.class);
   
   boolean incremented = false;
   
   while (tstream.incrementToken()) {
   
   if (typeAttribute.type().equals(word)) {
   builder.append(termAttribute.term());   
 
   }
   incremented = true;
   }
   
   token.setTermBuffer(builder.toString());
   
   if (incremented == true)
   return token;
   
   return null;
   }
 }
 
 I'm not sure if this is a safe way to do this, as i'm not familar with the 
 whole solr/lucene implementation after all.
 
 
 best
 
 
 -robert
 
 
 
 
 
 Then lowercase, remove whitespace (or not), do whatever else you want to do 
 to your single token to normalize it, and then edgengram it. 
 
 If you include whitespace in the token, then when making your queries for 
 auto-complete, be sure to use a query parser that doesn't do 
 pre-tokenization, the 'field' query parser should work well for this. 
 
 Jonathan
 
 
 
 
 From: Robert Gründler [rob...@dubture.com]
 Sent: Wednesday, November 10, 2010 6:39 PM
 To: solr-user@lucene.apache.org
 Subject: Concatenate multiple tokens into one
 
 Hi,
 
 i've created the following filterchain in a field type, the idea is to use 
 it for autocompletion purposes:
 
 tokenizer class=solr.WhitespaceTokenizerFactory/ !-- create tokens 
 separated by whitespace --
 filter class=solr.LowerCaseFilterFactory/ !-- lowercase everything --
 filter class=solr.StopFilterFactory ignoreCase=true 
 words=stopwords.txt enablePositionIncrements=true /  !-- throw out 
 stopwords --
 filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) 
 replacement= replace=all /  !-- throw out all everything except a-z --
 
 !-- actually, here i would like to join multiple tokens together again, to 
 provide one token for the EdgeNGramFilterFactory --
 
 filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25 
 / !-- create edgeNGram tokens for autocomplete matches --
 
 With that kind of filterchain, the EdgeNGramFilterFactory will receive 
 multiple tokens on input strings with whitespaces in it. This leads to the 
 following results:
 Input Query: George Cloo
 Matches:
 - George Harrison
 - John Clooridge
 - George Smith
 -George Clooney
 - etc
 
 However, only George Clooney should match in the autocompletion use case.
 Therefore, i'd like to add a filter before the EdgeNGramFilterFactory, which 
 concatenates all the tokens generated by the WhitespaceTokenizerFactory.
 Are there filters which can do such a thing?
 
 If not, are there examples how to implement a custom TokenFilter?
 
 thanks!
 
 -robert

Rollback can't be done after committing?

2010-11-11 Thread Kouta Osabe

Hi, all

I have a question about Solr and SolrJ's rollback.

I try to rollback like below

try{
server.addBean(dto);
server.commit;
}catch(Exception e){
 if (server != null) { server.rollback();}
}

I wonder if any Exception thrown, rollback process is run. so all
data would not be updated.

but once commited, rollback would not be well done.

rollback correctly will be done only when commit process will not?

Solr and SolrJ's rollback system is not the same as any RDB's rollback?

Re: Rollback can't be done after committing?

2010-11-11 Thread Jonathan Rochkind


What you say is true. Solr is not an rdbms.

Kouta Osabe wrote:

Hi, all

I have a question about Solr and SolrJ's rollback.

I try to rollback like below

try{
server.addBean(dto);
server.commit;
}catch(Exception e){
 if (server != null) { server.rollback();}
}

I wonder if any Exception thrown, rollback process is run. so all
data would not be updated.

but once commited, rollback would not be well done.

rollback correctly will be done only when commit process will not?

Solr and SolrJ's rollback system is not the same as any RDB's rollback?

using CJKTokenizerFactory for Japanese language

2010-11-11 Thread Kumar Pandey

I am exploring support for Japanese language in solr.
Solr seems to provide CJKTokenizerFactory.
How useful is this module? Has anyone been using this in production for
Japanese language?

One shortfall it seems to have from what I have been able to read up on is
that it can generate lot of false matches. For example mathcing kyoto when
searching for tokyo etc.

I did not see many questions related to this module so I wonder if people
are actively using it.
If not are there any other solution in the market that are recommended by
solr users?

Thanks
Kumar

Re: EdgeNGram relevancy

2010-11-11 Thread Ahmet Arslan

You can add an additional field, with using KeywordTokenizerFactory instead of 
WhitespaceTokenizerFactory. And query both these fields with an OR operator. 

edgytext:(Bill Cl) OR edgytext2:Bill Cl

You can even apply boost so that begins with matches comes first.

--- On Thu, 11/11/10, Robert Gründler rob...@dubture.com wrote:

 From: Robert Gründler rob...@dubture.com
 Subject: EdgeNGram relevancy
 To: solr-user@lucene.apache.org
 Date: Thursday, November 11, 2010, 5:51 PM
 Hi,
 
 consider the following fieldtype (used for
 autocompletion):
 
   fieldType name=edgytext class=solr.TextField
 positionIncrementGap=100
    analyzer type=index
      tokenizer
 class=solr.WhitespaceTokenizerFactory/
      filter
 class=solr.LowerCaseFilterFactory/
      filter
 class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true
 /     
      filter
 class=solr.PatternReplaceFilterFactory pattern=([^a-z])
 replacement= replace=all /
      filter
 class=solr.EdgeNGramFilterFactory minGramSize=1
 maxGramSize=25 /
    /analyzer
    analyzer type=query
      tokenizer
 class=solr.WhitespaceTokenizerFactory/
      filter
 class=solr.LowerCaseFilterFactory/
      filter
 class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true /
      filter
 class=solr.PatternReplaceFilterFactory pattern=([^a-z])
 replacement= replace=all /
    /analyzer
   /fieldType
 
 
 This works fine as long as the query string is a single
 word. For multiple words, the ranking is weird though.
 
 Example:
 
 Query String: Bill Cl
 
 Result (in that order):
 
 - Clyde Phillips
 - Clay Rogers
 - Roger Cloud
 - Bill Clinton
 
 Bill Clinton should have the highest rank in that
 case.  
 
 Has anyone an idea how to to configure this fieldtype to
 make matches in both tokens rank higher than those who match
 in either token?
 
 
 thanks!
 
 
 -robert

Re: Issue with facet fields

2010-11-11 Thread Paige Cook

Are you storing the upload_by and business fields? You will not be able to
retrieve a field from your index if it is not stored. Check that you have
stored=true for both of those fields.

- Paige

On Thu, Nov 11, 2010 at 10:23 AM, gauravshetti gaurav.she...@tcs.comwrote:


 I am facing this weird issue in facet fields

 Within config xml
 under
 requestHandler name=standard class=solr.SearchHandler
 !-- default values for query parameters --
 −
 lst name=defaults

 I have defined the fl as
 str name=fl

file_id folder_id display_name file_name priority_text content_type
 last_upload upload_by business indexed

 /str

 But my out xml doesnt contain the element upload_by and business
 But i am able to do seach by upload_by: and business:

 Even when i add in the url fl=* i do not get this facet field in the
 response

 Any idea what i am doing wrong.


 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Issue-with-facet-fields-tp1883106p1883106.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: EdgeNGram relevancy

2010-11-11 Thread Robert Gründler

thanks a lot, that setup works pretty well now.

the only problem now is that the StopWords do not work that good anymore. I'll 
provide an example, but first the 2 fieldtypes:

  !-- autocomplete field which finds matches inside strings (scor matches 
Martin Scorsese) --
  
  fieldType name=edgytext class=solr.TextField positionIncrementGap=100
   analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt enablePositionIncrements=true / 
 filter class=solr.PatternReplaceFilterFactory 
pattern=([^a-z]) replacement= replace=all /
 filter class=solr.EdgeNGramFilterFactory minGramSize=1 
maxGramSize=25 /
   /analyzer
   analyzer type=query
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt enablePositionIncrements=true /
 filter class=solr.PatternReplaceFilterFactory 
pattern=([^a-z]) replacement= replace=all /
   /analyzer
  /fieldType
  
  !-- autocomplete field which finds startsWith matches only (scor matches 
only Scorpio, but not Martin Scorsese) --  

  fieldType name=edgytext2 class=solr.TextField positionIncrementGap=100
   analyzer type=index
 tokenizer class=solr.KeywordTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.PatternReplaceFilterFactory 
pattern=([^a-z]) replacement= replace=all /
 filter class=solr.EdgeNGramFilterFactory minGramSize=1 
maxGramSize=25 /
   /analyzer
   analyzer type=query
 tokenizer class=solr.KeywordTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.PatternReplaceFilterFactory 
pattern=([^a-z]) replacement= replace=all /
   /analyzer
  /fieldType


This setup now makes troubles regarding StopWords, here's an example:

Let's say the index contains 2 Strings: Mr Martin Scorsese and Martin 
Scorsese. Mr is in the stopword list.

Query: edgytext:Mr Scorsese OR edgytext2:Mr Scorsese^2.0

This way, the only result i get is Mr Martin Scorsese, because the strict 
field edgytext2 is boosted by 2.0. 

Any idea why in this case Martin Scorsese is not in the result at all?


thanks again!


-robert






On Nov 11, 2010, at 5:57 PM, Ahmet Arslan wrote:

 You can add an additional field, with using KeywordTokenizerFactory instead 
 of WhitespaceTokenizerFactory. And query both these fields with an OR 
 operator. 
 
 edgytext:(Bill Cl) OR edgytext2:Bill Cl
 
 You can even apply boost so that begins with matches comes first.
 
 --- On Thu, 11/11/10, Robert Gründler rob...@dubture.com wrote:
 
 From: Robert Gründler rob...@dubture.com
 Subject: EdgeNGram relevancy
 To: solr-user@lucene.apache.org
 Date: Thursday, November 11, 2010, 5:51 PM
 Hi,
 
 consider the following fieldtype (used for
 autocompletion):
 
   fieldType name=edgytext class=solr.TextField
 positionIncrementGap=100
analyzer type=index
  tokenizer
 class=solr.WhitespaceTokenizerFactory/
  filter
 class=solr.LowerCaseFilterFactory/
  filter
 class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true
 / 
  filter
 class=solr.PatternReplaceFilterFactory pattern=([^a-z])
 replacement= replace=all /
  filter
 class=solr.EdgeNGramFilterFactory minGramSize=1
 maxGramSize=25 /
/analyzer
analyzer type=query
  tokenizer
 class=solr.WhitespaceTokenizerFactory/
  filter
 class=solr.LowerCaseFilterFactory/
  filter
 class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true /
  filter
 class=solr.PatternReplaceFilterFactory pattern=([^a-z])
 replacement= replace=all /
/analyzer
   /fieldType
 
 
 This works fine as long as the query string is a single
 word. For multiple words, the ranking is weird though.
 
 Example:
 
 Query String: Bill Cl
 
 Result (in that order):
 
 - Clyde Phillips
 - Clay Rogers
 - Roger Cloud
 - Bill Clinton
 
 Bill Clinton should have the highest rank in that
 case.  
 
 Has anyone an idea how to to configure this fieldtype to
 make matches in both tokens rank higher than those who match
 in either token?
 
 
 thanks!
 
 
 -robert

Re: Concatenate multiple tokens into one

2010-11-11 Thread Robert Gründler

I've posted a ConcaFilter in my previous mail which does concatenate tokens. 
This works fine, but i
realized that what i wanted to achieve is implemented easier in another way (by 
using 2 separate field types).

Have a look at a previous mail i wrote to the list and the reply from Ahmet 
Arslan (topic: EdgeNGram relevancy).


best


-robert




See 
On Nov 11, 2010, at 5:27 PM, Nick Martin wrote:

 Hi Robert, All,
 
 I have a similar problem, here is my fieldType, 
 http://paste.pocoo.org/show/289910/
 I want to include stopword removal and lowercase the incoming terms. The idea 
 being to take, Foo Bar Baz Ltd and turn it into foobarbaz for the 
 EdgeNgram filter factory.
 If anyone can tell me a simple way to concatenate tokens into one token 
 again, similar too the KeyWordTokenizer that would be super helpful.
 
 Many thanks
 
 Nick
 
 On 11 Nov 2010, at 00:23, Robert Gründler wrote:
 
 
 On Nov 11, 2010, at 1:12 AM, Jonathan Rochkind wrote:
 
 Are you sure you really want to throw out stopwords for your use case?  I 
 don't think autocompletion will work how you want if you do. 
 
 in our case i think it makes sense. the content is targetting the electronic 
 music / dj scene, so we have a lot of words like DJ or featuring which
 make sense to throw out of the query. Also searches for the beastie boys 
 and beastie boys should return a match in the autocompletion.
 
 
 And if you don't... then why use the WhitespaceTokenizer and then try to 
 jam the tokens back together? Why not just NOT tokenize in the first place. 
 Use the KeywordTokenizer, which really should be called the 
 NonTokenizingTokenizer, becaues it doesn't tokenize at all, it just creates 
 one token from the entire input string. 
 
 I started out with the KeywordTokenizer, which worked well, except the 
 StopWord problem.
 
 For now, i've come up with a quick-and-dirty custom ConcatFilter, which 
 does what i'm after:
 
 public class ConcatFilter extends TokenFilter {
 
  private TokenStream tstream;
 
  protected ConcatFilter(TokenStream input) {
  super(input);
  this.tstream = input;
  }
 
  @Override
  public Token next() throws IOException {
  
  Token token = new Token();
  StringBuilder builder = new StringBuilder();
  
  TermAttribute termAttribute = (TermAttribute) 
 tstream.getAttribute(TermAttribute.class);
  TypeAttribute typeAttribute = (TypeAttribute) 
 tstream.getAttribute(TypeAttribute.class);
  
  boolean incremented = false;
  
  while (tstream.incrementToken()) {
  
  if (typeAttribute.type().equals(word)) {
  builder.append(termAttribute.term());   
 
  }
  incremented = true;
  }
  
  token.setTermBuffer(builder.toString());
  
  if (incremented == true)
  return token;
  
  return null;
  }
 }
 
 I'm not sure if this is a safe way to do this, as i'm not familar with the 
 whole solr/lucene implementation after all.
 
 
 best
 
 
 -robert
 
 
 
 
 
 Then lowercase, remove whitespace (or not), do whatever else you want to do 
 to your single token to normalize it, and then edgengram it. 
 
 If you include whitespace in the token, then when making your queries for 
 auto-complete, be sure to use a query parser that doesn't do 
 pre-tokenization, the 'field' query parser should work well for this. 
 
 Jonathan
 
 
 
 
 From: Robert Gründler [rob...@dubture.com]
 Sent: Wednesday, November 10, 2010 6:39 PM
 To: solr-user@lucene.apache.org
 Subject: Concatenate multiple tokens into one
 
 Hi,
 
 i've created the following filterchain in a field type, the idea is to use 
 it for autocompletion purposes:
 
 tokenizer class=solr.WhitespaceTokenizerFactory/ !-- create tokens 
 separated by whitespace --
 filter class=solr.LowerCaseFilterFactory/ !-- lowercase everything --
 filter class=solr.StopFilterFactory ignoreCase=true 
 words=stopwords.txt enablePositionIncrements=true /  !-- throw out 
 stopwords --
 filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) 
 replacement= replace=all /  !-- throw out all everything except a-z 
 --
 
 !-- actually, here i would like to join multiple tokens together again, to 
 provide one token for the EdgeNGramFilterFactory --
 
 filter class=solr.EdgeNGramFilterFactory minGramSize=1 
 maxGramSize=25 / !-- create edgeNGram tokens for autocomplete matches 
 --
 
 With that kind of filterchain, the EdgeNGramFilterFactory will receive 
 multiple tokens on input strings with whitespaces in it. This leads to the 
 following results:
 Input Query: George Cloo
 Matches:
 - George Harrison
 - John Clooridge
 - George Smith
 -George Clooney
 - etc

Memory used by facet queries

2010-11-11 Thread Charlie Gildawie

Hello All.

My first time post so be kind. Developing a document store with lots and lots 
of very small documents. (200 million at the moment. Final size will probably 
be double this at 400 million documents). This is Proof of concept development 
so we are seeing what a single code can do for us before we consider sharding. 
We'd rather not shard if we don't have to.

I'm using SOLR 4.0 (for the simple facet pivots and groups which work well).

We're into week 4 of our development and have the production servers etc set 
up. Everything working very well until we start to test queries with production 
volumes of data.

I'm running into Java Heap Space exceptions during simple faceting on inverted 
fields. The fields we are currently faceting on are names - Country / Continent 
/ City names all stored as a Solr.StringField (there are other fields using 
tokenization to provide initial search but we want to use the simple 
StringFields to provide faceted navigation). In total we have 10 fields we'd 
ever want to facet on (8 names fields that are strings and 2 Datepart fields 
(year and yearMonth) that are also strings)).

This is our first time using SOLR and I didn't realise that we'd need so much 
heap for facets!

Solr is running in tomcat container and I've currently set tomcat to use a max 
of

JAVA_OPTS=$JAVA_OPTS -server -Xms512m -Xmx3m

I've been reading all I can find online and have seen advice to populate the 
facets caches first as soon as we've started the solr service. However I'd 
really like to know if there are ways to reduce the memory footprint. We 
currently have 32g of physical ram. Adding more ram is an option but I'm being 
asked the (completely reasonable) question -- Why do you need so much?

Please help!

Charlie.


-Original Message-
From: Robert Gründler [mailto:rob...@dubture.com]
Sent: 11 November 2010 18:14
To: solr-user@lucene.apache.org
Subject: Re: Concatenate multiple tokens into one

I've posted a ConcaFilter in my previous mail which does concatenate tokens. 
This works fine, but i realized that what i wanted to achieve is implemented 
easier in another way (by using 2 separate field types).

Have a look at a previous mail i wrote to the list and the reply from Ahmet 
Arslan (topic: EdgeNGram relevancy).


best


-robert




See
On Nov 11, 2010, at 5:27 PM, Nick Martin wrote:

 Hi Robert, All,

 I have a similar problem, here is my fieldType,
 http://paste.pocoo.org/show/289910/
 I want to include stopword removal and lowercase the incoming terms. The idea 
 being to take, Foo Bar Baz Ltd and turn it into foobarbaz for the 
 EdgeNgram filter factory.
 If anyone can tell me a simple way to concatenate tokens into one token 
 again, similar too the KeyWordTokenizer that would be super helpful.

 Many thanks

 Nick

 On 11 Nov 2010, at 00:23, Robert Gründler wrote:


 On Nov 11, 2010, at 1:12 AM, Jonathan Rochkind wrote:

 Are you sure you really want to throw out stopwords for your use case?  I 
 don't think autocompletion will work how you want if you do.

 in our case i think it makes sense. the content is targetting the
 electronic music / dj scene, so we have a lot of words like DJ or 
 featuring which make sense to throw out of the query. Also searches for 
 the beastie boys and beastie boys should return a match in the 
 autocompletion.


 And if you don't... then why use the WhitespaceTokenizer and then try to 
 jam the tokens back together? Why not just NOT tokenize in the first place. 
 Use the KeywordTokenizer, which really should be called the 
 NonTokenizingTokenizer, becaues it doesn't tokenize at all, it just creates 
 one token from the entire input string.

 I started out with the KeywordTokenizer, which worked well, except the 
 StopWord problem.

 For now, i've come up with a quick-and-dirty custom ConcatFilter, which 
 does what i'm after:

 public class ConcatFilter extends TokenFilter {

  private TokenStream tstream;

  protected ConcatFilter(TokenStream input) {
  super(input);
  this.tstream = input;
  }

  @Override
  public Token next() throws IOException {

  Token token = new Token();
  StringBuilder builder = new StringBuilder();

  TermAttribute termAttribute = (TermAttribute) 
 tstream.getAttribute(TermAttribute.class);
  TypeAttribute typeAttribute = (TypeAttribute)
 tstream.getAttribute(TypeAttribute.class);

  boolean incremented = false;

  while (tstream.incrementToken()) {

  if (typeAttribute.type().equals(word)) {
  builder.append(termAttribute.term());
  }
  incremented = true;
  }

  token.setTermBuffer(builder.toString());

  if (incremented == true)
  return token;

  return null;
  }
 }

 I'm not sure if this is a safe way to do this, as i'm not

Re: EdgeNGram relevancy

2010-11-11 Thread Ahmet Arslan

 This setup now makes troubles regarding StopWords, here's
 an example:
 
 Let's say the index contains 2 Strings: Mr Martin
 Scorsese and Martin Scorsese. Mr is in the stopword
 list.
 
 Query: edgytext:Mr Scorsese OR edgytext2:Mr Scorsese^2.0
 
 This way, the only result i get is Mr Martin Scorsese,
 because the strict field edgytext2 is boosted by 2.0. 
 
 Any idea why in this case Martin Scorsese is not in the
 result at all?

Did you run your query without using () and  operators? If yes can you try 
this?
q=edgytext:(Mr Scorsese) OR edgytext2:Mr Scorsese^2.0

If no can you paste output of debugQuery=on

Search Result Differences a Puzzle

2010-11-11 Thread Eric Martin

Hi,

 

I cannot find out how this is occurring: 

 

 

Nolosearch/com/search/apachesolr_search/law

 

 

You can see that the John Paul Stevens result yields more description in the
search result because of the keyword relevancy, whereas, the other results
just give you a snippet of the title based on keywords found. 

 

I am trying to figure out how to get a standard size search result no matter
what the relevancy is. While application of this type of result would be
irrelevant to many search engines it is completely practical in a legal
setting as a keyword is only as good as how it is being referenced in the
sentence or paragraph. What a dilemma I have!

 

 

I have been trying to figure out if it is the actual schema.xml file or
solrconfig.xml file and for the life of me, I can't find it referenced
anywhere. I tried changing the fragsize to 200 instead of a default of like
70. Didn't do any damage at re-index.

 

 

This problem is super critical to my search results. Like I said, as an
attorney, the word is superfluous until it attached to a long sentence or
two in order to describe if the keyword we searched for is relevant, let
alone worthy of  a click. That is why my titles are set to open in a new
window, faster access and if the result is crud, then just close the window
out and back to research.

 

 

Eric

Retrieving indexed content containing multiple languages

2010-11-11 Thread Tod

My Solr corpus is currently created by indexing metadata from a 
relational database as well as content pointed to by URLs from the 
database.  I'm using a pretty generic out of the box Solr schema.  The 
search results are presented via an AJAX enabled HTML page.


When I perform a search the document title (for example) has a mix of 
english and chinese characters.  Everything there is fine - I can see 
the english and chinese returned from a facet query on title.  I can 
search against the title using english words it contains and I get back 
an expected result.  I asked a chinese friend to perform the same search 
using chinese and nothing is returned.


How should I go about getting this search to work?  Chinese is just one 
language, I'll probably need to support more in the future.


My thought is that the chinese characters are indexed as their unicode 
equivalent so all I'll need to do is make sure the query is encoded 
appropriately and just perform a regular search as I would if the terms 
were in english.  For some reason that sounds too easy.


I see there is a CJK tokenizer that would help here.  Do I need that for 
my situation?  Is there a fairly detailed tutorial on how to handle 
these types of language challenges?



Thanks in advance - Tod

Re: Concatenate multiple tokens into one

2010-11-11 Thread Nick Martin

Thanks Robert, I had been trying to get your ConcatFilter to work, but I'm not 
sure what i need in the classpath and where Token comes from.
Will check the thread you mention.

Best

Nick

On 11 Nov 2010, at 18:13, Robert Gründler wrote:

 I've posted a ConcaFilter in my previous mail which does concatenate tokens. 
 This works fine, but i
 realized that what i wanted to achieve is implemented easier in another way 
 (by using 2 separate field types).
 
 Have a look at a previous mail i wrote to the list and the reply from Ahmet 
 Arslan (topic: EdgeNGram relevancy).
 
 
 best
 
 
 -robert
 
 
 
 
 See 
 On Nov 11, 2010, at 5:27 PM, Nick Martin wrote:
 
 Hi Robert, All,
 
 I have a similar problem, here is my fieldType, 
 http://paste.pocoo.org/show/289910/
 I want to include stopword removal and lowercase the incoming terms. The 
 idea being to take, Foo Bar Baz Ltd and turn it into foobarbaz for the 
 EdgeNgram filter factory.
 If anyone can tell me a simple way to concatenate tokens into one token 
 again, similar too the KeyWordTokenizer that would be super helpful.
 
 Many thanks
 
 Nick
 
 On 11 Nov 2010, at 00:23, Robert Gründler wrote:
 
 
 On Nov 11, 2010, at 1:12 AM, Jonathan Rochkind wrote:
 
 Are you sure you really want to throw out stopwords for your use case?  I 
 don't think autocompletion will work how you want if you do. 
 
 in our case i think it makes sense. the content is targetting the 
 electronic music / dj scene, so we have a lot of words like DJ or 
 featuring which
 make sense to throw out of the query. Also searches for the beastie boys 
 and beastie boys should return a match in the autocompletion.
 
 
 And if you don't... then why use the WhitespaceTokenizer and then try to 
 jam the tokens back together? Why not just NOT tokenize in the first 
 place. Use the KeywordTokenizer, which really should be called the 
 NonTokenizingTokenizer, becaues it doesn't tokenize at all, it just 
 creates one token from the entire input string. 
 
 I started out with the KeywordTokenizer, which worked well, except the 
 StopWord problem.
 
 For now, i've come up with a quick-and-dirty custom ConcatFilter, which 
 does what i'm after:
 
 public class ConcatFilter extends TokenFilter {
 
 private TokenStream tstream;
 
 protected ConcatFilter(TokenStream input) {
 super(input);
 this.tstream = input;
 }
 
 @Override
 public Token next() throws IOException {
 
 Token token = new Token();
 StringBuilder builder = new StringBuilder();
 
 TermAttribute termAttribute = (TermAttribute) 
 tstream.getAttribute(TermAttribute.class);
 TypeAttribute typeAttribute = (TypeAttribute) 
 tstream.getAttribute(TypeAttribute.class);
 
 boolean incremented = false;
 
 while (tstream.incrementToken()) {
 
 if (typeAttribute.type().equals(word)) {
 builder.append(termAttribute.term());   
 
 }
 incremented = true;
 }
 
 token.setTermBuffer(builder.toString());
 
 if (incremented == true)
 return token;
 
 return null;
 }
 }
 
 I'm not sure if this is a safe way to do this, as i'm not familar with the 
 whole solr/lucene implementation after all.
 
 
 best
 
 
 -robert
 
 
 
 
 
 Then lowercase, remove whitespace (or not), do whatever else you want to 
 do to your single token to normalize it, and then edgengram it. 
 
 If you include whitespace in the token, then when making your queries for 
 auto-complete, be sure to use a query parser that doesn't do 
 pre-tokenization, the 'field' query parser should work well for this. 
 
 Jonathan
 
 
 
 
 From: Robert Gründler [rob...@dubture.com]
 Sent: Wednesday, November 10, 2010 6:39 PM
 To: solr-user@lucene.apache.org
 Subject: Concatenate multiple tokens into one
 
 Hi,
 
 i've created the following filterchain in a field type, the idea is to use 
 it for autocompletion purposes:
 
 tokenizer class=solr.WhitespaceTokenizerFactory/ !-- create tokens 
 separated by whitespace --
 filter class=solr.LowerCaseFilterFactory/ !-- lowercase everything --
 filter class=solr.StopFilterFactory ignoreCase=true 
 words=stopwords.txt enablePositionIncrements=true /  !-- throw out 
 stopwords --
 filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) 
 replacement= replace=all /  !-- throw out all everything except a-z 
 --
 
 !-- actually, here i would like to join multiple tokens together again, 
 to provide one token for the EdgeNGramFilterFactory --
 
 filter class=solr.EdgeNGramFilterFactory minGramSize=1 
 maxGramSize=25 / !-- create edgeNGram tokens for autocomplete matches 
 --
 
 With that kind of filterchain, the

Re: Concatenate multiple tokens into one

2010-11-11 Thread Robert Gründler

this is the full source code, but be warned, i'm not a java developer, and i 
have no background in lucine/solr development:

// ConcatFilter

import java.io.IOException;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.TermAttribute;
import org.apache.lucene.analysis.tokenattributes.TypeAttribute;

public class ConcatFilter extends TokenFilter {

  protected ConcatFilter(TokenStream input) 
  {
super(input);   
  }

  @Override
  public Token next() throws IOException 
  {
Token token = new Token();
StringBuilder builder = new StringBuilder();

TermAttribute termAttribute = (TermAttribute) 
input.getAttribute(TermAttribute.class);
TypeAttribute typeAttribute = (TypeAttribute) 
input.getAttribute(TypeAttribute.class);

boolean hasToken = false;

while (input.incrementToken()) 
{
  if (typeAttribute.type().equals(word)) {
builder.append(termAttribute.term());
hasToken = true;
  } 
}

if (hasToken == true) {
  token.setTermBuffer(builder.toString());
  return token;
}
  
return null;
  }
}

//ConcatFilterFactory:

import org.apache.lucene.analysis.TokenStream;
import org.apache.solr.analysis.BaseTokenFilterFactory;

public class ConcatFilterFactory extends BaseTokenFilterFactory {

@Override
public TokenStream create(TokenStream stream) {
return new ConcatFilter(stream);
}
}

and in your schema.xml, you can simply add the filterfactory using this element:

filter class=com.example.ConcatFilterFactory /

Jar files i have included in the buildpath (can be found in the solr download 
package):

apache-solr-core-1.4.1.jar
lucene-analyzers-2.9.3.jar
lucene-core.2.9.3-jar


good luck ;)


-robert




On Nov 11, 2010, at 8:45 PM, Nick Martin wrote:

 Thanks Robert, I had been trying to get your ConcatFilter to work, but I'm 
 not sure what i need in the classpath and where Token comes from.
 Will check the thread you mention.
 
 Best
 
 Nick
 
 On 11 Nov 2010, at 18:13, Robert Gründler wrote:
 
 I've posted a ConcaFilter in my previous mail which does concatenate tokens. 
 This works fine, but i
 realized that what i wanted to achieve is implemented easier in another way 
 (by using 2 separate field types).
 
 Have a look at a previous mail i wrote to the list and the reply from Ahmet 
 Arslan (topic: EdgeNGram relevancy).
 
 
 best
 
 
 -robert
 
 
 
 
 See 
 On Nov 11, 2010, at 5:27 PM, Nick Martin wrote:
 
 Hi Robert, All,
 
 I have a similar problem, here is my fieldType, 
 http://paste.pocoo.org/show/289910/
 I want to include stopword removal and lowercase the incoming terms. The 
 idea being to take, Foo Bar Baz Ltd and turn it into foobarbaz for the 
 EdgeNgram filter factory.
 If anyone can tell me a simple way to concatenate tokens into one token 
 again, similar too the KeyWordTokenizer that would be super helpful.
 
 Many thanks
 
 Nick
 
 On 11 Nov 2010, at 00:23, Robert Gründler wrote:
 
 
 On Nov 11, 2010, at 1:12 AM, Jonathan Rochkind wrote:
 
 Are you sure you really want to throw out stopwords for your use case?  I 
 don't think autocompletion will work how you want if you do. 
 
 in our case i think it makes sense. the content is targetting the 
 electronic music / dj scene, so we have a lot of words like DJ or 
 featuring which
 make sense to throw out of the query. Also searches for the beastie boys 
 and beastie boys should return a match in the autocompletion.
 
 
 And if you don't... then why use the WhitespaceTokenizer and then try to 
 jam the tokens back together? Why not just NOT tokenize in the first 
 place. Use the KeywordTokenizer, which really should be called the 
 NonTokenizingTokenizer, becaues it doesn't tokenize at all, it just 
 creates one token from the entire input string. 
 
 I started out with the KeywordTokenizer, which worked well, except the 
 StopWord problem.
 
 For now, i've come up with a quick-and-dirty custom ConcatFilter, which 
 does what i'm after:
 
 public class ConcatFilter extends TokenFilter {
 
private TokenStream tstream;
 
protected ConcatFilter(TokenStream input) {
super(input);
this.tstream = input;
}
 
@Override
public Token next() throws IOException {

Token token = new Token();
StringBuilder builder = new StringBuilder();

TermAttribute termAttribute = (TermAttribute) 
 tstream.getAttribute(TermAttribute.class);
TypeAttribute typeAttribute = (TypeAttribute) 
 tstream.getAttribute(TypeAttribute.class);

boolean incremented = false;

while (tstream.incrementToken()) {

if (typeAttribute.type().equals(word)) {

Re: EdgeNGram relevancy

2010-11-11 Thread Nick Martin


On 12 Nov 2010, at 01:46, Ahmet Arslan iori...@yahoo.com wrote:

 This setup now makes troubles regarding StopWords, here's
 an example:
 
 Let's say the index contains 2 Strings: Mr Martin
 Scorsese and Martin Scorsese. Mr is in the stopword
 list.
 
 Query: edgytext:Mr Scorsese OR edgytext2:Mr Scorsese^2.0
 
 This way, the only result i get is Mr Martin Scorsese,
 because the strict field edgytext2 is boosted by 2.0. 
 
 Any idea why in this case Martin Scorsese is not in the
 result at all?
 
 Did you run your query without using () and  operators? If yes can you try 
 this?
 q=edgytext:(Mr Scorsese) OR edgytext2:Mr Scorsese^2.0
 
 If no can you paste output of debugQuery=on
 
 
 

This would still not deal with the problem of removing stop words from the 
indexing and query analysis stages.

I really need something that will allow that and give a single token as in the 
example below.

Best

Nick

Re: Retrieving indexed content containing multiple languages

2010-11-11 Thread Dennis Gearon

I look forward to the eanswers to this one.

 Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better 
idea to learn from others’ mistakes, so you do not have to make them yourself. 
from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.



- Original Message 
From: Tod listac...@gmail.com
To: solr-user@lucene.apache.org
Sent: Thu, November 11, 2010 11:35:23 AM
Subject: Retrieving indexed content containing multiple languages

My Solr corpus is currently created by indexing metadata from a relational 
database as well as content pointed to by URLs from the database.  I'm using a 
pretty generic out of the box Solr schema.  The search results are presented 
via 
an AJAX enabled HTML page.

When I perform a search the document title (for example) has a mix of english 
and chinese characters.  Everything there is fine - I can see the english and 
chinese returned from a facet query on title.  I can search against the title 
using english words it contains and I get back an expected result.  I asked a 
chinese friend to perform the same search using chinese and nothing is returned.

How should I go about getting this search to work?  Chinese is just one 
language, I'll probably need to support more in the future.

My thought is that the chinese characters are indexed as their unicode 
equivalent so all I'll need to do is make sure the query is encoded 
appropriately and just perform a regular search as I would if the terms were in 
english.  For some reason that sounds too easy.

I see there is a CJK tokenizer that would help here.  Do I need that for my 
situation?  Is there a fairly detailed tutorial on how to handle these types of 
language challenges?


Thanks in advance - Tod

Re: EdgeNGram relevancy

2010-11-11 Thread Andy

Could anyone help me understand what does Clyde Phillips appear in the 
results for Bill Cl??

Clyde Phillips doesn't produce any EdgeNGram that would match Bill Cl, so 
why is it even in the results?

Thanks.

--- On Thu, 11/11/10, Ahmet Arslan iori...@yahoo.com wrote:

 You can add an additional field, with
 using KeywordTokenizerFactory instead of
 WhitespaceTokenizerFactory. And query both these fields with
 an OR operator. 
 
 edgytext:(Bill Cl) OR edgytext2:Bill Cl
 
 You can even apply boost so that begins with matches comes
 first.
 
 --- On Thu, 11/11/10, Robert Gründler rob...@dubture.com
 wrote:
 
  From: Robert Gründler rob...@dubture.com
  Subject: EdgeNGram relevancy
  To: solr-user@lucene.apache.org
  Date: Thursday, November 11, 2010, 5:51 PM
  Hi,
  
  consider the following fieldtype (used for
  autocompletion):
  
    fieldType name=edgytext
 class=solr.TextField
  positionIncrementGap=100
     analyzer type=index
       tokenizer
  class=solr.WhitespaceTokenizerFactory/
       filter
  class=solr.LowerCaseFilterFactory/
       filter
  class=solr.StopFilterFactory ignoreCase=true
  words=stopwords.txt enablePositionIncrements=true
  /     
       filter
  class=solr.PatternReplaceFilterFactory
 pattern=([^a-z])
  replacement= replace=all /
       filter
  class=solr.EdgeNGramFilterFactory minGramSize=1
  maxGramSize=25 /
     /analyzer
     analyzer type=query
       tokenizer
  class=solr.WhitespaceTokenizerFactory/
       filter
  class=solr.LowerCaseFilterFactory/
       filter
  class=solr.StopFilterFactory ignoreCase=true
  words=stopwords.txt enablePositionIncrements=true
 /
       filter
  class=solr.PatternReplaceFilterFactory
 pattern=([^a-z])
  replacement= replace=all /
     /analyzer
    /fieldType
  
  
  This works fine as long as the query string is a
 single
  word. For multiple words, the ranking is weird
 though.
  
  Example:
  
  Query String: Bill Cl
  
  Result (in that order):
  
  - Clyde Phillips
  - Clay Rogers
  - Roger Cloud
  - Bill Clinton
  
  Bill Clinton should have the highest rank in that
  case.  
  
  Has anyone an idea how to to configure this fieldtype
 to
  make matches in both tokens rank higher than those who
 match
  in either token?
  
  
  thanks!
  
  
  -robert

Re: problem with wildcard

2010-11-11 Thread Ahmet Arslan

 I'm having some trouble with a query using some wildcard
 and I was wondering if anyone could tell me why these two
 similar queries do not return the same number of results.
 Basically, the query I'm making should return all docs whose
 title starts
 (or contain) the string lowe'. I suspect some analyzer is
 causing this behaviour and I'd like to know if there is a
 way to fix this problem.
 
 1)
 select?q=*:*fq=title:(+lowe')debugQuery=onrows=0

wildcard queries are not analyzed http://search-lucene.com/m/pnmlH14o6eM1/

Re: EdgeNGram relevancy

2010-11-11 Thread Robert Gründler

according to the fieldtype i posted previously, i think it's because of:

1. WhiteSpaceTokenizer splits the String Clyde Phillips into 2 tokens: 
Clyde and Phillips
2. EdgeNGramFilter gets the 2 tokens, and creates an EdgeNGram for each token: 
C Cl Cly ...   AND  P Ph Phi ...

The Query String Bill Cl gets split up in 2 Tokens Bill and Cl by the 
WhitespaceTokenizer.

This creates a match for the 2nd token Ci of the query, and one of the 
subtokens the EdgeNGramFilter created: Cl.


-robert




On Nov 11, 2010, at 21:34 , Andy wrote:

 Could anyone help me understand what does Clyde Phillips appear in the 
 results for Bill Cl??
 
 Clyde Phillips doesn't produce any EdgeNGram that would match Bill Cl, so 
 why is it even in the results?
 
 Thanks.
 
 --- On Thu, 11/11/10, Ahmet Arslan iori...@yahoo.com wrote:
 
 You can add an additional field, with
 using KeywordTokenizerFactory instead of
 WhitespaceTokenizerFactory. And query both these fields with
 an OR operator. 
 
 edgytext:(Bill Cl) OR edgytext2:Bill Cl
 
 You can even apply boost so that begins with matches comes
 first.
 
 --- On Thu, 11/11/10, Robert Gründler rob...@dubture.com
 wrote:
 
 From: Robert Gründler rob...@dubture.com
 Subject: EdgeNGram relevancy
 To: solr-user@lucene.apache.org
 Date: Thursday, November 11, 2010, 5:51 PM
 Hi,
 
 consider the following fieldtype (used for
 autocompletion):
 
   fieldType name=edgytext
 class=solr.TextField
 positionIncrementGap=100
analyzer type=index
  tokenizer
 class=solr.WhitespaceTokenizerFactory/
  filter
 class=solr.LowerCaseFilterFactory/
  filter
 class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true
 / 
  filter
 class=solr.PatternReplaceFilterFactory
 pattern=([^a-z])
 replacement= replace=all /
  filter
 class=solr.EdgeNGramFilterFactory minGramSize=1
 maxGramSize=25 /
/analyzer
analyzer type=query
  tokenizer
 class=solr.WhitespaceTokenizerFactory/
  filter
 class=solr.LowerCaseFilterFactory/
  filter
 class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true
 /
  filter
 class=solr.PatternReplaceFilterFactory
 pattern=([^a-z])
 replacement= replace=all /
/analyzer
   /fieldType
 
 
 This works fine as long as the query string is a
 single
 word. For multiple words, the ranking is weird
 though.
 
 Example:
 
 Query String: Bill Cl
 
 Result (in that order):
 
 - Clyde Phillips
 - Clay Rogers
 - Roger Cloud
 - Bill Clinton
 
 Bill Clinton should have the highest rank in that
 case.  
 
 Has anyone an idea how to to configure this fieldtype
 to
 make matches in both tokens rank higher than those who
 match
 in either token?
 
 
 thanks!
 
 
 -robert

FAST ESP - Solr migration webinar

2010-11-11 Thread Yonik Seeley

We're holding a free webinar on migration from FAST to Solr.  Details below.

-Yonik
http://www.lucidimagination.com

=
Solr To The Rescue: Successful Migration From FAST ESP to Open Source
Search Based on Apache Solr

Thursday, Nov 18, 2010, 14:00 EST (19:00 GMT)
Hosted by SearchDataManagement.com

For anyone concerned about the future of their FAST ESP applications
since the purchase of Fast Search and Transfer by Microsoft in 2008,
this webinar will provide valuable insights on making the switch to
Solr.  A three-person rountable will discuss factors driving the need
for FAST ESP alternatives, differences between FAST and Solr, a
typical migration project lifecycle  methodology, complementary open
source tools, best practices, customer examples, and recommended next
steps.

The speakers for this webinar are:

Helge Legernes, Founding Partner  CTO of Findwise
Michael McIntosh, VP Search Solutions for TNR Global
Eric Gaumer, Chief Architect for ESR Technology.

For more information and to register, please go to:

http://SearchDataManagement.bitpipe.com/detail/RES/1288718603_527.html?asrc=CL_PRM_Lucid2
=

Re: problem with wildcard

2010-11-11 Thread Jean-Sebastien Vachon


On 2010-11-11, at 3:45 PM, Ahmet Arslan wrote:

 I'm having some trouble with a query using some wildcard
 and I was wondering if anyone could tell me why these two
 similar queries do not return the same number of results.
 Basically, the query I'm making should return all docs whose
 title starts
 (or contain) the string lowe'. I suspect some analyzer is
 causing this behaviour and I'd like to know if there is a
 way to fix this problem.
 
 1)
 select?q=*:*fq=title:(+lowe')debugQuery=onrows=0
 
 wildcard queries are not analyzed http://search-lucene.com/m/pnmlH14o6eM1/
 

Yeah I found out about this a couple of minutes after I posted my problem. If 
there is no analyzer then
why is Solr not finding any documents when a single quote precedes the wildcard?

facet+shingle in autosuggest

2010-11-11 Thread Lukas Kahwe Smith

Hi,

I am using a facet.prefix search with shingle's in my autosuggest:
fieldType name=shingle class=solr.TextField positionIncrementGap=100 
stored=false multiValued=true
  analyzer
tokenizer class=solr.StandardTokenizerFactory /
filter class=solr.LowerCaseFilterFactory /
filter class=solr.RemoveDuplicatesTokenFilterFactory/
filter class=solr.ShingleFilterFactory
  maxShingleSize=3 outputUnigrams=true 
outputUnigramIfNoNgram=false /
  /analyzer
/fieldType

Now I would like to prevent stop words to appear in the suggestions:

lst name=autosuggest_shingle
int name=member states52/int
int name=member states experiencing6/int
int name=member states in6/int
int name=member states the5/int
int name=member states to25/int
int name=member states with7/int
/lst

Here I would like to filter out the last 4 suggestions really. Is there a way I 
can sensibly bring in a stop word filter here? Actually in theory the stop 
words could appear as the first or second word as well.

So I guess when producing shingle's I want to skip any stop word from being 
part of any shingle.

regards,
Lukas Kahwe Smith
m...@pooteeweet.org

Re: problem with wildcard

2010-11-11 Thread Ahmet Arslan

 select?q=*:*fq=title:(+lowe')debugQuery=onrows=0
  
  wildcard queries are not analyzed http://search-lucene.com/m/pnmlH14o6eM1/
  
 
 Yeah I found out about this a couple of minutes after I
 posted my problem. If there is no analyzer then
 why is Solr not finding any documents when a single quote
 precedes the wildcard?


Probably your index analyzer (WordDelimiterFilterFactory) eating that single 
quote. You can verify this at admin/analysis.jsp page. In other words there is 
no such term begins with (lowe') in your index. You can try searching just lowe*

Re: EdgeNGram relevancy

2010-11-11 Thread Andy

Ah I see. Thanks for the explanation.

Could you set the defaultOperator to AND? That way both Bill and Cl must 
be a match and that would exclude Clyde Phillips.

--- On Thu, 11/11/10, Robert Gründler rob...@dubture.com wrote:

 From: Robert Gründler rob...@dubture.com
 Subject: Re: EdgeNGram relevancy
 To: solr-user@lucene.apache.org
 Date: Thursday, November 11, 2010, 3:51 PM
 according to the fieldtype i posted
 previously, i think it's because of:

 1. WhiteSpaceTokenizer splits the String Clyde Phillips
 into 2 tokens: Clyde and Phillips
 2. EdgeNGramFilter gets the 2 tokens, and creates an
 EdgeNGram for each token: C Cl Cly
 ...   AND  P Ph Phi ...

 The Query String Bill Cl gets split up in 2 Tokens Bill
 and Cl by the WhitespaceTokenizer.

 This creates a match for the 2nd token Ci of the query,
 and one of the subtokens the EdgeNGramFilter created:
 Cl.

 -robert

 On Nov 11, 2010, at 21:34 , Andy wrote:

  Could anyone help me understand what does Clyde
 Phillips appear in the results for Bill Cl??

  Clyde Phillips doesn't produce any EdgeNGram that
 would match Bill Cl, so why is it even in the results?

  Thanks.

  --- On Thu, 11/11/10, Ahmet Arslan iori...@yahoo.com
 wrote:

  You can add an additional field, with
  using KeywordTokenizerFactory instead of
  WhitespaceTokenizerFactory. And query both these
 fields with
  an OR operator. 

  edgytext:(Bill Cl) OR edgytext2:Bill Cl

  You can even apply boost so that begins with
 matches comes
  first.

  --- On Thu, 11/11/10, Robert Gründler rob...@dubture.com
  wrote:

  From: Robert Gründler rob...@dubture.com
  Subject: EdgeNGram relevancy
  To: solr-user@lucene.apache.org
  Date: Thursday, November 11, 2010, 5:51 PM
  Hi,

  consider the following fieldtype (used for
  autocompletion):

    fieldType
 name=edgytext
  class=solr.TextField
  positionIncrementGap=100
     analyzer type=index
       tokenizer
  class=solr.WhitespaceTokenizerFactory/
       filter
  class=solr.LowerCaseFilterFactory/
       filter
  class=solr.StopFilterFactory
 ignoreCase=true
  words=stopwords.txt
 enablePositionIncrements=true
  /     
           filter
  class=solr.PatternReplaceFilterFactory
  pattern=([^a-z])
  replacement= replace=all /
       filter
  class=solr.EdgeNGramFilterFactory
 minGramSize=1
  maxGramSize=25 /
     /analyzer
     analyzer type=query
       tokenizer
  class=solr.WhitespaceTokenizerFactory/
       filter
  class=solr.LowerCaseFilterFactory/
       filter
  class=solr.StopFilterFactory
 ignoreCase=true
  words=stopwords.txt
 enablePositionIncrements=true
  /
           filter
  class=solr.PatternReplaceFilterFactory
  pattern=([^a-z])
  replacement= replace=all /
     /analyzer
    /fieldType

  This works fine as long as the query string is
 a
  single
  word. For multiple words, the ranking is
 weird
  though.

  Example:

  Query String: Bill Cl

  Result (in that order):

  - Clyde Phillips
  - Clay Rogers
  - Roger Cloud
  - Bill Clinton

  Bill Clinton should have the highest rank in
 that
  case.  

  Has anyone an idea how to to configure this
 fieldtype
  to
  make matches in both tokens rank higher than
 those who
  match
  in either token?

  thanks!

  -robert

Re: WELCOME to solr-user@lucene.apache.org

2010-11-11 Thread Erick Erickson

There's not much to go on here. Boosting works,
and index time as opposed to query time boosting
addresses two different needs. Could you add some
detail? All you've really said is it didn't work, which
doesn't allow a very constructive response.

Perhaps you could review:
http://wiki.apache.org/solr/HowToContribute

Best
Erick



On Thu, Nov 11, 2010 at 10:32 AM, Solr User solr...@gmail.com wrote:

 Hi,

 I have a question about boosting.

 I have the following fields in my schema.xml:

 1. title
 2. description
 3. ISBN

 etc

 I want to boost the field title. I tried index time boosting but it did not
 work. I also tried Query time boosting but with no luck.

 Can someone help me on how to implement boosting on a specific field like
 title?

 Thanks,
 Solr User

Re: WELCOME to solr-user@lucene.apache.org

2010-11-11 Thread Solr User

Eric,

Thank you so much for the reply and apologize for not providing all the
details.

The following are the field definitons in my schema.xml:

field name=title type=string indexed=true stored=true
omitNorms=false /

field name=author type=string indexed=true stored=true
multiValued=true omitNorms=true /

field name=authortype type=string indexed=true stored=true
multiValued=true omitNorms=true /

field name=isbn13 type=string indexed=true stored=true /

field name=isbn10 type=string indexed=true stored=true /

field name=material type=string indexed=true stored=true /

field name=pubdate type=string indexed=true stored=true /

field name=pubyear type=string indexed=true stored=true /

field name=reldate type=string indexed=false stored=true /

field name=format type=string indexed=true stored=true /

field name=pages type=string indexed=false stored=true /

field name=desc type=string indexed=true stored=true /

field name=series type=string indexed=true stored=true /

field name=season type=string indexed=true stored=true /

field name=imprint type=string indexed=true stored=true /

field name=bisacsub type=string indexed=true stored=true
multiValued=true omitNorms=true /

field name=bisacstatus type=string indexed=false stored=true /

field name=category type=string indexed=true stored=true
multiValued=true omitNorms=true /

field name=award type=string indexed=true stored=true
multiValued=true omitNorms=true /

field name=age type=string indexed=true stored=true /

field name=reading type=string indexed=true stored=true /

field name=grade type=string indexed=true stored=true /

field name=path type=string indexed=false stored=true /

field name=shortdesc type=string indexed=true stored=true /

field name=subtitle type=string indexed=true stored=true
omitNorms=true/

field name=price type=float indexed=true stored=true/

field name=searchFields type=textSpell indexed=true stored=true
multiValued=true omitNorms=true/

Copy Fields:

copyField source=title dest=searchFields/

copyField source=author dest=searchFields/

copyField source=isbn13 dest=searchFields/

copyField source=isbn10 dest=searchFields/

copyField source=format dest=searchFields/

copyField source=series dest=searchFields/

copyField source=season dest=searchFields/

copyField source=imprint dest=searchFields/

copyField source=bisacsub dest=searchFields/

copyField source=category dest=searchFields/

copyField source=award dest=searchFields/

copyField source=shortdesc dest=searchFields/

copyField source=desc dest=searchFields/

copyField source=subtitle dest=searchFields/



defaultSearchFieldsearchFields/defaultSearchField



Before creating the indexes I feed XML file to the Solr job to create index
files. I added Boost attribute to the title field before creating indexes
and an example is below:

?xml version=1.0 encoding=UTF-8 standalone=no?adddocfield
name=material1785440/fieldfield boost=10.0 name=titleEach Little
Bird That Sings/fieldfield name=price16.0/fieldfield
name=isbn100152051139/fieldfield
name=isbn139780152051136/fieldfield
name=formatHardcover/fieldfield
name=pubdate2005-03-01/fieldfield name=pubyear2005/fieldfield
name=reldate2005-02-22/fieldfield name=pages272/fieldfield
name=bisacstatusActive/fieldfield name=seasonSpring
2005/fieldfield name=imprintChildren's/fieldfield
name=age8.0-12.0/fieldfield name=grade3-6/fieldfield
name=authorMarla Frazee/fieldfield name=authortypeJacket
Illustrator/fieldfield name=authorDeborah Wiles/fieldfield
name=authortypeAuthor/fieldfield name=bisacsubSocial
Issues/Friendship/fieldfield name=bisacsubSocial Issues/General (see
also headings under Family)/fieldfield
name=bisacsubGeneral/fieldfield name=bisacsubGirls amp;
Women/fieldfield name=categoryFiction/Middle Grade/fieldfield
name=categoryFiction/Award Winners/fieldfield name=categoryComing
of Age/fieldfield name=categorySocial Situations/Death amp;
Dying/fieldfield name=categorySocial
Situations/Friendship/fieldfield
name=path/assets/product/0152051139.gif/fieldfield
name=desclt;divgt;Ten-year-old Comfort Snowberger has attended 247
funerals. But that's not surprising, considering that her family runs the
town funeral home. And even though Great-uncle Edisto keeled over with a
heart attack and Great-great-aunt Florentine dropped dead--just like
that--six months later, Comfort knows how to deal with loss, or so she
thinks. She's more concerned with avoiding her crazy cousin Peach and trying
to figure out why her best friend, Declaration, suddenly won't talk to her.
Life is full of surprises. And the biggest one of all is learning what it
takes to handle them.lt;brgt; lt;brgt;Deborah Wiles has created a
unique, funny, and utterly real cast of characters in this heartfelt, and
quintessentially Southern coming-of-age novel. Comfort will charm young
readers with her wit, her warmth, and her struggles as she learns about
life, loss, and ultimately, triumph.lt;brgt;lt;/divgt;/fieldfield
name=shortdescTen-year-old Comfort Snowberger

Re: WELCOME to solr-user@lucene.apache.org

2010-11-11 Thread Ahmet Arslan

There are several mistakes in your approach:

copyField just copies data. Index time boost is not copied.

There is no such boosting syntax. /select?q=Eachtitle^9fl=score

You are searching on your default field. 

This is not your cause of your problem but omitNorms=true disables index time 
boosts.

http://wiki.apache.org/solr/DisMaxQParserPlugin can satisfy your need.


--- On Thu, 11/11/10, Solr User solr...@gmail.com wrote:

 From: Solr User solr...@gmail.com
 Subject: Re: WELCOME to solr-user@lucene.apache.org
 To: solr-user@lucene.apache.org
 Date: Thursday, November 11, 2010, 11:54 PM
 Eric,
 
 Thank you so much for the reply and apologize for not
 providing all the
 details.
 
 The following are the field definitons in my schema.xml:
 
 field name=title type=string indexed=true
 stored=true
 omitNorms=false /
 
 field name=author type=string indexed=true
 stored=true
 multiValued=true omitNorms=true /
 
 field name=authortype type=string indexed=true
 stored=true
 multiValued=true omitNorms=true /
 
 field name=isbn13 type=string indexed=true
 stored=true /
 
 field name=isbn10 type=string indexed=true
 stored=true /
 
 field name=material type=string indexed=true
 stored=true /
 
 field name=pubdate type=string indexed=true
 stored=true /
 
 field name=pubyear type=string indexed=true
 stored=true /
 
 field name=reldate type=string indexed=false
 stored=true /
 
 field name=format type=string indexed=true
 stored=true /
 
 field name=pages type=string indexed=false
 stored=true /
 
 field name=desc type=string indexed=true
 stored=true /
 
 field name=series type=string indexed=true
 stored=true /
 
 field name=season type=string indexed=true
 stored=true /
 
 field name=imprint type=string indexed=true
 stored=true /
 
 field name=bisacsub type=string indexed=true
 stored=true
 multiValued=true omitNorms=true /
 
 field name=bisacstatus type=string indexed=false
 stored=true /
 
 field name=category type=string indexed=true
 stored=true
 multiValued=true omitNorms=true /
 
 field name=award type=string indexed=true
 stored=true
 multiValued=true omitNorms=true /
 
 field name=age type=string indexed=true
 stored=true /
 
 field name=reading type=string indexed=true
 stored=true /
 
 field name=grade type=string indexed=true
 stored=true /
 
 field name=path type=string indexed=false
 stored=true /
 
 field name=shortdesc type=string indexed=true
 stored=true /
 
 field name=subtitle type=string indexed=true
 stored=true
 omitNorms=true/
 
 field name=price type=float indexed=true
 stored=true/
 
 field name=searchFields type=textSpell
 indexed=true stored=true
 multiValued=true omitNorms=true/
 
 Copy Fields:
 
 copyField source=title dest=searchFields/
 
 copyField source=author dest=searchFields/
 
 copyField source=isbn13 dest=searchFields/
 
 copyField source=isbn10 dest=searchFields/
 
 copyField source=format dest=searchFields/
 
 copyField source=series dest=searchFields/
 
 copyField source=season dest=searchFields/
 
 copyField source=imprint dest=searchFields/
 
 copyField source=bisacsub dest=searchFields/
 
 copyField source=category dest=searchFields/
 
 copyField source=award dest=searchFields/
 
 copyField source=shortdesc dest=searchFields/
 
 copyField source=desc dest=searchFields/
 
 copyField source=subtitle dest=searchFields/
 
 
 
 defaultSearchFieldsearchFields/defaultSearchField
 
 
 
 Before creating the indexes I feed XML file to the Solr job
 to create index
 files. I added Boost attribute to the title field before
 creating indexes
 and an example is below:
 
 ?xml version=1.0 encoding=UTF-8
 standalone=no?adddocfield
 name=material1785440/fieldfield
 boost=10.0 name=titleEach Little
 Bird That Sings/fieldfield
 name=price16.0/fieldfield
 name=isbn100152051139/fieldfield
 name=isbn139780152051136/fieldfield
 name=formatHardcover/fieldfield
 name=pubdate2005-03-01/fieldfield
 name=pubyear2005/fieldfield
 name=reldate2005-02-22/fieldfield
 name=pages272/fieldfield
 name=bisacstatusActive/fieldfield
 name=seasonSpring
 2005/fieldfield
 name=imprintChildren's/fieldfield
 name=age8.0-12.0/fieldfield
 name=grade3-6/fieldfield
 name=authorMarla Frazee/fieldfield
 name=authortypeJacket
 Illustrator/fieldfield name=authorDeborah
 Wiles/fieldfield
 name=authortypeAuthor/fieldfield
 name=bisacsubSocial
 Issues/Friendship/fieldfield
 name=bisacsubSocial Issues/General (see
 also headings under Family)/fieldfield
 name=bisacsubGeneral/fieldfield
 name=bisacsubGirls amp;
 Women/fieldfield
 name=categoryFiction/Middle
 Grade/fieldfield
 name=categoryFiction/Award
 Winners/fieldfield name=categoryComing
 of Age/fieldfield name=categorySocial
 Situations/Death amp;
 Dying/fieldfield name=categorySocial
 Situations/Friendship/fieldfield
 name=path/assets/product/0152051139.gif/fieldfield
 name=desclt;divgt;Ten-year-old Comfort
 Snowberger has attended 247
 funerals. But that's not surprising, considering that her
 family runs the
 town funeral home. And even though Great-uncle Edisto

Re: facet+shingle in autosuggest

2010-11-11 Thread Erick Erickson

I don't know all the implications here, but can't you just
insert the StopwordFilterFactory before the ShingleFilterFactory
and turn it loose?

Best
Erick

On Thu, Nov 11, 2010 at 4:02 PM, Lukas Kahwe Smith m...@pooteeweet.orgwrote:

 Hi,

 I am using a facet.prefix search with shingle's in my autosuggest:
fieldType name=shingle class=solr.TextField
 positionIncrementGap=100 stored=false multiValued=true
  analyzer
tokenizer class=solr.StandardTokenizerFactory /
filter class=solr.LowerCaseFilterFactory /
filter class=solr.RemoveDuplicatesTokenFilterFactory/
filter class=solr.ShingleFilterFactory
  maxShingleSize=3 outputUnigrams=true
 outputUnigramIfNoNgram=false /
  /analyzer
/fieldType

 Now I would like to prevent stop words to appear in the suggestions:

 lst name=autosuggest_shingle
 int name=member states52/int
 int name=member states experiencing6/int
 int name=member states in6/int
 int name=member states the5/int
 int name=member states to25/int
 int name=member states with7/int
 /lst

 Here I would like to filter out the last 4 suggestions really. Is there a
 way I can sensibly bring in a stop word filter here? Actually in theory the
 stop words could appear as the first or second word as well.

 So I guess when producing shingle's I want to skip any stop word from being
 part of any shingle.

 regards,
 Lukas Kahwe Smith
 m...@pooteeweet.org

Re: using CJKTokenizerFactory for Japanese language

2010-11-11 Thread Koji Sekiguchi


(10/11/12 1:49), Kumar Pandey wrote:

I am exploring support for Japanese language in solr.
Solr seems to provide CJKTokenizerFactory.
How useful is this module? Has anyone been using this in production for
Japanese language?


CJKTokenizer is used in a lot of places in Japan.


One shortfall it seems to have from what I have been able to read up on is
that it can generate lot of false matches. For example mathcing kyoto when
searching for tokyo etc.


Yep, it is a well-known problem.


I did not see many questions related to this module so I wonder if people
are actively using it.
If not are there any other solution in the market that are recommended by
solr users?


You may want to look at morphological analyzers. There are some of them in 
Japan.
Search MeCab, Sen, GoSen by Google. Or in Lucene, there is a patch for
a morphological-taste analyzer:

https://issues.apache.org/jira/browse/LUCENE-2522

Koji

--
http://www.rondhuit.com/en/

Re: facet+shingle in autosuggest

2010-11-11 Thread Lukas Kahwe Smith


On 11.11.2010, at 17:42, Erick Erickson wrote:

 I don't know all the implications here, but can't you just
 insert the StopwordFilterFactory before the ShingleFilterFactory
 and turn it loose?


havent tried this, but i would suspect that i would then get in trouble with 
stuff like united states of america. it would then generate a shingle with 
united states america which in turn wouldnt generate a proper phrase search 
string.

one option of course would be to restrict the shingles to 2 words and then 
using the stop word filter would work as expected.

regards,
Lukas Kahwe Smith
m...@pooteeweet.org

Re: EdgeNGram relevancy

2010-11-11 Thread Robert Gründler

 
 Did you run your query without using () and  operators? If yes can you try 
 this?
 q=edgytext:(Mr Scorsese) OR edgytext2:Mr Scorsese^2.0

I didn't use () and  in my query before. Using the query with those operators
works now, stopwords are thrown out as the should, thanks.

However, i don't understand how the () and  operators affect the 
StopWordFilter.

Could you give a brief explanation for the above example?

thanks!


-robert

Re: EdgeNGram relevancy

2010-11-11 Thread Jonathan Rochkind

Without the parens, the edgytext: only applied to Mr, the default 
field still applied to Scorcese.


The double quotes are neccesary in the second case (rather than parens), 
because on a non-tokenized field because the standard query parser will 
pre-tokenize on whitespace before sending individual white-space 
seperated words to match the index. If the index includes multi-word 
tokens with internal whitespace, they will never match. But the standard 
query parser doesn't pre-tokenize like this, it passes the whole 
phrase to the index intact.


Robert Gründler wrote:

Did you run your query without using () and  operators? If yes can you try 
this?
q=edgytext:(Mr Scorsese) OR edgytext2:Mr Scorsese^2.0



I didn't use () and  in my query before. Using the query with those operators
works now, stopwords are thrown out as the should, thanks.

However, i don't understand how the () and  operators affect the 
StopWordFilter.

Could you give a brief explanation for the above example?

thanks!


-robert

Re: WELCOME to solr-user@lucene.apache.org

2010-11-11 Thread Ramavtar Meena

Hi,

If you are looking for query time boosting on title field you can do
the following:
/select?q=title:android^10

Also unless you have a very good reason to use string for date data
(in your case pubdate and reldate), you should be using
solr.DateField.

regards,
Ram
On Fri, Nov 12, 2010 at 3:41 AM, Ahmet Arslan iori...@yahoo.com wrote:
 There are several mistakes in your approach:

 copyField just copies data. Index time boost is not copied.

 There is no such boosting syntax. /select?q=Eachtitle^9fl=score

 You are searching on your default field.

 This is not your cause of your problem but omitNorms=true disables index 
 time boosts.

 http://wiki.apache.org/solr/DisMaxQParserPlugin can satisfy your need.


 --- On Thu, 11/11/10, Solr User solr...@gmail.com wrote:

 From: Solr User solr...@gmail.com
 Subject: Re: WELCOME to solr-user@lucene.apache.org
 To: solr-user@lucene.apache.org
 Date: Thursday, November 11, 2010, 11:54 PM
 Eric,

 Thank you so much for the reply and apologize for not
 providing all the
 details.

 The following are the field definitons in my schema.xml:

 field name=title type=string indexed=true
 stored=true
 omitNorms=false /

 field name=author type=string indexed=true
 stored=true
 multiValued=true omitNorms=true /

 field name=authortype type=string indexed=true
 stored=true
 multiValued=true omitNorms=true /

 field name=isbn13 type=string indexed=true
 stored=true /

 field name=isbn10 type=string indexed=true
 stored=true /

 field name=material type=string indexed=true
 stored=true /

 field name=pubdate type=string indexed=true
 stored=true /

 field name=pubyear type=string indexed=true
 stored=true /

 field name=reldate type=string indexed=false
 stored=true /

 field name=format type=string indexed=true
 stored=true /

 field name=pages type=string indexed=false
 stored=true /

 field name=desc type=string indexed=true
 stored=true /

 field name=series type=string indexed=true
 stored=true /

 field name=season type=string indexed=true
 stored=true /

 field name=imprint type=string indexed=true
 stored=true /

 field name=bisacsub type=string indexed=true
 stored=true
 multiValued=true omitNorms=true /

 field name=bisacstatus type=string indexed=false
 stored=true /

 field name=category type=string indexed=true
 stored=true
 multiValued=true omitNorms=true /

 field name=award type=string indexed=true
 stored=true
 multiValued=true omitNorms=true /

 field name=age type=string indexed=true
 stored=true /

 field name=reading type=string indexed=true
 stored=true /

 field name=grade type=string indexed=true
 stored=true /

 field name=path type=string indexed=false
 stored=true /

 field name=shortdesc type=string indexed=true
 stored=true /

 field name=subtitle type=string indexed=true
 stored=true
 omitNorms=true/

 field name=price type=float indexed=true
 stored=true/

 field name=searchFields type=textSpell
 indexed=true stored=true
 multiValued=true omitNorms=true/

 Copy Fields:

 copyField source=title dest=searchFields/

 copyField source=author dest=searchFields/

 copyField source=isbn13 dest=searchFields/

 copyField source=isbn10 dest=searchFields/

 copyField source=format dest=searchFields/

 copyField source=series dest=searchFields/

 copyField source=season dest=searchFields/

 copyField source=imprint dest=searchFields/

 copyField source=bisacsub dest=searchFields/

 copyField source=category dest=searchFields/

 copyField source=award dest=searchFields/

 copyField source=shortdesc dest=searchFields/

 copyField source=desc dest=searchFields/

 copyField source=subtitle dest=searchFields/



 defaultSearchFieldsearchFields/defaultSearchField



 Before creating the indexes I feed XML file to the Solr job
 to create index
 files. I added Boost attribute to the title field before
 creating indexes
 and an example is below:

 ?xml version=1.0 encoding=UTF-8
 standalone=no?adddocfield
 name=material1785440/fieldfield
 boost=10.0 name=titleEach Little
 Bird That Sings/fieldfield
 name=price16.0/fieldfield
 name=isbn100152051139/fieldfield
 name=isbn139780152051136/fieldfield
 name=formatHardcover/fieldfield
 name=pubdate2005-03-01/fieldfield
 name=pubyear2005/fieldfield
 name=reldate2005-02-22/fieldfield
 name=pages272/fieldfield
 name=bisacstatusActive/fieldfield
 name=seasonSpring
 2005/fieldfield
 name=imprintChildren's/fieldfield
 name=age8.0-12.0/fieldfield
 name=grade3-6/fieldfield
 name=authorMarla Frazee/fieldfield
 name=authortypeJacket
 Illustrator/fieldfield name=authorDeborah
 Wiles/fieldfield
 name=authortypeAuthor/fieldfield
 name=bisacsubSocial
 Issues/Friendship/fieldfield
 name=bisacsubSocial Issues/General (see
 also headings under Family)/fieldfield
 name=bisacsubGeneral/fieldfield
 name=bisacsubGirls 
 Women/fieldfield
 name=categoryFiction/Middle
 Grade/fieldfield
 name=categoryFiction/Award
 Winners/fieldfield name=categoryComing
 of Age/fieldfield name=categorySocial
 Situations/Death 
 Dying/fieldfield

Best practices to rebuild index on live system

2010-11-11 Thread Robert Gründler

Hi again,

we're coming closer to the rollout of our newly created solr/lucene based 
search, and i'm wondering
how people handle changes to their schema on live systems. 

In our case, we have 3 cores (ie. A,B,C), where the largest one takes about 1.5 
hours for a full dataimport from the relational
database. The Index is being updated in realtime, through post 
insert/update/delete events in our ORM.

So far, i can only think of 2 scenarios for rebuilding the index, if we need to 
update the schema after the rollout:

1. Create 3 more cores (A1,B1,C1) - Import the data from the database - After 
importing, switch the application to cores A1, B1, C1

This will most likely cause a corrupt index, as in the 1.5 hours of indexing, 
the database might get inserts/updates/deletes.

2. Put the Livesystem in a Read-Only mode and rebuild the index during that 
time. This will ensure data integrity in the index, with the drawback for users 
not being
able to write to the app.

Does Solr provide any built-in approaches to this problem?


best

-robert

Re: Best practices to rebuild index on live system

2010-11-11 Thread Jonathan Rochkind

You can do a similar thing to your case #1 with Solr replication, 
handling a lot of the details for you instead of you manually switching 
cores and such. Index to a new core, then tell your production solr to 
be a slave replicating from that master new core. It still may have some 
of the same downsides as your scenario #1, it's essentially the same 
thing, but with Solr replication taking care of the some of the nuts and 
bolts for you.


I haven't hard of any better solutions. In general, Solr seems not 
really so great at use cases where the index changes frequently in 
response to user actions, it doesn't seem to really have been designed 
that way.


You could store all your user-created data in an external store (rdbms 
or no-sql), as well as indexing it, and then when you rebuild the index 
you can get it all from there, so you won't lose any. It seems to often 
work best, getting along with Solr's assumptions,  to avoid considering 
a Solr index ever the canonical storage location of any data -- Solr 
isn't really designed to be storage, it's designed to be an index.  
Always have the canonical storage location of any data being some actual 
store, with Solr just being an index. That approach tends to make it 
easier to work out things like this, although there can still be some 
tricks. (Like, after you're done building your new index, but before you 
replicate it to production, you might have to check the actual canonical 
store for any data that changed in between the time you started your 
re-index and now -- and then re-index that. And then any data that 
changed between the time your second re-index began and... this could go 
on forever. )


Robert Gründler wrote:

Hi again,

we're coming closer to the rollout of our newly created solr/lucene based 
search, and i'm wondering
how people handle changes to their schema on live systems. 


In our case, we have 3 cores (ie. A,B,C), where the largest one takes about 1.5 
hours for a full dataimport from the relational
database. The Index is being updated in realtime, through post 
insert/update/delete events in our ORM.

So far, i can only think of 2 scenarios for rebuilding the index, if we need to 
update the schema after the rollout:

1. Create 3 more cores (A1,B1,C1) - Import the data from the database - After 
importing, switch the application to cores A1, B1, C1

This will most likely cause a corrupt index, as in the 1.5 hours of indexing, 
the database might get inserts/updates/deletes.

2. Put the Livesystem in a Read-Only mode and rebuild the index during that 
time. This will ensure data integrity in the index, with the drawback for users 
not being
able to write to the app.

Does Solr provide any built-in approaches to this problem?


best

-robert

Re: Best practices to rebuild index on live system

2010-11-11 Thread Erick Erickson

If by corrupt index you mean an index that's just not quite
up to date, could you do a delta import? In other words, how
do you make our Solr index reflect changes to the DB even
without a schema change? Could you extend that method
to handle your use case?

So the scenario is something like this:
Record the time
rebuild the index
import all changes since you recorded the original time.
switch cores or replicate.

Best
Erick

2010/11/11 Robert Gründler rob...@dubture.com

 Hi again,

 we're coming closer to the rollout of our newly created solr/lucene based
 search, and i'm wondering
 how people handle changes to their schema on live systems.

 In our case, we have 3 cores (ie. A,B,C), where the largest one takes about
 1.5 hours for a full dataimport from the relational
 database. The Index is being updated in realtime, through post
 insert/update/delete events in our ORM.

 So far, i can only think of 2 scenarios for rebuilding the index, if we
 need to update the schema after the rollout:

 1. Create 3 more cores (A1,B1,C1) - Import the data from the database -
 After importing, switch the application to cores A1, B1, C1

 This will most likely cause a corrupt index, as in the 1.5 hours of
 indexing, the database might get inserts/updates/deletes.

 2. Put the Livesystem in a Read-Only mode and rebuild the index during that
 time. This will ensure data integrity in the index, with the drawback for
 users not being
 able to write to the app.

 Does Solr provide any built-in approaches to this problem?


 best

 -robert

Re: Spatial search in Solr 1.5

2010-11-11 Thread Scott K

I just upgraded to a later version of the trunk and noticed my
geofilter queries stopped working, apparently because the sfilt
function was renamed to geofilt.

I realize trunk is not stable, but other than looking at every change,
is there an easy way to find changes that are not backward compatible
so developers know what they need to update when upgrading?

Thanks, Scott

On Tue, Oct 12, 2010 at 17:42, Yonik Seeley yo...@lucidimagination.com wrote:
 On Tue, Oct 12, 2010 at 8:07 PM, PeterKerk vettepa...@hotmail.com wrote:

 Ok, so does this actually say:
 for now you have to do calculations based on bounding box instead of great
 circle?

 I tried to make the documentation a little simpler... there's
  - geofilt... filters within a radius of d km  (i.e. great circle 
 distance)
  - bbox... filters using a bounding box
  - geodist... function query that yields the distance (again, great
 circle distance)

 If you point out the part to the docs you found confusing, I can try
 and improve it.
 Did you try and step through the quick start?  Those links actually work!

 And the fact that on top of the page it says Solr4.0, does that imply I
 cant use this right now? Or where could I find the latest trunk for this?

 The wiki says If you haven't already, get a recent nightly build of 
 Solr4.0...
 and links to the Solr4.0 page, which points to
 http://wiki.apache.org/solr/FrontPage#solr_development
 for nightly builds.

 -Yonik

 http://www.lucidimagination.com

Re: index just new articles from rss feeds - Data Import Request Handler

2010-11-11 Thread Shalin Shekhar Mangar

On Thu, Nov 11, 2010 at 8:21 AM, Matteo Moci mox...@gmail.com wrote:
 Hello,
 I'd like to use solr to index some documents coming from an rss feed,
 like the example at [1], but it seems that the configuration used
 there is just for a one-time indexing, trying to get all the articles
 exposed in the rss feed of the website.

 Is it possible to manage and index just the new articles coming from
 the rss source?


Each item in an RSS feed has a publishing date which you can use to
ingest only the new articles.

 I found that maybe the delta-import can be useful but, from what I understand,
 the delta-import is used to just update the index with contents of
 documents that have been modified since the last indexing:
 this is obviously useful, but I'd like to index just the new articles
 coming from an rss feed.

 Is it something managed automatically by solr or I have to deal with
 it in a separate way? Maybe a full import with clean=false
 parameters?
 Are there any solutions that you would suggest?
 Maybe storing the article feeds in a table like [2] and have a module
 that periodically sends each row to solr for indexing it?


The RSS import example is more of a proof-of-concept that it can be
done, it may not be the best way to do it though. Storing the article
feeds in a table is essential if you have multiple ones. You can use a
parent entity for the table and a child entity to make the actual http
calls to the RSS. Be sure to use onError=continue so that a bad RSS
feed does not stop the whole process. It will probably work fine for a
handful of feeds but if you are looking to develop a large feed
ingestion system, I'd suggest looking into alternate methods.

-- 
Regards,
Shalin Shekhar Mangar.

Re: Boosting

2010-11-11 Thread Shalin Shekhar Mangar

On Thu, Nov 11, 2010 at 10:35 AM, Solr User solr...@gmail.com wrote:
 Hi,

 I have a question about boosting.

 I have the following fields in my schema.xml:

 1. title
 2. description
 3. ISBN

 etc

 I want to boost the field title. I tried index time boosting but it did not
 work. I also tried Query time boosting but with no luck.

 Can someone help me on how to implement boosting on a specific field like
 title?


If you use index time boosting, you have to restart Solr and re-index
the documents after making the change to the schema.xml. For debugging
problems with query-time boosting, append debugQuery=on as a request
parameter to see the parsed query and scoring information.

-- 
Regards,
Shalin Shekhar Mangar.

Looking for help with Solr implementation

2010-11-11 Thread AC

Hi,


Not sure if this is the correct place to post but I'm looking for someone to 
help finish a Solr install on our LAMP based website.  This would be a paid 
project.  


The programmer that started the project got too busy with his full-time job to 
finish the project.  Solr has been installed and a basic search is working but 
we need to configure it to work across the site and also set-up faceted 
search.    I tried posting on some popular freelance sites but haven't been 
able 
to find anyone with real Solr expertise / experience.   


If you think you can help me with this project please let me know and I can 
supply more details.  


Regards

Link to download solr4.0 is not working?

2010-11-11 Thread Deche Pangestu

Hello,
Does anyone know where to download solr4.0 source?
I tried downloading from this page:
http://wiki.apache.org/solr/FrontPage#solr_development
but the link is not working...


Best,
Deche

importing from java

2010-11-11 Thread Tri Nguyen

Hi,

I'm restricted to the following in regards to importing.

I have access to a list (Iterator) of Java objects I need to import into solr.

Can I import the java objects as part of solr's data import interface (whenever 
an http request to solr to do a dataimport, it'll call my java class to get 
objects)?  


Before I had direct read only access to the db and specified the column 
mappings 
and things were fine with the data import.  


But now I am restricted to using a .jar file that has an api to get the records 
in the database and I need to publish these records in the db.  I do see solrj 
and but solrj is seaparate from the solr webapp.

Can I write my own dataimporthandler?

Thanks,

Tri

Re: Rollback can't be done after committing?

2010-11-11 Thread gengshaoguang

Hi, Kouta:
Any data store does not support rollback AFTER commit, rollback works only 
BEFORE.

On Friday, November 12, 2010 12:34:18 am Kouta Osabe wrote:
 Hi, all
 
 I have a question about Solr and SolrJ's rollback.
 
 I try to rollback like below
 
 try{
 server.addBean(dto);
 server.commit;
 }catch(Exception e){
  if (server != null) { server.rollback();}
 }
 
 I wonder if any Exception thrown, rollback process is run. so all
 data would not be updated.
 
 but once commited, rollback would not be well done.
 
 rollback correctly will be done only when commit process will not?
 
 Solr and SolrJ's rollback system is not the same as any RDB's rollback?

Re: Rollback can't be done after committing?

2010-11-11 Thread Pradeep Singh

In some cases you can rollback to a named checkpoint. I am not too sure but
I think I read in the lucene documentation that it supported named
checkpointing.

On Thu, Nov 11, 2010 at 7:12 PM, gengshaoguang gengshaogu...@ceopen.cnwrote:

 Hi, Kouta:
 Any data store does not support rollback AFTER commit, rollback works only
 BEFORE.

 On Friday, November 12, 2010 12:34:18 am Kouta Osabe wrote:
  Hi, all
 
  I have a question about Solr and SolrJ's rollback.
 
  I try to rollback like below
 
  try{
  server.addBean(dto);
  server.commit;
  }catch(Exception e){
   if (server != null) { server.rollback();}
  }
 
  I wonder if any Exception thrown, rollback process is run. so all
  data would not be updated.
 
  but once commited, rollback would not be well done.
 
  rollback correctly will be done only when commit process will not?
 
  Solr and SolrJ's rollback system is not the same as any RDB's rollback?

A Newbie Question

2010-11-11 Thread K. Seshadri Iyer

Hi,

Pardon me if this sounds very elementary, but I have a very basic question
regarding Solr search. I have about 10 storage devices running Solaris with
hundreds of thousands of text files (there are other files, as well, but my
target is these text files). The directories on the Solaris boxes are
exported and are available as NFS mounts.

I have installed Solr 1.4 on a Linux box and have tested the installation,
using curl to post  documents. However, the manual says that curl is not the
recommended way of posting documents to Solr. Could someone please tell me
what is the preferred approach in such an environment? I am not a programmer
and would appreciate some hand-holding here :o)

Thanks in advance,

Sesh

Re: importing from java

2010-11-11 Thread Tri Nguyen

another question is, can I write my own DataImportHandler class?

thanks,

Tri





From: Tri Nguyen tringuye...@yahoo.com
To: solr user solr-user@lucene.apache.org
Sent: Thu, November 11, 2010 7:01:25 PM
Subject: importing from java

Hi,

I'm restricted to the following in regards to importing.

I have access to a list (Iterator) of Java objects I need to import into solr.

Can I import the java objects as part of solr's data import interface (whenever 
an http request to solr to do a dataimport, it'll call my java class to get 
objects)?  


Before I had direct read only access to the db and specified the column 
mappings 

and things were fine with the data import.  


But now I am restricted to using a .jar file that has an api to get the records 
in the database and I need to publish these records in the db.  I do see solrj 
and but solrj is seaparate from the solr webapp.

Can I write my own dataimporthandler?

Thanks,

Tri

RE: importing from java

2010-11-11 Thread Eric Martin

http://wiki.apache.org/solr/DIHQuickStart
http://wiki.apache.org/solr/DataImportHandlerFaq
http://wiki.apache.org/solr/DataImportHandler

-Original Message-
From: Tri Nguyen [mailto:tringuye...@yahoo.com] 
Sent: Thursday, November 11, 2010 9:34 PM
To: solr-user@lucene.apache.org
Subject: Re: importing from java

another question is, can I write my own DataImportHandler class?

thanks,

Tri

From: Tri Nguyen tringuye...@yahoo.com
To: solr user solr-user@lucene.apache.org
Sent: Thu, November 11, 2010 7:01:25 PM
Subject: importing from java

Hi,

I'm restricted to the following in regards to importing.

I have access to a list (Iterator) of Java objects I need to import into
solr.

Can I import the java objects as part of solr's data import interface
(whenever 
an http request to solr to do a dataimport, it'll call my java class to get 
objects)?  

Before I had direct read only access to the db and specified the column
mappings 

and things were fine with the data import.  

But now I am restricted to using a .jar file that has an api to get the
records 
in the database and I need to publish these records in the db.  I do see
solrj 
and but solrj is seaparate from the solr webapp.

Can I write my own dataimporthandler?

Thanks,

Tri

Re: Rollback can't be done after committing?

2010-11-11 Thread gengshaoguang

Oh, Pardeep:
I don't think lucene is a advanced storage app to support rollback to a 
history check point (which would be support only in distributed system, such 
as tow phase commit or transactional web services)

yours

On Friday, November 12, 2010 11:25:45 am Pradeep Singh wrote:
 In some cases you can rollback to a named checkpoint. I am not too sure but
 I think I read in the lucene documentation that it supported named
 checkpointing.
 
 On Thu, Nov 11, 2010 at 7:12 PM, gengshaoguang 
gengshaogu...@ceopen.cnwrote:
  Hi, Kouta:
  Any data store does not support rollback AFTER commit, rollback works
  only BEFORE.
  
  On Friday, November 12, 2010 12:34:18 am Kouta Osabe wrote:
   Hi, all
   
   I have a question about Solr and SolrJ's rollback.
   
   I try to rollback like below
   
   try{
   server.addBean(dto);
   server.commit;
   }catch(Exception e){
   
if (server != null) { server.rollback();}
   
   }
   
   I wonder if any Exception thrown, rollback process is run. so all
   data would not be updated.
   
   but once commited, rollback would not be well done.
   
   rollback correctly will be done only when commit process will not?
   
   Solr and SolrJ's rollback system is not the same as any RDB's rollback?

Looking for help with Solr implementation

2010-11-11 Thread AC

Hi,

Not sure if this is the correct place to post but I'm looking for someone to 
help finish a Solr install on our LAMP based website.  This would be a paid 
project.  


The programmer that started the project got too busy with his full-time job to 
finish the project.  Solr has been installed and a basic search is working but 
we need to configure it to work across the site and also set-up faceted 
search.    I tried posting on some popular freelance sites but haven't been 
able 
to find anyone with real Solr expertise / experience.   


If you think you can help me with this project please let me know and I can 
supply more details.  


Regards,

Abe

Re: Best practices to rebuild index on live system

2010-11-11 Thread Shawn Heisey



On 11/11/2010 4:45 PM, Robert Gründler wrote:

So far, i can only think of 2 scenarios for rebuilding the index, if we need to 
update the schema after the rollout:

1. Create 3 more cores (A1,B1,C1) - Import the data from the database - After 
importing, switch the application to cores A1, B1, C1

This will most likely cause a corrupt index, as in the 1.5 hours of indexing, 
the database might get inserts/updates/deletes.

2. Put the Livesystem in a Read-Only mode and rebuild the index during that 
time. This will ensure data integrity in the index, with the drawback for users 
not being
able to write to the app.


I can tell you how we handle this.  The actual build system is more 
complicated than I have mentioned here, involving replication and error 
handling, but this is the basic idea.  This isn't the only possible 
approach, but it does work.


I have 6 main static shards and one incremental shard, each on their own 
machine (Xen VM, actually).  Data is distributed by taking the Did value 
(primary key in the database) and doing a mod 6 on it, the resulting 
value is the static shard number.


The system tracks two values at all times - minDid and maxDid.  The 
static shards have Did values = minDid.  The incremental is  minDid 
and = maxDid.  Once an hour, I write the current Did value to an RRD.  
Once a day, I use that RRD to figure out the Did value corresponding to 
one week ago.  All documents  minDid and = newMinDid are 
delta-imported into the static indexes and deleted from the incremental 
index, and minDid is updated.


When it comes time to rebuild, I first rebuild the static indexes in a 
core named build which takes 5-6 hours.  When that's done, I rebuild 
the incremental in its build core, which only takes about 10 minutes.  
Then on all the machines, I swap the build and live cores.  While all 
the static builds are happening, the incremental continues to get new 
content, until it too is rebuilt.


Shawn

Re: Link to download solr4.0 is not working?

2010-11-11 Thread Shawn Heisey


On 11/11/2010 7:44 PM, Deche Pangestu wrote:

Hello,
Does anyone know where to download solr4.0 source?
I tried downloading from this page:
http://wiki.apache.org/solr/FrontPage#solr_development
but the link is not working...


Your best bet is to use svn.
http://lucene.apache.org/solr/version_control.html

For Solr 4.0, you need to check out trunk:
http://svn.apache.org/repos/asf/lucene/dev/trunk

For Solr 3.1, you'd use branch_3x:
http://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x

Shawn

72 matches

Mail list logo