Re: Date Faceting and Double Counting

2009-09-01 Thread gwk

Hi Stephen,

When I added numerical faceting to my checkout of solr (solr-1240) I 
basically copied date faceting and modified it to work with numbers 
instead of dates. With numbers I got a lot of doulbe-counted values as 
well. So to fix my problem I added an extra parameter to number faceting 
where you can specify if either end of each range should be inclusive or 
exclusive. I just ported it back to date faceting (disclaimer, 
completely untested) and it should be attached to my post.


The following parameter is added: facet.date.exclusive
valid values for the parameter are: start, end, both and neither

To maintain compatibility with solr without the patch the default is 
neither. I hope the meaning of the values are self-explanatory.


Regards,

gwk

Stephen Duncan Jr wrote:

If we do date faceting and start at 2009-01-01T00:00:00Z, end at
2009-01-03T00:00:00Z, with a gap of +1DAY, then documents that occur at
exactly 2009-01-02T00:00:00Z will be included in both the returned counts
(2009-01-01T00:00:00Z and 2009-01-02T00:00:00Z).  At the moment, this is
quite bad for us, as we only index the day-level, so all of our documents
are exactly on the line between each facet-range.

Because we know our data is indexed as being exactly at midnight each day, I
think we can simply always start from 1 second prior and get the results we
want (start=2008-12-31T23:59:59Z, end=2009-01-02T23:59:59Z), but I think
this problem would affect everyone, even if usually more subtly (instead of
all documents being counted twice, only a few on the fencepost between
ranges).

Is this a known behavior people are happy with, or should I file an issue
asking for ranges in date-facets to be constructed to subtract one second
from the end of each range (so that the effective range queries for my case
would be: [2009-01-01T00:00:00Z TO 2009-01-01T23:59:59Z] &
[2009-01-02T00:00:00Z TO 2009-01-02T23:59:59Z])?

Alternatively, is there some other suggested way of using the date faceting
to avoid this problem?

  


Index: src/java/org/apache/solr/request/SimpleFacets.java
===
--- src/java/org/apache/solr/request/SimpleFacets.java  (revision 809880)
+++ src/java/org/apache/solr/request/SimpleFacets.java  (working copy)
@@ -29,6 +29,7 @@
 import org.apache.solr.common.params.SolrParams;
 import org.apache.solr.common.params.CommonParams;
 import org.apache.solr.common.params.FacetParams.FacetDateOther;
+import org.apache.solr.common.params.FacetParams.FacetDateExclusive;
 import org.apache.solr.common.util.NamedList;
 import org.apache.solr.common.util.SimpleOrderedMap;
 import org.apache.solr.common.util.StrUtils;
@@ -586,6 +587,32 @@
"date facet 'end' comes before 'start': "+endS+" < "+startS);
   }
 
+  boolean startInclusive = true;
+  boolean endInclusive = true;
+  final String[] exclusiveP =
+params.getFieldParams(f,FacetParams.FACET_DATE_EXCLUSIVE);
+  if (null != exclusiveP && 0 < exclusiveP.length) {
+Set exclusives
+= EnumSet.noneOf(FacetDateExclusive.class);
+
+for (final String e : exclusiveP) {
+  exclusives.add(FacetDateExclusive.get(e));
+}
+
+if(! exclusives.contains(FacetDateExclusive.NEITHER) ) {
+  boolean both = exclusives.contains(FacetDateExclusive.BOTH);
+  
+  if(both || exclusives.contains(FacetDateExclusive.START)) {
+startInclusive = false;
+  }
+  
+  if(both || exclusives.contains(FacetDateExclusive.END)) {
+endInclusive = false;
+  }
+}
+  }
+  
+  
   final String gap = required.getFieldParam(f,FacetParams.FACET_DATE_GAP);
   final DateMathParser dmp = new DateMathParser(ft.UTC, Locale.US);
   dmp.setNow(NOW);
@@ -610,7 +637,7 @@
   (SolrException.ErrorCode.BAD_REQUEST,
"date facet infinite loop (is gap negative?)");
   }
-  resInner.add(label, rangeCount(sf,low,high,true,true));
+  resInner.add(label, 
rangeCount(sf,low,high,startInclusive,endInclusive));
   low = high;
 }
   } catch (java.text.ParseException e) {
@@ -639,15 +666,15 @@
 
   if (all || others.contains(FacetDateOther.BEFORE)) {
 resInner.add(FacetDateOther.BEFORE.toString(),
- rangeCount(sf,null,start,false,false));
+ rangeCount(sf,null,start,false,!startInclusive));
   }
   if (all || others.contains(FacetDateOther.AFTER)) {
 resInner.add(FacetDateOther.AFTER.toString(),
- rangeCount(sf,end,null,false,false));
+ rangeCount(sf,end,null,!endInclusive,false));
   }
   if (all || others.contains(FacetDateOther.BETWEEN)) {
 resInner.add(FacetDateOther.BETWEEN.toString(),
- rangeCount(sf

Re: How to set similarity to catch more results ?

2009-09-01 Thread Kaoul
Thank you three for answers. After more research, I think I need to
use fuzzy search as I already know Levenshtein Distance and I don't
want to manage a list of synonyms manually. So "manually" spell check
isn't for me.
Thanks a lot.

On Tue, Sep 1, 2009 at 1:15 AM, Avlesh Singh wrote:
>>
>> I want it more flexible, as if I make a mistake with letters, results are
>> found like with google.
>>
> You are talking about spelling mistakes?
> http://wiki.apache.org/solr/SpellCheckComponent
>
> Cheers
> Avlesh
>
> On Mon, Aug 31, 2009 at 3:30 PM, Kaoul  wrote:
>
>> Hello,
>>
>> I'm new to Solr and don't find in documentation how-to to set
>> similarity. I want it more flexible, as if I make a mistake with
>> letters, results are found like with google.
>>
>> Thank you in advance.
>>
>


Re: Drill down into hierarchical facet : how to?

2009-09-01 Thread Uri Boness

Hi,

You know the level your currently in:

America/USA

You have the values for the location facet in the form:

America/USA/NYC/Chelsea...3
America/USA/NYC/East Village2
America/USA/San Francisco/Haight-Ashbury...5
America/USA/Los Angeles/Hollywood..1

Why can't you do the following:

first step: translate each facet value to the next expected level, that is:

America/USA/NYC.3
America/USA/NYC.2
America/USA/San Francisco.5
America/USA/Los Angeles.1

(this can easily be done using regular expression)

second step: aggregate the counts for similar values:

America/USA/NYC.5
America/USA/San Francisco.5
America/USA/Los Angeles.1

now, use the last "part" of the value path as display name, and use the 
whole value path as a filter (with a wildcard of course).


You can also have a look at this issue: 
http://issues.apache.org/jira/browse/SOLR-64


cheers,
Uri

clico wrote:

Hello

I'm looking for a way to do that

I have a hierachical facet

ex : Continent / Country / City / Blok


Europe/France/Paris/Saint Michel
America/USA/NYC/Chelsea
etc ...

I have some points of interest tagged in differents level of a same tree
ex : some POI can be tagged Saint Michel and other tagged Paris etc ...

I facet on a fiel "location". This i a fiel stored like this
Continent/Country/City/Blok



I want to drill down on this facet during the search and show the facets for
the next level of a level


Ex :

When I search at the level continent I want to facet the level Europe, USA
etc ...
and to show all the results (Europe contains POI tagged as Europe and POI
tagged as France for example)



I know I can make a facet query
something like Europe/France/* to search all POI in France
but how can I show the facet level under France (Paris, Lyon etc ...) ???


Thank u

  


Error while indexing using SmartChineseAnalyzer

2009-09-01 Thread Jana, Kumar Raja
Hi,

I tried using the patch provided for Solr-1336 JIRA issue for
integrating Lucene's SmartChineseAnalyzer with Solr and tried testing it
out but I faced the AbstractMethodError during indexing as well as
Searching (stack trace below). There seems to be something wrong during
the tokenization of the content.

 

Can someone please tell me what I am doing wrong here?

 

The Stack Trace

SEVERE: java.lang.AbstractMethodError

at
org.apache.solr.analysis.TokenizerChain.tokenStream(TokenizerChain.java:
64)

at
org.apache.solr.schema.IndexSchema$SolrIndexAnalyzer.tokenStream(IndexSc
hema.java:360)

at
org.apache.lucene.analysis.Analyzer.reusableTokenStream(Analyzer.java:44
)

at
org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPer
Field.java:123)

at
org.apache.lucene.index.DocFieldConsumersPerField.processFields(DocField
ConsumersPerField.java:36)

at
org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFi
eldProcessorPerThread.java:234)

at
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.j
ava:762)

at
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.j
ava:745)

at
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2199
)

at
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2171
)

at
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.
java:218)

at
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdate
ProcessorFactory.java:60)

at
org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:140)

at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69)

at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Conte
ntStreamHandlerBase.java:54)

at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerB
ase.java:131)

at org.apache.solr.core.SolrCore.execute(SolrCore.java:1333)

at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.ja
va:303)

at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.j
ava:232)

at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Applica
tionFilterChain.java:235)

 

Thanks,

Kumar



Re: Error while indexing using SmartChineseAnalyzer

2009-09-01 Thread Shalin Shekhar Mangar
On Tue, Sep 1, 2009 at 4:37 PM, Jana, Kumar Raja  wrote:

> Hi,
>
> I tried using the patch provided for Solr-1336 JIRA issue for
> integrating Lucene's SmartChineseAnalyzer with Solr and tried testing it
> out but I faced the AbstractMethodError during indexing as well as
> Searching (stack trace below).
>

Questions on patches are best asked on the issue. Please post the stack
trace to SOLR-1336.

-- 
Regards,
Shalin Shekhar Mangar.


RE: Error while indexing using SmartChineseAnalyzer

2009-09-01 Thread Jana, Kumar Raja
Thanks for the reply Shalin.
Posted the stack trace on the Jira issue SOLR-1336.

-Kumar

-Original Message-
From: Shalin Shekhar Mangar [mailto:shalinman...@gmail.com] 
Sent: Tuesday, September 01, 2009 4:56 PM
To: solr-user@lucene.apache.org
Subject: Re: Error while indexing using SmartChineseAnalyzer

On Tue, Sep 1, 2009 at 4:37 PM, Jana, Kumar Raja  wrote:

> Hi,
>
> I tried using the patch provided for Solr-1336 JIRA issue for
> integrating Lucene's SmartChineseAnalyzer with Solr and tried testing it
> out but I faced the AbstractMethodError during indexing as well as
> Searching (stack trace below).
>

Questions on patches are best asked on the issue. Please post the stack
trace to SOLR-1336.

-- 
Regards,
Shalin Shekhar Mangar.


Adding new docs, but duplicating instead of updating

2009-09-01 Thread Christopher Baird
Hi All,

 

I'm running Solr in a multicore setup.  I've set one of the cores to have a
specific field as the unique key (marked as the uniqueKey in the document
and the field is defined as required).  I'm sending an  command with
all the docs using a multipart post.  After running the add file, I send
 and then send .  This works fine.  When I resend the
file (and commit and optimize), I double my document count and when I do a
query by unique key, I get two documents back.

 

I've confirmed using the admin UI that (schema browser) that my document
count has doubled.  I've also confirmed that unique key is the one I
specified (again, using schema browser).  The unique key field is marked as
type textTight.

 

Thanks for any help

 

-Chris



Re: Monitoring split time for fq queries when filter cache is used

2009-09-01 Thread Martijn v Groningen
Hi Rahul,

Yes you are understanding is correct, but it is not possible to
monitor these actions separately with Solr.

Martijn

2009/9/1 Rahul R :
> Hello,
> I am trying to measure the benefit that I am getting out of using the filter
> cache. As I understand, there are two major parts to an fq query. Please
> correct me if I am wrong :
> - doing full index queries of each of the fq params (if filter cache is
> used, this result will be retrieved from the cache)
> - set intersection of above results (Will be done again even with filter
> cache enabled)
>
> Is there any flag/setting that I can enable to monitor how much time the
> above operations take separately i.e. the querying and the set-intersection
> ?
>
> Regards
> Rahul
>



-- 
Met vriendelijke groet,

Martijn van Groningen


RE: Adding new docs, but duplicating instead of updating

2009-09-01 Thread Harsch, Timothy J. (ARC-SC)[PEROT SYSTEMS]
I could be off base here, maybe using textTight as unique key is a common SOLR 
practice I don't know.  But, It would seem to me that using any field type that 
transforms a value (even if it is just whitespace removal) could be 
problematic.   Maybe not the source of your issue here, but I'd be worrying 
about collisions.  For instance what if you sent "xyz" as a key and "XYZ" as a 
key?  The doc would be overwritten.  You may end up with unexpected results 
when you get the record back...  Maybe with your use-case this is OK but have 
you considered using string instead?

Tim

-Original Message-
From: Christopher Baird [mailto:cba...@cardinalcommerce.com] 
Sent: Tuesday, September 01, 2009 7:30 AM
To: solr-user@lucene.apache.org
Subject: Adding new docs, but duplicating instead of updating

Hi All,

 

I'm running Solr in a multicore setup.  I've set one of the cores to have a
specific field as the unique key (marked as the uniqueKey in the document
and the field is defined as required).  I'm sending an  command with
all the docs using a multipart post.  After running the add file, I send
 and then send .  This works fine.  When I resend the
file (and commit and optimize), I double my document count and when I do a
query by unique key, I get two documents back.

 

I've confirmed using the admin UI that (schema browser) that my document
count has doubled.  I've also confirmed that unique key is the one I
specified (again, using schema browser).  The unique key field is marked as
type textTight.

 

Thanks for any help

 

-Chris



Re: Is caching worth it when my whole index is in RAM?

2009-09-01 Thread Michael
Thanks, Avlesh!  I'll try the filter cache.
Anybody familiar enough with the caching implementation to chime in?

Michael

On Mon, Aug 31, 2009 at 10:02 PM, Avlesh Singh  wrote:

> Good question!
> The application level cache, say filter cache, would still help because it
> not only caches values but also the underlying computation. Even with all
> the data in your RAM you will still end up doing the computations every
> time.
>
> Looking for responses from the more knowledgeable.
>
> Cheers
> Avlesh
>
> On Mon, Aug 31, 2009 at 8:25 PM, Michael  wrote:
>
> > Hi,
> > If I've got my entire 20G 4MM document index in RAM (on a ramdisk), do I
> > have a need for the document cache?  Or should I set it to 0 items,
> because
> > pulling field values from an index in RAM is so fast that the document
> > cache
> > would be a duplication of effort?
> >
> > Are there any other caches that I should turn off if I can get my entire
> > index in RAM?  Filter cache, query results cache, etc?
> >
> > Thanks!
> > Michael
> >
>


solrj - Log4j and slf4j integration - java.lang.IllegalStateException thrown

2009-09-01 Thread Villemos, Gert
We are using solrj 1.3 (with slf4j) in a client also using Aperture
(with log4j 1.2.14). When executing a query I get the error shown below.
The request is never received by the server, i.e. the exception is
thrown before the request is issued.

 

I think I'm running into a compatibility issue between slf4j and log4j,
but don't know how to solve it.

 

Regards,

Gert.

 

 

--- Stack Trace 

 

org.apache.solr.client.solrj.SolrServerException: Error executing query

  at
org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.j
ava:96)

  at
org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:109)

  at
org.esa.huginn.solr.SolrContainer.getNext(SolrContainer.java:105)

  at
org.esa.huginn.commons.container.consolidationstrategies.SynonymReplacem
entConsolidator.execute(SynonymReplacementConsolidator.java:191)

  at
org.esa.huginn.filesystemcrawler.FileSystemCrawlerContext.initialize(Fil
eSystemCrawlerContext.java:134)

  at
org.esa.huginn.filesystemcrawler.FileSystemCrawlerContext.main(FileSyste
mCrawlerContext.java:63)

Caused by: java.lang.IllegalStateException: Level number 10 is not
recognized.

  at
org.slf4j.impl.Log4jLoggerAdapter.log(Log4jLoggerAdapter.java:421)

  at
org.apache.commons.logging.impl.SLF4JLocationAwareLog.debug(SLF4JLocatio
nAwareLog.java:106)

  at
org.apache.commons.httpclient.HttpConnection.releaseConnection(HttpConne
ction.java:1178)

  at
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpCon
nectionAdapter.releaseConnection(MultiThreadedHttpConnectionManager.java
:1423)

  at
org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMetho
dDirector.java:222)

  at
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:3
97)

  at
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:3
23)

  at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsH
ttpSolrServer.java:335)

  at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsH
ttpSolrServer.java:183)

  at
org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.j
ava:90)

 

 

 

 



Please help Logica to respect the environment by not printing this email  / 
Pour contribuer comme Logica au respect de l'environnement, merci de ne pas 
imprimer ce mail /  Bitte drucken Sie diese Nachricht nicht aus und helfen Sie 
so Logica dabei, die Umwelt zu schützen. /  Por favor ajude a Logica a 
respeitar o ambiente nao imprimindo este correio electronico.



This e-mail and any attachment is for authorised use by the intended 
recipient(s) only. It may contain proprietary material, confidential 
information and/or be subject to legal privilege. It should not be copied, 
disclosed to, retained or used by, any other party. If you are not an intended 
recipient then please promptly delete this e-mail and any attachment and all 
copies and inform the sender. Thank you.



RE: Adding new docs, but duplicating instead of updating

2009-09-01 Thread Christopher Baird
Hi Tim,

I appreciate the suggestions.  I can tell you that the document I ran the
second time was the same document run the first time -- so any questions of
field value shouldn't be a concern.

Thanks
-Chris

-Original Message-
From: Harsch, Timothy J. (ARC-SC)[PEROT SYSTEMS]
[mailto:timothy.j.har...@nasa.gov] 
Sent: Tuesday, September 01, 2009 10:45 AM
To: solr-user@lucene.apache.org; cba...@cardinalcommerce.com
Subject: RE: Adding new docs, but duplicating instead of updating

I could be off base here, maybe using textTight as unique key is a common
SOLR practice I don't know.  But, It would seem to me that using any field
type that transforms a value (even if it is just whitespace removal) could
be problematic.   Maybe not the source of your issue here, but I'd be
worrying about collisions.  For instance what if you sent "xyz" as a key and
"XYZ" as a key?  The doc would be overwritten.  You may end up with
unexpected results when you get the record back...  Maybe with your use-case
this is OK but have you considered using string instead?

Tim

-Original Message-
From: Christopher Baird [mailto:cba...@cardinalcommerce.com] 
Sent: Tuesday, September 01, 2009 7:30 AM
To: solr-user@lucene.apache.org
Subject: Adding new docs, but duplicating instead of updating

Hi All,

 

I'm running Solr in a multicore setup.  I've set one of the cores to have a
specific field as the unique key (marked as the uniqueKey in the document
and the field is defined as required).  I'm sending an  command with
all the docs using a multipart post.  After running the add file, I send
 and then send .  This works fine.  When I resend the
file (and commit and optimize), I double my document count and when I do a
query by unique key, I get two documents back.

 

I've confirmed using the admin UI that (schema browser) that my document
count has doubled.  I've also confirmed that unique key is the one I
specified (again, using schema browser).  The unique key field is marked as
type textTight.

 

Thanks for any help

 

-Chris





Re: solrj - Log4j and slf4j integration - java.lang.IllegalStateException thrown

2009-09-01 Thread Smiley, David W.
Are you running the latest versions of these logging libraries?  I see nothing 
in the 1.5.8 SLF4J Log4j adapter that would cause this.

~ David Smiley
 Author: http://www.packtpub.com/solr-1-4-enterprise-search-server


On 9/1/09 10:49 AM, "Villemos, Gert"  wrote:

We are using solrj 1.3 (with slf4j) in a client also using Aperture
(with log4j 1.2.14). When executing a query I get the error shown below.
The request is never received by the server, i.e. the exception is
thrown before the request is issued.



I think I'm running into a compatibility issue between slf4j and log4j,
but don't know how to solve it.



Regards,

Gert.





--- Stack Trace 



org.apache.solr.client.solrj.SolrServerException: Error executing query

  at
org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.j
ava:96)

  at
org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:109)

  at
org.esa.huginn.solr.SolrContainer.getNext(SolrContainer.java:105)

  at
org.esa.huginn.commons.container.consolidationstrategies.SynonymReplacem
entConsolidator.execute(SynonymReplacementConsolidator.java:191)

  at
org.esa.huginn.filesystemcrawler.FileSystemCrawlerContext.initialize(Fil
eSystemCrawlerContext.java:134)

  at
org.esa.huginn.filesystemcrawler.FileSystemCrawlerContext.main(FileSyste
mCrawlerContext.java:63)

Caused by: java.lang.IllegalStateException: Level number 10 is not
recognized.

  at
org.slf4j.impl.Log4jLoggerAdapter.log(Log4jLoggerAdapter.java:421)

  at
org.apache.commons.logging.impl.SLF4JLocationAwareLog.debug(SLF4JLocatio
nAwareLog.java:106)

  at
org.apache.commons.httpclient.HttpConnection.releaseConnection(HttpConne
ction.java:1178)

  at
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpCon
nectionAdapter.releaseConnection(MultiThreadedHttpConnectionManager.java
:1423)

  at
org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMetho
dDirector.java:222)

  at
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:3
97)

  at
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:3
23)

  at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsH
ttpSolrServer.java:335)

  at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsH
ttpSolrServer.java:183)

  at
org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.j
ava:90)











Please help Logica to respect the environment by not printing this email  / 
Pour contribuer comme Logica au respect de l'environnement, merci de ne pas 
imprimer ce mail /  Bitte drucken Sie diese Nachricht nicht aus und helfen Sie 
so Logica dabei, die Umwelt zu sch?tzen. /  Por favor ajude a Logica a 
respeitar o ambiente nao imprimindo este correio electronico.



This e-mail and any attachment is for authorised use by the intended 
recipient(s) only. It may contain proprietary material, confidential 
information and/or be subject to legal privilege. It should not be copied, 
disclosed to, retained or used by, any other party. If you are not an intended 
recipient then please promptly delete this e-mail and any attachment and all 
copies and inform the sender. Thank you.




Re: Adding docs from MySQL and php

2009-09-01 Thread Aakash Dharmadhikari
hi Pablo,

  DataImportHandler might be the best option for you. check this link
http://wiki.apache.org/solr/DataImportHandler

regards,
aakash

On Tue, Sep 1, 2009 at 9:18 PM, Pablo Ferrari wrote:

> Hello all,
>
> I'm new to the list and new to Solr. My name is Pablo, I'm from Spain and
> I'm developing a web site using Solr.
>
> I have Solr with the examples working correctly and now I would like to
> load
> the data from a MySQL database using php.
> Is the best way to do this to write a php script that get the info from the
> MySQL and then generates an XML document to load into Solr? Is there a
> maximum size for this XML document? My MySQL database is quite big...
>
> Any help, book or internet tutorial you know will be really appreciated.
>
> Thank you!
>
> Pablo
>


Why dismax isn't the default with 1.4 and why it doesn't support fuzzy search ?

2009-09-01 Thread Erwin
Hello,

Solr is a great software, but I have some interrogations like :

The wiki says "As of Solr 1.3, the DisMaxRequestHandler is simply the
standard request handler with the default query parser set to the
DisMax Query Parser (defType=dismax).". I just made a checkout of svn
and dismax doesn't seems to be the default as :
- http://localhost:8983/solr/select/?q=test~0.5
and
http://localhost:8983/solr/select/?q=test~0.5&qt=dismax
doesn't show the same results.
Note that I'm new to solr and I'm using the "example".

So is dismax really the default ?

Secondly, I've patched solr with
http://issues.apache.org/jira/browse/SOLR-629 as I would like to have
fuzzy with dismax. I built it with "ant example". Now, behavior is
still the same, no fuzzy search with dismax (using the qt=dismax
parameter in GET URL).


In advance, thanks a lot.


Re: Adding docs from MySQL and php

2009-09-01 Thread Pablo Ferrari
Thanks Aakash!

I've looked at it and it looks very interesting, the problem is that my
database is a relational model, therefore I don't have a table with all the
information, but many tables related to each other by their ids (primary
keys and foreign keys).

I've been thinking about using DataImportHandler in any of this two ways:
- Write a script that creates a table with all the information I need for
searching (it is not very efficient because of duplicate data)
- Configure DataImportHandler with some JOIN SQL statement

I'll let you know how I did, thanks again!

Pablo

2009/9/1 Aakash Dharmadhikari 

> hi Pablo,
>
>  DataImportHandler might be the best option for you. check this link
> http://wiki.apache.org/solr/DataImportHandler
>
> regards,
> aakash
>
> On Tue, Sep 1, 2009 at 9:18 PM, Pablo Ferrari  >wrote:
>
> > Hello all,
> >
> > I'm new to the list and new to Solr. My name is Pablo, I'm from Spain and
> > I'm developing a web site using Solr.
> >
> > I have Solr with the examples working correctly and now I would like to
> > load
> > the data from a MySQL database using php.
> > Is the best way to do this to write a php script that get the info from
> the
> > MySQL and then generates an XML document to load into Solr? Is there a
> > maximum size for this XML document? My MySQL database is quite big...
> >
> > Any help, book or internet tutorial you know will be really appreciated.
> >
> > Thank you!
> >
> > Pablo
> >
>


Re: Adding docs from MySQL and php

2009-09-01 Thread Pablo Ferrari
wow, it looks like DIH already works with relational databases... thanks
again!

2009/9/1 Pablo Ferrari 

> Thanks Aakash!
>
> I've looked at it and it looks very interesting, the problem is that my
> database is a relational model, therefore I don't have a table with all the
> information, but many tables related to each other by their ids (primary
> keys and foreign keys).
>
> I've been thinking about using DataImportHandler in any of this two ways:
> - Write a script that creates a table with all the information I need for
> searching (it is not very efficient because of duplicate data)
> - Configure DataImportHandler with some JOIN SQL statement
>
> I'll let you know how I did, thanks again!
>
> Pablo
>
> 2009/9/1 Aakash Dharmadhikari 
>
> hi Pablo,
>>
>>  DataImportHandler might be the best option for you. check this link
>> http://wiki.apache.org/solr/DataImportHandler
>>
>> regards,
>> aakash
>>
>> On Tue, Sep 1, 2009 at 9:18 PM, Pablo Ferrari > >wrote:
>>
>> > Hello all,
>> >
>> > I'm new to the list and new to Solr. My name is Pablo, I'm from Spain
>> and
>> > I'm developing a web site using Solr.
>> >
>> > I have Solr with the examples working correctly and now I would like to
>> > load
>> > the data from a MySQL database using php.
>> > Is the best way to do this to write a php script that get the info from
>> the
>> > MySQL and then generates an XML document to load into Solr? Is there a
>> > maximum size for this XML document? My MySQL database is quite big...
>> >
>> > Any help, book or internet tutorial you know will be really appreciated.
>> >
>> > Thank you!
>> >
>> > Pablo
>> >
>>
>
>


RE: Adding new docs, but duplicating instead of updating

2009-09-01 Thread Harsch, Timothy J. (ARC-SC)[PEROT SYSTEMS]
What is the value of your uniqueKey?

-Original Message-
From: Christopher Baird [mailto:cba...@cardinalcommerce.com] 
Sent: Tuesday, September 01, 2009 8:20 AM
To: solr-user@lucene.apache.org
Subject: RE: Adding new docs, but duplicating instead of updating

Hi Tim,

I appreciate the suggestions.  I can tell you that the document I ran the
second time was the same document run the first time -- so any questions of
field value shouldn't be a concern.

Thanks
-Chris

-Original Message-
From: Harsch, Timothy J. (ARC-SC)[PEROT SYSTEMS]
[mailto:timothy.j.har...@nasa.gov] 
Sent: Tuesday, September 01, 2009 10:45 AM
To: solr-user@lucene.apache.org; cba...@cardinalcommerce.com
Subject: RE: Adding new docs, but duplicating instead of updating

I could be off base here, maybe using textTight as unique key is a common
SOLR practice I don't know.  But, It would seem to me that using any field
type that transforms a value (even if it is just whitespace removal) could
be problematic.   Maybe not the source of your issue here, but I'd be
worrying about collisions.  For instance what if you sent "xyz" as a key and
"XYZ" as a key?  The doc would be overwritten.  You may end up with
unexpected results when you get the record back...  Maybe with your use-case
this is OK but have you considered using string instead?

Tim

-Original Message-
From: Christopher Baird [mailto:cba...@cardinalcommerce.com] 
Sent: Tuesday, September 01, 2009 7:30 AM
To: solr-user@lucene.apache.org
Subject: Adding new docs, but duplicating instead of updating

Hi All,

 

I'm running Solr in a multicore setup.  I've set one of the cores to have a
specific field as the unique key (marked as the uniqueKey in the document
and the field is defined as required).  I'm sending an  command with
all the docs using a multipart post.  After running the add file, I send
 and then send .  This works fine.  When I resend the
file (and commit and optimize), I double my document count and when I do a
query by unique key, I get two documents back.

 

I've confirmed using the admin UI that (schema browser) that my document
count has doubled.  I've also confirmed that unique key is the one I
specified (again, using schema browser).  The unique key field is marked as
type textTight.

 

Thanks for any help

 

-Chris





RE: Adding new docs, but duplicating instead of updating

2009-09-01 Thread Christopher Baird
Hi Tim,

The value I'm using is a product SKU.  A sample would be like:  L49-4251.

Thanks
-Chris
-Original Message-
From: Harsch, Timothy J. (ARC-SC)[PEROT SYSTEMS]
[mailto:timothy.j.har...@nasa.gov] 
Sent: Tuesday, September 01, 2009 12:52 PM
To: solr-user@lucene.apache.org; cba...@cardinalcommerce.com
Subject: RE: Adding new docs, but duplicating instead of updating

What is the value of your uniqueKey?

-Original Message-
From: Christopher Baird [mailto:cba...@cardinalcommerce.com] 
Sent: Tuesday, September 01, 2009 8:20 AM
To: solr-user@lucene.apache.org
Subject: RE: Adding new docs, but duplicating instead of updating

Hi Tim,

I appreciate the suggestions.  I can tell you that the document I ran the
second time was the same document run the first time -- so any questions of
field value shouldn't be a concern.

Thanks
-Chris

-Original Message-
From: Harsch, Timothy J. (ARC-SC)[PEROT SYSTEMS]
[mailto:timothy.j.har...@nasa.gov] 
Sent: Tuesday, September 01, 2009 10:45 AM
To: solr-user@lucene.apache.org; cba...@cardinalcommerce.com
Subject: RE: Adding new docs, but duplicating instead of updating

I could be off base here, maybe using textTight as unique key is a common
SOLR practice I don't know.  But, It would seem to me that using any field
type that transforms a value (even if it is just whitespace removal) could
be problematic.   Maybe not the source of your issue here, but I'd be
worrying about collisions.  For instance what if you sent "xyz" as a key and
"XYZ" as a key?  The doc would be overwritten.  You may end up with
unexpected results when you get the record back...  Maybe with your use-case
this is OK but have you considered using string instead?

Tim

-Original Message-
From: Christopher Baird [mailto:cba...@cardinalcommerce.com] 
Sent: Tuesday, September 01, 2009 7:30 AM
To: solr-user@lucene.apache.org
Subject: Adding new docs, but duplicating instead of updating

Hi All,

 

I'm running Solr in a multicore setup.  I've set one of the cores to have a
specific field as the unique key (marked as the uniqueKey in the document
and the field is defined as required).  I'm sending an  command with
all the docs using a multipart post.  After running the add file, I send
 and then send .  This works fine.  When I resend the
file (and commit and optimize), I double my document count and when I do a
query by unique key, I get two documents back.

 

I've confirmed using the admin UI that (schema browser) that my document
count has doubled.  I've also confirmed that unique key is the one I
specified (again, using schema browser).  The unique key field is marked as
type textTight.

 

Thanks for any help

 

-Chris







SOLR vs SQL

2009-09-01 Thread Fuad Efendi
RE: http://www.mysecondhome.eu

 

 

I am browsing this website again (I have similar challenge at
http://www.casaGURU.com but still prefer database-SQL to search Professional
by service type)

 

I don't think SOLR is applicable in this specific case. I think standard DB
queries with predefined dropdown/radio values perform extremely faster than
SOLR faceting (you currently have only 9 records) - database queries
have consistent response time without dependency on dataset size (especially
MySQL MyISAM "SELECT COUNT(*)"); SOLR depends on dataset size.

 

SOLR is applicable if we are using at least full-text search (for instance,
search for "Jack London" may return house owned by Jack London in Australia,
and house at Jack Square in London, and etc.); if we are interested in
non-tokenized attributes only (putting heavy constraints on possible query
types,without _any_ full-text): database. 

 

 



RE: SOLR vs SQL

2009-09-01 Thread Fuad Efendi
"No results found for 'surface area 377', displaying all properties."
- why do we need SOLR then...





Re: extended documentation on analyzers

2009-09-01 Thread Chris Hostetter
: is there an online resource or a book that contains a thorough list of
: tokenizers and filters available and their functionality?
: 
: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

...from the intro on that page...

  For a more complete list of what Tokenizers and TokenFilters come out of 
  the box, please consult the [WWW] javadocs for the analysis package. if 
  you have any tips/tricks you'd like to mention about using any of these 
  classes, please add them below.

...with a link to...

http://lucene.apache.org/solr/api/org/apache/solr/analysis/package-summary.html


-Hoss



Re: Can solr do the equivalent of "select distinct(field)"?

2009-09-01 Thread Chris Hostetter

: lets say you filter your query on something and want to know how many
: distinct "categories" that your results comprise.
: then you can facet on the category field and count the number of facet
: values that are returned, right?

if you count the number of facet values returned you are getting a "count 
of disctinct values"

if you just want the list of distinct values in a field (for your whole 
index) there TermsComponent is the fastest way.

if you want the list of distinct values across a set of documents, then 
facet on that field when doing your query.

"select distinct category from books where bookInStock='true'" is analgous 
to looking at the facet section of...

   rows=0&q=bookInStock:true&facet=true&facet.field=category


-Hoss



Re: Date Faceting and Double Counting

2009-09-01 Thread Chris Hostetter
: Is this a known behavior people are happy with, or should I file an issue
: asking for ranges in date-facets to be constructed to subtract one second
: from the end of each range (so that the effective range queries for my case

It's a known anoyance, but not something that seems to anoy people enough 
that there have been any patches to improve the situation.  typically the 
off by 1 millisecond approach you describe works for people.


-Hoss



Re: Date Faceting and Double Counting

2009-09-01 Thread Chris Hostetter

: When I added numerical faceting to my checkout of solr (solr-1240) I basically
: copied date faceting and modified it to work with numbers instead of dates.
: With numbers I got a lot of doulbe-counted values as well. So to fix my
: problem I added an extra parameter to number faceting where you can specify if
: either end of each range should be inclusive or exclusive. I just ported it

gwk:

1) would you mind opening a Jira issue for your date faceting improvements 
as well (email attachments tend to get lost, and there are legal headaches 
with committing them that Jira solves by asking you explicitly if you 
license them to the ASF)

2) i haven't looked t your patch, but one of the reasons i never 
implemented an option like this with date faceting is that the query 
parser doesn't have any way of letting you write a query that is inclusive 
on one end, and exclusive on the other end -- so you might get accurate 
facet counts for range A-B and B-C (inclusive of the lower, exclusive of 
hte upp), but if you try to filter by one of those ranges, your counts 
will be off.  did you find a nice solution for this?




-Hoss



Re: Adding new docs, but duplicating instead of updating

2009-09-01 Thread Chris Hostetter
: specified (again, using schema browser).  The unique key field is marked as
: type textTight.

your uniqueKey field needs to be something where everydoc is only going to 
produce a single token, if you are using textTight, and sending product 
sku type data (as mentioned in another mesg in this thread) you are 
probably getting multiple tokens.

use copyField to putthissame sku value into a string field.


-Hoss



Re: Using Lucene's payload in Solr

2009-09-01 Thread Chris Hostetter
: Is it possible to have the copyField strip off the payload while it is
: copying since doing it in the analysis phrase is too late?  Or should I
: start looking into using UpdateProcessors as Chris had suggested?

"nope" and "yep"

I've had an idea in the back of my mind ofr a while now about adding more 
options ot the fieldTypes to specify how the *stored* values should be 
modified when indexing ... but there's nothing there to do that yet.  you 
have to make the modifications in an Updateprocessor (or in a response 
writer)

: >> It seems like it might be simpler have two new (generic) UpdateProcessors:
: >> one that can clone fieldA into fieldB, and one that can do regex mutations
: >> on fieldB ... neither needs to know about payloads at all, but the first
: >> can made a copy of "2.0|Solr In Action" and the second can strip off the
: >> "2.0|" from the copy.
: >>
: >> then you can write a new NumericPayloadRegexTokenizer that takes in two
: >> regex expressions -- one that knows how to extract the payload from a
: >> piece of input, and one that specifies the tokenization.
: >>
: >> those three classes seem easier to implemnt, easier to maintain, and more
: >> generally reusable then a custom xml request handler for your updates.


-Hoss



Re: Why dismax isn't the default with 1.4 and why it doesn't support fuzzy search ?

2009-09-01 Thread Chris Hostetter
: The wiki says "As of Solr 1.3, the DisMaxRequestHandler is simply the
: standard request handler with the default query parser set to the
: DisMax Query Parser (defType=dismax).". I just made a checkout of svn
: and dismax doesn't seems to be the default as :

that paragraph doesn't say that dismax is the "default handler" ... it 
says that using qt=dismax is the same as using qt=standard with the " 
query parser" set to be the DisMaxQueryParser (using defType=dismax)


so doing this replacement on any URL...

qt=dismax   =>  qt=standard&defTYpe=dismax

...should produce identical results.

: Secondly, I've patched solr with
: http://issues.apache.org/jira/browse/SOLR-629 as I would like to have
: fuzzy with dismax. I built it with "ant example". Now, behavior is
: still the same, no fuzzy search with dismax (using the qt=dismax
: parameter in GET URL).

questions/discussion of uncommitted patches is best done in the Jira issue 
wherey ou found the patch ... that way it helps other people evaluate the 
patch, and the author of the patch is more likelye to see your feedback.


-Hoss



Searching for a set of keywords /phrases in a document

2009-09-01 Thread matchan

I have a large document with various sections. Each section has a list of
keywords /phrases of interest. I have a master  list of keywords/phrases
stored as a String array. How can I use Solr or Lucene to search each
section document for all keywords and basically give me which keywords were
found ? I cant think of any straightforward way to implement this


-- 
View this message in context: 
http://www.nabble.com/Searching-for-a-set-of-keywords--phrases-in-a-document-tp25250714p25250714.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Monitoring split time for fq queries when filter cache is used

2009-09-01 Thread Rahul R
Thank you Martijn.

On Tue, Sep 1, 2009 at 8:07 PM, Martijn v Groningen <
martijn.is.h...@gmail.com> wrote:

> Hi Rahul,
>
> Yes you are understanding is correct, but it is not possible to
> monitor these actions separately with Solr.
>
> Martijn
>
> 2009/9/1 Rahul R :
>  > Hello,
> > I am trying to measure the benefit that I am getting out of using the
> filter
> > cache. As I understand, there are two major parts to an fq query. Please
> > correct me if I am wrong :
> > - doing full index queries of each of the fq params (if filter cache is
> > used, this result will be retrieved from the cache)
> > - set intersection of above results (Will be done again even with filter
> > cache enabled)
> >
> > Is there any flag/setting that I can enable to monitor how much time the
> > above operations take separately i.e. the querying and the
> set-intersection
> > ?
> >
> > Regards
> > Rahul
> >
>
>
>
> --
> Met vriendelijke groet,
>
> Martijn van Groningen
>


RE: encoding problem

2009-09-01 Thread Bernadette Houghton
Finally resolved the problem! The solution was 3-pronged on my windows PC-

Added to my.ini under mysqld-
default-character-set=utf8
collation_server=utf8_unicode_ci
character_set_server=utf8
skip-character-set-client-handshake

Added to JAVA_OPTS environmental variable –
-Dfile.encoding=UTF-8

Added to beginning of tomcat startup.bat (positioning is important!)
set JAVA_OPTS="-Dfile.encoding=UTF-8"  

Thanks to everyone for their much appreciated help!

Bern

-Original Message-
From: Bernadette Houghton [mailto:bernadette.hough...@deakin.edu.au] 
Sent: Monday, 31 August 2009 9:18 AM
To: 'solr-user@lucene.apache.org'
Subject: RE: encoding problem

Still having a few issues with encoding, although I've been able to resolve the 
particular issue below by just re-editing the affected record. 

The other encoding issue is with Greek characters. With solr turned off in our 
user-facing application, greek characters e.g. α,ω (small alpha, small omega) 
display correctly. But with solr turned on, garbage displays instead. If we 
enter the characters as decimal (e.g. ω), all displays OK with or without 
solr. Does this suggest anything to anyone??

TIA
bern