Re: Date Faceting and Double Counting
Hi Stephen, When I added numerical faceting to my checkout of solr (solr-1240) I basically copied date faceting and modified it to work with numbers instead of dates. With numbers I got a lot of doulbe-counted values as well. So to fix my problem I added an extra parameter to number faceting where you can specify if either end of each range should be inclusive or exclusive. I just ported it back to date faceting (disclaimer, completely untested) and it should be attached to my post. The following parameter is added: facet.date.exclusive valid values for the parameter are: start, end, both and neither To maintain compatibility with solr without the patch the default is neither. I hope the meaning of the values are self-explanatory. Regards, gwk Stephen Duncan Jr wrote: If we do date faceting and start at 2009-01-01T00:00:00Z, end at 2009-01-03T00:00:00Z, with a gap of +1DAY, then documents that occur at exactly 2009-01-02T00:00:00Z will be included in both the returned counts (2009-01-01T00:00:00Z and 2009-01-02T00:00:00Z). At the moment, this is quite bad for us, as we only index the day-level, so all of our documents are exactly on the line between each facet-range. Because we know our data is indexed as being exactly at midnight each day, I think we can simply always start from 1 second prior and get the results we want (start=2008-12-31T23:59:59Z, end=2009-01-02T23:59:59Z), but I think this problem would affect everyone, even if usually more subtly (instead of all documents being counted twice, only a few on the fencepost between ranges). Is this a known behavior people are happy with, or should I file an issue asking for ranges in date-facets to be constructed to subtract one second from the end of each range (so that the effective range queries for my case would be: [2009-01-01T00:00:00Z TO 2009-01-01T23:59:59Z] & [2009-01-02T00:00:00Z TO 2009-01-02T23:59:59Z])? Alternatively, is there some other suggested way of using the date faceting to avoid this problem? Index: src/java/org/apache/solr/request/SimpleFacets.java === --- src/java/org/apache/solr/request/SimpleFacets.java (revision 809880) +++ src/java/org/apache/solr/request/SimpleFacets.java (working copy) @@ -29,6 +29,7 @@ import org.apache.solr.common.params.SolrParams; import org.apache.solr.common.params.CommonParams; import org.apache.solr.common.params.FacetParams.FacetDateOther; +import org.apache.solr.common.params.FacetParams.FacetDateExclusive; import org.apache.solr.common.util.NamedList; import org.apache.solr.common.util.SimpleOrderedMap; import org.apache.solr.common.util.StrUtils; @@ -586,6 +587,32 @@ "date facet 'end' comes before 'start': "+endS+" < "+startS); } + boolean startInclusive = true; + boolean endInclusive = true; + final String[] exclusiveP = +params.getFieldParams(f,FacetParams.FACET_DATE_EXCLUSIVE); + if (null != exclusiveP && 0 < exclusiveP.length) { +Set exclusives += EnumSet.noneOf(FacetDateExclusive.class); + +for (final String e : exclusiveP) { + exclusives.add(FacetDateExclusive.get(e)); +} + +if(! exclusives.contains(FacetDateExclusive.NEITHER) ) { + boolean both = exclusives.contains(FacetDateExclusive.BOTH); + + if(both || exclusives.contains(FacetDateExclusive.START)) { +startInclusive = false; + } + + if(both || exclusives.contains(FacetDateExclusive.END)) { +endInclusive = false; + } +} + } + + final String gap = required.getFieldParam(f,FacetParams.FACET_DATE_GAP); final DateMathParser dmp = new DateMathParser(ft.UTC, Locale.US); dmp.setNow(NOW); @@ -610,7 +637,7 @@ (SolrException.ErrorCode.BAD_REQUEST, "date facet infinite loop (is gap negative?)"); } - resInner.add(label, rangeCount(sf,low,high,true,true)); + resInner.add(label, rangeCount(sf,low,high,startInclusive,endInclusive)); low = high; } } catch (java.text.ParseException e) { @@ -639,15 +666,15 @@ if (all || others.contains(FacetDateOther.BEFORE)) { resInner.add(FacetDateOther.BEFORE.toString(), - rangeCount(sf,null,start,false,false)); + rangeCount(sf,null,start,false,!startInclusive)); } if (all || others.contains(FacetDateOther.AFTER)) { resInner.add(FacetDateOther.AFTER.toString(), - rangeCount(sf,end,null,false,false)); + rangeCount(sf,end,null,!endInclusive,false)); } if (all || others.contains(FacetDateOther.BETWEEN)) { resInner.add(FacetDateOther.BETWEEN.toString(), - rangeCount(sf
Re: How to set similarity to catch more results ?
Thank you three for answers. After more research, I think I need to use fuzzy search as I already know Levenshtein Distance and I don't want to manage a list of synonyms manually. So "manually" spell check isn't for me. Thanks a lot. On Tue, Sep 1, 2009 at 1:15 AM, Avlesh Singh wrote: >> >> I want it more flexible, as if I make a mistake with letters, results are >> found like with google. >> > You are talking about spelling mistakes? > http://wiki.apache.org/solr/SpellCheckComponent > > Cheers > Avlesh > > On Mon, Aug 31, 2009 at 3:30 PM, Kaoul wrote: > >> Hello, >> >> I'm new to Solr and don't find in documentation how-to to set >> similarity. I want it more flexible, as if I make a mistake with >> letters, results are found like with google. >> >> Thank you in advance. >> >
Re: Drill down into hierarchical facet : how to?
Hi, You know the level your currently in: America/USA You have the values for the location facet in the form: America/USA/NYC/Chelsea...3 America/USA/NYC/East Village2 America/USA/San Francisco/Haight-Ashbury...5 America/USA/Los Angeles/Hollywood..1 Why can't you do the following: first step: translate each facet value to the next expected level, that is: America/USA/NYC.3 America/USA/NYC.2 America/USA/San Francisco.5 America/USA/Los Angeles.1 (this can easily be done using regular expression) second step: aggregate the counts for similar values: America/USA/NYC.5 America/USA/San Francisco.5 America/USA/Los Angeles.1 now, use the last "part" of the value path as display name, and use the whole value path as a filter (with a wildcard of course). You can also have a look at this issue: http://issues.apache.org/jira/browse/SOLR-64 cheers, Uri clico wrote: Hello I'm looking for a way to do that I have a hierachical facet ex : Continent / Country / City / Blok Europe/France/Paris/Saint Michel America/USA/NYC/Chelsea etc ... I have some points of interest tagged in differents level of a same tree ex : some POI can be tagged Saint Michel and other tagged Paris etc ... I facet on a fiel "location". This i a fiel stored like this Continent/Country/City/Blok I want to drill down on this facet during the search and show the facets for the next level of a level Ex : When I search at the level continent I want to facet the level Europe, USA etc ... and to show all the results (Europe contains POI tagged as Europe and POI tagged as France for example) I know I can make a facet query something like Europe/France/* to search all POI in France but how can I show the facet level under France (Paris, Lyon etc ...) ??? Thank u
Error while indexing using SmartChineseAnalyzer
Hi, I tried using the patch provided for Solr-1336 JIRA issue for integrating Lucene's SmartChineseAnalyzer with Solr and tried testing it out but I faced the AbstractMethodError during indexing as well as Searching (stack trace below). There seems to be something wrong during the tokenization of the content. Can someone please tell me what I am doing wrong here? The Stack Trace SEVERE: java.lang.AbstractMethodError at org.apache.solr.analysis.TokenizerChain.tokenStream(TokenizerChain.java: 64) at org.apache.solr.schema.IndexSchema$SolrIndexAnalyzer.tokenStream(IndexSc hema.java:360) at org.apache.lucene.analysis.Analyzer.reusableTokenStream(Analyzer.java:44 ) at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPer Field.java:123) at org.apache.lucene.index.DocFieldConsumersPerField.processFields(DocField ConsumersPerField.java:36) at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFi eldProcessorPerThread.java:234) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.j ava:762) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.j ava:745) at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2199 ) at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2171 ) at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2. java:218) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdate ProcessorFactory.java:60) at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:140) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Conte ntStreamHandlerBase.java:54) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerB ase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1333) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.ja va:303) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.j ava:232) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Applica tionFilterChain.java:235) Thanks, Kumar
Re: Error while indexing using SmartChineseAnalyzer
On Tue, Sep 1, 2009 at 4:37 PM, Jana, Kumar Raja wrote: > Hi, > > I tried using the patch provided for Solr-1336 JIRA issue for > integrating Lucene's SmartChineseAnalyzer with Solr and tried testing it > out but I faced the AbstractMethodError during indexing as well as > Searching (stack trace below). > Questions on patches are best asked on the issue. Please post the stack trace to SOLR-1336. -- Regards, Shalin Shekhar Mangar.
RE: Error while indexing using SmartChineseAnalyzer
Thanks for the reply Shalin. Posted the stack trace on the Jira issue SOLR-1336. -Kumar -Original Message- From: Shalin Shekhar Mangar [mailto:shalinman...@gmail.com] Sent: Tuesday, September 01, 2009 4:56 PM To: solr-user@lucene.apache.org Subject: Re: Error while indexing using SmartChineseAnalyzer On Tue, Sep 1, 2009 at 4:37 PM, Jana, Kumar Raja wrote: > Hi, > > I tried using the patch provided for Solr-1336 JIRA issue for > integrating Lucene's SmartChineseAnalyzer with Solr and tried testing it > out but I faced the AbstractMethodError during indexing as well as > Searching (stack trace below). > Questions on patches are best asked on the issue. Please post the stack trace to SOLR-1336. -- Regards, Shalin Shekhar Mangar.
Adding new docs, but duplicating instead of updating
Hi All, I'm running Solr in a multicore setup. I've set one of the cores to have a specific field as the unique key (marked as the uniqueKey in the document and the field is defined as required). I'm sending an command with all the docs using a multipart post. After running the add file, I send and then send . This works fine. When I resend the file (and commit and optimize), I double my document count and when I do a query by unique key, I get two documents back. I've confirmed using the admin UI that (schema browser) that my document count has doubled. I've also confirmed that unique key is the one I specified (again, using schema browser). The unique key field is marked as type textTight. Thanks for any help -Chris
Re: Monitoring split time for fq queries when filter cache is used
Hi Rahul, Yes you are understanding is correct, but it is not possible to monitor these actions separately with Solr. Martijn 2009/9/1 Rahul R : > Hello, > I am trying to measure the benefit that I am getting out of using the filter > cache. As I understand, there are two major parts to an fq query. Please > correct me if I am wrong : > - doing full index queries of each of the fq params (if filter cache is > used, this result will be retrieved from the cache) > - set intersection of above results (Will be done again even with filter > cache enabled) > > Is there any flag/setting that I can enable to monitor how much time the > above operations take separately i.e. the querying and the set-intersection > ? > > Regards > Rahul > -- Met vriendelijke groet, Martijn van Groningen
RE: Adding new docs, but duplicating instead of updating
I could be off base here, maybe using textTight as unique key is a common SOLR practice I don't know. But, It would seem to me that using any field type that transforms a value (even if it is just whitespace removal) could be problematic. Maybe not the source of your issue here, but I'd be worrying about collisions. For instance what if you sent "xyz" as a key and "XYZ" as a key? The doc would be overwritten. You may end up with unexpected results when you get the record back... Maybe with your use-case this is OK but have you considered using string instead? Tim -Original Message- From: Christopher Baird [mailto:cba...@cardinalcommerce.com] Sent: Tuesday, September 01, 2009 7:30 AM To: solr-user@lucene.apache.org Subject: Adding new docs, but duplicating instead of updating Hi All, I'm running Solr in a multicore setup. I've set one of the cores to have a specific field as the unique key (marked as the uniqueKey in the document and the field is defined as required). I'm sending an command with all the docs using a multipart post. After running the add file, I send and then send . This works fine. When I resend the file (and commit and optimize), I double my document count and when I do a query by unique key, I get two documents back. I've confirmed using the admin UI that (schema browser) that my document count has doubled. I've also confirmed that unique key is the one I specified (again, using schema browser). The unique key field is marked as type textTight. Thanks for any help -Chris
Re: Is caching worth it when my whole index is in RAM?
Thanks, Avlesh! I'll try the filter cache. Anybody familiar enough with the caching implementation to chime in? Michael On Mon, Aug 31, 2009 at 10:02 PM, Avlesh Singh wrote: > Good question! > The application level cache, say filter cache, would still help because it > not only caches values but also the underlying computation. Even with all > the data in your RAM you will still end up doing the computations every > time. > > Looking for responses from the more knowledgeable. > > Cheers > Avlesh > > On Mon, Aug 31, 2009 at 8:25 PM, Michael wrote: > > > Hi, > > If I've got my entire 20G 4MM document index in RAM (on a ramdisk), do I > > have a need for the document cache? Or should I set it to 0 items, > because > > pulling field values from an index in RAM is so fast that the document > > cache > > would be a duplication of effort? > > > > Are there any other caches that I should turn off if I can get my entire > > index in RAM? Filter cache, query results cache, etc? > > > > Thanks! > > Michael > > >
solrj - Log4j and slf4j integration - java.lang.IllegalStateException thrown
We are using solrj 1.3 (with slf4j) in a client also using Aperture (with log4j 1.2.14). When executing a query I get the error shown below. The request is never received by the server, i.e. the exception is thrown before the request is issued. I think I'm running into a compatibility issue between slf4j and log4j, but don't know how to solve it. Regards, Gert. --- Stack Trace org.apache.solr.client.solrj.SolrServerException: Error executing query at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.j ava:96) at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:109) at org.esa.huginn.solr.SolrContainer.getNext(SolrContainer.java:105) at org.esa.huginn.commons.container.consolidationstrategies.SynonymReplacem entConsolidator.execute(SynonymReplacementConsolidator.java:191) at org.esa.huginn.filesystemcrawler.FileSystemCrawlerContext.initialize(Fil eSystemCrawlerContext.java:134) at org.esa.huginn.filesystemcrawler.FileSystemCrawlerContext.main(FileSyste mCrawlerContext.java:63) Caused by: java.lang.IllegalStateException: Level number 10 is not recognized. at org.slf4j.impl.Log4jLoggerAdapter.log(Log4jLoggerAdapter.java:421) at org.apache.commons.logging.impl.SLF4JLocationAwareLog.debug(SLF4JLocatio nAwareLog.java:106) at org.apache.commons.httpclient.HttpConnection.releaseConnection(HttpConne ction.java:1178) at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpCon nectionAdapter.releaseConnection(MultiThreadedHttpConnectionManager.java :1423) at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMetho dDirector.java:222) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:3 97) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:3 23) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsH ttpSolrServer.java:335) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsH ttpSolrServer.java:183) at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.j ava:90) Please help Logica to respect the environment by not printing this email / Pour contribuer comme Logica au respect de l'environnement, merci de ne pas imprimer ce mail / Bitte drucken Sie diese Nachricht nicht aus und helfen Sie so Logica dabei, die Umwelt zu schützen. / Por favor ajude a Logica a respeitar o ambiente nao imprimindo este correio electronico. This e-mail and any attachment is for authorised use by the intended recipient(s) only. It may contain proprietary material, confidential information and/or be subject to legal privilege. It should not be copied, disclosed to, retained or used by, any other party. If you are not an intended recipient then please promptly delete this e-mail and any attachment and all copies and inform the sender. Thank you.
RE: Adding new docs, but duplicating instead of updating
Hi Tim, I appreciate the suggestions. I can tell you that the document I ran the second time was the same document run the first time -- so any questions of field value shouldn't be a concern. Thanks -Chris -Original Message- From: Harsch, Timothy J. (ARC-SC)[PEROT SYSTEMS] [mailto:timothy.j.har...@nasa.gov] Sent: Tuesday, September 01, 2009 10:45 AM To: solr-user@lucene.apache.org; cba...@cardinalcommerce.com Subject: RE: Adding new docs, but duplicating instead of updating I could be off base here, maybe using textTight as unique key is a common SOLR practice I don't know. But, It would seem to me that using any field type that transforms a value (even if it is just whitespace removal) could be problematic. Maybe not the source of your issue here, but I'd be worrying about collisions. For instance what if you sent "xyz" as a key and "XYZ" as a key? The doc would be overwritten. You may end up with unexpected results when you get the record back... Maybe with your use-case this is OK but have you considered using string instead? Tim -Original Message- From: Christopher Baird [mailto:cba...@cardinalcommerce.com] Sent: Tuesday, September 01, 2009 7:30 AM To: solr-user@lucene.apache.org Subject: Adding new docs, but duplicating instead of updating Hi All, I'm running Solr in a multicore setup. I've set one of the cores to have a specific field as the unique key (marked as the uniqueKey in the document and the field is defined as required). I'm sending an command with all the docs using a multipart post. After running the add file, I send and then send . This works fine. When I resend the file (and commit and optimize), I double my document count and when I do a query by unique key, I get two documents back. I've confirmed using the admin UI that (schema browser) that my document count has doubled. I've also confirmed that unique key is the one I specified (again, using schema browser). The unique key field is marked as type textTight. Thanks for any help -Chris
Re: solrj - Log4j and slf4j integration - java.lang.IllegalStateException thrown
Are you running the latest versions of these logging libraries? I see nothing in the 1.5.8 SLF4J Log4j adapter that would cause this. ~ David Smiley Author: http://www.packtpub.com/solr-1-4-enterprise-search-server On 9/1/09 10:49 AM, "Villemos, Gert" wrote: We are using solrj 1.3 (with slf4j) in a client also using Aperture (with log4j 1.2.14). When executing a query I get the error shown below. The request is never received by the server, i.e. the exception is thrown before the request is issued. I think I'm running into a compatibility issue between slf4j and log4j, but don't know how to solve it. Regards, Gert. --- Stack Trace org.apache.solr.client.solrj.SolrServerException: Error executing query at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.j ava:96) at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:109) at org.esa.huginn.solr.SolrContainer.getNext(SolrContainer.java:105) at org.esa.huginn.commons.container.consolidationstrategies.SynonymReplacem entConsolidator.execute(SynonymReplacementConsolidator.java:191) at org.esa.huginn.filesystemcrawler.FileSystemCrawlerContext.initialize(Fil eSystemCrawlerContext.java:134) at org.esa.huginn.filesystemcrawler.FileSystemCrawlerContext.main(FileSyste mCrawlerContext.java:63) Caused by: java.lang.IllegalStateException: Level number 10 is not recognized. at org.slf4j.impl.Log4jLoggerAdapter.log(Log4jLoggerAdapter.java:421) at org.apache.commons.logging.impl.SLF4JLocationAwareLog.debug(SLF4JLocatio nAwareLog.java:106) at org.apache.commons.httpclient.HttpConnection.releaseConnection(HttpConne ction.java:1178) at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpCon nectionAdapter.releaseConnection(MultiThreadedHttpConnectionManager.java :1423) at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMetho dDirector.java:222) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:3 97) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:3 23) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsH ttpSolrServer.java:335) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsH ttpSolrServer.java:183) at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.j ava:90) Please help Logica to respect the environment by not printing this email / Pour contribuer comme Logica au respect de l'environnement, merci de ne pas imprimer ce mail / Bitte drucken Sie diese Nachricht nicht aus und helfen Sie so Logica dabei, die Umwelt zu sch?tzen. / Por favor ajude a Logica a respeitar o ambiente nao imprimindo este correio electronico. This e-mail and any attachment is for authorised use by the intended recipient(s) only. It may contain proprietary material, confidential information and/or be subject to legal privilege. It should not be copied, disclosed to, retained or used by, any other party. If you are not an intended recipient then please promptly delete this e-mail and any attachment and all copies and inform the sender. Thank you.
Re: Adding docs from MySQL and php
hi Pablo, DataImportHandler might be the best option for you. check this link http://wiki.apache.org/solr/DataImportHandler regards, aakash On Tue, Sep 1, 2009 at 9:18 PM, Pablo Ferrari wrote: > Hello all, > > I'm new to the list and new to Solr. My name is Pablo, I'm from Spain and > I'm developing a web site using Solr. > > I have Solr with the examples working correctly and now I would like to > load > the data from a MySQL database using php. > Is the best way to do this to write a php script that get the info from the > MySQL and then generates an XML document to load into Solr? Is there a > maximum size for this XML document? My MySQL database is quite big... > > Any help, book or internet tutorial you know will be really appreciated. > > Thank you! > > Pablo >
Why dismax isn't the default with 1.4 and why it doesn't support fuzzy search ?
Hello, Solr is a great software, but I have some interrogations like : The wiki says "As of Solr 1.3, the DisMaxRequestHandler is simply the standard request handler with the default query parser set to the DisMax Query Parser (defType=dismax).". I just made a checkout of svn and dismax doesn't seems to be the default as : - http://localhost:8983/solr/select/?q=test~0.5 and http://localhost:8983/solr/select/?q=test~0.5&qt=dismax doesn't show the same results. Note that I'm new to solr and I'm using the "example". So is dismax really the default ? Secondly, I've patched solr with http://issues.apache.org/jira/browse/SOLR-629 as I would like to have fuzzy with dismax. I built it with "ant example". Now, behavior is still the same, no fuzzy search with dismax (using the qt=dismax parameter in GET URL). In advance, thanks a lot.
Re: Adding docs from MySQL and php
Thanks Aakash! I've looked at it and it looks very interesting, the problem is that my database is a relational model, therefore I don't have a table with all the information, but many tables related to each other by their ids (primary keys and foreign keys). I've been thinking about using DataImportHandler in any of this two ways: - Write a script that creates a table with all the information I need for searching (it is not very efficient because of duplicate data) - Configure DataImportHandler with some JOIN SQL statement I'll let you know how I did, thanks again! Pablo 2009/9/1 Aakash Dharmadhikari > hi Pablo, > > DataImportHandler might be the best option for you. check this link > http://wiki.apache.org/solr/DataImportHandler > > regards, > aakash > > On Tue, Sep 1, 2009 at 9:18 PM, Pablo Ferrari >wrote: > > > Hello all, > > > > I'm new to the list and new to Solr. My name is Pablo, I'm from Spain and > > I'm developing a web site using Solr. > > > > I have Solr with the examples working correctly and now I would like to > > load > > the data from a MySQL database using php. > > Is the best way to do this to write a php script that get the info from > the > > MySQL and then generates an XML document to load into Solr? Is there a > > maximum size for this XML document? My MySQL database is quite big... > > > > Any help, book or internet tutorial you know will be really appreciated. > > > > Thank you! > > > > Pablo > > >
Re: Adding docs from MySQL and php
wow, it looks like DIH already works with relational databases... thanks again! 2009/9/1 Pablo Ferrari > Thanks Aakash! > > I've looked at it and it looks very interesting, the problem is that my > database is a relational model, therefore I don't have a table with all the > information, but many tables related to each other by their ids (primary > keys and foreign keys). > > I've been thinking about using DataImportHandler in any of this two ways: > - Write a script that creates a table with all the information I need for > searching (it is not very efficient because of duplicate data) > - Configure DataImportHandler with some JOIN SQL statement > > I'll let you know how I did, thanks again! > > Pablo > > 2009/9/1 Aakash Dharmadhikari > > hi Pablo, >> >> DataImportHandler might be the best option for you. check this link >> http://wiki.apache.org/solr/DataImportHandler >> >> regards, >> aakash >> >> On Tue, Sep 1, 2009 at 9:18 PM, Pablo Ferrari > >wrote: >> >> > Hello all, >> > >> > I'm new to the list and new to Solr. My name is Pablo, I'm from Spain >> and >> > I'm developing a web site using Solr. >> > >> > I have Solr with the examples working correctly and now I would like to >> > load >> > the data from a MySQL database using php. >> > Is the best way to do this to write a php script that get the info from >> the >> > MySQL and then generates an XML document to load into Solr? Is there a >> > maximum size for this XML document? My MySQL database is quite big... >> > >> > Any help, book or internet tutorial you know will be really appreciated. >> > >> > Thank you! >> > >> > Pablo >> > >> > >
RE: Adding new docs, but duplicating instead of updating
What is the value of your uniqueKey? -Original Message- From: Christopher Baird [mailto:cba...@cardinalcommerce.com] Sent: Tuesday, September 01, 2009 8:20 AM To: solr-user@lucene.apache.org Subject: RE: Adding new docs, but duplicating instead of updating Hi Tim, I appreciate the suggestions. I can tell you that the document I ran the second time was the same document run the first time -- so any questions of field value shouldn't be a concern. Thanks -Chris -Original Message- From: Harsch, Timothy J. (ARC-SC)[PEROT SYSTEMS] [mailto:timothy.j.har...@nasa.gov] Sent: Tuesday, September 01, 2009 10:45 AM To: solr-user@lucene.apache.org; cba...@cardinalcommerce.com Subject: RE: Adding new docs, but duplicating instead of updating I could be off base here, maybe using textTight as unique key is a common SOLR practice I don't know. But, It would seem to me that using any field type that transforms a value (even if it is just whitespace removal) could be problematic. Maybe not the source of your issue here, but I'd be worrying about collisions. For instance what if you sent "xyz" as a key and "XYZ" as a key? The doc would be overwritten. You may end up with unexpected results when you get the record back... Maybe with your use-case this is OK but have you considered using string instead? Tim -Original Message- From: Christopher Baird [mailto:cba...@cardinalcommerce.com] Sent: Tuesday, September 01, 2009 7:30 AM To: solr-user@lucene.apache.org Subject: Adding new docs, but duplicating instead of updating Hi All, I'm running Solr in a multicore setup. I've set one of the cores to have a specific field as the unique key (marked as the uniqueKey in the document and the field is defined as required). I'm sending an command with all the docs using a multipart post. After running the add file, I send and then send . This works fine. When I resend the file (and commit and optimize), I double my document count and when I do a query by unique key, I get two documents back. I've confirmed using the admin UI that (schema browser) that my document count has doubled. I've also confirmed that unique key is the one I specified (again, using schema browser). The unique key field is marked as type textTight. Thanks for any help -Chris
RE: Adding new docs, but duplicating instead of updating
Hi Tim, The value I'm using is a product SKU. A sample would be like: L49-4251. Thanks -Chris -Original Message- From: Harsch, Timothy J. (ARC-SC)[PEROT SYSTEMS] [mailto:timothy.j.har...@nasa.gov] Sent: Tuesday, September 01, 2009 12:52 PM To: solr-user@lucene.apache.org; cba...@cardinalcommerce.com Subject: RE: Adding new docs, but duplicating instead of updating What is the value of your uniqueKey? -Original Message- From: Christopher Baird [mailto:cba...@cardinalcommerce.com] Sent: Tuesday, September 01, 2009 8:20 AM To: solr-user@lucene.apache.org Subject: RE: Adding new docs, but duplicating instead of updating Hi Tim, I appreciate the suggestions. I can tell you that the document I ran the second time was the same document run the first time -- so any questions of field value shouldn't be a concern. Thanks -Chris -Original Message- From: Harsch, Timothy J. (ARC-SC)[PEROT SYSTEMS] [mailto:timothy.j.har...@nasa.gov] Sent: Tuesday, September 01, 2009 10:45 AM To: solr-user@lucene.apache.org; cba...@cardinalcommerce.com Subject: RE: Adding new docs, but duplicating instead of updating I could be off base here, maybe using textTight as unique key is a common SOLR practice I don't know. But, It would seem to me that using any field type that transforms a value (even if it is just whitespace removal) could be problematic. Maybe not the source of your issue here, but I'd be worrying about collisions. For instance what if you sent "xyz" as a key and "XYZ" as a key? The doc would be overwritten. You may end up with unexpected results when you get the record back... Maybe with your use-case this is OK but have you considered using string instead? Tim -Original Message- From: Christopher Baird [mailto:cba...@cardinalcommerce.com] Sent: Tuesday, September 01, 2009 7:30 AM To: solr-user@lucene.apache.org Subject: Adding new docs, but duplicating instead of updating Hi All, I'm running Solr in a multicore setup. I've set one of the cores to have a specific field as the unique key (marked as the uniqueKey in the document and the field is defined as required). I'm sending an command with all the docs using a multipart post. After running the add file, I send and then send . This works fine. When I resend the file (and commit and optimize), I double my document count and when I do a query by unique key, I get two documents back. I've confirmed using the admin UI that (schema browser) that my document count has doubled. I've also confirmed that unique key is the one I specified (again, using schema browser). The unique key field is marked as type textTight. Thanks for any help -Chris
SOLR vs SQL
RE: http://www.mysecondhome.eu I am browsing this website again (I have similar challenge at http://www.casaGURU.com but still prefer database-SQL to search Professional by service type) I don't think SOLR is applicable in this specific case. I think standard DB queries with predefined dropdown/radio values perform extremely faster than SOLR faceting (you currently have only 9 records) - database queries have consistent response time without dependency on dataset size (especially MySQL MyISAM "SELECT COUNT(*)"); SOLR depends on dataset size. SOLR is applicable if we are using at least full-text search (for instance, search for "Jack London" may return house owned by Jack London in Australia, and house at Jack Square in London, and etc.); if we are interested in non-tokenized attributes only (putting heavy constraints on possible query types,without _any_ full-text): database.
RE: SOLR vs SQL
"No results found for 'surface area 377', displaying all properties." - why do we need SOLR then...
Re: extended documentation on analyzers
: is there an online resource or a book that contains a thorough list of : tokenizers and filters available and their functionality? : : http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters ...from the intro on that page... For a more complete list of what Tokenizers and TokenFilters come out of the box, please consult the [WWW] javadocs for the analysis package. if you have any tips/tricks you'd like to mention about using any of these classes, please add them below. ...with a link to... http://lucene.apache.org/solr/api/org/apache/solr/analysis/package-summary.html -Hoss
Re: Can solr do the equivalent of "select distinct(field)"?
: lets say you filter your query on something and want to know how many : distinct "categories" that your results comprise. : then you can facet on the category field and count the number of facet : values that are returned, right? if you count the number of facet values returned you are getting a "count of disctinct values" if you just want the list of distinct values in a field (for your whole index) there TermsComponent is the fastest way. if you want the list of distinct values across a set of documents, then facet on that field when doing your query. "select distinct category from books where bookInStock='true'" is analgous to looking at the facet section of... rows=0&q=bookInStock:true&facet=true&facet.field=category -Hoss
Re: Date Faceting and Double Counting
: Is this a known behavior people are happy with, or should I file an issue : asking for ranges in date-facets to be constructed to subtract one second : from the end of each range (so that the effective range queries for my case It's a known anoyance, but not something that seems to anoy people enough that there have been any patches to improve the situation. typically the off by 1 millisecond approach you describe works for people. -Hoss
Re: Date Faceting and Double Counting
: When I added numerical faceting to my checkout of solr (solr-1240) I basically : copied date faceting and modified it to work with numbers instead of dates. : With numbers I got a lot of doulbe-counted values as well. So to fix my : problem I added an extra parameter to number faceting where you can specify if : either end of each range should be inclusive or exclusive. I just ported it gwk: 1) would you mind opening a Jira issue for your date faceting improvements as well (email attachments tend to get lost, and there are legal headaches with committing them that Jira solves by asking you explicitly if you license them to the ASF) 2) i haven't looked t your patch, but one of the reasons i never implemented an option like this with date faceting is that the query parser doesn't have any way of letting you write a query that is inclusive on one end, and exclusive on the other end -- so you might get accurate facet counts for range A-B and B-C (inclusive of the lower, exclusive of hte upp), but if you try to filter by one of those ranges, your counts will be off. did you find a nice solution for this? -Hoss
Re: Adding new docs, but duplicating instead of updating
: specified (again, using schema browser). The unique key field is marked as : type textTight. your uniqueKey field needs to be something where everydoc is only going to produce a single token, if you are using textTight, and sending product sku type data (as mentioned in another mesg in this thread) you are probably getting multiple tokens. use copyField to putthissame sku value into a string field. -Hoss
Re: Using Lucene's payload in Solr
: Is it possible to have the copyField strip off the payload while it is : copying since doing it in the analysis phrase is too late? Or should I : start looking into using UpdateProcessors as Chris had suggested? "nope" and "yep" I've had an idea in the back of my mind ofr a while now about adding more options ot the fieldTypes to specify how the *stored* values should be modified when indexing ... but there's nothing there to do that yet. you have to make the modifications in an Updateprocessor (or in a response writer) : >> It seems like it might be simpler have two new (generic) UpdateProcessors: : >> one that can clone fieldA into fieldB, and one that can do regex mutations : >> on fieldB ... neither needs to know about payloads at all, but the first : >> can made a copy of "2.0|Solr In Action" and the second can strip off the : >> "2.0|" from the copy. : >> : >> then you can write a new NumericPayloadRegexTokenizer that takes in two : >> regex expressions -- one that knows how to extract the payload from a : >> piece of input, and one that specifies the tokenization. : >> : >> those three classes seem easier to implemnt, easier to maintain, and more : >> generally reusable then a custom xml request handler for your updates. -Hoss
Re: Why dismax isn't the default with 1.4 and why it doesn't support fuzzy search ?
: The wiki says "As of Solr 1.3, the DisMaxRequestHandler is simply the : standard request handler with the default query parser set to the : DisMax Query Parser (defType=dismax).". I just made a checkout of svn : and dismax doesn't seems to be the default as : that paragraph doesn't say that dismax is the "default handler" ... it says that using qt=dismax is the same as using qt=standard with the " query parser" set to be the DisMaxQueryParser (using defType=dismax) so doing this replacement on any URL... qt=dismax => qt=standard&defTYpe=dismax ...should produce identical results. : Secondly, I've patched solr with : http://issues.apache.org/jira/browse/SOLR-629 as I would like to have : fuzzy with dismax. I built it with "ant example". Now, behavior is : still the same, no fuzzy search with dismax (using the qt=dismax : parameter in GET URL). questions/discussion of uncommitted patches is best done in the Jira issue wherey ou found the patch ... that way it helps other people evaluate the patch, and the author of the patch is more likelye to see your feedback. -Hoss
Searching for a set of keywords /phrases in a document
I have a large document with various sections. Each section has a list of keywords /phrases of interest. I have a master list of keywords/phrases stored as a String array. How can I use Solr or Lucene to search each section document for all keywords and basically give me which keywords were found ? I cant think of any straightforward way to implement this -- View this message in context: http://www.nabble.com/Searching-for-a-set-of-keywords--phrases-in-a-document-tp25250714p25250714.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Monitoring split time for fq queries when filter cache is used
Thank you Martijn. On Tue, Sep 1, 2009 at 8:07 PM, Martijn v Groningen < martijn.is.h...@gmail.com> wrote: > Hi Rahul, > > Yes you are understanding is correct, but it is not possible to > monitor these actions separately with Solr. > > Martijn > > 2009/9/1 Rahul R : > > Hello, > > I am trying to measure the benefit that I am getting out of using the > filter > > cache. As I understand, there are two major parts to an fq query. Please > > correct me if I am wrong : > > - doing full index queries of each of the fq params (if filter cache is > > used, this result will be retrieved from the cache) > > - set intersection of above results (Will be done again even with filter > > cache enabled) > > > > Is there any flag/setting that I can enable to monitor how much time the > > above operations take separately i.e. the querying and the > set-intersection > > ? > > > > Regards > > Rahul > > > > > > -- > Met vriendelijke groet, > > Martijn van Groningen >
RE: encoding problem
Finally resolved the problem! The solution was 3-pronged on my windows PC- Added to my.ini under mysqld- default-character-set=utf8 collation_server=utf8_unicode_ci character_set_server=utf8 skip-character-set-client-handshake Added to JAVA_OPTS environmental variable – -Dfile.encoding=UTF-8 Added to beginning of tomcat startup.bat (positioning is important!) set JAVA_OPTS="-Dfile.encoding=UTF-8" Thanks to everyone for their much appreciated help! Bern -Original Message- From: Bernadette Houghton [mailto:bernadette.hough...@deakin.edu.au] Sent: Monday, 31 August 2009 9:18 AM To: 'solr-user@lucene.apache.org' Subject: RE: encoding problem Still having a few issues with encoding, although I've been able to resolve the particular issue below by just re-editing the affected record. The other encoding issue is with Greek characters. With solr turned off in our user-facing application, greek characters e.g. α,ω (small alpha, small omega) display correctly. But with solr turned on, garbage displays instead. If we enter the characters as decimal (e.g. ω), all displays OK with or without solr. Does this suggest anything to anyone?? TIA bern