Re: use of filter queries in Lucene/Solr Alpha40 and Beta4.0
Hoss, I'm so happy you realized the problem because I was quite worried about it!! Let me know if I can provide support with testing it. The last two days I was busy with migrating a bunch of hosts which should -hopefully- be finished today. Then I have again the infrastructure for running tests Günter On 09/05/2012 11:19 PM, Chris Hostetter wrote: : Subject: Re: use of filter queries in Lucene/Solr Alpha40 and Beta4.0 Günter, This is definitely strange The good news is, i can reproduce your problem. The bad news is, i can reproduce your problem - and i have no idea what's causing it. I've opened SOLR-3793 to try to get to the bottom of this, and included some basic steps to demonstrate the bug using the Solr 4.0-BETA example data, but i'm really not sure what the problem might be... https://issues.apache.org/jira/browse/SOLR-3793 -Hoss -- Universität Basel Universitätsbibliothek Günter Hipler Projekt SwissBib Schoenbeinstrasse 18-20 4056 Basel, Schweiz Tel.: + 41 (0)61 267 31 12 Fax: ++41 61 267 3103 e-mailguenter.hip...@unibas.ch URL:www.swissbib.org /http://www.ub.unibas.ch/
RE: Delete all documents in the index
One more thanks for posting this! I struggled with the same issue yesterday and solved it with _version_ hint from mailing list . Alex. -Original Message- From: Mark Mandel [mailto:mark.man...@gmail.com] Sent: Thursday, September 06, 2012 1:53 AM To: solr-user@lucene.apache.org Subject: Re: Delete all documents in the index Thanks for posting this! I ran into exactly this issue yesterday, and ended up felting the files to get around it. Mark Sent from my mobile doohickey. On Sep 6, 2012 4:13 AM, Rohit Harchandani rhar...@gmail.com wrote: Thanks everyone. Adding the _version_ field in the schema worked. Deleting the data directory works for me, but was not sure why deleting using curl was not working. On Wed, Sep 5, 2012 at 1:49 PM, Michael Della Bitta michael.della.bi...@appinions.com wrote: Rohit: If it's easy, the easiest thing to do is to turn off your servlet container, rm -r * inside of the data directory, and then restart the container. Michael Della Bitta Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017 www.appinions.com Where Influence Isn't a Game On Wed, Sep 5, 2012 at 12:56 PM, Jack Krupansky j...@basetechnology.com wrote: Check to make sure that you are not stumbling into SOLR-3432: deleteByQuery silently ignored if updateLog is enabled, but {{_version_}} field does not exist in schema. See: https://issues.apache.org/jira/browse/SOLR-3432 This could happen if you kept the new 4.0 solrconfig.xml, but copied in your pre-4.0 schema.xml. -- Jack Krupansky -Original Message- From: Rohit Harchandani Sent: Wednesday, September 05, 2012 12:48 PM To: solr-user@lucene.apache.org Subject: Delete all documents in the index Hi, I am having difficulty deleting documents from the index using curl. The urls i tried were: curl http://localhost:9020/solr/core1/update/?stream.body= deletequery*:*/query/deletecommit=true curl http://localhost:9020/solr/core1/update/?commit=true; -H Content-Type: text/xml --data-binary 'deletequeryid:[* TO *]/query/delete' curl http://localhost:9020/solr/core1/update/?commit=true; -H Content-Type: text/xml --data-binary 'deletequery*:*/query/delete' I also tried: curl http://localhost:9020/solr/core1/update/?stream.body=%3Cdelete%3E%3Cqu ery%3E*:*%3C/query%3E%3C/delete%3Ecommit=true as suggested on some forums. I get a response with status=0 in all cases, but none of the above seem to work. When I run curl http://localhost:9020/solr/core1/select?q=*:*rows=0wt=xml; I still get a value for numFound. I am currently using solr 4.0 beta version. Thanks for your help in advance. Regards, Rohit
Re: AW: AW: auto completion search with solr using NGrams in SOLR
Hi, Thanks, Iam getting the results with below url. *suggest/?q=michael bdf=titledefType=lucenefl=title* But, i want the results in spellcheck section. i want to search with title or empname or both. Aniljayanti -- View this message in context: http://lucene.472066.n3.nabble.com/auto-completion-search-with-solr-using-NGrams-in-SOLR-tp3998559p4005812.html Sent from the Solr - User mailing list archive at Nabble.com.
terms component search
Hi I am trying to implement some auto suggest functionality, and am currently looking at the terms component (Solr 3.6). For example, I can form a query like this: http://solrhost/solr/mycore/terms?terms.fl=title_sterms.sort=indexterms.limit=5terms.prefix=Hotel+C which searches in the title_s field for strings starting Hotel C. Results could be Hotel Chicago, 2 Hotel California, 8 Hotel Cool, 4 Is it possible to get more info in the results from this component - like return data from other fields? For example, along with the results from the title_s field, the corresponding data from the telephone field. Or, maybe I simply should execute a normal wildcard search. Thanks, Peter
Re: Document Processing
If your interest is focusing on the real textual content of a web page, you could try this : JReadability (https://github.com/ifesdjeen/jReadability , Apache 2.0 license), which wraps JSoup (as Lance suggested) and applies a set of predefined rules to scrap crap (nav, headers, footers, ...) off of the content. If you'd rather have the possibility to map portions of a webpage to dedicated solr fields, using JSoup on its own could be a win. Read this : https://norrisshelton.wordpress.com/2011/01/27/jsoup-java-html-parser/ Hope this helps, -- Tanguy 2012/9/6 Lance Norskog goks...@gmail.com There is another way to do this: crawl the mobile site! The Fennec browser from Mozilla talks Android. I often use it to get pagecrap off my screen. - Original Message - | From: Lance Norskog goks...@gmail.com | To: solr-user@lucene.apache.org | Sent: Wednesday, August 29, 2012 7:37:37 PM | Subject: Re: Document Processing | | I've seen the JSoup HTML parser library used for this. It worked | really well. The Boilerpipe library may be what you want. Its | schwerpunkt (*) is to separate boilerplate from wanted text in an | HTML | page. I don't know what fine-grained control it has. | | * raison d'être. There is no English word for this concept. | | On Tue, Dec 6, 2011 at 1:39 PM, Tommaso Teofili | tommaso.teof...@gmail.com wrote: | Hello Michael, | | I can help you with using the UIMA UpdateRequestProcessor [1]; the | current | implementation uses in-memory execution of UIMA pipelines but since | I was | planning to add the support for higher scalability (with UIMA-AS | [2]) that | may help you as well. | | Tommaso | | [1] : | http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/contrib/uima/src/java/org/apache/solr/uima/processor/UIMAUpdateRequestProcessor.java | [2] : http://uima.apache.org/doc-uimaas-what.html | | 2011/12/5 Michael Kelleher mj.kelle...@gmail.com | | Hello Erik, | | I will take a look at both: | | org.apache.solr.update.**processor.**LangDetectLanguageIdentifierUp** | dateProcessor | | and | | org.apache.solr.update.**processor.**TikaLanguageIdentifierUpdatePr** | ocessor | | | and figure out what I need to extend to handle processing in the | way I am | looking for. I am assuming that component configuration is | handled in a | standard way such that I can configure my new UpdateProcessor in | the same | way I would configure any other UpdateProcessor component? | | Thanks for the suggestion. | | | 1 more question: given that I am probably going to convert the | HTML to | XML so I can use XPath expressions to extract my content, do you | think | that this kind of processing will overload Solr? This Solr | instance will | be used solely for indexing, and will only ever have a single | ManifoldCF | crawling job feeding it documents at one time. | | --mike | | | | | -- | Lance Norskog | goks...@gmail.com |
Re: terms component search
Hi Peter, Yes if you want to do complex things in suggest mode, you'd better rely on the SearchComponent... For example, this blog post is a good read http://www.cominvent.com/2012/01/25/super-flexible-autocomplete-with-solr/ , if you have complex requirements on the searched fields. (Although your requirements seem to be more related to the results extraction than query building) Kind regards, -- Tanguy 2012/9/6 Peter Kirk p...@alpha-solutions.dk Hi I am trying to implement some auto suggest functionality, and am currently looking at the terms component (Solr 3.6). For example, I can form a query like this: http://solrhost/solr/mycore/terms?terms.fl=title_sterms.sort=indexterms.limit=5terms.prefix=Hotel+C which searches in the title_s field for strings starting Hotel C. Results could be Hotel Chicago, 2 Hotel California, 8 Hotel Cool, 4 Is it possible to get more info in the results from this component - like return data from other fields? For example, along with the results from the title_s field, the corresponding data from the telephone field. Or, maybe I simply should execute a normal wildcard search. Thanks, Peter
Re: solr indexing slows down after few minutes
Commit is not too often, it's a batch of 100 records, takes 40 to 60 secs before another commit. No I am not indexing with multi threads. It uses a single thread executor. I have seen steady performance for now after increasing the merge factor from 10 to 25. Will have to wait and watch if that reduces the search speed, but so far so good. Thanks Amit On Thu, Aug 30, 2012 at 10:53 PM, pravesh [via Lucene] ml-node+s472066n4004421...@n3.nabble.com wrote: Did you checked wiki: http://wiki.apache.org/lucene-java/ImproveIndexingSpeed Do you commit often? Do you index with multiple threads? Also try experimenting with various available MergePolicies introduced from SOLR 3.4 onwards Thanx Pravesh -- If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/solr-indexing-slows-down-after-few-minutes-tp4004337p4004421.html To unsubscribe from solr indexing slows down after few minutes, click herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=4004337code=YW1pdC5tYWxsaWtAZ21haWwuY29tfDQwMDQzMzd8LTk5Njc5OTA3NA== . NAMLhttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://lucene.472066.n3.nabble.com/solr-indexing-slows-down-after-few-minutes-tp4004337p4005864.html Sent from the Solr - User mailing list archive at Nabble.com.
solr 3.6.1 tomcat 7.0 missing core name in path
Hi I have installed solr 3.6.1 on tomcat 7.0 following the steps here. http://ralf.schaeftlein.de/2012/02/10/installing-solr-3-5-under-tomcat-7/ The slor home page loads fine but the admin page (http://localhost:8080/solr/admin/) throws error missing core name in path. I am installing single core. This is the solr.xml solr persistent=false cores adminPath=/admin/cores defaultCoreName=collection1 core name=collection1 instanceDir=. / /cores /solr I have double checked lot of steps by searching on net, but no luck. If anyone has faced this please suggest. Thanks Amit -- View this message in context: http://lucene.472066.n3.nabble.com/solr-3-6-1-tomcat-7-0-missing-core-name-in-path-tp4005868.html Sent from the Solr - User mailing list archive at Nabble.com.
Facetting inside a custom component
Hello, i'm currently devoloping a custom component in Solr. This component works fine. The problem I have is, I only have an access to the searcher which gives me the option to fire e.g. BooleanQueries. This searcher gives me a result, which I have to iterate to calculate informations which could be delivered by solr facets out of the box. The problem i'm facing with is, that there is no option for facetting. There is an example on the Lucene here: http://lucene.apache.org/core/4_0_0-BETA/facet/org/apache/lucene/facet/doc-files/userguide.html The problem is, that i have no taxonomyDir / i dont know, where to get it. TaxonomyReader taxo = new DirectoryTaxonomyReader(taxoDir); Does anybody have an idea, how to gather facet information? Thanks in advance, Ralf
Re: Facetting inside a custom component
Hi, just found a solution, but you have to know, what you want to count: try { final SolrIndexSearcher s = rb.req.getSearcher(); final SolrQueryParser qp = new SolrQueryParser(rb.req.getSchema(), null); final String queryString = entity_type:RELEASE; final Query q = qp.parse(queryString); final DocListAndSet results = s.getDocListAndSet(q, (ListQuery) null, (Sort) null, rb.req.getStart(), rb.req.getLimit()); final NamedList counts = new NamedList(); for (String fc : ImmutableSet.of(entity_type:RELEASE, entity_type:PRODUCT_NAME)) { counts.add(fc, s.numDocs(qp.parse(fc), results.docSet)); } // in counts you have your facet-counts } catch (ParseException pe) { pe.printStackTrace(); } Original-Nachricht Datum: Thu, 06 Sep 2012 13:58:29 +0200 Von: Ralf Heyde ralf.he...@gmx.de An: solr-user@lucene.apache.org Betreff: Facetting inside a custom component Hello, i'm currently devoloping a custom component in Solr. This component works fine. The problem I have is, I only have an access to the searcher which gives me the option to fire e.g. BooleanQueries. This searcher gives me a result, which I have to iterate to calculate informations which could be delivered by solr facets out of the box. The problem i'm facing with is, that there is no option for facetting. There is an example on the Lucene here: http://lucene.apache.org/core/4_0_0-BETA/facet/org/apache/lucene/facet/doc-files/userguide.html The problem is, that i have no taxonomyDir / i dont know, where to get it. TaxonomyReader taxo = new DirectoryTaxonomyReader(taxoDir); Does anybody have an idea, how to gather facet information? Thanks in advance, Ralf
Solr 4.0alpha: edismax complaints on certain characters
Hello, I was under the impression that edismax was supposed to be crash proof and just ignore bad syntax. But I am either misconfiguring it or hit a weird bug. I basically searched for text containing '/' and got this: { 'responseHeader'={ 'status'=400, 'QTime'=9, 'params'={ 'qf'='TitleEN DescEN', 'indent'='true', 'wt'='ruby', 'q'='foo/bar', 'defType'='edismax'}}, 'error'={ 'msg'='org.apache.lucene.queryparser.classic.ParseException: Cannot parse \'foo/bar \': Lexical error at line 1, column 9. Encountered: EOF after : /bar ', 'code'=400}} Is that normal? If it is, is there a known list of characters I need to escape or do I just have to catch the exception and tell user to not do this again? Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book)
RE: Solr 4.0alpha: edismax complaints on certain characters
As far as I understand, / is a special character and needs to be escaped. Maybe foo\/bar should work? I found this when I looked at the code of ClientUtils.escapeQueryChars: // These characters are part of the query syntax and must be escaped if (c == '\\' || c == '+' || c == '-' || c == '!' || c == '(' || c == ')' || c == ':' || c == '^' || c == '[' || c == ']' || c == '\' || c == '{' || c == '}' || c == '~' || c == '*' || c == '?' || c == '|' || c == '' || c == ';' || c == '/' || Character.isWhitespace(c)) { sb.append('\\'); -Original Message- From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] Sent: Thursday, September 06, 2012 4:35 PM To: solr-user@lucene.apache.org Subject: Solr 4.0alpha: edismax complaints on certain characters Hello, I was under the impression that edismax was supposed to be crash proof and just ignore bad syntax. But I am either misconfiguring it or hit a weird bug. I basically searched for text containing '/' and got this: { 'responseHeader'={ 'status'=400, 'QTime'=9, 'params'={ 'qf'='TitleEN DescEN', 'indent'='true', 'wt'='ruby', 'q'='foo/bar', 'defType'='edismax'}}, 'error'={ 'msg'='org.apache.lucene.queryparser.classic.ParseException: Cannot parse \'foo/bar \': Lexical error at line 1, column 9. Encountered: EOF after : /bar ', 'code'=400}} Is that normal? If it is, is there a known list of characters I need to escape or do I just have to catch the exception and tell user to not do this again? Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book)
Re: Solr 4.0alpha: edismax complaints on certain characters
I believe this is caused by the regex support in https://issues.apache.org/jira/browse/LUCENE-2039 It certainly seems wrong to interpret a slash in the middle of the word as the start of a regex, so I've reopened the issue. -Yonik http://lucidworks.com On Thu, Sep 6, 2012 at 9:34 AM, Alexandre Rafalovitch arafa...@gmail.com wrote: Hello, I was under the impression that edismax was supposed to be crash proof and just ignore bad syntax. But I am either misconfiguring it or hit a weird bug. I basically searched for text containing '/' and got this: { 'responseHeader'={ 'status'=400, 'QTime'=9, 'params'={ 'qf'='TitleEN DescEN', 'indent'='true', 'wt'='ruby', 'q'='foo/bar', 'defType'='edismax'}}, 'error'={ 'msg'='org.apache.lucene.queryparser.classic.ParseException: Cannot parse \'foo/bar \': Lexical error at line 1, column 9. Encountered: EOF after : /bar ', 'code'=400}} Is that normal? If it is, is there a known list of characters I need to escape or do I just have to catch the exception and tell user to not do this again? Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book)
AW: Website (crawler for) indexing
Thanks Rafał and Markus for your comments. I think Droids it has serious problem with URL parameters in current version (0.2.0) from Maven central: https://issues.apache.org/jira/browse/DROIDS-144 I knew about Nutch, but I haven't been able to implement a crawler with it. Have you done that or seen an example application? It's probably easy to call a Nutch jar and make it index a website and maybe I will have to do that. But as we already have a Java implementation to index other sources, it would be nice if we could integrate the crawling part too. Regards, Alexander Hello! You can implement your own crawler using Droids (http://incubator.apache.org/droids/) or use Apache Nutch (http://nutch.apache.org/), which is very easy to integrate with Solr and is very powerful crawler. -- Regards, Rafał Kuć Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch This may be a bit off topic: How do you index an existing website and control the data going into index? We already have Java code to process the HTML (or XHTML) and turn it into a SolrJ Document (removing tags and other things we do not want in the index). We use SolrJ for indexing. So I guess the question is essentially which Java crawler could be useful. We used to use wget on command line in our publishing process, but we do no longer want to do that. Thanks, Alexander
RE: deletedPkQuery not work in solr 3.3
You have deletedPKQuery, but the correct spelling is deletedPkQuery (lowercase k). Try that and see if it fixes your problem. Also, you can probably simplify this if you do this as command=full-importclean=false, then use something like this for your query: select product_id as '$deleteDocById' from modified_product where gmt_create gt; to_date('${dataimporter.last_index_time}','-mm-dd hh24:mi:ss') and modification = 'deleted' See http://wiki.apache.org/solr/DataImportHandler#Special_Commands for more info on this technique. Finally, you will want to be aware of https://issues.apache.org/jira/browse/SOLR-2492 , a bug which was fixed in Solr 3.4. DIH doesn't automatically do a commit in some cases if your import only does deletes. You need to issue a commit manually after it completes. James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: jun Wang [mailto:wangjun...@gmail.com] Sent: Wednesday, September 05, 2012 8:33 PM To: solr-user@lucene.apache.org Subject: deletedPkQuery not work in solr 3.3 I have a data-config.xml with 2 entity, like entity name=full PK=ID ... ... /entity and entity name=delta_build PK=ID ... ... /entity entity delta_build is for delta import, query is ?command=full-importentity=delta_buildclean=false and I want to using deletedPkQuery to delete index. So I have add those to entity delta_build deltaQuery=select -1 as ID from dual deltaImportQuery=select * from product where a.id='${dataimporter.delta.ID}' deletedPKQuery=select product_id as ID from modified_product where gmt_create gt; to_date('${dataimporter.last_index_time}','-mm-dd hh24:mi:ss') and modification = 'deleted' deltaQuery and deltaImportQuery is simply to avoid delta import any records, course delta import has been implement by full import. and I am just want using delta for delete index. But when I hit query ?command=delta-import deltaQuery and deltaImportQuery can be found in log, and without deletedPKQuery. Is there any thing wrong in config file? -- from Jun Wang
Re: Solr 4.0alpha: edismax complaints on certain characters
That's what I was thinking, but when I tried foo/bar in Solr 3.6 and 4.0-BETA it was working fine - it split the term and generated the proper query without any error. I think the problem is if you use the default Lucene query parser, not edismax. I removed defType==edismax from my query request and the problem reproduces. My two test queries: http://localhost:8983/solr/select/?debugQuery=truedefType=edismaxqf=featuresq=foo/bar http://localhost:8983/solr/select/?debugQuery=truedf=featuresq=foo/bar The first works; the second fails as reported (in 4.0-BETA, but works in 3.6). -- Jack Krupansky -Original Message- From: Yonik Seeley Sent: Thursday, September 06, 2012 9:53 AM To: solr-user@lucene.apache.org Subject: Re: Solr 4.0alpha: edismax complaints on certain characters I believe this is caused by the regex support in https://issues.apache.org/jira/browse/LUCENE-2039 It certainly seems wrong to interpret a slash in the middle of the word as the start of a regex, so I've reopened the issue. -Yonik http://lucidworks.com On Thu, Sep 6, 2012 at 9:34 AM, Alexandre Rafalovitch arafa...@gmail.com wrote: Hello, I was under the impression that edismax was supposed to be crash proof and just ignore bad syntax. But I am either misconfiguring it or hit a weird bug. I basically searched for text containing '/' and got this: { 'responseHeader'={ 'status'=400, 'QTime'=9, 'params'={ 'qf'='TitleEN DescEN', 'indent'='true', 'wt'='ruby', 'q'='foo/bar', 'defType'='edismax'}}, 'error'={ 'msg'='org.apache.lucene.queryparser.classic.ParseException: Cannot parse \'foo/bar \': Lexical error at line 1, column 9. Encountered: EOF after : /bar ', 'code'=400}} Is that normal? If it is, is there a known list of characters I need to escape or do I just have to catch the exception and tell user to not do this again? Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book)
Re: Solr 4.0alpha: edismax complaints on certain characters
I am on 4.0 alpha. Maybe it was fixed in beta. But I am most definitely seeing this in edismax. If I get rid of / and use debugQuery, I get: 'responseHeader'={ 'status'=0, 'QTime'=14, 'params'={ 'debugQuery'='true', 'indent'='true', 'q'='foobar', 'qf'='TitleEN DescEN', 'wt'='ruby', 'defType'='edismax'}}, 'response'={'numFound'=0,'start'=0,'docs'=[] }, 'debug'={ 'rawquerystring'='foobar', 'querystring'='foobar', 'parsedquery'='(+DisjunctionMaxQuery((DescEN:foobar | TitleEN:foobar)))/no_coord', 'parsedquery_toString'='+(DescEN:foobar | TitleEN:foobar)', 'explain'={}, 'QParser'='ExtendedDismaxQParser', I'll check beta on my machine by tomorrow. Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Thu, Sep 6, 2012 at 10:06 AM, Jack Krupansky j...@basetechnology.com wrote: That's what I was thinking, but when I tried foo/bar in Solr 3.6 and 4.0-BETA it was working fine - it split the term and generated the proper query without any error. I think the problem is if you use the default Lucene query parser, not edismax. I removed defType==edismax from my query request and the problem reproduces. My two test queries: http://localhost:8983/solr/select/?debugQuery=truedefType=edismaxqf=featuresq=foo/bar http://localhost:8983/solr/select/?debugQuery=truedf=featuresq=foo/bar The first works; the second fails as reported (in 4.0-BETA, but works in 3.6). -- Jack Krupansky -Original Message- From: Yonik Seeley Sent: Thursday, September 06, 2012 9:53 AM To: solr-user@lucene.apache.org Subject: Re: Solr 4.0alpha: edismax complaints on certain characters I believe this is caused by the regex support in https://issues.apache.org/jira/browse/LUCENE-2039 It certainly seems wrong to interpret a slash in the middle of the word as the start of a regex, so I've reopened the issue. -Yonik http://lucidworks.com On Thu, Sep 6, 2012 at 9:34 AM, Alexandre Rafalovitch arafa...@gmail.com wrote: Hello, I was under the impression that edismax was supposed to be crash proof and just ignore bad syntax. But I am either misconfiguring it or hit a weird bug. I basically searched for text containing '/' and got this: { 'responseHeader'={ 'status'=400, 'QTime'=9, 'params'={ 'qf'='TitleEN DescEN', 'indent'='true', 'wt'='ruby', 'q'='foo/bar', 'defType'='edismax'}}, 'error'={ 'msg'='org.apache.lucene.queryparser.classic.ParseException: Cannot parse \'foo/bar \': Lexical error at line 1, column 9. Encountered: EOF after : /bar ', 'code'=400}} Is that normal? If it is, is there a known list of characters I need to escape or do I just have to catch the exception and tell user to not do this again? Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book)
Re: AW: Website (crawler for) indexing
Hello! I think that really depends on what you want to achieve and what parts of your current system you would like to reuse. If it is only HTML processing I would let Nutch and Solr do that. Of course you can extend Nutch (it has a plugin API) and implement the custom logic you need as a Nutch plugin. There is even an example of Nutch plugin available (http://wiki.apache.org/nutch/WritingPluginExample), but its for Nutch 1.3. -- Regards, Rafał Kuć Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch Thanks Rafał and Markus for your comments. I think Droids it has serious problem with URL parameters in current version (0.2.0) from Maven central: https://issues.apache.org/jira/browse/DROIDS-144 I knew about Nutch, but I haven't been able to implement a crawler with it. Have you done that or seen an example application? It's probably easy to call a Nutch jar and make it index a website and maybe I will have to do that. But as we already have a Java implementation to index other sources, it would be nice if we could integrate the crawling part too. Regards, Alexander Hello! You can implement your own crawler using Droids (http://incubator.apache.org/droids/) or use Apache Nutch (http://nutch.apache.org/), which is very easy to integrate with Solr and is very powerful crawler. -- Regards, Rafał Kuć Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch This may be a bit off topic: How do you index an existing website and control the data going into index? We already have Java code to process the HTML (or XHTML) and turn it into a SolrJ Document (removing tags and other things we do not want in the index). We use SolrJ for indexing. So I guess the question is essentially which Java crawler could be useful. We used to use wget on command line in our publishing process, but we do no longer want to do that. Thanks, Alexander
Re: Solr 4.0alpha: edismax complaints on certain characters
I do in fact see your problem with an earlier 4.0 build, but not with 4.0-BETA. -- Jack Krupansky -Original Message- From: Alexandre Rafalovitch Sent: Thursday, September 06, 2012 10:13 AM To: solr-user@lucene.apache.org Subject: Re: Solr 4.0alpha: edismax complaints on certain characters I am on 4.0 alpha. Maybe it was fixed in beta. But I am most definitely seeing this in edismax. If I get rid of / and use debugQuery, I get: 'responseHeader'={ 'status'=0, 'QTime'=14, 'params'={ 'debugQuery'='true', 'indent'='true', 'q'='foobar', 'qf'='TitleEN DescEN', 'wt'='ruby', 'defType'='edismax'}}, 'response'={'numFound'=0,'start'=0,'docs'=[] }, 'debug'={ 'rawquerystring'='foobar', 'querystring'='foobar', 'parsedquery'='(+DisjunctionMaxQuery((DescEN:foobar | TitleEN:foobar)))/no_coord', 'parsedquery_toString'='+(DescEN:foobar | TitleEN:foobar)', 'explain'={}, 'QParser'='ExtendedDismaxQParser', I'll check beta on my machine by tomorrow. Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Thu, Sep 6, 2012 at 10:06 AM, Jack Krupansky j...@basetechnology.com wrote: That's what I was thinking, but when I tried foo/bar in Solr 3.6 and 4.0-BETA it was working fine - it split the term and generated the proper query without any error. I think the problem is if you use the default Lucene query parser, not edismax. I removed defType==edismax from my query request and the problem reproduces. My two test queries: http://localhost:8983/solr/select/?debugQuery=truedefType=edismaxqf=featuresq=foo/bar http://localhost:8983/solr/select/?debugQuery=truedf=featuresq=foo/bar The first works; the second fails as reported (in 4.0-BETA, but works in 3.6). -- Jack Krupansky -Original Message- From: Yonik Seeley Sent: Thursday, September 06, 2012 9:53 AM To: solr-user@lucene.apache.org Subject: Re: Solr 4.0alpha: edismax complaints on certain characters I believe this is caused by the regex support in https://issues.apache.org/jira/browse/LUCENE-2039 It certainly seems wrong to interpret a slash in the middle of the word as the start of a regex, so I've reopened the issue. -Yonik http://lucidworks.com On Thu, Sep 6, 2012 at 9:34 AM, Alexandre Rafalovitch arafa...@gmail.com wrote: Hello, I was under the impression that edismax was supposed to be crash proof and just ignore bad syntax. But I am either misconfiguring it or hit a weird bug. I basically searched for text containing '/' and got this: { 'responseHeader'={ 'status'=400, 'QTime'=9, 'params'={ 'qf'='TitleEN DescEN', 'indent'='true', 'wt'='ruby', 'q'='foo/bar', 'defType'='edismax'}}, 'error'={ 'msg'='org.apache.lucene.queryparser.classic.ParseException: Cannot parse \'foo/bar \': Lexical error at line 1, column 9. Encountered: EOF after : /bar ', 'code'=400}} Is that normal? If it is, is there a known list of characters I need to escape or do I just have to catch the exception and tell user to not do this again? Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book)
RE: Website (crawler for) indexing
-Original message- From:Lochschmied, Alexander alexander.lochschm...@vishay.com Sent: Thu 06-Sep-2012 16:04 To: solr-user@lucene.apache.org Subject: AW: Website (crawler for) indexing Thanks Rafał and Markus for your comments. I think Droids it has serious problem with URL parameters in current version (0.2.0) from Maven central: https://issues.apache.org/jira/browse/DROIDS-144 I knew about Nutch, but I haven't been able to implement a crawler with it. Have you done that or seen an example application? We've been using it for some years now for our site search customers and are happy but it can be quite a beast to begin with. The Nutch tutorial will walk you through the first steps, crawling and indexing to Solr. It's probably easy to call a Nutch jar and make it index a website and maybe I will have to do that. But as we already have a Java implementation to index other sources, it would be nice if we could integrate the crawling part too. You can control Nutch from within another application but i'd not recommend it, it's batch based and can take quite some time and resources to run. We usually prefer running a custom shell script controlling the process and call it via the cron. Regards, Alexander Hello! You can implement your own crawler using Droids (http://incubator.apache.org/droids/) or use Apache Nutch (http://nutch.apache.org/), which is very easy to integrate with Solr and is very powerful crawler. -- Regards, Rafał Kuć Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch This may be a bit off topic: How do you index an existing website and control the data going into index? We already have Java code to process the HTML (or XHTML) and turn it into a SolrJ Document (removing tags and other things we do not want in the index). We use SolrJ for indexing. So I guess the question is essentially which Java crawler could be useful. We used to use wget on command line in our publishing process, but we do no longer want to do that. Thanks, Alexander
Re: Solr 4.0alpha: edismax complaints on certain characters
The fix in edismax was made just a few days (6/28) before the formal announcement of 4.0-ALPHA (7/3), but unfortunately the fix came a few days after the cutoff for 4.0-ALPHA (6/25). See: https://issues.apache.org/jira/browse/SOLR-3467 (That issue should probably be annotated to indicate that it affects 4.0-ALPHA.) -- Jack Krupansky -Original Message- From: Alexandre Rafalovitch Sent: Thursday, September 06, 2012 10:13 AM To: solr-user@lucene.apache.org Subject: Re: Solr 4.0alpha: edismax complaints on certain characters I am on 4.0 alpha. Maybe it was fixed in beta. But I am most definitely seeing this in edismax. If I get rid of / and use debugQuery, I get: 'responseHeader'={ 'status'=0, 'QTime'=14, 'params'={ 'debugQuery'='true', 'indent'='true', 'q'='foobar', 'qf'='TitleEN DescEN', 'wt'='ruby', 'defType'='edismax'}}, 'response'={'numFound'=0,'start'=0,'docs'=[] }, 'debug'={ 'rawquerystring'='foobar', 'querystring'='foobar', 'parsedquery'='(+DisjunctionMaxQuery((DescEN:foobar | TitleEN:foobar)))/no_coord', 'parsedquery_toString'='+(DescEN:foobar | TitleEN:foobar)', 'explain'={}, 'QParser'='ExtendedDismaxQParser', I'll check beta on my machine by tomorrow. Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Thu, Sep 6, 2012 at 10:06 AM, Jack Krupansky j...@basetechnology.com wrote: That's what I was thinking, but when I tried foo/bar in Solr 3.6 and 4.0-BETA it was working fine - it split the term and generated the proper query without any error. I think the problem is if you use the default Lucene query parser, not edismax. I removed defType==edismax from my query request and the problem reproduces. My two test queries: http://localhost:8983/solr/select/?debugQuery=truedefType=edismaxqf=featuresq=foo/bar http://localhost:8983/solr/select/?debugQuery=truedf=featuresq=foo/bar The first works; the second fails as reported (in 4.0-BETA, but works in 3.6). -- Jack Krupansky -Original Message- From: Yonik Seeley Sent: Thursday, September 06, 2012 9:53 AM To: solr-user@lucene.apache.org Subject: Re: Solr 4.0alpha: edismax complaints on certain characters I believe this is caused by the regex support in https://issues.apache.org/jira/browse/LUCENE-2039 It certainly seems wrong to interpret a slash in the middle of the word as the start of a regex, so I've reopened the issue. -Yonik http://lucidworks.com On Thu, Sep 6, 2012 at 9:34 AM, Alexandre Rafalovitch arafa...@gmail.com wrote: Hello, I was under the impression that edismax was supposed to be crash proof and just ignore bad syntax. But I am either misconfiguring it or hit a weird bug. I basically searched for text containing '/' and got this: { 'responseHeader'={ 'status'=400, 'QTime'=9, 'params'={ 'qf'='TitleEN DescEN', 'indent'='true', 'wt'='ruby', 'q'='foo/bar', 'defType'='edismax'}}, 'error'={ 'msg'='org.apache.lucene.queryparser.classic.ParseException: Cannot parse \'foo/bar \': Lexical error at line 1, column 9. Encountered: EOF after : /bar ', 'code'=400}} Is that normal? If it is, is there a known list of characters I need to escape or do I just have to catch the exception and tell user to not do this again? Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book)
Re: Problem with verifying signature ?
: gpg: Signature made 08/06/12 19:52:21 Pacific Daylight Time using RSA key : ID 322 : D7ECA : gpg: Good signature from Robert Muir (Code Signing Key) rm...@apache.org : *gpg: WARNING: This key is not certified with a trusted signature!* : gpg: There is no indication that the signature belongs to the : owner. : Primary key fingerprint: 6661 9BA3 C030 DD55 3625 1303 817A E1DD 322D 7ECA : : Is this acceptable ? I guess it depends on what you mean by acceptible? I'm not an expert on this, but as i understand it... gpg is telling you that it confirmed the signature matches a known key named Robert Muir (Code Signing Key) which is in your keyring, but that there is no certified level of trust association with that key. Key Trust is a personal thing, specific to you, your keyring, and how you got the keys you put in that ring. if you trust that the KEYS file you downloaded from apache.org is legitimate, and that all the keys in it should be trusted, you can tell gpg that. (using the trust interactive command when using --edit-key) Alternatively, you could tell gpg that you have a high level of trust in the key of some other person you have met personally -- ie: if you met Uwe at a confernce and he physically handed you his key on a USB drive -- and then if Uwe has signed Robert's key with his own (i think it has, not sure off the top of my head), then gpg would extend an implicit transitive trust to Robert's key... http://www.gnupg.org/gph/en/manual.html#AEN335 -Hoss
Re: Solr not allowing persistent HTTP connections
: Some extra information. If I use curl and force it to use HTTP 1.0, it is more : visible that Solr doesn't allow persistent connections: a) solr has nothing to do with it, it's entirely something under the control of jetty the client. b) i think you are introducing confusion by trying to force an HTTP/1.0 connection -- Jetty supports Keep-Alive for HTTP/1.1, but maybe not for HTTP/1.0 ? If you use curl to request multiple URLs and just let curl jetty do their normal behavior (w/o trying to bypass anything or manually add headers) you can see that keep-alive is in fact working... $ curl -v --keepalive 'http://localhost:8983/solr/select?q=*:*' 'http://localhost:8983/solr/select?q=foo' * About to connect() to localhost port 8983 (#0) * Trying 127.0.0.1... connected GET /solr/select?q=*:* HTTP/1.1 User-Agent: curl/7.22.0 (x86_64-pc-linux-gnu) libcurl/7.22.0 OpenSSL/1.0.1 zlib/1.2.3.4 libidn/1.23 librtmp/2.3 Host: localhost:8983 Accept: */* HTTP/1.1 200 OK Content-Type: application/xml; charset=UTF-8 Transfer-Encoding: chunked ?xml version=1.0 encoding=UTF-8? response lst name=responseHeaderint name=status0/intint name=QTime1/intlst name=paramsstr name=q*:*/str/lst/lstresult name=response numFound=0 start=0/ /response * Connection #0 to host localhost left intact * Re-using existing connection! (#0) with host localhost * Connected to localhost (127.0.0.1) port 8983 (#0) GET /solr/select?q=foo HTTP/1.1 User-Agent: curl/7.22.0 (x86_64-pc-linux-gnu) libcurl/7.22.0 OpenSSL/1.0.1 zlib/1.2.3.4 libidn/1.23 librtmp/2.3 Host: localhost:8983 Accept: */* HTTP/1.1 200 OK Content-Type: application/xml; charset=UTF-8 Transfer-Encoding: chunked ?xml version=1.0 encoding=UTF-8? response lst name=responseHeaderint name=status0/intint name=QTime0/intlst name=paramsstr name=qfoo/str/lst/lstresult name=response numFound=0 start=0/ /response * Connection #0 to host localhost left intact * Closing connection #0 -Hoss
Re: Solr not allowing persistent HTTP connections
Thank you. I did the test with curl the same way you did it and it works. I still can not get ab (apache benchmark) to reuse connections to solr. I'll investigate this further. $ ab -c 1 -n 100 -k 'http://localhost:8983/solr/select?q=*:*' | grep Alive Keep-Alive requests:0 -- Aleksey On 12-09-06 11:07 AM, Chris Hostetter wrote: : Some extra information. If I use curl and force it to use HTTP 1.0, it is more : visible that Solr doesn't allow persistent connections: a) solr has nothing to do with it, it's entirely something under the control of jetty the client. b) i think you are introducing confusion by trying to force an HTTP/1.0 connection -- Jetty supports Keep-Alive for HTTP/1.1, but maybe not for HTTP/1.0 ? If you use curl to request multiple URLs and just let curl jetty do their normal behavior (w/o trying to bypass anything or manually add headers) you can see that keep-alive is in fact working... $ curl -v --keepalive 'http://localhost:8983/solr/select?q=*:*' 'http://localhost:8983/solr/select?q=foo' * About to connect() to localhost port 8983 (#0) * Trying 127.0.0.1... connected GET /solr/select?q=*:* HTTP/1.1 User-Agent: curl/7.22.0 (x86_64-pc-linux-gnu) libcurl/7.22.0 OpenSSL/1.0.1 zlib/1.2.3.4 libidn/1.23 librtmp/2.3 Host: localhost:8983 Accept: */* HTTP/1.1 200 OK Content-Type: application/xml; charset=UTF-8 Transfer-Encoding: chunked ?xml version=1.0 encoding=UTF-8? response lst name=responseHeaderint name=status0/intint name=QTime1/intlst name=paramsstr name=q*:*/str/lst/lstresult name=response numFound=0 start=0/ /response * Connection #0 to host localhost left intact * Re-using existing connection! (#0) with host localhost * Connected to localhost (127.0.0.1) port 8983 (#0) GET /solr/select?q=foo HTTP/1.1 User-Agent: curl/7.22.0 (x86_64-pc-linux-gnu) libcurl/7.22.0 OpenSSL/1.0.1 zlib/1.2.3.4 libidn/1.23 librtmp/2.3 Host: localhost:8983 Accept: */* HTTP/1.1 200 OK Content-Type: application/xml; charset=UTF-8 Transfer-Encoding: chunked ?xml version=1.0 encoding=UTF-8? response lst name=responseHeaderint name=status0/intint name=QTime0/intlst name=paramsstr name=qfoo/str/lst/lstresult name=response numFound=0 start=0/ /response * Connection #0 to host localhost left intact * Closing connection #0 -Hoss
Solr-Export
Hey Guys, I created a program to export Solr index data to XML. The url is https://github.com/eltu/Solr-Export Tell me about any problem, please. *** I only tested with the Solr 3.6.1 Thanks, Helton
Solr search not working after copying a new field to an existing Indexed Field
I have a made a schema change to copy an existing field name (Source Field) to an existing search field text (Destination Field). Since I made the schema change, I updated all the documents thinking the new source field will be clubbed together with the text field. The search for a specific name does not return results. If I delete the document and then adding the document back works just fine. I thought Add command with default override option will work as Delete and Add. Is this the only way to reindex the text field? Is there anyother method? I really appreciate your help on this! -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-search-not-working-after-copying-a-new-field-to-an-existing-Indexed-Field-tp4005993.html Sent from the Solr - User mailing list archive at Nabble.com.
NoHttpResponseException: The server failed to respond
We have a distributed solr setup with 8 servers and 8 cores on each server in production. We see this error multiple times in our solr servers. we are using solr 3.6.1. Has anyone seen this error before and have you resolved it ? 2012-09-04 02:16:40,995 [http-nio-8080-exec-7] ERROR org.apache.solr.core.SolrCore - org.apache.solr.common.SolrException at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:275) at com.nextag.search.solr.handler.ProductSearchHandler.handleRequestBody(ProductSearchHandler.java:269) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:365) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:260) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:225) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:123) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:472) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98) at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:927) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407) at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1001) at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:585) at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:1653) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) Caused by: org.apache.solr.client.solrj.SolrServerException: org.apache.commons.httpclient.NoHttpResponseException: The server solr2-vip.servername.com failed to respond at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:475) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:249) at org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:129) at org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:103) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) ... 3 more Caused by: org.apache.commons.httpclient.NoHttpResponseException: The server solr2-vip.servername.com failed to respond at org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMethodBase.java:1976) at org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase.java:1735) at org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1098) at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:398) at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:419) Here is the shardHandlerFactory setting in our config shardHandlerFactory class=HttpShardHandlerFactory int name=connTimeout5000/int int name=socketTimeout3/int /shardHandlerFactory We checked that Full GC is not running so frequently and the server is not too busy. -- View this message in context: http://lucene.472066.n3.nabble.com/NoHttpResponseException-The-server-failed-to-respond-tp4006017.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: UnInvertedField limitations
Hi Jack, 24bit = 16M possibilities, it's clear; just to confirm... the rest is unclear, why 4-byte can have 4 million cardinality? I thought it is 4 billions... And, just to confirm: UnInvertedField allows 16M cardinality, correct? On 12-08-20 6:51 PM, Jack Krupansky j...@basetechnology.com wrote: It appears that there is a hard limit of 24-bits or 16M for the number of bytes to reference the terms in a single field of a single document. It takes 1, 2, 3, 4, or 5 bytes to reference a term. If it took 4 bytes, that would allow 16/4 or 4 million unique terms - per document. Do you have such large documents? This appears to be a hard limit based of 24-bytes in a Java int. You can try facet.method=enum, but that may be too slow. What release of Solr are you running? -- Jack Krupansky -Original Message- From: Fuad Efendi Sent: Monday, August 20, 2012 4:34 PM To: Solr-User@lucene.apache.org Subject: UnInvertedField limitations Hi All, I have a problemŠ (Yonik, please!) help me, what is Term count limits? I possibly have 256,000,000 different terms in a fieldŠ or 16,000,000? Thanks! 2012-08-20 16:20:19,262 ERROR [solr.core.SolrCore] - [pool-1-thread-1] - : org.apache.solr.common.SolrException: Too many values for UnInvertedField faceting on field enrich_keywords_string_mv at org.apache.solr.request.UnInvertedField.init(UnInvertedField.java:179) at org.apache.solr.request.UnInvertedField.getUnInvertedField(UnInvertedField .j ava:668) at org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:326) at org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java :4 23) at org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:206) at org.apache.solr.handler.component.FacetComponent.process(FacetComponent.ja va :85) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHa nd ler.java:204) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBas e. java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1561) -- Fuad Efendi http://www.tokenizer.ca
Re: UnInvertedField limitations
Hi Lance, Use case is keyword extraction, and it could be 2- and 3-grams (2- and 3- words); so that theoretically we can have 10,000^3 = 1,000,000,000,000 3-grams for English only... of course my suggestion is to use statistics and to build a dictionary of such 3-word combinations (remove top, remove tail, using frequencies)... And to hard-limit this dictionary to 1,000,000... That was business requirement which technically impossible to implement (as a realtime query results); we don't even use word stemming etc... -Fuad On 12-08-20 7:22 PM, Lance Norskog goks...@gmail.com wrote: Is this required by your application? Is there any way to reduce the number of terms? A work around is to use shards. If your terms follow Zipf's Law each shard will have fewer than the complete number of terms. For N shards, each shard will have ~1/N of the singleton terms. For 2-count terms, 1/N or 2/N will have that term. Now I'm interested but not mathematically capable: what is the general probabilistic formula for splitting Zipf's Law across shards? On Mon, Aug 20, 2012 at 3:51 PM, Jack Krupansky j...@basetechnology.com wrote: It appears that there is a hard limit of 24-bits or 16M for the number of bytes to reference the terms in a single field of a single document. It takes 1, 2, 3, 4, or 5 bytes to reference a term. If it took 4 bytes, that would allow 16/4 or 4 million unique terms - per document. Do you have such large documents? This appears to be a hard limit based of 24-bytes in a Java int. You can try facet.method=enum, but that may be too slow. What release of Solr are you running? -- Jack Krupansky -Original Message- From: Fuad Efendi Sent: Monday, August 20, 2012 4:34 PM To: Solr-User@lucene.apache.org Subject: UnInvertedField limitations Hi All, I have a problemŠ (Yonik, please!) help me, what is Term count limits? I possibly have 256,000,000 different terms in a fieldŠ or 16,000,000? Thanks! 2012-08-20 16:20:19,262 ERROR [solr.core.SolrCore] - [pool-1-thread-1] - : org.apache.solr.common.SolrException: Too many values for UnInvertedField faceting on field enrich_keywords_string_mv at org.apache.solr.request.UnInvertedField.init(UnInvertedField.java:179) at org.apache.solr.request.UnInvertedField.getUnInvertedField(UnInvertedFiel d.j ava:668) at org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:326) at org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.jav a:4 23) at org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:206 ) at org.apache.solr.handler.component.FacetComponent.process(FacetComponent.j ava :85) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchH and ler.java:204) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBa se. java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1561) -- Fuad Efendi http://www.tokenizer.ca -- Lance Norskog goks...@gmail.com
Re: UnInvertedField limitations
It's actually limited to 24 bits to point to the term list in a byte[], but there are 256 different arrays, so the maximum capacity is 4B bytes of un-inverted terms, but each bucket is limited to 4B/256 so the real limit can come in at a little less due to luck. From the comments: * There is a single int[maxDoc()] which either contains a pointer into a byte[] for * the termNumber lists, or directly contains the termNumber list if it fits in the 4 * bytes of an integer. If the first byte in the integer is 1, the next 3 bytes * are a pointer into a byte[] where the termNumber list starts. * * There are actually 256 byte arrays, to compensate for the fact that the pointers * into the byte arrays are only 3 bytes long. The correct byte array for a document * is a function of it's id. -Yonik http://lucidworks.com On Thu, Sep 6, 2012 at 6:33 PM, Fuad Efendi f...@efendi.ca wrote: Hi Jack, 24bit = 16M possibilities, it's clear; just to confirm... the rest is unclear, why 4-byte can have 4 million cardinality? I thought it is 4 billions... And, just to confirm: UnInvertedField allows 16M cardinality, correct? On 12-08-20 6:51 PM, Jack Krupansky j...@basetechnology.com wrote: It appears that there is a hard limit of 24-bits or 16M for the number of bytes to reference the terms in a single field of a single document. It takes 1, 2, 3, 4, or 5 bytes to reference a term. If it took 4 bytes, that would allow 16/4 or 4 million unique terms - per document. Do you have such large documents? This appears to be a hard limit based of 24-bytes in a Java int. You can try facet.method=enum, but that may be too slow. What release of Solr are you running? -- Jack Krupansky -Original Message- From: Fuad Efendi Sent: Monday, August 20, 2012 4:34 PM To: Solr-User@lucene.apache.org Subject: UnInvertedField limitations Hi All, I have a problemŠ (Yonik, please!) help me, what is Term count limits? I possibly have 256,000,000 different terms in a fieldŠ or 16,000,000? Thanks! 2012-08-20 16:20:19,262 ERROR [solr.core.SolrCore] - [pool-1-thread-1] - : org.apache.solr.common.SolrException: Too many values for UnInvertedField faceting on field enrich_keywords_string_mv at org.apache.solr.request.UnInvertedField.init(UnInvertedField.java:179) at org.apache.solr.request.UnInvertedField.getUnInvertedField(UnInvertedField .j ava:668) at org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:326) at org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java :4 23) at org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:206) at org.apache.solr.handler.component.FacetComponent.process(FacetComponent.ja va :85) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHa nd ler.java:204) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBas e. java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1561) -- Fuad Efendi http://www.tokenizer.ca
Importing of unix date format from mysql database and dates of format 'Thu, 06 Sep 2012 22:32:33 +0000' in Solr 4.0
Hi, I am using Solr with DIH and started getting errors when the database time/date fields are getting imported in to Solr. I have used the date as the field type but when i looked up at the docs it looks like the date field does not accept (Thu, 06 Sep 2012 22:32:33 +) or (1346976590) formats. Also, When i used field_type as 'text_ar' and indexed a line with arabic text, Solr is displaying some non-ISO characters. It looks like the text is not being unicoded. Did anyone face a similar issue ? The Solr date field type does not support a variety of formats. Is there any work around to solve this kind of issues ? Many Thanks, -- Kiran Chitturi
Re: Importing of unix date format from mysql database and dates of format 'Thu, 06 Sep 2012 22:32:33 +0000' in Solr 4.0
: I am using Solr with DIH and started getting errors when the database : time/date fields are getting imported in to Solr. I have used the date as what actual error are you getting? If you are pulling dates from a SQL Date field, that the jdbc driver returns as java.util.Date objects, then you shouldn't need to do anything special, they should import just fine with solr's TrieDateField. if you are importing from something where you really need to convert yourself (ie: XML files, or string columns in a DB), there is the DIH DateFormatTransformer... https://wiki.apache.org/solr/DataImportHandler#DateFormatTransformer : Also, When i used field_type as 'text_ar' and indexed a line with arabic : text, Solr is displaying some non-ISO characters. It looks like the text is : not being unicoded. unicode is not a verb, so i'm not sure what you mean by displaying some non-ISO characters and text is not being unicoded .. please provide a specific example of hte problem you are facing, including details on what the source data is. -Hoss
Re: Importing of unix date format from mysql database and dates of format 'Thu, 06 Sep 2012 22:32:33 +0000' in Solr 4.0
http://www.electrictoolbox.com/article/mysql/format-date-time-mysql/ hth -- H On 6 Sep 2012 17:23, kiran chitturi chitturikira...@gmail.com wrote: Hi, I am using Solr with DIH and started getting errors when the database time/date fields are getting imported in to Solr. I have used the date as the field type but when i looked up at the docs it looks like the date field does not accept (Thu, 06 Sep 2012 22:32:33 +) or (1346976590) formats. Also, When i used field_type as 'text_ar' and indexed a line with arabic text, Solr is displaying some non-ISO characters. It looks like the text is not being unicoded. Did anyone face a similar issue ? The Solr date field type does not support a variety of formats. Is there any work around to solve this kind of issues ? Many Thanks, -- Kiran Chitturi
Re: solr issue with seaching words
: I am facing a strange problem. I am searching for word jacke but solr also : returns result where my description contains 'RCA-Jack/'. Íf i search : jacka or jackc or jackd, it works fine and does not return me any : result which is what i am expecting in this case. you need to tell us what the analyzers in your fieldType - if i had to guess, i would suspect that you are using a rules basd stemmer that converts jacke to jack in combination with something that splits on - ... which could be WordDelimiterFilter, or it could be something else. devil is in the details -Hoss
Re: EdgeNgramTokenFilter and positions
I don't know for sure, but I remember something around this being a problem, yes ... maybe https://issues.apache.org/jira/browse/LUCENE-3907 ? Otis Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm - Original Message - From: Walter Underwood wun...@chegg.com To: solr-user@lucene.apache.org solr-user@lucene.apache.org Cc: Sent: Wednesday, September 5, 2012 1:51 PM Subject: EdgeNgramTokenFilter and positions In the analysis page, the n-grams produced by EdgeNgramTokenFilter are at sequential positions. This seems wrong, because an n-gram is associated with a source token at a specific position. It also really messes up phrase matches. With the source text fleen, these positions and tokens are generated: 1,fl 2,fle 3,flee 4,fleen Is this a known bug? Fixed? I'm running 3.3. wunder -- Walter Underwood Search Guy wun...@chegg.commailto:wun...@chegg.com
Re: Importing of unix date format from mysql database and dates of format 'Thu, 06 Sep 2012 22:32:33 +0000' in Solr 4.0
Hi, Thank you for your response. The error i am getting is 'org.apache.solr.common.SolrException: Invalid Date String: '1345743552'. I think it was being saved as a string in DB, so i will use the DateFormatTransformer. When i index a text field which has arabic and English like this tweet “@anaga3an: هو سعد الحريري بيعمل ايه غير تحديد الدوجلاس ويختار الكرافته ؟؟” #gcc #ksa #lebanon #syria #kuwait #egypt #سوريا with field_type as 'text_ar' and when i try to see the same field again in solr, it is shown as below. RT @AhmedWagih: لو معملناش Øاجة Ù�ÙŠ الزيادة السكانية Ù�ÙŠ مصر، هنتØول لدولة Ù�قيرة كثيÙ�Ø© السكان زي بنجلادش #Egypt #EgyEconomy both of the lines do not mean the same, but i have just placed them here as an example. This was the problem i am facing. Many Thanks, Kiran On Thu, Sep 6, 2012 at 8:29 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: : I am using Solr with DIH and started getting errors when the database : time/date fields are getting imported in to Solr. I have used the date as what actual error are you getting? If you are pulling dates from a SQL Date field, that the jdbc driver returns as java.util.Date objects, then you shouldn't need to do anything special, they should import just fine with solr's TrieDateField. if you are importing from something where you really need to convert yourself (ie: XML files, or string columns in a DB), there is the DIH DateFormatTransformer... https://wiki.apache.org/solr/DataImportHandler#DateFormatTransformer : Also, When i used field_type as 'text_ar' and indexed a line with arabic : text, Solr is displaying some non-ISO characters. It looks like the text is : not being unicoded. unicode is not a verb, so i'm not sure what you mean by displaying some non-ISO characters and text is not being unicoded .. please provide a specific example of hte problem you are facing, including details on what the source data is. -Hoss -- Kiran Chitturi
solrcloud setup using tomcat, single machine
Hey guys! I've been attempting to get solrcloud set up on a ubuntu vm, but I believe I'm stuck. I've got tomcat setup, the solr war file in place, and when I browser to localhost:port/solr, I can see solr. CHECK I've set the zoo.cfg to use port 5200. I can start it up and see it's running (ls / shows me [zookeeper]). CHECK *Issues I'm running into* 1. I'm trying to get it so that the example in solr (example/solr/collection1/conf) will load up, however it doesn't look like it's working (from posts online, it looks like I should see a *Cloud* tab under localhost:port/solr, but it's not appearing. 2. Sometimes it looks like things are still trying to run on port 2181 (default zookeeper port). 3. Some commands I run look like they're trying to use jetty still, even though I think I have tomcat set up correctly. I must admit that my background is in C#, so calling java jars passing -D everywhere is a bit new to me. What I'd like to do is start up a solr node using zookeeper through tomcat, but it seems like most guide use jetty and I'm having issues trying to convert to tomcat. I don't know what you might need to know to help me out, so I'm going to give you as much info on my setup as I can. For reference, the folder structure I've adopted (feel free to make recommendations) is as follows: /usr/solr /usr/solr/data/conf # conf files /usr/solr/solr4.0.0-BETA # extraction from the tar.gz /usr/tomcat /usr/tomcat/tomcat7.0.30 #where tomcat lives /usr/tomcat/tomcat7.0.30/data/solr.war # war file from the extracted tar.gz /usr/tomcat/tomcat7.0.30/conf/Catalina/localhost/solr.xml # contains the following Context docBase=/usr/tomcat/tomcat7.0.30/data/solr.war debug=0 crossContext=true Environment name=solr/home type=java.lang.String value=/usr/solr/data/conf override=true / /Context /usr/zookeeper /usr/zookeeper/zookeeper3.3.6 # zookeeper extraction /usr/zookeeper/zookeeper3.3.6/data # where the data will be stored /usr/zookeeper/zookeeper3.3.6/conf/zoo.cfg # contains the following # The number of milliseconds of each tick tickTime=2000 # The number of ticks that the initial # synchronization phase can take initLimit=10 # The number of ticks that can pass between # sending a request and getting an acknowledgement syncLimit=5 # the directory where the snapshot is stored. dataDir=/usr/zookeeper/data # the port at which the clients will connect clientPort=5200 I've created the file /etc/init.d/tomcat (it contains the following): # Tomcat auto-start # # description: Auto-starts tomcat # processname: tomcat # pidfile: /var/run/tomcat.pid export JAVA_HOME=/opt/java/64/jre1.7.0_07 case $1 in start) /export JAVA_OPTS=$JAVA_OPTS -DnumShards=1 -Dbootstrap_confdir=/usr/solr/example/solr/collection1/conf -DzkHost=localhost:520 0 -DhostPort=8080 #might not be useful?/ sh /usr/tomcat/tomcat7.0.30/bin/startup.sh ;; stop) sh /usr/tomcat/tomcat7.0.30/bin/shutdown.sh ;; restart) sh /usr/tomcat/tomcat7.0.30/bin/shutdown.sh sh /usr/tomcat/tomcat7.0.30/bin/startup.sh ;; esac exit 0 I've been using some of these posts as references throughout the day (I've been at this for several hours): http://outerthought.org/blog/491-ot.html http://blog.jesjobom.com/2012/08/configurando-solr-cloud-beta-tomcat-zookeeper-externo/ http://www.slideshare.net/lucenerevolution/how-solrcloud-changes-the-user-experience-in-a-sharded-environment http://techspry.com/how-to/how-to-install-tomcat-7-and-solr-on-centos-5-5/ http://stackoverflow.com/questions/10026014/apache-solr-configuration-with-tomcat-6-0 ... more, but I don't wanna make this any longer than it needs to be *End goal for testing* On a single box (for testing), get this to happen: 1. a single zookeeper instance running on port 5200 2. a single tomcat instance running on port 8080 3. a single solr node running, using configs stored in zookeeper *Eventual production goal* 1. a 3-piece zookeeper ensemble, running on ports 5200,5201,5202 2. one of the following a. 4 solr nodes, running replicated (to allow 1 failure) b. 4 solr nodes, running replicated (to allow up to 2 failures) *. both choices should allow for querying across 2-3 nodes for higher volume, with potentially several shards per node in case data grows to big for a single box (entire index doesn't fit on 1 node) I know this is a lot to digest in a single post, but I'm trying to post what I've done, what issues I've ran into, and where I'm going with this so that you have enough information to base suggestions/answers on. Thanks! - Jesse -- View this message in context: http://lucene.472066.n3.nabble.com/solrcloud-setup-using-tomcat-single-machine-tp4006041.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: EdgeNgramTokenFilter and positions
Yes, that is exactly the bug. EdgeNgram should work like the synonym filter. wunder On Sep 6, 2012, at 5:51 PM, Otis Gospodnetic wrote: I don't know for sure, but I remember something around this being a problem, yes ... maybe https://issues.apache.org/jira/browse/LUCENE-3907 ? Otis Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm - Original Message - From: Walter Underwood wun...@chegg.com To: solr-user@lucene.apache.org solr-user@lucene.apache.org Cc: Sent: Wednesday, September 5, 2012 1:51 PM Subject: EdgeNgramTokenFilter and positions In the analysis page, the n-grams produced by EdgeNgramTokenFilter are at sequential positions. This seems wrong, because an n-gram is associated with a source token at a specific position. It also really messes up phrase matches. With the source text fleen, these positions and tokens are generated: 1,fl 2,fle 3,flee 4,fleen Is this a known bug? Fixed? I'm running 3.3. wunder -- Walter Underwood Search Guy wun...@chegg.commailto:wun...@chegg.com -- Walter Underwood wun...@wunderwood.org
Solr request/response lifecycle and logging full response time
Greetings, I'm looking to add some additional logging to a solr 3.6.0 setup to allow us to determine actual time spent by Solr responding to a request. We have a custom QueryComponent that sometimes returns 1+ MB of data and while QTime is always on the order of ~100ms, the response time at the client can be longer than a second (as measured with JMeter running on the same server using localhost). The end goal is to be able to: 1) determine if this large variance in response time is due to Solr, and if so where (to help determine if/how it can be optimized) 2) determine if the large variance is due to how jetty handles connections, buffering, etc... (and if so, if/how we can optimize there) ...or some combination of the two. As it stands now, where the second or so between when the actual query finishes as indicated by QTime, when solr gathers all the data to be returned as requested by fl, and when the client actually receives the data (even when the client is on the localhost) is completely opaque. My main question: - Is there any documentation (a diagram / flowchart would be oh so wonderful) on the lifecycle of a Solr request? So far I've attempted to modify and rebuild solr, adding logging to SolrCore's execute() method (this pretty much mirrors QTime), as well as add timing calculations and logging to various different overriden methods in the QueryComponent custom extension, all to no avail so far. What I'm getting at is how to: - start a stopwatch when solr receives the request from the client - stop the stopwatch and log the elapsed time right before solr hands the response body off to Jetty to be delivered back to the client. Thanks, as always! Aaron
Re: Importing of unix date format from mysql database and dates of format 'Thu, 06 Sep 2012 22:32:33 +0000' in Solr 4.0
On 7 September 2012 06:24, kiran chitturi chitturikira...@gmail.com wrote: [...] When i index a text field which has arabic and English like this tweet “@anaga3an: هو سعد الحريري بيعمل ايه غير تحديد الدوجلاس ويختار الكرافته ؟؟” #gcc #ksa #lebanon #syria #kuwait #egypt #سوريا with field_type as 'text_ar' and when i try to see the same field again in solr, it is shown as below. RT @AhmedWagih: لو معملناش Øاجة Ù�ÙŠ الزيادة السكانية Ù�ÙŠ مصر، هنتØول لدولة Ù�قيرة كثيÙ�Ø© السكان زي بنجلادش #Egypt #EgyEconomy both of the lines do not mean the same, but i have just placed them here as an example. This was the problem i am facing. [...] The encoding of your input text is being mangled at some point. Presuming that your original encoding is UTF-8, I would look at how you are indexing into Solr, and the encoding settings on the Java container. Solr itself handles UTF-8 perfectly fine, as do most Java containers if configured properly, so my first suspicion would be the indexing code. As it looks like you are pulling from mysql using DIH, check that the database character set is UTF-8, and that the connection uses UTF-8. Regards, Gora
Re: Solr request/response lifecycle and logging full response time
I'd still love to see a query lifecycle flowchart, but, in case it helps any future users or in case this is still incorrect, here's how I'm tackling this: 1) Override default json responseWriter with my own in solrconfig.xml: queryResponseWriter name=json class=com.mydomain.solr.component.JSONResponseWriterWithTiming/ 2) Define JSONResponseWriterWithTiming as just extending JSONResponseWriter and adding in a log statement: public class JSONResponseWriterWithTiming extends JSONResponseWriter { private static final Logger logger = LoggerFactory.getLogger(JSONResponseWriterWithTiming.class); @Override public void write(Writer writer, SolrQueryRequest req, SolrQueryResponse rsp) throws IOException { super.write(writer, req, rsp); if (logger.isInfoEnabled()) { final long st = req.getStartTime(); logger.info(String.format(Total solr time for query with QTime: %d is: %d, (int) (rsp.getEndTime() - st), (int) (System.currentTimeMillis() - st))); } } } Please advise if: - Flowcharts for any solr/lucene-related lifecycles exist - There is a better way of doing this Thanks, Aaron On Thu, Sep 6, 2012 at 9:16 PM, Aaron Daubman daub...@gmail.com wrote: Greetings, I'm looking to add some additional logging to a solr 3.6.0 setup to allow us to determine actual time spent by Solr responding to a request. We have a custom QueryComponent that sometimes returns 1+ MB of data and while QTime is always on the order of ~100ms, the response time at the client can be longer than a second (as measured with JMeter running on the same server using localhost). The end goal is to be able to: 1) determine if this large variance in response time is due to Solr, and if so where (to help determine if/how it can be optimized) 2) determine if the large variance is due to how jetty handles connections, buffering, etc... (and if so, if/how we can optimize there) ...or some combination of the two. As it stands now, where the second or so between when the actual query finishes as indicated by QTime, when solr gathers all the data to be returned as requested by fl, and when the client actually receives the data (even when the client is on the localhost) is completely opaque. My main question: - Is there any documentation (a diagram / flowchart would be oh so wonderful) on the lifecycle of a Solr request? So far I've attempted to modify and rebuild solr, adding logging to SolrCore's execute() method (this pretty much mirrors QTime), as well as add timing calculations and logging to various different overriden methods in the QueryComponent custom extension, all to no avail so far. What I'm getting at is how to: - start a stopwatch when solr receives the request from the client - stop the stopwatch and log the elapsed time right before solr hands the response body off to Jetty to be delivered back to the client. Thanks, as always! Aaron
Re: Importing of unix date format from mysql database and dates of format 'Thu, 06 Sep 2012 22:32:33 +0000' in Solr 4.0
Also, your browser may use a platform default for the encoding instead of UTF-8. Some MacOS and Windows browsers have this problem. Tomcat sometimes needs adjustment to use UTF-8. If you are on tomcat, check this: http://find.searchhub.org/link?url=http://wiki.apache.org/solr/SolrTomcat http://find.searchhub.org/?q=utf-8#%2Fp%3Asolr%2Fs%3Alucid%2Cwiki - Original Message - | From: Gora Mohanty g...@mimirtech.com | To: solr-user@lucene.apache.org | Sent: Thursday, September 6, 2012 7:13:40 PM | Subject: Re: Importing of unix date format from mysql database and dates of format 'Thu, 06 Sep 2012 22:32:33 +' | in Solr 4.0 | | On 7 September 2012 06:24, kiran chitturi chitturikira...@gmail.com | wrote: | [...] | | When i index a text field which has arabic and English like this | tweet | “@anaga3an: هو سعد الحريري بيعمل ايه غير تحديد الدوجلاس ويختار | الكرافته ؟؟” | #gcc #ksa #lebanon #syria #kuwait #egypt #سوريا | with field_type as 'text_ar' and when i try to see the same field | again in | solr, it is shown as below. | RT @AhmedWagih: لو معملناش Øاجة Ù�ÙŠ الزيادة | السكانية Ù�ÙŠ مصر، هنتØول لدولة Ù�قيرة | كثيÙ�Ø© السكان زي بنجلادش #Egypt #EgyEconomy | | both of the lines do not mean the same, but i have just placed them | here as | an example. This was the problem i am facing. | | [...] | | The encoding of your input text is being mangled at some point. | Presuming that your original encoding is UTF-8, I would look at | how you are indexing into Solr, and the encoding settings on the | Java container. Solr itself handles UTF-8 perfectly fine, as do | most Java containers if configured properly, so my first suspicion | would be the indexing code. | | As it looks like you are pulling from mysql using DIH, check that | the database character set is UTF-8, and that the connection uses | UTF-8. | | Regards, | Gora |
Re: Doubts in Result Grouping in solr 3.6.1
Grouping isn't defined for tokenized fields I don't think. See: http://wiki.apache.org/solr/FieldCollapsing where it says for group.field: ..The field must currently be single-valued... Are you sure you don't want faceting? Best Erick On Tue, Sep 4, 2012 at 5:27 AM, mechravi25 mechrav...@yahoo.co.in wrote: Hi, I am currently using solr 3.6.1 version and for indexing data, i am using the data import handler for 3.5 because of the reason posted in the following forum link http://lucene.472066.n3.nabble.com/Dataimport-Handler-in-solr-3-6-1-td4001149.html I am trying to achieve result grouping based on a field grpValue which has value like this Name XYZ|Company. There are totally 359 docs that were indexed and the field grpValue in all the 359 docs contains the word Company in its value. I gave the following in my schema.xml for splitting the word while indexing and querying fieldType name=groupField class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.PatternTokenizerFactory pattern=\s+|\|/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_new.txt enablePositionIncrements=true / /analyzer analyzer type=query tokenizer class=solr.PatternTokenizerFactory pattern=\s+|\|/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_new.txt enablePositionIncrements=true / /analyzer /fieldType I am trying to split the words if I have a single space or an “|” symbol in my data when i use the pattern=\s+|\| in PatternTokenizerFactory. When I gave the analyze option in solr, the sample value was split inot 3 words Name,XYZ,Company in both my index and query analyzer. When i gave the following url http://localhost:8080/solr/core1/select/?q=*%3A*version=2.2start=0rows=359indent=ongroup=truegroup.field=grpValuegroup.limit=0 I noticed that I have a grouping name called Company which has numFound as 73 but the particular field grpValue has the word Company in its value in all the 359 docs. Ideally, i should have got 359 docs as numFound under my group - lst name=grouped - lst name=grpValue int name=matches359/int - arr name=groups - lst str name=groupValueCompany/str result name=doclist numFound=73 start=0 / /lst Please someone guide me as to why only 73 docs is present in that group instead of 359. I also noticed that when I counted the numFound in all the groups, it totalled upto 359. Please guide me on this and I am not sure what I am missing. Please let me know in case more details is needed. Thanks in advance. -- View this message in context: http://lucene.472066.n3.nabble.com/Doubts-in-Result-Grouping-in-solr-3-6-1-tp4005239.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to preserve source column names in multivalue catch all field
Try using edismax to distribute the search across the fields rather than using the catch-all field. There's no way that I know of to reconstruct what field the source was. But storing the source fields without indexing them is OK too, it won't affect searching speed noticeably... Best Erick On Tue, Sep 4, 2012 at 11:52 AM, Kiran Jayakumar kiranjuni...@gmail.com wrote: Hi everyone, I have got a multivalue catch all field which captures all the text fields. Whats the best way to preserve the column information also ? In the UI, I need to show field : value type output. Right now, I am storing the source fields without indexing. Is there a better way to do it ? Thanks
Re: Best practices on managing facets with Code and Name
I don't know of any better way to do this. Conflating the fields is not _that_ error prone, although it is annoying I agree. I think that idea is better than storing them separately. Best Erick On Tue, Sep 4, 2012 at 4:58 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: Hello, I have some fields that have codes during search and internal management but also have user presentable names. Those fields are used for facets and I have a problem figuring out the best way to index, store, and present them. The best example would be a country name. I store the selected countries in URL, so want to say ...countries=ag|bo|cd but want those names to show up to user and be searchable for by SOLR as Antigua and Barbuda, Bolivia (Plurinational State of), and Democratic Republic of the Congo. The additional challenge is that ideally I want those full names in several languages (localized). Currently I am storing country codes in a facetable field and store country names in a catch-all names multiValue field. I get the codes as facets with counts and then do lookups to match to original names. But that does mean I have to look it up during indexing and then, second time, during display. It is ok, but I feel there must be a better way. I also tried conflating the fields into one, e.g. ag Antigua and Barbuda which would give both names when I retrieve facets, but having to remember that the field is combined one is annoying and error prone. Similarly, I thought of requesting both codes and names as facets, but that probably has a performance impact of double counting. Any ideas or known best practices? This feels like a fairly common scenario. Thank you, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book)
Re: Sorting on mutivalued fields still impossible?
And you've illustrated my viewpoint I think by saying two obvious choices. I may prefer the first, and you may prefer the second. Neither is necessarily more correct IMO, it depends on the problem space. Choosing either one will be unpopular with anyone who likes the other And I suspect that 99 times out of 100, someone wanting to sort on fields with multiple tokens hasn't thought the problem through carefully. So I favor forcing the person with the use-case where this is actually _desired_ behavior to work to implement rather than have to deal with surprising orderings. And duplicate entries in the result set gets ugly. Say a user sorts on a field containing 10,000 tokens. Now one doc is repeated 10,000 times in the result set. How many docs are set for numFound? Faceting? Grouping? I think your first option is at least easy to explain, but I don't see it as compelling enough to put the work into it, although I confess I don't know the guts of how much work it would take to find the first (and last, don't forget specifying desc) token for each doc Anyway, that's my story and I'm sticking to it G... Best Erick On Wed, Sep 5, 2012 at 12:54 AM, Toke Eskildsen t...@statsbiblioteket.dk wrote: On Fri, 2012-08-31 at 13:35 +0200, Erick Erickson wrote: Imagine you have two entries, aardvark and emu in your multiValued field. How should that document sort relative to another doc with camel and zebra? Any heuristic you apply will be wrong for someone else I see two obvious choices here: 1) Sort by the value that is ordered first by the comparator function. Doc1: aardvark, (emu) Doc2: camel, (zebra) This is what Uwe wants to do and it is normally done by preprocessing and collapsing to a single value. It could be implemented with an ordered multi-valued field cache by comparing on the first (or last, in the case of reverse sort) entry for each matching document. 2) Make duplicate entries in the result set, one for each value. Doc1: aardvark, (emu) Doc2: camel, (zebra) Doc1: (aardvark), emu Doc2: (camel), zebra I have a hard time coming up with a real world use case for this. It could be implemented by using a multi-valued field cache as above and putting the same document ID into the sliding window sorter once for each field value. Collapsing this into a single algorithm: Step through all IDs. For each ID, give access to the list of field values and provide a callback for adding one or more (value, ID)-pairs to the sliding windows sorter. Are there some other realistic heuristics that I have missed?
Re: SOLR 4.0 / Jetty Security Set Up
Securing Solr pretty much universally requires that you only allow trusted clients to access the machines directly, usually secured with a firewall and allowed IP addresses, the admin handler is the least of your worries. Consider if you let me ping solr directly, I can do something really annoying like: http://localhost:8983/solr/update?stream.body=deletequeryoffice:Bridgewater/query/delete Best Erick On Wed, Sep 5, 2012 at 2:51 AM, Paul Codman snoozes...@gmail.com wrote: First time Solr user and I am loving it! I have a standard Solr 4 set up running under Jetty. The instructions in the Wiki do not seem to apply to Solr 4 (eg mortbay references / section to uncomment not present in xml file / etc) - could someone please advise on steps required to secure Solr 4 and can someone confirm that security operates in relation to new Admin interface. Thanks in advance.
Re: use of filter queries in Lucene/Solr Alpha40 and Beta4.0
Guenter: Are you using SolrCloud or straight Solr? And were you updating in batches (i.e. updating multiple docs at once from SolrJ by using the server.add(doclist) form)? There was a bug in this process that caused various docs to show up in various shards differently. This has been fixed in 4x, any nightly build should have the fix. I'm absolutely grasping at straws here, but this was a weird case that I happen to know about... Hossman: of course this all goes up in smoke if you can reproduce this with any recent compilation of the code. FWIW Erick On Wed, Sep 5, 2012 at 11:29 PM, guenter.hip...@unibas.ch guenter.hip...@unibas.ch wrote: Hoss, I'm so happy you realized the problem because I was quite worried about it!! Let me know if I can provide support with testing it. The last two days I was busy with migrating a bunch of hosts which should -hopefully- be finished today. Then I have again the infrastructure for running tests Günter On 09/05/2012 11:19 PM, Chris Hostetter wrote: : Subject: Re: use of filter queries in Lucene/Solr Alpha40 and Beta4.0 Günter, This is definitely strange The good news is, i can reproduce your problem. The bad news is, i can reproduce your problem - and i have no idea what's causing it. I've opened SOLR-3793 to try to get to the bottom of this, and included some basic steps to demonstrate the bug using the Solr 4.0-BETA example data, but i'm really not sure what the problem might be... https://issues.apache.org/jira/browse/SOLR-3793 -Hoss -- Universität Basel Universitätsbibliothek Günter Hipler Projekt SwissBib Schoenbeinstrasse 18-20 4056 Basel, Schweiz Tel.: + 41 (0)61 267 31 12 Fax: ++41 61 267 3103 e-mailguenter.hip...@unibas.ch URL:www.swissbib.org /http://www.ub.unibas.ch/