Re: TIKA OCR not working
HI everyone, Does anyone have the answer for this problem :)? I saw the document of Tika. Tika 1.7 support OCR and Solr 5.0 use Tika 1.7, > but it looks like it does not work. Does anyone know that TIKA OCR works > automatically with Solr or I have to change some settings? > >> Trung. > It's not clear if OCR would happen automatically in Solr Cell, or if >> changes to Solr would be needed. >> >> For Tika OCR info, see: >> >> https://issues.apache.org/jira/browse/TIKA-93 >> https://wiki.apache.org/tika/TikaOCR >> >> >> >> -- Jack Krupansky >> >> On Thu, Apr 23, 2015 at 9:14 AM, Alexandre Rafalovitch < >> arafa...@gmail.com> >> wrote: >> >> > I think OCR is in Tika 1.8, so might be in Solr 5.?. But I haven't seen >> it >> > in use yet. >> > >> > Regards, >> > Alex >> > On 23 Apr 2015 10:24 pm, "Ahmet Arslan" >> wrote: >> > >> > > Hi Trung, >> > > >> > > I didn't know about OCR capabilities of tika. >> > > Someone who is familiar with sold-cell can inform us whether this >> > > functionality is added to solr or not. >> > > >> > > Ahmet >> > > >> > > >> > > >> > > On Thursday, April 23, 2015 2:06 PM, trung.ht >> wrote: >> > > Hi Ahmet, >> > > >> > > I used a png file, not a pdf file. From the document, I understand >> that >> > > solr will post the file to tika, and since tika 1.7, OCR is included. >> Is >> > > there something I misunderstood. >> > > >> > > Trung. >> > > >> > > >> > > On Thu, Apr 23, 2015 at 5:59 PM, Ahmet Arslan >> > > > >> > > wrote: >> > > >> > > > Hi Trung, >> > > > >> > > > solr-cell (tika) does not do OCR. It cannot exact text from image >> based >> > > > pdfs. >> > > > >> > > > Ahmet >> > > > >> > > > >> > > > >> > > > On Thursday, April 23, 2015 7:33 AM, trung.ht >> > wrote: >> > > > >> > > > >> > > > >> > > > Hi, >> > > > >> > > > I want to use solr to index some scanned document, after settings >> solr >> > > > document with a two field "content" and "filename", I tried to >> upload >> > the >> > > > attached file, but it seems that the content of the file is only >> "\n \n >> > > > \n". >> > > > But if I used the tesseract from command line I got the result >> > correctly. >> > > > >> > > > The log when solr receive my request: >> > > > --- >> > > > INFO - 2015-04-23 03:49:25.941; >> > > > org.apache.solr.update.processor.LogUpdateProcessor; [collection1] >> > > > webapp=/solr path=/update/extract params={literal.groupid=2&json.nl >> > > =flat& >> > > > resource.name=phplNiPrs&literal.id >> > > > >> > > >> > >> =4&commit=true&extractOnly=false&literal.historyid=4&omitHeader=true&literal.userid=3&literal.createddate=2015-04-22T15:00:00Z&fmap.content=content&wt=json&literal.filename=\\trunght\test\tesseract_3.png} >> > > > >> > > > >> > > > >> > > > The document when I check on solr admin page: >> > > > - >> > > > { "groupid": 2, "id": "4", "historyid": 4, "userid": 3, >> "createddate": >> > > > "2015-04-22T15:00:00Z", "filename": >> > "trunght\\test\\tesseract_3.png", >> > > > "autocomplete_text": [ "trunght\\test\\tesseract_3.png" ], >> > > "content": " >> > > > \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n >> \n >> > \n >> > > > \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n >> \n >> > ", >> > > > "_version_": 1499213034586898400 } >> > > > >> > > > --- >> > > > >> > > > Since I am a solr newbie I do not know where to look, can anyone >> give >> > me >> > > > an advice for where to look for error or settings to make it work. >> > > > Thanks in advanced. >> > > > >> > > > Trung. >> > > > >> > > >> > >> > >
Using SolrJ to access schema.xml
Hi Everyone, Per this link https://cwiki.apache.org/confluence/display/solr/Schema+API#SchemaAPI-ListFieldTypes Solr supports REST Schema API to modify to the schema. I looked at http://lucene.apache.org/solr/4_2_1/solr-solrj/index.html?overview-summary.html in hope SolrJ has Java API to allow schema modification, but I couldn't find any. Is this the case or did I not look hard enough? My need is to manage Solr's schema.xml file using a remote API. The REST Schema API gets me there but but I have to write code to work with the response / request XML which I much rather avoid if it is already out there. Thanks Steve
Re: payload similarity
I put up a complete example not too long ago that may help, see: http://lucidworks.com/blog/end-to-end-payload-example-in-solr/ Best, Erick On Fri, Apr 24, 2015 at 6:33 AM, Dmitry Kan wrote: > Ahmet, exactly. As I have just illustrated with code, simultaneously with > your reply. Thanks! > > On Fri, Apr 24, 2015 at 4:30 PM, Ahmet Arslan > wrote: > >> Hi Dmitry, >> >> I think, it is activated by PayloadTermQuery. >> >> Ahmet >> >> >> >> On Friday, April 24, 2015 2:51 PM, Dmitry Kan >> wrote: >> Hi, >> >> >> Using the approach here >> http://lucidworks.com/blog/getting-started-with-payloads/ I have >> implemented my own PayloadSimilarity class. When debugging the code I have >> noticed, that the scorePayload method is never called. What could be wrong? >> >> >> [code] >> >> class PayloadSimilarity extends DefaultSimilarity { >> @Override >> public float scorePayload(int doc, int start, int end, BytesRef >> payload) { >> float payloadValue = PayloadHelper.decodeFloat(payload.bytes); >> System.out.println("payloadValue = " + payloadValue); >> return payloadValue; >> } >> } >> >> [/code] >> >> >> Here is how the similarity is injected during indexing: >> >> [code] >> >> PayloadEncoder encoder = new FloatEncoder(); >> IndexWriterConfig indexWriterConfig = new >> IndexWriterConfig(Version.LUCENE_4_10_4, new >> PayloadAnalyzer(encoder)); >> payloadSimilarity = new PayloadSimilarity(); >> indexWriterConfig.setSimilarity(payloadSimilarity); >> IndexWriter writer = new IndexWriter(dir, indexWriterConfig); >> >> [/code] >> >> >> and during searching: >> >> [code] >> >> IndexReader indexReader = DirectoryReader.open(dir); >> IndexSearcher searcher = new IndexSearcher(indexReader); >> searcher.setSimilarity(payloadSimilarity); >> >> TermQuery termQuery = new TermQuery(new Term("body", "dogs")); >> termQuery.setBoost(1.1f); >> TopDocs topDocs = searcher.search(termQuery, 10); >> printResults(searcher, termQuery, topDocs); >> >> >> [/code] >> >> -- >> Dmitry Kan >> Luke Toolbox: http://github.com/DmitryKey/luke >> Blog: http://dmitrykan.blogspot.com >> Twitter: http://twitter.com/dmitrykan >> SemanticAnalyzer: www.semanticanalyzer.info >> > > > > -- > Dmitry Kan > Luke Toolbox: http://github.com/DmitryKey/luke > Blog: http://dmitrykan.blogspot.com > Twitter: http://twitter.com/dmitrykan > SemanticAnalyzer: www.semanticanalyzer.info
Re: require diversity in results?
Often, for small numbers of distinct types people use grouping and have the app layer mingle them or whatever is pleasing. I think this is different than post-processing you mention. Grouping (aka "field collapsing") can be expensive if there are a large number of groups but for small numbers it's not bad. The general problem is rather "interesting" in that just relaxing the score doesn't really handle massively different kinds of documents, think of "songs", "albums", and "reviews in rolling stone". You'd probably see a bazillion song entries before the first review entry Best, Erick On Fri, Apr 24, 2015 at 2:36 AM, Paul Libbrecht wrote: > Hello list, > > I'm wondering if there could extra parameters or query operators that > where I could impose that sorting by relevance should be relaxed so that > there's a minimum diversity in some fields in the first page of results. > > For example, I'd like the search results to contain at least three > possible type of resources in the first page, fetching things from below > if needed. > I know that could be done as a search result post-processor but I think > that this is generally a bad idea for performance. > > Any other idea? > > thanks > > Paul >
Re: Odp.: solr issue with pdf forms
Steve: Right, it's not exactly obvious. Bring up the admin UI, something like http://localhost:8983/solr. From there you have to select a core in the 'core selector' drop-down on the left side. If you're using SolrCloud, this will have a rather strange name, but it should be easy to identify what collection it belongs to. At that point you'll see a bunch of new options, among them "schema browser". From there, select your field from the drop-down that will appear, then a button should pop up "load term info". NOTE: you can get the same information from the TermsComponent, see: https://cwiki.apache.org/confluence/display/solr/The+Terms+Component. This is a little more flexible because you can, among other things, specify the place to start. In your case you might specify terms.prefix=mein which will show you the terms that are actually being _searched_ as opposed to being stored. This latter is what you see in the browser when you search for docs and is sometimes misleading as you're (probably) seeing. Best, Erick On Fri, Apr 24, 2015 at 1:58 AM, wrote: > Hey Erick, > > thanks a lot for your answer. I went to the admin schema browser, but what > should I see there? Sorry I'm not firm with the admin schema browser. :-( > > Best > Steve > > > -Ursprüngliche Nachricht- > Von: Erick Erickson [mailto:erickerick...@gmail.com] > Gesendet: Donnerstag, 23. April 2015 18:00 > An: solr-user@lucene.apache.org > Betreff: Re: Odp.: solr issue with pdf forms > > When you say "they're not indexed correctly", what's your evidence? > You cannot rely > on the display in the browser, that's the raw input just as it was sent to > Solr, _not_ the actual tokens in the index. What do you see when you go to > the admin schema browser pate and load the actual tokens. > > Or use the TermsComponent > (https://cwiki.apache.org/confluence/display/solr/The+Terms+Component) > to see the actual terms in the index as opposed to the stored data you see in > the browser when you look at search results. > > If the actual terms don't seem right _in the index_ we need to see your > analysis chain, i.e. your fieldType definition. > > I'm, 90% sure you're seeing the stored data and your terms are indexed just > fine, but I've certainly been wrong before, more times than I want to > remember. > > Best, > Erick > > On Thu, Apr 23, 2015 at 1:18 AM, wrote: >> Hey Erick, >> >> thanks for your answer. They are not indexed correctly. Also throught the >> solr admin interface I see these typical questionmarks within a rhombus >> where a blank space should be. >> I now figured out the following (not sure if it is relevant at all): >> - PDF documents created with "Acrobat PDFMaker 10.0 for Word" are >> indexed correctly, no issues >> - PDF documents (with editable form fields) created with "Adobe >> InDesign CS5 (7.0.1)" are indexed with the blank space issue >> >> Best >> Steve >> >> -Ursprüngliche Nachricht- >> Von: Erick Erickson [mailto:erickerick...@gmail.com] >> Gesendet: Mittwoch, 22. April 2015 17:11 >> An: solr-user@lucene.apache.org >> Betreff: Re: Odp.: solr issue with pdf forms >> >> Are they not _indexed_ correctly or not being displayed correctly? >> Take a look at admin UI>>schema browser>> your field and press the "load >> terms" button. That'll show you what is _in_ the index as opposed to what >> the raw data looked like. >> >> When you return the field in a Solr search, you get a verbatim, un-analyzed >> copy of your original input. My guess is that your browser isn't using the >> compatible character encoding for display. >> >> Best, >> Erick >> >> On Wed, Apr 22, 2015 at 7:08 AM, wrote: >>> Thanks for your answer. Maybe my English is not good enough, what are you >>> trying to say? Sorry I didn't get the point. >>> :-( >>> >>> >>> -Ursprüngliche Nachricht- >>> Von: LAFK [mailto:tomasz.bo...@gmail.com] >>> Gesendet: Mittwoch, 22. April 2015 14:01 >>> An: solr-user@lucene.apache.org; solr-user@lucene.apache.org >>> Betreff: Odp.: solr issue with pdf forms >>> >>> Out of my head I'd follow how are writable PDFs created and encoded. >>> >>> @LAFK_PL >>> Oryginalna wiadomość >>> Od: steve.sch...@t-systems.com >>> Wysłano: środa, 22 kwietnia 2015 12:41 >>> Do: solr-user@lucene.apache.org >>> Odpowiedz: solr-user@lucene.apache.org >>> Temat: solr issue with pdf forms >>> >>> Hi guys, >>> >>> hopefully you can help me with my issue. We are using a solr setup and have >>> the following issue: >>> - usual pdf files are indexed just fine >>> - pdf files with writable form-fields look like this: >>> Ich bestätige mit meiner Unterschrift, dass alle Angaben korrekt und >>> v ollständig sind >>> >>> Somehow the blank space character is not indexed correctly. >>> >>> Is this a know issue? Does anybody have an idea? >>> >>> Thanks a lot >>> Best >>> Steve
Re: AW: o.a.s.c.SolrException: missing content stream
: Another question I have though (which fits the subject even better): : In the log I see many : org.apache.solr.common.SolrException: missing content stream ... : What are possible reasons herfore? The possible and likeley reasons are that you sent an "update" request w/o any ContentStream (ie: no data to update). -Hoss http://www.lucidworks.com/
Re: and stopword in user query is being change to q.op=AND
: I was under understanding that stopwords are filtered even before being : parsed by search handler, i do have the filter in collection schema to : filter stopwords and the analysis shows that this stopword is filtered Generally speaking, your understanding of the order of operations for query parsing (regardless of hte parser) and analysis (regardless of the fields/analyzers/filters/etc...) is backwards. the query parser gets, as it's input, the query string (as a *single* string) and the request params. it inspects/parses the string according to it's rules & options & syntax and based on what it finds in that string (and in other request params) it passes some/all of that string to the analyzer for one or more fields, and uses the results of those analyzers as the terms for building up a query structure. ask yourself: if the raw user query input was first passed to an analyzer (for stop word filtering as you suggest) before the being passed to the query parser -- how would solr know what analyzer to use? in many parsers (like lucene and edismax) the fields to use can be specified *inside* the query string itself likewise: how would you ensure that syntactically significant string sequences (like "(" and ":" and "AND" etc..) that an analyzer might normally strip out based on the tokenizer/tokenfilters would be preserved so that the query parser could have them and use them to drive hte resulting query structure? -Hoss http://www.lucidworks.com/
Re: and stopword in user query is being change to q.op=AND
On 4/24/2015 10:55 AM, Rajesh Hazari wrote: > I was under understanding that stopwords are filtered even before > being parsed by search handler, i do have the filter in collection > schema to filter stopwords and the analysis shows that this stopword > is filtered > > Analysis response : attached is the solr analysis json response. There is a combination of things happening here. The "lowercaseOperators" parameter for edismax defaults to true ... which means that the "and" in your query is being interpreted as "AND" -- a boolean operator -- it's not making it to analysis. Because the query now has boolean operators, you are running into this - a bug that we have had in our tracker for a VERY long time: https://issues.apache.org/jira/browse/SOLR-2649 Side note: I personally feel that lowercaseOperators should default to false, but I haven't made any effort to get it changed. Thanks, Shawn
Re: and stopword in user query is being change to q.op=AND
I was under understanding that stopwords are filtered even before being parsed by search handler, i do have the filter in collection schema to filter stopwords and the analysis shows that this stopword is filtered Analysis response : attached is the solr analysis json response. [image: Inline image 1] Schema definition : < filter class="solr.StopFilterFactory" ignoreCase="true" words= "stopwords.txt" /> *shouldn't the final filter query terms be sent to search handler?* *Thanks,* *Rajesh**.* On Thu, Apr 23, 2015 at 2:56 PM, Chris Hostetter wrote: > > : And stopword in user query is being changed to q.op=AND, i am going to > : look more into this > > This is an explicitly documented feature of the edismax parser... > > > https://cwiki.apache.org/confluence/display/solr/The+Extended+DisMax+Query+Parser > > * treats "and" and "or" as "AND" and "OR" in Lucene syntax mode. > > ... > > The lowercaseOperators Parameter > > A Boolean parameter indicating if lowercase "and" and "or" should be > treated the same as operators "AND" and "OR". > > > > > : i thought of sharing this in solr community just in-case if someone have > : came across this issue. > : OR > : I will also be validating my config and schema if i am doing something > : wrong. > : > : solr : 4.9 > : query parser: edismax > : > : when i search for "*q=derek and romace*" final parsed query is > : *"(+(+DisjunctionMaxQuery((textSpell:derek)) > : +DisjunctionMaxQuery((textSpell:romance/no_coord" * > : * > : "response":{"numFound":0,"start":0,"maxScore":0.0,"docs":[]* > : > : when i search for "*q=derek romace*" final parsed query is > *"parsedquery": > : "(+(DisjunctionMaxQuery((textSpell:derek)) > : DisjunctionMaxQuery((textSpell:romance/no_coord",* > : *response": {* > : *"numFound": 1405,* > : *"start": 0,* > : *"maxScore": 0.2780709,* > : *"docs": [.* > : > : textSpell field definition : > : > : : omitNorms="true" multiValued="true" /> > : > : : positionIncrementGap="100"> > : > : > : : words="stopwords.txt" /> > : > : : protected="protwords.txt"/> > : > : > : > : : words="stopwords.txt" /> > : : ignoreCase="true" expand="false" /> > : > : : protected="protwords.txt"/> > : > : > : > : Let me know if anyone of you guys need more info. > : > : *Thanks,* > : *Rajesh**.* > : > > -Hoss > http://www.lucidworks.com/ > analysis.json Description: application/json
Re: Checking of Solr Memory and Disk usage
Meaning this was working fine until Solr 5.0.0? I'm quite new to Solr and I only started to use it when Solr 5.0.0 was released. Regards, Edwin On 24 April 2015 at 18:20, Tom Evans wrote: > On Fri, Apr 24, 2015 at 8:31 AM, Zheng Lin Edwin Yeo > wrote: > > Hi, > > > > So has anyone knows what is the issue with the "Heap Memory Usage" > reading > > showing the value -1. Should I open an issue in Jira? > > I have solr 4.8.1 and solr 5.0.0 servers, on the solr 4.8.1 servers > the core statistics have values for heap memory, on the solr 5.0.0 > ones I also see the value -1. This is with CentOS 6/Java 1.7 OpenJDK > on both versions. > > I don't see this issue in the fixed bugs in 5.1.0, but I only looked > at the headlines of the tickets.. > > http://lucene.apache.org/solr/5_1_0/changes/Changes.html#v5.1.0.bug_fixes > > Cheers > > Tom >
RE: Remote connection to Solr
Shawn's explanation fits better with why Websphere and Jetty might behave differently. But something else that might be happening could be if the DHCP negotiation causes the IP address to change from one network to another and back. -Original Message- From: Steven White [mailto:swhite4...@gmail.com] Sent: Friday, April 24, 2015 9:23 AM To: solr-user@lucene.apache.org Subject: Re: Remote connection to Solr Hi Shawn, The firewall was the first thing I looked into and after fiddling with it, I still see the issue. But if that was the issue, why WebSphere doesn't run into it but Jetty is? However, your point about domain / non domain and private / public network maybe provide me with some new area to look into. Thanks Steve On Fri, Apr 24, 2015 at 10:11 AM, Shawn Heisey wrote: > On 4/24/2015 8:03 AM, Steven White wrote: > > This maybe a Jetty question but let me start here first. > > > > I have Solr running on my laptop and from my desktop I have no issue > > accessing it. However, if I take my laptop home and connect it to > > my > home > > network, the next day when I connect the laptop to my office > > network, I > no > > longer can access Solr from my desktop. A restart of Solr will not > > do, > the > > only fix is to restart my Windows 8.1 OS (that's what's on my laptop). > > > > I have not been able to figure out why this is happening and I'm > suspecting > > it has to do something with Jetty because I have Solr 3.6 running on > > my laptop in a WebSphere profile and it does not run into this issue. > > > > Any ideas what could be causing this? Is this question for the > > Jetty mailing list? > > I'm guessing the Windows firewall is the problem here. I'm betting > your computer is detecting your home network and the office network as > two different types (one as domain, the other as private, possibly), > and that the Windows firewall only allows connections to Jetty when > you are on one of those types of networks. The websphere install may > have add explicit firewall exceptions for all network types when it was > installed. > > Fiddling with the firewall exceptions is probably the way to fix this. > > Thanks, > Shawn > >
Re: Remote connection to Solr
Hi Shawn, The firewall was the first thing I looked into and after fiddling with it, I still see the issue. But if that was the issue, why WebSphere doesn't run into it but Jetty is? However, your point about domain / non domain and private / public network maybe provide me with some new area to look into. Thanks Steve On Fri, Apr 24, 2015 at 10:11 AM, Shawn Heisey wrote: > On 4/24/2015 8:03 AM, Steven White wrote: > > This maybe a Jetty question but let me start here first. > > > > I have Solr running on my laptop and from my desktop I have no issue > > accessing it. However, if I take my laptop home and connect it to my > home > > network, the next day when I connect the laptop to my office network, I > no > > longer can access Solr from my desktop. A restart of Solr will not do, > the > > only fix is to restart my Windows 8.1 OS (that's what's on my laptop). > > > > I have not been able to figure out why this is happening and I'm > suspecting > > it has to do something with Jetty because I have Solr 3.6 running on my > > laptop in a WebSphere profile and it does not run into this issue. > > > > Any ideas what could be causing this? Is this question for the Jetty > > mailing list? > > I'm guessing the Windows firewall is the problem here. I'm betting your > computer is detecting your home network and the office network as two > different types (one as domain, the other as private, possibly), and > that the Windows firewall only allows connections to Jetty when you are > on one of those types of networks. The websphere install may have add > explicit firewall exceptions for all network types when it was installed. > > Fiddling with the firewall exceptions is probably the way to fix this. > > Thanks, > Shawn > >
Re: ArrayIndexOutOfBoundsException in RecordingJSONParser.java
Ticket opened: https://issues.apache.org/jira/i#browse/SOLR-7462 Thanks, Scott On Fri, Apr 24, 2015 at 9:38 AM, Shawn Heisey wrote: > On 4/24/2015 7:16 AM, Scott Dawson wrote: > > Should I create a JIRA ticket? (Am I allowed to?) I can provide more > info > > about my particular usage including a stacktrace if that's helpful. I'm > > using the new custom JSON indexing, which, by the way, is an excellent > > feature and will be of great benefit to my project. Thanks for that. > > Ouch. Thanks for finding the bug! > > Anyone can create an account on the Apache Jira and then create issues. > Please do! The issue for this bug would go in the SOLR project. > > https://issues.apache.org/jira/browse/SOLR > > Thanks, > Shawn > >
Re: Remote connection to Solr
On 4/24/2015 8:03 AM, Steven White wrote: > This maybe a Jetty question but let me start here first. > > I have Solr running on my laptop and from my desktop I have no issue > accessing it. However, if I take my laptop home and connect it to my home > network, the next day when I connect the laptop to my office network, I no > longer can access Solr from my desktop. A restart of Solr will not do, the > only fix is to restart my Windows 8.1 OS (that's what's on my laptop). > > I have not been able to figure out why this is happening and I'm suspecting > it has to do something with Jetty because I have Solr 3.6 running on my > laptop in a WebSphere profile and it does not run into this issue. > > Any ideas what could be causing this? Is this question for the Jetty > mailing list? I'm guessing the Windows firewall is the problem here. I'm betting your computer is detecting your home network and the office network as two different types (one as domain, the other as private, possibly), and that the Windows firewall only allows connections to Jetty when you are on one of those types of networks. The websphere install may have add explicit firewall exceptions for all network types when it was installed. Fiddling with the firewall exceptions is probably the way to fix this. Thanks, Shawn
Remote connection to Solr
Hi Everyone, This maybe a Jetty question but let me start here first. I have Solr running on my laptop and from my desktop I have no issue accessing it. However, if I take my laptop home and connect it to my home network, the next day when I connect the laptop to my office network, I no longer can access Solr from my desktop. A restart of Solr will not do, the only fix is to restart my Windows 8.1 OS (that's what's on my laptop). I have not been able to figure out why this is happening and I'm suspecting it has to do something with Jetty because I have Solr 3.6 running on my laptop in a WebSphere profile and it does not run into this issue. Any ideas what could be causing this? Is this question for the Jetty mailing list? Thanks Steve
Re: ArrayIndexOutOfBoundsException in RecordingJSONParser.java
On 4/24/2015 7:16 AM, Scott Dawson wrote: > Should I create a JIRA ticket? (Am I allowed to?) I can provide more info > about my particular usage including a stacktrace if that's helpful. I'm > using the new custom JSON indexing, which, by the way, is an excellent > feature and will be of great benefit to my project. Thanks for that. Ouch. Thanks for finding the bug! Anyone can create an account on the Apache Jira and then create issues. Please do! The issue for this bug would go in the SOLR project. https://issues.apache.org/jira/browse/SOLR Thanks, Shawn
Re: payload similarity
Ahmet, exactly. As I have just illustrated with code, simultaneously with your reply. Thanks! On Fri, Apr 24, 2015 at 4:30 PM, Ahmet Arslan wrote: > Hi Dmitry, > > I think, it is activated by PayloadTermQuery. > > Ahmet > > > > On Friday, April 24, 2015 2:51 PM, Dmitry Kan > wrote: > Hi, > > > Using the approach here > http://lucidworks.com/blog/getting-started-with-payloads/ I have > implemented my own PayloadSimilarity class. When debugging the code I have > noticed, that the scorePayload method is never called. What could be wrong? > > > [code] > > class PayloadSimilarity extends DefaultSimilarity { > @Override > public float scorePayload(int doc, int start, int end, BytesRef > payload) { > float payloadValue = PayloadHelper.decodeFloat(payload.bytes); > System.out.println("payloadValue = " + payloadValue); > return payloadValue; > } > } > > [/code] > > > Here is how the similarity is injected during indexing: > > [code] > > PayloadEncoder encoder = new FloatEncoder(); > IndexWriterConfig indexWriterConfig = new > IndexWriterConfig(Version.LUCENE_4_10_4, new > PayloadAnalyzer(encoder)); > payloadSimilarity = new PayloadSimilarity(); > indexWriterConfig.setSimilarity(payloadSimilarity); > IndexWriter writer = new IndexWriter(dir, indexWriterConfig); > > [/code] > > > and during searching: > > [code] > > IndexReader indexReader = DirectoryReader.open(dir); > IndexSearcher searcher = new IndexSearcher(indexReader); > searcher.setSimilarity(payloadSimilarity); > > TermQuery termQuery = new TermQuery(new Term("body", "dogs")); > termQuery.setBoost(1.1f); > TopDocs topDocs = searcher.search(termQuery, 10); > printResults(searcher, termQuery, topDocs); > > > [/code] > > -- > Dmitry Kan > Luke Toolbox: http://github.com/DmitryKey/luke > Blog: http://dmitrykan.blogspot.com > Twitter: http://twitter.com/dmitrykan > SemanticAnalyzer: www.semanticanalyzer.info > -- Dmitry Kan Luke Toolbox: http://github.com/DmitryKey/luke Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan SemanticAnalyzer: www.semanticanalyzer.info
Re: payload similarity
Answering my own question: in order to account for payloads, PayloadTermQuery should be used instead of TermQuery: PayloadTermQuery payloadTermQuery = new PayloadTermQuery(new Term("body", "dogs"), new MaxPayloadFunction()); Then in the query explanation we get: --- Results for body:dogs of type: org.apache.lucene.search.payloads.PayloadTermQuery Doc: doc=0 score=3.125 shardIndex=-1 payloadValue = 10.0 Explain: 3.125 = (MATCH) btq, product of: 0.3125 = weight(body:dogs in 0) [PayloadSimilarity], result of: 0.3125 = fieldWeight in 0, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = phraseFreq=1.0 1.0 = idf(docFreq=3, maxDocs=10) 0.3125 = fieldNorm(doc=0) 10.0 = MaxPayloadFunction.docScore() Doc: doc=9 score=3.125 shardIndex=-1 payloadValue = 10.0 Explain: 3.125 = (MATCH) btq, product of: 0.3125 = weight(body:dogs in 9) [PayloadSimilarity], result of: 0.3125 = fieldWeight in 9, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = phraseFreq=1.0 1.0 = idf(docFreq=3, maxDocs=10) 0.3125 = fieldNorm(doc=9) 10.0 = MaxPayloadFunction.docScore() Doc: doc=1 score=0.3125 shardIndex=-1 Explain: 0.3125 = (MATCH) btq, product of: 0.3125 = weight(body:dogs in 1) [PayloadSimilarity], result of: 0.3125 = fieldWeight in 1, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = phraseFreq=1.0 1.0 = idf(docFreq=3, maxDocs=10) 0.3125 = fieldNorm(doc=1) 1.0 = MaxPayloadFunction.docScore() On Fri, Apr 24, 2015 at 2:50 PM, Dmitry Kan wrote: > Hi, > > > Using the approach here > http://lucidworks.com/blog/getting-started-with-payloads/ I have > implemented my own PayloadSimilarity class. When debugging the code I have > noticed, that the scorePayload method is never called. What could be wrong? > > > [code] > > class PayloadSimilarity extends DefaultSimilarity { > @Override > public float scorePayload(int doc, int start, int end, BytesRef payload) { > float payloadValue = PayloadHelper.decodeFloat(payload.bytes); > System.out.println("payloadValue = " + payloadValue); > return payloadValue; > } > } > > [/code] > > > Here is how the similarity is injected during indexing: > > [code] > > PayloadEncoder encoder = new FloatEncoder(); > IndexWriterConfig indexWriterConfig = new > IndexWriterConfig(Version.LUCENE_4_10_4, new PayloadAnalyzer(encoder)); > payloadSimilarity = new PayloadSimilarity(); > indexWriterConfig.setSimilarity(payloadSimilarity); > IndexWriter writer = new IndexWriter(dir, indexWriterConfig); > > [/code] > > > and during searching: > > [code] > > IndexReader indexReader = DirectoryReader.open(dir); > IndexSearcher searcher = new IndexSearcher(indexReader); > searcher.setSimilarity(payloadSimilarity); > > TermQuery termQuery = new TermQuery(new Term("body", "dogs")); > termQuery.setBoost(1.1f); > TopDocs topDocs = searcher.search(termQuery, 10); > printResults(searcher, termQuery, topDocs); > > > [/code] > > -- > Dmitry Kan > Luke Toolbox: http://github.com/DmitryKey/luke > Blog: http://dmitrykan.blogspot.com > Twitter: http://twitter.com/dmitrykan > SemanticAnalyzer: www.semanticanalyzer.info > > -- Dmitry Kan Luke Toolbox: http://github.com/DmitryKey/luke Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan SemanticAnalyzer: www.semanticanalyzer.info
Re: payload similarity
Hi Dmitry, I think, it is activated by PayloadTermQuery. Ahmet On Friday, April 24, 2015 2:51 PM, Dmitry Kan wrote: Hi, Using the approach here http://lucidworks.com/blog/getting-started-with-payloads/ I have implemented my own PayloadSimilarity class. When debugging the code I have noticed, that the scorePayload method is never called. What could be wrong? [code] class PayloadSimilarity extends DefaultSimilarity { @Override public float scorePayload(int doc, int start, int end, BytesRef payload) { float payloadValue = PayloadHelper.decodeFloat(payload.bytes); System.out.println("payloadValue = " + payloadValue); return payloadValue; } } [/code] Here is how the similarity is injected during indexing: [code] PayloadEncoder encoder = new FloatEncoder(); IndexWriterConfig indexWriterConfig = new IndexWriterConfig(Version.LUCENE_4_10_4, new PayloadAnalyzer(encoder)); payloadSimilarity = new PayloadSimilarity(); indexWriterConfig.setSimilarity(payloadSimilarity); IndexWriter writer = new IndexWriter(dir, indexWriterConfig); [/code] and during searching: [code] IndexReader indexReader = DirectoryReader.open(dir); IndexSearcher searcher = new IndexSearcher(indexReader); searcher.setSimilarity(payloadSimilarity); TermQuery termQuery = new TermQuery(new Term("body", "dogs")); termQuery.setBoost(1.1f); TopDocs topDocs = searcher.search(termQuery, 10); printResults(searcher, termQuery, topDocs); [/code] -- Dmitry Kan Luke Toolbox: http://github.com/DmitryKey/luke Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan SemanticAnalyzer: www.semanticanalyzer.info
ArrayIndexOutOfBoundsException in RecordingJSONParser.java
Hello, I'm running Solr 5.1 and during indexing I get an ArrayIndexOutOfBoundsException at line 61 of org/apache/solr/util/RecordingJSONParser.java. Looking at the code (see below), it seems obvious that the if-statement at line 60 should use a greater-than sign instead of greater-than-or-equals. @Override public CharArr getStringChars() throws IOException { CharArr chars = super.getStringChars(); recordStr(chars.toString()); position = getPosition(); // if reading a String , the getStringChars do not return the closing single quote or double quote //so, try to capture that if(chars.getArray().length >=chars.getStart()+chars.size()) { // line 60 char next = chars.getArray()[chars.getStart() + chars.size()]; // line 61 if(next =='"' || next == '\'') { recordChar(next); } } return chars; } Should I create a JIRA ticket? (Am I allowed to?) I can provide more info about my particular usage including a stacktrace if that's helpful. I'm using the new custom JSON indexing, which, by the way, is an excellent feature and will be of great benefit to my project. Thanks for that. Regards, Scott Dawson
Re: SolrCloud to exclude xslt files in conf from zookeeper
On 4/24/2015 4:54 AM, Kumaradas Puthussery Krishnadas wrote: > I am creating a SolrCloud with 4 solr instances and 5 zookeeper instances. I > need to make sure that querying is working even when my 3 zookeepers are > down. But it looks like the queries using json transformation based xslt > templates which is not available since the zookeeper ensemble is not > available. Is it five zookeeper or three? It sounds like you might have five, but three of them are down for some reason. When you have five zookeepers, you can lose two and maintain quorum. If you lose three, then zookeeper doesn't have enough nodes to work properly, and SolrCloud will also stop normal operation. This is a fundamental property of zookeeper. There must be a majority of nodes operational -- more than half. If you have three zookeepers, you can lose only one and still maintain quorum. Some information about zookeeper that isn't directly applicable to your situation but may help explain why zookeeper behaves the way it does: Exactly half of the total nodes is not enough. If you have four zookeepers, two is not enough for quorum, three of them must be operational and able to communicate with each other. This is to prevent split-brain where two clusters are formed that cannot communicate with each other but independently believe that they are the functional cluster. > So is it possible to exclude files (eg: xslt folder) in the conf directory > from being loaded into Zookeeper rather point it to the filesystem so that > querying the solrcloud is not broken. One of the major points of putting the config in zookeeper is to centralize it and have zero reliance on local config files, which may be different on each Solr instance. Consider a cloud with five hundred nodes. A centralized config is the only way to be absolutely certain that every node has the update. You are welcome to file a feature request in Jira for the capability you want, but you may encounter resistance to actually getting it into Solr. If you lose zookeeper quorum, then SolrCloud has no choice other than stopping normal operation, to protect the integrity of the cloud. Any other action could lead to data loss. Zookeeper is a fundamental part of SolrCloud, so if your Zookeeper ensemble is not healthy, neither is SolrCloud. Thanks, Shawn
AW: o.a.s.c.SolrException: missing content stream
Stupid me (yet again): Should have taken a TEXT instead of (only) a STRING field for the content ;) Another question I have though (which fits the subject even better): In the log I see many org.apache.solr.common.SolrException: missing content stream at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:69) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:144) at org.apache.solr.core.SolrCore.execute(SolrCore.java:2006) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:413) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:204) ... at org.eclipse.jetty.server.Server.handle(Server.java:368) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489) at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:953) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1014) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:861) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240) at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82) at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:628) at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Unknown Source) What are possible reasons herfore? Thx Clemens -Ursprüngliche Nachricht- Von: Clemens Wyss DEV [mailto:clemens...@mysign.ch] Gesendet: Freitag, 24. April 2015 14:01 An: solr-user@lucene.apache.org Betreff: o.a.s.c.SolrException: missing content stream Context: Solr/Lucene 5.1 Adding documents to Solr core/index through SolrJ I extract pdf's using tika. The pdf-content is one of the fields of my SolrDocuments that are transmitted to Solr using SolrJ. As not all documents seem to be "coming through" I looked into the Solr-logs and see the follwoing exceptions: org.apache.solr.common.SolrException: Exception writing document id fustusermanuals#4614 to the index; possible analysis error. at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:170) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69) at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51) at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:931) at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1085) ... at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Unknown Source) Caused by: java.lang.IllegalArgumentException: Document contains at least one immense term in field="content__s_i_suggest" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[10, 32, 10, 32, 10, 10, 70, 82, 32, 77, 111, 100, 101, 32, 100, 39, 101, 109, 112, 108, 111, 105, 32, 10, 10, 32, 10, 10, 32, 10]...', original message: bytes can be at most 32766 in length; got 186493 at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:667) at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:344) at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:300) at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:231) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:449) at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1349) at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:242) at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:166) ... 40 more Caused by: org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException: bytes can be at most 32766 in length; got 186493 at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:284) at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:154) at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingCh
Re: Simple search low speed
Try breaking down the query to see which part of it is slow. If it turns out to be the range query you may want to look into using an frange postfilter. Joel Bernstein http://joelsolr.blogspot.com/ On Fri, Apr 24, 2015 at 6:50 AM, Norgorn wrote: > Thanks for your reply. > > Yes, 100% CPU is used by SOLR (100% - I mean 1 core, not all cores), I'm > totally sure. > > I have more than 80 GB RAM on test machine and about 50 is cached as disk > cache, SOLR uses about 8, Xmx=40G. > > I use GC1, but it can't be the problem, cause memory usage is much lower > than GC start limit (45% of heap). > > I think, the problem can be in fully optimized index, and search over one > big segment is much slower than parallel search over lot of segments, but > it > sounds weird, so I'm not sure. > Setups with big indexes which I know are all with optimized indexes. > > Index scheme: > termVectors="true" termPositions="true" termOffsets="true" /> > termVectors="true" termPositions="true" termOffsets="true" /> > > multiValued="false" required="true" omitNorms="true" > omitTermFreqAndPositions="true"/> > > omitNorms="true" omitTermFreqAndPositions="true"/> > > required="false" omitNorms="true" omitTermFreqAndPositions="true"/> > required="false" omitNorms="true" omitTermFreqAndPositions="true"/> > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Simple-search-low-speed-tp4202135p4202157.html > Sent from the Solr - User mailing list archive at Nabble.com. >
o.a.s.c.SolrException: missing content stream
Context: Solr/Lucene 5.1 Adding documents to Solr core/index through SolrJ I extract pdf's using tika. The pdf-content is one of the fields of my SolrDocuments that are transmitted to Solr using SolrJ. As not all documents seem to be "coming through" I looked into the Solr-logs and see the follwoing exceptions: org.apache.solr.common.SolrException: Exception writing document id fustusermanuals#4614 to the index; possible analysis error. at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:170) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69) at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51) at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:931) at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1085) ... at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Unknown Source) Caused by: java.lang.IllegalArgumentException: Document contains at least one immense term in field="content__s_i_suggest" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[10, 32, 10, 32, 10, 10, 70, 82, 32, 77, 111, 100, 101, 32, 100, 39, 101, 109, 112, 108, 111, 105, 32, 10, 10, 32, 10, 10, 32, 10]...', original message: bytes can be at most 32766 in length; got 186493 at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:667) at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:344) at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:300) at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:231) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:449) at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1349) at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:242) at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:166) ... 40 more Caused by: org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException: bytes can be at most 32766 in length; got 186493 at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:284) at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:154) at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:657) ... 47 more How can I tell Solr/SolrJ to allow more payload? I also see some org.apache.solr.common.SolrException: Exception writing document id fustusermanuals#3323 to the index; possible analysis error. at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:170) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69) at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51) at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:931) at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1085) at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:697) ... at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Unknown Source) Caused by: java.lang.IllegalArgumentException: Document contains at least one immense term in field="content__s_i_suggest" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[10, 69, 78, 32, 76, 67, 68, 32, 116, 101, 108, 101, 118, 105, 115, 105, 111, 110, 10, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95]...', original message: bytes can be at most 32766 in length; got 164683 at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:667) at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:344) at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:300) at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:231) at org.apache.lucene.index.DocumentsWriter.updateDocument(Do
payload similarity
Hi, Using the approach here http://lucidworks.com/blog/getting-started-with-payloads/ I have implemented my own PayloadSimilarity class. When debugging the code I have noticed, that the scorePayload method is never called. What could be wrong? [code] class PayloadSimilarity extends DefaultSimilarity { @Override public float scorePayload(int doc, int start, int end, BytesRef payload) { float payloadValue = PayloadHelper.decodeFloat(payload.bytes); System.out.println("payloadValue = " + payloadValue); return payloadValue; } } [/code] Here is how the similarity is injected during indexing: [code] PayloadEncoder encoder = new FloatEncoder(); IndexWriterConfig indexWriterConfig = new IndexWriterConfig(Version.LUCENE_4_10_4, new PayloadAnalyzer(encoder)); payloadSimilarity = new PayloadSimilarity(); indexWriterConfig.setSimilarity(payloadSimilarity); IndexWriter writer = new IndexWriter(dir, indexWriterConfig); [/code] and during searching: [code] IndexReader indexReader = DirectoryReader.open(dir); IndexSearcher searcher = new IndexSearcher(indexReader); searcher.setSimilarity(payloadSimilarity); TermQuery termQuery = new TermQuery(new Term("body", "dogs")); termQuery.setBoost(1.1f); TopDocs topDocs = searcher.search(termQuery, 10); printResults(searcher, termQuery, topDocs); [/code] -- Dmitry Kan Luke Toolbox: http://github.com/DmitryKey/luke Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan SemanticAnalyzer: www.semanticanalyzer.info
SolrCloud to exclude xslt files in conf from zookeeper
I am creating a SolrCloud with 4 solr instances and 5 zookeeper instances. I need to make sure that querying is working even when my 3 zookeepers are down. But it looks like the queries using json transformation based xslt templates which is not available since the zookeeper ensemble is not available. So is it possible to exclude files (eg: xslt folder) in the conf directory from being loaded into Zookeeper rather point it to the filesystem so that querying the solrcloud is not broken. Thanks Kumar
Re: Simple search low speed
Thanks for your reply. Yes, 100% CPU is used by SOLR (100% - I mean 1 core, not all cores), I'm totally sure. I have more than 80 GB RAM on test machine and about 50 is cached as disk cache, SOLR uses about 8, Xmx=40G. I use GC1, but it can't be the problem, cause memory usage is much lower than GC start limit (45% of heap). I think, the problem can be in fully optimized index, and search over one big segment is much slower than parallel search over lot of segments, but it sounds weird, so I'm not sure. Setups with big indexes which I know are all with optimized indexes. Index scheme: -- View this message in context: http://lucene.472066.n3.nabble.com/Simple-search-low-speed-tp4202135p4202157.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Simple search low speed
Java side: - launch jvisualvm - see how heap and CPU are occupied What are your JVM settings (heap) and how much RAM do you have? The CPU100% is used only by Solr? That is, are you 100% certain it's Solr that drives CPU to it's limit? pozdrawiam, LAFK 2015-04-24 12:14 GMT+02:00 Norgorn : > The number of documents in collection is about 100m. > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Simple-search-low-speed-tp4202135p4202152.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: Checking of Solr Memory and Disk usage
On Fri, Apr 24, 2015 at 8:31 AM, Zheng Lin Edwin Yeo wrote: > Hi, > > So has anyone knows what is the issue with the "Heap Memory Usage" reading > showing the value -1. Should I open an issue in Jira? I have solr 4.8.1 and solr 5.0.0 servers, on the solr 4.8.1 servers the core statistics have values for heap memory, on the solr 5.0.0 ones I also see the value -1. This is with CentOS 6/Java 1.7 OpenJDK on both versions. I don't see this issue in the fixed bugs in 5.1.0, but I only looked at the headlines of the tickets.. http://lucene.apache.org/solr/5_1_0/changes/Changes.html#v5.1.0.bug_fixes Cheers Tom
Re: Simple search low speed
The number of documents in collection is about 100m. -- View this message in context: http://lucene.472066.n3.nabble.com/Simple-search-low-speed-tp4202135p4202152.html Sent from the Solr - User mailing list archive at Nabble.com.
require diversity in results?
Hello list, I'm wondering if there could extra parameters or query operators that where I could impose that sorting by relevance should be relaxed so that there's a minimum diversity in some fields in the first page of results. For example, I'd like the search results to contain at least three possible type of resources in the first page, fetching things from below if needed. I know that could be done as a search result post-processor but I think that this is generally a bad idea for performance. Any other idea? thanks Paul signature.asc Description: OpenPGP digital signature
AW: Odp.: solr issue with pdf forms
Hey Erick, thanks a lot for your answer. I went to the admin schema browser, but what should I see there? Sorry I'm not firm with the admin schema browser. :-( Best Steve -Ursprüngliche Nachricht- Von: Erick Erickson [mailto:erickerick...@gmail.com] Gesendet: Donnerstag, 23. April 2015 18:00 An: solr-user@lucene.apache.org Betreff: Re: Odp.: solr issue with pdf forms When you say "they're not indexed correctly", what's your evidence? You cannot rely on the display in the browser, that's the raw input just as it was sent to Solr, _not_ the actual tokens in the index. What do you see when you go to the admin schema browser pate and load the actual tokens. Or use the TermsComponent (https://cwiki.apache.org/confluence/display/solr/The+Terms+Component) to see the actual terms in the index as opposed to the stored data you see in the browser when you look at search results. If the actual terms don't seem right _in the index_ we need to see your analysis chain, i.e. your fieldType definition. I'm, 90% sure you're seeing the stored data and your terms are indexed just fine, but I've certainly been wrong before, more times than I want to remember. Best, Erick On Thu, Apr 23, 2015 at 1:18 AM, wrote: > Hey Erick, > > thanks for your answer. They are not indexed correctly. Also throught the > solr admin interface I see these typical questionmarks within a rhombus where > a blank space should be. > I now figured out the following (not sure if it is relevant at all): > - PDF documents created with "Acrobat PDFMaker 10.0 for Word" are > indexed correctly, no issues > - PDF documents (with editable form fields) created with "Adobe > InDesign CS5 (7.0.1)" are indexed with the blank space issue > > Best > Steve > > -Ursprüngliche Nachricht- > Von: Erick Erickson [mailto:erickerick...@gmail.com] > Gesendet: Mittwoch, 22. April 2015 17:11 > An: solr-user@lucene.apache.org > Betreff: Re: Odp.: solr issue with pdf forms > > Are they not _indexed_ correctly or not being displayed correctly? > Take a look at admin UI>>schema browser>> your field and press the "load > terms" button. That'll show you what is _in_ the index as opposed to what the > raw data looked like. > > When you return the field in a Solr search, you get a verbatim, un-analyzed > copy of your original input. My guess is that your browser isn't using the > compatible character encoding for display. > > Best, > Erick > > On Wed, Apr 22, 2015 at 7:08 AM, wrote: >> Thanks for your answer. Maybe my English is not good enough, what are you >> trying to say? Sorry I didn't get the point. >> :-( >> >> >> -Ursprüngliche Nachricht- >> Von: LAFK [mailto:tomasz.bo...@gmail.com] >> Gesendet: Mittwoch, 22. April 2015 14:01 >> An: solr-user@lucene.apache.org; solr-user@lucene.apache.org >> Betreff: Odp.: solr issue with pdf forms >> >> Out of my head I'd follow how are writable PDFs created and encoded. >> >> @LAFK_PL >> Oryginalna wiadomość >> Od: steve.sch...@t-systems.com >> Wysłano: środa, 22 kwietnia 2015 12:41 >> Do: solr-user@lucene.apache.org >> Odpowiedz: solr-user@lucene.apache.org >> Temat: solr issue with pdf forms >> >> Hi guys, >> >> hopefully you can help me with my issue. We are using a solr setup and have >> the following issue: >> - usual pdf files are indexed just fine >> - pdf files with writable form-fields look like this: >> Ich bestätige mit meiner Unterschrift, dass alle Angaben korrekt und >> v ollständig sind >> >> Somehow the blank space character is not indexed correctly. >> >> Is this a know issue? Does anybody have an idea? >> >> Thanks a lot >> Best >> Steve
Simple search low speed
We have simple search over 50 GB index. And it's slow. I can't even wonder why, whole index is in RAM (and a lot of free space is available) and CPU is a bottleneck (100% load). The query is simple (except tvrh): q=(text:(word1+word2)++title:(word1+word2))&tv=true&isShard=true&qt=/tvrh&fq=cat:(10+11+12)&fq=field1:(150)&fq=field2:(0)&fq=date:[2015-01-01T00:00:00Z+TO+2015-04-24T23:59:59Z] text, title - text_general fields cat, field1, field2 - tint fields date - a date field (I know, it's deprecated, will be changed soon). All fields are indexed, some of them are stored. And search time is 15 seconds (for warmed searcher, it's not the first query). debug=true shows timings process={time=15382.0,query={time=15282.0} What can I check? -- View this message in context: http://lucene.472066.n3.nabble.com/Simple-search-low-speed-tp4202135.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Grouping Performance Optimation
If u need only 200 results grouped, u can easily do it with some external code, it will be much faster anyway. Also, it's widely suggested to use docValues="true" for fields, by which group is performed, it really helps (I can only say numbers in terms of RAM usage, but speed increases as-well). -- View this message in context: http://lucene.472066.n3.nabble.com/Grouping-Performance-Optimation-tp4201886p4202133.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Checking of Solr Memory and Disk usage
Hi, So has anyone knows what is the issue with the "Heap Memory Usage" reading showing the value -1. Should I open an issue in Jira? Regards, Edwin On 22 April 2015 at 21:23, Zheng Lin Edwin Yeo wrote: > I see. I'm running on SolrCloud with 2 replicia, so I guess mine will > probably use much more when my system reaches millions of documents. > > Regards, > Edwin > > > On 22 April 2015 at 20:47, Shawn Heisey wrote: > >> On 4/22/2015 12:11 AM, Zheng Lin Edwin Yeo wrote: >> > Roughly how many collections and how much records do you have in your >> Solr? >> > >> > I have 8 collections with a total of roughly 227000 records, most of >> which >> > are CSV records. One of my collections have 142000 records. >> >> The core that shows 82MB for heap usage has 16 million documents and is >> hit with an average of 1 or 2 queries per second. The entire Solr >> instance on this machine has about 55 million documents and a 6GB max >> heap. >> >> This is NOT running SolrCloud, though the indexes are distributed. >> There are 24 cores defined, but during normal operation, only four of >> them contain documents. All four of those cores show heap memory values >> less than 100MB, but the overall heap usage on that machine is measured >> in gigabytes. >> >> Thanks, >> Shawn >> >> >