Re: CLUSTERSTATE timeout
I'm having the same issue with 4.10.3 I'm performing various task on clusterstate API and getting random timeouts throguhout the day. -- View this message in context: http://lucene.472066.n3.nabble.com/CLUSTERSTATE-timeout-tp4199367p4199501.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr Lazy startup - load-on-startup missing from web.xml?
Hi, it worked! The issue was originally on WAS 7, but has somehow regressed to WebSphere 8.5. Thanks. On Thu, Feb 19, 2015 at 10:13 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : Hi! Solr is starting up dormant for me, until a client wake it up with a : REST request, or I open admin UI, only then the remaining initializing : happens. : Is it something known? based on my recollection of the servlet spec, that sounds like a bug/glitch/config option in your Servlet container... Googling WebSphere init Filters on startup turns up this IBM bug report with noted fix versions... http://www-01.ibm.com/support/docview.wss?uid=swg1PK86553 : I can't see any load-on-startup in the web.xml, in Solr.war. The bulk of Solr exists as a Filter. Filters are not permitted by the servlet spec to specify load-on-startup value (only Servlets can specify that, and the only Servlets in Solr are for supporting legacy paths -- the load order doesn't matter for them) : Running Solr 4.7.2 over WebSphere 8.5 : : App loading message as the server starts up: : [2/*16*/15 12:17:19:956 GMT] 0056 ApplicationMg A WSVR0221I: : Application started: solr-4.7.2 : [2/*16*/15 12:17:20:319 GMT] 0001 WsServerImpl A WSVR0001I: : Server serverSolr open for e-business : The the next start up message in the log is on the next day once I enter : Solr admin UI: : [2/*17*/15 10:20:13:827 GMT] 0098 SolrDispatchF I : org.apache.solr.servlet.SolrDispatchFilter init SolrDispatchFilter.init() : ... : -Hoss http://www.lucidworks.com/
Re: Java.net.socketexception: broken pipe Solr 4.10.2
On 4/13/2015 10:11 PM, vsilgalis wrote: just a couple of notes: this a 2 shard setup with 2 nodes per shard. Currently these are on VMs with 8 cores and 8GB of ram each (java max heap is ~5588mb but we usually never even get that high) backed by a NFS file store which we store the indexes on (netapp SAN with nfs exports on SAS disk). Broken pipe errors usually indicate that the client gave up waiting for the server and disconnected the TCP connection before the server completed processing and sent a response. This is frequently because of configured timeouts on the client. If reasonable timeouts are being exceeded, it's usually a performance problem. You haven't indicated how much disk space is occupied by the index data on each of these servers. There are also several other things that would be helpful to know. Please read this wiki page, then come back with any questions you might have, and I may also ask a question or two: http://wiki.apache.org/solr/SolrPerformanceProblems My immediate suspects are an OS disk cache that is too small, and/or problems with garbage collection pauses. These are two of the issues discussed on that wiki page. Thanks, Shawn
Problem related to filter on Zero value for DateField
Dears, Hi, I have strange problem with Solr 4.10.x. My problem is when I do searching on solr Zero date which is 0002-11-30T00:00:00Z if more than one filter be considered, the results became invalid. For example consider this scenario: When I search for a document with fq=p_date:0002-11-30T00:00:00Z Solr returns three different documents which is right for my Collection. All of these three documents have same value of 7 for document status. Now If I search for fq=document_status:7 the same three documents returns which is also a correct response. But When I do the searching on fq=focument_status:7fq=p_date:0002-11-30T00:00:00Z, Solr returns nothing! (0 document) While I have not such problem with other date values beside Solr Zero (0002-11-30T00:00:00Z). Please let me know it is a bug related to Solr or I did something wrong? Best regards. -- A.Nazemian
Re: Securing solr index
Hi I might misunderstand you, but if you are talking about securing the actual files/folders of the index, I do not think this is a Solr/Lucene concern. Use standard mechanisms of your OS. E.g. on linux/unix use chown, chgrp, chmod, sudo, apparmor etc - e.g. allowing only root to write the folders/files and sudo the user running Solr/Lucene to operate as root in this area. Even admins should not (normally) operate as root - that way they cannot write the files either. No one knows the root-password - except maybe for the super-super-admin, or you split the root-password in two and two admins know a part each, so that they have to both agree in order to operate as root. Be creative yourself. Regards, Per Steffensen On 13/04/15 12:13, Suresh Vanasekaran wrote: Hi, We are having the solr index maintained in a central server and multiple users might be able to access the index data. May I know what are best practice for securing the solr index folder where ideally only application user should be able to access. Even an admin user should not be able to copy the data and use it in another schema. Thanks CAUTION - Disclaimer * This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely for the use of the addressee(s). If you are not the intended recipient, please notify the sender by e-mail and delete the original message. Further, you are not to copy, disclose, or distribute this e-mail or its contents to any other person and any such actions are unlawful. This e-mail may contain viruses. Infosys has taken every reasonable precaution to minimize this risk, but is not liable for any damage you may sustain as a result of any virus in this e-mail. You should carry out your own virus checks before opening the e-mail or attachment. Infosys reserves the right to monitor and review the content of all messages sent to or from this e-mail address. Messages sent to or from this e-mail address may be stored on the Infosys e-mail system. ***INFOSYS End of Disclaimer INFOSYS***
facet on external field
Hi, I am using external field for price field since it changes frequently. generate facets using external field? how? I understand that faceting requires indexing and external fields fields are not actually indexed. -- Thanks Regards, Jainam Vora
Errors during Indexing in SOLR 4.6
Hi All, we recently migrated from SOLR 3.6 to SOLR 4, while indexing in SOLR 4 we are getting below exception. Apr 1, 2015 9:22:57 AM org.apache.solr.common.SolrException log SEVERE: null:org.apache.solr.common.SolrException: Exception writing document id 932684555 to the index; possible analysis error. at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:164) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69) at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51) Caused by: java.lang.IllegalArgumentException: first position increment must be 0 (got 0) for field 'DataEnglish' at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:131) this works perfectly fine in SOLR 3.6. can someone help in debugging this. any fixes/solutions? Thanks in Advance. Best Regards, Abhishek
Re: Java.net.socketexception: broken pipe Solr 4.10.2
Right now index size is about 10GB on each shard (yes I could use more RAM), but I'm looking more for a step up then step down approach. I will try adding more RAM to these machines as my next step. 1. Zookeeper is external to these boxes in a three node cluster with more than enough RAM to keep everything off disk. 2. os disk cache, when I add more RAM I will just add it as RAM for the machine and not to the Java Heap unless that is something you recommend. 3. java heap looks good so far, GC is minimal as far as i can tell but I can look into this some more. 4. we do have 2 cores per machine, but the second core is a joke (10MB) note: zkClientTimeout is set to 30 for safety's sake. java settings: -XX:+CMSClassUnloadingEnabled-XX:+AggressiveOpts-XX:+ParallelRefProcEnabled-XX:+CMSParallelRemarkEnabled-XX:CMSMaxAbortablePrecleanTime=6000-XX:CMSTriggerPermRatio=80-XX:CMSInitiatingOccupancyFraction=50-XX:+UseCMSInitiatingOccupancyOnly-XX:CMSFullGCsBeforeCompaction=1-XX:PretenureSizeThreshold=64m-XX:+CMSScavengeBeforeRemark-XX:ParallelGCThreads=4-XX:ConcGCThreads=4-XX:+UseConcMarkSweepGC-XX:+UseParNewGC-XX:MaxTenuringThreshold=8-XX:TargetSurvivorRatio=90-XX:SurvivorRatio=4-XX:NewRatio=3-XX:-UseSuperWord-Xmx5588m-Xms1596m -- View this message in context: http://lucene.472066.n3.nabble.com/Java-net-socketexception-broken-pipe-Solr-4-10-2-tp4199484p4199561.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Indexing PDF and MS Office files
Hi, Here are the solr-config xml and the error log from Solr logs for your reference. As mentioned earlier, I didnt make any changes to the solr-config.xml as I am using the xml file out of the box one that came with the default installation. Please let me know your thoughts on why these issues are occuring. Thanks Regards Vijay *Vijay Bhoomireddy*, Big Data Architect 1000 Great West Road, Brentford, London, TW8 9DW *T: +44 20 3475 7980* *M: **+44 7481 298 360* *W: *ww http://www.whishworks.com/w.whishworks.com http://www.whishworks.com/ https://www.linkedin.com/company/whishworks http://www.whishworks.com/blog/ https://twitter.com/WHISHWORKS https://www.facebook.com/whishworksit On 14 April 2015 at 15:57, Vijaya Narayana Reddy Bhoomi Reddy vijaya.bhoomire...@whishworks.com wrote: Hi, I am trying to index PDF and Microsoft Office files (.doc, .docx, .ppt, .pptx, .xlx, and .xlx) files into Solr. I am facing the following issues. Request to please let me know what is going wrong with the indexing process. I am using solr 4.10.2 and using the default example server configuration that comes with Solr distribution. PDF Files - Indexing as such works fine, but when I query using *.* in the Solr Query console, metadata information is displayed properly. However, the PDF content field is empty. This is happening for all PDF files I have tried. I have tried with some proprietary files, PDF eBooks etc. Whatever be the PDF file, content is not being displayed. MS Office files - For some office files, everything works perfect and the extracted content is visible in the query console. However, for others, I see the below error message during the indexing process. *Exception in thread main org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser* I am using SolrJ to index the documents and below is the code snippet related to indexing. Please let me know where the issue is occurring. static String solrServerURL = http://localhost:8983/solr;; static SolrServer solrServer = new HttpSolrServer(solrServerURL); static ContentStreamUpdateRequest indexingReq = new ContentStreamUpdateRequest(/update/extract); indexingReq.addFile(file, fileType); indexingReq.setParam(literal.id, literalId); indexingReq.setParam(uprefix, attr_); indexingReq.setParam(fmap.content, content); indexingReq.setParam(literal.fileurl, fileURL); indexingReq.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true); solrServer.request(indexingReq); Thanks Regards Vijay -- The contents of this e-mail are confidential and for the exclusive use of the intended recipient. If you receive this e-mail in error please delete it from your system immediately and notify us either by e-mail or telephone. You should not copy, forward or otherwise disclose the content of the e-mail. The views expressed in this communication may not necessarily be the view held by WHISHWORKS. ?xml version=1.0 encoding=UTF-8 ? !-- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the License); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. -- !-- For more details about configurations options that may appear in this file, see http://wiki.apache.org/solr/SolrConfigXml. -- config !-- In all configuration below, a prefix of solr. for class names is an alias that causes solr to search appropriate packages, including org.apache.solr.(search|update|request|core|analysis) You may also specify a fully qualified Java classname if you have your own custom plugins. -- !-- Controls what version of Lucene various components of Solr adhere to. Generally, you want to use the latest version to get all bug fixes and improvements. It is highly recommended that you fully re-index after changing this setting as it can affect both how text is indexed and queried. -- luceneMatchVersion4.10.2/luceneMatchVersion !-- lib/ directives can be used to instruct Solr to load any Jars identified and use them to resolve any plugins specified in your solrconfig.xml or schema.xml (ie: Analyzers, Request
Re: Indexing PDF and MS Office files
Andrea, Yes, I am using the stock schema.xml that comes with the example server of Solr-4.10.2 Hence not sure why the PDF content is not getting extracted and put into the content field in the index. Please find the log information for the Parsing error below. org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@138b0c5 at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:246) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:368) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489) at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53) at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:953) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1014) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:953) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240) at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Unknown Source) Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@138b0c5 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219) ... 32 more Caused by: java.lang.IllegalArgumentException: This paragraph is not the first one in the table at org.apache.poi.hwpf.usermodel.Range.getTable(Range.java:932) at org.apache.tika.parser.microsoft.WordExtractor.handleParagraph(WordExtractor.java:188) at org.apache.tika.parser.microsoft.WordExtractor.handleHeaderFooter(WordExtractor.java:172) at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:98) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:199) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:167) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 35 more ERROR - 2015-04-14 14:51:21.151; org.apache.solr.common.SolrException; null:org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@138b0c5 at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at
Re: Indexing PDF and MS Office files
It seems something like https://issues.apache.org/jira/browse/TIKA-1251. I see you're using Solr 4.10.2 which uses Tika 1.5 and that issue seems to be fixed in Tika 1.6. I agree with Erik: you should try with another version of Tika. Best, Andrea On 04/14/2015 06:44 PM, Vijaya Narayana Reddy Bhoomi Reddy wrote: Andrea, Yes, I am using the stock schema.xml that comes with the example server of Solr-4.10.2 Hence not sure why the PDF content is not getting extracted and put into the content field in the index. Please find the log information for the Parsing error below. org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@138b0c5 at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:246) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:368) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489) at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53) at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:953) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1014) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:953) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240) at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Unknown Source) Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@138b0c5 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219) ... 32 more Caused by: java.lang.IllegalArgumentException: This paragraph is not the first one in the table at org.apache.poi.hwpf.usermodel.Range.getTable(Range.java:932) at org.apache.tika.parser.microsoft.WordExtractor.handleParagraph(WordExtractor.java:188) at org.apache.tika.parser.microsoft.WordExtractor.handleHeaderFooter(WordExtractor.java:172) at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:98) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:199) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:167) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 35 more ERROR - 2015-04-14 14:51:21.151; org.apache.solr.common.SolrException; null:org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@138b0c5 at
sort by a copy field error
Hello, I have a pretty basic question: how can I sort by a copyfield? My schema conf is: field name=name type=text_general_edge_ngram indexed=true stored=true omitNorms=true termVectors=true/ field name=name_sort type=string indexed=true stored=false/ copyField source=name dest=name_sort / And when I try to sort by name_sort the following error is raised: error: { msg: sort param field can't be found: name_sort, code: 400 } Thanks in advanced, Pedro Figueiredo
[ANNOUNCE] Apache Solr 5.1.0 released
14 April 2015 - The Lucene PMC is pleased to announce the release of Apache Solr 5.1.0. Solr 5.1.0 is available for immediate download at: http://www.apache.org/dyn/closer.cgi/lucene/solr/5.1.0 Solr 5.1.0 includes 39 new features, 40 bug fixes, and 36 optimizations / other changes from over 60 unique contributors. For detailed information about what is included in 5.1.0 release, please see: http://lucene.apache.org/solr/5_1_0/changes/Changes.html Enjoy!
Re: Indexing PDF and MS Office files
looks like this is just a file that Tika can't handle, based on this line: bq: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser You might be able to get some joy from parsing this from Java and see if a more recent Tika would fix it. Here's some sample code: http://lucidworks.com/blog/indexing-with-solrj/ Best, Erick On Tue, Apr 14, 2015 at 9:44 AM, Vijaya Narayana Reddy Bhoomi Reddy vijaya.bhoomire...@whishworks.com wrote: Andrea, Yes, I am using the stock schema.xml that comes with the example server of Solr-4.10.2 Hence not sure why the PDF content is not getting extracted and put into the content field in the index. Please find the log information for the Parsing error below. org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@138b0c5 at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:246) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:368) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489) at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53) at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:953) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1014) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:953) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240) at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Unknown Source) Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@138b0c5 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219) ... 32 more Caused by: java.lang.IllegalArgumentException: This paragraph is not the first one in the table at org.apache.poi.hwpf.usermodel.Range.getTable(Range.java:932) at org.apache.tika.parser.microsoft.WordExtractor.handleParagraph(WordExtractor.java:188) at org.apache.tika.parser.microsoft.WordExtractor.handleHeaderFooter(WordExtractor.java:172) at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:98) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:199) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:167) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
Indexing PDF and MS Office files
Hi, I am trying to index PDF and Microsoft Office files (.doc, .docx, .ppt, .pptx, .xlx, and .xlx) files into Solr. I am facing the following issues. Request to please let me know what is going wrong with the indexing process. I am using solr 4.10.2 and using the default example server configuration that comes with Solr distribution. PDF Files - Indexing as such works fine, but when I query using *.* in the Solr Query console, metadata information is displayed properly. However, the PDF content field is empty. This is happening for all PDF files I have tried. I have tried with some proprietary files, PDF eBooks etc. Whatever be the PDF file, content is not being displayed. MS Office files - For some office files, everything works perfect and the extracted content is visible in the query console. However, for others, I see the below error message during the indexing process. *Exception in thread main org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser* I am using SolrJ to index the documents and below is the code snippet related to indexing. Please let me know where the issue is occurring. static String solrServerURL = http://localhost:8983/solr;; static SolrServer solrServer = new HttpSolrServer(solrServerURL); static ContentStreamUpdateRequest indexingReq = new ContentStreamUpdateRequest(/update/extract); indexingReq.addFile(file, fileType); indexingReq.setParam(literal.id, literalId); indexingReq.setParam(uprefix, attr_); indexingReq.setParam(fmap.content, content); indexingReq.setParam(literal.fileurl, fileURL); indexingReq.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true); solrServer.request(indexingReq); Thanks Regards Vijay -- The contents of this e-mail are confidential and for the exclusive use of the intended recipient. If you receive this e-mail in error please delete it from your system immediately and notify us either by e-mail or telephone. You should not copy, forward or otherwise disclose the content of the e-mail. The views expressed in this communication may not necessarily be the view held by WHISHWORKS.
Re: Indexing PDF and MS Office files
Hi Vijay, Please paste an extract of your schema, where the content field (the field where the PDF text shoudl be) and its type are declared. For the other issue, please paste the whole stacktrace because org.apache.tika.parser.microsoft.OfficeParser* says nothing. The complete stacktrace (or at least another three / four lines) should contain some other detail. Best, Andrea On 04/14/2015 04:57 PM, Vijaya Narayana Reddy Bhoomi Reddy wrote: Hi, I am trying to index PDF and Microsoft Office files (.doc, .docx, .ppt, .pptx, .xlx, and .xlx) files into Solr. I am facing the following issues. Request to please let me know what is going wrong with the indexing process. I am using solr 4.10.2 and using the default example server configuration that comes with Solr distribution. PDF Files - Indexing as such works fine, but when I query using *.* in the Solr Query console, metadata information is displayed properly. However, the PDF content field is empty. This is happening for all PDF files I have tried. I have tried with some proprietary files, PDF eBooks etc. Whatever be the PDF file, content is not being displayed. MS Office files - For some office files, everything works perfect and the extracted content is visible in the query console. However, for others, I see the below error message during the indexing process. *Exception in thread main org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser* I am using SolrJ to index the documents and below is the code snippet related to indexing. Please let me know where the issue is occurring. static String solrServerURL = http://localhost:8983/solr;; static SolrServer solrServer = new HttpSolrServer(solrServerURL); static ContentStreamUpdateRequest indexingReq = new ContentStreamUpdateRequest(/update/extract); indexingReq.addFile(file, fileType); indexingReq.setParam(literal.id, literalId); indexingReq.setParam(uprefix, attr_); indexingReq.setParam(fmap.content, content); indexingReq.setParam(literal.fileurl, fileURL); indexingReq.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true); solrServer.request(indexingReq); Thanks Regards Vijay
Re: Problem related to filter on Zero value for DateField
What does your main query look like? Normally we don't speak of searching with the fq parameter - it filters the results, but the actual searching is done via the main query with the q parameter. -- Jack Krupansky On Tue, Apr 14, 2015 at 4:17 AM, Ali Nazemian alinazem...@gmail.com wrote: Dears, Hi, I have strange problem with Solr 4.10.x. My problem is when I do searching on solr Zero date which is 0002-11-30T00:00:00Z if more than one filter be considered, the results became invalid. For example consider this scenario: When I search for a document with fq=p_date:0002-11-30T00:00:00Z Solr returns three different documents which is right for my Collection. All of these three documents have same value of 7 for document status. Now If I search for fq=document_status:7 the same three documents returns which is also a correct response. But When I do the searching on fq=focument_status:7fq=p_date:0002-11-30T00:00:00Z, Solr returns nothing! (0 document) While I have not such problem with other date values beside Solr Zero (0002-11-30T00:00:00Z). Please let me know it is a bug related to Solr or I did something wrong? Best regards. -- A.Nazemian
RE: using DirectSpellChecker and FileBasedSpellChecker with Solr 4.10.1
Elisabeth, Currently ConjunctionSolrSpellChecker only supports adding WordBreakSolrSpellchecker to IndexBased- FileBased- or DirectSolrSpellChecker. In the future, it would be great if it could handle other Spell Checker combinations. For instance, if you had a (e)dismax query that searches multiple fields, to have a separate spellchecker for each of them. But CSSC is not hardened for this more general usage, as hinted in the API doc. The check done to ensure all spellcheckers use the same stringdistance object, I believe, is a safeguard against using this class for functionality it is not able to correctly support. It looks to me that SOLR-6271 was opened to fix the bug in that it is comparing references on the stringdistance. This is not a problem with WBSSC because this one does not support string distance at all. What you're hoping for, however, is that the requirement for the string distances be the same to be removed entirely. You could try modifying the code by removing the check. However beware that you might not get the results you desire! But should this happen, please, go ahead and fix it for your use case and then donate the code. This is something I've personally wanted for a long time. James Dyer Ingram Content Group -Original Message- From: elisabeth benoit [mailto:elisaelisael...@gmail.com] Sent: Tuesday, April 14, 2015 7:37 AM To: solr-user@lucene.apache.org Subject: using DirectSpellChecker and FileBasedSpellChecker with Solr 4.10.1 Hello, I am using Solr 4.10.1 and trying to use DirectSolrSpellChecker and FileBasedSpellchecker in same request. I've applied change from patch 135.patch (cf Solr-6271). I've tried running the command patch -p1 -i 135.patch --dry-run but it didn't work, maybe because the patch was a fix to Solr 4.9, so I just replaced line in ConjunctionSolrSpellChecker else if (!stringDistance.equals(checker.getStringDistance())) { throw new IllegalArgumentException( All checkers need to use the same StringDistance.); } by else if (!stringDistance.equals(checker.getStringDistance())) { throw new IllegalArgumentException( All checkers need to use the same StringDistance!!! 1: + checker.getStringDistance() + 2: + stringDistance); } as it was done in the patch but still, when I send a spellcheck request, I get the error msg: All checkers need to use the same StringDistance!!! 1:org.apache.lucene.search.spell.LuceneLevenshteinDistance@15f57db32: org.apache.lucene.search.spell.LuceneLevenshteinDistance@280f7e08 From error message I gather both spellchecker use same distanceMeasure LuceneLevenshteinDistance, but they're not same instance of LuceneLevenshteinDistance. Is the condition all right? What should be done to fix this properly? Thanks, Elisabeth
Re: Indexing PDF and MS Office files
Hi, solrconfig.xml (especially if you didn't touch it) should be good. What about the schema? Are you using the one that comes with the download bundle, too? I don't see the stacktrace..did you forget to paste it? Best, Andrea On 04/14/2015 06:06 PM, Vijaya Narayana Reddy Bhoomi Reddy wrote: Hi, Here are the solr-config xml and the error log from Solr logs for your reference. As mentioned earlier, I didnt make any changes to the solr-config.xml as I am using the xml file out of the box one that came with the default installation. Please let me know your thoughts on why these issues are occuring. Thanks Regards Vijay *Vijay Bhoomireddy*, Big Data Architect 1000 Great West Road, Brentford, London, TW8 9DW *T:+44 20 3475 7980* *M:**+44 7481 298 360* *W: *ww http://www.whishworks.com/w.whishworks.com http://www.whishworks.com/ https://www.linkedin.com/company/whishworkshttp://www.whishworks.com/blog/https://twitter.com/WHISHWORKShttps://www.facebook.com/whishworksit On 14 April 2015 at 15:57, Vijaya Narayana Reddy Bhoomi Reddy vijaya.bhoomire...@whishworks.com mailto:vijaya.bhoomire...@whishworks.com wrote: Hi, I am trying to index PDF and Microsoft Office files (.doc, .docx, .ppt, .pptx, .xlx, and .xlx) files into Solr. I am facing the following issues. Request to please let me know what is going wrong with the indexing process. I am using solr 4.10.2 and using the default example server configuration that comes with Solr distribution. PDF Files - Indexing as such works fine, but when I query using *.* in the Solr Query console, metadata information is displayed properly. However, the PDF content field is empty. This is happening for all PDF files I have tried. I have tried with some proprietary files, PDF eBooks etc. Whatever be the PDF file, content is not being displayed. MS Office files - For some office files, everything works perfect and the extracted content is visible in the query console. However, for others, I see the below error message during the indexing process. *Exception in thread main org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser* * * I am using SolrJ to index the documents and below is the code snippet related to indexing. Please let me know where the issue is occurring. static String solrServerURL = http://localhost:8983/solr;; static SolrServer solrServer = new HttpSolrServer(solrServerURL); static ContentStreamUpdateRequest indexingReq = new ContentStreamUpdateRequest(/update/extract); indexingReq.addFile(file, fileType); indexingReq.setParam(literal.id http://literal.id, literalId); indexingReq.setParam(uprefix, attr_); indexingReq.setParam(fmap.content, content); indexingReq.setParam(literal.fileurl, fileURL); indexingReq.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true); solrServer.request(indexingReq); Thanks Regards Vijay The contents of this e-mail are confidential and for the exclusive use of the intended recipient. If you receive this e-mail in error please delete it from your system immediately and notify us either by e-mail or telephone. You should not copy, forward or otherwise disclose the content of the e-mail. The views expressed in this communication may not necessarily be the view held by WHISHWORKS.
Re: using DirectSpellChecker and FileBasedSpellChecker with Solr 4.10.1
Thanks for your answer! I didn't realize this what not supposed to be done (conjunction of DirectSolrSpellChecker and FileBasedSpellChecker). I got this idea in the mailing list while searching for a solution to get a list of words to ignore for the DirectSolrSpellChecker. Well well well, I'll try removing the check and see what happens. I'm not a java programmer, but if I can find a simple solution I'll let you know. Thanks again, Elisabeth 2015-04-14 16:29 GMT+02:00 Dyer, James james.d...@ingramcontent.com: Elisabeth, Currently ConjunctionSolrSpellChecker only supports adding WordBreakSolrSpellchecker to IndexBased- FileBased- or DirectSolrSpellChecker. In the future, it would be great if it could handle other Spell Checker combinations. For instance, if you had a (e)dismax query that searches multiple fields, to have a separate spellchecker for each of them. But CSSC is not hardened for this more general usage, as hinted in the API doc. The check done to ensure all spellcheckers use the same stringdistance object, I believe, is a safeguard against using this class for functionality it is not able to correctly support. It looks to me that SOLR-6271 was opened to fix the bug in that it is comparing references on the stringdistance. This is not a problem with WBSSC because this one does not support string distance at all. What you're hoping for, however, is that the requirement for the string distances be the same to be removed entirely. You could try modifying the code by removing the check. However beware that you might not get the results you desire! But should this happen, please, go ahead and fix it for your use case and then donate the code. This is something I've personally wanted for a long time. James Dyer Ingram Content Group -Original Message- From: elisabeth benoit [mailto:elisaelisael...@gmail.com] Sent: Tuesday, April 14, 2015 7:37 AM To: solr-user@lucene.apache.org Subject: using DirectSpellChecker and FileBasedSpellChecker with Solr 4.10.1 Hello, I am using Solr 4.10.1 and trying to use DirectSolrSpellChecker and FileBasedSpellchecker in same request. I've applied change from patch 135.patch (cf Solr-6271). I've tried running the command patch -p1 -i 135.patch --dry-run but it didn't work, maybe because the patch was a fix to Solr 4.9, so I just replaced line in ConjunctionSolrSpellChecker else if (!stringDistance.equals(checker.getStringDistance())) { throw new IllegalArgumentException( All checkers need to use the same StringDistance.); } by else if (!stringDistance.equals(checker.getStringDistance())) { throw new IllegalArgumentException( All checkers need to use the same StringDistance!!! 1: + checker.getStringDistance() + 2: + stringDistance); } as it was done in the patch but still, when I send a spellcheck request, I get the error msg: All checkers need to use the same StringDistance!!! 1:org.apache.lucene.search.spell.LuceneLevenshteinDistance@15f57db32: org.apache.lucene.search.spell.LuceneLevenshteinDistance@280f7e08 From error message I gather both spellchecker use same distanceMeasure LuceneLevenshteinDistance, but they're not same instance of LuceneLevenshteinDistance. Is the condition all right? What should be done to fix this properly? Thanks, Elisabeth
proper routing (from non-Java client) in solr cloud 5.0.0
Hi all - I've just upgraded my dev install of Solr (cloud) from 4.10 to 5.0. Our client is written in Go, for which I am not aware of a client, so we wrote our own. One tricky bit for this was the routing logic; if a document has routing prefix X and belong to collection Y, we need to know which solr node to connect to. Previously we accomplished this by watching the clusterstate.json file in zookeeper - at startup and whenever it changes, the client parses the file contents to build a routing table. However in 5.0 newly create collections do not show up in clusterstate.json but instead have their own state.json document. Are there any recommendations for how to handle this from the client? The obvious answer is to watch every collection's state.json document, but we run a lot of collections (~1000 currently, and growing) so I'm concerned about keeping that many watches open at the same time (should I be?). How does the SolrJ client handle this? Thanks! - Ian
RE: Securing solr index
That's a good point - if he's talking about securing the Solr filesystem, he can use standard mechanisms. You can also go beyond user/group/other permissions if your filesystem supports it. You can use Posix ACLs on many local linux filesystems. -Original Message- From: Per Steffensen [mailto:st...@designware.dk] Sent: Tuesday, April 14, 2015 8:04 AM To: solr-user@lucene.apache.org Subject: Re: Securing solr index Hi I might misunderstand you, but if you are talking about securing the actual files/folders of the index, I do not think this is a Solr/Lucene concern. Use standard mechanisms of your OS. E.g. on linux/unix use chown, chgrp, chmod, sudo, apparmor etc - e.g. allowing only root to write the folders/files and sudo the user running Solr/Lucene to operate as root in this area. Even admins should not (normally) operate as root - that way they cannot write the files either. No one knows the root-password - except maybe for the super-super-admin, or you split the root-password in two and two admins know a part each, so that they have to both agree in order to operate as root. Be creative yourself. Regards, Per Steffensen On 13/04/15 12:13, Suresh Vanasekaran wrote: Hi, We are having the solr index maintained in a central server and multiple users might be able to access the index data. May I know what are best practice for securing the solr index folder where ideally only application user should be able to access. Even an admin user should not be able to copy the data and use it in another schema. Thanks CAUTION - Disclaimer * This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely for the use of the addressee(s). If you are not the intended recipient, please notify the sender by e-mail and delete the original message. Further, you are not to copy, disclose, or distribute this e-mail or its contents to any other person and any such actions are unlawful. This e-mail may contain viruses. Infosys has taken every reasonable precaution to minimize this risk, but is not liable for any damage you may sustain as a result of any virus in this e-mail. You should carry out your own virus checks before opening the e-mail or attachment. Infosys reserves the right to monitor and review the content of all messages sent to or from this e-mail address. Messages sent to or from this e-mail address may be stored on the Infosys e-mail system. ***INFOSYS End of Disclaimer INFOSYS***
Re: Indexing PDF and MS Office files
Try doing a manual extraction request directly to Solr (not via SolrJ) and use the extractOnly option to see if the content is actually extracted. See: https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika Also, some PDF files actually have the content as a bitmap image, so no text is extracted. -- Jack Krupansky On Tue, Apr 14, 2015 at 10:57 AM, Vijaya Narayana Reddy Bhoomi Reddy vijaya.bhoomire...@whishworks.com wrote: Hi, I am trying to index PDF and Microsoft Office files (.doc, .docx, .ppt, .pptx, .xlx, and .xlx) files into Solr. I am facing the following issues. Request to please let me know what is going wrong with the indexing process. I am using solr 4.10.2 and using the default example server configuration that comes with Solr distribution. PDF Files - Indexing as such works fine, but when I query using *.* in the Solr Query console, metadata information is displayed properly. However, the PDF content field is empty. This is happening for all PDF files I have tried. I have tried with some proprietary files, PDF eBooks etc. Whatever be the PDF file, content is not being displayed. MS Office files - For some office files, everything works perfect and the extracted content is visible in the query console. However, for others, I see the below error message during the indexing process. *Exception in thread main org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser* I am using SolrJ to index the documents and below is the code snippet related to indexing. Please let me know where the issue is occurring. static String solrServerURL = http://localhost:8983/solr;; static SolrServer solrServer = new HttpSolrServer(solrServerURL); static ContentStreamUpdateRequest indexingReq = new ContentStreamUpdateRequest(/update/extract); indexingReq.addFile(file, fileType); indexingReq.setParam(literal.id, literalId); indexingReq.setParam(uprefix, attr_); indexingReq.setParam(fmap.content, content); indexingReq.setParam(literal.fileurl, fileURL); indexingReq.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true); solrServer.request(indexingReq); Thanks Regards Vijay -- The contents of this e-mail are confidential and for the exclusive use of the intended recipient. If you receive this e-mail in error please delete it from your system immediately and notify us either by e-mail or telephone. You should not copy, forward or otherwise disclose the content of the e-mail. The views expressed in this communication may not necessarily be the view held by WHISHWORKS.
Re: Indexing PDF and MS Office files
Vijay, You could try different excel files with different formats to rule out the issue is with TIKA version being used. Thanks Murthy On Wed, Apr 15, 2015 at 9:35 AM, Terry Rhodes trhodes...@gmail.com wrote: Perhaps the PDF is protected and the content can not be extracted? i have an unverified suspicion that the tika shipped with solr 4.10.2 may not support some/all office 2013 document formats. On 4/14/2015 8:18 PM, Jack Krupansky wrote: Try doing a manual extraction request directly to Solr (not via SolrJ) and use the extractOnly option to see if the content is actually extracted. See: https://cwiki.apache.org/confluence/display/solr/ Uploading+Data+with+Solr+Cell+using+Apache+Tika Also, some PDF files actually have the content as a bitmap image, so no text is extracted. -- Jack Krupansky On Tue, Apr 14, 2015 at 10:57 AM, Vijaya Narayana Reddy Bhoomi Reddy vijaya.bhoomire...@whishworks.com wrote: Hi, I am trying to index PDF and Microsoft Office files (.doc, .docx, .ppt, .pptx, .xlx, and .xlx) files into Solr. I am facing the following issues. Request to please let me know what is going wrong with the indexing process. I am using solr 4.10.2 and using the default example server configuration that comes with Solr distribution. PDF Files - Indexing as such works fine, but when I query using *.* in the Solr Query console, metadata information is displayed properly. However, the PDF content field is empty. This is happening for all PDF files I have tried. I have tried with some proprietary files, PDF eBooks etc. Whatever be the PDF file, content is not being displayed. MS Office files - For some office files, everything works perfect and the extracted content is visible in the query console. However, for others, I see the below error message during the indexing process. *Exception in thread main org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser* I am using SolrJ to index the documents and below is the code snippet related to indexing. Please let me know where the issue is occurring. static String solrServerURL = http://localhost:8983/solr;; static SolrServer solrServer = new HttpSolrServer(solrServerURL); static ContentStreamUpdateRequest indexingReq = new ContentStreamUpdateRequest(/update/extract); indexingReq.addFile(file, fileType); indexingReq.setParam(literal.id, literalId); indexingReq.setParam(uprefix, attr_); indexingReq.setParam(fmap.content, content); indexingReq.setParam(literal.fileurl, fileURL); indexingReq.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true); solrServer.request(indexingReq); Thanks Regards Vijay -- The contents of this e-mail are confidential and for the exclusive use of the intended recipient. If you receive this e-mail in error please delete it from your system immediately and notify us either by e-mail or telephone. You should not copy, forward or otherwise disclose the content of the e-mail. The views expressed in this communication may not necessarily be the view held by WHISHWORKS. -- Ph: 9845704792
Re: Indexing PDF and MS Office files
Perhaps the PDF is protected and the content can not be extracted? i have an unverified suspicion that the tika shipped with solr 4.10.2 may not support some/all office 2013 document formats. On 4/14/2015 8:18 PM, Jack Krupansky wrote: Try doing a manual extraction request directly to Solr (not via SolrJ) and use the extractOnly option to see if the content is actually extracted. See: https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika Also, some PDF files actually have the content as a bitmap image, so no text is extracted. -- Jack Krupansky On Tue, Apr 14, 2015 at 10:57 AM, Vijaya Narayana Reddy Bhoomi Reddy vijaya.bhoomire...@whishworks.com wrote: Hi, I am trying to index PDF and Microsoft Office files (.doc, .docx, .ppt, .pptx, .xlx, and .xlx) files into Solr. I am facing the following issues. Request to please let me know what is going wrong with the indexing process. I am using solr 4.10.2 and using the default example server configuration that comes with Solr distribution. PDF Files - Indexing as such works fine, but when I query using *.* in the Solr Query console, metadata information is displayed properly. However, the PDF content field is empty. This is happening for all PDF files I have tried. I have tried with some proprietary files, PDF eBooks etc. Whatever be the PDF file, content is not being displayed. MS Office files - For some office files, everything works perfect and the extracted content is visible in the query console. However, for others, I see the below error message during the indexing process. *Exception in thread main org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser* I am using SolrJ to index the documents and below is the code snippet related to indexing. Please let me know where the issue is occurring. static String solrServerURL = http://localhost:8983/solr;; static SolrServer solrServer = new HttpSolrServer(solrServerURL); static ContentStreamUpdateRequest indexingReq = new ContentStreamUpdateRequest(/update/extract); indexingReq.addFile(file, fileType); indexingReq.setParam(literal.id, literalId); indexingReq.setParam(uprefix, attr_); indexingReq.setParam(fmap.content, content); indexingReq.setParam(literal.fileurl, fileURL); indexingReq.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true); solrServer.request(indexingReq); Thanks Regards Vijay -- The contents of this e-mail are confidential and for the exclusive use of the intended recipient. If you receive this e-mail in error please delete it from your system immediately and notify us either by e-mail or telephone. You should not copy, forward or otherwise disclose the content of the e-mail. The views expressed in this communication may not necessarily be the view held by WHISHWORKS.
Re: Java.net.socketexception: broken pipe Solr 4.10.2
We ran into this during our indexing process running on 4.10.3. After increasing zookeeper timeouts, client timeouts, socket timeouts, implementing retry logic on our loading process the thing that worked was to change the Hard Commit timing. We were performing a Hard Commit every 5 minutes and after a couple hours of loading data some of the shards would start going down because they would timeout with zookeeper and/or close connections. Changing the timeouts just moved the problem later in the ingest process. Through a combination of decreasing the hard commit timing to 15 seconds, and migrating to G1 garbage collect, we are able to prevent ingest failures. For us the periodic stop the world garbage collects were causing connections to be closed and other nasty things such as zookeeper timeouts that would cause recovery to kick in. (Soft commits are turned off until the full ingest/baseline completes). I believe until a Hard Commit is issued Solr keeps the data in memory which explains why we were experiencing nasty garbage collects. The other change we made which may have helped is that we ensured the socket timeouts were in sync between the jetty instance running Solr and the SolrJ loading the data. During some of our batch updates Solr would take a couple minutes to respond back which I believe in some instances the socket server side would be closed (maxIdleTime setting in Jetty). Hope this helps, Jaime Spicciati Thanks Jaime On Tue, Apr 14, 2015 at 9:26 AM, vsilgalis vsilga...@gmail.com wrote: Right now index size is about 10GB on each shard (yes I could use more RAM), but I'm looking more for a step up then step down approach. I will try adding more RAM to these machines as my next step. 1. Zookeeper is external to these boxes in a three node cluster with more than enough RAM to keep everything off disk. 2. os disk cache, when I add more RAM I will just add it as RAM for the machine and not to the Java Heap unless that is something you recommend. 3. java heap looks good so far, GC is minimal as far as i can tell but I can look into this some more. 4. we do have 2 cores per machine, but the second core is a joke (10MB) note: zkClientTimeout is set to 30 for safety's sake. java settings: -XX:+CMSClassUnloadingEnabled-XX:+AggressiveOpts-XX:+ParallelRefProcEnabled-XX:+CMSParallelRemarkEnabled-XX:CMSMaxAbortablePrecleanTime=6000-XX:CMSTriggerPermRatio=80-XX:CMSInitiatingOccupancyFraction=50-XX:+UseCMSInitiatingOccupancyOnly-XX:CMSFullGCsBeforeCompaction=1-XX:PretenureSizeThreshold=64m-XX:+CMSScavengeBeforeRemark-XX:ParallelGCThreads=4-XX:ConcGCThreads=4-XX:+UseConcMarkSweepGC-XX:+UseParNewGC-XX:MaxTenuringThreshold=8-XX:TargetSurvivorRatio=90-XX:SurvivorRatio=4-XX:NewRatio=3-XX:-UseSuperWord-Xmx5588m-Xms1596m -- View this message in context: http://lucene.472066.n3.nabble.com/Java-net-socketexception-broken-pipe-Solr-4-10-2-tp4199484p4199561.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: [ANNOUNCE] Apache Solr 5.1.0 released
Hi Joe, This should help you: http://lucene.apache.org/solr/5_1_0/changes/Changes.html#v5.1.0.upgrading_from_solr_5.0 On Tue, Apr 14, 2015 at 12:47 PM, Joseph Obernberger j...@lovehorsepower.com wrote: Great news! Any tips on how to do an upgrade from 5.0.0 to 5.1.0? Thank you! -Joe On 4/14/2015 2:39 PM, Timothy Potter wrote: I apologize - Yonik prepared these nice release notes for 5.1 and I neglected to include them: Solr 5.1 Release Highlights: * The new Facet Module, including the JSON Facet API. This module is currently marked as experimental to allow for further API feedback and improvements. * A new JSON request API. This feature is currently marked as experimental to allow for further API feedback and improvements. * The ability to upload and download Solr configurations via SolrJ (CloudSolrClient). * First-class support for Real-Time Get in SolrJ. * Spatial 2D heat-map faceting. * EnumField now has docValues support. * API to dynamically add Jars to Solr's classpath for plugins. * Ability to enable/disable individual stats in the StatsComponent. * lucene/solr query syntax to give any query clause a constant score. * Schema API enhancements to remove or replace fields, dynamic fields, field types and copy fields. * When posting XML or JSON to Solr with curl, there is no need to specify the content type. * A list of update processors to be used for an update can be specified dynamically for any given request. * StatsComponent now supports Percentiles. * facet.contains option to limit which constraints are returned. * Streaming Aggregation for SolrCloud. * The admin UI now visualizes Lucene segment information. * Parameter substitution / macro expansion across entire request On Tue, Apr 14, 2015 at 11:42 AM, Timothy Potter thelabd...@gmail.com wrote: 14 April 2015 - The Lucene PMC is pleased to announce the release of Apache Solr 5.1.0. Solr 5.1.0 is available for immediate download at: http://www.apache.org/dyn/closer.cgi/lucene/solr/5.1.0 Solr 5.1.0 includes 39 new features, 40 bug fixes, and 36 optimizations / other changes from over 60 unique contributors. For detailed information about what is included in 5.1.0 release, please see: http://lucene.apache.org/solr/5_1_0/changes/Changes.html Enjoy! -- Anshum Gupta
Re: sort by a copy field error
On 4/14/2015 11:32 AM, Pedro Figueiredo wrote: And when I try to sort by name_sort the following error is raised: error: { msg: sort param field can't be found: name_sort, code: 400 } What was the exact sort parameter you sent to Solr? Did you reload the core or restart Solr and then reindex after you changed your schema? A reindex will be required. http://wiki.apache.org/solr/HowToReindex Thanks, Shawn
Re: sort by a copy field error
Hi Pedro Please post the request that produces that error Andrea On 14 Apr 2015 19:33, Pedro Figueiredo pjlfigueir...@criticalsoftware.com wrote: Hello, I have a pretty basic question: how can I sort by a copyfield? My schema conf is: field name=name type=text_general_edge_ngram indexed=true stored=true omitNorms=true termVectors=true/ field name=name_sort type=string indexed=true stored=false/ copyField source=name dest=name_sort / And when I try to sort by name_sort the following error is raised: error: { msg: sort param field can't be found: name_sort, code: 400 } Thanks in advanced, Pedro Figueiredo
Re: proper routing (from non-Java client) in solr cloud 5.0.0
Hi Ian, As per my understanding, Solrj does not use Zookeeper watches but instead caches the information (along with a TTL). You can find more information here, https://issues.apache.org/jira/browse/SOLR-5473 https://issues.apache.org/jira/browse/SOLR-5474 Regards Hrishikesh On Tue, Apr 14, 2015 at 8:49 AM, Ian Rose ianr...@fullstory.com wrote: Hi all - I've just upgraded my dev install of Solr (cloud) from 4.10 to 5.0. Our client is written in Go, for which I am not aware of a client, so we wrote our own. One tricky bit for this was the routing logic; if a document has routing prefix X and belong to collection Y, we need to know which solr node to connect to. Previously we accomplished this by watching the clusterstate.json file in zookeeper - at startup and whenever it changes, the client parses the file contents to build a routing table. However in 5.0 newly create collections do not show up in clusterstate.json but instead have their own state.json document. Are there any recommendations for how to handle this from the client? The obvious answer is to watch every collection's state.json document, but we run a lot of collections (~1000 currently, and growing) so I'm concerned about keeping that many watches open at the same time (should I be?). How does the SolrJ client handle this? Thanks! - Ian
Re: Disable or limit the size of Lucene field cache
Thank you.. This really helps. -- View this message in context: http://lucene.472066.n3.nabble.com/Disable-or-limit-the-size-of-Lucene-field-cache-tp4198798p4199646.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: proper routing (from non-Java client) in solr cloud 5.0.0
Hi Hrishikesh, Thanks for the pointers - I had not looked at SOLR-5474 https://issues.apache.org/jira/browse/SOLR-5474 previously. Interesting approach... I think we will stick with trying to keep zk watches open from all clients to all collections for now, but if that starts to be a bottleneck its good to know how the route that Solrj has chosen... cheers, Ian On Tue, Apr 14, 2015 at 3:56 PM, Hrishikesh Gadre gadre.s...@gmail.com wrote: Hi Ian, As per my understanding, Solrj does not use Zookeeper watches but instead caches the information (along with a TTL). You can find more information here, https://issues.apache.org/jira/browse/SOLR-5473 https://issues.apache.org/jira/browse/SOLR-5474 Regards Hrishikesh On Tue, Apr 14, 2015 at 8:49 AM, Ian Rose ianr...@fullstory.com wrote: Hi all - I've just upgraded my dev install of Solr (cloud) from 4.10 to 5.0. Our client is written in Go, for which I am not aware of a client, so we wrote our own. One tricky bit for this was the routing logic; if a document has routing prefix X and belong to collection Y, we need to know which solr node to connect to. Previously we accomplished this by watching the clusterstate.json file in zookeeper - at startup and whenever it changes, the client parses the file contents to build a routing table. However in 5.0 newly create collections do not show up in clusterstate.json but instead have their own state.json document. Are there any recommendations for how to handle this from the client? The obvious answer is to watch every collection's state.json document, but we run a lot of collections (~1000 currently, and growing) so I'm concerned about keeping that many watches open at the same time (should I be?). How does the SolrJ client handle this? Thanks! - Ian
Re: [ANNOUNCE] Apache Solr 5.1.0 released
Great news! Any tips on how to do an upgrade from 5.0.0 to 5.1.0? Thank you! -Joe On 4/14/2015 2:39 PM, Timothy Potter wrote: I apologize - Yonik prepared these nice release notes for 5.1 and I neglected to include them: Solr 5.1 Release Highlights: * The new Facet Module, including the JSON Facet API. This module is currently marked as experimental to allow for further API feedback and improvements. * A new JSON request API. This feature is currently marked as experimental to allow for further API feedback and improvements. * The ability to upload and download Solr configurations via SolrJ (CloudSolrClient). * First-class support for Real-Time Get in SolrJ. * Spatial 2D heat-map faceting. * EnumField now has docValues support. * API to dynamically add Jars to Solr's classpath for plugins. * Ability to enable/disable individual stats in the StatsComponent. * lucene/solr query syntax to give any query clause a constant score. * Schema API enhancements to remove or replace fields, dynamic fields, field types and copy fields. * When posting XML or JSON to Solr with curl, there is no need to specify the content type. * A list of update processors to be used for an update can be specified dynamically for any given request. * StatsComponent now supports Percentiles. * facet.contains option to limit which constraints are returned. * Streaming Aggregation for SolrCloud. * The admin UI now visualizes Lucene segment information. * Parameter substitution / macro expansion across entire request On Tue, Apr 14, 2015 at 11:42 AM, Timothy Potter thelabd...@gmail.com wrote: 14 April 2015 - The Lucene PMC is pleased to announce the release of Apache Solr 5.1.0. Solr 5.1.0 is available for immediate download at: http://www.apache.org/dyn/closer.cgi/lucene/solr/5.1.0 Solr 5.1.0 includes 39 new features, 40 bug fixes, and 36 optimizations / other changes from over 60 unique contributors. For detailed information about what is included in 5.1.0 release, please see: http://lucene.apache.org/solr/5_1_0/changes/Changes.html Enjoy!
JSON Facet Analytics API in Solr 5.1
Folks, there's a new JSON Facet API in the just released Solr 5.1 (actually, a new facet module under the covers too). It's marked as experimental so we have time to change the API based on your feedback. So let us know what you like, what you would change, what's missing, or any other ideas you may have! I've just started the documentation for the reference guide (on our confluence wiki), so for now the best doc is on my blog: http://yonik.com/json-facet-api/ http://yonik.com/solr-facet-functions/ http://yonik.com/solr-subfacets/ I'll also be hanging out more on the #solr-dev IRC channel on freenode if you want to hit me up there about any development ideas. -Yonik
using DirectSpellChecker and FileBasedSpellChecker with Solr 4.10.1
Hello, I am using Solr 4.10.1 and trying to use DirectSolrSpellChecker and FileBasedSpellchecker in same request. I've applied change from patch 135.patch (cf Solr-6271). I've tried running the command patch -p1 -i 135.patch --dry-run but it didn't work, maybe because the patch was a fix to Solr 4.9, so I just replaced line in ConjunctionSolrSpellChecker else if (!stringDistance.equals(checker.getStringDistance())) { throw new IllegalArgumentException( All checkers need to use the same StringDistance.); } by else if (!stringDistance.equals(checker.getStringDistance())) { throw new IllegalArgumentException( All checkers need to use the same StringDistance!!! 1: + checker.getStringDistance() + 2: + stringDistance); } as it was done in the patch but still, when I send a spellcheck request, I get the error msg: All checkers need to use the same StringDistance!!! 1:org.apache.lucene.search.spell.LuceneLevenshteinDistance@15f57db32: org.apache.lucene.search.spell.LuceneLevenshteinDistance@280f7e08 From error message I gather both spellchecker use same distanceMeasure LuceneLevenshteinDistance, but they're not same instance of LuceneLevenshteinDistance. Is the condition all right? What should be done to fix this properly? Thanks, Elisabeth
Re: [ANNOUNCE] Apache Solr 5.1.0 released
I apologize - Yonik prepared these nice release notes for 5.1 and I neglected to include them: Solr 5.1 Release Highlights: * The new Facet Module, including the JSON Facet API. This module is currently marked as experimental to allow for further API feedback and improvements. * A new JSON request API. This feature is currently marked as experimental to allow for further API feedback and improvements. * The ability to upload and download Solr configurations via SolrJ (CloudSolrClient). * First-class support for Real-Time Get in SolrJ. * Spatial 2D heat-map faceting. * EnumField now has docValues support. * API to dynamically add Jars to Solr's classpath for plugins. * Ability to enable/disable individual stats in the StatsComponent. * lucene/solr query syntax to give any query clause a constant score. * Schema API enhancements to remove or replace fields, dynamic fields, field types and copy fields. * When posting XML or JSON to Solr with curl, there is no need to specify the content type. * A list of update processors to be used for an update can be specified dynamically for any given request. * StatsComponent now supports Percentiles. * facet.contains option to limit which constraints are returned. * Streaming Aggregation for SolrCloud. * The admin UI now visualizes Lucene segment information. * Parameter substitution / macro expansion across entire request On Tue, Apr 14, 2015 at 11:42 AM, Timothy Potter thelabd...@gmail.com wrote: 14 April 2015 - The Lucene PMC is pleased to announce the release of Apache Solr 5.1.0. Solr 5.1.0 is available for immediate download at: http://www.apache.org/dyn/closer.cgi/lucene/solr/5.1.0 Solr 5.1.0 includes 39 new features, 40 bug fixes, and 36 optimizations / other changes from over 60 unique contributors. For detailed information about what is included in 5.1.0 release, please see: http://lucene.apache.org/solr/5_1_0/changes/Changes.html Enjoy!