Re: How to index pdf's content with SolrJ?
Still i am not able to index my docs in solr -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-indexing-fails-on-server-request-up-tp3927284p3927749.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: null pointer error with solr deduplication
Hi I might be wrong but it's your responsibility to put unique doc IDs across shards. read this page http://wiki.apache.org/solr/DistributedSearch#Distributed_Searching_Limitations particualry - Documents must have a unique key and the unique key must be stored (stored=true in schema.xml) - *The unique key field must be unique across all shards.* If docs with duplicate unique keys are encountered, Solr will make an attempt to return valid results, but the behavior may be non-deterministic. So solr bahaves as it should :) _unexpectidly_ But I agree in that sence that there must be no error especially such as NPE. Best Regards Alexander Aristov On 21 April 2012 03:42, Peter Markey sudoma...@gmail.com wrote: Hello, I have been trying out deduplication in solr by following: http://wiki.apache.org/solr/Deduplication. I have defined a signature field to hold the values of the signature created based on few other fields in a document and the idea seems to work like a charm in a single solr instance. But, when I have multiple cores and try to do a distributed search ( Http://localhost:8080/solr/core0/select?q=*shards=localhost:8080/solr/dedupe,localhost:8080/solr/dedupe2facet=truefacet.field=doc_id ) I get the error pasted below. While normal search (with just q) works fine, the facet/stats queries seem to be the culprit. The doc_id contains duplicate ids since I'm testing the same set of documents indexed in both the cores(dedupe, dedupe2). Any insights would be highly appreciated. Thanks 20-Apr-2012 11:39:35 PM org.apache.solr.common.SolrException log SEVERE: java.lang.NullPointerException at org.apache.solr.handler.component.QueryComponent.mergeIds(QueryComponent.java:887) at org.apache.solr.handler.component.QueryComponent.handleRegularResponses(QueryComponent.java:633) at org.apache.solr.handler.component.QueryComponent.handleResponses(QueryComponent.java:612) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:307) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1540) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:435) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:256) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:472) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98) at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:927) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407) at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:987) at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:579) at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:307) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662)
Re: Storing the md5 hash of pdf files as a field in the index
Hi Otis, thank you very much for the quick response to my question. I'll have a look at your suggested solution. Do you know if there's any documentation about writing such an Update Request Handler or how to trigger it using the Data Import/Tika combination? Thanks. Joe
RE: Special characters in synonyms.txt on Solr 3.5
Thanks, That worked like a charm. Should've thought about that :) / Carl From: Robert Muir [rcm...@gmail.com] Sent: 20 April 2012 18:21 To: solr-user@lucene.apache.org Subject: Re: Special characters in synonyms.txt on Solr 3.5 On Fri, Apr 20, 2012 at 12:10 PM, carl.nordenf...@bwinparty.com carl.nordenf...@bwinparty.com wrote: Directly injecting the letter ö into synonyms like so: island, ön island, ön renders the following exception on startup (both lines renders the same error): java.lang.RuntimeException: java.nio.charset.MalformedInputException: Input length = 3 at org.apache.solr.analysis.FSTSynonymFilterFactory.inform(FSTSynonymFilterFactory.java:92) at org.apache.solr.analysis.SynonymFilterFactory.inform(SynonymFilterFactory.java:50) Synonyms file needs to be in UTF-8 encoding. -- lucidimagination.com
Re: # open files with SolrCloud
I can reproduce some kind of searcher leak issue here, even w/o SolrCloud, and I've opened https://issues.apache.org/jira/browse/SOLR-3392 -Yonik lucenerevolution.com - Lucene/Solr Open Source Search Conference. Boston May 7-10
Analyzers and ReuseStrategy in Solr 4
Hi, I developed a custom analyzer. This analyzer needs to be polymorphous according to the first 4 characters of the text to be analyzed. In order to do this I implement my own ReuseStratgy class (NoReuseStrategy) and in the constructor, I do this super(new NoReuseStrategy()); At Lucene level, it works fine, the analyzer always rebuild the TokenStreamComponents. With Solr, it doesn't work because Solr embed all analyzers in AnalyzerWrapper and AnalyzerWrapper overwrites my NoReuseStrategy class with the PerFieldReuseStrategy class. Is there any way in order to define a custom ReuseStrategy class and use it in Solr ? Thank you. Dominique
Re: Storing the md5 hash of pdf files as a field in the index
The SignatureUpdateProcessor implements a smaller, faster cryptohash. It is used by the de-duplication feature. What's the purpose? Do you need the MD5 algorithm, or is any competent cryptohash good enough? On Sat, Apr 21, 2012 at 5:55 AM, kuchenbr...@mail.org wrote: Hi Otis, thank you very much for the quick response to my question. I'll have a look at your suggested solution. Do you know if there's any documentation about writing such an Update Request Handler or how to trigger it using the Data Import/Tika combination? Thanks. Joe -- Lance Norskog goks...@gmail.com
Re: Opposite to MoreLikeThis?
Are these documents classified already? Sounds like it would be much faster to suppress documents with the same tags as your target tags. On Fri, Apr 20, 2012 at 4:16 PM, Darren Govoni dar...@ontrenet.com wrote: You could run the MLT for the document in question, then gather all those doc id's in the MLT results and negate those in a subsequent query. Not sure how robust that would work with very large result sets, but something to try. Another approach would be to gather the interesting terms from the document in question and then negate those terms in subsequent queries. Perhaps with many negated terms, Solr will rank the results based on most negated terms above less negated terms, simulating a ranked less like effect. On Fri, 2012-04-20 at 15:38 -0700, Charlie Maroto wrote: Hi all, Is there a way to implement the opposite to MoreLikeThis (LessLikeThis, I guess :). The requirement we have is to remove all documents with content like that of a given document id or a text provided by the end-user. In the current index implementation (not using Solr), the user can narrow results by indicating what document(s) are not relevant to him and then request to remove from the search results any document whose content is like that of the selected document(s) Our index has close to 100 million documents and they cover multiple topics that are not related to one another. So, a search for some broad terms may retrieve documents about engineering, agriculture, communications, etc. As the user is trying to discover the relevant documents, he may select an agriculture-related document to exclude it and those documents like it from the results set; same w/ engineering-like content, etc. until most of the documents are about communications. Of course, some exclusions may actually remove relevant content but those filters can be removed to go back to the previous set of results. Any ideas from similar implementations or suggestions are welcomed! Thanks, Carlos -- Lance Norskog goks...@gmail.com
Re: # open files with SolrCloud
Yonik, This same issue we have on our production with Solr 4 Trunk build running on Cent OS, JDK 6 64-bit I have reported java.io.IOException: Map failed and Too many open files issue, i seems their is a search leak in Solr which is not closing them and file being kept open. It would be great help if we can resolve this issue, I was going to try latest build but it seems this issue is not resolved yet http://lucene.472066.n3.nabble.com/Large-Index-and-OutOfMemoryError-Map-failed-td3872891.html On Sat, Apr 21, 2012 at 11:57 AM, Yonik Seeley yo...@lucidimagination.comwrote: I can reproduce some kind of searcher leak issue here, even w/o SolrCloud, and I've opened https://issues.apache.org/jira/browse/SOLR-3392 -Yonik lucenerevolution.com - Lucene/Solr Open Source Search Conference. Boston May 7-10
Re: # open files with SolrCloud
forgot to mention we are not using Solr Cloud yet but we use Lucene NRT feature, This issue is happening WITHOUT Solr Cloud On Sat, Apr 21, 2012 at 8:14 PM, Gopal Patwa gopalpa...@gmail.com wrote: Yonik, This same issue we have on our production with Solr 4 Trunk build running on Cent OS, JDK 6 64-bit I have reported java.io.IOException: Map failed and Too many open files issue, i seems their is a search leak in Solr which is not closing them and file being kept open. It would be great help if we can resolve this issue, I was going to try latest build but it seems this issue is not resolved yet http://lucene.472066.n3.nabble.com/Large-Index-and-OutOfMemoryError-Map-failed-td3872891.html On Sat, Apr 21, 2012 at 11:57 AM, Yonik Seeley yo...@lucidimagination.com wrote: I can reproduce some kind of searcher leak issue here, even w/o SolrCloud, and I've opened https://issues.apache.org/jira/browse/SOLR-3392 -Yonik lucenerevolution.com - Lucene/Solr Open Source Search Conference. Boston May 7-10