Re: How to index pdf's content with SolrJ?

2012-04-21 Thread vasuj
Still i am not able to index my docs in solr

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-indexing-fails-on-server-request-up-tp3927284p3927749.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: null pointer error with solr deduplication

2012-04-21 Thread Alexander Aristov
Hi

I might be wrong but it's your responsibility to put unique doc IDs across
shards.

read this page
http://wiki.apache.org/solr/DistributedSearch#Distributed_Searching_Limitations

particualry

   - Documents must have a unique key and the unique key must be stored
   (stored=true in schema.xml)
   -

   *The unique key field must be unique across all shards.* If docs with
   duplicate unique keys are encountered, Solr will make an attempt to return
   valid results, but the behavior may be non-deterministic.

So solr bahaves as it should :) _unexpectidly_

But I agree in that sence that there must be no error especially such as
NPE.

Best Regards
Alexander Aristov


On 21 April 2012 03:42, Peter Markey sudoma...@gmail.com wrote:

 Hello,

 I have been trying out deduplication in solr by following:
 http://wiki.apache.org/solr/Deduplication. I have defined a signature
 field
 to hold the values of the signature created based on few other fields in a
 document and the idea seems to work like a charm in a single solr instance.
 But, when I have multiple cores and try to do a distributed search (

 Http://localhost:8080/solr/core0/select?q=*shards=localhost:8080/solr/dedupe,localhost:8080/solr/dedupe2facet=truefacet.field=doc_id
 )
 I get the error pasted below. While normal search (with just q) works fine,
 the facet/stats queries seem to be the culprit. The doc_id contains
 duplicate ids since I'm testing the same set of documents indexed in both
 the cores(dedupe, dedupe2). Any insights would be highly appreciated.

 Thanks



 20-Apr-2012 11:39:35 PM org.apache.solr.common.SolrException log
 SEVERE: java.lang.NullPointerException
 at

 org.apache.solr.handler.component.QueryComponent.mergeIds(QueryComponent.java:887)
 at

 org.apache.solr.handler.component.QueryComponent.handleRegularResponses(QueryComponent.java:633)
 at

 org.apache.solr.handler.component.QueryComponent.handleResponses(QueryComponent.java:612)
 at

 org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:307)
 at

 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1540)
 at

 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:435)
 at

 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:256)
 at

 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
 at

 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
 at

 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224)
 at

 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169)
 at

 org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:472)
 at

 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168)
 at

 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98)
 at
 org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:927)
 at

 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
 at
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407)
 at

 org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:987)
 at

 org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:579)
 at

 org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:307)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)



Re: Storing the md5 hash of pdf files as a field in the index

2012-04-21 Thread kuchenbrett
Hi Otis,

 thank you very much for the quick response to my question. I'll have a look at 
your suggested solution. Do you know if there's any documentation about writing 
such an Update Request Handler or how to trigger it using the Data Import/Tika 
combination?

 Thanks.
 Joe


RE: Special characters in synonyms.txt on Solr 3.5

2012-04-21 Thread carl.nordenf...@bwinparty.com
Thanks,

That worked like a charm.
Should've thought about that :)

/ Carl

From: Robert Muir [rcm...@gmail.com]
Sent: 20 April 2012 18:21
To: solr-user@lucene.apache.org
Subject: Re: Special characters in synonyms.txt on Solr 3.5

On Fri, Apr 20, 2012 at 12:10 PM, carl.nordenf...@bwinparty.com
carl.nordenf...@bwinparty.com wrote:
 Directly injecting the letter ö into synonyms like so:
 island, ön
 island, ön

 renders the following exception on startup (both lines renders the same 
 error):

 java.lang.RuntimeException: java.nio.charset.MalformedInputException: Input 
 length = 3
 at 
 org.apache.solr.analysis.FSTSynonymFilterFactory.inform(FSTSynonymFilterFactory.java:92)
 at 
 org.apache.solr.analysis.SynonymFilterFactory.inform(SynonymFilterFactory.java:50)

Synonyms file needs to be in UTF-8 encoding.



--
lucidimagination.com


Re: # open files with SolrCloud

2012-04-21 Thread Yonik Seeley
I can reproduce some kind of searcher leak issue here, even w/o
SolrCloud, and I've opened
https://issues.apache.org/jira/browse/SOLR-3392

-Yonik
lucenerevolution.com - Lucene/Solr Open Source Search Conference.
Boston May 7-10


Analyzers and ReuseStrategy in Solr 4

2012-04-21 Thread Dominique Bejean

Hi,

I developed a custom analyzer. This analyzer needs to be polymorphous 
according to the first 4 characters of the text to be analyzed. In order 
to do this I implement my own ReuseStratgy class (NoReuseStrategy) and 
in the constructor, I do this super(new NoReuseStrategy());


At Lucene level, it works fine, the analyzer always rebuild the 
TokenStreamComponents.


With Solr, it doesn't work because Solr embed all analyzers in 
AnalyzerWrapper and AnalyzerWrapper overwrites my NoReuseStrategy class 
with the PerFieldReuseStrategy class.


Is there any way in order to define a custom ReuseStrategy class and use 
it in Solr ?


Thank you.

Dominique





Re: Storing the md5 hash of pdf files as a field in the index

2012-04-21 Thread Lance Norskog
The SignatureUpdateProcessor implements a smaller, faster cryptohash.
It is used by the de-duplication feature.

What's the purpose? Do you need the MD5 algorithm, or is any competent
cryptohash good enough?

On Sat, Apr 21, 2012 at 5:55 AM,  kuchenbr...@mail.org wrote:
 Hi Otis,

  thank you very much for the quick response to my question. I'll have a look 
 at your suggested solution. Do you know if there's any documentation about 
 writing such an Update Request Handler or how to trigger it using the Data 
 Import/Tika combination?

  Thanks.
  Joe



-- 
Lance Norskog
goks...@gmail.com


Re: Opposite to MoreLikeThis?

2012-04-21 Thread Lance Norskog
Are these documents classified already? Sounds like it would be much
faster to suppress documents with the same tags as your target tags.

On Fri, Apr 20, 2012 at 4:16 PM, Darren Govoni dar...@ontrenet.com wrote:
 You could run the MLT for the document in question, then gather all
 those doc id's in the MLT results and negate those in a subsequent
 query. Not sure how robust that would work with very large result sets,
 but something to try.

 Another approach would be to gather the interesting terms from the
 document in question and then negate those terms in subsequent queries.
 Perhaps with many negated terms, Solr will rank the results based on
 most negated terms above less negated terms, simulating a ranked less
 like effect.

 On Fri, 2012-04-20 at 15:38 -0700, Charlie Maroto wrote:
 Hi all,

 Is there a way to implement the opposite to MoreLikeThis (LessLikeThis, I
 guess :).  The requirement we have is to remove all documents with content
 like that of a given document id or a text provided by the end-user.  In
 the current index implementation (not using Solr), the user can narrow
 results by indicating what document(s) are not relevant to him and then
 request to remove from the search results any document whose content is
 like that of the selected document(s)

 Our index has close to 100 million documents and they cover multiple topics
 that are not related to one another.  So, a search for some broad terms may
 retrieve documents about engineering, agriculture, communications, etc.  As
 the user is trying to discover the relevant documents, he may select an
 agriculture-related document to exclude it and those documents like it from
 the results set; same w/ engineering-like content, etc. until most of the
 documents are about communications.

 Of course, some exclusions may actually remove relevant content but those
 filters can be removed to go back to the previous set of results.

 Any ideas from similar implementations or suggestions are welcomed!
 Thanks,
 Carlos





-- 
Lance Norskog
goks...@gmail.com


Re: # open files with SolrCloud

2012-04-21 Thread Gopal Patwa
Yonik, This same issue we have on our production with Solr 4 Trunk build
running on Cent OS, JDK 6 64-bit

I have reported java.io.IOException: Map failed and Too many open files
issue, i seems their is a search leak in Solr which is not closing them and
file being kept open.

It would be great help if we can resolve this issue, I was going to try
latest build but it seems this issue is not resolved yet

http://lucene.472066.n3.nabble.com/Large-Index-and-OutOfMemoryError-Map-failed-td3872891.html


On Sat, Apr 21, 2012 at 11:57 AM, Yonik Seeley
yo...@lucidimagination.comwrote:

 I can reproduce some kind of searcher leak issue here, even w/o
 SolrCloud, and I've opened
 https://issues.apache.org/jira/browse/SOLR-3392

 -Yonik
 lucenerevolution.com - Lucene/Solr Open Source Search Conference.
 Boston May 7-10



Re: # open files with SolrCloud

2012-04-21 Thread Gopal Patwa
forgot to mention we are not using Solr Cloud yet but we use Lucene NRT
feature,  This issue is happening WITHOUT Solr Cloud

On Sat, Apr 21, 2012 at 8:14 PM, Gopal Patwa gopalpa...@gmail.com wrote:

 Yonik, This same issue we have on our production with Solr 4 Trunk build
 running on Cent OS, JDK 6 64-bit

 I have reported java.io.IOException: Map failed and Too many open
 files issue, i seems their is a search leak in Solr which is not closing
 them and file being kept open.

 It would be great help if we can resolve this issue, I was going to try
 latest build but it seems this issue is not resolved yet


 http://lucene.472066.n3.nabble.com/Large-Index-and-OutOfMemoryError-Map-failed-td3872891.html


 On Sat, Apr 21, 2012 at 11:57 AM, Yonik Seeley yo...@lucidimagination.com
  wrote:

 I can reproduce some kind of searcher leak issue here, even w/o
 SolrCloud, and I've opened
 https://issues.apache.org/jira/browse/SOLR-3392

 -Yonik
 lucenerevolution.com - Lucene/Solr Open Source Search Conference.
 Boston May 7-10