Re: improving score of result set

2012-10-29 Thread Alexander Aristov
Interesting but not exactly what I want to get.

If I group items then I will get small number of docs. I don't want this. I
need all of them.

Best Regards
Alexander Aristov


On 29 October 2012 12:05, yunfei wu yunfei...@gmail.com wrote:

 Besides changing the scoring algorithm, what about Field Collapsing -
 http://wiki.apache.org/solr/FieldCollapsing - to collapse the results from
 same website url?

 Yunfei


 On Mon, Oct 29, 2012 at 12:43 AM, Alexander Aristov 
 alexander.aris...@gmail.com wrote:

  Hi everybody,
 
  I have a question about scoring calculation algorithms and approaches.
 
  Lets say I have 10 documents. 8 of the them come from one web site (I
 have
  a field in schema with URL) and the other 2 from other different web
 sites.
  So for this example I have 3 web sites.
 
  For some queries those 8 documents have better terms matching and they
  appear at the top of results. It makes that 8 docs from one source come
  first and the other two come next and the last.
 
  I want to maybe artificially improve score of those 2 docs and put them
  atop. I don't want that they necessarily go first but if they come in the
  middle of the result set it would be perfect.
 
  One of the ideas is to reduce score for docs in the result set from one
  site so that if it contains too many docs from one source total scoring
 of
  each those docs would be reduced proportionally.
 
  Important thing is that I don't want to reduce doc score permanently.
 Only
  at query time. Maybe some functional queries can help me?
 
  How can I do this or maybe there are other ideas.
 
  Best Regards
  Alexander Aristov
 



Re: improving score of result set

2012-10-29 Thread Alexander Aristov
I think I get it right way.

Referring back to my example.

I will get 3 groups:
Large group with 8 documents in it and
two other groups with one document in each

If I limit a group by 5 docs then 1st group will have only 5 docs and the
other two will stay contain one doc.

And the order (based on score) won't be different. Each document in the
first group will have higher score,won't it? Or document score in each
group is calculated relatively so that top docs have similar score?

So this approach just limits number of similar documents. Instead I want to
keep all documents in results but shuffle them appropriately.

Best Regards
Alexander Aristov


On 29 October 2012 15:55, Erick Erickson erickerick...@gmail.com wrote:

 I don't think you're reading the grouping right. When you use grouping,
 you get the top N groups, and within each group you get the top M
 scoring documents. So you can actually get _more_ documents back than in
 the non-grouping case and your app can then intelligently intersperse them
 however you want.

 Best
 Erick

 On Mon, Oct 29, 2012 at 5:02 AM, Alexander Aristov
 alexander.aris...@gmail.com wrote:
  Interesting but not exactly what I want to get.
 
  If I group items then I will get small number of docs. I don't want
 this. I
  need all of them.
 
  Best Regards
  Alexander Aristov
 
 
  On 29 October 2012 12:05, yunfei wu yunfei...@gmail.com wrote:
 
  Besides changing the scoring algorithm, what about Field Collapsing -
  http://wiki.apache.org/solr/FieldCollapsing - to collapse the results
 from
  same website url?
 
  Yunfei
 
 
  On Mon, Oct 29, 2012 at 12:43 AM, Alexander Aristov 
  alexander.aris...@gmail.com wrote:
 
   Hi everybody,
  
   I have a question about scoring calculation algorithms and approaches.
  
   Lets say I have 10 documents. 8 of the them come from one web site (I
  have
   a field in schema with URL) and the other 2 from other different web
  sites.
   So for this example I have 3 web sites.
  
   For some queries those 8 documents have better terms matching and they
   appear at the top of results. It makes that 8 docs from one source
 come
   first and the other two come next and the last.
  
   I want to maybe artificially improve score of those 2 docs and put
 them
   atop. I don't want that they necessarily go first but if they come in
 the
   middle of the result set it would be perfect.
  
   One of the ideas is to reduce score for docs in the result set from
 one
   site so that if it contains too many docs from one source total
 scoring
  of
   each those docs would be reduced proportionally.
  
   Important thing is that I don't want to reduce doc score permanently.
  Only
   at query time. Maybe some functional queries can help me?
  
   How can I do this or maybe there are other ideas.
  
   Best Regards
   Alexander Aristov
  
 



Re: improving score of result set

2012-10-29 Thread Alexander Aristov
Perhapse this is a XY problem.

First of all I don't have a site which I want to boost. All docs are equal.

Secondly I will explain what I have. I have 100 docs indexed. I do a query
which returns 10 found docs. 8 of them from one site and 2 from other
different sites. I dont like order. Technically scores are good. I
understand why these 8 docs go first - because they havebetter matching.
But i dont like it. I want that articles from smaller collections would
somehow compete with other docs. For other queries situation can change and
another site can produce more results. In that case i would  lower that
site.

I've had a deep thought and think can try grouping.

More insites on my problem. These 8 docs have similar text which matches
query and  thats why they all get similar and relatively high score. For
example docs have text:

1. Red apple felt from tree
2 blue apple felt from tree
3 green apple felt from tree
...
8 orange pineapple felt from tree
9 a boy felt suddenly ill. A tree was green.
10 two pices felt apart and newer collapse. Family tree was reach.

I query felt tree. Docs 1-8 from one site.

I would like to make the score of docs 9 and 10 higher.

Grouping can help but maybe there are othe solutions.

Alexander
 29.10.2012 22:11 пользователь Chris Hostetter hossman_luc...@fucit.org
написал:


 You've mentioned that you want ot improve the scores of these documents,
 but you haven't really given any specifics about when/how/why you wnat to
 improve the score in general -- ie: in this examples you have a total of
 10 docs, but how do you distinguish the 2 special docs from the 8 other
 docs?  is it because they are the only two docs with some specific
 field value, or is it just because they are in the smaller of two sets
 of documents if you partition on some field?  if you added 100 more docs
 that were all in the same set as those two, would you want the other 8
 documents to start getting boosted?

 Let's assume that what you are trying to ask is..

   I want to artificially boost the scores of documents when the 'site'
field contains 'cnn.com'

 A simple way to do that is just to add an optional clause to your query
 that matches on site:cnn.com so the scores of those documents will be
 increased, but make the main part of your query required...

q=+(your main query) site:cnn.com

 Or if you use the dismax or edismax parsers there are special params (bq
 and/or boost) that help make this easy to split out...


 https://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_increase_the_score_for_specific_documents



 FWIW: this smells like an XY problem ... more details baout your actaul
 situation and end goal would be helpful...

 https://people.apache.org/~hossman/#xyproblem
 XY Problem

 Your question appears to be an XY Problem ... that is: you are dealing
 with X, you are assuming Y will help you, and you are asking about Y
 without giving more details about the X so that we can understand the
 full issue.  Perhaps the best solution doesn't involve Y at all?
 See Also: http://www.perlmonks.org/index.pl?node_id=542341



 -Hoss



Re: improving score of result set

2012-10-29 Thread Alexander Aristov
You absolutely follow my problem. I want to put Obama from espn atop just
because this is exceptional and probably interesting occurance. And the
score is low because content is long or there are no matches in title.
29.10.2012 23:18 пользователь Chris Hostetter hossman_luc...@fucit.org
написал:


 You haven't really explained things enough for us to help you...

 : First of all I don't have a site which I want to boost. All docs are
 equal.
 :
 : Secondly I will explain what I have. I have 100 docs indexed. I do a
 query
 : which returns 10 found docs. 8 of them from one site and 2 from other
 : different sites. I dont like order. Technically scores are good. I
 : understand why these 8 docs go first - because they havebetter matching.
 : But i dont like it. I want that articles from smaller collections would
 : somehow compete with other docs. For other queries situation can change
 and
 : another site can produce more results. In that case i would  lower that
 : site.

 *why* don't you like that order?  what is it that makes you think that
 order is bad? you say you want to articles fro mteh smalller collection
 to compete with the other docs -- but they already have.  unless part of
 your query included a clause that is biased in favor of one collection
 then all of those documents got a fair score for the query you passed
 in.

 It might help if you gave us a specific, concrete example of some *real*
 queries and the *real* docments they return, and why you don't think those
 scores are fair.

 Because if i'm following your reasoning, and thinking about a situation
 where i might have an index full of webpages, and some of those web pages
 are from cnn.com and some of those pages are from espn.com then a
 query for Obama might match lots of pages from cnn.com, with high
 scores, and there might be *one* match on espn.com with an extremely low
 score, because Obama is mentioned one time in some quote or something in a
 *very* long page ... in what situation would it make any sense to bias the
 score of that one espn.com document to make it score higher then other
 documents from cnn.com that legitimately score better because they mention
 Obama in the title, or many times in the body of the page?


 -Hoss



Re: query syntax to find ??? chars

2012-07-12 Thread Alexander Aristov
don't know why but doesn't work. :(

Best Regards
Alexander Aristov


On 11 July 2012 23:54, Yury Kats yuryk...@yahoo.com wrote:

 On 7/11/2012 2:55 PM, Alexander Aristov wrote:

  content:?? doesn't work :)

 I would try escaping them: content:\?\?\?\?\?\?






Re: null pointer error with solr deduplication

2012-04-21 Thread Alexander Aristov
Hi

I might be wrong but it's your responsibility to put unique doc IDs across
shards.

read this page
http://wiki.apache.org/solr/DistributedSearch#Distributed_Searching_Limitations

particualry

   - Documents must have a unique key and the unique key must be stored
   (stored=true in schema.xml)
   -

   *The unique key field must be unique across all shards.* If docs with
   duplicate unique keys are encountered, Solr will make an attempt to return
   valid results, but the behavior may be non-deterministic.

So solr bahaves as it should :) _unexpectidly_

But I agree in that sence that there must be no error especially such as
NPE.

Best Regards
Alexander Aristov


On 21 April 2012 03:42, Peter Markey sudoma...@gmail.com wrote:

 Hello,

 I have been trying out deduplication in solr by following:
 http://wiki.apache.org/solr/Deduplication. I have defined a signature
 field
 to hold the values of the signature created based on few other fields in a
 document and the idea seems to work like a charm in a single solr instance.
 But, when I have multiple cores and try to do a distributed search (

 Http://localhost:8080/solr/core0/select?q=*shards=localhost:8080/solr/dedupe,localhost:8080/solr/dedupe2facet=truefacet.field=doc_id
 )
 I get the error pasted below. While normal search (with just q) works fine,
 the facet/stats queries seem to be the culprit. The doc_id contains
 duplicate ids since I'm testing the same set of documents indexed in both
 the cores(dedupe, dedupe2). Any insights would be highly appreciated.

 Thanks



 20-Apr-2012 11:39:35 PM org.apache.solr.common.SolrException log
 SEVERE: java.lang.NullPointerException
 at

 org.apache.solr.handler.component.QueryComponent.mergeIds(QueryComponent.java:887)
 at

 org.apache.solr.handler.component.QueryComponent.handleRegularResponses(QueryComponent.java:633)
 at

 org.apache.solr.handler.component.QueryComponent.handleResponses(QueryComponent.java:612)
 at

 org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:307)
 at

 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1540)
 at

 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:435)
 at

 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:256)
 at

 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
 at

 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
 at

 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224)
 at

 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169)
 at

 org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:472)
 at

 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168)
 at

 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98)
 at
 org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:927)
 at

 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
 at
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407)
 at

 org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:987)
 at

 org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:579)
 at

 org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:307)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)



Re: Solr is not extracting the CDATA part of xml

2012-04-13 Thread Alexander Aristov
Hi

This is not solr format. You must re-format your XML into solr XML. you may
find examples on solr wiki or in solr examples dir.

Best Regards
Alexander Aristov


On 13 April 2012 23:13, srini softtec...@gmail.com wrote:

 Erick,

 Thanks for your reply. when you say Solr does not index arbitery xml
 document, then below is the way my xml document looks like which is sitting
 in oracle. Could you suggest the best of indexing it ? which method should
 I
 follow? Should I use XPathEntityProcessor?

 ?xml version=1.0 encoding=UTF-8 ?
 message xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance;
 xmlns=someurl xmlns:csp=someurl.xsd xsi:schemaLocation=somelocation
 jar: id=002 message-type=create
 content
 dsp:row
  dsp:channel100/dsp:channel
  dsp:role115/dsp:role
  /dsp:row

  /body/content/message

 Thanks in Advance
 Erick Erickson wrote
 
  Solr does not index arbitrary XML content. There is and XML
  form of a solr document that can be sent to Solr, but it is
  a specific form of XML.
 
  An example of the XML you're trying to index and what you mean
  by not working would be helpful.
 
  Best
  Erick
 
  On Fri, Apr 13, 2012 at 11:50 AM, srini lt;softtech88@gt; wrote:
  not sure why CDATA part did not get interpreted. this is how xml content
  looks like. I added quotes just to present the exact content xml
 content.
 
  body/body
 
  --
  View this message in context:
 
 http://lucene.472066.n3.nabble.com/Solr-is-not-extracting-the-CDATA-part-of-xml-tp3908317p3908341.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 

 Erick Erickson wrote
 
  Solr does not index arbitrary XML content. There is and XML
  form of a solr document that can be sent to Solr, but it is
  a specific form of XML.
 
  An example of the XML you're trying to index and what you mean
  by not working would be helpful.
 
  Best
  Erick
 
  On Fri, Apr 13, 2012 at 11:50 AM, srini lt;softtech88@gt; wrote:
  not sure why CDATA part did not get interpreted. this is how xml content
  looks like. I added quotes just to present the exact content xml
 content.
 
  body/body
 
  --
  View this message in context:
 
 http://lucene.472066.n3.nabble.com/Solr-is-not-extracting-the-CDATA-part-of-xml-tp3908317p3908341.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 

 Erick Erickson wrote
 
  Solr does not index arbitrary XML content. There is and XML
  form of a solr document that can be sent to Solr, but it is
  a specific form of XML.
 
  An example of the XML you're trying to index and what you mean
  by not working would be helpful.
 
  Best
  Erick
 
  On Fri, Apr 13, 2012 at 11:50 AM, srini lt;softtech88@gt; wrote:
  not sure why CDATA part did not get interpreted. this is how xml content
  looks like. I added quotes just to present the exact content xml
 content.
 
  body/body
 
  --
  View this message in context:
 
 http://lucene.472066.n3.nabble.com/Solr-is-not-extracting-the-CDATA-part-of-xml-tp3908317p3908341.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-is-not-extracting-the-CDATA-part-of-xml-tp3908317p3908791.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: default operation for a field

2012-04-02 Thread Alexander Aristov
Ok. got it. thanks

Best Regards
Alexander Aristov


On 2 April 2012 16:37, Erick Erickson erickerick...@gmail.com wrote:

 You can't set the default operator for a single field. This implies
 you're using edismax? If that's the case, your app layer can
 massage the query to something like
 term1 term2 term3 field_x:(term1 AND term2 AND term3). In which
 case field_x probably should not be in your qf parameter.

 Best
 Erick

 On Mon, Apr 2, 2012 at 2:05 AM, Alexander Aristov
 alexander.aris...@gmail.com wrote:
  Hi,
 
  Just curious if it's possible to set default operator for a field, not
 for
  all application. I have a field and I want it always had AND operation.
 Is
  it feasible?
 
  Users don't enter any opeartors for this field. Only one term or several
  separated by empty spaces. But if default operation is set to OR then the
  field doesn't work as I expect. I need only AND. Maybe another solution
 is
  possible?
 
  Best Regards
  Alexander Aristov



Re: SolrCloud war?

2012-02-04 Thread Alexander Aristov
!!! OFF TOPIC, srry 

I cannot stand but I want to write this. Subject is very intriguing beacuse
of two meanings of the war word. :)

Best Regards
Alexander Aristov


On 4 February 2012 01:50, Mark Miller markrmil...@gmail.com wrote:


 On Feb 3, 2012, at 1:04 PM, Darren Govoni wrote:

  I deployed each war app into the /solr context. I presume its needed
 by remote URL addressing.

 Yup - but you can override this by setting the hostContext in solr.xml. It
 defaults to solr as that fits the example jetty distribution.

 - Mark Miller
 lucidimagination.com














Re: solr keep old docs

2011-12-29 Thread Alexander Aristov
I have never developed for solr yet and don't know much internals but Today
I have tried one approach with searcher.

In my update processor I get searcher and search for ID. It works but I
need to load test it. Will index traversal be faster (less resource
consuming) than search?

Best Regards
Alexander Aristov


On 29 December 2011 17:03, Erick Erickson erickerick...@gmail.com wrote:

 Hmmm, we're not communicating G...

 The update processor wouldn't search in the
 classic sense. It would just use lower-level
 index traversal to determine if the doc (identified
 by your unique key) was already in the index
 and skip indexing that document if it was. No real
 *searching* involved (see TermDocs.seek for one
 approach).

 The price would be that you are transmitting the
 document over to the Solr instance and then
 throwing it away.

 Best
 Erick

 On Thu, Dec 29, 2011 at 12:52 AM, Mikhail Khludnev
 mkhlud...@griddynamics.com wrote:
  Alexander,
 
  I have two ideas how to implement fast dedupe externally, assuming your
 PKs
  don't fit to java.util.*Map:
 
- your crawler can use inprocess RDBMS (Derby, H2) to track dupes;
- if your crawler is stateless - it doesn't track PKs which has been
already crawled, you can retrieve it from Solr via
http://wiki.apache.org/solr/TermsComponent . That's blazingly fast,
 but
it might be a problem with removed documents (I'm not sure). And it's
 also
can lead to OOMException (if you have too much PKs). Let me know if you
need a workaround for one of these problems.
 
  If you choose internal dedupe (UpdateProcessor), pls let me know if
  querying one-by-one will be to slow for your and you'll need to do it
  page-by-page. I did some of such paging, and will do something similar
  soon, so I'm interested in it.
 
  Regards
 
  On Thu, Dec 29, 2011 at 9:34 AM, Alexander Aristov 
  alexander.aris...@gmail.com wrote:
 
  Unfortunately I have a lot of duplicates  and taking that searching
 might
  suffer I will try with implementing update procesor.
 
  But your idea is interesting and I will consider it, thanks.
 
  Best Regards
  Alexander Aristov
 
 
  On 28 December 2011 19:12, Tanguy Moal tanguy.m...@gmail.com wrote:
 
   Hello Alexander,
  
   I don't know much about your requirements in terms of size and
   performances, but I've had a similar use case and found a pretty
 simple
   workaround.
   If your duplicate rate is not too high, you can have the
   SignatureProcessor to generate fingerprint of documents (you already
 did
   that).
  
   Simply turn off overwritting of duplicates, you can then rely on
 solr's
   grouping / field collapsing to group your search results by
 fingerprints.
   You'll then have one document group per real document. You can use
   group.sort to sort your groups by indexing date ascending, and
   group.limit=1 to keep only the oldest one.
   You can even use group.format = simple to serve results as if no
   collapsing occured, and use group.ngroups (/!\ could be expansive
 /!\) to
   get the real number of deduplicated documents.
  
   Of course the index will be larger, as I said, I made no assumption
   regarding your operating requirements. And search can be a bit slower,
   depending on the average rate of duplicated documents.
   But you've got your issue addressed by configuration tuning only...
   Depending on your project's sizing, it could be time saving.
  
   The advantage is that you have the precious information of what
 content
  is
   duplicated from where :-)
  
   Hope this helps,
  
   --
   Tanguy
  
   Le 28/12/2011 15:45, Alexander Aristov a écrit :
  
Thanks Eric,
  
   it sets me direction. I will be writing new plugin and will get back
 to
   the
   dev forum with results and then we will decide next steps.
  
   Best Regards
   Alexander Aristov
  
  
   On 28 December 2011 18:08, Erick Ericksonerickerickson@gmail.**com
  erickerick...@gmail.com
wrote:
  
Well, the short answer is that nobody else has
   1  had a similar requirement
   AND
   2  not found a suitable work around
   AND
   3  implemented the change and contributed it back.
  
   So, if you'd like to volunteerG.
  
   Seriously. If you think this would be valuable and are
   willing to work on it, hop on over to the dev list and
   discuss it, open a JIRA and make it work. I'd start
   by opening a discussion on the dev list before
   opening a JIRA, just to get a sense of where the
   snags would be to changing the Solr code, but that's
   optional.
  
   That said, writing your own update request handler
   that detected this case isn't very difficult,
   extend UpdateRequestProcessorFactory/**UpdateRequestProcessor
   and use it as a plugin.
  
   Best
   Erick
  
   On Wed, Dec 28, 2011 at 6:46 AM, Alexander Aristov
   alexander.aris...@gmail.com  wrote:
  
   the problem with dedupe (SignatureUpdateProcessor ) is that it
  REPLACES
  
   old
  
   docs. I have tried it already.
  
   Best Regards

Re: solr keep old docs

2011-12-29 Thread Alexander Aristov
well. The first results are ready. I have implemented custom update
processor following your suggestion using low level index reader and
termdocs.

Launched scripts which add about 10 000 docs. Indexing took about 1 minute
including commit that is quite good for me. I don't have larger datasets so
won't be able to check with heavier conditions.

If someone is interested I can send over my jar file with my update
processor.

As I said I am ready to contribute it to solr but will get back to it in
the New Year after 10 Jan.

thanks everybody.

Best Regards
Alexander Aristov


On 29 December 2011 18:12, Erick Erickson erickerick...@gmail.com wrote:

 I'd guess it would be much faster, assuming that
 the search savings wouldn't be swamped by the
 additional transmission time over the wire and
 parsing the request (although SolrJ uses a binary
 format, so parsing request probably isn't all
 that expensive).

 You could even do a hybrid approach. Pack up all
 of the IDs you are about to update, send them to
 your special *request* handler and have your
 request handler respond with the documents that
 were already in the index...

 Hmmm, scratch all that. Start with just stringing
 together a long set of uniqueKeys and just
 search for them. Something like
 q=id:(1 2 47 09873)fl=id
 The response should be a minimal set of data
 returned (just the ID). Then you can remove
 each document ID returned from your
 next update. No custom Solr components
 required.

 Solr defaults to a maxBooleanClause count
 of 1024, so your packets should have fewer IDs
 this or you should bump that config setting.

 This should pretty much do what I was thinking
 with custom code without having to write
 anything..

 Best
 Erick

 On Thu, Dec 29, 2011 at 8:15 AM, Alexander Aristov
 alexander.aris...@gmail.com wrote:
  I have never developed for solr yet and don't know much internals but
 Today
  I have tried one approach with searcher.
 
  In my update processor I get searcher and search for ID. It works but I
  need to load test it. Will index traversal be faster (less resource
  consuming) than search?
 
  Best Regards
  Alexander Aristov
 
 
  On 29 December 2011 17:03, Erick Erickson erickerick...@gmail.com
 wrote:
 
  Hmmm, we're not communicating G...
 
  The update processor wouldn't search in the
  classic sense. It would just use lower-level
  index traversal to determine if the doc (identified
  by your unique key) was already in the index
  and skip indexing that document if it was. No real
  *searching* involved (see TermDocs.seek for one
  approach).
 
  The price would be that you are transmitting the
  document over to the Solr instance and then
  throwing it away.
 
  Best
  Erick
 
  On Thu, Dec 29, 2011 at 12:52 AM, Mikhail Khludnev
  mkhlud...@griddynamics.com wrote:
   Alexander,
  
   I have two ideas how to implement fast dedupe externally, assuming
 your
  PKs
   don't fit to java.util.*Map:
  
 - your crawler can use inprocess RDBMS (Derby, H2) to track dupes;
 - if your crawler is stateless - it doesn't track PKs which has been
 already crawled, you can retrieve it from Solr via
 http://wiki.apache.org/solr/TermsComponent . That's blazingly fast,
  but
 it might be a problem with removed documents (I'm not sure). And
 it's
  also
 can lead to OOMException (if you have too much PKs). Let me know if
 you
 need a workaround for one of these problems.
  
   If you choose internal dedupe (UpdateProcessor), pls let me know if
   querying one-by-one will be to slow for your and you'll need to do it
   page-by-page. I did some of such paging, and will do something similar
   soon, so I'm interested in it.
  
   Regards
  
   On Thu, Dec 29, 2011 at 9:34 AM, Alexander Aristov 
   alexander.aris...@gmail.com wrote:
  
   Unfortunately I have a lot of duplicates  and taking that searching
  might
   suffer I will try with implementing update procesor.
  
   But your idea is interesting and I will consider it, thanks.
  
   Best Regards
   Alexander Aristov
  
  
   On 28 December 2011 19:12, Tanguy Moal tanguy.m...@gmail.com
 wrote:
  
Hello Alexander,
   
I don't know much about your requirements in terms of size and
performances, but I've had a similar use case and found a pretty
  simple
workaround.
If your duplicate rate is not too high, you can have the
SignatureProcessor to generate fingerprint of documents (you
 already
  did
that).
   
Simply turn off overwritting of duplicates, you can then rely on
  solr's
grouping / field collapsing to group your search results by
  fingerprints.
You'll then have one document group per real document. You can
 use
group.sort to sort your groups by indexing date ascending, and
group.limit=1 to keep only the oldest one.
You can even use group.format = simple to serve results as if no
collapsing occured, and use group.ngroups (/!\ could be expansive
  /!\) to
get the real number

Re: solr keep old docs

2011-12-28 Thread Alexander Aristov
the problem with dedupe (SignatureUpdateProcessor ) is that it REPLACES old
docs. I have tried it already.

Best Regards
Alexander Aristov


On 28 December 2011 13:04, Lance Norskog goks...@gmail.com wrote:

 The SignatureUpdateProcessor is for exactly this problem:


 http://www.lucidimagination.com/search/link?url=http://wiki.apache.org/solr/Deduplication

 On Tue, Dec 27, 2011 at 10:42 PM, Alexander Aristov
 alexander.aris...@gmail.com wrote:
  I get docs from external sources and the only place I keep them is solr
  index. I have no a database or other means to track indexed docs (my
  personal oppinion is that it might be a huge headache).
 
  Some docs might change slightly in there original sources but I don't
 need
  that changes. In fact I need original data only.
 
  So I have no other ways but to either check if a document is already in
  index before I put it to solrj array (read - query solr) or develop my
 own
  update chain processor and implement ID check there and skip such docs.
 
  Maybe it's wrong place to aguee and probably it's been discussed before
 but
  I wonder why simple the overwrite parameter doesn't work here.
 
  My oppinion it perfectly suits here. In combination with unique ID it can
  cover all possible variants.
 
  cases:
 
  1. overwrite=true and uniquID exists then newer doc should overwrite the
  old one.
 
  2. overwrite=false and uniqueID exists then newer doc must be skipped
 since
  old exists.
 
  3. uniqueID doesn't exist then newer doc just gets added regardless if
 old
  exists or not.
 
 
  Best Regards
  Alexander Aristov
 
 
  On 27 December 2011 22:53, Erick Erickson erickerick...@gmail.com
 wrote:
 
  Mikhail is right as far as I know, the assumption built into Solr is
 that
  duplicate IDs (when uniqueKey is defined) should trigger the old
  document to be replaced.
 
  what is your system-of-record? By that I mean what does your SolrJ
  program do to send data to Solr? Is there any way you could just
  *not* send documents that are already in the Solr index based on,
  for instance, any timestamp associated with your system-of-record
  and the last time you did an incremental index?
 
  Best
  Erick
 
  On Tue, Dec 27, 2011 at 6:38 AM, Alexander Aristov
  alexander.aris...@gmail.com wrote:
   Hi
  
   I am not using database. All needed data is in solr index that's why I
  want
   to skip excessive checks.
  
   I will check DIH but not sure if it helps.
  
   I am fluent with Java and it's not a problem for me to write a class
 or
  so
   but I want to check first  maybe there are any ways (workarounds) to
 make
   it working without codding, just by playing around with configuration
 and
   params. I don't want to go away from default solr implementation.
  
   Best Regards
   Alexander Aristov
  
  
   On 27 December 2011 09:33, Mikhail Khludnev 
 mkhlud...@griddynamics.com
  wrote:
  
   On Tue, Dec 27, 2011 at 12:26 AM, Alexander Aristov 
   alexander.aris...@gmail.com wrote:
  
Hi people,
   
I urgently need your help!
   
I have solr 3.3 configured and running. I do uncremental indexing 4
   times a
day using bulk updates. Some documents are identical to some extent
  and I
wish to skip them, not to index.
But here is the problem as I could not find a way to tell solr
 ignore
  new
duplicate docs and keep old indexed docs. I don't care that it's
 new.
   Just
determine by ID that such document is in the index already and
 that's
  it.
   
I use solrj for indexing. I have tried setting overwrite=false and
  dedupe
apprache but nothing helped me. I either have that a newer doc
  overwrites
old one or I get duplicate.
   
I think it's a very simple and basic feature and it must exist.
 What
  did
   I
make wrong or didn't do?
   
  
   I guess, because  the mainstream approach is delta-import , when you
  have
   updated timestamps in your DB and last-import timestamp stored
   somewhere. You can check how it works in DIH.
  
  
   
Tried google but I couldn't find a solution there althoght many
 people
encounted such problem.
   
   
   it's definitely can be done by overriding
   o.a.s.update.DirectUpdateHandler2.addDoc(AddUpdateCommand), but I
  suggest
   to start from implementing your own
   http://wiki.apache.org/solr/UpdateRequestProcessor - search for PK,
  bypass
   chain call if it's found. Then if you meet performance issues on
  querying
   your PKs one by one, (but only after that) you can batch your
 searches,
   there are couple of optimization techniques for huge disjunction
 queries
   like PK:(2 OR 4 OR 5 OR 6).
  
  
I start considering that I must query index to check if a doc to be
  added
is in the index already and do not add it to array but I have so
 many
   docs
that I am affraid it's not a good solution.
   
Best Regards
Alexander Aristov
   
  
  
  
   --
   Sincerely yours
   Mikhail Khludnev
   Lucid Certified
   Apache Lucene/Solr

Re: solr keep old docs

2011-12-28 Thread Alexander Aristov
Thanks Eric,

it sets me direction. I will be writing new plugin and will get back to the
dev forum with results and then we will decide next steps.

Best Regards
Alexander Aristov


On 28 December 2011 18:08, Erick Erickson erickerick...@gmail.com wrote:

 Well, the short answer is that nobody else has
 1 had a similar requirement
 AND
 2 not found a suitable work around
 AND
 3 implemented the change and contributed it back.

 So, if you'd like to volunteer G.

 Seriously. If you think this would be valuable and are
 willing to work on it, hop on over to the dev list and
 discuss it, open a JIRA and make it work. I'd start
 by opening a discussion on the dev list before
 opening a JIRA, just to get a sense of where the
 snags would be to changing the Solr code, but that's
 optional.

 That said, writing your own update request handler
 that detected this case isn't very difficult,
 extend UpdateRequestProcessorFactory/UpdateRequestProcessor
 and use it as a plugin.

 Best
 Erick

 On Wed, Dec 28, 2011 at 6:46 AM, Alexander Aristov
 alexander.aris...@gmail.com wrote:
  the problem with dedupe (SignatureUpdateProcessor ) is that it REPLACES
 old
  docs. I have tried it already.
 
  Best Regards
  Alexander Aristov
 
 
  On 28 December 2011 13:04, Lance Norskog goks...@gmail.com wrote:
 
  The SignatureUpdateProcessor is for exactly this problem:
 
 
 
 http://www.lucidimagination.com/search/link?url=http://wiki.apache.org/solr/Deduplication
 
  On Tue, Dec 27, 2011 at 10:42 PM, Alexander Aristov
  alexander.aris...@gmail.com wrote:
   I get docs from external sources and the only place I keep them is
 solr
   index. I have no a database or other means to track indexed docs (my
   personal oppinion is that it might be a huge headache).
  
   Some docs might change slightly in there original sources but I don't
  need
   that changes. In fact I need original data only.
  
   So I have no other ways but to either check if a document is already
 in
   index before I put it to solrj array (read - query solr) or develop my
  own
   update chain processor and implement ID check there and skip such
 docs.
  
   Maybe it's wrong place to aguee and probably it's been discussed
 before
  but
   I wonder why simple the overwrite parameter doesn't work here.
  
   My oppinion it perfectly suits here. In combination with unique ID it
 can
   cover all possible variants.
  
   cases:
  
   1. overwrite=true and uniquID exists then newer doc should overwrite
 the
   old one.
  
   2. overwrite=false and uniqueID exists then newer doc must be skipped
  since
   old exists.
  
   3. uniqueID doesn't exist then newer doc just gets added regardless if
  old
   exists or not.
  
  
   Best Regards
   Alexander Aristov
  
  
   On 27 December 2011 22:53, Erick Erickson erickerick...@gmail.com
  wrote:
  
   Mikhail is right as far as I know, the assumption built into Solr is
  that
   duplicate IDs (when uniqueKey is defined) should trigger the old
   document to be replaced.
  
   what is your system-of-record? By that I mean what does your SolrJ
   program do to send data to Solr? Is there any way you could just
   *not* send documents that are already in the Solr index based on,
   for instance, any timestamp associated with your system-of-record
   and the last time you did an incremental index?
  
   Best
   Erick
  
   On Tue, Dec 27, 2011 at 6:38 AM, Alexander Aristov
   alexander.aris...@gmail.com wrote:
Hi
   
I am not using database. All needed data is in solr index that's
 why I
   want
to skip excessive checks.
   
I will check DIH but not sure if it helps.
   
I am fluent with Java and it's not a problem for me to write a
 class
  or
   so
but I want to check first  maybe there are any ways (workarounds)
 to
  make
it working without codding, just by playing around with
 configuration
  and
params. I don't want to go away from default solr implementation.
   
Best Regards
Alexander Aristov
   
   
On 27 December 2011 09:33, Mikhail Khludnev 
  mkhlud...@griddynamics.com
   wrote:
   
On Tue, Dec 27, 2011 at 12:26 AM, Alexander Aristov 
alexander.aris...@gmail.com wrote:
   
 Hi people,

 I urgently need your help!

 I have solr 3.3 configured and running. I do uncremental
 indexing 4
times a
 day using bulk updates. Some documents are identical to some
 extent
   and I
 wish to skip them, not to index.
 But here is the problem as I could not find a way to tell solr
  ignore
   new
 duplicate docs and keep old indexed docs. I don't care that it's
  new.
Just
 determine by ID that such document is in the index already and
  that's
   it.

 I use solrj for indexing. I have tried setting overwrite=false
 and
   dedupe
 apprache but nothing helped me. I either have that a newer doc
   overwrites
 old one or I get duplicate.

 I think it's a very simple and basic feature and it must exist

Re: solr keep old docs

2011-12-28 Thread Alexander Aristov
Unfortunately I have a lot of duplicates  and taking that searching might
suffer I will try with implementing update procesor.

But your idea is interesting and I will consider it, thanks.

Best Regards
Alexander Aristov


On 28 December 2011 19:12, Tanguy Moal tanguy.m...@gmail.com wrote:

 Hello Alexander,

 I don't know much about your requirements in terms of size and
 performances, but I've had a similar use case and found a pretty simple
 workaround.
 If your duplicate rate is not too high, you can have the
 SignatureProcessor to generate fingerprint of documents (you already did
 that).

 Simply turn off overwritting of duplicates, you can then rely on solr's
 grouping / field collapsing to group your search results by fingerprints.
 You'll then have one document group per real document. You can use
 group.sort to sort your groups by indexing date ascending, and
 group.limit=1 to keep only the oldest one.
 You can even use group.format = simple to serve results as if no
 collapsing occured, and use group.ngroups (/!\ could be expansive /!\) to
 get the real number of deduplicated documents.

 Of course the index will be larger, as I said, I made no assumption
 regarding your operating requirements. And search can be a bit slower,
 depending on the average rate of duplicated documents.
 But you've got your issue addressed by configuration tuning only...
 Depending on your project's sizing, it could be time saving.

 The advantage is that you have the precious information of what content is
 duplicated from where :-)

 Hope this helps,

 --
 Tanguy

 Le 28/12/2011 15:45, Alexander Aristov a écrit :

  Thanks Eric,

 it sets me direction. I will be writing new plugin and will get back to
 the
 dev forum with results and then we will decide next steps.

 Best Regards
 Alexander Aristov


 On 28 December 2011 18:08, Erick 
 Ericksonerickerickson@gmail.**comerickerick...@gmail.com
  wrote:

  Well, the short answer is that nobody else has
 1  had a similar requirement
 AND
 2  not found a suitable work around
 AND
 3  implemented the change and contributed it back.

 So, if you'd like to volunteerG.

 Seriously. If you think this would be valuable and are
 willing to work on it, hop on over to the dev list and
 discuss it, open a JIRA and make it work. I'd start
 by opening a discussion on the dev list before
 opening a JIRA, just to get a sense of where the
 snags would be to changing the Solr code, but that's
 optional.

 That said, writing your own update request handler
 that detected this case isn't very difficult,
 extend UpdateRequestProcessorFactory/**UpdateRequestProcessor
 and use it as a plugin.

 Best
 Erick

 On Wed, Dec 28, 2011 at 6:46 AM, Alexander Aristov
 alexander.aris...@gmail.com  wrote:

 the problem with dedupe (SignatureUpdateProcessor ) is that it REPLACES

 old

 docs. I have tried it already.

 Best Regards
 Alexander Aristov


 On 28 December 2011 13:04, Lance Norskoggoks...@gmail.com  wrote:

  The SignatureUpdateProcessor is for exactly this problem:



  http://www.lucidimagination.**com/search/link?url=http://**
 wiki.apache.org/solr/**Deduplicationhttp://www.lucidimagination.com/search/link?url=http://wiki.apache.org/solr/Deduplication

 On Tue, Dec 27, 2011 at 10:42 PM, Alexander Aristov
 alexander.aris...@gmail.com  wrote:

 I get docs from external sources and the only place I keep them is

 solr

 index. I have no a database or other means to track indexed docs (my
 personal oppinion is that it might be a huge headache).

 Some docs might change slightly in there original sources but I don't

 need

 that changes. In fact I need original data only.

 So I have no other ways but to either check if a document is already

 in

 index before I put it to solrj array (read - query solr) or develop my

 own

 update chain processor and implement ID check there and skip such

 docs.

 Maybe it's wrong place to aguee and probably it's been discussed

 before

 but

 I wonder why simple the overwrite parameter doesn't work here.

 My oppinion it perfectly suits here. In combination with unique ID it

 can

 cover all possible variants.

 cases:

 1. overwrite=true and uniquID exists then newer doc should overwrite

 the

 old one.

 2. overwrite=false and uniqueID exists then newer doc must be skipped

 since

 old exists.

 3. uniqueID doesn't exist then newer doc just gets added regardless if

 old

 exists or not.


 Best Regards
 Alexander Aristov


 On 27 December 2011 22:53, Erick 
 Ericksonerickerickson@gmail.**comerickerick...@gmail.com
 

 wrote:

 Mikhail is right as far as I know, the assumption built into Solr is

 that

 duplicate IDs (whenuniqueKey  is defined) should trigger the old
 document to be replaced.

 what is your system-of-record? By that I mean what does your SolrJ
 program do to send data to Solr? Is there any way you could just
 *not* send documents that are already in the Solr index based on,
 for instance, any timestamp associated with your

Re: solr keep old docs

2011-12-28 Thread Alexander Aristov
Yes I have been warned that query index each time before adding doc to
index might be resource consuming. Will check it.

As for the overwrite parameter I think the name is not the best then.
People outside the business like me misuse it and assume what I wrote.
Overwrite shall mean what it means.

But I understand what it does in fact and so my way is to write custom
update processor plugin.

Best Regards
Alexander Aristov


On 28 December 2011 22:16, Chris Hostetter hossman_luc...@fucit.org wrote:


 : That said, writing your own update request handler
 : that detected this case isn't very difficult,
 : extend UpdateRequestProcessorFactory/UpdateRequestProcessor
 : and use it as a plugin.

 i can't find the thread at the moment, but the general issue that has
 caused people headaches with this type of approach in the past has been
 that the performance of doing a query on every update (to see if the doc
 is already in the index) can slow things down quite a bit -- in your
 usecase it may not be a significant bottleneck, but that's the general
 issue that has come up i nthe past.

 If you look at systems (like nutch) that do large scale crawling, they
 treat the crawl phrase independent from the indexing phase precisesly for
 reasons like this -- so the crawler can dedup the documents (by unique
 URL) and eliminate duplication before ever even adding them to the index.

 :   I wonder why simple the overwrite parameter doesn't work here.
...
 :   2. overwrite=false and uniqueID exists then newer doc must be
 skipped
 :  since
 :   old exists.

 that is not what overwrite=false does (or was ever designed to do).
 overwrite=false is a way to tell Solr that you are already certain that
 the documents being added do not exist in the index, therefore Solr can
 save time by not attempting to overwrite an existing document.  It is
 intended for situations where you are bulk loading documents, ie: doing an
 initial build of an index from a system of record (ie: a single pass over
 adatabase that uses the same unique key) or importing documents from a
 new system of record with a completley differnet id space.



 -Hoss



Re: solr keep old docs

2011-12-27 Thread Alexander Aristov
Hi

I am not using database. All needed data is in solr index that's why I want
to skip excessive checks.

I will check DIH but not sure if it helps.

I am fluent with Java and it's not a problem for me to write a class or so
but I want to check first  maybe there are any ways (workarounds) to make
it working without codding, just by playing around with configuration and
params. I don't want to go away from default solr implementation.

Best Regards
Alexander Aristov


On 27 December 2011 09:33, Mikhail Khludnev mkhlud...@griddynamics.comwrote:

 On Tue, Dec 27, 2011 at 12:26 AM, Alexander Aristov 
 alexander.aris...@gmail.com wrote:

  Hi people,
 
  I urgently need your help!
 
  I have solr 3.3 configured and running. I do uncremental indexing 4
 times a
  day using bulk updates. Some documents are identical to some extent and I
  wish to skip them, not to index.
  But here is the problem as I could not find a way to tell solr ignore new
  duplicate docs and keep old indexed docs. I don't care that it's new.
 Just
  determine by ID that such document is in the index already and that's it.
 
  I use solrj for indexing. I have tried setting overwrite=false and dedupe
  apprache but nothing helped me. I either have that a newer doc overwrites
  old one or I get duplicate.
 
  I think it's a very simple and basic feature and it must exist. What did
 I
  make wrong or didn't do?
 

 I guess, because  the mainstream approach is delta-import , when you have
 updated timestamps in your DB and last-import timestamp stored
 somewhere. You can check how it works in DIH.


 
  Tried google but I couldn't find a solution there althoght many people
  encounted such problem.
 
 
 it's definitely can be done by overriding
 o.a.s.update.DirectUpdateHandler2.addDoc(AddUpdateCommand), but I suggest
 to start from implementing your own
 http://wiki.apache.org/solr/UpdateRequestProcessor - search for PK, bypass
 chain call if it's found. Then if you meet performance issues on querying
 your PKs one by one, (but only after that) you can batch your searches,
 there are couple of optimization techniques for huge disjunction queries
 like PK:(2 OR 4 OR 5 OR 6).


  I start considering that I must query index to check if a doc to be added
  is in the index already and do not add it to array but I have so many
 docs
  that I am affraid it's not a good solution.
 
  Best Regards
  Alexander Aristov
 



 --
 Sincerely yours
 Mikhail Khludnev
 Lucid Certified
 Apache Lucene/Solr Developer
 Grid Dynamics



Re: solr keep old docs

2011-12-27 Thread Alexander Aristov
I get docs from external sources and the only place I keep them is solr
index. I have no a database or other means to track indexed docs (my
personal oppinion is that it might be a huge headache).

Some docs might change slightly in there original sources but I don't need
that changes. In fact I need original data only.

So I have no other ways but to either check if a document is already in
index before I put it to solrj array (read - query solr) or develop my own
update chain processor and implement ID check there and skip such docs.

Maybe it's wrong place to aguee and probably it's been discussed before but
I wonder why simple the overwrite parameter doesn't work here.

My oppinion it perfectly suits here. In combination with unique ID it can
cover all possible variants.

cases:

1. overwrite=true and uniquID exists then newer doc should overwrite the
old one.

2. overwrite=false and uniqueID exists then newer doc must be skipped since
old exists.

3. uniqueID doesn't exist then newer doc just gets added regardless if old
exists or not.


Best Regards
Alexander Aristov


On 27 December 2011 22:53, Erick Erickson erickerick...@gmail.com wrote:

 Mikhail is right as far as I know, the assumption built into Solr is that
 duplicate IDs (when uniqueKey is defined) should trigger the old
 document to be replaced.

 what is your system-of-record? By that I mean what does your SolrJ
 program do to send data to Solr? Is there any way you could just
 *not* send documents that are already in the Solr index based on,
 for instance, any timestamp associated with your system-of-record
 and the last time you did an incremental index?

 Best
 Erick

 On Tue, Dec 27, 2011 at 6:38 AM, Alexander Aristov
 alexander.aris...@gmail.com wrote:
  Hi
 
  I am not using database. All needed data is in solr index that's why I
 want
  to skip excessive checks.
 
  I will check DIH but not sure if it helps.
 
  I am fluent with Java and it's not a problem for me to write a class or
 so
  but I want to check first  maybe there are any ways (workarounds) to make
  it working without codding, just by playing around with configuration and
  params. I don't want to go away from default solr implementation.
 
  Best Regards
  Alexander Aristov
 
 
  On 27 December 2011 09:33, Mikhail Khludnev mkhlud...@griddynamics.com
 wrote:
 
  On Tue, Dec 27, 2011 at 12:26 AM, Alexander Aristov 
  alexander.aris...@gmail.com wrote:
 
   Hi people,
  
   I urgently need your help!
  
   I have solr 3.3 configured and running. I do uncremental indexing 4
  times a
   day using bulk updates. Some documents are identical to some extent
 and I
   wish to skip them, not to index.
   But here is the problem as I could not find a way to tell solr ignore
 new
   duplicate docs and keep old indexed docs. I don't care that it's new.
  Just
   determine by ID that such document is in the index already and that's
 it.
  
   I use solrj for indexing. I have tried setting overwrite=false and
 dedupe
   apprache but nothing helped me. I either have that a newer doc
 overwrites
   old one or I get duplicate.
  
   I think it's a very simple and basic feature and it must exist. What
 did
  I
   make wrong or didn't do?
  
 
  I guess, because  the mainstream approach is delta-import , when you
 have
  updated timestamps in your DB and last-import timestamp stored
  somewhere. You can check how it works in DIH.
 
 
  
   Tried google but I couldn't find a solution there althoght many people
   encounted such problem.
  
  
  it's definitely can be done by overriding
  o.a.s.update.DirectUpdateHandler2.addDoc(AddUpdateCommand), but I
 suggest
  to start from implementing your own
  http://wiki.apache.org/solr/UpdateRequestProcessor - search for PK,
 bypass
  chain call if it's found. Then if you meet performance issues on
 querying
  your PKs one by one, (but only after that) you can batch your searches,
  there are couple of optimization techniques for huge disjunction queries
  like PK:(2 OR 4 OR 5 OR 6).
 
 
   I start considering that I must query index to check if a doc to be
 added
   is in the index already and do not add it to array but I have so many
  docs
   that I am affraid it's not a good solution.
  
   Best Regards
   Alexander Aristov
  
 
 
 
  --
  Sincerely yours
  Mikhail Khludnev
  Lucid Certified
  Apache Lucene/Solr Developer
  Grid Dynamics
 



solr keep old docs

2011-12-26 Thread Alexander Aristov
Hi people,

I urgently need your help!

I have solr 3.3 configured and running. I do uncremental indexing 4 times a
day using bulk updates. Some documents are identical to some extent and I
wish to skip them, not to index.
But here is the problem as I could not find a way to tell solr ignore new
duplicate docs and keep old indexed docs. I don't care that it's new. Just
determine by ID that such document is in the index already and that's it.

I use solrj for indexing. I have tried setting overwrite=false and dedupe
apprache but nothing helped me. I either have that a newer doc overwrites
old one or I get duplicate.

I think it's a very simple and basic feature and it must exist. What did I
make wrong or didn't do?

Tried google but I couldn't find a solution there althoght many people
encounted such problem.

I start considering that I must query index to check if a doc to be added
is in the index already and do not add it to array but I have so many docs
that I am affraid it's not a good solution.

Best Regards
Alexander Aristov


solr ignore duplicate documents

2011-12-13 Thread Alexander Aristov
People,

I am asking for your help with solr.

When a document is sent to solr and such document already exists in its
index (by its ID) then the new doc replaces the old one.

But I don't want to automatically replace documents. Just ignore and
proceed to the next. How can I configure solr to do so?

Of course I can query solr to check if it has the document already but it's
bad for me since I do bulk updates and this will complicate the process and
increase amount of request.

So are there any ways to configure solr to ignore duplicates? Just ignore.
I don't need any specific responses or actions.

Best Regards
Alexander Aristov


Re: solr upgrade question

2011-03-31 Thread Alexander Aristov
Didn't get any responses.

But I tried luke 1.0.1 and it did the magic. I run optimization and after
that solr got up.

Best Regards
Alexander Aristov


On 30 March 2011 15:47, Alexander Aristov alexander.aris...@gmail.comwrote:

 People

 Is were way to upgrade existsing index from solr 1.4 to solr 4(trunk). When
 I configured solr 4 and launched it complained about incorrect lucence file
 version (3 instead of old 2)

 Are there any procedures to convert index?


 Best Regards
 Alexander Aristov



solr upgrade question

2011-03-30 Thread Alexander Aristov
People

Is were way to upgrade existsing index from solr 1.4 to solr 4(trunk). When
I configured solr 4 and launched it complained about incorrect lucence file
version (3 instead of old 2)

Are there any procedures to convert index?


Best Regards
Alexander Aristov