Re: Custom Solr FunctionQuery Error

2011-12-28 Thread Parvin Gasimzade
Thank you for your answers.

I have a Map and want to boost the score of that documents
during search time.

In my example i get that map inside ValueSource and boost the matched
documents score.

In the query if {!graph} is added then it will return boosted query
otherwise it will return regular list.

For this design , which classes should i extend? Or how can i solve that
issue?

Regards,
Parvin

On Thu, Dec 29, 2011 at 1:07 AM, Yonik Seeley wrote:

> On Wed, Dec 28, 2011 at 2:16 AM, Parvin Gasimzade
>  wrote:
> > I have created custom Solr FunctionQuery in Solr 3.4.
> > I extended ValueSourceParser, ValueSource, Query and QParserPlugin
> classes.
>
> Note that you only need a QParserPlugin implementation for top level
> query types, not function queries.
> With just a ValueSourceParser and ValueSource implementation, you can
> use the custom function as a function query.
>
> Example:
> q={!func}graph("myparameter")
>
> -Yonik
> http://www.lucidimagination.com
>


Re: best practice to introducing singletons inside of Solr (IoC)

2011-12-28 Thread Mikhail Khludnev
Erick,

Ok. Let me try with plain java one. Possibly I'll need more tight
integration like injecting a core into the singleton, etc. But I don't know
yet.

Thanks for your efforts.

On Wed, Dec 28, 2011 at 5:48 PM, Erick Erickson wrote:

> I must be missing something here. Why would this be any different from
> any other singleton? I just did a little experiment where I implemented
> the classic singleton pattern in a RequestHandler and accessed
> from a Filter (both plugins) with no problem at all, just the usual
> blah var = MySingleton.getInstance();
> var.whatever
>
> There was non need to get Solr cores involved at all. Of course this
> was just a simple experiment, YMMV..
>
> Best
> Erick
>
> On Tue, Dec 27, 2011 at 11:52 PM, Mikhail Khludnev
>  wrote:
> > Colleagues,
> >
> > Don't hesitate to emit your opinion. Please!
> >
> > Regards
> >
> > On Wed, Dec 21, 2011 at 11:06 PM, Mikhail Khludnev <
> > mkhlud...@griddynamics.com> wrote:
> >
> >> Hello,
> >>
> >> I need to introduce several singletons inside of Solr and make them
> >> available for my own SearchHandlers, Components, and even QParsers, etc.
> >>
> >> Right now I use some kind of fake SolrRequestHandler which loads on
> init()
> >> and available everywhere through
> >> solrCore.getRequestHandler("wellknownName"). Then I downcast it
> everywhere
> >> and access the required methods. The same is possible with fake
> >> SearchComponent.
> >> Particularly my singletons are some additional fields schema (pretty
> >> sophisticated), and kind of request/response encoding facility.
> >> The typical Java hammer for such pins is Spring, but I've found puzzling
> >> to use
> >>
> http://static.springframework.org/spring/docs/3.0.x/javadoc-api/org/springframework/web/context/support/WebApplicationContextUtils.html
> >>
> >> What's the best way to do that?
> >>
> >> --
> >> Sincerely yours
> >> Mikhail Khludnev
> >> Lucid Certified
> >> Apache Lucene/Solr Developer
> >> Grid Dynamics
> >>
> >> 
> >>  
> >>
> >>
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> > Lucid Certified
> > Apache Lucene/Solr Developer
> > Grid Dynamics
> >
> > 
> >  
>



-- 
Sincerely yours
Mikhail Khludnev
Lucid Certified
Apache Lucene/Solr Developer
Grid Dynamics


 


Re: Grouping results after Sorting or vice-versa

2011-12-28 Thread Vijayaragavan
Hi Juan, 

I'm using Solr 3.1
The type of the date field is long.
Let's say, the documents indexed in Solr server be..


1326c5cc09bbc99a_1
1326c5cc09bbc99a
1316078009000
<.. Some Other fields here ..>
Some subject here...
Some message here...


1321dff33cecd5f4_1
1321dff33cecd5f4
1314956314000
<.. Some Other fields here ..>
Some subject here...
Some message here...


1321dff33cecd5f4_2
1321dff33cecd5f4
131771922
<.. Some Other fields here ..>
Some subject here...
Some message here...



133b70d0d0e32f12_1
133b70d0d0e32f12
1321626044000
<.. Some Other fields here ..>
Some subject here...
Some message here...


The results i'm getting for
http://localhost:8080/solr/core1/select/?qt=nutch&q=*:*&fq=userid:333&group=true&group.field=threadid&group.sort=date%20desc&sort=date%20desc


133b70d0d0e32f12_1
133b70d0d0e32f12
1321626044000
<.. Some Other fields here ..>
Some subject here...
Some message here...


1321dff33cecd5f4_2
1321dff33cecd5f4
131771922
<.. Some Other fields here ..>
Some subject here...
Some message here...


1326c5cc09bbc99a_1
1326c5cc09bbc99a
1316078009000
<.. Some Other fields here ..>
Some subject here...
Some message here...


1321dff33cecd5f4_1
1321dff33cecd5f4
1314956314000
<.. Some Other fields here ..>
Some subject here...
Some message here...


But the results i should get be...

133b70d0d0e32f12_1
133b70d0d0e32f12
1321626044000
<.. Some Other fields here ..>
Some subject here...
Some message here...


1321dff33cecd5f4_2
1321dff33cecd5f4
131771922
<.. Some Other fields here ..>
Some subject here...
Some message here...



1321dff33cecd5f4_1
1321dff33cecd5f4
1314956314000
<.. Some Other fields here ..>
Some subject here...
Some message here...

1326c5cc09bbc99a_1
1326c5cc09bbc99a
1316078009000
<.. Some Other fields here ..>
Some subject here...
Some message here...


Is it possible to get such results? If yes, how?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Grouping-results-after-Sorting-or-vice-versa-tp3615957p3618172.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr keep old docs

2011-12-28 Thread Mikhail Khludnev
Alexander,

I have two ideas how to implement fast dedupe externally, assuming your PKs
don't fit to java.util.*Map:

   - your crawler can use inprocess RDBMS (Derby, H2) to track dupes;
   - if your crawler is stateless - it doesn't track PKs which has been
   already crawled, you can retrieve it from Solr via
   http://wiki.apache.org/solr/TermsComponent . That's blazingly fast, but
   it might be a problem with removed documents (I'm not sure). And it's also
   can lead to OOMException (if you have too much PKs). Let me know if you
   need a workaround for one of these problems.

If you choose internal dedupe (UpdateProcessor), pls let me know if
querying one-by-one will be to slow for your and you'll need to do it
page-by-page. I did some of such paging, and will do something similar
soon, so I'm interested in it.

Regards

On Thu, Dec 29, 2011 at 9:34 AM, Alexander Aristov <
alexander.aris...@gmail.com> wrote:

> Unfortunately I have a lot of duplicates  and taking that searching might
> suffer I will try with implementing update procesor.
>
> But your idea is interesting and I will consider it, thanks.
>
> Best Regards
> Alexander Aristov
>
>
> On 28 December 2011 19:12, Tanguy Moal  wrote:
>
> > Hello Alexander,
> >
> > I don't know much about your requirements in terms of size and
> > performances, but I've had a similar use case and found a pretty simple
> > workaround.
> > If your duplicate rate is not too high, you can have the
> > SignatureProcessor to generate fingerprint of documents (you already did
> > that).
> >
> > Simply turn off overwritting of duplicates, you can then rely on solr's
> > grouping / field collapsing to group your search results by fingerprints.
> > You'll then have one document group per "real" document. You can use
> > group.sort to sort your groups by indexing date ascending, and
> > group.limit=1 to keep only the oldest one.
> > You can even use group.format = simple to serve results as if no
> > collapsing occured, and use group.ngroups (/!\ could be expansive /!\) to
> > get the real number of deduplicated documents.
> >
> > Of course the index will be larger, as I said, I made no assumption
> > regarding your operating requirements. And search can be a bit slower,
> > depending on the average rate of duplicated documents.
> > But you've got your issue addressed by configuration tuning only...
> > Depending on your project's sizing, it could be time saving.
> >
> > The advantage is that you have the precious information of what content
> is
> > duplicated from where :-)
> >
> > Hope this helps,
> >
> > --
> > Tanguy
> >
> > Le 28/12/2011 15:45, Alexander Aristov a écrit :
> >
> >  Thanks Eric,
> >>
> >> it sets me direction. I will be writing new plugin and will get back to
> >> the
> >> dev forum with results and then we will decide next steps.
> >>
> >> Best Regards
> >> Alexander Aristov
> >>
> >>
> >> On 28 December 2011 18:08, Erick Erickson erickerick...@gmail.com>>
> >>  wrote:
> >>
> >>  Well, the short answer is that nobody else has
> >>> 1>  had a similar requirement
> >>> AND
> >>> 2>  not found a suitable work around
> >>> AND
> >>> 3>  implemented the change and contributed it back.
> >>>
> >>> So, if you'd like to volunteer.
> >>>
> >>> Seriously. If you think this would be valuable and are
> >>> willing to work on it, hop on over to the dev list and
> >>> discuss it, open a JIRA and make it work. I'd start
> >>> by opening a discussion on the dev list before
> >>> opening a JIRA, just to get a sense of where the
> >>> snags would be to changing the Solr code, but that's
> >>> optional.
> >>>
> >>> That said, writing your own update request handler
> >>> that detected this case isn't very difficult,
> >>> extend UpdateRequestProcessorFactory/**UpdateRequestProcessor
> >>> and use it as a plugin.
> >>>
> >>> Best
> >>> Erick
> >>>
> >>> On Wed, Dec 28, 2011 at 6:46 AM, Alexander Aristov
> >>>   wrote:
> >>>
>  the problem with dedupe (SignatureUpdateProcessor ) is that it
> REPLACES
> 
> >>> old
> >>>
>  docs. I have tried it already.
> 
>  Best Regards
>  Alexander Aristov
> 
> 
>  On 28 December 2011 13:04, Lance Norskog  wrote:
> 
>   The SignatureUpdateProcessor is for exactly this problem:
> >
> >
> >
> >  http://www.lucidimagination.**com/search/link?url=http://**
> >>> wiki.apache.org/solr/**Deduplication<
> http://www.lucidimagination.com/search/link?url=http://wiki.apache.org/solr/Deduplication
> >
> >>>
>  On Tue, Dec 27, 2011 at 10:42 PM, Alexander Aristov
> >   wrote:
> >
> >> I get docs from external sources and the only place I keep them is
> >>
> > solr
> >>>
>  index. I have no a database or other means to track indexed docs (my
> >> personal oppinion is that it might be a huge headache).
> >>
> >> Some docs might change slightly in there original sources but I
> don't
> >>
> > need
> >
> >> that changes. In fact I nee

Re: High response time after being idle

2011-12-28 Thread Odey
It seems like my operation system was causing me trouble in some way. I
couldn't find what was triggering this issue, but after migrating the whole
project from wamp to lamp it has been resolved and everything is running
smoothly again.

Thank you very much for your help!

Regards,

--
View this message in context: 
http://lucene.472066.n3.nabble.com/High-response-time-after-being-idle-tp3616599p3618096.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr keep old docs

2011-12-28 Thread Alexander Aristov
Yes I have been warned that query index each time before adding doc to
index might be resource consuming. Will check it.

As for the overwrite parameter I think the name is not the best then.
People outside the "business" like me misuse it and assume what I wrote.
Overwrite shall mean what it means.

But I understand what it does in fact and so my way is to write custom
update processor plugin.

Best Regards
Alexander Aristov


On 28 December 2011 22:16, Chris Hostetter  wrote:

>
> : That said, writing your own update request handler
> : that detected this case isn't very difficult,
> : extend UpdateRequestProcessorFactory/UpdateRequestProcessor
> : and use it as a plugin.
>
> i can't find the thread at the moment, but the general issue that has
> caused people headaches with this type of approach in the past has been
> that the performance of doing a query on every update (to see if the doc
> is already in the index) can slow things down quite a bit -- in your
> usecase it may not be a significant bottleneck, but that's the general
> issue that has come up i nthe past.
>
> If you look at systems (like nutch) that do large scale crawling, they
> treat the crawl phrase independent from the indexing phase precisesly for
> reasons like this -- so the crawler can dedup the documents (by unique
> URL) and eliminate duplication before ever even adding them to the index.
>
> : >> > I wonder why simple the overwrite parameter doesn't work here.
>...
> : >> > 2. overwrite=false and uniqueID exists then newer doc must be
> skipped
> : >> since
> : >> > old exists.
>
> that is not what overwrite=false does (or was ever designed to do).
> overwrite=false is a way to tell Solr that you are already certain that
> the documents being added do not exist in the index, therefore Solr can
> save time by not attempting to overwrite an existing document.  It is
> intended for situations where you are bulk loading documents, ie: doing an
> initial build of an index from a system of record (ie: a single pass over
> adatabase that uses the same unique key) or importing documents from a
> new system of record with a completley differnet id space.
>
>
>
> -Hoss
>


Re: solr keep old docs

2011-12-28 Thread Alexander Aristov
Unfortunately I have a lot of duplicates  and taking that searching might
suffer I will try with implementing update procesor.

But your idea is interesting and I will consider it, thanks.

Best Regards
Alexander Aristov


On 28 December 2011 19:12, Tanguy Moal  wrote:

> Hello Alexander,
>
> I don't know much about your requirements in terms of size and
> performances, but I've had a similar use case and found a pretty simple
> workaround.
> If your duplicate rate is not too high, you can have the
> SignatureProcessor to generate fingerprint of documents (you already did
> that).
>
> Simply turn off overwritting of duplicates, you can then rely on solr's
> grouping / field collapsing to group your search results by fingerprints.
> You'll then have one document group per "real" document. You can use
> group.sort to sort your groups by indexing date ascending, and
> group.limit=1 to keep only the oldest one.
> You can even use group.format = simple to serve results as if no
> collapsing occured, and use group.ngroups (/!\ could be expansive /!\) to
> get the real number of deduplicated documents.
>
> Of course the index will be larger, as I said, I made no assumption
> regarding your operating requirements. And search can be a bit slower,
> depending on the average rate of duplicated documents.
> But you've got your issue addressed by configuration tuning only...
> Depending on your project's sizing, it could be time saving.
>
> The advantage is that you have the precious information of what content is
> duplicated from where :-)
>
> Hope this helps,
>
> --
> Tanguy
>
> Le 28/12/2011 15:45, Alexander Aristov a écrit :
>
>  Thanks Eric,
>>
>> it sets me direction. I will be writing new plugin and will get back to
>> the
>> dev forum with results and then we will decide next steps.
>>
>> Best Regards
>> Alexander Aristov
>>
>>
>> On 28 December 2011 18:08, Erick 
>> Erickson>
>>  wrote:
>>
>>  Well, the short answer is that nobody else has
>>> 1>  had a similar requirement
>>> AND
>>> 2>  not found a suitable work around
>>> AND
>>> 3>  implemented the change and contributed it back.
>>>
>>> So, if you'd like to volunteer.
>>>
>>> Seriously. If you think this would be valuable and are
>>> willing to work on it, hop on over to the dev list and
>>> discuss it, open a JIRA and make it work. I'd start
>>> by opening a discussion on the dev list before
>>> opening a JIRA, just to get a sense of where the
>>> snags would be to changing the Solr code, but that's
>>> optional.
>>>
>>> That said, writing your own update request handler
>>> that detected this case isn't very difficult,
>>> extend UpdateRequestProcessorFactory/**UpdateRequestProcessor
>>> and use it as a plugin.
>>>
>>> Best
>>> Erick
>>>
>>> On Wed, Dec 28, 2011 at 6:46 AM, Alexander Aristov
>>>   wrote:
>>>
 the problem with dedupe (SignatureUpdateProcessor ) is that it REPLACES

>>> old
>>>
 docs. I have tried it already.

 Best Regards
 Alexander Aristov


 On 28 December 2011 13:04, Lance Norskog  wrote:

  The SignatureUpdateProcessor is for exactly this problem:
>
>
>
>  http://www.lucidimagination.**com/search/link?url=http://**
>>> wiki.apache.org/solr/**Deduplication
>>>
 On Tue, Dec 27, 2011 at 10:42 PM, Alexander Aristov
>   wrote:
>
>> I get docs from external sources and the only place I keep them is
>>
> solr
>>>
 index. I have no a database or other means to track indexed docs (my
>> personal oppinion is that it might be a huge headache).
>>
>> Some docs might change slightly in there original sources but I don't
>>
> need
>
>> that changes. In fact I need original data only.
>>
>> So I have no other ways but to either check if a document is already
>>
> in
>>>
 index before I put it to solrj array (read - query solr) or develop my
>>
> own
>
>> update chain processor and implement ID check there and skip such
>>
> docs.
>>>
 Maybe it's wrong place to aguee and probably it's been discussed
>>
> before
>>>
 but
>
>> I wonder why simple the overwrite parameter doesn't work here.
>>
>> My oppinion it perfectly suits here. In combination with unique ID it
>>
> can
>>>
 cover all possible variants.
>>
>> cases:
>>
>> 1. overwrite=true and uniquID exists then newer doc should overwrite
>>
> the
>>>
 old one.
>>
>> 2. overwrite=false and uniqueID exists then newer doc must be skipped
>>
> since
>
>> old exists.
>>
>> 3. uniqueID doesn't exist then newer doc just gets added regardless if
>>
> old
>
>> exists or not.
>>
>>
>> Best Regards
>> Alexander Aristov
>>
>>
>> On 27 December 2011 22:53, Erick 
>> Erickson
>> >
>>
> wrote:
>

Re: Solr Distributed Search vs Hadoop

2011-12-28 Thread Ted Dunning
This copying is a bit overstated here because of the way that small
segments are merged into larger segments.  Those larger segments are then
copied much less often than the smaller ones.

While you can wind up with lots of copying in certain extreme cases, it is
quite rare.  In particular, if you have one of the following cases, you
won't see very many copies for any particular document:

- you don't delete files one at a time (i.e. indexing only without updates
or deletion)

or

- most documents that are going to be deleted are deleted as young documents

or

- the probability that any particular document will be deleted in a fixed
period of time decreases exponentially with the age of the documents

Any of these characteristics or many others will prevent a file from being
copied very many times because as the document ages, it keeps company with
similarly aged documents which are accordingly unlikely to have enough
compatriots deleted to make their segment have a small number of live
documents in it.  Put another way, the intervals between merges that a
particular document undergoes will become longer and longer as it ages and
thus the total number of copies it can undergo cannot grow very fast.

On Wed, Dec 28, 2011 at 7:53 PM, Lance Norskog  wrote:

> ...
> One problem with indexing is that Solr continally copies data into
> "segments" (index parts) while you index. So, each 5MB PDF might get
> copied 50 times during a full index job. If you can strip the index
> down to what you really want to search on, terabytes become gigabytes.
> Solr seems to handle 100g-200g fine on modern hardware.
>
>


Re: Custom Shingle Factory Filter Requirement

2011-12-28 Thread Vannia Rajan
On Tue, Dec 27, 2011 at 1:10 PM, Ahmet Arslan  wrote:

>
> To achieve this behavior, you can use StandardTokenizerFactory and
> EdgeNGramFilterFactory and LowerCaseFilterFactory at index time.
>
>
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.EdgeNGramFilterFactory
>

Thanks, but I ended up implementing a custom Transformer & used it as a
DataImport plugin (i used RegexTransformer's source-code as a reference).

This also helped me to merge another fields value to the current field in
the way i need.

-- 
Thanks,
Vanniarajan


Re: Solr Distributed Search vs Hadoop

2011-12-28 Thread Lance Norskog
Here is an example of schema design: a PDF file of 5MB might have
maybe 50k of actual text. The Solr ExtractingRequestHandler will find
that text and only index that. If you set the field to stored=true,
the 5mb will be saved. If saved=false, the PDF is not saved. Instead,
you would store a link to it.

One problem with indexing is that Solr continally copies data into
"segments" (index parts) while you index. So, each 5MB PDF might get
copied 50 times during a full index job. If you can strip the index
down to what you really want to search on, terabytes become gigabytes.
Solr seems to handle 100g-200g fine on modern hardware.

Lance

On Fri, Dec 23, 2011 at 1:54 AM, Nick Vincent  wrote:
> For data of this size you may want to look at something like Apache
> Cassandra, which is made specifically to handle data at this kind of
> scale across many machines.
>
> You can still use Hadoop to analyse and transform the data in a
> performant manner, however it's probably best to do some research on
> this on the relevant technical forums for those technologies.
>
> Nick



-- 
Lance Norskog
goks...@gmail.com


Re: Migration from Solr 1.4 to Solr 3.5

2011-12-28 Thread Lance Norskog
Yes, the 3.5 Solr is opening and reading the Solr 1.4 index. When you
do a commit, it will rewrite the index in 3.5 format.

Doing a complete copy of the configs from 1.4 to 3.5 is easy, but
there are a lot of new features and changed defaults in the
solrconfig.xml file. These make indexing faster, introduce better
memory management and a lot more. For your production upgrade you
should translate your local changes into a fresh 3.5 instance.

Lance

On Wed, Dec 28, 2011 at 5:23 AM, Bhavnik Gajjar  wrote:
> Thanks community! That helps!
>
> To check practically, I have now setup Solr 3.5 in test environment. Few
> observations on that,
>
>
>   1. I simply copy-pasted one of the Solr 1.4 instance on Solr 3.5 setup
>   (after correcting schema.config and solr.config files based on what is
>   suited for 3.5). If I do query like,
>   
> http://myserver:8080/solr/Solr_3.5_Instance/select?q=test&shards=myserver:8080/solr/Solr_3.5_Instance,
> myserver:8080/solr/Solr_1.4_Instance,
>   then it works OK! So, now, wondering, index format has been changed after
>   Solr 1.4, and hence, I was expecting above search to fail. Am I correct?
>   2. Continuing above point, I guess, if I need to use new feature which
>   didn't exist in Solr 1.4, but exists in Solr 3.5, then this hybrid (1.4 and
>   3.5 solr instances) setup won't work. Any thoughts?
>   3. I got wind that, first commit would convert Solr 1.4 index to new
>   format in Solr 3.5 setup. Is it so?
>   4. Are there any migration tool (or any other means?) available that
>   would convert old indexes (1.4) to new format (3.5)?
>
>
> Kind regards,
>
> Bhavnik
>
>
>  Original Message 
>
> To supplement the responses you have already gotten: All servers involved
> in a distributed query, including the one that is accessed and all the
> shards that are accessed from it, must run the same Javabin version.  Solr
> 1.4.1 and earlier use javabin version 1 and everything newer uses javabin
> version 2.  What you are proposing above will not work.
>
> Hopefully you have two complete sets of servers, for redundancy.  It would
> be a good idea to upgrade one server set, then upgrade the other.
> SOLR-2204 is in the works to make it possible to have these versions work
> together.  I don't think it's been committed yet.
>
> Thanks,
> Shawn
>
>>
>>
>>  Subject: Re: Migration from Solr 1.4 to Solr 3.5  Date: Fri, 23 Dec 2011
>> 10:58:43 -0800  From: Siva Kommuri 
>>   Reply-To:
>> solr-user@lucene.apache.org  To: solr-user@lucene.apache.org
>>    CC:
>> solr-user@lucene.apache.org 
>> 
>>
>> One migration strategy is to fall back to XML parser from the javabin 
>> parser, upgrade Solrj jars to 3.4, turn off replication, upgrade master, 
>> upgrade each of the slaves while turning on replication. Once all slaves 
>> have been upgraded/replication turned on - switch back to javabin parser.
>>
>> Best wishes,
>> Siva on 3GS
>>
>> On Dec 23, 2011, at 7:52, Erick Erickson  
>>  wrote:
>>
>> > Have you looked at CHANGES.txt in ? It has upgrade
>> > instructions for every release. Note that in general, newer Solr will 
>> > *read*
>> > an older index (one major revision back. i.e. 3.x should read 1.x, but 4.x
>> > will not read 1.x. Note also that there was no 2.x solr).
>> >
>> > The cautions in the upgrade notes are really about making sure that an
>> > index *produced* with 3.x is not *read* by 1.4, i.e. don't upgrade the
>> > master before the slave.
>> >
>> > I *think* that as long as you upgrade *all* slaves before upgrading the
>> > master, you'll be fine. And I also believe that you can upgrade only some
>> > of the slaves. Each of the slaves, even if only some of them are
>> > upgraded, are reading a 1.4 index even after replications.
>> >
>> > But I'd test first. And if you can re-index, that would actually be the 
>> > best
>> > solution. However, as above you can't reindex until *all* the slaves
>> > are upgraded.
>> >
>> > Best
>> > Erick
>> >
>> > On Fri, Dec 23, 2011 at 7:41 AM, Bhavnik Gajjar  
>> >  wrote:
>> >> Greetings,
>> >>
>> >> We are planning to migrate from Solr 1.4 to Solr 3.5 (or, even new Solr
>> >> version than 3.5, when available) in coming days. There are few questions
>> >> about this migration.
>> >>
>> >>
>> >> • I heard, index format is changed in this migration. So, does this 
>> >> require
>> >> me to reindex millions of data?
>> >>
>> >> • Are there any migration tool (or any other means?) available that would
>> >> convert old indexes (1.4) to new format (3.5)?
>> >>
>> >> • Consider this case.
>> >> http://myserver:8080/solr/mainindex/select/?q=solr&start=0&rows=10&shards=myserver:8080/solr/index1,myserver:8080/solr/mainindex,remoteserver:8080/solr/remotedata.
>> >> In this example, consider that 'myserver' has been upgraded with Solr 3.5,
>> >> but 'remoteserver' is still using Solr 1.4. The question is, would data
>> >> from remoteserver's Solr instance come/parsed fine or, would it cause
>> >> issues? If it results into issue

Re: Sort facets by defined custom Collator

2011-12-28 Thread Chris Hostetter

: Subject: Sort facets by defined custom Collator

deja-vu...

http://www.lucidimagination.com/search/p:solr/s:email/l:user/sort:date?q=%22Facet+Ordering%22

-Hoss


Re: Facet Ordering

2011-12-28 Thread Chris Hostetter

: I've seen in the solr faceting overview that it is possible to sort
: either by count or lexicographically, but is there a way to sort so
: the lowest counts come back first?

Peter Sturge looked into this a while back and provided a patch, but there 
were some issues with it that never got resolved (in particular, it didnt' 
work for several of hte faceting code paths).  If you are interested in 
helping to add this functinality, resurecting that patch might be a good 
palce to start...

https://issues.apache.org/jira/browse/SOLR-1672

-Hoss


Re: Facet Ordering

2011-12-28 Thread Jamie Johnson
I have a database where a user is searching for documents, and the
things which I'm faceting on are tags.  Tags boil down to things of
interest, perhaps names, places, etc.  The user in our case has asked
for the ability to change the ordering so they can easily find things
that appear very infrequently, perhaps there is some nugget in that
data which they'd not have seen before.

The real problem I have with the use case though is there is no real
way to know if the nugget is in the first 50 results, the last 50
results or somewhere in between.

On Wed, Dec 28, 2011 at 8:52 PM, Koji Sekiguchi  wrote:
> (11/12/29 5:50), Jamie Johnson wrote:
>>
>> I've seen in the solr faceting overview that it is possible to sort
>> either by count or lexicographically, but is there a way to sort so
>> the lowest counts come back first?
>>
>
> As far as I know, no. What is your use case?
>
> koji
> --
> http://www.rondhuit.com/en/


Re: Facet Ordering

2011-12-28 Thread Koji Sekiguchi

(11/12/29 5:50), Jamie Johnson wrote:

I've seen in the solr faceting overview that it is possible to sort
either by count or lexicographically, but is there a way to sort so
the lowest counts come back first?



As far as I know, no. What is your use case?

koji
--
http://www.rondhuit.com/en/


Re: High response time after being idle

2011-12-28 Thread Chris Hostetter

: Is it possible that the system is running out of RAM, and swapping,
: or is aggressively swapping for some reason?

it doesn't have to be the solr /tomcat process memory getting swapped out 
-- but that's certainly possible -- it could also be that the filesystem 
cache is expunging the disk pages used to hold the index as other 
processes need ram.  So even if your system isn't using any swap, you 
might still see this type of behavior if you are running other processes 
on the same machine and don't have enough free ram available for the 
filesystem cache to keep the index in memory.

check your io stats as well as your swap info.


-Hoss


Re: High response time after being idle

2011-12-28 Thread Otis Gospodnetic
Right, I think that's what's happening here.
Google "swapiness" if you are on Linux.

Alternatively, one could add something to prevent the OS from swapping out 
Solr's process.  Here is how ElasticSearch does it, for example: 
https://github.com/elasticsearch/elasticsearch/issues/464

Otis 


Performance Monitoring SaaS for Solr - 
http://sematext.com/spm/solr-performance-monitoring/index.html



>
> From: Erick Erickson 
>To: solr-user@lucene.apache.org 
>Sent: Wednesday, December 28, 2011 5:07 PM
>Subject: Re: High response time after being idle
> 
>What else, if anything, do you have running on the server?
>Because it's possible that pages are being swapped out
>for other processes to use.
>
>Solr itself shouldn't, as far as I know, time out anything so I
>expect you're running into issues with the op system.
>
>Best
>Erick
>
>On Wed, Dec 28, 2011 at 10:25 AM, Gora Mohanty  wrote:
>> On Wed, Dec 28, 2011 at 8:52 PM, Odey  wrote:
>>> Hello,
>>>
>>> I'm running Solr 3.5 on a XAMPP/Tomcat environment. It's working pretty good
>>> for just one exception: when Solr remains idle without handling any requests
>>> for about 5-10 mins the first request sent again will be delayed for a few
>>> seconds. Subsequent requests are lightning-fast as usual. So it seems to me
>>> like something important has been shut down while staying idle and has to be
>>> reloaded.
>> [...]
>>
>> Is it possible that the system is running out of RAM, and swapping,
>> or is aggressively swapping for some reason?
>>
>> Regards,
>> Gora
>
>
>

Re: Custom Solr FunctionQuery Error

2011-12-28 Thread Yonik Seeley
On Wed, Dec 28, 2011 at 2:16 AM, Parvin Gasimzade
 wrote:
> I have created custom Solr FunctionQuery in Solr 3.4.
> I extended ValueSourceParser, ValueSource, Query and QParserPlugin classes.

Note that you only need a QParserPlugin implementation for top level
query types, not function queries.
With just a ValueSourceParser and ValueSource implementation, you can
use the custom function as a function query.

Example:
q={!func}graph("myparameter")

-Yonik
http://www.lucidimagination.com


Re: Custom Solr FunctionQuery Error

2011-12-28 Thread Juan Grande
Hi Parvin,

You must also add the query parser definition to solrconfig.xml, for
example:



*Juan*



On Wed, Dec 28, 2011 at 4:16 AM, Parvin Gasimzade <
parvin.gasimz...@gmail.com> wrote:

> Hi all,
>
> I have created custom Solr FunctionQuery in Solr 3.4.
> I extended ValueSourceParser, ValueSource, Query and QParserPlugin classes.
>
> I set the name parameter as "graph" inside GraphQParserPlugin class.
>
> But when try to search i got an error. Search queries are
>
> http://localhost:8080/solr/select/?q={!graph}test<
> http://localhost:8080/recomm/select/?q=%7B!graph%7Dtest
> >
> http://localhost:8080/solr/select/?q=test&defType=graph<
> http://localhost:8080/recomm/select/?q=test&defType=graph>
>
> I also add the * class="org.gasimzade.solr.GraphValueSourceParser"/> *into
> solrConfig.xml but i got the same error...
>
> Error message is :
>
> Dec 27, 2011 7:05:20 PM org.apache.solr.common.SolrException log
> SEVERE: org.apache.solr.common.SolrException: Unknown query type 'graph'
>  at org.apache.solr.core.SolrCore.getQueryPlugin(SolrCore.java:1517)
> at org.apache.solr.search.QParser.getParser(QParser.java:316)
>  at
>
> org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:80)
> at
>
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:173)
>  at
>
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368)
>  at
>
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
> at
>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
>  at
>
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
> at
>
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
>  at
>
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
>  at
>
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
> at
>
> org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(CharacterEncodingFilter.java:88)
>  at
>
> org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:76)
> at
>
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
>  at
>
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
> at
>
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224)
>  at
>
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:185)
> at
>
> org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:472)
>  at
>
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:151)
> at
>
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:100)
>  at
> org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:929)
> at
>
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
>  at
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:405)
> at
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:269)
>  at
>
> org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:515)
> at
>
> org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:300)
>  at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>  at java.lang.Thread.run(Thread.java:679)
>
> Thank you for your help.
>
> Best Regards,
> Parvin
>


Re: High response time after being idle

2011-12-28 Thread Erick Erickson
What else, if anything, do you have running on the server?
Because it's possible that pages are being swapped out
for other processes to use.

Solr itself shouldn't, as far as I know, time out anything so I
expect you're running into issues with the op system.

Best
Erick

On Wed, Dec 28, 2011 at 10:25 AM, Gora Mohanty  wrote:
> On Wed, Dec 28, 2011 at 8:52 PM, Odey  wrote:
>> Hello,
>>
>> I'm running Solr 3.5 on a XAMPP/Tomcat environment. It's working pretty good
>> for just one exception: when Solr remains idle without handling any requests
>> for about 5-10 mins the first request sent again will be delayed for a few
>> seconds. Subsequent requests are lightning-fast as usual. So it seems to me
>> like something important has been shut down while staying idle and has to be
>> reloaded.
> [...]
>
> Is it possible that the system is running out of RAM, and swapping,
> or is aggressively swapping for some reason?
>
> Regards,
> Gora


Re: Grouping results after Sorting or vice-versa

2011-12-28 Thread Juan Grande
Hi,

I don't have an answer, but maybe I can help you if you provide more
information, for example:

- Which Solr version are you running?
- Which is the type of the date field?
- The output you are getting
- The output you expect
- Any other information that you consider relevant.

Thanks,

*Juan*



On Wed, Dec 28, 2011 at 5:12 AM, vijayrs  wrote:

> The issue i'm facing is... I didn't get the expected results when i combine
> "group" param and "sort" param.
>
> The query is...
>
>
> http://localhost:8080/solr/core1/select/?qt=nutch&q=*:*&fq=userid:333&group=true&group.field=threadid&group.sort=date%20desc&sort=date%20desc
>
> where "threadid" is a hexadecimal string which is common for more than 1
> message, and "date"  is in unix timestamp format.
>
> The results should be sorted based on "date" and also grouped by
> "threadid"... how it can be done?
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Grouping-results-after-Sorting-or-vice-versa-tp3615957p3615957.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Facet Ordering

2011-12-28 Thread Jamie Johnson
I've seen in the solr faceting overview that it is possible to sort
either by count or lexicographically, but is there a way to sort so
the lowest counts come back first?


Re: edismax doesn't obey 'pf' parameter

2011-12-28 Thread Chris Hostetter

: Of course. What I meant to say was there is
: always exactly one token in a non-tokenized
: field and it's offset is always exactly 0. There
: will never be tokens at position 1.
: 
: So asking to match phrases, which is based on
: term positions is basically a no-op.

That's not always true.

consider a situation where you have a multivalued "author_exact" field 
containing the authors full name as a literal string -- either using 
StrField or TextField w/keywordTokenizer; and it's copyFielded from an 
"author" field which is similar but tokenized.

So if a document contains the following two values in the author field...
"David Smiley"
"Eric Pugh"

then that document should be matched by all three of these queries...

defType=edismax&q=David&qf=author&pf=author_exact
defType=edismax&q=David+Pugh&qf=author&pf=author_exact
defType=edismax&q=David+Smiley&qf=author&pf=author_exact

...but it should score *really* high for that last query because it not 
only matches on the author field, but it also gets an exact match on the 
entire query string as an implicit phrase in the authr_exact field.

Dismax does behave this way, as you can see using the 3.5 example configs 
& data (note that "cat" is a StrField)...

http://localhost:8983/solr/select/?debugQuery=true&defType=dismax&qf=name^5+features^3&pf=features^2+cat^4&q=hard+drive

  +((DisjunctionMaxQuery((features:hard^3.0 | name:hard^5.0)) 
 DisjunctionMaxQuery((features:drive^3.0 | name:drive^5.0))
)~2) 
   DisjunctionMaxQuery((features:"hard drive"^2.0 | cat:hard drive^4.0))


But for some reason EDismax doesn't behave similarly...

http://localhost:8983/solr/select/?debugQuery=true&defType=edismax&qf=name^5+features^3&pf=features^2+cat^4&q=hard+drive

  +((DisjunctionMaxQuery((features:hard^3.0 | name:hard^5.0)) 
 DisjunctionMaxQuery((features:drive^3.0 | name:drive^5.0))
)~2) 
   DisjunctionMaxQuery((features:"hard drive"^2.0))

...that definitely seems like a bug to me.  but it's not entirely clear 
why it's happening (the pf related code in edismax is kind of hairy)

https://issues.apache.org/jira/browse/SOLR-2988

-Hoss


Re: FTP mount crash when crawling with solrj

2011-12-28 Thread Chris Hostetter

: I have a lots of files in my FTP account,and i use the curlftpfs to mount
: them to folder and then start index them with solrj api, but after a minutes
: pass something strange happen and the mounted folder is not accessible and
: crash,also i can not unmount it and the message "device is in use" appear,
: my solrj code is OK and i test it with my local files and the result is
: great but indexing mounted folder is my terrible problem, i mention that i
: use the curlftpfs with both centOS,fedora and Ubuntu but the result of
: crashing is the same,how can i fix this problem? is the problem with the my
: code? is sombody have ever face this problem when indexed of mounted folder?

No one can guess if something is wrong with your code (or what the problem 
might be) since you haven't posted any code.  

Since the problem seems to be entirely related to crawling this special 
mount, independent of submitting the docs to Solr, i would suggest that 
you tweak your code to just do the crawling, and write log files about 
what it finds (w/o ever sending the data to Solr), and see if that still 
has the problem -- if it does, then it sounds like your problem is 
definitely related to your own code, and has nothing to do with Solr.

off the cuff, i would guess that maybe you aren't closing all the 
files/filehandles that you open.

-Hoss


Re: Solr-3.5.0/Nutch-1.4 - SolrDeleteDuplicates fails

2011-12-28 Thread Chris Hostetter

: Exception in thread "main" java.io.IOException: Job failed!
: 
: at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
: 
: at
: 
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373)
: 
: at
: 
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353)
: 
: at org.apache.nutch.crawl.Crawl.run(Crawl.java:153)


Since SolrDeleteDuplicates is a nutch class (note the package: 
org.apache.nutch) you'll need to ask the folks on the nutch list about 
this error.

-Hoss


Re: XPathEntityProcessor and ExtractingRequestHandler

2011-12-28 Thread Chris Hostetter

: Can I use a XPathEntityProcessor in conjunction with an
: ExtractingRequestHandler?  Also, the scripting language that
: XPathEntityProcessor uses/supports, is that just ECMA/JavaScript?
: 
: Or is XPathEntityProcessor only supported for use in conjuntion with the
: DataImportHandler?

The Entity processors are a specifc feature of DIH, but 
ExtractingRequestHandler does have some options for pulling out specific 
pieces of the DOM produced by Tika.

-Hoss


Re: LineEntityProcessor

2011-12-28 Thread Chris Hostetter

You really haven't posted enough details for people to guess as to what 
your problem might be (in particuar: the actaul examples of your configs, 
and any log messages during hte import)

please consult this wiki page and then post a followup with more 
details...

https://wiki.apache.org/solr/UsingMailingLists


: Hello everybody,
: 
: I'm trying to use LineEntityProcessor of DIH but somehow without success.
: 
: I've create data-lep-config.xml, added request handler in solrconfig.xml.
: 
: During full-import I get a response saying that x rows were fetched, 0 docs
: added/updated.
: 
: I defined also very basic regex for RegExTransformer. So, what's wrong, why
: fetched rows could not be added into the index?
: 
: Thanks in advance,
: 
: Oleg
: 

-Hoss


Re: solr keep old docs

2011-12-28 Thread Chris Hostetter

: That said, writing your own update request handler
: that detected this case isn't very difficult,
: extend UpdateRequestProcessorFactory/UpdateRequestProcessor
: and use it as a plugin.

i can't find the thread at the moment, but the general issue that has 
caused people headaches with this type of approach in the past has been 
that the performance of doing a query on every update (to see if the doc 
is already in the index) can slow things down quite a bit -- in your 
usecase it may not be a significant bottleneck, but that's the general 
issue that has come up i nthe past.

If you look at systems (like nutch) that do large scale crawling, they 
treat the crawl phrase independent from the indexing phase precisesly for 
reasons like this -- so the crawler can dedup the documents (by unique 
URL) and eliminate duplication before ever even adding them to the index.

: >> > I wonder why simple the overwrite parameter doesn't work here.
...
: >> > 2. overwrite=false and uniqueID exists then newer doc must be skipped
: >> since
: >> > old exists.

that is not what overwrite=false does (or was ever designed to do).  
overwrite=false is a way to tell Solr that you are already certain that 
the documents being added do not exist in the index, therefore Solr can 
save time by not attempting to overwrite an existing document.  It is 
intended for situations where you are bulk loading documents, ie: doing an 
initial build of an index from a system of record (ie: a single pass over 
adatabase that uses the same unique key) or importing documents from a 
new system of record with a completley differnet id space.



-Hoss


Re: High response time after being idle

2011-12-28 Thread Gora Mohanty
On Wed, Dec 28, 2011 at 8:52 PM, Odey  wrote:
> Hello,
>
> I'm running Solr 3.5 on a XAMPP/Tomcat environment. It's working pretty good
> for just one exception: when Solr remains idle without handling any requests
> for about 5-10 mins the first request sent again will be delayed for a few
> seconds. Subsequent requests are lightning-fast as usual. So it seems to me
> like something important has been shut down while staying idle and has to be
> reloaded.
[...]

Is it possible that the system is running out of RAM, and swapping,
or is aggressively swapping for some reason?

Regards,
Gora


High response time after being idle

2011-12-28 Thread Odey
Hello,

I'm running Solr 3.5 on a XAMPP/Tomcat environment. It's working pretty good
for just one exception: when Solr remains idle without handling any requests
for about 5-10 mins the first request sent again will be delayed for a few
seconds. Subsequent requests are lightning-fast as usual. So it seems to me
like something important has been shut down while staying idle and has to be
reloaded.

I'm pretty confused why this is happening since even the first request after
restarting the server is a hundredfold faster then those ominous delayed
requests after being idle. 

I checked my logfiles - but querytime itself is not affected and I couldn't
find any dysfunctionalities or error-messages.

Are there Solr specific preferences which could cause me trouble?

Any help would be appreciated. 

regards,

--
View this message in context: 
http://lucene.472066.n3.nabble.com/High-response-time-after-being-idle-tp3616599p3616599.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr keep old docs

2011-12-28 Thread Tanguy Moal

Hello Alexander,

I don't know much about your requirements in terms of size and 
performances, but I've had a similar use case and found a pretty simple 
workaround.
If your duplicate rate is not too high, you can have the 
SignatureProcessor to generate fingerprint of documents (you already did 
that).


Simply turn off overwritting of duplicates, you can then rely on solr's 
grouping / field collapsing to group your search results by 
fingerprints. You'll then have one document group per "real" document. 
You can use group.sort to sort your groups by indexing date ascending, 
and group.limit=1 to keep only the oldest one.
You can even use group.format = simple to serve results as if no 
collapsing occured, and use group.ngroups (/!\ could be expansive /!\) 
to get the real number of deduplicated documents.


Of course the index will be larger, as I said, I made no assumption 
regarding your operating requirements. And search can be a bit slower, 
depending on the average rate of duplicated documents.
But you've got your issue addressed by configuration tuning only... 
Depending on your project's sizing, it could be time saving.


The advantage is that you have the precious information of what content 
is duplicated from where :-)


Hope this helps,

--
Tanguy

Le 28/12/2011 15:45, Alexander Aristov a écrit :

Thanks Eric,

it sets me direction. I will be writing new plugin and will get back to the
dev forum with results and then we will decide next steps.

Best Regards
Alexander Aristov


On 28 December 2011 18:08, Erick Erickson  wrote:


Well, the short answer is that nobody else has
1>  had a similar requirement
AND
2>  not found a suitable work around
AND
3>  implemented the change and contributed it back.

So, if you'd like to volunteer.

Seriously. If you think this would be valuable and are
willing to work on it, hop on over to the dev list and
discuss it, open a JIRA and make it work. I'd start
by opening a discussion on the dev list before
opening a JIRA, just to get a sense of where the
snags would be to changing the Solr code, but that's
optional.

That said, writing your own update request handler
that detected this case isn't very difficult,
extend UpdateRequestProcessorFactory/UpdateRequestProcessor
and use it as a plugin.

Best
Erick

On Wed, Dec 28, 2011 at 6:46 AM, Alexander Aristov
  wrote:

the problem with dedupe (SignatureUpdateProcessor ) is that it REPLACES

old

docs. I have tried it already.

Best Regards
Alexander Aristov


On 28 December 2011 13:04, Lance Norskog  wrote:


The SignatureUpdateProcessor is for exactly this problem:




http://www.lucidimagination.com/search/link?url=http://wiki.apache.org/solr/Deduplication

On Tue, Dec 27, 2011 at 10:42 PM, Alexander Aristov
  wrote:

I get docs from external sources and the only place I keep them is

solr

index. I have no a database or other means to track indexed docs (my
personal oppinion is that it might be a huge headache).

Some docs might change slightly in there original sources but I don't

need

that changes. In fact I need original data only.

So I have no other ways but to either check if a document is already

in

index before I put it to solrj array (read - query solr) or develop my

own

update chain processor and implement ID check there and skip such

docs.

Maybe it's wrong place to aguee and probably it's been discussed

before

but

I wonder why simple the overwrite parameter doesn't work here.

My oppinion it perfectly suits here. In combination with unique ID it

can

cover all possible variants.

cases:

1. overwrite=true and uniquID exists then newer doc should overwrite

the

old one.

2. overwrite=false and uniqueID exists then newer doc must be skipped

since

old exists.

3. uniqueID doesn't exist then newer doc just gets added regardless if

old

exists or not.


Best Regards
Alexander Aristov


On 27 December 2011 22:53, Erick Erickson

wrote:

Mikhail is right as far as I know, the assumption built into Solr is

that

duplicate IDs (when  is defined) should trigger the old
document to be replaced.

what is your system-of-record? By that I mean what does your SolrJ
program do to send data to Solr? Is there any way you could just
*not* send documents that are already in the Solr index based on,
for instance, any timestamp associated with your system-of-record
and the last time you did an incremental index?

Best
Erick

On Tue, Dec 27, 2011 at 6:38 AM, Alexander Aristov
  wrote:

Hi

I am not using database. All needed data is in solr index that's

why I

want

to skip excessive checks.

I will check DIH but not sure if it helps.

I am fluent with Java and it's not a problem for me to write a

class

or

so

but I want to check first  maybe there are any ways (workarounds)

to

make

it working without codding, just by playing around with

configuration

and

params. I don't want to go away from default solr implementation.

Best Regards
Alexander Aristov


On 27 December 2011 09:33, Mikhail Kh

Re: Poor performance on distributed search

2011-12-28 Thread Yonik Seeley
On Wed, Dec 28, 2011 at 5:47 AM, ku3ia  wrote:
> So, based on p.2) and on my previous researches, I conclude, that the more
> documents I want to retrieve, the slower is search and main problem is the
> cycle in writeDocs method. Am I right? Can you advice something in this
> situation?

For the first phase in a distributed search, Solr must return the top
N ids (in your case 200).  It currently does this by loading stored
fields, which is slow.  A better approach is to store the "id" field
as a column stride field.

https://issues.apache.org/jira/browse/SOLR-2753

-Yonik
http://www.lucidimagination.com


Re: solr keep old docs

2011-12-28 Thread Alexander Aristov
Thanks Eric,

it sets me direction. I will be writing new plugin and will get back to the
dev forum with results and then we will decide next steps.

Best Regards
Alexander Aristov


On 28 December 2011 18:08, Erick Erickson  wrote:

> Well, the short answer is that nobody else has
> 1> had a similar requirement
> AND
> 2> not found a suitable work around
> AND
> 3> implemented the change and contributed it back.
>
> So, if you'd like to volunteer .
>
> Seriously. If you think this would be valuable and are
> willing to work on it, hop on over to the dev list and
> discuss it, open a JIRA and make it work. I'd start
> by opening a discussion on the dev list before
> opening a JIRA, just to get a sense of where the
> snags would be to changing the Solr code, but that's
> optional.
>
> That said, writing your own update request handler
> that detected this case isn't very difficult,
> extend UpdateRequestProcessorFactory/UpdateRequestProcessor
> and use it as a plugin.
>
> Best
> Erick
>
> On Wed, Dec 28, 2011 at 6:46 AM, Alexander Aristov
>  wrote:
> > the problem with dedupe (SignatureUpdateProcessor ) is that it REPLACES
> old
> > docs. I have tried it already.
> >
> > Best Regards
> > Alexander Aristov
> >
> >
> > On 28 December 2011 13:04, Lance Norskog  wrote:
> >
> >> The SignatureUpdateProcessor is for exactly this problem:
> >>
> >>
> >>
> http://www.lucidimagination.com/search/link?url=http://wiki.apache.org/solr/Deduplication
> >>
> >> On Tue, Dec 27, 2011 at 10:42 PM, Alexander Aristov
> >>  wrote:
> >> > I get docs from external sources and the only place I keep them is
> solr
> >> > index. I have no a database or other means to track indexed docs (my
> >> > personal oppinion is that it might be a huge headache).
> >> >
> >> > Some docs might change slightly in there original sources but I don't
> >> need
> >> > that changes. In fact I need original data only.
> >> >
> >> > So I have no other ways but to either check if a document is already
> in
> >> > index before I put it to solrj array (read - query solr) or develop my
> >> own
> >> > update chain processor and implement ID check there and skip such
> docs.
> >> >
> >> > Maybe it's wrong place to aguee and probably it's been discussed
> before
> >> but
> >> > I wonder why simple the overwrite parameter doesn't work here.
> >> >
> >> > My oppinion it perfectly suits here. In combination with unique ID it
> can
> >> > cover all possible variants.
> >> >
> >> > cases:
> >> >
> >> > 1. overwrite=true and uniquID exists then newer doc should overwrite
> the
> >> > old one.
> >> >
> >> > 2. overwrite=false and uniqueID exists then newer doc must be skipped
> >> since
> >> > old exists.
> >> >
> >> > 3. uniqueID doesn't exist then newer doc just gets added regardless if
> >> old
> >> > exists or not.
> >> >
> >> >
> >> > Best Regards
> >> > Alexander Aristov
> >> >
> >> >
> >> > On 27 December 2011 22:53, Erick Erickson 
> >> wrote:
> >> >
> >> >> Mikhail is right as far as I know, the assumption built into Solr is
> >> that
> >> >> duplicate IDs (when  is defined) should trigger the old
> >> >> document to be replaced.
> >> >>
> >> >> what is your system-of-record? By that I mean what does your SolrJ
> >> >> program do to send data to Solr? Is there any way you could just
> >> >> *not* send documents that are already in the Solr index based on,
> >> >> for instance, any timestamp associated with your system-of-record
> >> >> and the last time you did an incremental index?
> >> >>
> >> >> Best
> >> >> Erick
> >> >>
> >> >> On Tue, Dec 27, 2011 at 6:38 AM, Alexander Aristov
> >> >>  wrote:
> >> >> > Hi
> >> >> >
> >> >> > I am not using database. All needed data is in solr index that's
> why I
> >> >> want
> >> >> > to skip excessive checks.
> >> >> >
> >> >> > I will check DIH but not sure if it helps.
> >> >> >
> >> >> > I am fluent with Java and it's not a problem for me to write a
> class
> >> or
> >> >> so
> >> >> > but I want to check first  maybe there are any ways (workarounds)
> to
> >> make
> >> >> > it working without codding, just by playing around with
> configuration
> >> and
> >> >> > params. I don't want to go away from default solr implementation.
> >> >> >
> >> >> > Best Regards
> >> >> > Alexander Aristov
> >> >> >
> >> >> >
> >> >> > On 27 December 2011 09:33, Mikhail Khludnev <
> >> mkhlud...@griddynamics.com
> >> >> >wrote:
> >> >> >
> >> >> >> On Tue, Dec 27, 2011 at 12:26 AM, Alexander Aristov <
> >> >> >> alexander.aris...@gmail.com> wrote:
> >> >> >>
> >> >> >> > Hi people,
> >> >> >> >
> >> >> >> > I urgently need your help!
> >> >> >> >
> >> >> >> > I have solr 3.3 configured and running. I do uncremental
> indexing 4
> >> >> >> times a
> >> >> >> > day using bulk updates. Some documents are identical to some
> extent
> >> >> and I
> >> >> >> > wish to skip them, not to index.
> >> >> >> > But here is the problem as I could not find a way to tell solr
> >> ignore
> >> >> new
> >> >> >> > duplicate docs and keep old in

Re: Problems while searching in default field

2011-12-28 Thread Erick Erickson
Right, you were mislead by the discussion in for that patch,
the option you specified was NOT how the patch was
eventually implemented. Try reading this page instead:
http://wiki.apache.org/solr/MultitermQueryAnalysis

The short form is that with 3.6 (i.e. 3.x at this point) you
may not have to do anything at all, but read the writeups
for some background.

Actually, you don't have to build anything (sorry for the confusion!).
Just pull down the latest successful nightly build from here:
https://builds.apache.org//view/S-Z/view/Solr/job/Solr-3.x/

Best
Erick

On Wed, Dec 28, 2011 at 8:00 AM, mechravi25  wrote:
> Hi,
>
> Thanks a lot guys. I tried the following options
>
> 1.) Downloaded the  solr 3.5.0 version and updated the schema.xml file with
> the sample fields i have. I then tried to set the property
> "ignoreCaseForWildcards=true" for a field type as mentioned in the url given
> for the patch-2438, but got the error "invalid
> arguments:{ignoreCaseForWildcards=true}" while starting the server.
> I did not index it in the new solr version. I placed the index files created
> in the previos version of solr(1.4) and then tried to do the wild card
> search alone.
> Please let me know if I am missing something.
>
> 2.) I used the subversion feature in eclipse to check out the code from SVN.
> I then tried to do an ANT build in eclipse by running the build.xml present
> in the solr directory. The build was successfull(I never got any error
> message) but no WAR was generated. I looked in other folders in the project
> and noticed that there were more than one build.xml files in that project.
>
> Please let me know which file to run for generating the war and in which
> file i need to change the target location of the WAR file to be generated.
>
> Thanks.
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Problems-while-searching-in-default-field-tp3601047p3616376.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr keep old docs

2011-12-28 Thread Erick Erickson
Well, the short answer is that nobody else has
1> had a similar requirement
AND
2> not found a suitable work around
AND
3> implemented the change and contributed it back.

So, if you'd like to volunteer .

Seriously. If you think this would be valuable and are
willing to work on it, hop on over to the dev list and
discuss it, open a JIRA and make it work. I'd start
by opening a discussion on the dev list before
opening a JIRA, just to get a sense of where the
snags would be to changing the Solr code, but that's
optional.

That said, writing your own update request handler
that detected this case isn't very difficult,
extend UpdateRequestProcessorFactory/UpdateRequestProcessor
and use it as a plugin.

Best
Erick

On Wed, Dec 28, 2011 at 6:46 AM, Alexander Aristov
 wrote:
> the problem with dedupe (SignatureUpdateProcessor ) is that it REPLACES old
> docs. I have tried it already.
>
> Best Regards
> Alexander Aristov
>
>
> On 28 December 2011 13:04, Lance Norskog  wrote:
>
>> The SignatureUpdateProcessor is for exactly this problem:
>>
>>
>> http://www.lucidimagination.com/search/link?url=http://wiki.apache.org/solr/Deduplication
>>
>> On Tue, Dec 27, 2011 at 10:42 PM, Alexander Aristov
>>  wrote:
>> > I get docs from external sources and the only place I keep them is solr
>> > index. I have no a database or other means to track indexed docs (my
>> > personal oppinion is that it might be a huge headache).
>> >
>> > Some docs might change slightly in there original sources but I don't
>> need
>> > that changes. In fact I need original data only.
>> >
>> > So I have no other ways but to either check if a document is already in
>> > index before I put it to solrj array (read - query solr) or develop my
>> own
>> > update chain processor and implement ID check there and skip such docs.
>> >
>> > Maybe it's wrong place to aguee and probably it's been discussed before
>> but
>> > I wonder why simple the overwrite parameter doesn't work here.
>> >
>> > My oppinion it perfectly suits here. In combination with unique ID it can
>> > cover all possible variants.
>> >
>> > cases:
>> >
>> > 1. overwrite=true and uniquID exists then newer doc should overwrite the
>> > old one.
>> >
>> > 2. overwrite=false and uniqueID exists then newer doc must be skipped
>> since
>> > old exists.
>> >
>> > 3. uniqueID doesn't exist then newer doc just gets added regardless if
>> old
>> > exists or not.
>> >
>> >
>> > Best Regards
>> > Alexander Aristov
>> >
>> >
>> > On 27 December 2011 22:53, Erick Erickson 
>> wrote:
>> >
>> >> Mikhail is right as far as I know, the assumption built into Solr is
>> that
>> >> duplicate IDs (when  is defined) should trigger the old
>> >> document to be replaced.
>> >>
>> >> what is your system-of-record? By that I mean what does your SolrJ
>> >> program do to send data to Solr? Is there any way you could just
>> >> *not* send documents that are already in the Solr index based on,
>> >> for instance, any timestamp associated with your system-of-record
>> >> and the last time you did an incremental index?
>> >>
>> >> Best
>> >> Erick
>> >>
>> >> On Tue, Dec 27, 2011 at 6:38 AM, Alexander Aristov
>> >>  wrote:
>> >> > Hi
>> >> >
>> >> > I am not using database. All needed data is in solr index that's why I
>> >> want
>> >> > to skip excessive checks.
>> >> >
>> >> > I will check DIH but not sure if it helps.
>> >> >
>> >> > I am fluent with Java and it's not a problem for me to write a class
>> or
>> >> so
>> >> > but I want to check first  maybe there are any ways (workarounds) to
>> make
>> >> > it working without codding, just by playing around with configuration
>> and
>> >> > params. I don't want to go away from default solr implementation.
>> >> >
>> >> > Best Regards
>> >> > Alexander Aristov
>> >> >
>> >> >
>> >> > On 27 December 2011 09:33, Mikhail Khludnev <
>> mkhlud...@griddynamics.com
>> >> >wrote:
>> >> >
>> >> >> On Tue, Dec 27, 2011 at 12:26 AM, Alexander Aristov <
>> >> >> alexander.aris...@gmail.com> wrote:
>> >> >>
>> >> >> > Hi people,
>> >> >> >
>> >> >> > I urgently need your help!
>> >> >> >
>> >> >> > I have solr 3.3 configured and running. I do uncremental indexing 4
>> >> >> times a
>> >> >> > day using bulk updates. Some documents are identical to some extent
>> >> and I
>> >> >> > wish to skip them, not to index.
>> >> >> > But here is the problem as I could not find a way to tell solr
>> ignore
>> >> new
>> >> >> > duplicate docs and keep old indexed docs. I don't care that it's
>> new.
>> >> >> Just
>> >> >> > determine by ID that such document is in the index already and
>> that's
>> >> it.
>> >> >> >
>> >> >> > I use solrj for indexing. I have tried setting overwrite=false and
>> >> dedupe
>> >> >> > apprache but nothing helped me. I either have that a newer doc
>> >> overwrites
>> >> >> > old one or I get duplicate.
>> >> >> >
>> >> >> > I think it's a very simple and basic feature and it must exist.
>> What
>> >> did
>> >> >> I
>> >> >> > make wrong or didn't do?
>> >>

Re: How can I check if a more complex query condition matched?

2011-12-28 Thread Erick Erickson
There's no easy/efficient way that I know of to do this. Perhaps a good
question is what value-add this is going to make for your app and is
there a better way to convey this information. For instance, would
highlighting convey "enough" information to your user?

You're right that you don't want to enable debug in production, it
can take quite a long time (see the timings when you DO enable
debug).

Could you consider some kind of one-off when a user really, really
needs this info where you *do* add debugQuery=on but restrict
the query to a single document they're interested in?

Best
Erick

On Wed, Dec 28, 2011 at 5:30 AM, Max  wrote:
> Thanks for your reply, I thought about using the debug mode, too, but
> the information is not easy to parse and doesnt contain everything I
> want. Furthermore I dont want to enable debug mode in production.
>
> Is there anything else I could try?
>
> On Tue, Dec 27, 2011 at 12:48 PM, Ahmet Arslan  wrote:
>>> I have a more complex query condition
>>> like this:
>>>
>>> (city:15 AND country:60)^4 OR city:15^2 OR country:60^2
>>>
>>> What I want to achive with this query is basically if a
>>> document has
>>> city = 15 AND country = 60 it is more important then
>>> another document
>>> which only has city = 15 OR country = 60
>>>
>>> Furhtermore I want to show in my results view why a certain
>>> document
>>> matched, something like "matched city and country" or
>>> "matched city
>>> only" or "matched country only".
>>>
>>> This is a bit of an simplified example, but the question
>>> remains: how
>>> can solr tell me which of the conditions in the query
>>> matched? If I
>>> match against a simple field only, I can get away with
>>> highlight
>>> fields, but conditions spanning multiple fields seem much
>>> more tricky.
>>
>> Looks like you can extract these info from output of &debugQuery=on.
>> http://wiki.apache.org/solr/CommonQueryParameters#debugQuery
>>


Re: best practice to introducing singletons inside of Solr (IoC)

2011-12-28 Thread Erick Erickson
I must be missing something here. Why would this be any different from
any other singleton? I just did a little experiment where I implemented
the classic singleton pattern in a RequestHandler and accessed
from a Filter (both plugins) with no problem at all, just the usual
blah var = MySingleton.getInstance();
var.whatever

There was non need to get Solr cores involved at all. Of course this
was just a simple experiment, YMMV..

Best
Erick

On Tue, Dec 27, 2011 at 11:52 PM, Mikhail Khludnev
 wrote:
> Colleagues,
>
> Don't hesitate to emit your opinion. Please!
>
> Regards
>
> On Wed, Dec 21, 2011 at 11:06 PM, Mikhail Khludnev <
> mkhlud...@griddynamics.com> wrote:
>
>> Hello,
>>
>> I need to introduce several singletons inside of Solr and make them
>> available for my own SearchHandlers, Components, and even QParsers, etc.
>>
>> Right now I use some kind of fake SolrRequestHandler which loads on init()
>> and available everywhere through
>> solrCore.getRequestHandler("wellknownName"). Then I downcast it everywhere
>> and access the required methods. The same is possible with fake
>> SearchComponent.
>> Particularly my singletons are some additional fields schema (pretty
>> sophisticated), and kind of request/response encoding facility.
>> The typical Java hammer for such pins is Spring, but I've found puzzling
>> to use
>> http://static.springframework.org/spring/docs/3.0.x/javadoc-api/org/springframework/web/context/support/WebApplicationContextUtils.html
>>
>> What's the best way to do that?
>>
>> --
>> Sincerely yours
>> Mikhail Khludnev
>> Lucid Certified
>> Apache Lucene/Solr Developer
>> Grid Dynamics
>>
>> 
>>  
>>
>>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Lucid Certified
> Apache Lucene/Solr Developer
> Grid Dynamics
>
> 
>  


Re: Indexing problem

2011-12-28 Thread Martin Koch
Could it be a commit you're needing?

curl 'localhost:8983/solr/update?commit=true'

/Martin

On Wed, Dec 28, 2011 at 11:47 AM, mumairshamsi wrote:

> http://lucene.472066.n3.nabble.com/file/n3616191/02.xml 02.xml
>
> i am trying to index this file for this i am using this command
>
> java -jar post.jar *.xml
>
> commands run fine but when i search not result is displaying
>
> I think it is encoding problem can any one help ??
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Indexing-problem-tp3616191p3616191.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Indexing problem

2011-12-28 Thread Ahmet Arslan
> http://lucene.472066.n3.nabble.com/file/n3616191/02.xml
> 02.xml 
> 
> i am trying to index this file for this i am using this
> command 
> 
> java -jar post.jar *.xml
> 
> commands run fine but when i search not result is
> displaying 
> 
> I think it is encoding problem can any one help ?? 

post.jar cannot index arbitrary xml files. It accepts xmls in particular 
format. ()  

http://wiki.apache.org/solr/UpdateXmlMessages#add.2BAC8-replace_documents


Re: Migration from Solr 1.4 to Solr 3.5

2011-12-28 Thread Bhavnik Gajjar
Thanks community! That helps!

To check practically, I have now setup Solr 3.5 in test environment. Few
observations on that,


   1. I simply copy-pasted one of the Solr 1.4 instance on Solr 3.5 setup
   (after correcting schema.config and solr.config files based on what is
   suited for 3.5). If I do query like,
   
http://myserver:8080/solr/Solr_3.5_Instance/select?q=test&shards=myserver:8080/solr/Solr_3.5_Instance,
myserver:8080/solr/Solr_1.4_Instance,
   then it works OK! So, now, wondering, index format has been changed after
   Solr 1.4, and hence, I was expecting above search to fail. Am I correct?
   2. Continuing above point, I guess, if I need to use new feature which
   didn't exist in Solr 1.4, but exists in Solr 3.5, then this hybrid (1.4 and
   3.5 solr instances) setup won't work. Any thoughts?
   3. I got wind that, first commit would convert Solr 1.4 index to new
   format in Solr 3.5 setup. Is it so?
   4. Are there any migration tool (or any other means?) available that
   would convert old indexes (1.4) to new format (3.5)?


Kind regards,

Bhavnik


 Original Message 

To supplement the responses you have already gotten: All servers involved
in a distributed query, including the one that is accessed and all the
shards that are accessed from it, must run the same Javabin version.  Solr
1.4.1 and earlier use javabin version 1 and everything newer uses javabin
version 2.  What you are proposing above will not work.

Hopefully you have two complete sets of servers, for redundancy.  It would
be a good idea to upgrade one server set, then upgrade the other.
SOLR-2204 is in the works to make it possible to have these versions work
together.  I don't think it's been committed yet.

Thanks,
Shawn

>
>
>  Subject: Re: Migration from Solr 1.4 to Solr 3.5  Date: Fri, 23 Dec 2011
> 10:58:43 -0800  From: Siva Kommuri 
>   Reply-To:
> solr-user@lucene.apache.org  To: solr-user@lucene.apache.org
>CC:
> solr-user@lucene.apache.org 
> 
>
> One migration strategy is to fall back to XML parser from the javabin parser, 
> upgrade Solrj jars to 3.4, turn off replication, upgrade master, upgrade each 
> of the slaves while turning on replication. Once all slaves have been 
> upgraded/replication turned on - switch back to javabin parser.
>
> Best wishes,
> Siva on 3GS
>
> On Dec 23, 2011, at 7:52, Erick Erickson  
>  wrote:
>
> > Have you looked at CHANGES.txt in ? It has upgrade
> > instructions for every release. Note that in general, newer Solr will *read*
> > an older index (one major revision back. i.e. 3.x should read 1.x, but 4.x
> > will not read 1.x. Note also that there was no 2.x solr).
> >
> > The cautions in the upgrade notes are really about making sure that an
> > index *produced* with 3.x is not *read* by 1.4, i.e. don't upgrade the
> > master before the slave.
> >
> > I *think* that as long as you upgrade *all* slaves before upgrading the
> > master, you'll be fine. And I also believe that you can upgrade only some
> > of the slaves. Each of the slaves, even if only some of them are
> > upgraded, are reading a 1.4 index even after replications.
> >
> > But I'd test first. And if you can re-index, that would actually be the best
> > solution. However, as above you can't reindex until *all* the slaves
> > are upgraded.
> >
> > Best
> > Erick
> >
> > On Fri, Dec 23, 2011 at 7:41 AM, Bhavnik Gajjar  
> >  wrote:
> >> Greetings,
> >>
> >> We are planning to migrate from Solr 1.4 to Solr 3.5 (or, even new Solr
> >> version than 3.5, when available) in coming days. There are few questions
> >> about this migration.
> >>
> >>
> >> • I heard, index format is changed in this migration. So, does this require
> >> me to reindex millions of data?
> >>
> >> • Are there any migration tool (or any other means?) available that would
> >> convert old indexes (1.4) to new format (3.5)?
> >>
> >> • Consider this case.
> >> http://myserver:8080/solr/mainindex/select/?q=solr&start=0&rows=10&shards=myserver:8080/solr/index1,myserver:8080/solr/mainindex,remoteserver:8080/solr/remotedata.
> >> In this example, consider that 'myserver' has been upgraded with Solr 3.5,
> >> but 'remoteserver' is still using Solr 1.4. The question is, would data
> >> from remoteserver's Solr instance come/parsed fine or, would it cause
> >> issues? If it results into issues, then of what type? how to resolve them?
> >> Please suggest.
> >>
> >> • We are using various features of Solr like, searching, faceting,
> >> spellcheck and highlighting. Will migrating from 1.4 to 3.5 cause any break
> >> in functionality? is there anything changed in response XML format of here
> >> mentioned features?
> >>
> >>  Thanks in advance,
> >>
> >> Bhavnik
> >> **
>
>


Re: Problems while searching in default field

2011-12-28 Thread mechravi25
Hi,

Thanks a lot guys. I tried the following options

1.) Downloaded the  solr 3.5.0 version and updated the schema.xml file with
the sample fields i have. I then tried to set the property
"ignoreCaseForWildcards=true" for a field type as mentioned in the url given
for the patch-2438, but got the error "invalid
arguments:{ignoreCaseForWildcards=true}" while starting the server.
I did not index it in the new solr version. I placed the index files created
in the previos version of solr(1.4) and then tried to do the wild card
search alone.
Please let me know if I am missing something.

2.) I used the subversion feature in eclipse to check out the code from SVN.
I then tried to do an ANT build in eclipse by running the build.xml present
in the solr directory. The build was successfull(I never got any error
message) but no WAR was generated. I looked in other folders in the project
and noticed that there were more than one build.xml files in that project.

Please let me know which file to run for generating the war and in which
file i need to change the target location of the WAR file to be generated.

Thanks.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Problems-while-searching-in-default-field-tp3601047p3616376.html
Sent from the Solr - User mailing list archive at Nabble.com.


Indexing problem

2011-12-28 Thread mumairshamsi
http://lucene.472066.n3.nabble.com/file/n3616191/02.xml 02.xml 

i am trying to index this file for this i am using this command 

java -jar post.jar *.xml

commands run fine but when i search not result is displaying 

I think it is encoding problem can any one help ?? 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-problem-tp3616191p3616191.html
Sent from the Solr - User mailing list archive at Nabble.com.


Grouping results after Sorting or vice-versa

2011-12-28 Thread vijayrs
The issue i'm facing is... I didn't get the expected results when i combine
"group" param and "sort" param.

The query is...

http://localhost:8080/solr/core1/select/?qt=nutch&q=*:*&fq=userid:333&group=true&group.field=threadid&group.sort=date%20desc&sort=date%20desc

where "threadid" is a hexadecimal string which is common for more than 1
message, and "date"  is in unix timestamp format.

The results should be sorted based on "date" and also grouped by
"threadid"... how it can be done?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Grouping-results-after-Sorting-or-vice-versa-tp3615957p3615957.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr keep old docs

2011-12-28 Thread Alexander Aristov
the problem with dedupe (SignatureUpdateProcessor ) is that it REPLACES old
docs. I have tried it already.

Best Regards
Alexander Aristov


On 28 December 2011 13:04, Lance Norskog  wrote:

> The SignatureUpdateProcessor is for exactly this problem:
>
>
> http://www.lucidimagination.com/search/link?url=http://wiki.apache.org/solr/Deduplication
>
> On Tue, Dec 27, 2011 at 10:42 PM, Alexander Aristov
>  wrote:
> > I get docs from external sources and the only place I keep them is solr
> > index. I have no a database or other means to track indexed docs (my
> > personal oppinion is that it might be a huge headache).
> >
> > Some docs might change slightly in there original sources but I don't
> need
> > that changes. In fact I need original data only.
> >
> > So I have no other ways but to either check if a document is already in
> > index before I put it to solrj array (read - query solr) or develop my
> own
> > update chain processor and implement ID check there and skip such docs.
> >
> > Maybe it's wrong place to aguee and probably it's been discussed before
> but
> > I wonder why simple the overwrite parameter doesn't work here.
> >
> > My oppinion it perfectly suits here. In combination with unique ID it can
> > cover all possible variants.
> >
> > cases:
> >
> > 1. overwrite=true and uniquID exists then newer doc should overwrite the
> > old one.
> >
> > 2. overwrite=false and uniqueID exists then newer doc must be skipped
> since
> > old exists.
> >
> > 3. uniqueID doesn't exist then newer doc just gets added regardless if
> old
> > exists or not.
> >
> >
> > Best Regards
> > Alexander Aristov
> >
> >
> > On 27 December 2011 22:53, Erick Erickson 
> wrote:
> >
> >> Mikhail is right as far as I know, the assumption built into Solr is
> that
> >> duplicate IDs (when  is defined) should trigger the old
> >> document to be replaced.
> >>
> >> what is your system-of-record? By that I mean what does your SolrJ
> >> program do to send data to Solr? Is there any way you could just
> >> *not* send documents that are already in the Solr index based on,
> >> for instance, any timestamp associated with your system-of-record
> >> and the last time you did an incremental index?
> >>
> >> Best
> >> Erick
> >>
> >> On Tue, Dec 27, 2011 at 6:38 AM, Alexander Aristov
> >>  wrote:
> >> > Hi
> >> >
> >> > I am not using database. All needed data is in solr index that's why I
> >> want
> >> > to skip excessive checks.
> >> >
> >> > I will check DIH but not sure if it helps.
> >> >
> >> > I am fluent with Java and it's not a problem for me to write a class
> or
> >> so
> >> > but I want to check first  maybe there are any ways (workarounds) to
> make
> >> > it working without codding, just by playing around with configuration
> and
> >> > params. I don't want to go away from default solr implementation.
> >> >
> >> > Best Regards
> >> > Alexander Aristov
> >> >
> >> >
> >> > On 27 December 2011 09:33, Mikhail Khludnev <
> mkhlud...@griddynamics.com
> >> >wrote:
> >> >
> >> >> On Tue, Dec 27, 2011 at 12:26 AM, Alexander Aristov <
> >> >> alexander.aris...@gmail.com> wrote:
> >> >>
> >> >> > Hi people,
> >> >> >
> >> >> > I urgently need your help!
> >> >> >
> >> >> > I have solr 3.3 configured and running. I do uncremental indexing 4
> >> >> times a
> >> >> > day using bulk updates. Some documents are identical to some extent
> >> and I
> >> >> > wish to skip them, not to index.
> >> >> > But here is the problem as I could not find a way to tell solr
> ignore
> >> new
> >> >> > duplicate docs and keep old indexed docs. I don't care that it's
> new.
> >> >> Just
> >> >> > determine by ID that such document is in the index already and
> that's
> >> it.
> >> >> >
> >> >> > I use solrj for indexing. I have tried setting overwrite=false and
> >> dedupe
> >> >> > apprache but nothing helped me. I either have that a newer doc
> >> overwrites
> >> >> > old one or I get duplicate.
> >> >> >
> >> >> > I think it's a very simple and basic feature and it must exist.
> What
> >> did
> >> >> I
> >> >> > make wrong or didn't do?
> >> >> >
> >> >>
> >> >> I guess, because  the mainstream approach is delta-import , when you
> >> have
> >> >> "updated" timestamps in your DB and "last-import" timestamp stored
> >> >> somewhere. You can check how it works in DIH.
> >> >>
> >> >>
> >> >> >
> >> >> > Tried google but I couldn't find a solution there althoght many
> people
> >> >> > encounted such problem.
> >> >> >
> >> >> >
> >> >> it's definitely can be done by overriding
> >> >> o.a.s.update.DirectUpdateHandler2.addDoc(AddUpdateCommand), but I
> >> suggest
> >> >> to start from implementing your own
> >> >> http://wiki.apache.org/solr/UpdateRequestProcessor - search for PK,
> >> bypass
> >> >> chain call if it's found. Then if you meet performance issues on
> >> querying
> >> >> your PKs one by one, (but only after that) you can batch your
> searches,
> >> >> there are couple of optimization techniques for huge disjunction
> queries
> >> >> 

Re: hl.boundaryScanner and hl.bs.chars

2011-12-28 Thread meghana
Thans iorixxx and Koji for your reply ,

so can i fulfill my needed requirement by using hl.regex.pattern and making
hl.fragmenter=regex ?? 
i was watching on these fields on wiki. i am thinking to use it to make my
highlighted text show in my desire format. 

my string is like below 
1s: This is very nice day. 3s: Christmas is about to come 4s: and christmas
preparation is just on 

now if i search with "chirstmas" , i want my fragment in below format


3s: Christmas is about to come 


4s: and christmas preparation is just on 


can i fulfill this using hl.regex.pattern ? or by any other way?? 
Thanks
Meghana

--
View this message in context: 
http://lucene.472066.n3.nabble.com/hl-boundaryScanner-and-hl-bs-chars-tp3615838p3616218.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Poor performance on distributed search

2011-12-28 Thread ku3ia
Hi all.
Due to my code review, I discovered next things:
1) as I wrote before, seems there is a low disk read speed;
2) at ~/solr-3.5/solr/core/src/java/org/apache/solr/response/XMLWriter.java
and in the same classes there is a writeDocList => writeDocs method, which
contains a cycle for of all docs;
3) as Michael Ryan wrote, this method uses SolrIndexSearcher.doc(int i,
Set fields), which as I understand returns data from cache, or from
index;
4) I found a patch SOLR-1961
(https://issues.apache.org/jira/browse/SOLR-1961), but, seems it uses
RAMDirectoryFactory, so I can't apply it;

So, based on p.2) and on my previous researches, I conclude, that the more
documents I want to retrieve, the slower is search and main problem is the
cycle in writeDocs method. Am I right? Can you advice something in this
situation?

Thanks.


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Poor-performance-on-distributed-search-tp3590028p3616192.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How can I check if a more complex query condition matched?

2011-12-28 Thread Max
Thanks for your reply, I thought about using the debug mode, too, but
the information is not easy to parse and doesnt contain everything I
want. Furthermore I dont want to enable debug mode in production.

Is there anything else I could try?

On Tue, Dec 27, 2011 at 12:48 PM, Ahmet Arslan  wrote:
>> I have a more complex query condition
>> like this:
>>
>> (city:15 AND country:60)^4 OR city:15^2 OR country:60^2
>>
>> What I want to achive with this query is basically if a
>> document has
>> city = 15 AND country = 60 it is more important then
>> another document
>> which only has city = 15 OR country = 60
>>
>> Furhtermore I want to show in my results view why a certain
>> document
>> matched, something like "matched city and country" or
>> "matched city
>> only" or "matched country only".
>>
>> This is a bit of an simplified example, but the question
>> remains: how
>> can solr tell me which of the conditions in the query
>> matched? If I
>> match against a simple field only, I can get away with
>> highlight
>> fields, but conditions spanning multiple fields seem much
>> more tricky.
>
> Looks like you can extract these info from output of &debugQuery=on.
> http://wiki.apache.org/solr/CommonQueryParameters#debugQuery
>


Was:Re: hl.boundaryScanner and hl.bs.chars [off topic]

2011-12-28 Thread Tanguy Moal

Dear list,
I'd like to bounce on that issue...

IMHO, configuration parsing could be a little bit stricter... At  least, 
what stands for a "severe" configuration error could be user-defined.


Let me give some examples that are common errors and that don't trigger 
the "abortOnConfigurationError" behaviour, while it's set to true.


* In schema.xml, one can set the attribute multivalued="true" or 
multiValude="true", and that won't trigger any startup error.
* In solrconfig.xml, it's even possible to declare configuration objects 
(such as fragListBuilder nodes) although solr 1.4 doesn't know anything 
about such a thing.


I've experienced both worlds in my short life : strict configuration 
parsers which get really painful to maintain when configuration becomes 
complex, and loose parsers which are so nice with configuration errors 
that sometimes a simple typo error gets hard to be spotted out (even if 
the multivalued="true" error is usually easy to find, as soon as one 
adds several values to a non multi-valued field :), the highlighting 
issue requires people to pay more attention to the "/!\ SolrX.Y" 
mentions on the wiki... BTW, http://wiki.apache.org/solr/Solr3.5's 
content seems outdated, I think it was released a month ago or so... 
Kudos! ;-) )


The point here is that, from my point of view, the 
abortOnConfigurationError flag is actually of little help when playing 
around with the configuration (at least without the ability to define 
what a severe configuration error is)


Thank you for your attention!

--
Tanguy

Le 28/12/2011 09:43, Koji Sekiguchi a écrit :

(11/12/28 17:08), Ahmet Arslan wrote:

FastVectorHighlighter requires Solr3.1

http://wiki.apache.org/solr/HighlightingParameters#hl.useFastVectorHighlighter 





Right. In addition, baoundaryScanner requires 3.5.

koji




Re: How to run the solr dedup for the document which match 80% or match almost.

2011-12-28 Thread Lance Norskog
You would have to implement this yourself in your indexing code. Solr
has an analysis plugin which does the analysis for your text and then
returns the result, but does not query or index. You can use this to
calculate the fuzzy hash, then search against index.

You might be able to code this in an UpdateRequestProcessor.

On Tue, Dec 27, 2011 at 9:45 PM, vibhoreng04  wrote:
> Hi Shashi,
>
> That's correct  !But I need something for index time comparision.Can cosine
> compare from the already indexed documents and compare the incrementally
> indexed files ?
>
>
>
> Regards,
>
>
> Vibhor
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/How-to-run-the-solr-dedup-for-the-document-which-match-80-or-match-almost-tp3614239p3615787.html
> Sent from the Solr - User mailing list archive at Nabble.com.



-- 
Lance Norskog
goks...@gmail.com


Re: solr keep old docs

2011-12-28 Thread Lance Norskog
The SignatureUpdateProcessor is for exactly this problem:

http://www.lucidimagination.com/search/link?url=http://wiki.apache.org/solr/Deduplication

On Tue, Dec 27, 2011 at 10:42 PM, Alexander Aristov
 wrote:
> I get docs from external sources and the only place I keep them is solr
> index. I have no a database or other means to track indexed docs (my
> personal oppinion is that it might be a huge headache).
>
> Some docs might change slightly in there original sources but I don't need
> that changes. In fact I need original data only.
>
> So I have no other ways but to either check if a document is already in
> index before I put it to solrj array (read - query solr) or develop my own
> update chain processor and implement ID check there and skip such docs.
>
> Maybe it's wrong place to aguee and probably it's been discussed before but
> I wonder why simple the overwrite parameter doesn't work here.
>
> My oppinion it perfectly suits here. In combination with unique ID it can
> cover all possible variants.
>
> cases:
>
> 1. overwrite=true and uniquID exists then newer doc should overwrite the
> old one.
>
> 2. overwrite=false and uniqueID exists then newer doc must be skipped since
> old exists.
>
> 3. uniqueID doesn't exist then newer doc just gets added regardless if old
> exists or not.
>
>
> Best Regards
> Alexander Aristov
>
>
> On 27 December 2011 22:53, Erick Erickson  wrote:
>
>> Mikhail is right as far as I know, the assumption built into Solr is that
>> duplicate IDs (when  is defined) should trigger the old
>> document to be replaced.
>>
>> what is your system-of-record? By that I mean what does your SolrJ
>> program do to send data to Solr? Is there any way you could just
>> *not* send documents that are already in the Solr index based on,
>> for instance, any timestamp associated with your system-of-record
>> and the last time you did an incremental index?
>>
>> Best
>> Erick
>>
>> On Tue, Dec 27, 2011 at 6:38 AM, Alexander Aristov
>>  wrote:
>> > Hi
>> >
>> > I am not using database. All needed data is in solr index that's why I
>> want
>> > to skip excessive checks.
>> >
>> > I will check DIH but not sure if it helps.
>> >
>> > I am fluent with Java and it's not a problem for me to write a class or
>> so
>> > but I want to check first  maybe there are any ways (workarounds) to make
>> > it working without codding, just by playing around with configuration and
>> > params. I don't want to go away from default solr implementation.
>> >
>> > Best Regards
>> > Alexander Aristov
>> >
>> >
>> > On 27 December 2011 09:33, Mikhail Khludnev > >wrote:
>> >
>> >> On Tue, Dec 27, 2011 at 12:26 AM, Alexander Aristov <
>> >> alexander.aris...@gmail.com> wrote:
>> >>
>> >> > Hi people,
>> >> >
>> >> > I urgently need your help!
>> >> >
>> >> > I have solr 3.3 configured and running. I do uncremental indexing 4
>> >> times a
>> >> > day using bulk updates. Some documents are identical to some extent
>> and I
>> >> > wish to skip them, not to index.
>> >> > But here is the problem as I could not find a way to tell solr ignore
>> new
>> >> > duplicate docs and keep old indexed docs. I don't care that it's new.
>> >> Just
>> >> > determine by ID that such document is in the index already and that's
>> it.
>> >> >
>> >> > I use solrj for indexing. I have tried setting overwrite=false and
>> dedupe
>> >> > apprache but nothing helped me. I either have that a newer doc
>> overwrites
>> >> > old one or I get duplicate.
>> >> >
>> >> > I think it's a very simple and basic feature and it must exist. What
>> did
>> >> I
>> >> > make wrong or didn't do?
>> >> >
>> >>
>> >> I guess, because  the mainstream approach is delta-import , when you
>> have
>> >> "updated" timestamps in your DB and "last-import" timestamp stored
>> >> somewhere. You can check how it works in DIH.
>> >>
>> >>
>> >> >
>> >> > Tried google but I couldn't find a solution there althoght many people
>> >> > encounted such problem.
>> >> >
>> >> >
>> >> it's definitely can be done by overriding
>> >> o.a.s.update.DirectUpdateHandler2.addDoc(AddUpdateCommand), but I
>> suggest
>> >> to start from implementing your own
>> >> http://wiki.apache.org/solr/UpdateRequestProcessor - search for PK,
>> bypass
>> >> chain call if it's found. Then if you meet performance issues on
>> querying
>> >> your PKs one by one, (but only after that) you can batch your searches,
>> >> there are couple of optimization techniques for huge disjunction queries
>> >> like PK:(2 OR 4 OR 5 OR 6).
>> >>
>> >>
>> >> > I start considering that I must query index to check if a doc to be
>> added
>> >> > is in the index already and do not add it to array but I have so many
>> >> docs
>> >> > that I am affraid it's not a good solution.
>> >> >
>> >> > Best Regards
>> >> > Alexander Aristov
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Sincerely yours
>> >> Mikhail Khludnev
>> >> Lucid Certified
>> >> Apache Lucene/Solr Developer
>> >> Grid Dynamics
>> >>
>>



-- 
Lance Norskog
goks...@g

Re: hl.boundaryScanner and hl.bs.chars

2011-12-28 Thread Koji Sekiguchi

(11/12/28 17:08), Ahmet Arslan wrote:

FastVectorHighlighter requires Solr3.1

http://wiki.apache.org/solr/HighlightingParameters#hl.useFastVectorHighlighter



Right. In addition, baoundaryScanner requires 3.5.

koji
--
http://www.rondhuit.com/en/


Re: hl.boundaryScanner and hl.bs.chars

2011-12-28 Thread Ahmet Arslan

> I tried by adding BoundaryScanner in my
> solrconfig.xml  and set
> hl.useFastVectorHighlighter=true, termVectors=on,
> termPositions=on and
> termOffsets=on. in my query. then also i didn't get any
> effect on my
> highlighting. 

> do i missing anything , or doing anything wrong?? 
> i like to make a note that i am using solr version 1.4 

FastVectorHighlighter requires Solr3.1

http://wiki.apache.org/solr/HighlightingParameters#hl.useFastVectorHighlighter


Re: hl.boundaryScanner and hl.bs.chars

2011-12-28 Thread meghana
Hi Kogi , 
Thanks for reply. 

I tried by adding BoundaryScanner in my solrconfig.xml  and set
hl.useFastVectorHighlighter=true, termVectors=on, termPositions=on and
termOffsets=on. in my query. then also i didn't get any effect on my
highlighting. 

my solr config setting is as below 


   10
   s:
 
   

do i missing anything , or doing anything wrong?? 
i like to make a note that i am using solr version 1.4 

Thanks 
Meghana

--
View this message in context: 
http://lucene.472066.n3.nabble.com/hl-boundaryScanner-and-hl-bs-chars-tp3615838p3615940.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr - Mutivalue field search on different elements

2011-12-28 Thread meghana
Hi Kogi , 
Thanks for reply. 

I tried by adding BoundaryScanner in my solrconfig.xml  and set
hl.useFastVectorHighlighter=true, termVectors=on, termPositions=on and
termOffsets=on. in my query. then also i didn't get any effect on my
highlighting. 

my solr config setting is as below


   10
   s:
 
   

do i missing anything , or doing anything wrong?? 
i like to make a note that i am using solr version 1.4

Thanks
Meghana

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Mutivalue-field-search-on-different-elements-tp3604213p3615937.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr - Mutivalue field search on different elements

2011-12-28 Thread Ahmet Arslan
> i can't delete 1s ,2s ...etc from my
> field value , i have to keep text in
> this format... so i'll apply slop in my search to do my
> needed search done.

It is OK if you cant delete 1s, 2s,  etc from field value. We can eat up 
those special markups in analysis chain. PatternReplaceCharFilterFactory
or PatternReplaceFilterFactory should do the trick.

http://lucene.apache.org/solr/api/org/apache/solr/analysis/PatternReplaceCharFilterFactory.html

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PatternReplaceFilterFactory