date:20150930

Re: [poll] virtualization platform for SOLR

2015-09-30 Thread Bernd Fehling

Hi Shawn,

unfortunately we have to run VMs, otherwise we would waste hardware.
I thought other solr users are in the same situation but seams that
other users have tons of hardware available and we are the only one
having to use VMs.
Right, bare metal is always better than any VM.
As you mentioned we have the indexer (master) on one physical machine
and two searchers (slaves) on other physical machines, all together with
other little VMs which are not I/O and CPU heavy.

Regards
Bernd

Am 30.09.2015 um 18:48 schrieb Shawn Heisey:
> On 9/30/2015 3:12 AM, Bernd Fehling wrote:
>> while setting up some new servers (virtual machines) using XEN I was
>> thinking about an alternative like KVM. My last tests with KVM is
>> a while ago and XEN performed much better in the area of I/O and
>> CPU usage.
>> This lead me to the idea to start a poll about virtualization platform and 
>> your experiences.
> 
> I once had a virtualized Solr install with Xen where each VM housed one
> Solr instance with one core.  The index was distributed, so it required
> several VMs for one copy of the index.
> 
> I eliminated the virtualization, used the same hardware as bare metal
> with Linux, still one Solr instance installed on the machine, but with
> multiple Solr cores.  Performance is much better now.
> 
> General advice:  Don't run virtual machines.
> 
> If a virtual environment is the only significant hardware you have
> access to and it's used for more than Solr, then you might need to.  If
> you do run virtual, then minimize the number of VMs, don't put multiple
> replicas of the same index data on the same physical VM host, give each
> Solr VM lots of memory, and don't oversubscribe the memory/cpu on the
> physical VM host.
> 
> Thanks,
> Shawn
>

Join with faceting and filtering

2015-09-30 Thread Troy Edwards

I am working with the following indices

*Item*

ItemId - string
Description - text (query on this)
Categories - Multivalued text (query on this)
Sellers - Multivalued text (query on this)
SellersString - Multivalued string (Need to facet and filter on this)

*ContractItem*

ContractItemId - string
ItemId - string
ContractCode - string (facet and filter on this)
Priority -  integer (order by priority descending)
Active - boolean (filter on this)

Say someone is searching for colgate

I am doing two queries:

First query: q = {!join from=ItemId to=ItemId
fromIndex=Item)(Description:colgate OR Categories:colgate OR
Sellers:colgate)&facet.field=ContractCode

>From the first query I get all the ItemIds and do a second query on Item
index using q=ItemId:(Id1 Id2 Id3) and generate facet on SellersString

I have to do some custom coding to retain Priority (so that I can sort on
it)

Following are the issues I am running into:

1) Since there are a lot of Items and ContractItems, the number of Ids
becomes large and I had to increase maxBooleanClause (possible performance
degradation?)

2) Since I have to return a lot of items from first query, the data size
becomes very large (again a performance concern)

3) When a filter is applied on the second query, I have to adjust the facet
results of the first query

4) Overall this seems complex

Is it possible to do just one query and apply filters (if any) and get
results along with facets?

Any suggestions on simplifying this and improving performance?

Thanks in advance

Re: solrcloud not displaying store fields

2015-09-30 Thread Chris Hostetter


the results you've posted make no sense to me unless the documents are out 
of sync between multiple replicas of whatever shard is hosting the doc 
that you get in the first result -- both of your queries, even though you 
are sending them to a specific replica, are general requests so they are 
being distributed to each shard and then aggregated.  the fact that you 
see these different results when adding to the fl may just be coincidence?

Try adding "distrib=false" to both queries, and issuing them to every 
replica of every shard in your index -- if you get inconsistent results 
for the same query between two replicas of the same shard that would seem 
to confirm my suspicion.  if you get inconsistent responses for the two 
different queries from a single replica then something else very weird is 
happening.

full details about your configs & schema would also be helpful to perhaps 
identify more esoteric possible causes (ie: any custom plugins? you aren't 
using TextField for your uniqueKey field are you? etc...)



: Date: Wed, 23 Sep 2015 11:17:02 +0530
: From: Roshan Agarwal 
: Reply-To: solr-user@lucene.apache.org
: To: solr-user@lucene.apache.org
: Subject: Re: solrcloud not displaying store fields
: 
: Further in this please see an example below:
: 
: 
: http://solr:port
: 
/solr/cloud1_shard1_replica1/select?q=UCID%3Addfdf4&fl=UCID&wt=json&indent=true
: 
: {
:   "responseHeader":{
: "status":0,
: "QTime":1157,
: "params":{
:   "q":"UCID:ddfdf4",
:   "indent":"true",
:   "fl":"UCID",
:   "wt":"json"}},
:   "response":{"numFound":1,"start":0,"maxScore":13.710121,"docs":[
:   {
: "UCID":"ddfdf4"}]
:   },
: 
: 
: But when we add TI in fl there is no num found
: 
: 
http://solr:port/solr/cloud1_shard1_replica1/select?q=UCID%3Addfdf4&fl=UCID,TI&wt=json&indent=true
: 
: {
:   "responseHeader":{
: "status":0,
: "QTime":1727,
: "params":{
:   "q":"UCID:ddfdf4",
: 
:   "indent":"true",
:   "fl":"UCID,TI",
:   "wt":"json"}},
:   "response":{"numFound":0,"start":0,"maxScore":13.710121,"docs":[]
:   },
:   "highlighting":{},
:   "spellcheck":{
: "suggestions":[]}}
: 
: Can any one explain this behaviour of solr
: 
: 
: Roshan
: 
: 
: 
: 
: On Wed, Sep 23, 2015 at 11:06 AM, Roshan Agarwal 
: wrote:
: 
: > I am getting an issue with solrcloud the stored field is not reflecting in
: > search where as we are able to get result
: >
: 
: 
: 
: -- 
: 
: Roshan Agarwal
: Director sales
: Siddhast Ip innovation (P) ltd
: 907 chandra vihar colony
: Jhansi-284002
: M:+919871549769
: M:+917376314900
: 

-Hoss
http://www.lucidworks.com/

Re: Passing Basic Auth info to HttpSolrClient

2015-09-30 Thread Ishan Chattopadhyaya

In latest Solr release, you can use the basic auth plugins for
authentication instead of doing something at the Jetty level.
https://cwiki.apache.org/confluence/display/solr/Basic+Authentication+Plugin
Right at the end, there's a note on how to use SolrJ with this.

Also, there exists: https://issues.apache.org/jira/browse/SOLR-8053 which
is due in Solr 5.4 release.

On Wed, Sep 30, 2015 at 7:28 PM, Davis, Daniel (NIH/NLM) [C] <
daniel.da...@nih.gov> wrote:

> HttpSolrClient can accept the Apache Commons HttpClient in its constructor:
>
> https://lucene.apache.org/solr/5_3_1/solr-solrj/org/apache/solr/client/solrj/impl/HttpSolrClient.html
>
> You can use the HttpClientBuilder (
> http://hc.apache.org/httpcomponents-client-ga/httpclient/apidocs/org/apache/http/impl/client/HttpClientBuilder.html),
> to build an HttpClient that does Basic Authentication, and then pass client
> to the SolrHttpClient constructor.
>
> A search on "HttpClientBuilder Basic Authentication" returned many hits,
> but here's one of them:
>
> http://www.baeldung.com/httpclient-4-basic-authentication
>
> Hope this helps,
>
> Dan Davis, Systems/Applications Architect (Contractor),
> Office of Computer and Communications Systems,
> National Library of Medicine, NIH
>
>
> -Original Message-
> From: Steven White [mailto:swhite4...@gmail.com]
> Sent: Tuesday, September 29, 2015 8:13 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Passing Basic Auth info to HttpSolrClient
>
> Hi,
>
> Re-posting to see if anyone can help.  If my question is not clear, let me
> know.
>
> Thanks!
>
> Steve
>
> On Mon, Sep 28, 2015 at 5:15 PM, Steven White 
> wrote:
>
> > Hi,
> >
> > I'm using HttpSolrClient to connect to Solr.  Everything works until
> > when I enabled basic authentication in Jetty.  My question is, how do
> > I pass to SolrJ the basic auth info. so that I don't get a 401 error?
> >
> > Thanks in advance
> >
> > Steve
> >
>

Re: Find records with no values in solr.LatLongType fied type

2015-09-30 Thread Ishan Chattopadhyaya

There's also a function, exists(), which might work here, and result in a
neater query.
e.g. something like: q=*:* -exists(usrlatlong_0_coordinate)
Haven't tried it, though.
https://cwiki.apache.org/confluence/display/solr/Function+Queries#FunctionQueries-AvailableFunctions

On Wed, Sep 30, 2015 at 8:17 PM, Kamal Kishore Aggarwal <
kkroyal@gmail.com> wrote:

> Thanks Erick..it worked..
>
> On Wed, Sep 16, 2015 at 9:21 PM, Erick Erickson 
> wrote:
>
> > Top level queries need a *:* in front, something like
> > q=*:* -usrlatlong_0_coordinate:[* TO *]
> >
> > I just took a quick check and just using usrlatlong:[* TO *]
> > encounters a parse error.
> >
> > P.S. It would help if you told us what you _did_ receive
> > when you tried your options. Parse errors? All docs?
> >
> > Best,
> > Erick
> >
> > On Mon, Sep 14, 2015 at 10:58 PM, Kamal Kishore Aggarwal
> >  wrote:
> > > Hi,
> > >
> > > I am working on solr 4.8,1. I am trying to find the docs where
> > latlongtype
> > > have null values.
> > >
> > > I have tried using these, but not getting the results :
> > >
> > > 1) http://localhost:8984/solr/IM-Search/select?q.alt=-usrlatlong:[' '
> > TO *]
> > >
> > > 2) http://localhost:8984/solr/IM-Search/select?q.alt=-usrlatlong:[* TO
> > *]
> > >
> > > Here's the configurations :
> > >>  > >> subFieldSuffix="_coordinate"/>
> > >>  stored="true"
> > >> required="false" multiValued="false" />
> > >
> > >
> > > Please help.
> >
>

Re: How can I get a monotonically increasing field value for docs?

2015-09-30 Thread Chris Hostetter


: Small potato: I assume cursor mark breaks when the number of shards changes
: while keeping the original values doesn't, since the relative position is
: encoded per shard...But that's an edge case.

I don't understand your question ... the encoded cursorMark values don't 
know about thing know/care anyhting about shards.  It only encodes 
information about the *relative* position where you left off according to 
the specified sort -- that position is relative to the abstract orderings 
of all possible values, not relative to any particular shard(s)

in your use case it would function *exactly* the same as keeping track of 
the exact timestamp and unqiueKey of the last doc you recieved, and 
passing that cursorMark value back on the next query would be exactly the 
same as specifying a "fq=timestamp:{X TO *] OR (timestamp:X AND id:[Y TO 
*])" on the next request, except that under the covers the way a 
cursorMark is passed down to the IndexSearcher as a "searchAfter" 
structure should be more efficient then using an fq.

adding shards, removing shards, adding documents, removing documents ... 
cursorMark doesn't care ... what you get back is any doc that, at the 
moment you sent that cursorMark value, has sort values which would place 
that doc *after* the last doc you recevied with the previous request when 
you got that value as the nextCursorMark.

changing the value of a sort field in a document in the middle of 
iteration might affect if it is ever seen, or if it's seen more then once 
(see previusly mentioned URL for detailed examples) but spliting shards or 
what not it's not going to the results of iterating a cursor in any way.


-Hoss
http://www.lucidworks.com/

Way to determine (via analyzer) what fields/types will be created for a given field name?

2015-09-30 Thread Bill Dueber

Let’s say I have





[I
started thinking this sort of thing
 through a while back

]

If I index a field named lastname_st, I end up with:

   - field lastname_t of type text
   - field lastname of type string

**
*Is there any way for me to query solr to find out what fields and
fieldtypes it’s going to produce, in the way the analysis handlers can
show me transformations and so on?*

—
Bill Dueber
Library Systems Programmer
University of Michigan Library

Re: [poll] virtualization platform for SOLR

2015-09-30 Thread Shawn Heisey

On 9/30/2015 3:12 AM, Bernd Fehling wrote:
> while setting up some new servers (virtual machines) using XEN I was
> thinking about an alternative like KVM. My last tests with KVM is
> a while ago and XEN performed much better in the area of I/O and
> CPU usage.
> This lead me to the idea to start a poll about virtualization platform and 
> your experiences.

I once had a virtualized Solr install with Xen where each VM housed one
Solr instance with one core.  The index was distributed, so it required
several VMs for one copy of the index.

I eliminated the virtualization, used the same hardware as bare metal
with Linux, still one Solr instance installed on the machine, but with
multiple Solr cores.  Performance is much better now.

General advice:  Don't run virtual machines.

If a virtual environment is the only significant hardware you have
access to and it's used for more than Solr, then you might need to.  If
you do run virtual, then minimize the number of VMs, don't put multiple
replicas of the same index data on the same physical VM host, give each
Solr VM lots of memory, and don't oversubscribe the memory/cpu on the
physical VM host.

Thanks,
Shawn

Re: Advice for configuring solr 3.5.1 on Cent OS

2015-09-30 Thread Shawn Heisey

On 9/30/2015 4:34 AM, Porky Pig wrote:
> Hello.
> 
> I managed to compile Solr 3.5.1 from source with the ant compiler.
> 
> I am able to start solr but not much else.
> It appears that it can't find its java libraries. Also the solr-webapp
> subpath doesn't contain anything while other similar path does. I'm
> attaching two log files which I believe are related to the server
> startup and my attempt to connect to it.
> 
> The server can be started as
> 
> solr start
> 
> and then at connection attempt it replies with 'Service unavailable'
> trying the test configuration:
> 
> solr start -e cloud
> 
> results in a missing library error.

Since there was never a 3.5.1 version, and 3.x did not include a start
script, I assume you must mean 5.3.1.

The "ant" packaged with RHEL/CentOS is broken and cannot successfully
compile Solr.  You must download and install ant from ant.apache.org.
The official release is made with ant 1.8.x, but the latest 1.9 version
should work too.

What command did you use to compile solr?  The command you need is "ant
server" while sitting in the solr directory.

Thanks,
Shawn

Re: Keyword match distance rule issue

2015-09-30 Thread Alessandro Benedetti

Hi, Solr does not support more than 2 as an edit distance !
You need to customise this at code level if you want to.

If in the index we have :

bridwater

Bridgewater (3)
Bridffwater (3)

This is really weird, but please , can you tell me what exactly have
indexed for that field ? Can you check the analysis tool and show me the
tokens produced for that field, at indexing and query time ?
The analysis tool is reachable from the core in the admin UI and is really
useful in this kind of situations.

Cheers



2015-09-30 14:32 GMT+01:00 anil.vadhavane :

> Hi Benedetti,
>
> Yes, at first it looks like a user error and I am surprised as well with
> the
> case.
>
> We tested this on two different system. We tried it with lower case input
> but it is not matching. We are using the standard title column to store the
> data. Even we tried with 3, 4 and 5 edit distance but, this particular
> query
> is not matching.
>
> I wonder if anyone really try this on their own system to confirm if that
> is
> the case with others as well or not.
>
> Just to clarify -
>
> We want to match "emma bridwater radios", stored in title column, with the
> search query "Bridgewater~2" (you can use 3 edit distance if you want). We
> observed that, Solr not matching it. However, if we try "Bridffwater~2",
> Solr matching it.
>
> It might be a silly mistack from our side but, we are not able to find the
> solution at present.
>
>
> Thanks
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Keyword-match-distance-rule-issue-tp4231624p4232040.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
--

Benedetti Alessandro
Visiting card - http://about.me/alessandro_benedetti
Blog - http://alexbenedetti.blogspot.co.uk

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Re: entity processing order during updates

2015-09-30 Thread Roxana Danger

Of course, thank you!
Hopefully, it will be more clear now. I have:
- in db-config:
  
  ...
 
 
   
  
 
  
- in config:

 
 
 
 
   

I need the following order to be executed:
   - import data from DB for E1
   - import data from DB for E2
   - execute myClass1 for all the docs
   - execute myClass2 for all the docs

Sometimes it seems to be loading data for E2 before importing all data for
E1.
Also, when the process for importing the data for E2 begins, have the
analyzers for the fields associated to E1 been already executed?

Thank you very much again,
Roxana


On 30 September 2015 at 15:59, Alexandre Rafalovitch 
wrote:

> Hmm. It seems I misread " the second processor needs to be executed
> after complete the first
> one." In fact, I am still unsure what that is supposed to mean.
>
> Could you give a more concrete example of the sequence with say 2
> items of each time and what you see vs. what you expect to see.
>
> And I assume for DIH, you have two top level entity definitions next
> to each other. Not nested entities, no update clauses (just full
> import), etc.
>
> Regards,
>Alex.
> 
> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> http://www.solr-start.com/
>
>
> On 30 September 2015 at 10:53, Roxana Danger
>  wrote:
> > Do you mean creating 2 instances and then generating a third one (or
> > updating one of them) for merging their data?
> > Is it not guaranteed that the entities in the DIH are imported in the
> order
> > described in the db-config file?
> > Thank you very much,
> > Roxana
> >
> >
> >
> > On 30 September 2015 at 14:48, Alexandre Rafalovitch  >
> > wrote:
> >
> >> Have you tried just having two separate endpoints each with its own
> >> definition of DIH and URP? Then, you just hit those end-points one at
> >> a time in whatever order you need.
> >>
> >> Seems easier than a custom switching logic.
> >>
> >> Regards,
> >>Alex.
> >> 
> >> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> >> http://www.solr-start.com/
> >>
> >>
> >> On 28 September 2015 at 11:23, Roxana Danger
> >>  wrote:
> >> > Hello,
> >> >  I am importing in solr 2 entities coming from 2 different tables,
> >> and
> >> > I have defined an update request processor chain with two custom
> >> processor
> >> > factories:
> >> >  - the first processor factory needs to be executed first for one
> >> type
> >> > of entities and then for the other (I differentiate the "entity type"
> >> with
> >> > a field called table). In the import data config file I keep the
> order on
> >> > which the entities should need to be processed.
> >> >   - the second processor needs to be executed after complete the
> >> first
> >> > one.
> >> >  When executed the updates having only the first processor, the
> >> updates
> >> > work all fine. However, when I added the second processor, it seems
> that
> >> > the first update processor is not getting the entities in the order I
> was
> >> > expected.
> >> >  Does anyone had this problem before? Could anyone help me to
> >> configure
> >> > this?
> >> >  Thank you very much in advance,
> >> >  Roxana
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > 
> >>
> >
> >
> >
> > --
> > Roxana Danger | Data Scientist Dragon Court, 27-29 Macklin Street,
> London,
> > WC2B 5LX Tel: 020 7067 4568 [image: reed.co.uk] 
> The
> > UK's #1 job site.  [image: Follow us on Twitter]
> > 
> >  [image:
> > Like us on Facebook] 
> >  It's time to Love Mondays
> »
> > 
>



-- 
Roxana Danger | Data Scientist Dragon Court, 27-29 Macklin Street, London,
WC2B 5LX Tel: 020 7067 4568 [image: reed.co.uk]  The
UK's #1 job site.  [image: Follow us on Twitter]

 [image:
Like us on Facebook] 
 It's time to Love Mondays »

Re: entity processing order during updates

2015-09-30 Thread Alexandre Rafalovitch

Hmm. It seems I misread " the second processor needs to be executed
after complete the first
one." In fact, I am still unsure what that is supposed to mean.

Could you give a more concrete example of the sequence with say 2
items of each time and what you see vs. what you expect to see.

And I assume for DIH, you have two top level entity definitions next
to each other. Not nested entities, no update clauses (just full
import), etc.

Regards,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 30 September 2015 at 10:53, Roxana Danger
 wrote:
> Do you mean creating 2 instances and then generating a third one (or
> updating one of them) for merging their data?
> Is it not guaranteed that the entities in the DIH are imported in the order
> described in the db-config file?
> Thank you very much,
> Roxana
>
>
>
> On 30 September 2015 at 14:48, Alexandre Rafalovitch 
> wrote:
>
>> Have you tried just having two separate endpoints each with its own
>> definition of DIH and URP? Then, you just hit those end-points one at
>> a time in whatever order you need.
>>
>> Seems easier than a custom switching logic.
>>
>> Regards,
>>Alex.
>> 
>> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
>> http://www.solr-start.com/
>>
>>
>> On 28 September 2015 at 11:23, Roxana Danger
>>  wrote:
>> > Hello,
>> >  I am importing in solr 2 entities coming from 2 different tables,
>> and
>> > I have defined an update request processor chain with two custom
>> processor
>> > factories:
>> >  - the first processor factory needs to be executed first for one
>> type
>> > of entities and then for the other (I differentiate the "entity type"
>> with
>> > a field called table). In the import data config file I keep the order on
>> > which the entities should need to be processed.
>> >   - the second processor needs to be executed after complete the
>> first
>> > one.
>> >  When executed the updates having only the first processor, the
>> updates
>> > work all fine. However, when I added the second processor, it seems that
>> > the first update processor is not getting the entities in the order I was
>> > expected.
>> >  Does anyone had this problem before? Could anyone help me to
>> configure
>> > this?
>> >  Thank you very much in advance,
>> >  Roxana
>> >
>> >
>> >
>> >
>> >
>> >
>> > 
>>
>
>
>
> --
> Roxana Danger | Data Scientist Dragon Court, 27-29 Macklin Street, London,
> WC2B 5LX Tel: 020 7067 4568 [image: reed.co.uk]  The
> UK's #1 job site.  [image: Follow us on Twitter]
> 
>  [image:
> Like us on Facebook] 
>  It's time to Love Mondays »
>

Re: entity processing order during updates

2015-09-30 Thread Roxana Danger

Do you mean creating 2 instances and then generating a third one (or
updating one of them) for merging their data?
Is it not guaranteed that the entities in the DIH are imported in the order
described in the db-config file?
Thank you very much,
Roxana



On 30 September 2015 at 14:48, Alexandre Rafalovitch 
wrote:

> Have you tried just having two separate endpoints each with its own
> definition of DIH and URP? Then, you just hit those end-points one at
> a time in whatever order you need.
>
> Seems easier than a custom switching logic.
>
> Regards,
>Alex.
> 
> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> http://www.solr-start.com/
>
>
> On 28 September 2015 at 11:23, Roxana Danger
>  wrote:
> > Hello,
> >  I am importing in solr 2 entities coming from 2 different tables,
> and
> > I have defined an update request processor chain with two custom
> processor
> > factories:
> >  - the first processor factory needs to be executed first for one
> type
> > of entities and then for the other (I differentiate the "entity type"
> with
> > a field called table). In the import data config file I keep the order on
> > which the entities should need to be processed.
> >   - the second processor needs to be executed after complete the
> first
> > one.
> >  When executed the updates having only the first processor, the
> updates
> > work all fine. However, when I added the second processor, it seems that
> > the first update processor is not getting the entities in the order I was
> > expected.
> >  Does anyone had this problem before? Could anyone help me to
> configure
> > this?
> >  Thank you very much in advance,
> >  Roxana
> >
> >
> >
> >
> >
> >
> > 
>



-- 
Roxana Danger | Data Scientist Dragon Court, 27-29 Macklin Street, London,
WC2B 5LX Tel: 020 7067 4568 [image: reed.co.uk]  The
UK's #1 job site.  [image: Follow us on Twitter]

 [image:
Like us on Facebook] 
 It's time to Love Mondays »

Re: Find records with no values in solr.LatLongType fied type

2015-09-30 Thread Kamal Kishore Aggarwal

Thanks Erick..it worked..

On Wed, Sep 16, 2015 at 9:21 PM, Erick Erickson 
wrote:

> Top level queries need a *:* in front, something like
> q=*:* -usrlatlong_0_coordinate:[* TO *]
>
> I just took a quick check and just using usrlatlong:[* TO *]
> encounters a parse error.
>
> P.S. It would help if you told us what you _did_ receive
> when you tried your options. Parse errors? All docs?
>
> Best,
> Erick
>
> On Mon, Sep 14, 2015 at 10:58 PM, Kamal Kishore Aggarwal
>  wrote:
> > Hi,
> >
> > I am working on solr 4.8,1. I am trying to find the docs where
> latlongtype
> > have null values.
> >
> > I have tried using these, but not getting the results :
> >
> > 1) http://localhost:8984/solr/IM-Search/select?q.alt=-usrlatlong:[' '
> TO *]
> >
> > 2) http://localhost:8984/solr/IM-Search/select?q.alt=-usrlatlong:[* TO
> *]
> >
> > Here's the configurations :
> >>  >> subFieldSuffix="_coordinate"/>
> >>  >> required="false" multiValued="false" />
> >
> >
> > Please help.
>

Re: Cloud Deployment Strategy... In the Cloud

2015-09-30 Thread Steve Davids

Our project built a custom "admin" webapp that we use for various O&M
activities so I went ahead and added the ability to upload a Zip
distribution which then uses SolrJ to forward the extracted contents to ZK,
this package is built and uploaded via a Gradle build task which makes life
easy on us by allowing us to jam stuff into ZK which is sitting in a
private network (local VPC) without necessarily needing to be on a ZK
machine. We then moved on to creating collection (trivial), and
adding/removing replicas. As for adding replicas I am rather confused as to
why I would need specify a specific shard for replica placement, before
when I threw down a core.properties file the machine would automatically
come up and figure out which shard it should join based on reasonable
assumptions - why wouldn't the same logic apply here? I then saw that
a Rule-based
Replica Placement

feature was added which I thought would be reasonable but after looking at
the tests  it appears to
still require a shard parameter for adding a replica which seems to defeat
the entire purpose. So after getting bummed out about that, I took a look
at the delete replica request since we are having machines come/go we need
to start dropping them and found that the delete replica requires a
collection, shard, and replica name and if I have the name of the machine
it appears the only way to figure out what to remove is by walking the
clusterstate tree for all collections and determine which replicas are a
candidate for removal which seems unnecessarily complicated.

Hopefully I don't come off as complaining, but rather looking at it from a
client perspective, the Collections API doesn't seem simple to use and
really the only reason I am messing around with it now is because there is
repeated threats to make "zk as truth" the default in the 5.x branch at
some point in the future. I would personally advocate that something like
the autoManageReplicas  be
introduced to make life much simpler on clients as this appears to be the
thing I am trying to implement externally.

If anyone has happened to to build a system to orchestrate Solr for cloud
infrastructure and have some pointers it would be greatly appreciated.

Thanks,

-Steve

On Thu, Sep 24, 2015 at 10:15 AM, Dan Davis  wrote:

> ant is very good at this sort of thing, and easier for Java devs to learn
> than Make.  Python has a module called fabric that is also very fine, but
> for my dev. ops. it is another thing to learn.
> I tend to divide things into three categories:
>
>  - Things that have to do with system setup, and need to be run as root.
> For this I write a bash script (I should learn puppet, but...)
>  - Things that have to do with one time installation as a solr admin user
> with /bin/bash, including upconfig.   For this I use an ant build.
>  - Normal operational procedures.   For this, I typically use Solr admin or
> scripts, but I wish I had time to create a good webapp (or money to
> purchase Fusion).
>
>
> On Thu, Sep 24, 2015 at 12:39 AM, Erick Erickson 
> wrote:
>
> > bq: What tools do you use for the "auto setup"? How do you get your
> config
> > automatically uploaded to zk?
> >
> > Both uploading the config to ZK and creating collections are one-time
> > operations, usually done manually. Currently uploading the config set is
> > accomplished with zkCli (yes, it's a little clumsy). There's a JIRA to
> put
> > this into solr/bin as a command though. They'd be easy enough to script
> in
> > any given situation though with a shell script or wizard
> >
> > Best,
> > Erick
> >
> > On Wed, Sep 23, 2015 at 7:33 PM, Steve Davids  wrote:
> >
> > > What tools do you use for the "auto setup"? How do you get your config
> > > automatically uploaded to zk?
> > >
> > > On Tue, Sep 22, 2015 at 2:35 PM, Gili Nachum 
> > wrote:
> > >
> > > > Our auto setup sequence is:
> > > > 1.deploy 3 zk nodes
> > > > 2. Deploy solr nodes and start them connecting to zk.
> > > > 3. Upload collection config to zk.
> > > > 4. Call create collection rest api.
> > > > 5. Done. SolrCloud ready to work.
> > > >
> > > > Don't yet have automation for replacing or adding a node.
> > > > On Sep 22, 2015 18:27, "Steve Davids"  wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I am trying to come up with a repeatable process for deploying a
> Solr
> > > > Cloud
> > > > > cluster from scratch along with the appropriate security groups,
> auto
> > > > > scaling groups, and custom Solr plugin code. I saw that LucidWorks
> > > > created
> > > > > a Solr Scale Toolkit but that seems to be more of a one-shot deal
> > than
> > > > > really setting up your environment for the long-haul. Here is were
> we
> > > are
> > > > > at right now:
> > > > >
> > > > >1. ZooKeeper ensemble is easily brought up via a Cloud Formation
> > > > Script
>

Re: Regression tests and evaluate quality of results

2015-09-30 Thread Doug Turnbull

Sounds exactly like our tool Quepid http://quepid.com :) which is our test
driven search toolbox.

Whether or not Quepid is the right fit for your application, we advocate
for a style of work called Test-Driven Relevancy.
http://opensourceconnections.com/blog/2013/10/14/what-is-test-driven-search-relevancy/

I've used both functional tests through Quepid and for clients that are
very particular or have the right analytics data judgement-list based
tests. Judgement lists are basically expert-graded ratings of results. The
decision when to use which really depends on the other side of the table.
Who is defining search correctness? How is it defined? Some organizations
have a lot of infrastructure to do this, and can give you judgement data to
tell you what the right answer is. Other organizations have much softer
requirements and are not as experienced with search. Sometimes ad-hoc,
assertion based testing is right for them.

Really the biggest question is how you define search correctness. Search is
an unique a piece of the user experience as any other part of the
application. You likely don't have the right expertise to define what
relevant means for your application. So a big part of our view of
test-driven relevancy is collaborating with domain/content/user experts
that can help define the right experience.

Shoot me an email if you want to chat at some point,
Cheers,
-Doug

On Wed, Sep 30, 2015 at 9:58 AM, marotosg  wrote:

> Hi,
>
> I have some doubts about how to define a process to evaluate the quality of
> search results.
> I have a solr collection with 4M documents with information about people. I
> search across several fields like  first name ,second name, email,
> address,
> phone etc.
>
> There is plenty of logic in the query. Some fields ranks higher, exact
> match
> ranks higher than trail search etc.
>
> I was thinking on an approach where I can create some automated tests.
> Based
> on some searches if the results are good enough or the ones which come
> first
> are actually better than the other ones.
> I would like to be able to identify how new functionality affects previous
> results.
>
> I thought on this creating two type of tests.
> a) Functional tests which will test if the functionality is correct. They
> will be based on a subset of records which are static. In this case, I know
> what order results should come back from the query.
>
> b) Based on full data. I would like to run queries and see if the results
> are good enough. That's the part I am not sure if makes sense or how to do
> it.
>
> I am not sure if that's not correct or if there is a any standad to follow.
> Any help would be much appreciate.
>
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Regression-tests-and-evaluate-quality-of-results-tp4232047.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

-- 
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections
, LLC | 240.476.9983
Author: Relevant Search 
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.

Re: Regression tests and evaluate quality of results

2015-09-30 Thread Toke Eskildsen

On Wed, 2015-09-30 at 06:58 -0700, marotosg wrote:
> b) Based on full data. I would like to run queries and see if the results
> are good enough. That's the part I am not sure if makes sense or how to do
> it.

Seems like an exact match for http://quepid.com/
(I am not affiliated)

- Toke Eskildsen, State and University Library, Denmark

Re: Regression tests and evaluate quality of results

2015-09-30 Thread Ahmet Arslan

Hi,

Testing quality requires "right answers" (query relevance judgments), which is
expensive to create.
Once you have qrels, you can evaluate effectiveness of your system with metrics
(MAP, ERR@20, NDCG@20, etc)

Here is a presentation you might find relevant.
http://opensourceconnections.com/blog/2014/07/24/using-quepid-to-improve-relevancy-of-advance-auto-intranet-search/

Ahmet

On Wednesday, September 30, 2015 4:58 PM, marotosg wrote:
Hi,

I have some doubts about how to define a process to evaluate the quality of
search results.
I have a solr collection with 4M documents with information about people. I
search across several fields like first name ,second name, email, address,
phone etc.

There is plenty of logic in the query. Some fields ranks higher, exact match
ranks higher than trail search etc.

I was thinking on an approach where I can create some automated tests. Based
on some searches if the results are good enough or the ones which come first
are actually better than the other ones.
I would like to be able to identify how new functionality affects previous
results.

I thought on this creating two type of tests.
a) Functional tests which will test if the functionality is correct. They
will be based on a subset of records which are static. In this case, I know
what order results should come back from the query.

b) Based on full data. I would like to run queries and see if the results
are good enough. That's the part I am not sure if makes sense or how to do
it.

I am not sure if that's not correct or if there is a any standad to follow.
Any help would be much appreciate.

--
View this message in context:
http://lucene.472066.n3.nabble.com/Regression-tests-and-evaluate-quality-of-results-tp4232047.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Can StandardTokenizerFactory works well for Chinese and English (Bilingual)?

2015-09-30 Thread Zheng Lin Edwin Yeo

Hi Charlie,

Yes sure, I'm now finalising my testing with all the different tokenizer,
and trying to understand how each of the tokenizer actually works.
Hopefully will be able to share something useful about my experience once
I'm done with it.

Regards,
Edwin


On 30 September 2015 at 17:25, Charlie Hull  wrote:

> On 30/09/2015 10:13, Zheng Lin Edwin Yeo wrote:
>
>> Hi Charlie,
>>
>
> Hi Edwin,
>
>>
>> Thanks for your reply. Seems like quite a number of the chinese tokenizers
>> are not really compatible with the newer versions of Solr
>>
>> I'm also looking at HMMChineseTokenizer and JiebaTokenizer to see if they
>> are suitable to be used for Solr 5.x too.
>>
>
> I think there is a general lack of knowledge (at least in the
> non-Chinese-speaking community) about the best way to analyze Chinese
> content with Lucene/Solr - so if you can write up your experiences that
> would be great!
>
> Cheers
>
> Charlie
>
>
>
>> Regards,
>> Edwin
>>
>>
>> On 30 September 2015 at 16:20, Charlie Hull  wrote:
>>
>> On 30/09/2015 04:09, Zheng Lin Edwin Yeo wrote:
>>>
>>> Hi Charlie,


>>> Hi,
>>>
>>>
 I've checked that Paoding's code is written for Solr 3 and Solr 4
 versions.
 It is not written for Solr 5, thus I was unable to use it in my Solr 5.x
 version.


>>> I'm pretty sure we had to recompile it for v4.6 as wellit has been a
>>> little painful.
>>>
>>>
 Have you tried to use HMMChineseTokenizer and JiebaTokenizer as well?


>>> I don't think so.
>>>
>>>
>>> Charlie
>>>
>>>
 Regards,
 Edwin


 On 25 September 2015 at 18:46, Charlie Hull  wrote:

 On 25/09/2015 11:43, Zheng Lin Edwin Yeo wrote:

>
> Hi Charlie,
>
>>
>> Thanks for your comment. I faced the compatibility issues with Paoding
>> when
>> I tried it in Solr 5.1.0 and Solr 5.2.1, and I found out that the code
>> was
>> optimised for Solr 3.6.
>>
>> Which version of Solr are you using when you tried on the Paoding?
>>
>>
>> Solr v4.6 I believe.
>
> Charlie
>
>
> Regards,
>
>> Edwin
>>
>>
>> On 25 September 2015 at 16:43, Charlie Hull 
>> wrote:
>>
>> On 23/09/2015 16:23, Alexandre Rafalovitch wrote:
>>
>>
>>> You may find the following articles interesting:
>>>
>>>



 http://discovery-grindstone.blogspot.ca/2014/01/searching-in-solr-analyzing-results-and.html
 ( a whole epic journey)
 https://dzone.com/articles/indexing-chinese-solr


 The latter article is great and we drew on it when helping a recent

>>> client
>>> with Chinese indexing. However, if you do use Paoding bear in mind
>>> that
>>> it
>>> has few if any tests and all the comments are in Chinese. We found a
>>> problem with it recently (it breaks the Lucene highlighters) and have
>>> submitted a patch:
>>> http://git.oschina.net/zhzhenqin/paoding-analysis/issues/1
>>>
>>> Cheers
>>>
>>> Charlie
>>>
>>>
>>> Regards,
>>>
>>>Alex.
 
 Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
 http://www.solr-start.com/


 On 23 September 2015 at 10:41, Zheng Lin Edwin Yeo <
 edwinye...@gmail.com>
 wrote:

 Hi,


> Would like to check, will StandardTokenizerFactory works well for
> indexing
> both English and Chinese (Bilingual) documents, or do we need
> tokenizers
> that are customised for chinese (Eg: HMMChineseTokenizerFactory)?
>
>
> Regards,
> Edwin
>
>
>
> --

>>> Charlie Hull
>>> Flax - Open Source Enterprise Search
>>>
>>> tel/fax: +44 (0)8700 118334
>>> mobile:  +44 (0)7767 825828
>>> web: www.flax.co.uk
>>>
>>>
>>>
>>>
>> --
> Charlie Hull
> Flax - Open Source Enterprise Search
>
> tel/fax: +44 (0)8700 118334
> mobile:  +44 (0)7767 825828
> web: www.flax.co.uk
>
>
>

>>> --
>>> Charlie Hull
>>> Flax - Open Source Enterprise Search
>>>
>>> tel/fax: +44 (0)8700 118334
>>> mobile:  +44 (0)7767 825828
>>> web: www.flax.co.uk
>>>
>>>
>>
>
> --
> Charlie Hull
> Flax - Open Source Enterprise Search
>
> tel/fax: +44 (0)8700 118334
> mobile:  +44 (0)7767 825828
> web: www.flax.co.uk
>

Re: Keyword match distance rule issue

2015-09-30 Thread anil.vadhavane

Hi Jack,

Thanks for a quick reply.

I understood your point regarding the edit distances related restriction in
Solr. Yes, the query string does not contain actual quotes. The query should
match with 2 edit distance. As I mentioned, if we try "Bridffwater~2", Solr
matching it.

We haven't noticed the Exception. We are using Solarium (php) client to
query Solr. We have also tried direct query to Solr using web browser.

Can you please check this case on your system and let us know if it matches?
if it is, we can go ahead and do further analysis to solve it. Please tell
us your Solr version and operating system if it matches.

Thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Keyword-match-distance-rule-issue-tp4231624p4232055.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Keyword match distance rule issue

2015-09-30 Thread Jack Krupansky

This feature is known as fuzzy query, not keyword match.

Unfortunately, the edit distance limit is limited to 2. 3 or more are not
supported. Lucene itself still has the old "slow" fuzzy query that supports
larger edit distances, but Solr has no syntax for selecting it.

Actually, this limit of 2 is strict and enforced in Solr 4.x and 5.x and an
exception will be thrown. So, are you really not seeing an exception when
you use an edit distance greater than 2?

Also, please confirm that your query string does not contain actual quotes.
If it did, the fuzzy syntax would simply be analyzed as if it were simple
text.

-- Jack Krupansky

On Wed, Sep 30, 2015 at 9:32 AM, anil.vadhavane 
wrote:

> Hi Benedetti,
>
> Yes, at first it looks like a user error and I am surprised as well with
> the
> case.
>
> We tested this on two different system. We tried it with lower case input
> but it is not matching. We are using the standard title column to store the
> data. Even we tried with 3, 4 and 5 edit distance but, this particular
> query
> is not matching.
>
> I wonder if anyone really try this on their own system to confirm if that
> is
> the case with others as well or not.
>
> Just to clarify -
>
> We want to match "emma bridwater radios", stored in title column, with the
> search query "Bridgewater~2" (you can use 3 edit distance if you want). We
> observed that, Solr not matching it. However, if we try "Bridffwater~2",
> Solr matching it.
>
> It might be a silly mistack from our side but, we are not able to find the
> solution at present.
>
>
> Thanks
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Keyword-match-distance-rule-issue-tp4231624p4232040.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

MongoDB to Solr connector - anyone done it?

2015-09-30 Thread Gili Nachum

Hi,

Looking to learn from experience of others, what works best?

Looking for a production grade solution to efficiently push data of a
multi-sharded Mongo to a multi-sharded Solr in a continues manner and in a
one off fashion.
Not having to write any code would be a nice bonus.

What I found so far:
Option1 -Mongo pushing to Solr with Mongo connector
,
see: Blog1

Option2 - Solr pulling from Solr with solr-mongodb-dih
 (Data Input Handler
extension).

RE: Passing Basic Auth info to HttpSolrClient

2015-09-30 Thread Davis, Daniel (NIH/NLM) [C]

HttpSolrClient can accept the Apache Commons HttpClient in its constructor:

https://lucene.apache.org/solr/5_3_1/solr-solrj/org/apache/solr/client/solrj/impl/HttpSolrClient.html

You can use the HttpClientBuilder 
(http://hc.apache.org/httpcomponents-client-ga/httpclient/apidocs/org/apache/http/impl/client/HttpClientBuilder.html),
  to build an HttpClient that does Basic Authentication, and then pass client 
to the SolrHttpClient constructor.

A search on "HttpClientBuilder Basic Authentication" returned many hits, but 
here's one of them:

http://www.baeldung.com/httpclient-4-basic-authentication

Hope this helps,

Dan Davis, Systems/Applications Architect (Contractor),
Office of Computer and Communications Systems,
National Library of Medicine, NIH

-Original Message-
From: Steven White [mailto:swhite4...@gmail.com] 
Sent: Tuesday, September 29, 2015 8:13 PM
To: solr-user@lucene.apache.org
Subject: Re: Passing Basic Auth info to HttpSolrClient

Hi,

Re-posting to see if anyone can help.  If my question is not clear, let me know.

Thanks!

Steve

On Mon, Sep 28, 2015 at 5:15 PM, Steven White  wrote:

> Hi,
>
> I'm using HttpSolrClient to connect to Solr.  Everything works until 
> when I enabled basic authentication in Jetty.  My question is, how do 
> I pass to SolrJ the basic auth info. so that I don't get a 401 error?
>
> Thanks in advance
>
> Steve
>

Regression tests and evaluate quality of results

2015-09-30 Thread marotosg

Hi,

I have some doubts about how to define a process to evaluate the quality of
search results.
I have a solr collection with 4M documents with information about people. I
search across several fields like  first name ,second name, email,  address,
phone etc. 

There is plenty of logic in the query. Some fields ranks higher, exact match
ranks higher than trail search etc.

I was thinking on an approach where I can create some automated tests. Based
on some searches if the results are good enough or the ones which come first
are actually better than the other ones. 
I would like to be able to identify how new functionality affects previous
results.

I thought on this creating two type of tests. 
a) Functional tests which will test if the functionality is correct. They
will be based on a subset of records which are static. In this case, I know
what order results should come back from the query.

b) Based on full data. I would like to run queries and see if the results
are good enough. That's the part I am not sure if makes sense or how to do
it.

I am not sure if that's not correct or if there is a any standad to follow.
Any help would be much appreciate.

 
 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Regression-tests-and-evaluate-quality-of-results-tp4232047.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr 4.8 - Updating zkhost list in solr.xml without requiring a restart

2015-09-30 Thread pramodEbay


> The idea is that your list of zookeeper hostnames is a virtual one, not 
> a real one. 

Thanks for the suggestion. Looks like I am not alone in thinking along the
same lines. I am planning on doing that and was not sure if anyone else
tried this approach and validated that it worked. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-4-8-Updating-zkhost-list-in-solr-xml-without-requiring-a-restart-tp4231979p4232045.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: entity processing order during updates

2015-09-30 Thread Alexandre Rafalovitch

Have you tried just having two separate endpoints each with its own
definition of DIH and URP? Then, you just hit those end-points one at
a time in whatever order you need.

Seems easier than a custom switching logic.

Regards,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 28 September 2015 at 11:23, Roxana Danger
 wrote:
> Hello,
>  I am importing in solr 2 entities coming from 2 different tables, and
> I have defined an update request processor chain with two custom processor
> factories:
>  - the first processor factory needs to be executed first for one type
> of entities and then for the other (I differentiate the "entity type" with
> a field called table). In the import data config file I keep the order on
> which the entities should need to be processed.
>   - the second processor needs to be executed after complete the first
> one.
>  When executed the updates having only the first processor, the updates
> work all fine. However, when I added the second processor, it seems that
> the first update processor is not getting the entities in the order I was
> expected.
>  Does anyone had this problem before? Could anyone help me to configure
> this?
>  Thank you very much in advance,
>  Roxana
>
>
>
>
>
>
>

Re: Keyword match distance rule issue

2015-09-30 Thread anil.vadhavane

Hi Benedetti,

Yes, at first it looks like a user error and I am surprised as well with the
case.

We tested this on two different system. We tried it with lower case input
but it is not matching. We are using the standard title column to store the
data. Even we tried with 3, 4 and 5 edit distance but, this particular query
is not matching.

I wonder if anyone really try this on their own system to confirm if that is
the case with others as well or not.

Just to clarify -

We want to match "emma bridwater radios", stored in title column, with the
search query "Bridgewater~2" (you can use 3 edit distance if you want). We
observed that, Solr not matching it. However, if we try "Bridffwater~2",
Solr matching it.

It might be a silly mistack from our side but, we are not able to find the
solution at present.


Thanks




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Keyword-match-distance-rule-issue-tp4231624p4232040.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: What kind of nutch documents does Solr index?

2015-09-30 Thread NutchDev

What Nutch does is, after fetching document from server they are passed to
parser to parse and parser detects the document type and accordingly do the
parsing. 

One possibility could be parser had failed to parse some documents. and
that's why you are getting count mismatch. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/What-kind-of-nutch-documents-does-Solr-index-tp4231646p4232034.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: What kind of nutch documents does Solr index?

2015-09-30 Thread Daniel Holmes

Thank you Upayavira for your anser. In the case I described maxDoc is 19263.
As I check the Nutch, default indexing filter in Nutch is basic indexing
filter and also it have a property to delete gone and permanently
redirected pages which it value was false for me.
I think the problem is still remained for solr.


On Mon, Sep 28, 2015 at 3:03 PM, Upayavira  wrote:

> I suspect you may be better off asking this on the Nutch user list. The
> decisions you are describing will be within the Nutch codebase, not
> Solr. Someone here may know (hopefully) but you may get more support
> over on the Nutch list.
>
> One suggestion -start with a clean, empty index. Run a crawl. Look at
> the maxDocs vs numDocs (visible via the admin UI for your
> core/collection). If maxDocs>numDocs, it means that some docs have been
> overwritten - i.e. the ID field that Nutch is using is not unique.
>
> Upayavira
>
> On Mon, Sep 28, 2015, at 10:19 AM, Daniel Holmes wrote:
> > Hi,
> > I am using apache Nutch 1.7 to crawl and apache Solr 4.7.2 for indexing.
> > In
> > my tests there is a gap between number of fetched results of Nutch and
> > number of indexed documents in Solr. For example one of the crawls is
> > fetched 23343 pages and 1146 images successfully while in the Solr 19250
> > docs is indexed and 500 of them is image urls.
> >
> > My question is that what kind of pages are indexed is solr and why?
> > Does Solr index pages whit other status or not?
> > what kind of images does Solr index?
> >
> > Thanks.
>

Re: MoreLikeThisHandler with mltipli input documents

2015-09-30 Thread Alessandro Benedetti

This query time is still suspicious ...
Have you tried to play with MLT params ?
Min term frequency ? Min Doc Freq ?
You can reduce the terms to query,

Parameter

Description

mlt.qf

Query fields and their boosts using the same format as that used by the
DisMaxRequestHandler. These fields must also be specified in mlt.fl.

mlt.minwl

Sets the minimum word length below which words will be ignored.

mlt.mintf

Specifies the Minimum Term Frequency, the frequency below which terms will
be ignored in the source document.

mlt.mindf

Specifies the Minimum Document Frequency, the frequency at which words will
be ignored which do not occur in at least this many documents.

mlt.maxwl

Sets the maximum word length above which words will be ignored.

mlt.maxqt

Sets the maximum number of query terms that will be included in any
generated query.

mlt.maxntp

Sets the maximum number of tokens to parse in each example document field
that is not stored with TermVector support.

mlt.maxdf

Specifies the Maximum Document Frequency, the frequency at which words will
be ignored which occur in more than this many documents.

mlt.fl

Specifies the fields to use for similarity. If possible, these should have
stored termVectors.

mlt.boost

Specifies if the query will be boosted by the interesting term relevance.
It can be either "true" or "false".



2015-09-30 9:40 GMT+01:00 Szűcs Roland :

> Hello Upayavira,
>
> We use the ajax call and it can work when it takes only some seconds (even
> the 7 sec can be acceptable in this case) as the customers first focus on
> the product page and if they are not satisfied with the e-book they will
> need the offer. I am just started to scare what will happen if we move to
> the market of English ebooks with 1 million titles. I will try the
> clustering as well, or using the termvector component we can implmenet our
> own more like this calculation as we realized that sometimes less than 25
> interesting terms are enough to make good recommendation and it can make
> the calculation faster. If you see my previous email with the intresting
> terms it shows clearly that half of the terms would be enough or even less.
> What a pity that there is no such a parameter for the more like this
> handler: mlt.interestingtermcount which would be set 25 as a default but we
> could modify it in the solrconfig to make the calculation less resource
> intensive.
>
> Thank you Upayavira and Alessandro the lots of help and effort you made. I
> see the options much clearer now.
>
> Cheers,
> Roland
>
> 2015-09-30 10:23 GMT+02:00 Upayavira :
>
> > Could you do the MLT as a separate (AJAX) request? They appear a little
> > afterwards, whilst the user is already reading the page?
> >
> > Or, you could do offline clustering, in which case, overnight, you
> > compare every document with every other, using a (likely non-solr)
> > clustering algorithm, and store those in a separate core. Then you can
> > request those immediately after your search query. Or reindex your
> > content with that data stored alongside.
> >
> > Upayavira
> >
> > On Wed, Sep 30, 2015, at 09:16 AM, Alessandro Benedetti wrote:
> > > I am still missing why you quote the number of the documents...
> > > If you have 5600 polish books, but you use the MLT only when you land
> in
> > > the page of a specific book ...
> > > I think i still miss the point !
> > > MLT on 1 polish book, takes 7 secs ?
> > >
> > >
> > > 2015-09-30 9:10 GMT+01:00 Szűcs Roland :
> > >
> > > > Hi Alessandro,
> > > >
> > > > You are right. I forget to mention one important factor. For 3000
> > hungarian
> > > > e-books the approach you mentioned is absolutely fine as the response
> > time
> > > > is some 0.7 sec. But when I use the same mlt for 5600 polish e-books
> > the
> > > > response time is 7 sec which is definetely not acceptable for the
> > users.
> > > >
> > > > Regards,
> > > > Roland
> > > >
> > > > 2015-09-29 17:19 GMT+02:00 Alessandro Benedetti <
> > > > benedetti.ale...@gmail.com>
> > > > :
> > > >
> > > > > Hi Roland,
> > > > > you said "The main goal is that when a customer is on the pruduct
> > page ".
> > > > > But if you are in a  product page, I guess you have the product Id.
> > > > > If you have the product id , you can simply execute the MLT request
> > with
> > > > > the single Doc Id in input.
> > > > >
> > > > > Why do you need to calculate beforehand?
> > > > >
> > > > > Cheers
> > > > >
> > > > > 2015-09-29 15:44 GMT+01:00 Szűcs Roland <
> szucs.rol...@bookandwalk.hu
> > >:
> > > > >
> > > > > > Hello Upayavira,
> > > > > >
> > > > > > The main goal is that when a customer is on the pruduct page on
> an
> > > > e-book
> > > > > > and he does not like it somehow I want to immediately offer
> her/him
> > > > > > alternative e-books in the same topic. If I expect from the
> > customer to
> > > > > > click on a button like "similar e-books" I lose half of them as
> > they
> > > > are
> > > > > > lazy to click anywhere. So I would like to present o

Advice for configuring solr 3.5.1 on Cent OS

2015-09-30 Thread Porky Pig

Hello.

I managed to compile Solr 3.5.1 from source with the ant compiler.

I am able to start solr but not much else.
It appears that it can't find its java libraries. Also the solr-webapp
subpath doesn't contain anything while other similar path does. I'm
attaching two log files which I believe are related to the server
startup and my attempt to connect to it.

The server can be started as

solr start

and then at connection attempt it replies with 'Service unavailable'
trying the test configuration:

solr start -e cloud

results in a missing library error.

I found the class files in a subfolder so I believe they were compiled
correctly. I'd like to know how I can determine solr to load the proper
paths.

Setting the java CLASSPATH environment doesn't seem to help.

Thanks in advance.
2015-09-29 13:57:38.964 INFO  (main) [   ] o.e.j.u.log Logging initialized @613ms
2015-09-29 13:57:39.185 INFO  (main) [   ] o.e.j.s.Server jetty-9.2.11.v20150529
2015-09-29 13:57:39.206 WARN  (main) [   ] o.e.j.s.h.RequestLogHandler !RequestLog
2015-09-29 13:57:39.209 INFO  (main) [   ] o.e.j.d.p.ScanningAppProvider Deployment monitor [file:/usr/local/src/solr-5.3.1/solr/server/contexts/] at interval 0
2015-09-29 13:57:39.358 WARN  (main) [   ] o.e.j.w.WebInfConfiguration Web application not found /usr/local/src/solr-5.3.1/solr/server/solr-webapp/webapp
2015-09-29 13:57:39.358 WARN  (main) [   ] o.e.j.w.WebAppContext Failed startup of context o.e.j.w.WebAppContext@1aab8583{/solr,null,null}{/usr/local/src/solr-5.3.1/solr/server/solr-webapp/webapp}
java.io.FileNotFoundException: /usr/local/src/solr-5.3.1/solr/server/solr-webapp/webapp
	at org.eclipse.jetty.webapp.WebInfConfiguration.unpack(WebInfConfiguration.java:493)
	at org.eclipse.jetty.webapp.WebInfConfiguration.preConfigure(WebInfConfiguration.java:72)
	at org.eclipse.jetty.webapp.WebAppContext.preConfigure(WebAppContext.java:468)
	at org.eclipse.jetty.webapp.WebAppContext.doStart(WebAppContext.java:504)
	at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
	at org.eclipse.jetty.deploy.bindings.StandardStarter.processBinding(StandardStarter.java:41)
	at org.eclipse.jetty.deploy.AppLifeCycle.runBindings(AppLifeCycle.java:186)
	at org.eclipse.jetty.deploy.DeploymentManager.requestAppGoal(DeploymentManager.java:498)
	at org.eclipse.jetty.deploy.DeploymentManager.addApp(DeploymentManager.java:146)
	at org.eclipse.jetty.deploy.providers.ScanningAppProvider.fileAdded(ScanningAppProvider.java:180)
	at org.eclipse.jetty.deploy.providers.WebAppProvider.fileAdded(WebAppProvider.java:461)
	at org.eclipse.jetty.deploy.providers.ScanningAppProvider$1.fileAdded(ScanningAppProvider.java:64)
	at org.eclipse.jetty.util.Scanner.reportAddition(Scanner.java:609)
	at org.eclipse.jetty.util.Scanner.reportDifferences(Scanner.java:528)
	at org.eclipse.jetty.util.Scanner.scan(Scanner.java:391)
	at org.eclipse.jetty.util.Scanner.doStart(Scanner.java:313)
	at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
	at org.eclipse.jetty.deploy.providers.ScanningAppProvider.doStart(ScanningAppProvider.java:150)
	at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
	at org.eclipse.jetty.deploy.DeploymentManager.startAppProvider(DeploymentManager.java:560)
	at org.eclipse.jetty.deploy.DeploymentManager.doStart(DeploymentManager.java:235)
	at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
	at org.eclipse.jetty.util.component.ContainerLifeCycle.start(ContainerLifeCycle.java:132)
	at org.eclipse.jetty.server.Server.start(Server.java:387)
	at org.eclipse.jetty.util.component.ContainerLifeCycle.doStart(ContainerLifeCycle.java:114)
	at org.eclipse.jetty.server.handler.AbstractHandler.doStart(AbstractHandler.java:61)
	at org.eclipse.jetty.server.Server.doStart(Server.java:354)
	at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
	at org.eclipse.jetty.xml.XmlConfiguration$1.run(XmlConfiguration.java:1255)
	at java.security.AccessController.doPrivileged(Native Method)
	at org.eclipse.jetty.xml.XmlConfiguration.main(XmlConfiguration.java:1174)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.eclipse.jetty.start.Main.invokeMain(Main.java:321)
	at org.eclipse.jetty.start.Main.start(Main.java:817)
	at org.eclipse.jetty.start.Main.main(Main.java:112)
2015-09-29 13:57:39.373 INFO  (main) [   ] o.e.j.s.ServerConnector Started ServerConnector@4b34b33e{HTTP/1.1}{0.0.0.0:8983}
2015-09-29 13:57:39.373 INFO  (main) [   ] o.e.j.s.Server Started @1024ms
2015-09-29 14:03:40.000 INFO  (ShutdownMonitor) [   ] o.e.j.s.ServerConnector Stopped ServerConnector@4b34b33e{HTTP/1.1}{0.0.0.0:8983}
2015-09-29 14:03:40.001 INF

Re: Can StandardTokenizerFactory works well for Chinese and English (Bilingual)?

2015-09-30 Thread Charlie Hull


On 30/09/2015 10:13, Zheng Lin Edwin Yeo wrote:

Hi Charlie,


Hi Edwin,


Thanks for your reply. Seems like quite a number of the chinese tokenizers
are not really compatible with the newer versions of Solr

I'm also looking at HMMChineseTokenizer and JiebaTokenizer to see if they
are suitable to be used for Solr 5.x too.


I think there is a general lack of knowledge (at least in the 
non-Chinese-speaking community) about the best way to analyze Chinese 
content with Lucene/Solr - so if you can write up your experiences that 
would be great!


Cheers

Charlie



Regards,
Edwin


On 30 September 2015 at 16:20, Charlie Hull  wrote:


On 30/09/2015 04:09, Zheng Lin Edwin Yeo wrote:


Hi Charlie,



Hi,



I've checked that Paoding's code is written for Solr 3 and Solr 4
versions.
It is not written for Solr 5, thus I was unable to use it in my Solr 5.x
version.



I'm pretty sure we had to recompile it for v4.6 as wellit has been a
little painful.



Have you tried to use HMMChineseTokenizer and JiebaTokenizer as well?



I don't think so.


Charlie



Regards,
Edwin


On 25 September 2015 at 18:46, Charlie Hull  wrote:

On 25/09/2015 11:43, Zheng Lin Edwin Yeo wrote:


Hi Charlie,


Thanks for your comment. I faced the compatibility issues with Paoding
when
I tried it in Solr 5.1.0 and Solr 5.2.1, and I found out that the code
was
optimised for Solr 3.6.

Which version of Solr are you using when you tried on the Paoding?



Solr v4.6 I believe.

Charlie


Regards,

Edwin


On 25 September 2015 at 16:43, Charlie Hull  wrote:

On 23/09/2015 16:23, Alexandre Rafalovitch wrote:



You may find the following articles interesting:





http://discovery-grindstone.blogspot.ca/2014/01/searching-in-solr-analyzing-results-and.html
( a whole epic journey)
https://dzone.com/articles/indexing-chinese-solr


The latter article is great and we drew on it when helping a recent

client
with Chinese indexing. However, if you do use Paoding bear in mind that
it
has few if any tests and all the comments are in Chinese. We found a
problem with it recently (it breaks the Lucene highlighters) and have
submitted a patch:
http://git.oschina.net/zhzhenqin/paoding-analysis/issues/1

Cheers

Charlie


Regards,


   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 23 September 2015 at 10:41, Zheng Lin Edwin Yeo <
edwinye...@gmail.com>
wrote:

Hi,



Would like to check, will StandardTokenizerFactory works well for
indexing
both English and Chinese (Bilingual) documents, or do we need
tokenizers
that are customised for chinese (Eg: HMMChineseTokenizerFactory)?


Regards,
Edwin




--

Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk






--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk






--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk






--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk

Re: Can StandardTokenizerFactory works well for Chinese and English (Bilingual)?

2015-09-30 Thread Zheng Lin Edwin Yeo

Hi Charlie,

Thanks for your reply. Seems like quite a number of the chinese tokenizers
are not really compatible with the newer versions of Solr

I'm also looking at HMMChineseTokenizer and JiebaTokenizer to see if they
are suitable to be used for Solr 5.x too.

Regards,
Edwin


On 30 September 2015 at 16:20, Charlie Hull  wrote:

> On 30/09/2015 04:09, Zheng Lin Edwin Yeo wrote:
>
>> Hi Charlie,
>>
>
> Hi,
>
>>
>> I've checked that Paoding's code is written for Solr 3 and Solr 4
>> versions.
>> It is not written for Solr 5, thus I was unable to use it in my Solr 5.x
>> version.
>>
>
> I'm pretty sure we had to recompile it for v4.6 as wellit has been a
> little painful.
>
>>
>> Have you tried to use HMMChineseTokenizer and JiebaTokenizer as well?
>>
>
> I don't think so.
>
>
> Charlie
>
>>
>> Regards,
>> Edwin
>>
>>
>> On 25 September 2015 at 18:46, Charlie Hull  wrote:
>>
>> On 25/09/2015 11:43, Zheng Lin Edwin Yeo wrote:
>>>
>>> Hi Charlie,

 Thanks for your comment. I faced the compatibility issues with Paoding
 when
 I tried it in Solr 5.1.0 and Solr 5.2.1, and I found out that the code
 was
 optimised for Solr 3.6.

 Which version of Solr are you using when you tried on the Paoding?


>>> Solr v4.6 I believe.
>>>
>>> Charlie
>>>
>>>
>>> Regards,
 Edwin


 On 25 September 2015 at 16:43, Charlie Hull  wrote:

 On 23/09/2015 16:23, Alexandre Rafalovitch wrote:

>
> You may find the following articles interesting:
>
>>
>>
>>
>> http://discovery-grindstone.blogspot.ca/2014/01/searching-in-solr-analyzing-results-and.html
>> ( a whole epic journey)
>> https://dzone.com/articles/indexing-chinese-solr
>>
>>
>> The latter article is great and we drew on it when helping a recent
> client
> with Chinese indexing. However, if you do use Paoding bear in mind that
> it
> has few if any tests and all the comments are in Chinese. We found a
> problem with it recently (it breaks the Lucene highlighters) and have
> submitted a patch:
> http://git.oschina.net/zhzhenqin/paoding-analysis/issues/1
>
> Cheers
>
> Charlie
>
>
> Regards,
>
>>   Alex.
>> 
>> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
>> http://www.solr-start.com/
>>
>>
>> On 23 September 2015 at 10:41, Zheng Lin Edwin Yeo <
>> edwinye...@gmail.com>
>> wrote:
>>
>> Hi,
>>
>>>
>>> Would like to check, will StandardTokenizerFactory works well for
>>> indexing
>>> both English and Chinese (Bilingual) documents, or do we need
>>> tokenizers
>>> that are customised for chinese (Eg: HMMChineseTokenizerFactory)?
>>>
>>>
>>> Regards,
>>> Edwin
>>>
>>>
>>>
>> --
> Charlie Hull
> Flax - Open Source Enterprise Search
>
> tel/fax: +44 (0)8700 118334
> mobile:  +44 (0)7767 825828
> web: www.flax.co.uk
>
>
>

>>> --
>>> Charlie Hull
>>> Flax - Open Source Enterprise Search
>>>
>>> tel/fax: +44 (0)8700 118334
>>> mobile:  +44 (0)7767 825828
>>> web: www.flax.co.uk
>>>
>>>
>>
>
> --
> Charlie Hull
> Flax - Open Source Enterprise Search
>
> tel/fax: +44 (0)8700 118334
> mobile:  +44 (0)7767 825828
> web: www.flax.co.uk
>

[poll] virtualization platform for SOLR

2015-09-30 Thread Bernd Fehling

Dear solr users,

while setting up some new servers (virtual machines) using XEN I was
thinking about an alternative like KVM. My last tests with KVM is
a while ago and XEN performed much better in the area of I/O and
CPU usage.
This lead me to the idea to start a poll about virtualization platform and your 
experiences.

Here are my questions (and own answers):

- what kind of virtualization platform are you using?
XEN

- what made your decision to use that platform and not any other?
XEN performed better than KVM.
Only tested XEN and KVM, nothing else.

- any problems seen so far like:
- - hanging server without hardware problems?
Never seen.

- - I/O bottleneck?
Not with XEN.

- - which OS?
SUSE Linux Enterprise Server 11 x86_64bit

- - JAVA problems?
Not with JAVA 6 and 7.
No experiences yet with JAVA 8.


Regards
Bernd

Re: MoreLikeThisHandler with mltipli input documents

2015-09-30 Thread Szűcs Roland

Hello Upayavira,

We use the ajax call and it can work when it takes only some seconds (even
the 7 sec can be acceptable in this case) as the customers first focus on
the product page and if they are not satisfied with the e-book they will
need the offer. I am just started to scare what will happen if we move to
the market of English ebooks with 1 million titles. I will try the
clustering as well, or using the termvector component we can implmenet our
own more like this calculation as we realized that sometimes less than 25
interesting terms are enough to make good recommendation and it can make
the calculation faster. If you see my previous email with the intresting
terms it shows clearly that half of the terms would be enough or even less.
What a pity that there is no such a parameter for the more like this
handler: mlt.interestingtermcount which would be set 25 as a default but we
could modify it in the solrconfig to make the calculation less resource
intensive.

Thank you Upayavira and Alessandro the lots of help and effort you made. I
see the options much clearer now.

Cheers,
Roland

2015-09-30 10:23 GMT+02:00 Upayavira :

> Could you do the MLT as a separate (AJAX) request? They appear a little
> afterwards, whilst the user is already reading the page?
>
> Or, you could do offline clustering, in which case, overnight, you
> compare every document with every other, using a (likely non-solr)
> clustering algorithm, and store those in a separate core. Then you can
> request those immediately after your search query. Or reindex your
> content with that data stored alongside.
>
> Upayavira
>
> On Wed, Sep 30, 2015, at 09:16 AM, Alessandro Benedetti wrote:
> > I am still missing why you quote the number of the documents...
> > If you have 5600 polish books, but you use the MLT only when you land in
> > the page of a specific book ...
> > I think i still miss the point !
> > MLT on 1 polish book, takes 7 secs ?
> >
> >
> > 2015-09-30 9:10 GMT+01:00 Szűcs Roland :
> >
> > > Hi Alessandro,
> > >
> > > You are right. I forget to mention one important factor. For 3000
> hungarian
> > > e-books the approach you mentioned is absolutely fine as the response
> time
> > > is some 0.7 sec. But when I use the same mlt for 5600 polish e-books
> the
> > > response time is 7 sec which is definetely not acceptable for the
> users.
> > >
> > > Regards,
> > > Roland
> > >
> > > 2015-09-29 17:19 GMT+02:00 Alessandro Benedetti <
> > > benedetti.ale...@gmail.com>
> > > :
> > >
> > > > Hi Roland,
> > > > you said "The main goal is that when a customer is on the pruduct
> page ".
> > > > But if you are in a  product page, I guess you have the product Id.
> > > > If you have the product id , you can simply execute the MLT request
> with
> > > > the single Doc Id in input.
> > > >
> > > > Why do you need to calculate beforehand?
> > > >
> > > > Cheers
> > > >
> > > > 2015-09-29 15:44 GMT+01:00 Szűcs Roland  >:
> > > >
> > > > > Hello Upayavira,
> > > > >
> > > > > The main goal is that when a customer is on the pruduct page on an
> > > e-book
> > > > > and he does not like it somehow I want to immediately offer her/him
> > > > > alternative e-books in the same topic. If I expect from the
> customer to
> > > > > click on a button like "similar e-books" I lose half of them as
> they
> > > are
> > > > > lazy to click anywhere. So I would like to present on the product
> pages
> > > > the
> > > > > alternatives of the e-books  without clicking.
> > > > >
> > > > > I assumed the best idea to claculate the similar e-books for all
> the
> > > > other
> > > > > (n*(n-1) similarity calculation) and present only the top 5. I
> planned
> > > to
> > > > > do it when our server is not busy. In this point I found the
> > > description
> > > > of
> > > > > mlt as a search component which seemed to be a good candidate as it
> > > > > calculates the similar documents to all the result set of the
> query. So
> > > > if
> > > > > I say q=*:* and mlt component is enabled I get similar document
> for my
> > > > > entire document set. The only problem was with this approach that
> mlt
> > > > > search component does not give back the interesting terms for my
> tag
> > > > cloud
> > > > > calculation.
> > > > >
> > > > > That's why I tried to mix the flexibility of mlt compoonent
> (multiple
> > > > docs
> > > > > as an input accepted) with the robustness of MoreLikeThisHandler
> > > (having
> > > > > interesting terms).
> > > > >
> > > > > If there is no solution, I will use the mlt component and solve
> the tag
> > > > > cloud calculation other way. By the way if I am not mistaken, the
> 5.3.1
> > > > > version takes the union of the feature set of the mlt component,
> and
> > > > > handler
> > > > >
> > > > > Best Regards,
> > > > > Roland
> > > > >
> > > > >
> > > > >
> > > > > 2015-09-29 14:38 GMT+02:00 Upayavira :
> > > > >
> > > > > > Let's take a step back. So, you have 3000 or so docs, and you
> want to
> > > > > > know which documents are sim

real tf-idf numbers for all the terms

2015-09-30 Thread Roland Szűcs

Hi all,

Is there any out of the box way to get the tf-idf values for all the terms?
The termVectorComponent is not good for this. Even if I set tv.tf_idf, tv.tf,
tv.df to true and I get the tf and df values, the tf-idf calculation is a
pure division. Where is the log transformation of the inverse document
frequency. Can function queries help somehow. My goal is to use document id
as an input and the result set is the terms with the real tf-idf
calculation.

Regards,

-- 
Roland Szűcs
Connect with
me on Linkedin 
CEOPhone: +36 1 210 81 13Bookandwalk.hu

Re: MoreLikeThisHandler with mltipli input documents

2015-09-30 Thread Upayavira

Could you do the MLT as a separate (AJAX) request? They appear a little
afterwards, whilst the user is already reading the page?

Or, you could do offline clustering, in which case, overnight, you
compare every document with every other, using a (likely non-solr)
clustering algorithm, and store those in a separate core. Then you can
request those immediately after your search query. Or reindex your
content with that data stored alongside.

Upayavira

On Wed, Sep 30, 2015, at 09:16 AM, Alessandro Benedetti wrote:
> I am still missing why you quote the number of the documents...
> If you have 5600 polish books, but you use the MLT only when you land in
> the page of a specific book ...
> I think i still miss the point !
> MLT on 1 polish book, takes 7 secs ?
> 
> 
> 2015-09-30 9:10 GMT+01:00 Szűcs Roland :
> 
> > Hi Alessandro,
> >
> > You are right. I forget to mention one important factor. For 3000 hungarian
> > e-books the approach you mentioned is absolutely fine as the response time
> > is some 0.7 sec. But when I use the same mlt for 5600 polish e-books the
> > response time is 7 sec which is definetely not acceptable for the users.
> >
> > Regards,
> > Roland
> >
> > 2015-09-29 17:19 GMT+02:00 Alessandro Benedetti <
> > benedetti.ale...@gmail.com>
> > :
> >
> > > Hi Roland,
> > > you said "The main goal is that when a customer is on the pruduct page ".
> > > But if you are in a  product page, I guess you have the product Id.
> > > If you have the product id , you can simply execute the MLT request with
> > > the single Doc Id in input.
> > >
> > > Why do you need to calculate beforehand?
> > >
> > > Cheers
> > >
> > > 2015-09-29 15:44 GMT+01:00 Szűcs Roland :
> > >
> > > > Hello Upayavira,
> > > >
> > > > The main goal is that when a customer is on the pruduct page on an
> > e-book
> > > > and he does not like it somehow I want to immediately offer her/him
> > > > alternative e-books in the same topic. If I expect from the customer to
> > > > click on a button like "similar e-books" I lose half of them as they
> > are
> > > > lazy to click anywhere. So I would like to present on the product pages
> > > the
> > > > alternatives of the e-books  without clicking.
> > > >
> > > > I assumed the best idea to claculate the similar e-books for all the
> > > other
> > > > (n*(n-1) similarity calculation) and present only the top 5. I planned
> > to
> > > > do it when our server is not busy. In this point I found the
> > description
> > > of
> > > > mlt as a search component which seemed to be a good candidate as it
> > > > calculates the similar documents to all the result set of the query. So
> > > if
> > > > I say q=*:* and mlt component is enabled I get similar document for my
> > > > entire document set. The only problem was with this approach that mlt
> > > > search component does not give back the interesting terms for my tag
> > > cloud
> > > > calculation.
> > > >
> > > > That's why I tried to mix the flexibility of mlt compoonent (multiple
> > > docs
> > > > as an input accepted) with the robustness of MoreLikeThisHandler
> > (having
> > > > interesting terms).
> > > >
> > > > If there is no solution, I will use the mlt component and solve the tag
> > > > cloud calculation other way. By the way if I am not mistaken, the 5.3.1
> > > > version takes the union of the feature set of the mlt component, and
> > > > handler
> > > >
> > > > Best Regards,
> > > > Roland
> > > >
> > > >
> > > >
> > > > 2015-09-29 14:38 GMT+02:00 Upayavira :
> > > >
> > > > > Let's take a step back. So, you have 3000 or so docs, and you want to
> > > > > know which documents are similar to these.
> > > > >
> > > > > Why do you want to know this? What feature do you need to build that
> > > > > will use that information? Knowing this may help us to arrive at the
> > > > > right technology for you.
> > > > >
> > > > > For example, you might want to investigate offline clustering
> > > algorithms
> > > > > (e.g. [1], which might be a bit dense to follow). A good book on
> > > machine
> > > > > learning if you are okay with Python is "Programming Collective
> > > > > Intelligence" as it explains the usual algorithms with simple for
> > loops
> > > > > making it very clear.
> > > > >
> > > > > Or, you could do searches, and then cluster the results at search
> > time
> > > > > (so if you search for 100 docs, it will identify clusters within
> > those
> > > > > 100 matching documents). That might get you there. See [2]
> > > > >
> > > > > So, if you let us know what the end-goal is, perhaps we can suggest
> > an
> > > > > alternative approach, rather than burying ourselves neck-deep in MLT
> > > > > problems.
> > > > >
> > > > > Upayavira
> > > > >
> > > > > [1]
> > > > >
> > > > >
> > > >
> > >
> > http://mylazycoding.blogspot.co.uk/2012/03/cluster-apache-solr-data-using-apache_13.html
> > > > > [2]
> > https://cwiki.apache.org/confluence/display/solr/Result+Clustering
> > > > >
> > > > > On Tue, Sep 29, 2015, at 12:42 PM, Szűcs Rola

Re: MoreLikeThisHandler with mltipli input documents

2015-09-30 Thread Szűcs Roland

Hi Alessandro,

Exactly. The response time varies but let's have a concrete other example.
This is my call: http://localhost:8983/solr/bandwpl/mlt?q=id:10812&fl=id

This is my result:

{
  "responseHeader":{
"status":0,
"QTime":6232},
  "response":{"numFound":4564,"start":0,"docs":[
  {
"id":"11335"},
  {
"id":"14984"},
  {
"id":"13948"},
  {
"id":"11105"},
  {
"id":"12122"},
  {
"id":"12315"},
  {
"id":"19145"},
  {
"id":"11843"},
  {
"id":"11640"},
  {
"id":"19053"}]
  },
  "interestingTerms":[
"content:hinduski",1.0,
"content:hindus",1.0174515,
"content:głowa",1.0453196,
"content:życie",1.0666888,
"content:czas",1.0824177,
"content:kobieta",1.0927386,
"content:indie",1.119314,
"content:quentin",1.1349105,
"content:madras",1.239089,
"content:musieć",1.2626213,
"content:matka",1.2966589,
"content:chcieć",1.299024,
"content:domu",1.3370595,
"content:stać",1.4053295,
"content:sari",1.4284334,
"content:ojciec",1.4596463,
"content:lindsay",1.5857035,
"content:wiedzieć",1.6952671,
"content:powiedzieć",1.8430523,
"content:baba",1.8915937,
"content:mieć",2.1113522,
"content:Nata",2.4373012,
"content:Gopal",2.518996,
"content:david",3.0211911,
"content:Trixie",7.082156]}


Cheers,

Roland


2015-09-30 10:16 GMT+02:00 Alessandro Benedetti 
:

> I am still missing why you quote the number of the documents...
> If you have 5600 polish books, but you use the MLT only when you land in
> the page of a specific book ...
> I think i still miss the point !
> MLT on 1 polish book, takes 7 secs ?
>
>
> 2015-09-30 9:10 GMT+01:00 Szűcs Roland :
>
> > Hi Alessandro,
> >
> > You are right. I forget to mention one important factor. For 3000
> hungarian
> > e-books the approach you mentioned is absolutely fine as the response
> time
> > is some 0.7 sec. But when I use the same mlt for 5600 polish e-books the
> > response time is 7 sec which is definetely not acceptable for the users.
> >
> > Regards,
> > Roland
> >
> > 2015-09-29 17:19 GMT+02:00 Alessandro Benedetti <
> > benedetti.ale...@gmail.com>
> > :
> >
> > > Hi Roland,
> > > you said "The main goal is that when a customer is on the pruduct page
> ".
> > > But if you are in a  product page, I guess you have the product Id.
> > > If you have the product id , you can simply execute the MLT request
> with
> > > the single Doc Id in input.
> > >
> > > Why do you need to calculate beforehand?
> > >
> > > Cheers
> > >
> > > 2015-09-29 15:44 GMT+01:00 Szűcs Roland :
> > >
> > > > Hello Upayavira,
> > > >
> > > > The main goal is that when a customer is on the pruduct page on an
> > e-book
> > > > and he does not like it somehow I want to immediately offer her/him
> > > > alternative e-books in the same topic. If I expect from the customer
> to
> > > > click on a button like "similar e-books" I lose half of them as they
> > are
> > > > lazy to click anywhere. So I would like to present on the product
> pages
> > > the
> > > > alternatives of the e-books  without clicking.
> > > >
> > > > I assumed the best idea to claculate the similar e-books for all the
> > > other
> > > > (n*(n-1) similarity calculation) and present only the top 5. I
> planned
> > to
> > > > do it when our server is not busy. In this point I found the
> > description
> > > of
> > > > mlt as a search component which seemed to be a good candidate as it
> > > > calculates the similar documents to all the result set of the query.
> So
> > > if
> > > > I say q=*:* and mlt component is enabled I get similar document for
> my
> > > > entire document set. The only problem was with this approach that mlt
> > > > search component does not give back the interesting terms for my tag
> > > cloud
> > > > calculation.
> > > >
> > > > That's why I tried to mix the flexibility of mlt compoonent (multiple
> > > docs
> > > > as an input accepted) with the robustness of MoreLikeThisHandler
> > (having
> > > > interesting terms).
> > > >
> > > > If there is no solution, I will use the mlt component and solve the
> tag
> > > > cloud calculation other way. By the way if I am not mistaken, the
> 5.3.1
> > > > version takes the union of the feature set of the mlt component, and
> > > > handler
> > > >
> > > > Best Regards,
> > > > Roland
> > > >
> > > >
> > > >
> > > > 2015-09-29 14:38 GMT+02:00 Upayavira :
> > > >
> > > > > Let's take a step back. So, you have 3000 or so docs, and you want
> to
> > > > > know which documents are similar to these.
> > > > >
> > > > > Why do you want to know this? What feature do you need to build
> that
> > > > > will use that information? Knowing this may help us to arrive at
> the
> > > > > right technology for you.
> > > > >
> > > > > For example, you might want to investigate offline clustering
> > > algorithms
> > > > > (e.g. [1], which might be a bit dense to follow). A goo

Re: Can StandardTokenizerFactory works well for Chinese and English (Bilingual)?

2015-09-30 Thread Charlie Hull


On 30/09/2015 04:09, Zheng Lin Edwin Yeo wrote:

Hi Charlie,


Hi,


I've checked that Paoding's code is written for Solr 3 and Solr 4 versions.
It is not written for Solr 5, thus I was unable to use it in my Solr 5.x
version.


I'm pretty sure we had to recompile it for v4.6 as wellit has been a 
little painful.


Have you tried to use HMMChineseTokenizer and JiebaTokenizer as well?


I don't think so.

Charlie


Regards,
Edwin


On 25 September 2015 at 18:46, Charlie Hull  wrote:


On 25/09/2015 11:43, Zheng Lin Edwin Yeo wrote:


Hi Charlie,

Thanks for your comment. I faced the compatibility issues with Paoding
when
I tried it in Solr 5.1.0 and Solr 5.2.1, and I found out that the code was
optimised for Solr 3.6.

Which version of Solr are you using when you tried on the Paoding?



Solr v4.6 I believe.

Charlie



Regards,
Edwin


On 25 September 2015 at 16:43, Charlie Hull  wrote:

On 23/09/2015 16:23, Alexandre Rafalovitch wrote:


You may find the following articles interesting:



http://discovery-grindstone.blogspot.ca/2014/01/searching-in-solr-analyzing-results-and.html
( a whole epic journey)
https://dzone.com/articles/indexing-chinese-solr



The latter article is great and we drew on it when helping a recent
client
with Chinese indexing. However, if you do use Paoding bear in mind that
it
has few if any tests and all the comments are in Chinese. We found a
problem with it recently (it breaks the Lucene highlighters) and have
submitted a patch:
http://git.oschina.net/zhzhenqin/paoding-analysis/issues/1

Cheers

Charlie


Regards,

  Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 23 September 2015 at 10:41, Zheng Lin Edwin Yeo <
edwinye...@gmail.com>
wrote:

Hi,


Would like to check, will StandardTokenizerFactory works well for
indexing
both English and Chinese (Bilingual) documents, or do we need
tokenizers
that are customised for chinese (Eg: HMMChineseTokenizerFactory)?


Regards,
Edwin





--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk






--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk






--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk

Re: MoreLikeThisHandler with mltipli input documents

2015-09-30 Thread Alessandro Benedetti

I am still missing why you quote the number of the documents...
If you have 5600 polish books, but you use the MLT only when you land in
the page of a specific book ...
I think i still miss the point !
MLT on 1 polish book, takes 7 secs ?


2015-09-30 9:10 GMT+01:00 Szűcs Roland :

> Hi Alessandro,
>
> You are right. I forget to mention one important factor. For 3000 hungarian
> e-books the approach you mentioned is absolutely fine as the response time
> is some 0.7 sec. But when I use the same mlt for 5600 polish e-books the
> response time is 7 sec which is definetely not acceptable for the users.
>
> Regards,
> Roland
>
> 2015-09-29 17:19 GMT+02:00 Alessandro Benedetti <
> benedetti.ale...@gmail.com>
> :
>
> > Hi Roland,
> > you said "The main goal is that when a customer is on the pruduct page ".
> > But if you are in a  product page, I guess you have the product Id.
> > If you have the product id , you can simply execute the MLT request with
> > the single Doc Id in input.
> >
> > Why do you need to calculate beforehand?
> >
> > Cheers
> >
> > 2015-09-29 15:44 GMT+01:00 Szűcs Roland :
> >
> > > Hello Upayavira,
> > >
> > > The main goal is that when a customer is on the pruduct page on an
> e-book
> > > and he does not like it somehow I want to immediately offer her/him
> > > alternative e-books in the same topic. If I expect from the customer to
> > > click on a button like "similar e-books" I lose half of them as they
> are
> > > lazy to click anywhere. So I would like to present on the product pages
> > the
> > > alternatives of the e-books  without clicking.
> > >
> > > I assumed the best idea to claculate the similar e-books for all the
> > other
> > > (n*(n-1) similarity calculation) and present only the top 5. I planned
> to
> > > do it when our server is not busy. In this point I found the
> description
> > of
> > > mlt as a search component which seemed to be a good candidate as it
> > > calculates the similar documents to all the result set of the query. So
> > if
> > > I say q=*:* and mlt component is enabled I get similar document for my
> > > entire document set. The only problem was with this approach that mlt
> > > search component does not give back the interesting terms for my tag
> > cloud
> > > calculation.
> > >
> > > That's why I tried to mix the flexibility of mlt compoonent (multiple
> > docs
> > > as an input accepted) with the robustness of MoreLikeThisHandler
> (having
> > > interesting terms).
> > >
> > > If there is no solution, I will use the mlt component and solve the tag
> > > cloud calculation other way. By the way if I am not mistaken, the 5.3.1
> > > version takes the union of the feature set of the mlt component, and
> > > handler
> > >
> > > Best Regards,
> > > Roland
> > >
> > >
> > >
> > > 2015-09-29 14:38 GMT+02:00 Upayavira :
> > >
> > > > Let's take a step back. So, you have 3000 or so docs, and you want to
> > > > know which documents are similar to these.
> > > >
> > > > Why do you want to know this? What feature do you need to build that
> > > > will use that information? Knowing this may help us to arrive at the
> > > > right technology for you.
> > > >
> > > > For example, you might want to investigate offline clustering
> > algorithms
> > > > (e.g. [1], which might be a bit dense to follow). A good book on
> > machine
> > > > learning if you are okay with Python is "Programming Collective
> > > > Intelligence" as it explains the usual algorithms with simple for
> loops
> > > > making it very clear.
> > > >
> > > > Or, you could do searches, and then cluster the results at search
> time
> > > > (so if you search for 100 docs, it will identify clusters within
> those
> > > > 100 matching documents). That might get you there. See [2]
> > > >
> > > > So, if you let us know what the end-goal is, perhaps we can suggest
> an
> > > > alternative approach, rather than burying ourselves neck-deep in MLT
> > > > problems.
> > > >
> > > > Upayavira
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> http://mylazycoding.blogspot.co.uk/2012/03/cluster-apache-solr-data-using-apache_13.html
> > > > [2]
> https://cwiki.apache.org/confluence/display/solr/Result+Clustering
> > > >
> > > > On Tue, Sep 29, 2015, at 12:42 PM, Szűcs Roland wrote:
> > > > > Hello Upayavira,
> > > > >
> > > > > Thanks dealing with my issue. I have applied already the
> > > termVectors=true
> > > > > to all fileds involved in the more like this calculation. I have
> > just 3
> > > > > 000
> > > > > documents each of them is represented by a relativly big term
> vector
> > > with
> > > > > more than 20 000 unique terms. If I run the more like this handler
> > for
> > > a
> > > > > solr doc it takes close to 1 sec to get back the first 10 similar
> > > > > documents. Aftwr this I have to pass the docid-s to my other
> > > application
> > > > > which find the cover of the e-book and other metadata and put it on
> > the
> > > > > web. The end-to-end process takes too much time from customer
> > > perspec

Re: MoreLikeThisHandler with mltipli input documents

2015-09-30 Thread Szűcs Roland

Hi Alessandro,

You are right. I forget to mention one important factor. For 3000 hungarian
e-books the approach you mentioned is absolutely fine as the response time
is some 0.7 sec. But when I use the same mlt for 5600 polish e-books the
response time is 7 sec which is definetely not acceptable for the users.

Regards,
Roland

2015-09-29 17:19 GMT+02:00 Alessandro Benedetti 
:

> Hi Roland,
> you said "The main goal is that when a customer is on the pruduct page ".
> But if you are in a  product page, I guess you have the product Id.
> If you have the product id , you can simply execute the MLT request with
> the single Doc Id in input.
>
> Why do you need to calculate beforehand?
>
> Cheers
>
> 2015-09-29 15:44 GMT+01:00 Szűcs Roland :
>
> > Hello Upayavira,
> >
> > The main goal is that when a customer is on the pruduct page on an e-book
> > and he does not like it somehow I want to immediately offer her/him
> > alternative e-books in the same topic. If I expect from the customer to
> > click on a button like "similar e-books" I lose half of them as they are
> > lazy to click anywhere. So I would like to present on the product pages
> the
> > alternatives of the e-books  without clicking.
> >
> > I assumed the best idea to claculate the similar e-books for all the
> other
> > (n*(n-1) similarity calculation) and present only the top 5. I planned to
> > do it when our server is not busy. In this point I found the description
> of
> > mlt as a search component which seemed to be a good candidate as it
> > calculates the similar documents to all the result set of the query. So
> if
> > I say q=*:* and mlt component is enabled I get similar document for my
> > entire document set. The only problem was with this approach that mlt
> > search component does not give back the interesting terms for my tag
> cloud
> > calculation.
> >
> > That's why I tried to mix the flexibility of mlt compoonent (multiple
> docs
> > as an input accepted) with the robustness of MoreLikeThisHandler (having
> > interesting terms).
> >
> > If there is no solution, I will use the mlt component and solve the tag
> > cloud calculation other way. By the way if I am not mistaken, the 5.3.1
> > version takes the union of the feature set of the mlt component, and
> > handler
> >
> > Best Regards,
> > Roland
> >
> >
> >
> > 2015-09-29 14:38 GMT+02:00 Upayavira :
> >
> > > Let's take a step back. So, you have 3000 or so docs, and you want to
> > > know which documents are similar to these.
> > >
> > > Why do you want to know this? What feature do you need to build that
> > > will use that information? Knowing this may help us to arrive at the
> > > right technology for you.
> > >
> > > For example, you might want to investigate offline clustering
> algorithms
> > > (e.g. [1], which might be a bit dense to follow). A good book on
> machine
> > > learning if you are okay with Python is "Programming Collective
> > > Intelligence" as it explains the usual algorithms with simple for loops
> > > making it very clear.
> > >
> > > Or, you could do searches, and then cluster the results at search time
> > > (so if you search for 100 docs, it will identify clusters within those
> > > 100 matching documents). That might get you there. See [2]
> > >
> > > So, if you let us know what the end-goal is, perhaps we can suggest an
> > > alternative approach, rather than burying ourselves neck-deep in MLT
> > > problems.
> > >
> > > Upayavira
> > >
> > > [1]
> > >
> > >
> >
> http://mylazycoding.blogspot.co.uk/2012/03/cluster-apache-solr-data-using-apache_13.html
> > > [2] https://cwiki.apache.org/confluence/display/solr/Result+Clustering
> > >
> > > On Tue, Sep 29, 2015, at 12:42 PM, Szűcs Roland wrote:
> > > > Hello Upayavira,
> > > >
> > > > Thanks dealing with my issue. I have applied already the
> > termVectors=true
> > > > to all fileds involved in the more like this calculation. I have
> just 3
> > > > 000
> > > > documents each of them is represented by a relativly big term vector
> > with
> > > > more than 20 000 unique terms. If I run the more like this handler
> for
> > a
> > > > solr doc it takes close to 1 sec to get back the first 10 similar
> > > > documents. Aftwr this I have to pass the docid-s to my other
> > application
> > > > which find the cover of the e-book and other metadata and put it on
> the
> > > > web. The end-to-end process takes too much time from customer
> > perspective
> > > > that is why I tried to find solution for offline more like this
> > > > calculation. But if my app has to call the morelikethishandler for
> each
> > > > doc
> > > > it puts overhead for the offline calculation.
> > > >
> > > > Best Regards,
> > > > Roland
> > > >
> > > > 2015-09-29 13:01 GMT+02:00 Upayavira :
> > > >
> > > > > If MoreLikeThis is slow for large documents that are indexed, have
> > you
> > > > > enabled term vectors on the similarity fields?
> > > > >
> > > > > Basically, what more like this does is this:
> > > > >
> > > > > * decide o

Re: Solr 4.8 - Updating zkhost list in solr.xml without requiring a restart

2015-09-30 Thread Upayavira

Why don't you create DNS names, or such, so that you can replace a
zookeeper instance at the same hostname:port rather than having to edit
solr.xml across your whole Solr farm?

The idea is that your list of zookeeper hostnames is a virtual one, not
a real one.

Upayavira

On Wed, Sep 30, 2015, at 04:40 AM, pramodmm wrote:
> 
> > Before we even think about upgrading the zookeeper functionality in
> > Solr, we must wait for the official 3.5 release from the zookeeper
> > project.  Alpha (or Beta) software will not be included in Solr unless
> > it is the only way to fix a very serious bug.  This is a new feature,
> > not a bug.
> 
> In the meantime, please help me validate what we are doing is right. 
> Currently, our zookeeper instances are running on vmware machines and
> when
> one of them dies and we get a new machine as a replacement - we install
> zookeeper and make it a part of the ensemble. Then we manually, go to
> every
> individual solr instance in the solr cloud - edit its  solr.xml - remove
> the
> entry of the dead machine from zkhost and replace it with the new
> hostname -
> thus keeping the list up-to-date. Then, we restart solr box. 
> 
> Are these the right steps ?
> 
> Thanks,
> Pramod
> 
> 
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Solr-4-8-Updating-zkhost-list-in-solr-xml-without-requiring-a-restart-tp4231979p4231994.html
> Sent from the Solr - User mailing list archive at Nabble.com.

42 matches

Mail list logo