date:20160420

Yes, we do edismax per field boosting, with explicit boosting of the title 
field. So it sure makes length normalization less relevant. But not 
*completely* irrelevant, which is why I still want to have it as part of the 
scoring, just with much less impact that it currently has.

/Jimi

From: Jack Krupansky 
Sent: Thursday, April 21, 2016 4:46 AM
To: solr-user@lucene.apache.org
Subject: Re: Is it possible to configure a minimum field length for the 
fieldNorm value?

Or should this be higher rated about NY, since it's shorter:

* New York

Another though on length norms: with the advent of multi-field dismax with
per-field boosting, people tend to explicitly boost the title field so that
the traditional length normalization is less relevant.


-- Jack Krupansky

On Wed, Apr 20, 2016 at 8:39 PM, Walter Underwood 
wrote:

> Sure, here are some real world examples from my time at Netflix.
>
> Is this movie twice as much about “new york”?
>
> * New York, New York
>
> Which one of these is the best match for “blade runner”:
>
> * Blade Runner: The Final Cut
> * Blade Runner: Theatrical & Director’s Cut
> * Blade Runner: Workprint
>
> http://dvd.netflix.com/Search?v1=blade+runner <
> http://dvd.netflix.com/Search?v1=blade+runner>
>
> At Netflix (when I was there), those were shown in popularity order with a
> boost function.
>
> And for stemming, should the movie “Saw” match “see”? Maybe not.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Apr 20, 2016, at 5:28 PM, Jack Krupansky 
> wrote:
> >
> > Maybe it's a cultural difference, but I can't imagine why on a query for
> > "John", any of those titles would be treated as anything other than
> equals
> > - namely, that they are all about John. Maybe the issue is that this
> seems
> > like a contrived example, and I'm asking for a realistic example. Or,
> maybe
> > you have some rule of relevance that you haven't yet shared - and I mean
> > rule that a user would comprehend and consider valuable, not simply a
> > mechanical rule.
> >
> >
> >
> > -- Jack Krupansky
> >
> > On Wed, Apr 20, 2016 at 8:10 PM, 
> > wrote:
> >
> >> Ok sure, I can try and give some examples :)
> >>
> >> Lets say that we have the following documents:
> >>
> >> Id: 1
> >> Title: John Doe
> >>
> >> Id: 2
> >> Title: John Doe Jr.
> >>
> >> Id: 3
> >> Title: John Lennon: The Life
> >>
> >> Id: 4
> >> Title: John Thompson's Modern Course for the Piano: First Grade Book
> >>
> >> Id: 5
> >> Title: I Rode With Stonewall: Being Chiefly The War Experiences of the
> >> Youngest Member of Jackson's Staff from John Brown's Raid to the
> Hanging of
> >> Mrs. Surratt
> >>
> >>
> >> And in general, when a search word matches the title, I would like to
> have
> >> the length of the title field influence the score, so that matching
> >> documents with shorter title get a higher score than documents with
> longer
> >> title, all else considered equal.
> >>
> >> So, when a user searches for "John", I would like the results to be
> pretty
> >> much in the order presented above. Though, it is not crucial that for
> >> example document 1 comes before document 2. But I would surely want
> >> document 1-3 to come before document 4 and 5.
> >>
> >> In my mind, the fieldNorm is a perfect solution for this. At least in
> >> theory. In practice, the encoding of the fieldNorm seems to make this
> >> function much less useful for this use case. Unless I have missed
> something.
> >>
> >> Is there another way to achive something like this? Note that I don't
> want
> >> a general boost on documents with short titles, I only want to boost
> them
> >> if the title field actually matched the query.
> >>
> >> /Jimi
> >>
> >> 
> >> From: Jack Krupansky 
> >> Sent: Thursday, April 21, 2016 1:28 AM
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: Is it possible to configure a minimum field length for the
> >> fieldNorm value?
> >>
> >> I'm not sure I fully follow what distinction you're trying to focus on.
> I
> >> mean, traditionally length normalization has simply tried to
> distinguish a
> >> title field (rarely more than a dozen words) from a full body of text,
> or
> >> maybe an abstract, not things like exactly how many words were in a
> title.
> >> Or, as another example, a short newswire article of a few paragraphs
> vs. a
> >> feature-length article, paper, or even book. IOW, traditionally it was
> more
> >> of a boolean than a broad range of values. Sure, yes, you absolutely can
> >> define a custom similarity with a custom norm that supports a wide
> range of
> >> lengths, but you'll have to decide what you really want  to achieve to
> tune
> >> it.
> >>
> >> Maybe you could give a couple examples of field values that you feel
> should

Re: Is it possible to configure a minimum field length for the fieldNorm value?

Yes, the example was contrived. Partly because our documents are mostly in 
Swedish text, but mostly because I thought that the example should be simple 
enough so it focused on the thing discussed (even though I simplifyed it to 
such a degree that I left out the current main problem with the fieldNorm, the 
fact that the values are too course when encoded). And we do have titles with 
title lengths varying in a way from 2 words to about 30 Words.

For me it makes perfect sense to have the shorter titles come up first in this 
example. It is basically the tf–idf principle. It is more likely that the 
document titled "John Doe" focuses on "John" than it is for the document titled 
"I Rode With Stonewall: Being Chiefly The War Experiences of the Youngest 
Member of Jackson's Staff from John Brown's Raid to the Hanging of Mrs. 
Surratt".

Now, having said that, I never said that the title length should have a *big* 
inpact of the score. Infact, this is the main problem I'm trying to solve. I 
want the inpact to be very, very, small. Basically I want this factor to only 
*nudge* the document score. I want it to work in such a way so that if one 
first would consider the score without this factor, only when two documents 
have scores quite close to each other should this factor have any real effect 
on the resulting order in the search results. That could be achieved if the 
fieldNorm only would change for example from 0.79 to 0.74, like the resulting 
values from SweetSpotSimilarity for two example documents I tested. But when 
these values are encoded and decoded, the values become 0.75 and 0.625, causing 
a much bigger impact on the final score.

/Jimi

From: Jack Krupansky 
Sent: Thursday, April 21, 2016 2:28 AM
To: solr-user@lucene.apache.org
Subject: Re: Is it possible to configure a minimum field length for the 
fieldNorm value?

Maybe it's a cultural difference, but I can't imagine why on a query for
"John", any of those titles would be treated as anything other than equals
- namely, that they are all about John. Maybe the issue is that this seems
like a contrived example, and I'm asking for a realistic example. Or, maybe
you have some rule of relevance that you haven't yet shared - and I mean
rule that a user would comprehend and consider valuable, not simply a
mechanical rule.

-- Jack Krupansky

On Wed, Apr 20, 2016 at 8:10 PM, 
wrote:

> Ok sure, I can try and give some examples :)
>
> Lets say that we have the following documents:
>
> Id: 1
> Title: John Doe
>
> Id: 2
> Title: John Doe Jr.
>
> Id: 3
> Title: John Lennon: The Life
>
> Id: 4
> Title: John Thompson's Modern Course for the Piano: First Grade Book
>
> Id: 5
> Title: I Rode With Stonewall: Being Chiefly The War Experiences of the
> Youngest Member of Jackson's Staff from John Brown's Raid to the Hanging of
> Mrs. Surratt
>
>
> And in general, when a search word matches the title, I would like to have
> the length of the title field influence the score, so that matching
> documents with shorter title get a higher score than documents with longer
> title, all else considered equal.
>
> So, when a user searches for "John", I would like the results to be pretty
> much in the order presented above. Though, it is not crucial that for
> example document 1 comes before document 2. But I would surely want
> document 1-3 to come before document 4 and 5.
>
> In my mind, the fieldNorm is a perfect solution for this. At least in
> theory. In practice, the encoding of the fieldNorm seems to make this
> function much less useful for this use case. Unless I have missed something.
>
> Is there another way to achive something like this? Note that I don't want
> a general boost on documents with short titles, I only want to boost them
> if the title field actually matched the query.
>
> /Jimi
>
> 
> From: Jack Krupansky 
> Sent: Thursday, April 21, 2016 1:28 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Is it possible to configure a minimum field length for the
> fieldNorm value?
>
> I'm not sure I fully follow what distinction you're trying to focus on. I
> mean, traditionally length normalization has simply tried to distinguish a
> title field (rarely more than a dozen words) from a full body of text, or
> maybe an abstract, not things like exactly how many words were in a title.
> Or, as another example, a short newswire article of a few paragraphs vs. a
> feature-length article, paper, or even book. IOW, traditionally it was more
> of a boolean than a broad range of values. Sure, yes, you absolutely can
> define a custom similarity with a custom norm that supports a wide range of
> lengths, but you'll have to decide what you really want  to achieve to tune
> it.
>
> Maybe you could give a couple examples of field values that you feel should
> be scored differently based on

pivoting with json facet api

2016-04-20 Thread Yangrui Guo

Hi

I am trying to facet results on my nest documents. The solr document did
not say much on how to pivot with json api with nest documents. Could
someone show me some examples? Thanks very much.

Yangrui

Re: Traversal of documents through network

2016-04-20 Thread vidya

ok. I understand that. So, you would say documents traverse through network.
If i specify some 100 docs to be dispalyed on my first page, will it effect
performance. While docs gets traversed, will there be any high volume
traffic and effects performance of the application.


And whats the time solr takes to index 1GB of data in general.


Thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Traversal-of-documents-through-network-tp4271555p4271743.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Overall large size in Solr across collections

2016-04-20 Thread Zheng Lin Edwin Yeo

Hi Shawn,

Yes, I'm using the Extracting Request Handler.

The 0.7GB/hr is the indexing rate at which the size of the original
documents which get ingested into Solr. This means that for every hour,
only 0.7GB of my documents gets ingested into Solr. It will require 10
hours just to index documents which are of 7GB in size.

Regards,
Edwin


On 21 April 2016 at 11:40, Shawn Heisey  wrote:

> On 4/20/2016 8:10 PM, Zheng Lin Edwin Yeo wrote:
> > I'm currently running 4 threads concurrently to run the indexing, Which
> > means I run the script in command prompt in 4 different command windows.
> > The ID has been configured in such a way that it will not overwrite each
> > other during the indexing. Is that considered multi-threading?
> >
> > The rate are all below 0.2GB/hr for each individual threads, and overall
> > rate is just 0.7GB/hr.
>
> Was I right to think you're using the Extracting Request Handler?
>
> If you have enough CPU resources on the Solr server, you could start
> even more copies of the program -- effectively, more threads.
>
> What are you measuring at 0.7GB/hr?  The size of the rich text documents
> you are ingesting?  The size of the text extracted from the documents?
> The size of the index directory in Solr?
>
> Using the dataimport handler importing from MySQL, I can simultaneously
> build six separate 60GB indexes in about 18 hours, on two servers.  Each
> of those indexes has more than 50 million documents.  These are not rich
> text documents, though.  DIH is single-threaded, so each of those
> indexes is only being built with one thread.  Saying the important thing
> again:  These are NOT rich text documents.
>
> If you're using ERH, which runs Tika, I can tell you that Tika is quite
> the resource hog.  It is likely chewing up CPU and memory resources at
> an incredible rate, slowing down your Solr server.  You would probably
> see better performance than ERH if you incorporate Tika and SolrJ into a
> client indexing program that runs on a different machine than Solr.
>
> Thanks,
> Shawn
>
>

complete cluster shutdown

2016-04-20 Thread Zap Org

I have 5 zookeeper and 2 solr machines and after a month or two whole
clustre shutdown i dont know why. The logs i get in zookeeper are attached
below. otherwise i dont get any error. All this is based on linux VM.

2016-03-11 16:50:18,159 [myid:5] - WARN  [SyncThread:5:FileTxnLog@334] -
fsync-ing the write ahead log in SyncThread:5 took 7268ms which will
adversely effect operation latency. See the ZooKeeper troubleshooting guide
2016-03-11 16:50:18,161 [myid:5] - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2185:NIOServerCnxn@357] - caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid
0x4535f00ee370001, likely client has closed socket
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
at
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
at java.lang.Thread.run(Thread.java:745)
2016-03-11 16:50:18,163 [myid:5] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2185:NIOServerCnxn@1007] - Closed socket connection for
client /localhost which had sessionid 0x4535f00ee370001
2016-03-11 16:50:18,166 [myid:5] - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2185:NIOServerCnxn@357] - caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid
0x2535ef744dd0005, likely client has closed socket
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
at
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
at java.lang.Thread.run(Thread.java:745)

Re: Overall large size in Solr across collections

2016-04-20 Thread Shawn Heisey

On 4/20/2016 8:10 PM, Zheng Lin Edwin Yeo wrote:
> I'm currently running 4 threads concurrently to run the indexing, Which
> means I run the script in command prompt in 4 different command windows.
> The ID has been configured in such a way that it will not overwrite each
> other during the indexing. Is that considered multi-threading?
>
> The rate are all below 0.2GB/hr for each individual threads, and overall
> rate is just 0.7GB/hr.

Was I right to think you're using the Extracting Request Handler?

If you have enough CPU resources on the Solr server, you could start
even more copies of the program -- effectively, more threads.

What are you measuring at 0.7GB/hr?  The size of the rich text documents
you are ingesting?  The size of the text extracted from the documents? 
The size of the index directory in Solr?

Using the dataimport handler importing from MySQL, I can simultaneously
build six separate 60GB indexes in about 18 hours, on two servers.  Each
of those indexes has more than 50 million documents.  These are not rich
text documents, though.  DIH is single-threaded, so each of those
indexes is only being built with one thread.  Saying the important thing
again:  These are NOT rich text documents.

If you're using ERH, which runs Tika, I can tell you that Tika is quite
the resource hog.  It is likely chewing up CPU and memory resources at
an incredible rate, slowing down your Solr server.  You would probably
see better performance than ERH if you incorporate Tika and SolrJ into a
client indexing program that runs on a different machine than Solr.

Thanks,
Shawn

Re: Is it possible to configure a minimum field length for the fieldNorm value?

Or should this be higher rated about NY, since it's shorter:

* New York

Another though on length norms: with the advent of multi-field dismax with
per-field boosting, people tend to explicitly boost the title field so that
the traditional length normalization is less relevant.


-- Jack Krupansky

On Wed, Apr 20, 2016 at 8:39 PM, Walter Underwood 
wrote:

> Sure, here are some real world examples from my time at Netflix.
>
> Is this movie twice as much about “new york”?
>
> * New York, New York
>
> Which one of these is the best match for “blade runner”:
>
> * Blade Runner: The Final Cut
> * Blade Runner: Theatrical & Director’s Cut
> * Blade Runner: Workprint
>
> http://dvd.netflix.com/Search?v1=blade+runner <
> http://dvd.netflix.com/Search?v1=blade+runner>
>
> At Netflix (when I was there), those were shown in popularity order with a
> boost function.
>
> And for stemming, should the movie “Saw” match “see”? Maybe not.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Apr 20, 2016, at 5:28 PM, Jack Krupansky 
> wrote:
> >
> > Maybe it's a cultural difference, but I can't imagine why on a query for
> > "John", any of those titles would be treated as anything other than
> equals
> > - namely, that they are all about John. Maybe the issue is that this
> seems
> > like a contrived example, and I'm asking for a realistic example. Or,
> maybe
> > you have some rule of relevance that you haven't yet shared - and I mean
> > rule that a user would comprehend and consider valuable, not simply a
> > mechanical rule.
> >
> >
> >
> > -- Jack Krupansky
> >
> > On Wed, Apr 20, 2016 at 8:10 PM, 
> > wrote:
> >
> >> Ok sure, I can try and give some examples :)
> >>
> >> Lets say that we have the following documents:
> >>
> >> Id: 1
> >> Title: John Doe
> >>
> >> Id: 2
> >> Title: John Doe Jr.
> >>
> >> Id: 3
> >> Title: John Lennon: The Life
> >>
> >> Id: 4
> >> Title: John Thompson's Modern Course for the Piano: First Grade Book
> >>
> >> Id: 5
> >> Title: I Rode With Stonewall: Being Chiefly The War Experiences of the
> >> Youngest Member of Jackson's Staff from John Brown's Raid to the
> Hanging of
> >> Mrs. Surratt
> >>
> >>
> >> And in general, when a search word matches the title, I would like to
> have
> >> the length of the title field influence the score, so that matching
> >> documents with shorter title get a higher score than documents with
> longer
> >> title, all else considered equal.
> >>
> >> So, when a user searches for "John", I would like the results to be
> pretty
> >> much in the order presented above. Though, it is not crucial that for
> >> example document 1 comes before document 2. But I would surely want
> >> document 1-3 to come before document 4 and 5.
> >>
> >> In my mind, the fieldNorm is a perfect solution for this. At least in
> >> theory. In practice, the encoding of the fieldNorm seems to make this
> >> function much less useful for this use case. Unless I have missed
> something.
> >>
> >> Is there another way to achive something like this? Note that I don't
> want
> >> a general boost on documents with short titles, I only want to boost
> them
> >> if the title field actually matched the query.
> >>
> >> /Jimi
> >>
> >> 
> >> From: Jack Krupansky 
> >> Sent: Thursday, April 21, 2016 1:28 AM
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: Is it possible to configure a minimum field length for the
> >> fieldNorm value?
> >>
> >> I'm not sure I fully follow what distinction you're trying to focus on.
> I
> >> mean, traditionally length normalization has simply tried to
> distinguish a
> >> title field (rarely more than a dozen words) from a full body of text,
> or
> >> maybe an abstract, not things like exactly how many words were in a
> title.
> >> Or, as another example, a short newswire article of a few paragraphs
> vs. a
> >> feature-length article, paper, or even book. IOW, traditionally it was
> more
> >> of a boolean than a broad range of values. Sure, yes, you absolutely can
> >> define a custom similarity with a custom norm that supports a wide
> range of
> >> lengths, but you'll have to decide what you really want  to achieve to
> tune
> >> it.
> >>
> >> Maybe you could give a couple examples of field values that you feel
> should
> >> be scored differently based on length.
> >>
> >> -- Jack Krupansky
> >>
> >> On Wed, Apr 20, 2016 at 7:17 PM, 
> >> wrote:
> >>
> >>> I am talking about the title field. And for the title field, a
> sweetspot
> >>> interval of 1 to 50 makes very little sense. I want to have a fieldNorm
> >>> value that differentiates between for example 2, 3, 4 and 5 terms in
> the
> >>> title, but only very little.
> >>>
> >>> The 20% number I got by simply calculating the difference in the title
> >>>

Re: Overall large size in Solr across collections

2016-04-20 Thread Zheng Lin Edwin Yeo

Hi Shawn,

I'm currently running 4 threads concurrently to run the indexing, Which
means I run the script in command prompt in 4 different command windows.
The ID has been configured in such a way that it will not overwrite each
other during the indexing. Is that considered multi-threading?

The rate are all below 0.2GB/hr for each individual threads, and overall
rate is just 0.7GB/hr.

Regards,
Edwin


On 20 April 2016 at 21:43, Shawn Heisey  wrote:

> On 4/19/2016 10:12 PM, Zheng Lin Edwin Yeo wrote:
> > Thanks for the information Shawn.
> >
> > I believe it could be due to the types of file that is being indexed.
> > Currently, I'm indexing the EML files which are in HTML format, and they
> > are more rich in content (with in line images and full text), while
> > previously the EML files are in Plain Text format, with the images as
> > attachments.
> >
> > Will this be the cause of the slow indexing speed which I'm facing now?
> It
> > is more than 3 times slower than what I had previously.
>
> I assume that you are using the Extracting Request Handler for this.  I
> know almost nothing about Tika, but I would imagine that extracting data
> from rich text documents is not a fast process, and that plain text
> documents would be a lot faster.  I could be wrong -- I've never used
> the ERH myself.
>
> If you want a setup like this to go faster, you probably need to make
> your indexing process multi-threaded.  Ideally, such an application
> would be written in Java and would incorporate Tika into the client-side
> code.  Tika can be very unstable, so running it inside Solr (the
> Extracting Request Handler) can make Solr itself unstable.
>
> Thanks,
> Shawn
>
>

Re: Storing different collection on different hard disk

2016-04-20 Thread Zheng Lin Edwin Yeo

Thanks for your reply.

I have managed to solve the problem. The reason is that we have to use this
"/" instead of this "\", even in Windows, and to include the data folder as
well.

This is the working one:
dataDir=D:/collection1/data

Regards,
Edwin


On 20 April 2016 at 21:39, Bram Van Dam  wrote:

> Have you considered simply mounting different disks under different
> paths? It looks like you're using Windows, so I'm not sure if that's
> possible, but it seems like a relatively basic task, so who knows.
>
> You could mount Disk 1 as /path/to/collection1 and Disk 2 as
> /path/to/collection2. That way you won't need to change your Solr
> configuration at all.
>
>  - Bram
>
> On 20/04/16 06:04, Zheng Lin Edwin Yeo wrote:
> > Thanks for your info.
> >
> > I tried to set, but Solr is not able to find the indexes, and I get the
> > following error:
> >
> >- *collection1:*
> >
> org.apache.solr.common.SolrException:org.apache.solr.common.SolrException:
> >java.io.IOException: The filename, directory name, or volume label
> syntax
> >is incorrect
> >
> >
> > Is this the correct way to set in core.properties file?
> > dataDir="D:\collection1"
> >
> > Also, do we need to set the dataDir in solrconfig.xml as well?
> >
> > Regards,
> > Edwin
> >
> >
> > On 19 April 2016 at 19:36, Alexandre Rafalovitch 
> wrote:
> >
> >> Have you tried setting dataDir parameter in the core.properties file?
> >>
> https://cwiki.apache.org/confluence/display/solr/Defining+core.properties
> >>
> >> Regards,
> >>Alex.
> >> 
> >> Newsletter and resources for Solr beginners and intermediates:
> >> http://www.solr-start.com/
> >>
> >>
> >> On 19 April 2016 at 20:43, Zheng Lin Edwin Yeo 
> >> wrote:
> >>> Hi,
> >>>
> >>> I would like to find out is it possible to store the indexes file of
> >>> different collections in different hard disk?
> >>> Like for example, I want to store the indexes of collection1 in Hard
> Disk
> >>> 1, and the indexes of collection2 in Hard Disk 2.
> >>>
> >>> I am using Solr 5.4.0
> >>>
> >>> Regards,
> >>> Edwin
> >>
> >
>
>

Re: Is it possible to configure a minimum field length for the fieldNorm value?

2016-04-20 Thread Walter Underwood

Sure, here are some real world examples from my time at Netflix.

Is this movie twice as much about “new york”?

* New York, New York

Which one of these is the best match for “blade runner”:

* Blade Runner: The Final Cut
* Blade Runner: Theatrical & Director’s Cut
* Blade Runner: Workprint

http://dvd.netflix.com/Search?v1=blade+runner 


At Netflix (when I was there), those were shown in popularity order with a 
boost function.

And for stemming, should the movie “Saw” match “see”? Maybe not.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Apr 20, 2016, at 5:28 PM, Jack Krupansky  wrote:
> 
> Maybe it's a cultural difference, but I can't imagine why on a query for
> "John", any of those titles would be treated as anything other than equals
> - namely, that they are all about John. Maybe the issue is that this seems
> like a contrived example, and I'm asking for a realistic example. Or, maybe
> you have some rule of relevance that you haven't yet shared - and I mean
> rule that a user would comprehend and consider valuable, not simply a
> mechanical rule.
> 
> 
> 
> -- Jack Krupansky
> 
> On Wed, Apr 20, 2016 at 8:10 PM, 
> wrote:
> 
>> Ok sure, I can try and give some examples :)
>> 
>> Lets say that we have the following documents:
>> 
>> Id: 1
>> Title: John Doe
>> 
>> Id: 2
>> Title: John Doe Jr.
>> 
>> Id: 3
>> Title: John Lennon: The Life
>> 
>> Id: 4
>> Title: John Thompson's Modern Course for the Piano: First Grade Book
>> 
>> Id: 5
>> Title: I Rode With Stonewall: Being Chiefly The War Experiences of the
>> Youngest Member of Jackson's Staff from John Brown's Raid to the Hanging of
>> Mrs. Surratt
>> 
>> 
>> And in general, when a search word matches the title, I would like to have
>> the length of the title field influence the score, so that matching
>> documents with shorter title get a higher score than documents with longer
>> title, all else considered equal.
>> 
>> So, when a user searches for "John", I would like the results to be pretty
>> much in the order presented above. Though, it is not crucial that for
>> example document 1 comes before document 2. But I would surely want
>> document 1-3 to come before document 4 and 5.
>> 
>> In my mind, the fieldNorm is a perfect solution for this. At least in
>> theory. In practice, the encoding of the fieldNorm seems to make this
>> function much less useful for this use case. Unless I have missed something.
>> 
>> Is there another way to achive something like this? Note that I don't want
>> a general boost on documents with short titles, I only want to boost them
>> if the title field actually matched the query.
>> 
>> /Jimi
>> 
>> 
>> From: Jack Krupansky 
>> Sent: Thursday, April 21, 2016 1:28 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Is it possible to configure a minimum field length for the
>> fieldNorm value?
>> 
>> I'm not sure I fully follow what distinction you're trying to focus on. I
>> mean, traditionally length normalization has simply tried to distinguish a
>> title field (rarely more than a dozen words) from a full body of text, or
>> maybe an abstract, not things like exactly how many words were in a title.
>> Or, as another example, a short newswire article of a few paragraphs vs. a
>> feature-length article, paper, or even book. IOW, traditionally it was more
>> of a boolean than a broad range of values. Sure, yes, you absolutely can
>> define a custom similarity with a custom norm that supports a wide range of
>> lengths, but you'll have to decide what you really want  to achieve to tune
>> it.
>> 
>> Maybe you could give a couple examples of field values that you feel should
>> be scored differently based on length.
>> 
>> -- Jack Krupansky
>> 
>> On Wed, Apr 20, 2016 at 7:17 PM, 
>> wrote:
>> 
>>> I am talking about the title field. And for the title field, a sweetspot
>>> interval of 1 to 50 makes very little sense. I want to have a fieldNorm
>>> value that differentiates between for example 2, 3, 4 and 5 terms in the
>>> title, but only very little.
>>> 
>>> The 20% number I got by simply calculating the difference in the title
>>> fieldNorm of two documents, where one title was one word longer than the
>>> other title. And one fieldNorm value was 20% larger then the other as a
>>> result of that. And since we use multiplicative scoring calculation, a
>> 20%
>>> increase in the fieldNorm results in a 20% increase in the final score.
>>> 
>>> I'm not talking about "scores as percentages". I'm simply noting that
>> this
>>> minor change in the text data (adding or removing one single word) causes
>>> the score to change by a almost 20%. I noted this when I renamed a
>>> document, removing a word from the title, and that single change caused
>> the
>>>

Re: Is it possible to configure a minimum field length for the fieldNorm value?

Maybe it's a cultural difference, but I can't imagine why on a query for
"John", any of those titles would be treated as anything other than equals
- namely, that they are all about John. Maybe the issue is that this seems
like a contrived example, and I'm asking for a realistic example. Or, maybe
you have some rule of relevance that you haven't yet shared - and I mean
rule that a user would comprehend and consider valuable, not simply a
mechanical rule.



-- Jack Krupansky

On Wed, Apr 20, 2016 at 8:10 PM, 
wrote:

> Ok sure, I can try and give some examples :)
>
> Lets say that we have the following documents:
>
> Id: 1
> Title: John Doe
>
> Id: 2
> Title: John Doe Jr.
>
> Id: 3
> Title: John Lennon: The Life
>
> Id: 4
> Title: John Thompson's Modern Course for the Piano: First Grade Book
>
> Id: 5
> Title: I Rode With Stonewall: Being Chiefly The War Experiences of the
> Youngest Member of Jackson's Staff from John Brown's Raid to the Hanging of
> Mrs. Surratt
>
>
> And in general, when a search word matches the title, I would like to have
> the length of the title field influence the score, so that matching
> documents with shorter title get a higher score than documents with longer
> title, all else considered equal.
>
> So, when a user searches for "John", I would like the results to be pretty
> much in the order presented above. Though, it is not crucial that for
> example document 1 comes before document 2. But I would surely want
> document 1-3 to come before document 4 and 5.
>
> In my mind, the fieldNorm is a perfect solution for this. At least in
> theory. In practice, the encoding of the fieldNorm seems to make this
> function much less useful for this use case. Unless I have missed something.
>
> Is there another way to achive something like this? Note that I don't want
> a general boost on documents with short titles, I only want to boost them
> if the title field actually matched the query.
>
> /Jimi
>
> 
> From: Jack Krupansky 
> Sent: Thursday, April 21, 2016 1:28 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Is it possible to configure a minimum field length for the
> fieldNorm value?
>
> I'm not sure I fully follow what distinction you're trying to focus on. I
> mean, traditionally length normalization has simply tried to distinguish a
> title field (rarely more than a dozen words) from a full body of text, or
> maybe an abstract, not things like exactly how many words were in a title.
> Or, as another example, a short newswire article of a few paragraphs vs. a
> feature-length article, paper, or even book. IOW, traditionally it was more
> of a boolean than a broad range of values. Sure, yes, you absolutely can
> define a custom similarity with a custom norm that supports a wide range of
> lengths, but you'll have to decide what you really want  to achieve to tune
> it.
>
> Maybe you could give a couple examples of field values that you feel should
> be scored differently based on length.
>
> -- Jack Krupansky
>
> On Wed, Apr 20, 2016 at 7:17 PM, 
> wrote:
>
> > I am talking about the title field. And for the title field, a sweetspot
> > interval of 1 to 50 makes very little sense. I want to have a fieldNorm
> > value that differentiates between for example 2, 3, 4 and 5 terms in the
> > title, but only very little.
> >
> > The 20% number I got by simply calculating the difference in the title
> > fieldNorm of two documents, where one title was one word longer than the
> > other title. And one fieldNorm value was 20% larger then the other as a
> > result of that. And since we use multiplicative scoring calculation, a
> 20%
> > increase in the fieldNorm results in a 20% increase in the final score.
> >
> > I'm not talking about "scores as percentages". I'm simply noting that
> this
> > minor change in the text data (adding or removing one single word) causes
> > the score to change by a almost 20%. I noted this when I renamed a
> > document, removing a word from the title, and that single change caused
> the
> > document to move up several positions in the result list. We don't want
> > such minor modifications to have such big impact of the resulting score.
> >
> > I'm not sure I can agree with you that "the effect of document length
> > normalization factor is minimal". Then why does it inpact our result in
> > such a big way? And as I said, we don't want to disable it completely, we
> > just want it to have a much lesser effect, even on really short texts.
> >
> > /Jimi
> >
> > 
> > From: Ahmet Arslan 
> > Sent: Thursday, April 21, 2016 12:10 AM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Is it possible to configure a minimum field length for the
> > fieldNorm value?
> >
> > Hi Jimi,
> >
> > Please define a meaningful document-lenght range like min=1 max=50.
> > By

Re: Is it possible to configure a minimum field length for the fieldNorm value?

Hi Jim,

fieldNorm encode/decode thing cause some precision loss. 
This may be a problem when dealing with very short documents.
You can find many discussions on this topic.

ahmet

On Thursday, April 21, 2016 3:10 AM, "jimi.hulleg...@svensktnaringsliv.se" 
 wrote:
Ok sure, I can try and give some examples :)

Lets say that we have the following documents:

Id: 1
Title: John Doe

Id: 2
Title: John Doe Jr.

Id: 3
Title: John Lennon: The Life

Id: 4
Title: John Thompson's Modern Course for the Piano: First Grade Book

Id: 5
Title: I Rode With Stonewall: Being Chiefly The War Experiences of the Youngest 
Member of Jackson's Staff from John Brown's Raid to the Hanging of Mrs. Surratt

And in general, when a search word matches the title, I would like to have the 
length of the title field influence the score, so that matching documents with 
shorter title get a higher score than documents with longer title, all else 
considered equal.

So, when a user searches for "John", I would like the results to be pretty much 
in the order presented above. Though, it is not crucial that for example 
document 1 comes before document 2. But I would surely want document 1-3 to 
come before document 4 and 5.

In my mind, the fieldNorm is a perfect solution for this. At least in theory. 
In practice, the encoding of the fieldNorm seems to make this function much 
less useful for this use case. Unless I have missed something.

Is there another way to achive something like this? Note that I don't want a 
general boost on documents with short titles, I only want to boost them if the 
title field actually matched the query.

/Jimi

From: Jack Krupansky 
Sent: Thursday, April 21, 2016 1:28 AM
To: solr-user@lucene.apache.org
Subject: Re: Is it possible to configure a minimum field length for the 
fieldNorm value?

I'm not sure I fully follow what distinction you're trying to focus on. I
mean, traditionally length normalization has simply tried to distinguish a
title field (rarely more than a dozen words) from a full body of text, or
maybe an abstract, not things like exactly how many words were in a title.
Or, as another example, a short newswire article of a few paragraphs vs. a
feature-length article, paper, or even book. IOW, traditionally it was more
of a boolean than a broad range of values. Sure, yes, you absolutely can
define a custom similarity with a custom norm that supports a wide range of
lengths, but you'll have to decide what you really want  to achieve to tune
it.

Maybe you could give a couple examples of field values that you feel should
be scored differently based on length.

-- Jack Krupansky

On Wed, Apr 20, 2016 at 7:17 PM, 
wrote:

> I am talking about the title field. And for the title field, a sweetspot
> interval of 1 to 50 makes very little sense. I want to have a fieldNorm
> value that differentiates between for example 2, 3, 4 and 5 terms in the
> title, but only very little.
>
> The 20% number I got by simply calculating the difference in the title
> fieldNorm of two documents, where one title was one word longer than the
> other title. And one fieldNorm value was 20% larger then the other as a
> result of that. And since we use multiplicative scoring calculation, a 20%
> increase in the fieldNorm results in a 20% increase in the final score.
>
> I'm not talking about "scores as percentages". I'm simply noting that this
> minor change in the text data (adding or removing one single word) causes
> the score to change by a almost 20%. I noted this when I renamed a
> document, removing a word from the title, and that single change caused the
> document to move up several positions in the result list. We don't want
> such minor modifications to have such big impact of the resulting score.
>
> I'm not sure I can agree with you that "the effect of document length
> normalization factor is minimal". Then why does it inpact our result in
> such a big way? And as I said, we don't want to disable it completely, we
> just want it to have a much lesser effect, even on really short texts.
>
> /Jimi
>
> 
> From: Ahmet Arslan 
> Sent: Thursday, April 21, 2016 12:10 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Is it possible to configure a minimum field length for the
> fieldNorm value?
>
> Hi Jimi,
>
> Please define a meaningful document-lenght range like min=1 max=50.
> By the way you need to reindex every time you change something.
>
> Regarding 20% score change, I am not sure how you calculated that number
> and I assume it is correct.
> What really matters is the relative order of documents. It doesn't mean
> anything addition of a word decreases the initial score by x%. Please see :
> https://wiki.apache.org/lucene-java/ScoresAsPercentages
>
> There is an information retrieval heuristic which says

Re: Is it possible to configure a minimum field length for the fieldNorm value?

Ok sure, I can try and give some examples :)

Lets say that we have the following documents:

Id: 1
Title: John Doe

Id: 2
Title: John Doe Jr.

Id: 3
Title: John Lennon: The Life

Id: 4
Title: John Thompson's Modern Course for the Piano: First Grade Book

Id: 5
Title: I Rode With Stonewall: Being Chiefly The War Experiences of the Youngest 
Member of Jackson's Staff from John Brown's Raid to the Hanging of Mrs. Surratt

And in general, when a search word matches the title, I would like to have the 
length of the title field influence the score, so that matching documents with 
shorter title get a higher score than documents with longer title, all else 
considered equal.

So, when a user searches for "John", I would like the results to be pretty much 
in the order presented above. Though, it is not crucial that for example 
document 1 comes before document 2. But I would surely want document 1-3 to 
come before document 4 and 5.

In my mind, the fieldNorm is a perfect solution for this. At least in theory. 
In practice, the encoding of the fieldNorm seems to make this function much 
less useful for this use case. Unless I have missed something.

Is there another way to achive something like this? Note that I don't want a 
general boost on documents with short titles, I only want to boost them if the 
title field actually matched the query.

/Jimi

From: Jack Krupansky 
Sent: Thursday, April 21, 2016 1:28 AM
To: solr-user@lucene.apache.org
Subject: Re: Is it possible to configure a minimum field length for the 
fieldNorm value?

I'm not sure I fully follow what distinction you're trying to focus on. I
mean, traditionally length normalization has simply tried to distinguish a
title field (rarely more than a dozen words) from a full body of text, or
maybe an abstract, not things like exactly how many words were in a title.
Or, as another example, a short newswire article of a few paragraphs vs. a
feature-length article, paper, or even book. IOW, traditionally it was more
of a boolean than a broad range of values. Sure, yes, you absolutely can
define a custom similarity with a custom norm that supports a wide range of
lengths, but you'll have to decide what you really want  to achieve to tune
it.

Maybe you could give a couple examples of field values that you feel should
be scored differently based on length.

-- Jack Krupansky

On Wed, Apr 20, 2016 at 7:17 PM, 
wrote:

> I am talking about the title field. And for the title field, a sweetspot
> interval of 1 to 50 makes very little sense. I want to have a fieldNorm
> value that differentiates between for example 2, 3, 4 and 5 terms in the
> title, but only very little.
>
> The 20% number I got by simply calculating the difference in the title
> fieldNorm of two documents, where one title was one word longer than the
> other title. And one fieldNorm value was 20% larger then the other as a
> result of that. And since we use multiplicative scoring calculation, a 20%
> increase in the fieldNorm results in a 20% increase in the final score.
>
> I'm not talking about "scores as percentages". I'm simply noting that this
> minor change in the text data (adding or removing one single word) causes
> the score to change by a almost 20%. I noted this when I renamed a
> document, removing a word from the title, and that single change caused the
> document to move up several positions in the result list. We don't want
> such minor modifications to have such big impact of the resulting score.
>
> I'm not sure I can agree with you that "the effect of document length
> normalization factor is minimal". Then why does it inpact our result in
> such a big way? And as I said, we don't want to disable it completely, we
> just want it to have a much lesser effect, even on really short texts.
>
> /Jimi
>
> 
> From: Ahmet Arslan 
> Sent: Thursday, April 21, 2016 12:10 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Is it possible to configure a minimum field length for the
> fieldNorm value?
>
> Hi Jimi,
>
> Please define a meaningful document-lenght range like min=1 max=50.
> By the way you need to reindex every time you change something.
>
> Regarding 20% score change, I am not sure how you calculated that number
> and I assume it is correct.
> What really matters is the relative order of documents. It doesn't mean
> anything addition of a word decreases the initial score by x%. Please see :
> https://wiki.apache.org/lucene-java/ScoresAsPercentages
>
> There is an information retrieval heuristic which says that addition of a
> non-query term should decrease the score.
>
> Lucene's default document length normalization may favor short document
> too much. But folks blend score with other structural fields (popularity),
> even completely bypass relevancy score and order by price, production date
> etc. I mean

Re: Is it possible to configure a minimum field length for the fieldNorm value?

I'm not sure I fully follow what distinction you're trying to focus on. I
mean, traditionally length normalization has simply tried to distinguish a
title field (rarely more than a dozen words) from a full body of text, or
maybe an abstract, not things like exactly how many words were in a title.
Or, as another example, a short newswire article of a few paragraphs vs. a
feature-length article, paper, or even book. IOW, traditionally it was more
of a boolean than a broad range of values. Sure, yes, you absolutely can
define a custom similarity with a custom norm that supports a wide range of
lengths, but you'll have to decide what you really want  to achieve to tune
it.

Maybe you could give a couple examples of field values that you feel should
be scored differently based on length.

-- Jack Krupansky

On Wed, Apr 20, 2016 at 7:17 PM, 
wrote:

> I am talking about the title field. And for the title field, a sweetspot
> interval of 1 to 50 makes very little sense. I want to have a fieldNorm
> value that differentiates between for example 2, 3, 4 and 5 terms in the
> title, but only very little.
>
> The 20% number I got by simply calculating the difference in the title
> fieldNorm of two documents, where one title was one word longer than the
> other title. And one fieldNorm value was 20% larger then the other as a
> result of that. And since we use multiplicative scoring calculation, a 20%
> increase in the fieldNorm results in a 20% increase in the final score.
>
> I'm not talking about "scores as percentages". I'm simply noting that this
> minor change in the text data (adding or removing one single word) causes
> the score to change by a almost 20%. I noted this when I renamed a
> document, removing a word from the title, and that single change caused the
> document to move up several positions in the result list. We don't want
> such minor modifications to have such big impact of the resulting score.
>
> I'm not sure I can agree with you that "the effect of document length
> normalization factor is minimal". Then why does it inpact our result in
> such a big way? And as I said, we don't want to disable it completely, we
> just want it to have a much lesser effect, even on really short texts.
>
> /Jimi
>
> 
> From: Ahmet Arslan 
> Sent: Thursday, April 21, 2016 12:10 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Is it possible to configure a minimum field length for the
> fieldNorm value?
>
> Hi Jimi,
>
> Please define a meaningful document-lenght range like min=1 max=50.
> By the way you need to reindex every time you change something.
>
> Regarding 20% score change, I am not sure how you calculated that number
> and I assume it is correct.
> What really matters is the relative order of documents. It doesn't mean
> anything addition of a word decreases the initial score by x%. Please see :
> https://wiki.apache.org/lucene-java/ScoresAsPercentages
>
> There is an information retrieval heuristic which says that addition of a
> non-query term should decrease the score.
>
> Lucene's default document length normalization may favor short document
> too much. But folks blend score with other structural fields (popularity),
> even completely bypass relevancy score and order by price, production date
> etc. I mean there are many use cases, the effect of document length
> normalization factor is minimal.
>
> Lucene/Solr is highly pluggable, very easy to customize.
>
> Ahmet
>
>
> On Wednesday, April 20, 2016 11:05 PM, "
> jimi.hulleg...@svensktnaringsliv.se" 
> wrote:
> Hi Ahmet,
>
> SweetSpotSimilarity seems quite nice. Some simple testing by throwing some
> different values at the class gives quite good results. Setting ln_min=1,
> ln_max=2, steepness=0.1 and discountOverlaps=true should give me more or
> less what I want. At least for the title field. I'm not sure what the
> actual effect of those settings would be on longer text fields, so maybe I
> will use the SweetSpotSimilarity only for the title field to start with.
>
> Of course I understand that there are many things that can be considered
> domain specific requirements, like if to favor/punish short/medium/long
> texts, and how. I was just wondering how many actual use cases there are
> where one want's a ~20% difference in score between two documents, where
> the only difference is that one of the documents has one extra word in one
> field. (And now I'm talking about an extra word that doesn't affect
> anything else except the fieldNorm value). I for one find it hard to find
> such a use case, and would consider it a very special use case, and would
> consider a more lenient calculation a better fit for most use cases (and
> therefore most domains). :)
>
> /Jimi
>
>
> -Original Message-
> From: Ahmet Arslan [mailto:iori...@yahoo.com.INVALID]
> Sent: Wednesday, April 20, 2016 8:14 PM
> To:

Re: Is it possible to configure a minimum field length for the fieldNorm value?

I am talking about the title field. And for the title field, a sweetspot 
interval of 1 to 50 makes very little sense. I want to have a fieldNorm value 
that differentiates between for example 2, 3, 4 and 5 terms in the title, but 
only very little.

The 20% number I got by simply calculating the difference in the title 
fieldNorm of two documents, where one title was one word longer than the other 
title. And one fieldNorm value was 20% larger then the other as a result of 
that. And since we use multiplicative scoring calculation, a 20% increase in 
the fieldNorm results in a 20% increase in the final score.

I'm not talking about "scores as percentages". I'm simply noting that this 
minor change in the text data (adding or removing one single word) causes the 
score to change by a almost 20%. I noted this when I renamed a document, 
removing a word from the title, and that single change caused the document to 
move up several positions in the result list. We don't want such minor 
modifications to have such big impact of the resulting score.

I'm not sure I can agree with you that "the effect of document length 
normalization factor is minimal". Then why does it inpact our result in such a 
big way? And as I said, we don't want to disable it completely, we just want it 
to have a much lesser effect, even on really short texts.

/Jimi


From: Ahmet Arslan 
Sent: Thursday, April 21, 2016 12:10 AM
To: solr-user@lucene.apache.org
Subject: Re: Is it possible to configure a minimum field length for the 
fieldNorm value?

Hi Jimi,

Please define a meaningful document-lenght range like min=1 max=50.
By the way you need to reindex every time you change something.

Regarding 20% score change, I am not sure how you calculated that number and I 
assume it is correct.
What really matters is the relative order of documents. It doesn't mean 
anything addition of a word decreases the initial score by x%. Please see :
https://wiki.apache.org/lucene-java/ScoresAsPercentages

There is an information retrieval heuristic which says that addition of a 
non-query term should decrease the score.

Lucene's default document length normalization may favor short document too 
much. But folks blend score with other structural fields (popularity), even 
completely bypass relevancy score and order by price, production date etc. I 
mean there are many use cases, the effect of document length normalization 
factor is minimal.

Lucene/Solr is highly pluggable, very easy to customize.

Ahmet


On Wednesday, April 20, 2016 11:05 PM, "jimi.hulleg...@svensktnaringsliv.se" 
 wrote:
Hi Ahmet,

SweetSpotSimilarity seems quite nice. Some simple testing by throwing some 
different values at the class gives quite good results. Setting ln_min=1, 
ln_max=2, steepness=0.1 and discountOverlaps=true should give me more or less 
what I want. At least for the title field. I'm not sure what the actual effect 
of those settings would be on longer text fields, so maybe I will use the 
SweetSpotSimilarity only for the title field to start with.

Of course I understand that there are many things that can be considered domain 
specific requirements, like if to favor/punish short/medium/long texts, and 
how. I was just wondering how many actual use cases there are where one want's 
a ~20% difference in score between two documents, where the only difference is 
that one of the documents has one extra word in one field. (And now I'm talking 
about an extra word that doesn't affect anything else except the fieldNorm 
value). I for one find it hard to find such a use case, and would consider it a 
very special use case, and would consider a more lenient calculation a better 
fit for most use cases (and therefore most domains). :)

/Jimi


-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com.INVALID]
Sent: Wednesday, April 20, 2016 8:14 PM
To: solr-user@lucene.apache.org
Subject: Re: Is it possible to configure a minimum field length for the 
fieldNorm value?

Hi Jimi,

SweetSpotSimilarity allows you define a document length range, so that all 
documents in that range will get same fieldNorm value.
In your case, you can say that from 1 word up to 100 words do not employ 
document length punishment. If a document is longer than 100 do some punishment.

By the way; favoring/punishing  short, middle, or long documents is domain 
specific thing. You are free to decide what to do.

Ahmet



On Wednesday, April 20, 2016 7:46 PM, "jimi.hulleg...@svensktnaringsliv.se" 
 wrote:
OK. Well, still, the fact that the score increases almost 20% because of just 
one extra term in the field, is not really reasonable if you ask me. But you 
seem to say that this is expected, reasonable and wanted behavior for most use 
case?

I'm not sure that I feel comfortable replacing the default Similarity 
implementation with a custom

Re: set session variable in mysql importHandler

2016-04-20 Thread Alexandre Rafalovitch

The driver documentation talks about "sessionVariables" that might be
possible to pass through the connection URL:
https://dev.mysql.com/doc/connector-j/5.1/en/connector-j-reference-configuration-properties.html

Alternatively, there might be a way to configure driver via JNDI and
set some variables that way.

I haven't tested either though.

Regards,
   Alex.

Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 20 April 2016 at 23:49, Shawn Heisey  wrote:
> On 4/20/2016 6:01 AM, Zaccheo Bagnati wrote:
>> I configured an ImportHandler on a MySQL table using jdbc driver. I'm
>> wondering if is possible to set a session variable in the mysql connection
>> before executing queries. e. g. "SET SESSION group_concat_max_len =
>> 100;"
>
> Normally the MySQL JDBC driver will not allow you to send more than one
> SQL statement in a single request -- this is to prevent SQL injection
> attacks.
>
> I think MySQL probably has a JDBC parameter that would allow multiple
> statements per request, but a better option might be to put all the
> statements you need in a stored procedure and call the procedure from
> the import handler.  You'll need to consult MySQL support resources for
> help with how to do this.
>
> Thanks,
> Shawn
>

Questions on SolrCloud core state, when will Solr recover a "DOWN" core to "ACTIVE" core.

2016-04-20 Thread Li Ding

Hi All,

We are using SolrCloud 4.6.1.  We have observed following behaviors
recently.  A Solr node in a Solrcloud cluster is up but some of the cores
on the nodes are marked as down in Zookeeper.  If the cores are parts of a
multi-sharded collection with one replica,  the queries to that collection
will fail.  However, when this happened, if we issue queries to the core
directly, it returns 200 and correct info.  But once Solr got into the
state, the core will be marked down forever unless we do a restart on Solr.

Has anyone seen this behavior before?  Is there any to get out of the state
on its own?

Thanks,

Li

Re: Is it possible to configure a minimum field length for the fieldNorm value?

Hi Jimi,

Please define a meaningful document-lenght range like min=1 max=50.
By the way you need to reindex every time you change something.

Regarding 20% score change, I am not sure how you calculated that number and I 
assume it is correct.
What really matters is the relative order of documents. It doesn't mean 
anything addition of a word decreases the initial score by x%. Please see : 
https://wiki.apache.org/lucene-java/ScoresAsPercentages

There is an information retrieval heuristic which says that addition of a 
non-query term should decrease the score. 

Lucene's default document length normalization may favor short document too 
much. But folks blend score with other structural fields (popularity), even 
completely bypass relevancy score and order by price, production date etc. I 
mean there are many use cases, the effect of document length normalization 
factor is minimal.

Lucene/Solr is highly pluggable, very easy to customize.

Ahmet

On Wednesday, April 20, 2016 11:05 PM, "jimi.hulleg...@svensktnaringsliv.se" 
 wrote:
Hi Ahmet,

SweetSpotSimilarity seems quite nice. Some simple testing by throwing some 
different values at the class gives quite good results. Setting ln_min=1, 
ln_max=2, steepness=0.1 and discountOverlaps=true should give me more or less 
what I want. At least for the title field. I'm not sure what the actual effect 
of those settings would be on longer text fields, so maybe I will use the 
SweetSpotSimilarity only for the title field to start with.

Of course I understand that there are many things that can be considered domain 
specific requirements, like if to favor/punish short/medium/long texts, and 
how. I was just wondering how many actual use cases there are where one want's 
a ~20% difference in score between two documents, where the only difference is 
that one of the documents has one extra word in one field. (And now I'm talking 
about an extra word that doesn't affect anything else except the fieldNorm 
value). I for one find it hard to find such a use case, and would consider it a 
very special use case, and would consider a more lenient calculation a better 
fit for most use cases (and therefore most domains). :)

/Jimi

-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com.INVALID] 
Sent: Wednesday, April 20, 2016 8:14 PM
To: solr-user@lucene.apache.org
Subject: Re: Is it possible to configure a minimum field length for the 
fieldNorm value?

Hi Jimi,

SweetSpotSimilarity allows you define a document length range, so that all 
documents in that range will get same fieldNorm value.
In your case, you can say that from 1 word up to 100 words do not employ 
document length punishment. If a document is longer than 100 do some punishment.

By the way; favoring/punishing  short, middle, or long documents is domain 
specific thing. You are free to decide what to do.

Ahmet

On Wednesday, April 20, 2016 7:46 PM, "jimi.hulleg...@svensktnaringsliv.se" 
 wrote:
OK. Well, still, the fact that the score increases almost 20% because of just 
one extra term in the field, is not really reasonable if you ask me. But you 
seem to say that this is expected, reasonable and wanted behavior for most use 
case?

I'm not sure that I feel comfortable replacing the default Similarity 
implementation with a custom one. That would just increase the complexity of 
our setup and would make future upgrades harder (we would for example have to 
remember to check if the default similarity configuration or implementation 
changes).

No, if it really is the case that most people like and want this, and there is 
no way to configure Solr/Lucene to calculate fieldNorm in a more reasonable way 
(in my book) for short field values, then I just think we are forced to set 
omitNorms="true", maybe in combination with a simple field boost for shorter 
fields.

/Jimi

-Original Message-
From: Jack Krupansky [mailto:jack.krupan...@gmail.com]
Sent: Wednesday, April 20, 2016 5:18 PM
To: solr-user@lucene.apache.org
Subject: Re: Is it possible to configure a minimum field length for the 
fieldNorm value?

FWIW, length for normalization is measured in terms (tokens), not characters.

With TDIFS similarity (the default before 6.0), the normalization is based on 
the inverse square root of the number of terms in the field:

return state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)));

That code is in ClassicSimilarity:
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/ClassicSimilarity.java#L115

You can always write your own custom Similarity class to override that 
calculation.

-- Jack Krupansky

On Wed, Apr 20, 2016 at 10:43 AM, 
wrote:

> Hi,
>
> In general I think that the fieldNorm factor in the score calculation 
> is quite good. But when the text is short I think that the effect

Re: Questions about tie parameter for dismax/edismax

Hi Jimi,

Contribution to the documentation is very important. 
It would be great if you can prepare a good text explaining things with common 
sense and easy to understand. Please include your documentation proposal as a 
commend to the confluence wiki [1].

[1] https://cwiki.apache.org/confluence/display/solr/Apache+Solr+Reference+Guide


Commiters will notice it and community will appreciate it!

Thanks,
Ahmet





On Wednesday, April 20, 2016 11:21 PM, "jimi.hulleg...@svensktnaringsliv.se" 
 wrote:
Thanks Ahmet! The second I read that part about the "albino elephant" query I 
remembered that I had read that before, but just forgotten about it. That 
explanation is really good, and really should be part of the regular 
documentation if you ask me. :)

/Jimi


-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com.INVALID] 
Sent: Wednesday, April 20, 2016 8:23 PM
To: solr-user@lucene.apache.org
Subject: Re: Questions about tie parameter for dismax/edismax

Hi Jimi,

Field based scoring, where you query multiple fields (title,body,keywords etc) 
with multiple query terms, is an unsolved problem. 

(E)dismax is a heuristic approach to attack the problem.

Please see the javadoc of DisjunctionMaxQuery :
https://lucene.apache.org/core/6_0_0/core/org/apache/lucene/search/DisjunctionMaxQuery.html

some folks try to obtain optimum parameters of edismax from training data.
Others employ learning to rank techniques ...

Ahmet


On Wednesday, April 20, 2016 6:18 PM, "jimi.hulleg...@svensktnaringsliv.se" 
 wrote:
Hi,

I have been looking a bit at the tie parameter, and I think I understand how it 
works, but I still have a few questions about it.

1. It is not documented anywhere (as far as I have seen) what the default value 
is. Some testing indicates that the default value is 0, and it makes perfect 
sense. But shouldn't that fact be documented?

2. There is very little information about how to think when choosing a tie 
value. Is there really no general recommendations based on some different use 
cases? Or is it simple a matter of "try different values and see what happens"?

3. Some recommendations I have seen mention a really low value is the best 
option. But can someone explain why? I understand that one moves further away 
from the dismax "philosophy" the higher the tie value one uses. But I care only 
about the quality of the score calculation. Can someone explain why the score 
has a higher quality with a lower tie?

4. Regarding the dismax "philosophy". On the dismax wiki page it says:

"Max means that if your word 'foo' matches both title and body, the max score 
of these two (probably title match) is added to the score, not the sum of the 
two as a simple OR query would do. This gives more control over your ranking."

But it doesn't explain *why* this gives "more control over your ranking". Can 
someone explain the logic behind that statement? I'm not claiming that it is 
incorrect, I just want to understand it. :)

Regards
/Jimi

Remedial Map-Reduce logic

2016-04-20 Thread Davis, Daniel (NIH/NLM) [C]

Well, it's been a long time since I took any data structures and algorithms 
course (2000, basically), and after the recent Solr 6 feature chat, I was very 
curious whether there was real computational goodness behind the move towards a 
JDBC interface based on Streaming Expressions.   This led me to stuff I should 
have read a long, long time ago:

http://dl.acm.org/citation.cfm?doid=1629175.1629197

So, yes, Solr providing Streaming Expressions and JDBC is a powerful good thing.

Dan Davis, Systems/Applications Architect (Contractor),
Office of Computer and Communications Systems,
National Library of Medicine, NIH

RE: Is it possible to configure a minimum field length for the fieldNorm value?

Hang on... It didn't work out as I wanted. But the problem seems to be in the 
encoding of the fieldNorm value. The decoded value is so coarse, so that when 
it is decoded the result is that two values that were quite close to each other 
originally, can become quite far apart after encoding and decoding.

For example, when testing this with two documents, the calculated fieldNorm 
value for the title field is 0.7905694 and 0.745356 respectively. Ie the 
difference is only about 0.05. But the encoded values become 122 and 121 
respectively, and when these values are decoded, they become 0.75 and 0.625. 
The difference now is 0.125. That is quite a big step, if you ask me. In fact, 
it is so big so it more or less makes this whole thing with SweetSpotSimilarity 
useless for me.

Am I missing something here? Is it really so that one can have a really great 
similarity implementation, that spits out great values, only to have them 
butchered because of the way Lucene stores the data? Can I do something to 
remedy this?

/Jimi

-Original Message-
From: jimi.hulleg...@svensktnaringsliv.se 
[mailto:jimi.hulleg...@svensktnaringsliv.se] 
Sent: Wednesday, April 20, 2016 10:05 PM
To: solr-user@lucene.apache.org
Subject: RE: Is it possible to configure a minimum field length for the 
fieldNorm value?

Hi Ahmet,

SweetSpotSimilarity seems quite nice. Some simple testing by throwing some 
different values at the class gives quite good results. Setting ln_min=1, 
ln_max=2, steepness=0.1 and discountOverlaps=true should give me more or less 
what I want. At least for the title field. I'm not sure what the actual effect 
of those settings would be on longer text fields, so maybe I will use the 
SweetSpotSimilarity only for the title field to start with.

Of course I understand that there are many things that can be considered domain 
specific requirements, like if to favor/punish short/medium/long texts, and 
how. I was just wondering how many actual use cases there are where one want's 
a ~20% difference in score between two documents, where the only difference is 
that one of the documents has one extra word in one field. (And now I'm talking 
about an extra word that doesn't affect anything else except the fieldNorm 
value). I for one find it hard to find such a use case, and would consider it a 
very special use case, and would consider a more lenient calculation a better 
fit for most use cases (and therefore most domains). :)

/Jimi

-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com.INVALID]
Sent: Wednesday, April 20, 2016 8:14 PM
To: solr-user@lucene.apache.org
Subject: Re: Is it possible to configure a minimum field length for the 
fieldNorm value?

Hi Jimi,

SweetSpotSimilarity allows you define a document length range, so that all 
documents in that range will get same fieldNorm value.
In your case, you can say that from 1 word up to 100 words do not employ 
document length punishment. If a document is longer than 100 do some punishment.

By the way; favoring/punishing  short, middle, or long documents is domain 
specific thing. You are free to decide what to do.

Ahmet



On Wednesday, April 20, 2016 7:46 PM, "jimi.hulleg...@svensktnaringsliv.se" 
 wrote:
OK. Well, still, the fact that the score increases almost 20% because of just 
one extra term in the field, is not really reasonable if you ask me. But you 
seem to say that this is expected, reasonable and wanted behavior for most use 
case?

I'm not sure that I feel comfortable replacing the default Similarity 
implementation with a custom one. That would just increase the complexity of 
our setup and would make future upgrades harder (we would for example have to 
remember to check if the default similarity configuration or implementation 
changes).

No, if it really is the case that most people like and want this, and there is 
no way to configure Solr/Lucene to calculate fieldNorm in a more reasonable way 
(in my book) for short field values, then I just think we are forced to set 
omitNorms="true", maybe in combination with a simple field boost for shorter 
fields.

/Jimi



-Original Message-
From: Jack Krupansky [mailto:jack.krupan...@gmail.com]
Sent: Wednesday, April 20, 2016 5:18 PM
To: solr-user@lucene.apache.org
Subject: Re: Is it possible to configure a minimum field length for the 
fieldNorm value?

FWIW, length for normalization is measured in terms (tokens), not characters.

With TDIFS similarity (the default before 6.0), the normalization is based on 
the inverse square root of the number of terms in the field:

return state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)));

That code is in ClassicSimilarity:
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/ClassicSimilarity.java#L115

You can always write your own custom Similarity class to override that 
calculation.

-- Jack Krupansky

RE: Questions about tie parameter for dismax/edismax

Thanks Ahmet! The second I read that part about the "albino elephant" query I 
remembered that I had read that before, but just forgotten about it. That 
explanation is really good, and really should be part of the regular 
documentation if you ask me. :)

/Jimi

-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com.INVALID] 
Sent: Wednesday, April 20, 2016 8:23 PM
To: solr-user@lucene.apache.org
Subject: Re: Questions about tie parameter for dismax/edismax

Hi Jimi,

Field based scoring, where you query multiple fields (title,body,keywords etc) 
with multiple query terms, is an unsolved problem. 

(E)dismax is a heuristic approach to attack the problem.

Please see the javadoc of DisjunctionMaxQuery :
https://lucene.apache.org/core/6_0_0/core/org/apache/lucene/search/DisjunctionMaxQuery.html

some folks try to obtain optimum parameters of edismax from training data.
Others employ learning to rank techniques ...

Ahmet


On Wednesday, April 20, 2016 6:18 PM, "jimi.hulleg...@svensktnaringsliv.se" 
 wrote:
Hi,

I have been looking a bit at the tie parameter, and I think I understand how it 
works, but I still have a few questions about it.

1. It is not documented anywhere (as far as I have seen) what the default value 
is. Some testing indicates that the default value is 0, and it makes perfect 
sense. But shouldn't that fact be documented?

2. There is very little information about how to think when choosing a tie 
value. Is there really no general recommendations based on some different use 
cases? Or is it simple a matter of "try different values and see what happens"?

3. Some recommendations I have seen mention a really low value is the best 
option. But can someone explain why? I understand that one moves further away 
from the dismax "philosophy" the higher the tie value one uses. But I care only 
about the quality of the score calculation. Can someone explain why the score 
has a higher quality with a lower tie?

4. Regarding the dismax "philosophy". On the dismax wiki page it says:

"Max means that if your word 'foo' matches both title and body, the max score 
of these two (probably title match) is added to the score, not the sum of the 
two as a simple OR query would do. This gives more control over your ranking."

But it doesn't explain *why* this gives "more control over your ranking". Can 
someone explain the logic behind that statement? I'm not claiming that it is 
incorrect, I just want to understand it. :)

Regards
/Jimi

RE: Is it possible to configure a minimum field length for the fieldNorm value?

Hi Ahmet,

SweetSpotSimilarity seems quite nice. Some simple testing by throwing some 
different values at the class gives quite good results. Setting ln_min=1, 
ln_max=2, steepness=0.1 and discountOverlaps=true should give me more or less 
what I want. At least for the title field. I'm not sure what the actual effect 
of those settings would be on longer text fields, so maybe I will use the 
SweetSpotSimilarity only for the title field to start with.

Of course I understand that there are many things that can be considered domain 
specific requirements, like if to favor/punish short/medium/long texts, and 
how. I was just wondering how many actual use cases there are where one want's 
a ~20% difference in score between two documents, where the only difference is 
that one of the documents has one extra word in one field. (And now I'm talking 
about an extra word that doesn't affect anything else except the fieldNorm 
value). I for one find it hard to find such a use case, and would consider it a 
very special use case, and would consider a more lenient calculation a better 
fit for most use cases (and therefore most domains). :)

/Jimi

-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com.INVALID] 
Sent: Wednesday, April 20, 2016 8:14 PM
To: solr-user@lucene.apache.org
Subject: Re: Is it possible to configure a minimum field length for the 
fieldNorm value?

Hi Jimi,

SweetSpotSimilarity allows you define a document length range, so that all 
documents in that range will get same fieldNorm value.
In your case, you can say that from 1 word up to 100 words do not employ 
document length punishment. If a document is longer than 100 do some punishment.

By the way; favoring/punishing  short, middle, or long documents is domain 
specific thing. You are free to decide what to do.

Ahmet

On Wednesday, April 20, 2016 7:46 PM, "jimi.hulleg...@svensktnaringsliv.se" 
 wrote:
OK. Well, still, the fact that the score increases almost 20% because of just 
one extra term in the field, is not really reasonable if you ask me. But you 
seem to say that this is expected, reasonable and wanted behavior for most use 
case?

I'm not sure that I feel comfortable replacing the default Similarity 
implementation with a custom one. That would just increase the complexity of 
our setup and would make future upgrades harder (we would for example have to 
remember to check if the default similarity configuration or implementation 
changes).

No, if it really is the case that most people like and want this, and there is 
no way to configure Solr/Lucene to calculate fieldNorm in a more reasonable way 
(in my book) for short field values, then I just think we are forced to set 
omitNorms="true", maybe in combination with a simple field boost for shorter 
fields.

/Jimi

-Original Message-
From: Jack Krupansky [mailto:jack.krupan...@gmail.com]
Sent: Wednesday, April 20, 2016 5:18 PM
To: solr-user@lucene.apache.org
Subject: Re: Is it possible to configure a minimum field length for the 
fieldNorm value?

FWIW, length for normalization is measured in terms (tokens), not characters.

With TDIFS similarity (the default before 6.0), the normalization is based on 
the inverse square root of the number of terms in the field:

return state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)));

That code is in ClassicSimilarity:
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/ClassicSimilarity.java#L115

You can always write your own custom Similarity class to override that 
calculation.

-- Jack Krupansky

On Wed, Apr 20, 2016 at 10:43 AM, 
wrote:

> Hi,
>
> In general I think that the fieldNorm factor in the score calculation 
> is quite good. But when the text is short I think that the effect is two big.
>
> Ie with two documents that have a short text in the same field, just a 
> few characters extra in of the documents lower the fieldNorm factor too much.
> In one test the text in document 1 is 30 characters long and has 
> fieldNorm 0.4375, and in document 2 the text is 37 characters long and 
> has fieldNorm 0.375. That means that the first document gets almost a 
> 20% higher score simply because of the 7 character difference.
>
> What are my options if I want to change this behavior? Can I set a 
> lower character limit, meaning that all fields with a length below 
> this limit gets the same fieldNorm value?
>
> I know I can force fieldNorm to be 1 by setting omitNorms="true" for 
> that field, but I would prefer to still have it, just limit its effect 
> on short texts.
>
> Regards
> /Jimi
>
>
>

Re: Indexing 700 docs per second

2016-04-20 Thread Mark Robinson

Thank you all for your very valuable suggestions.
I will try out the options shared once our set up is ready and probably get
back on my experience once it is done.

Thanks!
Mark.

On Wed, Apr 20, 2016 at 9:54 AM, Bram Van Dam  wrote:

> > I have a requirement to index (mainly updation) 700 docs per second.
> > Suppose I have a 128GB RAM, 32 CPU machine, with each doc size around 260
> > byes (6 fields out of which only 2 will undergo updation at the above
> > rate). This collection has around 122Million docs and that count is
> pretty
> > much a constant.
>
> We've found that average index size per document is a good predictor of
> performance. For instance, I've got a 150GB index lying around,
> containing 400M documents. That's roughly 400 bytes per document in
> index size. This was indexed @ 4500 documents/second.
>
> If the average index size per documents doubles, the throughput will go
> down by about a third. Your mileage may vary.
>
> But yeah, I would say that 700 docs on your machine won't be much of a
> problem. Especially considering your index will likely fit in memory.
>
>  - Bram
>
>
>

Re: how to restrict phrase to appear in same child document

2016-04-20 Thread Yangrui Guo

Hi thanks for answering. My problem is that users do not distinguish what
color the color belongs to in the query. For example, "which black driver
has a white mercedes", it is difficult to distinguish which color belongs
to which field, because there can be thousands of car brands and
professions. Is there anyway that can achieve the feature I stated been
fore?

On Wednesday, April 20, 2016, Alisa Z.  wrote:

>  Yangrui,
>
> First, have you indexed your documents with proper nested document
> structure [
> https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Index+Handlers#UploadingDatawithIndexHandlers-NestedChildDocuments]?
> From the peice of data you showed, it seems that you just put it right as
> it is and it all got flattened.
>
> Then, you'll probably want to introduce a distinguishing
> "type"/"category"/"path" fields into your data, so it would look like this:
>
> {
> type:top
> id:
> {
> type:car_color
> car:
> color:
> }
> {
>   type:driver_color
> driver:
> color:
> }
> }
>
>
> >Wed, 20 Apr 2016 -3:28:33 -0400 от Yangrui Guo  >:
> >
> >hello
> >
> >I have a nested document type in my index. Here's the structure of my
> >document:
> >
> >{
> >id:
> >{
> >car:
> >color:
> >}
> >{
> >driver:
> >color:
> >}
> >}
> >
> >However, when I use the query q={!parent
> >which="content_type:parent"}+(black AND driver)={!parent
> >which="content_type:parent"}+(white AND mercedes), the result also
> >contained white driver with black mercedes. I know I can put fields before
> >terms but it is not always easy to do this. Users might just enter one
> >string. How can I modify my query to require that the terms between two
> >parentheses must appear in the same child document, or boost those meet
> the
> >criteria? Thanks
>
>

Block Join faceting on intermediate levels with JSON Facet API (might be related to block join rollups & SOLR-8998)

2016-04-20 Thread Alisa Z .

 Hi all, 

I have been stretching some SOLR's capabilities for nested documents handling 
and I've come up with the following issue...

Let's say I have the following structure:

{
"blog-posts":{  //level 1
    "leaf-fields":[
    "date",
    "author"],
    "title":{   //level 2
    "leaf-fields":[ "text"],
    "keywords":{    //level 3
    "leaf-fields":[
    "text",
    "type"]
    }
    },
    "body":{    //level 2
    "leaf-fields":[ "text"],
    "keywords":{    //level 3
    "leaf-fields":[
    "text",
    "type"]
    }
    },
    "comments":{    //level 2
    "leaf-fields":[
    "date",
    "author",
    "text",
    "sentiment"
    ],
    "keywords":{    //level 3
    "leaf-fields":[
    "text",
    "type"]
    },
    "replies":{ //level 3
    "leaf-fields":[
    "date",
    "author",
    "text",
    "sentiment"],
    "keywords":{    //level 4
    "leaf-fields":[
    "text",
    "type"]
    }
  
  
And I want to know the distribution of all readers' keywords (levels 3 and 4) 
by comments (level 2).  
In JSON Facet API I tried this: 

curl http://localhost:8983/solr/my_index/query -d 
'q=path:2.blog-posts.comments=0&
json.facet={
  filter_by_child_type :{
    type:query,
    q:"path:*comments*keywords",
    domain: { blockChildren : "path:2.blog-posts.comments" },
    facet:{
  top_keywords : {
    type: terms,
    field: text,
    sort: "counts_by_comments desc",
    facet: {
   counts_by_comments: "unique(_root_)"    // I suspect in should be a 
different field, not _root_, but would it be for an intermediate document? 
 }'

Which gives me the wrong results, it aggregates by posts, not by comments (it's 
a toy data set, so I know that the correct answer for "Solr" is 3 when faceted 
by for comments)

{
"response":{"numFound":3,"start":0,"docs":[]
  },
  "facets":{
    "count":3,
    "filter_by_child_type":{
  "count":9,
  "top_keywords":{
    "buckets":[{
    "val":"Elasticsearch",
    "count":2,
    "counts_by_comments":2},
  {
    "val":"Solr",
    "count":5,
    "counts_by_comments":2},   //here the count by 
"comments" should be 3 
  {
    "val":"Solr 5.5",
    "count":1,
    "counts_by_comments":1},
  {
    "val":"feature",
    "count":1,
    "counts_by_comments":1}]


Am I writing the query wrong? 


By the way, Block Join Faceting works fine for this: 
bjqfacet?q={!parent%20which=path:2.blog-posts.comments}path:*.comments*keywords=0=true=text=json=true

{
  "response":{"numFound":3,"start":0,"docs":[]
  },
  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{
  "text":[
    "Elasticsearch",2,
    "Solr",3,  //correct result 
    "Solr 5.5",1,
    "feature",1]},
    "facet_dates":{},
    "facet_ranges":{},
    "facet_intervals":{},
    "facet_heatmaps":{}}}

But we've already discussed that it returns too much stuff: no way to put 
limits or order by counts :(  That's why I want to see whether it's posible to 
make JSON Facet API straight. 

Thank you in advance!

-- 
Alisa Zhila

Re: Questions about tie parameter for dismax/edismax

Hi Jimi,

Field based scoring, where you query multiple fields (title,body,keywords etc)
with multiple query terms, is an unsolved problem.

(E)dismax is a heuristic approach to attack the problem.

Please see the javadoc of DisjunctionMaxQuery :
https://lucene.apache.org/core/6_0_0/core/org/apache/lucene/search/DisjunctionMaxQuery.html

some folks try to obtain optimum parameters of edismax from training data.
Others employ learning to rank techniques ...

Ahmet

On Wednesday, April 20, 2016 6:18 PM, "jimi.hulleg...@svensktnaringsliv.se"
wrote:
Hi,

I have been looking a bit at the tie parameter, and I think I understand how it
works, but I still have a few questions about it.

1. It is not documented anywhere (as far as I have seen) what the default value
is. Some testing indicates that the default value is 0, and it makes perfect
sense. But shouldn't that fact be documented?

2. There is very little information about how to think when choosing a tie
value. Is there really no general recommendations based on some different use
cases? Or is it simple a matter of "try different values and see what happens"?

3. Some recommendations I have seen mention a really low value is the best
option. But can someone explain why? I understand that one moves further away
from the dismax "philosophy" the higher the tie value one uses. But I care only
about the quality of the score calculation. Can someone explain why the score
has a higher quality with a lower tie?

4. Regarding the dismax "philosophy". On the dismax wiki page it says:

"Max means that if your word 'foo' matches both title and body, the max score
of these two (probably title match) is added to the score, not the sum of the
two as a simple OR query would do. This gives more control over your ranking."

But it doesn't explain *why* this gives "more control over your ranking". Can
someone explain the logic behind that statement? I'm not claiming that it is
incorrect, I just want to understand it. :)

Regards
/Jimi

Re: Is it possible to configure a minimum field length for the fieldNorm value?

Hi Jimi,

SweetSpotSimilarity allows you define a document length range, so that all 
documents in that range will get same fieldNorm value.
In your case, you can say that from 1 word up to 100 words do not employ 
document length punishment. If a document is longer than 100 do some punishment.

By the way; favoring/punishing  short, middle, or long documents is domain 
specific thing. You are free to decide what to do.

Ahmet

On Wednesday, April 20, 2016 7:46 PM, "jimi.hulleg...@svensktnaringsliv.se" 
 wrote:
OK. Well, still, the fact that the score increases almost 20% because of just 
one extra term in the field, is not really reasonable if you ask me. But you 
seem to say that this is expected, reasonable and wanted behavior for most use 
case?

I'm not sure that I feel comfortable replacing the default Similarity 
implementation with a custom one. That would just increase the complexity of 
our setup and would make future upgrades harder (we would for example have to 
remember to check if the default similarity configuration or implementation 
changes).

No, if it really is the case that most people like and want this, and there is 
no way to configure Solr/Lucene to calculate fieldNorm in a more reasonable way 
(in my book) for short field values, then I just think we are forced to set 
omitNorms="true", maybe in combination with a simple field boost for shorter 
fields.

/Jimi

-Original Message-
From: Jack Krupansky [mailto:jack.krupan...@gmail.com] 
Sent: Wednesday, April 20, 2016 5:18 PM
To: solr-user@lucene.apache.org
Subject: Re: Is it possible to configure a minimum field length for the 
fieldNorm value?

FWIW, length for normalization is measured in terms (tokens), not characters.

With TDIFS similarity (the default before 6.0), the normalization is based on 
the inverse square root of the number of terms in the field:

return state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)));

That code is in ClassicSimilarity:
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/ClassicSimilarity.java#L115

You can always write your own custom Similarity class to override that 
calculation.

-- Jack Krupansky

On Wed, Apr 20, 2016 at 10:43 AM, 
wrote:

> Hi,
>
> In general I think that the fieldNorm factor in the score calculation 
> is quite good. But when the text is short I think that the effect is two big.
>
> Ie with two documents that have a short text in the same field, just a 
> few characters extra in of the documents lower the fieldNorm factor too much.
> In one test the text in document 1 is 30 characters long and has 
> fieldNorm 0.4375, and in document 2 the text is 37 characters long and 
> has fieldNorm 0.375. That means that the first document gets almost a 
> 20% higher score simply because of the 7 character difference.
>
> What are my options if I want to change this behavior? Can I set a 
> lower character limit, meaning that all fields with a length below 
> this limit gets the same fieldNorm value?
>
> I know I can force fieldNorm to be 1 by setting omitNorms="true" for 
> that field, but I would prefer to still have it, just limit its effect 
> on short texts.
>
> Regards
> /Jimi
>
>
>

Re: how to restrict phrase to appear in same child document

2016-04-20 Thread Alisa Z .

 Yangrui, 

First, have you indexed your documents with proper nested document structure 
[https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Index+Handlers#UploadingDatawithIndexHandlers-NestedChildDocuments]?
 From the peice of data you showed, it seems that you just put it right as it 
is and it all got flattened. 

Then, you'll probably want to introduce a distinguishing 
"type"/"category"/"path" fields into your data, so it would look like this: 

{
type:top
id:
{
type:car_color
car:
color:
}
{
  type:driver_color
driver:
color:
}
}


>Wed, 20 Apr 2016 -3:28:33 -0400 от Yangrui Guo :
>
>hello
>
>I have a nested document type in my index. Here's the structure of my
>document:
>
>{
>id:
>{
>car:
>color:
>}
>{
>driver:
>color:
>}
>}
>
>However, when I use the query q={!parent
>which="content_type:parent"}+(black AND driver)={!parent
>which="content_type:parent"}+(white AND mercedes), the result also
>contained white driver with black mercedes. I know I can put fields before
>terms but it is not always easy to do this. Users might just enter one
>string. How can I modify my query to require that the terms between two
>parentheses must appear in the same child document, or boost those meet the
>criteria? Thanks

Re: Traversal of documents through network

2016-04-20 Thread Alisa Z .

 Viday, 

No, not all of those 500 result docs will be brought to your client (browser, 
etc.)   Only as many documents as fit into the 1st "search result page" will be 
brought.

There is a notion of "pagination" in Solr (as well as in most search engines). 
The counts of occurrence might be approximate and anyway you will be displayed 
only as many documents as specified by your "search result page" size. By 
default, page size is set to 10 documents, so although you might see something 
like "response":{"numFound":27,"start":0,"docs"}, only 10 top documents will be 
displayed. 

In Solr, "page" size  is controlled with "start" and "row" parameters ( see 
https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results), so if 
you want less results to be brought at a time, you can specify your query like 
this: 
q="word"=5  - that will show you only top 5 results and only they will 
"traverse the network" (or being brought from the Solr server to your browser 
or other client).

If you want to look at another page, you specify 
q="word"=5=5 - this is the 2nd page  of the results 


Hope it helps.

--Alisa 


>Среда, 20 апреля 2016, 10:01 -04:00 от vidya :
>
>Hi
>
>When i queried a word in solr, documents having that keyword is displayed in
>500 documents,lets say. Will all those documents traverse through network ?
>Or how it happens ?
>
>Please help me on this.
>
>
>
>--
>View this message in context:  
>http://lucene.472066.n3.nabble.com/Traversal-of-documents-through-network-tp4271555.html
>Sent from the Solr - User mailing list archive at Nabble.com.

RE: Is it possible to configure a minimum field length for the fieldNorm value?

OK. Well, still, the fact that the score increases almost 20% because of just 
one extra term in the field, is not really reasonable if you ask me. But you 
seem to say that this is expected, reasonable and wanted behavior for most use 
case?

I'm not sure that I feel comfortable replacing the default Similarity 
implementation with a custom one. That would just increase the complexity of 
our setup and would make future upgrades harder (we would for example have to 
remember to check if the default similarity configuration or implementation 
changes).

No, if it really is the case that most people like and want this, and there is 
no way to configure Solr/Lucene to calculate fieldNorm in a more reasonable way 
(in my book) for short field values, then I just think we are forced to set 
omitNorms="true", maybe in combination with a simple field boost for shorter 
fields.

/Jimi

-Original Message-
From: Jack Krupansky [mailto:jack.krupan...@gmail.com] 
Sent: Wednesday, April 20, 2016 5:18 PM
To: solr-user@lucene.apache.org
Subject: Re: Is it possible to configure a minimum field length for the 
fieldNorm value?

FWIW, length for normalization is measured in terms (tokens), not characters.

With TDIFS similarity (the default before 6.0), the normalization is based on 
the inverse square root of the number of terms in the field:

return state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)));

That code is in ClassicSimilarity:
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/ClassicSimilarity.java#L115

You can always write your own custom Similarity class to override that 
calculation.

-- Jack Krupansky

On Wed, Apr 20, 2016 at 10:43 AM, 
wrote:

> Hi,
>
> In general I think that the fieldNorm factor in the score calculation 
> is quite good. But when the text is short I think that the effect is two big.
>
> Ie with two documents that have a short text in the same field, just a 
> few characters extra in of the documents lower the fieldNorm factor too much.
> In one test the text in document 1 is 30 characters long and has 
> fieldNorm 0.4375, and in document 2 the text is 37 characters long and 
> has fieldNorm 0.375. That means that the first document gets almost a 
> 20% higher score simply because of the 7 character difference.
>
> What are my options if I want to change this behavior? Can I set a 
> lower character limit, meaning that all fields with a length below 
> this limit gets the same fieldNorm value?
>
> I know I can force fieldNorm to be 1 by setting omitNorms="true" for 
> that field, but I would prefer to still have it, just limit its effect 
> on short texts.
>
> Regards
> /Jimi
>
>
>

Questions about tie parameter for dismax/edismax

Hi,

I have been looking a bit at the tie parameter, and I think I understand how it 
works, but I still have a few questions about it.

1. It is not documented anywhere (as far as I have seen) what the default value 
is. Some testing indicates that the default value is 0, and it makes perfect 
sense. But shouldn't that fact be documented?

2. There is very little information about how to think when choosing a tie 
value. Is there really no general recommendations based on some different use 
cases? Or is it simple a matter of "try different values and see what happens"?

3. Some recommendations I have seen mention a really low value is the best 
option. But can someone explain why? I understand that one moves further away 
from the dismax "philosophy" the higher the tie value one uses. But I care only 
about the quality of the score calculation. Can someone explain why the score 
has a higher quality with a lower tie?

4. Regarding the dismax "philosophy". On the dismax wiki page it says:

"Max means that if your word 'foo' matches both title and body, the max score 
of these two (probably title match) is added to the score, not the sum of the 
two as a simple OR query would do. This gives more control over your ranking."

But it doesn't explain *why* this gives "more control over your ranking". Can 
someone explain the logic behind that statement? I'm not claiming that it is 
incorrect, I just want to understand it. :)

Regards
/Jimi

Re: Is it possible to configure a minimum field length for the fieldNorm value?

FWIW, length for normalization is measured in terms (tokens), not
characters.

With TDIFS similarity (the default before 6.0), the normalization is based
on the inverse square root of the number of terms in the field:

return state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)));

That code is in ClassicSimilarity:
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/ClassicSimilarity.java#L115

You can always write your own custom Similarity class to override that
calculation.

-- Jack Krupansky

On Wed, Apr 20, 2016 at 10:43 AM, 
wrote:

> Hi,
>
> In general I think that the fieldNorm factor in the score calculation is
> quite good. But when the text is short I think that the effect is two big.
>
> Ie with two documents that have a short text in the same field, just a few
> characters extra in of the documents lower the fieldNorm factor too much.
> In one test the text in document 1 is 30 characters long and has fieldNorm
> 0.4375, and in document 2 the text is 37 characters long and has fieldNorm
> 0.375. That means that the first document gets almost a 20% higher score
> simply because of the 7 character difference.
>
> What are my options if I want to change this behavior? Can I set a lower
> character limit, meaning that all fields with a length below this limit
> gets the same fieldNorm value?
>
> I know I can force fieldNorm to be 1 by setting omitNorms="true" for that
> field, but I would prefer to still have it, just limit its effect on short
> texts.
>
> Regards
> /Jimi
>
>
>

Re: Live Podcast on Solr 6 with Yonik and Erik Hatcher (Today, 2pm ET)

2016-04-20 Thread Doug Turnbull

Thanks to those that watched live. If you missed it, here's the audio
recording if you'd like to listen in

http://opensourceconnections.com/blog/2016/04/19/solr-6-release/

Best
-Doug

On Tue, Apr 19, 2016 at 12:32 PM Doug Turnbull <
dturnb...@opensourceconnections.com> wrote:

> Doh! Thanks Yonik. Yes that's right. Thought I had double checked
>
> On Tue, Apr 19, 2016 at 12:24 PM Yonik Seeley  wrote:
>
>> Hey Doug,
>> Not sure if the URL matters, but I thougt it was this one:
>>
>>
>> https://blab.im/matthew-l-overstreet-solr-6-is-available-find-out-about-what-s-new
>>
>> -Yonik
>>
>>
>> On Tue, Apr 19, 2016 at 10:37 AM, Doug Turnbull
>>  wrote:
>> > Hey Solristas:
>> >
>> > We do a regular podcast called Search Disco
>> > . Today we'll be discussing
>> the
>> > recent release of Solr 6 with Solr creator, Yonik Seeley and Solr
>> committer
>> > Erik Hatcher.
>> >
>> > *Subscrbe to participate live*
>> > <
>> https://blab.im/matthew-l-overstreet-full-text-search-and-recommendation-engines
>> >.
>> > (*2PM ET today (19-APR)*)
>> >
>> > We use the blab  conversation platform, which will let
>> you
>> > chat with us, Yonik, and Erik. So bring your tough Solr questions!
>> >
>> > Look forward to seeing you there. And if you're interested in past
>> > episodes, check them out .
>> >
>> > -Doug Turnbull
>> > http://opensourceconnections.com
>>
>

Is it possible to configure a minimum field length for the fieldNorm value?