Re: Indexing books, chapters and pages

2016-03-01 Thread Alexandre Rafalovitch
Here is an - untested - possible approach. I might be missing
something by combining these things in too many layers, but.

1) Have chapter as parent documents and pages as children within that.
Block index them together.
2) On pages, include page text (probably not stored) as one field.
Also include a second field that has last paragraph of that page as
well as first paragraph of the next page. This gives you phrase
matches across boundaries. Also include pageId, etc.
3) On chapters, include book id as a string field.
4) Use block join query to search against pages, but return (parent)
chapters 
https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-BlockJoinQueryParsers
5) Use grouping or collapsing+expanding by book id to group chapters
within a book: https://cwiki.apache.org/confluence/display/solr/Result+Grouping
or https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results
6) Use [child] DocumentTransformer to get pages back with childFilter
to re-limit them by your query:
https://cwiki.apache.org/confluence/display/solr/Transforming+Result+Documents#TransformingResultDocuments-[child]-ChildDocTransformerFactory

The main question is whether 6) will be able to piggyback on the
output of 5).. And, of course, the performance...

I would love to know if this works, even partially. Either on the
mailing list or directly.

Regards,
   Alex.


Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 2 March 2016 at 00:50, Zaccheo Bagnati  wrote:
> Thank you, Jack for your answer.
> There are 2 reasons:
> 1. the requirement is to show in the result list both books and chapters
> grouped, so I would have to execute the query grouping by book, retrieve
> first, let's say, 10 books (sorted by relevance) and then for each book
> repeat the query grouping by chapter (always ordering by relevance) in
> order to obtain what we need (unfortunately it is not up to me defining the
> requirements... but it however make sense). Unless there exist some SOLR
> feature to do this in only one call (and that would be great!).
> 2. searching on pages will not match phrases that spans across 2 pages
> (e.g. if last word of page 1 is "broken" and first word of page 2 is
> "sentence" searching for "broken sentence" will not match)
> However if we will not find a better solution I think that your proposal is
> not so bad... I hope that reason #2 could be negligible and that #1
> performs quite fast though we are multiplying queries.
>
> Il giorno mar 1 mar 2016 alle ore 14:28 Jack Krupansky <
> jack.krupan...@gmail.com> ha scritto:
>
>> Any reason not to use the simplest structure - each page is one Solr
>> document with a book field, a chapter field, and a page text field? You can
>> then use grouping to group results by book (title text) or even chapter
>> (title text and/or number). Maybe initially group by book and then if the
>> user selects a book group you can re-query with the specific book and then
>> group by chapter.
>>
>>
>> -- Jack Krupansky
>>
>> On Tue, Mar 1, 2016 at 8:08 AM, Zaccheo Bagnati 
>> wrote:
>>
>> > Original data is quite well structured: it comes in XML with chapters and
>> > tags to mark the original page breaks on the paper version. In this way
>> we
>> > have the possibility to restructure it almost as we want before creating
>> > SOLR index.
>> >
>> > Il giorno mar 1 mar 2016 alle ore 14:04 Jack Krupansky <
>> > jack.krupan...@gmail.com> ha scritto:
>> >
>> > > To start, what is the form of your input data - is it already divided
>> > into
>> > > chapters and pages? Or... are you starting with raw PDF files?
>> > >
>> > >
>> > > -- Jack Krupansky
>> > >
>> > > On Tue, Mar 1, 2016 at 6:56 AM, Zaccheo Bagnati 
>> > > wrote:
>> > >
>> > > > Hi all,
>> > > > I'm searching for ideas on how to define schema and how to perform
>> > > queries
>> > > > in this use case: we have to index books, each book is split into
>> > > chapters
>> > > > and chapters are split into pages (pages represent original page
>> > cutting
>> > > in
>> > > > printed version). We should show the result grouped by books and
>> > chapters
>> > > > (for the same book) and pages (for the same chapter). As far as I
>> know,
>> > > we
>> > > > have 2 options:
>> > > >
>> > > > 1. index pages as SOLR documents. In this way we could theoretically
>> > > > retrieve chapters (and books?)  using grouping but
>> > > > a. we will miss matches across two contiguous pages (page cutting
>> > is
>> > > > only due to typographical needs so concepts could be split... as in
>> > > printed
>> > > > books)
>> > > > b. I don't know if it is possible in SOLR to group results on two
>> > > > different levels (books and chapters)
>> > > >
>> > > > 2. index chapters as SOLR documents. In this case we will have the
>> > right
>> > > > matches but how to obtain the matching pages? (we need pages because
>> > the
>> > > > client can only display pages)
>> > > >
>

Re: Pull request protocol question

2016-03-01 Thread Jan Høydahl
Hi,

Yes, the GitHub repo changed when we switched from svn to git, and you did the 
right thing. Please see 
http://lucene.apache.org/solr/news.html#8-february-2016-apache-lucenesolr-development-moves-to-git

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 1. mar. 2016 kl. 18.42 skrev Demian Katz :
> 
> Hello,
> 
> A few weeks ago, I submitted a pull request to Solr in association with a 
> JIRA ticket, and it was eventually merged.
> 
> More recently, I had an almost-trivial change I wanted to share, but on 
> GitHub, my Solr fork appeared to have changed upstreams. Was the whole Solr 
> repo moved and regenerated or something?
> 
> In any case, I ended up submitting my proposal using a new fork of 
> apache/lucene-solr. It's visible here:
> 
> https://github.com/apache/lucene-solr/pull/13
> 
> However, due to the weirdness of the switching upstreams, I thought I'd 
> better check in here and make sure I put this in the right place!
> 
> thanks,
> Demian



Re: both way synonyms with ManagedSynonymFilterFactory

2016-03-01 Thread Jan Høydahl
Thanks for reporting!

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 1. mar. 2016 kl. 13.31 skrev Bjørn Hjelle :
> 
> Thanks a lot for following up on this and creating the patch!
> 
> On Thu, Feb 25, 2016 at 2:49 PM, Jan Høydahl  wrote:
> 
>> Created https://issues.apache.org/jira/browse/SOLR-8737 to handle this
>> 
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>> 
>>> 22. feb. 2016 kl. 11.21 skrev Jan Høydahl :
>>> 
>>> Hi
>>> 
>>> Did you get any Further with this?
>>> I reproduced your situation with Solr 5.5.
>>> 
>>> Think the issue here is that when the SynonymFilter is created based on
>> the managed map, option “expand” is always set to “false”, while the
>> default for file-based synonym dictionary is “true”.
>>> 
>>> So with expand=false, what happens is that the input word (e.g. “mb”) is
>> *replaced* with the synonym “megabytes”. Confusingly enough, when synonyms
>> are applied both on index and query side, your document will contain
>> “megabytes” instead of “mb”, but when you query for “mb”, the same happens
>> on query side, so you will actually match :-)
>>> 
>>> I think what we need is to switch default to expand=true, and make it
>> configurable also in the managed factory.
>>> 
>>> --
>>> Jan Høydahl, search solution architect
>>> Cominvent AS - www.cominvent.com
>>> 
 11. feb. 2016 kl. 10.16 skrev Bjørn Hjelle :
 
 Hi,
 
 one-way managed synonyms seems to work fine, but I cannot make both-way
 synonyms work.
 
 Steps to reproduce with Solr 5.4.1:
 
 1. create a core:
 $ bin/solr create_core -c test -d server/solr/configsets/basic_configs
 
 2. edit schema.xml so fieldType text_general looks like this:
 
  >>> positionIncrementGap="100">

  
  >>> />
  

  
 
 3. reload the core:
 
 $ curl -X GET "
 http://localhost:8983/solr/admin/cores?action=RELOAD&core=test";
 
 4. add synonyms, one one-way synonym, one two-way, reload the core
>> again:
 
 $ curl -X PUT -H 'Content-type:application/json' --data-binary
 '{"mad":["angry","upset"]}' "
 http://localhost:8983/solr/test/schema/analysis/synonyms/english";
 $ curl -X PUT -H 'Content-type:application/json' --data-binary
 '["mb","megabytes"]' "
 http://localhost:8983/solr/test/schema/analysis/synonyms/english";
 $ curl -X GET "
 http://localhost:8983/solr/admin/cores?action=RELOAD&core=test";
 
 5. list the synonyms:
 {
 "responseHeader":{
  "status":0,
  "QTime":0},
 "synonymMappings":{
  "initArgs":{"ignoreCase":false},
  "initializedOn":"2016-02-11T09:00:50.354Z",
  "managedMap":{
"mad":["angry",
  "upset"],
"mb":["megabytes"],
"megabytes":["mb"]}}}
 
 
 6. add two documents:
 
 $ bin/post -c test -type 'application/json' -d '[{"id" : "1", "title_t"
>> :
 "10 megabytes makes me mad" },{"id" : "2", "title_t" : "100 mb should be
 sufficient" }]'
 $ bin/post -c test -type 'application/json' -d '[{"id" : "2", "title_t"
>> :
 "100 mb should be sufficient" }]'
 
 7. search for the documents:
 
 - all these return the first document, so one-way synonyms work:
 $ curl -X GET "
 http://localhost:8983/solr/test/select?q=title_t:angry&indent=true";
 $ curl -X GET "
 http://localhost:8983/solr/test/select?q=title_t:upset&indent=true";
 $ curl -X GET "
 http://localhost:8983/solr/test/select?q=title_t:mad&indent=true";
 
 - this only returns the document with "mb":
 
 $ curl -X GET "
 http://localhost:8983/solr/test/select?q=title_t:mb&indent=true";
 
 - this only returns the document with "megabytes"
 
 $ curl -X GET "
 http://localhost:8983/solr/test/select?q=title_t:megabytes&indent=true";
 
 
 Any input on how to make this work would be appreciated.
 
 Thanks,
 Bjørn
>>> 
>> 
>> 



Re: ExtendedDisMax configuration nowhere to be found

2016-03-01 Thread Jan Høydahl
We have a huge backlog of stale wiki.apache.org pages which should really just 
point to the refGuide.
I replaced the eDisMax and DisMax pages with a simple link to the ref guide, 
since they do not provide any added value.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 1. mar. 2016 kl. 01.10 skrev Alexandre Rafalovitch :
> 
> On 29 February 2016 at 09:40,   wrote:
>> I have no problem with automatic. It is "automagicall" stuff that I find a 
>> bit hard to like. Ie things that are automatic, but doesn't explain how and 
>> why they are automatic. But Disney Land and Disney World are actually really 
>> good examples of places where the magic stuff is suitable, ie in themeparks, 
>> designed mostly for kids. In the grown up world of IT, most people prefer 
>> logical and documented stuff, not things that "just works" without 
>> explaining why. No offence :)
> 
> I agree. Especially after 3 years of technical support for a large
> commercial product, I understand the price of 'automagical'. Solr does
> have a bit of that. And latest 5.x Solr is even more automagical, so
> when things work - it is fabulous. When they do not - it is a bit
> mysterious.
> 
> My solution was to document the learning and creating the resource
> site for others, which has been quite popular (
> http://www.solr-start.com ).
> 
> I also wrote a book specifically for beginners bringing together
> different parts of documentation to explain the automagical parts.
> https://www.packtpub.com/big-data-and-business-intelligence/instant-apache-solr-indexing-data-how
> . It covered the latest (at the time) Solr 4.3. I no longer recommend
> it to anybody on Solr 5, but you may still find it useful for Solr
> 4.6. Unfortunately, all my discount codes are no longer valid :-(
> 
> I am also working on some additional material both for beginners and
> advanced users that will be announced on my Solr Start mailing list as
> well as writing individual pieces on my blog (e.g.
> http://blog.outerthoughts.com/2015/11/learning-solr-comprehensively/
> ).
> 
> In reality, automagical stuff is explained. Problem is that it is
> explained on Wiki vs. Reference Guide vs. Individual blogs vs. Solr
> Revolution videos vs. ... The discovery of information is a
> significant problem for Solr, just like it is for any open source
> project.
> 
> Regards,
>   Alex.
> 
> Newsletter and resources for Solr beginners and intermediates:
> http://www.solr-start.com/



Re: SolrCloud - Strategy for recovering cluster states

2016-03-01 Thread Jeff Wartes

I’ve been running SolrCloud clusters in various versions for a few years here, 
and I can only think of two or three cases that the ZK-stored cluster state was 
broken in a way that I had to manually intervene by hand-editing the contents 
of ZK. I think I’ve seen Solr fixes go by for those cases, too. I’ve never 
completely wiped ZK. (Although granted, my ZK cluster has been pretty stable, 
and my collection count is smaller than yours)

My philosophy is that ZK is the source of cluster configuration, not the 
collection of core.properties files on the nodes. 
Currently, cluster state is shared between ZK and core directories. I’d prefer, 
and I think Solr development is going this way, (SOLR-7269) that all cluster 
state exist and be managed via ZK, and all state be removed from the local disk 
of the cluster nodes. The fact that a node uses local disk based configuration 
to figure out what collections/replicas it has is something that should be 
fixed, in my opinion.

If you’re frequently getting into bad states due to ZK issues, I’d suggest you 
file bugs against Solr for the fact that you got into the state, and then fix 
your ZK cluster.

Failing that, can you just periodically back up your ZK data and restore it if 
something breaks? I wrote a little tool to watch clusterstate.json and write 
every version to a local git repo a few years ago. I was mostly interested 
because I wanted to see changes that happened pretty fast, but it could also 
serve as a backup approach. Here’s a link, although I clearly haven’t touched 
it lately. Feel free to ask if you have issues: 
https://github.com/randomstatistic/git_zk_monitor




On 3/1/16, 12:09 PM, "danny teichthal"  wrote:

>Hi,
>Just summarizing my questions if the long mail is a little intimidating:
>1. Is there a best practice/automated tool for overcoming problems in
>cluster state coming from zookeeper disconnections?
>2. Creating a collection via core admin is discouraged, is it true also for
>core.properties discovery?
>
>I would like to be able to specify collection.configName in the
>core.properties and when starting server, the collection will be created
>and linked to the config name specified.
>
>
>
>On Mon, Feb 29, 2016 at 4:01 PM, danny teichthal 
>wrote:
>
>> Hi,
>>
>>
>> I would like to describe a process we use for overcoming problems in
>> cluster state when we have networking issues. Would appreciate if anyone
>> can answer about what are the flaws on this solution and what is the best
>> practice for recovery in case of network problems involving zookeeper.
>> I'm working with Solr Cloud with version 5.2.1
>> ~100 collections in a cluster of 6 machines.
>>
>> This is the short procedure:
>> 1. Bring all the cluster down.
>> 2. Clear all data from zookeeper.
>> 3. Upload configuration.
>> 4. Restart the cluster.
>>
>> We rely on the fact that a collection is created on core discovery
>> process, if it does not exist. It gives us much flexibility.
>> When the cluster comes up, it reads from core.properties and creates the
>> collections if needed.
>> Since we have only one configuration, the collections are automatically
>> linked to it and the cores inherit it from the collection.
>> This is a very robust procedure, that helped us overcome many problems
>> until we stabilized our cluster which is now pretty stable.
>> I know that the leader might change in such case and may lose updates, but
>> it is ok.
>>
>>
>> The problem is that today I want to add a new config set.
>> When I add it and clear zookeeper, the cores cannot be created because
>> there are 2 configurations. This breaks my recovery procedure.
>>
>> I thought about a few options:
>> 1. Put the config Name in core.properties - this doesn't work. (It is
>> supported in CoreAdminHandler, but  is discouraged according to
>> documentation)
>> 2. Change recovery procedure to not delete all data from zookeeper, but
>> only relevant parts.
>> 3. Change recovery procedure to delete all, but recreate and link
>> configurations for all collections before startup.
>>
>> Option #1 is my favorite, because it is very simple, it is currently not
>> supported, but from looking on code it looked like it is not complex to
>> implement.
>>
>>
>>
>> My questions are:
>> 1. Is there something wrong in the recovery procedure that I described ?
>> 2. What is the best way to fix problems in cluster state, except from
>> editing clusterstate.json manually? Is there an automated tool for that? We
>> have about 100 collections in a cluster, so editing is not really a
>> solution.
>> 3.Is creating a collection via core.properties is also discouraged?
>>
>>
>>
>> Would very appreciate any answers/ thoughts on that.
>>
>>
>> Thanks,
>>
>>
>>
>>
>>
>>


Re: SolrCloud - Strategy for recovering cluster states

2016-03-01 Thread danny teichthal
Hi,
Just summarizing my questions if the long mail is a little intimidating:
1. Is there a best practice/automated tool for overcoming problems in
cluster state coming from zookeeper disconnections?
2. Creating a collection via core admin is discouraged, is it true also for
core.properties discovery?

I would like to be able to specify collection.configName in the
core.properties and when starting server, the collection will be created
and linked to the config name specified.



On Mon, Feb 29, 2016 at 4:01 PM, danny teichthal 
wrote:

> Hi,
>
>
> I would like to describe a process we use for overcoming problems in
> cluster state when we have networking issues. Would appreciate if anyone
> can answer about what are the flaws on this solution and what is the best
> practice for recovery in case of network problems involving zookeeper.
> I'm working with Solr Cloud with version 5.2.1
> ~100 collections in a cluster of 6 machines.
>
> This is the short procedure:
> 1. Bring all the cluster down.
> 2. Clear all data from zookeeper.
> 3. Upload configuration.
> 4. Restart the cluster.
>
> We rely on the fact that a collection is created on core discovery
> process, if it does not exist. It gives us much flexibility.
> When the cluster comes up, it reads from core.properties and creates the
> collections if needed.
> Since we have only one configuration, the collections are automatically
> linked to it and the cores inherit it from the collection.
> This is a very robust procedure, that helped us overcome many problems
> until we stabilized our cluster which is now pretty stable.
> I know that the leader might change in such case and may lose updates, but
> it is ok.
>
>
> The problem is that today I want to add a new config set.
> When I add it and clear zookeeper, the cores cannot be created because
> there are 2 configurations. This breaks my recovery procedure.
>
> I thought about a few options:
> 1. Put the config Name in core.properties - this doesn't work. (It is
> supported in CoreAdminHandler, but  is discouraged according to
> documentation)
> 2. Change recovery procedure to not delete all data from zookeeper, but
> only relevant parts.
> 3. Change recovery procedure to delete all, but recreate and link
> configurations for all collections before startup.
>
> Option #1 is my favorite, because it is very simple, it is currently not
> supported, but from looking on code it looked like it is not complex to
> implement.
>
>
>
> My questions are:
> 1. Is there something wrong in the recovery procedure that I described ?
> 2. What is the best way to fix problems in cluster state, except from
> editing clusterstate.json manually? Is there an automated tool for that? We
> have about 100 collections in a cluster, so editing is not really a
> solution.
> 3.Is creating a collection via core.properties is also discouraged?
>
>
>
> Would very appreciate any answers/ thoughts on that.
>
>
> Thanks,
>
>
>
>
>
>


Re: understand scoring

2016-03-01 Thread shamik
Doug, do we've a date for the hard copy launch?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/understand-scoring-tp4260837p4260860.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: understand scoring

2016-03-01 Thread Doug Turnbull
Supposedly Late April, early May. But don't hold me to it until I see copy
edits :) Of course looks like now you can read at least the full ebook in
MEAP form.

-Doug

On Tue, Mar 1, 2016 at 2:57 PM, shamik  wrote:

> Doug, do we've a date for the hard copy launch?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/understand-scoring-tp4260837p4260860.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections
, LLC | 240.476.9983
Author: Relevant Search 
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.


Solr sort and facet of nested doc fields

2016-03-01 Thread Jhon Smith
I am looking for a solr solution of this model: Product (common fields) ->SKU 
(color, size) and STORE(store_name) <-(price)-> SKU
Listing contains only products but other facets (store names, colors) and 
sorting (by min price) should work either.

I can have 3 types of docs: products, skus and relation store-sku-price.  Or at 
least last 2 (product fields in sku docs with redundancy). Or we can make them 
nested. 
If i place them as nested docs then problems are:
1. Cannot figure out how to sort parent docs using function of nested doc 
field. (I want SKUs to be sorted by min price among all nested docs of this 
SKU). And then further sort documents by function among these sorted SKUs.
2. Cannot get facets from the lowest inner level (store_names) on the hightest 
parent level (documents). I.e. how to use facets formed retrieved on sku level 
(with child.facet.field facets) on document level?

It seems that simpler approach is to keep product fields in sku documents. Then 
we get only two levels: sku and store_names/price. And skus then can be grouped 
to get documents and working pagination.
But there are 2 problems again:
1. Still cannot figure out how to sort skus by min price of nested docs.
2. When grouping sku docs with group.facet=true facet counts of sku doc fields 
are grouped (so they relate to groups) while facet counts of fields from nested 
docs relate to sku docs, not groups: i.e. facets from child.facet.field are not 
grouped when sku field facets are grouped.
Any help?


Pull request protocol question

2016-03-01 Thread Demian Katz
Hello,

A few weeks ago, I submitted a pull request to Solr in association with a JIRA 
ticket, and it was eventually merged.

More recently, I had an almost-trivial change I wanted to share, but on GitHub, 
my Solr fork appeared to have changed upstreams. Was the whole Solr repo moved 
and regenerated or something?

In any case, I ended up submitting my proposal using a new fork of 
apache/lucene-solr. It's visible here:

https://github.com/apache/lucene-solr/pull/13

However, due to the weirdness of the switching upstreams, I thought I'd better 
check in here and make sure I put this in the right place!

thanks,
Demian


Re: understand scoring

2016-03-01 Thread Doug Turnbull
Your screenshot doesn't seem to carry over. We don't have the permission to
access files in your personal gmail.

But I might suggest pasting your Solr URL into Splainer: http://splainer.io.
Its a tool we use to explain Solr results.

I might further suggest this handy book :-p
https://www.manning.com/books/relevant-search Message me directly and I'll
happily share a discount code.

Best
-Doug

On Tue, Mar 1, 2016 at 12:12 PM, michael solomon 
wrote:

> Hi all,
> I'm struggling to understand Solr scoring but can understand why I get
> those results:
> [image: Inline image 1]
> (If don't see pic:
> https://mail.google.com/mail/u/0/?ui=2&ik=f570232aa3&view=fimg&th=153332681af9c93f&attid=0.1&disp=emb&realattid=ii_153332681af9c93f&attbid=ANGjdJ-af_Q3b_h02w_TyMUCG5JHSl75pLKOLJC0nXIOzp9ypz6FOG2fbk7RvkGM-dkb2MLguNgAjFMbigbW_VqO4Z-YpMxBGWUc7-T3q25XnFyeijoNzY_Fi6gRzhs&sz=s0-l75&ats=1456852075298&rm=153332681af9c93f&zw
>
> I expected that the order will be 1,3,2 (because 1 is shortest filed[4
> words], and 3 before 2 because the distance between the words...)
> Thank you,
> Michael
>



-- 
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections
, LLC | 240.476.9983
Author: Relevant Search 
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.


understand scoring

2016-03-01 Thread michael solomon
Hi all,
I'm struggling to understand Solr scoring but can understand why I get
those results:
[image: Inline image 1]
(If don't see pic:
https://mail.google.com/mail/u/0/?ui=2&ik=f570232aa3&view=fimg&th=153332681af9c93f&attid=0.1&disp=emb&realattid=ii_153332681af9c93f&attbid=ANGjdJ-af_Q3b_h02w_TyMUCG5JHSl75pLKOLJC0nXIOzp9ypz6FOG2fbk7RvkGM-dkb2MLguNgAjFMbigbW_VqO4Z-YpMxBGWUc7-T3q25XnFyeijoNzY_Fi6gRzhs&sz=s0-l75&ats=1456852075298&rm=153332681af9c93f&zw

I expected that the order will be 1,3,2 (because 1 is shortest filed[4
words], and 3 before 2 because the distance between the words...)
Thank you,
Michael


Re: SolrJ 5.5 won't work with any of my servers

2016-03-01 Thread Shawn Heisey
On 3/1/2016 9:30 AM, Shai Erera wrote:
> Ah ok, in my case even 5.4.1 didn't work with binary request writer, so
> probably we don't face the same issue.

If I set the writer to binary on 5.4.1, it fails too.

My intent when I wrote the program was to use the binary writer, but
apparently I didn't actually implement that.  5.4.1 works with the
default (xml) writer.  I have not tried 5.5 with the xml writer, but I
bet that would work.

I just added a null check to the part of my code that does doc.addField,
skipping the add if the object is null.  This appears to have fixed the
problem when using SolrJ 5.5.  This is also a good idea for my program
in general, so I'm not really unhappy about needing it.

So, I believe what happens with the binary writer is that when the field
contains a null object, the writer is adding the literal string "NULL"
(in uppercase) as the field value.  The XML writer apparently handles
this situation by not including the field.

Thanks,
Shawn



Re: Indexing books, chapters and pages

2016-03-01 Thread Jack Krupansky
The chapter seems like the optimal unit for initial searches - just combine
the page text with a line break between them or index as a multivalued
field and set the position increment gap to be 1 so that phrases work.

You could have a separate collection for pages, with each page as a Solr
document, but include the last line of text from the previous page and the
first line of text from the next page so that phrases will match across
page boundaries. Unfortunately, that may also result in false hits if the
full phrase is found on the two adopted lines. That would require some
special filtering to eliminate those false positives.

There is also the question of maximum phrase size - most phrases tend to be
reasonably short, but sometimes people may want to search for an entire
paragraph (e.g., a quote) that may span multiple lines on two adjacent
pages.

-- Jack Krupansky

On Tue, Mar 1, 2016 at 11:30 AM, Emir Arnautovic <
emir.arnauto...@sematext.com> wrote:

> Hi,
> From the top of my head - probably does not solve problem completely, but
> may trigger brainstorming: Index chapters and include page break tokens.
> Use highlighting to return matches and make sure fragment size is large
> enough to get page break token. In such scenario you should use slop for
> phrase searches...
>
> More I write it, less I like it, but will not delete...
>
> Regards,
> Emir
>
>
> On 01.03.2016 12:56, Zaccheo Bagnati wrote:
>
>> Hi all,
>> I'm searching for ideas on how to define schema and how to perform queries
>> in this use case: we have to index books, each book is split into chapters
>> and chapters are split into pages (pages represent original page cutting
>> in
>> printed version). We should show the result grouped by books and chapters
>> (for the same book) and pages (for the same chapter). As far as I know, we
>> have 2 options:
>>
>> 1. index pages as SOLR documents. In this way we could theoretically
>> retrieve chapters (and books?)  using grouping but
>>  a. we will miss matches across two contiguous pages (page cutting is
>> only due to typographical needs so concepts could be split... as in
>> printed
>> books)
>>  b. I don't know if it is possible in SOLR to group results on two
>> different levels (books and chapters)
>>
>> 2. index chapters as SOLR documents. In this case we will have the right
>> matches but how to obtain the matching pages? (we need pages because the
>> client can only display pages)
>>
>> we have been struggling on this problem for a lot of time and we're  not
>> able to find a suitable solution so I'm looking if someone has ideas or
>> has
>> already solved a similar issue.
>> Thanks
>>
>>
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>


Re: SolrJ 5.5 won't work with any of my servers

2016-03-01 Thread Shai Erera
Ah ok, in my case even 5.4.1 didn't work with binary request writer, so
probably we don't face the same issue.

Shai

On Tue, Mar 1, 2016, 17:07 Shawn Heisey  wrote:

> On 2/29/2016 9:14 PM, Shai Erera wrote:
> > Shawn, not sure if it's the same case as yours, but I've hit NPEs
> upgrading
> > to 5.5 too. In my case though, SolrJ talks to a proxy servlets before the
> > request gets routed to Solr, and that servlet didn't handle binary
> content
> > stream well.
> >
> > I had to add another resource method to the servlet which handled
> > "appliation/javabin" and "application/octet-stream" and received the body
> > as an InputStream.
>
> I wish this was an NPE.  It would be easier to track down.
>
> There is no proxy.  Although I do have a load balancer (haproxy) in
> place for queries, this program doesn't use it.
>
> It looks to me like SolrJ 5.5 behaves in a way that changes what gets
> sent when talking to an existing Solr install using the default Jetty
> config.  This seems like a bug to me.
>
> I need to investigate what happens in my code when MySQL returns NULL
> values.  I suspect that the object assigned to the SolrInputDocument
> field is null.  Whatever it is that happens, SolrJ 5.4.1 works correctly
> and 5.5.0 doesn't.
>
> Thanks,
> Shawn
>
>


Re: Indexing books, chapters and pages

2016-03-01 Thread Emir Arnautovic

Hi,
From the top of my head - probably does not solve problem completely, 
but may trigger brainstorming: Index chapters and include page break 
tokens. Use highlighting to return matches and make sure fragment size 
is large enough to get page break token. In such scenario you should use 
slop for phrase searches...


More I write it, less I like it, but will not delete...

Regards,
Emir

On 01.03.2016 12:56, Zaccheo Bagnati wrote:

Hi all,
I'm searching for ideas on how to define schema and how to perform queries
in this use case: we have to index books, each book is split into chapters
and chapters are split into pages (pages represent original page cutting in
printed version). We should show the result grouped by books and chapters
(for the same book) and pages (for the same chapter). As far as I know, we
have 2 options:

1. index pages as SOLR documents. In this way we could theoretically
retrieve chapters (and books?)  using grouping but
 a. we will miss matches across two contiguous pages (page cutting is
only due to typographical needs so concepts could be split... as in printed
books)
 b. I don't know if it is possible in SOLR to group results on two
different levels (books and chapters)

2. index chapters as SOLR documents. In this case we will have the right
matches but how to obtain the matching pages? (we need pages because the
client can only display pages)

we have been struggling on this problem for a lot of time and we're  not
able to find a suitable solution so I'm looking if someone has ideas or has
already solved a similar issue.
Thanks



--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Indexing books, chapters and pages

2016-03-01 Thread Walter Underwood
You could index both pages and chapters, with a type field.

You could index by chapter with the page number as a payload for each token.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Mar 1, 2016, at 5:50 AM, Zaccheo Bagnati  wrote:
> 
> Thank you, Jack for your answer.
> There are 2 reasons:
> 1. the requirement is to show in the result list both books and chapters
> grouped, so I would have to execute the query grouping by book, retrieve
> first, let's say, 10 books (sorted by relevance) and then for each book
> repeat the query grouping by chapter (always ordering by relevance) in
> order to obtain what we need (unfortunately it is not up to me defining the
> requirements... but it however make sense). Unless there exist some SOLR
> feature to do this in only one call (and that would be great!).
> 2. searching on pages will not match phrases that spans across 2 pages
> (e.g. if last word of page 1 is "broken" and first word of page 2 is
> "sentence" searching for "broken sentence" will not match)
> However if we will not find a better solution I think that your proposal is
> not so bad... I hope that reason #2 could be negligible and that #1
> performs quite fast though we are multiplying queries.
> 
> Il giorno mar 1 mar 2016 alle ore 14:28 Jack Krupansky <
> jack.krupan...@gmail.com> ha scritto:
> 
>> Any reason not to use the simplest structure - each page is one Solr
>> document with a book field, a chapter field, and a page text field? You can
>> then use grouping to group results by book (title text) or even chapter
>> (title text and/or number). Maybe initially group by book and then if the
>> user selects a book group you can re-query with the specific book and then
>> group by chapter.
>> 
>> 
>> -- Jack Krupansky
>> 
>> On Tue, Mar 1, 2016 at 8:08 AM, Zaccheo Bagnati 
>> wrote:
>> 
>>> Original data is quite well structured: it comes in XML with chapters and
>>> tags to mark the original page breaks on the paper version. In this way
>> we
>>> have the possibility to restructure it almost as we want before creating
>>> SOLR index.
>>> 
>>> Il giorno mar 1 mar 2016 alle ore 14:04 Jack Krupansky <
>>> jack.krupan...@gmail.com> ha scritto:
>>> 
 To start, what is the form of your input data - is it already divided
>>> into
 chapters and pages? Or... are you starting with raw PDF files?
 
 
 -- Jack Krupansky
 
 On Tue, Mar 1, 2016 at 6:56 AM, Zaccheo Bagnati 
 wrote:
 
> Hi all,
> I'm searching for ideas on how to define schema and how to perform
 queries
> in this use case: we have to index books, each book is split into
 chapters
> and chapters are split into pages (pages represent original page
>>> cutting
 in
> printed version). We should show the result grouped by books and
>>> chapters
> (for the same book) and pages (for the same chapter). As far as I
>> know,
 we
> have 2 options:
> 
> 1. index pages as SOLR documents. In this way we could theoretically
> retrieve chapters (and books?)  using grouping but
>a. we will miss matches across two contiguous pages (page cutting
>>> is
> only due to typographical needs so concepts could be split... as in
 printed
> books)
>b. I don't know if it is possible in SOLR to group results on two
> different levels (books and chapters)
> 
> 2. index chapters as SOLR documents. In this case we will have the
>>> right
> matches but how to obtain the matching pages? (we need pages because
>>> the
> client can only display pages)
> 
> we have been struggling on this problem for a lot of time and we're
>>> not
> able to find a suitable solution so I'm looking if someone has ideas
>> or
 has
> already solved a similar issue.
> Thanks
> 
 
>>> 
>> 



RE: Solr regex documenation

2016-03-01 Thread Markus Jelsma
Just keep in mind the regex operates on tokenized and filtered tokens if you 
use solr.TextField. But on verbatim input in case of StringField.
Markus 
 
-Original message-
> From:Anil 
> Sent: Tuesday 1st March 2016 16:28
> To: solr-user@lucene.apache.org
> Subject: Re: Solr regex documenation
> 
> Regex is working Markus. i need to investigate this particular pattern.
> Thanks for you responses.
> 
> On 29 February 2016 at 19:16, Markus Jelsma 
> wrote:
> 
> > Hmm, if you have some stemming algorithm on that field, [a-z]+works is
> > never going to work but [a-z]+work should. If the field contains Juniper
> > Networks, [a-z]+works is not going to be found due to -s being stripped.
> > But if the field is not tokenized net[a-z]+ is also not going to find
> > anything. You can always test with q=field:/.*/ to prove regex works.
> >
> > -Original message-
> > > From:Anil 
> > > Sent: Monday 29th February 2016 14:23
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: Solr regex documenation
> > >
> > > yes. when i search on juniper networks without regex, i can see the
> > results.
> > >
> > > But when I search on net[a-z]+ , i could not see juniper networks. i have
> > > looked all the documents in the results, could not find it.
> > >
> > > Thank you,
> > > Anil
> > >
> > > On 29 February 2016 at 18:42, Markus Jelsma 
> > > wrote:
> > >
> > > > Hmm, is the field indexed? A field:/[a-z]%2Bwork/ works fine overhere.
> > > > Markus
> > > >
> > > > -Original message-
> > > > > From:Anil 
> > > > > Sent: Monday 29th February 2016 13:24
> > > > > To: solr-user@lucene.apache.org
> > > > > Subject: Re: Solr regex documenation
> > > > >
> > > > > Yes Markus.
> > > > >
> > > > > On 29 February 2016 at 15:54, Markus Jelsma <
> > markus.jel...@openindex.io>
> > > > > wrote:
> > > > >
> > > > > > Hi - do you enclose the regex in slashes? Do you url encode the +
> > sign?
> > > > > > Markus
> > > > > >
> > > > > >
> > > > > >
> > > > > > -Original message-
> > > > > > > From:Anil 
> > > > > > > Sent: Monday 29th February 2016 7:45
> > > > > > > To: solr-user@lucene.apache.org
> > > > > > > Subject: Re: Solr regex documenation
> > > > > > >
> > > > > > > HI ,
> > > > > > >
> > > > > > > i am using [a-z]+works. i could not see networks in the solr
> > results.
> > > > > > >
> > > > > > > is it regex working properly in solr ? Please clarify.
> > > > > > >
> > > > > > > Regards,
> > > > > > > Anil
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On 27 February 2016 at 20:52, Anil  wrote:
> > > > > > >
> > > > > > > > Thanks Jack.
> > > > > > > >
> > > > > > > > On 27 February 2016 at 20:41, Jack Krupansky <
> > > > jack.krupan...@gmail.com
> > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > >> See:
> > > > > > > >>
> > > > > > > >>
> > > > > >
> > > >
> > https://lucene.apache.org/core/5_5_0/core/org/apache/lucene/search/RegexpQuery.html
> > > > > > > >>
> > > > > > > >>
> > > > > >
> > > >
> > https://lucene.apache.org/core/5_5_0/core/org/apache/lucene/util/automaton/RegExp.html
> > > > > > > >>
> > > > > > > >> I vaguely recall a Jira about regex not working at all in
> > Solr. I
> > > > > > don't
> > > > > > > >> recall reading about a resolution.
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> -- Jack Krupansky
> > > > > > > >>
> > > > > > > >> On Sat, Feb 27, 2016 at 7:05 AM, Anil 
> > wrote:
> > > > > > > >>
> > > > > > > >> > Hi,
> > > > > > > >> >
> > > > > > > >> > Can some one point me to the solr regex documentation ?
> > > > > > > >> >
> > > > > > > >> > i read it supports all java regex features.  i tried ^ and
> > $ ,
> > > > > > seems it
> > > > > > > >> is
> > > > > > > >> > not working.
> > > > > > > >> >
> > > > > > > >> > Thanks,
> > > > > > > >> > Anil
> > > > > > > >> >
> > > > > > > >>
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> 


Re: Solr regex documenation

2016-03-01 Thread Anil
Regex is working Markus. i need to investigate this particular pattern.
Thanks for you responses.

On 29 February 2016 at 19:16, Markus Jelsma 
wrote:

> Hmm, if you have some stemming algorithm on that field, [a-z]+works is
> never going to work but [a-z]+work should. If the field contains Juniper
> Networks, [a-z]+works is not going to be found due to -s being stripped.
> But if the field is not tokenized net[a-z]+ is also not going to find
> anything. You can always test with q=field:/.*/ to prove regex works.
>
> -Original message-
> > From:Anil 
> > Sent: Monday 29th February 2016 14:23
> > To: solr-user@lucene.apache.org
> > Subject: Re: Solr regex documenation
> >
> > yes. when i search on juniper networks without regex, i can see the
> results.
> >
> > But when I search on net[a-z]+ , i could not see juniper networks. i have
> > looked all the documents in the results, could not find it.
> >
> > Thank you,
> > Anil
> >
> > On 29 February 2016 at 18:42, Markus Jelsma 
> > wrote:
> >
> > > Hmm, is the field indexed? A field:/[a-z]%2Bwork/ works fine overhere.
> > > Markus
> > >
> > > -Original message-
> > > > From:Anil 
> > > > Sent: Monday 29th February 2016 13:24
> > > > To: solr-user@lucene.apache.org
> > > > Subject: Re: Solr regex documenation
> > > >
> > > > Yes Markus.
> > > >
> > > > On 29 February 2016 at 15:54, Markus Jelsma <
> markus.jel...@openindex.io>
> > > > wrote:
> > > >
> > > > > Hi - do you enclose the regex in slashes? Do you url encode the +
> sign?
> > > > > Markus
> > > > >
> > > > >
> > > > >
> > > > > -Original message-
> > > > > > From:Anil 
> > > > > > Sent: Monday 29th February 2016 7:45
> > > > > > To: solr-user@lucene.apache.org
> > > > > > Subject: Re: Solr regex documenation
> > > > > >
> > > > > > HI ,
> > > > > >
> > > > > > i am using [a-z]+works. i could not see networks in the solr
> results.
> > > > > >
> > > > > > is it regex working properly in solr ? Please clarify.
> > > > > >
> > > > > > Regards,
> > > > > > Anil
> > > > > >
> > > > > >
> > > > > >
> > > > > > On 27 February 2016 at 20:52, Anil  wrote:
> > > > > >
> > > > > > > Thanks Jack.
> > > > > > >
> > > > > > > On 27 February 2016 at 20:41, Jack Krupansky <
> > > jack.krupan...@gmail.com
> > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > >> See:
> > > > > > >>
> > > > > > >>
> > > > >
> > >
> https://lucene.apache.org/core/5_5_0/core/org/apache/lucene/search/RegexpQuery.html
> > > > > > >>
> > > > > > >>
> > > > >
> > >
> https://lucene.apache.org/core/5_5_0/core/org/apache/lucene/util/automaton/RegExp.html
> > > > > > >>
> > > > > > >> I vaguely recall a Jira about regex not working at all in
> Solr. I
> > > > > don't
> > > > > > >> recall reading about a resolution.
> > > > > > >>
> > > > > > >>
> > > > > > >> -- Jack Krupansky
> > > > > > >>
> > > > > > >> On Sat, Feb 27, 2016 at 7:05 AM, Anil 
> wrote:
> > > > > > >>
> > > > > > >> > Hi,
> > > > > > >> >
> > > > > > >> > Can some one point me to the solr regex documentation ?
> > > > > > >> >
> > > > > > >> > i read it supports all java regex features.  i tried ^ and
> $ ,
> > > > > seems it
> > > > > > >> is
> > > > > > >> > not working.
> > > > > > >> >
> > > > > > >> > Thanks,
> > > > > > >> > Anil
> > > > > > >> >
> > > > > > >>
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>


Re: SolrJ 5.5 won't work with any of my servers

2016-03-01 Thread Shawn Heisey
On 2/29/2016 9:14 PM, Shai Erera wrote:
> Shawn, not sure if it's the same case as yours, but I've hit NPEs upgrading
> to 5.5 too. In my case though, SolrJ talks to a proxy servlets before the
> request gets routed to Solr, and that servlet didn't handle binary content
> stream well.
>
> I had to add another resource method to the servlet which handled
> "appliation/javabin" and "application/octet-stream" and received the body
> as an InputStream.

I wish this was an NPE.  It would be easier to track down.

There is no proxy.  Although I do have a load balancer (haproxy) in
place for queries, this program doesn't use it.

It looks to me like SolrJ 5.5 behaves in a way that changes what gets
sent when talking to an existing Solr install using the default Jetty
config.  This seems like a bug to me.

I need to investigate what happens in my code when MySQL returns NULL
values.  I suspect that the object assigned to the SolrInputDocument
field is null.  Whatever it is that happens, SolrJ 5.4.1 works correctly
and 5.5.0 doesn't.

Thanks,
Shawn



Fwd: Standard highlighting doesn't work for Block Join

2016-03-01 Thread michael solomon
Hi,
I have solr 5.4.1 and I'm trying to use Block Join Query Parser for search
in children and return the parent.
I want to apply highlight on children but it's return empty.
My q parameter: "q={!parent which="is_parent:true"} normal_text:(account)"
highlight parameters:
"hl=true&hl.fl=normal_text&hl.simple.pre=&hl.simple.post="

and return:

> "highlighting": { "chikora.com": {} }
>

("chikora.com" it's the id of the parent document)
it's looks this already solved here:
https://issues.apache.org/jira/browse/LUCENE-5929
but I don't understand how to use it.

Thanks,
Michael
P.S: sorry about my English.. working on it :)


Re: Indexing books, chapters and pages

2016-03-01 Thread Zaccheo Bagnati
Thank you, Jack for your answer.
There are 2 reasons:
1. the requirement is to show in the result list both books and chapters
grouped, so I would have to execute the query grouping by book, retrieve
first, let's say, 10 books (sorted by relevance) and then for each book
repeat the query grouping by chapter (always ordering by relevance) in
order to obtain what we need (unfortunately it is not up to me defining the
requirements... but it however make sense). Unless there exist some SOLR
feature to do this in only one call (and that would be great!).
2. searching on pages will not match phrases that spans across 2 pages
(e.g. if last word of page 1 is "broken" and first word of page 2 is
"sentence" searching for "broken sentence" will not match)
However if we will not find a better solution I think that your proposal is
not so bad... I hope that reason #2 could be negligible and that #1
performs quite fast though we are multiplying queries.

Il giorno mar 1 mar 2016 alle ore 14:28 Jack Krupansky <
jack.krupan...@gmail.com> ha scritto:

> Any reason not to use the simplest structure - each page is one Solr
> document with a book field, a chapter field, and a page text field? You can
> then use grouping to group results by book (title text) or even chapter
> (title text and/or number). Maybe initially group by book and then if the
> user selects a book group you can re-query with the specific book and then
> group by chapter.
>
>
> -- Jack Krupansky
>
> On Tue, Mar 1, 2016 at 8:08 AM, Zaccheo Bagnati 
> wrote:
>
> > Original data is quite well structured: it comes in XML with chapters and
> > tags to mark the original page breaks on the paper version. In this way
> we
> > have the possibility to restructure it almost as we want before creating
> > SOLR index.
> >
> > Il giorno mar 1 mar 2016 alle ore 14:04 Jack Krupansky <
> > jack.krupan...@gmail.com> ha scritto:
> >
> > > To start, what is the form of your input data - is it already divided
> > into
> > > chapters and pages? Or... are you starting with raw PDF files?
> > >
> > >
> > > -- Jack Krupansky
> > >
> > > On Tue, Mar 1, 2016 at 6:56 AM, Zaccheo Bagnati 
> > > wrote:
> > >
> > > > Hi all,
> > > > I'm searching for ideas on how to define schema and how to perform
> > > queries
> > > > in this use case: we have to index books, each book is split into
> > > chapters
> > > > and chapters are split into pages (pages represent original page
> > cutting
> > > in
> > > > printed version). We should show the result grouped by books and
> > chapters
> > > > (for the same book) and pages (for the same chapter). As far as I
> know,
> > > we
> > > > have 2 options:
> > > >
> > > > 1. index pages as SOLR documents. In this way we could theoretically
> > > > retrieve chapters (and books?)  using grouping but
> > > > a. we will miss matches across two contiguous pages (page cutting
> > is
> > > > only due to typographical needs so concepts could be split... as in
> > > printed
> > > > books)
> > > > b. I don't know if it is possible in SOLR to group results on two
> > > > different levels (books and chapters)
> > > >
> > > > 2. index chapters as SOLR documents. In this case we will have the
> > right
> > > > matches but how to obtain the matching pages? (we need pages because
> > the
> > > > client can only display pages)
> > > >
> > > > we have been struggling on this problem for a lot of time and we're
> > not
> > > > able to find a suitable solution so I'm looking if someone has ideas
> or
> > > has
> > > > already solved a similar issue.
> > > > Thanks
> > > >
> > >
> >
>


Re: Indexing books, chapters and pages

2016-03-01 Thread Jack Krupansky
Any reason not to use the simplest structure - each page is one Solr
document with a book field, a chapter field, and a page text field? You can
then use grouping to group results by book (title text) or even chapter
(title text and/or number). Maybe initially group by book and then if the
user selects a book group you can re-query with the specific book and then
group by chapter.


-- Jack Krupansky

On Tue, Mar 1, 2016 at 8:08 AM, Zaccheo Bagnati  wrote:

> Original data is quite well structured: it comes in XML with chapters and
> tags to mark the original page breaks on the paper version. In this way we
> have the possibility to restructure it almost as we want before creating
> SOLR index.
>
> Il giorno mar 1 mar 2016 alle ore 14:04 Jack Krupansky <
> jack.krupan...@gmail.com> ha scritto:
>
> > To start, what is the form of your input data - is it already divided
> into
> > chapters and pages? Or... are you starting with raw PDF files?
> >
> >
> > -- Jack Krupansky
> >
> > On Tue, Mar 1, 2016 at 6:56 AM, Zaccheo Bagnati 
> > wrote:
> >
> > > Hi all,
> > > I'm searching for ideas on how to define schema and how to perform
> > queries
> > > in this use case: we have to index books, each book is split into
> > chapters
> > > and chapters are split into pages (pages represent original page
> cutting
> > in
> > > printed version). We should show the result grouped by books and
> chapters
> > > (for the same book) and pages (for the same chapter). As far as I know,
> > we
> > > have 2 options:
> > >
> > > 1. index pages as SOLR documents. In this way we could theoretically
> > > retrieve chapters (and books?)  using grouping but
> > > a. we will miss matches across two contiguous pages (page cutting
> is
> > > only due to typographical needs so concepts could be split... as in
> > printed
> > > books)
> > > b. I don't know if it is possible in SOLR to group results on two
> > > different levels (books and chapters)
> > >
> > > 2. index chapters as SOLR documents. In this case we will have the
> right
> > > matches but how to obtain the matching pages? (we need pages because
> the
> > > client can only display pages)
> > >
> > > we have been struggling on this problem for a lot of time and we're
> not
> > > able to find a suitable solution so I'm looking if someone has ideas or
> > has
> > > already solved a similar issue.
> > > Thanks
> > >
> >
>


Re: Indexing books, chapters and pages

2016-03-01 Thread Zaccheo Bagnati
Original data is quite well structured: it comes in XML with chapters and
tags to mark the original page breaks on the paper version. In this way we
have the possibility to restructure it almost as we want before creating
SOLR index.

Il giorno mar 1 mar 2016 alle ore 14:04 Jack Krupansky <
jack.krupan...@gmail.com> ha scritto:

> To start, what is the form of your input data - is it already divided into
> chapters and pages? Or... are you starting with raw PDF files?
>
>
> -- Jack Krupansky
>
> On Tue, Mar 1, 2016 at 6:56 AM, Zaccheo Bagnati 
> wrote:
>
> > Hi all,
> > I'm searching for ideas on how to define schema and how to perform
> queries
> > in this use case: we have to index books, each book is split into
> chapters
> > and chapters are split into pages (pages represent original page cutting
> in
> > printed version). We should show the result grouped by books and chapters
> > (for the same book) and pages (for the same chapter). As far as I know,
> we
> > have 2 options:
> >
> > 1. index pages as SOLR documents. In this way we could theoretically
> > retrieve chapters (and books?)  using grouping but
> > a. we will miss matches across two contiguous pages (page cutting is
> > only due to typographical needs so concepts could be split... as in
> printed
> > books)
> > b. I don't know if it is possible in SOLR to group results on two
> > different levels (books and chapters)
> >
> > 2. index chapters as SOLR documents. In this case we will have the right
> > matches but how to obtain the matching pages? (we need pages because the
> > client can only display pages)
> >
> > we have been struggling on this problem for a lot of time and we're  not
> > able to find a suitable solution so I'm looking if someone has ideas or
> has
> > already solved a similar issue.
> > Thanks
> >
>


Re: Indexing books, chapters and pages

2016-03-01 Thread Jack Krupansky
To start, what is the form of your input data - is it already divided into
chapters and pages? Or... are you starting with raw PDF files?


-- Jack Krupansky

On Tue, Mar 1, 2016 at 6:56 AM, Zaccheo Bagnati  wrote:

> Hi all,
> I'm searching for ideas on how to define schema and how to perform queries
> in this use case: we have to index books, each book is split into chapters
> and chapters are split into pages (pages represent original page cutting in
> printed version). We should show the result grouped by books and chapters
> (for the same book) and pages (for the same chapter). As far as I know, we
> have 2 options:
>
> 1. index pages as SOLR documents. In this way we could theoretically
> retrieve chapters (and books?)  using grouping but
> a. we will miss matches across two contiguous pages (page cutting is
> only due to typographical needs so concepts could be split... as in printed
> books)
> b. I don't know if it is possible in SOLR to group results on two
> different levels (books and chapters)
>
> 2. index chapters as SOLR documents. In this case we will have the right
> matches but how to obtain the matching pages? (we need pages because the
> client can only display pages)
>
> we have been struggling on this problem for a lot of time and we're  not
> able to find a suitable solution so I'm looking if someone has ideas or has
> already solved a similar issue.
> Thanks
>


Re: both way synonyms with ManagedSynonymFilterFactory

2016-03-01 Thread Bjørn Hjelle
Thanks a lot for following up on this and creating the patch!

On Thu, Feb 25, 2016 at 2:49 PM, Jan Høydahl  wrote:

> Created https://issues.apache.org/jira/browse/SOLR-8737 to handle this
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> > 22. feb. 2016 kl. 11.21 skrev Jan Høydahl :
> >
> > Hi
> >
> > Did you get any Further with this?
> > I reproduced your situation with Solr 5.5.
> >
> > Think the issue here is that when the SynonymFilter is created based on
> the managed map, option “expand” is always set to “false”, while the
> default for file-based synonym dictionary is “true”.
> >
> > So with expand=false, what happens is that the input word (e.g. “mb”) is
> *replaced* with the synonym “megabytes”. Confusingly enough, when synonyms
> are applied both on index and query side, your document will contain
> “megabytes” instead of “mb”, but when you query for “mb”, the same happens
> on query side, so you will actually match :-)
> >
> > I think what we need is to switch default to expand=true, and make it
> configurable also in the managed factory.
> >
> > --
> > Jan Høydahl, search solution architect
> > Cominvent AS - www.cominvent.com
> >
> >> 11. feb. 2016 kl. 10.16 skrev Bjørn Hjelle :
> >>
> >> Hi,
> >>
> >> one-way managed synonyms seems to work fine, but I cannot make both-way
> >> synonyms work.
> >>
> >> Steps to reproduce with Solr 5.4.1:
> >>
> >> 1. create a core:
> >> $ bin/solr create_core -c test -d server/solr/configsets/basic_configs
> >>
> >> 2. edit schema.xml so fieldType text_general looks like this:
> >>
> >>>> positionIncrementGap="100">
> >> 
> >>   
> >>>> />
> >>   
> >> 
> >>   
> >>
> >> 3. reload the core:
> >>
> >> $ curl -X GET "
> >> http://localhost:8983/solr/admin/cores?action=RELOAD&core=test";
> >>
> >> 4. add synonyms, one one-way synonym, one two-way, reload the core
> again:
> >>
> >> $ curl -X PUT -H 'Content-type:application/json' --data-binary
> >> '{"mad":["angry","upset"]}' "
> >> http://localhost:8983/solr/test/schema/analysis/synonyms/english";
> >> $ curl -X PUT -H 'Content-type:application/json' --data-binary
> >> '["mb","megabytes"]' "
> >> http://localhost:8983/solr/test/schema/analysis/synonyms/english";
> >> $ curl -X GET "
> >> http://localhost:8983/solr/admin/cores?action=RELOAD&core=test";
> >>
> >> 5. list the synonyms:
> >> {
> >> "responseHeader":{
> >>   "status":0,
> >>   "QTime":0},
> >> "synonymMappings":{
> >>   "initArgs":{"ignoreCase":false},
> >>   "initializedOn":"2016-02-11T09:00:50.354Z",
> >>   "managedMap":{
> >> "mad":["angry",
> >>   "upset"],
> >> "mb":["megabytes"],
> >> "megabytes":["mb"]}}}
> >>
> >>
> >> 6. add two documents:
> >>
> >> $ bin/post -c test -type 'application/json' -d '[{"id" : "1", "title_t"
> :
> >> "10 megabytes makes me mad" },{"id" : "2", "title_t" : "100 mb should be
> >> sufficient" }]'
> >> $ bin/post -c test -type 'application/json' -d '[{"id" : "2", "title_t"
> :
> >> "100 mb should be sufficient" }]'
> >>
> >> 7. search for the documents:
> >>
> >> - all these return the first document, so one-way synonyms work:
> >> $ curl -X GET "
> >> http://localhost:8983/solr/test/select?q=title_t:angry&indent=true";
> >> $ curl -X GET "
> >> http://localhost:8983/solr/test/select?q=title_t:upset&indent=true";
> >> $ curl -X GET "
> >> http://localhost:8983/solr/test/select?q=title_t:mad&indent=true";
> >>
> >> - this only returns the document with "mb":
> >>
> >> $ curl -X GET "
> >> http://localhost:8983/solr/test/select?q=title_t:mb&indent=true";
> >>
> >> - this only returns the document with "megabytes"
> >>
> >> $ curl -X GET "
> >> http://localhost:8983/solr/test/select?q=title_t:megabytes&indent=true";
> >>
> >>
> >> Any input on how to make this work would be appreciated.
> >>
> >> Thanks,
> >> Bjørn
> >
>
>


behavior of ScriptTransformer in DIH has changed

2016-03-01 Thread Bernd Fehling
Just in case someone uses  ScriptTransformer in DIH extensively
and is thinking about going from Java7 to Java8, some behavior
has changed due to change from Mozilla Rhino (Java7) to
Oracle Nashorn (Java8).

Took me a while to figure out why my DIH chrashed after changing to Java8.

A good help is "jrunscript" which comes with Java7 and Java8
but also "jjs" which only comes with Java8.

Regards
Bernd


Re: Indexing books, chapters and pages

2016-03-01 Thread Zaccheo Bagnati
That's fine. But how could I get, for example, obtain a list of the pages
containing a match?

Il giorno mar 1 mar 2016 alle ore 13:01 Binoy Dalal 
ha scritto:

> Here's one idea.
> Index each chapter as a parent document and then have individual pages to
> be the child documents.
> That way for a match in any chapter, you also get the individual pages as
> documents for presentation.
>
> On Tue, 1 Mar 2016, 17:26 Zaccheo Bagnati,  wrote:
>
> > Hi all,
> > I'm searching for ideas on how to define schema and how to perform
> queries
> > in this use case: we have to index books, each book is split into
> chapters
> > and chapters are split into pages (pages represent original page cutting
> in
> > printed version). We should show the result grouped by books and chapters
> > (for the same book) and pages (for the same chapter). As far as I know,
> we
> > have 2 options:
> >
> > 1. index pages as SOLR documents. In this way we could theoretically
> > retrieve chapters (and books?)  using grouping but
> > a. we will miss matches across two contiguous pages (page cutting is
> > only due to typographical needs so concepts could be split... as in
> printed
> > books)
> > b. I don't know if it is possible in SOLR to group results on two
> > different levels (books and chapters)
> >
> > 2. index chapters as SOLR documents. In this case we will have the right
> > matches but how to obtain the matching pages? (we need pages because the
> > client can only display pages)
> >
> > we have been struggling on this problem for a lot of time and we're  not
> > able to find a suitable solution so I'm looking if someone has ideas or
> has
> > already solved a similar issue.
> > Thanks
> >
> --
> Regards,
> Binoy Dalal
>


Re: Indexing books, chapters and pages

2016-03-01 Thread Binoy Dalal
Here's one idea.
Index each chapter as a parent document and then have individual pages to
be the child documents.
That way for a match in any chapter, you also get the individual pages as
documents for presentation.

On Tue, 1 Mar 2016, 17:26 Zaccheo Bagnati,  wrote:

> Hi all,
> I'm searching for ideas on how to define schema and how to perform queries
> in this use case: we have to index books, each book is split into chapters
> and chapters are split into pages (pages represent original page cutting in
> printed version). We should show the result grouped by books and chapters
> (for the same book) and pages (for the same chapter). As far as I know, we
> have 2 options:
>
> 1. index pages as SOLR documents. In this way we could theoretically
> retrieve chapters (and books?)  using grouping but
> a. we will miss matches across two contiguous pages (page cutting is
> only due to typographical needs so concepts could be split... as in printed
> books)
> b. I don't know if it is possible in SOLR to group results on two
> different levels (books and chapters)
>
> 2. index chapters as SOLR documents. In this case we will have the right
> matches but how to obtain the matching pages? (we need pages because the
> client can only display pages)
>
> we have been struggling on this problem for a lot of time and we're  not
> able to find a suitable solution so I'm looking if someone has ideas or has
> already solved a similar issue.
> Thanks
>
-- 
Regards,
Binoy Dalal


Indexing books, chapters and pages

2016-03-01 Thread Zaccheo Bagnati
Hi all,
I'm searching for ideas on how to define schema and how to perform queries
in this use case: we have to index books, each book is split into chapters
and chapters are split into pages (pages represent original page cutting in
printed version). We should show the result grouped by books and chapters
(for the same book) and pages (for the same chapter). As far as I know, we
have 2 options:

1. index pages as SOLR documents. In this way we could theoretically
retrieve chapters (and books?)  using grouping but
a. we will miss matches across two contiguous pages (page cutting is
only due to typographical needs so concepts could be split... as in printed
books)
b. I don't know if it is possible in SOLR to group results on two
different levels (books and chapters)

2. index chapters as SOLR documents. In this case we will have the right
matches but how to obtain the matching pages? (we need pages because the
client can only display pages)

we have been struggling on this problem for a lot of time and we're  not
able to find a suitable solution so I'm looking if someone has ideas or has
already solved a similar issue.
Thanks


[ISSUE] backup on a recovering index should fail

2016-03-01 Thread Gerald Reinhart


Hi,

   In short: backup on a recovering index should fail.

   We are using the backup command "http:// ...
/replication?command=backup&location=/tmp" against one server of the
cluster.
   Most of the time there is no issue with this command.

   But in some particular case, the server can be in recovery mode. In
this case, the command is doing a backup on an index that is not
complete and return a http code 200. We end up with a partial index
backup ! As a workaround we will do this backup against the leader of
the cloud: the leader is never in recovery mode.

   In our opinion, the backup command on a recovering index should
return a http code 503 Service Unavailable (and not http code 200 OK).

   Shall we open a issue or it's an expected behaviour ?

   Thanks,


Gérald and Elodie


Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 158 Ter Rue du Temple 75003 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.