Re: Solr Configuration (Caching & RAM) for performance Tuning

2016-03-02 Thread Erick Erickson
Committing after every doc is an anti-pattern. All the in-memory structures
are being thrown away after each update/insert.

Why do you think you need to do this? The usual pattern is to just let your
autocommit parameters in Solr config.XML do this for you.

Ditto with specifying commitWithin on each add unless you have a pretty
unusual installation.

I'd recommend setting your commit intervals to be as long as you can stand
and not do anything from the client, including specifying commitWithin.

Best
Erick
On Mar 3, 2016 8:47 PM, "Binoy Dalal"  wrote:

> Can you share the cache stats from the admin panel?
> Also how much load are you talking about here? (Queries/second)
> How many documents do you have?
> Are you fetching any large stored fields?
>
> On Thu, 3 Mar 2016, 12:31 Maulin Rathod,  wrote:
>
> > Adding extra information.
> >
> > Our index size is around 120 GB (2 shard + 2 replica).
> > We have 400 GB RAM on our windows server.  Solr is assigned 50 GB RAM.
> So
> > there is huge amount of free RAM (>300 GB) is available for OS.
> >
> > We have very simple query which returns only 5 solr documents. Under load
> > condition it takes 100 ms to  2000 ms.
> >
> >
> > -Original Message-
> > From: Maulin Rathod
> > Sent: 03 March 2016 12:24
> > To: solr-user@lucene.apache.org
> > Subject: RE: Solr Configuration (Caching & RAM) for performance Tuning
> >
> > we do soft commit when we insert/update document.
> >
> > //Insert Document
> >
> > UpdateResponse resp = cloudServer.add(doc, 1000); if (resp.getStatus() ==
> > 0) {
> > success = true;
> > }
> >
> > //Update Document
> >
> > UpdateRequest req = new UpdateRequest(); req.setCommitWithin(1000);
> > req.add(docs); UpdateResponse resp = req.process(cloudServer); if
> > (resp.getStatus() == 0) {
> > success = true;
> > }
> >
> > Here is commit settings in solrconfig.xml.
> >
> > 
> > 60
> > 2
> > false
> > 
> >
> > 
> > ${solr.autoSoftCommit.maxTime:-1}
> > 
> >
> >
> >
> >
> > -Original Message-
> > From: Binoy Dalal [mailto:binoydala...@gmail.com]
> > Sent: 03 March 2016 11:57
> > To: solr-user@lucene.apache.org
> > Subject: Re: Solr Configuration (Caching & RAM) for performance Tuning
> >
> > 1) Experiment with the autowarming settings in solrconfig.xml. Since in
> > your case, you're indexing so frequently consider setting the count to a
> > low number, so that not a lot of time is spent warming the caches.
> > Alternatively if you're not very big on initial query response times
> being
> > small, you could turn off auto warming all together.
> > Also how are your commit settings configured?
> > Do you do a hard commit every 10 seconds or do you have soft committing
> > enabled?
> >
> > 2) 50Gb memory is way to high to assign to just solr and it is
> unnecessary.
> > Solr loads your index into the OS cache. The index is not held in the JVM
> > heap.
> > So it is essential that your OS have enough free memory available to load
> > the entire index.
> > Since you're only seeing about a 2gb use of your JVM memory, set your
> heap
> > size to something around 4gbs.
> >
> > Also, how big is your index?
> >
> > On Thu, 3 Mar 2016, 11:39 Maulin Rathod,  wrote:
> >
> > > Hi,
> > >
> > > We are using Solr 5.2 (on windows 2012 server/jdk 1.8) for document
> > > content indexing/querying. We found that querying slows down
> > > intermittently under load condition.
> > >
> > > In our analysis we found two issues.
> > >
> > > 1) Solr is not effectively using caching.
> > >
> > > Whenever new document indexed, it opens new searcher and hence cache
> > > will become invalid (as cache was associated with old Index Searcher).
> > > In our scenario, new documents are indexed very frequently (at least
> > > 10 document are indexed per minute). So effectively cache will not be
> > > useful as it will open new searcher frequently to make new documents
> > available for searching.
> > > How can improve caching usage?
> > >
> > >
> > > 2) RAM is not utilized
> > >
> > > We observed that Solr is using only 1-2 GB of heap even though we have
> > > assign 50 GB. Seems like it is not loading index into RAM which leads
> > > to high IO. Is it possible to configure Solr to fully load indexes in
> > memory?
> > > Don't find any documentation about this. How can we increase RAM usage
> > > to improve Solr performance?
> > >
> > >
> > > Regards,
> > >
> > > Maulin
> > >
> > > [CC Award Winners 2015]
> > >
> > > --
> > Regards,
> > Binoy Dalal
> >
> --
> Regards,
> Binoy Dalal
>


Re: Currency field doubts

2016-03-02 Thread Jan Høydahl
Hi,

In SolrCloud you would want to upload your new currency.xml to ZK and then call 
the collections API for a reload.
Alternatively you could write your own ExchangeRate Provider for Google 
implementing the interface ExchangeRateProvider.
The downside here is that each Solr node then will fetch the exchange rates 
from Google individually and redundantly.

Wrt returning several currencies in fl, try &fl=INR:currency(mrp,INR) 
USD:currency(mrp,USD) EUR:currency(mrp,EUR)
It will return two new fields (the name before the colon is the alias).

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 3. mar. 2016 kl. 06.03 skrev Pranaya Behera :
> 
> Hi,
> For currency, as suggested in the wiki and guide, the field type is 
> currency and the defaults would take usd and will take the exchange rates 
> from the currency.xml file located in the conf dir. We have script that talks 
> to google apis for the current currency exchange and symlinked to the conf 
> dir for the xml file. In solr cloud mode if any config changes this need to 
> be uploaded to zookeeper and then a reload required for the collection to 
> know that there are changes in the config. As the wiki says " Replication is 
> supported, given that you explicitly configure replication forcurrency.xml. 
> Upon the arrival of a new version of currency.xml, Solr slaves will do a core 
> reload and begin using the new exchange rates. SeeSolrReplication 
> for
>  more.SolrCloud is also supported 
> since we use ResourceLoader to load the file." But when I tried to do so it 
> didnt neither uploaded the configsets to zookeeper nor reload the collection. 
> How to go about this without manual zookeeper upload and reload of collection.
> 
> And now lets say the currency is being stored as USD and some in INR. While 
> querying we can provide in the fl param as currency(fieldname, 
> CURRENCY_CODES) e.g. currency(mrp, INR), currency(mrp, USD) and it will give 
> the result with respect to currency.xml file. Is it possible to return 
> calculated mrp in two different currency e.g. if the mrp would return more 
> than just one currency. currency(mrp, INR, USD, EUR) as I try this I get an 
> error. Is it possible to do so, and how ?
> 
> -- 
> Thanks & Regards
> Pranaya Behera
> 
> 



Re: Solr Configuration (Caching & RAM) for performance Tuning

2016-03-02 Thread Binoy Dalal
Can you share the cache stats from the admin panel?
Also how much load are you talking about here? (Queries/second)
How many documents do you have?
Are you fetching any large stored fields?

On Thu, 3 Mar 2016, 12:31 Maulin Rathod,  wrote:

> Adding extra information.
>
> Our index size is around 120 GB (2 shard + 2 replica).
> We have 400 GB RAM on our windows server.  Solr is assigned 50 GB RAM.  So
> there is huge amount of free RAM (>300 GB) is available for OS.
>
> We have very simple query which returns only 5 solr documents. Under load
> condition it takes 100 ms to  2000 ms.
>
>
> -Original Message-
> From: Maulin Rathod
> Sent: 03 March 2016 12:24
> To: solr-user@lucene.apache.org
> Subject: RE: Solr Configuration (Caching & RAM) for performance Tuning
>
> we do soft commit when we insert/update document.
>
> //Insert Document
>
> UpdateResponse resp = cloudServer.add(doc, 1000); if (resp.getStatus() ==
> 0) {
> success = true;
> }
>
> //Update Document
>
> UpdateRequest req = new UpdateRequest(); req.setCommitWithin(1000);
> req.add(docs); UpdateResponse resp = req.process(cloudServer); if
> (resp.getStatus() == 0) {
> success = true;
> }
>
> Here is commit settings in solrconfig.xml.
>
> 
> 60
> 2
> false
> 
>
> 
> ${solr.autoSoftCommit.maxTime:-1}
> 
>
>
>
>
> -Original Message-
> From: Binoy Dalal [mailto:binoydala...@gmail.com]
> Sent: 03 March 2016 11:57
> To: solr-user@lucene.apache.org
> Subject: Re: Solr Configuration (Caching & RAM) for performance Tuning
>
> 1) Experiment with the autowarming settings in solrconfig.xml. Since in
> your case, you're indexing so frequently consider setting the count to a
> low number, so that not a lot of time is spent warming the caches.
> Alternatively if you're not very big on initial query response times being
> small, you could turn off auto warming all together.
> Also how are your commit settings configured?
> Do you do a hard commit every 10 seconds or do you have soft committing
> enabled?
>
> 2) 50Gb memory is way to high to assign to just solr and it is unnecessary.
> Solr loads your index into the OS cache. The index is not held in the JVM
> heap.
> So it is essential that your OS have enough free memory available to load
> the entire index.
> Since you're only seeing about a 2gb use of your JVM memory, set your heap
> size to something around 4gbs.
>
> Also, how big is your index?
>
> On Thu, 3 Mar 2016, 11:39 Maulin Rathod,  wrote:
>
> > Hi,
> >
> > We are using Solr 5.2 (on windows 2012 server/jdk 1.8) for document
> > content indexing/querying. We found that querying slows down
> > intermittently under load condition.
> >
> > In our analysis we found two issues.
> >
> > 1) Solr is not effectively using caching.
> >
> > Whenever new document indexed, it opens new searcher and hence cache
> > will become invalid (as cache was associated with old Index Searcher).
> > In our scenario, new documents are indexed very frequently (at least
> > 10 document are indexed per minute). So effectively cache will not be
> > useful as it will open new searcher frequently to make new documents
> available for searching.
> > How can improve caching usage?
> >
> >
> > 2) RAM is not utilized
> >
> > We observed that Solr is using only 1-2 GB of heap even though we have
> > assign 50 GB. Seems like it is not loading index into RAM which leads
> > to high IO. Is it possible to configure Solr to fully load indexes in
> memory?
> > Don't find any documentation about this. How can we increase RAM usage
> > to improve Solr performance?
> >
> >
> > Regards,
> >
> > Maulin
> >
> > [CC Award Winners 2015]
> >
> > --
> Regards,
> Binoy Dalal
>
-- 
Regards,
Binoy Dalal


Re: BlockJoinQuery parser and ArrayIndexOutOfBoundException

2016-03-02 Thread Mikhail Khludnev
On Thu, Mar 3, 2016 at 7:18 AM, Sathyakumar Seshachalam <
sathyakumar_seshacha...@trimble.com> wrote:

> In my case, yes there are standalone docs (without any parents) and then
> there is blocks with parents and its children in the same index.
>

As far as I know you can't mix them. Can you try to add some fake child if
a parent has no one?


-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics





RE: Solr Configuration (Caching & RAM) for performance Tuning

2016-03-02 Thread Maulin Rathod
Adding extra information.

Our index size is around 120 GB (2 shard + 2 replica). 
We have 400 GB RAM on our windows server.  Solr is assigned 50 GB RAM.  So 
there is huge amount of free RAM (>300 GB) is available for OS. 

We have very simple query which returns only 5 solr documents. Under load 
condition it takes 100 ms to  2000 ms. 


-Original Message-
From: Maulin Rathod 
Sent: 03 March 2016 12:24
To: solr-user@lucene.apache.org
Subject: RE: Solr Configuration (Caching & RAM) for performance Tuning

we do soft commit when we insert/update document.

//Insert Document

UpdateResponse resp = cloudServer.add(doc, 1000); if (resp.getStatus() == 0) {
success = true;
}

//Update Document

UpdateRequest req = new UpdateRequest(); req.setCommitWithin(1000); 
req.add(docs); UpdateResponse resp = req.process(cloudServer); if 
(resp.getStatus() == 0) {
success = true;
}

Here is commit settings in solrconfig.xml.


60
2
false



${solr.autoSoftCommit.maxTime:-1}





-Original Message-
From: Binoy Dalal [mailto:binoydala...@gmail.com]
Sent: 03 March 2016 11:57
To: solr-user@lucene.apache.org
Subject: Re: Solr Configuration (Caching & RAM) for performance Tuning

1) Experiment with the autowarming settings in solrconfig.xml. Since in your 
case, you're indexing so frequently consider setting the count to a low number, 
so that not a lot of time is spent warming the caches.
Alternatively if you're not very big on initial query response times being 
small, you could turn off auto warming all together.
Also how are your commit settings configured?
Do you do a hard commit every 10 seconds or do you have soft committing enabled?

2) 50Gb memory is way to high to assign to just solr and it is unnecessary.
Solr loads your index into the OS cache. The index is not held in the JVM heap.
So it is essential that your OS have enough free memory available to load the 
entire index.
Since you're only seeing about a 2gb use of your JVM memory, set your heap size 
to something around 4gbs.

Also, how big is your index?

On Thu, 3 Mar 2016, 11:39 Maulin Rathod,  wrote:

> Hi,
>
> We are using Solr 5.2 (on windows 2012 server/jdk 1.8) for document 
> content indexing/querying. We found that querying slows down 
> intermittently under load condition.
>
> In our analysis we found two issues.
>
> 1) Solr is not effectively using caching.
>
> Whenever new document indexed, it opens new searcher and hence cache 
> will become invalid (as cache was associated with old Index Searcher).
> In our scenario, new documents are indexed very frequently (at least
> 10 document are indexed per minute). So effectively cache will not be 
> useful as it will open new searcher frequently to make new documents 
> available for searching.
> How can improve caching usage?
>
>
> 2) RAM is not utilized
>
> We observed that Solr is using only 1-2 GB of heap even though we have 
> assign 50 GB. Seems like it is not loading index into RAM which leads 
> to high IO. Is it possible to configure Solr to fully load indexes in memory?
> Don't find any documentation about this. How can we increase RAM usage 
> to improve Solr performance?
>
>
> Regards,
>
> Maulin
>
> [CC Award Winners 2015]
>
> --
Regards,
Binoy Dalal


RE: Solr Configuration (Caching & RAM) for performance Tuning

2016-03-02 Thread Maulin Rathod
we do soft commit when we insert/update document.

//Insert Document

UpdateResponse resp = cloudServer.add(doc, 1000);
if (resp.getStatus() == 0)
{
success = true;
}

//Update Document

UpdateRequest req = new UpdateRequest();
req.setCommitWithin(1000);
req.add(docs);
UpdateResponse resp = req.process(cloudServer);
if (resp.getStatus() == 0)
{
success = true;
}

Here is commit settings in solrconfig.xml.


60
2
false



${solr.autoSoftCommit.maxTime:-1}





-Original Message-
From: Binoy Dalal [mailto:binoydala...@gmail.com] 
Sent: 03 March 2016 11:57
To: solr-user@lucene.apache.org
Subject: Re: Solr Configuration (Caching & RAM) for performance Tuning

1) Experiment with the autowarming settings in solrconfig.xml. Since in your 
case, you're indexing so frequently consider setting the count to a low number, 
so that not a lot of time is spent warming the caches.
Alternatively if you're not very big on initial query response times being 
small, you could turn off auto warming all together.
Also how are your commit settings configured?
Do you do a hard commit every 10 seconds or do you have soft committing enabled?

2) 50Gb memory is way to high to assign to just solr and it is unnecessary.
Solr loads your index into the OS cache. The index is not held in the JVM heap.
So it is essential that your OS have enough free memory available to load the 
entire index.
Since you're only seeing about a 2gb use of your JVM memory, set your heap size 
to something around 4gbs.

Also, how big is your index?

On Thu, 3 Mar 2016, 11:39 Maulin Rathod,  wrote:

> Hi,
>
> We are using Solr 5.2 (on windows 2012 server/jdk 1.8) for document 
> content indexing/querying. We found that querying slows down 
> intermittently under load condition.
>
> In our analysis we found two issues.
>
> 1) Solr is not effectively using caching.
>
> Whenever new document indexed, it opens new searcher and hence cache 
> will become invalid (as cache was associated with old Index Searcher). 
> In our scenario, new documents are indexed very frequently (at least 
> 10 document are indexed per minute). So effectively cache will not be 
> useful as it will open new searcher frequently to make new documents 
> available for searching.
> How can improve caching usage?
>
>
> 2) RAM is not utilized
>
> We observed that Solr is using only 1-2 GB of heap even though we have 
> assign 50 GB. Seems like it is not loading index into RAM which leads 
> to high IO. Is it possible to configure Solr to fully load indexes in memory?
> Don't find any documentation about this. How can we increase RAM usage 
> to improve Solr performance?
>
>
> Regards,
>
> Maulin
>
> [CC Award Winners 2015]
>
> --
Regards,
Binoy Dalal


Re: SolrCloud - Strategy for recovering cluster states

2016-03-02 Thread danny teichthal
According to what you describe, I really don't see the need of core
discovery in Solr Cloud. It will only be used to eagerly load a core on
startup.
If I understand correctly, when ZK = truth, this eager loading can/should
be done by consulting zookeeper instead of local disk.
I agree that it is really confusing.
The best strategy that I see form is to stop relying on core.properties and
keep it all in zookeeper.


On Wed, Mar 2, 2016 at 7:54 PM, Jeff Wartes  wrote:

> Well, with the understanding that someone who isn’t involved in the
> process is describing something that isn’t built yet...
>
> I could imagine changes like:
>  - Core discovery ignores cores that aren’t present in the ZK cluster state
>  - New cores are automatically created to bring a node in line with ZK
> cluster state (addreplica, essentially)
>
> So if the clusterstate said “node XYZ has a replica of shard3 of
> collection1 and that’s all”, and you downed node XYZ and deleted the data
> directory, it’d get restored when you started the node again. And if you
> copied the core directory for shard1 of collection2 in there and restarted
> the node, it’d get ignored because the clusterstate says node XYZ doesn’t
> have that.
>
> More importantly, if you completely destroyed a node and rebuilt it from
> an image, (AWS?) that image wouldn't need any special core directories
> specific to that node. As long as the node name was the same, Solr would
> handle bringing that node back to where it was in the cluster.
>
> Back to opinions, I think mixing the cluster definition between local disk
> on the nodes and ZK clusterstate is just confusing. It should really be one
> or the other. Specifically, I think it should be local disk for
> non-SolrCloud, and ZK for SolrCloud.
>
>
>
>
>
> On 3/2/16, 12:13 AM, "danny teichthal"  wrote:
>
> >Thanks Jeff,
> >I understand your philosophy and it sounds correct.
> >Since we had many problems with zookeeper when switching to Solr Cloud. we
> >couldn't make it as a source of knowledge and had to relay on a more
> stable
> >source.
> >The issues is that when we get such an event of zookeeper, it brought our
> >system down, and in this case, clearing the core.properties were a life
> >saver.
> >We've managed to make it pretty stable not, but we will always need a
> >"dooms day" weapon.
> >
> >I looked into the related JIRA and it confused me a little, and raised a
> >few other questions:
> >1. What exactly defines zookeeper as a truth?
> >2. What is the role of core.properties if the state is only in zookeeper?
> >
> >
> >
> >Your tool is very interesting, I just thought about writing such a tool
> >myself.
> >From the sources I understand that you represent each node as a path in
> the
> >git repository.
> >So, I guess that for restore purposes I will have to do
> >the opposite direction and create a node for every path entry.
> >
> >
> >
> >
> >On Tue, Mar 1, 2016 at 11:36 PM, Jeff Wartes 
> wrote:
> >
> >>
> >> I’ve been running SolrCloud clusters in various versions for a few years
> >> here, and I can only think of two or three cases that the ZK-stored
> cluster
> >> state was broken in a way that I had to manually intervene by
> hand-editing
> >> the contents of ZK. I think I’ve seen Solr fixes go by for those cases,
> >> too. I’ve never completely wiped ZK. (Although granted, my ZK cluster
> has
> >> been pretty stable, and my collection count is smaller than yours)
> >>
> >> My philosophy is that ZK is the source of cluster configuration, not the
> >> collection of core.properties files on the nodes.
> >> Currently, cluster state is shared between ZK and core directories. I’d
> >> prefer, and I think Solr development is going this way, (SOLR-7269) that
> >> all cluster state exist and be managed via ZK, and all state be removed
> >> from the local disk of the cluster nodes. The fact that a node uses
> local
> >> disk based configuration to figure out what collections/replicas it has
> is
> >> something that should be fixed, in my opinion.
> >>
> >> If you’re frequently getting into bad states due to ZK issues, I’d
> suggest
> >> you file bugs against Solr for the fact that you got into the state, and
> >> then fix your ZK cluster.
> >>
> >> Failing that, can you just periodically back up your ZK data and restore
> >> it if something breaks? I wrote a little tool to watch clusterstate.json
> >> and write every version to a local git repo a few years ago. I was
> mostly
> >> interested because I wanted to see changes that happened pretty fast,
> but
> >> it could also serve as a backup approach. Here’s a link, although I
> clearly
> >> haven’t touched it lately. Feel free to ask if you have issues:
> >> https://github.com/randomstatistic/git_zk_monitor
> >>
> >>
> >>
> >>
> >> On 3/1/16, 12:09 PM, "danny teichthal"  wrote:
> >>
> >> >Hi,
> >> >Just summarizing my questions if the long mail is a little
> intimidating:
> >> >1. Is there a best practice/automated tool for overcoming problems in
> >> >clus

Re: Solr Configuration (Caching & RAM) for performance Tuning

2016-03-02 Thread Binoy Dalal
1) Experiment with the autowarming settings in solrconfig.xml. Since in
your case, you're indexing so frequently consider setting the count to a
low number, so that not a lot of time is spent warming the caches.
Alternatively if you're not very big on initial query response times being
small, you could turn off auto warming all together.
Also how are your commit settings configured?
Do you do a hard commit every 10 seconds or do you have soft committing
enabled?

2) 50Gb memory is way to high to assign to just solr and it is unnecessary.
Solr loads your index into the OS cache. The index is not held in the JVM
heap.
So it is essential that your OS have enough free memory available to load
the entire index.
Since you're only seeing about a 2gb use of your JVM memory, set your heap
size to something around 4gbs.

Also, how big is your index?

On Thu, 3 Mar 2016, 11:39 Maulin Rathod,  wrote:

> Hi,
>
> We are using Solr 5.2 (on windows 2012 server/jdk 1.8) for document
> content indexing/querying. We found that querying slows down intermittently
> under load condition.
>
> In our analysis we found two issues.
>
> 1) Solr is not effectively using caching.
>
> Whenever new document indexed, it opens new searcher and hence cache will
> become invalid (as cache was associated with old Index Searcher). In our
> scenario, new documents are indexed very frequently (at least 10 document
> are indexed per minute). So effectively cache will not be useful as it will
> open new searcher frequently to make new documents available for searching.
> How can improve caching usage?
>
>
> 2) RAM is not utilized
>
> We observed that Solr is using only 1-2 GB of heap even though we have
> assign 50 GB. Seems like it is not loading index into RAM which leads to
> high IO. Is it possible to configure Solr to fully load indexes in memory?
> Don't find any documentation about this. How can we increase RAM usage to
> improve Solr performance?
>
>
> Regards,
>
> Maulin
>
> [CC Award Winners 2015]
>
> --
Regards,
Binoy Dalal


Solr Configuration (Caching & RAM) for performance Tuning

2016-03-02 Thread Maulin Rathod
Hi,

We are using Solr 5.2 (on windows 2012 server/jdk 1.8) for document content 
indexing/querying. We found that querying slows down intermittently under load 
condition.

In our analysis we found two issues.

1) Solr is not effectively using caching.

Whenever new document indexed, it opens new searcher and hence cache will 
become invalid (as cache was associated with old Index Searcher). In our 
scenario, new documents are indexed very frequently (at least 10 document are 
indexed per minute). So effectively cache will not be useful as it will open 
new searcher frequently to make new documents available for searching. How can 
improve caching usage?


2) RAM is not utilized

We observed that Solr is using only 1-2 GB of heap even though we have assign 
50 GB. Seems like it is not loading index into RAM which leads to high IO. Is 
it possible to configure Solr to fully load indexes in memory?  Don't find any 
documentation about this. How can we increase RAM usage to improve Solr 
performance?


Regards,

Maulin

[CC Award Winners 2015]



XX:ParGCCardsPerStrideChunk

2016-03-02 Thread William Bell
Has anyone tried -XX:ParGCCardsPerStrideChunk with Solr?

There has been reports of improved GC times.

-- 
Bill Bell
billnb...@gmail.com
cell 720-256-8076


Currency field doubts

2016-03-02 Thread Pranaya Behera

Hi,
 For currency, as suggested in the wiki and guide, the field type 
is currency and the defaults would take usd and will take the exchange 
rates from the currency.xml file located in the conf dir. We have script 
that talks to google apis for the current currency exchange and 
symlinked to the conf dir for the xml file. In solr cloud mode if any 
config changes this need to be uploaded to zookeeper and then a reload 
required for the collection to know that there are changes in the 
config. As the wiki says " Replication is supported, given that you 
explicitly configure replication forcurrency.xml. Upon the arrival of a 
new version of currency.xml, Solr slaves will do a core reload and begin 
using the new exchange rates. SeeSolrReplication 
for 
more.SolrCloud is also supported 
since we use ResourceLoader to load the file." But when I tried to do so 
it didnt neither uploaded the configsets to zookeeper nor reload the 
collection. How to go about this without manual zookeeper upload and 
reload of collection.


And now lets say the currency is being stored as USD and some in INR. 
While querying we can provide in the fl param as currency(fieldname, 
CURRENCY_CODES) e.g. currency(mrp, INR), currency(mrp, USD) and it will 
give the result with respect to currency.xml file. Is it possible to 
return calculated mrp in two different currency e.g. if the mrp would 
return more than just one currency. currency(mrp, INR, USD, EUR) as I 
try this I get an error. Is it possible to do so, and how ?


--
Thanks & Regards
Pranaya Behera




Re: BlockJoinQuery parser and ArrayIndexOutOfBoundException

2016-03-02 Thread Sathyakumar Seshachalam
Hi, 
I will try that approach. Deleting and force merging before adding the
blocks.
In my case, yes there are standalone docs (without any parents) and then
there is blocks with parents and its children in the same index.
Note however that docs in the blocks are unique in that the children,
there is just one copy of the children. (Not sure if that was clear
enough).

 

On 02/03/16, 9:40 PM, "Mikhail Khludnev" 
wrote:

>Hello,
>
>It's really hard to find exact case, why it happens. There is a bruteforce
>approach, sweep all deleted documents ie forcemerge until there is no
>deleted docs.
>Can it happen that standalone docs and parent blocks are mixed in the
>index?
>
>On Wed, Mar 2, 2016 at 2:04 PM, Sathyakumar Seshachalam <
>sathyakumar_seshacha...@trimble.com> wrote:
>
>> Am running in to this issue :
>> https://issues.apache.org/jira/browse/SOLR-7606. But am not following
>>all
>> of the description there in that ticket.
>>
>> But what I am not able to understand is when does a parent/child
>> orthogonality is broken. And what does a child document without a parent
>> mean ?
>>
>> I have a set of documents that have been added to solr (via an import
>>from
>> DB), And then in another process I fetch or recreate SolrInputDocument
>>from
>> DB those documents for which the relation need to be in place. And
>>before
>> adding them to Solr, I make sure all the parent and child documents are
>> deleted from Solr and then I add this block of documents in to solr.
>>
>> And now when I query I get an ArrayIndexOutOfBoundException exactly as
>> specified in that JIRA issue.
>> Any insights on how and what should be done will be greatly appreciated.
>>
>>
>>
>>
>
>
>-- 
>Sincerely yours
>Mikhail Khludnev
>Principal Engineer,
>Grid Dynamics
>
>
>



SolrEntityProcessor works with Solr Cloud

2016-03-02 Thread Neeraj Bhatt
Hello All

I am tryiing to import data from one solr cloud into another using
SolrEntityProcessor. My schema got changed and I need to reindex

1. Does SolrEntityProcessor works with Solr cloud to get data from Solr Cloud ?

It looks it will not work as SolrEntityProcessor code is creating
instance of SolrClient and not of CloudSolrClient

 solrClient = new HttpSolrClient(url.toExternalForm(), client);

2. Also in url how will I pass my zookeeper external ensemble urls


Thanks
Neeraj


Re: facet on two multi-valued fields

2016-03-02 Thread Jan Høydahl
It makes no sense to facet on a “text_general” ananlyzed field. Can you give a 
concrete example with a few dummy docs and show some queries (do you query the 
tagDescription field?) and wanted facet output?

There may be several ways to solve the task, depending on the exact use case. 
One solution could be to use child documents.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 2. mar. 2016 kl. 17.30 skrev Andreas Hubold :
> 
> Hi,
> 
> my schema looks like this
> 
> 
>  multiValued="true"/>
>  stored="false" multiValued="true"/>
> 
> 
> I'd like to get the tagIds of documents with a certain tagDescription (and 
> text). However tagIds contains multiple ids in the same order as 
> tagDescription and simple faceting would return all. Is there a way to just 
> get the IDs of the tags with a matching description?
> 
> Or would you recommend some other schema?
> 
> Thanks,
> Andreas
> 
> 



Re: Solr (v5.3.1) doesn't delete orphaned child documents

2016-03-02 Thread Mikhail Khludnev
when it indexes a document block it have to assign not a  but
"_root_"
field, but deleteById() is unaware of it.

On Wed, Mar 2, 2016 at 8:16 PM, Naeem Tahir 
wrote:

> Hi,
>
>  I noticed some strange behavior when deleting orphaned child
> documents in Solr 5.3.1. I am indexing nested documents in parent/child
> hierarchy. When I delete a child document whose parent is already deleted
> previously, child document still shows up in search. I am using
> deleteById() that always returns with a success code. Here is an
> illustration:
>
> A parent P has n (=3) children, say a, b, and c.
>
> (P)
>  |-(a)
>  |-(b)
>  |-(c)
>
> i) Index all four documents with P as parent and a,b,c as children of P.
> ii) Search returns 4 documents (P, a, b, c).
> iii) Delete P.
> iv) Search returns 3 documents (a, b, c)
> v) Now delete 'a'
> vi) Search still returns 3 documents including 'a'. Same behavior when you
> delete 'b' and 'c' as well.
>
> Can someone comment if this is the expected behavior?
>
>Thanks & regards,
> Naeem
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics





Re: SolrCloud - Strategy for recovering cluster states

2016-03-02 Thread Jeff Wartes
Well, with the understanding that someone who isn’t involved in the process is 
describing something that isn’t built yet...

I could imagine changes like:
 - Core discovery ignores cores that aren’t present in the ZK cluster state
 - New cores are automatically created to bring a node in line with ZK cluster 
state (addreplica, essentially) 
 
So if the clusterstate said “node XYZ has a replica of shard3 of collection1 
and that’s all”, and you downed node XYZ and deleted the data directory, it’d 
get restored when you started the node again. And if you copied the core 
directory for shard1 of collection2 in there and restarted the node, it’d get 
ignored because the clusterstate says node XYZ doesn’t have that.

More importantly, if you completely destroyed a node and rebuilt it from an 
image, (AWS?) that image wouldn't need any special core directories specific to 
that node. As long as the node name was the same, Solr would handle bringing 
that node back to where it was in the cluster.

Back to opinions, I think mixing the cluster definition between local disk on 
the nodes and ZK clusterstate is just confusing. It should really be one or the 
other. Specifically, I think it should be local disk for non-SolrCloud, and ZK 
for SolrCloud.





On 3/2/16, 12:13 AM, "danny teichthal"  wrote:

>Thanks Jeff,
>I understand your philosophy and it sounds correct.
>Since we had many problems with zookeeper when switching to Solr Cloud. we
>couldn't make it as a source of knowledge and had to relay on a more stable
>source.
>The issues is that when we get such an event of zookeeper, it brought our
>system down, and in this case, clearing the core.properties were a life
>saver.
>We've managed to make it pretty stable not, but we will always need a
>"dooms day" weapon.
>
>I looked into the related JIRA and it confused me a little, and raised a
>few other questions:
>1. What exactly defines zookeeper as a truth?
>2. What is the role of core.properties if the state is only in zookeeper?
>
>
>
>Your tool is very interesting, I just thought about writing such a tool
>myself.
>From the sources I understand that you represent each node as a path in the
>git repository.
>So, I guess that for restore purposes I will have to do
>the opposite direction and create a node for every path entry.
>
>
>
>
>On Tue, Mar 1, 2016 at 11:36 PM, Jeff Wartes  wrote:
>
>>
>> I’ve been running SolrCloud clusters in various versions for a few years
>> here, and I can only think of two or three cases that the ZK-stored cluster
>> state was broken in a way that I had to manually intervene by hand-editing
>> the contents of ZK. I think I’ve seen Solr fixes go by for those cases,
>> too. I’ve never completely wiped ZK. (Although granted, my ZK cluster has
>> been pretty stable, and my collection count is smaller than yours)
>>
>> My philosophy is that ZK is the source of cluster configuration, not the
>> collection of core.properties files on the nodes.
>> Currently, cluster state is shared between ZK and core directories. I’d
>> prefer, and I think Solr development is going this way, (SOLR-7269) that
>> all cluster state exist and be managed via ZK, and all state be removed
>> from the local disk of the cluster nodes. The fact that a node uses local
>> disk based configuration to figure out what collections/replicas it has is
>> something that should be fixed, in my opinion.
>>
>> If you’re frequently getting into bad states due to ZK issues, I’d suggest
>> you file bugs against Solr for the fact that you got into the state, and
>> then fix your ZK cluster.
>>
>> Failing that, can you just periodically back up your ZK data and restore
>> it if something breaks? I wrote a little tool to watch clusterstate.json
>> and write every version to a local git repo a few years ago. I was mostly
>> interested because I wanted to see changes that happened pretty fast, but
>> it could also serve as a backup approach. Here’s a link, although I clearly
>> haven’t touched it lately. Feel free to ask if you have issues:
>> https://github.com/randomstatistic/git_zk_monitor
>>
>>
>>
>>
>> On 3/1/16, 12:09 PM, "danny teichthal"  wrote:
>>
>> >Hi,
>> >Just summarizing my questions if the long mail is a little intimidating:
>> >1. Is there a best practice/automated tool for overcoming problems in
>> >cluster state coming from zookeeper disconnections?
>> >2. Creating a collection via core admin is discouraged, is it true also
>> for
>> >core.properties discovery?
>> >
>> >I would like to be able to specify collection.configName in the
>> >core.properties and when starting server, the collection will be created
>> >and linked to the config name specified.
>> >
>> >
>> >
>> >On Mon, Feb 29, 2016 at 4:01 PM, danny teichthal 
>> >wrote:
>> >
>> >> Hi,
>> >>
>> >>
>> >> I would like to describe a process we use for overcoming problems in
>> >> cluster state when we have networking issues. Would appreciate if anyone
>> >> can answer about what are the flaws on this solu

Solr (v5.3.1) doesn't delete orphaned child documents

2016-03-02 Thread Naeem Tahir
Hi,    

 I noticed some strange behavior when deleting orphaned child documents in 
Solr 5.3.1. I am indexing nested documents in parent/child hierarchy. When I 
delete a child document whose parent is already deleted previously, child 
document still shows up in search. I am using deleteById() that always returns 
with a success code. Here is an illustration:

A parent P has n (=3) children, say a, b, and c.

    (P)
         |-(a)
         |-(b)
         |-(c)
         
i) Index all four documents with P as parent and a,b,c as children of P.
ii) Search returns 4 documents (P, a, b, c).
iii) Delete P.
iv) Search returns 3 documents (a, b, c)
v) Now delete 'a'
vi) Search still returns 3 documents including 'a'. Same behavior when you 
delete 'b' and 'c' as well.

    Can someone comment if this is the expected behavior?

   Thanks & regards,
    Naeem


Re: FW: Difference Between Tokenizer and filter

2016-03-02 Thread Shawn Heisey
On 3/2/2016 9:55 AM, G, Rajesh wrote:
> Thanks for your email Koji. Can you please explain what is the role of 
> tokenizer and filter so I can understand why I should not have two tokenizer 
> in index and I should have at least one tokenizer in query?

You can't have two tokenizers.  It's not allowed.

The only notable difference between a Tokenizer and a Filter is that a
Tokenizer operates on an input that's a single string, turning it into a
token stream, and a Filter uses a token stream for both input and
output.  A CharFilter uses a single string as both input and output.

An analysis chain in the Solr schema (whether it's index or query) is
composed of zero or more CharFilter entries, exactly one Tokenizer
entry, and zero or more Filter entries.  Alternately, you can specify an
Analyzer class, which is a lot like a Tokenizer.  An Analyzer is
effectively the same thing as a tokenizer combined with filters.

CharFilters run before the Tokenizer, and Filters run after the
Tokenizer.  CharFilters, Tokenizers, Filters, and Analyzers are Lucene
concepts.

> My understanding is tokenizer is used to say how the content should be 
> indexed physically in file system. Filters are used to query result

The format of the index on disk is not controlled by the tokenizer, or
anything else in the analysis chain.  It is controlled by the Lucene
codec.  Only a very small part of the codec is configurable in Solr, but
normally this does not need configuring.  The codec defaults are
appropriate for the majority of use cases.

Thanks,
Shawn



RE: FW: Difference Between Tokenizer and filter

2016-03-02 Thread G, Rajesh
Thanks for your email Koji. Can you please explain what is the role of 
tokenizer and filter so I can understand why I should not have two tokenizer in 
index and I should have at least one tokenizer in query?

My understanding is tokenizer is used to say how the content should be indexed 
physically in file system. Filters are used to query result




Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer 
SHL Talent Measurement products and services. If you have received this e-mail 
in error, please notify the sender and immediately, destroy all copies of this 
email and its attachments. The publication, copying, in whole or in part, or 
use or dissemination in any other way of this e-mail and attachments by anyone 
other than the intended person(s) is prohibited.

-Original Message-
From: Koji Sekiguchi [mailto:koji.sekigu...@rondhuit.com]
Sent: Wednesday, March 2, 2016 8:10 PM
To: solr-user@lucene.apache.org
Subject: Re: FW: Difference Between Tokenizer and filter

Hi,

... must have one and only one  and it can 
have zero or more s. From the point of view of the rules, your 
... is not correct because it has more than 
one  and  ... is not correct as 
well because it has no .

Koji

On 2016/03/02 20:25, G, Rajesh wrote:
> Hi Team,
>
> Can you please clarify the below. My understanding is tokenizer is used to 
> say how the content should be indexed physically in file system. Filters are 
> used to query result. The blow lines are from my setup. But I have seen eg 
> that include filters inside  and tokenizer in 
>  that confused me.
>
>   positionIncrementGap="100" >
>  
>  class="solr.LowerCaseTokenizerFactory"/>
>  class="solr.StandardTokenizerFactory"/>
>  class="solr.NGramTokenizerFactory" minGramSize="2" maxGramSize="2"/>
>  
>  
>  minGramSize="2" maxGramSize="2"/>
>  
>  
>
> My goal is to user solr and find the best match among the technology
> names e.g Actual tech name
>
> 1.   Microsoft Visual Studio
>
> 2.   Microsoft Internet Explorer
>
> 3.   Microsoft Visio
>
> When user types Microsoft Visal Studio user should get Microsoft
> Visual Studio. Basically misspelled and jumble words should match
> closest tech name
>
>
>
>
>
> Corporate Executive Board India Private Limited. Registration No: 
> U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
> No.10 DLF Cyber City, Gurgaon, Haryana-122002, India..
>
>
>
> This e-mail and/or its attachments are intended only for the use of the 
> addressee(s) and may contain confidential and legally privileged information 
> belonging to CEB and/or its subsidiaries, including CEB subsidiaries that 
> offer SHL Talent Measurement products and services. If you have received this 
> e-mail in error, please notify the sender and immediately, destroy all copies 
> of this email and its attachments. The publication, copying, in whole or in 
> part, or use or dissemination in any other way of this e-mail and attachments 
> by anyone other than the intended person(s) is prohibited.
>
>



facet on two multi-valued fields

2016-03-02 Thread Andreas Hubold

Hi,

my schema looks like this


multiValued="true"/>
stored="false" multiValued="true"/>



I'd like to get the tagIds of documents with a certain tagDescription 
(and text). However tagIds contains multiple ids in the same order as 
tagDescription and simple faceting would return all. Is there a way to 
just get the IDs of the tags with a matching description?


Or would you recommend some other schema?

Thanks,
Andreas




Re: BlockJoinQuery parser and ArrayIndexOutOfBoundException

2016-03-02 Thread Mikhail Khludnev
Hello,

It's really hard to find exact case, why it happens. There is a bruteforce
approach, sweep all deleted documents ie forcemerge until there is no
deleted docs.
Can it happen that standalone docs and parent blocks are mixed in the index?

On Wed, Mar 2, 2016 at 2:04 PM, Sathyakumar Seshachalam <
sathyakumar_seshacha...@trimble.com> wrote:

> Am running in to this issue :
> https://issues.apache.org/jira/browse/SOLR-7606. But am not following all
> of the description there in that ticket.
>
> But what I am not able to understand is when does a parent/child
> orthogonality is broken. And what does a child document without a parent
> mean ?
>
> I have a set of documents that have been added to solr (via an import from
> DB), And then in another process I fetch or recreate SolrInputDocument from
> DB those documents for which the relation need to be in place. And before
> adding them to Solr, I make sure all the parent and child documents are
> deleted from Solr and then I add this block of documents in to solr.
>
> And now when I query I get an ArrayIndexOutOfBoundException exactly as
> specified in that JIRA issue.
> Any insights on how and what should be done will be greatly appreciated.
>
>
>
>


-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics





Re: Non-contigous terms in SuggestComponent

2016-03-02 Thread Alfonso Muñoz-Pomer Fuentes

Hi Edwin.

That was what I suspected, but I wanted to confirm. If we go down this 
route I’ll do some testing and post the results.


We’re using 5.1 in production, but I’m testing with 5.4.1.

The index has 40,891,287 documents and is 3.01 GB, so it’s not big at all.

Many thanks,
Alfonso

On 01/03/2016 06:25, Zheng Lin Edwin Yeo wrote:

 From what I have experienced, the performance using edismax will be slower.
It may not be that significant if your index size is small, but it will get
more significant as your index size grows.

By the way, which version of Solr are you using?

Regards,
Edwin


On 29 February 2016 at 21:33, Alfonso Muñoz-Pomer Fuentes 
wrote:


Hi all.

I’ve been reading through the Suggester component in Solr at
https://cwiki.apache.org/confluence/display/solr/Suggester.

I have a couple of questions regarding it which I haven’t been able to
find the answer for in that page or anywhere else.

Is there a way to get suggestions from non-contiguous terms using a
SuggestComponent? E.g. let’s say we have a document which contains “The
quick brown fox” in a text field, can it be configured so that a user can
obtain that suggestion by typing “quick fox”?

I know I can get this sort of results using edismax queries, so maybe I
can set a request handler to do suggestions in this way instead than with
SuggestComponent. What are the downsides performance-wise?

Thank you in advance.

--
Alfonso Muñoz-Pomer Fuentes
Software Engineer @ Expression Atlas Team
European Bioinformatics Institute (EMBL-EBI)
European Molecular Biology Laboratory
Tel:+ 44 (0) 1223 49 2633
Skype: amunozpomer





--
Alfonso Muñoz-Pomer Fuentes
Software Engineer @ Expression Atlas Team
European Bioinformatics Institute (EMBL-EBI)
European Molecular Biology Laboratory
Tel:+ 44 (0) 1223 49 2633
Skype: amunozpomer


Re: FW: Difference Between Tokenizer and filter

2016-03-02 Thread Koji Sekiguchi

Hi,

... must have one and only one  and
it can have zero or more s. From the point of view of the
rules, your ... is not correct
because it has more than one  and 
... is not correct as well because it has no .

Koji

On 2016/03/02 20:25, G, Rajesh wrote:

Hi Team,

Can you please clarify the below. My understanding is tokenizer is used to say how the 
content should be indexed physically in file system. Filters are used to query result. The 
blow lines are from my setup. But I have seen eg that include filters inside  and tokenizer in  that confused me.

 
 



 
 

 
 

My goal is to user solr and find the best match among the technology names e.g
Actual tech name

1.   Microsoft Visual Studio

2.   Microsoft Internet Explorer

3.   Microsoft Visio

When user types Microsoft Visal Studio user should get Microsoft Visual Studio. 
Basically misspelled and jumble words should match closest tech name





Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India..



This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer 
SHL Talent Measurement products and services. If you have received this e-mail 
in error, please notify the sender and immediately, destroy all copies of this 
email and its attachments. The publication, copying, in whole or in part, or 
use or dissemination in any other way of this e-mail and attachments by anyone 
other than the intended person(s) is prohibited.






Query solrcloud questions

2016-03-02 Thread michael solomon
Hi,
I Installed 3 instances of SolrCloud 5.4.1.
I'm doing a little search engine of websites and I'm store their info as
Nested Documents(one document for the website general information and it
children is the pages inside the website).
So when I'm querying this collection I'm using a BlockJoin parser({!parent
which="is_parent:true" score="max"}).
I have several questions:

   1. How to use highlight with BlockJoin ? when I tried to use it my
   result's highlight returned empty.
   2. How to return the relevant child? i.e the children which because of
   it this parent returned as a result (this child with the highest score?).
   3. Am I boosting right?:

>{!parent which="is_parent:true" score="max"}
>(
>normal_text:("clients home"~1000)
>h_titles:("clients home"~1000)^3
>title:("clients home"~1000)^5
>depth:0^1.1
>)


Thank you,
Michael


Re: Commit after every document - alternate approach

2016-03-02 Thread Varun Thacker
Hi Sangeetha,

Well I don't think you need to commit after every document add.

You can rely on Solr's transaction log feature . If you are using SolrCloud
it's mandatory to have a transaction log . So every documents get written
to the tlog . Now say a node crashes even if documents were not committed ,
since it's present in the tlog Solr will replay then on startup.

Also if you are using SolrCloud and have multiple replicas , you should use
the min_rf feature to make sure that N replicas acknowledge the write
before you get back success -
https://cwiki.apache.org/confluence/display/solr/Read+and+Write+Side+Fault+Tolerance

On Wed, Mar 2, 2016 at 3:41 PM, Emir Arnautovic <
emir.arnauto...@sematext.com> wrote:

> Hi Sangeetha,
> What is sure is that it is not going to work - with 200-300K doc/hour,
> there will be >50 commits/second, meaning there are <20ms time for
> doc+commit.
> You can do is let Solr handle commits and maybe use real time get to
> verify doc is in Solr or do some periodic sanity checks.
> Are you doing document updates so in order Solr updates are reason why you
> commit each doc before moving to next doc?
>
> Regards,
> Emir
>
>
> On 02.03.2016 09:06, sangeetha.subraman...@gtnexus.com wrote:
>
>> Hi All,
>>
>> I am trying to understand on how we can have commit issued to solr while
>> indexing documents. Around 200K to 300K document/per hour with an avg size
>> of 10 KB size each will be getting into SOLR . JAVA code fetches the
>> document from MQ and streamlines it to SOLR. The problem is the client code
>> issues hard-commit after each document which is sent to SOLR for indexing
>> and it waits for the response from SOLR to get assurance whether the
>> document got indexed successfully. Only if it gets a OK status from SOLR
>> the document is cleared out from SOLR.
>>
>> As far as I understand doing a commit after each document is an expensive
>> operation. But we need to make sure that all the documents which are put
>> into MQ gets indexed in SOLR. Is there any other way of getting this done ?
>> Please let me know.
>> If we do a batch indexing, is there any chances we can identify if some
>> documents is missed from indexing ?
>>
>> Thanks
>> Sangeetha
>>
>>
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>


-- 


Regards,
Varun Thacker


BlockJoinQuery parser and ArrayIndexOutOfBoundException

2016-03-02 Thread Sathyakumar Seshachalam
Am running in to this issue : https://issues.apache.org/jira/browse/SOLR-7606. 
But am not following all of the description there in that ticket.

But what I am not able to understand is when does a parent/child orthogonality 
is broken. And what does a child document without a parent mean ?

I have a set of documents that have been added to solr (via an import from DB), 
And then in another process I fetch or recreate SolrInputDocument from DB those 
documents for which the relation need to be in place. And before adding them to 
Solr, I make sure all the parent and child documents are deleted from Solr and 
then I add this block of documents in to solr.

And now when I query I get an ArrayIndexOutOfBoundException exactly as 
specified in that JIRA issue.
Any insights on how and what should be done will be greatly appreciated.





RE: outlook email file pst extraction problem

2016-03-02 Thread Allison, Timothy B.
This is probably more of a Tika question now...

It sounds like Tika is not extracting dates from the .eml files that you are 
generating?  To confirm, you are able to extract dates with libpst...it is just 
that Tika is not able to process the dates that you are sending it in your .eml 
files?

If you are able to share an .eml file (either via personal email or open a 
ticket on Tika's jira if you think this is a bug in Tika), I can take a look.

-Original Message-
From: Sreenivasa Kallu [mailto:sreenivasaka...@gmail.com] 
Sent: Monday, February 29, 2016 7:17 PM
To: solr-user@lucene.apache.org
Subject: Re: outlook email file pst extraction problem

Thanks Timothy for your prompt help.

 I tried first option. I am able to extract .eml ( MIME format) files from PST 
file using libpst library.
 I am not able extract .msg ( outlook emails) files using libpst library. I am 
able to feed .eml files into SOLR.
 I can see some of tags are missing in the extraction of .eml files in SOLR. 
Specially date tags are missing in the .eml file tags comparative with .msg 
file generated tags. How to generate date tags with .eml files.
My SOLR program stopped working due lack of date tags and same program worked 
file  with .msg files. Any suggestion to generate date tags with .eml  files?  
Is it good idea to look JPST or aspose ( both are 3rd party libraries to 
extract .msg files from PST file) for case?

Advanced Thanks.

--sreenivasa kallu

On Thu, Feb 11, 2016 at 11:55 AM, Allison, Timothy B. 
wrote:

> Should have looked at how we handle psts before earlier responsesorry.
>
> What you're seeing is Tika's default treatment of embedded documents, 
> it concatenates them all into one string.  It'll do the same thing for 
> zip files and other container files.  The default Tika format is 
> xhtml, and we include tags that show you where the attachments are.  
> If the tags are stripped, then you only get a big blob of text, which 
> is often all that's necessary for search.
>
> Before SOLR-7189, you wouldn't have gotten any content, so that's 
> progress...right?
>
> Some options for now:
> 1) use java-libpst as a preprocessing step to extract contents from 
> your psts before you ingest them in Solr (feel free to borrow code 
> from our OutlookPSTParser).
> 2) use tika from the commandline with the -J -t options to get a Json 
> representation of the overall file, which includes a list of maps, 
> where each map represents a single embedded file.  Again, if you have 
> any questions on this, head over to u...@tika.apache.org
>
> I think what you want is something along the lines of SOLR-7229, which 
> would treat each embedded document as its own document.  That issue is 
> not resolved, and there's currently no way of doing this within DIH 
> that I'm aware of.
>
> If others on this list have an interest in SOLR-7229, let me know, and 
> I'll try to find some time.  I'd need feedback on some design decisions.
>
>
>
>
>
> -Original Message-
> From: Sreenivasa Kallu [mailto:sreenivasaka...@gmail.com]
> Sent: Thursday, February 11, 2016 1:43 PM
> To: solr-user@lucene.apache.org
> Subject: outlook email file pst extraction problem
>
> Hi ,
>I am currently indexing individual outlook messages and 
> searching is working fine.
> I have created solr core using following command.
>  ./solr create -c sreenimsg1 -d data_driven_schema_configs
>
> I am using following command to index individual messages.
> curl  "
>
> http://localhost:8983/solr/sreenimsg/update/extract?literal.id=msg9&up
> refix=attr_&fmap.content=attr_content&commit=true
> "
> -F "myfile=@/home/ec2-user/msg9.msg"
>
> This setup is working fine.
>
> But new requirement is extract messages using outlook pst file.
> I tried following command to extract messages from outlook pst file.
>
> curl  "
>
> http://localhost:8983/solr/sreenimsg1/update/extract?literal.id=msg7&u
> prefix=attr_&fmap.content=attr_content&commit=true
> "
> -F "myfile=@/home/ec2-user/sateamc_0006.pst"
>
> This command extracting only high level tags and extracting all 
> messages into one message. I am not getting all tags when extracted 
> individual messgaes. is above command is correct? is it problem not using 
> recursion?
>  how to add recursion to above command ? is it tika library problem?
>
> Please help to solve above problem.
>
> Advanced Thanks.
>
> --sreenivasa kallu
>


Re: [ISSUE] After restoring data to a Solrcloud instance

2016-03-02 Thread Varun Thacker
Could you post the full output of the CheckIndex command on the restored
snapshot? Also what happens if you delete the snapshot indexes and attempt
to restore again? Does it get corrupted again or is it a one off scenario?

On Wed, Mar 2, 2016 at 3:44 PM, Janit Anjaria (Tech-IT) <
anjaria.ja...@flipkart.com> wrote:

> Hi,
>
> Varun, we actually ran the test for our restored data snapshot and it
> threw an error saying "Broken segment".
>
> How is it possible that the same test gives success on the snapshot, but
> not on the restored snapshot? Can you please throw some light on this, so
> we can proceed and fix this issue.
>
> Regards,
> Janit
>
> On Tue, Mar 1, 2016 at 12:05 PM, Varun Thacker  > wrote:
>
>> Hi Janit,
>>
>> Please ask these questions the solr-user mailing list. There is no point
>> in double posting to the dev list as well.
>>
>> Looking at the stacktrace it looks like the index files are corrupted.
>> On the backed up index can you run the CheckIndex
>> command to see what it says. The command would look something like this
>> - :
>> java -cp SOLR_HOME/example/webapps/WEB-INF/lib/lucene-core-[version].jar
>> -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex
>> /path/to/backup/data/index
>>
>> If the CheckIndex command says its corrupted then there is something
>> wrong with the backup. If the CheckIndex
>>  says the index is fine then there might be something wrong with the
>> restore process in which case we'll need to dig deeper
>>
>> On Tue, Mar 1, 2016 at 11:21 AM, Janit Anjaria (Tech-IT) <
>> anjaria.ja...@flipkart.com> wrote:
>>
>>> Hi,
>>>
>>> Please find below the console logs for the above issue we have been
>>> facing:
>>>
>>> org.apache.solr.common.SolrException: Error opening new searcher
>>> at org.apache.solr.core.SolrCore.(SolrCore.java:824)
>>> at org.apache.solr.core.SolrCore.(SolrCore.java:665)
>>> at org.apache.solr.core.CoreContainer.create(CoreContainer.java:746)
>>> at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:466)
>>> at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:457)
>>> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>>> at 
>>> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:232)
>>> at 
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>> at 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>> at java.lang.Thread.run(Thread.java:744)
>>> Caused by: org.apache.solr.common.SolrException: Error opening new searcher
>>> at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1667)
>>> at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1778)
>>> at org.apache.solr.core.SolrCore.initSearcher(SolrCore.java:920)
>>> at org.apache.solr.core.SolrCore.(SolrCore.java:797)
>>> ... 9 more
>>> Caused by: org.apache.lucene.index.CorruptIndexException: codec footer 
>>> mismatch (file truncated?): actual footer=438025 vs expected 
>>> footer=-1071082520 
>>> (resource=MMapIndexInput(path="/var/solr/data/fk-ekl-mmi-mapsearch_shard1_replica1/data/restore.snapshot.20160204065051901/_1tdyo.cfs"))
>>> at org.apache.lucene.codecs.CodecUtil.validateFooter(CodecUtil.java:416)
>>> at 
>>> org.apache.lucene.codecs.CodecUtil.retrieveChecksum(CodecUtil.java:401)
>>> at 
>>> org.apache.lucene.codecs.lucene50.Lucene50CompoundReader.(Lucene50CompoundReader.java:79)
>>> at 
>>> org.apache.lucene.codecs.lucene50.Lucene50CompoundFormat.getCompoundReader(Lucene50CompoundFormat.java:71)
>>> at 
>>> org.apache.lucene.index.IndexWriter.readFieldInfos(IndexWriter.java:1016)
>>> at 
>>> org.apache.lucene.index.IndexWriter.getFieldNumberMap(IndexWriter.java:1033)
>>> at org.apache.lucene.index.IndexWriter.(IndexWriter.java:938)
>>> at 
>>> org.apache.solr.update.SolrIndexWriter.(SolrIndexWriter.java:79)
>>> at 
>>> org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:66)
>>> at 
>>> org.apache.solr.update.DefaultSolrCoreState.createMainIndexWriter(DefaultSolrCoreState.java:235)
>>> at 
>>> org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:109)
>>> at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1636)
>>> ... 12 more
>>>
>>> 3/1/2016, 11:19:44 AM ERROR null CoreContainer Error waiting for
>>> SolrCore to be created
>>>
>>> java.util.concurrent.ExecutionException: 
>>> org.apache.solr.common.SolrException: Unable to create core 
>>> [fk-ekl-mmi-mapsearch_shard1_replica1]
>>> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>>> at java.util.concurrent.FutureTask.get(FutureTask.java:188)
>>> at org.apache.solr.core.CoreContainer$2.run(CoreContainer.java:495)
>>> at 
>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>>> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>>> at 
>

FW: Difference Between Tokenizer and filter

2016-03-02 Thread G, Rajesh
Hi Team,

Can you please clarify the below. My understanding is tokenizer is used to say 
how the content should be indexed physically in file system. Filters are used 
to query result. The blow lines are from my setup. But I have seen eg that 
include filters inside  and tokenizer in  that confused me.



   
   
   


   



My goal is to user solr and find the best match among the technology names e.g
Actual tech name

1.   Microsoft Visual Studio

2.   Microsoft Internet Explorer

3.   Microsoft Visio

When user types Microsoft Visal Studio user should get Microsoft Visual Studio. 
Basically misspelled and jumble words should match closest tech name





Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India..



This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer 
SHL Talent Measurement products and services. If you have received this e-mail 
in error, please notify the sender and immediately, destroy all copies of this 
email and its attachments. The publication, copying, in whole or in part, or 
use or dissemination in any other way of this e-mail and attachments by anyone 
other than the intended person(s) is prohibited.




Re: [ISSUE] After restoring data to a Solrcloud instance

2016-03-02 Thread Janit Anjaria (Tech-IT)
Hi,

Varun, we actually ran the test for our restored data snapshot and it threw
an error saying "Broken segment".

How is it possible that the same test gives success on the snapshot, but
not on the restored snapshot? Can you please throw some light on this, so
we can proceed and fix this issue.

Regards,
Janit

On Tue, Mar 1, 2016 at 12:05 PM, Varun Thacker 
wrote:

> Hi Janit,
>
> Please ask these questions the solr-user mailing list. There is no point
> in double posting to the dev list as well.
>
> Looking at the stacktrace it looks like the index files are corrupted.  On
> the backed up index can you run the CheckIndex
> command to see what it says. The command would look something like this
> - :
> java -cp SOLR_HOME/example/webapps/WEB-INF/lib/lucene-core-[version].jar
> -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex
> /path/to/backup/data/index
>
> If the CheckIndex command says its corrupted then there is something wrong
> with the backup. If the CheckIndex
>  says the index is fine then there might be something wrong with the
> restore process in which case we'll need to dig deeper
>
> On Tue, Mar 1, 2016 at 11:21 AM, Janit Anjaria (Tech-IT) <
> anjaria.ja...@flipkart.com> wrote:
>
>> Hi,
>>
>> Please find below the console logs for the above issue we have been
>> facing:
>>
>> org.apache.solr.common.SolrException: Error opening new searcher
>>  at org.apache.solr.core.SolrCore.(SolrCore.java:824)
>>  at org.apache.solr.core.SolrCore.(SolrCore.java:665)
>>  at org.apache.solr.core.CoreContainer.create(CoreContainer.java:746)
>>  at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:466)
>>  at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:457)
>>  at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>>  at 
>> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:232)
>>  at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>  at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>  at java.lang.Thread.run(Thread.java:744)
>> Caused by: org.apache.solr.common.SolrException: Error opening new searcher
>>  at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1667)
>>  at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1778)
>>  at org.apache.solr.core.SolrCore.initSearcher(SolrCore.java:920)
>>  at org.apache.solr.core.SolrCore.(SolrCore.java:797)
>>  ... 9 more
>> Caused by: org.apache.lucene.index.CorruptIndexException: codec footer 
>> mismatch (file truncated?): actual footer=438025 vs expected 
>> footer=-1071082520 
>> (resource=MMapIndexInput(path="/var/solr/data/fk-ekl-mmi-mapsearch_shard1_replica1/data/restore.snapshot.20160204065051901/_1tdyo.cfs"))
>>  at org.apache.lucene.codecs.CodecUtil.validateFooter(CodecUtil.java:416)
>>  at 
>> org.apache.lucene.codecs.CodecUtil.retrieveChecksum(CodecUtil.java:401)
>>  at 
>> org.apache.lucene.codecs.lucene50.Lucene50CompoundReader.(Lucene50CompoundReader.java:79)
>>  at 
>> org.apache.lucene.codecs.lucene50.Lucene50CompoundFormat.getCompoundReader(Lucene50CompoundFormat.java:71)
>>  at 
>> org.apache.lucene.index.IndexWriter.readFieldInfos(IndexWriter.java:1016)
>>  at 
>> org.apache.lucene.index.IndexWriter.getFieldNumberMap(IndexWriter.java:1033)
>>  at org.apache.lucene.index.IndexWriter.(IndexWriter.java:938)
>>  at 
>> org.apache.solr.update.SolrIndexWriter.(SolrIndexWriter.java:79)
>>  at 
>> org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:66)
>>  at 
>> org.apache.solr.update.DefaultSolrCoreState.createMainIndexWriter(DefaultSolrCoreState.java:235)
>>  at 
>> org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:109)
>>  at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1636)
>>  ... 12 more
>>
>> 3/1/2016, 11:19:44 AM ERROR null CoreContainer Error waiting for
>> SolrCore to be created
>>
>> java.util.concurrent.ExecutionException: 
>> org.apache.solr.common.SolrException: Unable to create core 
>> [fk-ekl-mmi-mapsearch_shard1_replica1]
>>  at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>>  at java.util.concurrent.FutureTask.get(FutureTask.java:188)
>>  at org.apache.solr.core.CoreContainer$2.run(CoreContainer.java:495)
>>  at 
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>>  at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>>  at 
>> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:232)
>>  at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>  at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>  at java.lang.Thread.run(Thread.java:744)
>> Caused by: org.apache.solr.common.SolrException: Unab

Re: understand scoring

2016-03-02 Thread michael solomon
Hi Emir,
In morning I delete those documents and know added them again to re-run the
query.. and know this is how I expect (0_0) and I can't to re-produce the
problem... this weird.. :\

On Wed, Mar 2, 2016 at 11:38 AM, Emir Arnautovic <
emir.arnauto...@sematext.com> wrote:

> Hi Michael,
> Can you please run query with debug and share title field configuration.
>
> Thanks,
> Emir
>
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>
>
>
> On 02.03.2016 09:14, michael solomon wrote:
>
>> Thanks you, @Doug Turnbull I tried http://splainer.io but it's not for my
>> query(not explain for the docs..).
>> here the picture again...
>>
>> https://drive.google.com/file/d/0B-7dnH4rlntJc2ZWdmxMS3RDMGc/view?usp=sharing
>>
>> On Tue, Mar 1, 2016 at 10:06 PM, Doug Turnbull <
>> dturnb...@opensourceconnections.com> wrote:
>>
>> Supposedly Late April, early May. But don't hold me to it until I see copy
>>> edits :) Of course looks like now you can read at least the full ebook in
>>> MEAP form.
>>>
>>> -Doug
>>>
>>> On Tue, Mar 1, 2016 at 2:57 PM, shamik  wrote:
>>>
>>> Doug, do we've a date for the hard copy launch?



 --
 View this message in context:


>>> http://lucene.472066.n3.nabble.com/understand-scoring-tp4260837p4260860.html
>>>
 Sent from the Solr - User mailing list archive at Nabble.com.


>>>
>>> --
>>> *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections
>>> , LLC | 240.476.9983
>>> Author: Relevant Search 
>>> This e-mail and all contents, including attachments, is considered to be
>>> Company Confidential unless explicitly stated otherwise, regardless
>>> of whether attachments are marked as such.
>>>
>>>


Re: Commit after every document - alternate approach

2016-03-02 Thread Emir Arnautovic

Hi Sangeetha,
What is sure is that it is not going to work - with 200-300K doc/hour, 
there will be >50 commits/second, meaning there are <20ms time for 
doc+commit.
You can do is let Solr handle commits and maybe use real time get to 
verify doc is in Solr or do some periodic sanity checks.
Are you doing document updates so in order Solr updates are reason why 
you commit each doc before moving to next doc?


Regards,
Emir

On 02.03.2016 09:06, sangeetha.subraman...@gtnexus.com wrote:

Hi All,

I am trying to understand on how we can have commit issued to solr while 
indexing documents. Around 200K to 300K document/per hour with an avg size of 
10 KB size each will be getting into SOLR . JAVA code fetches the document from 
MQ and streamlines it to SOLR. The problem is the client code issues 
hard-commit after each document which is sent to SOLR for indexing and it waits 
for the response from SOLR to get assurance whether the document got indexed 
successfully. Only if it gets a OK status from SOLR the document is cleared out 
from SOLR.

As far as I understand doing a commit after each document is an expensive 
operation. But we need to make sure that all the documents which are put into 
MQ gets indexed in SOLR. Is there any other way of getting this done ? Please 
let me know.
If we do a batch indexing, is there any chances we can identify if some 
documents is missed from indexing ?

Thanks
Sangeetha



--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: understand scoring

2016-03-02 Thread Emir Arnautovic

Hi Michael,
Can you please run query with debug and share title field configuration.

Thanks,
Emir

--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



On 02.03.2016 09:14, michael solomon wrote:

Thanks you, @Doug Turnbull I tried http://splainer.io but it's not for my
query(not explain for the docs..).
here the picture again...
https://drive.google.com/file/d/0B-7dnH4rlntJc2ZWdmxMS3RDMGc/view?usp=sharing

On Tue, Mar 1, 2016 at 10:06 PM, Doug Turnbull <
dturnb...@opensourceconnections.com> wrote:


Supposedly Late April, early May. But don't hold me to it until I see copy
edits :) Of course looks like now you can read at least the full ebook in
MEAP form.

-Doug

On Tue, Mar 1, 2016 at 2:57 PM, shamik  wrote:


Doug, do we've a date for the hard copy launch?



--
View this message in context:


http://lucene.472066.n3.nabble.com/understand-scoring-tp4260837p4260860.html

Sent from the Solr - User mailing list archive at Nabble.com.




--
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections
, LLC | 240.476.9983
Author: Relevant Search 
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.



Re: FW: Difference Between Tokenizer and filter

2016-03-02 Thread Emir Arnautovic

Hi Rajesh,
Processing flow is same for both indexing and querying. What is compared 
at the end are resulting tokens. In general flow is: text -> char filter 
-> filtered text -> tokenizer -> tokens -> filter1 -> tokens ... -> 
filterN -> tokens.


You can read more about analysis chain in Solr wiki: 
https://cwiki.apache.org/confluence/display/solr/Understanding+Analyzers,+Tokenizers,+and+Filters


Regards,
Emir

--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



On 02.03.2016 10:00, G, Rajesh wrote:

Hi Team,

Can you please clarify the below. My understanding is tokenizer is used to say how the 
content should be indexed physically in file system. Filters are used to query result. The 
blow lines are from my setup. But I have seen eg that include filters inside  and tokenizer in  that confused me.

 
 



 
 

 
 

My goal is to user solr and find the best match among the technology names e.g
Actual tech name

1.   Microsoft Visual Studio

2.   Microsoft Internet Explorer

3.   Microsoft Visio

When user types Microsoft Visal Studio user should get Microsoft Visual Studio. 
Basically misspelled and jumble words should match closest tech name





Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India..



This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer 
SHL Talent Measurement products and services. If you have received this e-mail 
in error, please notify the sender and immediately, destroy all copies of this 
email and its attachments. The publication, copying, in whole or in part, or 
use or dissemination in any other way of this e-mail and attachments by anyone 
other than the intended person(s) is prohibited.




FW: Difference Between Tokenizer and filter

2016-03-02 Thread G, Rajesh
Hi Team,

Can you please clarify the below. My understanding is tokenizer is used to say 
how the content should be indexed physically in file system. Filters are used 
to query result. The blow lines are from my setup. But I have seen eg that 
include filters inside  and tokenizer in  that confused me.



   
   
   


   



My goal is to user solr and find the best match among the technology names e.g
Actual tech name

1.   Microsoft Visual Studio

2.   Microsoft Internet Explorer

3.   Microsoft Visio

When user types Microsoft Visal Studio user should get Microsoft Visual Studio. 
Basically misspelled and jumble words should match closest tech name





Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India..



This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer 
SHL Talent Measurement products and services. If you have received this e-mail 
in error, please notify the sender and immediately, destroy all copies of this 
email and its attachments. The publication, copying, in whole or in part, or 
use or dissemination in any other way of this e-mail and attachments by anyone 
other than the intended person(s) is prohibited.




Re: Indexing books, chapters and pages

2016-03-02 Thread Zaccheo Bagnati
If someone of you cares about his Stackoverflow reputation and has time to
do it I also opened a question there:
http://stackoverflow.com/questions/35722672/solr-schema-to-model-books-chapters-and-pages.
Thanks again to everybody

Il giorno mer 2 mar 2016 alle ore 09:42 Zaccheo Bagnati 
ha scritto:

> Thanks Alexandre,
> your solution seems very good: I'll surely try it and let you know. I like
> the Idea of mixing blockjoins and grouping!
>
>
> Il giorno mer 2 mar 2016 alle ore 04:46 Alexandre Rafalovitch <
> arafa...@gmail.com> ha scritto:
>
>> Here is an - untested - possible approach. I might be missing
>> something by combining these things in too many layers, but.
>>
>> 1) Have chapter as parent documents and pages as children within that.
>> Block index them together.
>> 2) On pages, include page text (probably not stored) as one field.
>> Also include a second field that has last paragraph of that page as
>> well as first paragraph of the next page. This gives you phrase
>> matches across boundaries. Also include pageId, etc.
>> 3) On chapters, include book id as a string field.
>> 4) Use block join query to search against pages, but return (parent)
>> chapters
>> https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-BlockJoinQueryParsers
>> 5) Use grouping or collapsing+expanding by book id to group chapters
>> within a book:
>> https://cwiki.apache.org/confluence/display/solr/Result+Grouping
>> or
>> https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results
>> 6) Use [child] DocumentTransformer to get pages back with childFilter
>> to re-limit them by your query:
>>
>> https://cwiki.apache.org/confluence/display/solr/Transforming+Result+Documents#TransformingResultDocuments-[child]-ChildDocTransformerFactory
>>
>> The main question is whether 6) will be able to piggyback on the
>> output of 5).. And, of course, the performance...
>>
>> I would love to know if this works, even partially. Either on the
>> mailing list or directly.
>>
>> Regards,
>>Alex.
>>
>> 
>> Newsletter and resources for Solr beginners and intermediates:
>> http://www.solr-start.com/
>>
>>
>> On 2 March 2016 at 00:50, Zaccheo Bagnati  wrote:
>> > Thank you, Jack for your answer.
>> > There are 2 reasons:
>> > 1. the requirement is to show in the result list both books and chapters
>> > grouped, so I would have to execute the query grouping by book, retrieve
>> > first, let's say, 10 books (sorted by relevance) and then for each book
>> > repeat the query grouping by chapter (always ordering by relevance) in
>> > order to obtain what we need (unfortunately it is not up to me defining
>> the
>> > requirements... but it however make sense). Unless there exist some SOLR
>> > feature to do this in only one call (and that would be great!).
>> > 2. searching on pages will not match phrases that spans across 2 pages
>> > (e.g. if last word of page 1 is "broken" and first word of page 2 is
>> > "sentence" searching for "broken sentence" will not match)
>> > However if we will not find a better solution I think that your
>> proposal is
>> > not so bad... I hope that reason #2 could be negligible and that #1
>> > performs quite fast though we are multiplying queries.
>> >
>> > Il giorno mar 1 mar 2016 alle ore 14:28 Jack Krupansky <
>> > jack.krupan...@gmail.com> ha scritto:
>> >
>> >> Any reason not to use the simplest structure - each page is one Solr
>> >> document with a book field, a chapter field, and a page text field?
>> You can
>> >> then use grouping to group results by book (title text) or even chapter
>> >> (title text and/or number). Maybe initially group by book and then if
>> the
>> >> user selects a book group you can re-query with the specific book and
>> then
>> >> group by chapter.
>> >>
>> >>
>> >> -- Jack Krupansky
>> >>
>> >> On Tue, Mar 1, 2016 at 8:08 AM, Zaccheo Bagnati 
>> >> wrote:
>> >>
>> >> > Original data is quite well structured: it comes in XML with
>> chapters and
>> >> > tags to mark the original page breaks on the paper version. In this
>> way
>> >> we
>> >> > have the possibility to restructure it almost as we want before
>> creating
>> >> > SOLR index.
>> >> >
>> >> > Il giorno mar 1 mar 2016 alle ore 14:04 Jack Krupansky <
>> >> > jack.krupan...@gmail.com> ha scritto:
>> >> >
>> >> > > To start, what is the form of your input data - is it already
>> divided
>> >> > into
>> >> > > chapters and pages? Or... are you starting with raw PDF files?
>> >> > >
>> >> > >
>> >> > > -- Jack Krupansky
>> >> > >
>> >> > > On Tue, Mar 1, 2016 at 6:56 AM, Zaccheo Bagnati <
>> zacch...@gmail.com>
>> >> > > wrote:
>> >> > >
>> >> > > > Hi all,
>> >> > > > I'm searching for ideas on how to define schema and how to
>> perform
>> >> > > queries
>> >> > > > in this use case: we have to index books, each book is split into
>> >> > > chapters
>> >> > > > and chapters are split into pages (pages represent original page
>> >> > cutting
>> >> > > in
>> >> > > > printed v

Re: Indexing books, chapters and pages

2016-03-02 Thread Zaccheo Bagnati
Thanks Alexandre,
your solution seems very good: I'll surely try it and let you know. I like
the Idea of mixing blockjoins and grouping!

Il giorno mer 2 mar 2016 alle ore 04:46 Alexandre Rafalovitch <
arafa...@gmail.com> ha scritto:

> Here is an - untested - possible approach. I might be missing
> something by combining these things in too many layers, but.
>
> 1) Have chapter as parent documents and pages as children within that.
> Block index them together.
> 2) On pages, include page text (probably not stored) as one field.
> Also include a second field that has last paragraph of that page as
> well as first paragraph of the next page. This gives you phrase
> matches across boundaries. Also include pageId, etc.
> 3) On chapters, include book id as a string field.
> 4) Use block join query to search against pages, but return (parent)
> chapters
> https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-BlockJoinQueryParsers
> 5) Use grouping or collapsing+expanding by book id to group chapters
> within a book:
> https://cwiki.apache.org/confluence/display/solr/Result+Grouping
> or
> https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results
> 6) Use [child] DocumentTransformer to get pages back with childFilter
> to re-limit them by your query:
>
> https://cwiki.apache.org/confluence/display/solr/Transforming+Result+Documents#TransformingResultDocuments-[child]-ChildDocTransformerFactory
>
> The main question is whether 6) will be able to piggyback on the
> output of 5).. And, of course, the performance...
>
> I would love to know if this works, even partially. Either on the
> mailing list or directly.
>
> Regards,
>Alex.
>
> 
> Newsletter and resources for Solr beginners and intermediates:
> http://www.solr-start.com/
>
>
> On 2 March 2016 at 00:50, Zaccheo Bagnati  wrote:
> > Thank you, Jack for your answer.
> > There are 2 reasons:
> > 1. the requirement is to show in the result list both books and chapters
> > grouped, so I would have to execute the query grouping by book, retrieve
> > first, let's say, 10 books (sorted by relevance) and then for each book
> > repeat the query grouping by chapter (always ordering by relevance) in
> > order to obtain what we need (unfortunately it is not up to me defining
> the
> > requirements... but it however make sense). Unless there exist some SOLR
> > feature to do this in only one call (and that would be great!).
> > 2. searching on pages will not match phrases that spans across 2 pages
> > (e.g. if last word of page 1 is "broken" and first word of page 2 is
> > "sentence" searching for "broken sentence" will not match)
> > However if we will not find a better solution I think that your proposal
> is
> > not so bad... I hope that reason #2 could be negligible and that #1
> > performs quite fast though we are multiplying queries.
> >
> > Il giorno mar 1 mar 2016 alle ore 14:28 Jack Krupansky <
> > jack.krupan...@gmail.com> ha scritto:
> >
> >> Any reason not to use the simplest structure - each page is one Solr
> >> document with a book field, a chapter field, and a page text field? You
> can
> >> then use grouping to group results by book (title text) or even chapter
> >> (title text and/or number). Maybe initially group by book and then if
> the
> >> user selects a book group you can re-query with the specific book and
> then
> >> group by chapter.
> >>
> >>
> >> -- Jack Krupansky
> >>
> >> On Tue, Mar 1, 2016 at 8:08 AM, Zaccheo Bagnati 
> >> wrote:
> >>
> >> > Original data is quite well structured: it comes in XML with chapters
> and
> >> > tags to mark the original page breaks on the paper version. In this
> way
> >> we
> >> > have the possibility to restructure it almost as we want before
> creating
> >> > SOLR index.
> >> >
> >> > Il giorno mar 1 mar 2016 alle ore 14:04 Jack Krupansky <
> >> > jack.krupan...@gmail.com> ha scritto:
> >> >
> >> > > To start, what is the form of your input data - is it already
> divided
> >> > into
> >> > > chapters and pages? Or... are you starting with raw PDF files?
> >> > >
> >> > >
> >> > > -- Jack Krupansky
> >> > >
> >> > > On Tue, Mar 1, 2016 at 6:56 AM, Zaccheo Bagnati  >
> >> > > wrote:
> >> > >
> >> > > > Hi all,
> >> > > > I'm searching for ideas on how to define schema and how to perform
> >> > > queries
> >> > > > in this use case: we have to index books, each book is split into
> >> > > chapters
> >> > > > and chapters are split into pages (pages represent original page
> >> > cutting
> >> > > in
> >> > > > printed version). We should show the result grouped by books and
> >> > chapters
> >> > > > (for the same book) and pages (for the same chapter). As far as I
> >> know,
> >> > > we
> >> > > > have 2 options:
> >> > > >
> >> > > > 1. index pages as SOLR documents. In this way we could
> theoretically
> >> > > > retrieve chapters (and books?)  using grouping but
> >> > > > a. we will miss matches across two contiguous pages (page
> cutting
> >> > is

Re: Indexing books, chapters and pages

2016-03-02 Thread Zaccheo Bagnati
Thanks Jack,
the chapter is definitely the optimal unit to search into and your solution
seems a quite good approach. The counterpart is that, depending on how
we'll choose the amount of text shared on two adjacent pages we will
experience some errors. For example will be always possible finding a
matching chapter but not finding any matching page (because searched terms
are too much far away). Let's see if this could be tolerable.

Il giorno mar 1 mar 2016 alle ore 17:44 Jack Krupansky <
jack.krupan...@gmail.com> ha scritto:

> The chapter seems like the optimal unit for initial searches - just combine
> the page text with a line break between them or index as a multivalued
> field and set the position increment gap to be 1 so that phrases work.
>
> You could have a separate collection for pages, with each page as a Solr
> document, but include the last line of text from the previous page and the
> first line of text from the next page so that phrases will match across
> page boundaries. Unfortunately, that may also result in false hits if the
> full phrase is found on the two adopted lines. That would require some
> special filtering to eliminate those false positives.
>
> There is also the question of maximum phrase size - most phrases tend to be
> reasonably short, but sometimes people may want to search for an entire
> paragraph (e.g., a quote) that may span multiple lines on two adjacent
> pages.
>
> -- Jack Krupansky
>
> On Tue, Mar 1, 2016 at 11:30 AM, Emir Arnautovic <
> emir.arnauto...@sematext.com> wrote:
>
> > Hi,
> > From the top of my head - probably does not solve problem completely, but
> > may trigger brainstorming: Index chapters and include page break tokens.
> > Use highlighting to return matches and make sure fragment size is large
> > enough to get page break token. In such scenario you should use slop for
> > phrase searches...
> >
> > More I write it, less I like it, but will not delete...
> >
> > Regards,
> > Emir
> >
> >
> > On 01.03.2016 12:56, Zaccheo Bagnati wrote:
> >
> >> Hi all,
> >> I'm searching for ideas on how to define schema and how to perform
> queries
> >> in this use case: we have to index books, each book is split into
> chapters
> >> and chapters are split into pages (pages represent original page cutting
> >> in
> >> printed version). We should show the result grouped by books and
> chapters
> >> (for the same book) and pages (for the same chapter). As far as I know,
> we
> >> have 2 options:
> >>
> >> 1. index pages as SOLR documents. In this way we could theoretically
> >> retrieve chapters (and books?)  using grouping but
> >>  a. we will miss matches across two contiguous pages (page cutting
> is
> >> only due to typographical needs so concepts could be split... as in
> >> printed
> >> books)
> >>  b. I don't know if it is possible in SOLR to group results on two
> >> different levels (books and chapters)
> >>
> >> 2. index chapters as SOLR documents. In this case we will have the right
> >> matches but how to obtain the matching pages? (we need pages because the
> >> client can only display pages)
> >>
> >> we have been struggling on this problem for a lot of time and we're  not
> >> able to find a suitable solution so I'm looking if someone has ideas or
> >> has
> >> already solved a similar issue.
> >> Thanks
> >>
> >>
> > --
> > Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> > Solr & Elasticsearch Support * http://sematext.com/
> >
> >
>


Standard highlighting doesn't work for Block Join

2016-03-02 Thread michael solomon
IT WAS MY FIRST POST IN MAILING LIST SO NOT SURE IF YOU GET IT SO I'M SEND
IT AGAIN

Hi,
I have solr 5.4.1 and I'm trying to use Block Join Query Parser for search
in children and return the parent.
I want to apply highlight on children but it's return empty.
My q parameter: "q={!parent which="is_parent:true"} normal_text:(account)"
highlight parameters:
"hl=true&hl.fl=normal_text&hl.simple.pre=&hl.simple.post="

and return:

> "highlighting": { "chikora.com": {} }
>

("chikora.com" it's the id of the parent document)
it's looks this already solved here:
https://issues.apache.org/jira/browse/LUCENE-5929
but I don't understand how to use it.

Thanks,
Michael
P.S: sorry about my English.. working on it :)


Re: Indexing books, chapters and pages

2016-03-02 Thread Zaccheo Bagnati
Thanks Emir,
a similar solution has already come in my mind too: searching on chapters,
highlighting the result and retrieve matching pages parsing the highlighted
result... surely not a very efficient approach but could work...
however I think I'll try different approaches before this

Il giorno mar 1 mar 2016 alle ore 17:30 Emir Arnautovic <
emir.arnauto...@sematext.com> ha scritto:

> Hi,
>  From the top of my head - probably does not solve problem completely,
> but may trigger brainstorming: Index chapters and include page break
> tokens. Use highlighting to return matches and make sure fragment size
> is large enough to get page break token. In such scenario you should use
> slop for phrase searches...
>
> More I write it, less I like it, but will not delete...
>
> Regards,
> Emir
>
> On 01.03.2016 12:56, Zaccheo Bagnati wrote:
> > Hi all,
> > I'm searching for ideas on how to define schema and how to perform
> queries
> > in this use case: we have to index books, each book is split into
> chapters
> > and chapters are split into pages (pages represent original page cutting
> in
> > printed version). We should show the result grouped by books and chapters
> > (for the same book) and pages (for the same chapter). As far as I know,
> we
> > have 2 options:
> >
> > 1. index pages as SOLR documents. In this way we could theoretically
> > retrieve chapters (and books?)  using grouping but
> >  a. we will miss matches across two contiguous pages (page cutting is
> > only due to typographical needs so concepts could be split... as in
> printed
> > books)
> >  b. I don't know if it is possible in SOLR to group results on two
> > different levels (books and chapters)
> >
> > 2. index chapters as SOLR documents. In this case we will have the right
> > matches but how to obtain the matching pages? (we need pages because the
> > client can only display pages)
> >
> > we have been struggling on this problem for a lot of time and we're  not
> > able to find a suitable solution so I'm looking if someone has ideas or
> has
> > already solved a similar issue.
> > Thanks
> >
>
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>


Re: understand scoring

2016-03-02 Thread michael solomon
Thanks you, @Doug Turnbull I tried http://splainer.io but it's not for my
query(not explain for the docs..).
here the picture again...
https://drive.google.com/file/d/0B-7dnH4rlntJc2ZWdmxMS3RDMGc/view?usp=sharing

On Tue, Mar 1, 2016 at 10:06 PM, Doug Turnbull <
dturnb...@opensourceconnections.com> wrote:

> Supposedly Late April, early May. But don't hold me to it until I see copy
> edits :) Of course looks like now you can read at least the full ebook in
> MEAP form.
>
> -Doug
>
> On Tue, Mar 1, 2016 at 2:57 PM, shamik  wrote:
>
> > Doug, do we've a date for the hard copy launch?
> >
> >
> >
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/understand-scoring-tp4260837p4260860.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>
>
>
> --
> *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections
> , LLC | 240.476.9983
> Author: Relevant Search 
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless
> of whether attachments are marked as such.
>


Re: SolrCloud - Strategy for recovering cluster states

2016-03-02 Thread danny teichthal
Thanks Jeff,
I understand your philosophy and it sounds correct.
Since we had many problems with zookeeper when switching to Solr Cloud. we
couldn't make it as a source of knowledge and had to relay on a more stable
source.
The issues is that when we get such an event of zookeeper, it brought our
system down, and in this case, clearing the core.properties were a life
saver.
We've managed to make it pretty stable not, but we will always need a
"dooms day" weapon.

I looked into the related JIRA and it confused me a little, and raised a
few other questions:
1. What exactly defines zookeeper as a truth?
2. What is the role of core.properties if the state is only in zookeeper?



Your tool is very interesting, I just thought about writing such a tool
myself.
>From the sources I understand that you represent each node as a path in the
git repository.
So, I guess that for restore purposes I will have to do
the opposite direction and create a node for every path entry.




On Tue, Mar 1, 2016 at 11:36 PM, Jeff Wartes  wrote:

>
> I’ve been running SolrCloud clusters in various versions for a few years
> here, and I can only think of two or three cases that the ZK-stored cluster
> state was broken in a way that I had to manually intervene by hand-editing
> the contents of ZK. I think I’ve seen Solr fixes go by for those cases,
> too. I’ve never completely wiped ZK. (Although granted, my ZK cluster has
> been pretty stable, and my collection count is smaller than yours)
>
> My philosophy is that ZK is the source of cluster configuration, not the
> collection of core.properties files on the nodes.
> Currently, cluster state is shared between ZK and core directories. I’d
> prefer, and I think Solr development is going this way, (SOLR-7269) that
> all cluster state exist and be managed via ZK, and all state be removed
> from the local disk of the cluster nodes. The fact that a node uses local
> disk based configuration to figure out what collections/replicas it has is
> something that should be fixed, in my opinion.
>
> If you’re frequently getting into bad states due to ZK issues, I’d suggest
> you file bugs against Solr for the fact that you got into the state, and
> then fix your ZK cluster.
>
> Failing that, can you just periodically back up your ZK data and restore
> it if something breaks? I wrote a little tool to watch clusterstate.json
> and write every version to a local git repo a few years ago. I was mostly
> interested because I wanted to see changes that happened pretty fast, but
> it could also serve as a backup approach. Here’s a link, although I clearly
> haven’t touched it lately. Feel free to ask if you have issues:
> https://github.com/randomstatistic/git_zk_monitor
>
>
>
>
> On 3/1/16, 12:09 PM, "danny teichthal"  wrote:
>
> >Hi,
> >Just summarizing my questions if the long mail is a little intimidating:
> >1. Is there a best practice/automated tool for overcoming problems in
> >cluster state coming from zookeeper disconnections?
> >2. Creating a collection via core admin is discouraged, is it true also
> for
> >core.properties discovery?
> >
> >I would like to be able to specify collection.configName in the
> >core.properties and when starting server, the collection will be created
> >and linked to the config name specified.
> >
> >
> >
> >On Mon, Feb 29, 2016 at 4:01 PM, danny teichthal 
> >wrote:
> >
> >> Hi,
> >>
> >>
> >> I would like to describe a process we use for overcoming problems in
> >> cluster state when we have networking issues. Would appreciate if anyone
> >> can answer about what are the flaws on this solution and what is the
> best
> >> practice for recovery in case of network problems involving zookeeper.
> >> I'm working with Solr Cloud with version 5.2.1
> >> ~100 collections in a cluster of 6 machines.
> >>
> >> This is the short procedure:
> >> 1. Bring all the cluster down.
> >> 2. Clear all data from zookeeper.
> >> 3. Upload configuration.
> >> 4. Restart the cluster.
> >>
> >> We rely on the fact that a collection is created on core discovery
> >> process, if it does not exist. It gives us much flexibility.
> >> When the cluster comes up, it reads from core.properties and creates the
> >> collections if needed.
> >> Since we have only one configuration, the collections are automatically
> >> linked to it and the cores inherit it from the collection.
> >> This is a very robust procedure, that helped us overcome many problems
> >> until we stabilized our cluster which is now pretty stable.
> >> I know that the leader might change in such case and may lose updates,
> but
> >> it is ok.
> >>
> >>
> >> The problem is that today I want to add a new config set.
> >> When I add it and clear zookeeper, the cores cannot be created because
> >> there are 2 configurations. This breaks my recovery procedure.
> >>
> >> I thought about a few options:
> >> 1. Put the config Name in core.properties - this doesn't work. (It is
> >> supported in CoreAdminHandler, but  is discouraged accord

Re: Indexing books, chapters and pages

2016-03-02 Thread Zaccheo Bagnati
Thanks Walter,
the payload idea is something that I've never heard... it seems interesting
but quite complex to implement. I think we'll have to write a custom filter
to add page numbers and it's not clear to me how to retrieve payloads in
the query result. However I'll try to go more in deep on this.
any further detail on how to use payloads?

Il giorno mar 1 mar 2016 alle ore 17:05 Walter Underwood <
wun...@wunderwood.org> ha scritto:

> You could index both pages and chapters, with a type field.
>
> You could index by chapter with the page number as a payload for each
> token.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Mar 1, 2016, at 5:50 AM, Zaccheo Bagnati  wrote:
> >
> > Thank you, Jack for your answer.
> > There are 2 reasons:
> > 1. the requirement is to show in the result list both books and chapters
> > grouped, so I would have to execute the query grouping by book, retrieve
> > first, let's say, 10 books (sorted by relevance) and then for each book
> > repeat the query grouping by chapter (always ordering by relevance) in
> > order to obtain what we need (unfortunately it is not up to me defining
> the
> > requirements... but it however make sense). Unless there exist some SOLR
> > feature to do this in only one call (and that would be great!).
> > 2. searching on pages will not match phrases that spans across 2 pages
> > (e.g. if last word of page 1 is "broken" and first word of page 2 is
> > "sentence" searching for "broken sentence" will not match)
> > However if we will not find a better solution I think that your proposal
> is
> > not so bad... I hope that reason #2 could be negligible and that #1
> > performs quite fast though we are multiplying queries.
> >
> > Il giorno mar 1 mar 2016 alle ore 14:28 Jack Krupansky <
> > jack.krupan...@gmail.com> ha scritto:
> >
> >> Any reason not to use the simplest structure - each page is one Solr
> >> document with a book field, a chapter field, and a page text field? You
> can
> >> then use grouping to group results by book (title text) or even chapter
> >> (title text and/or number). Maybe initially group by book and then if
> the
> >> user selects a book group you can re-query with the specific book and
> then
> >> group by chapter.
> >>
> >>
> >> -- Jack Krupansky
> >>
> >> On Tue, Mar 1, 2016 at 8:08 AM, Zaccheo Bagnati 
> >> wrote:
> >>
> >>> Original data is quite well structured: it comes in XML with chapters
> and
> >>> tags to mark the original page breaks on the paper version. In this way
> >> we
> >>> have the possibility to restructure it almost as we want before
> creating
> >>> SOLR index.
> >>>
> >>> Il giorno mar 1 mar 2016 alle ore 14:04 Jack Krupansky <
> >>> jack.krupan...@gmail.com> ha scritto:
> >>>
>  To start, what is the form of your input data - is it already divided
> >>> into
>  chapters and pages? Or... are you starting with raw PDF files?
> 
> 
>  -- Jack Krupansky
> 
>  On Tue, Mar 1, 2016 at 6:56 AM, Zaccheo Bagnati 
>  wrote:
> 
> > Hi all,
> > I'm searching for ideas on how to define schema and how to perform
>  queries
> > in this use case: we have to index books, each book is split into
>  chapters
> > and chapters are split into pages (pages represent original page
> >>> cutting
>  in
> > printed version). We should show the result grouped by books and
> >>> chapters
> > (for the same book) and pages (for the same chapter). As far as I
> >> know,
>  we
> > have 2 options:
> >
> > 1. index pages as SOLR documents. In this way we could theoretically
> > retrieve chapters (and books?)  using grouping but
> >a. we will miss matches across two contiguous pages (page cutting
> >>> is
> > only due to typographical needs so concepts could be split... as in
>  printed
> > books)
> >b. I don't know if it is possible in SOLR to group results on two
> > different levels (books and chapters)
> >
> > 2. index chapters as SOLR documents. In this case we will have the
> >>> right
> > matches but how to obtain the matching pages? (we need pages because
> >>> the
> > client can only display pages)
> >
> > we have been struggling on this problem for a lot of time and we're
> >>> not
> > able to find a suitable solution so I'm looking if someone has ideas
> >> or
>  has
> > already solved a similar issue.
> > Thanks
> >
> 
> >>>
> >>
>
>


Commit after every document - alternate approach

2016-03-02 Thread sangeetha.subraman...@gtnexus.com
Hi All,

I am trying to understand on how we can have commit issued to solr while 
indexing documents. Around 200K to 300K document/per hour with an avg size of 
10 KB size each will be getting into SOLR . JAVA code fetches the document from 
MQ and streamlines it to SOLR. The problem is the client code issues 
hard-commit after each document which is sent to SOLR for indexing and it waits 
for the response from SOLR to get assurance whether the document got indexed 
successfully. Only if it gets a OK status from SOLR the document is cleared out 
from SOLR.

As far as I understand doing a commit after each document is an expensive 
operation. But we need to make sure that all the documents which are put into 
MQ gets indexed in SOLR. Is there any other way of getting this done ? Please 
let me know.
If we do a batch indexing, is there any chances we can identify if some 
documents is missed from indexing ?

Thanks
Sangeetha