RE: CDCR - how to deal with the transaction log files

2017-07-28 Thread Xie, Sean
You don't need to start cdcr on target cluster. Other steps are exactly what I 
did. After disable buffer on both target and source, the tlog files are purged 
according to the specs.


-- Thank you
Sean

From: Patrick Hoeffel 
>
Date: Friday, Jul 28, 2017, 4:01 PM
To: solr-user@lucene.apache.org 
>
Cc: jmy...@wayfair.com >
Subject: [EXTERNAL] RE: CDCR - how to deal with the transaction log files

Amrit,

Problem solved! My biggest mistake was in my SOURCE-side configuration. The 
zkHost field needed the entire zkHost string, including the CHROOT indicator. I 
suppose that should have been obvious to me, but the examples only showed the 
IP Address of the target ZK, and I made a poor assumption.

  
  
  
10.161.0.7:2181,10.161.0.6:2181,10.161.0.5:2181/chroot/solr
ks_v1
ks_v1
  

  
  
  
10.161.0.7:2181 <=== Problem was here.
ks_v1
ks_v1
  


After that, I just made sure I did this:
1. Stop all Solr nodes at both SOURCE and TARGET.
2. $ rm -rf $SOLR_HOME/server/solr/collection_name/data/tlog/*.*
3. On the TARGET:
a. $ collection/cdcr?action=DISABLEBUFFER
b. $ collection/cdcr?action=START

4. On the Source:
a. $ collection/cdcr?action=DISABLEBUFFER
b. $ collection/cdcr?action=START

At this point any existing data in the SOURCE collection started flowing into 
the TARGET collection, and it has remained congruent ever since.

Thanks,



Patrick Hoeffel

Senior Software Engineer
(Direct)  719-452-7371
(Mobile) 719-210-3706
patrick.hoef...@polarisalpha.com
PolarisAlpha.com


-Original Message-
From: Amrit Sarkar [mailto:sarkaramr...@gmail.com]
Sent: Friday, July 21, 2017 7:21 AM
To: solr-user@lucene.apache.org
Cc: jmy...@wayfair.com
Subject: Re: CDCR - how to deal with the transaction log files

Patrick,

Yes! You created default UpdateLog which got written to a disk and then you 
changed it to CdcrUpdateLog in configs. I find no reason it would create a 
proper COLLECTIONCHECKPOINT on target tlog.

One thing you can try before creating / starting from scratch is restarting 
source cluster nodes, the leaders of shard will try to create the same 
COLLECTIONCHECKPOINT, which may or may not be successful.

Amrit Sarkar
Search Engineer
Lucidworks, Inc.
415-589-9269
www.lucidworks.com
Twitter http://twitter.com/lucidworks
LinkedIn: https://www.linkedin.com/in/sarkaramrit2

On Fri, Jul 21, 2017 at 11:09 AM, Patrick Hoeffel < 
patrick.hoef...@polarisalpha.com> wrote:

> I'm working on my first setup of CDCR, and I'm seeing the same "The
> log reader for target collection {collection name} is not initialised"
> as you saw.
>
> It looks like you're creating collections on a regular basis, but for
> me, I create it one time and never again. I've been creating the
> collection first from defaults and then applying the CDCR-aware
> solrconfig changes afterward. It sounds like maybe I need to create
> the configset in ZK first, then create the collections, first on the
> Target and then on the Source, and I should be good?
>
> Thanks,
>
> Patrick Hoeffel
> Senior Software Engineer
> (Direct)  719-452-7371
> (Mobile) 719-210-3706
> patrick.hoef...@polarisalpha.com
> PolarisAlpha.com
>
>
> -Original Message-
> From: jmyatt [mailto:jmy...@wayfair.com]
> Sent: Wednesday, July 12, 2017 4:49 PM
> To: solr-user@lucene.apache.org
> Subject: Re: CDCR - how to deal with the transaction log files
>
> glad to hear you found your solution!  I have been combing over this
> post and others on this discussion board many times and have tried so
> many tweaks to configuration, order of steps, etc, all with absolutely
> no success in getting the Source cluster tlogs to delete.  So
> incredibly frustrating.  If anyone has other pearls of wisdom I'd love some 
> advice.
> Quick hits on what I've tried:
>
> - solrconfig exactly like Sean's (target and source respectively)
> expect no autoSoftCommit
> - I am also calling cdcr?action=DISABLEBUFFER (on source as well as on
> target) explicitly before starting since the config setting of
> defaultState=disabled doesn't seem to work
> - when I create the collection on source first, I get the warning "The
> log reader for target collection {collection name} is not
> initialised".  When I reverse the order (create the collection on
> target first), no such warning
> - tlogs replicate as expected, hard commits on both target and source
> cause tlogs to rollover, etc - all of that works as expected
> - action=QUEUES on source reflects the queueSize accurately.  Also
> *always* shows updateLogSynchronizer state as "stopped"
> - action=LASTPROCESSEDVERSION on both source and target always seems
> correct (I don't see the -1 that Sean mentioned).
> - I'm creating new collections every time and running full data
> imports 

RE: CDCR - how to deal with the transaction log files

2017-07-28 Thread Patrick Hoeffel
Amrit,

Problem solved! My biggest mistake was in my SOURCE-side configuration. The 
zkHost field needed the entire zkHost string, including the CHROOT indicator. I 
suppose that should have been obvious to me, but the examples only showed the 
IP Address of the target ZK, and I made a poor assumption.

  
  
  
10.161.0.7:2181,10.161.0.6:2181,10.161.0.5:2181/chroot/solr
ks_v1
ks_v1
  

  
  
  
10.161.0.7:2181 <=== Problem was here.
ks_v1
ks_v1
  


After that, I just made sure I did this:
1. Stop all Solr nodes at both SOURCE and TARGET.
2. $ rm -rf $SOLR_HOME/server/solr/collection_name/data/tlog/*.*
3. On the TARGET:
a. $ collection/cdcr?action=DISABLEBUFFER
b. $ collection/cdcr?action=START

4. On the Source:
a. $ collection/cdcr?action=DISABLEBUFFER
b. $ collection/cdcr?action=START

At this point any existing data in the SOURCE collection started flowing into 
the TARGET collection, and it has remained congruent ever since.

Thanks,



Patrick Hoeffel

Senior Software Engineer
(Direct)  719-452-7371
(Mobile) 719-210-3706
patrick.hoef...@polarisalpha.com
PolarisAlpha.com 


-Original Message-
From: Amrit Sarkar [mailto:sarkaramr...@gmail.com] 
Sent: Friday, July 21, 2017 7:21 AM
To: solr-user@lucene.apache.org
Cc: jmy...@wayfair.com
Subject: Re: CDCR - how to deal with the transaction log files

Patrick,

Yes! You created default UpdateLog which got written to a disk and then you 
changed it to CdcrUpdateLog in configs. I find no reason it would create a 
proper COLLECTIONCHECKPOINT on target tlog.

One thing you can try before creating / starting from scratch is restarting 
source cluster nodes, the leaders of shard will try to create the same 
COLLECTIONCHECKPOINT, which may or may not be successful.

Amrit Sarkar
Search Engineer
Lucidworks, Inc.
415-589-9269
www.lucidworks.com
Twitter http://twitter.com/lucidworks
LinkedIn: https://www.linkedin.com/in/sarkaramrit2

On Fri, Jul 21, 2017 at 11:09 AM, Patrick Hoeffel < 
patrick.hoef...@polarisalpha.com> wrote:

> I'm working on my first setup of CDCR, and I'm seeing the same "The 
> log reader for target collection {collection name} is not initialised" 
> as you saw.
>
> It looks like you're creating collections on a regular basis, but for 
> me, I create it one time and never again. I've been creating the 
> collection first from defaults and then applying the CDCR-aware 
> solrconfig changes afterward. It sounds like maybe I need to create 
> the configset in ZK first, then create the collections, first on the 
> Target and then on the Source, and I should be good?
>
> Thanks,
>
> Patrick Hoeffel
> Senior Software Engineer
> (Direct)  719-452-7371
> (Mobile) 719-210-3706
> patrick.hoef...@polarisalpha.com
> PolarisAlpha.com
>
>
> -Original Message-
> From: jmyatt [mailto:jmy...@wayfair.com]
> Sent: Wednesday, July 12, 2017 4:49 PM
> To: solr-user@lucene.apache.org
> Subject: Re: CDCR - how to deal with the transaction log files
>
> glad to hear you found your solution!  I have been combing over this 
> post and others on this discussion board many times and have tried so 
> many tweaks to configuration, order of steps, etc, all with absolutely 
> no success in getting the Source cluster tlogs to delete.  So 
> incredibly frustrating.  If anyone has other pearls of wisdom I'd love some 
> advice.
> Quick hits on what I've tried:
>
> - solrconfig exactly like Sean's (target and source respectively) 
> expect no autoSoftCommit
> - I am also calling cdcr?action=DISABLEBUFFER (on source as well as on
> target) explicitly before starting since the config setting of 
> defaultState=disabled doesn't seem to work
> - when I create the collection on source first, I get the warning "The 
> log reader for target collection {collection name} is not 
> initialised".  When I reverse the order (create the collection on 
> target first), no such warning
> - tlogs replicate as expected, hard commits on both target and source 
> cause tlogs to rollover, etc - all of that works as expected
> - action=QUEUES on source reflects the queueSize accurately.  Also
> *always* shows updateLogSynchronizer state as "stopped"
> - action=LASTPROCESSEDVERSION on both source and target always seems 
> correct (I don't see the -1 that Sean mentioned).
> - I'm creating new collections every time and running full data 
> imports that take 5-10 minutes. Again, all data replication, log 
> rollover, and autocommit activity seems to work as expected, and logs 
> on target are deleted.  It's just those pesky source tlogs I can't get to 
> delete.
>
>
>
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/CDCR-how-to-deal-with-the-transaction-log-
> files-tp4345062p4345715.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Unable to create core [collection] Caused by: null

2017-07-28 Thread Lucas Pelegrino
Thanks Shawn and everyone!

Solved.

2017-07-27 18:29 GMT-04:00 Shawn Heisey :

> On 7/25/2017 5:21 PM, Lucas Pelegrino wrote:
> > Trying to make solr work here, but I'm getting this error from this
> command:
> >
> > $ ./solr create -c products -d /Users/lucaswxp/reduza-solr/
> products/conf/
> >
> > Error CREATEing SolrCore 'products': Unable to create core [products]
> > Caused by: null
> >
> > I'm posting my solrconf.xml, schema.xml and data-config.xml here:
> > https://pastebin.com/fnYK9pSJ
> >
> > The debug from log solr: https://pastebin.com/kVLMvBwZ
>
> Problems with my email client.  Meant to send this to the list, but only
> sent it to Lucas.  Resending to the list.
>
> In the exception you got, I see that the final "Caused by" section
> starts with this:
>
> Caused by: java.lang.NullPointerException
> at
> org.apache.solr.response.SchemaXmlWriter.writeResponse(
> SchemaXmlWriter.java:85)
>
> Line 85 of that source code file is this:
>
>writeAttr(IndexSchema.NAME,
> schemaProperties.get(IndexSchema.NAME).toString());
>
> As Rick noted, the schema should be named "managed-schema" rather than
> "schema.xml", but Solr *should* see the schema.xml, copy it to
> managed-schema, rename it to something else, and continue loading.  I'm
> proceeding based on the idea that Solr *has* found your file.
>
> Looking at your schema, you have not given the schema a name, and that's
> what is causing the problem.  Therefore when the code mentioned above
> tried to get the name of the schema, it got a null.  Java can't perform
> an operation (in this case, toString() is the one being called) on a
> null pointer.  Here is the relevant line from one of the example schemas
> included with Solr 6.6 showing how to give the schema a name:
>
> 
>
> The error could be more descriptive, but a better option is to work
> without creating an error at all.  I opened an issue to deal with the
> problem, and I'm expecting it to be fixed in the next release of Solr:
>
> https://issues.apache.org/jira/browse/SOLR-11153
>
> Thanks,
> Shawn
>
>


multiple cores or single combined core

2017-07-28 Thread Steve Pruitt
This question has been asked before.  I found a few postings to Solr user and a 
couple on Google-in-the-large.  
But I am still not sure which is best.

My project currently has two distinct datasets (documents) with no shared 
fields.
 
But at times, we need to query across both of them.
 
So we are trying to decide between a single index to hold both as a combined 
document.  Or separate indexes, one for each document type and perform a 
queries on both and combine the results when we need a query across both..  
 
From what I have read, the general consensus seems to be separate indexes.
 
A few general conclusions of the posts:
If the load (query demand) per dataset is different, then multiple indexes can 
be managed differently.
Multiple indexes provide more flexibility.  One index can be taken down without 
affecting the other one.
Combining datasets makes for a larger index, which can greatly lengthen 
re-indexing time. 
The smaller the index, the faster the search.
 
We are also interested in how a single index with divergent document types 
might complicate ranking and relevancy.
 
Also, in the future we need to dynamically add new fields via managed schema.  
Also, it's possible in the future there can be new document types we need to 
index.
 
Any opinions or thoughts along this question are appreciated.
 
Thanks in advance.

-Steve Pruitt

 


Issues querying date fields (date fields stored as text)

2017-07-28 Thread MKrishna
Hi all,

I’m using Lucence Solr (Lucidworks ) as the search index and its  a
schemaless search engine to index outlook files. 

I have been using it since version 2.0.* and have an issue with querying
dates.

The older versions were mapping( date_created, mail from_date etc) to date
datatype making date range queries easier.

After upgrading to verion 3.0*  all the date fileds are mapped to
text_general  data type .
All my queries for dates are failing.

Please help !!! Any  helpful resources are greatly appreciated as I’m a
newbie.

Thanks in advance




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Issues-querying-date-fields-date-fields-stored-as-text-tp4348063.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Disadvantages of having many cores

2017-07-28 Thread David Hastings
You're better off just using one core.  Perhaps think about pre-processing
the logs to "summarize" them into less "documents"
I do this and in my situation i summarize things like, user-hits-item, for
example.  so i find all the times a certain user had hits on a certain item
in one day and put that into one document.  I have about 4/5 years of http
logs and it sits at around 265 million documents and 17gb.  so hardly an
issue for performance

On Fri, Jul 28, 2017 at 10:04 AM, Chellasamy G 
wrote:

> Hi,
>
>
>
> I am working on a log management tool and considering to use solr to
> index/search the logs.
>
> I have few doubts about how to organize or create the cores.
>
>
>
> The tool  should process 200 million events per day with each event
> containing 40 to 50 fields. Currently I have planned to create a core per
> day pushing all the data to the day's core. This may lead to the creation
> of many cores. Is this a good design? If not please suggest a good
> design.(Also, if multiple cores are used, will it slowdown the solr
> process' uptime)
>
>
>
>
>
> Thanks,
>
> Satyan
>
>
>
>


Disadvantages of having many cores

2017-07-28 Thread Chellasamy G
Hi,



I am working on a log management tool and considering to use solr to 
index/search the logs.

I have few doubts about how to organize or create the cores.



The tool  should process 200 million events per day with each event containing 
40 to 50 fields. Currently I have planned to create a core per day pushing all 
the data to the day's core. This may lead to the creation of many cores. Is 
this a good design? If not please suggest a good design.(Also, if multiple 
cores are used, will it slowdown the solr process' uptime)





Thanks,

Satyan





Re: logging support in Lucene code

2017-07-28 Thread Shawn Heisey
On 7/27/2017 10:57 AM, Nawab Zada Asad Iqbal wrote:
> I see a lot of discussion on this topic from almost 10 years ago: e.g.,
> https://issues.apache.org/jira/browse/LUCENE-1482
>
> For 4.5, I relied on 'System.out.println' for writing information for
> debugging in production.
>
> In 6.6, I notice that some classes in Lucene are instantiating a Logger,
> should I use Logger instead? I tried to log with it, but I don't see any
> output in logs.

You're asking about this on a Solr list, not a Lucene list.  I am not
subscribed to the main Lucene user list, so I do not know if you have
also sent this question to that list.

Solr uses slf4j for logging.  Many of its dependencies have chosen other
logging frameworks.

https://www.slf4j.org/

With slf4j, you can utilize just about any supported logging
implementation to do the actual end logging.  The end implementation
chosen by the Solr project for version 4.3 and later is log4j 1.x.

It is my understanding that Lucene's core module has zero dependencies
-- it's pure Java.  That would include any external logging
implementation.  I do not know if the core module even uses
java.util.logging ... a quick grep for "Logger" suggests that there are
no loggers in use in the core module at all, but it's possible that I
have not scanned for the correct text.  I did notice that
TestIndexWriter uses a PrintStream for logging, and Shalin's reply has
reminded me about the infoStream feature.

Looking at the source code, it does appear that some of the other Lucene
modules do use a logger. Some of them appear to use the logger built
into java, others seem to use one of the third-party implementations
like slf4j.  Some of the dependent jars pulled in for non-core Lucene
modules depend on various logging implementations.

Logging frameworks can be the center of a religious flamewar.  Opinions
run strong.  IMHO, if you are writing your own code, the best option is
slf4j, bound to whatever end logging implementation you are most
comfortable using.  You can install slf4j jars to intercept logging sent
to the other common logging implementations and direct those through
slf4j so they end up in the same place as everything else.

Note if you want to use log4j2 as your end logging destination with
slf4j: log4j2 comes with jars implementing the slf4j classes, so you're
probably going to want to use those.

Thanks,
Shawn



Re: search engine - Precision, recall

2017-07-28 Thread Shawn Heisey
On 7/27/2017 7:20 AM, Itay K wrote:
> I'm trying to measure Precision and recall for a search engine which is
> crawling data sources of an organization.
>
> Are there any best practices regrading these indexes and specific
> industries (e.g. for financial organizations, the recommended percentage
> for precision and recall is ~60%).
>
> Is there any best practice in general for the recommended percentage?
>
> I read an article from 2005 regrading measured precision and recall for web
> search engines but unfortunately my use case isn't a web application and I
> believe that since than a lot has changed.

I don't believe you can assign concrete numbers to these aspects of a
search engine, at least not in a way that has meaning after the query,
the index, or the user's expectation changes.

Recall is all about numbers, but precision is a completely subjective
measurement that is going to vary from person to person.  Results that
are highly relevant for one user might be completely irrelevant for
another, even when both users enter the exact same search terms.

Also, the search terms that one user enters are likely to be different
from the search terms that another user enters, even if they are looking
for exactly the same thing.  I cannot think of a way of calculating
percentages for precision and recall that would give meaningful numbers
when very different searches and expectations must be examined.  A
recall count for one search will have little meaning when compared to
the recall count for a different search ... and as already mentioned,
precision is COMPLETELY subjective.

IMHO, tuning precision and recall is not about getting some calculated
numbers as high as you can.  In order to tune successfully, you have to
know what people are searching for, what they expect to find, and come
up with a configuration that will produce the best balance between
precision and recall when applied to the combination of the data in the
index and what's actually being searched.  Frequently the tuning process
involves educating the users, in addition to (or sometimes instead of)
changing the search engine configuration.  Six months after tuning the
search, as the index and your users change, you may need completely
different settings to get good results.

Changes that affect precision and recall are usually a tradeoff between
those two factors.  Improving one of them will often make the other
worse.  They must be approached with a goal of bringing them into
balance for the searches done by a majority of the system users.

Thanks,
Shawn



Re: High CPU utilization on Upgrading to Solr Version 6.3

2017-07-28 Thread Shawn Heisey
On 7/27/2017 1:30 AM, Atita Arora wrote:
> What OS is Solr running on?  I'm only asking because some additional
> information I'm after has different gathering methods depending on OS.
> Other questions:
>
> /*OpenJDK 64-Bit Server VM (25.141-b16) for linux-amd64 JRE
> (1.8.0_141-b16), built on Jul 20 2017 21:47:59 by "mockbuild" with gcc
> 4.4.7 20120313 (Red Hat 4.4.7-18)*/
> /*Memory: 4k page, physical 264477520k(92198808k free), swap 0k(0k free)*/

Linux is the easiest to get good information from.  Run the "top"
program in a commandline session.  Press shift-M to sort by memory size,
and grab a screenshot.  Share that screenshot with a file sharing site
and give us the URL.

> Is there only one Solr process per machine, or more than one?
> /*On an average yes , one solr process per machine , however , we do
> have a machine (where this log is taken) has two solr processes
> (master and slave)*/

Running a master and a slave on one machine does nothing for
redundancy.  They need to be on separate machines for that to really
help.  As for multiple processes per machine, tou can have many indexes
in one Solr instance -- you don't need more than one in most cases.

> How many total documents are managed by one machine?
> */About 220945 per machine ( and double for this machine as it has
> instance of master as well as other slave)/*
>
> How big is all the index data managed by one machine?
> */The index is about 4G./*

If less than a quarter of a million documents results in a 4GB index,
those documents must be ENORMOUS, or else there is something strange
going on.

> What is the max heap on each Solr process?
> */Max heap is 25G for each Solr Process. (Xms 25g Xmx 25g)/*
> */
> /*
> The reason of choosing RAMDirectory was that it was used in the
> similar manner while the production Solr was on Version 4.3.2, so no
> particular reason but just replicated how it was working , never
> thought this may give troubles.

Set up the slaves just like the masters, with
NRTCachingDirectoryFactory.  For a couple hundred thousand docs, you
probably only need a 2GB heap, possibly even less.

> I had included a pastebin of GC snapshot (the complete log was too big
> to be included in the pastebin , so pasted a sampler)

I asked for the full log because that's what I need to look deeper.  A
sampler won't be enough.  There are file sharing websites for sharing
larger content, and if you compress the file before uploading it, you
should be able to achieve a fairly impressive compression ratio. 
Dropbox is generally a good choice for sharing fairly large content. 
Dropbox also works for image data, like the "top" screenshot I asked for
above.

> Another thing is as we observed the CPU cycles yesterday in high load
> condition we observed that the Highlighter component was taking
> longest , is there anything in particular we forgot to include that
> highlighting doesn't gives a performance hit .
> Attached is the snapshot taken from jvisualvm.

Attachments rarely make it through the mailing list.  Yours didn't, so I
cannot see that snapshot.

I do not know anything about highlighting, so I cannot comment on how
much CPU it takes.  I've never used the feature.

My best idea about why your CPU is so high is problems with garbage
collection.  To look into that, I need to have the full GC log.  The
rest of the information I've asked for will help focus my efforts.

Thanks,
Shawn



How to Join more than two(2) cores in solr query

2017-07-28 Thread aniljayanti
Hi,

I am using solr6.2.1 in my project. I want to join more than 2 cores as per
the requirement.

core1 : empid
core2 : empid,sid,pid
core3 : sid,pid

i want to join core2 and core3 on [core2.sid = core3.sid and core2.pid =
core3.pid] and the resultant records of core2.empid should be joined with
core1.empid.

Below is my solr query to join core2.sid = core3.sid, but do not know how to
write the query to join 
core1.empid = core2.empid and core2.pid = core3.pid.
 
http://localhost:8983/solr/core2/select?q={!join from=sid to=sid
fromIndex=core3 v='*:*'}

currently i am not able to write solr query for above requirement.

Could you please help me out of this.

Thanks in advance,

AnilJayanti



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-Join-more-than-two-2-cores-in-solr-query-tp4348067.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: logging support in Lucene code

2017-07-28 Thread Shalin Shekhar Mangar
Lucene does not use a logger framework. But if you are using Solr then you
can route the infoStream logging to Solr's log files by setting an option
in the solrconfig.xml. See
http://lucene.apache.org/solr/guide/6_6/indexconfig-in-solrconfig.html#IndexConfiginSolrConfig-OtherIndexingSettings

On Fri, Jul 28, 2017 at 11:13 AM, Nawab Zada Asad Iqbal 
wrote:

> Any doughnut for me ?
>
>
> Regards
> Nawab
>
> On Thu, Jul 27, 2017 at 9:57 AM Nawab Zada Asad Iqbal 
> wrote:
>
> > Hi,
> >
> > I see a lot of discussion on this topic from almost 10 years ago: e.g.,
> > https://issues.apache.org/jira/browse/LUCENE-1482
> >
> > For 4.5, I relied on 'System.out.println' for writing information for
> > debugging in production.
> >
> > In 6.6, I notice that some classes in Lucene are instantiating a Logger,
> > should I use Logger instead? I tried to log with it, but I don't see any
> > output in logs.
> >
> >
> > Regards
> > Nawab
> >
>



-- 
Regards,
Shalin Shekhar Mangar.