Re: Facet Double Counting

2015-01-25 Thread harish singh
Still the same.

Can the reason be that if there are duplicate logs/documents, then
the Facet query will count them, but when I do the Search Query, solr
eliminates the duplicates?


On Sat, Jan 24, 2015 at 11:47 PM, Ahmet Arslan iori...@yahoo.com.invalid
wrote:



 Hi Harish,

 What happens when you purge deleted terms with
 'solr/core/update?commit=trueexpungeDeletes=true'

 ahmet



 On Sunday, January 25, 2015 1:59 AM, harish singh 
 harish.sing...@gmail.com wrote:
 Hi,

 I am noticing a strange behavior with solr facet searching:

 This is my facet query:


- params:
{
   - facet: true,
   - sort: startTimeISO desc,
   - debugQuery: true,
   - facet.mincount: 1,
   - facet.sort: count,
   - start: 0,
   - q: requestType:(*login* or *LOGIN*) AND (user:(blabla*)),
   - facet.limit: 100,
   - facet.field: loginUserName,
   - wt: json,
   - fq: startTimeISO:[2015-01-22T00:00:00.000Z TO
   2015-01-23T00:00:00.000Z],
   - rows: 0
   }


 The result I am getting is:

 facet_counts:
 {

- facet_queries: { },
- facet_fields:
{
   - loginUserName:
   [
  - harry,
  - 36,
  - larry,
  - 10,
  - Carey
  ]
   },
- facet_dates: { },
- facet_ranges: { }

 }



 As you see, the result is showing Facet-Count for loginUserName= harry is
 36.

 So when I do a Solr Search for logs, I should get 36 logs.
 But I am getting 18.
 This happening for all the searches now.


 For some reason, I see double counting.

 Either Facetting is Double counting or Search is half-counting ?


 This is my Solr Search Query:



- params:
{
   - sort: startTimeISO desc,
   - debugQuery: true,
   - start: 0,
   - q: requestType:(*login* or *LOGIN*) AND (user:(blabla*)) AND (
   loginUserName:(harry)),
   - wt: json,
   - fq: startTimeISO:[2015-01-22T00:00:00.000Z TO
   2015-01-23T00:00:00.000Z],
   - rows: 200
   }



 This query gives only 18 logs. But Solr Facet Query gave 36.


 Is there something incorrect in any of my (or both) queries?
 I am trying to debug but it I think I am missing something silly.



RE: Facet Double Counting

2015-01-25 Thread Toke Eskildsen
harish singh [harish.sing...@gmail.com] wrote:
 As you see, the result is showing Facet-Count for loginUserName= harry is
 36.
 
 So when I do a Solr Search for logs, I should get 36 logs.
 But I am getting 18.
 This happening for all the searches now.

If you have recently added or changed uniqueKey and if your index has multiple 
documents with the same key, that would explain the behaviour you describe. If 
that is so, I recommend you delete the index and rebuild it from scratch.

- Toke Eskildsen


Re: Facet Double Counting

2015-01-25 Thread Ahmet Arslan
weird, optimize or expungeDeletes=true should do the trick.
Can you try to optimise this time?

On Sunday, January 25, 2015 11:08 AM, harish singh harish.sing...@gmail.com 
wrote:
Still the same.

Can the reason be that if there are duplicate logs/documents, then
the Facet query will count them, but when I do the Search Query, solr
eliminates the duplicates?



On Sat, Jan 24, 2015 at 11:47 PM, Ahmet Arslan iori...@yahoo.com.invalid
wrote:



 Hi Harish,

 What happens when you purge deleted terms with
 'solr/core/update?commit=trueexpungeDeletes=true'

 ahmet



 On Sunday, January 25, 2015 1:59 AM, harish singh 
 harish.sing...@gmail.com wrote:
 Hi,

 I am noticing a strange behavior with solr facet searching:

 This is my facet query:


- params:
{
   - facet: true,
   - sort: startTimeISO desc,
   - debugQuery: true,
   - facet.mincount: 1,
   - facet.sort: count,
   - start: 0,
   - q: requestType:(*login* or *LOGIN*) AND (user:(blabla*)),
   - facet.limit: 100,
   - facet.field: loginUserName,
   - wt: json,
   - fq: startTimeISO:[2015-01-22T00:00:00.000Z TO
   2015-01-23T00:00:00.000Z],
   - rows: 0
   }


 The result I am getting is:

 facet_counts:
 {

- facet_queries: { },
- facet_fields:
{
   - loginUserName:
   [
  - harry,
  - 36,
  - larry,
  - 10,
  - Carey
  ]
   },
- facet_dates: { },
- facet_ranges: { }

 }



 As you see, the result is showing Facet-Count for loginUserName= harry is
 36.

 So when I do a Solr Search for logs, I should get 36 logs.
 But I am getting 18.
 This happening for all the searches now.


 For some reason, I see double counting.

 Either Facetting is Double counting or Search is half-counting ?


 This is my Solr Search Query:



- params:
{
   - sort: startTimeISO desc,
   - debugQuery: true,
   - start: 0,
   - q: requestType:(*login* or *LOGIN*) AND (user:(blabla*)) AND (
   loginUserName:(harry)),
   - wt: json,
   - fq: startTimeISO:[2015-01-22T00:00:00.000Z TO
   2015-01-23T00:00:00.000Z],
   - rows: 200
   }



 This query gives only 18 logs. But Solr Facet Query gave 36.


 Is there something incorrect in any of my (or both) queries?
 I am trying to debug but it I think I am missing something silly.



Re: Facet Double Counting

2015-01-25 Thread harish singh
Oh yes!! :)
I tried the Faceting on the UUID field.
All the uuids have count = 2 == which probably explains why I am getting
Double counting in Facet result.

So does this mean, when I do a facet query on facet.field= loginUserName,
Solr does not look at the UUID?
And the unique field (UUID in this case) is considered only while Search
Queries?

On Sun, Jan 25, 2015 at 3:15 AM, Toke Eskildsen t...@statsbiblioteket.dk
wrote:

 harish singh [harish.sing...@gmail.com] wrote:
  As you see, the result is showing Facet-Count for loginUserName= harry
 is
  36.
 
  So when I do a Solr Search for logs, I should get 36 logs.
  But I am getting 18.
  This happening for all the searches now.

 If you have recently added or changed uniqueKey and if your index has
 multiple documents with the same key, that would explain the behaviour you
 describe. If that is so, I recommend you delete the index and rebuild it
 from scratch.

 - Toke Eskildsen



RE: Facet Double Counting

2015-01-25 Thread Toke Eskildsen
harish singh [harish.sing...@gmail.com] wrote:
 I tried the Faceting on the UUID field.

Nice debug trick. I'll remember that to next time.

 So does this mean, when I do a facet query on facet.field= loginUserName,
 Solr does not look at the UUID?

Yes. For faceting, Solr only uses the internal docIDs and the facet field data.

 And the unique field (UUID in this case) is considered only while Search
 Queries?

For a distributed setup, the documents are resolved from the shards using 
uniqueKey.

I did not think this was the case for a non-distributed setup - for such setup, 
I thought that the documents were resolved using internal docIDs. If your index 
is single-shard, then I was wrong.

- Toke Eskildsen


Re: solr replication vs. rsync

2015-01-25 Thread Shawn Heisey
On 1/24/2015 10:56 PM, Dan Davis wrote:
 When I polled the various projects already using Solr at my organization, I
 was greatly surprised that none of them were using Solr replication,
 because they had talked about replicating the data.
 
 But we are not Pinterest, and do not expect to be taking in changes one
 post at a time (at least the engineers don't - just wait until its used for
 a Crud app that wants full-text search on a description field!).Still,
 rsync can be very, very fast with the right options (-W for gigabit
 ethernet, and maybe -S for sparse files).   I've clocked it at 48 MB/s over
 GigE previously.
 
 Does anyone have any numbers for how fast Solr replication goes, and what
 to do to tune it?
 
 I'm not enthusiastic to give-up recently tested cluster stability for a
 home grown mess, but I am interested in numbers that are out there.

Numbers are included on the Solr replication wiki page, both in graph
and numeric form.  Gathering these numbers must have been pretty easy --
before the HTTP replication made it into Solr, Solr used to contain an
rsync-based implementation.

http://wiki.apache.org/solr/SolrReplication#Performance_numbers

Other data on that wiki page discusses the replication config.  There's
not a lot to tune.

I run a redundant non-SolrCloud index myself through a different method
-- my indexing program indexes each index copy completely independently.
 There is no replication.  This separation allows me to upgrade any
component, or change any part of solrconfig or schema, on either copy of
the index without affecting the other copy at all.  With replication, if
something is changed on the master or the slave, you might find that the
slave no longer works, because it will be handling an index created by
different software or a different config.

Thanks,
Shawn



Sorting on a computed value

2015-01-25 Thread tedsolr
I'll bet some super user has figured this out. How can I perform a sort on a
single computed field? I have a QParserPlugin that is collapsing docs based
on data from multiple fields. I am summing the values from one numerical
field 'X'. I was going to use a DocTransformer to inject that summed value
into the search results as a new field. But I have now realized that I have
to be able to sort on this summed field.

Without retrieving all results (which could be 1M+) in my app and sorting
manually, is there any way to sort on my computed field within Solr?
(using Solr 4.9)



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Sorting-on-a-computed-value-tp4181875.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: replicas goes in recovery mode right after update

2015-01-25 Thread Erick Erickson
Shawn directed you over here to the user list, but I see this note on
SOLR-7030:
All our searchers have 12 GB of RAM available and have quad core Intel(R)
Xeon(R) CPU X5570 @ 2.93GHz. There is only one java process running i.e
jboss and solr in it . All 12 GB is available as heap for the java
process...

So you have 12G physical memory and have allocated 12G to the Java process?
This is an anti-pattern. If that's
the case, your operating system is being starved for memory, probably
hitting a state where it spends all of its
time in stop-the-world garbage collection, eventually it doesn't respond to
Zookeeper's ping so Zookeeper
thinks the node is down and puts it into recovery. Where it spends a lot of
time doing... essentially nothing.

About the hard and soft commits: I suspect these are entirely unrelated,
but here's a blog on what they do, you
should pick the configuration that supports your use case (i.e. how much
latency can you stand between indexing
and being able to search?).

https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

Here's one very good reason you shouldn't starve your op system by
allocating all the physical memory to the JVM:
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html


But your biggest problem is that you have far too much of your physical
memory allocated to the JVM. This
will cause you endless problems, you just need more physical memory on
those boxes. It's _possible_ you could
get by with less memory for the JVM, counterintuitive as it seems try 8G or
maybe even 6G. At some point
you'll hit OOM errors, but that'll give you a lower limit on what the JVM
needs.

Unless I've mis-interpreted what you've written, though, I doubt you'll get
stable with that much memory allocated
to the JVM.

Best,
Erick



On Sun, Jan 25, 2015 at 1:02 PM, Vijay Sekhri sekhrivi...@gmail.com wrote:

 We have a cluster of solr cloud server with 10 shards and 4 replicas in
 each shard in our stress environment. In our prod environment we will have
 10 shards and 15 replicas in each shard. Our current commit settings are as
 follows

 *autoSoftCommit*
 *maxDocs50/maxDocs*
 *maxTime18/maxTime*
 */autoSoftCommit*
 *autoCommit*
 *maxDocs200/maxDocs*
 *maxTime18/maxTime*
 *openSearcherfalse/openSearcher*
 */autoCommit*


 We indexed roughly 90 Million docs. We have two different ways to index
 documents a) Full indexing. It takes 4 hours to index 90 Million docs and
 the rate of docs coming to the searcher is around 6000 per second b)
 Incremental indexing. It takes an hour to indexed delta changes. Roughly
 there are 3 million changes and rate of docs coming to the searchers is
 2500
 per second

 We have two collections search1 and search2. When we do full indexing , we
 do it in search2 collection while search1 is serving live traffic. After it
 finishes we swap the collection using aliases so that the search2
 collection serves live traffic while search1 becomes available for next
 full indexing run. When we do incremental indexing we do it in the search1
 collection which is serving live traffic.

 All our searchers have 12 GB of RAM available and have quad core Intel(R)
 Xeon(R) CPU X5570 @ 2.93GHz. There is only one java process running i.e
 jboss and solr in it . All 12 GB is available as heap for the java
 process.  We have observed that the heap memory of the java process average
 around 8 - 10 GB. All searchers have final index size of 9 GB. So in total
 there are 9X10 (shards) =  90GB worth of index files.

  We have observed the following issue when we trigger indexing . In about
 10 minutes after we trigger indexing on 14 parallel hosts, the replicas
 goes in to recovery mode. This happens to all the shards . In about 20
 minutes more and more replicas start going into recovery mode. After about
 half an hour all replicas except the leader are in recovery mode. We cannot
 throttle the indexing load as that will increase our overall indexing time.
 So to overcome this issue, we remove all the replicas before we trigger the
 indexing and then add them back after the indexing finishes.

 We observe the same behavior of replicas going into recovery when we do
 incremental indexing. We cannot remove replicas during our incremental
 indexing because it is also serving live traffic. We tried to throttle our
 indexing speed , however the cluster still goes into recovery .

 If we leave the cluster as it , when the indexing finishes , it eventually
 recovers after a while. As it is serving live traffic we cannot have these
 replicas go into recovery mode because it degrades the search performance
 also , our tests have shown.

 We have tried different commit settings like below

 a) No auto soft commit, no auto hard commit and a commit triggered at the
 end of indexing b) No auto soft commit, yes auto hard commit and a commit
 in the end of indexing
 c) Yes auto 

Re: Unexplained leader initiated recovery after updates - SolrCmdDistributor no longer retries on RemoteSolrException

2015-01-25 Thread sekhrivijay
Hi Lindsey 
Were you every able to figure out the reason for this behavior?
We are experiencing the same issue with solr cloud version 4.10

http://lucene.472066.n3.nabble.com/jira-Commented-SOLR-7030-replicas-goes-in-recovery-mode-right-after-update-td4181881.html
https://issues.apache.org/jira/browse/SOLR-7030


We even tried to remove the replicas to get around this issue. However we
cannot do that for the collection that is serving our live traffic.  Any
suggestions ?
Vijay



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Re-Unexplained-leader-initiated-recovery-after-updates-SolrCmdDistributor-no-longer-retries-on-Remotn-tp4179309p4181882.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr replication vs. rsync

2015-01-25 Thread Erick Erickson
bq:  I thought SolrCloud replicas were replication, and you imply parallel
indexing

Absolutely! You couldn't get near-real-time indexing if you relied on
replication a-la
3x. And you also couldn't guarantee consistency.

Say you have 1 shard, a leader and a follower (i.e. 2 replicas). Now you
throw a doc
to be indexed. The sequence is:
leader gets the doc
leader forwards the doc to the follower
leader and follower both add the doc to their local index (and tlog).
follower acks back to leader
leader acks back to client.

So yes, the raw document is forwarded to all replicas before the leader
responds
to the client, the docs all get written to the tlogs, etc. That's the only
way to guarantee
that if the leader goes down, the follower can take over without losing
documents.

Best,
Erick

On Sun, Jan 25, 2015 at 6:15 PM, Dan Davis dansm...@gmail.com wrote:

 @Erick,

 Problem space is not constant indexing.   I thought SolrCloud replicas were
 replication, and you imply parallel indexing.  Good to know.

 On Sunday, January 25, 2015, Erick Erickson erickerick...@gmail.com
 wrote:

  @Shawn: Cool table, thanks!
 
  @Dan:
  Just to throw a different spin on it, if you migrate to SolrCloud, then
  this question becomes moot as the raw documents are sent to each of the
  replicas so you very rarely have to copy the full index. Kind of a
 tradeoff
  between constant load because you're sending the raw documents around
  whenever you index and peak usage when the index replicates.
 
  There are a bunch of other reasons to go to SolrCloud, but you know your
  problem space best.
 
  FWIW,
  Erick
 
  On Sun, Jan 25, 2015 at 9:26 AM, Shawn Heisey apa...@elyograg.org
  javascript:; wrote:
 
   On 1/24/2015 10:56 PM, Dan Davis wrote:
When I polled the various projects already using Solr at my
   organization, I
was greatly surprised that none of them were using Solr replication,
because they had talked about replicating the data.
   
But we are not Pinterest, and do not expect to be taking in changes
 one
post at a time (at least the engineers don't - just wait until its
 used
   for
a Crud app that wants full-text search on a description field!).
   Still,
rsync can be very, very fast with the right options (-W for gigabit
ethernet, and maybe -S for sparse files).   I've clocked it at 48
 MB/s
   over
GigE previously.
   
Does anyone have any numbers for how fast Solr replication goes, and
  what
to do to tune it?
   
I'm not enthusiastic to give-up recently tested cluster stability
 for a
home grown mess, but I am interested in numbers that are out there.
  
   Numbers are included on the Solr replication wiki page, both in graph
   and numeric form.  Gathering these numbers must have been pretty easy
 --
   before the HTTP replication made it into Solr, Solr used to contain an
   rsync-based implementation.
  
   http://wiki.apache.org/solr/SolrReplication#Performance_numbers
  
   Other data on that wiki page discusses the replication config.  There's
   not a lot to tune.
  
   I run a redundant non-SolrCloud index myself through a different method
   -- my indexing program indexes each index copy completely
 independently.
There is no replication.  This separation allows me to upgrade any
   component, or change any part of solrconfig or schema, on either copy
 of
   the index without affecting the other copy at all.  With replication,
 if
   something is changed on the master or the slave, you might find that
 the
   slave no longer works, because it will be handling an index created by
   different software or a different config.
  
   Thanks,
   Shawn
  
  
 



Re: replicas goes in recovery mode right after update

2015-01-25 Thread Erick Erickson
Ah, OK. Whew! because I was wondering how you were running at _all_ if all
the memory was allocated to the JVM ;)..

What is your Zookeeper timeout? The original default was 15 seconds and this
has caused problems like this. Here's the scenario:
You send a bunch of docs at the server, and eventually you hit a
stop-the-world
GC that takes longer than the Zookeeper timeout. So ZK thinks the node is
down
and initiates recovery. Eventually, you hit this on all the replicas.

Sometimes I've seen situations where the answer is giving a bit more memory
to the JVM, say 2-4G in your case. The theory here (and this is a shot in
the
dark) that your peak JVM requirements are close to your 12G, so the garbage
collection spends enormous amounts of time collecting a small bit of memory,
runs for some fraction of a second and does it again. Adding more to the
JVMs
memory allows the parallel collections to work without so many
stop-the-world
GC pauses.

So what I'd do is turn on GC logging (probably on the replicas) and look for
very long GC pauses. Mark Miller put together a blog here:
https://lucidworks.com/blog/garbage-collection-bootcamp-1-0/

See the getting a view into garbage collection. The smoking gun here
is if you see full GC pauses that are longer than the ZK timeout.

90M docs in 4 hours across 10 shards is only 625/sec or so per shard. I've
seen
sustained indexing rates significantly above this, YMMV or course, a lot
depends
on the size of the docs.

What version of Solr BTW? And when you say you fire a bunch of indexers,
I'm
assuming these are SolrJ clients and use CloudSolrServer?

Best,
Erick


On Sun, Jan 25, 2015 at 4:10 PM, Vijay Sekhri sekhrivi...@gmail.com wrote:

 Thank you for the reply Eric.
 I am sorry I had wrong information posted. I posted our DEV env
 configuration by mistake.
 After double checking our stress and Prod Beta env where we have found the
 original issue, I found all the searchers have around 50 GB of RAM
 available and two instances of JVM running (2 different ports). Both
 instances have 12 GB allocated. The rest 26 GB is available for the OS. 1st
  instance on a host has search1 collection (live collection) and the 2nd
 instance on the same host  has search2 collection (for full indexing ).

 There is plenty room for OS related tasks. Our issue is not in anyway
 related to OS starving as shown from our dashboards.
 We have been through

 https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
 a lot of times but  we have two modes of operation
 a)  1st collection (Live traffic) - heavy searches and medium indexing
 b)  2nd collection (Not serving traffic) - very heavy indexing, no searches

 When our indexing finishes we swap the alias for these collection . So
 essentially we need to have a configuration that can support both the use
 cases together. We have tried a lot of different configuration options and
 none of them seems to work. My suspicion is that solr cloud is unable to
 keep up with the updates at the rate we are sending while it is trying to
 be consistent with all the replicas.


 On Sun, Jan 25, 2015 at 5:30 PM, Erick Erickson erickerick...@gmail.com
 wrote:

  Shawn directed you over here to the user list, but I see this note on
  SOLR-7030:
  All our searchers have 12 GB of RAM available and have quad core
 Intel(R)
  Xeon(R) CPU X5570 @ 2.93GHz. There is only one java process running i.e
  jboss and solr in it . All 12 GB is available as heap for the java
  process...
 
  So you have 12G physical memory and have allocated 12G to the Java
 process?
  This is an anti-pattern. If that's
  the case, your operating system is being starved for memory, probably
  hitting a state where it spends all of its
  time in stop-the-world garbage collection, eventually it doesn't respond
 to
  Zookeeper's ping so Zookeeper
  thinks the node is down and puts it into recovery. Where it spends a lot
 of
  time doing... essentially nothing.
 
  About the hard and soft commits: I suspect these are entirely unrelated,
  but here's a blog on what they do, you
  should pick the configuration that supports your use case (i.e. how much
  latency can you stand between indexing
  and being able to search?).
 
 
 
 https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
 
  Here's one very good reason you shouldn't starve your op system by
  allocating all the physical memory to the JVM:
  http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
 
 
  But your biggest problem is that you have far too much of your physical
  memory allocated to the JVM. This
  will cause you endless problems, you just need more physical memory on
  those boxes. It's _possible_ you could
  get by with less memory for the JVM, counterintuitive as it seems try 8G
 or
  maybe even 6G. At some point
  you'll hit OOM errors, but that'll give you a lower limit on what the JVM
  needs.
 
  Unless I've mis-interpreted 

Re: Indexed epoch time in Solr

2015-01-25 Thread Jorge Luis Betancourt González
Perhaps could you use a DocTransformer to convert the unix time field into any 
representation you want? You'll need to write a custom DocTransformer but this 
is no complex task.

Regards,

- Original Message -
From: Ahmed Adel ahmed.a...@badrit.com
To: solr-user@lucene.apache.org
Sent: Monday, January 26, 2015 12:35:54 AM
Subject: Indexed epoch time in Solr

Hi All,

Is there a way to convert unix time field that is already indexed to
ISO-8601 format in query response? If this is not possible on the query
level, what is the best way to copy this field to a new Solr standard date
field.

Thanks,

-- 
*Ahmed Adel*
http://s.wisestamp.com/links?url=http%3A%2F%2Fwww.linkedin.com%2Fin%2F


---
XII Aniversario de la creación de la Universidad de las Ciencias Informáticas. 
12 años de historia junto a Fidel. 12 de diciembre de 2014.



Re: solr replication vs. rsync

2015-01-25 Thread Erick Erickson
@Shawn: Cool table, thanks!

@Dan:
Just to throw a different spin on it, if you migrate to SolrCloud, then
this question becomes moot as the raw documents are sent to each of the
replicas so you very rarely have to copy the full index. Kind of a tradeoff
between constant load because you're sending the raw documents around
whenever you index and peak usage when the index replicates.

There are a bunch of other reasons to go to SolrCloud, but you know your
problem space best.

FWIW,
Erick

On Sun, Jan 25, 2015 at 9:26 AM, Shawn Heisey apa...@elyograg.org wrote:

 On 1/24/2015 10:56 PM, Dan Davis wrote:
  When I polled the various projects already using Solr at my
 organization, I
  was greatly surprised that none of them were using Solr replication,
  because they had talked about replicating the data.
 
  But we are not Pinterest, and do not expect to be taking in changes one
  post at a time (at least the engineers don't - just wait until its used
 for
  a Crud app that wants full-text search on a description field!).
 Still,
  rsync can be very, very fast with the right options (-W for gigabit
  ethernet, and maybe -S for sparse files).   I've clocked it at 48 MB/s
 over
  GigE previously.
 
  Does anyone have any numbers for how fast Solr replication goes, and what
  to do to tune it?
 
  I'm not enthusiastic to give-up recently tested cluster stability for a
  home grown mess, but I am interested in numbers that are out there.

 Numbers are included on the Solr replication wiki page, both in graph
 and numeric form.  Gathering these numbers must have been pretty easy --
 before the HTTP replication made it into Solr, Solr used to contain an
 rsync-based implementation.

 http://wiki.apache.org/solr/SolrReplication#Performance_numbers

 Other data on that wiki page discusses the replication config.  There's
 not a lot to tune.

 I run a redundant non-SolrCloud index myself through a different method
 -- my indexing program indexes each index copy completely independently.
  There is no replication.  This separation allows me to upgrade any
 component, or change any part of solrconfig or schema, on either copy of
 the index without affecting the other copy at all.  With replication, if
 something is changed on the master or the slave, you might find that the
 slave no longer works, because it will be handling an index created by
 different software or a different config.

 Thanks,
 Shawn




Indexed epoch time in Solr

2015-01-25 Thread Ahmed Adel
Hi All,

Is there a way to convert unix time field that is already indexed to
ISO-8601 format in query response? If this is not possible on the query
level, what is the best way to copy this field to a new Solr standard date
field.

Thanks,

-- 
*Ahmed Adel*
http://s.wisestamp.com/links?url=http%3A%2F%2Fwww.linkedin.com%2Fin%2F


Re: [MASSMAIL]Weighting of prominent text in HTML

2015-01-25 Thread Jorge Luis Betancourt González
Hi Dan:

Agreed, this question is more Nutch related than Solr ;)

Nutch doesn't send any data into /update/extract request handler, all the text 
and metadata extraction happens in Nutch side rather than relying in the 
ExtractRequestHandler provided by Solr. Underneath Nutch use Tika the same 
technology as the ExtractRequestHandler provided by Solr so shouldn't be any 
greater difference. 

By default Nutch doesn't boost anything as is Solr job to boost the different 
content in the different fields, which is what happens when you do a query 
against Solr. Nutch calculates the LinkRank which is a variation of the famous 
PageRank (or the OPIC score, which is another scoring algorithm implemented in 
Nutch, which I believe is the default in Nutch 2.x). What you can do is use the 
headings and map the heading tags into different fields and then apply 
different boosts to each field. 

The general idea with Nutch is to make pieces of the web page and store each 
piece in a different field in Solr, then you can tweak your relevance function 
using the values yo see fit, so you don't need to write any plugin to 
accomplish this (at least for the h1, h2, etc. example you provided, if you 
want to extract other parts of the webpage you'll need to write your own plugin 
to do so). 

Nutch is highly customizable, you can write a plugin for almost any piece of 
logic, from parsers to indexers, passing from URL filters, scoring algorithms, 
protocols and a long long list, usually the plugins are not so difficult to 
write, but the problem comes to know which extension point you need to use, 
this comes with experience and taking a good dive in the source code.

Hope this helps,

- Original Message -
From: Dan Davis dansm...@gmail.com
To: solr-user solr-user@lucene.apache.org
Sent: Monday, January 26, 2015 12:08:13 AM
Subject: [MASSMAIL]Weighting of prominent text in HTML

By examining solr.log, I can see that Nutch is using the /update request
handler rather than /update/extract.   So, this may be a more appropriate
question for the nutch mailing list.   OTOH, y'all know the anwser off the
top of your head.

Will Nutch boost text occurring in h1, h2, etc. more heavily than text in a
normal paragraph?Can this weighting be tuned without writing a plugin?
   Is writing a plugin often needed because of the flexibility that is
needed in practice?

I wanted to call this post *Anatomy of a small scale search engine*, but
lacked the nerve ;)

Thanks, all and many,

Dan Davis, Systems/Applications Architect
National Library of Medicine


---
XII Aniversario de la creación de la Universidad de las Ciencias Informáticas. 
12 años de historia junto a Fidel. 12 de diciembre de 2014.



Weighting of prominent text in HTML

2015-01-25 Thread Dan Davis
By examining solr.log, I can see that Nutch is using the /update request
handler rather than /update/extract.   So, this may be a more appropriate
question for the nutch mailing list.   OTOH, y'all know the anwser off the
top of your head.

Will Nutch boost text occurring in h1, h2, etc. more heavily than text in a
normal paragraph?Can this weighting be tuned without writing a plugin?
   Is writing a plugin often needed because of the flexibility that is
needed in practice?

I wanted to call this post *Anatomy of a small scale search engine*, but
lacked the nerve ;)

Thanks, all and many,

Dan Davis, Systems/Applications Architect
National Library of Medicine


Re: replicas goes in recovery mode right after update

2015-01-25 Thread Vijay Sekhri
Thank you for the reply Eric.
I am sorry I had wrong information posted. I posted our DEV env
configuration by mistake.
After double checking our stress and Prod Beta env where we have found the
original issue, I found all the searchers have around 50 GB of RAM
available and two instances of JVM running (2 different ports). Both
instances have 12 GB allocated. The rest 26 GB is available for the OS. 1st
 instance on a host has search1 collection (live collection) and the 2nd
instance on the same host  has search2 collection (for full indexing ).

There is plenty room for OS related tasks. Our issue is not in anyway
related to OS starving as shown from our dashboards.
We have been through
https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
a lot of times but  we have two modes of operation
a)  1st collection (Live traffic) - heavy searches and medium indexing
b)  2nd collection (Not serving traffic) - very heavy indexing, no searches

When our indexing finishes we swap the alias for these collection . So
essentially we need to have a configuration that can support both the use
cases together. We have tried a lot of different configuration options and
none of them seems to work. My suspicion is that solr cloud is unable to
keep up with the updates at the rate we are sending while it is trying to
be consistent with all the replicas.


On Sun, Jan 25, 2015 at 5:30 PM, Erick Erickson erickerick...@gmail.com
wrote:

 Shawn directed you over here to the user list, but I see this note on
 SOLR-7030:
 All our searchers have 12 GB of RAM available and have quad core Intel(R)
 Xeon(R) CPU X5570 @ 2.93GHz. There is only one java process running i.e
 jboss and solr in it . All 12 GB is available as heap for the java
 process...

 So you have 12G physical memory and have allocated 12G to the Java process?
 This is an anti-pattern. If that's
 the case, your operating system is being starved for memory, probably
 hitting a state where it spends all of its
 time in stop-the-world garbage collection, eventually it doesn't respond to
 Zookeeper's ping so Zookeeper
 thinks the node is down and puts it into recovery. Where it spends a lot of
 time doing... essentially nothing.

 About the hard and soft commits: I suspect these are entirely unrelated,
 but here's a blog on what they do, you
 should pick the configuration that supports your use case (i.e. how much
 latency can you stand between indexing
 and being able to search?).


 https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

 Here's one very good reason you shouldn't starve your op system by
 allocating all the physical memory to the JVM:
 http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html


 But your biggest problem is that you have far too much of your physical
 memory allocated to the JVM. This
 will cause you endless problems, you just need more physical memory on
 those boxes. It's _possible_ you could
 get by with less memory for the JVM, counterintuitive as it seems try 8G or
 maybe even 6G. At some point
 you'll hit OOM errors, but that'll give you a lower limit on what the JVM
 needs.

 Unless I've mis-interpreted what you've written, though, I doubt you'll get
 stable with that much memory allocated
 to the JVM.

 Best,
 Erick



 On Sun, Jan 25, 2015 at 1:02 PM, Vijay Sekhri sekhrivi...@gmail.com
 wrote:

  We have a cluster of solr cloud server with 10 shards and 4 replicas in
  each shard in our stress environment. In our prod environment we will
 have
  10 shards and 15 replicas in each shard. Our current commit settings are
 as
  follows
 
  *autoSoftCommit*
  *maxDocs50/maxDocs*
  *maxTime18/maxTime*
  */autoSoftCommit*
  *autoCommit*
  *maxDocs200/maxDocs*
  *maxTime18/maxTime*
  *openSearcherfalse/openSearcher*
  */autoCommit*
 
 
  We indexed roughly 90 Million docs. We have two different ways to index
  documents a) Full indexing. It takes 4 hours to index 90 Million docs and
  the rate of docs coming to the searcher is around 6000 per second b)
  Incremental indexing. It takes an hour to indexed delta changes. Roughly
  there are 3 million changes and rate of docs coming to the searchers is
  2500
  per second
 
  We have two collections search1 and search2. When we do full indexing ,
 we
  do it in search2 collection while search1 is serving live traffic. After
 it
  finishes we swap the collection using aliases so that the search2
  collection serves live traffic while search1 becomes available for next
  full indexing run. When we do incremental indexing we do it in the
 search1
  collection which is serving live traffic.
 
  All our searchers have 12 GB of RAM available and have quad core Intel(R)
  Xeon(R) CPU X5570 @ 2.93GHz. There is only one java process running i.e
  jboss and solr in it . All 12 GB is available as heap for the java
  process.  We have observed 

Re: solr replication vs. rsync

2015-01-25 Thread Dan Davis
@Erick,

Problem space is not constant indexing.   I thought SolrCloud replicas were
replication, and you imply parallel indexing.  Good to know.

On Sunday, January 25, 2015, Erick Erickson erickerick...@gmail.com wrote:

 @Shawn: Cool table, thanks!

 @Dan:
 Just to throw a different spin on it, if you migrate to SolrCloud, then
 this question becomes moot as the raw documents are sent to each of the
 replicas so you very rarely have to copy the full index. Kind of a tradeoff
 between constant load because you're sending the raw documents around
 whenever you index and peak usage when the index replicates.

 There are a bunch of other reasons to go to SolrCloud, but you know your
 problem space best.

 FWIW,
 Erick

 On Sun, Jan 25, 2015 at 9:26 AM, Shawn Heisey apa...@elyograg.org
 javascript:; wrote:

  On 1/24/2015 10:56 PM, Dan Davis wrote:
   When I polled the various projects already using Solr at my
  organization, I
   was greatly surprised that none of them were using Solr replication,
   because they had talked about replicating the data.
  
   But we are not Pinterest, and do not expect to be taking in changes one
   post at a time (at least the engineers don't - just wait until its used
  for
   a Crud app that wants full-text search on a description field!).
  Still,
   rsync can be very, very fast with the right options (-W for gigabit
   ethernet, and maybe -S for sparse files).   I've clocked it at 48 MB/s
  over
   GigE previously.
  
   Does anyone have any numbers for how fast Solr replication goes, and
 what
   to do to tune it?
  
   I'm not enthusiastic to give-up recently tested cluster stability for a
   home grown mess, but I am interested in numbers that are out there.
 
  Numbers are included on the Solr replication wiki page, both in graph
  and numeric form.  Gathering these numbers must have been pretty easy --
  before the HTTP replication made it into Solr, Solr used to contain an
  rsync-based implementation.
 
  http://wiki.apache.org/solr/SolrReplication#Performance_numbers
 
  Other data on that wiki page discusses the replication config.  There's
  not a lot to tune.
 
  I run a redundant non-SolrCloud index myself through a different method
  -- my indexing program indexes each index copy completely independently.
   There is no replication.  This separation allows me to upgrade any
  component, or change any part of solrconfig or schema, on either copy of
  the index without affecting the other copy at all.  With replication, if
  something is changed on the master or the slave, you might find that the
  slave no longer works, because it will be handling an index created by
  different software or a different config.
 
  Thanks,
  Shawn
 
 



Re: solr replication vs. rsync

2015-01-25 Thread Dan Davis
Thanks!

On Sunday, January 25, 2015, Erick Erickson erickerick...@gmail.com wrote:

 @Shawn: Cool table, thanks!

 @Dan:
 Just to throw a different spin on it, if you migrate to SolrCloud, then
 this question becomes moot as the raw documents are sent to each of the
 replicas so you very rarely have to copy the full index. Kind of a tradeoff
 between constant load because you're sending the raw documents around
 whenever you index and peak usage when the index replicates.

 There are a bunch of other reasons to go to SolrCloud, but you know your
 problem space best.

 FWIW,
 Erick

 On Sun, Jan 25, 2015 at 9:26 AM, Shawn Heisey apa...@elyograg.org
 javascript:; wrote:

  On 1/24/2015 10:56 PM, Dan Davis wrote:
   When I polled the various projects already using Solr at my
  organization, I
   was greatly surprised that none of them were using Solr replication,
   because they had talked about replicating the data.
  
   But we are not Pinterest, and do not expect to be taking in changes one
   post at a time (at least the engineers don't - just wait until its used
  for
   a Crud app that wants full-text search on a description field!).
  Still,
   rsync can be very, very fast with the right options (-W for gigabit
   ethernet, and maybe -S for sparse files).   I've clocked it at 48 MB/s
  over
   GigE previously.
  
   Does anyone have any numbers for how fast Solr replication goes, and
 what
   to do to tune it?
  
   I'm not enthusiastic to give-up recently tested cluster stability for a
   home grown mess, but I am interested in numbers that are out there.
 
  Numbers are included on the Solr replication wiki page, both in graph
  and numeric form.  Gathering these numbers must have been pretty easy --
  before the HTTP replication made it into Solr, Solr used to contain an
  rsync-based implementation.
 
  http://wiki.apache.org/solr/SolrReplication#Performance_numbers
 
  Other data on that wiki page discusses the replication config.  There's
  not a lot to tune.
 
  I run a redundant non-SolrCloud index myself through a different method
  -- my indexing program indexes each index copy completely independently.
   There is no replication.  This separation allows me to upgrade any
  component, or change any part of solrconfig or schema, on either copy of
  the index without affecting the other copy at all.  With replication, if
  something is changed on the master or the slave, you might find that the
  slave no longer works, because it will be handling an index created by
  different software or a different config.
 
  Thanks,
  Shawn
 
 



replicas goes in recovery mode right after update

2015-01-25 Thread Vijay Sekhri
We have a cluster of solr cloud server with 10 shards and 4 replicas in
each shard in our stress environment. In our prod environment we will have
10 shards and 15 replicas in each shard. Our current commit settings are as
follows

*autoSoftCommit*
*maxDocs50/maxDocs*
*maxTime18/maxTime*
*/autoSoftCommit*
*autoCommit*
*maxDocs200/maxDocs*
*maxTime18/maxTime*
*openSearcherfalse/openSearcher*
*/autoCommit*


We indexed roughly 90 Million docs. We have two different ways to index
documents a) Full indexing. It takes 4 hours to index 90 Million docs and
the rate of docs coming to the searcher is around 6000 per second b)
Incremental indexing. It takes an hour to indexed delta changes. Roughly
there are 3 million changes and rate of docs coming to the searchers is 2500
per second

We have two collections search1 and search2. When we do full indexing , we
do it in search2 collection while search1 is serving live traffic. After it
finishes we swap the collection using aliases so that the search2
collection serves live traffic while search1 becomes available for next
full indexing run. When we do incremental indexing we do it in the search1
collection which is serving live traffic.

All our searchers have 12 GB of RAM available and have quad core Intel(R)
Xeon(R) CPU X5570 @ 2.93GHz. There is only one java process running i.e
jboss and solr in it . All 12 GB is available as heap for the java
process.  We have observed that the heap memory of the java process average
around 8 - 10 GB. All searchers have final index size of 9 GB. So in total
there are 9X10 (shards) =  90GB worth of index files.

 We have observed the following issue when we trigger indexing . In about
10 minutes after we trigger indexing on 14 parallel hosts, the replicas
goes in to recovery mode. This happens to all the shards . In about 20
minutes more and more replicas start going into recovery mode. After about
half an hour all replicas except the leader are in recovery mode. We cannot
throttle the indexing load as that will increase our overall indexing time.
So to overcome this issue, we remove all the replicas before we trigger the
indexing and then add them back after the indexing finishes.

We observe the same behavior of replicas going into recovery when we do
incremental indexing. We cannot remove replicas during our incremental
indexing because it is also serving live traffic. We tried to throttle our
indexing speed , however the cluster still goes into recovery .

If we leave the cluster as it , when the indexing finishes , it eventually
recovers after a while. As it is serving live traffic we cannot have these
replicas go into recovery mode because it degrades the search performance
also , our tests have shown.

We have tried different commit settings like below

a) No auto soft commit, no auto hard commit and a commit triggered at the
end of indexing b) No auto soft commit, yes auto hard commit and a commit
in the end of indexing
c) Yes auto soft commit , no auto hard commit
d) Yes auto soft commit , yes auto hard commit
e) Different frequency setting for commits for above. Please NOTE that we
have tried 15 minute soft commit setting and 30 minutes hard commit
settings. Same time settings for both, 30 minute soft commit and an hour
hard commit setting

Unfortunately all the above yields the same behavior . The replicas still
goes in recovery We have increased the zookeeper timeout from 30 seconds to
5 minutes and the problem persists. Is there any setting that would fix
this issue ?

-- 
*
Vijay Sekhri
*