date:20150125

Re: Facet Double Counting

2015-01-25 Thread harish singh

Still the same.

Can the reason be that if there are duplicate logs/documents, then
the Facet query will count them, but when I do the Search Query, solr
eliminates the duplicates?


On Sat, Jan 24, 2015 at 11:47 PM, Ahmet Arslan iori...@yahoo.com.invalid
wrote:



 Hi Harish,

 What happens when you purge deleted terms with
 'solr/core/update?commit=trueexpungeDeletes=true'

 ahmet



 On Sunday, January 25, 2015 1:59 AM, harish singh 
 harish.sing...@gmail.com wrote:
 Hi,

 I am noticing a strange behavior with solr facet searching:

 This is my facet query:


- params:
{
   - facet: true,
   - sort: startTimeISO desc,
   - debugQuery: true,
   - facet.mincount: 1,
   - facet.sort: count,
   - start: 0,
   - q: requestType:(*login* or *LOGIN*) AND (user:(blabla*)),
   - facet.limit: 100,
   - facet.field: loginUserName,
   - wt: json,
   - fq: startTimeISO:[2015-01-22T00:00:00.000Z TO
   2015-01-23T00:00:00.000Z],
   - rows: 0
   }


 The result I am getting is:

 facet_counts:
 {

- facet_queries: { },
- facet_fields:
{
   - loginUserName:
   [
  - harry,
  - 36,
  - larry,
  - 10,
  - Carey
  ]
   },
- facet_dates: { },
- facet_ranges: { }

 }



 As you see, the result is showing Facet-Count for loginUserName= harry is
 36.

 So when I do a Solr Search for logs, I should get 36 logs.
 But I am getting 18.
 This happening for all the searches now.


 For some reason, I see double counting.

 Either Facetting is Double counting or Search is half-counting ?


 This is my Solr Search Query:



- params:
{
   - sort: startTimeISO desc,
   - debugQuery: true,
   - start: 0,
   - q: requestType:(*login* or *LOGIN*) AND (user:(blabla*)) AND (
   loginUserName:(harry)),
   - wt: json,
   - fq: startTimeISO:[2015-01-22T00:00:00.000Z TO
   2015-01-23T00:00:00.000Z],
   - rows: 200
   }



 This query gives only 18 logs. But Solr Facet Query gave 36.


 Is there something incorrect in any of my (or both) queries?
 I am trying to debug but it I think I am missing something silly.

RE: Facet Double Counting

2015-01-25 Thread Toke Eskildsen

harish singh [harish.sing...@gmail.com] wrote:
 As you see, the result is showing Facet-Count for loginUserName= harry is
 36.
 
 So when I do a Solr Search for logs, I should get 36 logs.
 But I am getting 18.
 This happening for all the searches now.

If you have recently added or changed uniqueKey and if your index has multiple 
documents with the same key, that would explain the behaviour you describe. If 
that is so, I recommend you delete the index and rebuild it from scratch.

- Toke Eskildsen

Re: Facet Double Counting

2015-01-25 Thread Ahmet Arslan

weird, optimize or expungeDeletes=true should do the trick.
Can you try to optimise this time?

On Sunday, January 25, 2015 11:08 AM, harish singh harish.sing...@gmail.com 
wrote:
Still the same.

Can the reason be that if there are duplicate logs/documents, then
the Facet query will count them, but when I do the Search Query, solr
eliminates the duplicates?



On Sat, Jan 24, 2015 at 11:47 PM, Ahmet Arslan iori...@yahoo.com.invalid
wrote:



 Hi Harish,

 What happens when you purge deleted terms with
 'solr/core/update?commit=trueexpungeDeletes=true'

 ahmet



 On Sunday, January 25, 2015 1:59 AM, harish singh 
 harish.sing...@gmail.com wrote:
 Hi,

 I am noticing a strange behavior with solr facet searching:

 This is my facet query:


- params:
{
   - facet: true,
   - sort: startTimeISO desc,
   - debugQuery: true,
   - facet.mincount: 1,
   - facet.sort: count,
   - start: 0,
   - q: requestType:(*login* or *LOGIN*) AND (user:(blabla*)),
   - facet.limit: 100,
   - facet.field: loginUserName,
   - wt: json,
   - fq: startTimeISO:[2015-01-22T00:00:00.000Z TO
   2015-01-23T00:00:00.000Z],
   - rows: 0
   }


 The result I am getting is:

 facet_counts:
 {

- facet_queries: { },
- facet_fields:
{
   - loginUserName:
   [
  - harry,
  - 36,
  - larry,
  - 10,
  - Carey
  ]
   },
- facet_dates: { },
- facet_ranges: { }

 }



 As you see, the result is showing Facet-Count for loginUserName= harry is
 36.

 So when I do a Solr Search for logs, I should get 36 logs.
 But I am getting 18.
 This happening for all the searches now.


 For some reason, I see double counting.

 Either Facetting is Double counting or Search is half-counting ?


 This is my Solr Search Query:



- params:
{
   - sort: startTimeISO desc,
   - debugQuery: true,
   - start: 0,
   - q: requestType:(*login* or *LOGIN*) AND (user:(blabla*)) AND (
   loginUserName:(harry)),
   - wt: json,
   - fq: startTimeISO:[2015-01-22T00:00:00.000Z TO
   2015-01-23T00:00:00.000Z],
   - rows: 200
   }



 This query gives only 18 logs. But Solr Facet Query gave 36.


 Is there something incorrect in any of my (or both) queries?
 I am trying to debug but it I think I am missing something silly.

Re: Facet Double Counting

2015-01-25 Thread harish singh

Oh yes!! :)
I tried the Faceting on the UUID field.
All the uuids have count = 2 == which probably explains why I am getting
Double counting in Facet result.

So does this mean, when I do a facet query on facet.field= loginUserName,
Solr does not look at the UUID?
And the unique field (UUID in this case) is considered only while Search
Queries?

On Sun, Jan 25, 2015 at 3:15 AM, Toke Eskildsen t...@statsbiblioteket.dk
wrote:

 harish singh [harish.sing...@gmail.com] wrote:
  As you see, the result is showing Facet-Count for loginUserName= harry
 is
  36.
 
  So when I do a Solr Search for logs, I should get 36 logs.
  But I am getting 18.
  This happening for all the searches now.

 If you have recently added or changed uniqueKey and if your index has
 multiple documents with the same key, that would explain the behaviour you
 describe. If that is so, I recommend you delete the index and rebuild it
 from scratch.

 - Toke Eskildsen

RE: Facet Double Counting

2015-01-25 Thread Toke Eskildsen

harish singh [harish.sing...@gmail.com] wrote:
 I tried the Faceting on the UUID field.

Nice debug trick. I'll remember that to next time.

 So does this mean, when I do a facet query on facet.field= loginUserName,
 Solr does not look at the UUID?

Yes. For faceting, Solr only uses the internal docIDs and the facet field data.

 And the unique field (UUID in this case) is considered only while Search
 Queries?

For a distributed setup, the documents are resolved from the shards using 
uniqueKey.

I did not think this was the case for a non-distributed setup - for such setup, 
I thought that the documents were resolved using internal docIDs. If your index 
is single-shard, then I was wrong.

- Toke Eskildsen

Re: solr replication vs. rsync

2015-01-25 Thread Shawn Heisey

On 1/24/2015 10:56 PM, Dan Davis wrote:
 When I polled the various projects already using Solr at my organization, I
 was greatly surprised that none of them were using Solr replication,
 because they had talked about replicating the data.
 
 But we are not Pinterest, and do not expect to be taking in changes one
 post at a time (at least the engineers don't - just wait until its used for
 a Crud app that wants full-text search on a description field!).Still,
 rsync can be very, very fast with the right options (-W for gigabit
 ethernet, and maybe -S for sparse files).   I've clocked it at 48 MB/s over
 GigE previously.
 
 Does anyone have any numbers for how fast Solr replication goes, and what
 to do to tune it?
 
 I'm not enthusiastic to give-up recently tested cluster stability for a
 home grown mess, but I am interested in numbers that are out there.

Numbers are included on the Solr replication wiki page, both in graph
and numeric form.  Gathering these numbers must have been pretty easy --
before the HTTP replication made it into Solr, Solr used to contain an
rsync-based implementation.

http://wiki.apache.org/solr/SolrReplication#Performance_numbers

Other data on that wiki page discusses the replication config.  There's
not a lot to tune.

I run a redundant non-SolrCloud index myself through a different method
-- my indexing program indexes each index copy completely independently.
 There is no replication.  This separation allows me to upgrade any
component, or change any part of solrconfig or schema, on either copy of
the index without affecting the other copy at all.  With replication, if
something is changed on the master or the slave, you might find that the
slave no longer works, because it will be handling an index created by
different software or a different config.

Thanks,
Shawn

Sorting on a computed value

2015-01-25 Thread tedsolr

I'll bet some super user has figured this out. How can I perform a sort on a
single computed field? I have a QParserPlugin that is collapsing docs based
on data from multiple fields. I am summing the values from one numerical
field 'X'. I was going to use a DocTransformer to inject that summed value
into the search results as a new field. But I have now realized that I have
to be able to sort on this summed field.

Without retrieving all results (which could be 1M+) in my app and sorting
manually, is there any way to sort on my computed field within Solr?
(using Solr 4.9)



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Sorting-on-a-computed-value-tp4181875.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: replicas goes in recovery mode right after update

2015-01-25 Thread Erick Erickson

Shawn directed you over here to the user list, but I see this note on
SOLR-7030:
All our searchers have 12 GB of RAM available and have quad core Intel(R)
Xeon(R) CPU X5570 @ 2.93GHz. There is only one java process running i.e
jboss and solr in it . All 12 GB is available as heap for the java
process...

So you have 12G physical memory and have allocated 12G to the Java process?
This is an anti-pattern. If that's
the case, your operating system is being starved for memory, probably
hitting a state where it spends all of its
time in stop-the-world garbage collection, eventually it doesn't respond to
Zookeeper's ping so Zookeeper
thinks the node is down and puts it into recovery. Where it spends a lot of
time doing... essentially nothing.

About the hard and soft commits: I suspect these are entirely unrelated,
but here's a blog on what they do, you
should pick the configuration that supports your use case (i.e. how much
latency can you stand between indexing
and being able to search?).

https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

Here's one very good reason you shouldn't starve your op system by
allocating all the physical memory to the JVM:
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

But your biggest problem is that you have far too much of your physical
memory allocated to the JVM. This
will cause you endless problems, you just need more physical memory on
those boxes. It's _possible_ you could
get by with less memory for the JVM, counterintuitive as it seems try 8G or
maybe even 6G. At some point
you'll hit OOM errors, but that'll give you a lower limit on what the JVM
needs.

Unless I've mis-interpreted what you've written, though, I doubt you'll get
stable with that much memory allocated
to the JVM.

Best,
Erick

On Sun, Jan 25, 2015 at 1:02 PM, Vijay Sekhri sekhrivi...@gmail.com wrote:

We have a cluster of solr cloud server with 10 shards and 4 replicas in
each shard in our stress environment. In our prod environment we will have
10 shards and 15 replicas in each shard. Our current commit settings are as
follows

*autoSoftCommit*
*maxDocs50/maxDocs*
*maxTime18/maxTime*
*/autoSoftCommit*
*autoCommit*
*maxDocs200/maxDocs*
*maxTime18/maxTime*
*openSearcherfalse/openSearcher*
*/autoCommit*

We indexed roughly 90 Million docs. We have two different ways to index
documents a) Full indexing. It takes 4 hours to index 90 Million docs and
the rate of docs coming to the searcher is around 6000 per second b)
Incremental indexing. It takes an hour to indexed delta changes. Roughly
there are 3 million changes and rate of docs coming to the searchers is
2500
per second

We have two collections search1 and search2. When we do full indexing , we
do it in search2 collection while search1 is serving live traffic. After it
finishes we swap the collection using aliases so that the search2
collection serves live traffic while search1 becomes available for next
full indexing run. When we do incremental indexing we do it in the search1
collection which is serving live traffic.

All our searchers have 12 GB of RAM available and have quad core Intel(R)
Xeon(R) CPU X5570 @ 2.93GHz. There is only one java process running i.e
jboss and solr in it . All 12 GB is available as heap for the java
process. We have observed that the heap memory of the java process average
around 8 - 10 GB. All searchers have final index size of 9 GB. So in total
there are 9X10 (shards) = 90GB worth of index files.

We have observed the following issue when we trigger indexing . In about
10 minutes after we trigger indexing on 14 parallel hosts, the replicas
goes in to recovery mode. This happens to all the shards . In about 20
minutes more and more replicas start going into recovery mode. After about
half an hour all replicas except the leader are in recovery mode. We cannot
throttle the indexing load as that will increase our overall indexing time.
So to overcome this issue, we remove all the replicas before we trigger the
indexing and then add them back after the indexing finishes.

We observe the same behavior of replicas going into recovery when we do
incremental indexing. We cannot remove replicas during our incremental
indexing because it is also serving live traffic. We tried to throttle our
indexing speed , however the cluster still goes into recovery .

If we leave the cluster as it , when the indexing finishes , it eventually
recovers after a while. As it is serving live traffic we cannot have these
replicas go into recovery mode because it degrades the search performance
also , our tests have shown.

We have tried different commit settings like below

a) No auto soft commit, no auto hard commit and a commit triggered at the
end of indexing b) No auto soft commit, yes auto hard commit and a commit
in the end of indexing
c) Yes auto

Re: Unexplained leader initiated recovery after updates - SolrCmdDistributor no longer retries on RemoteSolrException

2015-01-25 Thread sekhrivijay

Hi Lindsey 
Were you every able to figure out the reason for this behavior?
We are experiencing the same issue with solr cloud version 4.10

http://lucene.472066.n3.nabble.com/jira-Commented-SOLR-7030-replicas-goes-in-recovery-mode-right-after-update-td4181881.html
https://issues.apache.org/jira/browse/SOLR-7030


We even tried to remove the replicas to get around this issue. However we
cannot do that for the collection that is serving our live traffic.  Any
suggestions ?
Vijay



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Re-Unexplained-leader-initiated-recovery-after-updates-SolrCmdDistributor-no-longer-retries-on-Remotn-tp4179309p4181882.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: solr replication vs. rsync

2015-01-25 Thread Erick Erickson

bq:  I thought SolrCloud replicas were replication, and you imply parallel
indexing

Absolutely! You couldn't get near-real-time indexing if you relied on
replication a-la
3x. And you also couldn't guarantee consistency.

Say you have 1 shard, a leader and a follower (i.e. 2 replicas). Now you
throw a doc
to be indexed. The sequence is:
leader gets the doc
leader forwards the doc to the follower
leader and follower both add the doc to their local index (and tlog).
follower acks back to leader
leader acks back to client.

So yes, the raw document is forwarded to all replicas before the leader
responds
to the client, the docs all get written to the tlogs, etc. That's the only
way to guarantee
that if the leader goes down, the follower can take over without losing
documents.

Best,
Erick

On Sun, Jan 25, 2015 at 6:15 PM, Dan Davis dansm...@gmail.com wrote:

 @Erick,

 Problem space is not constant indexing.   I thought SolrCloud replicas were
 replication, and you imply parallel indexing.  Good to know.

 On Sunday, January 25, 2015, Erick Erickson erickerick...@gmail.com
 wrote:

  @Shawn: Cool table, thanks!
 
  @Dan:
  Just to throw a different spin on it, if you migrate to SolrCloud, then
  this question becomes moot as the raw documents are sent to each of the
  replicas so you very rarely have to copy the full index. Kind of a
 tradeoff
  between constant load because you're sending the raw documents around
  whenever you index and peak usage when the index replicates.
 
  There are a bunch of other reasons to go to SolrCloud, but you know your
  problem space best.
 
  FWIW,
  Erick
 
  On Sun, Jan 25, 2015 at 9:26 AM, Shawn Heisey apa...@elyograg.org
  javascript:; wrote:
 
   On 1/24/2015 10:56 PM, Dan Davis wrote:
When I polled the various projects already using Solr at my
   organization, I
was greatly surprised that none of them were using Solr replication,
because they had talked about replicating the data.
   
But we are not Pinterest, and do not expect to be taking in changes
 one
post at a time (at least the engineers don't - just wait until its
 used
   for
a Crud app that wants full-text search on a description field!).
   Still,
rsync can be very, very fast with the right options (-W for gigabit
ethernet, and maybe -S for sparse files).   I've clocked it at 48
 MB/s
   over
GigE previously.
   
Does anyone have any numbers for how fast Solr replication goes, and
  what
to do to tune it?
   
I'm not enthusiastic to give-up recently tested cluster stability
 for a
home grown mess, but I am interested in numbers that are out there.
  
   Numbers are included on the Solr replication wiki page, both in graph
   and numeric form.  Gathering these numbers must have been pretty easy
 --
   before the HTTP replication made it into Solr, Solr used to contain an
   rsync-based implementation.
  
   http://wiki.apache.org/solr/SolrReplication#Performance_numbers
  
   Other data on that wiki page discusses the replication config.  There's
   not a lot to tune.
  
   I run a redundant non-SolrCloud index myself through a different method
   -- my indexing program indexes each index copy completely
 independently.
There is no replication.  This separation allows me to upgrade any
   component, or change any part of solrconfig or schema, on either copy
 of
   the index without affecting the other copy at all.  With replication,
 if
   something is changed on the master or the slave, you might find that
 the
   slave no longer works, because it will be handling an index created by
   different software or a different config.
  
   Thanks,
   Shawn

Re: replicas goes in recovery mode right after update

2015-01-25 Thread Erick Erickson

Ah, OK. Whew! because I was wondering how you were running at _all_ if all
the memory was allocated to the JVM ;)..

What is your Zookeeper timeout? The original default was 15 seconds and this
has caused problems like this. Here's the scenario:
You send a bunch of docs at the server, and eventually you hit a
stop-the-world
GC that takes longer than the Zookeeper timeout. So ZK thinks the node is
down
and initiates recovery. Eventually, you hit this on all the replicas.

Sometimes I've seen situations where the answer is giving a bit more memory
to the JVM, say 2-4G in your case. The theory here (and this is a shot in
the
dark) that your peak JVM requirements are close to your 12G, so the garbage
collection spends enormous amounts of time collecting a small bit of memory,
runs for some fraction of a second and does it again. Adding more to the
JVMs
memory allows the parallel collections to work without so many
stop-the-world
GC pauses.

So what I'd do is turn on GC logging (probably on the replicas) and look for
very long GC pauses. Mark Miller put together a blog here:
https://lucidworks.com/blog/garbage-collection-bootcamp-1-0/

See the getting a view into garbage collection. The smoking gun here
is if you see full GC pauses that are longer than the ZK timeout.

90M docs in 4 hours across 10 shards is only 625/sec or so per shard. I've
seen
sustained indexing rates significantly above this, YMMV or course, a lot
depends
on the size of the docs.

What version of Solr BTW? And when you say you fire a bunch of indexers,
I'm
assuming these are SolrJ clients and use CloudSolrServer?

Best,
Erick

On Sun, Jan 25, 2015 at 4:10 PM, Vijay Sekhri sekhrivi...@gmail.com wrote:

Thank you for the reply Eric.
I am sorry I had wrong information posted. I posted our DEV env
configuration by mistake.
After double checking our stress and Prod Beta env where we have found the
original issue, I found all the searchers have around 50 GB of RAM
available and two instances of JVM running (2 different ports). Both
instances have 12 GB allocated. The rest 26 GB is available for the OS. 1st
instance on a host has search1 collection (live collection) and the 2nd
instance on the same host has search2 collection (for full indexing ).

There is plenty room for OS related tasks. Our issue is not in anyway
related to OS starving as shown from our dashboards.
We have been through

https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
a lot of times but we have two modes of operation
a) 1st collection (Live traffic) - heavy searches and medium indexing
b) 2nd collection (Not serving traffic) - very heavy indexing, no searches

When our indexing finishes we swap the alias for these collection . So
essentially we need to have a configuration that can support both the use
cases together. We have tried a lot of different configuration options and
none of them seems to work. My suspicion is that solr cloud is unable to
keep up with the updates at the rate we are sending while it is trying to
be consistent with all the replicas.

On Sun, Jan 25, 2015 at 5:30 PM, Erick Erickson erickerick...@gmail.com
wrote:

Shawn directed you over here to the user list, but I see this note on
SOLR-7030:
All our searchers have 12 GB of RAM available and have quad core
Intel(R)
Xeon(R) CPU X5570 @ 2.93GHz. There is only one java process running i.e
jboss and solr in it . All 12 GB is available as heap for the java
process...

So you have 12G physical memory and have allocated 12G to the Java
process?
This is an anti-pattern. If that's
the case, your operating system is being starved for memory, probably
hitting a state where it spends all of its
time in stop-the-world garbage collection, eventually it doesn't respond
to
Zookeeper's ping so Zookeeper
thinks the node is down and puts it into recovery. Where it spends a lot
of
time doing... essentially nothing.

https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

Here's one very good reason you shouldn't starve your op system by
allocating all the physical memory to the JVM:
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

But your biggest problem is that you have far too much of your physical
memory allocated to the JVM. This
will cause you endless problems, you just need more physical memory on
those boxes. It's _possible_ you could
get by with less memory for the JVM, counterintuitive as it seems try 8G
or
maybe even 6G. At some point
you'll hit OOM errors, but that'll give you a lower limit on what the JVM
needs.

Unless I've mis-interpreted

Re: Indexed epoch time in Solr

2015-01-25 Thread Jorge Luis Betancourt González

Perhaps could you use a DocTransformer to convert the unix time field into any 
representation you want? You'll need to write a custom DocTransformer but this 
is no complex task.

Regards,

- Original Message -
From: Ahmed Adel ahmed.a...@badrit.com
To: solr-user@lucene.apache.org
Sent: Monday, January 26, 2015 12:35:54 AM
Subject: Indexed epoch time in Solr

Hi All,

Is there a way to convert unix time field that is already indexed to
ISO-8601 format in query response? If this is not possible on the query
level, what is the best way to copy this field to a new Solr standard date
field.

Thanks,

-- 
*Ahmed Adel*
http://s.wisestamp.com/links?url=http%3A%2F%2Fwww.linkedin.com%2Fin%2F

---
XII Aniversario de la creación de la Universidad de las Ciencias Informáticas. 
12 años de historia junto a Fidel. 12 de diciembre de 2014.

Re: solr replication vs. rsync

2015-01-25 Thread Erick Erickson

@Shawn: Cool table, thanks!

@Dan:
Just to throw a different spin on it, if you migrate to SolrCloud, then
this question becomes moot as the raw documents are sent to each of the
replicas so you very rarely have to copy the full index. Kind of a tradeoff
between constant load because you're sending the raw documents around
whenever you index and peak usage when the index replicates.

There are a bunch of other reasons to go to SolrCloud, but you know your
problem space best.

FWIW,
Erick

On Sun, Jan 25, 2015 at 9:26 AM, Shawn Heisey apa...@elyograg.org wrote:

 On 1/24/2015 10:56 PM, Dan Davis wrote:
  When I polled the various projects already using Solr at my
 organization, I
  was greatly surprised that none of them were using Solr replication,
  because they had talked about replicating the data.
 
  But we are not Pinterest, and do not expect to be taking in changes one
  post at a time (at least the engineers don't - just wait until its used
 for
  a Crud app that wants full-text search on a description field!).
 Still,
  rsync can be very, very fast with the right options (-W for gigabit
  ethernet, and maybe -S for sparse files).   I've clocked it at 48 MB/s
 over
  GigE previously.
 
  Does anyone have any numbers for how fast Solr replication goes, and what
  to do to tune it?
 
  I'm not enthusiastic to give-up recently tested cluster stability for a
  home grown mess, but I am interested in numbers that are out there.

 Numbers are included on the Solr replication wiki page, both in graph
 and numeric form.  Gathering these numbers must have been pretty easy --
 before the HTTP replication made it into Solr, Solr used to contain an
 rsync-based implementation.

 http://wiki.apache.org/solr/SolrReplication#Performance_numbers

 Other data on that wiki page discusses the replication config.  There's
 not a lot to tune.

 I run a redundant non-SolrCloud index myself through a different method
 -- my indexing program indexes each index copy completely independently.
  There is no replication.  This separation allows me to upgrade any
 component, or change any part of solrconfig or schema, on either copy of
 the index without affecting the other copy at all.  With replication, if
 something is changed on the master or the slave, you might find that the
 slave no longer works, because it will be handling an index created by
 different software or a different config.

 Thanks,
 Shawn

Indexed epoch time in Solr

2015-01-25 Thread Ahmed Adel

Hi All,

Is there a way to convert unix time field that is already indexed to
ISO-8601 format in query response? If this is not possible on the query
level, what is the best way to copy this field to a new Solr standard date
field.

Thanks,

-- 
*Ahmed Adel*
http://s.wisestamp.com/links?url=http%3A%2F%2Fwww.linkedin.com%2Fin%2F

Re: [MASSMAIL]Weighting of prominent text in HTML

2015-01-25 Thread Jorge Luis Betancourt González

Hi Dan:

Agreed, this question is more Nutch related than Solr ;)

Nutch doesn't send any data into /update/extract request handler, all the text 
and metadata extraction happens in Nutch side rather than relying in the 
ExtractRequestHandler provided by Solr. Underneath Nutch use Tika the same 
technology as the ExtractRequestHandler provided by Solr so shouldn't be any 
greater difference. 

By default Nutch doesn't boost anything as is Solr job to boost the different 
content in the different fields, which is what happens when you do a query 
against Solr. Nutch calculates the LinkRank which is a variation of the famous 
PageRank (or the OPIC score, which is another scoring algorithm implemented in 
Nutch, which I believe is the default in Nutch 2.x). What you can do is use the 
headings and map the heading tags into different fields and then apply 
different boosts to each field. 

The general idea with Nutch is to make pieces of the web page and store each 
piece in a different field in Solr, then you can tweak your relevance function 
using the values yo see fit, so you don't need to write any plugin to 
accomplish this (at least for the h1, h2, etc. example you provided, if you 
want to extract other parts of the webpage you'll need to write your own plugin 
to do so). 

Nutch is highly customizable, you can write a plugin for almost any piece of 
logic, from parsers to indexers, passing from URL filters, scoring algorithms, 
protocols and a long long list, usually the plugins are not so difficult to 
write, but the problem comes to know which extension point you need to use, 
this comes with experience and taking a good dive in the source code.

Hope this helps,

- Original Message -
From: Dan Davis dansm...@gmail.com
To: solr-user solr-user@lucene.apache.org
Sent: Monday, January 26, 2015 12:08:13 AM
Subject: [MASSMAIL]Weighting of prominent text in HTML

By examining solr.log, I can see that Nutch is using the /update request
handler rather than /update/extract.   So, this may be a more appropriate
question for the nutch mailing list.   OTOH, y'all know the anwser off the
top of your head.

Will Nutch boost text occurring in h1, h2, etc. more heavily than text in a
normal paragraph?Can this weighting be tuned without writing a plugin?
   Is writing a plugin often needed because of the flexibility that is
needed in practice?

I wanted to call this post *Anatomy of a small scale search engine*, but
lacked the nerve ;)

Thanks, all and many,

Dan Davis, Systems/Applications Architect
National Library of Medicine


---
XII Aniversario de la creación de la Universidad de las Ciencias Informáticas. 
12 años de historia junto a Fidel. 12 de diciembre de 2014.

Weighting of prominent text in HTML

2015-01-25 Thread Dan Davis

By examining solr.log, I can see that Nutch is using the /update request
handler rather than /update/extract.   So, this may be a more appropriate
question for the nutch mailing list.   OTOH, y'all know the anwser off the
top of your head.

Will Nutch boost text occurring in h1, h2, etc. more heavily than text in a
normal paragraph?Can this weighting be tuned without writing a plugin?
   Is writing a plugin often needed because of the flexibility that is
needed in practice?

I wanted to call this post *Anatomy of a small scale search engine*, but
lacked the nerve ;)

Thanks, all and many,

Dan Davis, Systems/Applications Architect
National Library of Medicine

Re: replicas goes in recovery mode right after update

2015-01-25 Thread Vijay Sekhri

There is plenty room for OS related tasks. Our issue is not in anyway
related to OS starving as shown from our dashboards.
We have been through
https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
a lot of times but we have two modes of operation
a) 1st collection (Live traffic) - heavy searches and medium indexing
b) 2nd collection (Not serving traffic) - very heavy indexing, no searches

On Sun, Jan 25, 2015 at 5:30 PM, Erick Erickson erickerick...@gmail.com
wrote:

https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

Here's one very good reason you shouldn't starve your op system by
allocating all the physical memory to the JVM:
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

Unless I've mis-interpreted what you've written, though, I doubt you'll get
stable with that much memory allocated
to the JVM.

Best,
Erick

On Sun, Jan 25, 2015 at 1:02 PM, Vijay Sekhri sekhrivi...@gmail.com
wrote:

We have a cluster of solr cloud server with 10 shards and 4 replicas in
each shard in our stress environment. In our prod environment we will
have
10 shards and 15 replicas in each shard. Our current commit settings are
as
follows

*autoSoftCommit*
*maxDocs50/maxDocs*
*maxTime18/maxTime*
*/autoSoftCommit*
*autoCommit*
*maxDocs200/maxDocs*
*maxTime18/maxTime*
*openSearcherfalse/openSearcher*
*/autoCommit*

We have two collections search1 and search2. When we do full indexing ,
we
do it in search2 collection while search1 is serving live traffic. After
it
finishes we swap the collection using aliases so that the search2
collection serves live traffic while search1 becomes available for next
full indexing run. When we do incremental indexing we do it in the
search1
collection which is serving live traffic.

Re: solr replication vs. rsync

2015-01-25 Thread Dan Davis

@Erick,

Problem space is not constant indexing.   I thought SolrCloud replicas were
replication, and you imply parallel indexing.  Good to know.

On Sunday, January 25, 2015, Erick Erickson erickerick...@gmail.com wrote:

 @Shawn: Cool table, thanks!

 @Dan:
 Just to throw a different spin on it, if you migrate to SolrCloud, then
 this question becomes moot as the raw documents are sent to each of the
 replicas so you very rarely have to copy the full index. Kind of a tradeoff
 between constant load because you're sending the raw documents around
 whenever you index and peak usage when the index replicates.

 There are a bunch of other reasons to go to SolrCloud, but you know your
 problem space best.

 FWIW,
 Erick

 On Sun, Jan 25, 2015 at 9:26 AM, Shawn Heisey apa...@elyograg.org
 javascript:; wrote:

  On 1/24/2015 10:56 PM, Dan Davis wrote:
   When I polled the various projects already using Solr at my
  organization, I
   was greatly surprised that none of them were using Solr replication,
   because they had talked about replicating the data.
  
   But we are not Pinterest, and do not expect to be taking in changes one
   post at a time (at least the engineers don't - just wait until its used
  for
   a Crud app that wants full-text search on a description field!).
  Still,
   rsync can be very, very fast with the right options (-W for gigabit
   ethernet, and maybe -S for sparse files).   I've clocked it at 48 MB/s
  over
   GigE previously.
  
   Does anyone have any numbers for how fast Solr replication goes, and
 what
   to do to tune it?
  
   I'm not enthusiastic to give-up recently tested cluster stability for a
   home grown mess, but I am interested in numbers that are out there.
 
  Numbers are included on the Solr replication wiki page, both in graph
  and numeric form.  Gathering these numbers must have been pretty easy --
  before the HTTP replication made it into Solr, Solr used to contain an
  rsync-based implementation.
 
  http://wiki.apache.org/solr/SolrReplication#Performance_numbers
 
  Other data on that wiki page discusses the replication config.  There's
  not a lot to tune.
 
  I run a redundant non-SolrCloud index myself through a different method
  -- my indexing program indexes each index copy completely independently.
   There is no replication.  This separation allows me to upgrade any
  component, or change any part of solrconfig or schema, on either copy of
  the index without affecting the other copy at all.  With replication, if
  something is changed on the master or the slave, you might find that the
  slave no longer works, because it will be handling an index created by
  different software or a different config.
 
  Thanks,
  Shawn

Re: solr replication vs. rsync

2015-01-25 Thread Dan Davis

Thanks!

On Sunday, January 25, 2015, Erick Erickson erickerick...@gmail.com wrote:

 @Shawn: Cool table, thanks!

 @Dan:
 Just to throw a different spin on it, if you migrate to SolrCloud, then
 this question becomes moot as the raw documents are sent to each of the
 replicas so you very rarely have to copy the full index. Kind of a tradeoff
 between constant load because you're sending the raw documents around
 whenever you index and peak usage when the index replicates.

 There are a bunch of other reasons to go to SolrCloud, but you know your
 problem space best.

 FWIW,
 Erick

 On Sun, Jan 25, 2015 at 9:26 AM, Shawn Heisey apa...@elyograg.org
 javascript:; wrote:

  On 1/24/2015 10:56 PM, Dan Davis wrote:
   When I polled the various projects already using Solr at my
  organization, I
   was greatly surprised that none of them were using Solr replication,
   because they had talked about replicating the data.
  
   But we are not Pinterest, and do not expect to be taking in changes one
   post at a time (at least the engineers don't - just wait until its used
  for
   a Crud app that wants full-text search on a description field!).
  Still,
   rsync can be very, very fast with the right options (-W for gigabit
   ethernet, and maybe -S for sparse files).   I've clocked it at 48 MB/s
  over
   GigE previously.
  
   Does anyone have any numbers for how fast Solr replication goes, and
 what
   to do to tune it?
  
   I'm not enthusiastic to give-up recently tested cluster stability for a
   home grown mess, but I am interested in numbers that are out there.
 
  Numbers are included on the Solr replication wiki page, both in graph
  and numeric form.  Gathering these numbers must have been pretty easy --
  before the HTTP replication made it into Solr, Solr used to contain an
  rsync-based implementation.
 
  http://wiki.apache.org/solr/SolrReplication#Performance_numbers
 
  Other data on that wiki page discusses the replication config.  There's
  not a lot to tune.
 
  I run a redundant non-SolrCloud index myself through a different method
  -- my indexing program indexes each index copy completely independently.
   There is no replication.  This separation allows me to upgrade any
  component, or change any part of solrconfig or schema, on either copy of
  the index without affecting the other copy at all.  With replication, if
  something is changed on the master or the slave, you might find that the
  slave no longer works, because it will be handling an index created by
  different software or a different config.
 
  Thanks,
  Shawn

replicas goes in recovery mode right after update

2015-01-25 Thread Vijay Sekhri

We have a cluster of solr cloud server with 10 shards and 4 replicas in
each shard in our stress environment. In our prod environment we will have
10 shards and 15 replicas in each shard. Our current commit settings are as
follows

*autoSoftCommit*
*maxDocs50/maxDocs*
*maxTime18/maxTime*
*/autoSoftCommit*
*autoCommit*
*maxDocs200/maxDocs*
*maxTime18/maxTime*
*openSearcherfalse/openSearcher*
*/autoCommit*


We indexed roughly 90 Million docs. We have two different ways to index
documents a) Full indexing. It takes 4 hours to index 90 Million docs and
the rate of docs coming to the searcher is around 6000 per second b)
Incremental indexing. It takes an hour to indexed delta changes. Roughly
there are 3 million changes and rate of docs coming to the searchers is 2500
per second

We have two collections search1 and search2. When we do full indexing , we
do it in search2 collection while search1 is serving live traffic. After it
finishes we swap the collection using aliases so that the search2
collection serves live traffic while search1 becomes available for next
full indexing run. When we do incremental indexing we do it in the search1
collection which is serving live traffic.

All our searchers have 12 GB of RAM available and have quad core Intel(R)
Xeon(R) CPU X5570 @ 2.93GHz. There is only one java process running i.e
jboss and solr in it . All 12 GB is available as heap for the java
process.  We have observed that the heap memory of the java process average
around 8 - 10 GB. All searchers have final index size of 9 GB. So in total
there are 9X10 (shards) =  90GB worth of index files.

 We have observed the following issue when we trigger indexing . In about
10 minutes after we trigger indexing on 14 parallel hosts, the replicas
goes in to recovery mode. This happens to all the shards . In about 20
minutes more and more replicas start going into recovery mode. After about
half an hour all replicas except the leader are in recovery mode. We cannot
throttle the indexing load as that will increase our overall indexing time.
So to overcome this issue, we remove all the replicas before we trigger the
indexing and then add them back after the indexing finishes.

We observe the same behavior of replicas going into recovery when we do
incremental indexing. We cannot remove replicas during our incremental
indexing because it is also serving live traffic. We tried to throttle our
indexing speed , however the cluster still goes into recovery .

If we leave the cluster as it , when the indexing finishes , it eventually
recovers after a while. As it is serving live traffic we cannot have these
replicas go into recovery mode because it degrades the search performance
also , our tests have shown.

We have tried different commit settings like below

a) No auto soft commit, no auto hard commit and a commit triggered at the
end of indexing b) No auto soft commit, yes auto hard commit and a commit
in the end of indexing
c) Yes auto soft commit , no auto hard commit
d) Yes auto soft commit , yes auto hard commit
e) Different frequency setting for commits for above. Please NOTE that we
have tried 15 minute soft commit setting and 30 minutes hard commit
settings. Same time settings for both, 30 minute soft commit and an hour
hard commit setting

Unfortunately all the above yields the same behavior . The replicas still
goes in recovery We have increased the zookeeper timeout from 30 seconds to
5 minutes and the problem persists. Is there any setting that would fix
this issue ?

-- 
*
Vijay Sekhri
*

Re: Facet Double Counting

RE: Facet Double Counting

Re: Facet Double Counting

Re: Facet Double Counting

RE: Facet Double Counting

Re: solr replication vs. rsync

Sorting on a computed value

Re: replicas goes in recovery mode right after update

Re: Unexplained leader initiated recovery after updates - SolrCmdDistributor no longer retries on RemoteSolrException

Re: solr replication vs. rsync

Re: replicas goes in recovery mode right after update

Re: Indexed epoch time in Solr

Re: solr replication vs. rsync

Indexed epoch time in Solr

Re: [MASSMAIL]Weighting of prominent text in HTML

Weighting of prominent text in HTML

Re: replicas goes in recovery mode right after update

Re: solr replication vs. rsync

Re: solr replication vs. rsync

replicas goes in recovery mode right after update

20 matches

Site Navigation

Mail list logo

Footer information