Overseer Leader gone

2015-08-31 Thread Rishi Easwaran
Hi All,

I have a cluster that has the overseer leader gone. This is on Solr 4.10.3 
version.
Its completely gone from zookeeper and bouncing any instance does not start a 
new election process.
Anyone experience this issue before and any ideas to fix this.

Thanks,
Rishi.


Re: Multiple index.timestamp directories using up disk space

2015-05-05 Thread Rishi Easwaran
Worried about data loss makes sense. If I get the way solr behaves, the new 
directory should only have missing/changed segments. 
I guess since our application is extremely write heavy, with lot of inserts and 
deletes, almost every segment is touched even during a short window, so it 
appears like for our deployment every segment is copied over when replicas get 
out of sync. 

Thanks for clarifying this behaviour from solr cloud so we can put in external 
steps to resolve when this situation arises.  
 

 

 

-Original Message-
From: Ramkumar R. Aiyengar andyetitmo...@gmail.com
To: solr-user solr-user@lucene.apache.org
Sent: Tue, May 5, 2015 4:52 am
Subject: Re: Multiple index.timestamp directories using up disk space


Yes, data loss is the concern. If the recovering replica is not able
to
retrieve the files from the leader, it at least has an older copy.

Also,
the entire index is not fetched from the leader, only the segments
which have
changed. The replica initially gets the file list from the
replica, checks
against what it has, and then downloads the difference --
then moves it to the
main index. Note that this process can fail sometimes
(say due to I/O errors,
or due to a problem with the leader itself), in
which case the replica drops
all accumulated files from the leader, and
starts from scratch. If that
happens, it needs to look back at its old
index again to figure out what it
needs to download on the next attempt.

May be with a fair number of
assumptions which should usually hold good,
you can still come up with a
mechanism to drop existing files, but those
won't hold good in case of serious
issues with the cloud, you could end up
losing data. That's worse than using a
bit more disk space!
On 4 May 2015 11:56, Rishi Easwaran
rishi.easwa...@aol.com wrote:

Thanks for the responses Mark and
Ramkumar.

 The question I had was, why does Solr need 2 copies at any given
time,
leading to 2x disk space usage.
 Not sure if this information is not
published anywhere, and makes HW
estimation almost impossible for large scale
deployment. Even if the copies
are temporary, this becomes really expensive,
especially when using SSD in
production, when the complex size is over 400TB
indexes, running 1000's of
solr cloud shards.

 If a solr follower has
decided that it needs to do replication from leader
and capture full copy
snapshot. Why can't it delete the old information and
replicate from scratch,
not requiring more disk space.
 Is the concern data loss (a case when both
leader and follower lose data)?.

 Thanks,

Rishi.







-Original Message-
From: Mark Miller
markrmil...@gmail.com
To: solr-user solr-user@lucene.apache.org
Sent: Tue,
Apr 28, 2015 10:52 am
Subject: Re: Multiple index.timestamp directories using
up disk space


If copies of the index are not eventually cleaned up, I'd
fill a JIRA
to
address the issue. Those directories should be removed over
time. At
times
there will have to be a couple around at the same time and
others may
take
a while to clean up.

- Mark

On Tue, Apr 28, 2015 at 3:27
AM Ramkumar
R. Aiyengar 
andyetitmo...@gmail.com wrote:

 SolrCloud does
need up to
twice the amount of disk space as your usual
 index size during
replication.
Amongst other things, this ensures you have
 a full copy of the
index at any
point. There's no way around this, I would
 suggest you
provision the
additional disk space needed.
 On 20 Apr 2015 23:21, Rishi
Easwaran
rishi.easwa...@aol.com wrote:

  Hi All,
 
  We are
seeing this
problem with solr 4.6 and solr 4.10.3.
  For some reason, solr
cloud tries to
recover and creates a new index
  directory -
(ex:index.20150420181214550),
while keeping the older index
 as
  is. This
creates an issues where the
disk space fills up and the shard
  never ends
up recovering.
  Usually
this requires a manual intervention of  bouncing
the instance and
  wiping
the disk clean to allow for a clean recovery.


  Any ideas on how to
prevent solr from creating multiple copies of
index
  directory.
 
 
Thanks,
  Rishi.
 


 


Solr/ Solr Cloud meetup at Aol

2015-05-05 Thread Rishi Easwaran

 Hi All,

Aol is hosting a meetup in Dulles VA. The topic this time is Solr/ Solr Cloud. 

 http://www.meetup.com/Code-Brew/events/53217/

Thanks,
Rishi.

Re: Multiple index.timestamp directories using up disk space

2015-05-05 Thread Rishi Easwaran
Hi Shawn, 

Thanks for clarifying lucene segment behaviour. We don't trigger optimize 
externally, could it be internal solr optimize? Is there a setting/ knob to 
control when optimize occurs. 

Thanks for pointing it out, will monitor memory closely. Though doubt memory is 
an issue, these are top tier machines with 144GB RAM supporting 12x4GB JVM's. 
Out of which 9 JVM's are running in cloud mode writing to SSD, should be enough 
memory leftover for OS cache.


 The behaviour we see multiple huge directories for the same core. Till we 
figure out what's going on, the only option we are left with it is to clean up 
the entire index to free up disk space, and allow a replica to sync from 
scratch.

Thanks,
Rishi.  

 

-Original Message-
From: Shawn Heisey apa...@elyograg.org
To: solr-user solr-user@lucene.apache.org
Sent: Tue, May 5, 2015 10:55 am
Subject: Re: Multiple index.timestamp directories using up disk space


On 5/5/2015 7:29 AM, Rishi Easwaran wrote:
 Worried about data loss makes
sense. If I get the way solr behaves, the new directory should only have
missing/changed segments. 
 I guess since our application is extremely write
heavy, with lot of inserts and deletes, almost every segment is touched even
during a short window, so it appears like for our deployment every segment is
copied over when replicas get out of sync.

Once a segment is written, it is
*NEVER* updated again.  This aspect of
Lucene indexes makes Solr replication
more efficient.  The ids of
deleted documents are written to separate files
specifically for
tracking deletes.  Those files are typically quite small
compared to the
index segments.  Any new documents are inserted into new
segments.

When older segments are merged, the information in all of those
segments
is copied to a single new segment (minus documents marked as
deleted),
and then the old segments are erased.  Optimizing replaces the
entire
index, and each replica of the index would be considered different,
so
an index recovery that happens after optimization might copy the
whole
thing.

If you are seeing a lot of index recoveries during normal
operation,
chances are that your Solr servers do not have enough resources, and
the
resource that has the most impact on performance is memory.  The amount
of
memory required for good Solr performance is higher than most people
expect. 
It's a normal expectation that programs require memory to run,
but Solr has an
additional memory requirement that often surprises them
-- the need for a
significant OS disk
cache:

http://wiki.apache.org/solr/SolrPerformanceProblems

Thanks,
Shawn


 


Re: Solr Cloud reclaiming disk space from deleted documents

2015-05-04 Thread Rishi Easwaran
Sadly with the size of our complex, spiting and adding more HW is not a viable 
long term solution. 
 I guess the options we have are to run optimize regularly and/or become 
aggressive in our merges proactively even before solr cloud gets into this 
situation.
 
 Thanks,
 Rishi.
 

 

 

-Original Message-
From: Gili Nachum gilinac...@gmail.com
To: solr-user solr-user@lucene.apache.org
Sent: Mon, Apr 27, 2015 4:18 pm
Subject: Re: Solr Cloud reclaiming disk space from deleted documents


To prevent it from re occurring you could monitor index size and once above
a
certain size threshold add another machine and split the shard between
existing
and new machine.
On Apr 20, 2015 9:10 PM, Rishi Easwaran
rishi.easwa...@aol.com wrote:

 So is there anything that can be done from
a tuning perspective, to
 recover a shard that is 75%-90% full, other that get
rid of the index and
 rebuild the data?
  Also to prevent this issue from
re-occurring, looks like we need make our
 system aggressive with segment
merges using lower merge factor


 Thanks,
 Rishi.




-Original Message-
 From: Shawn Heisey apa...@elyograg.org
 To:
solr-user solr-user@lucene.apache.org
 Sent: Mon, Apr 20, 2015 11:25 am

Subject: Re: Solr Cloud reclaiming disk space from deleted documents


 On
4/20/2015 8:44 AM, Rishi Easwaran wrote:
  Yeah I noticed that. Looks like

optimize won't work since on some disks we are already pretty full.
  Any

thoughts on increasing/decreasing mergeFactor10/mergeFactor  or

ConcurrentMergeScheduler to make solr do merges faster.

 You don't have to
do
 an optimize to need 2x disk space.  Even normal
 merging, if it happens
just
 right, can require the same disk space as a
 full optimize.  Normal
Solr
 operation requires that you have enough
 space for your index to reach
at least
 double size on occasion.

 Higher merge factors are better for
indexing speed,
 because merging
 happens less frequently.  Lower merge
factors are better for
 query
 speed, at least after the merging finishes,
because merging happens
 more
 frequently and there are fewer total segments
at any given moment.

 During a merge, there is so much I/O that query speed
is often
 negatively
 affected.

 Thanks,
 Shawn





 


Re: Multiple index.timestamp directories using up disk space

2015-05-04 Thread Rishi Easwaran
Thanks for the responses Mark and Ramkumar.
 
 The question I had was, why does Solr need 2 copies at any given time, leading 
to 2x disk space usage. 
 Not sure if this information is not published anywhere, and makes HW 
estimation almost impossible for large scale deployment. Even if the copies are 
temporary, this becomes really expensive, especially when using SSD in 
production, when the complex size is over 400TB indexes, running 1000's of solr 
cloud shards. 
 
 If a solr follower has decided that it needs to do replication from leader and 
capture full copy snapshot. Why can't it delete the old information and 
replicate from scratch, not requiring more disk space.
 Is the concern data loss (a case when both leader and follower lose data)?.
 
 Thanks,
 Rishi.   

 

 

 

-Original Message-
From: Mark Miller markrmil...@gmail.com
To: solr-user solr-user@lucene.apache.org
Sent: Tue, Apr 28, 2015 10:52 am
Subject: Re: Multiple index.timestamp directories using up disk space


If copies of the index are not eventually cleaned up, I'd fill a JIRA
to
address the issue. Those directories should be removed over time. At
times
there will have to be a couple around at the same time and others may
take
a while to clean up.

- Mark

On Tue, Apr 28, 2015 at 3:27 AM Ramkumar
R. Aiyengar 
andyetitmo...@gmail.com wrote:

 SolrCloud does need up to
twice the amount of disk space as your usual
 index size during replication.
Amongst other things, this ensures you have
 a full copy of the index at any
point. There's no way around this, I would
 suggest you provision the
additional disk space needed.
 On 20 Apr 2015 23:21, Rishi Easwaran
rishi.easwa...@aol.com wrote:

  Hi All,
 
  We are seeing this
problem with solr 4.6 and solr 4.10.3.
  For some reason, solr cloud tries to
recover and creates a new index
  directory - (ex:index.20150420181214550),
while keeping the older index
 as
  is. This creates an issues where the
disk space fills up and the shard
  never ends up recovering.
  Usually
this requires a manual intervention of  bouncing the instance and
  wiping
the disk clean to allow for a clean recovery.
 
  Any ideas on how to
prevent solr from creating multiple copies of index
  directory.
 
 
Thanks,
  Rishi.
 


 


Re: Multiple index.timestamp directories using up disk space

2015-05-04 Thread Rishi Easwaran
Walter,

Unless I am missing something here.. I completely get that, when a few segment 
merges solr requires 2x space of segments to accomplish this.
Usually any index has multiple segments files so this fragmented 2x space 
consumption is not an issue, even as merged segments grow bigger. 

But what I am talking about is copy of a whole index as is into a new 
directory.  The new directory has no relation to the older index directory or 
its segments, so not sure what merges are going on across directories/indexes, 
and why solr needs the older index.

Thanks,
Rishi.

 

 

 

-Original Message-
From: Walter Underwood wun...@wunderwood.org
To: solr-user solr-user@lucene.apache.org
Sent: Mon, May 4, 2015 9:50 am
Subject: Re: Multiple index.timestamp directories using up disk space


One segment is in-use, being searched. That segment (and others) are merged into
a new segment. After the new segment is ready, searches are directed to the new
copy and the old copies are deleted.

That is how two copies are needed.

If
you cannot provide 2X the disk space, you will not have a stable Solr
installation. You should consider a different search engine.

“Optimizing”
(forced merges) will not help. It will probably cause failures more often
because it always merges the larges segment.

Walter
Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my
blog)


On May 4, 2015, at 3:53 AM, Rishi Easwaran rishi.easwa...@aol.com
wrote:

 Thanks for the responses Mark and Ramkumar.
 
 The question I
had was, why does Solr need 2 copies at any given time, leading to 2x disk space
usage. 
 Not sure if this information is not published anywhere, and makes HW
estimation almost impossible for large scale deployment. Even if the copies are
temporary, this becomes really expensive, especially when using SSD in
production, when the complex size is over 400TB indexes, running 1000's of solr
cloud shards. 
 
 If a solr follower has decided that it needs to do
replication from leader and capture full copy snapshot. Why can't it delete the
old information and replicate from scratch, not requiring more disk space.
 Is
the concern data loss (a case when both leader and follower lose data)?.
 

Thanks,
 Rishi.   
 
 
 
 
 
 
 
 -Original
Message-
 From: Mark Miller markrmil...@gmail.com
 To: solr-user
solr-user@lucene.apache.org
 Sent: Tue, Apr 28, 2015 10:52 am
 Subject:
Re: Multiple index.timestamp directories using up disk space
 
 
 If
copies of the index are not eventually cleaned up, I'd fill a JIRA
 to

address the issue. Those directories should be removed over time. At
 times

there will have to be a couple around at the same time and others may
 take

a while to clean up.
 
 - Mark
 
 On Tue, Apr 28, 2015 at 3:27 AM
Ramkumar
 R. Aiyengar 
 andyetitmo...@gmail.com wrote:
 
 SolrCloud
does need up to
 twice the amount of disk space as your usual
 index size
during replication.
 Amongst other things, this ensures you have
 a full
copy of the index at any
 point. There's no way around this, I would

suggest you provision the
 additional disk space needed.
 On 20 Apr 2015
23:21, Rishi Easwaran
 rishi.easwa...@aol.com wrote:
 
 Hi
All,
 
 We are seeing this
 problem with solr 4.6 and solr
4.10.3.
 For some reason, solr cloud tries to
 recover and creates a new
index
 directory - (ex:index.20150420181214550),
 while keeping the older
index
 as
 is. This creates an issues where the
 disk space fills up
and the shard
 never ends up recovering.
 Usually
 this requires a
manual intervention of  bouncing the instance and
 wiping
 the disk clean
to allow for a clean recovery.
 
 Any ideas on how to
 prevent solr
from creating multiple copies of index
 directory.
 
 

Thanks,
 Rishi.
 
 
 
 


 


Re: Solr Cloud reclaiming disk space from deleted documents

2015-05-04 Thread Rishi Easwaran
Thanks Shawn.. yeah regular optimize might be the route we take, if this 
becomes a recurring issue.
 I remember in our old multicore deployment CPU used to spike and the core 
almost became non responsive. 

My guess with solr cloud architecture, any slack by leader while optimizing is 
picked up by the replica.
I was searching around for optimize behaviour of solr cloud and could not find 
much information.

Does anyone have experience running optimize for solr cloud in a loaded 
production env?

Thanks,
Rishi.
 
 

 

 

-Original Message-
From: Shawn Heisey apa...@elyograg.org
To: solr-user solr-user@lucene.apache.org
Sent: Mon, May 4, 2015 9:11 am
Subject: Re: Solr Cloud reclaiming disk space from deleted documents


On 5/4/2015 4:55 AM, Rishi Easwaran wrote:
 Sadly with the size of our
complex, spiting and adding more HW is not a viable long term solution. 
  I
guess the options we have are to run optimize regularly and/or become aggressive
in our merges proactively even before solr cloud gets into this situation.

If
you are regularly deleting most of your index, or reindexing large
parts of it,
which effectively does the same thing, then regular
optimizes may be required
to keep the index size down, although you must
remember that you need enough
room for the core to grow in order to
actually complete the optimize.  If the
core is 75-90 percent deleted
docs, then you will not need 2x the core size to
optimize it, because
the new index will be much smaller.

Currently,
SolrCloud will always optimize the entire collection when you
ask for an
optimize on any core, but it will NOT optimize all the
replicas (cores) at the
same time.  It will go through the cores that
make up the collection and
optimize each one one in sequence.  If your
index is sharded and replicated
enough, hopefully that will make it
possible for the optimize to complete even
though the amount of disk
space available may be low.

We have at least one
issue in Jira where users have asked for optimize
to honor distrib=false, which
would allow the user to be in complete
control of all optimizing, but so far
that hasn't been implemented.  The
volunteers that maintain Solr can only
accomplish so much in the limited
time they have
available.

Thanks,
Shawn


 


Multiple index.timestamp directories using up disk space

2015-04-20 Thread Rishi Easwaran
Hi All,

We are seeing this problem with solr 4.6 and solr 4.10.3. 
For some reason, solr cloud tries to recover and creates a new index directory 
- (ex:index.20150420181214550), while keeping the older index as is. This 
creates an issues where the disk space fills up and the shard never ends up 
recovering.
Usually this requires a manual intervention of  bouncing the instance and 
wiping the disk clean to allow for a clean recovery. 

Any ideas on how to prevent solr from creating multiple copies of index 
directory.

Thanks,
Rishi.


Re: Solr Cloud reclaiming disk space from deleted documents

2015-04-20 Thread Rishi Easwaran
Yeah I noticed that. Looks like optimize won't work since on some disks we are 
already pretty full.
Any thoughts on increasing/decreasing mergeFactor10/mergeFactor  or 
ConcurrentMergeScheduler to make solr do merges faster.   


 

 

 

-Original Message-
From: Gili Nachum gilinac...@gmail.com
To: solr-user solr-user@lucene.apache.org
Sent: Sun, Apr 19, 2015 12:34 pm
Subject: Re: Solr Cloud reclaiming disk space from deleted documents


I assume you don't have much free space available in your disk. Notice
that
during optimization (merge into a single segment) your shard replica
space
usage may peak to 2x-3x of it's normal size until optimization
completes.
Is it a problem? Not if optimization occurs over shards serially and
your
index is broken to many small shards.
On Apr 18, 2015 1:54 AM, Rishi
Easwaran rishi.easwa...@aol.com wrote:

 Thanks Shawn for the quick
reply.
 Our indexes are running on SSD, so 3 should be ok.
 Any
recommendation on bumping it up?

 I guess will have to run optimize for
entire solr cloud and see if we can
 reclaim space.

 Thanks,

Rishi.








 -Original Message-
 From: Shawn
Heisey apa...@elyograg.org
 To: solr-user solr-user@lucene.apache.org

Sent: Fri, Apr 17, 2015 6:22 pm
 Subject: Re: Solr Cloud reclaiming disk space
from deleted documents


 On 4/17/2015 2:15 PM, Rishi Easwaran wrote:
 
Running into an issue and wanted
 to see if anyone had some suggestions.
 
We are seeing this with both solr 4.6
 and 4.10.3 code.
  We are running an
extremely update heavy application, with
 millions of writes and deletes
happening to our indexes constantly.  An
 issue we
 are seeing is that solr
cloud reclaiming the disk space that can be used
 for new
 inserts, by
cleanup up deletes.
 
  We used to run optimize periodically with
 our
old multicore set up, not sure if that works for solr cloud.
 
  Num

Docs:28762340
  Max Doc:48079586
  Deleted Docs:19317246
 
 
Version
 1429299216227
  Gen 16525463
  Size 109.92 GB
 
  In our
solrconfig.xml we
 use the following configs.
 
  indexConfig
 
!-- Values here
 affect all index writers and act as a default unless
overridden. --
 
 useCompoundFilefalse/useCompoundFile
 

maxBufferedDocs1000/maxBufferedDocs
 

maxMergeDocs2147483647/maxMergeDocs
 

maxFieldLength1/maxFieldLength
 
 

mergeFactor10/mergeFactor
  mergePolicy

class=org.apache.lucene.index.TieredMergePolicy/
 
mergeScheduler
 class=org.apache.lucene.index.ConcurrentMergeScheduler

 int
 name=maxThreadCount3/int
  int

name=maxMergeCount15/int
  /mergeScheduler
 

ramBufferSizeMB64/ramBufferSizeMB
 
  /indexConfig

 This

part of my response won't help the issue you wrote about, but it
 can
affect
 performance, so I'm going to mention it.  If your indexes are

stored on regular
 spinning disks, reduce mergeScheduler/maxThreadCount
 to
1.  If they are stored
 on SSD, then a value of 3 is OK.  Spinning
 disks
cannot do seeks (read/write
 head moves) fast enough to handle
 multiple
merging threads properly.  All the
 seek activity required will
 really slow
down merging, which is a very bad thing
 when your indexing
 load is high. 
SSD disks do not have to seek, so multiple
 threads are OK
 there.

 An
optimize is the only way to reclaim all of the disk
 space held by
 deleted
documents.  Over time, as segments are merged
 automatically,
 deleted doc
space will be automatically recovered, but it won't
 be
 perfect, especially
as segments are merged multiple times into very
 large
 segments.

 If
you send an optimize command to a core/collection in SolrCloud,
 the
 entire
collection will be optimized ... the cloud will do one
 shard
 replica
(core) at a time until the entire collection has been
 optimized.
 There is
no way (currently) to ask it to only optimize a
 single core, or to do

multiple cores simultaneously, even if they are on
 different

servers.

 Thanks,
 Shawn





 


Re: Solr Cloud reclaiming disk space from deleted documents

2015-04-20 Thread Rishi Easwaran
So is there anything that can be done from a tuning perspective, to recover a 
shard that is 75%-90% full, other that get rid of the index and rebuild the 
data?
 Also to prevent this issue from re-occurring, looks like we need make our 
system aggressive with segment merges using lower merge factor  

 
Thanks,
Rishi.

 

-Original Message-
From: Shawn Heisey apa...@elyograg.org
To: solr-user solr-user@lucene.apache.org
Sent: Mon, Apr 20, 2015 11:25 am
Subject: Re: Solr Cloud reclaiming disk space from deleted documents


On 4/20/2015 8:44 AM, Rishi Easwaran wrote:
 Yeah I noticed that. Looks like
optimize won't work since on some disks we are already pretty full.
 Any
thoughts on increasing/decreasing mergeFactor10/mergeFactor  or
ConcurrentMergeScheduler to make solr do merges faster.

You don't have to do
an optimize to need 2x disk space.  Even normal
merging, if it happens just
right, can require the same disk space as a
full optimize.  Normal Solr
operation requires that you have enough
space for your index to reach at least
double size on occasion.

Higher merge factors are better for indexing speed,
because merging
happens less frequently.  Lower merge factors are better for
query
speed, at least after the merging finishes, because merging happens
more
frequently and there are fewer total segments at any given moment.

During a merge, there is so much I/O that query speed is often
negatively
affected.

Thanks,
Shawn


 


Solr Cloud reclaiming disk space from deleted documents

2015-04-17 Thread Rishi Easwaran
Hi All,

Running into an issue and wanted to see if anyone had some suggestions.
We are seeing this with both solr 4.6 and 4.10.3 code.
We are running an extremely update heavy application, with millions of writes 
and deletes happening to our indexes constantly.  An issue we are seeing is 
that solr cloud reclaiming the disk space that can be used for new inserts, by 
cleanup up deletes. 

We used to run optimize periodically with our old multicore set up, not sure if 
that works for solr cloud.

Num Docs:28762340
Max Doc:48079586
Deleted Docs:19317246

Version 1429299216227
Gen 16525463
Size 109.92 GB

In our solrconfig.xml we use the following configs.

indexConfig
!-- Values here affect all index writers and act as a default unless 
overridden. --
useCompoundFilefalse/useCompoundFile
maxBufferedDocs1000/maxBufferedDocs
maxMergeDocs2147483647/maxMergeDocs
maxFieldLength1/maxFieldLength

mergeFactor10/mergeFactor
mergePolicy class=org.apache.lucene.index.TieredMergePolicy/
mergeScheduler 
class=org.apache.lucene.index.ConcurrentMergeScheduler
int name=maxThreadCount3/int
int name=maxMergeCount15/int
/mergeScheduler
ramBufferSizeMB64/ramBufferSizeMB

/indexConfig


Any suggestions on which which tunable to adjust, mergeFactor, mergeScheduler 
thread counts etc would be great.

Thanks,
Rishi.
 


Re: Solr Cloud reclaiming disk space from deleted documents

2015-04-17 Thread Rishi Easwaran
Thanks Shawn for the quick reply.
Our indexes are running on SSD, so 3 should be ok.
Any recommendation on bumping it up?

I guess will have to run optimize for entire solr cloud and see if we can 
reclaim space.

Thanks,
Rishi. 
 

 

 

 

-Original Message-
From: Shawn Heisey apa...@elyograg.org
To: solr-user solr-user@lucene.apache.org
Sent: Fri, Apr 17, 2015 6:22 pm
Subject: Re: Solr Cloud reclaiming disk space from deleted documents


On 4/17/2015 2:15 PM, Rishi Easwaran wrote:
 Running into an issue and wanted
to see if anyone had some suggestions.
 We are seeing this with both solr 4.6
and 4.10.3 code.
 We are running an extremely update heavy application, with
millions of writes and deletes happening to our indexes constantly.  An issue we
are seeing is that solr cloud reclaiming the disk space that can be used for new
inserts, by cleanup up deletes. 

 We used to run optimize periodically with
our old multicore set up, not sure if that works for solr cloud.

 Num
Docs:28762340
 Max Doc:48079586
 Deleted Docs:19317246

 Version
1429299216227
 Gen 16525463
 Size 109.92 GB

 In our solrconfig.xml we
use the following configs.

 indexConfig
 !-- Values here
affect all index writers and act as a default unless overridden. --

useCompoundFilefalse/useCompoundFile

maxBufferedDocs1000/maxBufferedDocs

maxMergeDocs2147483647/maxMergeDocs

maxFieldLength1/maxFieldLength


mergeFactor10/mergeFactor
 mergePolicy
class=org.apache.lucene.index.TieredMergePolicy/
 mergeScheduler
class=org.apache.lucene.index.ConcurrentMergeScheduler
 int
name=maxThreadCount3/int
 int
name=maxMergeCount15/int
 /mergeScheduler

ramBufferSizeMB64/ramBufferSizeMB
 
 /indexConfig

This
part of my response won't help the issue you wrote about, but it
can affect
performance, so I'm going to mention it.  If your indexes are
stored on regular
spinning disks, reduce mergeScheduler/maxThreadCount
to 1.  If they are stored
on SSD, then a value of 3 is OK.  Spinning
disks cannot do seeks (read/write
head moves) fast enough to handle
multiple merging threads properly.  All the
seek activity required will
really slow down merging, which is a very bad thing
when your indexing
load is high.  SSD disks do not have to seek, so multiple
threads are OK
there.

An optimize is the only way to reclaim all of the disk
space held by
deleted documents.  Over time, as segments are merged
automatically,
deleted doc space will be automatically recovered, but it won't
be
perfect, especially as segments are merged multiple times into very
large
segments.

If you send an optimize command to a core/collection in SolrCloud,
the
entire collection will be optimized ... the cloud will do one
shard
replica (core) at a time until the entire collection has been
optimized.
There is no way (currently) to ask it to only optimize a
single core, or to do
multiple cores simultaneously, even if they are on
different
servers.

Thanks,
Shawn


 


Re: Basic Multilingual search capability

2015-02-26 Thread Rishi Easwaran
Hi Tom,

Thanks for your inputs. 
I was planning to use stopword filter, but will definitely make sure they are 
unique and not to step over each other.  I think for our system even going with 
length of 50-75 should be fine, will definitely up that number after doing some 
analysis on our input.
Just one clarification, when you say ICUFilterFactory am I correct in thinking 
its ICUFodingFilterFactory.
 
Thanks,
Rishi.

 

 

-Original Message-
From: Tom Burton-West tburt...@umich.edu
To: solr-user solr-user@lucene.apache.org
Sent: Wed, Feb 25, 2015 4:33 pm
Subject: Re: Basic Multilingual search capability


Hi Rishi,

As others have indicated Multilingual search is very difficult to do well.

At HathiTrust we've been using the ICUTokenizer and ICUFilterFactory to
deal with having materials in 400 languages.  We also added the
CJKBigramFilter to get better precision on CJK queries.  We don't use stop
words because stop words in one language are content words in another.  For
example die in German is a stopword but it is a content word in English.

Putting multiple languages in one index can affect word frequency
statistics which make relevance ranking less accurate.  So for example for
the English query Die Hard the word die would get a low idf score
because it occurs so frequently in German.  We realize that our  approach
does not produce the best results, but given the 400 languages, and limited
resources, we do our best to make search not suck for non-English
languages.   When we have the resources we are thinking about doing special
processing for a small fraction of the top 20 languages.  We plan to select
those languages  that most need special processing and relatively easy to
disambiguate from other languages.


If you plan on identifying languages (rather than scripts), you should be
aware that most language detection libraries don't work well on short texts
such as queries.

If you know that you have scripts for which you have content in only one
language, you can use script detection instead of language detection.


If you have German, a filter length of 25 might be too low (Because of
compounding). You might want to analyze a sample of your German text to
find a good length.

Tom

http://www.hathitrust.org/blogs/Large-scale-Search


On Wed, Feb 25, 2015 at 10:31 AM, Rishi Easwaran rishi.easwa...@aol.com
wrote:

 Hi Alex,

 Thanks for the suggestions. These steps will definitely help out with our
 use case.
 Thanks for the idea about the lengthFilter to protect our system.

 Thanks,
 Rishi.







 -Original Message-
 From: Alexandre Rafalovitch arafa...@gmail.com
 To: solr-user solr-user@lucene.apache.org
 Sent: Tue, Feb 24, 2015 8:50 am
 Subject: Re: Basic Multilingual search capability


 Given the limited needs, I would probably do something like this:

 1) Put a language identifier in the UpdateRequestProcessor chain
 during indexing and route out at least known problematic languages,
 such as Chinese, Japanese, Arabic into individual fields
 2) Put everything else together into one field with ICUTokenizer,
 maybe also ICUFoldingFilter
 3) At the very end of that joint filter, stick in LengthFilter with
 some high number, e.g. 25 characters max. This will ensure that
 super-long words from non-space languages and edge conditions do not
 break the rest of your system.


 Regards,
Alex.
 
 Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
 http://www.solr-start.com/


 On 23 February 2015 at 23:14, Walter Underwood wun...@wunderwood.org
 wrote:
  I understand relevancy, stemming etc becomes extremely complicated with
 multilingual support, but our first goal is to be able to tokenize and
 provide
 basic search capability for any language. Ex: When the document contains
 hello
 or здравствуйте, the analyzer creates tokens and provides exact match
 search
 results.




 


Re: Basic Multilingual search capability

2015-02-25 Thread Rishi Easwaran


Hi Trey,

Thanks for the detailed response and the link to the talk, it was very 
informative.
Yes looking at the current system requirements ICUTokenizer might be the best 
bet for our use case.
MultiTextField mentioned in the jira SOLR-6492 has some cool features and 
definitely looking forward to trying out once its integrated to main.

 
Thanks,
Rishi.

 

 

-Original Message-
From: Trey Grainger solrt...@gmail.com
To: solr-user solr-user@lucene.apache.org
Sent: Tue, Feb 24, 2015 1:40 am
Subject: Re: Basic Multilingual search capability


Hi Rishi,

I don't generally recommend a language-insensitive approach except for
really simple multilingual use cases (for most of the reasons Walter
mentioned), but the ICUTokenizer is probably the best bet you're going to
have if you really want to go that route and only need exact-match on the
tokens that are parsed. It won't work that well for all languages (CJK
languages, for example), but it will work fine for many.

It is also possible to handle multi-lingual content in a more intelligent
(i.e. per-language configuration) way in your search index, of course.
There are three primary strategies (i.e. ways that actually work in the
real world) to do this:
1) create a separate field for each language and search across all of them
at query time
2) create a separate core per language-combination and search across all of
them at query time
3) invoke multiple language-specific analyzers within a single field's
analyzer and index/query using one or more of those language's analyzers
for each document/query.

These are listed in ascending order of complexity, and each can be valid
based upon your use case. For at least the first and third cases, you can
use index-time language detection to map to the appropriate
fields/analyzers if you are otherwise unaware of the languages of the
content from your application layer. The third option requires custom code
(included in the large Multilingual Search chapter of Solr in Action
http://solrinaction.com and soon to be contributed back to Solr via
SOLR-6492 https://issues.apache.org/jira/browse/SOLR-6492), but it
enables you to index an arbitrarily large number of languages into the same
field if needed, while preserving language-specific analysis for each
language.

I presented in detail on the above strategies at Lucene/Solr Revolution
last November, so you may consider checking out the presentation and/or
slides to asses if one of these strategies will work for your use case:
http://www.treygrainger.com/posts/presentations/semantic-multilingual-strategies-in-lucenesolr/

For the record, I'd highly recommend going with the first strategy (a
separate field per language) if you can, as it is certainly the simplest of
the approaches (albeit the one that scales the least well after you add
more than a few languages to your queries). If you want to stay simple and
stick with the ICUTokenizer then it will work to a point, but some of the
problems Walter mentioned may eventually bite you if you are supporting
certain groups of languages.

All the best,

Trey Grainger
Co-author, Solr in Action
Director of Engineering, Search  Recommendations @ CareerBuilder

On Mon, Feb 23, 2015 at 11:14 PM, Walter Underwood wun...@wunderwood.org
wrote:

 It isn’t just complicated, it can be impossible.

 Do you have content in Chinese or Japanese? Those languages (and some
 others) do not separate words with spaces. You cannot even do word search
 without a language-specific, dictionary-based parser.

 German is space separated, except many noun compounds are not
 space-separated.

 Do you have Finnish content? Entire prepositional phrases turn into word
 endings.

 Do you have Arabic content? That is even harder.

 If all your content is in space-separated languages that are not heavily
 inflected, you can kind of do OK with a language-insensitive approach. But
 it hits the wall pretty fast.

 One thing that does work pretty well is trademarked names (LaserJet, Coke,
 etc). Those are spelled the same in all languages and usually not inflected.

 wunder
 Walter Underwood
 wun...@wunderwood.org
 http://observer.wunderwood.org/  (my blog)

 On Feb 23, 2015, at 8:00 PM, Rishi Easwaran rishi.easwa...@aol.com
 wrote:

  Hi Alex,
 
  There is no specific language list.
  For example: the documents that needs to be indexed are emails or any
 messages for a global customer base. The messages back and forth could be
 in any language or mix of languages.
 
  I understand relevancy, stemming etc becomes extremely complicated with
 multilingual support, but our first goal is to be able to tokenize and
 provide basic search capability for any language. Ex: When the document
 contains hello or здравствуйте, the analyzer creates tokens and provides
 exact match search results.
 
  Now it would be great if it had capability to tokenize email addresses
 (ex:he...@aol.com- i think standardTokenizer already does this),
 filenames (здравствуйте.pdf

Re: Basic Multilingual search capability

2015-02-25 Thread Rishi Easwaran
Hi Alex,

Thanks for the suggestions. These steps will definitely help out with our use 
case.
Thanks for the idea about the lengthFilter to protect our system.

Thanks,
Rishi.

 

 

 

-Original Message-
From: Alexandre Rafalovitch arafa...@gmail.com
To: solr-user solr-user@lucene.apache.org
Sent: Tue, Feb 24, 2015 8:50 am
Subject: Re: Basic Multilingual search capability


Given the limited needs, I would probably do something like this:

1) Put a language identifier in the UpdateRequestProcessor chain
during indexing and route out at least known problematic languages,
such as Chinese, Japanese, Arabic into individual fields
2) Put everything else together into one field with ICUTokenizer,
maybe also ICUFoldingFilter
3) At the very end of that joint filter, stick in LengthFilter with
some high number, e.g. 25 characters max. This will ensure that
super-long words from non-space languages and edge conditions do not
break the rest of your system.


Regards,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 23 February 2015 at 23:14, Walter Underwood wun...@wunderwood.org wrote:
 I understand relevancy, stemming etc becomes extremely complicated with 
multilingual support, but our first goal is to be able to tokenize and provide 
basic search capability for any language. Ex: When the document contains hello 
or здравствуйте, the analyzer creates tokens and provides exact match search 
results.

 


Re: Strange search behaviour when upgrading to 4.10.3

2015-02-23 Thread Rishi Easwaran
Thanks Shawn.
Just ran the analysis between 4.6 and 4.10, there seems to be only difference 
between the outputs positionLength value is set in 4.10. Does that mean 
anything.

Version 4.10



SF





text

raw_bytes

start

end

positionLength

type

position









message

[6d 65 73 73 61 67 65]

0

7

1

ALNUM

1








 Version 4.6


 


SF





text

raw_bytes

type

start

end

position









message

[6d 65 73 73 61 67 65]

ALNUM

0

7

1







Thanks,
Rishi.


 

-Original Message-
From: Shawn Heisey apa...@elyograg.org
To: solr-user solr-user@lucene.apache.org
Sent: Fri, Feb 20, 2015 6:51 pm
Subject: Re: Strange search behaviour when upgrading to 4.10.3


On 2/20/2015 4:24 PM, Rishi Easwaran wrote:
 Also, the tokenizer we use is very similar to the following.
 ftp://zimbra.imladris.sk/src/HELIX-720.fbsd/ZimbraServer/src/java/com/zimbra/cs/index/analysis/UniversalTokenizer.java
 ftp://zimbra.imladris.sk/src/HELIX-720.fbsd/ZimbraServer/src/java/com/zimbra/cs/index/analysis/UniversalLexer.jflex


 From the looks of it the text is being indexed as a single token and not 
broken across whitespace. 

I can't claim to know how analyzer code works.  I did manage to see the
code, but it doesn't mean much to me.

I would suggest using the analysis tab in the Solr admin interface.  On
that page, select the field or fieldType, set the verbose flag and
type the actual field contents into the index side of the page.  When
you click the Analyze Values button, it will show you what Solr does
with the input at index time.

Do you still have access to any machines (dev or otherwise) running the
old version with the custom component? If so, do the same things on the
analysis page for that version that you did on the new version, and see
whether it does something different.  If it does do something different,
then you will need to track down the problem in the code for your custom
analyzer.

Thanks,
Shawn


 


Basic Multilingual search capability

2015-02-23 Thread Rishi Easwaran
Hi All,

For our use case we don't really need to do a lot of manipulation of incoming 
text during index time. At most removal of common stop words, tokenize emails/ 
filenames etc if possible. We get text documents from our end users, which can 
be in any language (sometimes combination) and we cannot determine the language 
of the incoming text. Language detection at index time is not necessary.

Which analyzer is recommended to achive basic multilingual search capability 
for a use case like this.
I have read a bunch of posts about using a combination standardtokenizer or 
ICUtokenizer, lowercasefilter and reverwildcardfilter factory, but looking for 
ideas, suggestions, best practices.

http://lucene.472066.n3.nabble.com/ICUTokenizer-or-StandardTokenizer-or-for-quot-text-all-quot-type-field-that-might-include-non-whitess-td4142727.html#a4144236
http://lucene.472066.n3.nabble.com/How-to-implement-multilingual-word-components-fields-schema-td4157140.html#a4158923
https://issues.apache.org/jira/browse/SOLR-6492  

 
Thanks,
Rishi.
 


Re: Basic Multilingual search capability

2015-02-23 Thread Rishi Easwaran
Hi Alex,

There is no specific language list.  
For example: the documents that needs to be indexed are emails or any messages 
for a global customer base. The messages back and forth could be in any 
language or mix of languages.
 
I understand relevancy, stemming etc becomes extremely complicated with 
multilingual support, but our first goal is to be able to tokenize and provide 
basic search capability for any language. Ex: When the document contains hello 
or здравствуйте, the analyzer creates tokens and provides exact match search 
results.

Now it would be great if it had capability to tokenize email addresses 
(ex:he...@aol.com- i think standardTokenizer already does this),  filenames 
(здравствуйте.pdf), but maybe we can use filters to accomplish that. 

Thanks,
Rishi.
 
 
-Original Message-
From: Alexandre Rafalovitch arafa...@gmail.com
To: solr-user solr-user@lucene.apache.org
Sent: Mon, Feb 23, 2015 5:49 pm
Subject: Re: Basic Multilingual search capability


Which languages are you expecting to deal with? Multilingual support
is a complex issue. Even if you think you don't need much, it is
usually a lot more complex than expected, especially around relevancy.

Regards,
   Alex.

Sign up for my Solr resources newsletter at http://www.solr-start.com/


On 23 February 2015 at 16:19, Rishi Easwaran rishi.easwa...@aol.com wrote:
 Hi All,

 For our use case we don't really need to do a lot of manipulation of incoming 
text during index time. At most removal of common stop words, tokenize emails/ 
filenames etc if possible. We get text documents from our end users, which can 
be in any language (sometimes combination) and we cannot determine the language 
of the incoming text. Language detection at index time is not necessary.

 Which analyzer is recommended to achive basic multilingual search capability 
for a use case like this.
 I have read a bunch of posts about using a combination standardtokenizer or 
ICUtokenizer, lowercasefilter and reverwildcardfilter factory, but looking for 
ideas, suggestions, best practices.

 http://lucene.472066.n3.nabble.com/ICUTokenizer-or-StandardTokenizer-or-for-quot-text-all-quot-type-field-that-might-include-non-whitess-td4142727.html#a4144236
 http://lucene.472066.n3.nabble.com/How-to-implement-multilingual-word-components-fields-schema-td4157140.html#a4158923
 https://issues.apache.org/jira/browse/SOLR-6492


 Thanks,
 Rishi.


 


Re: Basic Multilingual search capability

2015-02-23 Thread Rishi Easwaran
Hi Wunder,

Yes we do expect incoming documents to contain Chinese/Japanese/Arabic 
languages.

From what you have mentioned, it looks like we need to auto detect the incoming 
content language and tokenize/filter after that.
But I thought the ICU tokenizer had capability to do that  
(https://cwiki.apache.org/confluence/display/solr/Tokenizers#Tokenizers-ICUTokenizer)
This tokenizer processes multilingual text and tokenizes it appropriately 
based on its script attribute. 
or am I missing something? 

Thanks,
Rishi.

 

 

-Original Message-
From: Walter Underwood wun...@wunderwood.org
To: solr-user solr-user@lucene.apache.org
Sent: Mon, Feb 23, 2015 11:17 pm
Subject: Re: Basic Multilingual search capability


It isn’t just complicated, it can be impossible.

Do you have content in Chinese or Japanese? Those languages (and some others) 
do 
not separate words with spaces. You cannot even do word search without a 
language-specific, dictionary-based parser.

German is space separated, except many noun compounds are not space-separated.

Do you have Finnish content? Entire prepositional phrases turn into word 
endings.

Do you have Arabic content? That is even harder.

If all your content is in space-separated languages that are not heavily 
inflected, you can kind of do OK with a language-insensitive approach. But it 
hits the wall pretty fast.

One thing that does work pretty well is trademarked names (LaserJet, Coke, 
etc). 
Those are spelled the same in all languages and usually not inflected.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

On Feb 23, 2015, at 8:00 PM, Rishi Easwaran rishi.easwa...@aol.com wrote:

 Hi Alex,
 
 There is no specific language list.  
 For example: the documents that needs to be indexed are emails or any 
 messages 
for a global customer base. The messages back and forth could be in any 
language 
or mix of languages.
 
 I understand relevancy, stemming etc becomes extremely complicated with 
multilingual support, but our first goal is to be able to tokenize and provide 
basic search capability for any language. Ex: When the document contains hello 
or здравствуйте, the analyzer creates tokens and provides exact match search 
results.
 
 Now it would be great if it had capability to tokenize email addresses 
(ex:he...@aol.com- i think standardTokenizer already does this),  filenames 
(здравствуйте.pdf), but maybe we can use filters to accomplish that. 
 
 Thanks,
 Rishi.
 
 -Original Message-
 From: Alexandre Rafalovitch arafa...@gmail.com
 To: solr-user solr-user@lucene.apache.org
 Sent: Mon, Feb 23, 2015 5:49 pm
 Subject: Re: Basic Multilingual search capability
 
 
 Which languages are you expecting to deal with? Multilingual support
 is a complex issue. Even if you think you don't need much, it is
 usually a lot more complex than expected, especially around relevancy.
 
 Regards,
   Alex.
 
 Sign up for my Solr resources newsletter at http://www.solr-start.com/
 
 
 On 23 February 2015 at 16:19, Rishi Easwaran rishi.easwa...@aol.com wrote:
 Hi All,
 
 For our use case we don't really need to do a lot of manipulation of 
 incoming 

 text during index time. At most removal of common stop words, tokenize 
 emails/ 

 filenames etc if possible. We get text documents from our end users, which 
 can 

 be in any language (sometimes combination) and we cannot determine the 
language 
 of the incoming text. Language detection at index time is not necessary.
 
 Which analyzer is recommended to achive basic multilingual search capability 
 for a use case like this.
 I have read a bunch of posts about using a combination standardtokenizer or 
 ICUtokenizer, lowercasefilter and reverwildcardfilter factory, but looking 
 for 

 ideas, suggestions, best practices.
 
 http://lucene.472066.n3.nabble.com/ICUTokenizer-or-StandardTokenizer-or-for-quot-text-all-quot-type-field-that-might-include-non-whitess-td4142727.html#a4144236
 http://lucene.472066.n3.nabble.com/How-to-implement-multilingual-word-components-fields-schema-td4157140.html#a4158923
 https://issues.apache.org/jira/browse/SOLR-6492
 
 
 Thanks,
 Rishi.
 
 
 


 


Re: Strange search behaviour when upgrading to 4.10.3

2015-02-20 Thread Rishi Easwaran
Hi Shawn,
Also, the tokenizer we use is very similar to the following.
ftp://zimbra.imladris.sk/src/HELIX-720.fbsd/ZimbraServer/src/java/com/zimbra/cs/index/analysis/UniversalTokenizer.java
ftp://zimbra.imladris.sk/src/HELIX-720.fbsd/ZimbraServer/src/java/com/zimbra/cs/index/analysis/UniversalLexer.jflex


From the looks of it the text is being indexed as a single token and not broken 
across whitespace. 

Thanks,
Rishi. 

 

 

-Original Message-
From: Shawn Heisey apa...@elyograg.org
To: solr-user solr-user@lucene.apache.org
Sent: Fri, Feb 20, 2015 11:52 am
Subject: Re: Strange search behaviour when upgrading to 4.10.3


On 2/20/2015 9:37 AM, Rishi Easwaran wrote:
 We are trying to upgrade from Solr 4.6 to 4.10.3. When testing search 4.10.3 
search results are not being returned, actually looks like only the first word 
in a sentence is getting indexed. 
 Ex: inserting This is a test message only returns results when searching 
 for 
content:this*. searching for content:test* or content:message* does not work 
with 4.10. Only searching for content:*message* works. This leads to me to 
believe there is something wrong with behaviour of our analyzer and tokenizers 

snip

  fields
 field name=content type=ourType stored=false indexed = true 
required=false multiValued=true /
   /fields

 fieldType name=ourType indexed = true class=solr.TextField 
 analyzer class = com.zimbra.cs.index.ZimbraAnalyzer  /
 /fieldType
  
 Looking at the release notes from solr and lucene
 http://lucene.apache.org/solr/4_10_1/changes/Changes.html
 http://lucene.apache.org/core/4_10_1/changes/Changes.html
 Nothing really sticks out, atleast to me.  Any help to get it working with 
4.10 would be great.

The links you provided lead to zero-byte files when I try them, so I
could not look deeper.

Have you recompiled your custom analysis components against the newer
versions of the Solr/Lucene libraries?  Anytime you're dealing with
custom components, you cannot assume that a component compiled to work
with one version of Solr will work with another version.  The internal
API does change, and there is less emphasis on avoiding API breaks in
minor Solr releases than there is with Lucene, because the vast majority
of Solr users are not writing their own code that uses the Solr API. 
Recompiling against the newer libraries may cause compiler errors that
reveal places in your code that require changes.

Thanks,
Shawn


 


Re: Strange search behaviour when upgrading to 4.10.3

2015-02-20 Thread Rishi Easwaran
Yes, The analyzers and tokenizers were recompiled with new version of 
solr/lucene and there were some errors, most of them were related to using 
BytesRefBuilder, which i did. 

Can you try these links.
ftp://zimbra.imladris.sk/src/HELIX-720.fbsd/ZimbraServer/src/java/com/zimbra/cs/index/ZimbraAnalyzer.java
ftp://zimbra.imladris.sk/src/HELIX-720.fbsd/ZimbraServer/src/java/com/zimbra/cs/index/analysis/UniversalAnalyzer.java

 

 

 

-Original Message-
From: Shawn Heisey apa...@elyograg.org
To: solr-user solr-user@lucene.apache.org
Sent: Fri, Feb 20, 2015 11:52 am
Subject: Re: Strange search behaviour when upgrading to 4.10.3


On 2/20/2015 9:37 AM, Rishi Easwaran wrote:
 We are trying to upgrade from Solr 4.6 to 4.10.3. When testing search 4.10.3 
search results are not being returned, actually looks like only the first word 
in a sentence is getting indexed. 
 Ex: inserting This is a test message only returns results when searching 
 for 
content:this*. searching for content:test* or content:message* does not work 
with 4.10. Only searching for content:*message* works. This leads to me to 
believe there is something wrong with behaviour of our analyzer and tokenizers 

snip

  fields
 field name=content type=ourType stored=false indexed = true 
required=false multiValued=true /
   /fields

 fieldType name=ourType indexed = true class=solr.TextField 
 analyzer class = com.zimbra.cs.index.ZimbraAnalyzer  /
 /fieldType
  
 Looking at the release notes from solr and lucene
 http://lucene.apache.org/solr/4_10_1/changes/Changes.html
 http://lucene.apache.org/core/4_10_1/changes/Changes.html
 Nothing really sticks out, atleast to me.  Any help to get it working with 
4.10 would be great.

The links you provided lead to zero-byte files when I try them, so I
could not look deeper.

Have you recompiled your custom analysis components against the newer
versions of the Solr/Lucene libraries?  Anytime you're dealing with
custom components, you cannot assume that a component compiled to work
with one version of Solr will work with another version.  The internal
API does change, and there is less emphasis on avoiding API breaks in
minor Solr releases than there is with Lucene, because the vast majority
of Solr users are not writing their own code that uses the Solr API. 
Recompiling against the newer libraries may cause compiler errors that
reveal places in your code that require changes.

Thanks,
Shawn


 


Strange search behaviour when upgrading to 4.10.3

2015-02-20 Thread Rishi Easwaran
Hi,

We are trying to upgrade from Solr 4.6 to 4.10.3. When testing search 4.10.3 
search results are not being returned, actually looks like only the first word 
in a sentence is getting indexed. 
Ex: inserting This is a test message only returns results when searching for 
content:this*. searching for content:test* or content:message* does not work 
with 4.10. Only searching for content:*message* works. This leads to me to 
believe there is something wrong with behaviour of our analyzer and tokenizers 

A little bit of background. 

We have our own analyzer and tokenizer since pre solr 1.4 and its been 
regularly updated. The analyzer works with solr 4.6 we have it running in 
production (I also tested that search works with solr 4.9.1).
It is very similar to the tokenizers and analyzers located here.
ftp://193.87.16.77/src/HELIX-720.fbsd/ZimbraServer/src/java/com/zimbra/cs/index/ZimbraAnalyzer.java
ftp://193.87.16.77/src/HELIX-720.fbsd/ZimbraServer/src/java/com/zimbra/cs/index/analysis/UniversalAnalyzer.java
ftp://193.87.16.77/src/HELIX-720.fbsd/ZimbraServer/src/java/com/zimbra/cs/index/analysis/
But with modifications to work with latest solr/lucene code ex: override- 
createComponents

The schema of the filed being analyzed is as follows

 fields
field name=content type=ourType stored=false indexed = true 
required=false multiValued=true /
  /fields

fieldType name=ourType indexed = true class=solr.TextField 
analyzer class = com.zimbra.cs.index.ZimbraAnalyzer  /
/fieldType
 
Looking at the release notes from solr and lucene
http://lucene.apache.org/solr/4_10_1/changes/Changes.html
http://lucene.apache.org/core/4_10_1/changes/Changes.html
Nothing really sticks out, atleast to me.  Any help to get it working with 4.10 
would be great.

Thanks,
Rishi.


SOLR Talk at AOL Dulles Campus.

2014-07-08 Thread Rishi Easwaran
All, 
There is a tech talk on AOL Dulles campus tomorrow. Do swing by if you can and 
share it with your colleagues and friends. 
www.meetup.com/Code-Brew/events/192361672/
There will be free food and beer served at this event :)

Thanks,
Rishi.


Re: SOLR Cloud 4.6 - PERFORMANCE WARNING: Overlapping onDeckSearchers=2

2014-03-31 Thread Rishi Easwaran
The SSD is separated into logical volumes..each instance gets 100 GB SSD disk 
space to write its index.
If I add them all up its ~45GB in 1TB SSD disk space. 
Not sure I get  You should not be running more than one instance of Solr per 
machine.One instance of Solr can run multiple indexes.
Yeah I know that, we have been running 6-8 instances of SOLR using multicore 
ability since ~2008, supporting millions of small indexes. 
Now we are looking at SOLR cloud with large indexes to see if we can leverage 
some of its benefits.
As many folks have experienced, JVM with its stop the world pauses, cannot GC 
using CMS within acceptable limits with very large heaps. 
To utilize the H/W to its full potential, multiple instances on a single host 
is pretty common practice for us. 


 

 

 

-Original Message-
From: Shawn Heisey s...@elyograg.org
To: solr-user solr-user@lucene.apache.org
Sent: Sun, Mar 30, 2014 5:51 pm
Subject: Re: SOLR Cloud 4.6 - PERFORMANCE WARNING: Overlapping onDeckSearchers=2


On 3/30/2014 2:59 PM, Rishi Easwaran wrote:
 RAM shouldn't be a problem. 
 I have a box with 144GB RAM, running 12 instances with 4GB Java heap each.
 There are 9 instances wrting to 1TB of SSD disk space. 
  Other 3 are writing to SATA drives, and have autosoftcommit disabled.

This brought up more questions than it answered.  I was assuming that
you only had a total of 4GB of index data, but after reading this, I
think my assumption may be incorrect.  If you add up all the Solr index
data on the SSD, how much disk space does it take?

You should not be running more than one instance of Solr per machine.
One instance of Solr can run multiple indexes.  Running more than one
results in quite a lot of overhead, and it seems unlikely that you would
need to dedicate 48GB of total RAM to the Java heap.

Thanks,
Shawn


 


Re: SOLR Cloud 4.6 - PERFORMANCE WARNING: Overlapping onDeckSearchers=2

2014-03-30 Thread Rishi Easwaran
RAM shouldn't be a problem. 
I have a box with 144GB RAM, running 12 instances with 4GB Java heap each.
There are 9 instances wrting to 1TB of SSD disk space. 
 Other 3 are writing to SATA drives, and have autosoftcommit disabled.

 

 

-Original Message-
From: Shawn Heisey elyog...@elyograg.org
To: solr-user solr-user@lucene.apache.org
Sent: Fri, Mar 28, 2014 8:35 pm
Subject: Re: SOLR Cloud 4.6 - PERFORMANCE WARNING: Overlapping onDeckSearchers=2


On 3/28/2014 4:07 PM, Rishi Easwaran wrote:
 
  Shawn,
 
 I changed the autoSoftCommit value to 15000 (15 sec). 
 My index size is pretty small ~4GB and its running on a SSD drive with ~100 
 GB 
space on it. 
 Now I see the warn message every 15 seconds.
 
 The caches I think are minimal
 
 filterCache class=solr.FastLRUCache size=512 initialSize=512 
autowarmCount=0/
 
  queryResultCache class=solr.LRUCache size=512   
   
initialSize=512 autowarmCount=0/
  documentCache class=solr.LRUCache   size=512

initialSize=512   autowarmCount=0/
 
 queryResultMaxDocsCached200/queryResultMaxDocsCached
 
 I think still something is going on. I mean 15s on SSD drives is a long time 
to handle a 4GB index.

How much RAM do you have and what size is your max java heap?

https://wiki.apache.org/solr/SolrPerformanceProblems#RAM

Thanks,
Shawn


 


Re: SOLR Cloud 4.6 - PERFORMANCE WARNING: Overlapping onDeckSearchers=2

2014-03-28 Thread Rishi Easwaran
Hi Dmitry,

I thought auto soft commit was for NRT search (shouldn't it be optimized for 
search performance), if i have to wait 10 mins how is it NRT? or am I missing 
something?



 Thanks,
Rishi.

 

-Original Message-
From: Dmitry Kan solrexp...@gmail.com
To: solr-user solr-user@lucene.apache.org
Sent: Fri, Mar 28, 2014 1:02 pm
Subject: Re: SOLR Cloud 4.6 - PERFORMANCE WARNING: Overlapping onDeckSearchers=2


Hi Rishi,

Do you really need soft-commit every second? Can you make it 10 mins, for
example?

What is happening (conditional on checking your logs) is that several
commits (looks like 2 in your case) are arriving in a quick succession.
Then system is starting to warmup the searchers, one per each commit. This
is a waste of resources, because only one searcher will be used in the end,
so one of them is warming in vain.

Just rethink your commit strategy with regards to the update frequency and
warming up time to avoid issues with this in the future.

Dmitry


On Thu, Mar 27, 2014 at 11:16 PM, Rishi Easwaran rishi.easwa...@aol.comwrote:

 All,

 I am running SOLR Cloud 4.6, everything looks ok, except for this warn
 message constantly in the logs.


 2014-03-27 17:09:03,982 WARN  [commitScheduler-15-thread-1] [] SolrCore -
 [index_shard16_replica1] PERFORMANCE WARNING: Overlapping onDeckSearchers=2
 2014-03-27 17:09:05,517 WARN  [commitScheduler-15-thread-1] [] SolrCore -
 [index_shard16_replica1] PERFORMANCE WARNING: Overlapping onDeckSearchers=2
 2014-03-27 17:09:06,774 WARN  [commitScheduler-15-thread-1] [] SolrCore -
 [index_shard16_replica1] PERFORMANCE WARNING: Overlapping onDeckSearchers=2
 2014-03-27 17:09:08,085 WARN  [commitScheduler-15-thread-1] [] SolrCore -
 [index_shard16_replica1] PERFORMANCE WARNING: Overlapping onDeckSearchers=2
 2014-03-27 17:09:09,114 WARN  [commitScheduler-15-thread-1] [] SolrCore -
 [index_shard16_replica1] PERFORMANCE WARNING: Overlapping onDeckSearchers=2
 2014-03-27 17:09:10,238 WARN  [commitScheduler-15-thread-1] [] SolrCore -
 [index_shard16_replica1] PERFORMANCE WARNING: Overlapping onDeckSearchers=2

 Searched around a bit, looks like my solrconfig.xml is configured fine and
 verified there are no explicit commits sent by our clients.

 My solrconfig.xml
  autoCommit
 maxDocs1/maxDocs
 maxTime6/maxTime
 openSearcherfalse/openSearcher
 /autoCommit

autoSoftCommit
  maxTime1000/maxTime
/autoSoftCommit


 Any idea why its warning every second, the only config that has 1 second
 is softcommit.

 Thanks,
 Rishi.




-- 
Dmitry
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan

 


Re: SOLR Cloud 4.6 - PERFORMANCE WARNING: Overlapping onDeckSearchers=2

2014-03-28 Thread Rishi Easwaran

 Shawn,

I changed the autoSoftCommit value to 15000 (15 sec). 
My index size is pretty small ~4GB and its running on a SSD drive with ~100 GB 
space on it. 
Now I see the warn message every 15 seconds.

The caches I think are minimal

filterCache class=solr.FastLRUCache size=512 initialSize=512 
autowarmCount=0/

 queryResultCache class=solr.LRUCache size=512 
initialSize=512 autowarmCount=0/
 documentCache class=solr.LRUCache   size=512  
 initialSize=512   autowarmCount=0/

queryResultMaxDocsCached200/queryResultMaxDocsCached

I think still something is going on. I mean 15s on SSD drives is a long time to 
handle a 4GB index.


Thanks,
Rishi.

 

-Original Message-
From: Shawn Heisey s...@elyograg.org
To: solr-user solr-user@lucene.apache.org
Sent: Fri, Mar 28, 2014 3:28 pm
Subject: Re: SOLR Cloud 4.6 - PERFORMANCE WARNING: Overlapping onDeckSearchers=2


On 3/28/2014 1:03 PM, Rishi Easwaran wrote:
 I thought auto soft commit was for NRT search (shouldn't it be optimized for 
search performance), if i have to wait 10 mins how is it NRT? or am I missing 
something?

You are correct, but once a second is REALLY often.  If the rest of your 
config is not set up properly, that's far too frequent.  With commits 
happening once a second, they must complete in less than a second, and 
that can be difficult to achieve.

A typical extreme NRT config requires small (or disabled) Solr caches, 
no cache autowarming, and enough free RAM (not allocated to programs) to 
cache all of the index data on the server.  If the index is very big, it 
may not be possible to get the commit time below one second, so you may 
need to go with something like 10 to 60 seconds.

Thanks,
Shawn


 


SOLR Cloud 4.6 - PERFORMANCE WARNING: Overlapping onDeckSearchers=2

2014-03-27 Thread Rishi Easwaran
All,

I am running SOLR Cloud 4.6, everything looks ok, except for this warn message 
constantly in the logs.


2014-03-27 17:09:03,982 WARN  [commitScheduler-15-thread-1] [] SolrCore - 
[index_shard16_replica1] PERFORMANCE WARNING: Overlapping onDeckSearchers=2
2014-03-27 17:09:05,517 WARN  [commitScheduler-15-thread-1] [] SolrCore - 
[index_shard16_replica1] PERFORMANCE WARNING: Overlapping onDeckSearchers=2
2014-03-27 17:09:06,774 WARN  [commitScheduler-15-thread-1] [] SolrCore - 
[index_shard16_replica1] PERFORMANCE WARNING: Overlapping onDeckSearchers=2
2014-03-27 17:09:08,085 WARN  [commitScheduler-15-thread-1] [] SolrCore - 
[index_shard16_replica1] PERFORMANCE WARNING: Overlapping onDeckSearchers=2
2014-03-27 17:09:09,114 WARN  [commitScheduler-15-thread-1] [] SolrCore - 
[index_shard16_replica1] PERFORMANCE WARNING: Overlapping onDeckSearchers=2
2014-03-27 17:09:10,238 WARN  [commitScheduler-15-thread-1] [] SolrCore - 
[index_shard16_replica1] PERFORMANCE WARNING: Overlapping onDeckSearchers=2

Searched around a bit, looks like my solrconfig.xml is configured fine and 
verified there are no explicit commits sent by our clients.

My solrconfig.xml 
 autoCommit
maxDocs1/maxDocs
maxTime6/maxTime
openSearcherfalse/openSearcher
/autoCommit

   autoSoftCommit
 maxTime1000/maxTime
   /autoSoftCommit


Any idea why its warning every second, the only config that has 1 second is 
softcommit.

Thanks,
Rishi.



Re: Solr Cloud Hangs consistently .

2013-06-19 Thread Rishi Easwaran
Update!!

Got SOLR cloud working, was able to do 90k document inserts with 
replicationFactor=2, with my jmeter script, previously was getting stuck with 
3k inserts or less.
After some investigation, figured out that ulimits for my process were not 
being set properly, OS defaults were kicking in, which is very small for a 
server app.
One of our install script had changed.
I had to up the ulimits - -n,-u,-v and for now no other issues seen.


 

 

-Original Message-
From: Rishi Easwaran rishi.easwa...@aol.com
To: solr-user solr-user@lucene.apache.org
Sent: Tue, Jun 18, 2013 10:40 am
Subject: Re: Solr Cloud Hangs consistently .


Mark,

All I am doing are inserts, afaik search side deadlocks should not be an issue.

I am using Jmeter, standard test driver we use for most of our benchmarks and 
stats collection.
My jmeter.jmx file- http://apaste.info/79IS , maybe i overlooked something

 
Is there a benchmark script that solr community uses (preferably with jmeter), 
we are write heavy so at the moment focusing on inserts only.

Thanks,

Rishi.

 

 

-Original Message-
From: Yago Riveiro yago.rive...@gmail.com
To: solr-user solr-user@lucene.apache.org
Sent: Mon, Jun 17, 2013 6:19 pm
Subject: Re: Solr Cloud Hangs consistently .


I do all the indexing through a HTTP POST, with replicationFactor=1 no problem, 
if is higher deadlock problems can appear

A stack trace like this 
http://lucene.472066.n3.nabble.com/updating-docs-in-solr-cloud-hangs-td4067388.html#a4067862
 

is that I get

-- 
Yago Riveiro
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Monday, June 17, 2013 at 11:03 PM, Mark Miller wrote:

 If it actually happens with replicationFactor=1, it doesn't likely have 
anything to do with the update handler issue I'm referring to. In some cases 
like these, people have better luck with Jetty than Tomcat - we test it much 
more. For instance, it's setup to help avoid search side distributed deadlocks.
 
 In any case, there is something special about it - I do and have seen a lot 
 of 

heavy indexing to SolrCloud by me and others without running into this. Both 
with replicationFacotor=1 and greater. So there is something specific in how 
the 

load is being done or what features/methods are being used that likely causes 
it 

or makes it easier to cause.
 
 But again, the issue I know about involves threads that are not even created 
in the replicationFactor = 1 case, so that could be a first report afaik.
 
 - Mark
 
 On Jun 17, 2013, at 5:52 PM, Rishi Easwaran rishi.easwa...@aol.com 
(mailto:rishi.easwa...@aol.com) wrote:
 
  Update!!
  
  This happens with replicationFactor=1
  Just for kicks I created a collection with a 24 shards, replicationfactor=1 
cluster on my exisiting benchmark env.
  Same behaviour, SOLR cloud just hangs. Nothing in the logs, top/heap/cpu 
most metrics looks fine.
  Only indication seems to be netstat showing incoming request not being read 
in.
  
  Yago,
  
  I saw your previous post 
  (http://lucene.472066.n3.nabble.com/updating-docs-in-solr-cloud-hangs-td4067388.html#a4067631)
  Following it, Last week, I upgraded to SOLR 4.3, to see if the issue gets 
fixed, but no luck.
  Looks like this is a dominant and easily reproducible issue on SOLR cloud.
  
  
  Thanks,
  
  Rishi. 
  
  
  
  
  
  
  
  
  
  
  
  -Original Message-
  From: Yago Riveiro yago.rive...@gmail.com (mailto:yago.rive...@gmail.com)
  To: solr-user solr-user@lucene.apache.org 
  (mailto:solr-user@lucene.apache.org)
  Sent: Mon, Jun 17, 2013 5:15 pm
  Subject: Re: Solr Cloud Hangs consistently .
  
  
  I can confirm that the deadlock happen with only 2 replicas by shard. I 
  need 


  shutdown one node that host a replica of the shard to recover the 
  indexation 


  capability.
  
  -- 
  Yago Riveiro
  Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
  
  
  On Monday, June 17, 2013 at 6:44 PM, Rishi Easwaran wrote:
  
   
   
   Hi All,
   
   I am trying to benchmark SOLR Cloud and it consistently hangs. 
   Nothing in the logs, no stack trace, no errors, no warnings, just seems 
stuck.
   
   A little bit about my set up. 
   I have 3 benchmark hosts, each with 96GB RAM, 24 CPU's and 1TB SSD. Each 
host 
   
  
  is configured to have 8 SOLR cloud nodes running at 4GB each.
   JVM configs: http://apaste.info/57Ai
   
   My cluster has 12 shards with replication factor 2- 
   http://apaste.info/09sA
   
   I originally stated with SOLR 4.2., tomcat 5 and jdk 6, as we are already 
  running this configuration in production in Non-Cloud form. 
   It got stuck repeatedly.
   
   I decided to upgrade to the latest and greatest of everything, SOLR 4.3, 
JDK7 
  and tomcat7. 
   It still shows same behaviour and hangs through the test.
   
   My test schema and config.
   Schema.xml - http://apaste.info/imah
   SolrConfig.xml - http://apaste.info/ku4F
   
   The test is pretty simple. its a jmeter test with update command via SOAP 
rpc 
  (round robin

Re: Solr Cloud Hangs consistently .

2013-06-18 Thread Rishi Easwaran
Mark,

All I am doing are inserts, afaik search side deadlocks should not be an issue.

I am using Jmeter, standard test driver we use for most of our benchmarks and 
stats collection.
My jmeter.jmx file- http://apaste.info/79IS , maybe i overlooked something

 
Is there a benchmark script that solr community uses (preferably with jmeter), 
we are write heavy so at the moment focusing on inserts only.

Thanks,

Rishi.

 

 

-Original Message-
From: Yago Riveiro yago.rive...@gmail.com
To: solr-user solr-user@lucene.apache.org
Sent: Mon, Jun 17, 2013 6:19 pm
Subject: Re: Solr Cloud Hangs consistently .


I do all the indexing through a HTTP POST, with replicationFactor=1 no problem, 
if is higher deadlock problems can appear

A stack trace like this 
http://lucene.472066.n3.nabble.com/updating-docs-in-solr-cloud-hangs-td4067388.html#a4067862
 
is that I get

-- 
Yago Riveiro
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Monday, June 17, 2013 at 11:03 PM, Mark Miller wrote:

 If it actually happens with replicationFactor=1, it doesn't likely have 
anything to do with the update handler issue I'm referring to. In some cases 
like these, people have better luck with Jetty than Tomcat - we test it much 
more. For instance, it's setup to help avoid search side distributed deadlocks.
 
 In any case, there is something special about it - I do and have seen a lot 
 of 
heavy indexing to SolrCloud by me and others without running into this. Both 
with replicationFacotor=1 and greater. So there is something specific in how 
the 
load is being done or what features/methods are being used that likely causes 
it 
or makes it easier to cause.
 
 But again, the issue I know about involves threads that are not even created 
in the replicationFactor = 1 case, so that could be a first report afaik.
 
 - Mark
 
 On Jun 17, 2013, at 5:52 PM, Rishi Easwaran rishi.easwa...@aol.com 
(mailto:rishi.easwa...@aol.com) wrote:
 
  Update!!
  
  This happens with replicationFactor=1
  Just for kicks I created a collection with a 24 shards, replicationfactor=1 
cluster on my exisiting benchmark env.
  Same behaviour, SOLR cloud just hangs. Nothing in the logs, top/heap/cpu 
most metrics looks fine.
  Only indication seems to be netstat showing incoming request not being read 
in.
  
  Yago,
  
  I saw your previous post 
  (http://lucene.472066.n3.nabble.com/updating-docs-in-solr-cloud-hangs-td4067388.html#a4067631)
  Following it, Last week, I upgraded to SOLR 4.3, to see if the issue gets 
fixed, but no luck.
  Looks like this is a dominant and easily reproducible issue on SOLR cloud.
  
  
  Thanks,
  
  Rishi. 
  
  
  
  
  
  
  
  
  
  
  
  -Original Message-
  From: Yago Riveiro yago.rive...@gmail.com (mailto:yago.rive...@gmail.com)
  To: solr-user solr-user@lucene.apache.org 
  (mailto:solr-user@lucene.apache.org)
  Sent: Mon, Jun 17, 2013 5:15 pm
  Subject: Re: Solr Cloud Hangs consistently .
  
  
  I can confirm that the deadlock happen with only 2 replicas by shard. I 
  need 

  shutdown one node that host a replica of the shard to recover the 
  indexation 

  capability.
  
  -- 
  Yago Riveiro
  Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
  
  
  On Monday, June 17, 2013 at 6:44 PM, Rishi Easwaran wrote:
  
   
   
   Hi All,
   
   I am trying to benchmark SOLR Cloud and it consistently hangs. 
   Nothing in the logs, no stack trace, no errors, no warnings, just seems 
stuck.
   
   A little bit about my set up. 
   I have 3 benchmark hosts, each with 96GB RAM, 24 CPU's and 1TB SSD. Each 
host 
   
  
  is configured to have 8 SOLR cloud nodes running at 4GB each.
   JVM configs: http://apaste.info/57Ai
   
   My cluster has 12 shards with replication factor 2- 
   http://apaste.info/09sA
   
   I originally stated with SOLR 4.2., tomcat 5 and jdk 6, as we are already 
  running this configuration in production in Non-Cloud form. 
   It got stuck repeatedly.
   
   I decided to upgrade to the latest and greatest of everything, SOLR 4.3, 
JDK7 
  and tomcat7. 
   It still shows same behaviour and hangs through the test.
   
   My test schema and config.
   Schema.xml - http://apaste.info/imah
   SolrConfig.xml - http://apaste.info/ku4F
   
   The test is pretty simple. its a jmeter test with update command via SOAP 
rpc 
  (round robin request across every node), adding in 5 fields from a csv file 
- 
  id, guid, subject, body, compositeID (guid!id).
   number of jmeter threads = 150. loop count = 20, num of messages to 
add/per 
  
  guid = 3; total 150*3*20 = 9000 documents. 
   
   When cloud gets stuck, i don't get anything in the logs, but when i run 
  netstat i see the following.
   Sample netstat on a stuck run. http://apaste.info/hr0O 
   hycl-d20 is my jmeter host. ssd-d01/2/3 are my cloud hosts.
   
   At the moment my benchmarking efforts are at a stand still.
   
   Any help from the community would be great, I got some heap dumps and 
stack 
  dumps, but haven't

Re: SOLR Cloud - Disable Transaction Logs

2013-06-18 Thread Rishi Easwaran
SolrJ already has access to zookeeper cluster state. Network I/O bottleneck can 
be avoided by parallel requests. 
You are only as slow as your slowest responding server, which could be your 
single leader with the current set up.

Wouldn't this lessen the burden of the leader, as he does not have to maintain 
transaction logs or distribute to replicas? 

 

 

 

-Original Message-
From: Shalin Shekhar Mangar shalinman...@gmail.com
To: solr-user solr-user@lucene.apache.org
Sent: Tue, Jun 18, 2013 2:05 am
Subject: Re: SOLR Cloud - Disable Transaction Logs


Yes, but at what cost? You are thinking of replacing disk IO with even more
slower network IO. The transaction log is a append-only log -- it is not
pretty cheap especially so if you compare it with the indexing process.
Plus your write request/sec will drop a lot once you start doing
synchronous replication.


On Tue, Jun 18, 2013 at 2:18 AM, Rishi Easwaran rishi.easwa...@aol.comwrote:

 Shalin,

 Just some thoughts.

 Near Real time replication- don't we use solrCmdDistributor, which send
 requests immediately to replicas with a clonedRequest, as an option can't
 we achieve something similar form CloudSolrserver in Solrj instead of
 leader doing it. As long as 2 nodes receive writes and acknowledge.
 durability should be high.
 Peer-Sync and Recovery - Can we achieve that merging indexes from leader
 as needed, instead of replaying the transaction logs?

 Rishi.







 -Original Message-
 From: Shalin Shekhar Mangar shalinman...@gmail.com
 To: solr-user solr-user@lucene.apache.org
 Sent: Mon, Jun 17, 2013 3:43 pm
 Subject: Re: SOLR Cloud - Disable Transaction Logs


 It is also necessary for near real-time replication, peer sync and
 recovery.


 On Tue, Jun 18, 2013 at 1:04 AM, Rishi Easwaran rishi.easwa...@aol.com
 wrote:

  Hi,
 
  Is there a way to disable transaction logs in SOLR cloud. As far as I can
  tell no.
  Just curious why do we need transaction logs, seems like an I/O intensive
  operation.
  As long as I have replicatonFactor 1, if a node (leader) goes down, the
  replica can take over and maintain a durable state of my index.
 
  I understand from the previous discussions, that it was intended for
  update durability and realtime get.
  But, unless I am missing something an ability to disable it in SOLR cloud
  if not needed would be good.
 
  Thanks,
 
  Rishi.
 
 


 --
 Regards,
 Shalin Shekhar Mangar.





-- 
Regards,
Shalin Shekhar Mangar.

 


Re: SOLR Cloud - Disable Transaction Logs

2013-06-18 Thread Rishi Easwaran

Erick,

We at AOL mail have been using SOLR for quiet a while and our system is pretty 
write heavy and disk I/O is one of our bottlenecks. At present we use regular 
SOLR in the lotsOfCore configuration and I am in  the process of benchmarking 
SOLR cloud for our use case. I don't have concrete data that tLogs are placing 
lot of load on the system, but for a large scale system like ours even minimal 
load gets magnified. 


From the Cloud design, for a properly set up cluster, usually you have 
replicas at different availability zones . Probablity of losing more than 1 
availability zone at any given time should be pretty low. Why have tLogs if 
all replicas on an update get the request anyway, In theory 1 replica must be 
able to commit eventually.

NRT is an optional feature and probably not tied to Cloud, correct?


Thanks,

Rishi.



 

 

-Original Message-
From: Erick Erickson erickerick...@gmail.com
To: solr-user solr-user@lucene.apache.org
Sent: Tue, Jun 18, 2013 4:07 pm
Subject: Re: SOLR Cloud - Disable Transaction Logs


bq: the replica can take over and maintain a durable
state of my index

This is not true. On an update, all the nodes in a slice
have already written the data to the tlog, not just the
leader. So if a leader goes down, the replicas have
enough local info to insure that data is not lost. Without
tlogs this would not be true since documents are not
durably saved until a hard commit.

tlogs save data between hard commits. As Yonik
explained to me once, soft commits are about
visibility, hard commits are about durability and
tlogs fill up the gap between hard commits.

So to reinforce Shalin's comment yes, you can disable tlogs
if
1 you don't want any of SolrCloud's HA/DR capabilities
2 NRT is unimportant

IOW if you're using 4.x just like you would 3.x in terms
of replication, HA/DR, etc. This is perfectly reasonable,
but don't get hung up on disabling tlogs.

And you haven't told us _why_ you want to do this. They
don't consume much memory or disk space unless you
have configured your hard commits (with openSearcher
true or false) to be quite long. Do you have any proof at
all that the tlogs are placing enough load on the system
to go down this road?

Best
Erick

On Tue, Jun 18, 2013 at 10:49 AM, Rishi Easwaran rishi.easwa...@aol.com wrote:
 SolrJ already has access to zookeeper cluster state. Network I/O bottleneck 
can be avoided by parallel requests.
 You are only as slow as your slowest responding server, which could be your 
single leader with the current set up.

 Wouldn't this lessen the burden of the leader, as he does not have to 
 maintain 
transaction logs or distribute to replicas?







 -Original Message-
 From: Shalin Shekhar Mangar shalinman...@gmail.com
 To: solr-user solr-user@lucene.apache.org
 Sent: Tue, Jun 18, 2013 2:05 am
 Subject: Re: SOLR Cloud - Disable Transaction Logs


 Yes, but at what cost? You are thinking of replacing disk IO with even more
 slower network IO. The transaction log is a append-only log -- it is not
 pretty cheap especially so if you compare it with the indexing process.
 Plus your write request/sec will drop a lot once you start doing
 synchronous replication.


 On Tue, Jun 18, 2013 at 2:18 AM, Rishi Easwaran rishi.easwa...@aol.comwrote:

 Shalin,

 Just some thoughts.

 Near Real time replication- don't we use solrCmdDistributor, which send
 requests immediately to replicas with a clonedRequest, as an option can't
 we achieve something similar form CloudSolrserver in Solrj instead of
 leader doing it. As long as 2 nodes receive writes and acknowledge.
 durability should be high.
 Peer-Sync and Recovery - Can we achieve that merging indexes from leader
 as needed, instead of replaying the transaction logs?

 Rishi.







 -Original Message-
 From: Shalin Shekhar Mangar shalinman...@gmail.com
 To: solr-user solr-user@lucene.apache.org
 Sent: Mon, Jun 17, 2013 3:43 pm
 Subject: Re: SOLR Cloud - Disable Transaction Logs


 It is also necessary for near real-time replication, peer sync and
 recovery.


 On Tue, Jun 18, 2013 at 1:04 AM, Rishi Easwaran rishi.easwa...@aol.com
 wrote:

  Hi,
 
  Is there a way to disable transaction logs in SOLR cloud. As far as I can
  tell no.
  Just curious why do we need transaction logs, seems like an I/O intensive
  operation.
  As long as I have replicatonFactor 1, if a node (leader) goes down, the
  replica can take over and maintain a durable state of my index.
 
  I understand from the previous discussions, that it was intended for
  update durability and realtime get.
  But, unless I am missing something an ability to disable it in SOLR cloud
  if not needed would be good.
 
  Thanks,
 
  Rishi.
 
 


 --
 Regards,
 Shalin Shekhar Mangar.





 --
 Regards,
 Shalin Shekhar Mangar.



 



Solr Cloud Hangs consistently .

2013-06-17 Thread Rishi Easwaran


Hi All,

I am trying to benchmark SOLR Cloud and it consistently hangs. 
Nothing in the logs, no stack trace, no errors, no warnings, just seems stuck.

A little bit about my set up. 
I have 3 benchmark hosts, each with 96GB RAM, 24 CPU's and 1TB SSD. Each host 
is configured to have 8 SOLR cloud nodes running at 4GB each.
JVM configs: http://apaste.info/57Ai

My cluster has 12 shards with replication factor 2- http://apaste.info/09sA

I originally stated with SOLR 4.2., tomcat 5 and jdk 6, as we are already 
running this configuration in production in Non-Cloud form. 
It got stuck repeatedly.

I decided to upgrade to the latest and greatest of everything, SOLR 4.3, JDK7 
and tomcat7. 
 It still shows same behaviour and hangs through the test.

My test schema and config.
Schema.xml - http://apaste.info/imah
SolrConfig.xml - http://apaste.info/ku4F

The test is pretty simple. its a jmeter test with update command via SOAP rpc 
(round robin request across every node), adding in 5 fields from a csv file - 
id, guid, subject, body, compositeID (guid!id).
number of jmeter threads = 150. loop count = 20, num of messages to add/per 
guid = 3; total 150*3*20 = 9000 documents.  

When cloud gets stuck, i don't get anything in the logs, but when i run netstat 
i see the following.
Sample netstat on a stuck run. http://apaste.info/hr0O 
hycl-d20 is my jmeter host. ssd-d01/2/3 are my cloud hosts.

 
At the moment my benchmarking efforts are at a stand still.

Any help from the community would be great, I got some heap dumps and stack 
dumps, but haven't found a smoking gun yet.
If I can provide anything else to diagnose this issue. just let me know.

Thanks,

Rishi.










Re: Solr Cloud Hangs consistently .

2013-06-17 Thread Rishi Easwaran
Mark,

I got a few stack dumps of the instance that was stuck ssdtest-d03:8011

http://apaste.info/cofK
http://apaste.info/sv4M
http://apaste.info/cxUf

 


 I can get dumps of others if needed.

Thanks,

Rishi.

 

-Original Message-
From: Mark Miller markrmil...@gmail.com
To: solr-user solr-user@lucene.apache.org
Sent: Mon, Jun 17, 2013 1:57 pm
Subject: Re: Solr Cloud Hangs consistently .


Could you give a simple stack trace dump as well?

It's likely the distributed update deadlock that has been reported a few times 
now - I think usually with a replication factor greater than 2, but I can't be 
sure. The deadlock involves sending docs concurrently to replicas and I 
wouldn't 
have expected it to be so easily hit with only 2 replicas per shard. I should 
be 
able to tell from a stack trace though.

If it is that, it's on my short list to investigate (been there a long time now 
though - but I still hope to look at it soon).

- Mark

On Jun 17, 2013, at 1:44 PM, Rishi Easwaran rishi.easwa...@aol.com wrote:

 
 
 Hi All,
 
 I am trying to benchmark SOLR Cloud and it consistently hangs. 
 Nothing in the logs, no stack trace, no errors, no warnings, just seems stuck.
 
 A little bit about my set up. 
 I have 3 benchmark hosts, each with 96GB RAM, 24 CPU's and 1TB SSD. Each host 
is configured to have 8 SOLR cloud nodes running at 4GB each.
 JVM configs: http://apaste.info/57Ai
 
 My cluster has 12 shards with replication factor 2- http://apaste.info/09sA
 
 I originally stated with SOLR 4.2., tomcat 5 and jdk 6, as we are already 
running this configuration in production in Non-Cloud form. 
 It got stuck repeatedly.
 
 I decided to upgrade to the latest and greatest of everything, SOLR 4.3, JDK7 
and tomcat7. 
 It still shows same behaviour and hangs through the test.
 
 My test schema and config.
 Schema.xml - http://apaste.info/imah
 SolrConfig.xml - http://apaste.info/ku4F
 
 The test is pretty simple. its a jmeter test with update command via SOAP rpc 
(round robin request across every node), adding in 5 fields from a csv file - 
id, guid, subject, body, compositeID (guid!id).
 number of jmeter threads = 150. loop count = 20, num of messages to add/per 
guid = 3; total 150*3*20 = 9000 documents.  
 
 When cloud gets stuck, i don't get anything in the logs, but when i run 
netstat i see the following.
 Sample netstat on a stuck run. http://apaste.info/hr0O 
 hycl-d20 is my jmeter host. ssd-d01/2/3 are my cloud hosts.
 
 
 At the moment my benchmarking efforts are at a stand still.
 
 Any help from the community would be great, I got some heap dumps and stack 
dumps, but haven't found a smoking gun yet.
 If I can provide anything else to diagnose this issue. just let me know.
 
 Thanks,
 
 Rishi.
 
 
 
 
 
 
 
 


 


Re: Solr Cloud Hangs consistently .

2013-06-17 Thread Rishi Easwaran
FYI..you can ignore  http4ClientExpiryService thread in the stack dump.
Its a dummy executor service, i created to test out something, unrelated to 
this issue.  
 

 

 

-Original Message-
From: Rishi Easwaran rishi.easwa...@aol.com
To: solr-user solr-user@lucene.apache.org
Sent: Mon, Jun 17, 2013 2:54 pm
Subject: Re: Solr Cloud Hangs consistently .


Mark,

I got a few stack dumps of the instance that was stuck ssdtest-d03:8011

http://apaste.info/cofK
http://apaste.info/sv4M
http://apaste.info/cxUf

 


 I can get dumps of others if needed.

Thanks,

Rishi.

 

-Original Message-
From: Mark Miller markrmil...@gmail.com
To: solr-user solr-user@lucene.apache.org
Sent: Mon, Jun 17, 2013 1:57 pm
Subject: Re: Solr Cloud Hangs consistently .


Could you give a simple stack trace dump as well?

It's likely the distributed update deadlock that has been reported a few times 
now - I think usually with a replication factor greater than 2, but I can't be 
sure. The deadlock involves sending docs concurrently to replicas and I 
wouldn't 

have expected it to be so easily hit with only 2 replicas per shard. I should 
be 

able to tell from a stack trace though.

If it is that, it's on my short list to investigate (been there a long time now 
though - but I still hope to look at it soon).

- Mark

On Jun 17, 2013, at 1:44 PM, Rishi Easwaran rishi.easwa...@aol.com wrote:

 
 
 Hi All,
 
 I am trying to benchmark SOLR Cloud and it consistently hangs. 
 Nothing in the logs, no stack trace, no errors, no warnings, just seems stuck.
 
 A little bit about my set up. 
 I have 3 benchmark hosts, each with 96GB RAM, 24 CPU's and 1TB SSD. Each host 
is configured to have 8 SOLR cloud nodes running at 4GB each.
 JVM configs: http://apaste.info/57Ai
 
 My cluster has 12 shards with replication factor 2- http://apaste.info/09sA
 
 I originally stated with SOLR 4.2., tomcat 5 and jdk 6, as we are already 
running this configuration in production in Non-Cloud form. 
 It got stuck repeatedly.
 
 I decided to upgrade to the latest and greatest of everything, SOLR 4.3, JDK7 
and tomcat7. 
 It still shows same behaviour and hangs through the test.
 
 My test schema and config.
 Schema.xml - http://apaste.info/imah
 SolrConfig.xml - http://apaste.info/ku4F
 
 The test is pretty simple. its a jmeter test with update command via SOAP rpc 
(round robin request across every node), adding in 5 fields from a csv file - 
id, guid, subject, body, compositeID (guid!id).
 number of jmeter threads = 150. loop count = 20, num of messages to add/per 
guid = 3; total 150*3*20 = 9000 documents.  
 
 When cloud gets stuck, i don't get anything in the logs, but when i run 
netstat i see the following.
 Sample netstat on a stuck run. http://apaste.info/hr0O 
 hycl-d20 is my jmeter host. ssd-d01/2/3 are my cloud hosts.
 
 
 At the moment my benchmarking efforts are at a stand still.
 
 Any help from the community would be great, I got some heap dumps and stack 
dumps, but haven't found a smoking gun yet.
 If I can provide anything else to diagnose this issue. just let me know.
 
 Thanks,
 
 Rishi.
 
 
 
 
 
 
 
 


 

 


SOLR Cloud - Disable Transaction Logs

2013-06-17 Thread Rishi Easwaran
Hi,

Is there a way to disable transaction logs in SOLR cloud. As far as I can tell 
no.
Just curious why do we need transaction logs, seems like an I/O intensive 
operation.
As long as I have replicatonFactor 1, if a node (leader) goes down, the 
replica can take over and maintain a durable state of my index.

I understand from the previous discussions, that it was intended for update 
durability and realtime get.
But, unless I am missing something an ability to disable it in SOLR cloud if 
not needed would be good.

Thanks,

Rishi.  



Re: SOLR Cloud - Disable Transaction Logs

2013-06-17 Thread Rishi Easwaran
Shalin,
 
Just some thoughts.

Near Real time replication- don't we use solrCmdDistributor, which send 
requests immediately to replicas with a clonedRequest, as an option can't we 
achieve something similar form CloudSolrserver in Solrj instead of leader doing 
it. As long as 2 nodes receive writes and acknowledge. durability should be 
high.
Peer-Sync and Recovery - Can we achieve that merging indexes from leader as 
needed, instead of replaying the transaction logs?

Rishi.

 

 

 

-Original Message-
From: Shalin Shekhar Mangar shalinman...@gmail.com
To: solr-user solr-user@lucene.apache.org
Sent: Mon, Jun 17, 2013 3:43 pm
Subject: Re: SOLR Cloud - Disable Transaction Logs


It is also necessary for near real-time replication, peer sync and recovery.


On Tue, Jun 18, 2013 at 1:04 AM, Rishi Easwaran rishi.easwa...@aol.comwrote:

 Hi,

 Is there a way to disable transaction logs in SOLR cloud. As far as I can
 tell no.
 Just curious why do we need transaction logs, seems like an I/O intensive
 operation.
 As long as I have replicatonFactor 1, if a node (leader) goes down, the
 replica can take over and maintain a durable state of my index.

 I understand from the previous discussions, that it was intended for
 update durability and realtime get.
 But, unless I am missing something an ability to disable it in SOLR cloud
 if not needed would be good.

 Thanks,

 Rishi.




-- 
Regards,
Shalin Shekhar Mangar.

 


Spread the word - Opening at AOL Mail Team in Dulles VA

2013-06-17 Thread Rishi Easwaran
Hi All,

With the economy the way it is and many folks still looking. 
Figured this is a good place as any to publish this. 

Just today, we got an opening for mid-senior level Software Engineer in our 
team.
Experience with SOLR is a big+.
Feel free to have a look at this position.
http://www.linkedin.com/jobs?viewJob=jobId=6073910

If interested, send your current resume to rishi.easwa...@aol.com.
I will take it to my Director.   

This position is in Dulles, VA.

Thanks,

Rishi.


Re: Solr Cloud Hangs consistently .

2013-06-17 Thread Rishi Easwaran
Update!!

This happens with replicationFactor=1
Just for kicks I created a collection with a 24 shards, replicationfactor=1 
cluster on my exisiting benchmark env.
Same behaviour, SOLR cloud just hangs. Nothing in the logs, top/heap/cpu most 
metrics looks fine.
Only indication seems to be netstat showing incoming request not being read in.
 
Yago,

I saw your previous post 
(http://lucene.472066.n3.nabble.com/updating-docs-in-solr-cloud-hangs-td4067388.html#a4067631)
Following it, Last week, I upgraded to SOLR 4.3, to see if the issue gets 
fixed, but no luck.
Looks like this is a dominant and easily reproducible issue on SOLR cloud.


Thanks,

Rishi. 





 

 

 

-Original Message-
From: Yago Riveiro yago.rive...@gmail.com
To: solr-user solr-user@lucene.apache.org
Sent: Mon, Jun 17, 2013 5:15 pm
Subject: Re: Solr Cloud Hangs consistently .


I can confirm that the deadlock happen with only 2 replicas by shard. I need 
shutdown one node that host a replica of the shard to recover the indexation 
capability.

-- 
Yago Riveiro
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Monday, June 17, 2013 at 6:44 PM, Rishi Easwaran wrote:

 
 
 Hi All,
 
 I am trying to benchmark SOLR Cloud and it consistently hangs. 
 Nothing in the logs, no stack trace, no errors, no warnings, just seems stuck.
 
 A little bit about my set up. 
 I have 3 benchmark hosts, each with 96GB RAM, 24 CPU's and 1TB SSD. Each host 
is configured to have 8 SOLR cloud nodes running at 4GB each.
 JVM configs: http://apaste.info/57Ai
 
 My cluster has 12 shards with replication factor 2- http://apaste.info/09sA
 
 I originally stated with SOLR 4.2., tomcat 5 and jdk 6, as we are already 
running this configuration in production in Non-Cloud form. 
 It got stuck repeatedly.
 
 I decided to upgrade to the latest and greatest of everything, SOLR 4.3, JDK7 
and tomcat7. 
 It still shows same behaviour and hangs through the test.
 
 My test schema and config.
 Schema.xml - http://apaste.info/imah
 SolrConfig.xml - http://apaste.info/ku4F
 
 The test is pretty simple. its a jmeter test with update command via SOAP rpc 
(round robin request across every node), adding in 5 fields from a csv file - 
id, guid, subject, body, compositeID (guid!id).
 number of jmeter threads = 150. loop count = 20, num of messages to add/per 
guid = 3; total 150*3*20 = 9000 documents. 
 
 When cloud gets stuck, i don't get anything in the logs, but when i run 
netstat i see the following.
 Sample netstat on a stuck run. http://apaste.info/hr0O 
 hycl-d20 is my jmeter host. ssd-d01/2/3 are my cloud hosts.
 
 At the moment my benchmarking efforts are at a stand still.
 
 Any help from the community would be great, I got some heap dumps and stack 
dumps, but haven't found a smoking gun yet.
 If I can provide anything else to diagnose this issue. just let me know.
 
 Thanks,
 
 Rishi. 


 


Re: shardkey

2013-06-12 Thread Rishi Easwaran
From my understanding.
In SOLR cloud the CompositeIdDocRouter uses HashbasedDocRouter.
CompositeId router is default if your numShards1 on collection creation.
CompositeId router generates an hash using the uniqueKey defined in your 
schema.xml to route your documents to a dedicated shard.

You can use select?q=xyzshard.keys=uniquekey to focus your search to hit only 
the shard that has your shard.key  

 

 Thanks,

Rishi.

 

-Original Message-
From: Joshi, Shital shital.jo...@gs.com
To: 'solr-user@lucene.apache.org' solr-user@lucene.apache.org
Sent: Wed, Jun 12, 2013 10:01 am
Subject: shardkey


Hi,

We are using Solr 4.3.0 SolrCloud (5 shards, 10 replicas). I have couple 
questions on shard key. 

1. Looking at the admin GUI, how do I know which field is being used 
for shard 
key.
2. What is the default shard key used?
3. How do I override the default shard key?

Thanks. 

 


Solr Composite Unique key from existing fields in schema

2013-05-28 Thread Rishi Easwaran
Hi All,

Historically we have used a single field in our schema as a uniqueKey.

  field name=docidtype=string   indexed=true  stored=true  
multiValued=false required=true/
  field name=userid  type=string   indexed=true  stored=true  
multiValued=false required=true/ 
uniqueKeydocid/uniqueKey

Wanted to change this to a composite key something like 
uniqueKeyuserid-docid/uniqueKey.
I know I can auto generate compositekey at document insert time, using custom 
code to generate a new field, but wanted to know if there was an inbuilt SOLR 
mechanism of doing this. That would prevent us from creating and storing an 
extra field.

Thanks,

Rishi.






Re: Solr Composite Unique key from existing fields in schema

2013-05-28 Thread Rishi Easwaran
Thanks Jack, looks like that will do the trick from me. I will try it out. 

 

 

 

-Original Message-
From: Jack Krupansky j...@basetechnology.com
To: solr-user solr-user@lucene.apache.org
Sent: Tue, May 28, 2013 12:07 pm
Subject: Re: Solr Composite Unique key from existing fields in schema


You can do this by combining the builtin update processors.

Add this to your solrconfig:

updateRequestProcessorChain name=composite-id
  processor class=solr.CloneFieldUpdateProcessorFactory
str name=sourcedocid_s/str
str name=sourceuserid_s/str
str name=destid/str
  /processor
  processor class=solr.ConcatFieldUpdateProcessorFactory
str name=fieldNameid/str
str name=delimiter--/str
  /processor
  processor class=solr.LogUpdateProcessorFactory /
  processor class=solr.RunUpdateProcessorFactory /
/updateRequestProcessorChain

Add documents such as:

curl 
http://localhost:8983/solr/update?commit=trueupdate.chain=composite-id; \
-H 'Content-type:application/json' -d '
[{title: Hello World,
  docid_s: doc-1,
  userid_s: user-1,
  comments_ss: [Easy, Fast]}]'

And get results like:

title:[Hello World],
docid_s:doc-1,
userid_s:user-1,
comments_ss:[Easy,
  Fast],
id:doc-1--user-1,

Add as many fields in whatever order you want using source in the clone 
update processor, and pick your composite key field name as well. And set 
the delimiter string as well in the concat update processor.

I managed to reverse the field order from what you requested (userid, 
docid).

I used the standard Solr example schema, so I used dynamic fields for the 
two ids, but use your own field names.

-- Jack Krupansky

-Original Message- 
From: Rishi Easwaran
Sent: Tuesday, May 28, 2013 11:12 AM
To: solr-user@lucene.apache.org
Subject: Solr Composite Unique key from existing fields in schema

Hi All,

Historically we have used a single field in our schema as a uniqueKey.

  field name=docidtype=string   indexed=true  stored=true 
multiValued=false required=true/
  field name=userid  type=string   indexed=true  stored=true 
multiValued=false required=true/
uniqueKeydocid/uniqueKey

Wanted to change this to a composite key something like 
uniqueKeyuserid-docid/uniqueKey.
I know I can auto generate compositekey at document insert time, using 
custom code to generate a new field, but wanted to know if there was an 
inbuilt SOLR mechanism of doing this. That would prevent us from creating 
and storing an extra field.

Thanks,

Rishi.





 


Re: Solr Composite Unique key from existing fields in schema

2013-05-28 Thread Rishi Easwaran
Jack,

No sure if this is the correct behaviour.
I set up updateRequestorPorcess chain as mentioned below, but looks like the 
compositeId that is generated is based on input order.

For example: 
If my input comes in as 
field name=docid1/field
field name=userid12345/field

 I get the following compositeId1-12345. 

If I reverse the input 

field name=userid12345/field

field name=docid1/field
I get the following compositeId 12345-1 . 
 

In this case the compositeId is not unique and I am getting duplicates.

Thanks,

Rishi.



-Original Message-
From: Jack Krupansky j...@basetechnology.com
To: solr-user solr-user@lucene.apache.org
Sent: Tue, May 28, 2013 12:07 pm
Subject: Re: Solr Composite Unique key from existing fields in schema


You can do this by combining the builtin update processors.

Add this to your solrconfig:

updateRequestProcessorChain name=composite-id
  processor class=solr.CloneFieldUpdateProcessorFactory
str name=sourcedocid_s/str
str name=sourceuserid_s/str
str name=destid/str
  /processor
  processor class=solr.ConcatFieldUpdateProcessorFactory
str name=fieldNameid/str
str name=delimiter--/str
  /processor
  processor class=solr.LogUpdateProcessorFactory /
  processor class=solr.RunUpdateProcessorFactory /
/updateRequestProcessorChain

Add documents such as:

curl 
http://localhost:8983/solr/update?commit=trueupdate.chain=composite-id; \
-H 'Content-type:application/json' -d '
[{title: Hello World,
  docid_s: doc-1,
  userid_s: user-1,
  comments_ss: [Easy, Fast]}]'

And get results like:

title:[Hello World],
docid_s:doc-1,
userid_s:user-1,
comments_ss:[Easy,
  Fast],
id:doc-1--user-1,

Add as many fields in whatever order you want using source in the clone 
update processor, and pick your composite key field name as well. And set 
the delimiter string as well in the concat update processor.

I managed to reverse the field order from what you requested (userid, 
docid).

I used the standard Solr example schema, so I used dynamic fields for the 
two ids, but use your own field names.

-- Jack Krupansky

-Original Message- 
From: Rishi Easwaran
Sent: Tuesday, May 28, 2013 11:12 AM
To: solr-user@lucene.apache.org
Subject: Solr Composite Unique key from existing fields in schema

Hi All,

Historically we have used a single field in our schema as a uniqueKey.

  field name=docidtype=string   indexed=true  stored=true 
multiValued=false required=true/
  field name=userid  type=string   indexed=true  stored=true 
multiValued=false required=true/
uniqueKeydocid/uniqueKey

Wanted to change this to a composite key something like 
uniqueKeyuserid-docid/uniqueKey.
I know I can auto generate compositekey at document insert time, using 
custom code to generate a new field, but wanted to know if there was an 
inbuilt SOLR mechanism of doing this. That would prevent us from creating 
and storing an extra field.

Thanks,

Rishi.





 


Re: Solr Composite Unique key from existing fields in schema

2013-05-28 Thread Rishi Easwaran
I thought the same, but that doesn't seem to be the case.


 

 

 

-Original Message-
From: Jack Krupansky j...@basetechnology.com
To: solr-user solr-user@lucene.apache.org
Sent: Tue, May 28, 2013 3:32 pm
Subject: Re: Solr Composite Unique key from existing fields in schema


The order in the ID should be purely dependent on the order of the field 
names in the processor configuration:

str name=sourcedocid_s/str
str name=sourceuserid_s/str

-- Jack Krupansky

-Original Message- 
From: Rishi Easwaran
Sent: Tuesday, May 28, 2013 2:54 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr Composite Unique key from existing fields in schema

Jack,

No sure if this is the correct behaviour.
I set up updateRequestorPorcess chain as mentioned below, but looks like the 
compositeId that is generated is based on input order.

For example:
If my input comes in as
field name=docid1/field
field name=userid12345/field

I get the following compositeId1-12345.

If I reverse the input

field name=userid12345/field

field name=docid1/field
I get the following compositeId 12345-1 .


In this case the compositeId is not unique and I am getting duplicates.

Thanks,

Rishi.



-Original Message-
From: Jack Krupansky j...@basetechnology.com
To: solr-user solr-user@lucene.apache.org
Sent: Tue, May 28, 2013 12:07 pm
Subject: Re: Solr Composite Unique key from existing fields in schema


You can do this by combining the builtin update processors.

Add this to your solrconfig:

updateRequestProcessorChain name=composite-id
  processor class=solr.CloneFieldUpdateProcessorFactory
str name=sourcedocid_s/str
str name=sourceuserid_s/str
str name=destid/str
  /processor
  processor class=solr.ConcatFieldUpdateProcessorFactory
str name=fieldNameid/str
str name=delimiter--/str
  /processor
  processor class=solr.LogUpdateProcessorFactory /
  processor class=solr.RunUpdateProcessorFactory /
/updateRequestProcessorChain

Add documents such as:

curl
http://localhost:8983/solr/update?commit=trueupdate.chain=composite-id; \
-H 'Content-type:application/json' -d '
[{title: Hello World,
  docid_s: doc-1,
  userid_s: user-1,
  comments_ss: [Easy, Fast]}]'

And get results like:

title:[Hello World],
docid_s:doc-1,
userid_s:user-1,
comments_ss:[Easy,
  Fast],
id:doc-1--user-1,

Add as many fields in whatever order you want using source in the clone
update processor, and pick your composite key field name as well. And set
the delimiter string as well in the concat update processor.

I managed to reverse the field order from what you requested (userid,
docid).

I used the standard Solr example schema, so I used dynamic fields for the
two ids, but use your own field names.

-- Jack Krupansky

-Original Message- 
From: Rishi Easwaran
Sent: Tuesday, May 28, 2013 11:12 AM
To: solr-user@lucene.apache.org
Subject: Solr Composite Unique key from existing fields in schema

Hi All,

Historically we have used a single field in our schema as a uniqueKey.

  field name=docidtype=string   indexed=true  stored=true
multiValued=false required=true/
  field name=userid  type=string   indexed=true  stored=true
multiValued=false required=true/
uniqueKeydocid/uniqueKey

Wanted to change this to a composite key something like
uniqueKeyuserid-docid/uniqueKey.
I know I can auto generate compositekey at document insert time, using
custom code to generate a new field, but wanted to know if there was an
inbuilt SOLR mechanism of doing this. That would prevent us from creating
and storing an extra field.

Thanks,

Rishi.







 


Re: Solr Composite Unique key from existing fields in schema

2013-05-28 Thread Rishi Easwaran
Thanks Jack, That fixed it and guarantees the order.

As far as I can tell SOLR cloud 4.2.1 needs a uniquekey defined in its schema, 
or I get an exception.
SolrCore Initialization Failures
 * testCloud2_shard1_replica1: 
org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: 
QueryElevationComponent requires the schema to have a uniqueKeyField. 

Now that I have an autogenerated composite-id, it has to become a part of my 
schema as uniquekey for SOLR cloud to work. 
  field name=docidtype=string   indexed=true  stored=true 
multiValued=false required=true/
  field name=userid  type=string   indexed=true  stored=true 
multiValued=false required=true/
 field name=compositeId  type=string   indexed=true  stored=true 
multiValued=false required=true/ 
uniqueKeycompositeId/uniqueKey

Is there a way to avoid compositeId field being defined in my schema.xml, would 
like to avoid the overhead of storing this field in my index.

Thanks,

Rishi.


 

 

 

-Original Message-
From: Jack Krupansky j...@basetechnology.com
To: solr-user solr-user@lucene.apache.org
Sent: Tue, May 28, 2013 4:33 pm
Subject: Re: Solr Composite Unique key from existing fields in schema


The TL;DR response: Try this:

updateRequestProcessorChain name=composite-id
  processor class=solr.CloneFieldUpdateProcessorFactory
str name=sourceuserid_s/str
str name=destid/str
  /processor
  processor class=solr.CloneFieldUpdateProcessorFactory
str name=sourcedocid_s/str
str name=destid/str
  /processor
  processor class=solr.ConcatFieldUpdateProcessorFactory
str name=fieldNameid/str
str name=delimiter--/str
  /processor
  processor class=solr.LogUpdateProcessorFactory /
  processor class=solr.RunUpdateProcessorFactory /
/updateRequestProcessorChain

That will assure that the userid gets processed before the docid.

I'll have to review the contract for CloneFieldUpdateProcessorFactory to see 
what is or ain't guaranteed when there are multiple input fields - whether 
this is a bug or a feature or simply undefined.

-- Jack Krupansky

-Original Message- 
From: Rishi Easwaran
Sent: Tuesday, May 28, 2013 3:54 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr Composite Unique key from existing fields in schema

I thought the same, but that doesn't seem to be the case.








-Original Message-
From: Jack Krupansky j...@basetechnology.com
To: solr-user solr-user@lucene.apache.org
Sent: Tue, May 28, 2013 3:32 pm
Subject: Re: Solr Composite Unique key from existing fields in schema


The order in the ID should be purely dependent on the order of the field
names in the processor configuration:

str name=sourcedocid_s/str
str name=sourceuserid_s/str

-- Jack Krupansky

-Original Message- 
From: Rishi Easwaran
Sent: Tuesday, May 28, 2013 2:54 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr Composite Unique key from existing fields in schema

Jack,

No sure if this is the correct behaviour.
I set up updateRequestorPorcess chain as mentioned below, but looks like the
compositeId that is generated is based on input order.

For example:
If my input comes in as
field name=docid1/field
field name=userid12345/field

I get the following compositeId1-12345.

If I reverse the input

field name=userid12345/field

field name=docid1/field
I get the following compositeId 12345-1 .


In this case the compositeId is not unique and I am getting duplicates.

Thanks,

Rishi.



-Original Message-
From: Jack Krupansky j...@basetechnology.com
To: solr-user solr-user@lucene.apache.org
Sent: Tue, May 28, 2013 12:07 pm
Subject: Re: Solr Composite Unique key from existing fields in schema


You can do this by combining the builtin update processors.

Add this to your solrconfig:

updateRequestProcessorChain name=composite-id
  processor class=solr.CloneFieldUpdateProcessorFactory
str name=sourcedocid_s/str
str name=sourceuserid_s/str
str name=destid/str
  /processor
  processor class=solr.ConcatFieldUpdateProcessorFactory
str name=fieldNameid/str
str name=delimiter--/str
  /processor
  processor class=solr.LogUpdateProcessorFactory /
  processor class=solr.RunUpdateProcessorFactory /
/updateRequestProcessorChain

Add documents such as:

curl
http://localhost:8983/solr/update?commit=trueupdate.chain=composite-id; \
-H 'Content-type:application/json' -d '
[{title: Hello World,
  docid_s: doc-1,
  userid_s: user-1,
  comments_ss: [Easy, Fast]}]'

And get results like:

title:[Hello World],
docid_s:doc-1,
userid_s:user-1,
comments_ss:[Easy,
  Fast],
id:doc-1--user-1,

Add as many fields in whatever order you want using source in the clone
update processor, and pick your composite key field name as well. And set
the delimiter string as well in the concat update processor.

I managed to reverse the field order from what you requested (userid,
docid).

I used the standard Solr example schema, so I used dynamic fields for the
two ids, but use your own field names

Re: Upgrading from SOLR 3.5 to 4.2.1 Results.

2013-05-20 Thread Rishi Easwaran
Sure Shalin, hopefully soon.
 

 

 

-Original Message-
From: Shalin Shekhar Mangar shalinman...@gmail.com
To: solr-user solr-user@lucene.apache.org
Sent: Sat, May 18, 2013 11:35 pm
Subject: Re: Upgrading from SOLR 3.5 to 4.2.1 Results.


Awesome news Rishi! Looking forward to your SolrCloud updates.


On Sat, May 18, 2013 at 12:59 AM, Rishi Easwaran rishi.easwa...@aol.comwrote:



 Hi All,

 Its Friday 3:00pm, warm  sunny outside and it was a good week. Figured
 I'd share some good news.
 I work for AOL mail team and we use SOLR for our mail search backend.
 We have been using it since pre-SOLR 1.4 and strong supporters of SOLR
 community.
 We deal with millions indexes and billions of requests a day across our
 complex.
 We finished full rollout of SOLR 4.2.1 into our production last week.

 Some key highlights:
 - ~75% Reduction in Search response times
 - ~50% Reduction in SOLR Disk busy , which in turn helped with ~90%
 Reduction in errors
 - Garbage collection total stop reduction by over 50% moving application
 throughput into the 99.8% - 99.9% range
 - ~15% reduction in CPU usage

 We did not tune our application moving from 3.5 to 4.2.1 nor update java.
 For the most part it was a binary upgrade, with patches for our special
 use case.

 Now going forward we are looking at prototyping SOLR Cloud for our search
 system, upgrade java and tomcat, tune our application further. Lots of fun
 stuff :)

 Have a great weekend everyone.
 Thanks,

 Rishi.







-- 
Regards,
Shalin Shekhar Mangar.

 


Re: Upgrading from SOLR 3.5 to 4.2.1 Results.

2013-05-20 Thread Rishi Easwaran
We use commodity H/W which we procured over the years as our complex grew.
Running on jdk6 with tomcat 5. (Planning to upgrade to jdk7 and tomcat7 soon).
We run them with about 4GB heap. Using CMS GC. 


 

 

 

-Original Message-
From: adityab aditya_ba...@yahoo.com
To: solr-user solr-user@lucene.apache.org
Sent: Sat, May 18, 2013 10:37 am
Subject: Re: Upgrading from SOLR 3.5 to 4.2.1 Results.


These numbers are really great. Would you mind sharing your h/w configuration
and JVM params

thanks 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Upgrading-from-SOLR-3-5-to-4-2-1-Results-tp4064266p4064370.html
Sent from the Solr - User mailing list archive at Nabble.com.

 


Re: Upgrading from SOLR 3.5 to 4.2.1 Results.

2013-05-20 Thread Rishi Easwaran
No, we just upgraded to 4.2.1.
With the size of our complex and effort required apply our patches and rollout, 
our upgrades are not that often.


 

 

-Original Message-
From: Noureddine Bouhlel nouredd...@ecotour.com
To: solr-user solr-user@lucene.apache.org
Sent: Mon, May 20, 2013 3:36 pm
Subject: Re: Upgrading from SOLR 3.5 to 4.2.1 Results.


Hi Rishi,

Have you done any tests with Solr 4.3 ?

Regards,


Cordialement,

BOUHLEL Noureddine



On 17 May 2013 21:29, Rishi Easwaran rishi.easwa...@aol.com wrote:



 Hi All,

 Its Friday 3:00pm, warm  sunny outside and it was a good week. Figured
 I'd share some good news.
 I work for AOL mail team and we use SOLR for our mail search backend.
 We have been using it since pre-SOLR 1.4 and strong supporters of SOLR
 community.
 We deal with millions indexes and billions of requests a day across our
 complex.
 We finished full rollout of SOLR 4.2.1 into our production last week.

 Some key highlights:
 - ~75% Reduction in Search response times
 - ~50% Reduction in SOLR Disk busy , which in turn helped with ~90%
 Reduction in errors
 - Garbage collection total stop reduction by over 50% moving application
 throughput into the 99.8% - 99.9% range
 - ~15% reduction in CPU usage

 We did not tune our application moving from 3.5 to 4.2.1 nor update java.
 For the most part it was a binary upgrade, with patches for our special
 use case.

 Now going forward we are looking at prototyping SOLR Cloud for our search
 system, upgrade java and tomcat, tune our application further. Lots of fun
 stuff :)

 Have a great weekend everyone.
 Thanks,

 Rishi.






 


Upgrading from SOLR 3.5 to 4.2.1 Results.

2013-05-17 Thread Rishi Easwaran


Hi All,

Its Friday 3:00pm, warm  sunny outside and it was a good week. Figured I'd 
share some good news.
I work for AOL mail team and we use SOLR for our mail search backend. 
We have been using it since pre-SOLR 1.4 and strong supporters of SOLR 
community.
We deal with millions indexes and billions of requests a day across our complex.
We finished full rollout of SOLR 4.2.1 into our production last week. 

Some key highlights:
- ~75% Reduction in Search response times
- ~50% Reduction in SOLR Disk busy , which in turn helped with ~90% Reduction 
in errors
- Garbage collection total stop reduction by over 50% moving application 
throughput into the 99.8% - 99.9% range
- ~15% reduction in CPU usage

We did not tune our application moving from 3.5 to 4.2.1 nor update java.
For the most part it was a binary upgrade, with patches for our special use 
case.  

Now going forward we are looking at prototyping SOLR Cloud for our search 
system, upgrade java and tomcat, tune our application further. Lots of fun 
stuff :)

Have a great weekend everyone. 
Thanks,

Rishi. 






Re: SOLR Cloud Collection Management quesiotn.

2013-05-15 Thread Rishi Easwaran

Hi Anshum,

What if you have more nodes than shards*replicationFactor.
In the example below, originally I created the collection to use 6 
shards* 2 replicationFactor = 12 nodes total.
Now I added 6 more nodes, 18 nodes total. I just want to add 1 extra 
replica per shard.

How will it get evenly distributed, what is the determining criteria.

Thanks,

Rishi.



-Original Message-
From: Anshum Gupta ans...@anshumgupta.net
To: solr-user solr-user@lucene.apache.org
Sent: Tue, May 14, 2013 9:42 pm
Subject: Re: SOLR Cloud Collection Management quesiotn.


Hi Rishi,

If you have your cluster up and running, just add the nodes and they 
will
get evenly assigned to the shards. As of now, the replication factor is 
not

persisted.


On Wed, May 15, 2013 at 1:07 AM, Rishi Easwaran 
rishi.easwa...@aol.comwrote:


Ok looks like...I have to go to every node, add a replica 

individually,

create the cores and add them to the collection.

ex:
http://newNode1:port/solr/**admin/cores?action=CREATE**


name=testCloud1_shard1_**replica3collection=**testCloud1shard=shard1**
collection.configName=myconf


http://newNode2:port/solr/**admin/cores?action=CREATE**


name=testCloud1_shard2_**replica3collection=**testCloud1shard=shard2**
collection.configName=myconf



Is there an easier way to do this.
Any ideas.

Thanks,

Rishi.


-Original Message-
From: Rishi Easwaran rishi.easwa...@aol.com
To: solr-user solr-user@lucene.apache.org
Sent: Tue, May 14, 2013 2:58 pm
Subject: SOLR Cloud Collection Management quesiotn.


Hi,

I am beginning to work on SOLR cloud implementation.
I created a collection using the collections API

http://myhost:port/solr/admin/**collections?action=CREATE**
name=testCloud1numShards=6**replicationFactor=2**
collection.configName=myconf**maxShardsPerNode=1

My cluster now has 6 shards and 2 replicas  (1 leader  1 replica) for
each shard.
Now I want to add extra replicas to each shard in my cluster without
out changing the replicationFactor used to create the collection.
Any ideas on how to go about doing that.

Thanks,

Rishi.









--

Anshum Gupta
http://www.anshumgupta.net

 


SOLR Cloud Collection Management quesiotn.

2013-05-14 Thread Rishi Easwaran

Hi,

I am beginning to work on SOLR cloud implementation.
I created a collection using the collections API

http://myhost:port/solr/admin/collections?action=CREATEname=testCloud1numShards=6replicationFactor=2collection.configName=myconfmaxShardsPerNode=1

My cluster now has 6 shards and 2 replicas  (1 leader  1 replica) for 
each shard.
Now I want to add extra replicas to each shard in my cluster without 
out changing the replicationFactor used to create the collection.

Any ideas on how to go about doing that.

Thanks,

Rishi.

  


Re: SOLR Cloud Collection Management quesiotn.

2013-05-14 Thread Rishi Easwaran
Ok looks like...I have to go to every node, add a replica individually, 
create the cores and add them to the collection.


ex:
http://newNode1:port/solr/admin/cores?action=CREATEname=testCloud1_shard1_replica3collection=testCloud1shard=shard1collection.configName=myconf 

http://newNode2:port/solr/admin/cores?action=CREATEname=testCloud1_shard2_replica3collection=testCloud1shard=shard2collection.configName=myconf 



Is there an easier way to do this.
Any ideas.

Thanks,

Rishi.

-Original Message-
From: Rishi Easwaran rishi.easwa...@aol.com
To: solr-user solr-user@lucene.apache.org
Sent: Tue, May 14, 2013 2:58 pm
Subject: SOLR Cloud Collection Management quesiotn.


Hi,

I am beginning to work on SOLR cloud implementation.
I created a collection using the collections API

http://myhost:port/solr/admin/collections?action=CREATEname=testCloud1numShards=6replicationFactor=2collection.configName=myconfmaxShardsPerNode=1

My cluster now has 6 shards and 2 replicas  (1 leader  1 replica) for
each shard.
Now I want to add extra replicas to each shard in my cluster without
out changing the replicationFactor used to create the collection.
Any ideas on how to go about doing that.

Thanks,

Rishi.