Re: Solr - Index Concurrency - Is it possible to have multiple threads write to same index?

2012-08-28 Thread Lance Norskog
SolrCloud supports this dynamic addition. SolrCloud makes copies of
the source documents and every Solr instances does its own indexing.
With replication, you only create the indexes once. When storing very
large documents, this is worthwhile.

The only use cases I have seen for EmbeddedSolrServer that really
makes sense is as Hadoop output.

On Mon, Aug 27, 2012 at 8:28 PM, KnightRider ksu.wildc...@gmail.com wrote:
 One other thing i forgot to mention is - multicore setup we have requires us
 to be able to add cores dynamically and i am not sure if thats supported by
 http solr out-of-the-box.



 -
 Thanks
 -K'Rider
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Solr-Index-Concurrency-Is-it-possible-to-have-multiple-threads-write-to-same-index-tp4002544p4003623.html
 Sent from the Solr - User mailing list archive at Nabble.com.



-- 
Lance Norskog
goks...@gmail.com


Re: Solr - Index Concurrency - Is it possible to have multiple threads write to same index?

2012-08-27 Thread KnightRider
Thanks for the Reply Lance.

From your post my understanding is that Solr commiters are more focussed on
http solr than EmbeddedSolrServer and EmbeddedSolrServer may not be tested
for all features supported by http solr.
Said that, can you please tell if there is any justification/usecase for
using EmbeddedSolrServer?
Reason am asking is if EmbeddedSolrServer is not advised by Solr committers
than why don't they deprecate it and force users to go http solr route
instead of EmbeddedSolrServer.
Just trying to understand if there is any valid use-case for using
EmbeddedSolrServer.

We currently have EmbeddedSolrServer with multi-core setup (one core per
client and size of each core/index is in the range of 20G-70G) integrated in
our web application and it has been working fine for us but after reading
the responses I am now wondering if we should be moving towards Http Solr
and what benefit we might get if EmbeddedSolrServer is replaced with Http
Solr.

For replication we have been using rsync tool and it has been working fine
for us.

Also for our needs (below) do you suggest Http Solr or EmbeddedSolrServer.
1) Indexing Speed is more important than flexibility
2) Have huge text articles/blog files (2 MB) that needs to be parsed from
filesystem and indexed.
Our index size will be in the range of 20 GB - 70 GB per core. And there is
a core for each client.
3) Need to store all the data in the index because we absolutely need the
highlighter feature working and reading through Solr documentation I found
that Highlighter can be used only when data is stored.
4) We also need to store positions and offsets because we need to be able to
use phrase queries and also need the position of the terms in search result
documents.

Thanks
K'Rider



-
Thanks
-K'Rider
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Index-Concurrency-Is-it-possible-to-have-multiple-threads-write-to-same-index-tp4002544p4003622.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr - Index Concurrency - Is it possible to have multiple threads write to same index?

2012-08-27 Thread KnightRider
One other thing i forgot to mention is - multicore setup we have requires us
to be able to add cores dynamically and i am not sure if thats supported by
http solr out-of-the-box.



-
Thanks
-K'Rider
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Index-Concurrency-Is-it-possible-to-have-multiple-threads-write-to-same-index-tp4002544p4003623.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr - Index Concurrency - Is it possible to have multiple threads write to same index?

2012-08-25 Thread Lance Norskog
A few other things:
Support: many of the Solr committers do not like the Embedded server.
It does not get much attention, so if you find problems with it you
may have to fix them and get someone to review and commit the fixes.
I'm not saying they sabotage it, there just is not much interest in
making it first-class.

Replication: you can replicate from the Embedded server with the old
rsync-based replicator. The Java Replication tool requires servlets.
If you are Unix-savvy, the rsync tool is fine.

Indexing speed:
1) You can use shards to split the index into pieces. This divides the
indexing work among the shards.
2) Do not store the giant data. A lot of sites instead archive the
datafile and index a link to the file. Giant stored fields cause
indexing speed to drop dramatically because stored data is not saved
just once: it is copied repeatedly during merging as new documents are
added. Index data is also copied around, but this tends to increase
sub-linearly since documents share terms.
3) Do not store positions and offsets. These allow you to do phrase
queries because they store the position of each word. They take a lot
of memory, and have to be copied around during merging.

On Thu, Aug 23, 2012 at 1:31 AM, Mikhail Khludnev
mkhlud...@griddynamics.com wrote:
 I know the following drawbacks of EmbServer:

- org.apache.solr.client.solrj.request.UpdateRequest.getContentStreams()
which is called on handling update request, provides a lot of garbage in
memory and bloat it by expensive XML.
- 
 org.apache.solr.response.BinaryResponseWriter.getParsedResponse(SolrQueryRequest,
SolrQueryResponse) does something like this on response side - it just
bloat your heap

 for me your task is covered by Multiple Cores. Anyway if you are ok with
 EmbeddedServer let it be. Just be aware of stream updates feature
 http://wiki.apache.org/solr/ContentStream

 my average indexing speed estimate is for fairly small docs less than 1K
 (which are always used for micro-benchmarking).

 Much analysis is the key argument for invoking updates in multiple threads.
 What's your CPU stat during indexing?




 On Thu, Aug 23, 2012 at 7:52 AM, ksu wildcats ksu.wildc...@gmail.comwrote:

 Thanks for the reply Mikhail.

 For our needs the speed is more important than flexibility and we have huge
 text files (ex: blogs / articles ~2 MB size) that needs to be read from our
 filesystem and then store into the index.

 We have our app creating separate core per client (dynamically) and there
 is
 one instance of EmbeddedSolrServer for each core thats used for adding
 documents to the index.
 Each document has about 10 fields and one of the field has ~2MB data stored
 (stored = true, analyzed=true).
 Also we have logic built into our webapp to dynamically create the solr
 config files
 (solrConfig  schema per core - filters/analyzers/handler values can be
 different for each core)
 for each core before creating an instance of EmbeddedSolrServer for that
 core.
 Another reason to go with EmbeddedSolrServer is to reduce overhead of
 transporting large data (~2 MB) over http/xml.

 We use this setup for building our master index which then gets replicated
 to slave servers
 using replication scripts provided by solr.
 We also have solr admin ui integrated into our webapp (using admin jsp 
 handlers from solradmin ui)

 We have been using this MultiCore setup for more than a year now and so far
 we havent run into any issues with EmbeddedSolrServer integrated into our
 webapp.
 However I am now trying to figure out the impact if we allow multiple
 threads sending request to EmbeddedSolrServer (same core) for adding docs
 to
 index simultaneously.

 Our understanding was that EmbeddedSolrServer would give us better
 performance over http solr for our needs.
 Its quite possible that we might be wrong and http solr would have given us
 similar/better performance.

 Also based on documentation from SolrWiki I am assuming that
 EmbeddedSolrServer API is same as the one used by Http Solr.

 Said that, can you please tell if there is any specific downside to using
 EmbeddedSolrServer that could cause issues for us down the line.

 I am also interested in your below comment about indexing 1 million docs in
 few mins. Ideally we would like to get to that speed
 I am assuming this depends on the size of the doc and type of
 analyzer/tokenizer/filters being used. Correct?
 Can you please share (or point me to documentation) on how to get this
 speed
 for 1 mil docs.
   - one million is a fairly small amount, in average it should be indexed
  in few mins. I doubt that you really need to distribute indexing

 Thanks
 -K



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-Index-Concurrency-Is-it-possible-to-have-multiple-threads-write-to-same-index-tp4002544p4002776.html
 Sent from the Solr - User mailing list archive at Nabble.com.




 --
 Sincerely yours
 Mikhail Khludnev
 Tech Lead
 Grid Dynamics

 http

Re: Solr - Index Concurrency - Is it possible to have multiple threads write to same index?

2012-08-22 Thread ksu wildcats
Thanks for the reply Mikhail.

For our needs the speed is more important than flexibility and we have huge
text files (ex: blogs / articles ~2 MB size) that needs to be read from our
filesystem and then store into the index.

We have our app creating separate core per client (dynamically) and there is
one instance of EmbeddedSolrServer for each core thats used for adding
documents to the index.
Each document has about 10 fields and one of the field has ~2MB data stored
(stored = true, analyzed=true). 
Also we have logic built into our webapp to dynamically create the solr
config files 
(solrConfig  schema per core - filters/analyzers/handler values can be
different for each core)
for each core before creating an instance of EmbeddedSolrServer for that
core.
Another reason to go with EmbeddedSolrServer is to reduce overhead of
transporting large data (~2 MB) over http/xml.

We use this setup for building our master index which then gets replicated
to slave servers 
using replication scripts provided by solr.
We also have solr admin ui integrated into our webapp (using admin jsp 
handlers from solradmin ui)

We have been using this MultiCore setup for more than a year now and so far
we havent run into any issues with EmbeddedSolrServer integrated into our
webapp.
However I am now trying to figure out the impact if we allow multiple
threads sending request to EmbeddedSolrServer (same core) for adding docs to
index simultaneously.

Our understanding was that EmbeddedSolrServer would give us better
performance over http solr for our needs.
Its quite possible that we might be wrong and http solr would have given us
similar/better performance.

Also based on documentation from SolrWiki I am assuming that
EmbeddedSolrServer API is same as the one used by Http Solr.

Said that, can you please tell if there is any specific downside to using
EmbeddedSolrServer that could cause issues for us down the line.

I am also interested in your below comment about indexing 1 million docs in
few mins. Ideally we would like to get to that speed
I am assuming this depends on the size of the doc and type of
analyzer/tokenizer/filters being used. Correct?
Can you please share (or point me to documentation) on how to get this speed
for 1 mil docs.
  - one million is a fairly small amount, in average it should be indexed
 in few mins. I doubt that you really need to distribute indexing

Thanks
-K



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Index-Concurrency-Is-it-possible-to-have-multiple-threads-write-to-same-index-tp4002544p4002776.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr - Index Concurrency - Is it possible to have multiple threads write to same index?

2012-08-21 Thread ksu wildcats
We have a webapp that has embedded solr integrated in it.
It essentially handles creating separate index (core) per client and it is
currently setup such that there can only be one index write operation per
core.
Say if we have 1 Million documents that needs be to Indexed, our app reads
each document and writes it to index (using embedded solr library).

I am looking into ways to speed up indexing time and I was wondering if it
would be possible to have our app run on multiple servers and each server
process indexing docs concurrently. I was thinking of having Index storage
on NFS that can be accessed by all servers.

I am not entirely sure but reading through documentation my understanding is
that we cannot have multiple index writers (even if they are running on
different servers) write to same index directory simultaneously. is that
correct?

If there is a limitation on concurrent writes to same index directory then
do i need to have each server build a separate index (more like a cores
within core) and merge all the sub indexes into main index to speed up the
indexing time?

Please let me know if am heading in correct path or if there are better
alternatives to speed up indexing time?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Index-Concurrency-Is-it-possible-to-have-multiple-threads-write-to-same-index-tp4002544.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Index Concurrency

2007-05-11 Thread Yonik Seeley

On 5/10/07, joestelmach [EMAIL PROTECTED] wrote:

 Yes, coordination between the main index searcher, the index writer,
 and the index reader needed to delete other documents.

Can you point me to any documentation/code that describes this
implementation?


Look at SolrCore.getSearcher() and DirectUpdateHandler2.

-Yonik


Re: Index Concurrency

2007-05-10 Thread Otis Gospodnetic
Though, isn't there a recent patch to allow multiple indices under a single 
Solr instance in JIRA?

Otis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

- Original Message 
From: Yonik Seeley [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Wednesday, May 9, 2007 6:32:33 PM
Subject: Re: Index Concurrency

On 5/9/07, joestelmach [EMAIL PROTECTED] wrote:
 My first intuition is to give each user their own index. My thinking here is
 that querying would be faster (since each user's index would be much smaller
 than one big index,) and, more importantly, that I would dodge any
 concurrency issues stemming from multiple threads trying to update the same
 index simultaneously.  I realize that Lucene implements a locking mechanism
 to protect against concurrent access, but I seem to hit the lock access
 timeout quite easily with only a couple threads.

 After looking at solr, I would really like to take advantage of the many
 features it adds to Lucene, but it doesn't look like I'll be able to achieve
 multiple indexes.

No, not currently.  Start your implementation with just a single
index... unless it is very large, it will likely be fast enough.

Solr also handles all the concurrency issues, and you should never hit
lock access timeout when updating from multiple threads.

-Yonik





Re: Index Concurrency

2007-05-10 Thread joestelmach


 Yes, coordination between the main index searcher, the index writer,
 and the index reader needed to delete other documents.

Can you point me to any documentation/code that describes this
implementation?

 That's weird... I've never seen that.
 The lucene write lock is only obtained when the IndexWriter is created.
 Can you post the relevant part of the log file where the exception
 happens?

After doing some more testing, I believe it was a stale lock file that was
causing me to have these lock issues yesterday - sorry for the false alarm
:)

 Also, unless you have at least 6 CPU cores or so, you are unlikely to
 see greater throughput with 10 threads.  If you add multiple documents
 per HTTP-POST (such that HTTP latency is minimized), the best setting
 would probably be nThreads == nCores.  For a single doc per POST, more
 threads will serve to cover the latency and keep Solr busy.

I agree with your thinking here.  My requirement for a large number of
threads is somewhat of an artifact of my current system design.  I'm trying
not to serialize the system's processing at the point of indexing.
-- 
View this message in context: 
http://www.nabble.com/Index-Concurrency-tf3718634.html#a10424207
Sent from the Solr - User mailing list archive at Nabble.com.



Index Concurrency

2007-05-09 Thread joestelmach

Hello,

I'm a bit new to search indexing and I'm hoping some of you here can help me
with an e-mail application I'm working on.  I have a mail retrieval program
that accesses multiple POP accounts in parallel, and parses each message
into a database.  I would like to add a new document to a solr index each
time I process a message.

My first intuition is to give each user their own index. My thinking here is
that querying would be faster (since each user's index would be much smaller
than one big index,) and, more importantly, that I would dodge any
concurrency issues stemming from multiple threads trying to update the same
index simultaneously.  I realize that Lucene implements a locking mechanism
to protect against concurrent access, but I seem to hit the lock access
timeout quite easily with only a couple threads.

After looking at solr, I would really like to take advantage of the many
features it adds to Lucene, but it doesn't look like I'll be able to achieve
multiple indexes.

Am I completely off in thinking that I need multiple indexes?  Is there some
best practice for this sort of thing that I haven't stumbled upon?

Any advice would be greatly appreciated.

Thanks,
Joe
-- 
View this message in context: 
http://www.nabble.com/Index-Concurrency-tf3718634.html#a10403918
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Index Concurrency

2007-05-09 Thread Yonik Seeley

On 5/9/07, joestelmach [EMAIL PROTECTED] wrote:

My first intuition is to give each user their own index. My thinking here is
that querying would be faster (since each user's index would be much smaller
than one big index,) and, more importantly, that I would dodge any
concurrency issues stemming from multiple threads trying to update the same
index simultaneously.  I realize that Lucene implements a locking mechanism
to protect against concurrent access, but I seem to hit the lock access
timeout quite easily with only a couple threads.

After looking at solr, I would really like to take advantage of the many
features it adds to Lucene, but it doesn't look like I'll be able to achieve
multiple indexes.


No, not currently.  Start your implementation with just a single
index... unless it is very large, it will likely be fast enough.

Solr also handles all the concurrency issues, and you should never hit
lock access timeout when updating from multiple threads.

-Yonik


Re: Index Concurrency

2007-05-09 Thread joestelmach

Yonik,

Thanks for  your fast reply.

 No, not currently.  Start your implementation with just a single
 index... unless it is very large, it will likely be fast enough.

My index will get quite large

 Solr also handles all the concurrency issues, and you should never hit
 lock access timeout when updating from multiple threads.

Does solr provide any additional concurrency control over what Lucene
provides?  In my simple testing of indexing 2,000 messages, solr would issue
lock access timeouts with as little as 10 threads.   Running all 2,000
messages through sequentially yields no problems at all.   Actually, I'm
able churn through over 100,000 messages when no threads are involved.  Am I
missing some concurrency settings?

Thanks,
Joe


-- 
View this message in context: 
http://www.nabble.com/Index-Concurrency-tf3718634.html#a10406382
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Index Concurrency

2007-05-09 Thread Yonik Seeley

On 5/9/07, joestelmach [EMAIL PROTECTED] wrote:

Does solr provide any additional concurrency control over what Lucene
provides?


Yes, coordination between the main index searcher, the index writer,
and the index reader needed to delete other documents.


In my simple testing of indexing 2,000 messages, solr would issue
lock access timeouts with as little as 10 threads.


That's weird... I've never seen that.
The lucene write lock is only obtained when the IndexWriter is created.
Can you post the relevant part of the log file where the exception happens?

Also, unless you have at least 6 CPU cores or so, you are unlikely to
see greater throughput with 10 threads.  If you add multiple documents
per HTTP-POST (such that HTTP latency is minimized), the best setting
would probably be nThreads == nCores.  For a single doc per POST, more
threads will serve to cover the latency and keep Solr busy.

-Yonik