date:20130311

What Solr version?

Are you mixing deletes and adds?

Do you have more than one shard for a collection per machine? ie are you 
oversharding?

Can you post the stack traces (using jstack, or jconsolr, or visualvm, or…)?


- Mark


On Mar 11, 2013, at 11:39 AM, yriveiro yago.rive...@gmail.com wrote:

 Hi,
 
 I have the next issue:
 
 I have a collection with a leader and a replica, both are synchronized.
 
 When I try to index data to this collection I have a timeout error (the
 output is python):
 
 (class 'requests.exceptions.Timeout',
 Timeout(TimeoutError(HTTPConnectionPool(host='192.168.20.50', port=8983):
 Request timed out. (timeout=60.0),),), traceback object at
 0x7f64c033b908)
 
 Now, I can't index any document to this collection because I have always the
 timeout error.
 
 In the tomcat I have about 100 thread stuck, 
 
 S 11393624 ms 0 KB30 KB   192.168.20.47   192.168.20.50   POST
 /solr/ST-4A46DF1563_0612/update?update.distrib=TOLEADERdistrib.from=http%3A%2F%2F192.168.20.48%3A8983%2Fsolr%2FST-4A46DF1563_0612%2Fwt=javabinversion=2
 HTTP/1.1
 
 Someone have any idea that what can be happening and why I can't index any
 document to the collection?
 
 
 
 -
 Best regards
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/SolrCloud-index-timeout-tp4046348.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: writing doc to another collection from UpdateReqeustProcessor

Sure, seems reasonable.

- Mark

On Mar 11, 2013, at 11:52 AM, mike st. john mstj...@gmail.com wrote:

 Whats the best approach in writing the current doc inside an 
 UpdateRequestProcessor to another collection ?
 
 
 Would i just call up CloudSolrServer and process it as i normally would in 
 solrj?
 
 
 
 Thanks
 msj

Re: Some nodes have all the load

There is an open JIRA issue about trying to spread the leader load during 
elections. Was waiting to get reports that it was really a problem for someone 
though.

How much load were you putting on? How long were the nodes unresponsive? 
Unresponsive to everything? Just updates? Searches? What version of Solr? How 
many shards do you have? Collections?

- Mark

On Mar 11, 2013, at 11:41 AM, jimtronic jimtro...@gmail.com wrote:

 I was doing some rolling updates of my cluster ( 12 cores, 4 servers ) and I
 ended up in a situation where one node was elected leader by all the cores.
 This seemed very taxing to that one node. It was also still trying to serve
 query requests so it slowed everything down. I'm trying to do a lot of
 frequent atomic updates along with some periodic DIH syncs.
 
 My solution to this situation was to try to take the supreme leader out of
 the cluster and let the leader election start. This was not easy as there
 was so much load on it, I couldn't take it out gracefully. Some of my cores
 became unreachable for a while.
 
 This was all under fictitious load, but it made me nervous about high load
 production situation.
 
 I'm sure there's several things I'm doing wrong in all this, so I thought
 I'd see what you guys think.
 
 Jim
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Some-nodes-have-all-the-load-tp4046349.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: [Beginner] wants to contribute in open source project

2013-03-11 Thread Andy Lester


On Mar 11, 2013, at 11:14 AM, chandresh pancholi 
chandreshpancholi...@gmail.com wrote:

 I am beginner in this field. It would be great if you help me out. I love
 to code in java.
 can you guys share some link so that i can start contributing in
 solr/lucene project.


This article I wrote about getting started contributing to projects may give 
you some ideas.

http://blog.smartbear.com/software-quality/bid/167051/14-Ways-to-Contribute-to-Open-Source-without-Being-a-Programming-Genius-or-a-Rock-Star

I don't have tasks specifically for the Solr project (does Solr have such a 
list for newcomers to help on?) but I hope that you'll get some ideas.

xoa

--
Andy Lester = a...@petdance.com = www.petdance.com = AIM:petdance

Re: [Beginner] wants to contribute in open source project

2013-03-11 Thread Tomás Fernández Löbbe

You can also take a look at http://wiki.apache.org/solr/HowToContribute

Tomás


On Mon, Mar 11, 2013 at 9:20 AM, Andy Lester a...@petdance.com wrote:


 On Mar 11, 2013, at 11:14 AM, chandresh pancholi 
 chandreshpancholi...@gmail.com wrote:

  I am beginner in this field. It would be great if you help me out. I love
  to code in java.
  can you guys share some link so that i can start contributing in
  solr/lucene project.


 This article I wrote about getting started contributing to projects may
 give you some ideas.


 http://blog.smartbear.com/software-quality/bid/167051/14-Ways-to-Contribute-to-Open-Source-without-Being-a-Programming-Genius-or-a-Rock-Star

 I don't have tasks specifically for the Solr project (does Solr have such
 a list for newcomers to help on?) but I hope that you'll get some ideas.

 xoa

 --
 Andy Lester = a...@petdance.com = www.petdance.com = AIM:petdance

Re: Memory Guidance


On 3/10/2013 8:00 PM, jimtronic wrote:

I'm having trouble finding some problems while load testing my setup.

If you saw these numbers on your dashboard, would they worry you?

Physical Memory  97.6%
14.64 GB of 15.01 GB

File Descriptor Count  19.1%
196 of 1024

JVM-Memory  95%
1.67 GB (dark gray)
1.76 GB (med gray)
1.76 GB


What OS?  If it's a unix/linux environment, the full output of the 
'free' command will be important.  Generally speaking, it's normal for 
any computer (client or server, regardless of OS) to use all available 
memory when under load.


Thanks,
Shawn

Re: Memory Guidance


On 3/11/2013 11:14 AM, Shawn Heisey wrote:

On 3/10/2013 8:00 PM, jimtronic wrote:

I'm having trouble finding some problems while load testing my setup.

If you saw these numbers on your dashboard, would they worry you?

Physical Memory  97.6%
14.64 GB of 15.01 GB

File Descriptor Count  19.1%
196 of 1024

JVM-Memory  95%
1.67 GB (dark gray)
1.76 GB (med gray)
1.76 GB


What OS?  If it's a unix/linux environment, the full output of the
'free' command will be important.  Generally speaking, it's normal for
any computer (client or server, regardless of OS) to use all available
memory when under load.


Replying to myself.  The cold must be getting to me. :)

If nothing else is running on this server except for Solr, and your 
index is less than 15GB in size, these numbers would not worry me at 
all.  If your index is less than 30GB in size, you might still be OK, 
but at that point your index would exceed available RAM.  Chances are 
that you would be able to cache enough of it for good performance, 
depending on your schema.  The reason that I say this is that you have 
about 2GB of RAM give to Solr, leaving about 13-14GB for OS disk caching.


If the server is shared with other things, particularly a busy database 
or busy web server, then the above paragraph might not apply - you may 
not have enough resources for Solr to work effectively.


Thanks,
Shawn

Re: SolrCloud index timeout

2013-03-11 Thread yriveiro

Hi,

The version is the 4.1

I'm not mixing deletes and adds, are only adds.

I have a 4 nodes in 2 physical machines, 2 instances of tomcat in each
machine. In this case the leader is located in a diferent physical machine
that the replica. The collection has all shards in different nodes, I have
not oversharding.

The question of the stack I need install the visualvm and try to get the
stack.

I create the collection using the CORE API:

LEADER
curl
http://192.168.20.48:8983/solr/admin/cores\?action\=CREATE\name\=ST-0112\collection\=ST-0112\shard\=00\collection.configName\=statisticsBucket-regular

REPLICA
curl
http://192.168.20.50:8983/solr/admin/cores\?action\=CREATE\name\=ST-0112\collection\=ST-0112\shard\=00\collection.configName\=statisticsBucket-regular

The data folders have the content:

LEADER
drwxr-xr-x 2 root root  4096 Jan 30 17:40 index
drwxr-xr-x 2 root root 12288 Feb  5 13:28 index.20130130174052236
drwxr-xr-x 2 root root 36864 Mar 11 15:20 index.20130220001204140
-rw-r--r-- 1 root root78 Feb 20 00:13 index.properties
-rw-r--r-- 1 root root   251 Feb 20 00:13 replication.properties
drwxr-xr-x 2 root root  4096 Mar 11 15:19 tlog

REPLICA
drwxr-xr-x 2 root root 4096 Mar 11 15:59 index.20130228105843631
-rw-r--r-- 1 root root   78 Feb 28 10:59 index.properties
-rw-r--r-- 1 root root  208 Feb 28 10:59 replication.properties
drwxr-xr-x 2 root root 4096 Mar 11 12:17 tlog



-
Best regards
--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-update-timeout-tp4046348p4046385.html
Sent from the Solr - User mailing list archive at Nabble.com.

Solr replication takes long time

2013-03-11 Thread Victor Ruiz

Hi guys,

I have a problem with Solr replication. I have 2 solr servers (Solr 4.0.0) 1
master and 1 slave (8 processors,16GB RAM ,Ubuntu 11,  ext3,  each). In
every server, there are 2 independent instances of solr running (I tried
also multicore config, but having independent instances has for me better
performance), every instance having a differente collection. So, we have 2
masters in server 1, and 2 slaves in server 2.

Index size is currently (for the biggest collection) around 17 million
documents, with a total size near 12 GB. The files transferred every
replication cycle are typically not more than 100, with a total size not
bigger than 50MB. The other collection is not that big, just around 1
million docs and not bigger than 2 GB and not a high update ratio. The big
collection has a load around 200 queries per second (MoreLikeThis,
RealTimeGetHandler , TermVectorComponent mainly), and for the small one it
is below 50 queries per second

Replication has been working for long time with any problem, but in the last
weeks the replication cycles started to take long and long time for the big
collection, even more than 2 minutes, some times even more. During that
time, slaves are so overloaded, that many queries are timing out, despite
the timeout in my clients is 30 seconds. 

The servers are in same LAN, gigabit ethernet, so the broadband should not
be the bottleneck.

Since the index is receiving frequents updates and deletes (update handler
receives more than 200 request per second for the big collection, but not
more than 5 per second for the small one), I tried to use the
maxCommitsToKeep attribute, to ensure that no file was deleted during
replication, but it has no effect. 

My solrconfig.xml in the big collection is like that:

?xml version=1.0 encoding=UTF-8 ?

config

luceneMatchVersionLUCENE_40/luceneMatchVersion

directoryFactory name=DirectoryFactory
  
class=${solr.directoryFactory:solr.NRTCachingDirectoryFactory}/


indexConfig
mergeFactor3/mergeFactor

deletionPolicy class=solr.SolrDeletionPolicy

str name=maxCommitsToKeep10/str
str name=maxOptimizedCommitsToKeep1/str

str name=maxCommitAge6HOUR/str

/deletionPolicy

/indexConfig

jmx/

updateHandler class=solr.DirectUpdateHandler2

autoCommit
maxDocs2000/maxDocs
maxTime3/maxTime
/autoCommit

autoSoftCommit
maxTime500/maxTime
/autoSoftCommit

updateLog
str name=dir${solr.data.dir:}/str
/updateLog

/updateHandler

query
maxBooleanClauses2048/maxBooleanClauses

filterCache
class=solr.FastLRUCache
size=2048
initialSize=1024
autowarmCount=1024/

queryResultCache
class=solr.LRUCache
size=2048
initialSize=1024
autowarmCount=1024/


documentCache
class=solr.LRUCache
size=2048
initialSize=1024
autowarmCount=1024/

enableLazyFieldLoadingtrue/enableLazyFieldLoading

queryResultWindowSize50/queryResultWindowSize

queryResultMaxDocsCached50/queryResultMaxDocsCached

listener event=newSearcher class=solr.QuerySenderListener
arr name=queries
lst
str name=q*:*/str
str name=fqdate:[NOW/DAY-7DAY TO 
NOW/DAY+1DAY]/str
str name=rows1000/str
/lst
/arr
/listener
listener event=firstSearcher 
class=solr.QuerySenderListener
arr name=queries
lst
str name=q*:*/str
str name=fqdate:[NOW/DAY-7DAY TO 
NOW/DAY+1DAY]/str
str name=rows1000/str
/lst
/arr
/listener

useColdSearchertrue/useColdSearcher

maxWarmingSearchers4/maxWarmingSearchers
/query



requestHandler name=/replication class=solr.ReplicationHandler
lst

Re: Solr replication takes long time

Are you using Solr 4.1?

- Mark

On Mar 11, 2013, at 1:53 PM, Victor Ruiz bik1...@gmail.com wrote:

 Hi guys,
 
 I have a problem with Solr replication. I have 2 solr servers (Solr 4.0.0) 1
 master and 1 slave (8 processors,16GB RAM ,Ubuntu 11,  ext3,  each). In
 every server, there are 2 independent instances of solr running (I tried
 also multicore config, but having independent instances has for me better
 performance), every instance having a differente collection. So, we have 2
 masters in server 1, and 2 slaves in server 2.
 
 Index size is currently (for the biggest collection) around 17 million
 documents, with a total size near 12 GB. The files transferred every
 replication cycle are typically not more than 100, with a total size not
 bigger than 50MB. The other collection is not that big, just around 1
 million docs and not bigger than 2 GB and not a high update ratio. The big
 collection has a load around 200 queries per second (MoreLikeThis,
 RealTimeGetHandler , TermVectorComponent mainly), and for the small one it
 is below 50 queries per second
 
 Replication has been working for long time with any problem, but in the last
 weeks the replication cycles started to take long and long time for the big
 collection, even more than 2 minutes, some times even more. During that
 time, slaves are so overloaded, that many queries are timing out, despite
 the timeout in my clients is 30 seconds. 
 
 The servers are in same LAN, gigabit ethernet, so the broadband should not
 be the bottleneck.
 
 Since the index is receiving frequents updates and deletes (update handler
 receives more than 200 request per second for the big collection, but not
 more than 5 per second for the small one), I tried to use the
 maxCommitsToKeep attribute, to ensure that no file was deleted during
 replication, but it has no effect. 
 
 My solrconfig.xml in the big collection is like that:
 
 ?xml version=1.0 encoding=UTF-8 ?
 
 config
 
   luceneMatchVersionLUCENE_40/luceneMatchVersion
 
   directoryFactory name=DirectoryFactory
 
 class=${solr.directoryFactory:solr.NRTCachingDirectoryFactory}/
 
 
   indexConfig
   mergeFactor3/mergeFactor
 
   deletionPolicy class=solr.SolrDeletionPolicy
   
   str name=maxCommitsToKeep10/str
   str name=maxOptimizedCommitsToKeep1/str
   
   str name=maxCommitAge6HOUR/str
 
   /deletionPolicy
 
   /indexConfig
 
   jmx/
 
   updateHandler class=solr.DirectUpdateHandler2
 
   autoCommit
   maxDocs2000/maxDocs
   maxTime3/maxTime
   /autoCommit
 
   autoSoftCommit
   maxTime500/maxTime
   /autoSoftCommit
 
   updateLog
   str name=dir${solr.data.dir:}/str
   /updateLog
 
   /updateHandler
 
   query
   maxBooleanClauses2048/maxBooleanClauses
 
   filterCache
   class=solr.FastLRUCache
   size=2048
   initialSize=1024
   autowarmCount=1024/
 
   queryResultCache
   class=solr.LRUCache
   size=2048
   initialSize=1024
   autowarmCount=1024/
 
   
   documentCache
   class=solr.LRUCache
   size=2048
   initialSize=1024
   autowarmCount=1024/
 
   enableLazyFieldLoadingtrue/enableLazyFieldLoading
 
   queryResultWindowSize50/queryResultWindowSize
 
   queryResultMaxDocsCached50/queryResultMaxDocsCached
 
   listener event=newSearcher class=solr.QuerySenderListener
   arr name=queries
   lst
   str name=q*:*/str
   str name=fqdate:[NOW/DAY-7DAY TO 
 NOW/DAY+1DAY]/str
   str name=rows1000/str
   /lst
   /arr
   /listener
   listener event=firstSearcher 
 class=solr.QuerySenderListener
   arr name=queries
   lst
   str name=q*:*/str
   str name=fqdate:[NOW/DAY-7DAY TO 
 NOW/DAY+1DAY]/str
   str name=rows1000/str
   /lst
   /arr
   /listener
 
   useColdSearchertrue/useColdSearcher
 
   maxWarmingSearchers4/maxWarmingSearchers
   /query

Upgrade Solr3.5 to Solr4.1 - Index Reformat ?

Hello,

We are planning to upgrade our solr servers from version 3.5 to 4.1.
We have master slave configuration and the index size is quite big (i.e.
around 14 GB ).
1. Do we really need to re-format the whole index , when we upgrade to 4.1 ?
2. What will be the consequences - if we do not re-format and simply upgrade
war file and config files ( solrconfig.xml, schema.xml ) on all slaves and
master together. (Shutdown all master  slaves and then upgrade  startup) ?
3. If re-formatting is neccessary - then what is the best tool to achieve
it. ( How long does it usually take to re-format the index of size around
14GB ) ?

Thanks,
Feroz




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Upgrade-Solr3-5-to-Solr4-1-Index-Reformat-tp4046391.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr replication takes long time

2013-03-11 Thread Victor Ruiz

no, Solr 4.0.0, I wanted to update to Solr 4.1 but I read that there was an
issue with the replication, so I decided not to try it for now


Mark Miller-3 wrote
 Are you using Solr 4.1?
 
 - Mark
 
 On Mar 11, 2013, at 1:53 PM, Victor Ruiz lt;

 bik1979@

 gt; wrote:
 
 Hi guys,
 
 I have a problem with Solr replication. I have 2 solr servers (Solr
 4.0.0) 1
 master and 1 slave (8 processors,16GB RAM ,Ubuntu 11,  ext3,  each). In
 every server, there are 2 independent instances of solr running (I tried
 also multicore config, but having independent instances has for me better
 performance), every instance having a differente collection. So, we have
 2
 masters in server 1, and 2 slaves in server 2.
 
 Index size is currently (for the biggest collection) around 17 million
 documents, with a total size near 12 GB. The files transferred every
 replication cycle are typically not more than 100, with a total size not
 bigger than 50MB. The other collection is not that big, just around 1
 million docs and not bigger than 2 GB and not a high update ratio. The
 big
 collection has a load around 200 queries per second (MoreLikeThis,
 RealTimeGetHandler , TermVectorComponent mainly), and for the small one
 it
 is below 50 queries per second
 
 Replication has been working for long time with any problem, but in the
 last
 weeks the replication cycles started to take long and long time for the
 big
 collection, even more than 2 minutes, some times even more. During that
 time, slaves are so overloaded, that many queries are timing out, despite
 the timeout in my clients is 30 seconds. 
 
 The servers are in same LAN, gigabit ethernet, so the broadband should
 not
 be the bottleneck.
 
 Since the index is receiving frequents updates and deletes (update
 handler
 receives more than 200 request per second for the big collection, but not
 more than 5 per second for the small one), I tried to use the
 maxCommitsToKeep attribute, to ensure that no file was deleted during
 replication, but it has no effect. 
 
 My solrconfig.xml in the big collection is like that:
 
 ?xml version=1.0 encoding=UTF-8 ?
 
 
 config
 
  
 luceneMatchVersion
 LUCENE_40
 /luceneMatchVersion
 
  
 directoryFactory name=DirectoryFactory

 
 class=${solr.directoryFactory:solr.NRTCachingDirectoryFactory}/
 
 
  
 indexConfig
  
 mergeFactor
 3
 /mergeFactor
 
  
 deletionPolicy class=solr.SolrDeletionPolicy
  
  
 str name=maxCommitsToKeep
 10
 /str
  
 str name=maxOptimizedCommitsToKeep
 1
 /str
  
  
 str name=maxCommitAge
 6HOUR
 /str
 
  
 /deletionPolicy
 
  
 /indexConfig
 
  
 jmx/
 
  
 updateHandler class=solr.DirectUpdateHandler2
 
  
 autoCommit
  
 maxDocs
 2000
 /maxDocs
  
 maxTime
 3
 /maxTime
  
 /autoCommit
 
  
 autoSoftCommit
  
 maxTime
 500
 /maxTime
  
 /autoSoftCommit
 
  
 updateLog
  
 str name=dir
 ${solr.data.dir:}
 /str
  
 /updateLog
 
  
 /updateHandler
 
  
 query
  
 maxBooleanClauses
 2048
 /maxBooleanClauses
 
  
 filterCache

   class=solr.FastLRUCache
  size=2048
  initialSize=1024
  autowarmCount=1024/
 
  
 queryResultCache

   class=solr.LRUCache
  size=2048
  initialSize=1024
  autowarmCount=1024/
 
  
  
 documentCache

   class=solr.LRUCache
  size=2048
  initialSize=1024
  autowarmCount=1024/
 
  
 enableLazyFieldLoading
 true
 /enableLazyFieldLoading
 
  
 queryResultWindowSize
 50
 /queryResultWindowSize
 
  
 queryResultMaxDocsCached
 50
 /queryResultMaxDocsCached
 
  
 listener event=newSearcher class=solr.QuerySenderListener
  
 arr name=queries
  
 lst
  
 str name=q
 *:*
 /str
  
 str name=fq
 date:[NOW/DAY-7DAY TO NOW/DAY+1DAY]
 /str
  
 str name=rows
 1000
 /str
  
 /lst
  
 /arr
  
 /listener
  
 listener event=firstSearcher class=solr.QuerySenderListener
  
 arr name=queries
  
 lst
  
 str name=q
 *:*
 /str
  
 str name=fq
 date:[NOW/DAY-7DAY TO NOW/DAY+1DAY]

Re: Solr replication takes long time

Okay - yes, 4.0 is a better choice for replication than 4.1.

It almost sounds like you may be replicating the full index rather than just 
changes or something. 4.0 had a couple issues as well - a couple things that 
were discovered while writing stronger tests for 4.2.

4.2 is spreading onto mirrors now.

- Mark

On Mar 11, 2013, at 2:00 PM, Victor Ruiz bik1...@gmail.com wrote:

 no, Solr 4.0.0, I wanted to update to Solr 4.1 but I read that there was an
 issue with the replication, so I decided not to try it for now
 
 
 Mark Miller-3 wrote
 Are you using Solr 4.1?
 
 - Mark
 
 On Mar 11, 2013, at 1:53 PM, Victor Ruiz lt;
 
 bik1979@
 
 gt; wrote:
 
 Hi guys,
 
 I have a problem with Solr replication. I have 2 solr servers (Solr
 4.0.0) 1
 master and 1 slave (8 processors,16GB RAM ,Ubuntu 11,  ext3,  each). In
 every server, there are 2 independent instances of solr running (I tried
 also multicore config, but having independent instances has for me better
 performance), every instance having a differente collection. So, we have
 2
 masters in server 1, and 2 slaves in server 2.
 
 Index size is currently (for the biggest collection) around 17 million
 documents, with a total size near 12 GB. The files transferred every
 replication cycle are typically not more than 100, with a total size not
 bigger than 50MB. The other collection is not that big, just around 1
 million docs and not bigger than 2 GB and not a high update ratio. The
 big
 collection has a load around 200 queries per second (MoreLikeThis,
 RealTimeGetHandler , TermVectorComponent mainly), and for the small one
 it
 is below 50 queries per second
 
 Replication has been working for long time with any problem, but in the
 last
 weeks the replication cycles started to take long and long time for the
 big
 collection, even more than 2 minutes, some times even more. During that
 time, slaves are so overloaded, that many queries are timing out, despite
 the timeout in my clients is 30 seconds. 
 
 The servers are in same LAN, gigabit ethernet, so the broadband should
 not
 be the bottleneck.
 
 Since the index is receiving frequents updates and deletes (update
 handler
 receives more than 200 request per second for the big collection, but not
 more than 5 per second for the small one), I tried to use the
 maxCommitsToKeep attribute, to ensure that no file was deleted during
 replication, but it has no effect. 
 
 My solrconfig.xml in the big collection is like that:
 
 ?xml version=1.0 encoding=UTF-8 ?
 
 
 config
 
 
 luceneMatchVersion
 LUCENE_40
 /luceneMatchVersion
 
 
 directoryFactory name=DirectoryFactory
 

 class=${solr.directoryFactory:solr.NRTCachingDirectoryFactory}/
 
 
 
 indexConfig
 
 mergeFactor
 3
 /mergeFactor
 
 
 deletionPolicy class=solr.SolrDeletionPolicy
 
 
 str name=maxCommitsToKeep
 10
 /str
 
 str name=maxOptimizedCommitsToKeep
 1
 /str
 
 
 str name=maxCommitAge
 6HOUR
 /str
 
 
 /deletionPolicy
 
 
 /indexConfig
 
 
 jmx/
 
 
 updateHandler class=solr.DirectUpdateHandler2
 
 
 autoCommit
 
 maxDocs
 2000
 /maxDocs
 
 maxTime
 3
 /maxTime
 
 /autoCommit
 
 
 autoSoftCommit
 
 maxTime
 500
 /maxTime
 
 /autoSoftCommit
 
 
 updateLog
 
 str name=dir
 ${solr.data.dir:}
 /str
 
 /updateLog
 
 
 /updateHandler
 
 
 query
 
 maxBooleanClauses
 2048
 /maxBooleanClauses
 
 
 filterCache
 
  class=solr.FastLRUCache
 size=2048
 initialSize=1024
 autowarmCount=1024/
 
 
 queryResultCache
 
  class=solr.LRUCache
 size=2048
 initialSize=1024
 autowarmCount=1024/
 
 
 
 documentCache
 
  class=solr.LRUCache
 size=2048
 initialSize=1024
 autowarmCount=1024/
 
 
 enableLazyFieldLoading
 true
 /enableLazyFieldLoading
 
 
 queryResultWindowSize
 50
 /queryResultWindowSize
 
 
 queryResultMaxDocsCached
 50
 /queryResultMaxDocsCached
 
 
 listener event=newSearcher class=solr.QuerySenderListener
 
 arr name=queries
 
 lst
 
 str name=q
 *:*
 /str
 
 str name=fq
 date:[NOW/DAY-7DAY TO NOW/DAY+1DAY]
 /str
 
 str name=rows
 1000
 /str
 
 /lst

Re: Dynamic schema design: feedback requested

2013-03-11 Thread Yonik Seeley

On Wed, Mar 6, 2013 at 7:50 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:
 2) If you wish to use the /schema REST API for read and write operations,
 then schema information will be persisted under the covers in a data store
 whose format is an implementation detail just like the index file format.

This really needs to be driven by costs and benefits...
There are clear benefits to having a simple human readable / editable
file for the schema (whether it's on the local filesystem or on ZK).

 The ability to say my schema is a config file and i own it should always 
 exist (remove it over my dead body)

There are clear benefits to this being the persistence mechanism for
the REST API.

Even if the REST API persisted it's data in some binary format for
example, then there would still need to be import/export mechanisms
for the human readable/editable config file config file that should
always exist.  Why would we want any other intermediate format (i.e.
data that is not human readable)?  Seems like we should only
introduce that extra complexity if the benefits are great enough.
Actually, I just realized we already have this intermediate
representation - it's the in-memory IndexSchema object.

-Yonik
http://lucidworks.com

Re: Dynamic schema design: feedback requested


:  2) If you wish to use the /schema REST API for read and write operations,
:  then schema information will be persisted under the covers in a data store
:  whose format is an implementation detail just like the index file format.
: 
: This really needs to be driven by costs and benefits...
: There are clear benefits to having a simple human readable / editable
: file for the schema (whether it's on the local filesystem or on ZK).

The cost is the user complexity of understanding what changes are 
respected and when, and in hte implementation complexity of dealing with 
changes coming from multiple code paths (both files changed on disk and 
REST based request changes)

in the current model, the config file on disk is hte authority, it is read 
in it's entirety on core init/reload, and users have total ownership of 
that file -- changes are funneled through the user, into the config, and 
solr is a read only participant.  Since solr knows the only way schema 
information will ever change is when it reads that file, it can make 
internal assumptions about the consistency of that data.

in a model where a public REST API might be modifying solr's in memory 
state, solr can't neccessarily make those same assumptions, and the 
complexity of the system becomes a lot simpler if the Solr is 
the authority of the information about the schema, and we don't have to 
worry about what happens if comflicts arrise, eg: someone modifies the 
schema on disk, but hasn't (yet?) done a core reload, when a new REST 
request comes in to modify the schema data in some other way.



-Hoss

Re: Dynamic schema design: feedback requested


To revisit sarowes comment about how/when to decide if we are using the   
config file version of schema info (and hte API is read only) vs
internal managed state data version of schema info (and the API is
read/write)...

On Wed, 6 Mar 2013, Steve Rowe wrote:

: Two possible approaches:
: 
: a. When schema.xml is present, ...
...
: b. Alternatively, the reverse: ...
...
: I like option a. better, since it provides a stable situation for users 
: who don't want the new dynamic schema modification feature, and who want 
: to continue to hand edit schema.xml.  Users who want the new feature 
: would use a command-line tool to convert their schema.xml to 
: schema.json, then remove schema.xml from conf/.


The more i think about it, the less I like either a or b because both 
are completley implicit.

I think practically speaking, from a support standpoint, we should require 
an more explicit configuration of what *type* of schema management 
should be used, and then have code that sanity checks that and warns/fails 
if the configuraiton setting doesn't match what is found in the ./conf 
dir.

The situation i worry about, is whan a novice solr user takes over 
maintence of an existing setup that is using REST based schema management, 
and therefore has no schema.xml file.  The novice is reading 
docs/tutorials talking about how to achieve some goal, which make refrence 
to editing the schema.xml or adding XXX to the schema.xml or even 
worse in the cases of some CMSs: To upgrade to FooCMS vX.Y, replace your 
schema.xml with this file... but they have no schema.xml or any clear and 
obvious indication looking at what configs they do have of *why* there is 
no schema.xml, so maybe they try to add one.

I think it would be better to add some new option in solroconfig.xml that 
requires the user to be explicit about what type of management thye want 
to use, defaulting to schema.xml for back compat...

  schema type=conf 
  [maybe an optional file=path/to/schema.xml ?] /

...vs...

  schema type=managed 
  [this is where the mutable=true|false sarowe mentioned could live] 
/

The on core load:

1) if the configured schema type is file but there is no schema.xml 
file, ERROR loudly and fail fast.

2) if we see that the the configured schema type is file but we detected 
the existence of managed internal schema info (schema.json, zk nodes, 
whatever) then we should WARN that the managed internal data is being 
ignored.

3) if the configured schema type is managed but there is no manged 
internal schema info (schema.json, zk nodes, whatever) then ERROR loudly 
and fail fase (or maybe we create an empty schema for them?)

4) if we see that the the configured schema type is managed but we 
also detected the existence of a schema.xml config file, then
whatever) then we should WARN that the schema.xml is being 
ignored.

...although i could easily be convinced that all of those WARN 
sitautions should really be hard failures to reduce confusion -- depends 
on how easy we can make it to let users delete all internally manged 
schema info before switching to a type=conf schema.xml approach.


-Hoss

Re: Upgrade Solr3.5 to Solr4.1 - Index Reformat ?


On 3/11/2013 11:56 AM, feroz_kh wrote:

We are planning to upgrade our solr servers from version 3.5 to 4.1.
We have master slave configuration and the index size is quite big (i.e.
around 14 GB ).
1. Do we really need to re-format the whole index , when we upgrade to 4.1 ?
2. What will be the consequences - if we do not re-format and simply upgrade
war file and config files ( solrconfig.xml, schema.xml ) on all slaves and
master together. (Shutdown all master  slaves and then upgrade  startup) ?
3. If re-formatting is neccessary - then what is the best tool to achieve
it. ( How long does it usually take to re-format the index of size around
14GB ) ?


If you are replicating from 3.5 to 4.1, then your index will be in the 
3.5 format.  If you upgrade both the master where you index and the 
slave(s), existing index files will be in the old format, new index 
segments will be in the new format.  If you were to optimize your index 
after upgrading, it would completely replace it with the new format.


For me on a fast I/O subsystem (six 1TB SATA drives in RAID10), it takes 
about ten minutes to optimize a 22GB index on Solr 3.5.  Solr 4.1 needs 
to compress stored fields, which means extra CPU time, but less time 
actually writing to disk, so it would be about the same or possibly less.


Thanks,
Shawn

Re: Upgrade Solr3.5 to Solr4.1 - Index Reformat ?

2013-03-11 Thread Tomás Fernández Löbbe

Hi Feroz, due to Lucene's backward compatibility policy (
http://wiki.apache.org/lucene-java/BackwardsCompatibility ), a Solr 4.1
instance should be able to read an index generated by a Solr 3.5 instance.
This would not be true if you need to change the schema. Also, be careful
because Solr 4.1 could and will change the index files and will make them
unreadable by Solr 3.5 (so you should make a backup in case you need to
revert to 3.5 for some reason).
This means, that if you can't shutdown your whole application all together,
you could update the slaves first, and then the masters. Replacing all
servers together will also work.

That said, you should not use 4.1 if you are using Master/Slave, there are
some known bugs in that specific feature in 4.1 that were fixed for 4.2.

Tomás

On Mon, Mar 11, 2013 at 10:56 AM, feroz_kh feroz.kh2...@gmail.com wrote:

Hello,

We are planning to upgrade our solr servers from version 3.5 to 4.1.
We have master slave configuration and the index size is quite big (i.e.
around 14 GB ).
1. Do we really need to re-format the whole index , when we upgrade to 4.1
?
2. What will be the consequences - if we do not re-format and simply
upgrade
war file and config files ( solrconfig.xml, schema.xml ) on all slaves and
master together. (Shutdown all master slaves and then upgrade startup)
?
3. If re-formatting is neccessary - then what is the best tool to achieve
it. ( How long does it usually take to re-format the index of size around
14GB ) ?

Thanks,
Feroz

--
View this message in context:
http://lucene.472066.n3.nabble.com/Upgrade-Solr3-5-to-Solr4-1-Index-Reformat-tp4046391.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Dynamic schema design: feedback requested

2013-03-11 Thread Yonik Seeley

On Mon, Mar 11, 2013 at 2:50 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 :  2) If you wish to use the /schema REST API for read and write operations,
 :  then schema information will be persisted under the covers in a data store
 :  whose format is an implementation detail just like the index file format.
 :
 : This really needs to be driven by costs and benefits...
 : There are clear benefits to having a simple human readable / editable
 : file for the schema (whether it's on the local filesystem or on ZK).

 The cost is the user complexity of understanding what changes are
 respected and when

There is going to be a cost to understanding any feature.  This
doesn't deal with the answer to the question are we better off with
or without this feature.

, and in hte implementation complexity of dealing with
 changes coming from multiple code paths (both files changed on disk and
 REST based request changes)

Right - and these should be quantifiable going forward.
In ZK mode, we need concurrency control anyway, so depending on the
design, there may be really no cost at all.
In local FS mode, it might be a very low cost (simply check the
timestamp on the file for example).  Code to re-read the schema and
merge changes needs to be there anyway for cloud mode it seems.  *If*
we needed to, we could just assert that the schema file is the
persistence mechanism, as opposed to the system of record, hence if
you hand edit it and then use the API to change it, your hand edit may
be lost.  Or we may decide to do away with local FS mode altogether.

I guess my main point is, we shouldn't decide a priori that using the
API means you can no longer hand edit.

My thoughts on this are probably heavily influenced on how I initially
envisioned implementation working in cloud mode (which I thought about
first since it's harder).  A human readable file on ZK that represents
the system of record for the schema seemed to be the best.  I never
even considered making it non-human readable (and thus non-editable by
hand).

-Yonik
http://lucidworks.com

Re: Solr replication takes long time

2013-03-11 Thread Victor Ruiz

Thanks for your answer Mark. I think I'll try to update to 4.2. I'll keep you
updated.

Anyway, I'd not say that the full index is replicated, I've been monitoring
the replication process in the Solr admin console and there I see that
usually not more than 50-100 files are transferrend, the total size is
rarely greater than 50MB. Is this info trustable?

Victor

Mark Miller-3 wrote
 Okay - yes, 4.0 is a better choice for replication than 4.1.
 
 It almost sounds like you may be replicating the full index rather than
 just changes or something. 4.0 had a couple issues as well - a couple
 things that were discovered while writing stronger tests for 4.2.
 
 4.2 is spreading onto mirrors now.
 
 - Mark
 
 On Mar 11, 2013, at 2:00 PM, Victor Ruiz lt;

 bik1979@

 gt; wrote:
 
 no, Solr 4.0.0, I wanted to update to Solr 4.1 but I read that there was
 an
 issue with the replication, so I decided not to try it for now
 
 
 Mark Miller-3 wrote
 Are you using Solr 4.1?
 
 - Mark
 
 On Mar 11, 2013, at 1:53 PM, Victor Ruiz lt;
 
 bik1979@
 
 gt; wrote:
 
 Hi guys,
 
 I have a problem with Solr replication. I have 2 solr servers (Solr
 4.0.0) 1
 master and 1 slave (8 processors,16GB RAM ,Ubuntu 11,  ext3,  each). In
 every server, there are 2 independent instances of solr running (I
 tried
 also multicore config, but having independent instances has for me
 better
 performance), every instance having a differente collection. So, we
 have
 2
 masters in server 1, and 2 slaves in server 2.
 
 Index size is currently (for the biggest collection) around 17 million
 documents, with a total size near 12 GB. The files transferred every
 replication cycle are typically not more than 100, with a total size
 not
 bigger than 50MB. The other collection is not that big, just around 1
 million docs and not bigger than 2 GB and not a high update ratio. The
 big
 collection has a load around 200 queries per second (MoreLikeThis,
 RealTimeGetHandler , TermVectorComponent mainly), and for the small one
 it
 is below 50 queries per second
 
 Replication has been working for long time with any problem, but in the
 last
 weeks the replication cycles started to take long and long time for the
 big
 collection, even more than 2 minutes, some times even more. During that
 time, slaves are so overloaded, that many queries are timing out,
 despite
 the timeout in my clients is 30 seconds. 
 
 The servers are in same LAN, gigabit ethernet, so the broadband should
 not
 be the bottleneck.
 
 Since the index is receiving frequents updates and deletes (update
 handler
 receives more than 200 request per second for the big collection, but
 not
 more than 5 per second for the small one), I tried to use the
 maxCommitsToKeep attribute, to ensure that no file was deleted during
 replication, but it has no effect. 
 
 My solrconfig.xml in the big collection is like that:
 
 ?xml version=1.0 encoding=UTF-8 ?
 
 
 
 config
 

 
 luceneMatchVersion
 LUCENE_40
 
 /luceneMatchVersion
 

 
 directoryFactory name=DirectoryFactory

 
  
 class=${solr.directoryFactory:solr.NRTCachingDirectoryFactory}/
 
 

 
 indexConfig

 
 mergeFactor
 3
 
 /mergeFactor
 

 
 deletionPolicy class=solr.SolrDeletionPolicy


 
 str name=maxCommitsToKeep
 10
 
 /str

 
 str name=maxOptimizedCommitsToKeep
 1
 
 /str


 
 str name=maxCommitAge
 6HOUR
 
 /str
 

 
 /deletionPolicy
 

 
 /indexConfig
 

 
 jmx/
 

 
 updateHandler class=solr.DirectUpdateHandler2
 

 
 autoCommit

 
 maxDocs
 2000
 
 /maxDocs

 
 maxTime
 3
 
 /maxTime

 
 /autoCommit
 

 
 autoSoftCommit

 
 maxTime
 500
 
 /maxTime

 
 /autoSoftCommit
 

 
 updateLog

 
 str name=dir
 ${solr.data.dir:}
 
 /str

 
 /updateLog
 

 
 /updateHandler
 

 
 query

 
 maxBooleanClauses
 2048
 
 /maxBooleanClauses
 

 
 filterCache

 
 class=solr.FastLRUCache
size=2048
initialSize=1024
autowarmCount=1024/
 

 
 queryResultCache

 
 class=solr.LRUCache
size=2048
initialSize=1024
autowarmCount=1024/
 


 
 documentCache

 
 class=solr.LRUCache
size=2048
initialSize=1024
autowarmCount=1024/
 

 
 enableLazyFieldLoading
 true
 
 /enableLazyFieldLoading
 

 
 queryResultWindowSize
 50
 
 /queryResultWindowSize
 

 
 queryResultMaxDocsCached
 50

RE: Need help with delta import

2013-03-11 Thread Xavier Pell

This is absolutely a sintax error, I had the same problem, and with
dih.delta.id it solves all my problems. Thanks to god and the special
person who post the answer in this page.

You have to revise your sintax in queries for delta import and watch the
catalina (i use tomcat) log file for any errors.

Regards,

question about syntax for multiple terms in filter query

2013-03-11 Thread geeky2

hello everyone,

i have a question on the filter query syntax for multiple terms, after
reading this:

http://wiki.apache.org/solr/CommonQueryParameters#fq

i see from the above that two (2) syntax constructs are supported

fq=term1:foo  fq=term2:bar

and

fq=+term1:foo +term2:bar

is there a reason why i would want to use one syntax over the other?

does the first syntax support the and operand as well as the ?

thx
mark




--
View this message in context: 
http://lucene.472066.n3.nabble.com/question-about-syntax-for-multiple-terms-in-filter-query-tp4046442.html
Sent from the Solr - User mailing list archive at Nabble.com.

PostingsHighlighter and analysis

2013-03-11 Thread Trey Hyde

debug=timing has told me for a very long time that 99% of my query time for 
slow queries is in the highlighting component so I've been eagerly awaiting the 
postingshighlighter for quite some time.  Mean query times 50ms or less, with 
certain queries able to generate  30s worth of highlighting.Now that it's 
here I've been somewhat disappointed since I can't use it since so many common 
analyzers emit tokens out of order, which, apparently is not compatible with 
storeOffsetsWithPositions.

The only analyzer that is in the bad list according to LUCENE-4641 that is 
really critical to our searches is the WordDelimiter filer.

My current index time filter config (which I believe has bee unchanged for me 
for 5+ years):
 filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=1 
generateWordParts=1
generateNumberParts=1 catenateWords=1 
catenateNumbers=1 catenateAll=0/

Does anyone have any suggestions deal with this?   Perhaps limiting certain 
options will always produce tokens in order?

Thanks

Trey Hyde 
Director of Engineering
Email th...@centraldesktop.com

Central Desktop. Work together in ways you never thought possible. 
Connect with us   Website  |  Twitter  |  Facebook  |  LinkedIn  |  Google+  |  
Blog

Re: question about syntax for multiple terms in filter query

2013-03-11 Thread Otis Gospodnetic

Hello Mark,

I think fq=+term1:foo +term2:bar doesn't actually result in 2 filters being
created/used, while fq=term1:foofq=term2:bar does

Otis
--
Solr  ElasticSearch Support
http://sematext.com/






On Mon, Mar 11, 2013 at 4:41 PM, geeky2 gee...@hotmail.com wrote:

 hello everyone,

 i have a question on the filter query syntax for multiple terms, after
 reading this:

 http://wiki.apache.org/solr/CommonQueryParameters#fq

 i see from the above that two (2) syntax constructs are supported

 fq=term1:foo  fq=term2:bar

 and

 fq=+term1:foo +term2:bar

 is there a reason why i would want to use one syntax over the other?

 does the first syntax support the and operand as well as the ?

 thx
 mark




 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/question-about-syntax-for-multiple-terms-in-filter-query-tp4046442.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: question about syntax for multiple terms in filter query

2013-03-11 Thread Jack Krupansky


Point number 3 from that wiki says it all:

3.The document sets from each filter query are cached independently. Thus, 
concerning the previous examples: use a single fq containing two mandatory 
clauses if those clauses appear together often, and use two separate fq 
params if they are relatively independent.


FWIW, there is not  operator in Lucene/Solr query syntax. There is the 
 operator which is equivalent to AND, but each of the ampersands must 
be URL-encoded as %26 to use them in a query in a URL.


So, yes, you can use the AND operator, as:

 fq=term1:foo AND fq=term2:bar

or

 fq=term1:foo %26%26 fq=term2:bar

Note that this is not valid in a URL:

 fq=term1:foo  fq=term2:bar

It must be written as:

 fq=term1:foo fq=term2:bar

The  marks the start of a new query parameter - but that is query in 
the sense of the URL query, not a Solr query. The  must be immediately 
followed by the parameter name and an =.


-- Jack Krupansky

-Original Message- 
From: geeky2

Sent: Monday, March 11, 2013 4:41 PM
To: solr-user@lucene.apache.org
Subject: question about syntax for multiple terms in filter query

hello everyone,

i have a question on the filter query syntax for multiple terms, after
reading this:

http://wiki.apache.org/solr/CommonQueryParameters#fq

i see from the above that two (2) syntax constructs are supported

fq=term1:foo  fq=term2:bar

and

fq=+term1:foo +term2:bar

is there a reason why i would want to use one syntax over the other?

does the first syntax support the and operand as well as the ?

thx
mark




--
View this message in context: 
http://lucene.472066.n3.nabble.com/question-about-syntax-for-multiple-terms-in-filter-query-tp4046442.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to set Configuration setting for search

2013-03-11 Thread Otis Gospodnetic

Hello Deepshikha,

No need for regular expressions once you index some data try using
keywords... like Google. :)

Otis
--
Solr  ElasticSearch Support
http://sematext.com/





On Mon, Mar 11, 2013 at 6:05 AM, Deepshikha Raghav raghavd...@in.ibm.comwrote:

 Hi Team ,

 In Solr how to setFREE TEXT SEARCH configuration.
 Is there any regular expression setting so that I can configure to obtain
 search results.


 With Warm Regards
 Deepshikha Raghav
 IBM , Gurgaon
 ---
 Mobile-+91-8800140037

Re: Upgrade Solr3.5 to Solr4.1 - Index Reformat ?

Thanks Shawn.
So if we have new segments in 4.1 format and all old files in 3.5 format at
the same time, then will it cause any performance degradation on slaves
while reading index files ( which will contain both 3.5 formatted and 4.1
formatted files)?




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Upgrade-Solr3-5-to-Solr4-1-Index-Reformat-tp4046391p4046469.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Upgrade Solr3.5 to Solr4.1 - Index Reformat ?

Thanks Tomas!
I see the latest available version is 4.1 - but you have suggested a 4.2
version, where can i grab 4.2 version from?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Upgrade-Solr3-5-to-Solr4-1-Index-Reformat-tp4046391p4046471.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Dynamic schema design: feedback requested


: we needed to, we could just assert that the schema file is the
: persistence mechanism, as opposed to the system of record, hence if
: you hand edit it and then use the API to change it, your hand edit may
: be lost.  Or we may decide to do away with local FS mode altogether.

presuming that it's just a persistence mechanism, but also assuming that 
the user may edit directly, still creates burdens/complexity in when solr 
reads/writes to that file -- even if we say that user edits to that file 
might be overridden (ie: does solr garuntee if/when that the file will be 
written to if you use the REST api to modify things? -- that's going to be 
important if we let people read//edit that file)

: I guess my main point is, we shouldn't decide a priori that using the
: API means you can no longer hand edit.

and my point is we should build a feature where solr has the ability to 
read/write some piece of information, we should start with the asumption 
that it's OK for us to decide that a priori, and not walk into things 
assuming we have to support a lot of much more complicated uses cases.  if 
at some point during the implementation we find that supporting a more lax 
it's ok, you can edit this by hand approach won't be a burden, then so 
be it -- we can relax that a priori assertion.

: My thoughts on this are probably heavily influenced on how I initially

my thoughts on this are based directly on:

A) the observations of the confusion  implementation complexity 
observed in the dual nature of solr.xml over the years.

B) having spent a lot of time maintining code that did programatic 
read/writing of solr schema.xml files while also trying to treat them as 
config files that users were allowed to hand edit -- it's a pain in the 
ass.

: envisioned implementation working in cloud mode (which I thought about
: first since it's harder).  A human readable file on ZK that represents
: the system of record for the schema seemed to be the best.  I never

1) i never said the data couldn't/shouldn't be human readable -- i said it 
should be an implementation detail (ie: subject to change automaticly on 
upgrade just like hte index format), and that end users shouldn't be 
allowed to edit it arbitrarily

2) cloud mode, as i understand it, is actaully much *easier* (if you want 
to allow arbitrary user edits to these files) because you can set ZK 
watches on those nodes, so any code that is maintaining interal state 
based on them (ie: REST API round trip serialization code that just read 
the file in to modify the DOM before writing it back out) can be notified 
if the file has changed.  I also beleive i was told that writes to files
in ZK are atomic, which also means you never have to wory about reading 
partial data in the middle of someone else's write.

in the general situation of config files on disk we can't even try to 
enforce a lock file type approach, because we shouldn't assume a user will 
remember to obey our locks before editing the file.

If you  sarowe  others feel that:

1) it's important to allow arbitrary user editing of schema.xml files in 
zk mode even when REST read/writes are enabled
2) that allowing arbitrary user edits w/o risk of conflict or complexity 
in the REST read/write code is easy to implement in ZK mode
3) it's reasonable to require ZK mode in order to suppot read/write mode 
in the REST API

...that that would certainly resolve my concern's stemming from B 
above.  i'm still worried about A, but perhaps the ZK nature of things 
and the watches  atomicity provided there will reduce confusion.

But as long as we are talking about this REST api supporting reads  
writes to schema info even when running in single node mode with files on 
disk -- i think it is a *HUGE* fucking mistake to start with the 
assumption that the serialization mechanism of the REST api needs to be 
able to play nicely with arbitrary user editing of schema.xml.


-Hoss

Re: Some nodes have all the load

2013-03-11 Thread jimtronic

The load test was fairly heavy (ie lots of users) and designed to mimic a
fully operational system with lots of users doing normal things.

There were two things I gleaned from the logs:

PERFORMANCE WARNING: Overlapping onDeckSearchers=2 appeared for several of
my more active cores

and

The non-leaders were throwing errors saying that the leader as not
responding while trying to forward updates. (sorry can't find that specific
error now)

My best guess is that it has something to do with the commits.

 a. frequent user generated writes using
/update?commitWithin=500waitFlush=falsewaitSearcher=false
 b. softCommit set to 3000
 c. autoCommit set to 300,000 and openSearcher false
 d. I'm also doing frequent periodic DIH updates. I guess this is
commit=true by default.

Should I omit commitWithin and set DIH to commit=false and just let soft and
autocommit do their jobs?

Cheers,
Jim





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Some-nodes-have-all-the-load-tp4046349p4046476.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Upgrade Solr3.5 to Solr4.1 - Index Reformat ?


On 3/11/2013 3:39 PM, feroz_kh wrote:

Thanks Shawn.
So if we have new segments in 4.1 format and all old files in 3.5 format at
the same time, then will it cause any performance degradation on slaves
while reading index files ( which will contain both 3.5 formatted and 4.1
formatted files)?


There should be no performance degradation.  Solr 4.1 should perform at 
least as well as 3.5 and in many cases it will perform better.  Your 
index on disk will get smaller when converted to 4.1 format, and may 
become faster.


Thanks,
Shawn

Re: Upgrade Solr3.5 to Solr4.1 - Index Reformat ?


On 3/11/2013 3:43 PM, feroz_kh wrote:

Thanks Tomas!
I see the latest available version is 4.1 - but you have suggested a 4.2
version, where can i grab 4.2 version from?


It is already accessible from many mirrors.  Because it is not yet 
accessible from a large enough percentage of mirrors, the URL hasn't 
been updated on the main website yet.  Here is the URL:


http://www.apache.org/dyn/closer.cgi/lucene/solr/4.2.0

If the mirror that gets chosen for you automatically does not yet have 
it, just try another mirror.  There is no information on the download 
list about where each mirror is, so you'll just have to guess, or look 
them up to see where they are.


Thanks,
Shawn

Re: [Beginner] wants to contribute in open source project


: This article I wrote about getting started contributing to projects may give 
you some ideas.
: 
: 
http://blog.smartbear.com/software-quality/bid/167051/14-Ways-to-Contribute-to-Open-Source-without-Being-a-Programming-Genius-or-a-Rock-Star

Or pehaps even the followup i did of Andy's article layering his advice 
directly on to Solr...

http://searchhub.org/2012/03/26/14-ways-to-contribute-to-solr/




-Hoss

Re: How to Integrate Solr With Hbase

2013-03-11 Thread Bharat Mallampati

We do have same kind of scenario in our application also.

The way we are achieving it is we have a batch process to read the data
from Hbase using Hbase API  and write it to SOLR using SOLRJ API.


Thanks
Bharat



On Mon, Mar 11, 2013 at 5:38 AM, kamaci furkankam...@gmail.com wrote:

 I have crawled data into Hbase with my Nutch. How can I use Solr to index
 the
 data at Hbase? (Is there any solution from Nutch side, you are welcome)

 PS: I am new to such kind of technologies and I run Solr from under example
 folder as startup.jar



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/How-to-Integrate-Solr-With-Hbase-tp4046297.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Some nodes have all the load


On 3/11/2013 3:52 PM, jimtronic wrote:

The load test was fairly heavy (ie lots of users) and designed to mimic a
fully operational system with lots of users doing normal things.

There were two things I gleaned from the logs:

PERFORMANCE WARNING: Overlapping onDeckSearchers=2 appeared for several of
my more active cores

and

The non-leaders were throwing errors saying that the leader as not
responding while trying to forward updates. (sorry can't find that specific
error now)

My best guess is that it has something to do with the commits.

  a. frequent user generated writes using
/update?commitWithin=500waitFlush=falsewaitSearcher=false
  b. softCommit set to 3000
  c. autoCommit set to 300,000 and openSearcher false
  d. I'm also doing frequent periodic DIH updates. I guess this is
commit=true by default.

Should I omit commitWithin and set DIH to commit=false and just let soft and
autocommit do their jobs?


I've just locate a previous message on this list from Mark Miller saying 
that in Solr 4, commitWithin is a soft commit.


You should definitely wait for Mark or another committer to verify what 
I'm saying in the small novel I am writing below.


My personal opinion is that you should have frequent soft commits (auto, 
manual, commitWithin, or some combination) along with less frequent (but 
not infrequent) autoCommit with openSearcher=false.  The autoCommit 
(which is a hard commit) does two things - ensures that the transaction 
logs do not grow out of control, and persists changes to disk.  If you 
have auto soft commits and updateLog is enabled, I would say that you 
are pretty safe using commit=false on your DIH updates.


If Mark agrees with what I have said, and your config/schema checks out 
OK with expected norms, you may be running into bugs.  It might also be 
a case of not enough CPU/RAM resources for the system load.  You never 
responded in another thread with the output of the 'free' command, or 
the size of your indexes.  Putting 13 busy Solr cores onto one box is 
overkill, unless the machine has 16-32 CPU cores *and* plenty of fast 
RAM to cache all your indexes in the OS disk cache.  Based on what 
you're saying here and in the other thread, you probably need a java 
heap size of 4GB or 8GB, heavily tuned JVM garbage collection options, 
and depending on the size of your indexes, 16GB may not be enough total 
system RAM.


IMHO, you should not use trunk (5.0) for anything that you plan to one 
day run in production.  Trunk is very volatile, large-scale changes 
sometimes get committed with only minimal testing.  The dev branch named 
branch_4x (currently 4.3) is kept reasonably stable almost all of the 
time.  Version 4.2 has just been released - it is already available on 
the faster mirrors and there should be a release announcement within a 
day from now.


If this is not being set up in anticipation for a production deployment, 
then trunk would be fine, but bugs are to be expected.  If the same 
problems do not happen in 4.2 or branch_4x, then I would move the 
discussion to the dev list.


Thanks,
Shawn

Re: Upgrade Solr3.5 to Solr4.1 - Index Reformat ?

Thanks Tomas/Shawn!

One more question related to backward compatibilty.
Previously we had upgraded our solr master/slaves from 1.4 version to 3.5
version - We didn't reformat the whole index then. So i believe there will
be some files with 1.4 format present in our index.

Now when we upgrade from 3.5 to 4.1/or4.2  - Can we expect solr slave
version 4.x to read both 1.4 and 3.5 formatted indices, without any issues ?

Thanks,
Feroz



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Upgrade-Solr3-5-to-Solr4-1-Index-Reformat-tp4046391p4046500.html
Sent from the Solr - User mailing list archive at Nabble.com.

[ANNOUNCE] Apache Solr 4.2 released

2013-03-11 Thread Robert Muir

March 2013, Apache Solr™ 4.2 available
The Lucene PMC is pleased to announce the release of Apache Solr 4.2

Solr is the popular, blazing fast, open source NoSQL search platform
from the Apache Lucene project. Its major features include powerful
full-text search, hit highlighting, faceted search, dynamic
clustering, database integration, rich document (e.g., Word, PDF)
handling, and geospatial search.  Solr is highly scalable, providing
fault tolerant distributed search and indexing, and powers the search
and navigation features of many of the world's largest internet sites.

Solr 4.2 is available for immediate download at:
   http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

See the CHANGES.txt file included with the release for a full list of details.

Solr 4.2 Release Highlights:

* A read side REST API for the schema. Always wanted to introspect the
schema over http? Now you can. Looks like the write side will be
coming next.

* DocValues have been integrated into Solr. DocValues can be loaded up
a lot faster than the field cache and can also use different
compression algorithms as well as in RAM or on Disk representations.
Faceting, sorting, and function queries all get to benefit. How about
the OS handling faceting and sorting caches off heap? No more tuning
60 gigabyte heaps? How about a snappy new per segment DocValues
faceting method? Improved numeric faceting? Sweet.

* Collection Aliasing. Got time based data? Want to re-index in a
temporary collection and then swap it into production? Done. Stay
tuned for Shard Aliasing.

* Collection API responses. The collections API was still very new in
4.0, and while it improved a fair bit in 4.1, responses were certainly
needed, but missed the cut off. Initially, we made the decision to
make the Collection API super fault tolerant, which made responses
tougher to do. No one wants to hunt through logs files to see how
things turned out. Done in 4.2.

* Interact with any collection on any node. Until 4.2, you could only
interact with a node in your cluster if it hosted at least one replica
of the collection you wanted to query/update. No longer - query any
node, whether it has a piece of your intended collection or not and
get a proxied response.

* Allow custom shard names so that new host addresses can take over
for retired shards. Working on Amazon without elastic ips? This is for
you.

* Lucene 4.2 optimizations such as compressed term vectors.

Solr 4.2 also includes many other new features as well as numerous
optimizations and bugfixes.

Please report any feedback to the mailing lists
(http://lucene.apache.org/solr/discussion.html)

Note: The Apache Software Foundation uses an extensive mirroring
network for distributing releases.  It is possible that the mirror you
are using may not have replicated the release yet.  If that is the
case, please try another mirror.  This also goes for Maven access.

Happy searching,
Lucene/Solr developers

Re: Dynamic schema design: feedback requested

2013-03-11 Thread Yonik Seeley

On Mon, Mar 11, 2013 at 5:51 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:
 : I guess my main point is, we shouldn't decide a priori that using the
 : API means you can no longer hand edit.

 and my point is we should build a feature where solr has the ability to
 read/write some piece of information, we should start with the asumption
 that it's OK for us to decide that a priori, and not walk into things
 assuming we have to support a lot of much more complicated uses cases.  if
 at some point during the implementation we find that supporting a more lax
 it's ok, you can edit this by hand approach won't be a burden, then so
 be it -- we can relax that a priori assertion.

I guess I like a more breadth-first method (or at least that's what it
feels like to me).
You keep both options in mind as you proceed, and don't start off a
hard assertion either way.
It would be nice to support editing by hand... but if it becomes too
burdensome, c'est la vie.

If the persistence format we're going to use is nicely human readable,
then I'm good.  We can disagree on philosophies, but I'm not sure that
it amounts to much in the way of concrete differences at this point.
What concerned me was talk of starting to treat this as more of a
black box.

-Yonik
http://lucidworks.com

RE: DataDirectory: relative path doesn't work

2013-03-11 Thread Patrick Mi

Thanks for fixing the wiki page http://wiki.apache.org/solr/SolrConfigXml
now it says this:
'If this directory is not absolute, then it is relative to the directory
you're in when you start SOLR.'

It will be nice if you drop me a line here after you make the change on the
document ...

-Original Message-
From: Patrick Mi [mailto:patrick...@touchpointgroup.com] 
Sent: Tuesday, 26 February 2013 5:49 p.m.
To: solr-user@lucene.apache.org
Subject: DataDirectory: relative path doesn't work 

I am running Solr4.0/Tomcat 7 on Centos6

According to this page http://wiki.apache.org/solr/SolrConfigXml if
dataDir is not absolute, then it is relative to the instanceDir of the
SolrCore.

However the index directory is always created under the directory where I
start the Tomcat (startup.sh) rather than under instanceDir of the SolrCore.

Am I doing something wrong in configuration?

Regards,
Patrick

Re: Upgrade Solr3.5 to Solr4.1 - Index Reformat ?


On 3/11/2013 5:59 PM, feroz_kh wrote:

One more question related to backward compatibilty.
Previously we had upgraded our solr master/slaves from 1.4 version to 3.5
version - We didn't reformat the whole index then. So i believe there will
be some files with 1.4 format present in our index.

Now when we upgrade from 3.5 to 4.1/or4.2  - Can we expect solr slave
version 4.x to read both 1.4 and 3.5 formatted indices, without any issues ?


If you think that you've got index files from 1.4 still hanging around, 
you should optimize the indexes in 3.5 before upgrading further, to 
convert the index.  The new version will NOT read index segments that old.


Thanks,
Shawn

Re: Some nodes have all the load


On Mar 11, 2013, at 5:52 PM, jimtronic jimtro...@gmail.com wrote:

 Should I omit commitWithin and set DIH to commit=false and just let soft and
 autocommit do their jobs?

Yeah, that's one valid option. You def are not able to keep up with the current 
commit / open searcher level. It looks like DIH will do a hard commit which 
will likely open a new searcher as well - that's not good - you should stick to 
soft commits and the infrequent hard commit. Then the commitWithin is fairly 
aggressive at 500ms. Whether or not you can keep up with this varies with a lot 
of factors and features and settings - clearly you are not currently able to 
keep up.

- Mark

Re: Some nodes have all the load