Re: update to 4.3

2013-05-06 Thread Arkadi Colson

Any tips on what to do with the configuration files?
Where do I have to store them and what should they look like? Any examples?


May 07, 2013 6:16:27 AM org.apache.catalina.core.AprLifecycleListener init
INFO: The APR based Apache Tomcat Native library which allows optimal 
performance in production environments was not found on the 
java.library.path: 
/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib

May 07, 2013 6:16:28 AM org.apache.coyote.AbstractProtocol init
INFO: Initializing ProtocolHandler ["http-bio-8983"]
May 07, 2013 6:16:28 AM org.apache.coyote.AbstractProtocol init
INFO: Initializing ProtocolHandler ["ajp-bio-8009"]
May 07, 2013 6:16:28 AM org.apache.catalina.startup.Catalina load
INFO: Initialization processed in 621 ms
May 07, 2013 6:16:28 AM org.apache.catalina.core.StandardService 
startInternal

INFO: Starting service Catalina
May 07, 2013 6:16:28 AM org.apache.catalina.core.StandardEngine 
startInternal

INFO: Starting Servlet Engine: Apache Tomcat/7.0.39
May 07, 2013 6:16:28 AM org.apache.catalina.startup.HostConfig deployWAR
INFO: Deploying web application archive 
/usr/local/apache-tomcat-7.0.39/webapps/solr.war
log4j:WARN No appenders could be found for logger 
(org.apache.solr.servlet.SolrDispatchFilter).

log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for 
more info.
May 07, 2013 6:16:33 AM org.apache.catalina.startup.HostConfig 
deployDirectory
INFO: Deploying web application directory 
/usr/local/apache-tomcat-7.0.39/webapps/host-manager
May 07, 2013 6:16:33 AM org.apache.catalina.startup.HostConfig 
deployDirectory
INFO: Deploying web application directory 
/usr/local/apache-tomcat-7.0.39/webapps/docs
May 07, 2013 6:16:33 AM org.apache.catalina.startup.HostConfig 
deployDirectory
INFO: Deploying web application directory 
/usr/local/apache-tomcat-7.0.39/webapps/manager
May 07, 2013 6:16:34 AM org.apache.catalina.startup.HostConfig 
deployDirectory
INFO: Deploying web application directory 
/usr/local/apache-tomcat-7.0.39/webapps/ROOT
May 07, 2013 6:16:34 AM org.apache.catalina.startup.HostConfig 
deployDirectory
INFO: Deploying web application directory 
/usr/local/apache-tomcat-7.0.39/webapps/examples

May 07, 2013 6:16:34 AM org.apache.coyote.AbstractProtocol start
INFO: Starting ProtocolHandler ["http-bio-8983"]
May 07, 2013 6:16:34 AM org.apache.coyote.AbstractProtocol start
INFO: Starting ProtocolHandler ["ajp-bio-8009"]
May 07, 2013 6:16:34 AM org.apache.catalina.startup.Catalina start
INFO: Server startup in 6000 ms

BR,
Arkadi

On 05/06/2013 10:13 PM, Jan Høydahl wrote:

Hi,

The reason is that from Solr 4.3 you need to provide the SLF4J logger jars of 
choice
when deploying Solr to an external servlet container.

Simplest is to copy all jars from example/lib/ext into tomcat/lib

cd solr-4.3.0/example/lib/ext
cp * /usr/local/apache-tomcat-7.0.39/lib/

Please see CHANGES.TXT for more info 
http://lucene.apache.org/solr/4_3_0/changes/Changes.html#4.3.0.upgrading_from_solr_4.2.0

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

6. mai 2013 kl. 16:50 skrev Arkadi Colson :


Hi

After update to 4.3 I got this error:

May 06, 2013 2:30:08 PM org.apache.coyote.AbstractProtocol init
INFO: Initializing ProtocolHandler ["http-bio-8983"]
May 06, 2013 2:30:08 PM org.apache.coyote.AbstractProtocol init
INFO: Initializing ProtocolHandler ["ajp-bio-8009"]
May 06, 2013 2:30:08 PM org.apache.catalina.startup.Catalina load
INFO: Initialization processed in 610 ms
May 06, 2013 2:30:08 PM org.apache.catalina.core.StandardService startInternal
INFO: Starting service Catalina
May 06, 2013 2:30:08 PM org.apache.catalina.core.StandardEngine startInternal
INFO: Starting Servlet Engine: Apache Tomcat/7.0.39
May 06, 2013 2:30:08 PM org.apache.catalina.startup.HostConfig deployWAR
INFO: Deploying web application archive 
/usr/local/apache-tomcat-7.0.39/webapps/solr.war
May 06, 2013 2:30:45 PM org.apache.catalina.util.SessionIdGenerator 
createSecureRandom
INFO: Creation of SecureRandom instance for session ID generation using 
[SHA1PRNG] took [36,697] milliseconds.
May 06, 2013 2:30:45 PM org.apache.catalina.core.StandardContext startInternal
SEVERE: Error filterStart
May 06, 2013 2:30:45 PM org.apache.catalina.core.StandardContext startInternal
SEVERE: Context [/solr] startup failed due to previous errors
May 06, 2013 2:30:45 PM org.apache.catalina.startup.HostConfig deployDirectory
INFO: Deploying web application directory 
/usr/local/apache-tomcat-7.0.39/webapps/host-manager
May 06, 2013 2:30:45 PM org.apache.catalina.startup.HostConfig deployDirectory
INFO: Deploying web application directory 
/usr/local/apache-tomcat-7.0.39/webapps/docs
May 06, 2013 2:30:45 PM org.apache.catalina.startup.HostConfig deployDirectory
INFO: Deploying web application directory 
/usr/local/apache-tomcat-7.0.39/webapps/manager
May 06, 2013 2:30:45 PM org.apache.catalina.startup.HostConfig deployDi

Re: ConcurrentUpdateSolrServer "Missing ContentType" error on SOLR 4.2.1

2013-05-06 Thread Shawn Heisey
> I apologize for intruding, Shawn, do you know what can cause empty params
> (i.e. params={}) ?

I've got no idea what is causing this problem on your system. All of the
ideas I've had so far don't seem to apply.

Can you run a packet sniffer on your client to see whether the client is
sending the right info?

Thanks,
Shawn




Re: ConcurrentUpdateSolrServer "Missing ContentType" error on SOLR 4.2.1

2013-05-06 Thread Ravi Solr
I apologize for intruding, Shawn, do you know what can cause empty params
(i.e. params={}) ?

Ravi


On Mon, May 6, 2013 at 5:47 PM, Shawn Heisey  wrote:

> On 5/6/2013 1:25 PM, cleardot wrote:
>
>> My SolrJ client uses ConcurrentUpdateSolrServer to index > 50Gs of docs
>> to a
>> SOLR 3.6 instance on my Linux box.  When running the same client against
>> SOLR 4.2.1 on EC2 I got the following:
>>
>
> 
>
>
>  SOLR 4.2.1 log error
>> ==**
>> INFO: [mycore] webapp=/solr path=/update params={} {} 0 0
>> May 6, 2013 6:13:55 PM org.apache.solr.common.**SolrException log
>> SEVERE: org.apache.solr.common.**SolrException: Missing ContentType
>>
>
> This isn't the first time I've seen empty params in a Solr log on this
> list, but the other one was with 3.6.2 for both server and client.  Is
> "params={}" what actually got logged, or did you remove the stuff there to
> sanitize your logs on a public list?
>
> Are you by chance setting the response parser on your solr server object
> to something besides the Binary (javabin) parser?  If you are, could you
> remove the setParser call in your client code?  The only time you need to
> change the parser is when you're using SolrJ with a version of Solr that
> does not have the same javabin version.  The javabin version was v1 in Solr
> 1.4.1 and earlier, then v2 in 3.1.0 and later.  The other response parsers
> are less efficient than javabin.
>
> Thanks,
> Shawn
>
>


Re: What makes an Analyzer/Tokenizer/CharFilter/etc suitable for Solr?

2013-05-06 Thread Alexandre Rafalovitch
Has this logic (default constructor or version flag) changed due to
LUCENE-4877 ? I rerun my tool and suddenly huge number of Factories
acquired a new constructor (e.g. MappingCharFilterFactory).

Regards,
   Alex.
Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Wed, Mar 6, 2013 at 2:49 PM, Chris Hostetter
 wrote:
>
> : *) Have a default empty constructor
> :
> : My preliminary tests seem to indicate this is the case. Am I missing
> : anything.
>
> Any analyzer that has an empty construct *or* a constructor that takes in
> a lucene "Version" object may be specified.
>
> I've updated the wiki to make this more clear...
> https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Specifying_an_Analyzer_in_the_schema
>
> For CharFilters, Tokenizers, TokenFilters: they must have a factory of the
> appropriate type (CharFilterFactory, TokenizerFactory, TokenFilterFactory)
>
>
> -Hoss


RE: Solr Cloud with large synonyms.txt

2013-05-06 Thread David Parks
Wouldn't it make more sense to only store a pointer to a synonyms file in
zookeeper? Maybe just make the synonyms file accessible via http so other
boxes can copy it if needed? Zookeeper was never meant for storing
significant amounts of data.


-Original Message-
From: Jan Høydahl [mailto:jan@cominvent.com] 
Sent: Tuesday, May 07, 2013 4:35 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr Cloud with large synonyms.txt

See discussion here
http://lucene.472066.n3.nabble.com/gt-1MB-file-to-Zookeeper-td3958614.html

One idea was compression. Perhaps if we add gzip support to SynonymFilter it
can read synonyms.txt.gz which would then fit larger raw dicts?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

6. mai 2013 kl. 18:32 skrev Son Nguyen :

> Hello,
> 
> I'm building a Solr Cloud (version 4.1.0) with 2 shards and a Zookeeper
(the Zookeeer is on different machine, version 3.4.5).
> I've tried to start with a 1.7MB synonyms.txt, but got a
"ConnectionLossException":
> Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /configs/solr1/synonyms.txt
>at
org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>at
org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266)
>at
org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:270)
>at
org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:267)
>at
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java
:65)
>at
org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:267)
>at
org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:436)
>at
org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:315)
>at
org.apache.solr.cloud.ZkController.uploadToZK(ZkController.java:1135)
>at
org.apache.solr.cloud.ZkController.uploadConfigDir(ZkController.java:955)
>at
org.apache.solr.core.CoreContainer.initZooKeeper(CoreContainer.java:285)
>... 43 more
> 
> I did some researches on internet and found out that because Zookeeper
znode size limit is 1MB. I tried to increase the system property
"jute.maxbuffer" but it won't work.
> Does anyone have experience of dealing with it?
> 
> Thanks,
> Son



Re: Questions about the performance of Solr

2013-05-06 Thread Mikhail Khludnev
Hello,

start from http://wiki.apache.org/solr/CommonQueryParameters#fq




On Mon, May 6, 2013 at 11:42 AM, joo  wrote:

> Search speed at which data is loaded is more than 7 ten millon current will
> be reduced too.
> About 50 seconds it will take, but the number is often just this, it is not
> possible to know whether such.
> Will there is a problem with the Query I use it to know the Query
> Optimizing
> Solr and fall.
> The Query, for example I use,
> time: [time to time] AND category: (1,2) AND (message1: message OR
> message2:
> message)
> I try to this.
> As long as there is no this problem, you need advice please do take a look
> at which part.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Questions-about-the-performance-of-Solr-tp4060988.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics


 


Re: Indexing off of the production servers

2013-05-06 Thread Erick Erickson
Nope. There is no replication, as in replication of the indexed
document in the normal flow. The _raw_ document is forwarded to all
replicas and upon return from the replicas, the raw document has been
written to each individual transaction log on each replica.
"replication" implies the _indexed_ form of the document is what's
forwarded to the replicas, and that's not the case.

It's somewhat confusing, but _if_ a replica goes down, when it comes
back up if it's "too far" out of date then an old-style replication of
the whole index is performed. But absent that it's all raw documents
forwarded to replicas from the leader.

Otherwise, how could you hope that a replica could take over without
loss of data? The leader could have gone down before it forwarded the
docs but after it responded to the client.

Best
Erick

On Mon, May 6, 2013 at 10:43 AM, Furkan KAMACI  wrote:
> Hi Erick;
>
> Thanks for your answer. I have read that at somewhere:
>
> I believe "redirect" from replica to leader would happen only at
> index time, so a doc first gets indexed to leader and from there it's
> replicated to non-leader shards.
>
> Is that true? I want to make clear the things in my mind otherwise I want
> to ask a separate question about what happens for indexing and querying at
> SolrCloud.
>
> 2013/5/6 Shawn Heisey 
>
>> On 5/6/2013 7:55 AM, Andre Bois-Crettez wrote:
>> > Excellent idea !
>> > And it is possible to use collection aliasing with the CREATEALIAS to
>> > make this transparent for the query side.
>> >
>> > ex. with 2 collections named :
>> > collection_1
>> > collection_2
>> >
>> >
>> /collections?action=CREATEALIAS&name=collectionalias&collections=collection_1
>> >
>> > "collectionalias" is now a virtual collection pointing to collection_1.
>> >
>> > Index on collection_2, then :
>> >
>> /collections?action=CREATEALIAS&name=collectionalias&collections=collection_2
>> >
>> > "collectionalias" now is an alias to collection_2.
>> >
>> >
>> http://wiki.apache.org/solr/SolrCloud#Managing_collections_via_the_Collections_API
>>
>> Awesome idea, Andre! I was wondering whether you might have to delete
>> the original alias before creating the new one, but a quick look at the
>> issue for collection aliasing shows that this isn't the case.
>>
>> https://issues.apache.org/jira/browse/SOLR-4497
>>
>> The wiki doesn't mention the DELETEALIAS action.  I won't have time
>> right now to update the wiki.
>>
>> Thanks,
>> Shawn
>>
>>


Re: Is there a way to remove caches in SOLR?

2013-05-06 Thread Shawn Heisey

On 5/6/2013 5:38 PM, bbarani wrote:

I am trying to create performance metrics for SOLR. I don't want the searcher
to warm up when I issue a query since I am trying to collect metrics for
cold search. Is there a way to disable warming?


Set the autowarmCount to 0 in each of the cache definitions.  That will 
prevent a new searcher from being warmed up when you commit.  To 
completely disable the cache, set the size to 0 as well.


You'll also want to remove the newSearcher and firstSearcher pieces from 
the  section of solrconfig.xml for testing without cache warming. 
 The example config has some uncommented config for these.  The example 
queries you'll find there are unlikely to return results, but just 
running the search can pre-warm things.


This may be something you already know, but you'll want to have 
benchmark data with the caches and the warming enabled as well as data 
without it, as that is how things will likely run in the real world.


Thanks,
shawn



Re: Is there a way to remove caches in SOLR?

2013-05-06 Thread varun srivastava
make size 0


On Mon, May 6, 2013 at 4:38 PM, bbarani  wrote:

> I am trying to create performance metrics for SOLR. I don't want the
> searcher
> to warm up when I issue a query since I am trying to collect metrics for
> cold search. Is there a way to disable warming?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Is-there-a-way-to-remove-caches-in-SOLR-tp4061216.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Is there a way to remove caches in SOLR?

2013-05-06 Thread bbarani
I am trying to create performance metrics for SOLR. I don't want the searcher
to warm up when I issue a query since I am trying to collect metrics for
cold search. Is there a way to disable warming?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Is-there-a-way-to-remove-caches-in-SOLR-tp4061216.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: ConcurrentUpdateSolrServer "Missing ContentType" error on SOLR 4.2.1

2013-05-06 Thread Shawn Heisey

On 5/6/2013 4:06 PM, cleardot wrote:

Shawn,

I didn't sanitize the log other than the ec2 servername.  The constructor is

ConcurrentUpdateSolrServer solrServer
  = new ConcurrentUpdateSolrServer(solrUrl,
solrBufferCount, solrThreadCount);

and I don't use setParser at all.

But the SolrJ client is using apache-solr-core-3.6.1.jar and
apache-solr-solrj-3.6.1.jar while the server is 4.2.1.  Maybe I do need to
use setParser?


The javabin versions between those two are compatible, but the older 
SolrJ may not send a content type when using javabin, and the newer Solr 
seems to require it.  Perhaps if you do change the parser, it might 
force the issue.  XML isn't quite as efficient as javabin, but it's not 
like it will be super slow either.  Add the following after creating the 
server object:


solrServer.setParser (new XMLResponseParser());

This probably will require adding an import:

import org.apache.solr.client.solrj.impl.XMLResponseParser;

You also might try upgrading SolrJ on your client app.  I'm using SolrJ 
4.2.1 with two different server versions - 3.5.0 and 4.2.1.  It works 
perfectly.


Thanks,
Shawn



Re: ConcurrentUpdateSolrServer "Missing ContentType" error on SOLR 4.2.1

2013-05-06 Thread cleardot
Shawn,

I didn't sanitize the log other than the ec2 servername.  The constructor is

   ConcurrentUpdateSolrServer solrServer
 = new ConcurrentUpdateSolrServer(solrUrl,
solrBufferCount, solrThreadCount);

and I don't use setParser at all.

But the SolrJ client is using apache-solr-core-3.6.1.jar and
apache-solr-solrj-3.6.1.jar while the server is 4.2.1.  Maybe I do need to
use setParser?

DK




--
View this message in context: 
http://lucene.472066.n3.nabble.com/ConcurrentUpdateSolrServer-Missing-ContentType-error-on-SOLR-4-2-1-tp4061160p4061197.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: ConcurrentUpdateSolrServer "Missing ContentType" error on SOLR 4.2.1

2013-05-06 Thread Shawn Heisey

On 5/6/2013 1:25 PM, cleardot wrote:

My SolrJ client uses ConcurrentUpdateSolrServer to index > 50Gs of docs to a
SOLR 3.6 instance on my Linux box.  When running the same client against
SOLR 4.2.1 on EC2 I got the following:





SOLR 4.2.1 log error
==
INFO: [mycore] webapp=/solr path=/update params={} {} 0 0
May 6, 2013 6:13:55 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: Missing ContentType


This isn't the first time I've seen empty params in a Solr log on this 
list, but the other one was with 3.6.2 for both server and client.  Is 
"params={}" what actually got logged, or did you remove the stuff there 
to sanitize your logs on a public list?


Are you by chance setting the response parser on your solr server object 
to something besides the Binary (javabin) parser?  If you are, could you 
remove the setParser call in your client code?  The only time you need 
to change the parser is when you're using SolrJ with a version of Solr 
that does not have the same javabin version.  The javabin version was v1 
in Solr 1.4.1 and earlier, then v2 in 3.1.0 and later.  The other 
response parsers are less efficient than javabin.


Thanks,
Shawn



Re: List of Solr Query Parsers

2013-05-06 Thread Jan Høydahl
Added. Please try editing the page now.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

6. mai 2013 kl. 19:58 skrev Roman Chyla :

> Hi Jan,
> My login is RomanChyla
> Thanks,
> 
> Roman
> On 6 May 2013 10:00, "Jan Høydahl"  wrote:
> 
>> Hi Roman,
>> 
>> This sounds great! Please register as a user on the WIKI and give us your
>> username here, then we'll grant you editing karma so you can edit the page
>> yourself! The NEAR/5 syntax is really something I think we should get into
>> the default lucene parser. Can't wait to have a look at your code.
>> 
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>> 
>> 6. mai 2013 kl. 15:41 skrev Roman Chyla :
>> 
>>> Hi Jan,
>>> Please add this one
>> http://29min.wordpress.com/category/antlrqueryparser/
>>> - I can't edit the wiki
>>> 
>>> This parser is written with ANTLR and on top of lucene modern query
>> parser.
>>> There is a version which implements Lucene standard QP as well as a
>> version
>>> which includes proximity operators, multi token synonym handling and all
>> of
>>> solr qparsers using function syntax - ie,. for a query like: multi
>> synonym
>>> NEAR/5 edismax(foo)
>>> 
>>> I would like to create a JIRA ticket soon
>>> 
>>> Thanks
>>> 
>>> Roman
>>> On 6 May 2013 09:21, "Jan Høydahl"  wrote:
>>> 
 Hi,
 
 I just added a Wiki page to try to gather a list of all known Solr query
 parsers in one place, both those which are part of Solr and those in
>> JIRA
 or 3rd party.
 
 http://wiki.apache.org/solr/QueryParser
 
 If you known about other cool parsers out there, please add to the list.
 
 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com
 
 
>> 
>> 



Re: Solr Cloud with large synonyms.txt

2013-05-06 Thread Jan Høydahl
See discussion here 
http://lucene.472066.n3.nabble.com/gt-1MB-file-to-Zookeeper-td3958614.html

One idea was compression. Perhaps if we add gzip support to SynonymFilter it 
can read synonyms.txt.gz which would then fit larger raw dicts?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

6. mai 2013 kl. 18:32 skrev Son Nguyen :

> Hello,
> 
> I'm building a Solr Cloud (version 4.1.0) with 2 shards and a Zookeeper (the 
> Zookeeer is on different machine, version 3.4.5).
> I've tried to start with a 1.7MB synonyms.txt, but got a 
> "ConnectionLossException":
> Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: 
> KeeperErrorCode = ConnectionLoss for /configs/solr1/synonyms.txt
>at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266)
>at 
> org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:270)
>at 
> org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:267)
>at 
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:65)
>at 
> org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:267)
>at 
> org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:436)
>at 
> org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:315)
>at 
> org.apache.solr.cloud.ZkController.uploadToZK(ZkController.java:1135)
>at 
> org.apache.solr.cloud.ZkController.uploadConfigDir(ZkController.java:955)
>at 
> org.apache.solr.core.CoreContainer.initZooKeeper(CoreContainer.java:285)
>... 43 more
> 
> I did some researches on internet and found out that because Zookeeper znode 
> size limit is 1MB. I tried to increase the system property "jute.maxbuffer" 
> but it won't work.
> Does anyone have experience of dealing with it?
> 
> Thanks,
> Son



Open position: Senior Information Retrieval Engineer, Zurich, Switzerland

2013-05-06 Thread Toan V Luu
We are looking for an engineer who has strong background in Information
retrieval and Solr /Lucene platform. Prefer native German or French
speaking. Please contact us if you are interested in this position:
http://local-ch.github.io/senior-ir-engineer.html.
Thanks.
Toan Luu.


ConcurrentUpdateSolrServer "Missing ContentType" error on SOLR 4.2.1

2013-05-06 Thread cleardot
My SolrJ client uses ConcurrentUpdateSolrServer to index > 50Gs of docs to a
SOLR 3.6 instance on my Linux box.  When running the same client against
SOLR 4.2.1 on EC2 I got the following:


SolrJ client error

request: http://ec2-103-x-x-x.compute-3.amazonaws.com/solr/mycore/update
at
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:189)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:650)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:675)
at java.lang.Thread.run(Thread.java:595)
12593 [pool-1-thread-2] INFO
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer - finished:
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner@7eb1cc87
12593 [pool-1-thread-5] INFO
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer - Status for:
doc-123b166-a2fd-11e0-94b3-842b2b170032 is 400
12593 [pool-1-thread-5] ERROR
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer - error
java.lang.Exception: Bad Request



SOLR 4.2.1 log error
==
INFO: [mycore] webapp=/solr path=/update params={} {} 0 0
May 6, 2013 6:13:55 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: Missing ContentType
at
org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:78)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1817)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:639)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:345)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:861)
at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:606)
at
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
at java.lang.Thread.run(Thread.java:679)



I have other SolrJ clients using the 3.6 Solr jars and CommonsHttpSolrServer
to index to SOLR 4.2.1 with no problem, the issue seems to be with
ConcurrentUpdateSolrServer.

UpdateRequestHandler.java seems to be the source of the error.

  

I'm wondering if the issue is with the 4.3.1 solrconfig settings for the
/update date handler, currently just 

 

Also my ConcurrentUpdateSolrServer constructor does not specify an http
client, I made a few failed attempts at setting content type that way.

any help appreciated!

DK



--
View this message in context: 
http://lucene.472066.n3.nabble.com/ConcurrentUpdateSolrServer-Missing-ContentType-error-on-SOLR-4-2-1-tp4061160.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Why is SolrCloud doing a full copy of the index?

2013-05-06 Thread Otis Gospodnetic
Hi,

I just looked at SPM monitoring we have for Solr servers that run
search-lucene.com.  One of them has 1-2 collections/minute.  Another
one closer to 10.  These are both small servers with small JVM heaps.
Here is a graph of one of them:

https://apps.sematext.com/spm/s/104ppwguao

Just looked at some other Java servers we have running, not Solr, and
I see close to 60 small collections per minute.

So these numbers will vary a lot depending on the heap size and other
JVM settings, as well as the actual code/usage. :)

Otis
--
Solr & ElasticSearch Support
http://sematext.com/





On Mon, May 6, 2013 at 4:39 PM, Shawn Heisey  wrote:
> On 5/6/2013 1:39 PM, Michael Della Bitta wrote:
>>
>> Hi Shawn,
>>
>> Thanks a lot for this entry!
>>
>> I'm wondering, when you say "Garbage collections that happen more often
>> than ten or so times per minute may be an indication that the heap size is
>> too small," do you mean *any* collections, or just full collections?
>
>
> My gut reaction is any collection, but in extremely busy environments a rate
> of ten per minute might be a very slow day on a setup that's working
> perfectly.
>
> As I wrote that particular bit, I was thinking that any number I put there
> was probably wrong for some large subset of users, but I wanted to finish
> putting down my thoughts and improve it later.
>
> Thanks,
> Shawn
>


Re: Why is SolrCloud doing a full copy of the index?

2013-05-06 Thread Shawn Heisey

On 5/6/2013 1:39 PM, Michael Della Bitta wrote:

Hi Shawn,

Thanks a lot for this entry!

I'm wondering, when you say "Garbage collections that happen more often
than ten or so times per minute may be an indication that the heap size is
too small," do you mean *any* collections, or just full collections?


My gut reaction is any collection, but in extremely busy environments a 
rate of ten per minute might be a very slow day on a setup that's 
working perfectly.


As I wrote that particular bit, I was thinking that any number I put 
there was probably wrong for some large subset of users, but I wanted to 
finish putting down my thoughts and improve it later.


Thanks,
Shawn



Re: Query Elevation exception on shard queries

2013-05-06 Thread varun srivastava
Thanks Ravi. So then it is a bug .


On Mon, May 6, 2013 at 12:04 PM, Ravi Solr  wrote:

> Varun,
>  Since our cores were totally disjoint i.e. they pertain to two
> different applications which may or may not have results for a given query,
> we moved the elavation outside of solr into our java code. As long as both
> cores had some results to return for a given query elevation would work.
>
> Thanks,
>
> Ravi
>
>
> On Sat, May 4, 2013 at 1:54 PM, varun srivastava  >wrote:
>
> > Hi Ravi,
> >  I am getting same probelm . You got any solution ?
> >
> > Thanks
> > Varun
> >
> >
> > On Fri, Mar 29, 2013 at 11:48 AM, Ravi Solr  wrote:
> >
> > > Hello,
> > >   We have a Solr 3.6.2 multicore setup, where each core is a
> complete
> > > index for one application. In our site search we use sharded query to
> > query
> > > two cores at a time. The issue is, If one core has docs but other core
> > > doesn't for an elevated query solr is throwing a 500 error. I woudl
> > really
> > > appreciate it if somebody can point me in the right direction on how to
> > > avoid this error, the following is my query
> > >
> > >
> > >
> >
> [#|2013-03-29T13:44:55.609-0400|INFO|sun-appserver2.1|org.apache.solr.core.SolrCore|_ThreadID=22;_ThreadName=httpSSLWorkerThread-9001-0;|[core1]
> > > webapp=/solr path=/select/
> > >
> > >
> >
> params={q=civil+war&start=0&rows=10&shards=localhost:/solr/core1,localhost:/solr/core2&hl=true&hl.fragsize=0&hl.snippets=5&hl.simple.pre=&hl.simple.post=&hl.fl=body&fl=*&facet=true&facet.field=type&facet.mincount=1&facet.method=enum&fq=pubdate:[2005-01-01T00:00:00Z+TO+NOW/DAY%2B1DAY]&facet.query={!ex%3Ddt+key%3D"Past+24+Hours"}pubdate:[NOW/DAY-1DAY+TO+NOW/DAY%2B1DAY]&facet.query={!ex%3Ddt+key%3D"Past+7+Days"}pubdate:[NOW/DAY-7DAYS+TO+NOW/DAY%2B1DAY]&facet.query={!ex%3Ddt+key%3D"Past+60+Days"}pubdate:[NOW/DAY-60DAYS+TO+NOW/DAY%2B1DAY]&facet.query={!ex%3Ddt+key%3D"Past+12+Months"}pubdate:[NOW/DAY-1YEAR+TO+NOW/DAY%2B1DAY]&facet.query={!ex%3Ddt+key%3D"All+Since+2005"}pubdate:[*+TO+NOW/DAY%2B1DAY]}
> > > status=500 QTime=15 |#]
> > >
> > >
> > > As you can see the 2 cores are core1 and core2. The core1 has data for
> he
> > > query 'civil war' however core2 doesn't have any data. We have the
> 'civil
> > > war' in the elevate.xml which causes Solr to throw a SolrException as
> > > follows. However if I remove the elevate entry for this query,
> everything
> > > works well.
> > >
> > > *type* Status report
> > >
> > > *message*Index: 1, Size: 0 java.lang.IndexOutOfBoundsException: Index:
> 1,
> > > Size: 0 at java.util.ArrayList.RangeCheck(ArrayList.java:547) at
> > > java.util.ArrayList.get(ArrayList.java:322) at
> > > org.apache.solr.common.util.NamedList.getVal(NamedList.java:137) at
> > >
> > >
> >
> org.apache.solr.handler.component.ShardFieldSortedHitQueue$ShardComparator.sortVal(ShardDoc.java:221)
> > > at
> > >
> > >
> >
> org.apache.solr.handler.component.ShardFieldSortedHitQueue$2.compare(ShardDoc.java:260)
> > > at
> > >
> > >
> >
> org.apache.solr.handler.component.ShardFieldSortedHitQueue.lessThan(ShardDoc.java:160)
> > > at
> > >
> > >
> >
> org.apache.solr.handler.component.ShardFieldSortedHitQueue.lessThan(ShardDoc.java:101)
> > > at org.apache.lucene.util.PriorityQueue.upHeap(PriorityQueue.java:223)
> at
> > > org.apache.lucene.util.PriorityQueue.add(PriorityQueue.java:132) at
> > >
> > >
> >
> org.apache.lucene.util.PriorityQueue.insertWithOverflow(PriorityQueue.java:148)
> > > at
> > >
> > >
> >
> org.apache.solr.handler.component.QueryComponent.mergeIds(QueryComponent.java:786)
> > > at
> > >
> > >
> >
> org.apache.solr.handler.component.QueryComponent.handleRegularResponses(QueryComponent.java:587)
> > > at
> > >
> > >
> >
> org.apache.solr.handler.component.QueryComponent.handleResponses(QueryComponent.java:566)
> > > at
> > >
> > >
> >
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:283)
> > > at
> > >
> > >
> >
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
> > > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376) at
> > >
> > >
> >
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:365)
> > > at
> > >
> > >
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:260)
> > > at
> > >
> > >
> >
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:246)
> > > at
> > >
> > >
> >
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:214)
> > > at
> > >
> > >
> >
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:313)
> > > at
> > >
> > >
> >
> org.apache.catalina.core.StandardContextValve.invokeInternal(StandardContextValve.java:287)
> > > at
> > >
> > >
> >
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:218)
> > > at
> > >
> > >
> >
> org.apache.catalina.core.StandardPipeline.doInvoke(StandardPi

Re: update to 4.3

2013-05-06 Thread Jan Høydahl
Hi,

The reason is that from Solr 4.3 you need to provide the SLF4J logger jars of 
choice
when deploying Solr to an external servlet container.

Simplest is to copy all jars from example/lib/ext into tomcat/lib

cd solr-4.3.0/example/lib/ext
cp * /usr/local/apache-tomcat-7.0.39/lib/

Please see CHANGES.TXT for more info 
http://lucene.apache.org/solr/4_3_0/changes/Changes.html#4.3.0.upgrading_from_solr_4.2.0

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

6. mai 2013 kl. 16:50 skrev Arkadi Colson :

> Hi
> 
> After update to 4.3 I got this error:
> 
> May 06, 2013 2:30:08 PM org.apache.coyote.AbstractProtocol init
> INFO: Initializing ProtocolHandler ["http-bio-8983"]
> May 06, 2013 2:30:08 PM org.apache.coyote.AbstractProtocol init
> INFO: Initializing ProtocolHandler ["ajp-bio-8009"]
> May 06, 2013 2:30:08 PM org.apache.catalina.startup.Catalina load
> INFO: Initialization processed in 610 ms
> May 06, 2013 2:30:08 PM org.apache.catalina.core.StandardService startInternal
> INFO: Starting service Catalina
> May 06, 2013 2:30:08 PM org.apache.catalina.core.StandardEngine startInternal
> INFO: Starting Servlet Engine: Apache Tomcat/7.0.39
> May 06, 2013 2:30:08 PM org.apache.catalina.startup.HostConfig deployWAR
> INFO: Deploying web application archive 
> /usr/local/apache-tomcat-7.0.39/webapps/solr.war
> May 06, 2013 2:30:45 PM org.apache.catalina.util.SessionIdGenerator 
> createSecureRandom
> INFO: Creation of SecureRandom instance for session ID generation using 
> [SHA1PRNG] took [36,697] milliseconds.
> May 06, 2013 2:30:45 PM org.apache.catalina.core.StandardContext startInternal
> SEVERE: Error filterStart
> May 06, 2013 2:30:45 PM org.apache.catalina.core.StandardContext startInternal
> SEVERE: Context [/solr] startup failed due to previous errors
> May 06, 2013 2:30:45 PM org.apache.catalina.startup.HostConfig deployDirectory
> INFO: Deploying web application directory 
> /usr/local/apache-tomcat-7.0.39/webapps/host-manager
> May 06, 2013 2:30:45 PM org.apache.catalina.startup.HostConfig deployDirectory
> INFO: Deploying web application directory 
> /usr/local/apache-tomcat-7.0.39/webapps/docs
> May 06, 2013 2:30:45 PM org.apache.catalina.startup.HostConfig deployDirectory
> INFO: Deploying web application directory 
> /usr/local/apache-tomcat-7.0.39/webapps/manager
> May 06, 2013 2:30:45 PM org.apache.catalina.startup.HostConfig deployDirectory
> INFO: Deploying web application directory 
> /usr/local/apache-tomcat-7.0.39/webapps/ROOT
> May 06, 2013 2:30:45 PM org.apache.catalina.startup.HostConfig deployDirectory
> INFO: Deploying web application directory 
> /usr/local/apache-tomcat-7.0.39/webapps/examples
> May 06, 2013 2:30:45 PM org.apache.coyote.AbstractProtocol start
> INFO: Starting ProtocolHandler ["http-bio-8983"]
> May 06, 2013 2:30:45 PM org.apache.coyote.AbstractProtocol start
> INFO: Starting ProtocolHandler ["ajp-bio-8009"]
> May 06, 2013 2:30:45 PM org.apache.catalina.startup.Catalina start
> INFO: Server startup in 37541 ms
> 
> Any idea?
> 
> -- 
> Met vriendelijke groeten
> 
> Arkadi Colson
> 
> Smartbit bvba • Hoogstraat 13 • 3670 Meeuwen
> T +32 11 64 08 80 • F +32 11 64 08 81
> 



Re: iterate through each document in Solr

2013-05-06 Thread Dmitry Kan
Hi Ming,

Quoting my anwser on a diff. thread (
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201210.mbox/%3ccaonbidbuzzsaqctdhtlxlgeoori_ghrjbt-84bm0zb-fsps...@mail.gmail.com%3E
):

> > [code]
> > Directory indexDir = FSDirectory.open(new File(pathToDir));
> > IndexReader input = IndexReader.open(indexDir, true);
> >
> > FieldSelector fieldSelector = new SetBasedFieldSelector(
> > null, // to retrive all stored fields
> > Collections.emptySet());
> >
> > int maxDoc = input.maxDoc();
> > for (int i = 0; i < maxDoc; i++) {
> > if (input.isDeleted(i)) {
> > // deleted document found, retrieve it
> > Document document = input.document(i, fieldSelector);
> > // analyze its field values here...
> > }
> > }
> > [/code]

Have a look here for code of a complete standalone example. It does
different thing with the Lucene index, so *do not* run it on your
index.

Dmitry



On Mon, May 6, 2013 at 7:36 PM, Mingfeng Yang  wrote:

> Hi Dmitry,
>
> My index is not sharded, and since its size is so big, sharding won't help
> much on the paging issue.  Do you know any API which can help read from
> lucene binary index directly? I will be nice if we can just scan
> through the docs directly.
>
> Thanks!
> Ming-
>
>
> On Mon, May 6, 2013 at 3:33 AM, Dmitry Kan  wrote:
>
> > Are you doing it once? Is your index sharded? If so, can you ask each
> shard
> > individually?
> > Another way would be to do it on Lucene level, i.e. read from the binary
> > indices (API exists).
> >
> > Dmitry
> >
> >
> > On Mon, May 6, 2013 at 5:48 AM, Mingfeng Yang 
> > wrote:
> >
> > > Dear Solr Users,
> > >
> > > Does anyone know what is the best way to iterate through each document
> > in a
> > > Solr index with billion entries?
> > >
> > > I tried to use  select?q=*:*&start=xx&rows=500  to get 500 docs each
> time
> > > and then change start value, but it got very slow after getting through
> > > about 10 million docs.
> > >
> > > Thanks,
> > > Ming-
> > >
> >
>


Re: Why is SolrCloud doing a full copy of the index?

2013-05-06 Thread Michael Della Bitta
Hi Shawn,

Thanks a lot for this entry!

I'm wondering, when you say "Garbage collections that happen more often
than ten or so times per minute may be an indication that the heap size is
too small," do you mean *any* collections, or just full collections?


Michael Della Bitta


Appinions
18 East 41st Street, 2nd Floor
New York, NY 10017-6271

www.appinions.com

Where Influence Isn’t a Game


On Sat, May 4, 2013 at 1:55 PM, Shawn Heisey  wrote:

> On 5/4/2013 11:45 AM, Shawn Heisey wrote:
> > Advance warning: this is a long reply.
>
> I have condensed some relevant performance problem information into the
> following wiki page:
>
> http://wiki.apache.org/solr/SolrPerformanceProblems
>
> Anyone who has additional information for this page, feel free to add
> it.  I hope I haven't made too many mistakes!
>
> Thanks,
> Shawn
>
>


Re: Query Elevation exception on shard queries

2013-05-06 Thread Ravi Solr
Varun,
 Since our cores were totally disjoint i.e. they pertain to two
different applications which may or may not have results for a given query,
we moved the elavation outside of solr into our java code. As long as both
cores had some results to return for a given query elevation would work.

Thanks,

Ravi


On Sat, May 4, 2013 at 1:54 PM, varun srivastava wrote:

> Hi Ravi,
>  I am getting same probelm . You got any solution ?
>
> Thanks
> Varun
>
>
> On Fri, Mar 29, 2013 at 11:48 AM, Ravi Solr  wrote:
>
> > Hello,
> >   We have a Solr 3.6.2 multicore setup, where each core is a complete
> > index for one application. In our site search we use sharded query to
> query
> > two cores at a time. The issue is, If one core has docs but other core
> > doesn't for an elevated query solr is throwing a 500 error. I woudl
> really
> > appreciate it if somebody can point me in the right direction on how to
> > avoid this error, the following is my query
> >
> >
> >
> [#|2013-03-29T13:44:55.609-0400|INFO|sun-appserver2.1|org.apache.solr.core.SolrCore|_ThreadID=22;_ThreadName=httpSSLWorkerThread-9001-0;|[core1]
> > webapp=/solr path=/select/
> >
> >
> params={q=civil+war&start=0&rows=10&shards=localhost:/solr/core1,localhost:/solr/core2&hl=true&hl.fragsize=0&hl.snippets=5&hl.simple.pre=&hl.simple.post=&hl.fl=body&fl=*&facet=true&facet.field=type&facet.mincount=1&facet.method=enum&fq=pubdate:[2005-01-01T00:00:00Z+TO+NOW/DAY%2B1DAY]&facet.query={!ex%3Ddt+key%3D"Past+24+Hours"}pubdate:[NOW/DAY-1DAY+TO+NOW/DAY%2B1DAY]&facet.query={!ex%3Ddt+key%3D"Past+7+Days"}pubdate:[NOW/DAY-7DAYS+TO+NOW/DAY%2B1DAY]&facet.query={!ex%3Ddt+key%3D"Past+60+Days"}pubdate:[NOW/DAY-60DAYS+TO+NOW/DAY%2B1DAY]&facet.query={!ex%3Ddt+key%3D"Past+12+Months"}pubdate:[NOW/DAY-1YEAR+TO+NOW/DAY%2B1DAY]&facet.query={!ex%3Ddt+key%3D"All+Since+2005"}pubdate:[*+TO+NOW/DAY%2B1DAY]}
> > status=500 QTime=15 |#]
> >
> >
> > As you can see the 2 cores are core1 and core2. The core1 has data for he
> > query 'civil war' however core2 doesn't have any data. We have the 'civil
> > war' in the elevate.xml which causes Solr to throw a SolrException as
> > follows. However if I remove the elevate entry for this query, everything
> > works well.
> >
> > *type* Status report
> >
> > *message*Index: 1, Size: 0 java.lang.IndexOutOfBoundsException: Index: 1,
> > Size: 0 at java.util.ArrayList.RangeCheck(ArrayList.java:547) at
> > java.util.ArrayList.get(ArrayList.java:322) at
> > org.apache.solr.common.util.NamedList.getVal(NamedList.java:137) at
> >
> >
> org.apache.solr.handler.component.ShardFieldSortedHitQueue$ShardComparator.sortVal(ShardDoc.java:221)
> > at
> >
> >
> org.apache.solr.handler.component.ShardFieldSortedHitQueue$2.compare(ShardDoc.java:260)
> > at
> >
> >
> org.apache.solr.handler.component.ShardFieldSortedHitQueue.lessThan(ShardDoc.java:160)
> > at
> >
> >
> org.apache.solr.handler.component.ShardFieldSortedHitQueue.lessThan(ShardDoc.java:101)
> > at org.apache.lucene.util.PriorityQueue.upHeap(PriorityQueue.java:223) at
> > org.apache.lucene.util.PriorityQueue.add(PriorityQueue.java:132) at
> >
> >
> org.apache.lucene.util.PriorityQueue.insertWithOverflow(PriorityQueue.java:148)
> > at
> >
> >
> org.apache.solr.handler.component.QueryComponent.mergeIds(QueryComponent.java:786)
> > at
> >
> >
> org.apache.solr.handler.component.QueryComponent.handleRegularResponses(QueryComponent.java:587)
> > at
> >
> >
> org.apache.solr.handler.component.QueryComponent.handleResponses(QueryComponent.java:566)
> > at
> >
> >
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:283)
> > at
> >
> >
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
> > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376) at
> >
> >
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:365)
> > at
> >
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:260)
> > at
> >
> >
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:246)
> > at
> >
> >
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:214)
> > at
> >
> >
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:313)
> > at
> >
> >
> org.apache.catalina.core.StandardContextValve.invokeInternal(StandardContextValve.java:287)
> > at
> >
> >
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:218)
> > at
> >
> >
> org.apache.catalina.core.StandardPipeline.doInvoke(StandardPipeline.java:648)
> > at
> >
> >
> org.apache.catalina.core.StandardPipeline.doInvoke(StandardPipeline.java:593)
> > at com.sun.enterprise.web.WebPipeline.invoke(WebPipeline.java:94) at
> >
> >
> com.sun.enterprise.web.PESessionLockingStandardPipeline.invoke(PESessionLockingStandardPipeline.java:98)
> > at
> >
> >
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:

Re: Solr 4.3 and SLF4j

2013-05-06 Thread Mark Miller
You need all the same jars that are in the lib/ext folder of the default jetty 
distribution. Those are the logging jars, those are what you need. All you can 
do is swap out impls (see the SLF4j documentation). You must have all those 
jars as a start, and if you don't want to use log4j, you can switch impls. 

- Mark

On May 6, 2013, at 1:55 PM, Jonatan Fournier  wrote:

> Hi,
> 
> I've read from http://wiki.apache.org/solr/SolrLogging that Solr no longer
> ships with Logging jars bundled into the WAR file.
> 
> For simplicity in package management, other than Solr, I'm trying to stay
> with stock packages from Ubuntu 12.04 (e.g. Tomcat7 etc.)
> 
> Now I'm trying to find out what do I need to install to meet the Solr
> Logging requirements, using Ubuntu packages if possible at all.
> 
> Initially I thought having 'libslf4j-java' would be enough but that still
> gave me that Tomcat 7 error at startup:
> 
> May 06, 2013 1:28:00 PM org.apache.catalina.core.StandardContext filterStart
> SEVERE: Exception starting filter SolrRequestFilter
> org.apache.solr.common.SolrException: Could not find necessary SLF4j
> logging jars. If using Jetty, the SLF4j logging jars need to go in the
> jetty lib/ext directory. For other containers, the corresponding directory
> should be used. For more information, see:
> http://wiki.apache.org/solr/SolrLogging
> at
> org.apache.solr.servlet.SolrDispatchFilter.(SolrDispatchFilter.java:105)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
> at java.lang.Class.newInstance0(Class.java:374)
> at java.lang.Class.newInstance(Class.java:327)
> at
> org.apache.catalina.core.DefaultInstanceManager.newInstance(DefaultInstanceManager.java:125)
> at
> org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:256)
> at
> org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:382)
> at
> org.apache.catalina.core.ApplicationFilterConfig.(ApplicationFilterConfig.java:103)
> at
> org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:4638)
> at
> org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5294)
> at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150)
> at
> org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:895)
> at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:871)
> at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:615)
> at
> org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:649)
> at
> org.apache.catalina.startup.HostConfig$DeployDescriptor.run(HostConfig.java:1581)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> at java.util.concurrent.FutureTask.run(FutureTask.java:166)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:722)
> Caused by: java.lang.NoClassDefFoundError: org/slf4j/LoggerFactory
> at
> org.apache.solr.servlet.SolrDispatchFilter.(SolrDispatchFilter.java:103)
> ... 24 more
> Caused by: java.lang.ClassNotFoundException: org.slf4j.LoggerFactory
> at
> org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1701)
> at
> org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1546)
> ... 25 more
> 
> Anybody testing 4.3 on Tomcat at the moment? Any help would be appreciated
> related to Tomcat configuration etc.
> 
> Cheers,
> 
> /jonatan



Re: List of Solr Query Parsers

2013-05-06 Thread Roman Chyla
Hi Jan,
My login is RomanChyla
Thanks,

Roman
On 6 May 2013 10:00, "Jan Høydahl"  wrote:

> Hi Roman,
>
> This sounds great! Please register as a user on the WIKI and give us your
> username here, then we'll grant you editing karma so you can edit the page
> yourself! The NEAR/5 syntax is really something I think we should get into
> the default lucene parser. Can't wait to have a look at your code.
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> 6. mai 2013 kl. 15:41 skrev Roman Chyla :
>
> > Hi Jan,
> > Please add this one
> http://29min.wordpress.com/category/antlrqueryparser/
> > - I can't edit the wiki
> >
> > This parser is written with ANTLR and on top of lucene modern query
> parser.
> > There is a version which implements Lucene standard QP as well as a
> version
> > which includes proximity operators, multi token synonym handling and all
> of
> > solr qparsers using function syntax - ie,. for a query like: multi
> synonym
> > NEAR/5 edismax(foo)
> >
> > I would like to create a JIRA ticket soon
> >
> > Thanks
> >
> > Roman
> > On 6 May 2013 09:21, "Jan Høydahl"  wrote:
> >
> >> Hi,
> >>
> >> I just added a Wiki page to try to gather a list of all known Solr query
> >> parsers in one place, both those which are part of Solr and those in
> JIRA
> >> or 3rd party.
> >>
> >>  http://wiki.apache.org/solr/QueryParser
> >>
> >> If you known about other cool parsers out there, please add to the list.
> >>
> >> --
> >> Jan Høydahl, search solution architect
> >> Cominvent AS - www.cominvent.com
> >>
> >>
>
>


Solr 4.3 and SLF4j

2013-05-06 Thread Jonatan Fournier
Hi,

I've read from http://wiki.apache.org/solr/SolrLogging that Solr no longer
ships with Logging jars bundled into the WAR file.

For simplicity in package management, other than Solr, I'm trying to stay
with stock packages from Ubuntu 12.04 (e.g. Tomcat7 etc.)

Now I'm trying to find out what do I need to install to meet the Solr
Logging requirements, using Ubuntu packages if possible at all.

Initially I thought having 'libslf4j-java' would be enough but that still
gave me that Tomcat 7 error at startup:

May 06, 2013 1:28:00 PM org.apache.catalina.core.StandardContext filterStart
SEVERE: Exception starting filter SolrRequestFilter
org.apache.solr.common.SolrException: Could not find necessary SLF4j
logging jars. If using Jetty, the SLF4j logging jars need to go in the
jetty lib/ext directory. For other containers, the corresponding directory
should be used. For more information, see:
http://wiki.apache.org/solr/SolrLogging
at
org.apache.solr.servlet.SolrDispatchFilter.(SolrDispatchFilter.java:105)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
at java.lang.Class.newInstance0(Class.java:374)
at java.lang.Class.newInstance(Class.java:327)
at
org.apache.catalina.core.DefaultInstanceManager.newInstance(DefaultInstanceManager.java:125)
at
org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:256)
at
org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:382)
at
org.apache.catalina.core.ApplicationFilterConfig.(ApplicationFilterConfig.java:103)
at
org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:4638)
at
org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5294)
at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150)
at
org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:895)
at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:871)
at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:615)
at
org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:649)
at
org.apache.catalina.startup.HostConfig$DeployDescriptor.run(HostConfig.java:1581)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)
Caused by: java.lang.NoClassDefFoundError: org/slf4j/LoggerFactory
at
org.apache.solr.servlet.SolrDispatchFilter.(SolrDispatchFilter.java:103)
... 24 more
Caused by: java.lang.ClassNotFoundException: org.slf4j.LoggerFactory
at
org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1701)
at
org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1546)
... 25 more

Anybody testing 4.3 on Tomcat at the moment? Any help would be appreciated
related to Tomcat configuration etc.

Cheers,

/jonatan


Re: how to quickly export data from SolrCloud

2013-05-06 Thread Kevin Osborn
This is actually something I will do quite frequently. I basically export
from Solr into a CSV file as part of a workflow sequence.

CSV is nice and fast, but does not have the ZooKeeper integration that I
like with SolrJ.


On Mon, May 6, 2013 at 10:11 AM, Shawn Heisey  wrote:

> On 5/6/2013 10:48 AM, Kevin Osborn wrote:
>
>> I am looking to export a large amount of data from Solr. This export will
>> be done by a Java application and then written to file. Initially, I was
>> thinking of using direct HTTP calls and using the CSV response writer. And
>> then my Java application can quickly parse each line from a stream.
>>
>> But, with SolrCloud, I prefer to use SolrJ due to its communication with
>> Zookeeper. Is there any way to use the CSV response writer with SolrJ?
>>
>> Would the overhead of using SolrJ's "solrbin" format be much slower than
>> the CSV response writer?
>>
>
> What do you intend to do with the exported data?  If you're going to use
> it to import into a new Solr index, you might be better off using the
> dataimport handler with SolrEntityProcessor.  Just point it at one of your
> servers and include the collection name in the URL.
>
> If the export will have other uses and CSV format will work for you, that
> would probably be more efficient than something you could whip together
> quickly with SolrJ.  If you've got really excellent java skills and have a
> lot of time to work on it, you might be able to write something efficient,
> but Solr can already do it.
>
> If you plan to page through your data rather than grab it all with one
> query, it is MUCH more efficient to use a range query on a field with
> sequential data than to use the start and rows parameters.  This is
> *especially* true if you're using a sharded index, which is typically the
> case with SolrCloud.
>
> By the way, I am assuming that this process will be a one-time (or very
> rare) thing for migration purposes, or possibly something that you
> occasionally do for some kind of index verification.  If this is something
> that you'll be doing all the time, then you probably want to develop a
> SolrJ application.
>
> Thanks,
> Shawn
>
>


-- 
*KEVIN OSBORN*
LEAD SOFTWARE ENGINEER
CNET Content Solutions
OFFICE 949.399.8714
CELL 949.310.4677  SKYPE osbornk
5 Park Plaza, Suite 600, Irvine, CA 92614
[image: CNET Content Solutions]


Re: Tokenize Sentence and Set Attribute

2013-05-06 Thread Jack Krupansky
Sounds like a very ambitious project. I'm sure you COULD do it in Solr, but 
not in very short order.


Check out some discussion of simply searching within sentences:
http://markmail.org/message/aoiq62a4mlo25zzk?q=apache#query:apache+page:1+mid:aoiq62a4mlo25zzk+state:results

First, how do you expect to use/query the corpus?  In other words, what are 
your user requirements? They will determine what structure the Solr index, 
analysis chains, and custom search components will need.


Also, check out the Solr OpenNLP wiki:
http://wiki.apache.org/solr/OpenNLP

And see "LUCENE-2899: Add OpenNLP Analysis capabilities as a module":
https://issues.apache.org/jira/browse/LUCENE-2899

-- Jack Krupansky

-Original Message- 
From: Rendy Bambang Junior

Sent: Monday, May 06, 2013 11:41 AM
To: solr-user@lucene.apache.org
Subject: Tokenize Sentence and Set Attribute

Hello,

I am trying to use part of speech tagger for bahasa Indonesia to filter
tokens in Solr.
The tagger receive input as word list of a sentence and return tag array.

I think the process should by like this:
- tokenize sentence
- tokenize word
- pass it into the tagger
- set attribute using tagger output
- pass it into a FilteringTokenFilter implementation

Is it possible to do this in Solr/Lucene? If it is, how?

I've read similar solution for Japanese language but since I am lack of
Japanese understanding, it couldn't help a lot.

--
Regards,
Rendy Bambang Junior
Informatics Engineering '09
Bandung Institute of Technology 



Re: how to quickly export data from SolrCloud

2013-05-06 Thread Shawn Heisey

On 5/6/2013 10:48 AM, Kevin Osborn wrote:

I am looking to export a large amount of data from Solr. This export will
be done by a Java application and then written to file. Initially, I was
thinking of using direct HTTP calls and using the CSV response writer. And
then my Java application can quickly parse each line from a stream.

But, with SolrCloud, I prefer to use SolrJ due to its communication with
Zookeeper. Is there any way to use the CSV response writer with SolrJ?

Would the overhead of using SolrJ's "solrbin" format be much slower than
the CSV response writer?


What do you intend to do with the exported data?  If you're going to use 
it to import into a new Solr index, you might be better off using the 
dataimport handler with SolrEntityProcessor.  Just point it at one of 
your servers and include the collection name in the URL.


If the export will have other uses and CSV format will work for you, 
that would probably be more efficient than something you could whip 
together quickly with SolrJ.  If you've got really excellent java skills 
and have a lot of time to work on it, you might be able to write 
something efficient, but Solr can already do it.


If you plan to page through your data rather than grab it all with one 
query, it is MUCH more efficient to use a range query on a field with 
sequential data than to use the start and rows parameters.  This is 
*especially* true if you're using a sharded index, which is typically 
the case with SolrCloud.


By the way, I am assuming that this process will be a one-time (or very 
rare) thing for migration purposes, or possibly something that you 
occasionally do for some kind of index verification.  If this is 
something that you'll be doing all the time, then you probably want to 
develop a SolrJ application.


Thanks,
Shawn



how to quickly export data from SolrCloud

2013-05-06 Thread Kevin Osborn
I am looking to export a large amount of data from Solr. This export will
be done by a Java application and then written to file. Initially, I was
thinking of using direct HTTP calls and using the CSV response writer. And
then my Java application can quickly parse each line from a stream.

But, with SolrCloud, I prefer to use SolrJ due to its communication with
Zookeeper. Is there any way to use the CSV response writer with SolrJ?

Would the overhead of using SolrJ's "solrbin" format be much slower than
the CSV response writer?

-- 
*KEVIN OSBORN*
LEAD SOFTWARE ENGINEER
CNET Content Solutions
OFFICE 949.399.8714
CELL 949.310.4677  SKYPE osbornk
5 Park Plaza, Suite 600, Irvine, CA 92614
[image: CNET Content Solutions]


Re: iterate through each document in Solr

2013-05-06 Thread Mingfeng Yang
Andre,

Thanks for the info!  Unfortunately, my solr is on 3.6 version, and looks
like those options are not available. :(

Ming-


On Mon, May 6, 2013 at 5:32 AM, Andre Bois-Crettez wrote:

> On 05/06/2013 06:03 AM, Michael Sokolov wrote:
>
>> On 5/5/13 7:48 PM, Mingfeng Yang wrote:
>>
>>> Dear Solr Users,
>>>
>>> Does anyone know what is the best way to iterate through each document
>>> in a
>>> Solr index with billion entries?
>>>
>>> I tried to use  select?q=*:*&start=xx&rows=500  to get 500 docs each time
>>> and then change start value, but it got very slow after getting through
>>> about 10 million docs.
>>>
>>> Thanks,
>>> Ming-
>>>
>>>  You need to use a unique and stable sort key and get documents>
>> sortkey.  For example, if you have a unique key, retrieve documents
>> ordered by the unique key, and for each batch get documents>  max (key)
>> from the previous batch
>>
>> -Mike
>>
>>  There is more details on the wiki :
> http://wiki.apache.org/solr/**CommonQueryParameters#pageDoc_**
> and_pageScore
>
>
> --
> André Bois-Crettez
>
> Search technology, Kelkoo
> http://www.kelkoo.com/
>
>
> Kelkoo SAS
> Société par Actions Simplifiée
> Au capital de € 4.168.964,30
> Siège social : 8, rue du Sentier 75002 Paris
> 425 093 069 RCS Paris
>
> Ce message et les pièces jointes sont confidentiels et établis à
> l'attention exclusive de leurs destinataires. Si vous n'êtes pas le
> destinataire de ce message, merci de le détruire et d'en avertir
> l'expéditeur.
>


Re: iterate through each document in Solr

2013-05-06 Thread Mingfeng Yang
Hi Dmitry,

My index is not sharded, and since its size is so big, sharding won't help
much on the paging issue.  Do you know any API which can help read from
lucene binary index directly? I will be nice if we can just scan
through the docs directly.

Thanks!
Ming-


On Mon, May 6, 2013 at 3:33 AM, Dmitry Kan  wrote:

> Are you doing it once? Is your index sharded? If so, can you ask each shard
> individually?
> Another way would be to do it on Lucene level, i.e. read from the binary
> indices (API exists).
>
> Dmitry
>
>
> On Mon, May 6, 2013 at 5:48 AM, Mingfeng Yang 
> wrote:
>
> > Dear Solr Users,
> >
> > Does anyone know what is the best way to iterate through each document
> in a
> > Solr index with billion entries?
> >
> > I tried to use  select?q=*:*&start=xx&rows=500  to get 500 docs each time
> > and then change start value, but it got very slow after getting through
> > about 10 million docs.
> >
> > Thanks,
> > Ming-
> >
>


Solr Cloud with large synonyms.txt

2013-05-06 Thread Son Nguyen
Hello,

I'm building a Solr Cloud (version 4.1.0) with 2 shards and a Zookeeper (the 
Zookeeer is on different machine, version 3.4.5).
I've tried to start with a 1.7MB synonyms.txt, but got a 
"ConnectionLossException":
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: 
KeeperErrorCode = ConnectionLoss for /configs/solr1/synonyms.txt
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266)
at 
org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:270)
at 
org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:267)
at 
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:65)
at 
org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:267)
at 
org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:436)
at 
org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:315)
at org.apache.solr.cloud.ZkController.uploadToZK(ZkController.java:1135)
at 
org.apache.solr.cloud.ZkController.uploadConfigDir(ZkController.java:955)
at 
org.apache.solr.core.CoreContainer.initZooKeeper(CoreContainer.java:285)
... 43 more

I did some researches on internet and found out that because Zookeeper znode 
size limit is 1MB. I tried to increase the system property "jute.maxbuffer" but 
it won't work.
Does anyone have experience of dealing with it?

Thanks,
Son


RE: Indexing off of the production servers

2013-05-06 Thread David Parks
So, am I following this correctly by saying that, this proposed solution
would present us a way to index a collection on an offline/dev solr cloud
instance and *move* that pre-prepared index to the production server using
an alias/rename trick?

That seems like a reasonably doable solution. I also wonder how much work it
is to build the shards programmatically (e.g. directly in a hadoop/java
environment), cutting out the extra step of needing another solr instances
running on a staging environment somewhere. Then using this technique to
swap in the shards.

I might do something like this first and then look into simplifying, and
further automating, later on. And if it is indeed possible to build a hadoop
driver for indexing, I think that would be a useful tool for the community
at large. So I'm still curious about it, at least as a thought exercise, if
nothing else.

Thanks,
Dave


-Original Message-
From: Furkan KAMACI [mailto:furkankam...@gmail.com] 
Sent: Monday, May 06, 2013 9:44 PM
To: solr-user@lucene.apache.org
Subject: Re: Indexing off of the production servers

Hi Erick;

Thanks for your answer. I have read that at somewhere:

I believe "redirect" from replica to leader would happen only at index time,
so a doc first gets indexed to leader and from there it's replicated to
non-leader shards.

Is that true? I want to make clear the things in my mind otherwise I want to
ask a separate question about what happens for indexing and querying at
SolrCloud.

2013/5/6 Shawn Heisey 

> On 5/6/2013 7:55 AM, Andre Bois-Crettez wrote:
> > Excellent idea !
> > And it is possible to use collection aliasing with the CREATEALIAS 
> > to make this transparent for the query side.
> >
> > ex. with 2 collections named :
> > collection_1
> > collection_2
> >
> >
> /collections?action=CREATEALIAS&name=collectionalias&collections=colle
> ction_1
> >
> > "collectionalias" is now a virtual collection pointing to collection_1.
> >
> > Index on collection_2, then :
> >
> /collections?action=CREATEALIAS&name=collectionalias&collections=colle
> ction_2
> >
> > "collectionalias" now is an alias to collection_2.
> >
> >
> http://wiki.apache.org/solr/SolrCloud#Managing_collections_via_the_Col
> lections_API
>
> Awesome idea, Andre! I was wondering whether you might have to delete 
> the original alias before creating the new one, but a quick look at 
> the issue for collection aliasing shows that this isn't the case.
>
> https://issues.apache.org/jira/browse/SOLR-4497
>
> The wiki doesn't mention the DELETEALIAS action.  I won't have time 
> right now to update the wiki.
>
> Thanks,
> Shawn
>
>



solr.LatLonType type vs solr.SpatialRecursivePrefixTreeFieldType

2013-05-06 Thread bbarani
Hi,

I am currently using SOLR 4.2 to index geospatial data. I have configured my
geospatial field as below.



  

I just want to make sure that I am using the correct SOLR class for
performing geospatial search since I am not sure which of the 2
class(LatLonType vs  SpatialRecursivePrefixTreeFieldType) will be supported
by future versions of SOLR.

I assume latlong is an upgraded version of
SpatialRecursivePrefixTreeFieldType, can someone please confirm if I am
right?

Thanks,
Barani 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-LatLonType-type-vs-solr-SpatialRecursivePrefixTreeFieldType-tp4061113.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: custom tokenizer error

2013-05-06 Thread Sarita Nair
baseTokenizer is reset in the #reset method.

Sarita




 From: Jack Krupansky 
To: solr-user@lucene.apache.org 
Sent: Sunday, May 5, 2013 1:37 PM
Subject: Re: custom tokenizer error
 

I didn't notice any call to the "reset" method for your base tokenizer.

Is there any reason that you didn't just use char filters to replace colon 
and periods with spaces?

-- Jack Krupansky

-Original Message- 
From: Sarita Nair
Sent: Friday, May 03, 2013 2:43 PM
To: solr-user@lucene.apache.org
Subject: custom tokenizer error

I am using a custom Tokenizer, as part of analysis chain, for a Solr (4.2.1) 
field. On trying to index, Solr throws a NullPointerException.
The unit tests for the custom tokenizer work fine. Any ideas as to what is 
it that I am missing/doing incorrectly will be appreciated.

Here is the relevant schema.xml excerpt:

    
    
    
    
    
    
    

Here are the relevant pieces of the Tokenizer:

    /**
     * Intercepts each token produced by {@link 
StandardTokenizer#incrementToken()}
     * and checks for the presence of a colon or period. If found, splits 
the token
     * on the punctuation mark and adjusts the term and offset attributes of 
the
     * underlying {@link TokenStream} to create additional tokens.
     *
     *
     */
    public class EmbeddedPunctuationTokenizer extends Tokenizer {
private static final Pattern PUNCTUATION_SYMBOLS = Pattern.compile("[:.]");
private StandardTokenizer baseTokenizer;
       private CharTermAttribute termAttr;

private OffsetAttribute offsetAttr;

private /*@Nullable*/ String tokenAfterPunctuation = null;

private int currentOffset = 0;

public EmbeddedPunctuationTokenizer(final Reader reader) {
super(reader);
baseTokenizer = new StandardTokenizer(Version.MINIMUM_LUCENE_VERSION, 
reader);
// Two TokenStreams are in play here: the one underlying the current
// instance and the one underlying the StandardTokenizer. The attribute
// instances must be associated with both.
termAttr = baseTokenizer.addAttribute(CharTermAttribute.class);
offsetAttr = baseTokenizer.addAttribute(OffsetAttribute.class);
this.addAttributeImpl((CharTermAttributeImpl)termAttr);
this.addAttributeImpl((OffsetAttributeImpl)offsetAttr);
}

@Override
public void end() throws IOException {
baseTokenizer.end();
super.end();
}

@Override
public void close() throws IOException {
baseTokenizer.close();
super.close();
}

@Override
public void reset() throws IOException {
super.reset();
baseTokenizer.reset();
currentOffset = 0;
tokenAfterPunctuation = null;
}

@Override
public final boolean incrementToken() throws IOException {
clearAttributes();
if (tokenAfterPunctuation != null) {
// Do not advance the underlying TokenStream if the previous call
// found an embedded punctuation mark and set aside the substring
// that follows it. Set the attributes instead from the substring,
// bearing in mind that the substring could contain more embedded
// punctuation marks.
adjustAttributes(tokenAfterPunctuation);
} else if (baseTokenizer.incrementToken()) {
// No remaining substring from a token with embedded punctuation: save
// the starting offset reported by the base tokenizer as the current
// offset, then proceed with the analysis of token it returned.
currentOffset = offsetAttr.startOffset();
adjustAttributes(termAttr.toString());
} else {
// No more tokens in the underlying token stream: return false
return false;
}
return true;
}


           private void adjustAttributes(final String token) {
Matcher m = PUNCTUATION_SYMBOLS.matcher(token);
if (m.find()) {
int index = m.start();
offsetAttr.setOffset(currentOffset, currentOffset + index);
termAttr.copyBuffer(token.toCharArray(), 0, index);
tokenAfterPunctuation = token.substring(index + 1);
// Given that the incoming token had an embedded punctuation mark,
// the starting offset for the substring following the punctuation
// mark will be 1 beyond the end of the current token, which is the
// substring preceding embedded punctuation mark.
currentOffset = offsetAttr.endOffset() + 1;
} else if (tokenAfterPunctuation != null) {
// Last remaining substring following a previously detected embedded
// punctuation mark: adjust attributes based on its values.
int length = tokenAfterPunctuation.length();
termAttr.copyBuffer(tokenAfterPunctuation.toCharArray(), 0, length);
offsetAttr.setOffset(currentOffset, currentOffset + length);
tokenAfterPunctuation = null;
}
// Implied else: neither is true so attributes from base tokenizer need
// no adjustments.
}

}
}

Solr throws the following error, in the 'else if' block of #incrementToken

    2013-04-29 14:19:48,920 [http-thread-pool-8080(3)] ERROR 
org.apache.solr.core.SolrCore - java.lang.NullPointerException
    at 
org.apache.lucene.analysis.standard.StandardTokenizerImpl.zzRefill(StandardTokenizerImpl.java:923)
    at 
org.apache.lucene.analysis.standard.StandardTokenizerImpl.getNextToken(StandardTokenizerImpl.java:1133)
    at 
org.apache.lucene.analysis.

Tokenize Sentence and Set Attribute

2013-05-06 Thread Rendy Bambang Junior
Hello,

I am trying to use part of speech tagger for bahasa Indonesia to filter
tokens in Solr.
The tagger receive input as word list of a sentence and return tag array.

I think the process should by like this:
- tokenize sentence
- tokenize word
- pass it into the tagger
- set attribute using tagger output
- pass it into a FilteringTokenFilter implementation

Is it possible to do this in Solr/Lucene? If it is, how?

I've read similar solution for Japanese language but since I am lack of
Japanese understanding, it couldn't help a lot.

-- 
Regards,
Rendy Bambang Junior
Informatics Engineering '09
Bandung Institute of Technology


Re: disaster recovery scenarios for solr cloud and zookeeper

2013-05-06 Thread Mark Miller
ClusterState is kept in memory and Solr is notified of ClusterState updates by 
ZooKeeper when a change happens - Solr then grabs the latest ClusterState. If 
ZooKeeper goes down, Solr keeps using the in memory ClusterState it has and 
simply stops getting any new ClusterState updates until ZooKeeper comes back.

- Mark

On May 6, 2013, at 2:59 AM, Furkan KAMACI  wrote:

> Hi Mark;
> 
> You said: "So it's pretty simple - when you lost the ability to talk to ZK,
> everything keeps working based on the most recent clusterstate - except
> that updates are blocked and you cannot add new nodes to the cluster."
> Where nodes keeps cluster stat? When a query comes to a node that is at
> another shard's replica, how query will return accurately?
> 
> 2013/5/5 Jack Krupansky 
> 
>> Is soul retrieval possible when ZooKeeper is down?
>> 
>> -- Jack Krupansky
>> 
>> -Original Message- From: Mark Miller
>> Sent: Sunday, May 05, 2013 2:19 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: disaster recovery scenarios for solr cloud and zookeeper
>> 
>> 
>> When Solr loses it's connection to ZooKeeper, updates will start being
>> rejected. Read requests will continue as normal. This is regardless of how
>> long ZooKeeper is down.
>> 
>> So it's pretty simple - when you lost the ability to talk to ZK,
>> everything keeps working based on the most recent clusterstate - except
>> that updates are blocked and you cannot add new nodes to the cluster. You
>> are essentially in steady state.
>> 
>> The ZK clients will continue trying to reconnect so that when ZK comes
>> back updates while start being accepted again and new nodes may join the
>> cluster.
>> 
>> - Mark
>> 
>> On May 3, 2013, at 3:21 PM, Dennis Haller  wrote:
>> 
>> Hi,
>>> 
>>> Solr 4.x is architected with a dependency on Zookeeper, and Zookeeper is
>>> expected to have a very high (perfect?) availability. With 3 or 5
>>> zookeeper
>>> nodes, it is possible to manage zookeeper maintenance and online
>>> availability to be close to %100. But what is the worst case for Solr if
>>> for some unanticipated reason all Zookeeper nodes go offline?
>>> 
>>> Could someone comment on a couple of possible scenarios for which all ZK
>>> nodes are offline. What would happen to Solr and what would be needed to
>>> recover in each case?
>>> 1) brief interruption, say <2 minutes,
>>> 2) longer downtime, say 60 min
>>> 
>>> Thanks
>>> Dennis
>>> 
>> 
>> 



Re: Duplicated Documents Across shards

2013-05-06 Thread Shawn Heisey
> Oops... you're right, and before I started writing that response I had the
> thought that these should be "shardDir", but even that is confused. I
> think
> "replicaDir" or "collectionReplica" or "shardReplicaDir" or...
> "collectionShardReplicaDir" - the latter is wordy, but is explicit. I'd
> reserve "coreDir" for "old" Solr.

Many naming choices in Solr could use an overhaul. As you've pointed out,
some names work well for single shard per machine setups. Others work well
for cloud but not so well for non cloud.

I think that things are headed towards always using zookeeper, and
possibly towards always being cloud, even for nonredundant and nonsharded
single index deployments.

I will start a couple of discussion threads on the dev list.

Thanks,
Shawn




Re: Log Monitor System for SolrCloud and Logging to log4j at SolrCloud?

2013-05-06 Thread Steve Rowe
Done - see http://markmail.org/message/66vpwk42ih6uxps7

On May 6, 2013, at 5:29 AM, Furkan KAMACI  wrote:

> Is there any road map for Solr when will Solr 4.3 be tagged at svn?
> 
> 2013/4/26 Mark Miller 
> 
>> Slf4j is meant to work with existing frameworks - you can set it up to
>> work with log4j, and Solr will use log4j by default in the about to be
>> released 4.3.
>> 
>> http://wiki.apache.org/solr/SolrLogging
>> 
>> - Mark
>> 
>> On Apr 26, 2013, at 7:19 AM, Furkan KAMACI  wrote:
>> 
>>> I want to use GrayLog2 to monitor my logging files for SolrCloud.
>> However I
>>> think that GrayLog2 works with log4j and logback. Solr uses slf4j.
>>> How can I solve this problem and what logging monitoring system does
>> folks
>>> use?
>> 
>> 



Re: Solr on Amazon EC2

2013-05-06 Thread Stephane Gamard
Hi Rajesh, Rule of thumb when it comes to Solr and the cloud is run your own instance. There are so many difference (subtle but could be painful) between Solr releases that it is best that you know which you are using. Solr is also package to work directly out of the box (using the jetty starter: start.jar). My deco would be to load up a generic Ubuntu install, download the solar binary distribution and start from there. When you have sold running the way you want, make an image (snapshot) and that will be your Solr base image. --Stephane GamardOn May 6, 2013 at May 6, 2013, Rajesh Nikam (rajeshni...@gmail.com) wrote: Hello,

I am looking into how to do document classification for categorization of
html documents. I see Solr/Lucene + MoreLikeThis that suits to find similar
documents for given document.

I am able to do classification using Lucene + MoreLikeThis example.

Then I was looking for how to host Solr on Amazon EC2. I see bitnami
provide AMI images for the same.
I see there are 4000+ AMI IDs to select from. I am not sure which to use ?

Could you please let me know which is correct image to use in this case ?
Or how to create new image with tomcat + Solr and save it for future usage ?

Thanks,
Rajesh


update to 4.3

2013-05-06 Thread Arkadi Colson

Hi

After update to 4.3 I got this error:

May 06, 2013 2:30:08 PM org.apache.coyote.AbstractProtocol init
INFO: Initializing ProtocolHandler ["http-bio-8983"]
May 06, 2013 2:30:08 PM org.apache.coyote.AbstractProtocol init
INFO: Initializing ProtocolHandler ["ajp-bio-8009"]
May 06, 2013 2:30:08 PM org.apache.catalina.startup.Catalina load
INFO: Initialization processed in 610 ms
May 06, 2013 2:30:08 PM org.apache.catalina.core.StandardService 
startInternal

INFO: Starting service Catalina
May 06, 2013 2:30:08 PM org.apache.catalina.core.StandardEngine 
startInternal

INFO: Starting Servlet Engine: Apache Tomcat/7.0.39
May 06, 2013 2:30:08 PM org.apache.catalina.startup.HostConfig deployWAR
INFO: Deploying web application archive 
/usr/local/apache-tomcat-7.0.39/webapps/solr.war
May 06, 2013 2:30:45 PM org.apache.catalina.util.SessionIdGenerator 
createSecureRandom
INFO: Creation of SecureRandom instance for session ID generation using 
[SHA1PRNG] took [36,697] milliseconds.
May 06, 2013 2:30:45 PM org.apache.catalina.core.StandardContext 
startInternal

SEVERE: Error filterStart
May 06, 2013 2:30:45 PM org.apache.catalina.core.StandardContext 
startInternal

SEVERE: Context [/solr] startup failed due to previous errors
May 06, 2013 2:30:45 PM org.apache.catalina.startup.HostConfig 
deployDirectory
INFO: Deploying web application directory 
/usr/local/apache-tomcat-7.0.39/webapps/host-manager
May 06, 2013 2:30:45 PM org.apache.catalina.startup.HostConfig 
deployDirectory
INFO: Deploying web application directory 
/usr/local/apache-tomcat-7.0.39/webapps/docs
May 06, 2013 2:30:45 PM org.apache.catalina.startup.HostConfig 
deployDirectory
INFO: Deploying web application directory 
/usr/local/apache-tomcat-7.0.39/webapps/manager
May 06, 2013 2:30:45 PM org.apache.catalina.startup.HostConfig 
deployDirectory
INFO: Deploying web application directory 
/usr/local/apache-tomcat-7.0.39/webapps/ROOT
May 06, 2013 2:30:45 PM org.apache.catalina.startup.HostConfig 
deployDirectory
INFO: Deploying web application directory 
/usr/local/apache-tomcat-7.0.39/webapps/examples

May 06, 2013 2:30:45 PM org.apache.coyote.AbstractProtocol start
INFO: Starting ProtocolHandler ["http-bio-8983"]
May 06, 2013 2:30:45 PM org.apache.coyote.AbstractProtocol start
INFO: Starting ProtocolHandler ["ajp-bio-8009"]
May 06, 2013 2:30:45 PM org.apache.catalina.startup.Catalina start
INFO: Server startup in 37541 ms

Any idea?

--
Met vriendelijke groeten

Arkadi Colson

Smartbit bvba • Hoogstraat 13 • 3670 Meeuwen
T +32 11 64 08 80 • F +32 11 64 08 81



Re: Atomic Update and stored copy-fields

2013-05-06 Thread raulgrande83
We have defined those copyfield destinations as stored because we have
experienced some problems when highlighting in them. These fields have
different Tokenizers and Analyzers. We have found that if we search in one
of them but highlight in a different one some words that doesn't match the
first query appeared highlighted in the results.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Atomic-Update-and-stored-copy-fields-tp4059129p4061095.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Indexing off of the production servers

2013-05-06 Thread Furkan KAMACI
Hi Erick;

Thanks for your answer. I have read that at somewhere:

I believe "redirect" from replica to leader would happen only at
index time, so a doc first gets indexed to leader and from there it's
replicated to non-leader shards.

Is that true? I want to make clear the things in my mind otherwise I want
to ask a separate question about what happens for indexing and querying at
SolrCloud.

2013/5/6 Shawn Heisey 

> On 5/6/2013 7:55 AM, Andre Bois-Crettez wrote:
> > Excellent idea !
> > And it is possible to use collection aliasing with the CREATEALIAS to
> > make this transparent for the query side.
> >
> > ex. with 2 collections named :
> > collection_1
> > collection_2
> >
> >
> /collections?action=CREATEALIAS&name=collectionalias&collections=collection_1
> >
> > "collectionalias" is now a virtual collection pointing to collection_1.
> >
> > Index on collection_2, then :
> >
> /collections?action=CREATEALIAS&name=collectionalias&collections=collection_2
> >
> > "collectionalias" now is an alias to collection_2.
> >
> >
> http://wiki.apache.org/solr/SolrCloud#Managing_collections_via_the_Collections_API
>
> Awesome idea, Andre! I was wondering whether you might have to delete
> the original alias before creating the new one, but a quick look at the
> issue for collection aliasing shows that this isn't the case.
>
> https://issues.apache.org/jira/browse/SOLR-4497
>
> The wiki doesn't mention the DELETEALIAS action.  I won't have time
> right now to update the wiki.
>
> Thanks,
> Shawn
>
>


Re: Duplicated Documents Across shards

2013-05-06 Thread Jack Krupansky
Oops... you're right, and before I started writing that response I had the 
thought that these should be "shardDir", but even that is confused. I think 
"replicaDir" or "collectionReplica" or "shardReplicaDir" or... 
"collectionShardReplicaDir" - the latter is wordy, but is explicit. I'd 
reserve "coreDir" for "old" Solr.


Maybe "collectionDir" is fine for single node, single shard, single replica 
Solr, and would throw an error if number of shards or replicas was greater 
than 1. Otherwise, "replicaDir" would be sufficient and brief.


I don't care so much exactly what the name is, so long as it accurately 
conveys its meaning.


Just to be clear, although the more modern Solr term "collection" came into 
use when SolrCloud was introduced, it is not solely a SolrCloud term. Even a 
"single core" Solr is using a "collection" (that happens to be single-core 
and single-shard and single-replica.) To wit, the stock Solr example, which 
is not SolrCloud, is named "collection1".


-- Jack Krupansky

-Original Message- 
From: Shawn Heisey

Sent: Monday, May 06, 2013 10:18 AM
To: solr-user@lucene.apache.org
Subject: Re: Duplicated Documents Across shards

On 5/6/2013 7:44 AM, Jack Krupansky wrote:

I think if we had a more compehensible term for a "collection
configuration directory", a lot of the confusion would go away. I mean,
what the heck is an "instance" anyway? How does "instanceDir" relate to
an "instance" of the Solr "server"? Sure, I know that it is the parent
directory of the collection configuration (conf directory) or a
"collection directory", but how would a mere mortal grok that? I mean,
"instance" sounds like it's at a higher level than the collection itself
- that's why people tend to think it's the same for all cores in a Solr
"instance".

We should reconsider the name of that term. My choice: collectionDir.


I think that might lead to just as much confusion as instanceDir,
because it's for a core, not a collection.  A name like coreDir would
avoid that confusion.

If you actually are using collections, then you'll be using SolrCloud.
A SolrCloud installation with maxShardsPerNode>1 will have more than one
core for the same collection on each node, so collectionDir would be
very confusing.

I was initially thinking a good name would be coreConfDir or confDir,
but that only makes sense in situations where dataDir is also present.
The Collections API creates cores without a dataDir parameter, and many
solr.xml files are created manually without dataDir.

Thanks,
Shawn 



Re: Indexing off of the production servers

2013-05-06 Thread Shawn Heisey
On 5/6/2013 7:55 AM, Andre Bois-Crettez wrote:
> Excellent idea !
> And it is possible to use collection aliasing with the CREATEALIAS to
> make this transparent for the query side.
> 
> ex. with 2 collections named :
> collection_1
> collection_2
> 
> /collections?action=CREATEALIAS&name=collectionalias&collections=collection_1
> 
> "collectionalias" is now a virtual collection pointing to collection_1.
> 
> Index on collection_2, then :
> /collections?action=CREATEALIAS&name=collectionalias&collections=collection_2
> 
> "collectionalias" now is an alias to collection_2.
> 
> http://wiki.apache.org/solr/SolrCloud#Managing_collections_via_the_Collections_API

Awesome idea, Andre! I was wondering whether you might have to delete
the original alias before creating the new one, but a quick look at the
issue for collection aliasing shows that this isn't the case.

https://issues.apache.org/jira/browse/SOLR-4497

The wiki doesn't mention the DELETEALIAS action.  I won't have time
right now to update the wiki.

Thanks,
Shawn



Re: Duplicated Documents Across shards

2013-05-06 Thread Shawn Heisey
On 5/6/2013 7:44 AM, Jack Krupansky wrote:
> I think if we had a more compehensible term for a "collection
> configuration directory", a lot of the confusion would go away. I mean,
> what the heck is an "instance" anyway? How does "instanceDir" relate to
> an "instance" of the Solr "server"? Sure, I know that it is the parent
> directory of the collection configuration (conf directory) or a
> "collection directory", but how would a mere mortal grok that? I mean,
> "instance" sounds like it's at a higher level than the collection itself
> - that's why people tend to think it's the same for all cores in a Solr
> "instance".
> 
> We should reconsider the name of that term. My choice: collectionDir.

I think that might lead to just as much confusion as instanceDir,
because it's for a core, not a collection.  A name like coreDir would
avoid that confusion.

If you actually are using collections, then you'll be using SolrCloud.
A SolrCloud installation with maxShardsPerNode>1 will have more than one
core for the same collection on each node, so collectionDir would be
very confusing.

I was initially thinking a good name would be coreConfDir or confDir,
but that only makes sense in situations where dataDir is also present.
The Collections API creates cores without a dataDir parameter, and many
solr.xml files are created manually without dataDir.

Thanks,
Shawn



Solr on Amazon EC2

2013-05-06 Thread Rajesh Nikam
Hello,

I am looking into how to do document classification for categorization of
html documents. I see Solr/Lucene + MoreLikeThis that suits to find similar
documents for given document.

I am able to do classification using Lucene + MoreLikeThis example.

Then I was looking for how to host Solr on Amazon EC2. I see bitnami
provide AMI images for the same.
I see there are 4000+ AMI IDs to select from. I am not sure which to use ?

Could you please let me know which is correct image to use in this case ?
Or how to create new image with tomcat + Solr and save it for future usage ?

Thanks,
Rajesh


Re: Memory problems with HttpSolrServer

2013-05-06 Thread Shawn Heisey
On 5/6/2013 1:32 AM, Rogowski, Britta wrote:
> Hi!
> 
> When I write from our database to a HttpSolrServer, (using a 
> LinkedBlockingQueue to write just one document at a time), I run into memory 
> problems (due to various constraints, I have to remain on a 32-bit system, so 
> I can use at most 2 GB RAM).
> 
> If I use an EmbeddedSolrServer (to write locally), I have no such problems. 
> Just now, I tried out ConcurrentUpdateSolrServer (with a queue size of 1, but 
> 3 threads to be safe), and that worked out fine too. I played around with 
> various GC options and monitored memory with jconsole and jmap, but only 
> found out that there's lots of byte arrays, SolrInputFields and Strings 
> hanging around.
> 
> Since ConcurrentUpdateSolrServer works, I'm happy, but I was wondering if 
> people were aware of the memory issue around HttpSolrServer.

Is it memory usage within the JVM, or OS allocation for the java process
that you are looking at?

There are no known memory problems with current versions of SolrJ, and
none that I know about with older versions.  At the time you wrote this,
4.2.1 was the latest version, but now several hours later, 4.3.0 has
been released.

I have a SolrJ app that I've been using since 3.5.0, currently using
4.2.1.  It creates 32 separate HttpSolrServer instances, to keep all my
shards up to date.  It runs for weeks or months at a time and is
currently using about 25MB of RAM within the JVM.  When special reindex
requests happen, memory usage may briefly go up to a few hundred MB.  It
will typically allocate the entire 1GB heap at the OS level, but I could
run it with a smaller heap and have no trouble.

After I gathered those numbers, I restarted the application.  Memory
usage is still low, and the OS shows only 106MB in use.

I suspect that your java code may have a memory leak.  I'm not sure why
the leak isn't happening with the concurrent object, that's very very
weird.  ConcurrentUpdateSolrServer uses HttpSolrServer internally.  When
you use HttpSolrServer, are you reusing one object or creating a new one
for every request?  You should create one HttpSolrServer object for
every separate Solr core and then use that object for the life of your
application.  It is completely thread safe.

There is a large caveat with ConcurrentUpdateSolrServer.  If you are
using try/catch blocks to trap request errors and take action, you
should be aware that this object will never throw an error.  Even if a
request fails or your Solr server is down, your application will never know.

Why do I need 32 HttpSolrServer objects? I have 2 index chains, 7 shards
per chain, with a live core and a build core per shard.  That is 28
separate cores.  There are four Solr servers, so I need four additional
objects for CoreAdmin requests.

Thanks,
Shawn



Re: A Comma /aSpace in a Query argument

2013-05-06 Thread Jack Krupansky
Oops, and I neglected to mention that you can escape a single character with 
a backslash, or you can enclose the entire term in double quotes:


q=myfield:aa,bb
q=myfield:"aa,bb"

q=myfield:aa\ bb
q=myfield:"aa bb"

-- Jack Krupansky

-Original Message- 
From: Jack Krupansky

Sent: Monday, May 06, 2013 9:51 AM
To: solr-user@lucene.apache.org
Subject: Re: A Comma /aSpace in a Query argument

Commas, dots, hyphens, slashes, and semicolons do not need to be "escaped",
but spaces do.

But... be careful, because some analyzers will throw away all or most
punctuation. You may have to resort to the white space analyzer to preserver
punctuation characters, but you can add char filters to eliminate
punctuation that you don't want to keep.

-- Jack Krupansky

-Original Message- 
From: Peter Schütt

Sent: Monday, May 06, 2013 9:34 AM
To: solr-user@lucene.apache.org
Subject: A Comma /aSpace in a Query argument

Hallo,

I want to use a comma as part of a query argument.

E.G.

q=myfield:aa,bb

and "aa,bb" is the value of the field.

Do I have to mask it?

And what is about a space in an argument

q=myfield:aa bb

and "aa bb" is the value of the field.

Thanks for any hint.

Ciao
 Peter Schütt 



Re: List of Solr Query Parsers

2013-05-06 Thread Jan Høydahl
Hi Roman,

This sounds great! Please register as a user on the WIKI and give us your 
username here, then we'll grant you editing karma so you can edit the page 
yourself! The NEAR/5 syntax is really something I think we should get into the 
default lucene parser. Can't wait to have a look at your code.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

6. mai 2013 kl. 15:41 skrev Roman Chyla :

> Hi Jan,
> Please add this one http://29min.wordpress.com/category/antlrqueryparser/
> - I can't edit the wiki
> 
> This parser is written with ANTLR and on top of lucene modern query parser.
> There is a version which implements Lucene standard QP as well as a version
> which includes proximity operators, multi token synonym handling and all of
> solr qparsers using function syntax - ie,. for a query like: multi synonym
> NEAR/5 edismax(foo)
> 
> I would like to create a JIRA ticket soon
> 
> Thanks
> 
> Roman
> On 6 May 2013 09:21, "Jan Høydahl"  wrote:
> 
>> Hi,
>> 
>> I just added a Wiki page to try to gather a list of all known Solr query
>> parsers in one place, both those which are part of Solr and those in JIRA
>> or 3rd party.
>> 
>>  http://wiki.apache.org/solr/QueryParser
>> 
>> If you known about other cool parsers out there, please add to the list.
>> 
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>> 
>> 



Re: Indexing off of the production servers

2013-05-06 Thread Andre Bois-Crettez

Excellent idea !
And it is possible to use collection aliasing with the CREATEALIAS to
make this transparent for the query side.

ex. with 2 collections named :
collection_1
collection_2

/collections?action=CREATEALIAS&name=collectionalias&collections=collection_1
"collectionalias" is now a virtual collection pointing to collection_1.

Index on collection_2, then :
/collections?action=CREATEALIAS&name=collectionalias&collections=collection_2
"collectionalias" now is an alias to collection_2.

http://wiki.apache.org/solr/SolrCloud#Managing_collections_via_the_Collections_API


André

On 05/06/2013 03:05 PM, Upayavira wrote:

In non-SolrCloud mode, you can index to another core, and then swap
cores. You could index on another box, ship the index files to your
production server, create a core pointing at these files, then swap this
core with the original one.

If you can tell your search app to switch to using a different
collection, you could achieve what you want with solrcloud.

You index to a different collection, which is running on different set
of SolrCloud nodes from your production search. Once indexing is
complete, you create cores on your production boxes for this new
collection. Once indexes have synced, you can switch your app to use
this new collection, thus publishing your new index. You can then delete
the cores on the boxes you were using for indexing.

Now, that's not transparent, but would be do-able.

Upayavira

On Mon, May 6, 2013, at 01:37 PM, David Parks wrote:

I'm less concerned with fully utilizing a hadoop cluster (due to having
fewer shards than I have hadoop reduce slots) as I am with just
off-loading
the whole indexing process. We may just want to re-index the whole thing
to
add some index time boosts or whatever else we conjure up to make queries
faster and better quality. We're doing a lot of work on optimization
right
now.

To re-index the whole thing is a 5-10 hour process for us, so when we
move
some update to production that requires full re-indexing (every week or
so),
right now we're just re-building new instances of solr to handle the
re-indexing and then copying the final VMs to the production environment
(slow process). I'm leery of letting a heavy duty full re-index process
loose for 10 hours on production on a regular basis.

It doesn't sound like there are any pre-built processes for doing this
now
though. I thought I had heard of master/slave hierarchy in 3.x that would
allow us to designate a master to do indexing and let the slaves pull
finished indexes from the master, so I thought maybe something like that
followed into solr cloud. Eric might be right in that it's not worth the
effort if there isn't some existing strategy.

Dave


-Original Message-
From: Furkan KAMACI [mailto:furkankam...@gmail.com]
Sent: Monday, May 06, 2013 7:06 PM
To: solr-user@lucene.apache.org
Subject: Re: Indexing off of the production servers

Hi Erick;

I think that even if you use Map/Reduce you will not parallelize you
indexing because indexing will parallelize as much as how many leaders
you
have at your SolrCloud, isn't it?

2013/5/6 Erick Erickson


The only problem with using Hadoop (or whatever) is that you need to
be sure that documents end up on the same shard, which means that you
have to use the same routing mechanism that SolrCloud uses. The custom
doc routing may help here

My very first question, though, would be whether this is necessary.
It might be sufficient to just throttle the rate of indexing, or just
do the indexing during off hours or Have you measured an indexing
degradation during your heavy indexing? Indexing has costs, no
question, but it's worth asking whether the costs are heavy enough to
be worth the bother..

Best
Erick

On Mon, May 6, 2013 at 5:04 AM, Furkan KAMACI
wrote:

1-2) Your aim for using Hadoop is probably Map/Reduce jobs. When you
use Map/Reduce jobs you split your workload, process it, and then
reduce step takes into account. Let me explain you new SolrCloud
architecture. You start your SolrCluoud with a numShards parameter.
Let's assume that you have 5 shards. Then you will have 5 leader at
your SolrCloud. These

leaders

will be responsible for indexing your data. It means that your
indexing workload will divided into 5 so it means that you have
parallelized your data as like Map/Reduce jobs.

Let's assume that you have added 10 new Solr nodes into your SolrCloud.
They will be added as a replica for each shard. Then you will have 5
shards, 5 leaders of them and every shard has 2 replica. When you
send a query into a SolrCloud every replica will help you for
searching and if

you

add more replicas to your SolrCloud your search performance will

improve.


2013/5/6 David Parks


I've had trouble figuring out what options exist if I want to
perform

all

indexing off of the production servers (I'd like to keep them only
for

user

queries).



We index data in batches roughly daily, ideally I'd index all solr
cloud shards off

Re: A Comma /aSpace in a Query argument

2013-05-06 Thread giovanni.bricc...@banzai.it

Try escaping it with a \


Giovanni

Il 06/05/13 15:34, Peter Sch�tt ha scritto:

Hallo,

I want to use a comma as part of a query argument.

E.G.

q=myfield:aa,bb

and "aa,bb" is the value of the field.

Do I have to mask it?

And what is about a space in an argument

q=myfield:aa bb

and "aa bb" is the value of the field.

Thanks for any hint.

Ciao
   Peter Schütt





Re: A Comma /aSpace in a Query argument

2013-05-06 Thread Jack Krupansky
Commas, dots, hyphens, slashes, and semicolons do not need to be "escaped", 
but spaces do.


But... be careful, because some analyzers will throw away all or most 
punctuation. You may have to resort to the white space analyzer to preserver 
punctuation characters, but you can add char filters to eliminate 
punctuation that you don't want to keep.


-- Jack Krupansky

-Original Message- 
From: Peter Schütt

Sent: Monday, May 06, 2013 9:34 AM
To: solr-user@lucene.apache.org
Subject: A Comma /aSpace in a Query argument

Hallo,

I want to use a comma as part of a query argument.

E.G.

q=myfield:aa,bb

and "aa,bb" is the value of the field.

Do I have to mask it?

And what is about a space in an argument

q=myfield:aa bb

and "aa bb" is the value of the field.

Thanks for any hint.

Ciao
 Peter Schütt 



Re: Duplicated Documents Across shards

2013-05-06 Thread Jack Krupansky
I think if we had a more compehensible term for a "collection configuration 
directory", a lot of the confusion would go away. I mean, what the heck is 
an "instance" anyway? How does "instanceDir" relate to an "instance" of the 
Solr "server"? Sure, I know that it is the parent directory of the 
collection configuration (conf directory) or a "collection directory", but 
how would a mere mortal grok that? I mean, "instance" sounds like it's at a 
higher level than the collection itself - that's why people tend to think 
it's the same for all cores in a Solr "instance".


We should reconsider the name of that term. My choice: collectionDir.

-- Jack Krupansky

-Original Message- 
From: Erick Erickson

Sent: Monday, May 06, 2013 7:39 AM
To: solr-user@lucene.apache.org
Subject: Re: Duplicated Documents Across shards

Having multiple cores point to the same index is, except for
special circumstances where one of the cores is guaranteed to
be read only, a Bad Thing.

So it sounds like you've found your issue...

Best
Erick

On Mon, May 6, 2013 at 4:44 AM, Iker Mtnz. Apellaniz
 wrote:

Thanks Erick,
  I think we found the problem. When defining the cores for both shards we
define both of them in the same instanceDir, like this:



  Each shard should have its own folder, so the final configuration should
be like this:
instanceDir="1_collection/shard2/"

name="1_collection" config="solrconfig.xml" collection="1_collection"/>
instanceDir="1_collection/shard4/"

name="1_collection" config="solrconfig.xml" collection="1_collection"/>

Can anyone confirm this?

Thanks,
  Iker


2013/5/4 Erick Erickson 


Sounds like you've explicitly routed the same document to two
different shards. Document replacement only happens locally to a
shard, so the fact that you have documents with the same ID on two
different shards is why you're getting duplicate documents.

Best
Erick

On Fri, May 3, 2013 at 3:44 PM, Iker Mtnz. Apellaniz
 wrote:
> We are currently using version 4.2.
> We have made tests with a single document and it gives us a 2 document
> count. But if we force to shard into te first machine, the one with a
> unique shard, the count gives us 1 document.
> I've tried using distrib=false parameter, it gives us no duplicate
> documents, but the same document appears to be in two different shards.
>
> Finally, about the separate directories, We have only one directory for
the
> data in each physical machine and collection, and I don't see any
subfolder
> for the different shards.
>
> Is it possible that we have something wrong with the dataDir
configuration
> to use multiple shards in one machine?
>
> ${solr.data.dir:}
>  class="${solr.directoryFactory:solr.NRTCachingDirectoryFactory}"/>
>
>
>
> 2013/5/3 Erick Erickson 
>
>> What version of Solr? The custom routing stuff is quite new so
>> I'm guessing 4x?
>>
>> But this shouldn't be happening. The actual index data for the
>> shards should be in separate directories, they just happen to
>> be on the same physical machine.
>>
>> Try querying each one with &distrib=false to see the counts
>> from single shards, that may shed some light on this. It vaguely
>> sounds like you have indexed the same document to both shards
>> somehow...
>>
>> Best
>> Erick
>>
>> On Fri, May 3, 2013 at 5:28 AM, Iker Mtnz. Apellaniz
>>  wrote:
>> > Hi,
>> >   We have currently a solrCloud implementation running 5 shards in 3
>> > physical machines, so the first machine will have the shard number 
>> > 1,

the
>> > second machine shards 2 & 4, and the third shards 3 & 5. We noticed
that
>> > while queryng numFoundDocs decreased when we increased the start
param.
>> >   After some investigation we found that the documents in shards 2 
>> > to

5
>> > were being counted twice. Querying to shard 2 will give you back the
>> > results for shard 2 & 4, and the same thing for shards 3 & 5. Our
guess
>> is
>> > that the physical index for both shard 2&4 is shared, so the shards
don't
>> > know which part of it is for each one.
>> >   The uniqueKey is correctly defined, and we have tried using shard
>> prefix
>> > (shard1!docID).
>> >
>> >   Is there any way to solve this problem when a unique physical
machine
>> > shares shards?
>> >   Is it a "real" problem os it just affects facet & numResults?
>> >
>> > Thanks
>> >Iker
>> >
>> > --
>> > /** @author imartinez*/
>> > Person me = *new* Developer();
>> > me.setName(*"Iker Mtz de Apellaniz Anzuola"*);
>> > me.setTwit("@mitxino77 ");
>> > me.setLocations({"St Cugat, Barcelona", "Kanpezu, Euskadi", "*,
>> World"]});
>> > me.setSkills({*SoftwareDeveloper, Curious, AmateurCook*});
>> > me.setWebs({*urbasaabentura.com, ikertxef.com*});
>> > *return* me;
>>
>
>
>
> --
> /** @author imartinez*/
> Person me = *new* Developer();
> me.setName(*"Iker Mtz de Apellaniz Anzuola"*);
> me.setTwit("@mitxino77 ");
> me.setLocations({"St Cugat, Barcelona", "Kanpezu, Euskadi", "*,
World"]});
> me.setSkil

Re: List of Solr Query Parsers

2013-05-06 Thread Roman Chyla
Hi Jan,
Please add this one http://29min.wordpress.com/category/antlrqueryparser/
- I can't edit the wiki

This parser is written with ANTLR and on top of lucene modern query parser.
There is a version which implements Lucene standard QP as well as a version
which includes proximity operators, multi token synonym handling and all of
solr qparsers using function syntax - ie,. for a query like: multi synonym
NEAR/5 edismax(foo)

I would like to create a JIRA ticket soon

Thanks

Roman
On 6 May 2013 09:21, "Jan Høydahl"  wrote:

> Hi,
>
> I just added a Wiki page to try to gather a list of all known Solr query
> parsers in one place, both those which are part of Solr and those in JIRA
> or 3rd party.
>
>   http://wiki.apache.org/solr/QueryParser
>
> If you known about other cool parsers out there, please add to the list.
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
>


A Comma /aSpace in a Query argument

2013-05-06 Thread Peter Sch�tt
Hallo,

I want to use a comma as part of a query argument.

E.G.

q=myfield:aa,bb

and "aa,bb" is the value of the field.

Do I have to mask it?

And what is about a space in an argument

q=myfield:aa bb

and "aa bb" is the value of the field.

Thanks for any hint.

Ciao
  Peter Schütt



Re: List of Solr Query Parsers

2013-05-06 Thread Jack Krupansky

Jan,

I have a full 80-page chapter on query parsers in the new book on Lucene and 
Solr. Send me an email if you would like to be a reviewer. It integrates the 
descriptions of Solr query parser, dismax, and edismax so that it's not as 
difficult to figure out which is which and how they compare. It doesn't 
cover non-committed query parsers, but does include surround, et al, all 
with lots of examples, and includes all the query-related parameters (except 
groups like facets, highlight, grouping, stats, etc. that each have separate 
chapters), again with lots of examples.


The book:
http://www.amazon.com/Lucene-Solr-Definitive-comprehensive-realtime/dp/1449359957

-- Jack Krupansky

-Original Message- 
From: Jan Høydahl

Sent: Monday, May 06, 2013 9:20 AM
To: solr-user@lucene.apache.org
Subject: List of Solr Query Parsers

Hi,

I just added a Wiki page to try to gather a list of all known Solr query 
parsers in one place, both those which are part of Solr and those in JIRA or 
3rd party.


 http://wiki.apache.org/solr/QueryParser

If you known about other cool parsers out there, please add to the list.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com 



Re: Indexing off of the production servers

2013-05-06 Thread Erick Erickson
bq:  I thought I had heard of master/slave hierarchy in 3.x that would
allow us to designate a master to do indexing and let the slaves pull
finished indexes from the master, so I thought maybe something like that
followed into solr cloud.

You can still do this in Solr4 if you choose, but not in cloud mode. The
tradeoff is that you sacrifice the automatic fail-over etc if you use Solr4
in non-cloud mode. But in non-cloud mode it's just like 3.x in this
respect.

You could, in fact, take total control of this via HTTP commands, see:
http://wiki.apache.org/solr/SolrReplication#HTTP_API
So you can just turn replication completely off on your master, do your
indexing, then turn replication back on via HTTP commands. You lose
the automatic sharding (i.e you have to take care to send the docs to
the right shards) and you lose the automatic fail-over etc from SolrCloud.

Otherwise, Upayavira's comments might be where you want to go

FWIW,
Erick

On Mon, May 6, 2013 at 8:37 AM, David Parks  wrote:
> I'm less concerned with fully utilizing a hadoop cluster (due to having
> fewer shards than I have hadoop reduce slots) as I am with just off-loading
> the whole indexing process. We may just want to re-index the whole thing to
> add some index time boosts or whatever else we conjure up to make queries
> faster and better quality. We're doing a lot of work on optimization right
> now.
>
> To re-index the whole thing is a 5-10 hour process for us, so when we move
> some update to production that requires full re-indexing (every week or so),
> right now we're just re-building new instances of solr to handle the
> re-indexing and then copying the final VMs to the production environment
> (slow process). I'm leery of letting a heavy duty full re-index process
> loose for 10 hours on production on a regular basis.
>
> It doesn't sound like there are any pre-built processes for doing this now
> though. I thought I had heard of master/slave hierarchy in 3.x that would
> allow us to designate a master to do indexing and let the slaves pull
> finished indexes from the master, so I thought maybe something like that
> followed into solr cloud. Eric might be right in that it's not worth the
> effort if there isn't some existing strategy.
>
> Dave
>
>
> -Original Message-
> From: Furkan KAMACI [mailto:furkankam...@gmail.com]
> Sent: Monday, May 06, 2013 7:06 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Indexing off of the production servers
>
> Hi Erick;
>
> I think that even if you use Map/Reduce you will not parallelize you
> indexing because indexing will parallelize as much as how many leaders you
> have at your SolrCloud, isn't it?
>
> 2013/5/6 Erick Erickson 
>
>> The only problem with using Hadoop (or whatever) is that you need to
>> be sure that documents end up on the same shard, which means that you
>> have to use the same routing mechanism that SolrCloud uses. The custom
>> doc routing may help here
>>
>> My very first question, though, would be whether this is necessary.
>> It might be sufficient to just throttle the rate of indexing, or just
>> do the indexing during off hours or Have you measured an indexing
>> degradation during your heavy indexing? Indexing has costs, no
>> question, but it's worth asking whether the costs are heavy enough to
>> be worth the bother..
>>
>> Best
>> Erick
>>
>> On Mon, May 6, 2013 at 5:04 AM, Furkan KAMACI 
>> wrote:
>> > 1-2) Your aim for using Hadoop is probably Map/Reduce jobs. When you
>> > use Map/Reduce jobs you split your workload, process it, and then
>> > reduce step takes into account. Let me explain you new SolrCloud
>> > architecture. You start your SolrCluoud with a numShards parameter.
>> > Let's assume that you have 5 shards. Then you will have 5 leader at
>> > your SolrCloud. These
>> leaders
>> > will be responsible for indexing your data. It means that your
>> > indexing workload will divided into 5 so it means that you have
>> > parallelized your data as like Map/Reduce jobs.
>> >
>> > Let's assume that you have added 10 new Solr nodes into your SolrCloud.
>> > They will be added as a replica for each shard. Then you will have 5
>> > shards, 5 leaders of them and every shard has 2 replica. When you
>> > send a query into a SolrCloud every replica will help you for
>> > searching and if
>> you
>> > add more replicas to your SolrCloud your search performance will
> improve.
>> >
>> >
>> > 2013/5/6 David Parks 
>> >
>> >> I've had trouble figuring out what options exist if I want to
>> >> perform
>> all
>> >> indexing off of the production servers (I'd like to keep them only
>> >> for
>> user
>> >> queries).
>> >>
>> >>
>> >>
>> >> We index data in batches roughly daily, ideally I'd index all solr
>> >> cloud shards offline, then move the final index files to the solr
>> >> cloud
>> instance
>> >> that needs it and flip a switch and have it use the new index.
>> >>
>> >>
>> >>
>> >> Is this possible via either:
>> >>
>> >> 1.   Doing the

List of Solr Query Parsers

2013-05-06 Thread Jan Høydahl
Hi,

I just added a Wiki page to try to gather a list of all known Solr query 
parsers in one place, both those which are part of Solr and those in JIRA or 
3rd party.

  http://wiki.apache.org/solr/QueryParser

If you known about other cool parsers out there, please add to the list.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com



Re: Indexing off of the production servers

2013-05-06 Thread Erick Erickson
bq:  Your data will be indexed by shard leaders while your replicas
are responsible for querying.

This is not true in SolrCloud mode. When you send a document
to Solr, upon return that document has been sent to every replica
for the appropriate shard and entered in the transaction log. It is
indexed on every node for a given shard.

In SolrCloud, there isn't much distinction between leaders and
replicas. A leader is just a replica with a few additional responsibilities.
One of those responsibilities is insuring that docs with the same
ID sent to several nodes at once are resolved appropriately, which is
why the leader gets the updates forwarded to it. But from that point,
the doc is sent to every replica associated with that leader (shard)
and indexed there.

The bits about SolrJ being "leader aware" are partly in place, but
currently the docs are sent to _a_ leader, not necessarily the
leader of the shard they will eventually end up on. That's on the
roadmap, but not there yet.

FWIW,
Erick

On Mon, May 6, 2013 at 9:03 AM, Furkan KAMACI  wrote:
> Hi Dave;
>
> I think that when you do indexing you can use CloudSolrServer so you can
> learn from Zookeeper that where you data will go and then send your data to
> there. This will speed up you when indexing and gives benefit of
> Map/Reduce. Your data will be indexed by shard leaders while your replicas
> are responsible for querying. Also even if you are not satisfied with you
> query performance you can add more replica. If you want to improve your
> indexing you can define more shards at your system (beginning with Solr 4.3
> shard splitting will be a new feature for Solr.)
>
> 2013/5/6 David Parks 
>
>> I'm less concerned with fully utilizing a hadoop cluster (due to having
>> fewer shards than I have hadoop reduce slots) as I am with just off-loading
>> the whole indexing process. We may just want to re-index the whole thing to
>> add some index time boosts or whatever else we conjure up to make queries
>> faster and better quality. We're doing a lot of work on optimization right
>> now.
>>
>> To re-index the whole thing is a 5-10 hour process for us, so when we move
>> some update to production that requires full re-indexing (every week or
>> so),
>> right now we're just re-building new instances of solr to handle the
>> re-indexing and then copying the final VMs to the production environment
>> (slow process). I'm leery of letting a heavy duty full re-index process
>> loose for 10 hours on production on a regular basis.
>>
>> It doesn't sound like there are any pre-built processes for doing this now
>> though. I thought I had heard of master/slave hierarchy in 3.x that would
>> allow us to designate a master to do indexing and let the slaves pull
>> finished indexes from the master, so I thought maybe something like that
>> followed into solr cloud. Eric might be right in that it's not worth the
>> effort if there isn't some existing strategy.
>>
>> Dave
>>
>>
>> -Original Message-
>> From: Furkan KAMACI [mailto:furkankam...@gmail.com]
>> Sent: Monday, May 06, 2013 7:06 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Indexing off of the production servers
>>
>> Hi Erick;
>>
>> I think that even if you use Map/Reduce you will not parallelize you
>> indexing because indexing will parallelize as much as how many leaders you
>> have at your SolrCloud, isn't it?
>>
>> 2013/5/6 Erick Erickson 
>>
>> > The only problem with using Hadoop (or whatever) is that you need to
>> > be sure that documents end up on the same shard, which means that you
>> > have to use the same routing mechanism that SolrCloud uses. The custom
>> > doc routing may help here
>> >
>> > My very first question, though, would be whether this is necessary.
>> > It might be sufficient to just throttle the rate of indexing, or just
>> > do the indexing during off hours or Have you measured an indexing
>> > degradation during your heavy indexing? Indexing has costs, no
>> > question, but it's worth asking whether the costs are heavy enough to
>> > be worth the bother..
>> >
>> > Best
>> > Erick
>> >
>> > On Mon, May 6, 2013 at 5:04 AM, Furkan KAMACI 
>> > wrote:
>> > > 1-2) Your aim for using Hadoop is probably Map/Reduce jobs. When you
>> > > use Map/Reduce jobs you split your workload, process it, and then
>> > > reduce step takes into account. Let me explain you new SolrCloud
>> > > architecture. You start your SolrCluoud with a numShards parameter.
>> > > Let's assume that you have 5 shards. Then you will have 5 leader at
>> > > your SolrCloud. These
>> > leaders
>> > > will be responsible for indexing your data. It means that your
>> > > indexing workload will divided into 5 so it means that you have
>> > > parallelized your data as like Map/Reduce jobs.
>> > >
>> > > Let's assume that you have added 10 new Solr nodes into your SolrCloud.
>> > > They will be added as a replica for each shard. Then you will have 5
>> > > shards, 5 leaders of them and every shard has 2

[ANNOUNCE] Apache Solr 4.3 released

2013-05-06 Thread Simon Willnauer
May 2013, Apache Solr™ 4.3 available

The Lucene PMC is pleased to announce the release of Apache Solr 4.3.

Solr is the popular, blazing fast, open source NoSQL search platform
from the Apache Lucene project. Its major features include powerful
full-text search, hit highlighting, faceted search, dynamic
clustering, database integration, rich document (e.g., Word, PDF)
handling, and geospatial search.  Solr is highly scalable, providing
fault tolerant distributed search and indexing, and powers the search
and navigation features of many of the world's largest internet sites.

Solr 4.3 is available for immediate download at:
   http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

See the CHANGES.txt file included with the release for a full list of
details.

Solr 4.3.0 Release Highlights:

* Tired of maintaining core information in solr.xml? Now you can configure
  Solr to automatically find cores by walking an arbitrary directory.

* Shard Splitting: You can now split SolrCloud shards to expand your cluster as
  you grow.

* The read side schema REST API has been improved and expanded upon: all schema
  information is now available and the full live schema can now be returned in
  json or xml.  Ground work is included for the upcoming write side of the
  schema REST API.

* Spatial queries can now search for indexed shapes by "IsWithin",
"Contains" and
  "IsDisjointTo" relationships, in addition to typical "Intersects".

* Faceting now supports local parameters for faceting on the same field with
  different options.

* Significant performance improvements for minShouldMatch (mm) queries due to
  skipping resulting in up to 4000% faster queries.

* Various new highlighting configuration parameters.

* A new solr.xml format that is closer to that of solrconfig.xml. The example
  still uses the old format, but 4.4 will ship with the new format.

* Lucene 4.3.0 bug fixes and optimizations.

Solr 4.3.0 also includes many other new features as well as numerous
optimizations and bugfixes.

Please report any feedback to the mailing lists
(http://lucene.apache.org/solr/discussion.html)

Note: The Apache Software Foundation uses an extensive mirroring network
for distributing releases.  It is possible that the mirror you are using
may not have replicated the release yet.  If that is the case, please
try another mirror.  This also goes for Maven access.

Happy searching,
Lucene/Solr developers


Re: Indexing off of the production servers

2013-05-06 Thread Upayavira
In non-SolrCloud mode, you can index to another core, and then swap
cores. You could index on another box, ship the index files to your
production server, create a core pointing at these files, then swap this
core with the original one.

If you can tell your search app to switch to using a different
collection, you could achieve what you want with solrcloud.

You index to a different collection, which is running on different set
of SolrCloud nodes from your production search. Once indexing is
complete, you create cores on your production boxes for this new
collection. Once indexes have synced, you can switch your app to use
this new collection, thus publishing your new index. You can then delete
the cores on the boxes you were using for indexing.

Now, that's not transparent, but would be do-able.

Upayavira

On Mon, May 6, 2013, at 01:37 PM, David Parks wrote:
> I'm less concerned with fully utilizing a hadoop cluster (due to having
> fewer shards than I have hadoop reduce slots) as I am with just
> off-loading
> the whole indexing process. We may just want to re-index the whole thing
> to
> add some index time boosts or whatever else we conjure up to make queries
> faster and better quality. We're doing a lot of work on optimization
> right
> now.
> 
> To re-index the whole thing is a 5-10 hour process for us, so when we
> move
> some update to production that requires full re-indexing (every week or
> so),
> right now we're just re-building new instances of solr to handle the
> re-indexing and then copying the final VMs to the production environment
> (slow process). I'm leery of letting a heavy duty full re-index process
> loose for 10 hours on production on a regular basis.
> 
> It doesn't sound like there are any pre-built processes for doing this
> now
> though. I thought I had heard of master/slave hierarchy in 3.x that would
> allow us to designate a master to do indexing and let the slaves pull
> finished indexes from the master, so I thought maybe something like that
> followed into solr cloud. Eric might be right in that it's not worth the
> effort if there isn't some existing strategy.
> 
> Dave
> 
> 
> -Original Message-
> From: Furkan KAMACI [mailto:furkankam...@gmail.com] 
> Sent: Monday, May 06, 2013 7:06 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Indexing off of the production servers
> 
> Hi Erick;
> 
> I think that even if you use Map/Reduce you will not parallelize you
> indexing because indexing will parallelize as much as how many leaders
> you
> have at your SolrCloud, isn't it?
> 
> 2013/5/6 Erick Erickson 
> 
> > The only problem with using Hadoop (or whatever) is that you need to 
> > be sure that documents end up on the same shard, which means that you 
> > have to use the same routing mechanism that SolrCloud uses. The custom 
> > doc routing may help here
> >
> > My very first question, though, would be whether this is necessary.
> > It might be sufficient to just throttle the rate of indexing, or just 
> > do the indexing during off hours or Have you measured an indexing 
> > degradation during your heavy indexing? Indexing has costs, no 
> > question, but it's worth asking whether the costs are heavy enough to 
> > be worth the bother..
> >
> > Best
> > Erick
> >
> > On Mon, May 6, 2013 at 5:04 AM, Furkan KAMACI 
> > wrote:
> > > 1-2) Your aim for using Hadoop is probably Map/Reduce jobs. When you 
> > > use Map/Reduce jobs you split your workload, process it, and then 
> > > reduce step takes into account. Let me explain you new SolrCloud 
> > > architecture. You start your SolrCluoud with a numShards parameter. 
> > > Let's assume that you have 5 shards. Then you will have 5 leader at 
> > > your SolrCloud. These
> > leaders
> > > will be responsible for indexing your data. It means that your 
> > > indexing workload will divided into 5 so it means that you have 
> > > parallelized your data as like Map/Reduce jobs.
> > >
> > > Let's assume that you have added 10 new Solr nodes into your SolrCloud.
> > > They will be added as a replica for each shard. Then you will have 5 
> > > shards, 5 leaders of them and every shard has 2 replica. When you 
> > > send a query into a SolrCloud every replica will help you for 
> > > searching and if
> > you
> > > add more replicas to your SolrCloud your search performance will
> improve.
> > >
> > >
> > > 2013/5/6 David Parks 
> > >
> > >> I've had trouble figuring out what options exist if I want to 
> > >> perform
> > all
> > >> indexing off of the production servers (I'd like to keep them only 
> > >> for
> > user
> > >> queries).
> > >>
> > >>
> > >>
> > >> We index data in batches roughly daily, ideally I'd index all solr 
> > >> cloud shards offline, then move the final index files to the solr 
> > >> cloud
> > instance
> > >> that needs it and flip a switch and have it use the new index.
> > >>
> > >>
> > >>
> > >> Is this possible via either:
> > >>
> > >> 1.   Doing the indexing in Hadoop?? (this wou

Re: Indexing off of the production servers

2013-05-06 Thread Furkan KAMACI
Hi Dave;

I think that when you do indexing you can use CloudSolrServer so you can
learn from Zookeeper that where you data will go and then send your data to
there. This will speed up you when indexing and gives benefit of
Map/Reduce. Your data will be indexed by shard leaders while your replicas
are responsible for querying. Also even if you are not satisfied with you
query performance you can add more replica. If you want to improve your
indexing you can define more shards at your system (beginning with Solr 4.3
shard splitting will be a new feature for Solr.)

2013/5/6 David Parks 

> I'm less concerned with fully utilizing a hadoop cluster (due to having
> fewer shards than I have hadoop reduce slots) as I am with just off-loading
> the whole indexing process. We may just want to re-index the whole thing to
> add some index time boosts or whatever else we conjure up to make queries
> faster and better quality. We're doing a lot of work on optimization right
> now.
>
> To re-index the whole thing is a 5-10 hour process for us, so when we move
> some update to production that requires full re-indexing (every week or
> so),
> right now we're just re-building new instances of solr to handle the
> re-indexing and then copying the final VMs to the production environment
> (slow process). I'm leery of letting a heavy duty full re-index process
> loose for 10 hours on production on a regular basis.
>
> It doesn't sound like there are any pre-built processes for doing this now
> though. I thought I had heard of master/slave hierarchy in 3.x that would
> allow us to designate a master to do indexing and let the slaves pull
> finished indexes from the master, so I thought maybe something like that
> followed into solr cloud. Eric might be right in that it's not worth the
> effort if there isn't some existing strategy.
>
> Dave
>
>
> -Original Message-
> From: Furkan KAMACI [mailto:furkankam...@gmail.com]
> Sent: Monday, May 06, 2013 7:06 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Indexing off of the production servers
>
> Hi Erick;
>
> I think that even if you use Map/Reduce you will not parallelize you
> indexing because indexing will parallelize as much as how many leaders you
> have at your SolrCloud, isn't it?
>
> 2013/5/6 Erick Erickson 
>
> > The only problem with using Hadoop (or whatever) is that you need to
> > be sure that documents end up on the same shard, which means that you
> > have to use the same routing mechanism that SolrCloud uses. The custom
> > doc routing may help here
> >
> > My very first question, though, would be whether this is necessary.
> > It might be sufficient to just throttle the rate of indexing, or just
> > do the indexing during off hours or Have you measured an indexing
> > degradation during your heavy indexing? Indexing has costs, no
> > question, but it's worth asking whether the costs are heavy enough to
> > be worth the bother..
> >
> > Best
> > Erick
> >
> > On Mon, May 6, 2013 at 5:04 AM, Furkan KAMACI 
> > wrote:
> > > 1-2) Your aim for using Hadoop is probably Map/Reduce jobs. When you
> > > use Map/Reduce jobs you split your workload, process it, and then
> > > reduce step takes into account. Let me explain you new SolrCloud
> > > architecture. You start your SolrCluoud with a numShards parameter.
> > > Let's assume that you have 5 shards. Then you will have 5 leader at
> > > your SolrCloud. These
> > leaders
> > > will be responsible for indexing your data. It means that your
> > > indexing workload will divided into 5 so it means that you have
> > > parallelized your data as like Map/Reduce jobs.
> > >
> > > Let's assume that you have added 10 new Solr nodes into your SolrCloud.
> > > They will be added as a replica for each shard. Then you will have 5
> > > shards, 5 leaders of them and every shard has 2 replica. When you
> > > send a query into a SolrCloud every replica will help you for
> > > searching and if
> > you
> > > add more replicas to your SolrCloud your search performance will
> improve.
> > >
> > >
> > > 2013/5/6 David Parks 
> > >
> > >> I've had trouble figuring out what options exist if I want to
> > >> perform
> > all
> > >> indexing off of the production servers (I'd like to keep them only
> > >> for
> > user
> > >> queries).
> > >>
> > >>
> > >>
> > >> We index data in batches roughly daily, ideally I'd index all solr
> > >> cloud shards offline, then move the final index files to the solr
> > >> cloud
> > instance
> > >> that needs it and flip a switch and have it use the new index.
> > >>
> > >>
> > >>
> > >> Is this possible via either:
> > >>
> > >> 1.   Doing the indexing in Hadoop?? (this would be ideal as we
> have
> > a
> > >> significant investment in a hadoop cluster already), or
> > >>
> > >> 2.   Maintaining a separate "master" server that handles indexing
> > and
> > >> the nodes that receive user queries update their index from there
> > >> (I
> > seem
> > >> to
> > >> recall reading about this co

replication between solr 3.1 and 4.x

2013-05-06 Thread elrond
Is it possible to replicate solr between diffrent versions ? (in my case
between 3.1 (master) and 4.x(slave)))?

All i get is:

May 06, 2013 2:28:00 PM org.apache.solr.handler.SnapPuller fetchFileList
SEVERE: No files to download for index generation: 3





--
View this message in context: 
http://lucene.472066.n3.nabble.com/replication-between-solr-3-1-and-4-x-tp4061053.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Indexing off of the production servers

2013-05-06 Thread David Parks
I'm less concerned with fully utilizing a hadoop cluster (due to having
fewer shards than I have hadoop reduce slots) as I am with just off-loading
the whole indexing process. We may just want to re-index the whole thing to
add some index time boosts or whatever else we conjure up to make queries
faster and better quality. We're doing a lot of work on optimization right
now.

To re-index the whole thing is a 5-10 hour process for us, so when we move
some update to production that requires full re-indexing (every week or so),
right now we're just re-building new instances of solr to handle the
re-indexing and then copying the final VMs to the production environment
(slow process). I'm leery of letting a heavy duty full re-index process
loose for 10 hours on production on a regular basis.

It doesn't sound like there are any pre-built processes for doing this now
though. I thought I had heard of master/slave hierarchy in 3.x that would
allow us to designate a master to do indexing and let the slaves pull
finished indexes from the master, so I thought maybe something like that
followed into solr cloud. Eric might be right in that it's not worth the
effort if there isn't some existing strategy.

Dave


-Original Message-
From: Furkan KAMACI [mailto:furkankam...@gmail.com] 
Sent: Monday, May 06, 2013 7:06 PM
To: solr-user@lucene.apache.org
Subject: Re: Indexing off of the production servers

Hi Erick;

I think that even if you use Map/Reduce you will not parallelize you
indexing because indexing will parallelize as much as how many leaders you
have at your SolrCloud, isn't it?

2013/5/6 Erick Erickson 

> The only problem with using Hadoop (or whatever) is that you need to 
> be sure that documents end up on the same shard, which means that you 
> have to use the same routing mechanism that SolrCloud uses. The custom 
> doc routing may help here
>
> My very first question, though, would be whether this is necessary.
> It might be sufficient to just throttle the rate of indexing, or just 
> do the indexing during off hours or Have you measured an indexing 
> degradation during your heavy indexing? Indexing has costs, no 
> question, but it's worth asking whether the costs are heavy enough to 
> be worth the bother..
>
> Best
> Erick
>
> On Mon, May 6, 2013 at 5:04 AM, Furkan KAMACI 
> wrote:
> > 1-2) Your aim for using Hadoop is probably Map/Reduce jobs. When you 
> > use Map/Reduce jobs you split your workload, process it, and then 
> > reduce step takes into account. Let me explain you new SolrCloud 
> > architecture. You start your SolrCluoud with a numShards parameter. 
> > Let's assume that you have 5 shards. Then you will have 5 leader at 
> > your SolrCloud. These
> leaders
> > will be responsible for indexing your data. It means that your 
> > indexing workload will divided into 5 so it means that you have 
> > parallelized your data as like Map/Reduce jobs.
> >
> > Let's assume that you have added 10 new Solr nodes into your SolrCloud.
> > They will be added as a replica for each shard. Then you will have 5 
> > shards, 5 leaders of them and every shard has 2 replica. When you 
> > send a query into a SolrCloud every replica will help you for 
> > searching and if
> you
> > add more replicas to your SolrCloud your search performance will
improve.
> >
> >
> > 2013/5/6 David Parks 
> >
> >> I've had trouble figuring out what options exist if I want to 
> >> perform
> all
> >> indexing off of the production servers (I'd like to keep them only 
> >> for
> user
> >> queries).
> >>
> >>
> >>
> >> We index data in batches roughly daily, ideally I'd index all solr 
> >> cloud shards offline, then move the final index files to the solr 
> >> cloud
> instance
> >> that needs it and flip a switch and have it use the new index.
> >>
> >>
> >>
> >> Is this possible via either:
> >>
> >> 1.   Doing the indexing in Hadoop?? (this would be ideal as we have
> a
> >> significant investment in a hadoop cluster already), or
> >>
> >> 2.   Maintaining a separate "master" server that handles indexing
> and
> >> the nodes that receive user queries update their index from there 
> >> (I
> seem
> >> to
> >> recall reading about this configuration in 3.x, but now we're using 
> >> solr
> >> cloud)
> >>
> >>
> >>
> >> Is there some ideal solution I can use to "protect" the production 
> >> solr instances from degraded performance during large index 
> >> processing
> periods?
> >>
> >>
> >>
> >> Thanks!
> >>
> >> David
> >>
> >>
>



Re: Memory problems with HttpSolrServer

2013-05-06 Thread Andre Bois-Crettez

On 05/06/2013 09:32 AM, Rogowski, Britta wrote:

Hi!

When I write from our database to a HttpSolrServer, (using a 
LinkedBlockingQueue to write just one document at a time), I run into memory 
problems (due to various constraints, I have to remain on a 32-bit system, so I 
can use at most 2 GB RAM).

If I use an EmbeddedSolrServer (to write locally), I have no such problems. 
Just now, I tried out ConcurrentUpdateSolrServer (with a queue size of 1, but 3 
threads to be safe), and that worked out fine too. I played around with various 
GC options and monitored memory with jconsole and jmap, but only found out that 
there's lots of byte arrays, SolrInputFields and Strings hanging around.

Since ConcurrentUpdateSolrServer works, I'm happy, but I was wondering if 
people were aware of the memory issue around HttpSolrServer.

Regards,

Britta Rogowski

We are not memory constrained so can not confirm the problem with
HttpSolrServer, but how often do you commit ?
Having an autocommit set to a few minutes may help reduce memory usage
during indexation.

Is the memory usage on the Solr server side, or in your feeder code ?

--
André Bois-Crettez

Search technology, Kelkoo
http://www.kelkoo.com/


Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.


Re: Is indexing large documents still an issue?

2013-05-06 Thread Bai Shen
You can still use highlighting without returning the content.  Just set
content as your alternate highlight field.   Then if no highlights are
returned you will receive the content.  Make sure you set a character limit
so you don't get the whole thing.  I use 300.

Does that make sense?  This is what I add to my query string.

"&hl=true&hl.fl=content&hl.snippets=3&hl.alternateField=content&hl.maxAlternateFieldLength=300"


On Thu, May 2, 2013 at 7:32 AM, adfel70  wrote:

> Well, returning the content field for highlighting is within my
> requirements.
> Did you solve this in some other way? or you just didn't have to?
>
>
> Bai Shen wrote
> > The only issue I ran into was returning the content field.  Once I
> > modified
> > my query to avoid that, I got good performance.
> >
> > Admittedly, I only have about 15-20k documents in my index ATM, but most
> > of
> > them are in the multiMB range with a current max of 250MB.
> >
> >
> > On Thu, May 2, 2013 at 7:05 AM, adfel70 <
>
> > adfel70@
>
> > > wrote:
> >
> >> Hi,
> >> In previous versions of solr, indexing documents with large fields
> caused
> >> performance degradation.
> >>
> >> Is this still the case in solr 4.2?
> >>
> >> If so, and I'll need to chunk the document and index many document
> parts,
> >> can anyony give a general idea of what field/document size solr CAN
> >> handle?
> >>
> >> thanks.
> >>
> >>
> >>
> >> --
> >> View this message in context:
> >>
> http://lucene.472066.n3.nabble.com/Is-indexing-large-documents-still-an-issue-tp4060425.html
> >> Sent from the Solr - User mailing list archive at Nabble.com.
> >>
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Is-indexing-large-documents-still-an-issue-tp4060425p4060431.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: iterate through each document in Solr

2013-05-06 Thread Andre Bois-Crettez

On 05/06/2013 06:03 AM, Michael Sokolov wrote:

On 5/5/13 7:48 PM, Mingfeng Yang wrote:

Dear Solr Users,

Does anyone know what is the best way to iterate through each document in a
Solr index with billion entries?

I tried to use  select?q=*:*&start=xx&rows=500  to get 500 docs each time
and then change start value, but it got very slow after getting through
about 10 million docs.

Thanks,
Ming-


You need to use a unique and stable sort key and get documents>
sortkey.  For example, if you have a unique key, retrieve documents
ordered by the unique key, and for each batch get documents>  max (key)
from the previous batch

-Mike


There is more details on the wiki :
http://wiki.apache.org/solr/CommonQueryParameters#pageDoc_and_pageScore


--
André Bois-Crettez

Search technology, Kelkoo
http://www.kelkoo.com/


Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.


Re: Log Monitor System for SolrCloud and Logging to log4j at SolrCloud?

2013-05-06 Thread Furkan KAMACI
Is there any road map for Solr when will Solr 4.3 be tagged at svn?

2013/4/26 Mark Miller 

> Slf4j is meant to work with existing frameworks - you can set it up to
> work with log4j, and Solr will use log4j by default in the about to be
> released 4.3.
>
> http://wiki.apache.org/solr/SolrLogging
>
> - Mark
>
> On Apr 26, 2013, at 7:19 AM, Furkan KAMACI  wrote:
>
> > I want to use GrayLog2 to monitor my logging files for SolrCloud.
> However I
> > think that GrayLog2 works with log4j and logback. Solr uses slf4j.
> > How can I solve this problem and what logging monitoring system does
> folks
> > use?
>
>


Re: Indexing off of the production servers

2013-05-06 Thread Furkan KAMACI
Hi Erick;

I think that even if you use Map/Reduce you will not parallelize you
indexing because indexing will parallelize as much as how many leaders you
have at your SolrCloud, isn't it?

2013/5/6 Erick Erickson 

> The only problem with using Hadoop (or whatever) is that you
> need to be sure that documents end up on the same shard, which
> means that you have to use the same routing mechanism that
> SolrCloud uses. The custom doc routing may help here
>
> My very first question, though, would be whether this is necessary.
> It might be sufficient to just throttle the rate of indexing, or just do
> the
> indexing during off hours or Have you measured an indexing
> degradation during your heavy indexing? Indexing has costs, no
> question, but it's worth asking whether the costs are heavy enough
> to be worth the bother..
>
> Best
> Erick
>
> On Mon, May 6, 2013 at 5:04 AM, Furkan KAMACI 
> wrote:
> > 1-2) Your aim for using Hadoop is probably Map/Reduce jobs. When you use
> > Map/Reduce jobs you split your workload, process it, and then reduce step
> > takes into account. Let me explain you new SolrCloud architecture. You
> > start your SolrCluoud with a numShards parameter. Let's assume that you
> > have 5 shards. Then you will have 5 leader at your SolrCloud. These
> leaders
> > will be responsible for indexing your data. It means that your indexing
> > workload will divided into 5 so it means that you have parallelized your
> > data as like Map/Reduce jobs.
> >
> > Let's assume that you have added 10 new Solr nodes into your SolrCloud.
> > They will be added as a replica for each shard. Then you will have 5
> > shards, 5 leaders of them and every shard has 2 replica. When you send a
> > query into a SolrCloud every replica will help you for searching and if
> you
> > add more replicas to your SolrCloud your search performance will improve.
> >
> >
> > 2013/5/6 David Parks 
> >
> >> I've had trouble figuring out what options exist if I want to perform
> all
> >> indexing off of the production servers (I'd like to keep them only for
> user
> >> queries).
> >>
> >>
> >>
> >> We index data in batches roughly daily, ideally I'd index all solr cloud
> >> shards offline, then move the final index files to the solr cloud
> instance
> >> that needs it and flip a switch and have it use the new index.
> >>
> >>
> >>
> >> Is this possible via either:
> >>
> >> 1.   Doing the indexing in Hadoop?? (this would be ideal as we have
> a
> >> significant investment in a hadoop cluster already), or
> >>
> >> 2.   Maintaining a separate "master" server that handles indexing
> and
> >> the nodes that receive user queries update their index from there (I
> seem
> >> to
> >> recall reading about this configuration in 3.x, but now we're using solr
> >> cloud)
> >>
> >>
> >>
> >> Is there some ideal solution I can use to "protect" the production solr
> >> instances from degraded performance during large index processing
> periods?
> >>
> >>
> >>
> >> Thanks!
> >>
> >> David
> >>
> >>
>


Re: Scores dilemma after providing boosting with bq as same weigtage for 2 condition

2013-05-06 Thread Erick Erickson
Try adding &debugQuery=true to your query, the resulting
data will show you exactly how the doc score is calculated.

Warning: reading the explain can be a bit challenging, but
that's the only way to really understand why docs scored as
they did.

Best
Erick

On Mon, May 6, 2013 at 7:33 AM, nishi  wrote:
> While giving bq with same weightage as ^1.2 below on two values of
> articleTopic, the result always coming all of the "Office" on the top
> somehow. What other factors would influence at this scenario when there is
> no keyword search also?
>
> http://localhost:8080/solr?rows=900&fq=(articleTopic:"Food" OR
> articleTopic:"Office")&bq=(articleTopic:"Food" OR
> articleTopic:"Office")^1.2&fl=title,description,documentId,score&sort=score
> desc
>
> Also with other fields adding at bq, still the results favoured somehow to
> "Office" one:
> http://localhost:8080/solr?rows=900&fq=(articleTopic:"Food" OR
> articleTopic:"Office")&bq=(articleSrc:"News" OR articleSrc:"Blog")^0.5 OR
> (articleTopic:"Food" OR
> articleTopic:"Office")^1.2&fl=title,description,documentId,score&sort=score
> desc
>
> Please advice what might be the reason/factors which doesn't balance the
> result with both articleTopic and instead always favored the result with the
> "Food" somehow in the score.
>
> Thanks in advance
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Scores-dilemma-after-providing-boosting-with-bq-as-same-weigtage-for-2-condition-tp4061035.html
> Sent from the Solr - User mailing list archive at Nabble.com.


query regarding the multiple documents

2013-05-06 Thread Rohan Thakur
hi all

wanted to know that I have indexed documents for search purpose in solr and
now for auto suggestion purpose I want to index new data that is the
popular query term searched by users and frequency of them to get searched
on websitebut as it has no relation with the product data on which I
have build search...can I like index this new table in same data-config.xml
within new document tag and if so then how to configure request handler for
auto suggestion to search in this new index only and not the search index
document.

thanks
regards
rohan


Re: solr adding unique values

2013-05-06 Thread Erick Erickson
Depends on your goal here. I'm guessing you're using
atomic updates, in which case you need to use "set"
rather than "add" as the former replaces the contents.
See: http://wiki.apache.org/solr/UpdateJSON#Solr_4.0_Example

If you're simply re-indexing the documents, just send the entire
fresh document to solr and it'll replace the earlier document
completely.

Best
Erick

On Mon, May 6, 2013 at 4:14 AM, Nikhil Kumar  wrote:
> Hey,
>I have recently started using solr, I have a list of users, which are
> subscribed to some lists.
> eg.
> user a[
> id:a
> liists[
>  list_a
>]
> ]
> user b[
>id:b
> liists[
>  list_a
>]
> ]
> I am using {"id": a, "lists":{"add":"list_a"}} to add particular list a
> user.
> but what is happening if I use the same command again, it again adds the
> same list, which i want to avoid.
> user a[
> id:a
> liists[
>  list_a,
>  list_a
>]
> ]
> I searched the documentation and tutorials, i found
>
>-
>
>overwrite = "true" | "false" — default is "true", meaning newer
>documents will replace previously added documents with the same uniqueKey.
>-
>
>commitWithin = "(milliseconds)" if the "commitWithin" attribute is
>present, the document will be added within that time. [image: ]
>Solr1.4 . See
> CommitWithin
>-
>
>(deprecated) allowDups = "true" | "false" — default is "false"
>-
>
>(deprecated) overwritePending = "true" | "false" — default is negation
>of allowDups
>-
>
>(deprecated) overwriteCommitted = "true"|"false" — default is negation
>of allowDups
>
>
>but using overwrite and allowDups didn't solve the problem either, seems
>because there is no unique id but just value.
>
>So the question is how to solve this problem?
>
> --
> Thank You and Regards,
> Nikhil Kumar
> +91-9916343619
> Technical Analyst
> Hashed In Technologies Pvt. Ltd.


Re: Duplicated Documents Across shards

2013-05-06 Thread Iker Mtnz. Apellaniz
Thank you very Much Erick,
  That was the real problem, we had two cores sharing the same folder and
core_name. Here is the definitive version of the solr.xml. Tested and
correctly working




Thanks everybody
 Iker


2013/5/6 Erick Erickson 

> Having multiple cores point to the same index is, except for
> special circumstances where one of the cores is guaranteed to
> be read only, a Bad Thing.
>
> So it sounds like you've found your issue...
>
> Best
> Erick
>
> On Mon, May 6, 2013 at 4:44 AM, Iker Mtnz. Apellaniz
>  wrote:
> > Thanks Erick,
> >   I think we found the problem. When defining the cores for both shards
> we
> > define both of them in the same instanceDir, like this:
> >  > name="1_collection" config="solrconfig.xml" collection="1_collection"/>
> >  > name="1_collection" config="solrconfig.xml" collection="1_collection"/>
> >
> >   Each shard should have its own folder, so the final configuration
> should
> > be like this:
> >  instanceDir="1_collection/shard2/"
> > name="1_collection" config="solrconfig.xml" collection="1_collection"/>
> >  instanceDir="1_collection/shard4/"
> > name="1_collection" config="solrconfig.xml" collection="1_collection"/>
> >
> > Can anyone confirm this?
> >
> > Thanks,
> >   Iker
> >
> >
> > 2013/5/4 Erick Erickson 
> >
> >> Sounds like you've explicitly routed the same document to two
> >> different shards. Document replacement only happens locally to a
> >> shard, so the fact that you have documents with the same ID on two
> >> different shards is why you're getting duplicate documents.
> >>
> >> Best
> >> Erick
> >>
> >> On Fri, May 3, 2013 at 3:44 PM, Iker Mtnz. Apellaniz
> >>  wrote:
> >> > We are currently using version 4.2.
> >> > We have made tests with a single document and it gives us a 2 document
> >> > count. But if we force to shard into te first machine, the one with a
> >> > unique shard, the count gives us 1 document.
> >> > I've tried using distrib=false parameter, it gives us no duplicate
> >> > documents, but the same document appears to be in two different
> shards.
> >> >
> >> > Finally, about the separate directories, We have only one directory
> for
> >> the
> >> > data in each physical machine and collection, and I don't see any
> >> subfolder
> >> > for the different shards.
> >> >
> >> > Is it possible that we have something wrong with the dataDir
> >> configuration
> >> > to use multiple shards in one machine?
> >> >
> >> > ${solr.data.dir:}
> >> >  >> > class="${solr.directoryFactory:solr.NRTCachingDirectoryFactory}"/>
> >> >
> >> >
> >> >
> >> > 2013/5/3 Erick Erickson 
> >> >
> >> >> What version of Solr? The custom routing stuff is quite new so
> >> >> I'm guessing 4x?
> >> >>
> >> >> But this shouldn't be happening. The actual index data for the
> >> >> shards should be in separate directories, they just happen to
> >> >> be on the same physical machine.
> >> >>
> >> >> Try querying each one with &distrib=false to see the counts
> >> >> from single shards, that may shed some light on this. It vaguely
> >> >> sounds like you have indexed the same document to both shards
> >> >> somehow...
> >> >>
> >> >> Best
> >> >> Erick
> >> >>
> >> >> On Fri, May 3, 2013 at 5:28 AM, Iker Mtnz. Apellaniz
> >> >>  wrote:
> >> >> > Hi,
> >> >> >   We have currently a solrCloud implementation running 5 shards in
> 3
> >> >> > physical machines, so the first machine will have the shard number
> 1,
> >> the
> >> >> > second machine shards 2 & 4, and the third shards 3 & 5. We noticed
> >> that
> >> >> > while queryng numFoundDocs decreased when we increased the start
> >> param.
> >> >> >   After some investigation we found that the documents in shards 2
> to
> >> 5
> >> >> > were being counted twice. Querying to shard 2 will give you back
> the
> >> >> > results for shard 2 & 4, and the same thing for shards 3 & 5. Our
> >> guess
> >> >> is
> >> >> > that the physical index for both shard 2&4 is shared, so the shards
> >> don't
> >> >> > know which part of it is for each one.
> >> >> >   The uniqueKey is correctly defined, and we have tried using shard
> >> >> prefix
> >> >> > (shard1!docID).
> >> >> >
> >> >> >   Is there any way to solve this problem when a unique physical
> >> machine
> >> >> > shares shards?
> >> >> >   Is it a "real" problem os it just affects facet & numResults?
> >> >> >
> >> >> > Thanks
> >> >> >Iker
> >> >> >
> >> >> > --
> >> >> > /** @author imartinez*/
> >> >> > Person me = *new* Developer();
> >> >> > me.setName(*"Iker Mtz de Apellaniz Anzuola"*);
> >> >> > me.setTwit("@mitxino77 ");
> >> >> > me.setLocations({"St Cugat, Barcelona", "Kanpezu, Euskadi", "*,
> >> >> World"]});
> >> >> > me.setSkills({*SoftwareDeveloper, Curious, AmateurCook*});
> >> >> > me.setWebs({*urbasaabentura.com, ikertxef.com*});
> >> >> > *return* me;
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > /** @author imartinez*/
> >> > Person me = *new* Developer();
> >> > me.setName(*"Iker Mtz de Apellaniz Anzuola"*);
>

Re: Indexing off of the production servers

2013-05-06 Thread Erick Erickson
The only problem with using Hadoop (or whatever) is that you
need to be sure that documents end up on the same shard, which
means that you have to use the same routing mechanism that
SolrCloud uses. The custom doc routing may help here

My very first question, though, would be whether this is necessary.
It might be sufficient to just throttle the rate of indexing, or just do the
indexing during off hours or Have you measured an indexing
degradation during your heavy indexing? Indexing has costs, no
question, but it's worth asking whether the costs are heavy enough
to be worth the bother..

Best
Erick

On Mon, May 6, 2013 at 5:04 AM, Furkan KAMACI  wrote:
> 1-2) Your aim for using Hadoop is probably Map/Reduce jobs. When you use
> Map/Reduce jobs you split your workload, process it, and then reduce step
> takes into account. Let me explain you new SolrCloud architecture. You
> start your SolrCluoud with a numShards parameter. Let's assume that you
> have 5 shards. Then you will have 5 leader at your SolrCloud. These leaders
> will be responsible for indexing your data. It means that your indexing
> workload will divided into 5 so it means that you have parallelized your
> data as like Map/Reduce jobs.
>
> Let's assume that you have added 10 new Solr nodes into your SolrCloud.
> They will be added as a replica for each shard. Then you will have 5
> shards, 5 leaders of them and every shard has 2 replica. When you send a
> query into a SolrCloud every replica will help you for searching and if you
> add more replicas to your SolrCloud your search performance will improve.
>
>
> 2013/5/6 David Parks 
>
>> I've had trouble figuring out what options exist if I want to perform all
>> indexing off of the production servers (I'd like to keep them only for user
>> queries).
>>
>>
>>
>> We index data in batches roughly daily, ideally I'd index all solr cloud
>> shards offline, then move the final index files to the solr cloud instance
>> that needs it and flip a switch and have it use the new index.
>>
>>
>>
>> Is this possible via either:
>>
>> 1.   Doing the indexing in Hadoop?? (this would be ideal as we have a
>> significant investment in a hadoop cluster already), or
>>
>> 2.   Maintaining a separate "master" server that handles indexing and
>> the nodes that receive user queries update their index from there (I seem
>> to
>> recall reading about this configuration in 3.x, but now we're using solr
>> cloud)
>>
>>
>>
>> Is there some ideal solution I can use to "protect" the production solr
>> instances from degraded performance during large index processing periods?
>>
>>
>>
>> Thanks!
>>
>> David
>>
>>


Re: Duplicated Documents Across shards

2013-05-06 Thread Erick Erickson
Having multiple cores point to the same index is, except for
special circumstances where one of the cores is guaranteed to
be read only, a Bad Thing.

So it sounds like you've found your issue...

Best
Erick

On Mon, May 6, 2013 at 4:44 AM, Iker Mtnz. Apellaniz
 wrote:
> Thanks Erick,
>   I think we found the problem. When defining the cores for both shards we
> define both of them in the same instanceDir, like this:
>  name="1_collection" config="solrconfig.xml" collection="1_collection"/>
>  name="1_collection" config="solrconfig.xml" collection="1_collection"/>
>
>   Each shard should have its own folder, so the final configuration should
> be like this:
>  name="1_collection" config="solrconfig.xml" collection="1_collection"/>
>  name="1_collection" config="solrconfig.xml" collection="1_collection"/>
>
> Can anyone confirm this?
>
> Thanks,
>   Iker
>
>
> 2013/5/4 Erick Erickson 
>
>> Sounds like you've explicitly routed the same document to two
>> different shards. Document replacement only happens locally to a
>> shard, so the fact that you have documents with the same ID on two
>> different shards is why you're getting duplicate documents.
>>
>> Best
>> Erick
>>
>> On Fri, May 3, 2013 at 3:44 PM, Iker Mtnz. Apellaniz
>>  wrote:
>> > We are currently using version 4.2.
>> > We have made tests with a single document and it gives us a 2 document
>> > count. But if we force to shard into te first machine, the one with a
>> > unique shard, the count gives us 1 document.
>> > I've tried using distrib=false parameter, it gives us no duplicate
>> > documents, but the same document appears to be in two different shards.
>> >
>> > Finally, about the separate directories, We have only one directory for
>> the
>> > data in each physical machine and collection, and I don't see any
>> subfolder
>> > for the different shards.
>> >
>> > Is it possible that we have something wrong with the dataDir
>> configuration
>> > to use multiple shards in one machine?
>> >
>> > ${solr.data.dir:}
>> > > > class="${solr.directoryFactory:solr.NRTCachingDirectoryFactory}"/>
>> >
>> >
>> >
>> > 2013/5/3 Erick Erickson 
>> >
>> >> What version of Solr? The custom routing stuff is quite new so
>> >> I'm guessing 4x?
>> >>
>> >> But this shouldn't be happening. The actual index data for the
>> >> shards should be in separate directories, they just happen to
>> >> be on the same physical machine.
>> >>
>> >> Try querying each one with &distrib=false to see the counts
>> >> from single shards, that may shed some light on this. It vaguely
>> >> sounds like you have indexed the same document to both shards
>> >> somehow...
>> >>
>> >> Best
>> >> Erick
>> >>
>> >> On Fri, May 3, 2013 at 5:28 AM, Iker Mtnz. Apellaniz
>> >>  wrote:
>> >> > Hi,
>> >> >   We have currently a solrCloud implementation running 5 shards in 3
>> >> > physical machines, so the first machine will have the shard number 1,
>> the
>> >> > second machine shards 2 & 4, and the third shards 3 & 5. We noticed
>> that
>> >> > while queryng numFoundDocs decreased when we increased the start
>> param.
>> >> >   After some investigation we found that the documents in shards 2 to
>> 5
>> >> > were being counted twice. Querying to shard 2 will give you back the
>> >> > results for shard 2 & 4, and the same thing for shards 3 & 5. Our
>> guess
>> >> is
>> >> > that the physical index for both shard 2&4 is shared, so the shards
>> don't
>> >> > know which part of it is for each one.
>> >> >   The uniqueKey is correctly defined, and we have tried using shard
>> >> prefix
>> >> > (shard1!docID).
>> >> >
>> >> >   Is there any way to solve this problem when a unique physical
>> machine
>> >> > shares shards?
>> >> >   Is it a "real" problem os it just affects facet & numResults?
>> >> >
>> >> > Thanks
>> >> >Iker
>> >> >
>> >> > --
>> >> > /** @author imartinez*/
>> >> > Person me = *new* Developer();
>> >> > me.setName(*"Iker Mtz de Apellaniz Anzuola"*);
>> >> > me.setTwit("@mitxino77 ");
>> >> > me.setLocations({"St Cugat, Barcelona", "Kanpezu, Euskadi", "*,
>> >> World"]});
>> >> > me.setSkills({*SoftwareDeveloper, Curious, AmateurCook*});
>> >> > me.setWebs({*urbasaabentura.com, ikertxef.com*});
>> >> > *return* me;
>> >>
>> >
>> >
>> >
>> > --
>> > /** @author imartinez*/
>> > Person me = *new* Developer();
>> > me.setName(*"Iker Mtz de Apellaniz Anzuola"*);
>> > me.setTwit("@mitxino77 ");
>> > me.setLocations({"St Cugat, Barcelona", "Kanpezu, Euskadi", "*,
>> World"]});
>> > me.setSkills({*SoftwareDeveloper, Curious, AmateurCook*});
>> > *return* me;
>>
>
>
>
> --
> /** @author imartinez*/
> Person me = *new* Developer();
> me.setName(*"Iker Mtz de Apellaniz Anzuola"*);
> me.setTwit("@mitxino77 ");
> me.setLocations({"St Cugat, Barcelona", "Kanpezu, Euskadi", "*, World"]});
> me.setSkills({*SoftwareDeveloper, Curious, AmateurCook*});
> *return* me;


Scores dilemma after providing boosting with bq as same weigtage for 2 condition

2013-05-06 Thread nishi
While giving bq with same weightage as ^1.2 below on two values of
articleTopic, the result always coming all of the "Office" on the top
somehow. What other factors would influence at this scenario when there is
no keyword search also?

http://localhost:8080/solr?rows=900&fq=(articleTopic:"Food" OR
articleTopic:"Office")&bq=(articleTopic:"Food" OR
articleTopic:"Office")^1.2&fl=title,description,documentId,score&sort=score
desc

Also with other fields adding at bq, still the results favoured somehow to
"Office" one:
http://localhost:8080/solr?rows=900&fq=(articleTopic:"Food" OR
articleTopic:"Office")&bq=(articleSrc:"News" OR articleSrc:"Blog")^0.5 OR
(articleTopic:"Food" OR
articleTopic:"Office")^1.2&fl=title,description,documentId,score&sort=score
desc

Please advice what might be the reason/factors which doesn't balance the
result with both articleTopic and instead always favored the result with the
"Food" somehow in the score.

Thanks in advance




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Scores-dilemma-after-providing-boosting-with-bq-as-same-weigtage-for-2-condition-tp4061035.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: action=CREATE

2013-05-06 Thread Erick Erickson
http://wiki.apache.org/solr/CoreAdmin#CREATE

"Core properties can be specified when creating a new core using
optional property.name=value request parameters, similar to 
tag inside solr.xml."

haven't tried it myself though...

Best
Erick

On Mon, May 6, 2013 at 3:04 AM, Peter Kirk  wrote:
> Hi
>
> I have a core definition in solr.xml which looks like the following:
>
> 
> 
> 
>
> If I instead want to create this core with a CREATE command, how do I also 
> supply a property - like "language" in the above?
>
> For example, some sort of request:
> http://localhost:8080/solr/admin/cores?action=CREATE&name=MajorIndex&instanceDir=cores/major&property=???
>
> Thanks!
>
>
>


Re: disaster recovery scenarios for solr cloud and zookeeper

2013-05-06 Thread Erick Erickson
If I understand correctly, each of the nodes has a copy of the state
as of the last time there was a ZK quorum and operates off that
so the cluster can keep chugging along with updates disabled.

Of course if the state of your cluster changes (i.e. nodes come or
go), ZK is no longer available to tell everyone about the change etc

Best
Erick

On Mon, May 6, 2013 at 2:59 AM, Furkan KAMACI  wrote:
> Hi Mark;
>
> You said: "So it's pretty simple - when you lost the ability to talk to ZK,
> everything keeps working based on the most recent clusterstate - except
> that updates are blocked and you cannot add new nodes to the cluster."
> Where nodes keeps cluster stat? When a query comes to a node that is at
> another shard's replica, how query will return accurately?
>
> 2013/5/5 Jack Krupansky 
>
>> Is soul retrieval possible when ZooKeeper is down?
>>
>> -- Jack Krupansky
>>
>> -Original Message- From: Mark Miller
>> Sent: Sunday, May 05, 2013 2:19 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: disaster recovery scenarios for solr cloud and zookeeper
>>
>>
>> When Solr loses it's connection to ZooKeeper, updates will start being
>> rejected. Read requests will continue as normal. This is regardless of how
>> long ZooKeeper is down.
>>
>> So it's pretty simple - when you lost the ability to talk to ZK,
>> everything keeps working based on the most recent clusterstate - except
>> that updates are blocked and you cannot add new nodes to the cluster. You
>> are essentially in steady state.
>>
>> The ZK clients will continue trying to reconnect so that when ZK comes
>> back updates while start being accepted again and new nodes may join the
>> cluster.
>>
>> - Mark
>>
>> On May 3, 2013, at 3:21 PM, Dennis Haller  wrote:
>>
>>  Hi,
>>>
>>> Solr 4.x is architected with a dependency on Zookeeper, and Zookeeper is
>>> expected to have a very high (perfect?) availability. With 3 or 5
>>> zookeeper
>>> nodes, it is possible to manage zookeeper maintenance and online
>>> availability to be close to %100. But what is the worst case for Solr if
>>> for some unanticipated reason all Zookeeper nodes go offline?
>>>
>>> Could someone comment on a couple of possible scenarios for which all ZK
>>> nodes are offline. What would happen to Solr and what would be needed to
>>> recover in each case?
>>> 1) brief interruption, say <2 minutes,
>>> 2) longer downtime, say 60 min
>>>
>>> Thanks
>>> Dennis
>>>
>>
>>


Re: Did something change with Payloads?

2013-05-06 Thread hariistou
Hi,

I realized that there was no mistake with the way Lucene writes
postings/payloads.
Rather, there is no flaw there.

The problem may be with the way the scorePayload() is implemented.
We need to use both payload.bytes, and payload.offset to compute the score.

So, please ignore my previous message in this thread.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Did-something-change-with-Payloads-tp4049561p4061030.html
Sent from the Solr - User mailing list archive at Nabble.com.


solr adding unique values

2013-05-06 Thread Nikhil Kumar
Hey,
   I have recently started using solr, I have a list of users, which are
subscribed to some lists.
eg.
user a[
id:a
liists[
 list_a
   ]
]
user b[
   id:b
liists[
 list_a
   ]
]
I am using {"id": a, "lists":{"add":"list_a"}} to add particular list a
user.
but what is happening if I use the same command again, it again adds the
same list, which i want to avoid.
user a[
id:a
liists[
 list_a,
 list_a
   ]
]
I searched the documentation and tutorials, i found

   -

   overwrite = "true" | "false" — default is "true", meaning newer
   documents will replace previously added documents with the same uniqueKey.
   -

   commitWithin = "(milliseconds)" if the "commitWithin" attribute is
   present, the document will be added within that time. [image: ]
   Solr1.4 . See
CommitWithin
   -

   (deprecated) allowDups = "true" | "false" — default is "false"
   -

   (deprecated) overwritePending = "true" | "false" — default is negation
   of allowDups
   -

   (deprecated) overwriteCommitted = "true"|"false" — default is negation
   of allowDups


   but using overwrite and allowDups didn't solve the problem either, seems
   because there is no unique id but just value.

   So the question is how to solve this problem?

-- 
Thank You and Regards,
Nikhil Kumar
+91-9916343619
Technical Analyst
Hashed In Technologies Pvt. Ltd.


Re: iterate through each document in Solr

2013-05-06 Thread Dmitry Kan
Are you doing it once? Is your index sharded? If so, can you ask each shard
individually?
Another way would be to do it on Lucene level, i.e. read from the binary
indices (API exists).

Dmitry


On Mon, May 6, 2013 at 5:48 AM, Mingfeng Yang  wrote:

> Dear Solr Users,
>
> Does anyone know what is the best way to iterate through each document in a
> Solr index with billion entries?
>
> I tried to use  select?q=*:*&start=xx&rows=500  to get 500 docs each time
> and then change start value, but it got very slow after getting through
> about 10 million docs.
>
> Thanks,
> Ming-
>


Re: Indexing off of the production servers

2013-05-06 Thread Furkan KAMACI
1-2) Your aim for using Hadoop is probably Map/Reduce jobs. When you use
Map/Reduce jobs you split your workload, process it, and then reduce step
takes into account. Let me explain you new SolrCloud architecture. You
start your SolrCluoud with a numShards parameter. Let's assume that you
have 5 shards. Then you will have 5 leader at your SolrCloud. These leaders
will be responsible for indexing your data. It means that your indexing
workload will divided into 5 so it means that you have parallelized your
data as like Map/Reduce jobs.

Let's assume that you have added 10 new Solr nodes into your SolrCloud.
They will be added as a replica for each shard. Then you will have 5
shards, 5 leaders of them and every shard has 2 replica. When you send a
query into a SolrCloud every replica will help you for searching and if you
add more replicas to your SolrCloud your search performance will improve.


2013/5/6 David Parks 

> I've had trouble figuring out what options exist if I want to perform all
> indexing off of the production servers (I'd like to keep them only for user
> queries).
>
>
>
> We index data in batches roughly daily, ideally I'd index all solr cloud
> shards offline, then move the final index files to the solr cloud instance
> that needs it and flip a switch and have it use the new index.
>
>
>
> Is this possible via either:
>
> 1.   Doing the indexing in Hadoop?? (this would be ideal as we have a
> significant investment in a hadoop cluster already), or
>
> 2.   Maintaining a separate "master" server that handles indexing and
> the nodes that receive user queries update their index from there (I seem
> to
> recall reading about this configuration in 3.x, but now we're using solr
> cloud)
>
>
>
> Is there some ideal solution I can use to "protect" the production solr
> instances from degraded performance during large index processing periods?
>
>
>
> Thanks!
>
> David
>
>


Re: Duplicated Documents Across shards

2013-05-06 Thread Iker Mtnz. Apellaniz
Thanks Erick,
  I think we found the problem. When defining the cores for both shards we
define both of them in the same instanceDir, like this:



  Each shard should have its own folder, so the final configuration should
be like this:



Can anyone confirm this?

Thanks,
  Iker


2013/5/4 Erick Erickson 

> Sounds like you've explicitly routed the same document to two
> different shards. Document replacement only happens locally to a
> shard, so the fact that you have documents with the same ID on two
> different shards is why you're getting duplicate documents.
>
> Best
> Erick
>
> On Fri, May 3, 2013 at 3:44 PM, Iker Mtnz. Apellaniz
>  wrote:
> > We are currently using version 4.2.
> > We have made tests with a single document and it gives us a 2 document
> > count. But if we force to shard into te first machine, the one with a
> > unique shard, the count gives us 1 document.
> > I've tried using distrib=false parameter, it gives us no duplicate
> > documents, but the same document appears to be in two different shards.
> >
> > Finally, about the separate directories, We have only one directory for
> the
> > data in each physical machine and collection, and I don't see any
> subfolder
> > for the different shards.
> >
> > Is it possible that we have something wrong with the dataDir
> configuration
> > to use multiple shards in one machine?
> >
> > ${solr.data.dir:}
> >  > class="${solr.directoryFactory:solr.NRTCachingDirectoryFactory}"/>
> >
> >
> >
> > 2013/5/3 Erick Erickson 
> >
> >> What version of Solr? The custom routing stuff is quite new so
> >> I'm guessing 4x?
> >>
> >> But this shouldn't be happening. The actual index data for the
> >> shards should be in separate directories, they just happen to
> >> be on the same physical machine.
> >>
> >> Try querying each one with &distrib=false to see the counts
> >> from single shards, that may shed some light on this. It vaguely
> >> sounds like you have indexed the same document to both shards
> >> somehow...
> >>
> >> Best
> >> Erick
> >>
> >> On Fri, May 3, 2013 at 5:28 AM, Iker Mtnz. Apellaniz
> >>  wrote:
> >> > Hi,
> >> >   We have currently a solrCloud implementation running 5 shards in 3
> >> > physical machines, so the first machine will have the shard number 1,
> the
> >> > second machine shards 2 & 4, and the third shards 3 & 5. We noticed
> that
> >> > while queryng numFoundDocs decreased when we increased the start
> param.
> >> >   After some investigation we found that the documents in shards 2 to
> 5
> >> > were being counted twice. Querying to shard 2 will give you back the
> >> > results for shard 2 & 4, and the same thing for shards 3 & 5. Our
> guess
> >> is
> >> > that the physical index for both shard 2&4 is shared, so the shards
> don't
> >> > know which part of it is for each one.
> >> >   The uniqueKey is correctly defined, and we have tried using shard
> >> prefix
> >> > (shard1!docID).
> >> >
> >> >   Is there any way to solve this problem when a unique physical
> machine
> >> > shares shards?
> >> >   Is it a "real" problem os it just affects facet & numResults?
> >> >
> >> > Thanks
> >> >Iker
> >> >
> >> > --
> >> > /** @author imartinez*/
> >> > Person me = *new* Developer();
> >> > me.setName(*"Iker Mtz de Apellaniz Anzuola"*);
> >> > me.setTwit("@mitxino77 ");
> >> > me.setLocations({"St Cugat, Barcelona", "Kanpezu, Euskadi", "*,
> >> World"]});
> >> > me.setSkills({*SoftwareDeveloper, Curious, AmateurCook*});
> >> > me.setWebs({*urbasaabentura.com, ikertxef.com*});
> >> > *return* me;
> >>
> >
> >
> >
> > --
> > /** @author imartinez*/
> > Person me = *new* Developer();
> > me.setName(*"Iker Mtz de Apellaniz Anzuola"*);
> > me.setTwit("@mitxino77 ");
> > me.setLocations({"St Cugat, Barcelona", "Kanpezu, Euskadi", "*,
> World"]});
> > me.setSkills({*SoftwareDeveloper, Curious, AmateurCook*});
> > *return* me;
>



-- 
/** @author imartinez*/
Person me = *new* Developer();
me.setName(*"Iker Mtz de Apellaniz Anzuola"*);
me.setTwit("@mitxino77 ");
me.setLocations({"St Cugat, Barcelona", "Kanpezu, Euskadi", "*, World"]});
me.setSkills({*SoftwareDeveloper, Curious, AmateurCook*});
*return* me;


Indexing off of the production servers

2013-05-06 Thread David Parks
I've had trouble figuring out what options exist if I want to perform all
indexing off of the production servers (I'd like to keep them only for user
queries).

 

We index data in batches roughly daily, ideally I'd index all solr cloud
shards offline, then move the final index files to the solr cloud instance
that needs it and flip a switch and have it use the new index.

 

Is this possible via either:

1.   Doing the indexing in Hadoop?? (this would be ideal as we have a
significant investment in a hadoop cluster already), or

2.   Maintaining a separate "master" server that handles indexing and
the nodes that receive user queries update their index from there (I seem to
recall reading about this configuration in 3.x, but now we're using solr
cloud)

 

Is there some ideal solution I can use to "protect" the production solr
instances from degraded performance during large index processing periods?

 

Thanks!

David



Questions about the performance of Solr

2013-05-06 Thread joo
Search speed at which data is loaded is more than 7 ten millon current will
be reduced too.
About 50 seconds it will take, but the number is often just this, it is not
possible to know whether such.
Will there is a problem with the Query I use it to know the Query Optimizing
Solr and fall.
The Query, for example I use,
time: [time to time] AND category: (1,2) AND (message1: message OR message2:
message)
I try to this.
As long as there is no this problem, you need advice please do take a look
at which part.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Questions-about-the-performance-of-Solr-tp4060988.html
Sent from the Solr - User mailing list archive at Nabble.com.


without the indexed property is set to true by default?

2013-05-06 Thread joo
Indexed properties in a constant field current to the field, I did not give
the search.
indexed attribute is set to true by default, does not turn you on?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/without-the-indexed-property-is-set-to-true-by-default-tp4060973.html
Sent from the Solr - User mailing list archive at Nabble.com.


Memory problems with HttpSolrServer

2013-05-06 Thread Rogowski, Britta
Hi!

When I write from our database to a HttpSolrServer, (using a 
LinkedBlockingQueue to write just one document at a time), I run into memory 
problems (due to various constraints, I have to remain on a 32-bit system, so I 
can use at most 2 GB RAM).

If I use an EmbeddedSolrServer (to write locally), I have no such problems. 
Just now, I tried out ConcurrentUpdateSolrServer (with a queue size of 1, but 3 
threads to be safe), and that worked out fine too. I played around with various 
GC options and monitored memory with jconsole and jmap, but only found out that 
there's lots of byte arrays, SolrInputFields and Strings hanging around.

Since ConcurrentUpdateSolrServer works, I'm happy, but I was wondering if 
people were aware of the memory issue around HttpSolrServer.

Regards,

Britta Rogowski



__

Britta Rogoswki
Senior Developer

Wolters Kluwer Deutschland
Online Product Development
Feldstiege 100
48161 M?nster

Tel +49 (2533) 9300-251
Fax
brogow...@wolterskluwer.de

Wolters Kluwer Deutschland GmbH | Feldstiege 100 | D-48161 M?nster |
HRB 58843 Amtsgericht K?ln | Gesch?ftsf?hrer: Dr. Ulrich Hermann (Vorsitz), 
Michael Gloss, Christian Lindemann, Frank Schellmann | USt.-ID.Nr. DE188836808



action=CREATE

2013-05-06 Thread Peter Kirk
Hi

I have a core definition in solr.xml which looks like the following:





If I instead want to create this core with a CREATE command, how do I also 
supply a property - like "language" in the above?

For example, some sort of request:
http://localhost:8080/solr/admin/cores?action=CREATE&name=MajorIndex&instanceDir=cores/major&property=???

Thanks!





Re: disaster recovery scenarios for solr cloud and zookeeper

2013-05-06 Thread Furkan KAMACI
Hi Mark;

You said: "So it's pretty simple - when you lost the ability to talk to ZK,
everything keeps working based on the most recent clusterstate - except
that updates are blocked and you cannot add new nodes to the cluster."
Where nodes keeps cluster stat? When a query comes to a node that is at
another shard's replica, how query will return accurately?

2013/5/5 Jack Krupansky 

> Is soul retrieval possible when ZooKeeper is down?
>
> -- Jack Krupansky
>
> -Original Message- From: Mark Miller
> Sent: Sunday, May 05, 2013 2:19 PM
> To: solr-user@lucene.apache.org
> Subject: Re: disaster recovery scenarios for solr cloud and zookeeper
>
>
> When Solr loses it's connection to ZooKeeper, updates will start being
> rejected. Read requests will continue as normal. This is regardless of how
> long ZooKeeper is down.
>
> So it's pretty simple - when you lost the ability to talk to ZK,
> everything keeps working based on the most recent clusterstate - except
> that updates are blocked and you cannot add new nodes to the cluster. You
> are essentially in steady state.
>
> The ZK clients will continue trying to reconnect so that when ZK comes
> back updates while start being accepted again and new nodes may join the
> cluster.
>
> - Mark
>
> On May 3, 2013, at 3:21 PM, Dennis Haller  wrote:
>
>  Hi,
>>
>> Solr 4.x is architected with a dependency on Zookeeper, and Zookeeper is
>> expected to have a very high (perfect?) availability. With 3 or 5
>> zookeeper
>> nodes, it is possible to manage zookeeper maintenance and online
>> availability to be close to %100. But what is the worst case for Solr if
>> for some unanticipated reason all Zookeeper nodes go offline?
>>
>> Could someone comment on a couple of possible scenarios for which all ZK
>> nodes are offline. What would happen to Solr and what would be needed to
>> recover in each case?
>> 1) brief interruption, say <2 minutes,
>> 2) longer downtime, say 60 min
>>
>> Thanks
>> Dennis
>>
>
>