Re: CDCR: Help With Tlog Growth Issues

2016-12-02 Thread Renaud Delbru

Hi Shalin,

when the buffer is enabled, tlogs are not removed anymore, even if they 
were replicated [1]:
"When buffering updates, the updates log will store all the updates 
indefinitely. "


Once you disable the buffer, all the old tlogs should be cleaned (the 
next time the tlog cleaning process is triggered).


Buffer is useful in scenarios when you want to ensure that the source 
cluster will not clean updates until the target clusters are fully 
initialized. For example, let say we perform a whole index replication 
(SLR-6465), while the whole-index replication is performed, the source 
cluster should buffer updates until the whole-index replication is 
completed, otherwise we might miss some updates..


[1] 
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=62687462#CrossDataCenterReplication(CDCR)-TheBufferElement


Kind Regards
--
Renaud Delbru

On 01/12/2016 17:58, Shalin Shekhar Mangar wrote:

Even if buffer is enabled, the old tlogs should be remove once the
updates in those tlogs have been replicated to the target. So the real
question is why they haven't been removed automatically?

On Thu, Dec 1, 2016 at 9:13 PM, Renaud Delbru  wrote:

Hi Thomas,

Looks like the buffer is enabled on the update log, and even if the updates
were replicated, they are not removed.

What is the output of the command  `cdcr?action=STATUS` on both cluster ?

If you see in the response `enabled`, then the buffer
is enabled.
To disable it, you should run the command `/cdcr?action=DISABLEBUFFER`.

Kind Regards
--
Renaud Delbru

On 10/11/2016 23:09, Thomas Tickle wrote:


I am having an issue with cdcr that I could use some assistance in
resolving.

I followed the instructions found here:
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=62687462

The CDCR is setup with a single source to a single target.  Both the
source and target cluster are identically setup as 3 machines, each running
an external zookeeper and a solr instance.  I’ve enabled the data
replication and successfully seen the documents replicated from the source
to the target with no errors in the log files.

However, when examining the /cdcr?action=QUEUES command, I noticed that
the tlogTotalSize and tlogTotalCount were alarmingly high.  Checking the
data directory for each shard, I was able to confirm that there was several
thousand logs files of each 3-4 megs.  It added up to almost 35 GBs of
tlogs.  Obviously, this amount of tlogs causes a serious issue when trying
to restart a solr server after activities such as patch.

*Is it normal for old tlogs to never get removed in a CDCR setup?*

**

Thomas Tickle



Nothing in this message is intended to constitute an electronic signature
unless a specific statement to the contrary is included in this message.

Confidentiality Note: This message is intended only for the person or
entity to which it is addressed. It may contain confidential and/or
privileged material. Any review, transmission, dissemination or other use,
or taking of any action in reliance upon this message by persons or entities
other than the intended recipient is prohibited and may be unlawful. If you
received this message in error, please contact the sender and delete it from
your computer.









Re: CDCR: Help With Tlog Growth Issues

2016-12-01 Thread Renaud Delbru

Hi Thomas,

Looks like the buffer is enabled on the update log, and even if the 
updates were replicated, they are not removed.


What is the output of the command  `cdcr?action=STATUS` on both cluster ?

If you see in the response `enabled`, then the 
buffer is enabled.

To disable it, you should run the command `/cdcr?action=DISABLEBUFFER`.

Kind Regards
--
Renaud Delbru

On 10/11/2016 23:09, Thomas Tickle wrote:


I am having an issue with cdcr that I could use some assistance in 
resolving.


I followed the instructions found here: 
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=62687462


The CDCR is setup with a single source to a single target.  Both the 
source and target cluster are identically setup as 3 machines, each 
running an external zookeeper and a solr instance.  I’ve enabled the 
data replication and successfully seen the documents replicated from 
the source to the target with no errors in the log files.


However, when examining the /cdcr?action=QUEUES command, I noticed 
that the tlogTotalSize and tlogTotalCount were alarmingly high.  
Checking the data directory for each shard, I was able to confirm that 
there was several thousand logs files of each 3-4 megs.  It added up 
to almost 35 GBs of tlogs.  Obviously, this amount of tlogs causes a 
serious issue when trying to restart a solr server after activities 
such as patch.


*Is it normal for old tlogs to never get removed in a CDCR setup?*

**

Thomas Tickle



Nothing in this message is intended to constitute an electronic 
signature unless a specific statement to the contrary is included in 
this message.


Confidentiality Note: This message is intended only for the person or 
entity to which it is addressed. It may contain confidential and/or 
privileged material. Any review, transmission, dissemination or other 
use, or taking of any action in reliance upon this message by persons 
or entities other than the intended recipient is prohibited and may be 
unlawful. If you received this message in error, please contact the 
sender and delete it from your computer. 




Re: how to sampling search result

2016-09-30 Thread Renaud Delbru
Some people in the Elasticsearch community are using random scoring [1] 
to sample a document subset from the search results. Maybe something 
similar could be implemented for Solr ?


There are probably more efficient sampling solution than this one, but 
this solution is likely more straightforward to implement.


[1] 
https://www.elastic.co/guide/en/elasticsearch/guide/current/random-scoring.html


--
Renaud Delbru

On 27/09/16 15:57, googoo wrote:

Hi,

Is it possible I can sampling based on  "search result"?
Like run query first, and search result return 1 million documents.
With random sampling, 50% (500K) documents return for facet, and stats.

The sampling need based on "search result".

Thanks,
Yongtao



--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-sampling-search-result-tp4298269.html
Sent from the Solr - User mailing list archive at Nabble.com.





Re: CDCR (Solr6.x) does not start

2016-07-05 Thread Renaud Delbru

Hi Uwe,

At first look, your configuration seems correct,
see my comments below.

On 28/06/16 15:36, Uwe Reh wrote:

9. Start CDCR
http://SOURCE:s_port/solr/scoll/cdcr?action=start&wt=json

{"responseHeader":{"status":0,"QTime":13},"status":["process","started","buffer","enabled"]}


! (not even a single query to the target's zookeeper ??)


Indeed, you should have observed a communication between the source 
cluster and the target zookeeper. Do you see any errors in the log of 
the source cluster ? Or a log message such as:

"Unable to instantiate the log reader for target collection ..."



10. Enter some test data into the SOURCE

11. Explicit commit in SOURCE
http://SOURCE:s_port/solr/scoll/update?commit=true&opensearcher=true
!! (at least now there should be some traffic, or?)


Replication should start even if no commit has been sent to the source 
cluster.




12. Check errors and queues
http://SOURCE:s_port/solr/scoll_shard1_replica1/cdcr?action=queues&wt=json

{"responseHeader":{"status":0,"QTime":0},"queues":[],"tlogTotalSize":135,"tlogTotalCount":1,"updateLogSynchronizer":"stopped"}


http://SOURCE:s_port/solr/scoll_shard1_replica1/cdcr?action=errors&wt=json

{"responseHeader":{"status":0,"QTime":0},"errors":[]}

! Why is the element queues is empty


The empty queue seems to indicate there is an issue, and that cdcr was 
unable to instantiate the replicator for the target cluster.
Just to be sure, your source cluster has 4 shards, but not replica ? If 
it has replicas, can you ensure that you execute these command on the 
shard leader.


Kind Regards
--
Renaud Delbru



Re: Solr6 CDCR issue with a 3 cloud design

2016-07-05 Thread Renaud Delbru

Hi Dmitry,

On 28/06/16 13:19, dmitry.medve...@barclays.com wrote:

No ERRORS and queue size is equal to 0.
Should I extend the logging lever to Max maybe? Currently it's default.

How can I know, if a commit operation has been sent to the 2 target clusters 
after the replication? What command should I run to check this?
I submit new doc/s to my ACTIVE/PRIMARY cloud and that's all.


Commits are not replicated from the source cluster to the target 
cluster. You have to manually sent a commit to the target cluster 
manually if you want to see all the pending docs:


curl 'http://target_cluster:8983/solr/collection_name/update?commit=true'

You can also try to configure an autocommit on your target cluster [1]

[1] 
https://cwiki.apache.org/confluence/display/solr/UpdateHandlers+in+SolrConfig#UpdateHandlersinSolrConfig-autoCommit


Kind regards
--
Renaud Delbru



-Original Message-
From: Renaud Delbru [mailto:renaud@siren.solutions]
Sent: Tuesday, June 14, 2016 11:57
To: solr-user@lucene.apache.org
Subject: Re: Solr6 CDCR issue with a 3 cloud design

Hi dmitry,

Was a commit operation sent to the 2 target clusters after the replication ? 
Replicated documents will not appeared until a commit operation is sent.

What is the output of the monitoring actions QUEUES and ERRORS ? Are you seeing 
any errors reported ? Are you seeing the queue size not equal to 0 ?

--
Renaud Delbru

On 09/06/16 08:55, dmitry.medve...@barclays.com wrote:

I've set up a 3 cloud CDCR: Source => Target1-Source2 => Target2 CDCR
environment, and the replication process works perfectly, but:

when I shutdown Target1-Source2 cloud (the mediator, for testing for
resilience), index/push some docs to Source1 cloud, get back
Target1-Source2 cloud online after several min, then I only part of
the docs are replicated to the 2 Target clouds (7 of 10 docs tested).

Anyone has an idea what is the reason for such a behavior?

Configurations attached.

Thanks in advance,

Dmitry Medvedev.

___

This message is for information purposes only, it is not a
recommendation, advice, offer or solicitation to buy or sell a product
or service nor an official confirmation of any transaction. It is
directed at persons who are professionals and is not intended for
retail customer use. Intended for recipient only. This message is
subject to the terms at: www.barclays.com/emaildisclaimer
<http://www.barclays.com/emaildisclaimer>.

For important disclosures, please see:
www.barclays.com/salesandtradingdisclaimer
<http://www.barclays.com/salesandtradingdisclaimer> regarding market
commentary from Barclays Sales and/or Trading, who are active market
participants; and in respect of Barclays Research, including
disclosures relating to specific issuers, please see 
http://publicresearch.barclays.com.

___



___

This message is for information purposes only, it is not a recommendation, 
advice, offer or solicitation to buy or sell a product or service nor an 
official confirmation of any transaction. It is directed at persons who are 
professionals and is not intended for retail customer use. Intended for 
recipient only. This message is subject to the terms at: 
www.barclays.com/emaildisclaimer.

For important disclosures, please see: 
www.barclays.com/salesandtradingdisclaimer regarding market commentary from 
Barclays Sales and/or Trading, who are active market participants; and in 
respect of Barclays Research, including disclosures relating to specific 
issuers, please see http://publicresearch.barclays.com.

___





Re: Regarding CDCR SOLR 6

2016-06-21 Thread Renaud Delbru

Hi,

On 15/06/16 03:18, Bharath Kumar wrote:

Hi Renaud,

Thank you so much for your response. It is very helpful and it helped me
understand the need for turning on buffering.

Is it recommended to keep the buffering enabled all the time on the
source cluster? If the target cluster is up and running and the cdcr is
started, can i turn off the buffering on the source site?


yes, no need to keep buffering on if your target cluster is up and 
running and cdcr replication is started.



As you have mentioned, the transaction logs are kept on the source
cluster, until the data is replicated on the target cluster, once the
cdcr is started. Is there a possibility that target cluster is out of
sync with the source cluster and we need to do a hard recovery from the
source cluster to sync up the target cluster?


If the target cluster goes down while cdcr is replicating, there should 
be no loss of information. The source cluster will try from time to time 
to communicate with the target and continue the replication until the 
target cluster is back up and running. Until it can resume 
communication, the source cluster will keep a pointer on where the 
replication should resume, and therefore the update log will not be 
cleaned up to this point.


The pointer on the source cluster is not persistent (maybe that could be 
something to implement). Therefore if the source cluster is restarted, 
the pointer will be lost, and buffer should be activated until the 
target cluster is up and running.




Also i have the below configuration on the source cluster to synchronize
the update logs.
|   <||lst| |name||=||"updateLogSynchronizer"||>|
|||<||str| |name||=||"schedule"||>1000|

|
|
|Regarding the monitoring of the replication, i am planning to add a
script to check the queue size, to make sure the disk is not full in
case the target site is down and the transaction log size keeps growing
on the source site.|
|Is there any other recommended approach?|


The best is to use the monitoring api which provides some metrics on how 
the replication is going. In the cwiki [1], there are also some 
recommendations on how to monitor the system


[1] 
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=62687462


Kind Regards
--
Renaud Delbru


|
|
|Thanks again, your inputs were very helpful.|

On Tue, Jun 14, 2016 at 7:10 PM, Bharath Kumar
mailto:bharath.mvku...@gmail.com>> wrote:

Hi Renaud,

Thank you so much for your response. It is very helpful and it
helped me understand the need for turning on buffering.

Is it recommended to keep the buffering enabled all the time on the
source cluster? If the target cluster is up and running and the cdcr
is started, can i turn off the buffering on the source site?

As you have mentioned, the transaction logs are kept on the source
cluster, until the data is replicated on the target cluster, once
the cdcr is started, is there a possibility that if on the target
cluster



On Tue, Jun 14, 2016 at 6:50 AM, Davis, Daniel (NIH/NLM) [C]
mailto:daniel.da...@nih.gov>> wrote:

I must chime in to clarify something - in case 2, would the
source cluster eventually start a log reader on its own?   That
is, would the CDCR heal over time, or would manual action be
required?

    -----Original Message-
From: Renaud Delbru [mailto:renaud@siren.solutions
<mailto:renaud@siren.solutions>]
Sent: Tuesday, June 14, 2016 4:51 AM
To: solr-user@lucene.apache.org <mailto:solr-user@lucene.apache.org>
Subject: Re: Regarding CDCR SOLR 6

Hi Bharath,

The buffer is useful when you need to buffer updates on the
source cluster before starting cdcr, if the source cluster might
receive updates in the meanwhile and you want to be sure to not
miss them.

To understand this better, you need to understand how cdcr clean
transaction logs. Cdcr when started (with the START action) will
instantiate a log reader for each target cluster. The position
of the log reader will indicate cdcr which transaction logs it
can clean. If all the log readers are beyond a certain point,
then cdcr can clean all the transaction logs up to this point.

However, there might be cases when the source cluster will be up
without any log readers instantiated:
1) The source cluster is started, but cdcr is not started yet
2) the source cluster is started, cdcr is started, but the
target cluster was not accessible when cdcr was started. In this
case, cdcr will not be able to instantiate a log reader for this
cluster.

In these two scenarios, if updates are received by the source
cluster, then they might be cleaned out from the transaction log
as per the normal update log cleaning procedure.
 

Re: Encryption to Solr indexes – Using Custom Codec

2016-06-21 Thread Renaud Delbru

Hi,

maybe it is the way you created the jar ? Why not applying the patch to 
lucene/solr trunk and use ant jar instead to get the codecs jar created 
for you ?
Also, I think the directory where you put the jars should be called 
"lib" instead of "Lib".

you can try also to use the lib directives in your solrconfig.xml [1]

[1] 
https://cwiki.apache.org/confluence/display/solr/Lib+Directives+in+SolrConfig


--
Renaud Delbru

On 20/06/16 15:42, Sidana, Mohit wrote:

Hello,

As Part of my studies I am exploring the solutions which can be used for
Lucene/Solr Index encryption.

I found the patch open on Apache JIRA- Codec for index-level encryption
<https://issues.apache.org/jira/browse/LUCENE-6966>(LUCENE-6966).
https://issues.apache.org/jira/browse/LUCENE-6966 and I am currently
trying to test this Custom codec with Solr to perform secure search over
some sensitive records.

I've decided to follow the path described in Solr wiki, setting up
Simple Text Codec and further tried to use Encrypted codec Source.

*Here are the additional details.*

I've created a basic jar file out of this source code (Build it as a jar
from Eclipse using Maven Plugin).

The Solr installation I'm using to test this is the Solr 6.0.0 unzipped,
and started via its embedded Jetty server and using the single core.

I've placed my jar with the codec in [My_Core\ instance Dir.]\ Lib

In:

[$SolrDir]\Solr\ My_Core \conf\*solrconfig.xml*

I've added the following lines:

| |||

||

Then in the *schema.xml* file, I've declared some field and field Types
that should use this codec:



|  |

|  |

|  |

|  |

|  |

|  |

||

||

||

||

I'm pretty sure I've followed all the steps described in Solr Wiki;
however, when I actually try to use custom codec implementation (named
"Encrypted Codec") to index some sample CSV data using simple post tool

java -Dtype=text/csv -Durl=http://localhost:8983/solr/My_Core /update
-jar  post.jar  Sales.csv

and I have also tried doing the same with SolrJ but I have faced the
same error.

SolrClient _server_=
*new*HttpSolrClient("http://localhost:8983/solr/My_Core ");

   SolrInputDocument doc= *new*SolrInputDocument();

doc.addField("id", "1234");

doc.addField("name", "A lovely summer holiday");

*try*{

server.add(doc);

server.commit();

  System.*/out/*.println("Document added!");

   } *catch*(SolrServerException | IOException e) {

e.printStackTrace();

   }

}

}

I get the attached errors in Solr log.

org.apache.solr.common.SolrException: Exception writing document id
b3e01ada-d0f1-4ddf-ad6a-2828bfe619a3 to the index; possible analysis error.

 at
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:181)

 at
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:68)

 at
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)

 at
org.apache.solr.update.processor.AddSchemaFieldsUpdateProcessorFactory$AddSchemaFieldsUpdateProcessor.processAdd(AddSchemaFieldsUpdateProcessorFactory.java:335)

 at
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)

 at
org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:117)

 at
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)

 at
org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:117)

 at
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)

 at
org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:117)

 at
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)

 at
org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:117)

 at
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)

 at
org.apache.solr.update.processor.FieldNameMutatingUpdateProcessorFactory$1.processAdd(FieldNameMutatingUpdateProcessorFactory.java:74)

 at
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)

 at
org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:117)

 at
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.

Re: Solr6 CDCR issue with a 3 cloud design

2016-06-14 Thread Renaud Delbru

Hi dmitry,

Was a commit operation sent to the 2 target clusters after the 
replication ? Replicated documents will not appeared until a commit 
operation is sent.


What is the output of the monitoring actions QUEUES and ERRORS ? Are you 
seeing any errors reported ? Are you seeing the queue size not equal to 0 ?


--
Renaud Delbru

On 09/06/16 08:55, dmitry.medve...@barclays.com wrote:

I've set up a 3 cloud CDCR: Source => Target1-Source2 => Target2 CDCR
environment, and the replication process works perfectly, but:

when I shutdown Target1-Source2 cloud (the mediator, for testing for
resilience), index/push some docs to Source1 cloud, get back
Target1-Source2 cloud online after several min, then I only part of the
docs are replicated to the 2 Target clouds (7 of 10 docs tested).

Anyone has an idea what is the reason for such a behavior?

Configurations attached.

Thanks in advance,

Dmitry Medvedev.

___

This message is for information purposes only, it is not a
recommendation, advice, offer or solicitation to buy or sell a product
or service nor an official confirmation of any transaction. It is
directed at persons who are professionals and is not intended for retail
customer use. Intended for recipient only. This message is subject to
the terms at: www.barclays.com/emaildisclaimer
<http://www.barclays.com/emaildisclaimer>.

For important disclosures, please see:
www.barclays.com/salesandtradingdisclaimer
<http://www.barclays.com/salesandtradingdisclaimer> regarding market
commentary from Barclays Sales and/or Trading, who are active market
participants; and in respect of Barclays Research, including disclosures
relating to specific issuers, please see http://publicresearch.barclays.com.

___





Re: Regarding CDCR SOLR 6

2016-06-14 Thread Renaud Delbru

Hi Bharath,

The buffer is useful when you need to buffer updates on the source 
cluster before starting cdcr, if the source cluster might receive 
updates in the meanwhile and you want to be sure to not miss them.


To understand this better, you need to understand how cdcr clean 
transaction logs. Cdcr when started (with the START action) will 
instantiate a log reader for each target cluster. The position of the 
log reader will indicate cdcr which transaction logs it can clean. If 
all the log readers are beyond a certain point, then cdcr can clean all 
the transaction logs up to this point.


However, there might be cases when the source cluster will be up without 
any log readers instantiated:

1) The source cluster is started, but cdcr is not started yet
2) the source cluster is started, cdcr is started, but the target 
cluster was not accessible when cdcr was started. In this case, cdcr 
will not be able to instantiate a log reader for this cluster.


In these two scenarios, if updates are received by the source cluster, 
then they might be cleaned out from the transaction log as per the 
normal update log cleaning procedure.
That is where the buffer becomes useful. When you know that while 
starting up your clusters and cdcr, you will be in one of these two 
scenarios, then you can activate the buffer to be sure to not miss 
updates. Then when the source and target clusters are properly up and 
cdcr replication is properly started, you can turn off this buffer.


--
Renaud Delbru

On 14/06/16 06:41, Bharath Kumar wrote:

Hi,

I have setup cross data center replication using solr 6, i want to know why
the buffer needs to be enabled on the source cluster? Even if the buffer is
not enabled, i am able to replicate the data between source and target
sites. What is the advantages of enabling the buffer on the source site? If
i enable the buffer, the transaction logs are never deleted and over a
period of time we are running out of disk. Can you please let me know why
the buffer enabling is required?





Re: Need Help with Solr 6.0 Cross Data Center Replication

2016-06-08 Thread Renaud Delbru

Hi,

unfortunately no, I haven't had the time to reproduce your settings with 
separated zookeeper instances.


I'll update if I have something.
--
Renaud Delbru

On 07/06/16 16:55, Satvinder Singh wrote:

Hi,

Any updates on this??

Thanks

Satvinder Singh



   Security Systems Engineer
 satvinder.si...@nc4.com
 804.744.9630  x273 direct
 703.989.8030 cell



 www.NC4.com <http://www.NC4.com>



   <https://www.linkedin.com/company/nc4>  
<https://plus.google.com/+Nc4worldwidesolutions/posts> 
<https://twitter.com/NC4worldwide>









On 5/19/16, 8:41 AM, "Satvinder Singh"  wrote:


Hi,

So this is what I did:

I created solr as a service. Below are the steps I followed for that:--

$ tar xzf solr-X.Y.Z.tgz solr-X.Y.Z/bin/install_solr_service.sh 
--strip-components=2

$ sudo bash ./install_solr_service.sh solr-X.Y.Z.tgz -i /opt/solr1 -d 
/var/solr1 -u solr -s solr1 -p 8501
$ sudo bash ./install_solr_service.sh solr-X.Y.Z.tgz -i /opt/solr2 -d 
/var/solr2 -u solr -s solr2 -p 8502

Then to start it in cloud I modified the solr1.cmd.in and solr2.cmd.in in 
/etc/defaults/
I added ZK_HOST=192.168.56.103:2181,192.168.56.103:2182,192.168.56.103:2183 
(192.168.56.103 is where my 3 zookeeper instances are)

Then I started the 2 solr services solr1 and solr2

Then I created the configset
/bin/solr zk -upconfig -z 
192.168.56.103:2181,192.168.56.103:2182,192.168.56.103:2183 -n Liferay -d 
server/solr/configsets/sample_techproducts_configs/conf

Then I created the collection using:
http://192.168.56.101:8501/solr/admin/collections?action=CREATE&name=dingdong&numShards=1&replicationFactor=2&collection.configName=liferay
This created fine

Then I deleted the solrconfig.xml from the zookeeper Liferay configset

Then I uploaded the new solrconfig.xml to the configset.

When when I do a reload on the collections I get the error. Or I created a new 
collection I get the error.

Thanks

Satvinder Singh



Security Systems Engineer
satvinder.si...@nc4.com
703.682.6000 x276 direct
703.989.8030 cell
www.NC4.com





?


-Original Message-
From: Renaud Delbru [mailto:renaud@siren.solutions]
Sent: Thursday, May 19, 2016 7:13 AM
To: solr-user@lucene.apache.org
Subject: Re: Need Help with Solr 6.0 Cross Data Center Replication

I have reproduced your steps and the cdcr request handler started successfully. 
I have attached to this mail the config sets I have used.
It is simply the sample_techproducts_config configset with your solrconfig.xml.

I have used solr 6.0.0 with the following commands:

$ ./bin/solr start -cloud

$ ./bin/solr create_collection -c test_cdcr -d cdcr_configs

Connecting to ZooKeeper at localhost:9983 ...
Uploading /solr-6.0.0/server/solr/configsets/cdcr_configs/conf for config 
test_cdcr to ZooKeeper at localhost:9983

Creating new collection 'test_cdcr' using command:
http://localhost:8983/solr/admin/collections?action=CREATE&name=test_cdcr&numShards=1&replicationFactor=1&maxShardsPerNode=1&collection.configName=test_cdcr

{
   "responseHeader":{
 "status":0,
 "QTime":5765},
   "success":{"127.0.1.1:8983_solr":{
   "responseHeader":{
 "status":0,
 "QTime":4426},
   "core":"test_cdcr_shard1_replica1"}}}

$ curl http://localhost:8983/solr/test_cdcr/cdcr?action=STATUS



03stoppedenabled 



The difference is that I have used the embedded zookeeper, not a separate 
ensemble.

Could you please provide the commands you used to create the collection ?

Kind Regards
--
Renaud Delbru

On 16/05/16 16:55, Satvinder Singh wrote:

I also am using a zk ensemble with 3 nodes on each side.

Thanks


Satvinder Singh



Security Systems Engineer
satvinder.si...@nc4.com
703.682.6000 x276 direct
703.989.8030 cell
www.NC4.com





?


-Original Message-
From: Satvinder Singh [mailto:satvinder.si...@nc4.com]
Sent: Monday, May 16, 2016 11:54 AM
To: solr-user@lucene.apache.org
Subject: RE: Need Help with Solr 6.0 Cross Data Center Replication

Hi,

So the way I am doing it is, for both for the Target and Source side, I took a 
copy of the sample_techproducts_config configset, can created one configset. 
Then I modified the solrconfig.xml in there, both for the Target and Source 
side. And then created the collection, and I get the errors. I get the error if 
I create a new collection or try to reload an existing collection after the 
solrconfig update.
Attached is the log and configs.
Thanks

Satvinder Singh



Security Systems Engineer
satvinder.si...@nc4.com
703.682.6000 x276 direct
703.989.8030 cell
www.NC4.com





?


-Original Message-
From: Renaud Delbru [mailto:renaud@siren.solutions]
Sent: Monday, May 16, 2016 11:45 AM
To: solr-user@lucene.apache.org
Subject: Re: Need Help with Solr 6.0 Cross Data Center Replication

Hi,

Re: Solr 6 CDCR does not work

2016-05-30 Thread Renaud Delbru

Hi Adam,

could you check the response of the monitoring commands [1], QUEUES, 
ERRORS, OPS. This might help undeerstanding if documents are flowing or 
if there are issues.


Also, do you have an autocommit configured on the target ? CDCR does not 
replicate commit, and therefore you have to send a commit command on the 
target to ensure that the latest replicated documents are visible.


[1] 
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=62687462#CrossDataCenterReplication%28CDCR%29-Monitoringcommands


--
Renaud Delbru

On 29/05/16 12:10, Adam Majid Sanjaya wrote:

I’m testing Solr 6 CDCR, but it’s seems not working.

Source configuration:

   
 targetzkip:2181
 corehol
 corehol
   

   
 1
 1000
 128
   

   
 5000
   



   
 ${solr.ulog.dir:}
   


Target(s) configuration:

   
 disabled
   



   
   



   
 cdcr-proccessor-chain
   



   
 ${solr.ulog.dir:}
   


Source Log: no cdcr
Target Log: no cdcr

Create a core (solrconfig.xml modification directly from the folder
data_driven_schema_configs):
#bin/solr create -c corehol -p 8983

Start cross-data center replication by running the START command on the
source data center
http://sourceip::8983/solr/corehol/cdcr?action=START

Disable buffer by running the DISABLEBUFFER command on the target data
center
http://targetip::8983/solr/corehol/cdcr?action=DISABLEBUFFER

The documents are not replicated to the target zone.

What should I examine?





Re: Inconsistent Solr document count on Target clouds when replicating data in Solr6 CDCR

2016-05-19 Thread Renaud Delbru

Hi Dmitry,

You can activate debug log and see more information, such as the number 
of documents replicated by the cdcr replicator thread, etc.


However, I think that the issue is that indexes on the target instances 
are not refreshed, and therefore some of the documents indexed not yet 
visible. Cdcr does not replicate commit operations, and let the target 
cluster handle the refresh. You can try to manually execute a commit 
operation on the target cluster and see if all the documents appears.


Kind Regards
--
Renaud Delbru

On 19/05/16 17:39, dmitry.medve...@barclays.com wrote:

I've come across a weird problem which I'm trying to debug at the moment, and 
was just wondering if anyone has stumbled across it too:

I have an active-passive-passive configuration (1 Source cloud, 2 targets), and 
NOT all the documents are being replicated to the target clouds. Example: 3 
docs are being pushed/indexed on the Source cloud, S1, S2, S3, and only 2 docs 
can be found (almost immediately) on the Target clouds, say T1, T3. The 
behavior is NOT consistent.

I feel like it's a configuration issue, but it could also be a bug. How can I 
debug this issue?

What log files should I examine?

I couldn't find anything in the logs (of both the Source & Target clouds).



Source configuration:



10.88.52.219:9983,10.36.75.4:9983
demo
demo



2
10
128



1000





${solr.ulog.dir:}





Target(s) configuration:



disabled










cdcr-proc-chain





${solr.ulog.dir:}




Thnx,
Dmitry Medvedev
Tech search leader
BARCLAYS CAPITAL
Search Platform Engineering
Global Technology Infrastructure Services  (GTIS)
Barclays Capital, Atidim High-Tech Industrial Park, Tel Aviv 61580
* DDI : +972-3-5452462 * Mobile : +972-545874521
* 
dmitry.medve...@barclayscapital.com<mailto:dmitry.medve...@barclayscapital.com>

P Please consider the environment before printing this email


___

This message is for information purposes only, it is not a recommendation, 
advice, offer or solicitation to buy or sell a product or service nor an 
official confirmation of any transaction. It is directed at persons who are 
professionals and is not intended for retail customer use. Intended for 
recipient only. This message is subject to the terms at: 
www.barclays.com/emaildisclaimer.

For important disclosures, please see: 
www.barclays.com/salesandtradingdisclaimer regarding market commentary from 
Barclays Sales and/or Trading, who are active market participants; and in 
respect of Barclays Research, including disclosures relating to specific 
issuers, please see http://publicresearch.barclays.com.

___





Re: Need Help with Solr 6.0 Cross Data Center Replication

2016-05-19 Thread Renaud Delbru

Hi Abdel,

have you reloaded the collection [1] after uploading the configuration 
to zookeeper ?


[1] 
https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api2


--
Renaud Delbru

On 16/05/16 17:29, Abdel Belkasri wrote:

Thanks Renaud.

Here is my setup:

1- I have created 2 sites: Main (source) and DR (traget).
2- Both sites are the same before configuring CDCR
3- The collections (source and target) are created before configuring CDCR
4- collections are created using interactive mode: accepting most defaults
except the ports (gettingstarted collection)
5- I have a zookeeper ensemble too.
6- I change the solrconfig.xml, then I upload using the command:
# upload configset to zookeeper
zkcli.bat -cmd upconfig -zkhost  localhost:2181 -confname gettingstarted
-solrhome C:\solr\solr-6-cloud\solr-6.0.0 -confdir
C:\solr\solr-6-cloud\solr-6.0.0\server\solr\configsets\basic_configs\conf

Renaud can you send your confi files...

Thanks,
--Abdel.

On Mon, May 16, 2016 at 12:16 PM, Satvinder Singh 
wrote:


Thank you.

To summarize this is what I have, all VMS running on Centos7 :

Source Side
 |___ 1 VM running 3 Zookeeper instances on port 2181, 2182 and
2183 (ZOOKEEPER 3.4.8)(Java 1.8.0_91)
 |___ 1 VM running 2 solr 6.0 instances on port 8501, 8502 (Solr
6.0) (Java 1.8.0_91)
 |___ sample_techproducts_config copied as 'liferay', and used to
create collections, that is where I am
  modifying the solrconfig.xml


Target Side
 |___ 1 VM running 3 Zookeeper instances on port 2181, 2182 and
2183 (ZOOKEEPER 3.4.8)(Java 1.8.0_91)
 |___ 1 VM running 2 solr 6.0 instances on port 8501, 8502 (Solr
6.0) (Java 1.8.0_91)
 |___ sample_techproducts_config copied as 'liferay', and used to
create collections, that is where I am
  modifying the solrconfig.xml


Thanks
Satvinder Singh
Security Systems Engineer
satvinder.si...@nc4.com
703.682.6000 x276 direct
703.989.8030 cell
www.NC4.com








-Original Message-
From: Renaud Delbru [mailto:renaud@siren.solutions]
Sent: Monday, May 16, 2016 11:59 AM
To: solr-user@lucene.apache.org
Subject: Re: Need Help with Solr 6.0 Cross Data Center Replication

Thanks Satvinder,
Tomorrow, I'll try to reproduce the issue with your steps and will let you
know.

Regards
--
Renaud Delbru

On 16/05/16 16:53, Satvinder Singh wrote:

Hi,

So the way I am doing it is, for both for the Target and Source side, I

took a copy of the sample_techproducts_config configset, can created one
configset. Then I modified the solrconfig.xml in there, both for the Target
and Source side. And then created the collection, and I get the errors. I
get the error if I create a new collection or try to reload an existing
collection after the solrconfig update.

Attached is the log and configs.
Thanks

Satvinder Singh



Security Systems Engineer
satvinder.si...@nc4.com
703.682.6000 x276 direct
703.989.8030 cell
www.NC4.com








-Original Message-
From: Renaud Delbru [mailto:renaud@siren.solutions]
Sent: Monday, May 16, 2016 11:45 AM
To: solr-user@lucene.apache.org
Subject: Re: Need Help with Solr 6.0 Cross Data Center Replication

Hi,

I have tried to reproduce the problem, but was unable to.
I have downloaded the Solr 6.0 distribution, added to the solr config

the cdcr request handler and modified the update handler to register the
CdcrUpdateLog, then start Solr in cloud mode and created a new collection
using my solr config. The cdcr request handler starts properly and does not
complain about the update log.


Could you provide more background on how to reproduce the issue ? E.g.,

how do you create a new collection with the cdcr configuration.

Are you trying to configure CDCR on collections that were created prior

to the CDCR configuration ?


@Erik: I have noticed a small issue in the CDCR page of the reference

guide. In the code snippet in Configuration -> Source Configuration, the
 element is nested within the .


Thanks
Regards
--
Renaud Delbru

On 15/05/16 23:13, Abdel Belkasri wrote:

Erick,

I tried the new configuration. The same issue that Satvinder is
having. The log updater cannot be instantiated...

class="solr.CdcrUpdateLog"

for some reason that class is causing a problem!

Anyway, anyone has a config that works?

Regards,
--Abdel

On Fri, May 13, 2016 at 11:57 AM, Erick Erickson

wrote:


I changed the CDCR doc, Oliver could you take a glance and see if it
is clear now? All I changed was the sample solrconfig sections

https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=626
8
7462

Thanks,
Erick

On Fri, May 13, 2016 at 6:23 AM, Oliver Rudolph
 wrote:

Hi,

I had the same problem. The documentation is kind of missleading here.

You

must not add a new  element to your config but
update the existing . All you need to do is add the
class="solr.CdcrUpdateLog" element to the  element
inside your existing . Hope this helps!


Mit freundl

Re: Need Help with Solr 6.0 Cross Data Center Replication

2016-05-16 Thread Renaud Delbru

Thanks Satvinder,
Tomorrow, I'll try to reproduce the issue with your steps and will let 
you know.


Regards
--
Renaud Delbru

On 16/05/16 16:53, Satvinder Singh wrote:

Hi,

So the way I am doing it is, for both for the Target and Source side, I took a 
copy of the sample_techproducts_config configset, can created one configset. 
Then I modified the solrconfig.xml in there, both for the Target and Source 
side. And then created the collection, and I get the errors. I get the error if 
I create a new collection or try to reload an existing collection after the 
solrconfig update.
Attached is the log and configs.
Thanks

Satvinder Singh



Security Systems Engineer
satvinder.si...@nc4.com
703.682.6000 x276 direct
703.989.8030 cell
www.NC4.com








-Original Message-
From: Renaud Delbru [mailto:renaud@siren.solutions]
Sent: Monday, May 16, 2016 11:45 AM
To: solr-user@lucene.apache.org
Subject: Re: Need Help with Solr 6.0 Cross Data Center Replication

Hi,

I have tried to reproduce the problem, but was unable to.
I have downloaded the Solr 6.0 distribution, added to the solr config the cdcr 
request handler and modified the update handler to register the CdcrUpdateLog, 
then start Solr in cloud mode and created a new collection using my solr 
config. The cdcr request handler starts properly and does not complain about 
the update log.

Could you provide more background on how to reproduce the issue ? E.g., how do 
you create a new collection with the cdcr configuration.
Are you trying to configure CDCR on collections that were created prior to the 
CDCR configuration ?

@Erik: I have noticed a small issue in the CDCR page of the reference guide. In the code 
snippet in Configuration -> Source Configuration, the  element is 
nested within the .

Thanks
Regards
--
Renaud Delbru

On 15/05/16 23:13, Abdel Belkasri wrote:

Erick,

I tried the new configuration. The same issue that Satvinder is
having. The log updater cannot be instantiated...

class="solr.CdcrUpdateLog"

for some reason that class is causing a problem!

Anyway, anyone has a config that works?

Regards,
--Abdel

On Fri, May 13, 2016 at 11:57 AM, Erick Erickson

wrote:


I changed the CDCR doc, Oliver could you take a glance and see if it
is clear now? All I changed was the sample solrconfig sections

https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=6268
7462

Thanks,
Erick

On Fri, May 13, 2016 at 6:23 AM, Oliver Rudolph
 wrote:

Hi,

I had the same problem. The documentation is kind of missleading here.

You

must not add a new  element to your config but update
the existing . All you need to do is add the
class="solr.CdcrUpdateLog" element to the  element inside
your existing . Hope this helps!


Mit freundlichen Grüßen / Kind regards

Oliver Rudolph

IBM Deutschland Research & Development GmbH Vorsitzender des
Aufsichtsrats: Martina Koederitz
Geschäftsführung: Dirk Wittkopp
Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht

Stuttgart,

HRB 243294













Disclaimer: This message is intended only for the use of the individual or 
entity to which it is addressed and may contain information which is 
privileged, confidential, proprietary, or exempt from disclosure under 
applicable law. If you are not the intended recipient or the person responsible 
for delivering the message to the intended recipient, you are strictly 
prohibited from disclosing, distributing, copying, or in any way using this 
message. If you have received this communication in error, please notify the 
sender and destroy and delete any copies you may have received.





Re: Need Help with Solr 6.0 Cross Data Center Replication

2016-05-16 Thread Renaud Delbru

Hi,

I have tried to reproduce the problem, but was unable to.
I have downloaded the Solr 6.0 distribution, added to the solr config 
the cdcr request handler and modified the update handler to register the 
CdcrUpdateLog, then start Solr in cloud mode and created a new 
collection using my solr config. The cdcr request handler starts 
properly and does not complain about the update log.


Could you provide more background on how to reproduce the issue ? E.g., 
how do you create a new collection with the cdcr configuration.
Are you trying to configure CDCR on collections that were created prior 
to the CDCR configuration ?


@Erik: I have noticed a small issue in the CDCR page of the reference 
guide. In the code snippet in Configuration -> Source Configuration, the 
 element is nested within the .


Thanks
Regards
--
Renaud Delbru

On 15/05/16 23:13, Abdel Belkasri wrote:

Erick,

I tried the new configuration. The same issue that Satvinder is having. The
log updater cannot be instantiated...

class="solr.CdcrUpdateLog"

for some reason that class is causing a problem!

Anyway, anyone has a config that works?

Regards,
--Abdel

On Fri, May 13, 2016 at 11:57 AM, Erick Erickson 
wrote:


I changed the CDCR doc, Oliver could you take a glance and see if it
is clear now? All I changed was the sample solrconfig sections

https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=62687462

Thanks,
Erick

On Fri, May 13, 2016 at 6:23 AM, Oliver Rudolph
 wrote:

Hi,

I had the same problem. The documentation is kind of missleading here.

You

must not add a new  element to your config but update the
existing . All you need to do is add the
class="solr.CdcrUpdateLog" element to the  element inside your
existing . Hope this helps!


Mit freundlichen Grüßen / Kind regards

Oliver Rudolph

IBM Deutschland Research & Development GmbH
Vorsitzender des Aufsichtsrats: Martina Koederitz
Geschäftsführung: Dirk Wittkopp
Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht

Stuttgart,

HRB 243294















Re: Cross Data Center Replication - ERROR

2016-05-12 Thread Renaud Delbru

Hi Abdel,

Your configuration looks ok regarding the cdcr update log.
Could you tell us a bit more about your Solr installation ? More 
specifically, does the solr instances, both source and target, contain 
one collection that was created prior the configuration of cdcr ?


Best,
--
Renaud Delbru

On 11/05/16 20:46, Abdel Belkasri wrote:

Hi there,



I am trying to configure Cross Data Center Replication using solr 6.0.

I am having issue creating collections or reloading old collections with
the new solrconfig.xml on both the target and source side. I keep getting
error 
“org.apache.solr.common.SolrException:org.apache.solr.common.SolrException:
Solr instance is not configured with the cdcr update log”





This is my config on the Source



   

   

 disabled

   

   

   

 

 

   

   

   

 cdcr-proc-chain

   

   

   

 

 ${solr.ulog.dir:}

 500

 20

 65536

 

   





This is the config on the Target side:



   

   

 disabled

   

   

   

 

 

   

   

   

 cdcr-proc-chain

   

   

   

 

 ${solr.ulog.dir:}

 500

 20

 65536

 

   





HOW SOLR IS RUNNING:

ZKHOSTS parameter in solr.in.sh file under /etc/default

and when you start solr service it will start in cloud





Any help would be great.



Thanks


--Abdel.





Re: SolrCloud - Fails to delete documents when some shard is down

2016-03-21 Thread Renaud Delbru

On 21/03/16 14:43, Erick Erickson wrote:

Hmmm, you say "where I have many shards and
can't have one problem  causing no deletion of old data.".

You then have a shard that, when it comes back up still
has all the old data and that _is_ acceptable? Seems like
that would be jarring to the users when some portion of the
docs in their collection reappeared...

But no, there's no similar option for update that I know of.
Solr tries very hard for consistency and this would lead
to an inconsistent data state.

What is the root cause of your shard going down? That's
the fundamental problem here...


As Erick said, what would be the cause to have a full shard and all its 
replicas going down at the same time ?
Usually, if a shard has multiple replicas, and one node goes down, the 
replicas on the other nodes should take the lead for this shard, and the 
delete queries should work.

--
Renaud Delbru



Best,
Erick

On Mon, Mar 21, 2016 at 7:08 AM, Tali Finelt  wrote:


Hi,

I am using Solr 4.10.2.

When one of the shards in my environment is down and fails to recover -
The process of deleting documents from other shards fails as well.

For example,
When running:
https://:8983/solr//update?stream.body=
*:*&commit=true

I get the following error message:
No registered leader was found after waiting for 4000ms ,
collection:  slice:  

This causes a problem in a big environment where I have many shards and
can't have one problem  causing no deletion of old data.

Is there a way around that?

To Query on data in such cases, I use shards.tolerant=true parameter to
get results even if some shards are down.
Is there something similar for this case?

Thanks,
Tali








Re: Solr 6.0

2016-02-25 Thread Renaud Delbru

Hi Shawn,

On 25/02/16 14:07, Shawn Heisey wrote:

The CDCR functionality is currently present in the master branch, but I
do not know for sure whether it will be included in the 6.0 release.  I
am not involved with that feature and have no idea how stable the code is.
CDCR is stable and is running now for months in a large production 
deployment without any known issues.
Erick, who took care of committing it into the trunk, was planning to 
release it as part of 6.0.

--
Renaud Delbru


Re: Index complex JSON data in SOLR

2014-11-20 Thread Renaud Delbru

Hi David,

you might want to look at SIREn 1.4 [1], a plugin for Lucene/Solr, that 
includes a update handler [2] which mimics elasticsearch index api. You 
can push JSON documents to the API and it will dynamically flatten and 
index the JSON documents into a set of fields (similar to 
Elasticsearch). It also index the full json into a SIREn's field to 
support nested queries.


[1] http://siren.solutions/siren/downloads/
[2] http://siren.solutions/manual/solr-configuration-update-handler.html

--
Renaud Delbru

On 11/15/2014 10:05 PM, David Lee wrote:

Hi All,

How do I index complex JSON data in SOLR? For example,

{prices:[{state:"CA", price:"101.0"}, {state:"NJ",
price:"102.0"},{state:"CO", price:"102.0"}]}


It's simple in ElasticSearch, but in SOLR it always reports the following
error:
"Error parsing JSON field value. Unexpected OBJECT_START"


Thanks,
DL



[ANN] SIREn, a Lucene/Solr plugin for rich JSON data search

2014-07-23 Thread Renaud Delbru
One of the coolest features of Lucene/Solr is its ability to index 
nested documents using a Blockjoin approach.


While this works well for small documents and document collections, it 
becomes unsustainable for larger ones: Blockjoin works by splitting the 
original document in many documents, one per nested record.


For example, a single USPTO patent (XML format converted to JSON) will 
end up being over 1500 documents in the index. This has massive 
implications on performance and scalability.


Introducing SIREn

SIREn is an open source plugin for Solr for indexing and searching rich 
nested JSON data.


SIREn uses a sophisticated "tree indexing" design which ensures that the 
index is not artificially inflated. This ensures that querying on many 
types of nested queries can be up to 3x faster. Further, depending on 
the data, memory requirements for faceting can be up to 10x higher. As 
such, SIREn allows you to use Solr for larger and more complex datasets, 
especially so for sophisticated analytics. (You can read our whitepaper 
to find out more [1])


SIREn is also truly schemaless - it even allows you to change the type 
of a property between documents without being restricted by a defined 
mapping. This can be very useful for data integration scenarios where 
data is described in different ways in different sources.


You only need a few minutes to download and try SIREn [2]. It comes with 
a detailed manual [3] and you have access to the code on GitHub [4].


We look forward to hear about your feedbacks.

[1] 
http://siren.solutions/siren/resources/whitepapers/comparing-siren-1-2-and-lucenes-blockjoin-performance-a-uspto-patent-search-scenario/

[2] http://siren.solutions/siren/downloads/
[3] http://siren.solutions/manual/preface.html
[4] https://github.com/sindicetech/siren
--
Renaud Delbru
CTO
SIREn Solutions


Re: Obtaining query AST?

2011-05-31 Thread Renaud Delbru

Hi,

have a look at the flexible query parser of lucene (contrib package) 
[1]. It provides a framework to easily create different parsing logic. 
You should be able to access the AST and to modify as you want how it 
can be translated into a Lucene query (look at processors and pipeline 
processors).
One time you have your own query parser, then it is straightforward to 
plug it into Solr.


[1] http://lucene.apache.org/java/3_1_0/api/contrib-queryparser/index.html
--
Renaud Delbru

On 31/05/11 19:24, dar...@ontrenet.com wrote:

Hi,
  I want to write my own query expander. It needs to obtain the AST
(abstract syntax tree) of an already parsed query string, navigate to
certain parts of it (words) and make logical phrases of those words by
adding to the AST - where necessary.

This cannot be done to the string because the query logic cannot be
semantically altered. (e.g. AND, OR, paren's etc) so it must be parsed
first.

How can this be done with SolrJ?

thanks for any tips.
Darren






Resolved- Re: Replication Error - Index fetch failed - File Not Found & OverlappingFileLockException

2011-05-30 Thread Renaud Delbru

Hi,

I found out the problem by myself.
The reason was a bad deployment of of Solr on tomcat. Two instances of 
solr were instantiated instead of one. The two instances were managing 
the same indexes, and therefore were trying to write at the same time.


My apologies for the noise created on the ml,
--
Renaud Delbru

On 30/05/11 21:52, Renaud Delbru wrote:

Hi,

For months, we were using apache solr 3.1.0 snapshots without problems.
Recently, we have upgraded our index to apache solr 3.1.0,
and also moved to a multi-core infrastructure (4 core per nodes, each
core having its own index).

We found that one of the index slave started to show failure, i.e.,
query errors. By looking at the log, we observed some errors during the
latest snappull, due to two type of exceptions:
- java.io.FileNotFoundException: File does not exist ...
and
- java.nio.channels.OverlappingFileLockException: null

Then, after the failed pull, the index started to show some index
related failure:

java.io.IOException: read past EOF at
org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:207)]


However, after manually restarting the node, everything went back to
normal.

You can find a more detailed log at [1].

We are afraid to see this problem occurring again. Have you some idea on
what can be the cause ? Or a solution to avoid such problem ?

[1] http://pastebin.com/vbnyrUgJ

Thanks in advance




Replication Error - Index fetch failed - File Not Found & OverlappingFileLockException

2011-05-30 Thread Renaud Delbru

Hi,

For months, we were using apache solr 3.1.0 snapshots without problems.
Recently, we have upgraded our index to apache solr 3.1.0,
and also moved to a multi-core infrastructure (4 core per nodes, each 
core having its own index).


We found that one of the index slave started to show failure, i.e., 
query errors. By looking at the log, we observed some errors during the 
latest snappull, due to two type of exceptions:

- java.io.FileNotFoundException: File does not exist ...
and
- java.nio.channels.OverlappingFileLockException: null

Then, after the failed pull, the index started to show some index 
related failure:


java.io.IOException: read past EOF at 
org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:207)]


However, after manually restarting the node, everything went back to normal.

You can find a more detailed log at [1].

We are afraid to see this problem occurring again. Have you some idea on 
what can be the cause ? Or a solution to avoid such problem ?


[1] http://pastebin.com/vbnyrUgJ

Thanks in advance
--
Renaud Delbru


Re: Indexing documents with "complex multivalued fields"

2011-05-23 Thread Renaud Delbru

Hi,

you could look at this recent thread [1], it is similar to your problem.

[1] 
http://search.lucidimagination.com/search/document/33ec1a98d3f93217/search_across_related_correlated_multivalue_fields_in_solr#1f66876c782c78d5

--
Renaud Delbru

On 23/05/11 14:40, anass talby wrote:

Hi,

I'm new in solr and would like to index documents that have complex
multivalued fields. I do want to do something like:


1

  1
  red


  2
  green

...

  ...


How can i do this with solr

thanks in advance.





Re: Support for huge data set?

2011-05-13 Thread Renaud Delbru

Hi,

Our system [1] consists of +220 million semi-structured web documents 
(RDF, Microformats, etc.), with fairly small documents (a few kb) and 
large documents (a few MB). Each document has in addition a dozen of 
additional fields for indexing and storing metadata about the document.


It runs on top of Solr 3.1 with the following configuration:
- 2 master indexes
- 2 slaves indexes
Each server is a quad-core with 32Gb of Ram, and 4 SATA drives in RAID10.

The indexing performance are quite good. We can reindex our full data 
collection in less than a day (using only the two master indexes). Live 
updates (a few millions documents per day) are processed continuously by 
our masters. We replicate the change every hours to the slave indexes. 
Query performance are also ok (you can try it by yourself on [1]).


As a side note, we are using Solr 3.1 plus a plugin we have developped 
for indexing semi-structured data. This plugin is adding much more data 
to the index than plain Solr. So you can expect even better performance 
by using plain solr (with respect to indexing performance).


[1] http://sindice.com
--
Renaud Delbru

On 12/05/11 17:59, atreyu wrote:

Hi,

I have about 300 million docs (or 10TB data) which is doubling every 3
years, give or take.  The data mostly consists of Oracle records, webpage
files (HTML/XML, etc.) and office doc files.  There are b/t two and four
dozen concurrent users, typically.  The indexing server has>  27 GB of RAM,
but it still gets extremely taxed, and this will only get worse.

Would Solr be able to efficiently deal with a load of this size?  I am
trying to avoid the heavy cost of GSA, etc...

Thanks.


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Support-for-huge-data-set-tp2932652p2932652.html
Sent from the Solr - User mailing list archive at Nabble.com.




Re: Search across related/correlated multivalue fields in Solr

2011-04-27 Thread Renaud Delbru

On 27/04/11 19:50, Walter Underwood wrote:

This kind of thing is really easy in an XML database. That is an XPath 
expression, not even a search.


Indeed, in fact SIREn is based on a XML IR technique, i.e., a simplified 
node-based indexing scheme.

--
Renaud Delbru


Re: Search across related/correlated multivalue fields in Solr

2011-04-27 Thread Renaud Delbru

On 27/04/11 19:37, Renaud Delbru wrote:

Hi Jason,

On 27/04/11 19:25, Jason Rutherglen wrote:

Renaud,

Can you provide a brief synopsis of how your system works?


SIREn provides a new "field type" for Solr. In this particular SIREn
field, the data is not a piece of text, but is organised in a table.
Then, SIREn provides query objects to query a specific cell (or group of
cell) of this table, a specific row (or group of rows), etc.

So, let's take the example of ronotica,
you want to index a 1:N relationships between students and educations.
Your Solr document will look like:
doc {
student_id: 100
firstname: john
lastname: doe
education: {
[[2008], [OHIO_ST]],
[[2010]], [YALE]]
}
}

where student_id, firstname and lastname are normal solr fields, and
education is a siren field. This field represents a table with two
columns, degreeYear and Institution, and where each row represent an
entry, or record, associated to the student.

Then, you can use SIREn to query a document having a row matching 2010
and Yale. In this case, SIREn will not return you the john doe student.


I meant "to query a document having a row matching 2010 and *OHIO_ST*" 
instead of Yale. Sorry for the confusion.


Re: Search across related/correlated multivalue fields in Solr

2011-04-27 Thread Renaud Delbru

Hi Jason,

On 27/04/11 19:25, Jason Rutherglen wrote:

Renaud,

Can you provide a brief synopsis of how your system works?


SIREn provides a new "field type" for Solr. In this particular SIREn 
field, the data is not a piece of text, but is organised in a table. 
Then, SIREn provides query objects to query a specific cell (or group of 
cell) of this table, a specific row (or group of rows), etc.


So, let's take the example of ronotica,
you want to index a 1:N relationships between students and educations.
Your Solr document will look like:
doc {
  student_id: 100
  firstname: john
  lastname: doe
  education: {
[[2008], [OHIO_ST]],
[[2010]], [YALE]]
  }
}

where student_id, firstname and lastname are normal solr fields, and 
education is a siren field. This field represents a table with two 
columns, degreeYear and Institution, and where each row represent an 
entry, or record, associated to the student.


Then, you can use SIREn to query a document having a row matching 2010 
and Yale. In this case, SIREn will not return you the john doe student.


I hope my brief synopsis and example is clear,
let me know if there is something that you don't understand (maybe in 
private).


Regards,
--
Renaud Delbru


Re: Search across related/correlated multivalue fields in Solr

2011-04-27 Thread Renaud Delbru

Hi,

you might want to look at the SIREn plugin [1,2], which allows you to 
index and query 1:N relationships such as yours, in a tabular data 
format [3].


[1] http://siren.sindice.com/
[2] https://github.com/rdelbru/SIREn
[3] 
https://dev.deri.ie/confluence/display/SIREn/Indexing+and+Searching+Tabular+Data


Kind Regards,
--
Renaud Delbru

On 27/04/11 18:30, ronotica wrote:

The nature of my project is such that search is needed and specifically
search across related entities. We want to perform several queries involving
a correlation between two or more properties of a given entity in a
collection.

To put things in context, here is a snippet of the domain:

Student { firstname, lastname }
Education { degreeCode, degreeYear, institution }

The database tables look like so:

STUDENT
--
STUDENT_ID FNAME  LNAME
100 John  Doe
200 Rasheed Jones
300 Mary  Hampton

EDUCATION
-
EDUCATION_ID  DEGREE_CODE   DEGREE_YR   INSTITUTION
STUDENT_ID
1 MD  2008
OHIO_ST100
2 PHD 2010 YALE
100
3 MS  2007
OHIO_ST   200
4 MD  2010  YALE
300

A student can have many educations. Currently, our documents look like this
in solr:

DOC_ID   STUDENT_IDFNAME   LNAME  DEGREE_CODEDEGREE_YR
INSTITUTION
100 100John  Doe  MD PHD
2008 2010 OHIO_ST YALE
101 200Rasheed JonesMS
2007 OHIO_ST
102 300Mary  Hampton   MD
2010 YALE

Searching for all students who graduated from OHIO_ST in 2010 currently
gives a hit (John Doe) when it shouldn't.

What is the best way to have overcome this issue in Solr? This is only
happening when I am searching across correlated fields, mainly because the
data has been denormalized and Lucene has no notion of relationships between
the various fields.

One way that as come to mind is to have separate documents for "education"
and perform multiple searches to get at an answer. Besides this, is there
any other way? Does Solr provide any elegant solution for this?

Any help will be greatly appreciated.

Thanks.

PS: We have about 15 of these kind of relationships all relating to the
student and will like to perform search on each of them.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Search-across-related-correlated-multivalue-fields-in-Solr-tp2871176p2871176.html
Sent from the Solr - User mailing list archive at Nabble.com.




Re: Queries with undetermined field count

2011-04-07 Thread Renaud Delbru

Hi,

SIREn [1], a Lucene/Solr plugin, allows you perform queries across an 
undetermined number of fields, even if you have hundred of thousands of 
fields. It might be helpful for your scenario.


[1] http://siren.sindice.com
--
Renaud Delbru

On 07/04/11 19:18, jisenhart wrote:


I have a question on how to set up queries not having a predetermined
field list to search on.

Here are some sample docs,

1234
hihello
lalachika chika boom boom


1235
foobarhappy happy joy
joy
some textsome more words to
search

.
.
.

4567
bedrock
memeyou you
super duperare we done?


Now a given user user, say fred, belongs to any number of groups, say
only fred, and group1 for this example.
A query on 'foo' is easy if I know that fred belongs to only these two:

_fred:foo OR _group1:foo //will find a hit on doc 1235

However, a user can belong to any number of groups. How do I perform
such a search if the users group list is arbitrarily large?

Could I somehow make use of reference docs like so:


fred
fredgroup1

.
.
.

wilma
wilmagroup1group5group9group11group31group40







Re: Matching on a multi valued field

2011-04-05 Thread Renaud Delbru

Hi,

you could try the SIREn plugin [1] which supports multi-valued fields.

[1] http://siren.sindice.com
--
Renaud Delbru

On 29/03/11 21:57, Brian Lamb wrote:

Hi all,

I have a field set up like this:



And I have some records:

RECORD1

   man's best friend
   pooch


RECORD2

   man's worst enemy
   friend to no one


Now if I do a search such as:
http://localhost:8983/solr/search/?q=*:*&fq={!q.op=AND df=common_names}man's
friend

Both records are returned. However, I only want RECORD1 returned. I
understand why RECORD2 is returned but how can I structure my query so that
only RECORD1 is returned?

Thanks,

Brian Lamb





Re: Triggering optimise based on time interval

2011-02-16 Thread Renaud Delbru

Mainly technical administration effort.

We are trying to have a solr packaging that
- minimises the effort to deploy the system on a machine.
- reduces errors when deploying
- centralised the logic of the Solr system

Ideally, we would like to have a central place (e.g., solrconfig) where 
the logic of the system is configured.
In that case, the system administrator does not have to bother with a 
long list of tasks and checkpoints every time we need to release a new 
version of the solr system, or extend our clusters. He should just have 
to take the new release, ship it on a machine, and start up solr.

--
Renaud Delbru

On 16/02/11 13:15, Stefan Matheis wrote:

Renaud,

just because i'm interested in .. what are your concerns about using
cron for that?

Stefan

On Wed, Feb 16, 2011 at 2:12 PM, Renaud Delbru  wrote:

Hi,

We would like to trigger an optimise every x hours. From what I can see,
there is nothing in Solr (3.1-SNAPSHOT) that enables to do such a thing.
We have a master-slave configuration. The masters are tuned for fast
indexing (large merge factor). However, for the moment, the master index is
replicated as it is to the slaves, and therefore it does not provide very
fast query time.
Our idea was
- to configure the replication so that it only happens after an optimise,
and
- schedule a partial optimise in order to reduce the number of segments
every x hours for faster querying.
We do not want to rely on cron job for executing the partial optimise every
x hours, but we would prefer to configure this directly within the solr
config.

Our first idea was to create a SolrEventListener, that will be postCommit
triggered, and that will be in charge of executing an optimise at regular
time interval. Is this a good approach ? Or is there other solutions to
achieve this ?

Thanks,
--
Renaud Delbru





Triggering optimise based on time interval

2011-02-16 Thread Renaud Delbru

Hi,

We would like to trigger an optimise every x hours. From what I can see, 
there is nothing in Solr (3.1-SNAPSHOT) that enables to do such a thing.
We have a master-slave configuration. The masters are tuned for fast 
indexing (large merge factor). However, for the moment, the master index 
is replicated as it is to the slaves, and therefore it does not provide 
very fast query time.

Our idea was
- to configure the replication so that it only happens after an 
optimise, and
- schedule a partial optimise in order to reduce the number of segments 
every x hours for faster querying.
We do not want to rely on cron job for executing the partial optimise 
every x hours, but we would prefer to configure this directly within the 
solr config.


Our first idea was to create a SolrEventListener, that will be 
postCommit triggered, and that will be in charge of executing an 
optimise at regular time interval. Is this a good approach ? Or is there 
other solutions to achieve this ?


Thanks,
--
Renaud Delbru


Re: Filter Query, Filter Cache and Hit Ratio

2011-01-29 Thread Renaud Delbru

Thanks a lot, this totally makes sense but it was hard to figure this out.

cheers
--
Renaud Delbru

On 28/01/11 20:39, cbenn...@job.com wrote:

Ooops,

I meant NOW/DAY


-Original Message-
From: cbenn...@job.com [mailto:cbenn...@job.com]
Sent: Friday, January 28, 2011 3:37 PM
To: solr-user@lucene.apache.org
Subject: RE: Filter Query, Filter Cache and Hit Ratio

Hi,

You've used NOW in the range query which will give a date/time accurate
to
the millisecond, try using NOW\DAY

Colin.


-Original Message-
From: Renaud Delbru [mailto:renaud.del...@deri.org]
Sent: Friday, January 28, 2011 2:22 PM
To: solr-user@lucene.apache.org
Subject: Filter Query, Filter Cache and Hit Ratio

Hi,

I am looking for some more information on how the filter cache is
working, and how the hit are incremented.

We are using filter queries for certain predefined value, such as the
timestamp:[2011-01-21T00:00:00Z+TO+NOW] (which is the current day).
From
what I understand from the documentation:
"the filter cache stores the results of any filter queries ("fq"
parameters) that Solr is explicitly asked to execute. (Each filter is
executed and cached separately. When it's time to use them to limit

the

number of results returned by a query, this is done using set
intersections.)"
So, we were imagining that is two consecutive queries (as the one
above)
was using the same timestamp filter query, the second query will take
advantage of the filter cache, and we would see the number of hits
increasing (hit on the cached timestamp filter query) . However, this
is
not the case, the number of hits on the filter cache does not

increase

and stays very low. Is it normal ?

INFO: [] webapp=/siren path=/select


params={wt=javabin&rows=0&version=2&fl=id,score&start=0&q=*:*&isShard=t

rue&fq=timestamp:[2011-01-
21T00:00:00Z+TO+NOW]&fq=domain:my.wordpress.com&fsv=true}
hits=0 status=0 QTime=139
INFO: [] webapp=/siren path=/select


params={wt=javabin&rows=0&version=2&fl=id,score&start=0&q=*:*&isShard=t

rue&fq=timestamp:[2011-01-
21T00:00:00Z+TO+NOW]&fq=domain:syours.wordpress.com&fsv=true}
hits=0 status=0 QTime=138

--
Renaud Delbru











Filter Query, Filter Cache and Hit Ratio

2011-01-28 Thread Renaud Delbru

Hi,

I am looking for some more information on how the filter cache is 
working, and how the hit are incremented.


We are using filter queries for certain predefined value, such as the 
timestamp:[2011-01-21T00:00:00Z+TO+NOW] (which is the current day). From 
what I understand from the documentation:
"the filter cache stores the results of any filter queries ("fq" 
parameters) that Solr is explicitly asked to execute. (Each filter is 
executed and cached separately. When it's time to use them to limit the 
number of results returned by a query, this is done using set 
intersections.)"
So, we were imagining that is two consecutive queries (as the one above) 
was using the same timestamp filter query, the second query will take 
advantage of the filter cache, and we would see the number of hits 
increasing (hit on the cached timestamp filter query) . However, this is 
not the case, the number of hits on the filter cache does not increase 
and stays very low. Is it normal ?


INFO: [] webapp=/siren path=/select 
params={wt=javabin&rows=0&version=2&fl=id,score&start=0&q=*:*&isShard=true&fq=timestamp:[2011-01-21T00:00:00Z+TO+NOW]&fq=domain:my.wordpress.com&fsv=true} 
hits=0 status=0 QTime=139
INFO: [] webapp=/siren path=/select 
params={wt=javabin&rows=0&version=2&fl=id,score&start=0&q=*:*&isShard=true&fq=timestamp:[2011-01-21T00:00:00Z+TO+NOW]&fq=domain:syours.wordpress.com&fsv=true} 
hits=0 status=0 QTime=138


--
Renaud Delbru



Re: Specifying an AnalyzerFactory in the schema

2011-01-25 Thread Renaud Delbru

Hi Chris,

On 24/01/11 21:18, Chris Hostetter wrote:

: I notice that in the schema, it is only possible to specify a Analyzer class,
: but not a Factory class as for the other elements (Tokenizer, Fitler, etc.).
: This limits the use of this feature, as it is impossible to specify parameters
: for the Analyzer.
: I have looked at the IndexSchema implementation, and I think this requires a
: simple fix. Do I open an issue about it ?

Support for constructing Analyzers directly is very crude, and primarily
existed for making it easy for people with old indexes and analyzers to
keep working.

moving foward, Lucene/Solr eventtually won't "ship" concret Analyzers
implementations at all (at least, that's the last concensus i remember) so
enhancing support for loading Analyzers (or AnalyzerFactories) doesn't
make much sense.

Practically speaking, if you have an existing Analyzer that you want to
use in Solr, instead of writting an "AnalyzerFactory" for it, you could
just write a "TokenizerFactory" that wraps it instead -- functinally that
would let you achieve everything ana AnalyzerFactory would, except that
Solr would already handle letting the schema.xml specify the
positionIncrementGap (which you could happily ignore if you wanted)
Thanks for the trick, I haven't thought about doing that. This should 
work indeed.


cheers
--
Renaud Delbru


Specifying an AnalyzerFactory in the schema

2011-01-19 Thread Renaud Delbru

Hi,

I notice that in the schema, it is only possible to specify a Analyzer 
class, but not a Factory class as for the other elements (Tokenizer, 
Fitler, etc.).
This limits the use of this feature, as it is impossible to specify 
parameters for the Analyzer.
I have looked at the IndexSchema implementation, and I think this 
requires a simple fix. Do I open an issue about it ?


Regards,
--
Renaud Delbru


Re: Why does Solr commit block indexing?

2010-12-17 Thread Renaud Delbru

Hi Grant,

looking forward for a fix ;o). Such a fix would improve quite a lot the 
performance of Solr update throughput (even if its performance is 
already quite impressive).


cheers
--
Renaud Delbru

On 17/12/10 13:05, Grant Ingersoll wrote:

I'm not sure if there is a issue open, but I know I've talked w/ Yonik about 
this and a few other changes to the DirectUpdateHandler2 in the past.  It does 
indeed need to be fixed.

-Grant

On Dec 17, 2010, at 7:04 AM, Renaud Delbru wrote:


Hi Michael,

thanks for your answer.
Do the Solr team is aware of the problem ? Is there an issue opened about this, 
or ongoing work about that ?

Regards,
--
Renaud Delbru

On 16/12/10 16:45, Michael McCandless wrote:

Unfortunately, (I think?) Solr currently commits by closing the
IndexWriter, which must wait for any running merges to complete, and
then opening a new one.

This is really rather silly because IndexWriter has had its own commit
method (which does not block ongoing indexing nor merging) for quite
some time now.

I'm not sure why we haven't switched over already... there must be
some trickiness involved.

Mike

On Thu, Dec 16, 2010 at 9:39 AM, Renaud Delbru   wrote:

Hi,

See log at [1].
We are using the latest snapshot of lucene_branch3.1. We have configured
Solr to use the ConcurrentMergeScheduler:


When a commit() runs, it blocks indexing (all imcoming update requests are
blocked until the commit operation is finished) ... at the end of the log we
notice a 4 minute gap during which none of the solr cients trying to add
data receive any attention.
This is a bit annoying as it leads to timeout exception on the client side.
Here, the commit time is only 4 minutes, but it can be larger if there are
merges of large segments
I thought Solr was able to handle commits and updates at the same time: the
commit operation should be done in the background, and the server still
continue to receive update requests (maybe at a slower rate than normal).
But it looks like it is not the case. Is it a normal behaviour ?

[1] http://pastebin.com/KPkusyVb

Regards
--
Renaud Delbru


--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem docs using Solr/Lucene:
http://www.lucidimagination.com/search





Re: Why does Solr commit block indexing?

2010-12-17 Thread Renaud Delbru

Hi Michael,

thanks for your answer.
Do the Solr team is aware of the problem ? Is there an issue opened 
about this, or ongoing work about that ?


Regards,
--
Renaud Delbru

On 16/12/10 16:45, Michael McCandless wrote:

Unfortunately, (I think?) Solr currently commits by closing the
IndexWriter, which must wait for any running merges to complete, and
then opening a new one.

This is really rather silly because IndexWriter has had its own commit
method (which does not block ongoing indexing nor merging) for quite
some time now.

I'm not sure why we haven't switched over already... there must be
some trickiness involved.

Mike

On Thu, Dec 16, 2010 at 9:39 AM, Renaud Delbru  wrote:

Hi,

See log at [1].
We are using the latest snapshot of lucene_branch3.1. We have configured
Solr to use the ConcurrentMergeScheduler:


When a commit() runs, it blocks indexing (all imcoming update requests are
blocked until the commit operation is finished) ... at the end of the log we
notice a 4 minute gap during which none of the solr cients trying to add
data receive any attention.
This is a bit annoying as it leads to timeout exception on the client side.
Here, the commit time is only 4 minutes, but it can be larger if there are
merges of large segments
I thought Solr was able to handle commits and updates at the same time: the
commit operation should be done in the background, and the server still
continue to receive update requests (maybe at a slower rate than normal).
But it looks like it is not the case. Is it a normal behaviour ?

[1] http://pastebin.com/KPkusyVb

Regards
--
Renaud Delbru





Why does Solr commit block indexing?

2010-12-16 Thread Renaud Delbru

Hi,

See log at [1].
We are using the latest snapshot of lucene_branch3.1. We have configured 
Solr to use the ConcurrentMergeScheduler:



When a commit() runs, it blocks indexing (all imcoming update requests 
are blocked until the commit operation is finished) ... at the end of 
the log we notice a 4 minute gap during which none of the solr cients 
trying to add data receive any attention.
This is a bit annoying as it leads to timeout exception on the client 
side. Here, the commit time is only 4 minutes, but it can be larger if 
there are merges of large segments
I thought Solr was able to handle commits and updates at the same time: 
the commit operation should be done in the background, and the server 
still continue to receive update requests (maybe at a slower rate than 
normal). But it looks like it is not the case. Is it a normal behaviour ?


[1] http://pastebin.com/KPkusyVb

Regards
--
Renaud Delbru


Re: How to Transmit and Append Indexes

2010-11-21 Thread Renaud Delbru
 Have you looked at Apache Nutch [1]. It is a distributed web crawl and 
search system, based on Lucene/Solr and Hadoop.


[1] http://nutch.apache.org/
--
Renaud Delbru

On 19/11/10 16:52, Bing Li wrote:

Hi, all,

I am working on a distributed searching system. Now I have one server only.
It has to crawl pages from the Web, generate indexes locally and respond
users' queries. I think this is too busy for it to work smoothly.

I plan to use two servers at at least. The jobs to crawl pages and generate
indexes are done by one of them. After that, the new available indexes
should be transmitted to anther one which is responsible for responding
users' queries. From users' point of view, this system must be fast.
However, I don't know how I can get the additional indexes which I can
transmit. After transmission, how to append them to the old indexes? Does
the appending block searching?

Thanks so much for your help!

Bing Li





Re: How to extend IndexSchema and SchemaField

2010-10-09 Thread Renaud Delbru

 Hi Chris,

I have opened an issue (SOLR-2146 [1]) following that discussion.

[1] https://issues.apache.org/jira/browse/SOLR-2146

cheers
--
Renaud Delbru

On 14/09/10 01:06, Chris Hostetter wrote:

: Yes, I have thought of that, or even extending field type. But this does not
: work for my use case, since I can have multiple fields of a same type
: (therefore with the same field type, and same analyzer), but each one of them
: needs specific information. Therefore, I think the only "nice" way to achieve
: this is to have the possibility to add attributes to any field definition.

Right, at the moment custom FieldType classes can specify whatever
attributes they want to use in the  declaration -- but it's
not possible to specify arbitrary attributes that can be used in the
  declaration.

By all means, pelase open an issue requesting this as a feature.

I don't know that anyone explicitly set out to impose this limitation, but
one of the reasons it likely exists is because SchemaField is not
something that is intended to be customized -- while FieldType
objects are constructed once at startup, SchemaField obejcts are
frequently created on the fly when dealing with dynamicFields, so
initialization complexity is kept to a minimum.

That said -- this definitely seems like that type of usecase that we
should try to find *some* solution for -- even if it just means having
Solr automaticly create hidden FieldType instances for you on startup
based on attributes specified in the  that the corrisponding
FieldType class understands.


-Hoss

--
http://lucenerevolution.org/  ...  October 7-8, Boston
http://bit.ly/stump-hoss  ...  Stump The Chump!





Re: How to extend IndexSchema and SchemaField

2010-09-10 Thread Renaud Delbru

 Hi Charlie,

On 10/09/10 16:11, Charlie Jackson wrote:

Have you already explored the idea of using a custom analyzer for your
field? Depending on your use case, that might work for you.
Yes, I have thought of that, or even extending field type. But this does 
not work for my use case, since I can have multiple fields of a same 
type (therefore with the same field type, and same analyzer), but each 
one of them needs specific information. Therefore, I think the only 
"nice" way to achieve this is to have the possibility to add attributes 
to any field definition.


cheers
--
Renaud Delbru


Re: How to extend IndexSchema and SchemaField

2010-09-10 Thread Renaud Delbru

 Hi Javier,

On 10/09/10 07:15, Javier Diaz wrote:

Looking at the code we found out that there's no way to extend the schema.
Finally we copied part of the code that reads the schema in our
RequestHandler. It works but I'm not sure if it's the best way to do it. Let
me know if you want our code as an example.
So, do you mean you are duplicating part of the code for reading the 
schema and parse on your own way the schema in your request handler ?
If you could share the code to have a look, it could be helpful and 
inspiring. cheers.

--
Renaud Delbru


Re: How to extend IndexSchema and SchemaField

2010-09-09 Thread Renaud Delbru

 Hi,

so I suppose there is no solution. Is there a chance that SchemaField 
becomes extensible in the future ? Because, at the moment, all the field 
attributes (indexed, stored, etc.) are hardcoded inside SchemaField. Do 
you think it is worth opening an issue about it ?

--
Renaud Delbru

On 07/09/10 16:13, Renaud Delbru wrote:

 Hi,

I would like to extend the field node in the schema.xml by adding new 
attributes. For example, I would like to be able to write:


And be able to access myattribute directly from IndexSchema and 
SchemaField objects. However, these two classes are final, and also 
not very easy to extend ?

Is there any other solutions ?

thanks,




How to extend IndexSchema and SchemaField

2010-09-07 Thread Renaud Delbru

 Hi,

I would like to extend the field node in the schema.xml by adding new 
attributes. For example, I would like to be able to write:


And be able to access myattribute directly from IndexSchema and 
SchemaField objects. However, these two classes are final, and also not 
very easy to extend ?

Is there any other solutions ?

thanks,
--
Renaud Delbru


Re: determine which value produced a hit in multivalued field type

2010-01-26 Thread Renaud Delbru

Hi,

SIREn [1] could provide you such information (return the value index in 
the multi-valued field). But actually, only a Lucene extension is 
available, and you'll have to modified a little bit the SIREn query 
operator to returns you the value position in the query results.


[1] http://siren.sindice.com/
--
Renaud Delbru

On 22/01/10 22:52, Harsch, Timothy J. (ARC-TI)[PEROT SYSTEMS] wrote:

Hi,
If I have a multiValued field type of text, and I put values 
[cat,dog,green,blue] in it.  Is there a way to tell when I execute a query 
against that field for dog, that it was in the 1st element position for that 
multiValued field?

Thanks!
Tim


   




Re: Best wasy to solve Parent-Child relationship without Denormalizing?

2010-01-19 Thread Renaud Delbru

Hi,

SIREn [1] could help you to solve this task (look at the different 
indexing examples). But actually, only a Lucene extension is available. 
If you want to use it into Solr, you will have to implement your own 
Solr plugin (which should require only a limited amount of work).


[1] http://siren.sindice.com/
--
Renaud Delbru

On 19/01/10 13:14, karthi_1986 wrote:

Hi,

Here is an extract of my data schema in which my user should be able to
issue the following search:
company_description:pharmaceutical AND product_description:cosmetic

[Company profile]
 Company name
 Company url
 Company description
 Company user rating

[Product profile]
 Product name
 Product category
 Product description
 Product rating

So, I'm expecting a result where all cosmetic products created by
pharmaceutical companies are returned.

The problem is, I've read in posts a year old that this parent-child
relationship can only be solved by indexing the denormalized data together.
However, I'm dealing with 10,000,000 companies with possibly 10 products
each, so my data requirements are going to be HUGGEE!!

Is there a new feature in Solr which can handle this for me without the need
for de-normalization?
   




Re: how to scan dynamic field without specifying each field in query

2009-09-03 Thread Renaud Delbru

Hi,

maybe SIREn [1] can help you for this task. SIREn is a Lucene plugin 
that allows to index and query tabular data. You can for example create 
a SIREn field "foo", index n values in n cells, and then query a 
specific cell or a range of cells. Unfortunately, the Solr plugin is not 
yet available, and therefore you will have to write your own query 
syntax and parser for this task.


Regards,

[1] http://siren.sindice.com
--
Renaud Delbru

gdeconto wrote:

thx for the reply.

you mean into a multivalue field?  possible, but was wondering if there was
something more flexible than that.  the ability to use a function (ie
myfunction) would open up some possibilities for more complex searching and
search syntax.

I could write my own query parser with special extended syntax, but that is
farther than I wanted to go.



Manepalli, Kalyan wrote:
  

You can copy the dynamic fields value into a different field and query on
that field.

Thanks,
Kalyan Manepalli





  




Re: Indexing arbitrary RDF resources

2009-03-26 Thread Renaud Delbru

Hi,

Here in DERI [1], we are working on an extension for Lucene / Solr to 
handle RDF data and structured queries. The engine is currently in use 
in the Sindice [2] search engine. We are planning to release our 
extension, called SIREn (for Semantic Information Retrieval Engine), as 
open source in the following month.


The approach with dynamic fields could work, but it has strong 
limitations when dealing with a large number of fields. Among them, it 
will lead to data duplication in the dictionary (the dictionary will 
become quickly very large since multiple fields / predicate can have 
identical terms) and it will be very inefficient to ask queries across 
all the fields. Our work overcomes such problems. We are also currently 
working on supporting join queries among entities / documents that are 
not of the most simple kind.


If you want to know more, you can contact our team (or send me directly 
an email). Maybe, it could be a good idea to join our efforts.


[1] http://www.deri.ie/
[2] http://www.sindice.com/
--
Regards,
Renaud Delbru

re...@gmx.net wrote:

Hey, all!

I'm planning a project where I want to write software that takes an RDF class 
and uses that information to dynamically support indexing and faceted searching 
of resources of that type. This would (as I imagine it) function with dynamic 
fields in all required data types and multiplicities and a mapping from 
properties to field names.
The project will be part of the open CMS software drupal which already has a 
working Solr integration module. You can find details about my project idea 
here:
http://groups.drupal.org/node/20589

Has something like this already been done or thought of by anyone her? Does 
anyone here have hints or remarks regarding the idea?

Thanks in advance for any comments!
  




Re: Store content out of solr

2009-02-17 Thread Renaud Delbru
A common approach (for web search engines) is to use HBase [1] as a 
"Document Repository". Each document indexed inside Solr will have an 
entry (row, identified by the document URL) in the HBase table. This 
works great when you deal with a large data collection (it scales better 
than a SQL database). The counterpart is that it is slightly slower than 
a local database.


[1] http://hadoop.apache.org/hbase/
--
Renaud Delbru

roberto wrote:

Hello,

We are indexing information from diferent sources so we would like to
centralize the information content so i can retrieve using the ID
provided buy solr?

Does anyone did something like this, and have some advices ? I
thinking in store the information into a database like mysql ?

Thanks,
  




Re: [ANN] Lucid Imagination

2009-02-09 Thread Renaud Delbru

Hi Mark,

Mark Miller wrote:
Hey Renaud - in the future, its probably best to direct Gaze questions 
(unless it directly relates to Solr) to supp...@lucidimagination.com 
<mailto:supp...@lucidimagination.com>.

Right, I was not aware of this mailing list.


Gaze is a tool thats stores RequestHandler statistics avgs (over small 
intervals) for long time ranges, and then lets you view graphs of that 
data, either in (basically) real-time or for specific time ranges.


There is a Readme explaining install included with the Gaze download. 
Gaze is pre-installed in the LucidImagination certified distribution 
of Solr, and in the Readme html file for that, you will find 
instructions on enabling gaze (you uncomment the Gaze request handler 
in solrconfig.xml).
Ok, the Gaze documentation can only be found in the distribution file. 
It is what I was trying to find (I was looking on the Lucid Imagination 
website).
Gaze is implemented as a Solr RequestHandler plugin and an additional 
webapp. The RequestHandler plugin pings chosen request handlers every 
interval to collect RequestHandler statistics. This info is stored in 
RRD databases (this is done so that Gaze has a *very* minimal overhead 
- its meant for production use). The webapp is an interface to 
selecting which RequestHandlers you want to be monitored and other 
settings, as well as graph views of the collected data. There are also 
some other little info tools that display server/jvm and index 
statistics.

Gaze features look quite nice and useful.

Thanks for your reply,
Regards
--
Renaud Delbru



Re: [ANN] Lucid Imagination

2009-02-06 Thread Renaud Delbru

Hi,

I don't find any documentation about Solr Gaze. How can I use it ?

Thanks,
Regards
--
Renaud Delbru

Grant Ingersoll wrote:

Hi Lucene and Solr users,

As some of you may know, Yonik, Erik, Sami, Mark and I teamed up with
Marc Krellenstein to create a company to provide commercial
support (with SLAs), training, value-add components and services to
users of Lucene and Solr.  We have been relatively quiet up until now 
as we prepare our

offerings, but I am now pleased to announce the official launch of
Lucid Imagination.  You can find us at http://www.lucidimagination.com/
and learn more about us at http://www.lucidimagination.com/About/.

We have also launched a beta search site dedicated to searching all
things in the Lucene ecosystem: Lucene, Solr, Tika, Mahout, Nutch,
Droids, etc.  It's powered, of course, by Lucene via Solr (we'll
provide details in a separate message later about our setup.)  You can
search the Lucene family of websites, wikis, mail archives and JIRA 
issues all in one place.

To try it out, browse to http://www.lucidimagination.com/search/.

Any and all feedback is welcome at f...@lucidimagination.com.

Thanks,
Grant

--
Grant Ingersoll
http://www.lucidimagination.com/













Re: Solr 1.3 Maven Artifact Problem

2008-10-23 Thread Renaud Delbru

Hi,

About the second point, it was my mistake (source dependencies problem 
in eclipse).

--
Renaud Delbru

Renaud Delbru wrote:

Hi,

I am using the Solr 1.3 mave nartifacts from [1]. It seems that these 
artifacts are not correct. I have noticed that:
1) solr-core artifact contains org.apache.solr.client.solrj packages, 
and at the same time, the solr-core artifact depends on the solr-solrj 
artifact.
2) the source jar does not match the compiled class: I found different 
method fingerprints in EmbeddedSolr and in CoreDescriptor


Do someone encounter the same problem ?

[1] http://repo1.maven.org/maven2/org/apache/solr/

Regards,


Solr 1.3 Maven Artifact Problem

2008-10-23 Thread Renaud Delbru

Hi,

I am using the Solr 1.3 mave nartifacts from [1]. It seems that these 
artifacts are not correct. I have noticed that:
1) solr-core artifact contains org.apache.solr.client.solrj packages, 
and at the same time, the solr-core artifact depends on the solr-solrj 
artifact.
2) the source jar does not match the compiled class: I found different 
method fingerprints in EmbeddedSolr and in CoreDescriptor


Do someone encounter the same problem ?

[1] http://repo1.maven.org/maven2/org/apache/solr/

Regards,
--
Renaud Delbru


Re: Slow deleteById request

2008-07-15 Thread Renaud Delbru

Hi,

I think the reason was indeed maxPendingDeletes which was configured to 
1000.
After having updated to a solr nightly build with Lucene 2.4, the issue 
seems to have disappeared.


Thanks for your advices.
--
Renaud Delbru

Mike Klaas wrote:


On 1-Jul-08, at 10:44 PM, Chris Hostetter wrote:

>
> : Yes, updating to a newer version of nightly Solr build could solve 
> the
> : problem, but I am a little afraid to do it since solr-trunk has 
> switched to

> : lucene 2.4-dev.
>
> but did you check wether or not you have maxPendingDeletes 
> configured as

> yonik asked?
>
> That would explain exactly waht you are seeing ... after a certain 
> number
> of deletes have passed, the next one would automaticly force a 
> commit (and
> a newSearcher) and (i believe) subsequent deletes would block until 
> the

> commit is done ... which sounds like exactly what you describe.

It shouldn't cause a commit, just a flushing of deletes.  However, 
deletes count toward both maxDocs and maxTime for  
purposes, so that is the likely explanation.


-Mike



Re: Slow deleteById request

2008-07-01 Thread Renaud Delbru

Yonik Seeley wrote:

I'd try the latest nightly solr build... it now lets Lucene manage the deletes.
  
Yes, updating to a newer version of nightly Solr build could solve the 
problem, but I am a little afraid to do it since solr-trunk has switched 
to lucene 2.4-dev.


Thanks for your answers, Yonik.
--
Renaud Delbru



Re: Slow deleteById request

2008-07-01 Thread Renaud Delbru

Hi Yonik,

We are not sending a commit with a delete. It happens when using the 
following command:
curl http://mydomain.net:8080/index/update -s -H 'Content-type:text/xml; 
charset=utf-8' -d "http://example.org/"
or using the SolrJ deleteById method (that does not execute a commit as 
far as I know).


The strange things is that it is not always reproduced. Ten or so delete 
requests will be executed fast (in few ms), then a batch of few delete 
requests will take 10, 20 or even 30 seconds.


By looking more precisely at the log, it seems that, in fact,  the 
delete request triggers the opening of a new searcher, with its 
auto-warming. On a large index (our case), I heard that it can take 
quite some time. Anyway, I am without a precise explanation about this 
problem.
This is not a big issue in our case, since it occurs for few requests 
and since other concurrent requests will be handled by the other searcher.


--
Renaud Delbru


Yonik Seeley wrote:

That's very strange... are you sending a commit with the delete perhaps?
If so, the whole request would block until a new searcher is registered.

-Yonik

On Tue, Jul 1, 2008 at 8:54 AM, Renaud Delbru <[EMAIL PROTECTED]> wrote:
  

Hi,

We experience very slow delete, taking more than 10 seconds. A delete is
executed using deleteById (from Solrj or from curl), at the same time
documents are being added.
By looking at the log (below), it seems that a delete by ID request is only
executed during the next commit (done automatically every 1000 added
documents), and that the process (Solrj or curl) executing the deleteById
request is blocked until the commit is performed.

Is it a normal behavior or a misconfiguration of our Solr server ?

Thanks in advance for insights.

[11:32:02.840]autowarming result for [EMAIL PROTECTED] main
[11:32:02.840]
 
queryResultCache{lookups=0,hits=0,hitratio=0.00,inserts=512,evictions=0,size=512,cumulative_lookups=238825,cumulative_hits=202879,cumulative_hitratio=0.84,cumulative_inserts=36255,c
umulative_evictions=4289}
[11:32:02.840]autowarming [EMAIL PROTECTED] main from [EMAIL PROTECTED] main
[11:32:02.840]
 
documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,cumulative_lookups=2395306,cumulative_hits=1705483,cumulative_hitratio=0.71,cumulative_inserts=689823,cumulative_evictions=411577}
[11:32:02.840]autowarming result for [EMAIL PROTECTED] main
[11:32:02.840]
 
documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,cumulative_lookups=2395306,cumulative_hits=1705483,cumulative_hitratio=0.71,cumulative_inserts=689823,cumulative_evictions=411577}
[11:32:02.840]Registered new searcher [EMAIL PROTECTED] main
[11:32:02.840]{delete=[http://example.org/]} 0 14212
[11:32:02.840]webapp=/index path=/update params={wt=xml&version=2.2}
status=0 QTime=14212
[11:32:02.840]DirectUpdateHandler2 deleting and removing dups for 217 ids
[11:32:02.840]Closing Writer DirectUpdateHandler2
[11:32:02.842]Closing [EMAIL PROTECTED] main
[11:32:02.842]
 
filterCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
[11:32:02.842]
 
queryResultCache{lookups=0,hits=0,hitratio=0.00,inserts=512,evictions=0,size=512,cumulative_lookups=238825,cumulative_hits=202879,cumulative_hitratio=0.84,cumulative_inserts=36255,cumulative_evictions=4289}
[11:32:02.842]
 
documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,cumulative_lookups=2395306,cumulative_hits=1705483,cumulative_hitratio=0.71,cumulative_inserts=689823,cumulative_evictions=411577}
[11:32:02.894]Opening [EMAIL PROTECTED] DirectUpdateHandler2
[11:32:03.566]DirectUpdateHandler2 docs deleted=0
[11:32:03.566]Closing [EMAIL PROTECTED] DirectUpdateHandler2

--
Renaud Delbru




Re: Slow deleteById request

2008-07-01 Thread Renaud Delbru

Small precision,
we are using a nightly build of Solr 1.3 (one of the nightly build just 
before the integration of Lucene 2.4).

--
Renaud Delbru

Renaud Delbru wrote:

Hi,

We experience very slow delete, taking more than 10 seconds. A delete 
is executed using deleteById (from Solrj or from curl), at the same 
time documents are being added.
By looking at the log (below), it seems that a delete by ID request is 
only executed during the next commit (done automatically every 1000 
added documents), and that the process (Solrj or curl) executing the 
deleteById request is blocked until the commit is performed.


Is it a normal behavior or a misconfiguration of our Solr server ?

Thanks in advance for insights.

[11:32:02.840]autowarming result for [EMAIL PROTECTED] main
[11:32:02.840]  
queryResultCache{lookups=0,hits=0,hitratio=0.00,inserts=512,evictions=0,size=512,cumulative_lookups=238825,cumulative_hits=202879,cumulative_hitratio=0.84,cumulative_inserts=36255,c 


umulative_evictions=4289}
[11:32:02.840]autowarming [EMAIL PROTECTED] main from 
[EMAIL PROTECTED] main
[11:32:02.840]  
documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,cumulative_lookups=2395306,cumulative_hits=1705483,cumulative_hitratio=0.71,cumulative_inserts=689823,cumulative_evictions=411577} 


[11:32:02.840]autowarming result for [EMAIL PROTECTED] main
[11:32:02.840]  
documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,cumulative_lookups=2395306,cumulative_hits=1705483,cumulative_hitratio=0.71,cumulative_inserts=689823,cumulative_evictions=411577} 


[11:32:02.840]Registered new searcher [EMAIL PROTECTED] main
[11:32:02.840]{delete=[http://example.org/]} 0 14212
[11:32:02.840]webapp=/index path=/update params={wt=xml&version=2.2} 
status=0 QTime=14212

[11:32:02.840]DirectUpdateHandler2 deleting and removing dups for 217 ids
[11:32:02.840]Closing Writer DirectUpdateHandler2
[11:32:02.842]Closing [EMAIL PROTECTED] main
[11:32:02.842]  
filterCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0} 

[11:32:02.842]  
queryResultCache{lookups=0,hits=0,hitratio=0.00,inserts=512,evictions=0,size=512,cumulative_lookups=238825,cumulative_hits=202879,cumulative_hitratio=0.84,cumulative_inserts=36255,cumulative_evictions=4289} 

[11:32:02.842]  
documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,cumulative_lookups=2395306,cumulative_hits=1705483,cumulative_hitratio=0.71,cumulative_inserts=689823,cumulative_evictions=411577} 


[11:32:02.894]Opening [EMAIL PROTECTED] DirectUpdateHandler2
[11:32:03.566]DirectUpdateHandler2 docs deleted=0
[11:32:03.566]Closing [EMAIL PROTECTED] DirectUpdateHandler2



Slow deleteById request

2008-07-01 Thread Renaud Delbru

Hi,

We experience very slow delete, taking more than 10 seconds. A delete is 
executed using deleteById (from Solrj or from curl), at the same time 
documents are being added.
By looking at the log (below), it seems that a delete by ID request is 
only executed during the next commit (done automatically every 1000 
added documents), and that the process (Solrj or curl) executing the 
deleteById request is blocked until the commit is performed.


Is it a normal behavior or a misconfiguration of our Solr server ?

Thanks in advance for insights.

[11:32:02.840]autowarming result for [EMAIL PROTECTED] main
[11:32:02.840]  
queryResultCache{lookups=0,hits=0,hitratio=0.00,inserts=512,evictions=0,size=512,cumulative_lookups=238825,cumulative_hits=202879,cumulative_hitratio=0.84,cumulative_inserts=36255,c

umulative_evictions=4289}
[11:32:02.840]autowarming [EMAIL PROTECTED] main from [EMAIL PROTECTED] main
[11:32:02.840]  
documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,cumulative_lookups=2395306,cumulative_hits=1705483,cumulative_hitratio=0.71,cumulative_inserts=689823,cumulative_evictions=411577}

[11:32:02.840]autowarming result for [EMAIL PROTECTED] main
[11:32:02.840]  
documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,cumulative_lookups=2395306,cumulative_hits=1705483,cumulative_hitratio=0.71,cumulative_inserts=689823,cumulative_evictions=411577}

[11:32:02.840]Registered new searcher [EMAIL PROTECTED] main
[11:32:02.840]{delete=[http://example.org/]} 0 14212
[11:32:02.840]webapp=/index path=/update params={wt=xml&version=2.2} 
status=0 QTime=14212

[11:32:02.840]DirectUpdateHandler2 deleting and removing dups for 217 ids
[11:32:02.840]Closing Writer DirectUpdateHandler2
[11:32:02.842]Closing [EMAIL PROTECTED] main
[11:32:02.842]  
filterCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
[11:32:02.842]  
queryResultCache{lookups=0,hits=0,hitratio=0.00,inserts=512,evictions=0,size=512,cumulative_lookups=238825,cumulative_hits=202879,cumulative_hitratio=0.84,cumulative_inserts=36255,cumulative_evictions=4289}
[11:32:02.842]  
documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,cumulative_lookups=2395306,cumulative_hits=1705483,cumulative_hitratio=0.71,cumulative_inserts=689823,cumulative_evictions=411577}

[11:32:02.894]Opening [EMAIL PROTECTED] DirectUpdateHandler2
[11:32:03.566]DirectUpdateHandler2 docs deleted=0
[11:32:03.566]Closing [EMAIL PROTECTED] DirectUpdateHandler2

--
Renaud Delbru


Re: How to get incrementPositionGap value from IndexSchema ?

2008-03-12 Thread Renaud Delbru

Hi Chris,

Thanks for your reply. Indeed, there is the getPositionIncrementGap 
method, I forgot it.


I need this information to be able to configure my query processor. I 
have extended Solr with a new query parser to be able to search document 
on a sentence-based granularity. Each sentence is a fieldable instance 
of a field 'sentences', and I execute span queries to be able to match a 
boolean combination of terms on a sentence-level, not a document-level.

I hope this explanation is clear and makes sense.

Regards.


Chris Hostetter wrote:

: I am looking for a way to access the incrementPositionGap value defined for a
: field type in the schema.xml.

I think you mean "positionIncrementGap"

It's a property of the  in schema.xml, but internally it's 
passed to SolrAnalyzer.setPositionIncrementGap.  if you want to 
programaticly know what the "positionIncrementGap" is for any analyzer of 
any field or fieldtype regardless of wether or not it's a SolrAnalyzer, 
just use Analzer.getPositionIncrementGap(String fieldName) 


ie: myFieldType.getAnalyzer().getPositionIncrementGap(myFieldName)


If you don't mind me asking:  why do you want/need this information in 
your custom code?



-Hoss
  



--
Renaud Delbru,
E.C.S., Ph.D. Student,
Semantic Information Systems and
Language Engineering Group (SmILE),
Digital Enterprise Research Institute,
National University of Ireland, Galway.
http://smile.deri.ie/


How to get incrementPositionGap value from IndexSchema ?

2008-03-10 Thread Renaud Delbru

Hi,

I am looking for a way to access the incrementPositionGap value defined 
for a field type in the schema.xml.
There is a getArgs method in FieldType class, but it is protected and I 
am not able to access it.

Is there another solution ?

Regards.

--
Renaud Delbru,
E.C.S., Ph.D. Student,
Semantic Information Systems and
Language Engineering Group (SmILE),
Digital Enterprise Research Institute,
National University of Ireland, Galway.
http://smile.deri.ie/


Re: SpanQuery support

2008-02-05 Thread Renaud Delbru

Very nice, I will try this approach.

Thanks Yonik.

Yonik Seeley wrote:

On Feb 4, 2008 11:53 AM, Yonik Seeley <[EMAIL PROTECTED]> wrote:
  

You could, but that would be the hard way (by a big margin).
There are pluggable query parsers now (see QParserPlugin)... but the
current missing piece is being able to specify a new parser plugin
from solrconfig.xml



Hmmm, it appears I forgot what I implemented already ;-)

Support for adding new parser plugins from solrconfig.xml already
exists (and I just added a test).
So add something like the following to your solrconfig.xml


And then implement FooQParserPlugin in Java to create your desired
query structures (span queries or whatever).  See other
implementations of FooQParserPlugin in Solr for guidance.

To use your "foo" parser, set it to the default query type by adding
defType="foo" to the request (or to the defaults for your handler).
You can also override the current query type via q=my query


-Yonik
  



--
Renaud Delbru,
E.C.S., Ph.D. Student,
Semantic Information Systems and
Language Engineering Group (SmILE),
Digital Enterprise Research Institute,
National University of Ireland, Galway.
http://smile.deri.ie/


Re: Querying multiple dynamicField

2008-02-05 Thread Renaud Delbru
The idea was to keep separated a certain number of lines (or sentences) 
in a document without using the GapPosition trick between field 
instances. I found that the use of multiple dynamic fields is a cleaner 
and generic approach.
By using the copyField, I duplicate data inside the index but I loose 
also the line distinction.


I think the addition of wildcards in the field name can be a good 
addition to the Solr features. This will give us the ability to query 
only a certain "type" of dynamic field (typeA_*, typeB_*, etc.).


Regards.

Lance Norskog wrote:

You can use the  directive to copy all 'sentence_*' fields into
one indexed field. You then have a named field that you can search against.

Lance Norskog

-Original Message-
From: Renaud Delbru [mailto:[EMAIL PROTECTED] 
Sent: Friday, February 01, 2008 6:48 PM

To: solr-user@lucene.apache.org
Subject: Querying multiple dynamicField

Hi,

We would like to know if there is an efficient way to query multiple
dynamicField at the same time, using wildcard in the field name. For
example, we have a list of dynamic fields "sentence_*" and we would like to
execute a query on all the "sentence_*" fields.
Is there a way to execute such queries on Solr 1.3 / Lucene 2.3 ?

Regards.

--
Renaud Delbru
  



--
Renaud Delbru,
E.C.S., Ph.D. Student,
Semantic Information Systems and
Language Engineering Group (SmILE),
Digital Enterprise Research Institute,
National University of Ireland, Galway.
http://smile.deri.ie/


Re: SpanQuery support

2008-02-04 Thread Renaud Delbru

Yonik Seeley wrote:

On Feb 2, 2008 3:43 PM, Renaud Delbru <[EMAIL PROTECTED]> wrote:
  

I was looking at the discussion of SOLR-281. If I understand correctly,
the task would be to write my own search component class,
SpanQueryComponent that extends the SearchComponent class, then
overwriting the declaration of the "query searchComponent" in
solrconfig.xml:

Then, I will be able to use directly my own query syntax and query
component ? Is it correct ?



You could, but that would be the hard way (by a big margin).
There are pluggable query parsers now (see QParserPlugin)... but the
current missing piece is being able to specify a new parser plugin
from solrconfig.xml

-Yonik
  
I have looked at MoreLikeThisHandler.java. I saw that all the 
MoreLikeThis logics is defined inside the handler and through the inner 
class MoreLikeThisHelper.
Could I follow the same approach and define a ProximityHandler class 
that execute Lucene SpanQuery based on some request parameters ? Is it 
the right way to do ?


Regards.

--
Renaud Delbru,
E.C.S., Ph.D. Student,
Semantic Information Systems and
Language Engineering Group (SmILE),
Digital Enterprise Research Institute,
National University of Ireland, Galway.
http://smile.deri.ie/


Re: SpanQuery support

2008-02-04 Thread Renaud Delbru

Hi Yonik,

Yonik Seeley wrote:

On Feb 2, 2008 3:43 PM, Renaud Delbru <[EMAIL PROTECTED]> wrote:
  

I was looking at the discussion of SOLR-281. If I understand correctly,
the task would be to write my own search component class,
SpanQueryComponent that extends the SearchComponent class, then
overwriting the declaration of the "query searchComponent" in
solrconfig.xml:

Then, I will be able to use directly my own query syntax and query
component ? Is it correct ?



You could, but that would be the hard way (by a big margin).
There are pluggable query parsers now (see QParserPlugin)... but the
current missing piece is being able to specify a new parser plugin
from solrconfig.xml

-Yonik
  

Hum, I would prefer to follow the easiest way ;o).
Could you explain me briefly the easiest way ? And give me some hints on 
which classes I need to extend to achieve my goal ?


Regards.

--
Renaud Delbru,
E.C.S., Ph.D. Student,
Semantic Information Systems and
Language Engineering Group (SmILE),
Digital Enterprise Research Institute,
National University of Ireland, Galway.
http://smile.deri.ie/


Re: SpanQuery support

2008-02-02 Thread Renaud Delbru

Thanks Yonik,

I was looking at the discussion of SOLR-281. If I understand correctly, 
the task would be to write my own search component class, 
SpanQueryComponent that extends the SearchComponent class, then 
overwriting the declaration of the "query searchComponent" in 
solrconfig.xml:


Then, I will be able to use directly my own query syntax and query 
component ? Is it correct ?


Regards.

Yonik Seeley wrote:

Solr 1.3 will have query parser plugins... so you could write your own
parser that utilized span queries.
-Yonik

On Feb 2, 2008 2:48 PM, Renaud Delbru <[EMAIL PROTECTED]> wrote:
  

Do you know if it is currently possible to use the SpanQuery feature of
Lucene in Solr 1.3. We would like to use nested span queries such as
(("A B") near ("C D")).
Do a request handler support such feature ? Or, any idea how could we
perform ?




--
Renaud Delbru,
E.C.S., Ph.D. Student,
Semantic Information Systems and
Language Engineering Group (SmILE),
Digital Enterprise Research Institute,
National University of Ireland, Galway.
http://smile.deri.ie/


SpanQuery support

2008-02-02 Thread Renaud Delbru

Hi,

Do you know if it is currently possible to use the SpanQuery feature of 
Lucene in Solr 1.3. We would like to use nested span queries such as 
(("A B") near ("C D")).
Do a request handler support such feature ? Or, any idea how could we 
perform ?


Regards.

--
Renaud Delbru


Querying multiple dynamicField

2008-02-01 Thread Renaud Delbru

Hi,

We would like to know if there is an efficient way to query multiple 
dynamicField at the same time, using wildcard in the field name. For 
example, we have a list of dynamic fields "sentence_*" and we would like 
to execute a query on all the "sentence_*" fields.

Is there a way to execute such queries on Solr 1.3 / Lucene 2.3 ?

Regards.

--
Renaud Delbru


Re: LSA Implementation

2007-11-26 Thread Renaud Delbru

LDA (Latent Dirichlet Allocation) is a similar technique that extends pLSI.
You can find some implementation in C++ and Java on the Web.

Grant Ingersoll wrote:
Interesting.  I am not a lawyer, but my understanding has always been 
that this is not something we could do.


The question has come up from time to time on the Lucene mailing list:
http://www.gossamer-threads.com/lists/engine?list=lucene&do=search_results&search_forum=forum_3&search_string=Latent+Semantic&search_type=AND 



That being said, there may be other approaches that do similar things 
that aren't covered by a patent, I don't know.


Is there something specific you want to do, or are you just going by 
the promise of better results using LSI?


I suppose if someone said they had a patch for Lucene/Solr that 
implemented it, we could ask on legal-discuss for advice.


-Grant

On Nov 26, 2007, at 1:13 PM, Eswar K wrote:


I was just searching for info on LSA and came across Semantic Indexing
project under GNU license...which of couse is still under development 
in C++

though.

- Eswar

On Nov 26, 2007 9:56 PM, Jack <[EMAIL PROTECTED]> wrote:


Interesting. Patents are valid for 20 years so it expires next year? :)
PLSA does not seem to have been patented, at least not mentioned in
http://en.wikipedia.org/wiki/Probabilistic_latent_semantic_analysis

On Nov 26, 2007 6:58 AM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:

LSA (http://en.wikipedia.org/wiki/Latent_semantic_indexing) is
patented, so it is not likely to happen unless the authors donate the
patent to the ASF.

-Grant



On Nov 26, 2007, at 8:23 AM, Eswar K wrote:


All,

Is there any plan to implement Latent Semantic Analysis as part of
Solr
anytime in the near future?

Regards,
Eswar


--
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ








--
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ





--
Renaud Delbru,
E.C.S., M.Sc. Student,
Semantic Information Systems and
Language Engineering Group (SmILE),
Digital Enterprise Research Institute,
National University of Ireland, Galway.
http://smile.deri.ie/