Re: Quick Questions

2013-03-08 Thread Upayavira
In example/cloud-scripts/ you will find a Solr specific zkCli tool to
upload/download configs.

You will need to reload a core/collection for the changes to take
effect.

Upayavira

On Fri, Mar 8, 2013, at 07:02 AM, Nathan Findley wrote:
 I am setting up solrcloud with zookeeper.
 
 - I am wondering if there are nicer ways to update the zookeeper config 
 files (data-import) besides restarting a node with the boostrap option?
 - Right now I kill the node manually in order to restart it. Is there a 
 better way to restart?
 
 Thanks,
 Nate
 
 -- 
 CTO
 Zenlok株式会社
 


SOLR - Recommendation on architecture

2013-03-08 Thread Kobe J
We are planning to use SOLR 4.1 for full text indexing. Following is the
hardware configuration of the web server that we plan to install SOLR on:-

*CPU*: 2 x Dual Core (4 cores)

*R**AM:* 12GB

*Storage*: 212GB

*OS Version* – Windows 2008 R2



The dataset to be imported will have approx.. 800k records, with 450 fields
per record. Query response time should be btw 200ms-800ms.



Please suggest if the current single server implementation should work fine
and if the specified configuration is enough for the requirement.


R: Query parsing issue

2013-03-08 Thread Francesco Valentini
Thank you very much,

I've tried both the way that you have suggested to me. Then I've choosen to
re-write the parse method by extending  ExtendedDismaxQParser class.

Francesco.

-Messaggio originale-
Da: Tomás Fernández Löbbe [mailto:tomasflo...@gmail.com] 
Inviato: mercoledì 6 marzo 2013 19:39
A: solr-user@lucene.apache.org
Oggetto: Re: Query parsing issue

It should be easy to extend ExtendedDismaxQParser and do your pre-processing
in the parse() method before calling edismax's parse. Or maybe you could
change the way EDismax is splitting the input query into clauses by
extending the splitIntoClauses method?

Tomás


On Wed, Mar 6, 2013 at 6:37 AM, Francesco Valentini 
francesco.valent...@altiliagroup.com wrote:

 Hi,



 I’ve written my own analyzer to index and query a set of documents. At 
 indexing time everything goes well but

 now I have a problem in  query phase.

 I need to pass  the whole query string to my analyzer before the 
 edismax query parser begins its tasks.

 In other words I have to preprocess the raw query string.

 The phrase querying does not fit my needs because I don’t have to 
 match the entire set of terms/tokens.

 How can I achieve this?



 Thank you in advance.





 Francesco







Re: SOLR - Recommendation on architecture

2013-03-08 Thread Gora Mohanty
On 8 March 2013 14:19, Kobe J kobe.free.wo...@gmail.com wrote:
 We are planning to use SOLR 4.1 for full text indexing. Following is the
 hardware configuration of the web server that we plan to install SOLR on:-

 *CPU*: 2 x Dual Core (4 cores)

 *R**AM:* 12GB

 *Storage*: 212GB

 *OS Version* – Windows 2008 R2
[...]

As with most things, the devil is in the details: What kind of
queries are you planning to run, and what search features
will you be using, e.g., faceting, sorting, highlighting, etc.
A desired query response time is meaningless without also
specifying the number of simultaneous users. Your best bet
is to set up a prototype, and benchmark your search.

Having said that, your proposed hardware seems more than
adequate for your needs:
1. If possible, use SSDs or fast disks
2. I would not use Windows as a server platform

Regards,
Gora


Re: JoinQuery and scores

2013-03-08 Thread Upayavira
I would recommend reading up on Lucene scoring, there's a lot to
understand there.

The join query parser (triggered by the use of {!join} syntax) searches
for a list of documents matching the term specified, and provides a list
of matching IDs. It then performs a second search based upon those IDs.
It is that second search that will be scored, but given you are just
using IDs, there's no scoring to be done.

Given that your joining term 'jeans' will exist in documents on both
sides of the join, you could say:

http://localhost:8983/solr/ee/select?fq={!join%20from=oxparentid%20to=oxid}jeansq=jeans

That would cause the term 'jeans' to be scored (the more common the term
in a document, the higher it scores, etc).

But by the sounds of it, it would be useful for you to understand better
how scoring calculations are done, so you can see *why* a score would be
the way it is.

Upayavira

On Fri, Mar 8, 2013, at 07:56 AM, Stefan Moises wrote:
 Hi Erick,
 
 if I try the same query without join I get different scores for each 
 hit... here is an example query:
 
 http://localhost:8983/solr/ee/select?facet=truefacet.mincount=1facet.limit=-1rows=10fl=oxid,score,oxtitledebugQuery=truestart=0facet.sort=lexfacet.field=oxpricefacet.field=manuseofacet.field=vendseofacet.field=catpathsfac
  
 et.field=catpathstokfacet.field=att_EU-Groessefacet.field=att_Schnittfacet.field=att_Groessefacet.field=att_Farbefacet.field=att_Einsatzbereichfacet.field=att_Materialfacet.field=att_Modellfacet.field=att_Anzeigefacet.field=att_Designfacet.field=att_Lieferumfangfacet.field=att_Washingfacet.field=att_Beschaffenheitqt=dismaxq={!join%20from=oxparentid%20to=oxid}jeans
 
 Anything wrong with that?
 Every doc returned has a score of 1.0 with the join.
 Without join I get scores between 0.40337953 and 0.40530312.
 
 Thanks,
 Stefan
 
 Am 08.03.2013 03:21, schrieb Erick Erickson:
  What's the rest of your query? What you've indicated doesn't have any terms
  to score. Join can be thought of as a bit like a filter query in this
  sense; the fact that the join hit is just an inclusion/exclusion clause,
  not a scoring.
 
  Best
  Erick
 
 
  On Thu, Mar 7, 2013 at 10:32 AM, Stefan Moises moi...@shoptimax.de wrote:
 
  Hi List,
 
  we are using the JoinQuery (JoinQParserPlugin) via request parameter, e.g.
  {!join from=parentid to=productsid} in Solr 4.1 which works great for our
  purposes, but unfortunately, all docs returned get a score of 1.0... this
  makes the whole search pretty useless imho, since the results are sorted
  totally random of course 
  Is there any simple way to fix this or an explanation why this is the case?
 
  Thanks a lot in advance,
  Stefan
 
 
 
 
 
 -- 
 Mit den besten Grüßen aus Nürnberg,
 Stefan Moises
 
 ***
 Stefan Moises
 Senior Softwareentwickler
 Leiter Modulentwicklung
 
 shoptimax GmbH
 Guntherstraße 45 a
 90461 Nürnberg
 Amtsgericht Nürnberg HRB 21703
 GF Friedrich Schreieck
 
 Fax:  0911/25566-29
 moi...@shoptimax.de
 http://www.shoptimax.de
 ***
 
 


Re: Solr 4.x auto-increment/sequence/counter functionality.

2013-03-08 Thread mark12345
So I think I took the easiest option by creating an UpdateRequestProcessor
implementation (I was unsure of the performance implications and object
model of ScriptUpdateProcessor).  The below
DocumentCreationDetailsProcessorFactory class seems to achieve my aim of
allowing me to sort my Solr Documents by a creation order (To an extent - I
don't think it is exactly the commit order..), though the
auto-increment/sequence/counter functionality is not continuous.

Solr Sort Parameter String:
sort=created_time_stamp_l asc, created_processing_sequence_number_l asc,
created_by_solr_thread_id_l asc, created_by_solr_core_name_s asc,
created_by_solr_shard_id_s asc


Any comments or feedback would be appreciated.

//
// UpdateRequestProcessor implementation
//
public class DocumentCreationDetailsProcessorFactory extends
UpdateRequestProcessorFactory {

private static final AtomicLong processingSequenceNumber = new
AtomicLong();

@Override
public UpdateRequestProcessor getInstance(SolrQueryRequest req,
SolrQueryResponse rsp, UpdateRequestProcessor next) {
return new DocumentCreationDetailsProcessor(req, rsp, next,
processingSequenceNumber);
}
}

class DocumentCreationDetailsProcessor extends UpdateRequestProcessor {

private final SolrQueryRequest req;

@SuppressWarnings(unused)
private final SolrQueryResponse rsp;

@SuppressWarnings(unused)
private final UpdateRequestProcessor next;

private final AtomicLong processingSequenceNumber;


public DocumentCreationDetailsProcessor(SolrQueryRequest req,
SolrQueryResponse rsp, UpdateRequestProcessor next, AtomicLong
processingSequenceNumber ) {
super(next);

this.req = req;
this.rsp = rsp;
this.next = next;

this.processingSequenceNumber = processingSequenceNumber;

}

@Override
public void processAdd(AddUpdateCommand cmd) throws IOException {

SolrInputDocument solrInputDocument = cmd.getSolrInputDocument();

solrInputDocument.addField(created_time_stamp_l,
System.currentTimeMillis());

solrInputDocument.addField(created_processing_sequence_number_l,
processingSequenceNumber.incrementAndGet());

String solrCoreName = null;
String solrShardId = null;

if (req != null
 req.getCore() != null
 req.getCore().getCoreDescriptor() != null
) {

SolrCore solrCore = req.getCore();
CoreDescriptor coreDesc = null;
CloudDescriptor cloudDesc = null;

if ( solrCore != null ) {
solrCoreName = solrCore.getName();
coreDesc = req.getCore().getCoreDescriptor();

if (coreDesc != null) {

cloudDesc = coreDesc.getCloudDescriptor();
}

if (cloudDesc != null) {
solrShardId = cloudDesc.getShardId();
}
}
}


solrInputDocument.addField(created_by_solr_thread_id_l,
Thread.currentThread().getId());
solrInputDocument.addField(created_by_solr_core_name_s,
solrCoreName);
solrInputDocument.addField(created_by_solr_shard_id_s,
solrShardId);


// pass it up the chain
super.processAdd(cmd);
}
}
//



//
//  Added the below for a bit of context
(http://wiki.apache.org/solr/SolrPlugins)
//

mkdir /opt/solr/instances/test/collection1/lib
cp /home/user/download/test-solr-plugins-0.0.1.jar
/opt/solr/instances/test/collection1/lib/
chown root:tomcat7 /opt/solr/instances/test/collection1/lib/*

vim /opt/solr/instances/test/collection1/conf/solrconfig.xml
updateRequestProcessorChain name=mychain
processor
class=com.test.solr.plugins.DocumentCreationDetailsProcessorFactory
/processor
processor class=solr.LogUpdateProcessorFactory /
processor class=solr.RunUpdateProcessorFactory /
/updateRequestProcessorChain


vim /opt/solr/instances/test/collection1/conf/solrconfig.xml
requestHandler name=/update class=solr.UpdateRequestHandler
lst name=defaults
str name=update.chainmychain/str
/lst
/requestHandler




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-4-x-auto-increment-sequence-counter-functionality-tp4045125p4045725.html
Sent from the Solr - User mailing list archive at Nabble.com.


SOLR-3076 for beginners?

2013-03-08 Thread Uwe Reh

Hi,

blockjoin seems to be a real cool feature. Unfortunately I'm to dumb, to 
get the patch running. I even don't know what to do :-(


Is there anywhere an example, a howto or a cookbook, other than using 
elasticsearch or bare lucene?


Uwe


Re: SOLR - Recommendation on architecture

2013-03-08 Thread Jilal Oussama
I would not recommend Windows too


2013/3/8 Kobe J kobe.free.wo...@gmail.com

 We are planning to use SOLR 4.1 for full text indexing. Following is the
 hardware configuration of the web server that we plan to install SOLR on:-

 *CPU*: 2 x Dual Core (4 cores)

 *R**AM:* 12GB

 *Storage*: 212GB

 *OS Version* – Windows 2008 R2



 The dataset to be imported will have approx.. 800k records, with 450 fields
 per record. Query response time should be btw 200ms-800ms.



 Please suggest if the current single server implementation should work fine
 and if the specified configuration is enough for the requirement.



Re: SOLR - Recommendation on architecture

2013-03-08 Thread Upayavira
Because?

Upayavira

On Fri, Mar 8, 2013, at 09:27 AM, Jilal Oussama wrote:
 I would not recommend Windows too
 
 
 2013/3/8 Kobe J kobe.free.wo...@gmail.com
 
  We are planning to use SOLR 4.1 for full text indexing. Following is the
  hardware configuration of the web server that we plan to install SOLR on:-
 
  *CPU*: 2 x Dual Core (4 cores)
 
  *R**AM:* 12GB
 
  *Storage*: 212GB
 
  *OS Version* – Windows 2008 R2
 
 
 
  The dataset to be imported will have approx.. 800k records, with 450 fields
  per record. Query response time should be btw 200ms-800ms.
 
 
 
  Please suggest if the current single server implementation should work fine
  and if the specified configuration is enough for the requirement.
 


Re: Quick Questions

2013-03-08 Thread Nathan Findley

On 03/08/2013 05:06 PM, Upayavira wrote:

In example/cloud-scripts/ you will find a Solr specific zkCli tool to
upload/download configs.

You will need to reload a core/collection for the changes to take
effect.

Upayavira

On Fri, Mar 8, 2013, at 07:02 AM, Nathan Findley wrote:

I am setting up solrcloud with zookeeper.

- I am wondering if there are nicer ways to update the zookeeper config
files (data-import) besides restarting a node with the boostrap option?
- Right now I kill the node manually in order to restart it. Is there a
better way to restart?

Thanks,
Nate

--
CTO
Zenlok株式会社



Ok that is good to know.

Using zookeeper I can see the following dataimport.properties:

last_index_time=2013-03-06 12\:02\:22
email_history.last_index_time=2013-03-06 12\:02\:22
...

The problem is that the last_index_time is not being changed when I run 
a delta import.  Any ideas why?  If it is a permissions issue, I am a 
bit confused because I am testing using the root user and don't  see any 
errors to indicate that zookeeper is failing to write to the filesystem.


Thanks,
Nate

--
CTO
Zenlok株式会社



Re: SOLR - Recommendation on architecture

2013-03-08 Thread Upayavira
If you are attempting to assess performance, you should use as many
records as you can muster. A Lucene index does start to struggle at a
certain size, and you may be getting close to that, depending upon the
size of your fields.

Are you suggesting that you would host other services on the server as
well? I would expect your Solr instance to want sole use of the server,
as an index of your size will demand it. 

Upayavira

On Fri, Mar 8, 2013, at 10:02 AM, kobe.free.wo...@gmail.com wrote:
 Thanks for your suggestion Gora.
 
 Yes, we are planning to use faceting, sorting features. The number of
 simultaneous users would be around 500 per min. We have preferred windows
 since the server would also be hosting some of our Microsoft based web
 applications. For prototyping, given the number of records we will be
 working with, what number of records do you suggest should we include in
 prototyping?
 
 
 
 
 
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/SOLR-Recommendation-on-architecture-tp4045718p4045734.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Mark document as hidden

2013-03-08 Thread lboutros
Dear all,

I would like to mark documents as hidden.
I could add a field hidden and pass the value to true, but the whole
documents will be reindexed. 
And External file fields are not searchable.
I could store the document keys in an external database and filter the
result with these ids. But if I have some millions of hidden documents, I
don't think it is a great idea.

Currently I will reindex the documents, but if someone has a better idea,
any help will be appreciated.

Ludovic.



-
Jouve
France.
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Mark-document-as-hidden-tp4045756.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Mark document as hidden

2013-03-08 Thread Upayavira
Without java coding, you cannot filter on things that aren't in your
index. You would need to re-index the document, but maybe you could make
use of atomic updates to just change the hidden field without needing to
push the whole document again.

Upayavira

On Fri, Mar 8, 2013, at 11:40 AM, lboutros wrote:
 Dear all,
 
 I would like to mark documents as hidden.
 I could add a field hidden and pass the value to true, but the whole
 documents will be reindexed. 
 And External file fields are not searchable.
 I could store the document keys in an external database and filter the
 result with these ids. But if I have some millions of hidden documents, I
 don't think it is a great idea.
 
 Currently I will reindex the documents, but if someone has a better idea,
 any help will be appreciated.
 
 Ludovic.
 
 
 
 -
 Jouve
 France.
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Mark-document-as-hidden-tp4045756.html
 Sent from the Solr - User mailing list archive at Nabble.com.


RessourceLoader newInstance

2013-03-08 Thread Peter Kirk
Hi

Can someone explain to me the point of the method public T T 
newInstance(String cname, ClassT expectedType) in interface 
org.apache.solr.common.ResourceLoader (or 
org.apache.lucene.analysis.util.ResourceLoader)?

If I want to implement a ResourceLoader, what is the purpose of me implementing 
a newInstance method?

The other method in the interface (openResource), makes sense to me, but I'm 
not sure about newInstance.

Thanks,
Peter





Re: Mark document as hidden

2013-03-08 Thread Erik Hatcher
External file fields, via function queries, are still usable for filtering.  
Consider using the frange function query to filter out hidden documents. 

Erik

On Mar 8, 2013, at 6:40, lboutros boutr...@gmail.com wrote:

 Dear all,
 
 I would like to mark documents as hidden.
 I could add a field hidden and pass the value to true, but the whole
 documents will be reindexed. 
 And External file fields are not searchable.
 I could store the document keys in an external database and filter the
 result with these ids. But if I have some millions of hidden documents, I
 don't think it is a great idea.
 
 Currently I will reindex the documents, but if someone has a better idea,
 any help will be appreciated.
 
 Ludovic.
 
 
 
 -
 Jouve
 France.
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Mark-document-as-hidden-tp4045756.html
 Sent from the Solr - User mailing list archive at Nabble.com.


RE: inconsistent number of results returned in solr cloud

2013-03-08 Thread Hardik Upadhyay
HI

I am using solr 4.0 (Not BETA), and have created 2 shard 2 replica 
configuration.
But when I query solr with filter query it returns inconsistent result count.
Without filter query it returns same consistent result count.
I don't understand why?

Can any one help in this?

Best Regards

Hardik Upadhyay




SolrCloud: port out of range:-1

2013-03-08 Thread roySolr
Hello,

I have some problems with Solrcloud and Zookeeper. I have 2 servers and i
want to have a solr instance on both servers. Both solr instances runs an
embedded zookeeper.

When i try to start the first one i get the error:  port out of range:-1.

The command i run to start solr with embedded zookeeper:

java -Djetty.port=4110 -DzkRun=10.100.10.101:5110
-DzkHost=10.100.10.101:5110,10.100.10.102:5120 -Dbootstrap_conf=true
-DnumShards=1 -Xmx1024M -Xms512M -jar start.jar

It runs Solr on port 4110, the embedded zk on 5110. 

The -DzkHost gives the urls of the localhost zk(5110) and the url of the
other server(zk port). When i try to start this it give the error: port out
of range:-1.

What's wrong?

Thanks
Roy







--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-port-out-of-range-1-tp4045804.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: inconsistent number of results returned in solr cloud

2013-03-08 Thread mike st. john

check for dup id's

a quick way is to facet using the id as a field and set the mincount to 2.


-Mike

Hardik Upadhyay wrote:


HI

I am using solr 4.0 (Not BETA), and have created 2 shard 2 replica 
configuration.
But when I query solr with filter query it returns inconsistent result 
count.

Without filter query it returns same consistent result count.
I don't understand why?

Can any one help in this?

Best Regards

Hardik Upadhyay




Re: Mark document as hidden

2013-03-08 Thread lboutros
Excellent Erik ! It works perfectly.

Normal filter queries are cached. Is it the same for frange filter queries
like this one ? :

fq={!frange l=0 u=10}removed_revision

Thanks to both for your answers.

Ludovic.



-
Jouve
France.
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Mark-document-as-hidden-tp4045756p4045817.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Mark document as hidden

2013-03-08 Thread lboutros
One more question, is there already a way to update the external file (add
values) in Solr ?

Ludovic.



-
Jouve
France.
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Mark-document-as-hidden-tp4045756p4045823.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SOLR - Recommendation on architecture

2013-03-08 Thread Walter Underwood
Your servers seems to be about the right size, but as everyone else has said, 
it depends on the kinds of queries.

Solr should be the only service on the system. Solr can make heavy use of the 
disk which will interfere with other processes. If you are lucky enough to get 
the system tuned to run from RAM, it can use 100% of CPU. Tuning Solr will be 
very difficult with other services sharing the same system.

If you need to meet an SLA, you will have a hard time doing that on a shared 
server. When you don't meet that SLA, it will be almost impossible to diagnose 
why.

Why not Windows?

* the Windows filesystem is not designed for heavy server use
* Windows does not allow open files to be deleted -- there are workarounds for 
this in Solr, but it is a continuing problem
* the Windows file cache is organized by file, not by block, which is 
inefficient for Solr's access pattern
* Java on Windows works, but has a number of workarounds and quirks
* the Solr community is almost all Unix users, so you will get much better help 
on Unix

wunder

On Mar 8, 2013, at 3:04 AM, Upayavira wrote:

 If you are attempting to assess performance, you should use as many
 records as you can muster. A Lucene index does start to struggle at a
 certain size, and you may be getting close to that, depending upon the
 size of your fields.
 
 Are you suggesting that you would host other services on the server as
 well? I would expect your Solr instance to want sole use of the server,
 as an index of your size will demand it. 
 
 Upayavira
 
 On Fri, Mar 8, 2013, at 10:02 AM, kobe.free.wo...@gmail.com wrote:
 Thanks for your suggestion Gora.
 
 Yes, we are planning to use faceting, sorting features. The number of
 simultaneous users would be around 500 per min. We have preferred windows
 since the server would also be hosting some of our Microsoft based web
 applications. For prototyping, given the number of records we will be
 working with, what number of records do you suggest should we include in
 prototyping?
 
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/SOLR-Recommendation-on-architecture-tp4045718p4045734.html
 Sent from the Solr - User mailing list archive at Nabble.com.







Re: Mark document as hidden

2013-03-08 Thread Erik Hatcher
Ludovic -

Yes, this query would be cached (unless you say cache=false).  

Erik

On Mar 8, 2013, at 10:26 , lboutros wrote:

 Excellent Erik ! It works perfectly.
 
 Normal filter queries are cached. Is it the same for frange filter queries
 like this one ? :
 
 fq={!frange l=0 u=10}removed_revision
 
 Thanks to both for your answers.
 
 Ludovic.
 
 
 
 -
 Jouve
 France.
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Mark-document-as-hidden-tp4045756p4045817.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Mark document as hidden

2013-03-08 Thread Erik Hatcher
The external file is maintained externally.  Solr only reads it, and does not 
have a facility to write to it, if that is what you're asking.  

Erik

On Mar 8, 2013, at 10:43 , lboutros wrote:

 One more question, is there already a way to update the external file (add
 values) in Solr ?
 
 Ludovic.
 
 
 
 -
 Jouve
 France.
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Mark-document-as-hidden-tp4045756p4045823.html
 Sent from the Solr - User mailing list archive at Nabble.com.



RE: Migrate Solr 3.4 w/ solr-1255 GeoHash to Solr 4

2013-03-08 Thread David Smiley (@MITRE.org)
The underling index format is unchanged between SOLR-2155 and Solr 4 provided
that this is only about indexing points, and SOLR-2155 could only index
points any way. To really ensure it's drop-in compatible, specify
maxLevels=12 *instead of* setting maxDistErr (which indirectly derives a
maxLevels) so you can be sure the levels are the same.

Also, SOLR-2155 always did full query shape precision (to as much as the
maxLevels indexed length allows of course). By default,
SpatialRecursivePrefixTreeFieldType uses 2.5% of the query shape radius as
its accuracy, which buys a little more performance at the expense of
accuracy.  You can set distErrPct=0 if you require full precision.  For
example you might need a meter of indexed precision for the case when
someone zooms in really low to some small region, but if the search is a
huge area of an entire country or state, then do you truly need a meter
precision along the edge for that case too?  I think not.  distErrPct is
relative to the overall size of the query shape.  The default I think is
probably fine but people have observed its inaccuracy, particularly when a
point is plotted outside a drawn query box and thought it was a problem with
the spatial code when it's actually a configuration default.  0 is actually
quite scalable provided there isn't a ton of indexed data coinciding with
the query shape edge along the query shape's entire edge.

I'd be interested to hear if the Solr 4 version is faster/slower if you have
any benchmarks -- especially v4.2 due out soon, but earlier 4.x should be
nearly the same.

It's weird that you're seeing the stored value coming back in search results
as a geohash.  In Solr 4 you get precisely what you added.

~ David


Harley wrote
 Hi David Smiley: 
 We use  a 3rd party software to load Solr 3.4 so the behavior needs to be
 transparent with the migration to 4.1, but I was expecting that I would
 need to rebuild the solr database.
 
 I moved/added the old solr 3.4 core to solr 4.1, with only minor
 modification (commented out the old spatial type and add the new) and I
 was surprised I was able to query the data. 
 
 The geohash is displaying as a hash, and not coordinate, so I am checking
 my configuration on the geospatial class.
 
 
 
 
 
 Harley Powers Parks, GISP
 Booz | Allen | Hamilton
 Geospatial Visualization Web Developer
 
 WEB: https://www.apan.org
 
 USPACOM J73/APAN
 Pacific Warfighting Center Ford Island
  
 p: 808.472.7752
 c: 808.377.0632
 apan: 

 harley.parks@

 nipr:  

 harley.parks.ctr@

  
 CONTRACTOR: 
 Booz | Allen | Hamilton
  e: 

 parks_harley@

 
 
 
 -Original Message-
 From: David Smiley (@MITRE.org) [mailto:

 DSMILEY@

 ] 
 Sent: Wednesday, March 06, 2013 9:34 PM
 To: 

 solr-user@.apache

 Subject: Re: Migrate Solr 3.4 w/ solr-1255 GeoHash to Solr 4
 
 Hi Harley,
 
 See: http://wiki.apache.org/solr/SolrAdaptersForLuceneSpatial4
 In SOLR-2155 you had to explicitly specify the prefix encoding length,
 whereas in Solr 4 you specify how much precision you need and it figures
 out what the length is that satisfies that. When you first use the field,
 it'll log what the derived levels figure is (if you care).  The units are
 decimal degrees (0-180 from no distance to reverse side of the globe --
 aka latitudinal degrees).
 
 You can name the field type whatever you want, but I don't recommend
 geohash because this conflates it with an actual GeoHashField, and also
 it's more of an internal detail.
 
 You said you're having trouble with the migration... but what is the
 trouble?
 
 ~ David
 
 
 Harley wrote
 I'm having trouble migrating the geohash fields from my Solr 3.4 
 schema to the Solr 4 schema.
 
 this is the 3.4 type and class:
 
 fieldType name=geohash class=solr2155.solr.schema.GeoHashField

  length=12/
 is the below Solr 4 spatial type the right configuration to implement 
 data being stored in fields once using the geohash type and class in 
 the above solr 3.4 field type?
 
 fieldType name=geohash

  class=solr.SpatialRecursivePrefixTreeFieldType
 geo=true distErrPct=0.025 maxDistErr=0.09 units=degrees
 prefixTree=geohash /
 is the units=degrees degree decimal? example: 21.0345
 
  
 
 Harley Powers Parks, GISP
 
 
 
 
 
 -
  Author:
 http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Migrate-Solr-3-4-w-solr-1255-GeoHash-to-Solr-4-tp4045416p4045470.html
 Sent from the Solr - User mailing list archive at Nabble.com.





-
 Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Migrate-Solr-3-4-w-solr-1255-GeoHash-to-Solr-4-tp4045416p4045835.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Migrate Solr 3.4 w/ solr-1255 GeoHash to Solr 4

2013-03-08 Thread David Smiley (@MITRE.org)
You're supposed to add geo point data in latitude, longitude format,
although some other variations work.  Is your updating process supplying a
geohash instead?  If so you could write a simple Solr UpdateRequestProcessor
to convert it to the expected format.  But that doesn't help the fact that
apparently you're index already has a geohash for the stored field value. 
Are all your fields either stored fields or aren't but are copied from a
stored field?  It would then be an option to dump all data via CSV (take
care for multi-value fields) then load it into an empty instance.

You could optimize your index, which upgrades as a side-effect.  
FYI there's a Lucene IndexUpgrader you can use at the command line:
http://lucene.apache.org/core/4_1_0/core/org/apache/lucene/index/IndexUpgrader.html
But again, if your stored field values are geohashes when you want a lat,lon
then this isn't going to fix that.

~ David


Harley wrote
 David Smiley:
 Because we use a 3rd party software.. I checked to see if this would still
 worked... search query still works. But adding data seems to be broken,
 likely because of the geohash type.
 
 So, below is the log file,  which tells me to upgrade 
 
 If possible, it would be great to simply get the old 3.4 index working.
 What should my workflow be to get this working as is, then to upgrade?
 
 I'm expecting to delete the data folder, then rebuild the index via 3rd
 party software adding data to Solr 4... is it' possible to reindex the
 existing data folder?





-
 Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Migrate-Solr-3-4-w-solr-1255-GeoHash-to-Solr-4-tp4045416p4045836.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Mark document as hidden

2013-03-08 Thread lboutros
Ok, thanks Erik.

Do you see any problem in modifying the Update handler in order to append
some  values to this file ?

Ludovic



-
Jouve
France.
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Mark-document-as-hidden-tp4045756p4045839.html
Sent from the Solr - User mailing list archive at Nabble.com.


How can I limit my Solr search to an arbitrary set of 100,000 documents?

2013-03-08 Thread Andy Lester
We've got an 11,000,000-document index.  Most documents have a unique ID called 
flrid, plus a different ID called solrid that is Solr's PK.  For some 
searches, we need to be able to limit the searches to a subset of documents 
defined by a list of FLRID values.  The list of FLRID values can change between 
every search and it will be rare enough to call it never that any two 
searches will have the same set of FLRIDs to limit on.

What we're doing right now is, roughly:

q=title:dogs AND 
(flrid:(123 125 139  34823) OR 
 flrid:(34837 ... 59091) OR 
 ... OR 
 flrid:(101294813 ... 103049934))

Each of those FQs parentheticals can be 1,000 FLRIDs strung together.  We have 
to subgroup to get past Solr's limitations on the number of terms that can be 
ORed together.

The problem with this approach (besides that it's clunky) is that it seems to 
perform O(N^2) or so.  With 1,000 FLRIDs, the search comes back in 50ms or so.  
If we have 10,000 FLRIDs, it comes back in 400-500ms.  With 100,000 FLRIDs, 
that jumps up to about 75000ms.  We want it be on the order of 1000-2000ms at 
most in all cases up to 100,000 FLRIDs.

How can we do this better?

Things we've tried or considered:

* Tried: Using dismax with minimum-match mm:0 to simulate an OR query.  No 
improvement.
* Tried: Putting the FLRIDs into the fq instead of the q.  No improvement.
* Considered: dumping all the FLRIDs for a given search into another core and 
doing a join between it and the main core, but if we do five or ten searches 
per second, it seems like Solr would die from all the commits.  The set of 
FLRIDs is unique between searches so there is no reuse possible.
* Considered: Translating FLRIDs to SolrID and then limiting on SolrID instead, 
so that Solr doesn't have to hit the documents in order to translate 
FLRID-SolrID to do the matching.

What we're hoping for:

* An efficient way to pass a long set of IDs, or for Solr to be able to pull 
them from the app's Oracle database.
* Have Solr do big ORs as a set operation not as (what we assume is) a naive 
one-at-a-time matching.
* A way to create a match vector that gets passed to the query, because strings 
of fqs in the query seems to be a suboptimal way to do it.

I've searched SO and the web and found people asking about this type of 
situation a few times, but no answers that I see beyond what we're doing now.

* 
http://stackoverflow.com/questions/11938342/solr-search-within-subset-defined-by-list-of-keys
* 
http://stackoverflow.com/questions/9183898/searching-within-a-subset-of-data-solr
* 
http://lucene.472066.n3.nabble.com/Filtered-search-for-subset-of-ids-td502245.html
* 
http://lucene.472066.n3.nabble.com/Search-within-a-subset-of-documents-td1680475.html

Thanks,
Andy

--
Andy Lester = a...@petdance.com = www.petdance.com = AIM:petdance



Re: Mark document as hidden

2013-03-08 Thread lboutros
I could create an UpdateRequestProcessorFactory that could update this file,
it seems to be better ?



-
Jouve
France.
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Mark-document-as-hidden-tp4045756p4045842.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How can I limit my Solr search to an arbitrary set of 100,000 documents?

2013-03-08 Thread Roman Chyla
hi Andy,

It seems like a common type of operation and I would be also curious what
others think. My take on this is to create a compressed intbitset and send
it as a query filter, then have the handler decompress/deserialize it, and
use it as a filter query. We have already done experiments with intbitsets
and it is fast to send/receive

look at page 20
http://www.slideshare.net/lbjay/letting-in-the-light-using-solr-as-an-external-search-component

it is not on my immediate list of tasks, but if you want to help, it can be
done sooner

roman

On Fri, Mar 8, 2013 at 12:10 PM, Andy Lester a...@petdance.com wrote:

 We've got an 11,000,000-document index.  Most documents have a unique ID
 called flrid, plus a different ID called solrid that is Solr's PK.  For
 some searches, we need to be able to limit the searches to a subset of
 documents defined by a list of FLRID values.  The list of FLRID values can
 change between every search and it will be rare enough to call it never
 that any two searches will have the same set of FLRIDs to limit on.

 What we're doing right now is, roughly:

 q=title:dogs AND
 (flrid:(123 125 139  34823) OR
  flrid:(34837 ... 59091) OR
  ... OR
  flrid:(101294813 ... 103049934))

 Each of those FQs parentheticals can be 1,000 FLRIDs strung together.  We
 have to subgroup to get past Solr's limitations on the number of terms that
 can be ORed together.

 The problem with this approach (besides that it's clunky) is that it seems
 to perform O(N^2) or so.  With 1,000 FLRIDs, the search comes back in 50ms
 or so.  If we have 10,000 FLRIDs, it comes back in 400-500ms.  With 100,000
 FLRIDs, that jumps up to about 75000ms.  We want it be on the order of
 1000-2000ms at most in all cases up to 100,000 FLRIDs.

 How can we do this better?

 Things we've tried or considered:

 * Tried: Using dismax with minimum-match mm:0 to simulate an OR query.  No
 improvement.
 * Tried: Putting the FLRIDs into the fq instead of the q.  No improvement.
 * Considered: dumping all the FLRIDs for a given search into another core
 and doing a join between it and the main core, but if we do five or ten
 searches per second, it seems like Solr would die from all the commits.
  The set of FLRIDs is unique between searches so there is no reuse possible.
 * Considered: Translating FLRIDs to SolrID and then limiting on SolrID
 instead, so that Solr doesn't have to hit the documents in order to
 translate FLRID-SolrID to do the matching.

 What we're hoping for:

 * An efficient way to pass a long set of IDs, or for Solr to be able to
 pull them from the app's Oracle database.
 * Have Solr do big ORs as a set operation not as (what we assume is) a
 naive one-at-a-time matching.
 * A way to create a match vector that gets passed to the query, because
 strings of fqs in the query seems to be a suboptimal way to do it.

 I've searched SO and the web and found people asking about this type of
 situation a few times, but no answers that I see beyond what we're doing
 now.

 *
 http://stackoverflow.com/questions/11938342/solr-search-within-subset-defined-by-list-of-keys
 *
 http://stackoverflow.com/questions/9183898/searching-within-a-subset-of-data-solr
 *
 http://lucene.472066.n3.nabble.com/Filtered-search-for-subset-of-ids-td502245.html
 *
 http://lucene.472066.n3.nabble.com/Search-within-a-subset-of-documents-td1680475.html

 Thanks,
 Andy

 --
 Andy Lester = a...@petdance.com = www.petdance.com = AIM:petdance




Re: How can I limit my Solr search to an arbitrary set of 100,000 documents?

2013-03-08 Thread Walter Underwood
First, terms used to subset the index should be a filter query, not part of the 
main query. That may help, because the filter query terms are not used for 
relevance scoring.

Have you done any system profiling? Where is the bottleneck: CPU or disk? There 
is no point in optimising things before you know the bottleneck.

Also, your latency goals may be impossible. Assume roughly one disk access per 
term in the query. You are not going to be able to do 100,000 random access 
disk IOs in 2 seconds, let alone process the results.

wunder

On Mar 8, 2013, at 9:32 AM, Roman Chyla wrote:

 hi Andy,
 
 It seems like a common type of operation and I would be also curious what
 others think. My take on this is to create a compressed intbitset and send
 it as a query filter, then have the handler decompress/deserialize it, and
 use it as a filter query. We have already done experiments with intbitsets
 and it is fast to send/receive
 
 look at page 20
 http://www.slideshare.net/lbjay/letting-in-the-light-using-solr-as-an-external-search-component
 
 it is not on my immediate list of tasks, but if you want to help, it can be
 done sooner
 
 roman
 
 On Fri, Mar 8, 2013 at 12:10 PM, Andy Lester a...@petdance.com wrote:
 
 We've got an 11,000,000-document index.  Most documents have a unique ID
 called flrid, plus a different ID called solrid that is Solr's PK.  For
 some searches, we need to be able to limit the searches to a subset of
 documents defined by a list of FLRID values.  The list of FLRID values can
 change between every search and it will be rare enough to call it never
 that any two searches will have the same set of FLRIDs to limit on.
 
 What we're doing right now is, roughly:
 
q=title:dogs AND
(flrid:(123 125 139  34823) OR
 flrid:(34837 ... 59091) OR
 ... OR
 flrid:(101294813 ... 103049934))
 
 Each of those FQs parentheticals can be 1,000 FLRIDs strung together.  We
 have to subgroup to get past Solr's limitations on the number of terms that
 can be ORed together.
 
 The problem with this approach (besides that it's clunky) is that it seems
 to perform O(N^2) or so.  With 1,000 FLRIDs, the search comes back in 50ms
 or so.  If we have 10,000 FLRIDs, it comes back in 400-500ms.  With 100,000
 FLRIDs, that jumps up to about 75000ms.  We want it be on the order of
 1000-2000ms at most in all cases up to 100,000 FLRIDs.
 
 How can we do this better?
 
 Things we've tried or considered:
 
 * Tried: Using dismax with minimum-match mm:0 to simulate an OR query.  No
 improvement.
 * Tried: Putting the FLRIDs into the fq instead of the q.  No improvement.
 * Considered: dumping all the FLRIDs for a given search into another core
 and doing a join between it and the main core, but if we do five or ten
 searches per second, it seems like Solr would die from all the commits.
 The set of FLRIDs is unique between searches so there is no reuse possible.
 * Considered: Translating FLRIDs to SolrID and then limiting on SolrID
 instead, so that Solr doesn't have to hit the documents in order to
 translate FLRID-SolrID to do the matching.
 
 What we're hoping for:
 
 * An efficient way to pass a long set of IDs, or for Solr to be able to
 pull them from the app's Oracle database.
 * Have Solr do big ORs as a set operation not as (what we assume is) a
 naive one-at-a-time matching.
 * A way to create a match vector that gets passed to the query, because
 strings of fqs in the query seems to be a suboptimal way to do it.
 
 I've searched SO and the web and found people asking about this type of
 situation a few times, but no answers that I see beyond what we're doing
 now.
 
 *
 http://stackoverflow.com/questions/11938342/solr-search-within-subset-defined-by-list-of-keys
 *
 http://stackoverflow.com/questions/9183898/searching-within-a-subset-of-data-solr
 *
 http://lucene.472066.n3.nabble.com/Filtered-search-for-subset-of-ids-td502245.html
 *
 http://lucene.472066.n3.nabble.com/Search-within-a-subset-of-documents-td1680475.html
 
 Thanks,
 Andy
 
 --
 Andy Lester = a...@petdance.com = www.petdance.com = AIM:petdance
 
 






Re: InvalidShapeException when using SpatialRecursivePrefixTreeFieldType with custom worldBounds

2013-03-08 Thread David Smiley (@MITRE.org)
Hi Jon.

If you're able to trigger an IndexOutOfBoundsException out of the prefix
tree then please file a bug (to the Lucene project, not Solr).  I'll look
into it when I have time.
I need to add a Wiki page on the use of spatial for time ranges; there are
some tricks to it.  Nevertheless you've demonstrated a bug.

~ David



-
 Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
--
View this message in context: 
http://lucene.472066.n3.nabble.com/InvalidShapeException-when-using-SpatialRecursivePrefixTreeFieldType-with-custom-worldBounds-tp4045351p4045864.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr 4.1 UI fail to display result

2013-03-08 Thread Stefan Matheis
I know, it's a bit late on this thread, but for the record - filed and already 
fixed: https://issues.apache.org/jira/browse/SOLR-4349 



On Saturday, February 2, 2013 at 6:35 PM, J Mohamed Zahoor wrote:

 It works In chrome though...
 
 ./Zahoor@iPhone
 
 On 02-Feb-2013, at 4:34 PM, J Mohamed Zahoor zah...@indix.com 
 (mailto:zah...@indix.com) wrote:
 
   
   I'm not sure why .. but this sounds like the JSON Parser was called with 
   an HTML- or XML-String? After you hit the Execute Button on the 
   Website, on the top of the right content-area, there is a link - which is 
   what the UI will request .. if you open that in another browser-tab or 
   with curl/wget .. what is the response you get? Is that really JSON? Or 
   perhaps some kind of Error Message?
  
  The link itself does not seem to be okay. It shows only this for q=*:*
  
  http://localhost:8983/solr/collection1/select?
  
  But if i add a wt=json in another tab.. i get a json response.
  
  ./zahoor 




Re: How can I limit my Solr search to an arbitrary set of 100,000 documents?

2013-03-08 Thread Roman Chyla
I think we speak of one use case where user wants to limit the search into
a collection of documents but there is no unifying (easy) way to select
those papers - besides a loong query: id:1 OR id:5 OR id:90...

And no, the latency of several hundred milliseconds is perfectly achievable
with several hundred thousands of ids, you should explore the link...

roman



On Fri, Mar 8, 2013 at 12:56 PM, Walter Underwood wun...@wunderwood.orgwrote:

 First, terms used to subset the index should be a filter query, not part
 of the main query. That may help, because the filter query terms are not
 used for relevance scoring.

 Have you done any system profiling? Where is the bottleneck: CPU or disk?
 There is no point in optimising things before you know the bottleneck.

 Also, your latency goals may be impossible. Assume roughly one disk access
 per term in the query. You are not going to be able to do 100,000 random
 access disk IOs in 2 seconds, let alone process the results.

 wunder

 On Mar 8, 2013, at 9:32 AM, Roman Chyla wrote:

  hi Andy,
 
  It seems like a common type of operation and I would be also curious what
  others think. My take on this is to create a compressed intbitset and
 send
  it as a query filter, then have the handler decompress/deserialize it,
 and
  use it as a filter query. We have already done experiments with
 intbitsets
  and it is fast to send/receive
 
  look at page 20
 
 http://www.slideshare.net/lbjay/letting-in-the-light-using-solr-as-an-external-search-component
 
  it is not on my immediate list of tasks, but if you want to help, it can
 be
  done sooner
 
  roman
 
  On Fri, Mar 8, 2013 at 12:10 PM, Andy Lester a...@petdance.com wrote:
 
  We've got an 11,000,000-document index.  Most documents have a unique ID
  called flrid, plus a different ID called solrid that is Solr's PK.
  For
  some searches, we need to be able to limit the searches to a subset of
  documents defined by a list of FLRID values.  The list of FLRID values
 can
  change between every search and it will be rare enough to call it
 never
  that any two searches will have the same set of FLRIDs to limit on.
 
  What we're doing right now is, roughly:
 
 q=title:dogs AND
 (flrid:(123 125 139  34823) OR
  flrid:(34837 ... 59091) OR
  ... OR
  flrid:(101294813 ... 103049934))
 
  Each of those FQs parentheticals can be 1,000 FLRIDs strung together.
  We
  have to subgroup to get past Solr's limitations on the number of terms
 that
  can be ORed together.
 
  The problem with this approach (besides that it's clunky) is that it
 seems
  to perform O(N^2) or so.  With 1,000 FLRIDs, the search comes back in
 50ms
  or so.  If we have 10,000 FLRIDs, it comes back in 400-500ms.  With
 100,000
  FLRIDs, that jumps up to about 75000ms.  We want it be on the order of
  1000-2000ms at most in all cases up to 100,000 FLRIDs.
 
  How can we do this better?
 
  Things we've tried or considered:
 
  * Tried: Using dismax with minimum-match mm:0 to simulate an OR query.
  No
  improvement.
  * Tried: Putting the FLRIDs into the fq instead of the q.  No
 improvement.
  * Considered: dumping all the FLRIDs for a given search into another
 core
  and doing a join between it and the main core, but if we do five or ten
  searches per second, it seems like Solr would die from all the commits.
  The set of FLRIDs is unique between searches so there is no reuse
 possible.
  * Considered: Translating FLRIDs to SolrID and then limiting on SolrID
  instead, so that Solr doesn't have to hit the documents in order to
  translate FLRID-SolrID to do the matching.
 
  What we're hoping for:
 
  * An efficient way to pass a long set of IDs, or for Solr to be able to
  pull them from the app's Oracle database.
  * Have Solr do big ORs as a set operation not as (what we assume is) a
  naive one-at-a-time matching.
  * A way to create a match vector that gets passed to the query, because
  strings of fqs in the query seems to be a suboptimal way to do it.
 
  I've searched SO and the web and found people asking about this type of
  situation a few times, but no answers that I see beyond what we're doing
  now.
 
  *
 
 http://stackoverflow.com/questions/11938342/solr-search-within-subset-defined-by-list-of-keys
  *
 
 http://stackoverflow.com/questions/9183898/searching-within-a-subset-of-data-solr
  *
 
 http://lucene.472066.n3.nabble.com/Filtered-search-for-subset-of-ids-td502245.html
  *
 
 http://lucene.472066.n3.nabble.com/Search-within-a-subset-of-documents-td1680475.html
 
  Thanks,
  Andy
 
  --
  Andy Lester = a...@petdance.com = www.petdance.com = AIM:petdance
 
 







Re: How to add shard in 4.2-snapshot

2013-03-08 Thread Mark Miller

On Mar 8, 2013, at 12:23 AM, Jam Luo cooljam2...@gmail.com wrote:

 Hi
 I use the 4.2-snapshot version, git sha id is
 f4502778b263849a827e89e45d37b33861f225f9 . I deploy a cluster by SolrCloud,
 there is 3 node,one core per node,they are in defferent shard. the JVM
 argument is -DnumShards=3.
 Now I must add mechine and add shard, but I start new solr instance and
 change the argument numShards=4, the count of shard  do not change.
 
 How to add shard in this version? In 4.0, I increase numShards, the new
 solr instance will in a new shard, but now it is inoperative.

Unless you control sharding yourself, you cannot currently add shards without 
some reindexing. Shard splitting is a feature that is coming soon though. You 
can easily add replicas currently, but not shards.

You can control shards yourself by not using numShards.

- Mark

Re: Dynamic schema design: feedback requested

2013-03-08 Thread Steve Rowe
Hi Jan,

On Mar 6, 2013, at 4:50 PM, Jan Høydahl jan@cominvent.com wrote:
 Will ZK get pushed the serialized monolithic schema.xml / schema.json from 
 the node which changed it, and then trigger an update to the rest of the 
 cluster?

Yes.

 I was kind of hoping that once we have introduced ZK into the mix as our 
 centralized config server, we could start using it as such consistently. And 
 so instead of ZK storing a plain xml file, we split up the schema as native 
 ZK nodes […]

Erik Hatcher made the same suggestion on SOLR-3251: 
https://issues.apache.org/jira/browse/SOLR-3251?focusedCommentId=13571713page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13571713

My response on the issue: 
https://issues.apache.org/jira/browse/SOLR-3251?focusedCommentId=13572774page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13572774

In short, I'm not sure it's a good idea, and in any event, I don't want to 
implement this as part of the initial implementation - it could be added on 
later.

 multiple collections may share the same config set and thus schema, so what 
 happens if someone does not know this and hits PUT 
 localhost:8983/solr/collection1/schema and it affects also the schema for 
 collection2?

Hmm, that's a great question.  Querying against a named config rather than a 
collection/core would not be an improvement, though, since the relationship 
between the two wouldn't be represented in the request.

Maybe if there were requests that returned the collections using a particular 
named config, and vice versa, people could at least discover problematic 
dependencies before they send schema modificaiton requests?  Or maybe such 
requests already exist?

Steve

Re: SolrCloud: port out of range:-1

2013-03-08 Thread Shawn Heisey

On 3/8/2013 7:37 AM, roySolr wrote:

java -Djetty.port=4110 -DzkRun=10.100.10.101:5110
-DzkHost=10.100.10.101:5110,10.100.10.102:5120 -Dbootstrap_conf=true
-DnumShards=1 -Xmx1024M -Xms512M -jar start.jar

It runs Solr on port 4110, the embedded zk on 5110.

The -DzkHost gives the urls of the localhost zk(5110) and the url of the
other server(zk port). When i try to start this it give the error: port out
of range:-1.


The full log line, ideally with several lines above and below for 
context, is going to be crucial for figuring this out.  Also, the 
contents of your solr.xml file may be important.


Thanks,
Shawn



Re: Dynamic schema design: feedback requested

2013-03-08 Thread Steve Rowe
On Mar 6, 2013, at 7:50 PM, Chris Hostetter hossman_luc...@fucit.org wrote:
 I think it would make a lot of sense -- not just in terms of 
 implementation but also for end user clarity -- to have some simple, 
 straightforward to understand caveats about maintaining schema 
 information...
 
 1) If you want to keep schema information in an authoritative config file 
 that you can manually edit, then the /schema REST API will be read only. 
 
 2) If you wish to use the /schema REST API for read and write operations, 
 then schema information will be persisted under the covers in a data store 
 whose format is an implementation detail just like the index file format.
 
 3) If you are using a schema config file and you wish to switch to using 
 the /schema REST API for managing schema information, there is a 
 tool/command/API you can run to so.
 
 4) if you are using the /schema REST API for managing schema information, 
 and you wish to switch to using a schema config file, there is a 
 tool/command/API you can run to export the schema info if a config file 
 format.

+1

 ...wether of not the under the covers in a data store used by the REST 
 API is JSON, or some binary data, or an XML file just schema.xml w/o 
 whitespace/comments should be an implementation detail.  Likewise is the 
 question of wether some new config file formats are added -- it shouldn't 
 matter.
 
 If it's config it's config and the user owns it.
 If it's data it's data and the system owns it.

Calling the system-owned file 'schema.dat', rather than 'schema.json' (i.e., 
extension=format), would help to reinforce this black-box view.

Steve



Re: SolrCloud: port out of range:-1

2013-03-08 Thread Tomás Fernández Löbbe
A couple of comments about your deployment architecture too. You'll need to
change the zoo.cfg to make the Zookeeper ensemble work with two instances
as you are trying to do, have you? The example configuration with the
zoo.cfg is intended for a single ZK instance as described in the SolrCloud
example. That said, really a two instances ZK ensemble as the one you are
intending to have doesn't make much sense, if ANY of your Solr servers
break (which as you are running embedded, ZK will also stop), the whole
cluster will be useless until you start the server again.

Tomás


On Fri, Mar 8, 2013 at 12:26 PM, Shawn Heisey s...@elyograg.org wrote:

 On 3/8/2013 7:37 AM, roySolr wrote:

 java -Djetty.port=4110 -DzkRun=10.100.10.101:5110
 -DzkHost=10.100.10.101:5110,10**.100.10.102:5120http://10.100.10.102:5120-Dbootstrap_conf=true
 -DnumShards=1 -Xmx1024M -Xms512M -jar start.jar

 It runs Solr on port 4110, the embedded zk on 5110.

 The -DzkHost gives the urls of the localhost zk(5110) and the url of the
 other server(zk port). When i try to start this it give the error: port
 out
 of range:-1.


 The full log line, ideally with several lines above and below for context,
 is going to be crucial for figuring this out.  Also, the contents of your
 solr.xml file may be important.

 Thanks,
 Shawn




Re: SolrCloud: port out of range:-1

2013-03-08 Thread Walter Underwood
A two server Zookeeper ensemble is actually less reliable than a one server 
ensemble.

With two servers, Zookeeper stops working if either of them fail, so there is a 
higher probability that it will go down.

The minimum number for increased reliability is three servers.

wunder

On Mar 8, 2013, at 12:33 PM, Tomás Fernández Löbbe wrote:

 A couple of comments about your deployment architecture too. You'll need to
 change the zoo.cfg to make the Zookeeper ensemble work with two instances
 as you are trying to do, have you? The example configuration with the
 zoo.cfg is intended for a single ZK instance as described in the SolrCloud
 example. That said, really a two instances ZK ensemble as the one you are
 intending to have doesn't make much sense, if ANY of your Solr servers
 break (which as you are running embedded, ZK will also stop), the whole
 cluster will be useless until you start the server again.
 
 Tomás
 
 
 On Fri, Mar 8, 2013 at 12:26 PM, Shawn Heisey s...@elyograg.org wrote:
 
 On 3/8/2013 7:37 AM, roySolr wrote:
 
 java -Djetty.port=4110 -DzkRun=10.100.10.101:5110
 -DzkHost=10.100.10.101:5110,10**.100.10.102:5120http://10.100.10.102:5120-Dbootstrap_conf=true
 -DnumShards=1 -Xmx1024M -Xms512M -jar start.jar
 
 It runs Solr on port 4110, the embedded zk on 5110.
 
 The -DzkHost gives the urls of the localhost zk(5110) and the url of the
 other server(zk port). When i try to start this it give the error: port
 out
 of range:-1.
 
 
 The full log line, ideally with several lines above and below for context,
 is going to be crucial for figuring this out.  Also, the contents of your
 solr.xml file may be important.
 
 Thanks,
 Shawn
 
 

--
Walter Underwood
wun...@wunderwood.org





Re: Dynamic schema design: feedback requested

2013-03-08 Thread Steve Rowe
On Mar 8, 2013, at 2:57 PM, Steve Rowe sar...@gmail.com wrote:
 multiple collections may share the same config set and thus schema, so what 
 happens if someone does not know this and hits PUT 
 localhost:8983/solr/collection1/schema and it affects also the schema for 
 collection2?
 
 Hmm, that's a great question.  Querying against a named config rather than a 
 collection/core would not be an improvement, though, since the relationship 
 between the two wouldn't be represented in the request.
 
 Maybe if there were requests that returned the collections using a particular 
 named config, and vice versa, people could at least discover problematic 
 dependencies before they send schema modificaiton requests?  Or maybe such 
 requests already exist?

Also, this doesn't have to be either/or (collection/core vs. config) - we could 
have another API that's config-specific, e.g. for the fields resource:

collection-specific:http://localhost:8983/solr/collection1/schema/fields
 
config-specific:http://localhost:8983/solr/configs/configA/schema/fields

Steve

update some fields vs replace the whole document

2013-03-08 Thread Mingfeng Yang
Generally speaking, which has better performance for Solr?
1. updating some fields or adding new fields into a document.
or
2. replacing the whole document.

As I understand,  update fields need to search for the corresponding doc
first, and then replace field values.  While replacing the whole document
is just like adding new document.  Is it right?


Re: update some fields vs replace the whole document

2013-03-08 Thread Upayavira
With an atomic update, you need to retrieve the stored fields in order
to build up the full document to insert back.

In either case, you'll have to locate the previous version and mark it
deleted before you can insert the new version.

I bet that the amount of time spent retrieving stored fields is matched
by the time saved by not having to transmit those fields over the wire,
although I'd be very curious to see someone actually test that.

Upayavira

On Fri, Mar 8, 2013, at 09:51 PM, Mingfeng Yang wrote:
 Generally speaking, which has better performance for Solr?
 1. updating some fields or adding new fields into a document.
 or
 2. replacing the whole document.
 
 As I understand,  update fields need to search for the corresponding doc
 first, and then replace field values.  While replacing the whole document
 is just like adding new document.  Is it right?


Re: update some fields vs replace the whole document

2013-03-08 Thread Mingfeng Yang
Then what's the difference between adding a new document vs.
replacing/overwriting a document?

Ming-


On Fri, Mar 8, 2013 at 2:07 PM, Upayavira u...@odoko.co.uk wrote:

 With an atomic update, you need to retrieve the stored fields in order
 to build up the full document to insert back.

 In either case, you'll have to locate the previous version and mark it
 deleted before you can insert the new version.

 I bet that the amount of time spent retrieving stored fields is matched
 by the time saved by not having to transmit those fields over the wire,
 although I'd be very curious to see someone actually test that.

 Upayavira

 On Fri, Mar 8, 2013, at 09:51 PM, Mingfeng Yang wrote:
  Generally speaking, which has better performance for Solr?
  1. updating some fields or adding new fields into a document.
  or
  2. replacing the whole document.
 
  As I understand,  update fields need to search for the corresponding doc
  first, and then replace field values.  While replacing the whole document
  is just like adding new document.  Is it right?



Re: High QTime when wildcards in hl.fl are used

2013-03-08 Thread Karol Sikora
I've found more interesting informations about using 
fastVectorHighlighting combined with highlighted fields with wildcards 
after testing on isolated group of documents with text content.

fvh + fulltext_*: QTime ~4s (!)
fvh + fulltext_1234: QTime ~50ms
no fvh + fulltext_*: QTime ~600ms
no fvh + fulltext_1234: QTime ~500ms.

As we can see very long query times as associated with using fvh 
combined with wildcarded hl.fl.
In source code I found that fields to highlight when we using wildcards 
are computed by regex in loop through fields returned by query in 
document, so I this case, when I have only one fileld that is matching 
given pattern it should be no diference between using wildcards and not.


Any ideas?


W dniu 08.03.2013 13:49, Karol Sikora pisze:

Hi all,

I'm currently stumbling with following case:
I have indexed documents with fileds named like fulltext_[some id].
I'm testing highlighting on document which have only one that field, 
fulltext_1234.
When 'fulltext_*' is provided as hl.fl, QTime is horribly big ( 10s), 
when explicit 'fulltext_1234' is provided, QTime is acceptable (~30ms).
I've found that using widlcards in hl.fl can increase QTime ( 
http://stackoverflow.com/questions/11774508/optimize-solr-highlighter), but 
it definitly should not cost so much.


I'm using fastVectorHighliter in both cases.
Any ideas why using wildcards cause such big QTimes? Maybe there is 
workaround?

--
  
Karol Sikora

+48 781 493 788

Laboratorium EE
ul. Mokotowska 46A/23 | 00-543 Warszawa |
www.laboratorium.ee  |www.laboratorium.ee/facebook


--
Karol Sikora
+48 781 493 788

Laboratorium EE
ul. Mokotowska 46A/23 | 00-543 Warszawa |
www.laboratorium.ee | www.laboratorium.ee/facebook



Multiple Collections in one Zookeeper

2013-03-08 Thread jimtronic
Hi, 

I have a solrcloud cluster running several cores and pointing at one
zookeeper.

For performance reasons, I'd like to move one of the cores on to it's own
dedicated cluster of servers. Can I use the same zookeeper to keep track of
both clusters.

Thanks!
Jim



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Multiple-Collections-in-one-Zookeeper-tp4045936.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Multiple Collections in one Zookeeper

2013-03-08 Thread Michael Della Bitta
Yes, but you'll need to append a sub path on to the zookeeper path for your
second cluster. For ex:

zookeeper1.example.com,zookeeper2.example.com,zookeeper3.example.com/subpath
On Mar 8, 2013 6:46 PM, jimtronic jimtro...@gmail.com wrote:

 Hi,

 I have a solrcloud cluster running several cores and pointing at one
 zookeeper.

 For performance reasons, I'd like to move one of the cores on to it's own
 dedicated cluster of servers. Can I use the same zookeeper to keep track of
 both clusters.

 Thanks!
 Jim



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Multiple-Collections-in-one-Zookeeper-tp4045936.html
 Sent from the Solr - User mailing list archive at Nabble.com.



RE: Migrate Solr 3.4 w/ solr-1255 GeoHash to Solr 4

2013-03-08 Thread Parks, Harley
Yes. Success.
 
I was able to successfully migrate solr 3.4 w/ solr-2155 solrconfig.xml
and schema.xml; but I had to rebuild the database (solr index data
folder).  

fieldType name=geohash_rpt
class=solr.SpatialRecursivePrefixTreeFieldType
geo=true distErrPct=0 maxLevels=12 units=degrees
prefixTree=geohash/

Everything seems to be working.
I need to try and see if I can convert the old 3.4 database... but when
we upgrade, we always rebuild our solr index. 




-Original Message-
From: David Smiley (@MITRE.org) [mailto:dsmi...@mitre.org] 
Sent: Friday, March 08, 2013 6:56 AM
To: solr-user@lucene.apache.org
Subject: RE: Migrate Solr 3.4 w/ solr-1255 GeoHash to Solr 4

You're supposed to add geo point data in latitude, longitude format,
although some other variations work.  Is your updating process supplying
a geohash instead?  If so you could write a simple Solr
UpdateRequestProcessor to convert it to the expected format.  But that
doesn't help the fact that apparently you're index already has a geohash
for the stored field value. 
Are all your fields either stored fields or aren't but are copied from a
stored field?  It would then be an option to dump all data via CSV (take
care for multi-value fields) then load it into an empty instance.

You could optimize your index, which upgrades as a side-effect.  
FYI there's a Lucene IndexUpgrader you can use at the command line:
http://lucene.apache.org/core/4_1_0/core/org/apache/lucene/index/IndexUp
grader.html
But again, if your stored field values are geohashes when you want a
lat,lon then this isn't going to fix that.

~ David


Harley wrote
 David Smiley:
 Because we use a 3rd party software.. I checked to see if this would 
 still worked... search query still works. But adding data seems to be 
 broken, likely because of the geohash type.
 
 So, below is the log file,  which tells me to upgrade
 
 If possible, it would be great to simply get the old 3.4 index
working.
 What should my workflow be to get this working as is, then to upgrade?
 
 I'm expecting to delete the data folder, then rebuild the index via 
 3rd party software adding data to Solr 4... is it' possible to reindex

 the existing data folder?





-
 Author:
http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
--
View this message in context:
http://lucene.472066.n3.nabble.com/Migrate-Solr-3-4-w-solr-1255-GeoHash-
to-Solr-4-tp4045416p4045836.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: update some fields vs replace the whole document

2013-03-08 Thread Jack Krupansky
Generally it will be more a matter of application semantics. Solr makes it 
reasonably efficient to completely overwrite the existing document and 
fields, if that is what you want. But, in some applications, it may be 
desirable to preserve some or most of the existing fields; whether that is 
easier to accomplish be completely regenerating the full document from data 
stored elsewhere in the application (e.g., a RDBMS) or doing a selective 
write will depend on the application. In some apps, the rest of the data may 
not be maintained separately, so a selective write makes more sense. Or, 
maybe the existing document contains metadata fields such as timestamps or 
counters that would get reset if the whole document was regenerated.


-- Jack Krupansky

-Original Message- 
From: Mingfeng Yang

Sent: Friday, March 08, 2013 5:41 PM
To: solr-user@lucene.apache.org
Subject: Re: update some fields vs replace the whole document

Then what's the difference between adding a new document vs.
replacing/overwriting a document?

Ming-


On Fri, Mar 8, 2013 at 2:07 PM, Upayavira u...@odoko.co.uk wrote:


With an atomic update, you need to retrieve the stored fields in order
to build up the full document to insert back.

In either case, you'll have to locate the previous version and mark it
deleted before you can insert the new version.

I bet that the amount of time spent retrieving stored fields is matched
by the time saved by not having to transmit those fields over the wire,
although I'd be very curious to see someone actually test that.

Upayavira

On Fri, Mar 8, 2013, at 09:51 PM, Mingfeng Yang wrote:
 Generally speaking, which has better performance for Solr?
 1. updating some fields or adding new fields into a document.
 or
 2. replacing the whole document.

 As I understand,  update fields need to search for the corresponding doc
 first, and then replace field values.  While replacing the whole 
 document

 is just like adding new document.  Is it right?





Re: Search a folder with File name and retrieve all the files matched

2013-03-08 Thread Jan Høydahl
Since this is a POC you could simply run this command with the default example 
schema:

cd solr/example/exampledocs
java -Dauto -Drecursive=0 -jar post.jar path/to/folder

You will get the full file name with path in field resourcename
If you need to search just the filename, you can achieve that through adding a 
new field filename with a copyField resourcename-filename and a custom 
fieldType for filename with a PatternReplaceFilterFactory to remove the path.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

7. mars 2013 kl. 22:11 skrev Alexandre Rafalovitch arafa...@gmail.com:

 You could use DataImportHandler with FileListEntityProcessor to get the
 file names in:
 http://wiki.apache.org/solr/DataImportHandler#FileListEntityProcessor
 
 Then, if it is recursive enumeration and not just one level, you probably
 want a tokenizer that splits on path separator characters (e.g. /). Or
 maybe you want to index filename as a separate field from full path (can do
 it in FileListEntityProcessor itself).
 
 And if you combined the list of files with inner entity using Tika, you can
 load the file content for searching as well:
 http://wiki.apache.org/solr/DataImportHandler#Tika_Integration
 
 Regards,
   Alex.
 
 Personal blog: http://blog.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all at
 once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)
 
 
 On Thu, Mar 7, 2013 at 3:39 PM, pavangolla pavango...@gmail.com wrote:
 
 HI,
 I am new to apache solr,
 
 I am doing a poc, where there is a folder (in sys or some repository) which
 has different files with diff extensions pdf, doc, xls..,
 
 I want to search with a file name and retrieve all the files with the name
 matching
 
 How do i proceed on this.
 
 Please help me on this.
 
 
 
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Search-a-folder-with-File-name-and-retrieve-all-the-files-matched-tp4045629.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 



Re: Search a folder with File name and retrieve all the files matched

2013-03-08 Thread Erik Hatcher
Thanks, Jan, for making the post tool do this type of thing.  Great stuff.

The filename would be a good one add for out of the box goodness.  We can 
easily add just the filename to the index with something like the patch below.  
And on that note, what else would folks want in an easy to use document search 
system like this?

Erik

Index: core/src/java/org/apache/solr/util/SimplePostTool.java
===
--- core/src/java/org/apache/solr/util/SimplePostTool.java  (revision 
1450270)
+++ core/src/java/org/apache/solr/util/SimplePostTool.java  (working copy)
@@ -749,6 +749,7 @@
   urlStr = appendParam(urlStr, resource.name= + 
URLEncoder.encode(file.getAbsolutePath(), UTF-8));
 if(urlStr.indexOf(literal.id)==-1)
   urlStr = appendParam(urlStr, literal.id= + 
URLEncoder.encode(file.getAbsolutePath(), UTF-8));
+urlStr = appendParam(urlStr, literal.filename_s= + 
URLEncoder.encode(file.getName(), UTF-8));
 url = new URL(urlStr);
   }
 } else {



On Mar 8, 2013, at 19:16 , Jan Høydahl wrote:

 Since this is a POC you could simply run this command with the default 
 example schema:
 
 cd solr/example/exampledocs
 java -Dauto -Drecursive=0 -jar post.jar path/to/folder
 
 You will get the full file name with path in field resourcename
 If you need to search just the filename, you can achieve that through adding 
 a new field filename with a copyField resourcename-filename and a custom 
 fieldType for filename with a PatternReplaceFilterFactory to remove the path.
 
 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com
 Solr Training - www.solrtraining.com
 
 7. mars 2013 kl. 22:11 skrev Alexandre Rafalovitch arafa...@gmail.com:
 
 You could use DataImportHandler with FileListEntityProcessor to get the
 file names in:
 http://wiki.apache.org/solr/DataImportHandler#FileListEntityProcessor
 
 Then, if it is recursive enumeration and not just one level, you probably
 want a tokenizer that splits on path separator characters (e.g. /). Or
 maybe you want to index filename as a separate field from full path (can do
 it in FileListEntityProcessor itself).
 
 And if you combined the list of files with inner entity using Tika, you can
 load the file content for searching as well:
 http://wiki.apache.org/solr/DataImportHandler#Tika_Integration
 
 Regards,
  Alex.
 
 Personal blog: http://blog.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all at
 once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)
 
 
 On Thu, Mar 7, 2013 at 3:39 PM, pavangolla pavango...@gmail.com wrote:
 
 HI,
 I am new to apache solr,
 
 I am doing a poc, where there is a folder (in sys or some repository) which
 has different files with diff extensions pdf, doc, xls..,
 
 I want to search with a file name and retrieve all the files with the name
 matching
 
 How do i proceed on this.
 
 Please help me on this.
 
 
 
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Search-a-folder-with-File-name-and-retrieve-all-the-files-matched-tp4045629.html
 Sent from the Solr - User mailing list archive at Nabble.com.