Re: Basic auth on SolrCloud /admin/* calls

2013-03-29 Thread Isaac Hebsh
Hi Tim,
Are you running Solr 4.2? (In 4.0 and 4.1, the Collections API didn't
return any failure message. see SOLR-4043 issue).

As far as I know, you can't tell Solr to use authentication credentials
when communicating other nodes. It's a bigger issue.. for example, if you
want to protect the /update requestHandler, so unauthorized users won't
delete your whole collection, it can interfere the replication process.

I think it's a necessary mechanism in production environment... I'm curious
how do people use SolrCloud in production w/o it.





On Fri, Mar 29, 2013 at 3:42 AM, Vaillancourt, Tim tvaillanco...@ea.comwrote:

 Hey guys,

 I've recently setup basic auth under Jetty 8 for all my Solr 4.x
 '/admin/*' calls, in order to protect my Collections and Cores API.

 Although the security constraint is working as expected ('/admin/*' calls
 require Basic Auth or return 401), when I use the Collections API to create
 a collection, I receive a 200 OK to the Collections API CREATE call, but
 the background Cores API calls that are ran on the Collection API's behalf
 fail on the Basic Auth on other nodes with a 401 code, as I should have
 foreseen, but didn't.

 Is there a way to tell SolrCloud to use authentication on internal Cores
 API calls that are spawned on Collections API's behalf, or is this a new
 feature request?

 To reproduce:

 1.   Implement basic auth on '/admin/*' URIs.

 2.   Perform a CREATE Collections API call to a node (which will
 return 200 OK).

 3.   Notice all Cores API calls fail (Collection isn't created). See
 stack trace below from the node that was issued the CREATE call.

 The stack trace I get is:

 org.apache.solr.common.SolrException: Server at http://HOST
 HERE:8983/solrhttp://%3cHOST%20HERE%3e:8983/solr returned non ok
 status:401, message:Unauthorized
 at
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
 at
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
 at
 org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:169)
 at
 org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:135)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
 at java.lang.Thread.run(Thread.java:662)

 Cheers!

 Tim





Combining Solr Indexes at SolrCloud

2013-03-29 Thread Furkan KAMACI
Let's assume that I have two machine in a SolrCloud that works as a part of
cloud. If I want to shutdown one of them an combine its indexes into other
how can I do that?


SOAP for Solr indexing mechanism

2013-03-29 Thread Furkan KAMACI
Is there any support for communication over SOAP for Solr indexing
mechanism?


Parallel Indexing With Solr?

2013-03-29 Thread Furkan KAMACI
Does Solr allows parallelism (parallel computing) for indexing?


Re: Parallel Indexing With Solr?

2013-03-29 Thread Gora Mohanty
On 29 March 2013 14:56, Furkan KAMACI furkankam...@gmail.com wrote:
 Does Solr allows parallelism (parallel computing) for indexing?

What do you mean by parallel computing in this context?

Solr can use multiple threads for indexing if that is what
you are asking.

Regards,
Gora


Re: solrj sample code for solrcloud

2013-03-29 Thread Erick Erickson
Here's some indexing code, should get you started...

http://searchhub.org/dev/2012/02/14/indexing-with-solrj/

It's against 3.x as I remember, so there might be a bit of updating to do.

Best
Erick

On Thu, Mar 28, 2013 at 2:49 AM, Jeong-dae Ha sa2ntjul...@gmail.com wrote:
 Does anyone have solrj indexing and searching sample code?
 I could not find it on the internet.

 Thanks.


Need Help in Patching OPENNLP

2013-03-29 Thread karthicrnair
Hi All, 

am very new to solr and Java technology. I would wonder if some one can
gimme a way out to patch the OpenNLP platform with Solr.

Am simply blocked out at the initial step, applying patch to Solr 4.2. Any
pointer would be highly appreciated.

Thanks,
Karthic 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Need-Help-in-Patching-OPENNLP-tp4052362.html
Sent from the Solr - User mailing list archive at Nabble.com.


Realtime updates solrcloud

2013-03-29 Thread roySolr
Hello Guys,

I want to use the realtime updates mechanism of solrcloud. My setup is as
follow:

3 solr engines,
3 zookeeper instances(ensemble)

The setup works great, recovery, leader election etc.

The problem is the realtime updates, it's slow after the servers gets some
traffic.

I try to explain it:
I test the realtime update with the following command:

*curl http://SOLRURL:SOLRPORT/solr/update -H Content-Type: text/xml
--data-binary 'adddocfield name=id3504811/fieldfield
name=websitehttp://www.google.nl/add/doc'*

I see this in logs of solr server:

*Mar 29, 2013 12:38:51 PM
org.apache.solr.update.processor.LogUpdateProcessor finish
INFO: [collection1] webapp=/solr path=/update params={} {add=[3504811
(1430841858290876416)]} 0 35 *

The other solr servers get the following lines in the log:

*INFO: [collection1] webapp=/solr path=/update
params={distrib.from=http://SOLRIP:SOLRPORT/solr/collection1/update.distrib=FROMLEADERwt=javabinversion=2}
{add=[3504811 (1430844456234385408)]} 0 14*

This looks good, the doc is added and the leader send this doc to the other
solr servers.

First times it takes 1 sec to make the update visible:)

When i send some traffic to the server(200q/s), the update takes +- 30 sec
to make it visible.
I stopped the traffic it's still takes 30 sec's to make the update visible.
How is it possible? The solrconfig parts:

*autoCommit 
 maxTime60/maxTime 
 openSearcherfalse/openSearcher 
/autoCommit

autoSoftCommit 
maxTime2000/maxTime 
/autoSoftCommit*

Did i miss something?

Best Regards,
Roy



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Realtime-updates-solrcloud-tp4052370.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Too many fields to Sort in Solr

2013-03-29 Thread adityab
Hi Joel, 
Might have an answer for this. Initially my servers were on 3.5 and then i
moved to Solr 4.0. at this time i use the solrconfig.xml that was in the
example and updated is with parameters i changed in 3.5 for the environment.
there was no codecFactory class=solr.SchemaCodecFactory/ in the 4.0
example solrconfig.xml file. We continued to us the same file and updated
war to 4.1 then 4.2 just by changing the luceneMatchVersion in the existing
solrconfig.xml file. 

I was looking at the 4.2 and comparing it with the one we have and i see
that the *codefactory* is in the example solrconfig.xml file. 


  codecFactory class=solr.SchemaCodecFactory/




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Too-many-fields-to-Sort-in-Solr-tp4049139p4052374.html
Sent from the Solr - User mailing list archive at Nabble.com.


Suggestions for Customizing Solr Admin Page

2013-03-29 Thread Furkan KAMACI
I want to customize Solr Admin Page. I think that I will need more
complicated things to manage my cloud. I will separate my Solr cluster into
just indexing ones and just response ones. I will index my documents by
categorical and I will index them at different collections.

In my admin page I will combine that collections, I will separate my
collection into new ones. I will add, remove, query documents etc.

Here is an old topic about admin Solr page:
http://lucene.472066.n3.nabble.com/Extending-Solr-s-Admin-functionality-td473974.html

My needs my change and some of them should be done via existing Solr admin
page. What do you suggest me, extending existing admin page, wrapping up a
new one over a Solrj. Which directions should I care and how can I decide
one of them.


Re: Combining Solr Indexes at SolrCloud

2013-03-29 Thread Isaac Hebsh
Let's say you have machine A and machine B. you want to shutdown B.
If all the shards on B have replicas (on A), you can shutdown B instantly.
If there is a shard on B that has no replica, you should create one on
machine A (using Core API), let it replicate the whole shard contents, and
then you are safe to shutdown B.

[Changing the shard count of an existing collection is not possible for
now, so MERGing cores is not relevant.]


On Fri, Mar 29, 2013 at 11:23 AM, Furkan KAMACI furkankam...@gmail.comwrote:

 Let's assume that I have two machine in a SolrCloud that works as a part of
 cloud. If I want to shutdown one of them an combine its indexes into other
 how can I do that?



Solr fuzzy search with WordDemiliterFilter

2013-03-29 Thread ilay raja
Hi

  I need to apply fuzzy search for my production. It better the search
results for spelling issue. However, it is not applying the analyzer
filters configured in schema.xml
I know fuzzy and wildcard search wont apply the filters. But is there a way
to plugin the filters or write this logic at the client. Because am not
getting any results for queries with numbers and special symbols(-). The
configuration in schema.xml :

  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=true /
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.KeywordMarkerFilterFactory
protected=protwords.txt/
filter class=solr.EnglishMinimalStemFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=true /
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.KeywordMarkerFilterFactory
protected=protwords.txt/
filter class=solr.EnglishMinimalStemFilterFactory/
  /analyzer
/fieldType


How to make sure that the filters as per the indexing also applied on fuzzy
search at the query time when the filters configured are not working.

Please help.


Re: SOAP for Solr indexing mechanism

2013-03-29 Thread Otis Gospodnetic
Nope.

Otis
--
Solr  ElasticSearch Support
http://sematext.com/





On Fri, Mar 29, 2013 at 4:54 AM, Furkan KAMACI furkankam...@gmail.com wrote:
 Is there any support for communication over SOAP for Solr indexing
 mechanism?


Re: Solr fuzzy search with WordDemiliterFilter

2013-03-29 Thread Jack Krupansky
The use of the fuzzy query operator will suppress the Word Delimiter Filter 
at query time. That's just the way it works. You can't use both fuzzy query 
and WDF when WDF is splitting apart words, numbers, and case changes, and 
throwing away special characters as well.


To put it simply, at query time the user needs to close their eyes and 
imagine what transformations WDF is doing and then query based on that.


One workaround: copy to a separate field that does not use WDF. Then the 
user can use fuzzy query fine (other than that it is limited to an editing 
distance of 2) for that other field.


-- Jack Krupansky

-Original Message- 
From: ilay raja

Sent: Friday, March 29, 2013 10:28 AM
To: solr-user@lucene.apache.org ; solr-...@lucene.apache.org
Subject: Solr fuzzy search with WordDemiliterFilter

Hi

 I need to apply fuzzy search for my production. It better the search
results for spelling issue. However, it is not applying the analyzer
filters configured in schema.xml
I know fuzzy and wildcard search wont apply the filters. But is there a way
to plugin the filters or write this logic at the client. Because am not
getting any results for queries with numbers and special symbols(-). The
configuration in schema.xml :

 analyzer type=index
   tokenizer class=solr.WhitespaceTokenizerFactory/
   filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=true /
   filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
   filter class=solr.LowerCaseFilterFactory/
   filter class=solr.KeywordMarkerFilterFactory
protected=protwords.txt/
   filter class=solr.EnglishMinimalStemFilterFactory/
 /analyzer
 analyzer type=query
   tokenizer class=solr.WhitespaceTokenizerFactory/
   filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true/
   filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=true /
   filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
   filter class=solr.LowerCaseFilterFactory/
   filter class=solr.KeywordMarkerFilterFactory
protected=protwords.txt/
   filter class=solr.EnglishMinimalStemFilterFactory/
 /analyzer
   /fieldType


How to make sure that the filters as per the indexing also applied on fuzzy
search at the query time when the filters configured are not working.

Please help. 



Re: Too many fields to Sort in Solr

2013-03-29 Thread Joel Bernstein
OK, that makes sense. How are DocValues working for you?


On Fri, Mar 29, 2013 at 9:02 AM, adityab aditya_ba...@yahoo.com wrote:

 Hi Joel,
 Might have an answer for this. Initially my servers were on 3.5 and then i
 moved to Solr 4.0. at this time i use the solrconfig.xml that was in the
 example and updated is with parameters i changed in 3.5 for the
 environment.
 there was no codecFactory class=solr.SchemaCodecFactory/ in the 4.0
 example solrconfig.xml file. We continued to us the same file and updated
 war to 4.1 then 4.2 just by changing the luceneMatchVersion in the existing
 solrconfig.xml file.

 I was looking at the 4.2 and comparing it with the one we have and i see
 that the *codefactory* is in the example solrconfig.xml file.


   codecFactory class=solr.SchemaCodecFactory/




 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Too-many-fields-to-Sort-in-Solr-tp4049139p4052374.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Joel Bernstein
Professional Services LucidWorks


trying to index postgresql database using solrj

2013-03-29 Thread taniamm2002
 I'm new to solr and my question may be easy but i can't understand why

I've got table which I have already indexed in solr (so I've already have
the fields of this table in the schema.xml). SO i added 2 new rows in my
database and now I try to index again this table but this time from my java
apllication usong solrj

But it gives me all the time exeption

*1030 [pool-1-thread-1] ERROR
org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer - error
java.lang.Exception: Bad Request

Bad Request

request: http://localhost:8983/solr/db/update
at
org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner.run(StreamingUpdateSolrServer.java:161)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
1030 [pool-1-thread-1] INFO
org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer - finished:
org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner@3f5b4f9c
Total Time Taken: 1296 milliseconds to index 100 SQL rows*

You see that at the end it shows me the right number of the rows whih means
that it reads from my database. May the problem be that the this table is
already index or I don't know.



public class ReadFromSolr {
private  Connection conn = null;
private static StreamingUpdateSolrServer  server;
  private Collection docs = new ArrayList();
  private int _totalSql = 0;
  private long _start = System.currentTimeMillis();

public static void main(String[] args) throws SolrServerException,
SQLException, IOException 
{ String url = http://localhost:8983/solr/db/;;

 ReadFromSolr idxer = new ReadFromSolr(url);
  try
  {
 idxer.doSqlDocuments();

 idxer.endIndexing();
  }
  catch (Exception ex)
  {
  ex.printStackTrace();
  }




   }
private void doSqlDocuments() throws SQLException {

try {
Class.forName(org.postgresql.Driver);

conn = DriverManager.getConnection(
jdbc:postgresql://localhost:5432/plovdivbizloca,
postgres, tan);
java.sql.Statement st = null;
   st = conn.createStatement();
   ResultSet rs =   st.executeQuery(select * from pl_biz);

  while (rs.next()) {

SolrInputDocument doc = new SolrInputDocument(); 

Integer  id = rs.getInt(id);
String name = rs.getString(name);
String midname = rs.getString(midname);
String lastname = rs.getString(lastname);
String frlsname = rs.getString(frlsname);
String biz_subject = rs.getString(biz_subject);
String company_type = rs.getString(company_type);
String obshtina = rs.getString(obshtina);
String main_office_town = rs.getString(main_office_town);
String address = rs.getString(address);
String role = rs.getString(role);
String country = rs.getString(country);
String nace_code = rs.getString(nace_code);
String nace_text = rs.getString(nace_text);
String zip_code = rs.getString(zip_code);
String phone = rs.getString(phone);
String fax = rs.getString(fax);
String email = rs.getString(email);
String web = rs.getString(web);
String location = rs.getString(location);
String geohash = rs.getString(geohash);
Integer popularity = rs.getInt(popularity);

doc.addField(id, id);
doc.addField(name, name); 
doc.addField(midname, midname);
doc.addField(lastnme, lastname);
doc.addField(frlsname, frlsname);
doc.addField(biz_subject, biz_subject);
doc.addField(company_type, company_type);
doc.addField(obshtina, obshtina);
doc.addField(main_office_town, main_office_town);
doc.addField(address, address);
doc.addField(role, role);
doc.addField(country, country);
doc.addField(nace_code, nace_code);
doc.addField(nace_text, nace_text);
doc.addField(zip_code, zip_code);
doc.addField(phone, phone);
doc.addField(fax, fax);
doc.addField(email, email);
doc.addField(web, web);
doc.addField(location, location);
doc.addField(geohash, geohash);
doc.addField(popularity, popularity);


docs.add(doc);
 ++_totalSql;

if (docs.size()  100) {
 // Commit within 5 minutes.
  UpdateResponse resp = server.add(docs, 30);
  docs.clear();
}
  }
}
catch (Exception ex) 
{
  ex.printStackTrace();
} 
finally {
  if (conn != null) {
conn.close();
  }
}


}

 private void endIndexing() throws IOException, SolrServerException {
if (docs.size()  0) { // Are there any documents left over?
  server.add(docs, 30); 
}
server.commit(); 


long endTime = 

Re: Parallel Indexing With Solr?

2013-03-29 Thread Otis Gospodnetic
Yes.  You can index from any app that can hit SOlr with multiple
threads.  You can use StreamingUpdateSolrServer, at least in older
Solrs, to handle multi-threading for you.  You can index from a
MapReduce job 

Otis
--
Solr  ElasticSearch Support
http://sematext.com/





On Fri, Mar 29, 2013 at 5:26 AM, Furkan KAMACI furkankam...@gmail.com wrote:
 Does Solr allows parallelism (parallel computing) for indexing?


DocValues vs stored fields?

2013-03-29 Thread Jack Krupansky
I’m still a little fuzzy on DocValues (maybe because I’m still grappling with 
how it does or doesn’t still relate to “Column Stride Fields”), so can anybody 
clue me in as to how useful DocValues is/are?

Are DocValues simply an alternative to “stored fields”?

If so, and if DocValues are so great, why aren’t we just switching Solr over to 
DocValues under the hood for all fields?

And if there are “issues” with DocValues that would make such a complete 
switchover less than absolutely desired, what are those issues?

In short, when should a user use DocValues over stored fields, and vice versa?

As things stand, all we’ve done is make Solr more confusing than it was before, 
without improving its OOBE. OOBE should be job one in Solr.

Thanks.

P.S., And if I actually want to do Column Stride Fields, is there a way to do 
that?

-- Jack Krupansky

Re: Parallel Indexing With Solr?

2013-03-29 Thread Furkan KAMACI
Can you tell more about You can index from a MapReduce job ? I use
nutch and it says Solr to index and reindex. I know that I can use Map
Reduce jobs at nutch side however can I use Map Reduce jobs at Solr side
(i.e for indexing etc.)?


2013/3/29 Otis Gospodnetic otis.gospodne...@gmail.com

 Yes.  You can index from any app that can hit SOlr with multiple
 threads.  You can use StreamingUpdateSolrServer, at least in older
 Solrs, to handle multi-threading for you.  You can index from a
 MapReduce job 

 Otis
 --
 Solr  ElasticSearch Support
 http://sematext.com/





 On Fri, Mar 29, 2013 at 5:26 AM, Furkan KAMACI furkankam...@gmail.com
 wrote:
  Does Solr allows parallelism (parallel computing) for indexing?



Cannot find word with accent

2013-03-29 Thread Van Tassell, Kristian
I'm trying to find documents with this word:

général

It returns one hit for a document containing General.

If I search for g*ral I get 230 hits, of which some contain the word général.

I'm not sure where to begin looking, I believe everything is encoded correctly. 
The text_fr (French) fieldType configuration is essentially a boilerplate one 
from the Solr distribution.

Thanks in advance for any insight!
-Kristian


Re: Cannot find word with accent

2013-03-29 Thread Jack Krupansky

The French Light Stemmer Filter is folding the accents:

filter class=solr.FrenchLightStemFilterFactory/

Try the Solr Admin UI Analysis page and you can see that the accents go away 
at the last step in analysis.


This behavior is hardwired into the Lucene FrenchLightStemmer norm method. 
It would be nice if somebody added an attribute to disable accent folding.


Try the French Minimal Stemmer Filter:

filter class=solr.FrenchMinimalStemFilterFactory/

It doesn't do the accent folding, but does less stemming as well.

-- Jack Krupansky

-Original Message- 
From: Van Tassell, Kristian

Sent: Friday, March 29, 2013 11:50 AM
To: solr-user@lucene.apache.org
Subject: Cannot find word with accent

I'm trying to find documents with this word:

général

It returns one hit for a document containing General.

If I search for g*ral I get 230 hits, of which some contain the word 
général.


I'm not sure where to begin looking, I believe everything is encoded 
correctly. The text_fr (French) fieldType configuration is essentially a 
boilerplate one from the Solr distribution.


Thanks in advance for any insight!
-Kristian 



Synonyms problem

2013-03-29 Thread Plamen Mihaylov
Hey guys,

I have the following problem - I have a website with sport players, where
using Solr indexing their data. I have defined synonyms like: NY, New York.
When I search for New York - there are 145 results found, but when I search
for NY - there are 142 results found. Why there is a diff and how can I fix
this?

Configuration snippets:

synonyms.txt

...
NY, New York
...

--
schema.xml

...
 fieldType name=text class=solr.TextField
positionIncrementGap=100
analyzer type=index
filter class=solr.
SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true/
tokenizer class=solr.WhitespaceTokenizerFactory /
!-- we will only use synonyms at query time filter
class=solr.SynonymFilterFactory synonyms=index_synonyms.txt
ignoreCase=true expand=false/ --

filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=true /
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0
splitOnCaseChange=1 /
filter class=solr.LowerCaseFilterFactory /
filter class=solr.PhoneticFilterFactory
encoder=DoubleMetaphone inject=true /
filter class=solr.RemoveDuplicatesTokenFilterFactory /
filter class=solr.LengthFilterFactory min=2 max=100
/
!-- filter class=solr.SnowballPorterFilterFactory
language=English / --
/analyzer
analyzer type=query
filter class=solr.SynonymFilterFactory
synonyms=synonyms.txt ignoreCase=true expand=true /
tokenizer class=solr.WhitespaceTokenizerFactory /

filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt /
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=0
catenateNumbers=0 catenateAll=0 /
filter class=solr.LowerCaseFilterFactory /
!-- filter class=solr.EnglishPorterFilterFactory
protected=protwords.txt/ --
filter class=solr.RemoveDuplicatesTokenFilterFactory /
filter class=solr.StopFilterFactory ignoreCase=true
words=letterstops.txt enablePositionIncrements=true /
/analyzer
/fieldType


Thanks in advance.
Plamen


Re: DocValues vs stored fields?

2013-03-29 Thread Timothy Potter
Hi Jack,

I've just started to dig into this as well, so sharing what I know but
still some holes in my knowledge too.

DocValues == Column Stride Fields (best resource I know of so far is
Simon's preso from Lucene Rev 2011 -
http://www.slideshare.net/LucidImagination/column-stride-fields-aka-docvalues).
It's pretty dense but some nuggets I've gleaned from this are:

1) DocValues are more efficient in terms of memory usage and I/O
performance for building an alternative to FieldCache (slide 27 is very
impressive)
2) DocValues has a more efficient way to store primitive types, such as
packed ints
3) Faster random access to stored values

In terms of switch-over, you have to re-index to change your fields to use
DocValues on disk, which is why they are not enabled by default.

Lastly, another goal of DocValues is to allow updates to a single field w/o
re-indexing the entire doc. That's not implemented yet but I think still
planned.

Cheers,
 Tim



On Fri, Mar 29, 2013 at 9:31 AM, Jack Krupansky j...@basetechnology.comwrote:

 I’m still a little fuzzy on DocValues (maybe because I’m still grappling
 with how it does or doesn’t still relate to “Column Stride Fields”), so can
 anybody clue me in as to how useful DocValues is/are?

 Are DocValues simply an alternative to “stored fields”?

 If so, and if DocValues are so great, why aren’t we just switching Solr
 over to DocValues under the hood for all fields?

 And if there are “issues” with DocValues that would make such a complete
 switchover less than absolutely desired, what are those issues?

 In short, when should a user use DocValues over stored fields, and vice
 versa?

 As things stand, all we’ve done is make Solr more confusing than it was
 before, without improving its OOBE. OOBE should be job one in Solr.

 Thanks.

 P.S., And if I actually want to do Column Stride Fields, is there a way to
 do that?

 -- Jack Krupansky


Re: Synonyms problem

2013-03-29 Thread Thomas Krämer | ontopica
Hi Plamen

You should set expand to true during

analyzer type=index

filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt
  ignoreCase=true expand=true/


...

Greetings,

Thomas

Am 29.03.2013 17:16, schrieb Plamen Mihaylov:
 Hey guys,
 
 I have the following problem - I have a website with sport players, where
 using Solr indexing their data. I have defined synonyms like: NY, New York.
 When I search for New York - there are 145 results found, but when I search
 for NY - there are 142 results found. Why there is a diff and how can I fix
 this?
 
 Configuration snippets:
 
 synonyms.txt
 
 ...
 NY, New York
 ...
 
 --
 schema.xml
 
 ...
  fieldType name=text class=solr.TextField
 positionIncrementGap=100
 analyzer type=index
 filter class=solr.
 SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=true/
 tokenizer class=solr.WhitespaceTokenizerFactory /
 !-- we will only use synonyms at query time filter
 class=solr.SynonymFilterFactory synonyms=index_synonyms.txt
 ignoreCase=true expand=false/ --
 
 filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true /
 filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=1
 catenateNumbers=1 catenateAll=0
 splitOnCaseChange=1 /
 filter class=solr.LowerCaseFilterFactory /
 filter class=solr.PhoneticFilterFactory
 encoder=DoubleMetaphone inject=true /
 filter class=solr.RemoveDuplicatesTokenFilterFactory /
 filter class=solr.LengthFilterFactory min=2 max=100
 /
 !-- filter class=solr.SnowballPorterFilterFactory
 language=English / --
 /analyzer
 analyzer type=query
 filter class=solr.SynonymFilterFactory
 synonyms=synonyms.txt ignoreCase=true expand=true /
 tokenizer class=solr.WhitespaceTokenizerFactory /
 
 filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt /
 filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=0
 catenateNumbers=0 catenateAll=0 /
 filter class=solr.LowerCaseFilterFactory /
 !-- filter class=solr.EnglishPorterFilterFactory
 protected=protwords.txt/ --
 filter class=solr.RemoveDuplicatesTokenFilterFactory /
 filter class=solr.StopFilterFactory ignoreCase=true
 words=letterstops.txt enablePositionIncrements=true /
 /analyzer
 /fieldType
 
 
 Thanks in advance.
 Plamen
 


-- 

ontopica GmbH
Prinz-Albert-Str. 2b
53113 Bonn
Germany
fon: +49-228-227229-22
fax: +49-228-227229-77
web: http://www.ontopica.de
ontopica GmbH
Sitz der Gesellschaft: Bonn

Geschäftsführung: Thomas Krämer, Christoph Okpue
Handelsregister: Amtsgericht Bonn, HRB 17852




Re: Synonyms problem

2013-03-29 Thread Walter Underwood
Also, all the filters need to be after the tokenizer. There are two synonym 
filters specified, one before the tokenizer and one after.

I'm surprised that works at all. Shouldn't that be fatal error when loading the 
config?

wunder

On Mar 29, 2013, at 9:33 AM, Thomas Krämer | ontopica wrote:

 Hi Plamen
 
 You should set expand to true during
 
 analyzer type=index
 
 filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt
  ignoreCase=true expand=true/
 
 
 ...
 
 Greetings,
 
 Thomas
 
 Am 29.03.2013 17:16, schrieb Plamen Mihaylov:
 Hey guys,
 
 I have the following problem - I have a website with sport players, where
 using Solr indexing their data. I have defined synonyms like: NY, New York.
 When I search for New York - there are 145 results found, but when I search
 for NY - there are 142 results found. Why there is a diff and how can I fix
 this?
 
 Configuration snippets:
 
 synonyms.txt
 
 ...
 NY, New York
 ...
 
 --
 schema.xml
 
 ...
 fieldType name=text class=solr.TextField
 positionIncrementGap=100
analyzer type=index
filter class=solr.
 SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true/
tokenizer class=solr.WhitespaceTokenizerFactory /
!-- we will only use synonyms at query time filter
 class=solr.SynonymFilterFactory synonyms=index_synonyms.txt
ignoreCase=true expand=false/ --
 
filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true /
filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0
 splitOnCaseChange=1 /
filter class=solr.LowerCaseFilterFactory /
filter class=solr.PhoneticFilterFactory
 encoder=DoubleMetaphone inject=true /
filter class=solr.RemoveDuplicatesTokenFilterFactory /
filter class=solr.LengthFilterFactory min=2 max=100
 /
!-- filter class=solr.SnowballPorterFilterFactory
 language=English / --
/analyzer
analyzer type=query
filter class=solr.SynonymFilterFactory
 synonyms=synonyms.txt ignoreCase=true expand=true /
tokenizer class=solr.WhitespaceTokenizerFactory /
 
filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt /
filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=0
catenateNumbers=0 catenateAll=0 /
filter class=solr.LowerCaseFilterFactory /
!-- filter class=solr.EnglishPorterFilterFactory
 protected=protwords.txt/ --
filter class=solr.RemoveDuplicatesTokenFilterFactory /
filter class=solr.StopFilterFactory ignoreCase=true
 words=letterstops.txt enablePositionIncrements=true /
/analyzer
/fieldType
 
 
 Thanks in advance.
 Plamen
 
 
 
 -- 
 
 ontopica GmbH
 Prinz-Albert-Str. 2b
 53113 Bonn
 Germany
 fon: +49-228-227229-22
 fax: +49-228-227229-77
 web: http://www.ontopica.de
 ontopica GmbH
 Sitz der Gesellschaft: Bonn
 
 Geschäftsführung: Thomas Krämer, Christoph Okpue
 Handelsregister: Amtsgericht Bonn, HRB 17852
 
 

--
Walter Underwood
wun...@wunderwood.org





dataimport

2013-03-29 Thread A. Lotfi
Hi,

When I hit Execute button in Query tab I only see :

Last Update: 12:34:58
Indexing since 01s
Requests: 1 (1/s), Fetched: 0 (0/s), Skipped: 0, Processed: 0 (0/s)
Started: about an hour ago

did not see  any green entry saying Indexing Completed.

 Thanks

Re: Synonyms problem

2013-03-29 Thread Steve Rowe
The XPath expressions used to collect the charFilter sequence, the tokenizer, 
and the token filter sequence are evaluated independently of each other - see 
line #244 through #251:

http://svn.apache.org/viewvc/lucene/dev/tags/lucene_solr_4_2_0/solr/core/src/java/org/apache/solr/schema/FieldTypePluginLoader.java?view=markup#l232

Steve

On Mar 29, 2013, at 12:37 PM, Walter Underwood wun...@wunderwood.org wrote:

 Also, all the filters need to be after the tokenizer. There are two synonym 
 filters specified, one before the tokenizer and one after.
 
 I'm surprised that works at all. Shouldn't that be fatal error when loading 
 the config?
 
 wunder
 
 On Mar 29, 2013, at 9:33 AM, Thomas Krämer | ontopica wrote:
 
 Hi Plamen
 
 You should set expand to true during
 
 analyzer type=index
 
 filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt
 ignoreCase=true expand=true/
 
 
 ...
 
 Greetings,
 
 Thomas
 
 Am 29.03.2013 17:16, schrieb Plamen Mihaylov:
 Hey guys,
 
 I have the following problem - I have a website with sport players, where
 using Solr indexing their data. I have defined synonyms like: NY, New York.
 When I search for New York - there are 145 results found, but when I search
 for NY - there are 142 results found. Why there is a diff and how can I fix
 this?
 
 Configuration snippets:
 
 synonyms.txt
 
 ...
 NY, New York
 ...
 
 --
 schema.xml
 
 ...
fieldType name=text class=solr.TextField
 positionIncrementGap=100
   analyzer type=index
   filter class=solr.
 SynonymFilterFactory synonyms=synonyms.txt
   ignoreCase=true expand=true/
   tokenizer class=solr.WhitespaceTokenizerFactory /
   !-- we will only use synonyms at query time filter
 class=solr.SynonymFilterFactory synonyms=index_synonyms.txt
   ignoreCase=true expand=false/ --
 
   filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true /
   filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=1
   catenateNumbers=1 catenateAll=0
 splitOnCaseChange=1 /
   filter class=solr.LowerCaseFilterFactory /
   filter class=solr.PhoneticFilterFactory
 encoder=DoubleMetaphone inject=true /
   filter class=solr.RemoveDuplicatesTokenFilterFactory /
   filter class=solr.LengthFilterFactory min=2 max=100
 /
   !-- filter class=solr.SnowballPorterFilterFactory
 language=English / --
   /analyzer
   analyzer type=query
   filter class=solr.SynonymFilterFactory
 synonyms=synonyms.txt ignoreCase=true expand=true /
   tokenizer class=solr.WhitespaceTokenizerFactory /
 
   filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt /
   filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=0
   catenateNumbers=0 catenateAll=0 /
   filter class=solr.LowerCaseFilterFactory /
   !-- filter class=solr.EnglishPorterFilterFactory
 protected=protwords.txt/ --
   filter class=solr.RemoveDuplicatesTokenFilterFactory /
   filter class=solr.StopFilterFactory ignoreCase=true
 words=letterstops.txt enablePositionIncrements=true /
   /analyzer
   /fieldType
 
 
 Thanks in advance.
 Plamen
 
 
 
 -- 
 
 ontopica GmbH
 Prinz-Albert-Str. 2b
 53113 Bonn
 Germany
 fon: +49-228-227229-22
 fax: +49-228-227229-77
 web: http://www.ontopica.de
 ontopica GmbH
 Sitz der Gesellschaft: Bonn
 
 Geschäftsführung: Thomas Krämer, Christoph Okpue
 Handelsregister: Amtsgericht Bonn, HRB 17852
 
 
 
 --
 Walter Underwood
 wun...@wunderwood.org
 
 
 



Re: Synonyms problem

2013-03-29 Thread Plamen Mihaylov
Guys,

This is a commented line where expand is false. I moved the synonym filter
after tokenizer, but the result is the same.

Actual configuration:

fieldType name=text class=solr.TextField
positionIncrementGap=100
analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory /
filter class=solr.SynonymFilterFactory
synonyms=synonyms.txt ignoreCase=true expand=true/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=true /
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0
splitOnCaseChange=1 /
filter class=solr.LowerCaseFilterFactory /
filter class=solr.PhoneticFilterFactory
encoder=DoubleMetaphone inject=true /
filter class=solr.RemoveDuplicatesTokenFilterFactory /
filter class=solr.LengthFilterFactory min=2 max=100
/
!-- filter class=solr.SnowballPorterFilterFactory
language=English / --
/analyzer
analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory /
filter class=solr.SynonymFilterFactory
synonyms=synonyms.txt ignoreCase=true expand=true /
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt /
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=0
catenateNumbers=0 catenateAll=0 /
filter class=solr.LowerCaseFilterFactory /
!-- filter class=solr.EnglishPorterFilterFactory
protected=protwords.txt/ --
filter class=solr.RemoveDuplicatesTokenFilterFactory /
filter class=solr.StopFilterFactory ignoreCase=true
words=letterstops.txt enablePositionIncrements=true /
/analyzer
/fieldType

2013/3/29 Walter Underwood wun...@wunderwood.org

 Also, all the filters need to be after the tokenizer. There are two
 synonym filters specified, one before the tokenizer and one after.

 I'm surprised that works at all. Shouldn't that be fatal error when
 loading the config?

 wunder

 On Mar 29, 2013, at 9:33 AM, Thomas Krämer | ontopica wrote:

  Hi Plamen
 
  You should set expand to true during
 
  analyzer type=index
  
  filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt
   ignoreCase=true expand=true/
 
 
  ...
 
  Greetings,
 
  Thomas
 
  Am 29.03.2013 17:16, schrieb Plamen Mihaylov:
  Hey guys,
 
  I have the following problem - I have a website with sport players,
 where
  using Solr indexing their data. I have defined synonyms like: NY, New
 York.
  When I search for New York - there are 145 results found, but when I
 search
  for NY - there are 142 results found. Why there is a diff and how can I
 fix
  this?
 
  Configuration snippets:
 
  synonyms.txt
 
  ...
  NY, New York
  ...
 
  --
  schema.xml
 
  ...
  fieldType name=text class=solr.TextField
  positionIncrementGap=100
 analyzer type=index
 filter class=solr.
  SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=true/
 tokenizer class=solr.WhitespaceTokenizerFactory /
 !-- we will only use synonyms at query time filter
  class=solr.SynonymFilterFactory synonyms=index_synonyms.txt
 ignoreCase=true expand=false/ --
 
 filter class=solr.StopFilterFactory ignoreCase=true
  words=stopwords.txt enablePositionIncrements=true /
 filter class=solr.WordDelimiterFilterFactory
  generateWordParts=1 generateNumberParts=1 catenateWords=1
 catenateNumbers=1 catenateAll=0
  splitOnCaseChange=1 /
 filter class=solr.LowerCaseFilterFactory /
 filter class=solr.PhoneticFilterFactory
  encoder=DoubleMetaphone inject=true /
 filter class=solr.RemoveDuplicatesTokenFilterFactory
 /
 filter class=solr.LengthFilterFactory min=2
 max=100
  /
 !-- filter class=solr.SnowballPorterFilterFactory
  language=English / --
 /analyzer
 analyzer type=query
 filter class=solr.SynonymFilterFactory
  synonyms=synonyms.txt ignoreCase=true expand=true /
 tokenizer class=solr.WhitespaceTokenizerFactory /
 
 filter class=solr.StopFilterFactory ignoreCase=true
  words=stopwords.txt /
 filter class=solr.WordDelimiterFilterFactory
  generateWordParts=1 generateNumberParts=1 catenateWords=0
 catenateNumbers=0 catenateAll=0 /
 filter class=solr.LowerCaseFilterFactory /
 !-- filter class=solr.EnglishPorterFilterFactory
  protected=protwords.txt/ --
 filter 

Re: Synonyms problem

2013-03-29 Thread Walter Underwood
There are several problems with this config.

Indexing uses the phonetic filter, but query does not. This almost guarantees 
that nothing will match. Numbers could match, if the filter passes them.

Query time has two stopword filters with different lists. Indexing only has 
one. This isn't fatal, but it is pretty weird. Is letterstops.txt trying to do 
the same thing as the length filter? If so, use the length filter both place. 
Or not at all. Deleting single all single characters is a bad idea. You'll 
never find Vitamin C.

The same synonyms are used at index and query time, which is unnecessary. Only 
use synonyms at index time unless you really know what you are doing and have a 
special need.

wunder

On Mar 29, 2013, at 9:53 AM, Plamen Mihaylov wrote:

 Guys,
 
 This is a commented line where expand is false. I moved the synonym filter
 after tokenizer, but the result is the same.
 
 Actual configuration:
 
fieldType name=text class=solr.TextField
 positionIncrementGap=100
analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory /
filter class=solr.SynonymFilterFactory
 synonyms=synonyms.txt ignoreCase=true expand=true/
filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true /
filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0
 splitOnCaseChange=1 /
filter class=solr.LowerCaseFilterFactory /
filter class=solr.PhoneticFilterFactory
 encoder=DoubleMetaphone inject=true /
filter class=solr.RemoveDuplicatesTokenFilterFactory /
filter class=solr.LengthFilterFactory min=2 max=100
 /
!-- filter class=solr.SnowballPorterFilterFactory
 language=English / --
/analyzer
analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory /
filter class=solr.SynonymFilterFactory
 synonyms=synonyms.txt ignoreCase=true expand=true /
filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt /
filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=0
catenateNumbers=0 catenateAll=0 /
filter class=solr.LowerCaseFilterFactory /
!-- filter class=solr.EnglishPorterFilterFactory
 protected=protwords.txt/ --
filter class=solr.RemoveDuplicatesTokenFilterFactory /
filter class=solr.StopFilterFactory ignoreCase=true
 words=letterstops.txt enablePositionIncrements=true /
/analyzer
/fieldType
 
 2013/3/29 Walter Underwood wun...@wunderwood.org
 
 Also, all the filters need to be after the tokenizer. There are two
 synonym filters specified, one before the tokenizer and one after.
 
 I'm surprised that works at all. Shouldn't that be fatal error when
 loading the config?
 
 wunder
 
 On Mar 29, 2013, at 9:33 AM, Thomas Krämer | ontopica wrote:
 
 Hi Plamen
 
 You should set expand to true during
 
 analyzer type=index
 
 filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt
 ignoreCase=true expand=true/
 
 
 ...
 
 Greetings,
 
 Thomas
 
 Am 29.03.2013 17:16, schrieb Plamen Mihaylov:
 Hey guys,
 
 I have the following problem - I have a website with sport players,
 where
 using Solr indexing their data. I have defined synonyms like: NY, New
 York.
 When I search for New York - there are 145 results found, but when I
 search
 for NY - there are 142 results found. Why there is a diff and how can I
 fix
 this?
 
 Configuration snippets:
 
 synonyms.txt
 
 ...
 NY, New York
 ...
 
 --
 schema.xml
 
 ...
fieldType name=text class=solr.TextField
 positionIncrementGap=100
   analyzer type=index
   filter class=solr.
 SynonymFilterFactory synonyms=synonyms.txt
   ignoreCase=true expand=true/
   tokenizer class=solr.WhitespaceTokenizerFactory /
   !-- we will only use synonyms at query time filter
 class=solr.SynonymFilterFactory synonyms=index_synonyms.txt
   ignoreCase=true expand=false/ --
 
   filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true /
   filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=1
   catenateNumbers=1 catenateAll=0
 splitOnCaseChange=1 /
   filter class=solr.LowerCaseFilterFactory /
   filter class=solr.PhoneticFilterFactory
 encoder=DoubleMetaphone inject=true /
   filter class=solr.RemoveDuplicatesTokenFilterFactory
 /
   filter class=solr.LengthFilterFactory min=2
 max=100
 /
   !-- filter 

Add fuzzy to edismax specs?

2013-03-29 Thread Walter Underwood
I've implemented this for the second time, so it is probably time to contribute 
it. I find it really useful.

I've extended the query spec parser for edismax to also accept a tilde and to 
generate a FuzzyQuery. I used this at Netflix (on 1.3 with dismax), and 
re-implemented it for 3.3 here at Chegg. We've had it in production for nearly 
a year. I'll need to re-port this as part of our move to 4.x.

Here is what the spec looks like. This expands to a fuzzy search on title with 
a similarity of 0.75, and so on.

   str name=qftitle~0.75^4 long_title^4 title_stem^2 author~0.75/str

I'm not 100% sure I understand the spec parser in edismax, so I'd like some 
review when this is ready. I'd probably only do it for edismax.

See: https://issues.apache.org/jira/browse/SOLR-629

wunder
--
Walter Underwood
wun...@wunderwood.org
Search Guy, Chegg.com



Re: Solr 4.2 - Slave Index version is higher than Master

2013-03-29 Thread adityab
+1 
I have observed this same issue no change on master and slave is bumped up
with higher index number. 




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-4-2-Slave-Index-version-is-higher-than-Master-tp4049827p4052445.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr metrics in Codahale metrics and Graphite?

2013-03-29 Thread Walter Underwood
What are folks using for this?

wunder
--
Walter Underwood
wun...@wunderwood.org





RE: Basic auth on SolrCloud /admin/* calls

2013-03-29 Thread Vaillancourt, Tim
Yes, I should have mentioned this is under 4.2 Solr.

I sort of expected what I'm doing might be unsupported, but basically my 
concern is under the current SOLR design, any client with connectivity to 
SOLR's port can perform Admin-level API calls like create/drop Cores or 
Collections.

I'm only aiming for '/solr/admin/*' calls to separate Application access from 
the Administrative access logically, and not the non-admin calls like 
'/update', although you can cause damage with '/update', too.

I may try to patch the code to send Basic auth credentials on internal calls 
just for fun, but I'm thinking longer-term authentication should be 
implemented/added to the SOLR codebase (for at least admin calls) vs playing 
with security at the container level, and having the app inside the container 
aware of it.

On the upside, in short testing I was able to get a Collection online using 
Cores API only using curl calls w/basic auth. Only the Collections API is 
affected due to it calls itself which do not have auth.

Cheers,

Tim

-Original Message-
From: Isaac Hebsh [mailto:isaac.he...@gmail.com] 
Sent: Friday, March 29, 2013 12:37 AM
To: solr-user@lucene.apache.org
Subject: Re: Basic auth on SolrCloud /admin/* calls

Hi Tim,
Are you running Solr 4.2? (In 4.0 and 4.1, the Collections API didn't return 
any failure message. see SOLR-4043 issue).

As far as I know, you can't tell Solr to use authentication credentials when 
communicating other nodes. It's a bigger issue.. for example, if you want to 
protect the /update requestHandler, so unauthorized users won't delete your 
whole collection, it can interfere the replication process.

I think it's a necessary mechanism in production environment... I'm curious how 
do people use SolrCloud in production w/o it.





On Fri, Mar 29, 2013 at 3:42 AM, Vaillancourt, Tim tvaillanco...@ea.comwrote:

 Hey guys,

 I've recently setup basic auth under Jetty 8 for all my Solr 4.x 
 '/admin/*' calls, in order to protect my Collections and Cores API.

 Although the security constraint is working as expected ('/admin/*' 
 calls require Basic Auth or return 401), when I use the Collections 
 API to create a collection, I receive a 200 OK to the Collections API 
 CREATE call, but the background Cores API calls that are ran on the 
 Collection API's behalf fail on the Basic Auth on other nodes with a 
 401 code, as I should have foreseen, but didn't.

 Is there a way to tell SolrCloud to use authentication on internal 
 Cores API calls that are spawned on Collections API's behalf, or is 
 this a new feature request?

 To reproduce:

 1.   Implement basic auth on '/admin/*' URIs.

 2.   Perform a CREATE Collections API call to a node (which will
 return 200 OK).

 3.   Notice all Cores API calls fail (Collection isn't created). See
 stack trace below from the node that was issued the CREATE call.

 The stack trace I get is:

 org.apache.solr.common.SolrException: Server at http://HOST
 HERE:8983/solrhttp://%3cHOST%20HERE%3e:8983/solr returned non ok
 status:401, message:Unauthorized
 at
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServe
 r.java:373)
 at
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServe
 r.java:181)
 at
 org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHan
 dler.java:169)
 at
 org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHan
 dler.java:135) at 
 java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439
 ) at 
 java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecu
 tor.java:895)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.
 java:918) at java.lang.Thread.run(Thread.java:662)

 Cheers!

 Tim





Re: Synonyms problem

2013-03-29 Thread Plamen Mihaylov
Thank you a lot, Walter. I removed most of the filters and now it returns
the same number of results. It looks simply this way:

fieldType name=text class=solr.TextField
positionIncrementGap=100
analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory /
filter class=solr.SynonymFilterFactory
synonyms=synonyms.txt ignoreCase=true expand=true/
filter class=solr.LowerCaseFilterFactory /
filter class=solr.RemoveDuplicatesTokenFilterFactory /
/analyzer
analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory /
filter class=solr.LowerCaseFilterFactory /
filter class=solr.RemoveDuplicatesTokenFilterFactory /
/analyzer
/fieldType

Can I ask you another question: I have Magento + Solr and have a
requirement to create an admin magento module, where I can add/remove
synonyms dynamically. Is this possible? I searched google but it seems not
possible.

Regards
Plamen

2013/3/29 Walter Underwood wun...@wunderwood.org

 There are several problems with this config.

 Indexing uses the phonetic filter, but query does not. This almost
 guarantees that nothing will match. Numbers could match, if the filter
 passes them.

 Query time has two stopword filters with different lists. Indexing only
 has one. This isn't fatal, but it is pretty weird. Is letterstops.txt
 trying to do the same thing as the length filter? If so, use the length
 filter both place. Or not at all. Deleting single all single characters is
 a bad idea. You'll never find Vitamin C.

 The same synonyms are used at index and query time, which is unnecessary.
 Only use synonyms at index time unless you really know what you are doing
 and have a special need.

 wunder

 On Mar 29, 2013, at 9:53 AM, Plamen Mihaylov wrote:

  Guys,
 
  This is a commented line where expand is false. I moved the synonym
 filter
  after tokenizer, but the result is the same.
 
  Actual configuration:
 
 fieldType name=text class=solr.TextField
  positionIncrementGap=100
 analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory /
 filter class=solr.SynonymFilterFactory
  synonyms=synonyms.txt ignoreCase=true expand=true/
 filter class=solr.StopFilterFactory ignoreCase=true
  words=stopwords.txt enablePositionIncrements=true /
 filter class=solr.WordDelimiterFilterFactory
  generateWordParts=1 generateNumberParts=1 catenateWords=1
 catenateNumbers=1 catenateAll=0
  splitOnCaseChange=1 /
 filter class=solr.LowerCaseFilterFactory /
 filter class=solr.PhoneticFilterFactory
  encoder=DoubleMetaphone inject=true /
 filter class=solr.RemoveDuplicatesTokenFilterFactory /
 filter class=solr.LengthFilterFactory min=2 max=100
  /
 !-- filter class=solr.SnowballPorterFilterFactory
  language=English / --
 /analyzer
 analyzer type=query
 tokenizer class=solr.WhitespaceTokenizerFactory /
 filter class=solr.SynonymFilterFactory
  synonyms=synonyms.txt ignoreCase=true expand=true /
 filter class=solr.StopFilterFactory ignoreCase=true
  words=stopwords.txt /
 filter class=solr.WordDelimiterFilterFactory
  generateWordParts=1 generateNumberParts=1 catenateWords=0
 catenateNumbers=0 catenateAll=0 /
 filter class=solr.LowerCaseFilterFactory /
 !-- filter class=solr.EnglishPorterFilterFactory
  protected=protwords.txt/ --
 filter class=solr.RemoveDuplicatesTokenFilterFactory /
 filter class=solr.StopFilterFactory ignoreCase=true
  words=letterstops.txt enablePositionIncrements=true /
 /analyzer
 /fieldType
 
  2013/3/29 Walter Underwood wun...@wunderwood.org
 
  Also, all the filters need to be after the tokenizer. There are two
  synonym filters specified, one before the tokenizer and one after.
 
  I'm surprised that works at all. Shouldn't that be fatal error when
  loading the config?
 
  wunder
 
  On Mar 29, 2013, at 9:33 AM, Thomas Krämer | ontopica wrote:
 
  Hi Plamen
 
  You should set expand to true during
 
  analyzer type=index
  
  filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt
  ignoreCase=true expand=true/
 
 
  ...
 
  Greetings,
 
  Thomas
 
  Am 29.03.2013 17:16, schrieb Plamen Mihaylov:
  Hey guys,
 
  I have the following problem - I have a website with sport players,
  where
  using Solr indexing their data. I have defined synonyms like: NY, New
  York.
  When I search for New York - there are 145 results found, but when I
  search
  for NY - there are 142 results found. Why there is a diff and how can
 I
  fix
  this?
 
  Configuration snippets:
 
  synonyms.txt
 
  ...

Re: Solrcloud 4.1 Collection with multiple slices only use

2013-03-29 Thread Chris R
So, upgraded to 4.2 this morning.  I had gotten to the point where I okay
with the collection creation process in 4.1 using the API vice the solr.xml
file in 4.0, but now 4.2 doesn't seem to want to create the instanceDir?

e.g. the Dashboard reports the following when my solr.data.dir is set to
/data/solr in the solrconfig.xml.  However, the instance dirs aren't
created, yet the index and tlog dirs are

Instance /data/solr/collection1_shard1_replica1
Data /data/solr
Index /data/solr/index

   -


Chris

On Thu, Mar 28, 2013 at 7:48 PM, Mark Miller markrmil...@gmail.com wrote:


 On Mar 28, 2013, at 7:30 PM, Shawn Heisey s...@elyograg.org wrote:

  Can't you leave numShards out completely, then include a numShards
 parameter on a collection api CREATE url, possibly giving a different
 numShards to each collection?
 
  Thanks,
  Shawn
 

 Yes - that's why I say the collections API is the way forward - it has
 none of these limitations. The limitations are all around pre-configuring
 everything in solr.xml and not using the collections API.

 - Mark


Query Elevation exception on shard queries

2013-03-29 Thread Ravi Solr
Hello,
  We have a Solr 3.6.2 multicore setup, where each core is a complete
index for one application. In our site search we use sharded query to query
two cores at a time. The issue is, If one core has docs but other core
doesn't for an elevated query solr is throwing a 500 error. I woudl really
appreciate it if somebody can point me in the right direction on how to
avoid this error, the following is my query

[#|2013-03-29T13:44:55.609-0400|INFO|sun-appserver2.1|org.apache.solr.core.SolrCore|_ThreadID=22;_ThreadName=httpSSLWorkerThread-9001-0;|[core1]
webapp=/solr path=/select/
params={q=civil+warstart=0rows=10shards=localhost:/solr/core1,localhost:/solr/core2hl=truehl.fragsize=0hl.snippets=5hl.simple.pre=stronghl.simple.post=/stronghl.fl=bodyfl=*facet=truefacet.field=typefacet.mincount=1facet.method=enumfq=pubdate:[2005-01-01T00:00:00Z+TO+NOW/DAY%2B1DAY]facet.query={!ex%3Ddt+key%3DPast+24+Hours}pubdate:[NOW/DAY-1DAY+TO+NOW/DAY%2B1DAY]facet.query={!ex%3Ddt+key%3DPast+7+Days}pubdate:[NOW/DAY-7DAYS+TO+NOW/DAY%2B1DAY]facet.query={!ex%3Ddt+key%3DPast+60+Days}pubdate:[NOW/DAY-60DAYS+TO+NOW/DAY%2B1DAY]facet.query={!ex%3Ddt+key%3DPast+12+Months}pubdate:[NOW/DAY-1YEAR+TO+NOW/DAY%2B1DAY]facet.query={!ex%3Ddt+key%3DAll+Since+2005}pubdate:[*+TO+NOW/DAY%2B1DAY]}
status=500 QTime=15 |#]


As you can see the 2 cores are core1 and core2. The core1 has data for he
query 'civil war' however core2 doesn't have any data. We have the 'civil
war' in the elevate.xml which causes Solr to throw a SolrException as
follows. However if I remove the elevate entry for this query, everything
works well.

*type* Status report

*message*Index: 1, Size: 0 java.lang.IndexOutOfBoundsException: Index: 1,
Size: 0 at java.util.ArrayList.RangeCheck(ArrayList.java:547) at
java.util.ArrayList.get(ArrayList.java:322) at
org.apache.solr.common.util.NamedList.getVal(NamedList.java:137) at
org.apache.solr.handler.component.ShardFieldSortedHitQueue$ShardComparator.sortVal(ShardDoc.java:221)
at
org.apache.solr.handler.component.ShardFieldSortedHitQueue$2.compare(ShardDoc.java:260)
at
org.apache.solr.handler.component.ShardFieldSortedHitQueue.lessThan(ShardDoc.java:160)
at
org.apache.solr.handler.component.ShardFieldSortedHitQueue.lessThan(ShardDoc.java:101)
at org.apache.lucene.util.PriorityQueue.upHeap(PriorityQueue.java:223) at
org.apache.lucene.util.PriorityQueue.add(PriorityQueue.java:132) at
org.apache.lucene.util.PriorityQueue.insertWithOverflow(PriorityQueue.java:148)
at
org.apache.solr.handler.component.QueryComponent.mergeIds(QueryComponent.java:786)
at
org.apache.solr.handler.component.QueryComponent.handleRegularResponses(QueryComponent.java:587)
at
org.apache.solr.handler.component.QueryComponent.handleResponses(QueryComponent.java:566)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:283)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376) at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:365)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:260)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:246)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:214)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:313)
at
org.apache.catalina.core.StandardContextValve.invokeInternal(StandardContextValve.java:287)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:218)
at
org.apache.catalina.core.StandardPipeline.doInvoke(StandardPipeline.java:648)
at
org.apache.catalina.core.StandardPipeline.doInvoke(StandardPipeline.java:593)
at com.sun.enterprise.web.WebPipeline.invoke(WebPipeline.java:94) at
com.sun.enterprise.web.PESessionLockingStandardPipeline.invoke(PESessionLockingStandardPipeline.java:98)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:222)
at
org.apache.catalina.core.StandardPipeline.doInvoke(StandardPipeline.java:648)
at
org.apache.catalina.core.StandardPipeline.doInvoke(StandardPipeline.java:593)
at
org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:587)
at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:1093)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:166)
at
org.apache.catalina.core.StandardPipeline.doInvoke(StandardPipeline.java:648)
at
org.apache.catalina.core.StandardPipeline.doInvoke(StandardPipeline.java:593)
at
org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:587)
at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:1093)
at org.apache.coyote.tomcat5.CoyoteAdapter.service(CoyoteAdapter.java:291)
at
com.sun.enterprise.web.connector.grizzly.DefaultProcessorTask.invokeAdapter(DefaultProcessorTask.java:670)
at

Re: DocValues vs stored fields?

2013-03-29 Thread Marcin Rzewucki
Hi,
Atomic updates (single field updates) do not depend on DocValues. They were
implemented in Solr4.0 and works fine (but all fields have to be
retrievable). DocValues are supposed to be more efficient than FieldCache.
Why not enabled by default ? Maybe because they are not for all fields and
because of their limitations (a field has to be single-valued, required or
to have default value).
Regards.



On 29 March 2013 17:20, Timothy Potter thelabd...@gmail.com wrote:

 Hi Jack,

 I've just started to dig into this as well, so sharing what I know but
 still some holes in my knowledge too.

 DocValues == Column Stride Fields (best resource I know of so far is
 Simon's preso from Lucene Rev 2011 -

 http://www.slideshare.net/LucidImagination/column-stride-fields-aka-docvalues
 ).
 It's pretty dense but some nuggets I've gleaned from this are:

 1) DocValues are more efficient in terms of memory usage and I/O
 performance for building an alternative to FieldCache (slide 27 is very
 impressive)
 2) DocValues has a more efficient way to store primitive types, such as
 packed ints
 3) Faster random access to stored values

 In terms of switch-over, you have to re-index to change your fields to use
 DocValues on disk, which is why they are not enabled by default.

 Lastly, another goal of DocValues is to allow updates to a single field w/o
 re-indexing the entire doc. That's not implemented yet but I think still
 planned.

 Cheers,
  Tim



 On Fri, Mar 29, 2013 at 9:31 AM, Jack Krupansky j...@basetechnology.com
 wrote:

  I’m still a little fuzzy on DocValues (maybe because I’m still grappling
  with how it does or doesn’t still relate to “Column Stride Fields”), so
 can
  anybody clue me in as to how useful DocValues is/are?
 
  Are DocValues simply an alternative to “stored fields”?
 
  If so, and if DocValues are so great, why aren’t we just switching Solr
  over to DocValues under the hood for all fields?
 
  And if there are “issues” with DocValues that would make such a complete
  switchover less than absolutely desired, what are those issues?
 
  In short, when should a user use DocValues over stored fields, and vice
  versa?
 
  As things stand, all we’ve done is make Solr more confusing than it was
  before, without improving its OOBE. OOBE should be job one in Solr.
 
  Thanks.
 
  P.S., And if I actually want to do Column Stride Fields, is there a way
 to
  do that?
 
  -- Jack Krupansky



Re: DocValues vs stored fields?

2013-03-29 Thread Otis Gospodnetic
Hi,

The current field update mechanism is not really a field update
mechanism.  It just looks like that from the outside.  DocValues
should make true field updates implementable.

Otis
--
Solr  ElasticSearch Support
http://sematext.com/





On Fri, Mar 29, 2013 at 3:30 PM, Marcin Rzewucki mrzewu...@gmail.com wrote:
 Hi,
 Atomic updates (single field updates) do not depend on DocValues. They were
 implemented in Solr4.0 and works fine (but all fields have to be
 retrievable). DocValues are supposed to be more efficient than FieldCache.
 Why not enabled by default ? Maybe because they are not for all fields and
 because of their limitations (a field has to be single-valued, required or
 to have default value).
 Regards.



 On 29 March 2013 17:20, Timothy Potter thelabd...@gmail.com wrote:

 Hi Jack,

 I've just started to dig into this as well, so sharing what I know but
 still some holes in my knowledge too.

 DocValues == Column Stride Fields (best resource I know of so far is
 Simon's preso from Lucene Rev 2011 -

 http://www.slideshare.net/LucidImagination/column-stride-fields-aka-docvalues
 ).
 It's pretty dense but some nuggets I've gleaned from this are:

 1) DocValues are more efficient in terms of memory usage and I/O
 performance for building an alternative to FieldCache (slide 27 is very
 impressive)
 2) DocValues has a more efficient way to store primitive types, such as
 packed ints
 3) Faster random access to stored values

 In terms of switch-over, you have to re-index to change your fields to use
 DocValues on disk, which is why they are not enabled by default.

 Lastly, another goal of DocValues is to allow updates to a single field w/o
 re-indexing the entire doc. That's not implemented yet but I think still
 planned.

 Cheers,
  Tim



 On Fri, Mar 29, 2013 at 9:31 AM, Jack Krupansky j...@basetechnology.com
 wrote:

  I’m still a little fuzzy on DocValues (maybe because I’m still grappling
  with how it does or doesn’t still relate to “Column Stride Fields”), so
 can
  anybody clue me in as to how useful DocValues is/are?
 
  Are DocValues simply an alternative to “stored fields”?
 
  If so, and if DocValues are so great, why aren’t we just switching Solr
  over to DocValues under the hood for all fields?
 
  And if there are “issues” with DocValues that would make such a complete
  switchover less than absolutely desired, what are those issues?
 
  In short, when should a user use DocValues over stored fields, and vice
  versa?
 
  As things stand, all we’ve done is make Solr more confusing than it was
  before, without improving its OOBE. OOBE should be job one in Solr.
 
  Thanks.
 
  P.S., And if I actually want to do Column Stride Fields, is there a way
 to
  do that?
 
  -- Jack Krupansky



Re: Solrcloud 4.1 Collection with multiple slices only use

2013-03-29 Thread Mark Miller
Those are paths? /data/solr off the root?

When using the collections api, you really don't want to set an absolute data 
dir - it should be relative, I'd just take the default. Then, even though many 
shards shard that solrconfig and data dir, they will all find a nice home 
relative to the instance dir. If you don't do this, you won't be able to over 
shard, and things get tricky fast.

- Mark

On Mar 29, 2013, at 2:45 PM, Chris R corg...@gmail.com wrote:

 So, upgraded to 4.2 this morning.  I had gotten to the point where I okay
 with the collection creation process in 4.1 using the API vice the solr.xml
 file in 4.0, but now 4.2 doesn't seem to want to create the instanceDir?
 
 e.g. the Dashboard reports the following when my solr.data.dir is set to
 /data/solr in the solrconfig.xml.  However, the instance dirs aren't
 created, yet the index and tlog dirs are
 
 Instance /data/solr/collection1_shard1_replica1
 Data /data/solr
 Index /data/solr/index
 
   -
 
 
 Chris
 
 On Thu, Mar 28, 2013 at 7:48 PM, Mark Miller markrmil...@gmail.com wrote:
 
 
 On Mar 28, 2013, at 7:30 PM, Shawn Heisey s...@elyograg.org wrote:
 
 Can't you leave numShards out completely, then include a numShards
 parameter on a collection api CREATE url, possibly giving a different
 numShards to each collection?
 
 Thanks,
 Shawn
 
 
 Yes - that's why I say the collections API is the way forward - it has
 none of these limitations. The limitations are all around pre-configuring
 everything in solr.xml and not using the collections API.
 
 - Mark



Re: Basic auth on SolrCloud /admin/* calls

2013-03-29 Thread Mark Miller
This has always been the case with Solr. Solr's security model is that clients 
should not have access to it - only trusted intermediaries should have access 
to it. Otherwise, it should be locked down at a higher level. That's been the 
case from day one and still is.

That said, someone did do some work on internode basic auth a while back, but 
it didn't raise a ton of interest yet.

- Mark

On Mar 29, 2013, at 2:09 PM, Vaillancourt, Tim tvaillanco...@ea.com wrote:

 Yes, I should have mentioned this is under 4.2 Solr.
 
 I sort of expected what I'm doing might be unsupported, but basically my 
 concern is under the current SOLR design, any client with connectivity to 
 SOLR's port can perform Admin-level API calls like create/drop Cores or 
 Collections.
 
 I'm only aiming for '/solr/admin/*' calls to separate Application access 
 from the Administrative access logically, and not the non-admin calls like 
 '/update', although you can cause damage with '/update', too.
 
 I may try to patch the code to send Basic auth credentials on internal calls 
 just for fun, but I'm thinking longer-term authentication should be 
 implemented/added to the SOLR codebase (for at least admin calls) vs playing 
 with security at the container level, and having the app inside the container 
 aware of it.
 
 On the upside, in short testing I was able to get a Collection online using 
 Cores API only using curl calls w/basic auth. Only the Collections API is 
 affected due to it calls itself which do not have auth.
 
 Cheers,
 
 Tim
 
 -Original Message-
 From: Isaac Hebsh [mailto:isaac.he...@gmail.com] 
 Sent: Friday, March 29, 2013 12:37 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Basic auth on SolrCloud /admin/* calls
 
 Hi Tim,
 Are you running Solr 4.2? (In 4.0 and 4.1, the Collections API didn't return 
 any failure message. see SOLR-4043 issue).
 
 As far as I know, you can't tell Solr to use authentication credentials when 
 communicating other nodes. It's a bigger issue.. for example, if you want to 
 protect the /update requestHandler, so unauthorized users won't delete your 
 whole collection, it can interfere the replication process.
 
 I think it's a necessary mechanism in production environment... I'm curious 
 how do people use SolrCloud in production w/o it.
 
 
 
 
 
 On Fri, Mar 29, 2013 at 3:42 AM, Vaillancourt, Tim 
 tvaillanco...@ea.comwrote:
 
 Hey guys,
 
 I've recently setup basic auth under Jetty 8 for all my Solr 4.x 
 '/admin/*' calls, in order to protect my Collections and Cores API.
 
 Although the security constraint is working as expected ('/admin/*' 
 calls require Basic Auth or return 401), when I use the Collections 
 API to create a collection, I receive a 200 OK to the Collections API 
 CREATE call, but the background Cores API calls that are ran on the 
 Collection API's behalf fail on the Basic Auth on other nodes with a 
 401 code, as I should have foreseen, but didn't.
 
 Is there a way to tell SolrCloud to use authentication on internal 
 Cores API calls that are spawned on Collections API's behalf, or is 
 this a new feature request?
 
 To reproduce:
 
 1.   Implement basic auth on '/admin/*' URIs.
 
 2.   Perform a CREATE Collections API call to a node (which will
 return 200 OK).
 
 3.   Notice all Cores API calls fail (Collection isn't created). See
 stack trace below from the node that was issued the CREATE call.
 
 The stack trace I get is:
 
 org.apache.solr.common.SolrException: Server at http://HOST
 HERE:8983/solrhttp://%3cHOST%20HERE%3e:8983/solr returned non ok
 status:401, message:Unauthorized
 at
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServe
 r.java:373)
 at
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServe
 r.java:181)
 at
 org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHan
 dler.java:169)
 at
 org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHan
 dler.java:135) at 
 java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439
 ) at 
 java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecu
 tor.java:895)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.
 java:918) at java.lang.Thread.run(Thread.java:662)
 
 Cheers!
 
 Tim
 
 
 



Re: DocValues vs stored fields?

2013-03-29 Thread Marcin Rzewucki
Hi Otis,

Currently, whole record has to be stored on disk in order to update single
field. Are you trying to say that it won't be necessary with the use of
DocValues ? Sounds great!

Regards.


On 29 March 2013 20:51, Otis Gospodnetic otis.gospodne...@gmail.com wrote:

 Hi,

 The current field update mechanism is not really a field update
 mechanism.  It just looks like that from the outside.  DocValues
 should make true field updates implementable.

 Otis
 --
 Solr  ElasticSearch Support
 http://sematext.com/





 On Fri, Mar 29, 2013 at 3:30 PM, Marcin Rzewucki mrzewu...@gmail.com
 wrote:
  Hi,
  Atomic updates (single field updates) do not depend on DocValues. They
 were
  implemented in Solr4.0 and works fine (but all fields have to be
  retrievable). DocValues are supposed to be more efficient than
 FieldCache.
  Why not enabled by default ? Maybe because they are not for all fields
 and
  because of their limitations (a field has to be single-valued, required
 or
  to have default value).
  Regards.
 
 
 
  On 29 March 2013 17:20, Timothy Potter thelabd...@gmail.com wrote:
 
  Hi Jack,
 
  I've just started to dig into this as well, so sharing what I know but
  still some holes in my knowledge too.
 
  DocValues == Column Stride Fields (best resource I know of so far is
  Simon's preso from Lucene Rev 2011 -
 
 
 http://www.slideshare.net/LucidImagination/column-stride-fields-aka-docvalues
  ).
  It's pretty dense but some nuggets I've gleaned from this are:
 
  1) DocValues are more efficient in terms of memory usage and I/O
  performance for building an alternative to FieldCache (slide 27 is very
  impressive)
  2) DocValues has a more efficient way to store primitive types, such as
  packed ints
  3) Faster random access to stored values
 
  In terms of switch-over, you have to re-index to change your fields to
 use
  DocValues on disk, which is why they are not enabled by default.
 
  Lastly, another goal of DocValues is to allow updates to a single field
 w/o
  re-indexing the entire doc. That's not implemented yet but I think still
  planned.
 
  Cheers,
   Tim
 
 
 
  On Fri, Mar 29, 2013 at 9:31 AM, Jack Krupansky 
 j...@basetechnology.com
  wrote:
 
   I’m still a little fuzzy on DocValues (maybe because I’m still
 grappling
   with how it does or doesn’t still relate to “Column Stride Fields”),
 so
  can
   anybody clue me in as to how useful DocValues is/are?
  
   Are DocValues simply an alternative to “stored fields”?
  
   If so, and if DocValues are so great, why aren’t we just switching
 Solr
   over to DocValues under the hood for all fields?
  
   And if there are “issues” with DocValues that would make such a
 complete
   switchover less than absolutely desired, what are those issues?
  
   In short, when should a user use DocValues over stored fields, and
 vice
   versa?
  
   As things stand, all we’ve done is make Solr more confusing than it
 was
   before, without improving its OOBE. OOBE should be job one in Solr.
  
   Thanks.
  
   P.S., And if I actually want to do Column Stride Fields, is there a
 way
  to
   do that?
  
   -- Jack Krupansky
 



Re: per-fieldtype similarity not working

2013-03-29 Thread mike.vogel
Any example or suggestion for how to patch the wrapper so that coord method
is called for the field type with the custom similarity?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/per-fieldtype-similarity-not-working-tp3987050p4052470.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solrcloud 4.1 Collection with multiple slices only use

2013-03-29 Thread Chris R
Yes, removing the absolute value cured the problem, but I feel like there
should be a better option than the default.  Given multiple collections,
there should be some ability within the API to lay down the directory
structure in a different way e.g. ./collection/shard as opposed to the
current auto naming scheme.  If you wanted to do that now, you would have
to create all the collections, stop everything, modify solr.xmls, move
files, and restart. painful at best.  Some might say it's not
necessary

Thanks,
Chris


On Fri, Mar 29, 2013 at 4:01 PM, Mark Miller markrmil...@gmail.com wrote:

 Those are paths? /data/solr off the root?

 When using the collections api, you really don't want to set an absolute
 data dir - it should be relative, I'd just take the default. Then, even
 though many shards shard that solrconfig and data dir, they will all find a
 nice home relative to the instance dir. If you don't do this, you won't be
 able to over shard, and things get tricky fast.

 - Mark

 On Mar 29, 2013, at 2:45 PM, Chris R corg...@gmail.com wrote:

  So, upgraded to 4.2 this morning.  I had gotten to the point where I okay
  with the collection creation process in 4.1 using the API vice the
 solr.xml
  file in 4.0, but now 4.2 doesn't seem to want to create the instanceDir?
 
  e.g. the Dashboard reports the following when my solr.data.dir is set to
  /data/solr in the solrconfig.xml.  However, the instance dirs aren't
  created, yet the index and tlog dirs are
 
  Instance /data/solr/collection1_shard1_replica1
  Data /data/solr
  Index /data/solr/index
 
-
 
 
  Chris
 
  On Thu, Mar 28, 2013 at 7:48 PM, Mark Miller markrmil...@gmail.com
 wrote:
 
 
  On Mar 28, 2013, at 7:30 PM, Shawn Heisey s...@elyograg.org wrote:
 
  Can't you leave numShards out completely, then include a numShards
  parameter on a collection api CREATE url, possibly giving a different
  numShards to each collection?
 
  Thanks,
  Shawn
 
 
  Yes - that's why I say the collections API is the way forward - it has
  none of these limitations. The limitations are all around
 pre-configuring
  everything in solr.xml and not using the collections API.
 
  - Mark




4.2 Admin UI

2013-03-29 Thread Chris R
I've notice on the Admin UI that on some of my nodes that Core Selector
combo box doesn't populate.  Known issue?

Chris


Re: DocValues vs stored fields?

2013-03-29 Thread Marcin Rzewucki
By the way: even if a field has DocValues with on disk option enabled it
has to have stored=true to be retrievable. Why ?


On 29 March 2013 20:51, Otis Gospodnetic otis.gospodne...@gmail.com wrote:

 Hi,

 The current field update mechanism is not really a field update
 mechanism.  It just looks like that from the outside.  DocValues
 should make true field updates implementable.

 Otis
 --
 Solr  ElasticSearch Support
 http://sematext.com/





 On Fri, Mar 29, 2013 at 3:30 PM, Marcin Rzewucki mrzewu...@gmail.com
 wrote:
  Hi,
  Atomic updates (single field updates) do not depend on DocValues. They
 were
  implemented in Solr4.0 and works fine (but all fields have to be
  retrievable). DocValues are supposed to be more efficient than
 FieldCache.
  Why not enabled by default ? Maybe because they are not for all fields
 and
  because of their limitations (a field has to be single-valued, required
 or
  to have default value).
  Regards.
 
 
 
  On 29 March 2013 17:20, Timothy Potter thelabd...@gmail.com wrote:
 
  Hi Jack,
 
  I've just started to dig into this as well, so sharing what I know but
  still some holes in my knowledge too.
 
  DocValues == Column Stride Fields (best resource I know of so far is
  Simon's preso from Lucene Rev 2011 -
 
 
 http://www.slideshare.net/LucidImagination/column-stride-fields-aka-docvalues
  ).
  It's pretty dense but some nuggets I've gleaned from this are:
 
  1) DocValues are more efficient in terms of memory usage and I/O
  performance for building an alternative to FieldCache (slide 27 is very
  impressive)
  2) DocValues has a more efficient way to store primitive types, such as
  packed ints
  3) Faster random access to stored values
 
  In terms of switch-over, you have to re-index to change your fields to
 use
  DocValues on disk, which is why they are not enabled by default.
 
  Lastly, another goal of DocValues is to allow updates to a single field
 w/o
  re-indexing the entire doc. That's not implemented yet but I think still
  planned.
 
  Cheers,
   Tim
 
 
 
  On Fri, Mar 29, 2013 at 9:31 AM, Jack Krupansky 
 j...@basetechnology.com
  wrote:
 
   I’m still a little fuzzy on DocValues (maybe because I’m still
 grappling
   with how it does or doesn’t still relate to “Column Stride Fields”),
 so
  can
   anybody clue me in as to how useful DocValues is/are?
  
   Are DocValues simply an alternative to “stored fields”?
  
   If so, and if DocValues are so great, why aren’t we just switching
 Solr
   over to DocValues under the hood for all fields?
  
   And if there are “issues” with DocValues that would make such a
 complete
   switchover less than absolutely desired, what are those issues?
  
   In short, when should a user use DocValues over stored fields, and
 vice
   versa?
  
   As things stand, all we’ve done is make Solr more confusing than it
 was
   before, without improving its OOBE. OOBE should be job one in Solr.
  
   Thanks.
  
   P.S., And if I actually want to do Column Stride Fields, is there a
 way
  to
   do that?
  
   -- Jack Krupansky
 



Re: Solr 4.2 - Slave Index version is higher than Master

2013-03-29 Thread adityab
Something is really wrong with replication. 
Check the document attached which has the screen shot. 
I - re-indexed the master after adding new fields to schema file (its part
of config file replication) 
The UI shows master as gen '6' where as in slaves log the Master gen is '7'

The attached document has the screenshot captured. 
Replication_Issue_4.2.docx
http://lucene.472066.n3.nabble.com/file/n4052485/Replication_Issue_4.2.docx  



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-4-2-Slave-Index-version-is-higher-than-Master-tp4049827p4052485.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Too many fields to Sort in Solr

2013-03-29 Thread adityab
Joel, thanks for your excellent idea using docValues. its working exactly as
you described. 
So far my unit test case has no issues and i see low memory foot print. Will
be sending the build for performance that should give comparable numbers. 

Now i see another replication issue in 4.2. there is a thread on that. 

thanks
Aditya 




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Too-many-fields-to-Sort-in-Solr-tp4049139p4052486.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Basic auth on SolrCloud /admin/* calls

2013-03-29 Thread Vaillancourt, Tim
Agreed, we don't have clients hitting Solr directly, it is used like a backend 
database in our usage by intermediaries, similar to say MySQL. Although 
restricting the access to Solr to fewer hosts is something, I still feel an 
application has no business being able to perform admin level calls, at least 
in my use case. This is being very nitpicky though.

We also open Solr's port to monitoring servers who shouldn't have access to 
admin calls and thinking paranoid a compromised app using a single collection 
could affect the entire cloud with admin call access.

Seeing the long term plan is to leave this feature at the container level 
(which is totally valid), I think I'll continue with the basic auth approach I 
attempted and see what I can dig up on past efforts. I'll be sure to share what 
I've done.

Thanks Mark!

Tim

-Original Message-
From: Mark Miller [mailto:markrmil...@gmail.com] 
Sent: Friday, March 29, 2013 1:04 PM
To: solr-user@lucene.apache.org
Subject: Re: Basic auth on SolrCloud /admin/* calls

This has always been the case with Solr. Solr's security model is that clients 
should not have access to it - only trusted intermediaries should have access 
to it. Otherwise, it should be locked down at a higher level. That's been the 
case from day one and still is.

That said, someone did do some work on internode basic auth a while back, but 
it didn't raise a ton of interest yet.

- Mark

On Mar 29, 2013, at 2:09 PM, Vaillancourt, Tim tvaillanco...@ea.com wrote:

 Yes, I should have mentioned this is under 4.2 Solr.
 
 I sort of expected what I'm doing might be unsupported, but basically my 
 concern is under the current SOLR design, any client with connectivity to 
 SOLR's port can perform Admin-level API calls like create/drop Cores or 
 Collections.
 
 I'm only aiming for '/solr/admin/*' calls to separate Application access 
 from the Administrative access logically, and not the non-admin calls like 
 '/update', although you can cause damage with '/update', too.
 
 I may try to patch the code to send Basic auth credentials on internal calls 
 just for fun, but I'm thinking longer-term authentication should be 
 implemented/added to the SOLR codebase (for at least admin calls) vs playing 
 with security at the container level, and having the app inside the container 
 aware of it.
 
 On the upside, in short testing I was able to get a Collection online using 
 Cores API only using curl calls w/basic auth. Only the Collections API is 
 affected due to it calls itself which do not have auth.
 
 Cheers,
 
 Tim
 
 -Original Message-
 From: Isaac Hebsh [mailto:isaac.he...@gmail.com]
 Sent: Friday, March 29, 2013 12:37 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Basic auth on SolrCloud /admin/* calls
 
 Hi Tim,
 Are you running Solr 4.2? (In 4.0 and 4.1, the Collections API didn't return 
 any failure message. see SOLR-4043 issue).
 
 As far as I know, you can't tell Solr to use authentication credentials when 
 communicating other nodes. It's a bigger issue.. for example, if you want to 
 protect the /update requestHandler, so unauthorized users won't delete your 
 whole collection, it can interfere the replication process.
 
 I think it's a necessary mechanism in production environment... I'm curious 
 how do people use SolrCloud in production w/o it.
 
 
 
 
 
 On Fri, Mar 29, 2013 at 3:42 AM, Vaillancourt, Tim 
 tvaillanco...@ea.comwrote:
 
 Hey guys,
 
 I've recently setup basic auth under Jetty 8 for all my Solr 4.x 
 '/admin/*' calls, in order to protect my Collections and Cores API.
 
 Although the security constraint is working as expected ('/admin/*' 
 calls require Basic Auth or return 401), when I use the Collections 
 API to create a collection, I receive a 200 OK to the Collections API 
 CREATE call, but the background Cores API calls that are ran on the 
 Collection API's behalf fail on the Basic Auth on other nodes with a
 401 code, as I should have foreseen, but didn't.
 
 Is there a way to tell SolrCloud to use authentication on internal 
 Cores API calls that are spawned on Collections API's behalf, or is 
 this a new feature request?
 
 To reproduce:
 
 1.   Implement basic auth on '/admin/*' URIs.
 
 2.   Perform a CREATE Collections API call to a node (which will
 return 200 OK).
 
 3.   Notice all Cores API calls fail (Collection isn't created). See
 stack trace below from the node that was issued the CREATE call.
 
 The stack trace I get is:
 
 org.apache.solr.common.SolrException: Server at http://HOST
 HERE:8983/solrhttp://%3cHOST%20HERE%3e:8983/solr returned non ok
 status:401, message:Unauthorized
 at
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServ
 e
 r.java:373)
 at
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServ
 e
 r.java:181)
 at
 org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHa
 n
 dler.java:169)
 at
 

RE: Basic auth on SolrCloud /admin/* calls

2013-03-29 Thread Vaillancourt, Tim
Here we go:

https://issues.apache.org/jira/browse/SOLR-4470

Tim

-Original Message-
From: Vaillancourt, Tim [mailto:tvaillanco...@ea.com] 
Sent: Friday, March 29, 2013 3:25 PM
To: solr-user@lucene.apache.org
Subject: RE: Basic auth on SolrCloud /admin/* calls

Agreed, we don't have clients hitting Solr directly, it is used like a backend 
database in our usage by intermediaries, similar to say MySQL. Although 
restricting the access to Solr to fewer hosts is something, I still feel an 
application has no business being able to perform admin level calls, at least 
in my use case. This is being very nitpicky though.

We also open Solr's port to monitoring servers who shouldn't have access to 
admin calls and thinking paranoid a compromised app using a single collection 
could affect the entire cloud with admin call access.

Seeing the long term plan is to leave this feature at the container level 
(which is totally valid), I think I'll continue with the basic auth approach I 
attempted and see what I can dig up on past efforts. I'll be sure to share what 
I've done.

Thanks Mark!

Tim

-Original Message-
From: Mark Miller [mailto:markrmil...@gmail.com]
Sent: Friday, March 29, 2013 1:04 PM
To: solr-user@lucene.apache.org
Subject: Re: Basic auth on SolrCloud /admin/* calls

This has always been the case with Solr. Solr's security model is that clients 
should not have access to it - only trusted intermediaries should have access 
to it. Otherwise, it should be locked down at a higher level. That's been the 
case from day one and still is.

That said, someone did do some work on internode basic auth a while back, but 
it didn't raise a ton of interest yet.

- Mark

On Mar 29, 2013, at 2:09 PM, Vaillancourt, Tim tvaillanco...@ea.com wrote:

 Yes, I should have mentioned this is under 4.2 Solr.
 
 I sort of expected what I'm doing might be unsupported, but basically my 
 concern is under the current SOLR design, any client with connectivity to 
 SOLR's port can perform Admin-level API calls like create/drop Cores or 
 Collections.
 
 I'm only aiming for '/solr/admin/*' calls to separate Application access 
 from the Administrative access logically, and not the non-admin calls like 
 '/update', although you can cause damage with '/update', too.
 
 I may try to patch the code to send Basic auth credentials on internal calls 
 just for fun, but I'm thinking longer-term authentication should be 
 implemented/added to the SOLR codebase (for at least admin calls) vs playing 
 with security at the container level, and having the app inside the container 
 aware of it.
 
 On the upside, in short testing I was able to get a Collection online using 
 Cores API only using curl calls w/basic auth. Only the Collections API is 
 affected due to it calls itself which do not have auth.
 
 Cheers,
 
 Tim
 
 -Original Message-
 From: Isaac Hebsh [mailto:isaac.he...@gmail.com]
 Sent: Friday, March 29, 2013 12:37 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Basic auth on SolrCloud /admin/* calls
 
 Hi Tim,
 Are you running Solr 4.2? (In 4.0 and 4.1, the Collections API didn't return 
 any failure message. see SOLR-4043 issue).
 
 As far as I know, you can't tell Solr to use authentication credentials when 
 communicating other nodes. It's a bigger issue.. for example, if you want to 
 protect the /update requestHandler, so unauthorized users won't delete your 
 whole collection, it can interfere the replication process.
 
 I think it's a necessary mechanism in production environment... I'm curious 
 how do people use SolrCloud in production w/o it.
 
 
 
 
 
 On Fri, Mar 29, 2013 at 3:42 AM, Vaillancourt, Tim 
 tvaillanco...@ea.comwrote:
 
 Hey guys,
 
 I've recently setup basic auth under Jetty 8 for all my Solr 4.x 
 '/admin/*' calls, in order to protect my Collections and Cores API.
 
 Although the security constraint is working as expected ('/admin/*' 
 calls require Basic Auth or return 401), when I use the Collections 
 API to create a collection, I receive a 200 OK to the Collections API 
 CREATE call, but the background Cores API calls that are ran on the 
 Collection API's behalf fail on the Basic Auth on other nodes with a
 401 code, as I should have foreseen, but didn't.
 
 Is there a way to tell SolrCloud to use authentication on internal 
 Cores API calls that are spawned on Collections API's behalf, or is 
 this a new feature request?
 
 To reproduce:
 
 1.   Implement basic auth on '/admin/*' URIs.
 
 2.   Perform a CREATE Collections API call to a node (which will
 return 200 OK).
 
 3.   Notice all Cores API calls fail (Collection isn't created). See
 stack trace below from the node that was issued the CREATE call.
 
 The stack trace I get is:
 
 org.apache.solr.common.SolrException: Server at http://HOST
 HERE:8983/solrhttp://%3cHOST%20HERE%3e:8983/solr returned non ok
 status:401, message:Unauthorized
 at
 

Re: Solr 4.2 - Slave Index version is higher than Master

2013-03-29 Thread Mark Miller
That's pretty weird stuff. As a workaround, you might stop replicating your 
conf files - that takes a sketchier path at the moment.

The key to solving this is to figure out how the heck the slave is increasing 
it's gen…that should require a commit. In this case, *lots* of them. Commits 
that don't happen on the master. There should not be another way you can 
increase the gen…

Can you share your full logs?

- Mark

On Mar 29, 2013, at 5:03 PM, adityab aditya_ba...@yahoo.com wrote:

 Something is really wrong with replication. 
 Check the document attached which has the screen shot. 
 I - re-indexed the master after adding new fields to schema file (its part
 of config file replication) 
 The UI shows master as gen '6' where as in slaves log the Master gen is '7'
 
 The attached document has the screenshot captured. 
 Replication_Issue_4.2.docx
 http://lucene.472066.n3.nabble.com/file/n4052485/Replication_Issue_4.2.docx 
  
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Solr-4-2-Slave-Index-version-is-higher-than-Master-tp4049827p4052485.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr 4.2 - Slave Index version is higher than Master

2013-03-29 Thread adityab
@Mark attached are the full logs from both master and slave. Hope this might
be some help. 
console_master.log
http://lucene.472066.n3.nabble.com/file/n4052516/console_master.log  
console_slave.log
http://lucene.472066.n3.nabble.com/file/n4052516/console_slave.log  

Ignore the mbeans call in master log. I have a program that pings the master
every minute. 





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-4-2-Slave-Index-version-is-higher-than-Master-tp4049827p4052516.html
Sent from the Solr - User mailing list archive at Nabble.com.


Getting better snippets in highlighting component

2013-03-29 Thread Jorge Luis Betancourt Gonzalez
Hi all:

I'm building a document search plattform, basically indexing a lot of PDF 
files. Some of this files has an index, which means that when I query for 
normativos in my application (built using Symfony2+PHP+Solarium) I get a few 
results like this:

10
 6.2 Elementos normativos generales 
12
 6.3 Elementos normativos técnicos 
..32
 ANEXOS A Formas verbales (normativo

Which is a bit of a problem, is there any way I can get rid of this dots? Is 
there any sort of relevance in the snippets that the highlighting components 
returns? I mean in this particular case, the snippet came from the index page 
of the PDF which I hardly think is the best snippet in the document for this 
particular query, any thought on this? Is there any golden rule to treat 
cases like this?

Thanks a lot!
http://www.uci.cu


Re: Getting better snippets in highlighting component

2013-03-29 Thread Jack Krupansky
It looks like a table of contents. The dots are followed by the page number, 
followed by the text from the next table of contents entry, and repeat.


Even Google doesn't do anything special for this. For example, search for 
chapter 1 chapter 2 pdf:


[PDF]
2013 Publication 505 - Internal Revenue Service
www.irs.gov/pub/irs-pdf/p505.pdfFile Format: PDF/Adobe Acrobat
Mar 21, 2013 – Introduction . . . . . . . . . . . . . . . . . . 1. What's 
New for 2013 . . . . . . . . . . . . . 2. Reminders . . . . . . . . . . . . 
. . . . . . . 2. Chapter 1. Tax Withholding for ...


I'm sure somebody can come up with a clever heuristic to avoid this kind of 
thing.


Maybe simply truncate any sequence of white space and only punctuation down 
to two or three characters or so.


-- Jack Krupansky
-Original Message- 
From: Jorge Luis Betancourt Gonzalez

Sent: Friday, March 29, 2013 10:34 PM
To: solr-user@lucene.apache.org
Subject: Getting better snippets in highlighting component

Hi all:

I'm building a document search plattform, basically indexing a lot of PDF 
files. Some of this files has an index, which means that when I query for 
normativos in my application (built using Symfony2+PHP+Solarium) I get a 
few results like this:


10 
6.2 Elementos normativos generales 
12 
6.3 Elementos normativos técnicos 
..32 
ANEXOS A Formas verbales (normativo


Which is a bit of a problem, is there any way I can get rid of this dots? Is 
there any sort of relevance in the snippets that the highlighting components 
returns? I mean in this particular case, the snippet came from the index 
page of the PDF which I hardly think is the best snippet in the document for 
this particular query, any thought on this? Is there any golden rule to 
treat cases like this?


Thanks a lot!
http://www.uci.cu 



Re: Getting better snippets in highlighting component

2013-03-29 Thread Jorge Luis Betancourt Gonzalez
Hi Jack:

Thanks for the reply, exactly I know is a common thing to encounter this TOC in 
a lot of files, I'm plying with the regex fragmenter to be a little more 
selective about the generated snippets, but no luck so far.

- Mensaje original -
De: Jack Krupansky j...@basetechnology.com
Para: solr-user@lucene.apache.org
Enviados: Sábado, 30 de Marzo 2013 0:40:03
Asunto: Re: Getting better snippets in highlighting component

It looks like a table of contents. The dots are followed by the page number,
followed by the text from the next table of contents entry, and repeat.

Even Google doesn't do anything special for this. For example, search for
chapter 1 chapter 2 pdf:

[PDF]
2013 Publication 505 - Internal Revenue Service
www.irs.gov/pub/irs-pdf/p505.pdfFile Format: PDF/Adobe Acrobat
Mar 21, 2013 – Introduction . . . . . . . . . . . . . . . . . . 1. What's
New for 2013 . . . . . . . . . . . . . 2. Reminders . . . . . . . . . . . .
. . . . . . . 2. Chapter 1. Tax Withholding for ...

I'm sure somebody can come up with a clever heuristic to avoid this kind of
thing.

Maybe simply truncate any sequence of white space and only punctuation down
to two or three characters or so.

-- Jack Krupansky
-Original Message-
From: Jorge Luis Betancourt Gonzalez
Sent: Friday, March 29, 2013 10:34 PM
To: solr-user@lucene.apache.org
Subject: Getting better snippets in highlighting component

Hi all:

I'm building a document search plattform, basically indexing a lot of PDF
files. Some of this files has an index, which means that when I query for
normativos in my application (built using Symfony2+PHP+Solarium) I get a 
few results like this:

10
6.2 Elementos normativos generales
12
6.3 Elementos normativos técnicos
..32
ANEXOS A Formas verbales (normativo

Which is a bit of a problem, is there any way I can get rid of this dots? Is
there any sort of relevance in the snippets that the highlighting components
returns? I mean in this particular case, the snippet came from the index
page of the PDF which I hardly think is the best snippet in the document for
this particular query, any thought on this? Is there any golden rule to
treat cases like this?

Thanks a lot!
http://www.uci.cu

http://www.uci.cu
http://www.uci.cu