RE: XPathentity processor on CLOB field

2015-06-18 Thread Pattabiraman, Meenakshisundaram

I got this working - the errors were due to a mistake in letter case - was 
using 'datasource' instead of 'dataSource'  in the entity that was using 
XpathEntityProcessor. Hence this was being ignored and was inheriting the JDBC 
Datasource of the parent entity.

I am pasting the complete data-config for anyone encountering the same problem.

dataSource name=xmldata type=FieldReaderDataSource/
dataSource name=mbdev driver=oracle.jdbc.driver.OracleDriver 
url=jdbc:oracle:thin:@localhost:1521:orcl user=orcl password=orcl/
document name=insight
entity name=input query=select * from test logLevel=debug 
dataSource=mbdev transformer=ClobTransformer onError=skip
field column=LOAD_DATE name=load_date /
field column=RESPONSE_XML name=RESPONSE_XML clob=true / 
field column=id name=id/  
entity name=catReport dataSource=xmldata 
dataField=input.RESPONSE_XML processor=XPathEntityProcessor  
forEach=/DecisionServiceRs  rootEntity=true logLevel=debug
field column=event xpath=/DecisionServiceRs/@event/
field column=policyNumber 

Re: Suggester for text array

2015-06-18 Thread Alessandro Benedetti
Hi Advait ,
First of all I suggest you to study Solr a little bit [1]. because your
requirements are actually really simple :

1) You can simply use more than one suggest dictionary if you care to keep
the suggestions separated ( keeping if a term is coming from the name or
from the the category)

if you don't care to keep them separated, simply use a copy field to copy
both the fields in.

2) Solr supports multi valued fields since the beginning.
I really suggest you to split by comma in your indexer application,
providing to Solr the multi values already separated.
Because they are multi values for the category field ( so it's nor analysis
responsibility to split them)



2015-06-18 13:43 GMT+01:00 Advait Suhas Pandit


 We run an ecommerce company and would like to use SOLR for our product
 database searches.

 We have products along with the categories that they belong to. In case
 the product belongs to more than 1 category, we have a comma separated
 field of categories.

 How do we do auto complete on -
 1. Multiple fields - product name, category
 2. On categories which are not first in the list in the case of the comma
 separated values
 E.g. If a product belongs to Hair Care Products, Personal Care Products
 how do we ensure that the suggester will even suggest if someone starts
 typing in Personal Care. Also, how do we show only Personal Care in the
 auto complete and not as Hair Care Products, Personal Care Products.



Benedetti Alessandro
Visiting card :

Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?

William Blake - Songs of Experience -1794 England

Suggester for text array

2015-06-18 Thread Advait Suhas Pandit

We run an ecommerce company and would like to use SOLR for our product database 

We have products along with the categories that they belong to. In case the 
product belongs to more than 1 category, we have a comma separated field of 

How do we do auto complete on -
1. Multiple fields - product name, category
2. On categories which are not first in the list in the case of the comma 
separated values
E.g. If a product belongs to Hair Care Products, Personal Care Products how do 
we ensure that the suggester will even suggest if someone starts typing in 
Personal Care. Also, how do we show only Personal Care in the auto complete and 
not as Hair Care Products, Personal Care Products.


Re: Solr 5.2.1 on Solaris

2015-06-18 Thread Shawn Heisey
On 6/18/2015 8:05 AM, Bence Vass wrote:
 Is there any documentation on how to start Solr 5.2.1 on Solaris (Solaris
 10)? The script (solr start) doesn't work out of the box, is anyone running
 Solaris 5.x on Solaris?

I think the biggest problem on Solaris will be the options used on the
ps command.  The ps usage in the solr script appears to be formulated
for the version of ps found on Linux and other free UNIX-like operating
systems, and I know from experience that those options don't work on

The solr script also uses lsof, which I don't think is normally
installed on Solaris.  I'm not sure whether lsof is actually required,
or if the script will work without it.

I won't have time right away, but I will be able to look into this at
some point in the next few days and come up with a patch to make the
script work on Solaris.  If anybody else has the time and skill to do so
immediately, feel free to step in.


Re: Help: Problem in customized token filter

2015-06-18 Thread Aman Tandon
Please help, what wrong I am doing here. please guide me.

With Regards
Aman Tandon

On Thu, Jun 18, 2015 at 4:51 PM, Aman Tandon


 I created a *token concat filter* to concat all the tokens from token
 stream. It creates the concatenated token as expected.

 But when I am posting the xml containing more than 30,000 documents, then
 only first document is having the data of that field.


 *field name=titlex type=text indexed=true stored=false
 required=false omitNorms=false multiValued=false /*

 *fieldType name=text class=solr.TextField
 *  analyzer type=index*
 *charFilter class=solr.HTMLStripCharFilterFactory/*
 *tokenizer class=solr.StandardTokenizerFactory/*
 *filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=0
 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/*
 *filter class=solr.LowerCaseFilterFactory/*
 *filter class=solr.ShingleFilterFactory maxShingleSize=3
 outputUnigrams=true tokenSeparator=/*
 *filter class=solr.SnowballPorterFilterFactory
 language=English protected=protwords.txt/*
 *filter class=solr.SynonymFilterFactory
 synonyms=stemmed_synonyms_text_prime_ex_index.txt ignoreCase=true
 *  /analyzer*
 *  analyzer type=query*
 *tokenizer class=solr.StandardTokenizerFactory/*
 *filter class=solr.SynonymFilterFactory
 synonyms=synonyms.txt ignoreCase=true expand=true/*
 *filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords_text_prime_search.txt enablePositionIncrements=true /*
 *filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=0
 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/*
 *filter class=solr.LowerCaseFilterFactory/*
 *filter class=solr.SnowballPorterFilterFactory
 language=English protected=protwords.txt/*
 *  /analyzer**/fieldType*

 Please help me, The code for the filter is as follows, please take a look.

 Here is the picture of what filter is doing

 The code of concat filter is :



 *import org.apache.lucene.analysis.TokenFilter;*

 *import org.apache.lucene.analysis.TokenStream;*

 *import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;*

 *import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;*


 *import org.apache.lucene.analysis.tokenattributes.TypeAttribute;*

 *public class ConcatenateWordsFilter extends TokenFilter {*

 *  private CharTermAttribute charTermAttribute =

 *  private OffsetAttribute offsetAttribute =

 *  PositionIncrementAttribute posIncr =

 *  TypeAttribute typeAtrr = addAttribute(TypeAttribute.class);*

 *  private StringBuilder stringBuilder = new StringBuilder();*

 *  private boolean exhausted = false;*

 *  /***

 *   * Creates a new ConcatenateWordsFilter*

 *   * @param input TokenStream that will be filtered*

 *   */*

 *  public ConcatenateWordsFilter(TokenStream input) {*


 *  }*

 *  /***

 *   * {@inheritDoc}*

 *   */*

 *  @Override*

 *  public final boolean incrementToken() throws IOException {*

 *while (!exhausted  input.incrementToken()) {*

 *  char terms[] = charTermAttribute.buffer();*

 *  int termLength = charTermAttribute.length();*

 *  if(typeAtrr.type().equals(ALPHANUM)){*

 * stringBuilder.append(terms, 0, termLength);*

 *  }*

 *  charTermAttribute.copyBuffer(terms, 0, termLength);*

 *  return true;*


 *if (!exhausted) {*

 *  exhausted = true;*

 *  String sb = stringBuilder.toString();*

 *  System.err.println(The Data got is +sb);*

 *  int sbLength = sb.length();*

 *  //posIncr.setPositionIncrement(0);*

 *  charTermAttribute.copyBuffer(sb.toCharArray(), 0, sbLength);*

 *  offsetAttribute.setOffset(offsetAttribute.startOffset(),

 *  stringBuilder.setLength(0);*

 *  //typeAtrr.setType(CONCATENATED);*

 *  return true;*


 *return false;*

 *  }*


 With Regards
 Aman Tandon

Solr 5.2.1 on Solaris

2015-06-18 Thread Bence Vass

Is there any documentation on how to start Solr 5.2.1 on Solaris (Solaris
10)? The script (solr start) doesn't work out of the box, is anyone running
Solaris 5.x on Solaris?

- Thanks

Re: Error when submitting PDF to Solr w/text fields using SolrJ

2015-06-18 Thread Alessandro Benedetti
We would like more information, but the first thing I notice is that hardly
would make any sense to use a string type for a file content.

Can you give more details about the exception ?
Have you debugged a little bit ?
How does the solr input document look before it is sent to Solr ?

Furthermore please give us all the stack trace. THe message you post is
almost useless without all the details ...

2015-06-18 15:39 GMT+01:00 Paden


 I'm using Solr to pull information from a Database and a file system
 simultaneously. The database houses the file path of the file in the file
 system. It pulls all of those just fine. In fact, it combines the metadata
 from the database and the metadata from the file system great. The problem
 occurs when I try to index the text. The error does not occur at the point
 when it tries to add the field text to the document. The error occurs
 I try to submit that document to Solr. It gives me this error,

 org.apache.solr.common.SolrException: Exception writing document id
 /some/filepath to the index; possible analysis error.

 This is how the field is defined in schema:

 field name=text type=string indexed=true stored=false
 required=false multiValued=true /

 and this is the code I use to add it to the document:

 File file = new File(filepath);

 ContentHandler textHandler = new BodyContentHandler();

 Metadata metadata = new Metadata();

 ParseContext context = new ParseContext();

 Input Stream = new FileInputStream(file);


  autoParser.parse(input, textHandler, metadata, context);

 } catch (Exception e) {

   //prints out error message



 if(textHandler != null){





 } catch (Exception ex){




 I think it has something to do with how the field is defined in schema but
 don't know. All the files that get error messages are PDF's if that helps.
 There are .doc s in the file system but they don't error out.

 View this message in context:
 Sent from the Solr - User mailing list archive at


Benedetti Alessandro
Visiting card :

Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?

William Blake - Songs of Experience -1794 England

Solr Logging

2015-06-18 Thread rbkumar88

I want to log Solr search queries/response time and Solr indexing log
separately in different set of log files.
Is there any convenient framework/way to do it.


Managed schema and schema.xml file

2015-06-18 Thread Steven White
Hi everyone,

I just upgraded from 5.1.0 to 5.2.1 and noticed a behavior change which I
consider a bug.

In my solrconfig.xml, I have the following:

   !-- schemaFactory class=ClassicIndexSchemaFactory/ --
   schemaFactory class=ManagedIndexSchemaFactory
 bool name=mutabletrue/bool
 str name=managedSchemaResourceNamemy-schema.xml/str

In 5.1.0 (and maybe prior ver.?) when I enable managed schema per the
above, the existing schema.xml file is left as-is, a copy of it is created
as schema.xml.bak and a new one is created based on the name I gave it

With 5.2.1 schema.xml is renamed to schema.xml.bak and my-schema.xml is
created (e.g.: schema.xml is deleted).

Is this an expected behavior or is this a bug?  I see it as a bug because
if I revert the change I made in my solrconfig.xml back to (i.e.: not
managed schema any more):

  schemaFactory class=ClassicIndexSchemaFactory/

Solr will not restart because it cannot find schema.xml



Error when submitting PDF to Solr w/text fields using SolrJ

2015-06-18 Thread Paden

I'm using Solr to pull information from a Database and a file system
simultaneously. The database houses the file path of the file in the file
system. It pulls all of those just fine. In fact, it combines the metadata
from the database and the metadata from the file system great. The problem
occurs when I try to index the text. The error does not occur at the point
when it tries to add the field text to the document. The error occurs when
I try to submit that document to Solr. It gives me this error, 

org.apache.solr.common.SolrException: Exception writing document id
/some/filepath to the index; possible analysis error. 

This is how the field is defined in schema:

field name=text type=string indexed=true stored=false
required=false multiValued=true / 

and this is the code I use to add it to the document:

File file = new File(filepath); 

ContentHandler textHandler = new BodyContentHandler(); 

Metadata metadata = new Metadata();

ParseContext context = new ParseContext();

Input Stream = new FileInputStream(file); 


 autoParser.parse(input, textHandler, metadata, context); 

} catch (Exception e) { 

  //prints out error message



if(textHandler != null){




} catch (Exception ex){ 




I think it has something to do with how the field is defined in schema but I
don't know. All the files that get error messages are PDF's if that helps.
There are .doc s in the file system but they don't error out. 

Re: Managed schema and schema.xml file

2015-06-18 Thread Shawn Heisey
On 6/18/2015 8:10 AM, Steven White wrote:
 In 5.1.0 (and maybe prior ver.?) when I enable managed schema per the
 above, the existing schema.xml file is left as-is, a copy of it is created
 as schema.xml.bak and a new one is created based on the name I gave it
 With 5.2.1 schema.xml is renamed to schema.xml.bak and my-schema.xml is
 created (e.g.: schema.xml is deleted).
 Is this an expected behavior or is this a bug?  I see it as a bug because
 if I revert the change I made in my solrconfig.xml back to (i.e.: not
 managed schema any more):
   schemaFactory class=ClassicIndexSchemaFactory/
 Solr will not restart because it cannot find schema.xml

As I understand it, the managed schema system will complain if it sees a
file named schema.xml -- having both the managed schema file and
schema.xml is confusing, so if the classic file exists, it's an error.

Because of that, if you switch your config from managed to classic
schema, you must also create the schema.xml file (or rename the managed
version).  Neither factory is aware of the other, so there's no
automated way to handle that.


Re: Dedupe in a SolrCloud

2015-06-18 Thread Markus Mirsberger
Thanks :) 
exactly what I was looking I only need to create the signature once 
this works perfect for me:)


Sent from my iPhone

 On 17.06.2015, at 20:32, Shalin Shekhar Mangar wrote:
 Comments inline:
 On Wed, Jun 17, 2015 at 3:18 PM, Markus.Mirsberger wrote:
 I am trying to use the dedupe feature to detect and mark near duplicate
 content in my collections.
 I dont want to prevent duplicate content. I woud like to detect it and keep
 it for further processing. Thats why Im using an extra field and not the
 documents unique field.
 Here is how I added it to the solrConfig.xml :
 requestHandler name=/update class=solr.UpdateRequestHandler
   lst name=defaults
 str name=update.chainfill_signature/str
 updateRequestProcessorChain name=fill_signature
processor class=solr.RunUpdateProcessorFactory /
 updateProcessor class=solr.processor.SignatureUpdateProcessorFactory
 bool name=enabledtrue/bool
 str name=signatureFieldsignature/str
 bool name=overwriteDupesfalse/bool
 str name=fieldscontent/str
 str name=quantRate.2/str
 str name=minTokenLen3/str
 When I initially add the documents to the cloud everything works as expected
 . the documents are added and the signature will be created and
 The problem occours when I want to update an exisiting document. In that
 case the update.chain=fill_signature parameter will of course be set too and
 I get a bad request error.
 I found this solr issue:
 Is it that problem I am running into?
 You haven't pasted the complete error response so I am guessing a bit
 here. It is possible that you are running into the same problem i.e.
 the signature is being calculated again and the signature field not
 multi-valued, causes an error.
 Is it somehow possible to add parameters or set a specific update Handler
 when Im adding documents to the cloud using solrJ?
 Yes, any custom parameter can be added to a SolrJ request. There is a
 setParam(String param, String value) method available in
 AbstractUpdateRequest which can be used to set a custom update.chain
 for each SolrJ request.
 In that case I could ether set the update.chain manually and remove it from
 the request handler or write a second request Handler which I only use if I
 want set the signature field.
 I know I can do that manually when Im using eg curl but is it also possible
 with SolrJ? :)
 Shalin Shekhar Mangar.

Re: Error when submitting PDF to Solr w/text fields using SolrJ

2015-06-18 Thread Paden
USING Solr 5.1.0

This is the schema file

?xml version=1.0 encoding=UTF-8 ?

schema name=example version=1.5
   field name=_version_ type=long indexed=true stored=true/
   field name=_root_ type=string indexed=true stored=false/

   field name=id type=string indexed=true stored=true
required=false multiValued=false /
   field name=filepath type=string indexed=true stored =true
required=false multiValued=false /  
   field name=title type=string indexed=true stored =true
required=false multiValued=false /  
   field name=author type=string indexed=true stored =true
required=false multiValued=false /  
   field name=text type=string indexed=true stored =false
required=false multiValued=true /  
   field name=key type=string indexed=true stored =false
required=false multiValued=false / 

   dynamicField name=*_name  type=text_general   multiValued=false
indexed=true  stored=true /

   dynamicField name=*_i  type=intindexed=true  stored=true/
   dynamicField name=*_is type=intindexed=true  stored=true 
   dynamicField name=*_s  type=string  indexed=true  stored=true /
   dynamicField name=*_ss type=string  indexed=true  stored=true
   dynamicField name=*_l  type=long   indexed=true  stored=true/
   dynamicField name=*_ls type=long   indexed=true  stored=true 
   dynamicField name=*_t  type=text_generalindexed=true 
   dynamicField name=*_txt type=text_general   indexed=true 
stored=true multiValued=true/
   dynamicField name=*_en  type=text_enindexed=true 
stored=true multiValued=true/
   dynamicField name=*_b  type=boolean indexed=true stored=true/
   dynamicField name=*_bs type=boolean indexed=true stored=true 
   dynamicField name=*_f  type=float  indexed=true  stored=true/
   dynamicField name=*_fs type=float  indexed=true  stored=true 
   dynamicField name=*_d  type=double indexed=true  stored=true/
   dynamicField name=*_ds type=double indexed=true  stored=true 

   dynamicField name=*_coordinate  type=tdouble indexed=true 
stored=false /

   dynamicField name=*_dt  type=dateindexed=true  stored=true/
   dynamicField name=*_dts type=dateindexed=true  stored=true
   dynamicField name=*_p  type=location indexed=true stored=true/

   dynamicField name=*_ti type=tintindexed=true  stored=true/
   dynamicField name=*_tl type=tlong   indexed=true  stored=true/
   dynamicField name=*_tf type=tfloat  indexed=true  stored=true/
   dynamicField name=*_td type=tdouble indexed=true  stored=true/
   dynamicField name=*_tdt type=tdate  indexed=true  stored=true/

   dynamicField name=*_c   type=currency indexed=true 

   dynamicField name=ignored_* type=ignored multiValued=true/
   dynamicField name=attr_* type=text_general indexed=true
stored=true multiValued=true/

   dynamicField name=random_* type=random /


fieldType name=string class=solr.StrField sortMissingLast=true /

fieldType name=boolean class=solr.BoolField

fieldType name=int class=solr.TrieIntField precisionStep=0
fieldType name=float class=solr.TrieFloatField precisionStep=0
fieldType name=long class=solr.TrieLongField precisionStep=0
fieldType name=double class=solr.TrieDoubleField precisionStep=0

fieldType name=tint class=solr.TrieIntField precisionStep=8
fieldType name=tfloat class=solr.TrieFloatField precisionStep=8
fieldType name=tlong class=solr.TrieLongField precisionStep=8
fieldType name=tdouble class=solr.TrieDoubleField precisionStep=8

fieldType name=date class=solr.TrieDateField precisionStep=0

fieldType name=tdate class=solr.TrieDateField precisionStep=6

fieldType name=binary class=solr.BinaryField/

fieldType name=random class=solr.RandomSortField indexed=true /

fieldType name=text_ws class=solr.TextField
tokenizer class=solr.WhitespaceTokenizerFactory/

fieldType name=text_general class=solr.TextField
  analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt /

filter class=solr.LowerCaseFilterFactory/
  analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt /
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true/
filter class=solr.LowerCaseFilterFactory/

Re: Collections API and adding new boxes

2015-06-18 Thread Erick Erickson
See particularly the ADDREPLICA command and the
node parameter. You might not even need the node
parameter since when you add a replica Solr does its
best to put the new replica on an underutilized node.


On Thu, Jun 18, 2015 at 2:58 PM, Shawn Heisey wrote:
 On 6/18/2015 3:23 PM, Jim.Musil wrote:
 Let's say I have a zookeeper ensemble with several Solr nodes connected to 
 it. I've created a collection successfully and all is well.

 What happens when I want to add another solr node?

 I've tried spinning one up and connecting it to zookeeper, but the new node 
 doesn't join the collection.  What's the expected next step?

 This is Solr 5.1.

 The new node will be part of the cloud as soon as it starts, but until
 you take action with the Collections API, it will not have any indexes
 on it.  SolrCloud does not automatically create replicas except in a
 very specific set of circumstances that I do not think are very common.

 You'll need to either create a new collection or take steps to modify
 your current collection(s) so that one or more shard replicas are
 located on the new node.


Re: Error when submitting PDF to Solr w/text fields using SolrJ

2015-06-18 Thread Erick Erickson
The stack trace is what gets returned to the client, right? It's often
much more informative to see the Solr log output, the error message
is often much more helpful there. By the time the exception bubbles
up through the various layers vital information is sometimes not returned
to the client in the error message.

One precaution I would take since you've changed the schema is to
_completely_ remove the index.
1 shut down Solr
2 rm -rf coreX/data
3 restart Solr.
4 try it again.

Lucene doesn't really care at all whether a field gets indexed one way in
one document and another way in the next document and occasionally
having fields indexed different ways (string and text) in different documents
at the same time confuses things.


On Thu, Jun 18, 2015 at 10:31 AM, Paden wrote:
 Just rolling out a little bit more information as it is coming. I changed the
 field type in the schema to text_general and that didn't change a thing.

 Another thing is that it's consistently submitting/not submitting the same
 documents. I will run over it one time and it won't index a set of
 documents. When I clear the index and run the program again it
 submits/doesn't submit the same documents.

 And it will index certain PDF's it just won't index others. Which is weird
 because I printed the strings that are submitted to Solr and the ones that
 get submitted are really similar to the ones that aren't submitted.

 I can't post the actual strings for sensitivity reasons.

Re: Solr 4.10.4: Could not create instance of 'SolrInputDocument'

2015-06-18 Thread Erick Erickson
No clue whatsoever, you haven't provided near enough details. I rather
doubt that many people
on this list really understand the interactions of that technology
stack, I certainly don't.

I'd ask on the ColdFusion list, as they're (apparently) the ones
who've integrated a Solr
connector of sorts. What evidence do you have that using a stock Solr
is even possible? For
all I know, the Solr provided with CF has some kind of customizations
(maybe a plugin?) that is


On Thu, Jun 18, 2015 at 5:22 AM, Paul Revere wrote:
 Our web site is created using PaperThin's CommonSpot CMS in a ColdFusion 10 
 and Windows Server 2008 R2 environment, using Apache Solr 4.10.4 instead of 
 CF Solr. We create collections through the CMS interface and they do appear 
 in both the CMS and the Solr dashboard when created. However, when we try 
 indexing our collections through the CMS interface, our CMS error logs show 
 the entry 'Could not create instance of 'SolrInputDocument'' for each member 
 of the collection. This is not a fatal error, as the indexing appears to 
 cycle through all members, but each member errors out with log entries for 
 each member.  I've Googled this error message without success. What might 
 this error message indicate please??

Re: Help: Problem in customized token filter

2015-06-18 Thread Aman Tandon
Hi Steve,

  you never set exhausted to false, and when the filter got reused, *it
 incorrectly carried state from the previous document.*

Thanks for replying, but I am not able to understand this.

With Regards
Aman Tandon

On Fri, Jun 19, 2015 at 10:25 AM, Steve Rowe wrote:

 Hi Aman,

 The admin UI screenshot you linked to is from an older version of Solr -
 what version are you using?

 Lots of extraneous angle brackets and asterisks got into your email and
 made for a bunch of cleanup work before I could read or edit it.  In the
 future, please put your code somewhere people can easily read it and
 copy/paste it into an editor: into a github gist or on a paste service, etc.

 Looks to me like your use of “exhausted” is unnecessary, and is likely the
 cause of the problem you saw (only one document getting processed): you
 never set exhausted to false, and when the filter got reused, it
 incorrectly carried state from the previous document.

 Here’s a simpler version that’s hopefully more correct and more efficient
 (2 fewer copies from the StringBuilder to the final token).  Note: I didn’t
 test it:


  On Jun 18, 2015, at 11:33 AM, Aman Tandon
  Please help, what wrong I am doing here. please guide me.
  With Regards
  Aman Tandon
  On Thu, Jun 18, 2015 at 4:51 PM, Aman Tandon
  I created a *token concat filter* to concat all the tokens from token
  stream. It creates the concatenated token as expected.
  But when I am posting the xml containing more than 30,000 documents,
  only first document is having the data of that field.
  *field name=titlex type=text indexed=true stored=false
  required=false omitNorms=false multiValued=false /*
  *fieldType name=text class=solr.TextField
  *  analyzer type=index*
  *charFilter class=solr.HTMLStripCharFilterFactory/*
  *tokenizer class=solr.StandardTokenizerFactory/*
  *filter class=solr.WordDelimiterFilterFactory
  generateWordParts=1 generateNumberParts=1 catenateWords=0
  catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/*
  *filter class=solr.LowerCaseFilterFactory/*
  *filter class=solr.ShingleFilterFactory maxShingleSize=3
  outputUnigrams=true tokenSeparator=/*
  *filter class=solr.SnowballPorterFilterFactory
  language=English protected=protwords.txt/*
  *filter class=solr.SynonymFilterFactory
  synonyms=stemmed_synonyms_text_prime_ex_index.txt ignoreCase=true
  *  /analyzer*
  *  analyzer type=query*
  *tokenizer class=solr.StandardTokenizerFactory/*
  *filter class=solr.SynonymFilterFactory
  synonyms=synonyms.txt ignoreCase=true expand=true/*
  *filter class=solr.StopFilterFactory ignoreCase=true
 enablePositionIncrements=true /*
  *filter class=solr.WordDelimiterFilterFactory
  generateWordParts=1 generateNumberParts=1 catenateWords=0
  catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/*
  *filter class=solr.LowerCaseFilterFactory/*
  *filter class=solr.SnowballPorterFilterFactory
  language=English protected=protwords.txt/*
  *  /analyzer**/fieldType*
  Please help me, The code for the filter is as follows, please take a
  Here is the picture of what filter is doing
  The code of concat filter is :
  *import org.apache.lucene.analysis.TokenFilter;*
  *import org.apache.lucene.analysis.TokenStream;*
  *import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;*
  *import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;*
  *import org.apache.lucene.analysis.tokenattributes.TypeAttribute;*
  *public class ConcatenateWordsFilter extends TokenFilter {*
  *  private CharTermAttribute charTermAttribute =
  *  private OffsetAttribute offsetAttribute =
  *  PositionIncrementAttribute posIncr =
  *  TypeAttribute typeAtrr = addAttribute(TypeAttribute.class);*
  *  private StringBuilder stringBuilder = new StringBuilder();*
  *  private boolean exhausted = false;*
  *  /***
  *   * Creates a new ConcatenateWordsFilter*
  *   * @param input TokenStream that will be filtered*
  *   */*
  *  public ConcatenateWordsFilter(TokenStream input) {*

Re: Help: Problem in customized token filter

2015-06-18 Thread Steve Rowe

My version won’t produce anything at all, since incrementToken() always returns 

I updated the gist (at the same URL) to fix the problem by returning true from 
incrementToken() once and then false until reset() is called.  It also handles 
the case when the concatenated token is zero length by not emitting a token.


 On Jun 19, 2015, at 12:55 AM, Steve Rowe wrote:
 Hi Aman,
 The admin UI screenshot you linked to is from an older version of Solr - what 
 version are you using?
 Lots of extraneous angle brackets and asterisks got into your email and made 
 for a bunch of cleanup work before I could read or edit it.  In the future, 
 please put your code somewhere people can easily read it and copy/paste it 
 into an editor: into a github gist or on a paste service, etc.
 Looks to me like your use of “exhausted” is unnecessary, and is likely the 
 cause of the problem you saw (only one document getting processed): you never 
 set exhausted to false, and when the filter got reused, it incorrectly 
 carried state from the previous document.
 Here’s a simpler version that’s hopefully more correct and more efficient (2 
 fewer copies from the StringBuilder to the final token).  Note: I didn’t test 
 On Jun 18, 2015, at 11:33 AM, Aman Tandon wrote:
 Please help, what wrong I am doing here. please guide me.
 With Regards
 Aman Tandon
 On Thu, Jun 18, 2015 at 4:51 PM, Aman Tandon
 I created a *token concat filter* to concat all the tokens from token
 stream. It creates the concatenated token as expected.
 But when I am posting the xml containing more than 30,000 documents, then
 only first document is having the data of that field.
 *field name=titlex type=text indexed=true stored=false
 required=false omitNorms=false multiValued=false /*
 *fieldType name=text class=solr.TextField
 *  analyzer type=index*
 *charFilter class=solr.HTMLStripCharFilterFactory/*
 *tokenizer class=solr.StandardTokenizerFactory/*
 *filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=0
 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/*
 *filter class=solr.LowerCaseFilterFactory/*
 *filter class=solr.ShingleFilterFactory maxShingleSize=3
 outputUnigrams=true tokenSeparator=/*
 *filter class=solr.SnowballPorterFilterFactory
 language=English protected=protwords.txt/*
 *filter class=solr.SynonymFilterFactory
 synonyms=stemmed_synonyms_text_prime_ex_index.txt ignoreCase=true
 *  /analyzer*
 *  analyzer type=query*
 *tokenizer class=solr.StandardTokenizerFactory/*
 *filter class=solr.SynonymFilterFactory
 synonyms=synonyms.txt ignoreCase=true expand=true/*
 *filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords_text_prime_search.txt enablePositionIncrements=true /*
 *filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=0
 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/*
 *filter class=solr.LowerCaseFilterFactory/*
 *filter class=solr.SnowballPorterFilterFactory
 language=English protected=protwords.txt/*
 *  /analyzer**/fieldType*
 Please help me, The code for the filter is as follows, please take a look.
 Here is the picture of what filter is doing
 The code of concat filter is :
 *import org.apache.lucene.analysis.TokenFilter;*
 *import org.apache.lucene.analysis.TokenStream;*
 *import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;*
 *import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;*
 *import org.apache.lucene.analysis.tokenattributes.TypeAttribute;*
 *public class ConcatenateWordsFilter extends TokenFilter {*
 *  private CharTermAttribute charTermAttribute =
 *  private OffsetAttribute offsetAttribute =
 *  PositionIncrementAttribute posIncr =
 *  TypeAttribute typeAtrr = addAttribute(TypeAttribute.class);*
 *  private StringBuilder stringBuilder = new StringBuilder();*
 *  private boolean exhausted = false;*
 *  /***
 *   * Creates a new ConcatenateWordsFilter*
 *   * @param input TokenStream that will be filtered*
 *   */*
 *  public 

Auto-suggest in Solr

2015-06-18 Thread Zheng Lin Edwin Yeo
I'm implementing an auto-suggest feature in Solr, and I'll like to achieve
the follwing:

For example, if the user enters mp3, Solr might suggest mp3 player,
mp3 nano and mp3 music.
When the user enters mp3 p, the suggestion should narrow down to mp3

Currently, when I type mp3 p, the suggester is returning words that
starts with the letter p only, and I'm getting results like plan,
production, etc, and it does not take the mp3 token into consideration.

I'm using Solr 5.1 and below is my configuration:

In solrconfig.xml:

searchComponent name=suggest class=solr.SuggestComponent
  lst name=suggester

 str name=lookupImplFreeTextLookupFactory/str
 str name=indexPathsuggester_freetext_dir/str

str name=dictionaryImplDocumentDictionaryFactory/str
str name=fieldSuggestion/str
str name=weightFieldProject/str
str name=suggestFreeTextAnalyzerFieldTypesuggestType/str
int name=ngrams5/int
str name=buildOnStartupfalse/str
str name=buildOnCommitfalse/str

In schema.xml

fieldType name=suggestType class=solr.TextField
analyzer type=index
charFilter class=solr.PatternReplaceCharFilterFactory
pattern=[^a-zA-Z0-9] replacement=  /
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.ShingleFilterFactory minShingleSize=2
maxShingleSize=6 outputUnigrams=false/
analyzer type=query
charFilter class=solr.PatternReplaceCharFilterFactory
pattern=[^a-zA-Z0-9] replacement=  /
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.ShingleFilterFactory minShingleSize=2
maxShingleSize=6 outputUnigrams=true/

Is there anything that I configured wrongly?


Re: Help: Problem in customized token filter

2015-06-18 Thread Steve Rowe
Hi Aman,

The admin UI screenshot you linked to is from an older version of Solr - what 
version are you using?

Lots of extraneous angle brackets and asterisks got into your email and made 
for a bunch of cleanup work before I could read or edit it.  In the future, 
please put your code somewhere people can easily read it and copy/paste it into 
an editor: into a github gist or on a paste service, etc.

Looks to me like your use of “exhausted” is unnecessary, and is likely the 
cause of the problem you saw (only one document getting processed): you never 
set exhausted to false, and when the filter got reused, it incorrectly carried 
state from the previous document.

Here’s a simpler version that’s hopefully more correct and more efficient (2 
fewer copies from the StringBuilder to the final token).  Note: I didn’t test 


 On Jun 18, 2015, at 11:33 AM, Aman Tandon wrote:
 Please help, what wrong I am doing here. please guide me.
 With Regards
 Aman Tandon
 On Thu, Jun 18, 2015 at 4:51 PM, Aman Tandon
 I created a *token concat filter* to concat all the tokens from token
 stream. It creates the concatenated token as expected.
 But when I am posting the xml containing more than 30,000 documents, then
 only first document is having the data of that field.
 *field name=titlex type=text indexed=true stored=false
 required=false omitNorms=false multiValued=false /*
 *fieldType name=text class=solr.TextField
 *  analyzer type=index*
 *charFilter class=solr.HTMLStripCharFilterFactory/*
 *tokenizer class=solr.StandardTokenizerFactory/*
 *filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=0
 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/*
 *filter class=solr.LowerCaseFilterFactory/*
 *filter class=solr.ShingleFilterFactory maxShingleSize=3
 outputUnigrams=true tokenSeparator=/*
 *filter class=solr.SnowballPorterFilterFactory
 language=English protected=protwords.txt/*
 *filter class=solr.SynonymFilterFactory
 synonyms=stemmed_synonyms_text_prime_ex_index.txt ignoreCase=true
 *  /analyzer*
 *  analyzer type=query*
 *tokenizer class=solr.StandardTokenizerFactory/*
 *filter class=solr.SynonymFilterFactory
 synonyms=synonyms.txt ignoreCase=true expand=true/*
 *filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords_text_prime_search.txt enablePositionIncrements=true /*
 *filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=0
 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/*
 *filter class=solr.LowerCaseFilterFactory/*
 *filter class=solr.SnowballPorterFilterFactory
 language=English protected=protwords.txt/*
 *  /analyzer**/fieldType*
 Please help me, The code for the filter is as follows, please take a look.
 Here is the picture of what filter is doing
 The code of concat filter is :
 *import org.apache.lucene.analysis.TokenFilter;*
 *import org.apache.lucene.analysis.TokenStream;*
 *import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;*
 *import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;*
 *import org.apache.lucene.analysis.tokenattributes.TypeAttribute;*
 *public class ConcatenateWordsFilter extends TokenFilter {*
 *  private CharTermAttribute charTermAttribute =
 *  private OffsetAttribute offsetAttribute =
 *  PositionIncrementAttribute posIncr =
 *  TypeAttribute typeAtrr = addAttribute(TypeAttribute.class);*
 *  private StringBuilder stringBuilder = new StringBuilder();*
 *  private boolean exhausted = false;*
 *  /***
 *   * Creates a new ConcatenateWordsFilter*
 *   * @param input TokenStream that will be filtered*
 *   */*
 *  public ConcatenateWordsFilter(TokenStream input) {*
 *  }*
 *  /***
 *   * {@inheritDoc}*
 *   */*
 *  @Override*
 *  public final boolean incrementToken() throws IOException {*
 *while (!exhausted  input.incrementToken()) {*
 *  char terms[] = charTermAttribute.buffer();*
 *  int termLength = charTermAttribute.length();*
 *  if(typeAtrr.type().equals(ALPHANUM)){*
 * stringBuilder.append(terms, 0, 

How to append new data to index i solr?

2015-06-18 Thread ??????
 I'm a solr user with some question. I want to append new data to the 
existing index. Does Solr support to append new data to index?
 Thanks for any reply.
Best wishes.

Re: Help: Problem in customized token filter

2015-06-18 Thread Aman Tandon
Yes I just saw.

With Regards
Aman Tandon

On Fri, Jun 19, 2015 at 10:39 AM, Steve Rowe wrote:


 My version won’t produce anything at all, since incrementToken() always
 returns false…

 I updated the gist (at the same URL) to fix the problem by returning true
 from incrementToken() once and then false until reset() is called.  It also
 handles the case when the concatenated token is zero length by not emitting
 a token.


  On Jun 19, 2015, at 12:55 AM, Steve Rowe wrote:
  Hi Aman,
  The admin UI screenshot you linked to is from an older version of Solr -
 what version are you using?
  Lots of extraneous angle brackets and asterisks got into your email and
 made for a bunch of cleanup work before I could read or edit it.  In the
 future, please put your code somewhere people can easily read it and
 copy/paste it into an editor: into a github gist or on a paste service, etc.
  Looks to me like your use of “exhausted” is unnecessary, and is likely
 the cause of the problem you saw (only one document getting processed): you
 never set exhausted to false, and when the filter got reused, it
 incorrectly carried state from the previous document.
  Here’s a simpler version that’s hopefully more correct and more
 efficient (2 fewer copies from the StringBuilder to the final token).
 Note: I didn’t test it:
  On Jun 18, 2015, at 11:33 AM, Aman Tandon
  Please help, what wrong I am doing here. please guide me.
  With Regards
  Aman Tandon
  On Thu, Jun 18, 2015 at 4:51 PM, Aman Tandon
  I created a *token concat filter* to concat all the tokens from token
  stream. It creates the concatenated token as expected.
  But when I am posting the xml containing more than 30,000 documents,
  only first document is having the data of that field.
  *field name=titlex type=text indexed=true stored=false
  required=false omitNorms=false multiValued=false /*
  *fieldType name=text class=solr.TextField
  *  analyzer type=index*
  *charFilter class=solr.HTMLStripCharFilterFactory/*
  *tokenizer class=solr.StandardTokenizerFactory/*
  *filter class=solr.WordDelimiterFilterFactory
  generateWordParts=1 generateNumberParts=1 catenateWords=0
  catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/*
  *filter class=solr.LowerCaseFilterFactory/*
  *filter class=solr.ShingleFilterFactory maxShingleSize=3
  outputUnigrams=true tokenSeparator=/*
  *filter class=solr.SnowballPorterFilterFactory
  language=English protected=protwords.txt/*
  *filter class=solr.SynonymFilterFactory
  synonyms=stemmed_synonyms_text_prime_ex_index.txt ignoreCase=true
  *  /analyzer*
  *  analyzer type=query*
  *tokenizer class=solr.StandardTokenizerFactory/*
  *filter class=solr.SynonymFilterFactory
  synonyms=synonyms.txt ignoreCase=true expand=true/*
  *filter class=solr.StopFilterFactory ignoreCase=true
 enablePositionIncrements=true /*
  *filter class=solr.WordDelimiterFilterFactory
  generateWordParts=1 generateNumberParts=1 catenateWords=0
  catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/*
  *filter class=solr.LowerCaseFilterFactory/*
  *filter class=solr.SnowballPorterFilterFactory
  language=English protected=protwords.txt/*
  *  /analyzer**/fieldType*
  Please help me, The code for the filter is as follows, please take a
  Here is the picture of what filter is doing
  The code of concat filter is :
  *import org.apache.lucene.analysis.TokenFilter;*
  *import org.apache.lucene.analysis.TokenStream;*
  *import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;*
  *import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;*
  *import org.apache.lucene.analysis.tokenattributes.TypeAttribute;*
  *public class ConcatenateWordsFilter extends TokenFilter {*
  *  private CharTermAttribute charTermAttribute =
  *  private OffsetAttribute offsetAttribute =
  *  PositionIncrementAttribute posIncr =
  *  TypeAttribute typeAtrr = addAttribute(TypeAttribute.class);*
  *  private StringBuilder stringBuilder = new StringBuilder();*

Re: Help: Problem in customized token filter

2015-06-18 Thread Steve Rowe

Solr uses the same Token filter instances over and over, calling reset() before 
sending each document through.  Your code sets “exhausted to true and then 
never sets it back to false, so the next time the token filter instance is 
used, its “exhausted value is still true, so no input stream tokens are 
concatenated ever again.

Does that make sense?


 On Jun 19, 2015, at 1:10 AM, Aman Tandon wrote:
 Hi Steve,
 you never set exhausted to false, and when the filter got reused, *it
 incorrectly carried state from the previous document.*
 Thanks for replying, but I am not able to understand this.
 With Regards
 Aman Tandon
 On Fri, Jun 19, 2015 at 10:25 AM, Steve Rowe wrote:
 Hi Aman,
 The admin UI screenshot you linked to is from an older version of Solr -
 what version are you using?
 Lots of extraneous angle brackets and asterisks got into your email and
 made for a bunch of cleanup work before I could read or edit it.  In the
 future, please put your code somewhere people can easily read it and
 copy/paste it into an editor: into a github gist or on a paste service, etc.
 Looks to me like your use of “exhausted” is unnecessary, and is likely the
 cause of the problem you saw (only one document getting processed): you
 never set exhausted to false, and when the filter got reused, it
 incorrectly carried state from the previous document.
 Here’s a simpler version that’s hopefully more correct and more efficient
 (2 fewer copies from the StringBuilder to the final token).  Note: I didn’t
 test it:
 On Jun 18, 2015, at 11:33 AM, Aman Tandon
 Please help, what wrong I am doing here. please guide me.
 With Regards
 Aman Tandon
 On Thu, Jun 18, 2015 at 4:51 PM, Aman Tandon
 I created a *token concat filter* to concat all the tokens from token
 stream. It creates the concatenated token as expected.
 But when I am posting the xml containing more than 30,000 documents,
 only first document is having the data of that field.
 *field name=titlex type=text indexed=true stored=false
 required=false omitNorms=false multiValued=false /*
 *fieldType name=text class=solr.TextField
 *  analyzer type=index*
 *charFilter class=solr.HTMLStripCharFilterFactory/*
 *tokenizer class=solr.StandardTokenizerFactory/*
 *filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=0
 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/*
 *filter class=solr.LowerCaseFilterFactory/*
 *filter class=solr.ShingleFilterFactory maxShingleSize=3
 outputUnigrams=true tokenSeparator=/*
 *filter class=solr.SnowballPorterFilterFactory
 language=English protected=protwords.txt/*
 *filter class=solr.SynonymFilterFactory
 synonyms=stemmed_synonyms_text_prime_ex_index.txt ignoreCase=true
 *  /analyzer*
 *  analyzer type=query*
 *tokenizer class=solr.StandardTokenizerFactory/*
 *filter class=solr.SynonymFilterFactory
 synonyms=synonyms.txt ignoreCase=true expand=true/*
 *filter class=solr.StopFilterFactory ignoreCase=true
 enablePositionIncrements=true /*
 *filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=0
 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/*
 *filter class=solr.LowerCaseFilterFactory/*
 *filter class=solr.SnowballPorterFilterFactory
 language=English protected=protwords.txt/*
 *  /analyzer**/fieldType*
 Please help me, The code for the filter is as follows, please take a
 Here is the picture of what filter is doing
 The code of concat filter is :
 *import org.apache.lucene.analysis.TokenFilter;*
 *import org.apache.lucene.analysis.TokenStream;*
 *import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;*
 *import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;*
 *import org.apache.lucene.analysis.tokenattributes.TypeAttribute;*
 *public class ConcatenateWordsFilter extends TokenFilter {*
 *  private CharTermAttribute charTermAttribute =
 *  private OffsetAttribute offsetAttribute =
 *  PositionIncrementAttribute posIncr =

Re: How to do a Data sharding for data in a database table

2015-06-18 Thread Erick Erickson
You've repeated your original statement. Shawn's
observation is that 10M docs is a very small corpus
by Solr standards. You either have very demanding
document/search combinations or you have a poorly
tuned Solr installation.

On reasonable hardware I expect 25-50M documents to have
sub-second response time.

So what we're trying to do is be sure this isn't
an XY problem, from Hossman's apache page:

Your question appears to be an XY Problem ... that is: you are dealing
with X, you are assuming Y will help you, and you are asking about Y
without giving more details about the X so that we can understand the
full issue.  Perhaps the best solution doesn't involve Y at all?
See Also:

So again, how would you characterize your documents? How many
fields? What do queries look like? How much physical memory on the
machine? How much memory have you allocated to the JVM?

You might review:


On Thu, Jun 18, 2015 at 3:23 PM, wwang525 wrote:
 The query without load is still under 1 second. But under load, response time
 can be much longer due to the queued up query.

 We would like to shard the data to something like 6 M / shard, which will
 still give a under 1 second response time under load.

 What are some best practice to shard the data? for example, we could shard
 the data by date range, but that is pretty dynamic, and we could shard data
 by some other properties, but if the data is not evenly distributed, you may
 not be able shard it anymore.

How to do a Data sharding for data in a database table

2015-06-18 Thread wwang525

We probably would like to shard the data since the response time for
demanding queries at  10M records is getting  1 second in a single request

I have not done any data sharding before. What are some recommended way to
do data sharding. For example, may be by a criteria with a list of specific

Re: MappingCharFilterFactory and start and end offsets

2015-06-18 Thread Steve Rowe
Hi Dmitry,

It’s weird that start and end offsets are the same - what do you see for the 
start/end of ‘$’, i.e. if you take out MCFF?  (I think it should be start:5, 

As far as offsets “respecting the remapped token”, are you asking for offsets 
to be set as if ‘dollarsign' were part of the original text?  If so, there is 
no setting that would do that - the intent is for offsets to map to the 
*original* text.  You can work around this by performing the substitution prior 
to Solr analysis, e.g. in an update processor like RegexReplaceProcessorFactory.


 On Jun 18, 2015, at 3:07 AM, Dmitry Kan wrote:
 It looks like MappingCharFilter sets start and end offset to the same
 value. Can this be affected on by some setting?
 For a string: test $ test2 and mapping $ =  dollarsign  (we insert
 extra space to separate $ into its own token)
 we get:
 Ideally, we would like to have start and end offset respecting the remapped
 token. Can this be achieved with settings?
 Dmitry Kan
 Luke Toolbox:

Sent from my iPhone

Re: Error when submitting PDF to Solr w/text fields using SolrJ

2015-06-18 Thread Paden
Just rolling out a little bit more information as it is coming. I changed the
field type in the schema to text_general and that didn't change a thing. 

Another thing is that it's consistently submitting/not submitting the same
documents. I will run over it one time and it won't index a set of
documents. When I clear the index and run the program again it
submits/doesn't submit the same documents. 

And it will index certain PDF's it just won't index others. Which is weird
because I printed the strings that are submitted to Solr and the ones that
get submitted are really similar to the ones that aren't submitted. 

I can't post the actual strings for sensitivity reasons. 

Collections API and adding new boxes

2015-06-18 Thread Jim . Musil

Let's say I have a zookeeper ensemble with several Solr nodes connected to it. 
I've created a collection successfully and all is well.

What happens when I want to add another solr node?

I've tried spinning one up and connecting it to zookeeper, but the new node 
doesn't join the collection.  What's the expected next step?

This is Solr 5.1.

Jim Musil

Re: How to do a Data sharding for data in a database table

2015-06-18 Thread wwang525
The query without load is still under 1 second. But under load, response time
can be much longer due to the queued up query.

We would like to shard the data to something like 6 M / shard, which will
still give a under 1 second response time under load.

What are some best practice to shard the data? for example, we could shard
the data by date range, but that is pretty dynamic, and we could shard data
by some other properties, but if the data is not evenly distributed, you may
not be able shard it anymore.

Re: How to do a Data sharding for data in a database table

2015-06-18 Thread Jack Krupansky
10M doesn't sound too demanding.

How complex are your queries?

How complex is your data - like number of fields and size, like very large

Are you sure you have enough RAM to fully cache your index?

Are your queries compute-bound or I/O bound? If I/O-bound, get more RAM. If
compute-bound, sharding may help, but have to examine query complexity

-- Jack Krupansky

On Thu, Jun 18, 2015 at 2:05 PM, wwang525 wrote:


 We probably would like to shard the data since the response time for
 demanding queries at  10M records is getting  1 second in a single

 I have not done any data sharding before. What are some recommended way to
 do data sharding. For example, may be by a criteria with a list of specific

Re: Collections API and adding new boxes

2015-06-18 Thread Shawn Heisey
On 6/18/2015 3:23 PM, Jim.Musil wrote:
 Let's say I have a zookeeper ensemble with several Solr nodes connected to 
 it. I've created a collection successfully and all is well.

 What happens when I want to add another solr node?

 I've tried spinning one up and connecting it to zookeeper, but the new node 
 doesn't join the collection.  What's the expected next step?

 This is Solr 5.1.

The new node will be part of the cloud as soon as it starts, but until
you take action with the Collections API, it will not have any indexes
on it.  SolrCloud does not automatically create replicas except in a
very specific set of circumstances that I do not think are very common.

You'll need to either create a new collection or take steps to modify
your current collection(s) so that one or more shard replicas are
located on the new node.


Re: How to create concatenated token

2015-06-18 Thread Aman Tandon
Hi Erick,

In that issue you forwarded to me, they want to make one token from all
tokens received from token stream but in my case I want to keep the tokens
same and create and extra new token which is concat of all the tokens.

 I'd guess, is the case
 here. I mean do you really want to concatenate 50 tokens?

We are applying it on *title field* of product  so max length can be 10 I
guess and that too will be in rare case.

With Regards
Aman Tandon

On Wed, Jun 17, 2015 at 7:16 PM, Erick Erickson

 If you used the JIRA I linked, vote for it, add any improvements etc.
 Anyone can attach a patch to a JIRA, you just have to create a login.

 That said, this may be too rare a use-case to deal with. I just thought
 of shingling which I should have suggested before that will work for
 concatenating small numbers of tokens which, I'd guess, is the case
 here. I mean do you really want to concatenate 50 tokens?


 On Wed, Jun 17, 2015 at 12:07 AM, Aman Tandon
  Dear Erick,
  e.g. Solr training
  *Porter:-*  solr  train
Position 1 2
  *Concatenated :-*   solr  train
 Position 1  2
  I did implemented the filter as per my requirement. Thank you so much for
  your help and guidance. So how could I contribute it to the solr.
  With Regards
  Aman Tandon
  On Wed, Jun 17, 2015 at 10:14 AM, Aman Tandon
  Hi Erick,
  Thank you so much, it will be helpful for me to learn how to save the
  state of token. I has no idea of how to save state of previous tokens
  to this it was difficult to generate a concatenated token in the last.
  So is there anything should I read to learn more about it.
  With Regards
  Aman Tandon
  On Wed, Jun 17, 2015 at 9:20 AM, Erick Erickson
  I really question the premise, but have a look at:
  Note that this is not committed and I haven't reviewed
  it so I don't have anything to say about that. And you'd
  have to implement it as a custom Filter.
  On Tue, Jun 16, 2015 at 5:55 PM, Aman Tandon
   Any guesses, how could I achieve this behaviour.
   With Regards
   Aman Tandon
   On Tue, Jun 16, 2015 at 8:15 PM, Aman Tandon
   e.g. Intent for solr training: fq=id: 234, 456, 545 title(solr
   typo error
   e.g. Intent for solr training: fq=id:(234 456 545) title:(solr
   With Regards
   Aman Tandon
   On Tue, Jun 16, 2015 at 8:13 PM, Aman Tandon
   We has some business logic to search the user query in user
   finding the exact matching products.
   e.g. Intent for solr training: fq=id: 234, 456, 545 title(solr
   As we can see it is phrase query so it will took more time than the
   single stemmed token query. There are also 5-7 words phrase query.
   want to reduce the search time by implementing this feature.
   With Regards
   Aman Tandon
   On Tue, Jun 16, 2015 at 6:42 PM, Alessandro Benedetti wrote:
   Can I ask you why you need to concatenate the tokens ? Maybe we
   better solution to concat all the tokens in one single big token .
   I find it difficult to understand the reasons behind tokenising,
   filtering and then un-tokenizing again :)
   It would be great if you explain a little bit better what you
   do !
   2015-06-16 13:26 GMT+01:00 Aman Tandon
I have a requirement to create the concatenated token of all the
created from the last item of my analyzer chain.
*Suppose my analyzer chain is :*
* tokenizer class=solr.WhitespaceTokenizerFactory /  filter
class=solr.WordDelimiterFilterFactory catenateAll=1
minGramSize=2 maxGramSize=15 side=front /filter
I want to create a concatenated token plugin to add at
along with the last token.
e.g. Solr training
*Porter:-*  solr  train
  Position 1 2
*Concatenated :-*   solr  train
   Position 1  2
Please help me out. How to create custom filter for this
With Regards
Aman Tandon
   Benedetti Alessandro
   Visiting card :

MappingCharFilterFactory and start and end offsets

2015-06-18 Thread Dmitry Kan

It looks like MappingCharFilter sets start and end offset to the same
value. Can this be affected on by some setting?

For a string: test $ test2 and mapping $ =  dollarsign  (we insert
extra space to separate $ into its own token)

we get:

Ideally, we would like to have start and end offset respecting the remapped
token. Can this be achieved with settings?

Dmitry Kan
Luke Toolbox:

Contribute the Customized Phonetic Filter to Apache Solr

2015-06-18 Thread Aman Tandon

We created the new phonetic filter, It is working great on our products,
mostly of our suppliers are Indian, it is quite helpful for us to provide
the exact result e.g.

1) rikshaw, still able to find the suppliers of rickshaw
2) telefone, still able to find the suppliers of telephone

We also analyzed our search satisfaction feedback, it improved by 13% (54%
- 67%) just after implementing the same.

And we want to contribute the same to solr, So how could I do it.

With Regards
Aman Tandon

Extended Dismax Query Parser with AND as default operator

2015-06-18 Thread Dirk Buchhorn

I have a question to the extended dismax query parser. If the default operator 
is changed to AND (q.op=AND) then the search results seems to be incorrect. I 
will explain it on some examples. For this test I use solr v5.1 and the tika 
core from the example directory.
== Preparation ==
Add the following lines to the schema.xml file
  field name=id type=string indexed=true stored=true required=true/
Change the field text to stored=true
Remove the multiValued attribute from the title and text field (we don't need 
multivaled fields in our test)

Add test data (use curl or fiddler)
Header: Content-type: application/json
  {id:1, title:green, author:Jon, text:blue},
  {id:2, title:green, author:Jon Jessie, text:red},
  {id:3, title:yellow, author:Jessie, text:blue},
  {id:4, title:green, author:Jessie, text:blue},
  {id:5, title:blue, author:Jon, text:yellow},
  {id:6, title:red, author:Jon, text:green}

== Test ==
The following parameter are always set.
default operator is AND: q.op=AND
use the extended dismax query parser: defType=edismax
set the default query fields to title and text: qf=title text
sort: id asc

=== #1 test ===
q=red green
{ numFound:2,start:0,
{id:2,title:green,author:Jon Jessie,text:red},
parsedquery_toString: +(((text:green | title:green) (text:red | title:red))~2)

This test works as expected.

=== #2 test ===
We use a group
q=(red green)
Same response as test one.
parsedquery_toString: +(((text:green | title:green) (text:red | title:red))~2)

This test works as expected.

=== #3 test ===
q=green red author:Jessie
{ numFound:1,start:0,
  docs:[{id:2,title:green,author:Jon Jessie,text:red}]
parsedquery_toString: +(((text:green | title:green) (text:red | title:red) 

This test works as expected.

=== #4 test ===
q=(green red) author:Jessie
{ numFound:2,start:0,
{id:2,title:green,author:Jon Jessie,text:red},
parsedquery_toString: +text:green | title:green) (text:red | title:red)) 

The same result as the 3th test was expected. Why no AND is used for the query 

=== #5 test ===
q=(+green +red) author:Jessie
{ numFound:4,start:0,
{id:2,title:green,author:Jon Jessie,text:red},
parsedquery_toString: +((+(text:green | title:green) +(text:red | title:red)) 

Now AND is used for the group but the author is concatenated with OR. Why?

=== #6 test ===
q=(+green +red) +author:Jessie
{ numFound:3,start:0,
{id:2,title:green,author:Jon Jessie,text:red},
parsedquery_toString: +((+(text:green | title:green) +(text:red | title:red)) 

Still not the expected result.

=== #7 test ===
q=+(+green +red) +author:Jessie
{ numFound:1,start:0,
  docs:[{id:2,title:green,author:Jon Jessie,text:red}]
parsedquery_toString: +(+(+(text:green | title:green) +(text:red | title:red)) 

Now the result is ok. But if all operators must be given then q.op=AND is 

=== #8 test ===
q=green author:(Jon Jessie)
Found four results, expected are one. The query must changed to '+green 
+author:(+Jon +Jessie)' to get the expected result.

Is this a bug in the extended dismax parser or what is the reason for not 
consequently applying q.op=AND to the query expression?

Kind regards

Dirk Buchhorn

Re: Contribute the Customized Phonetic Filter to Apache Solr

2015-06-18 Thread davidphilip cherian
Hi Aman,


On Thu, Jun 18, 2015 at 12:11 PM, Aman Tandon


 We created the new phonetic filter, It is working great on our products,
 mostly of our suppliers are Indian, it is quite helpful for us to provide
 the exact result e.g.

 1) rikshaw, still able to find the suppliers of rickshaw
 2) telefone, still able to find the suppliers of telephone

 We also analyzed our search satisfaction feedback, it improved by 13% (54%
 - 67%) just after implementing the same.

 And we want to contribute the same to solr, So how could I do it.

 With Regards
 Aman Tandon

facet query is not working

2015-06-18 Thread Midas A

I am not getting facet results .

field name=geolocation type=location indexed=true stored=true/ 
dynamicField name=*_coordinate type=tdouble indexed=true stored=

Re: facet query is not working

2015-06-18 Thread Mikhail Khludnev
isn't facet=true necessary?

On Thu, Jun 18, 2015 at 12:03 PM, Midas A wrote:


 I am not getting facet results .

 field name=geolocation type=location indexed=true stored=true/ 
 dynamicField name=*_coordinate type=tdouble indexed=true stored=

Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

Duplicate suggestions

2015-06-18 Thread jon kerling
I am using solr 5.1. I'm getting duplicate suggestions when using my 
solrsuggester. I'm using AnalyzingInfixLookupFactory  
DocumentDictionaryFactory. can i configure it to suggest me only different 

here are details about my configuration:

from schema.xml:searchComponent name=suggest class=solr.SuggestComponent
   lst name=suggester
  str name=namemySuggester1a/str
  str name=lookupImplAnalyzingInfixLookupFactory/str  
  str name=indexPathsuggester_infix_dir1a/str
  str name=allTermsRequiredtrue/str
  str name=dictionaryImplDocumentDictionaryFactory/str 
  str name=fieldf1/str
  str name=weightFieldweightField/str
  str name=suggestAnalyzerFieldTypetext_general/str
  str name=buildOnStartupfalse/str

  lst name=suggester
  str name=namemySuggester2a/str
  str name=lookupImplAnalyzingInfixLookupFactory/str  
  str name=indexPathsuggester_infix_dir2a/str
  str name=allTermsRequiredtrue/str
  str name=dictionaryImplDocumentDictionaryFactory/str 
  str name=fieldf2/str
  str name=weightFieldweightField/str
  str name=suggestAnalyzerFieldTypetext_general/str
  str name=buildOnStartupfalse/str

  requestHandler name=/suggest class=solr.SearchHandler startup=lazy
    lst name=defaults
  str name=suggesttrue/str
  str name=suggest.count6/str
  str name=suggest.dictionarymySuggester1a/str
  str name=suggest.dictionarymySuggester2a/str
    arr name=components

from schema.xml:field name=f1 type=string indexed=true stored=true 
required=false multiValued=false /
field name=f2 type=string indexed=true stored=true required=false 
multiValued=false /Field name=weightField  type=float  indexed=true  
** weightField is ignored by me, I'm not adding any values in it at all.

document example:doc    str name=f12015-04-01/str    str 
name=f212:06:00/str    str name=f3BOOO/str    str name=f4/    
str name=f57.52.11.212/str    str name=f67.52.11.213/str    str 
After i build the suggester I'm trying to get suggests like here:

?xml version=1.0 encoding=UTF-8?
   lst name=responseHeader
  int name=status0/int
  int name=QTime62/int
   lst name=suggest
  lst name=mySuggester2a
 lst name=12
int name=numFound6/int
arr name=suggestions
  str name=term18:34:lt;bgt;12lt;/bgt;/str
  long name=weight0/long
  str name=payload /
  str name=term18:34:lt;bgt;12lt;/bgt;/str
  long name=weight0/long
  str name=payload /
  str name=term18:35:lt;bgt;12lt;/bgt;/str
  long name=weight0/long
  str name=payload /
  str name=term18:35:lt;bgt;12lt;/bgt;/str
  long name=weight0/long
  str name=payload /
  str name=term18:35:lt;bgt;12lt;/bgt;/str
  long name=weight0/long
  str name=payload /
  str name=termlt;bgt;12lt;/bgt;:06:02/str
  long name=weight0/long
  str name=payload /
  lst name=mySuggester1a
 lst name=12
int name=numFound0/int
arr name=suggestions /

I would like to get this kind of suggester response ( no duplicates ):

?xml version=1.0 encoding=UTF-8?
   lst name=responseHeader
  int name=status0/int
  int name=QTime62/int
   lst name=suggest
  lst name=mySuggester2a
 lst name=12
int name=numFound3/int
arr name=suggestions
  str name=term18:34:lt;bgt;12lt;/bgt;/str
  long name=weight0/long
  str name=payload /
  str name=term18:35:lt;bgt;12lt;/bgt;/str
  long name=weight0/long
  str name=payload /
  str name=termlt;bgt;12lt;/bgt;:06:02/str
  long name=weight0/long
  str name=payload /
  lst name=mySuggester1a
 lst name=12
int name=numFound0/int
arr name=suggestions /
/responseThank you.

Re: facet query is not working

2015-06-18 Thread Alessandro Benedetti
If he has not put any appends or invariant in the request handler,
facet=true is mandatory to activate the facets.

I haven't tried those specific facet queries .

I hope the problem was not simply he didn't activate faceting ...

2015-06-18 10:35 GMT+01:00 Mikhail Khludnev

 isn't facet=true necessary?

 On Thu, Jun 18, 2015 at 12:03 PM, Midas A wrote:

  I am not getting facet results .
  field name=geolocation type=location indexed=true stored=true/
  dynamicField name=*_coordinate type=tdouble indexed=true stored=

 Sincerely yours
 Mikhail Khludnev
 Principal Engineer,
 Grid Dynamics


Benedetti Alessandro
Visiting card :

Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?

William Blake - Songs of Experience -1794 England