Re: Another japanese analysis problem

2014-04-18 Thread Alexandre Rafalovitch
Did you read through the CJK article series? Maybe there is something
in there? 
http://discovery-grindstone.blogspot.com/2013/10/cjk-with-solr-for-libraries-part-1.html

Sorry, no help on actual Japanese.

Regards,
   Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency


On Fri, Apr 18, 2014 at 12:50 PM, Shawn Heisey s...@elyograg.org wrote:
 On 4/10/2014 11:53 AM, Shawn Heisey wrote:
 My analysis chain includes CJKBigramFilter on both the index and query.
 I have outputUnigrams enabled on the index side, but it is disabled on
 the query side.  This has resulted in a problem with phrase queries.
 This is a subset of my index analysis for the three terms you can see in
 the ICUNF step, separated by spaces:

 https://www.dropbox.com/s/9q1x9pdbsjhzocg/bigram-position-problem.png

 Note that in the CJKBF step, the second unigram is output at position 2,
 pushing the english terms to 3 and 4.

 When the customer phrase filter query (lucene query parser) for the
 first two terms on this specific field, it doesn't match, because the
 query analysis doesn't output the unigrams and therefore the positions
 don't match.

 I would have expected both unigrams to be at position 1.  Is this a bug
 or expected behavior?

 It's been a week with no reply.

 First I worked around this problem by disabling outputUnigrams on the
 index side, to match the query side.  At that point, the customer was
 unable to do a searches for a single character and find longer strings
 containing that character.  I knew this would happen ... I did tell our
 project manager, but I do not know whether it was communicated to the
 customer.

 Then I tried setting outputUnigrams to true on both index and query.
 Just as I had anticipated, the customer was unhappy with getting results
 where a word containing only one character of their multi-character
 search string was present.

 Re-stating the underlying problem and my question:

 The outputUnigrams option sets one of the unigrams from each bigram to
 the same position as the bigram, but then puts the other one at the next
 position, breaking phrase queries.  This sounds like a bug.  Is it a
 bug?  If not, I would REALLY like a config option to produce the
 behavior that I expected.

 Thanks,
 Shawn



'qt' parameter is not working in search call of SolrPhpClient

2014-04-18 Thread harshrossi
I am using SolrPhpClient for interacting with Solr via PHP.

I am using a custom request handler ( /select_test ) with 'edismax' feature
in Solr config file

  requestHandler name=/select_test class=solr.SearchHandler
 lst name=defaults
   str name=echoParamsexplicit/str
   str name=wtjson/str
   
   str name=defTypeedismax/str
   str name=qf
  text name topic description
   /str
   str name=dftext/str
   str name=mm100%/str
   str name=q.alt*:*/str
   str name=rows10/str
   str name=fl*,score/str

   str name=mlt.qf
  text name topic description
   /str
   str name=mlt.fltext,name,topic,description/str
   int name=mlt.count3/int
 /lst
  /requestHandler

I set the value for 'qt' parameter as '/select_test' in the $search_options
array and pass it as parameter to the search function of the
Apache_Solr_Service as below:

$search_options = array(
'qt' = '/select_test',
   'fq' = 'topic:games',
   'sort' = 'name desc'
);



$result = $solr-search($query, 0, 10, $search_options);

It does not call the request handler at all. The call goes to the default
'/select' handler in solr config file.

Just to confirm I put the custom request handler code in default handler and
it worked.

Why is this happening? Am I not setting it right?

Please help!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/qt-parameter-is-not-working-in-search-call-of-SolrPhpClient-tp4131934.html
Sent from the Solr - User mailing list archive at Nabble.com.


solr parallel update and total indexing Issue

2014-04-18 Thread ~$alpha`
There is a bis issue in solr parallel update and total indexing

Total Import syntax (working)
dataimport?command=full-importcommit=trueoptimize=true 

Update syntax(working)
solr/update?softCommit=true' -H 'Content-type:application/json' -d
'[{id:1870719,column:{set:11}}]'


Issue: If both are run in parallel, then commit in b/w take place.

Example: i have 10k in total indexes i fire an solr query to update 1000
records and in between i fire a total import(full indexer) what's
happening is that in between commit is taken place... i.e untill total
indexer runs i got limited records(1000).

How to solve this ?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-parallel-update-and-total-indexing-Issue-tp4131935.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Another japanese analysis problem

2014-04-18 Thread Shawn Heisey
On 4/18/2014 12:04 AM, Alexandre Rafalovitch wrote:
 Did you read through the CJK article series? Maybe there is something
 in there? 
 http://discovery-grindstone.blogspot.com/2013/10/cjk-with-solr-for-libraries-part-1.html
 
 Sorry, no help on actual Japanese.

Almost everything I know about the Japanese language has been learned in
the last few weeks, working on this Solr config!

That blog series looks like really awesome information.  I will be
trying out some of what they've mentioned.  Thank you for pointing me
that direction.  The author's index is a lot more complex than ours ...
I'm really hoping to avoid having a lot of copies of each field.  The
index is already relatively large.

I think I'll take my discussion about a possible bug in CJKBigramFilter
to the dev list.

Thanks,
Shawn



Re: Where to specify numShards when startup up a cloud setup

2014-04-18 Thread Liu Bo
Hi zzT

Putting numShards in core.properties also works.

I struggled a little bit while figuring out this configuration approach.
I knew I am not alone! ;-)


On 2 April 2014 18:06, zzT zis@gmail.com wrote:

 It seems that I've figured out a configuration approach to this issue.

 I'm having the exact same issue and the only viable solutions found on the
 net till now are
 1) Pass -DnumShards=x when starting up Solr server
 2) Use the Collections API as indicated by Shawn.

 What I've noticed though - after making the call to /collections to create
 a
 node solr.xml - is that a new core entry is added inside solr.xml with
 the
 attribute numShards.

 So, right now I'm configuring solr.xml with numShards attribute inside my
 core nodes. This way I don't have to worry with annoying stuff you've
 already mentioned e.g. waiting for Solr to start up etc.

 Of course same logic applies here, numShards param is meanigful only the
 first time. Even if you change it at a later point the # of shards stays
 the
 same.



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Where-to-specify-numShards-when-startup-up-a-cloud-setup-tp4078473p4128566.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
All the best

Liu Bo


Having trouble with German compound words in Solr 4.7

2014-04-18 Thread Alistair
Hello all,

I'm a fairly new Solr user and I need my search function to handle compound
words in German. I've searched through the archives and found that Solr
already has a Filter Factory made for such words called
DictionaryCompoundWordTokenFilterFactory. I've already built a list of words
that I want split, and it seems like the filter is working correctly in most
cases, the majority of our searches are clothing items so let's say
/schwarzkleid/ (black dress) becomes /schwarz/ /kleid/, which is what
I want to happen. However, it seems like the keyword search is done using an
*OR* operator. So I'm seeing items that are either black or are dresses but
I just want to see items that are both. I've also read that changing the
default operator in schema.xml or adding q.op as *AND* in the solrconfig.xml
will rectify this issue, but nothing has changed in my query results. It
still uses the *OR* operator.
I've tried using Extended dismax in my queries but I am using the Solr PHP
library and I don't think it supports adding Dismax filters to the queries
themselves (if I'm wrong, please correct me). By the way, I am using Zend
Framework 2.0 in the backend and am communicating with Solr through the Solr
PHP library:  Solr PHP http://www.php.net/manual/tr/book.solr.php  . 

Any suggestions on how to change the operator after my compound word queries
have been split?

Thanks!

Ali



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Having-trouble-with-German-compound-words-in-Solr-4-7-tp4131964.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Having trouble with German compound words in Solr 4.7

2014-04-18 Thread Jack Krupansky
Make sure your field type has the autoGeneratePhraseQueries=true attribute 
(default is false). q.op only applies to explicit terms, not to terms which 
decompose into multiple terms. Confusing? Yes!


-- Jack Krupansky

-Original Message- 
From: Alistair

Sent: Friday, April 18, 2014 6:11 AM
To: solr-user@lucene.apache.org
Subject: Having trouble with German compound words in Solr 4.7

Hello all,

I'm a fairly new Solr user and I need my search function to handle compound
words in German. I've searched through the archives and found that Solr
already has a Filter Factory made for such words called
DictionaryCompoundWordTokenFilterFactory. I've already built a list of words
that I want split, and it seems like the filter is working correctly in most
cases, the majority of our searches are clothing items so let's say
/schwarzkleid/ (black dress) becomes /schwarz/ /kleid/, which is what
I want to happen. However, it seems like the keyword search is done using an
*OR* operator. So I'm seeing items that are either black or are dresses but
I just want to see items that are both. I've also read that changing the
default operator in schema.xml or adding q.op as *AND* in the solrconfig.xml
will rectify this issue, but nothing has changed in my query results. It
still uses the *OR* operator.
I've tried using Extended dismax in my queries but I am using the Solr PHP
library and I don't think it supports adding Dismax filters to the queries
themselves (if I'm wrong, please correct me). By the way, I am using Zend
Framework 2.0 in the backend and am communicating with Solr through the Solr
PHP library:  Solr PHP http://www.php.net/manual/tr/book.solr.php  .

Any suggestions on how to change the operator after my compound word queries
have been split?

Thanks!

Ali



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Having-trouble-with-German-compound-words-in-Solr-4-7-tp4131964.html
Sent from the Solr - User mailing list archive at Nabble.com. 



space between search terms

2014-04-18 Thread kumar
Hi,

I Have a field called title. It is having a values called indira nagar
as well as indiranagar.

If i type any of the keywords it has to display both results.

Can anybody help how can we do this?


I am using the title field in the following way:

fieldType name=title class=solr.TextField positionIncrementGap=100
analyzer type=index
charFilter 
class=solr.MappingCharFilterFactory
mapping=mapping-ISOLatin1Accent.txt /
tokenizer 
class=solr.WhitespaceTokenizerFactory/
filter class=solr.WordDelimiterFilterFactory 
generateWordParts=1 
generateNumberParts=1 
catenateWords=1 
catenateNumbers=1 
catenateAll=1 
splitOnCaseChange=1
splitOnNumerics=1 
preserveOriginal=1 /
filter class=solr.LowerCaseFilterFactory /
filter class=solr.PatternReplaceFilterFactory
pattern=([^\w\d\*æøåÆØÅ ]) replacement=  replace=all /
filter class=solr.StopFilterFactory 
ignoreCase=true
words=stopwords.txt enablePositionIncrements=true /

/analyzer
analyzer type=query
charFilter 
class=solr.MappingCharFilterFactory
mapping=mapping-ISOLatin1Accent.txt /
tokenizer 
class=solr.WhitespaceTokenizerFactory/
filter class=solr.WordDelimiterFilterFactory 
generateWordParts=1 
generateNumberParts=1 
catenateWords=1 
catenateNumbers=1 
catenateAll=1 
splitOnCaseChange=1
splitOnNumerics=1 
preserveOriginal=1/
filter class=solr.LowerCaseFilterFactory /
filter class=solr.PatternReplaceFilterFactory
pattern=([^\w\d\*æøåÆØÅ ]) replacement=  replace=all /
filter class=solr.SynonymFilterFactory 
ignoreCase=true
synonyms=synonyms_tf.txt expand=true /
filter class=solr.StopFilterFactory 
ignoreCase=true
words=stopwords.txt enablePositionIncrements=true /
filter class=solr.KeywordMarkerFilterFactory
protected=protwords.txt /
filter class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer
/fieldType



--
View this message in context: 
http://lucene.472066.n3.nabble.com/space-between-search-terms-tp4131967.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: space between search terms

2014-04-18 Thread Jack Krupansky

Use an index-time synonym filter with a synonym entry:

indira nagar,indiranagar

But do not use that same filter at query time.

But, that may mess up some exact phrase queries, such as:

q=indiranagar xyz

since the following term is actually positioned after the longest synonym.

To resolve that, use a sloppy phrase:

q=indiranagar xyz~1

Or, set qs=1 for the edismax query parser.

-- Jack Krupansky

-Original Message- 
From: kumar

Sent: Friday, April 18, 2014 6:34 AM
To: solr-user@lucene.apache.org
Subject: space between search terms

Hi,

I Have a field called title. It is having a values called indira nagar
as well as indiranagar.

If i type any of the keywords it has to display both results.

Can anybody help how can we do this?


I am using the title field in the following way:

fieldType name=title class=solr.TextField positionIncrementGap=100
analyzer type=index
charFilter class=solr.MappingCharFilterFactory
mapping=mapping-ISOLatin1Accent.txt /
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1
generateNumberParts=1
catenateWords=1
catenateNumbers=1
catenateAll=1
splitOnCaseChange=1
splitOnNumerics=1
preserveOriginal=1 /
filter class=solr.LowerCaseFilterFactory /
filter class=solr.PatternReplaceFilterFactory
pattern=([^\w\d\*æøåÆØÅ ]) replacement=  replace=all /
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=true /

/analyzer
analyzer type=query
charFilter class=solr.MappingCharFilterFactory
mapping=mapping-ISOLatin1Accent.txt /
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1
generateNumberParts=1
catenateWords=1
catenateNumbers=1
catenateAll=1
splitOnCaseChange=1
splitOnNumerics=1
preserveOriginal=1/
filter class=solr.LowerCaseFilterFactory /
filter class=solr.PatternReplaceFilterFactory
pattern=([^\w\d\*æøåÆØÅ ]) replacement=  replace=all /
filter class=solr.SynonymFilterFactory ignoreCase=true
synonyms=synonyms_tf.txt expand=true /
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=true /
filter class=solr.KeywordMarkerFilterFactory
protected=protwords.txt /
   filter class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer
/fieldType



--
View this message in context: 
http://lucene.472066.n3.nabble.com/space-between-search-terms-tp4131967.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: multi word search for elevator (QueryElevationComponent) not working

2014-04-18 Thread Niranjan
Hi Remi ,

Thanks for your reply.

I tried with with setting the query_text for apple ipod and added the
required doc_id to elevate.
I got the result but again I am  not able to get the desired result for NLP
queries such as ipod nano generation 5 or apple ipod best music .
As in both the queries it contains ipod for which I want my desired doc
id's to be elevated.

I also tried changing in the QueryElevationComponent as:

First with this:
str name=queryFieldTypestring/str

Second time:
str name=queryFieldTypetext_general/str

But no success.

Please correct me if I am doing the correct change as you mentioned.

Is there any other way round  in solr to achieve this.(Promoted Search).
Please guide me.

Regads,
Niranjan






--
View this message in context: 
http://lucene.472066.n3.nabble.com/multi-word-search-for-elevator-QueryElevationComponent-not-working-tp4131016p4131971.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Having trouble with German compound words in Solr 4.7

2014-04-18 Thread Alistair
Hey Jack,

thanks for the reply. I added autoGeneratePhraseQueries=true to the
fieldType and now it's giving me even more results! I'm not sure if the
debug of my query will be helpful but I'll paste it just in case someone
might have an idea. This produces 113524 results, whereas if I manually
enter the query as keyword:schwarz AND keyword:kleid I only get 20283
results (which is the correct one). 





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Having-trouble-with-German-compound-words-in-Solr-4-7-tp4131964p4131973.html
Sent from the Solr - User mailing list archive at Nabble.com.


QueryElevationComponent always reads config from zookeeper

2014-04-18 Thread ronak kirit
Hello,

I was looking into QueryElevationComponent component.

As per the spec (http://wiki.apache.org/solr/QueryElevationComponent), if
config is not found in zookeepr, it should be loaded from data directory.
However, I see the bug. It doesn't seem to be working even in latest 4.7.2
release.

I have checked the latest code and found this:
MapString, ElevationObj getElevationMap(IndexReader reader, SolrCore
core) throws Exception {
synchronized (elevationCache) {
  MapString, ElevationObj map = elevationCache.get(null);
  if (map != null) return map;

  map = elevationCache.get(reader);
  if (map == null) {
String f = initArgs.get(CONFIG_FILE);
if (f == null) {
  throw new SolrException(SolrException.ErrorCode.SERVER_ERROR,
  QueryElevationComponent must specify argument:  +
CONFIG_FILE);
}
log.info(Loading QueryElevation from data dir:  + f);

Config cfg;

ZkController zkController =
core.getCoreDescriptor().getCoreContainer().getZkController();
if (zkController != null) {
  cfg = new Config(core.getResourceLoader(), f, null, null);
} else {
  InputStream is = VersionedFile.getLatestFile(core.getDataDir(),
f);
  cfg = new Config(core.getResourceLoader(), f, new
InputSource(is), null);
}

map = loadElevationMap(cfg);
elevationCache.put(reader, map);
  }
  return map;
}
  }

As per this code, we will never be able to load config from data directory
if zookeepr exists.

Can we fix this issue?

Thanks,
Ronak


Re: cache warming questions

2014-04-18 Thread Kranti Parisa
cool, thanks.

Thanks,
Kranti K. Parisa
http://www.linkedin.com/in/krantiparisa



On Thu, Apr 17, 2014 at 11:37 PM, Erick Erickson erickerick...@gmail.comwrote:

 No, the 5 most recently used in a query will be used to autowarm.

 If you have things you _know_ are going to be popular fqs, you could
 put them in newSearcher queries.

 Best,
 Erick

 On Thu, Apr 17, 2014 at 4:51 PM, Kranti Parisa kranti.par...@gmail.com
 wrote:
  Erik,
 
  I have a followup question on this topic.
 
  If we have used 10 unique FQs and when we configure filterCache=100 
  autoWarm=5, then which 5 out of the 10 will be repopulated in the case of
  new searcher?
 
  I don't think there is a way to set the preference or there is?
 
 
  Thanks,
  Kranti K. Parisa
  http://www.linkedin.com/in/krantiparisa
 
 
 
  On Thu, Apr 17, 2014 at 5:25 PM, Matt Kuiper matt.kui...@issinc.com
 wrote:
 
  Ok,  that makes sense.
 
  Thanks again,
  Matt
 
  Matt Kuiper - Software Engineer
  Intelligent Software Solutions
  p. 719.452.7721 | matt.kui...@issinc.com
  www.issinc.com | LinkedIn: intelligent-software-solutions
 
  -Original Message-
  From: Erick Erickson [mailto:erickerick...@gmail.com]
  Sent: Thursday, April 17, 2014 9:26 AM
  To: solr-user@lucene.apache.org
  Subject: Re: cache warming questions
 
  Don't go overboard warming here, you often hit diminishing returns very
  quickly. For instance, if the size is 512 you might set your autowarm
 count
  to 16 and get the most bang for your buck. Beyond some (usually small)
  number, the additional work you put in to warming is wasted. This is
  especially true if your autocommit (soft, or hard with
  openSearcher=true) is short.
 
  So while you're correct in your sizing bit, practically it's rarely that
  complicated since the autowarm count is usually so much smaller than the
  size that there's no danger of swapping them out. YMMV of course.
 
  Best,
  Erick
 
  On Wed, Apr 16, 2014 at 10:33 AM, Matt Kuiper matt.kui...@issinc.com
  wrote:
   Thanks Erick, this is helpful information!
  
   So it sounds like, at minimum the cache size (at least for filterCache
  and queryResultCache) should be the sum of the autowarmCount for that
 cache
  and the number of queries defined for the newSearcher listener.
  Otherwise
  some items in the caches will be evicted right away.
  
   Matt
  
   -Original Message-
   From: Erick Erickson [mailto:erickerick...@gmail.com]
   Sent: Tuesday, April 15, 2014 5:21 PM
   To: solr-user@lucene.apache.org
   Subject: Re: cache warming questions
  
   bq: What does it mean that items will be regenerated or prepopulated
  from the current searcher's cache...
  
   You're right, the values aren't cached. They can't be since the
 internal
  Lucene document id is used to identify docs, and due to merging the
  internal ID may bear no relation to the old internal ID for a particular
  document.
  
   I find it useful to think of Solr's caches as a  map where the key is
  the query and the value is some representation of the found documents.
  The details of the value don't matter, so I'll skip them.
  
   What matters is the key. Consider the filter cache. You put something
  like fq=price:[0 TO 100] on a URL. Solr then uses the fq  clause as the
  key to the filterCache.
  
   Here's the sneaky bit. When you specify an autowarm count of N for the
  filterCache, when a new searcher is opened the first N keys from the map
  are re-executed in the new searcher's context and the results put into
 the
  new searcher's filterCache.
  
   bq:  ...how does auto warming and explicit warming work together?
  
   They're orthogonal. IOW, the autowarming for each cache is executed as
  well as the newSearcher static warming queries. Use the static queries
 to
  do things like fill the sort caches etc.
  
   Incidentally, this bears on why there's a firstSearcher and
  newSearcher. The newSearcher queries are run in addition to the cache
  autowarms. firstSearcher static queries are only run when a Solr server
 is
  started the first time, and there are no cache entries to autowarm. So
 the
  firstSearcher queries might be quite a bit more complex than newSearcher
  queries.
  
   HTH,
   Erick
  
   On Tue, Apr 15, 2014 at 1:55 PM, Matt Kuiper matt.kui...@issinc.com
  wrote:
   Hello,
  
   I have a few questions regarding how Solr caches are warmed.
  
   My understanding is that there are two ways to warm internal Solr
  caches (only one way for document cache and lucene FieldCache):
  
   Auto warming - occurs when there is a current searcher handling
  requests and new searcher is being prepared.  When a new searcher is
  opened, its caches may be prepopulated or autowarmed with cached
 object
  from caches in the old searcher. autowarmCount is the number of cached
  items that will be regenerated in the new searcher.
  http://wiki.apache.org/solr/SolrCaching#autowarmCount
  
   Explicit warming - where the static warming queries specified in
  

Re: Filtering Solr Queries

2014-04-18 Thread Erick Erickson
Is this a manageable list? That is, not a zillion names? If so, it
seems like you could do this with synonyms. Assuming your string_ci
bit is a string type, you'd need to change that to something like
KeywordTokenizerFactory followed by filters, and you might want to add
something like LowercaseFilterFactory to the chain.

Best,
Erick

On Thu, Apr 17, 2014 at 9:47 PM, kumar pavan2...@gmail.com wrote:
 Hi,

 I am indexing the data using title, city and location fields.

 but different cities are having same location names like rajaji nagar,
 rajajinagar.

 When user types

 computers in rajaji nagarIt has to display results like computers
 in rajajinagr as well as computers in rajaji nagr.

 I am using the following schema.


 field name=city type=string_ci indexed=true stored=false /
 field name=locality type=string_ci indexed=true stored=false /
 field name=mytitle type=textfullmatch indexed=true stored=false
 multiValued=true omitNorms=true omitTermFreqAndPositions=true /





 fieldType name=textfullmatch class=solr.TextField
 analyzer type=index
 charFilter 
 class=solr.MappingCharFilterFactory
 mapping=mapping-ISOLatin1Accent.txt/
 tokenizer 
 class=solr.KeywordTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter 
 class=solr.PatternReplaceFilterFactory pattern=([\.,;:-_])
 replacement=  replace=all/
 filter class=solr.EdgeNGramFilterFactory 
 maxGramSize=50
 minGramSize=2/
 filter 
 class=solr.PatternReplaceFilterFactory
 pattern=([^\w\d\*æøåÆØÅ ]) replacement= replace=all/
 filter class=solr.StopFilterFactory 
 ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true /
 /analyzer
 analyzer type=query
 charFilter 
 class=solr.MappingCharFilterFactory
 mapping=mapping-ISOLatin1Accent.txt/
 tokenizer 
 class=solr.KeywordTokenizerFactory/

 filter class=solr.LowerCaseFilterFactory/
 filter 
 class=solr.PatternReplaceFilterFactory pattern=([\.,;:-_])
 replacement=  replace=all/
 filter 
 class=solr.PatternReplaceFilterFactory
 pattern=([^\w\d\*æøåÆØÅ ]) replacement= replace=all/
 filter 
 class=solr.PatternReplaceFilterFactory pattern=^(.{30})(.*)?
 replacement=$1 replace=all/
 filter class=solr.SynonymFilterFactory 
 ignoreCase=true
 synonyms=synonyms_fsw.txt expand=true /
 filter class=solr.StopFilterFactory 
 ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true /
 filter 
 class=solr.RemoveDuplicatesTokenFilterFactory/
 /analyzer
 /fieldType







 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Filtering-Solr-Queries-tp4131924.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Having trouble with German compound words in Solr 4.7

2014-04-18 Thread Siegfried Goeschl
Hi Alistair,

quick email before getting my plane - I worked with similar requirements in the 
past and tuning SOLR can be tricky

* are you hitting the same SOLR query handler (application versus manual 
checking)?
* turn on debugging for your application SOLR queries so you see what query is 
actually executed
* one thing I always do for prototyping is setting up the Solritas GUI using 
the same query handler as the application server

Cheers,

Siegfried Goeschl


On 18 Apr 2014, at 06:06, Alistair ali...@gmail.com wrote:

 Hey Jack,
 
 thanks for the reply. I added autoGeneratePhraseQueries=true to the
 fieldType and now it's giving me even more results! I'm not sure if the
 debug of my query will be helpful but I'll paste it just in case someone
 might have an idea. This produces 113524 results, whereas if I manually
 enter the query as keyword:schwarz AND keyword:kleid I only get 20283
 results (which is the correct one). 
 
 
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Having-trouble-with-German-compound-words-in-Solr-4-7-tp4131964p4131973.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: 'qt' parameter is not working in search call of SolrPhpClient

2014-04-18 Thread Erick Erickson
You're confusing a couple of things here. the /select_test can be
accessed by pointing your URL at it rather than using qt, i.e. the
destination you're going to will be
http://server:port/solr/collection/select_test
rather than
http://server:port/solr/collection/select

Best,
Erick

On Thu, Apr 17, 2014 at 11:31 PM, harshrossi harshro...@gmail.com wrote:
 I am using SolrPhpClient for interacting with Solr via PHP.

 I am using a custom request handler ( /select_test ) with 'edismax' feature
 in Solr config file

   requestHandler name=/select_test class=solr.SearchHandler
  lst name=defaults
str name=echoParamsexplicit/str
str name=wtjson/str

str name=defTypeedismax/str
str name=qf
   text name topic description
/str
str name=dftext/str
str name=mm100%/str
str name=q.alt*:*/str
str name=rows10/str
str name=fl*,score/str

str name=mlt.qf
   text name topic description
/str
str name=mlt.fltext,name,topic,description/str
int name=mlt.count3/int
  /lst
   /requestHandler

 I set the value for 'qt' parameter as '/select_test' in the $search_options
 array and pass it as parameter to the search function of the
 Apache_Solr_Service as below:

 $search_options = array(
 'qt' = '/select_test',
'fq' = 'topic:games',
'sort' = 'name desc'
 );



 $result = $solr-search($query, 0, 10, $search_options);

 It does not call the request handler at all. The call goes to the default
 '/select' handler in solr config file.

 Just to confirm I put the custom request handler code in default handler and
 it worked.

 Why is this happening? Am I not setting it right?

 Please help!



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/qt-parameter-is-not-working-in-search-call-of-SolrPhpClient-tp4131934.html
 Sent from the Solr - User mailing list archive at Nabble.com.


multi-field suggestions

2014-04-18 Thread Michael Sokolov
I've been working on getting AnalyzingInfixSuggester to make suggestions 
using tokens drawn from multiple fields.  I've done this by copying 
tokens from each of those fields into a destination field, and building 
suggestions using that destination field.  This allows me to use 
different analysis strategies for each of the fields, which I need, but 
it doesn't address a couple of remaining issues:


1. Some source fields are more important than others, and it would be 
good to be able to give their tokens greater weight somehow


2. The threshold is applied equally across all tokens, but for some 
fields we want to suggest singletons (threshold=0), while for others we 
want to use the threshold to exclude low-frequency terms.


I looked a little bit at how to extend the whole framework from Solr on 
down to handle multiple source fields intrinsically, rather than using 
the copying technique, and it looks like I could possibly manage 
something like this by extending DocumentDictionary and plugging in a 
different DictionaryFactory.  Does that sound like a good approach?  Is 
there some better way to approach this problem?


Thanks

-Mike

PS Sorry for the cross-post; I realized after I hit send this was 
probably a better question for solr-user than lucene...


Re: solr parallel update and total indexing Issue

2014-04-18 Thread Erick Erickson
try not setting softCommit=true, that's going to take the current
state of your index and make it visible. If your DIH process has
deleted all your records, then that's the current state.

Personally I wouldn't try to mix-n-match like this, the results will
take forever to get right. If you absolutely must do something like
this, I'd use collection aliasing to rebuild my index in a different
collection then switch from the old to new one in a controlled
fashion.

Best,
Erick

On Thu, Apr 17, 2014 at 11:37 PM, ~$alpha` lavesh.ra...@gmail.com wrote:
 There is a bis issue in solr parallel update and total indexing

 Total Import syntax (working)
 dataimport?command=full-importcommit=trueoptimize=true

 Update syntax(working)
 solr/update?softCommit=true' -H 'Content-type:application/json' -d
 '[{id:1870719,column:{set:11}}]'


 Issue: If both are run in parallel, then commit in b/w take place.

 Example: i have 10k in total indexes i fire an solr query to update 1000
 records and in between i fire a total import(full indexer) what's
 happening is that in between commit is taken place... i.e untill total
 indexer runs i got limited records(1000).

 How to solve this ?



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/solr-parallel-update-and-total-indexing-Issue-tp4131935.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Can I reconstruct text from tokens?

2014-04-18 Thread Michael Sokolov
I believe you could use term vectors to retrieve all the terms in a 
document, with their offsets.  Retrieving them from the inverted index 
would be expensive since the index is term-oriented, not 
document-oriented.  Without tv, I think you essentially have to scan the 
entire term dictionary looking for terms in your document. So that will 
cost you probably more than it's worth?


-Mike

On 04/16/2014 11:50 AM, Alexandre Rafalovitch wrote:

Hello,

If I use very basic tokenizers, e.g. space based and no filters, can I
reconstruct the text from the tokenized form?

So, This is a test - This, is, a, test - This is a test?

I know we store enough information, but I don't know internal API
enough to know what I should be looking at for reconstruction
algorithm.

Any hints?

The XY problem is that I want to store large amount of very repeatable
text into Solr. I want the index to be as small as possible, so
thought if I just pre-tokenized, my dictionary will be quite small.
And I will be reconstructing some final form anyway.

The other option is to just use compressed fields on stored field, but
I assume that does not take cross-document efficiencies into account.
And, it will be a read-only index after build, so I don't care about
updates messing things up.

Regards,
Alex

Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency




Re: Indexing Big Data With or Without Solr

2014-04-18 Thread Vineet Mishra
Thanks Furkan, I will definitely give it a try then.

Thanks again!




On Tue, Apr 15, 2014 at 7:53 PM, Furkan KAMACI furkankam...@gmail.comwrote:

 Hi Vineet;

 I've been using SolrCloud for such kind of Big Data and I think that you
 should consider to use it. If you have any problems you can ask it here.

 Thanks;
 Furkan KAMACI


 2014-04-15 13:20 GMT+03:00 Vineet Mishra clearmido...@gmail.com:

  Hi All,
 
  I have worked with Solr 3.5 to implement real time search on some 100GB
  data, that worked fine but was little slow on complex queries(Multiple
  group/joined queries).
  But now I want to index some real Big Data(around 4 TB or even more), can
  SolrCloud be solution for it if not what could be the best possible
  solution in this case.
 
  *Stats for the previous Implementation:*
  It was Master Slave Architecture with normal Standalone multiple instance
  of Solr 3.5. There were around 12 Solr instance running on different
  machines.
 
  *Things to consider for the next implementation:*
  Since all the data is sensor data hence it is the factor of duplicity and
  uniqueness.
 
  *Really urgent, please take the call on priority with set of feasible
  solution.*
 
  Regards
 



Boost Search results

2014-04-18 Thread A Laxmi
Hi,


When I started to compare the search results with the two options below, I
see a lot of difference in the search results esp. the* urls that show up
on the top *(*Relevancy *perspective).

(1) Nutch 2.2.1 (with *Solr 4.0*)
(2) Bing custom search set-up

I wonder how should I tweak the boost parameters to get the best results on
the top like how Bing, Google does.

Please suggest why I see a difference and what parameters are best to
configure in Solr to achieve what I see from Bing, or Google search
relevancy.

Here is what i got in solrconfig.xml:

 str name=defTypeedismax/str
   str name=qf
 text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4
   /str
   str name=q.alt*:*/str
   str name=rows10/str
   str name=fl*,score/str


Thanks


Re: Can I reconstruct text from tokens?

2014-04-18 Thread Ramkumar R. Aiyengar
Sorry, didn't think this through. You're right, still the same problem..
On 16 Apr 2014 17:40, Alexandre Rafalovitch arafa...@gmail.com wrote:

 Why? I want stored=false, at which point multivalued field is just offset
 values in the dictionary. Still have to reconstruct from offsets.

 Or am I missing something?

 Regards,
  Alex
 On 16/04/2014 10:59 pm, Ramkumar R. Aiyengar andyetitmo...@gmail.com
 wrote:

  Logically if you tokenize and put the results in a multivalued field, you
  should be able to get all values in sequence?
  On 16 Apr 2014 16:51, Alexandre Rafalovitch arafa...@gmail.com
 wrote:
 
   Hello,
  
   If I use very basic tokenizers, e.g. space based and no filters, can I
   reconstruct the text from the tokenized form?
  
   So, This is a test - This, is, a, test - This is a test?
  
   I know we store enough information, but I don't know internal API
   enough to know what I should be looking at for reconstruction
   algorithm.
  
   Any hints?
  
   The XY problem is that I want to store large amount of very repeatable
   text into Solr. I want the index to be as small as possible, so
   thought if I just pre-tokenized, my dictionary will be quite small.
   And I will be reconstructing some final form anyway.
  
   The other option is to just use compressed fields on stored field, but
   I assume that does not take cross-document efficiencies into account.
   And, it will be a read-only index after build, so I don't care about
   updates messing things up.
  
   Regards,
  Alex
  
   Personal website: http://www.outerthoughts.com/
   Current project: http://www.solr-start.com/ - Accelerating your Solr
   proficiency
  
 



Re: Boost Search results

2014-04-18 Thread Markus Jelsma
Hi, replicating full features search engine behaviour is not going to work with 
nutch and solr out of the box. You are missing a thousand features such as 
proper main content extraction, deduplication, classification of content and 
hub or link pages, and much more. These things are possible to implement but 
you may want to start with having you solr request handler better configured, 
to begin with, your qf parameter does not have nutchs default title and content 
field selected.


A Laxmi a.lakshmi...@gmail.com schreef:Hi,


When I started to compare the search results with the two options below, I
see a lot of difference in the search results esp. the* urls that show up
on the top *(*Relevancy *perspective).

(1) Nutch 2.2.1 (with *Solr 4.0*)
(2) Bing custom search set-up

I wonder how should I tweak the boost parameters to get the best results on
the top like how Bing, Google does.

Please suggest why I see a difference and what parameters are best to
configure in Solr to achieve what I see from Bing, or Google search
relevancy.

Here is what i got in solrconfig.xml:

str name=defTypeedismax/str
   str name=qf
 text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4
   /str
   str name=q.alt*:*/str
   str name=rows10/str
   str name=fl*,score/str


Thanks


Re: Boost Search results

2014-04-18 Thread A Laxmi
Hi Markus, Yes, you are right. I passed the qf from my front-end framework
(PHP which uses SolrClient). This is how I got it set-up:

$this-solr-set_param('defType','edismax');
$this-solr-set_param('qf','title^10 content^5 url^5');

where you can see qf = title^10 content^5 url^5






On Fri, Apr 18, 2014 at 4:02 PM, Markus Jelsma
markus.jel...@openindex.iowrote:

 Hi, replicating full features search engine behaviour is not going to work
 with nutch and solr out of the box. You are missing a thousand features
 such as proper main content extraction, deduplication, classification of
 content and hub or link pages, and much more. These things are possible to
 implement but you may want to start with having you solr request handler
 better configured, to begin with, your qf parameter does not have nutchs
 default title and content field selected.


 A Laxmi a.lakshmi...@gmail.com schreef:Hi,


 When I started to compare the search results with the two options below, I
 see a lot of difference in the search results esp. the* urls that show up
 on the top *(*Relevancy *perspective).

 (1) Nutch 2.2.1 (with *Solr 4.0*)
 (2) Bing custom search set-up

 I wonder how should I tweak the boost parameters to get the best results on
 the top like how Bing, Google does.

 Please suggest why I see a difference and what parameters are best to
 configure in Solr to achieve what I see from Bing, or Google search
 relevancy.

 Here is what i got in solrconfig.xml:

 str name=defTypeedismax/str
str name=qf
  text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4
/str
str name=q.alt*:*/str
str name=rows10/str
str name=fl*,score/str


 Thanks



Re: Boost Search results

2014-04-18 Thread A Laxmi
Markus, like I mentioned in my last email, I have got the qf with title,
content and url. That doesn't help a whole lot. Could you please advise if
there are any other parameters that I should consider for solr request
handler config or the numbers I have got for title, content, url in qf
parameter have to be modified?

Thanks for your help..


On Fri, Apr 18, 2014 at 4:08 PM, A Laxmi a.lakshmi...@gmail.com wrote:

 Hi Markus, Yes, you are right. I passed the qf from my front-end framework
 (PHP which uses SolrClient). This is how I got it set-up:

 $this-solr-set_param('defType','edismax');
 $this-solr-set_param('qf','title^10 content^5 url^5');

 where you can see qf = title^10 content^5 url^5






 On Fri, Apr 18, 2014 at 4:02 PM, Markus Jelsma markus.jel...@openindex.io
  wrote:

 Hi, replicating full features search engine behaviour is not going to
 work with nutch and solr out of the box. You are missing a thousand
 features such as proper main content extraction, deduplication,
 classification of content and hub or link pages, and much more. These
 things are possible to implement but you may want to start with having you
 solr request handler better configured, to begin with, your qf parameter
 does not have nutchs default title and content field selected.


 A Laxmi a.lakshmi...@gmail.com schreef:Hi,


 When I started to compare the search results with the two options below, I
 see a lot of difference in the search results esp. the* urls that show up
 on the top *(*Relevancy *perspective).

 (1) Nutch 2.2.1 (with *Solr 4.0*)
 (2) Bing custom search set-up

 I wonder how should I tweak the boost parameters to get the best results
 on
 the top like how Bing, Google does.

 Please suggest why I see a difference and what parameters are best to
 configure in Solr to achieve what I see from Bing, or Google search
 relevancy.

 Here is what i got in solrconfig.xml:

 str name=defTypeedismax/str
str name=qf
  text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4
/str
str name=q.alt*:*/str
str name=rows10/str
str name=fl*,score/str


 Thanks





Re: Can I reconstruct text from tokens?

2014-04-18 Thread Erick Erickson
Luke actually does this, or attempts to. The doc you assemble is lossy
though

It doesn't have stop words
All capitalization is lost
original terms for synonyms are lost
all punctuation is lost
I don't  think you can do this unless you store term information.
it's slow.
original words that are stemmed are lost
Anything you do with, say, ngrams will definitely be strange.
etc.

Basically, all the filters in the analysis chain may change what goes
into the index, that's their job. Each step may lose information.

FWIW,
Erick


On Fri, Apr 18, 2014 at 12:36 PM, Ramkumar R. Aiyengar
andyetitmo...@gmail.com wrote:
 Sorry, didn't think this through. You're right, still the same problem..
 On 16 Apr 2014 17:40, Alexandre Rafalovitch arafa...@gmail.com wrote:

 Why? I want stored=false, at which point multivalued field is just offset
 values in the dictionary. Still have to reconstruct from offsets.

 Or am I missing something?

 Regards,
  Alex
 On 16/04/2014 10:59 pm, Ramkumar R. Aiyengar andyetitmo...@gmail.com
 wrote:

  Logically if you tokenize and put the results in a multivalued field, you
  should be able to get all values in sequence?
  On 16 Apr 2014 16:51, Alexandre Rafalovitch arafa...@gmail.com
 wrote:
 
   Hello,
  
   If I use very basic tokenizers, e.g. space based and no filters, can I
   reconstruct the text from the tokenized form?
  
   So, This is a test - This, is, a, test - This is a test?
  
   I know we store enough information, but I don't know internal API
   enough to know what I should be looking at for reconstruction
   algorithm.
  
   Any hints?
  
   The XY problem is that I want to store large amount of very repeatable
   text into Solr. I want the index to be as small as possible, so
   thought if I just pre-tokenized, my dictionary will be quite small.
   And I will be reconstructing some final form anyway.
  
   The other option is to just use compressed fields on stored field, but
   I assume that does not take cross-document efficiencies into account.
   And, it will be a read-only index after build, so I don't care about
   updates messing things up.
  
   Regards,
  Alex
  
   Personal website: http://www.outerthoughts.com/
   Current project: http://www.solr-start.com/ - Accelerating your Solr
   proficiency
  
 



Re: space between search terms

2014-04-18 Thread Ahmet Arslan
Hi Jack,

I am planning to extract and publish such words for Turkish language. But I am 
not sure how to utilize them.

I wonder if there is a more flexible solution that will work query time only. 
That would not require reindexing every time a new item is added. 

Ahmet


On Friday, April 18, 2014 1:47 PM, Jack Krupansky j...@basetechnology.com 
wrote:
Use an index-time synonym filter with a synonym entry:

indira nagar,indiranagar

But do not use that same filter at query time.

But, that may mess up some exact phrase queries, such as:

q=indiranagar xyz

since the following term is actually positioned after the longest synonym.

To resolve that, use a sloppy phrase:

q=indiranagar xyz~1

Or, set qs=1 for the edismax query parser.

-- Jack Krupansky


-Original Message- 
From: kumar
Sent: Friday, April 18, 2014 6:34 AM
To: solr-user@lucene.apache.org
Subject: space between search terms

Hi,

I Have a field called title. It is having a values called indira nagar
as well as indiranagar.

If i type any of the keywords it has to display both results.

Can anybody help how can we do this?


I am using the title field in the following way:

fieldType name=title class=solr.TextField positionIncrementGap=100
analyzer type=index
charFilter class=solr.MappingCharFilterFactory
mapping=mapping-ISOLatin1Accent.txt /
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1
generateNumberParts=1
catenateWords=1
catenateNumbers=1
catenateAll=1
splitOnCaseChange=1
splitOnNumerics=1
preserveOriginal=1 /
filter class=solr.LowerCaseFilterFactory /
filter class=solr.PatternReplaceFilterFactory
pattern=([^\w\d\*æøåÆØÅ ]) replacement=  replace=all /
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=true /

/analyzer
analyzer type=query
charFilter class=solr.MappingCharFilterFactory
mapping=mapping-ISOLatin1Accent.txt /
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1
generateNumberParts=1
catenateWords=1
catenateNumbers=1
catenateAll=1
splitOnCaseChange=1
splitOnNumerics=1
preserveOriginal=1/
filter class=solr.LowerCaseFilterFactory /
filter class=solr.PatternReplaceFilterFactory
pattern=([^\w\d\*æøåÆØÅ ]) replacement=  replace=all /
filter class=solr.SynonymFilterFactory ignoreCase=true
synonyms=synonyms_tf.txt expand=true /
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=true /
filter class=solr.KeywordMarkerFilterFactory
protected=protwords.txt /
                filter class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer
/fieldType



--
View this message in context: 
http://lucene.472066.n3.nabble.com/space-between-search-terms-tp4131967.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: space between search terms

2014-04-18 Thread Erick Erickson
Ahmet:

Yeah, the index .vs. query time bit is a pain. Often what people will
do is take their best shot at index time, then accumulate omissions
and use that list for query time. Then whenever they can/need to
re-index, merge the query-time list into the index time list and start
over.

Not an ideal solution by any means, but one that people have made to work.

Best,
Erick

On Fri, Apr 18, 2014 at 4:38 PM, Ahmet Arslan iori...@yahoo.com wrote:
 Hi Jack,

 I am planning to extract and publish such words for Turkish language. But I 
 am not sure how to utilize them.

 I wonder if there is a more flexible solution that will work query time only. 
 That would not require reindexing every time a new item is added.

 Ahmet


 On Friday, April 18, 2014 1:47 PM, Jack Krupansky j...@basetechnology.com 
 wrote:
 Use an index-time synonym filter with a synonym entry:

 indira nagar,indiranagar

 But do not use that same filter at query time.

 But, that may mess up some exact phrase queries, such as:

 q=indiranagar xyz

 since the following term is actually positioned after the longest synonym.

 To resolve that, use a sloppy phrase:

 q=indiranagar xyz~1

 Or, set qs=1 for the edismax query parser.

 -- Jack Krupansky


 -Original Message-
 From: kumar
 Sent: Friday, April 18, 2014 6:34 AM
 To: solr-user@lucene.apache.org
 Subject: space between search terms

 Hi,

 I Have a field called title. It is having a values called indira nagar
 as well as indiranagar.

 If i type any of the keywords it has to display both results.

 Can anybody help how can we do this?


 I am using the title field in the following way:

 fieldType name=title class=solr.TextField positionIncrementGap=100
 analyzer type=index
 charFilter class=solr.MappingCharFilterFactory
 mapping=mapping-ISOLatin1Accent.txt /
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1
 generateNumberParts=1
 catenateWords=1
 catenateNumbers=1
 catenateAll=1
 splitOnCaseChange=1
 splitOnNumerics=1
 preserveOriginal=1 /
 filter class=solr.LowerCaseFilterFactory /
 filter class=solr.PatternReplaceFilterFactory
 pattern=([^\w\d\*æøåÆØÅ ]) replacement=  replace=all /
 filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true /

 /analyzer
 analyzer type=query
 charFilter class=solr.MappingCharFilterFactory
 mapping=mapping-ISOLatin1Accent.txt /
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1
 generateNumberParts=1
 catenateWords=1
 catenateNumbers=1
 catenateAll=1
 splitOnCaseChange=1
 splitOnNumerics=1
 preserveOriginal=1/
 filter class=solr.LowerCaseFilterFactory /
 filter class=solr.PatternReplaceFilterFactory
 pattern=([^\w\d\*æøåÆØÅ ]) replacement=  replace=all /
 filter class=solr.SynonymFilterFactory ignoreCase=true
 synonyms=synonyms_tf.txt expand=true /
 filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true /
 filter class=solr.KeywordMarkerFilterFactory
 protected=protwords.txt /
 filter class=solr.RemoveDuplicatesTokenFilterFactory/
 /analyzer
 /fieldType



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/space-between-search-terms-tp4131967.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: space between search terms

2014-04-18 Thread Jack Krupansky
The LucidWorks Search query parser does indeed support multi-word synonyms 
at query time.


I vaguely recall some Jira traffic on supporting multi-word synonyms at 
query time for some special cases, but a review of CHANGES.txt does not find 
any such changes that made it into a release, yet.


The simplest approach for now is to do the query-time synonym expansion in 
your app layer as a preprocessor.


-- Jack Krupansky

-Original Message- 
From: Ahmet Arslan

Sent: Friday, April 18, 2014 7:38 PM
To: solr-user@lucene.apache.org
Subject: Re: space between search terms

Hi Jack,

I am planning to extract and publish such words for Turkish language. But I 
am not sure how to utilize them.


I wonder if there is a more flexible solution that will work query time 
only. That would not require reindexing every time a new item is added.


Ahmet


On Friday, April 18, 2014 1:47 PM, Jack Krupansky j...@basetechnology.com 
wrote:

Use an index-time synonym filter with a synonym entry:

indira nagar,indiranagar

But do not use that same filter at query time.

But, that may mess up some exact phrase queries, such as:

q=indiranagar xyz

since the following term is actually positioned after the longest synonym.

To resolve that, use a sloppy phrase:

q=indiranagar xyz~1

Or, set qs=1 for the edismax query parser.

-- Jack Krupansky


-Original Message- 
From: kumar

Sent: Friday, April 18, 2014 6:34 AM
To: solr-user@lucene.apache.org
Subject: space between search terms

Hi,

I Have a field called title. It is having a values called indira nagar
as well as indiranagar.

If i type any of the keywords it has to display both results.

Can anybody help how can we do this?


I am using the title field in the following way:

fieldType name=title class=solr.TextField positionIncrementGap=100
analyzer type=index
charFilter class=solr.MappingCharFilterFactory
mapping=mapping-ISOLatin1Accent.txt /
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1
generateNumberParts=1
catenateWords=1
catenateNumbers=1
catenateAll=1
splitOnCaseChange=1
splitOnNumerics=1
preserveOriginal=1 /
filter class=solr.LowerCaseFilterFactory /
filter class=solr.PatternReplaceFilterFactory
pattern=([^\w\d\*æøåÆØÅ ]) replacement=  replace=all /
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=true /

/analyzer
analyzer type=query
charFilter class=solr.MappingCharFilterFactory
mapping=mapping-ISOLatin1Accent.txt /
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1
generateNumberParts=1
catenateWords=1
catenateNumbers=1
catenateAll=1
splitOnCaseChange=1
splitOnNumerics=1
preserveOriginal=1/
filter class=solr.LowerCaseFilterFactory /
filter class=solr.PatternReplaceFilterFactory
pattern=([^\w\d\*æøåÆØÅ ]) replacement=  replace=all /
filter class=solr.SynonymFilterFactory ignoreCase=true
synonyms=synonyms_tf.txt expand=true /
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=true /
filter class=solr.KeywordMarkerFilterFactory
protected=protwords.txt /
   filter class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer
/fieldType



--
View this message in context:
http://lucene.472066.n3.nabble.com/space-between-search-terms-tp4131967.html
Sent from the Solr - User mailing list archive at Nabble.com. 



need help from hard core solr experts - out of memory error

2014-04-18 Thread Candygram For Mongo
I have lots of log files and other files to support this issue (sometimes
referenced in the text below) but I am not sure the best way to submit.  I
don't want to overwhelm and I am not sure if this email will accept graphs
and charts.  Please provide direction and I will send them.


*Issue Description*



We are getting Out Of Memory errors when we try to execute a full import
using the Data Import Handler.  This error originally occurred on a
production environment with a database containing 27 million records.  Heap
memory was configured for 6GB and the server had 32GB of physical memory.
 We have been able to replicate the error on a local system with 6 million
records.  We set the memory heap size to 64MB to accelerate the error
replication.  The indexing process has been failing in different scenarios.
 We have 9 test cases documented.  In some of the test cases we increased
the heap size to 128MB.  In our first test case we set heap memory to 512MB
which also failed.





*Environment Values Used*



*SOLR/Lucene version: *4.2.1*

*JVM version:

Java(TM) SE Runtime Environment (build 1.7.0_07-b11)

Java HotSpot(TM) 64-Bit Server VM (build 23.3-b01, mixed mode)

*Indexer startup command:

set JVMARGS= -XX:MaxPermSize=364m -Xss256K –Xmx128m –Xms128m

java  %JVMARGS% ^

-Dcom.sun.management.jmxremote.port=1092 ^

-Dcom.sun.management.jmxremote.ssl=false ^

-Dcom.sun.management.jmxremote.authenticate=false ^

-jar start.jar

*SOLR indexing HTTP parameters request:

webapp=/solr path=/dataimport
params={clean=falsecommand=full-importwt=javabinversion=2}



The information we use for the database retrieve using the Data Import
Handler is as follows:



dataSource

name=org_only

type=JdbcDataSource

driver=oracle.jdbc.OracleDriver

url=jdbc:oracle:thin:@{server name}:1521:{database
name}

user={username}

password={password}

readOnly=false

/





*The Query (simple, single table)*



*select*



*NVL(cast(STU.ACCT_ADDRESS_ALL.R_ID as varchar2(100)), 'null')*

*as SOLR_ID,*



*'STU.ACCT_ADDRESS_ALL'*

*as SOLR_CATEGORY,*



*NVL(cast(STU.ACCT_ADDRESS_ALL.R_ID as varchar2(255)), ' ') as
ADDRESSALLRID,*

*NVL(cast(STU.ACCT_ADDRESS_ALL.ADDR_TYPE as varchar2(255)), ' ') as
ADDRESSALLADDRTYPECD,*

*NVL(cast(STU.ACCT_ADDRESS_ALL.LONGITUDE as varchar2(255)), ' ') as
ADDRESSALLLONGITUDE,*

*NVL(cast(STU.ACCT_ADDRESS_ALL.LATITUDE as varchar2(255)), ' ') as
ADDRESSALLLATITUDE,*

*NVL(cast(STU.ACCT_ADDRESS_ALL.ADDR_NAME as varchar2(255)), ' ') as
ADDRESSALLADDRNAME,*

*NVL(cast(STU.ACCT_ADDRESS_ALL.CITY as varchar2(255)), ' ') as
ADDRESSALLCITY,*

*NVL(cast(STU.ACCT_ADDRESS_ALL.STATE as varchar2(255)), ' ') as
ADDRESSALLSTATE,*

*NVL(cast(STU.ACCT_ADDRESS_ALL.EMAIL_ADDR as varchar2(255)), ' ') as
ADDRESSALLEMAILADDR *



*from STU.ACCT_ADDRESS_ALL*



You can see this information in the database.xml file.



Our main solrconfig.xml file contains the following differences compared to
a new downloaded solrconfig.xml
filefile:///D:/Solr%20Full%20Indexing%20issue/solrconfig%20(default%20content).xml(the
original content).



config

lib dir=../../../dist/
regex=solr-dataimporthandler-.*\.jar /

!—Our libraries containing customized filters--

lib path=../../../../default/lib/common.jar /

lib path=../../../../default/lib/webapp.jar /

lib path=../../../../default/lib/commons-pool-1.4.jar /



abortOnConfigurationError${solr.abortOnConfigurationError:true}/abortOnConfigurationError



directoryFactory name=DirectoryFactory
class=org.apache.solr.core.StandardDirectoryFactory /



requestHandler name=/dataimport
class=org.apache.solr.handler.dataimport.DataImportHandler

lst name=defaults

str name=configdatabase.xml/str

/lst

/requestHandler

/config





*Custom Libraries*



The common.jar contains a customized TokenFiltersFactory implementation
that we use for indexing.  They do some special treatment to the fields
read from the database.  How those classes are used is described in the
schema.xml file.  The webapp.jar file contains other related classes.  The
commons-pool-1.4.jar is an API from apache used for instances reuse.



The logic used in the TokenFiltersFactory is contained in the following
files:




ConcatFilterFactory.javafile:///D:/Solr%20Full%20Indexing%20issue/source%20files/ConcatFilterFactory.java


ConcatFilter.javafile:///D:/Solr%20Full%20Indexing%20issue/source%20files/ConcatFilter.java


MDFilterSchemaFactory.javafile:///D:/Solr%20Full%20Indexing%20issue/source%20files/MDFilterSchemaFactory.java


MDFilter.javafile:///D:/Solr%20Full%20Indexing%20issue/source%20files/MDFilter.java


MDFilterPoolObjectFactory.javafile:///D:/Solr%20Full%20Indexing%20issue/source%20files/MDFilterPoolObjectFactory.java


Re: need help from hard core solr experts - out of memory error

2014-04-18 Thread Walter Underwood
I see heap size commands for 128 Meg and 512 Meg. That will certainly run out 
of memory. Why do you think you have 6G of heap with these settings?

–Xmx128m –Xms128m
–Xmx512m –Xms512m

wunder

On Apr 18, 2014, at 5:15 PM, Candygram For Mongo 
candygram.for.mo...@gmail.com wrote:

 I have lots of log files and other files to support this issue (sometimes
 referenced in the text below) but I am not sure the best way to submit.  I
 don't want to overwhelm and I am not sure if this email will accept graphs
 and charts.  Please provide direction and I will send them.
 
 
 *Issue Description*
 
 
 
 We are getting Out Of Memory errors when we try to execute a full import
 using the Data Import Handler.  This error originally occurred on a
 production environment with a database containing 27 million records.  Heap
 memory was configured for 6GB and the server had 32GB of physical memory.
 We have been able to replicate the error on a local system with 6 million
 records.  We set the memory heap size to 64MB to accelerate the error
 replication.  The indexing process has been failing in different scenarios.
 We have 9 test cases documented.  In some of the test cases we increased
 the heap size to 128MB.  In our first test case we set heap memory to 512MB
 which also failed.
 
 
 
 
 
 *Environment Values Used*
 
 
 
 *SOLR/Lucene version: *4.2.1*
 
 *JVM version:
 
 Java(TM) SE Runtime Environment (build 1.7.0_07-b11)
 
 Java HotSpot(TM) 64-Bit Server VM (build 23.3-b01, mixed mode)
 
 *Indexer startup command:
 
 set JVMARGS= -XX:MaxPermSize=364m -Xss256K –Xmx128m –Xms128m
 
 java  %JVMARGS% ^
 
 -Dcom.sun.management.jmxremote.port=1092 ^
 
 -Dcom.sun.management.jmxremote.ssl=false ^
 
 -Dcom.sun.management.jmxremote.authenticate=false ^
 
 -jar start.jar
 
 *SOLR indexing HTTP parameters request:
 
 webapp=/solr path=/dataimport
 params={clean=falsecommand=full-importwt=javabinversion=2}
 
 
 
 The information we use for the database retrieve using the Data Import
 Handler is as follows:
 
 
 
 dataSource
 
name=org_only
 
type=JdbcDataSource
 
driver=oracle.jdbc.OracleDriver
 
url=jdbc:oracle:thin:@{server name}:1521:{database
 name}
 
user={username}
 
password={password}
 
readOnly=false
 
/
 
 
 
 
 
 *The Query (simple, single table)*
 
 
 
 *select*
 
 
 
 *NVL(cast(STU.ACCT_ADDRESS_ALL.R_ID as varchar2(100)), 'null')*
 
 *as SOLR_ID,*
 
 
 
 *'STU.ACCT_ADDRESS_ALL'*
 
 *as SOLR_CATEGORY,*
 
 
 
 *NVL(cast(STU.ACCT_ADDRESS_ALL.R_ID as varchar2(255)), ' ') as
 ADDRESSALLRID,*
 
 *NVL(cast(STU.ACCT_ADDRESS_ALL.ADDR_TYPE as varchar2(255)), ' ') as
 ADDRESSALLADDRTYPECD,*
 
 *NVL(cast(STU.ACCT_ADDRESS_ALL.LONGITUDE as varchar2(255)), ' ') as
 ADDRESSALLLONGITUDE,*
 
 *NVL(cast(STU.ACCT_ADDRESS_ALL.LATITUDE as varchar2(255)), ' ') as
 ADDRESSALLLATITUDE,*
 
 *NVL(cast(STU.ACCT_ADDRESS_ALL.ADDR_NAME as varchar2(255)), ' ') as
 ADDRESSALLADDRNAME,*
 
 *NVL(cast(STU.ACCT_ADDRESS_ALL.CITY as varchar2(255)), ' ') as
 ADDRESSALLCITY,*
 
 *NVL(cast(STU.ACCT_ADDRESS_ALL.STATE as varchar2(255)), ' ') as
 ADDRESSALLSTATE,*
 
 *NVL(cast(STU.ACCT_ADDRESS_ALL.EMAIL_ADDR as varchar2(255)), ' ') as
 ADDRESSALLEMAILADDR *
 
 
 
 *from STU.ACCT_ADDRESS_ALL*
 
 
 
 You can see this information in the database.xml file.
 
 
 
 Our main solrconfig.xml file contains the following differences compared to
 a new downloaded solrconfig.xml
 filefile:///D:/Solr%20Full%20Indexing%20issue/solrconfig%20(default%20content).xml(the
 original content).
 
 
 
 config
 
lib dir=../../../dist/
 regex=solr-dataimporthandler-.*\.jar /
 
!—Our libraries containing customized filters--
 
lib path=../../../../default/lib/common.jar /
 
lib path=../../../../default/lib/webapp.jar /
 
lib path=../../../../default/lib/commons-pool-1.4.jar /
 
 
 
 abortOnConfigurationError${solr.abortOnConfigurationError:true}/abortOnConfigurationError
 
 
 
 directoryFactory name=DirectoryFactory
 class=org.apache.solr.core.StandardDirectoryFactory /
 
 
 
 requestHandler name=/dataimport
 class=org.apache.solr.handler.dataimport.DataImportHandler
 
 lst name=defaults
 
 str name=configdatabase.xml/str
 
 /lst
 
 /requestHandler
 
 /config
 
 
 
 
 
 *Custom Libraries*
 
 
 
 The common.jar contains a customized TokenFiltersFactory implementation
 that we use for indexing.  They do some special treatment to the fields
 read from the database.  How those classes are used is described in the
 schema.xml file.  The webapp.jar file contains other related classes.  The
 commons-pool-1.4.jar is an API from apache used for instances reuse.
 
 
 
 The logic used in the TokenFiltersFactory is contained in the following
 files:
 
 
 

 

Re: need help from hard core solr experts - out of memory error

2014-04-18 Thread Candygram For Mongo
We consistently reproduce this problem on multiple systems configured with
6GB and 12GB of heap space.  To quickly reproduce many cases for
troubleshooting we reduced the heap space to 64, 128 and 512MB.  With 6 or
12GB configured it takes hours to see the error.


On Fri, Apr 18, 2014 at 5:54 PM, Walter Underwood wun...@wunderwood.orgwrote:

 I see heap size commands for 128 Meg and 512 Meg. That will certainly run
 out of memory. Why do you think you have 6G of heap with these settings?

 –Xmx128m –Xms128m
 –Xmx512m –Xms512m

 wunder

 On Apr 18, 2014, at 5:15 PM, Candygram For Mongo 
 candygram.for.mo...@gmail.com wrote:

  I have lots of log files and other files to support this issue (sometimes
  referenced in the text below) but I am not sure the best way to submit.
  I
  don't want to overwhelm and I am not sure if this email will accept
 graphs
  and charts.  Please provide direction and I will send them.
 
 
  *Issue Description*
 
 
 
  We are getting Out Of Memory errors when we try to execute a full import
  using the Data Import Handler.  This error originally occurred on a
  production environment with a database containing 27 million records.
  Heap
  memory was configured for 6GB and the server had 32GB of physical memory.
  We have been able to replicate the error on a local system with 6 million
  records.  We set the memory heap size to 64MB to accelerate the error
  replication.  The indexing process has been failing in different
 scenarios.
  We have 9 test cases documented.  In some of the test cases we increased
  the heap size to 128MB.  In our first test case we set heap memory to
 512MB
  which also failed.
 
 
 
 
 
  *Environment Values Used*
 
 
 
  *SOLR/Lucene version: *4.2.1*
 
  *JVM version:
 
  Java(TM) SE Runtime Environment (build 1.7.0_07-b11)
 
  Java HotSpot(TM) 64-Bit Server VM (build 23.3-b01, mixed mode)
 
  *Indexer startup command:
 
  set JVMARGS= -XX:MaxPermSize=364m -Xss256K –Xmx128m –Xms128m
 
  java  %JVMARGS% ^
 
  -Dcom.sun.management.jmxremote.port=1092 ^
 
  -Dcom.sun.management.jmxremote.ssl=false ^
 
  -Dcom.sun.management.jmxremote.authenticate=false ^
 
  -jar start.jar
 
  *SOLR indexing HTTP parameters request:
 
  webapp=/solr path=/dataimport
  params={clean=falsecommand=full-importwt=javabinversion=2}
 
 
 
  The information we use for the database retrieve using the Data Import
  Handler is as follows:
 
 
 
  dataSource
 
 name=org_only
 
 type=JdbcDataSource
 
 driver=oracle.jdbc.OracleDriver
 
 url=jdbc:oracle:thin:@{server
 name}:1521:{database
  name}
 
 user={username}
 
 password={password}
 
 readOnly=false
 
 /
 
 
 
 
 
  *The Query (simple, single table)*
 
 
 
  *select*
 
 
 
  *NVL(cast(STU.ACCT_ADDRESS_ALL.R_ID as varchar2(100)), 'null')*
 
  *as SOLR_ID,*
 
 
 
  *'STU.ACCT_ADDRESS_ALL'*
 
  *as SOLR_CATEGORY,*
 
 
 
  *NVL(cast(STU.ACCT_ADDRESS_ALL.R_ID as varchar2(255)), ' ') as
  ADDRESSALLRID,*
 
  *NVL(cast(STU.ACCT_ADDRESS_ALL.ADDR_TYPE as varchar2(255)), ' ') as
  ADDRESSALLADDRTYPECD,*
 
  *NVL(cast(STU.ACCT_ADDRESS_ALL.LONGITUDE as varchar2(255)), ' ') as
  ADDRESSALLLONGITUDE,*
 
  *NVL(cast(STU.ACCT_ADDRESS_ALL.LATITUDE as varchar2(255)), ' ') as
  ADDRESSALLLATITUDE,*
 
  *NVL(cast(STU.ACCT_ADDRESS_ALL.ADDR_NAME as varchar2(255)), ' ') as
  ADDRESSALLADDRNAME,*
 
  *NVL(cast(STU.ACCT_ADDRESS_ALL.CITY as varchar2(255)), ' ') as
  ADDRESSALLCITY,*
 
  *NVL(cast(STU.ACCT_ADDRESS_ALL.STATE as varchar2(255)), ' ') as
  ADDRESSALLSTATE,*
 
  *NVL(cast(STU.ACCT_ADDRESS_ALL.EMAIL_ADDR as varchar2(255)), ' ') as
  ADDRESSALLEMAILADDR *
 
 
 
  *from STU.ACCT_ADDRESS_ALL*
 
 
 
  You can see this information in the database.xml file.
 
 
 
  Our main solrconfig.xml file contains the following differences compared
 to
  a new downloaded solrconfig.xml
 
 filefile:///D:/Solr%20Full%20Indexing%20issue/solrconfig%20(default%20content).xml(the
  original content).
 
 
 
  config
 
 lib dir=../../../dist/
  regex=solr-dataimporthandler-.*\.jar /
 
 !—Our libraries containing customized filters--
 
 lib path=../../../../default/lib/common.jar /
 
 lib path=../../../../default/lib/webapp.jar /
 
 lib path=../../../../default/lib/commons-pool-1.4.jar /
 
 
 
 
 abortOnConfigurationError${solr.abortOnConfigurationError:true}/abortOnConfigurationError
 
 
 
  directoryFactory name=DirectoryFactory
  class=org.apache.solr.core.StandardDirectoryFactory /
 
 
 
  requestHandler name=/dataimport
  class=org.apache.solr.handler.dataimport.DataImportHandler
 
  lst name=defaults
 
  str name=configdatabase.xml/str
 
  /lst
 
  /requestHandler
 
  /config
 
 
 
 
 
  *Custom Libraries*
 
 
 
  The common.jar contains a customized TokenFiltersFactory implementation
  that we use for indexing.  They do some 

is there any way to post images and attachments to this mailing list?

2014-04-18 Thread Candygram For Mongo



Re: is there any way to post images and attachments to this mailing list?

2014-04-18 Thread A Laxmi
Just upload them in Google Drive and share the link with this group.


On Fri, Apr 18, 2014 at 9:15 PM, Candygram For Mongo 
candygram.for.mo...@gmail.com wrote:





Re: need help from hard core solr experts - out of memory error

2014-04-18 Thread Candygram For Mongo
I have uploaded several files including the problem description with
graphics to this link on Google drive:

https://drive.google.com/folderview?id=0B7UpFqsS5lSjWEhxRE1NN2tMNTQusp=sharing

I shared it with this address solr-user@lucene.apache.org so I am hoping
it can be accessed by people in the group.


On Fri, Apr 18, 2014 at 5:15 PM, Candygram For Mongo 
candygram.for.mo...@gmail.com wrote:

 I have lots of log files and other files to support this issue (sometimes
 referenced in the text below) but I am not sure the best way to submit.  I
 don't want to overwhelm and I am not sure if this email will accept graphs
 and charts.  Please provide direction and I will send them.


 *Issue Description*



 We are getting Out Of Memory errors when we try to execute a full import
 using the Data Import Handler.  This error originally occurred on a
 production environment with a database containing 27 million records.  Heap
 memory was configured for 6GB and the server had 32GB of physical memory.
  We have been able to replicate the error on a local system with 6 million
 records.  We set the memory heap size to 64MB to accelerate the error
 replication.  The indexing process has been failing in different scenarios.
  We have 9 test cases documented.  In some of the test cases we increased
 the heap size to 128MB.  In our first test case we set heap memory to 512MB
 which also failed.





 *Environment Values Used*



 *SOLR/Lucene version: *4.2.1*

 *JVM version:

 Java(TM) SE Runtime Environment (build 1.7.0_07-b11)

 Java HotSpot(TM) 64-Bit Server VM (build 23.3-b01, mixed mode)

 *Indexer startup command:

 set JVMARGS= -XX:MaxPermSize=364m -Xss256K –Xmx128m –Xms128m

 java  %JVMARGS% ^

 -Dcom.sun.management.jmxremote.port=1092 ^

 -Dcom.sun.management.jmxremote.ssl=false ^

 -Dcom.sun.management.jmxremote.authenticate=false ^

 -jar start.jar

 *SOLR indexing HTTP parameters request:

 webapp=/solr path=/dataimport
 params={clean=falsecommand=full-importwt=javabinversion=2}



 The information we use for the database retrieve using the Data Import
 Handler is as follows:



 dataSource

 name=org_only

 type=JdbcDataSource

 driver=oracle.jdbc.OracleDriver

 url=jdbc:oracle:thin:@{server
 name}:1521:{database name}

 user={username}

 password={password}

 readOnly=false

 /





 *The Query (simple, single table)*



 *select*



 *NVL(cast(STU.ACCT_ADDRESS_ALL.R_ID as varchar2(100)), 'null')*

 *as SOLR_ID,*



 *'STU.ACCT_ADDRESS_ALL'*

 *as SOLR_CATEGORY,*



 *NVL(cast(STU.ACCT_ADDRESS_ALL.R_ID as varchar2(255)), ' ') as
 ADDRESSALLRID,*

 *NVL(cast(STU.ACCT_ADDRESS_ALL.ADDR_TYPE as varchar2(255)), ' ') as
 ADDRESSALLADDRTYPECD,*

 *NVL(cast(STU.ACCT_ADDRESS_ALL.LONGITUDE as varchar2(255)), ' ') as
 ADDRESSALLLONGITUDE,*

 *NVL(cast(STU.ACCT_ADDRESS_ALL.LATITUDE as varchar2(255)), ' ') as
 ADDRESSALLLATITUDE,*

 *NVL(cast(STU.ACCT_ADDRESS_ALL.ADDR_NAME as varchar2(255)), ' ') as
 ADDRESSALLADDRNAME,*

 *NVL(cast(STU.ACCT_ADDRESS_ALL.CITY as varchar2(255)), ' ') as
 ADDRESSALLCITY,*

 *NVL(cast(STU.ACCT_ADDRESS_ALL.STATE as varchar2(255)), ' ') as
 ADDRESSALLSTATE,*

 *NVL(cast(STU.ACCT_ADDRESS_ALL.EMAIL_ADDR as varchar2(255)), ' ') as
 ADDRESSALLEMAILADDR *



 *from STU.ACCT_ADDRESS_ALL*



 You can see this information in the database.xml file.



 Our main solrconfig.xml file contains the following differences compared
 to a new downloaded solrconfig.xml file (the original content).



 config

 lib dir=../../../dist/
 regex=solr-dataimporthandler-.*\.jar /

 !—Our libraries containing customized filters--

 lib path=../../../../default/lib/common.jar /

 lib path=../../../../default/lib/webapp.jar /

 lib path=../../../../default/lib/commons-pool-1.4.jar /




 abortOnConfigurationError${solr.abortOnConfigurationError:true}/abortOnConfigurationError



 directoryFactory name=DirectoryFactory
 class=org.apache.solr.core.StandardDirectoryFactory /



 requestHandler name=/dataimport
 class=org.apache.solr.handler.dataimport.DataImportHandler

 lst name=defaults

 str name=configdatabase.xml/str

 /lst

 /requestHandler

 /config





 *Custom Libraries*



 The common.jar contains a customized TokenFiltersFactory implementation
 that we use for indexing.  They do some special treatment to the fields
 read from the database.  How those classes are used is described in the
 schema.xml file.  The webapp.jar file contains other related classes.
 The commons-pool-1.4.jar is an API from apache used for instances reuse.



 The logic used in the TokenFiltersFactory is contained in the following
 files:



 ConcatFilterFactory.java

 ConcatFilter.java

 MDFilterSchemaFactory.java

 MDFilter.java

 

Re: Boost Search results

2014-04-18 Thread Aman Tandon
I guess you can apply some deboost for URL.
Lakshmi it will be more helpful to suggest if you also provide some kind of
example about what you want to achieve

On Saturday, April 19, 2014, A Laxmi a.lakshmi...@gmail.com wrote:
 Markus, like I mentioned in my last email, I have got the qf with title,
 content and url. That doesn't help a whole lot. Could you please advise if
 there are any other parameters that I should consider for solr request
 handler config or the numbers I have got for title, content, url in qf
 parameter have to be modified?

 Thanks for your help..


 On Fri, Apr 18, 2014 at 4:08 PM, A Laxmi a.lakshmi...@gmail.com wrote:

 Hi Markus, Yes, you are right. I passed the qf from my front-end
framework
 (PHP which uses SolrClient). This is how I got it set-up:

 $this-solr-set_param('defType','edismax');
 $this-solr-set_param('qf','title^10 content^5 url^5');

 where you can see qf = title^10 content^5 url^5






 On Fri, Apr 18, 2014 at 4:02 PM, Markus Jelsma 
markus.jel...@openindex.io
  wrote:

 Hi, replicating full features search engine behaviour is not going to
 work with nutch and solr out of the box. You are missing a thousand
 features such as proper main content extraction, deduplication,
 classification of content and hub or link pages, and much more. These
 things are possible to implement but you may want to start with having
you
 solr request handler better configured, to begin with, your qf parameter
 does not have nutchs default title and content field selected.


 A Laxmi a.lakshmi...@gmail.com schreef:Hi,


 When I started to compare the search results with the two options
below, I
 see a lot of difference in the search results esp. the* urls that show
up
 on the top *(*Relevancy *perspective).

 (1) Nutch 2.2.1 (with *Solr 4.0*)
 (2) Bing custom search set-up

 I wonder how should I tweak the boost parameters to get the best results
 on
 the top like how Bing, Google does.

 Please suggest why I see a difference and what parameters are best to
 configure in Solr to achieve what I see from Bing, or Google search
 relevancy.

 Here is what i got in solrconfig.xml:

 str name=defTypeedismax/str
str name=qf
  text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4
/str
str name=q.alt*:*/str
str name=rows10/str
str name=fl*,score/str


 Thanks





-- 
Sent from Gmail Mobile


Re: Indexing Big Data With or Without Solr

2014-04-18 Thread Aman Tandon
Vineet please share after you setup for solr cloud
Are you using jetty or tomcat.?

On Saturday, April 19, 2014, Vineet Mishra clearmido...@gmail.com wrote:
 Thanks Furkan, I will definitely give it a try then.

 Thanks again!




 On Tue, Apr 15, 2014 at 7:53 PM, Furkan KAMACI furkankam...@gmail.com
wrote:

 Hi Vineet;

 I've been using SolrCloud for such kind of Big Data and I think that you
 should consider to use it. If you have any problems you can ask it here.

 Thanks;
 Furkan KAMACI


 2014-04-15 13:20 GMT+03:00 Vineet Mishra clearmido...@gmail.com:

  Hi All,
 
  I have worked with Solr 3.5 to implement real time search on some 100GB
  data, that worked fine but was little slow on complex queries(Multiple
  group/joined queries).
  But now I want to index some real Big Data(around 4 TB or even more),
can
  SolrCloud be solution for it if not what could be the best possible
  solution in this case.
 
  *Stats for the previous Implementation:*
  It was Master Slave Architecture with normal Standalone multiple
instance
  of Solr 3.5. There were around 12 Solr instance running on different
  machines.
 
  *Things to consider for the next implementation:*
  Since all the data is sensor data hence it is the factor of duplicity
and
  uniqueness.
 
  *Really urgent, please take the call on priority with set of feasible
  solution.*
 
  Regards
 



-- 
Sent from Gmail Mobile


Re: need help from hard core solr experts - out of memory error

2014-04-18 Thread Shawn Heisey
On 4/18/2014 6:15 PM, Candygram For Mongo wrote:
 We are getting Out Of Memory errors when we try to execute a full import
 using the Data Import Handler.  This error originally occurred on a
 production environment with a database containing 27 million records.  Heap
 memory was configured for 6GB and the server had 32GB of physical memory.
  We have been able to replicate the error on a local system with 6 million
 records.  We set the memory heap size to 64MB to accelerate the error
 replication.  The indexing process has been failing in different scenarios.
  We have 9 test cases documented.  In some of the test cases we increased
 the heap size to 128MB.  In our first test case we set heap memory to 512MB
 which also failed.

One characteristic of a JDBC connection is that unless you tell it
otherwise, it will try to retrieve the entire resultset into RAM before
any results are delivered to the application.  It's not Solr doing this,
it's JDBC.

In this case, there are 27 million rows in the resultset.  It's highly
unlikely that this much data (along with the rest of Solr's memory
requirements) will fit in 6GB of heap.

JDBC has a built-in way to deal with this.  It's called fetchSize.  By
using the batchSize parameter on your JdbcDataSource config, you can set
the JDBC fetchSize.  Set it to something small, between 100 and 1000,
and you'll probably get rid of the OOM problem.

http://wiki.apache.org/solr/DataImportHandler#Configuring_JdbcDataSource

If you had been using MySQL, I would have recommended that you set
batchSize to -1.  This sets fetchSize to Integer.MIN_VALUE, which tells
the MySQL driver to stream results instead of trying to either batch
them or return everything.  I'm pretty sure that the Oracle driver
doesn't work this way -- you would have to modify the dataimport source
code to use their streaming method.

Thanks,
Shawn