Re: Fastest way to import big amount of documents in SolrCloud

2014-05-02 Thread Alexander Kanarsky
If you build your index in Hadoop, read this (it is about the Cloudera
Search but in my understanding also should work with Solr Hadoop contrib
since 4.7)
http://www.cloudera.com/content/cloudera-content/cloudera-docs/Search/latest/Cloudera-Search-User-Guide/csug_batch_index_to_solr_servers_using_golive.html


On Thu, May 1, 2014 at 1:47 PM, Costi Muraru costimur...@gmail.com wrote:

 Hi guys,

 What would you say it's the fastest way to import data in SolrCloud?
 Our use case: each day do a single import of a big number of documents.

 Should we use SolrJ/DataImportHandler/other? Or perhaps is there a bulk
 import feature in SOLR? I came upon this promising link:
 http://wiki.apache.org/solr/UpdateCSV
 Any idea on how UpdateCSV is performance-wise compared with
 SolrJ/DataImportHandler?

 If SolrJ, should we split the data in chunks and start multiple clients at
 once? In this way we could perhaps take advantage of the multitude number
 of servers in the SolrCloud configuration?

 Either way, after the import is finished, should we do an optimize or a
 commit or none (
 http://wiki.solarium-project.org/index.php/V1:Optimize_command)?

 Any tips and tricks to perform this process the right way are gladly
 appreciated.

 Thanks,
 Costi



Re: timeAllowed in not honoring

2014-05-02 Thread Toke Eskildsen
On Thu, 2014-05-01 at 23:38 +0200, Shawn Heisey wrote:
 I was surprised to read that fc uses less memory.

I think that is an error in the documentation. Except for special cases,
such as asking for all facet values on a high cardinality field, I would
estimate that enum uses less memory than fc.

- Toke Eskildsen, State and University Library, Denmark




Re: timeAllowed in not honoring

2014-05-02 Thread Toke Eskildsen
On Thu, 2014-05-01 at 23:03 +0200, Aman Tandon wrote:
 So can you explain how enum is faster than default.

The fundamental difference is than enum iterates terms and counts how
many of the documents associated to the terms are in the hits, while fc
iterates all hits and updates a counter for the term associated to the
document.

A bit too simplified we have enum: terms-docs, fc: hits-terms. enum
wins when there are relatively few unique terms and is much less
affected by index updates than fc. As Shawn says, you are best off by
testing.

 We are planning to move to SolrCloud with the version solr 4.7.1, so does
 this 14 GB of RAM will be sufficient? or should we increase it?

Switching to SolrCloud does not change your fundamental memory
requirements for searching. The merging part adds some overhead, but
with a heap of 14GB, I would be surprised if that would require an
increase.

Consider using DocValues for facet fields with many unique values, for
getting both speed and low memory usage at the cost of increased index
size.

- Toke Eskildsen, State and University Library, Denmark




Re: Block Join Score Highlighting

2014-05-02 Thread StrW_dev
Mikhail Khludnev wrote
 Hello,
 
 Score support is addressed at
 https://issues.apache.org/jira/browse/SOLR-5882.
  Highlighting is another story. be aware of
 http://heliosearch.org/expand-block-join/ it might somehow useful for your
 problem.

Thx for the reply! My score question is answered with that.

I already tried expanding based on that exact article. With expanding I
might be able to search in parent, filter children and also return the
children based on the same filter query. 
However this doesn't give me the most relevant child and certainly won't
allow me to use the boost of that child in the score of the parent document. 
I am forced to search on child level as this allows me to use the unique
boost of the child to influence the score.

What I would need is to return snippet based on search in the parent, but
now snippets are based on the returning document.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Block-Join-Score-Highlighting-tp4134045p4134273.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: XSLT Caching Warning

2014-05-02 Thread Christopher Gross
I have a few transforms that I need to do, but I turned set the cache
lifetime very high.  I'm just trying to rectify error messages that pop up.

If it's something that I can ignore, then that's OK, I just wanted to be
sure.

Thanks!

-- Chris


On Thu, May 1, 2014 at 10:32 PM, Alexandre Rafalovitch
arafa...@gmail.comwrote:

 I think the key message here is:
 simplistic XSLT caching mechanism is not appropriate for high load
 scenarios.

 As in, maybe this is not really a production-level component. One
 exception is given and it is not just lifetime, it's also a
 single-transform.

 Are you satisfying both of those conditions? If so, it's probably ok
 to just ignore the warning.

 Regards,
Alex.
 Personal website: http://www.outerthoughts.com/
 Current project: http://www.solr-start.com/ - Accelerating your Solr
 proficiency


 On Fri, May 2, 2014 at 3:28 AM, Christopher Gross cogr...@gmail.com
 wrote:
  I get this warning when Solr (4.7.2) Starts:
  WARN  org.apache.solr.util.xslt.TransformerProvider  â The
  TransformerProvider's simplistic XSLT caching mechanism is not
 appropriate
  for high load scenarios, unless a single XSLT transform is used and
  xsltCacheLifetimeSeconds is set to a sufficiently high value.
 
  The solrconfig.xml setting is:
queryResponseWriter name=xslt class=solr.XSLTResponseWriter
  int name=xsltCacheLifetimeSeconds10/int
/queryResponseWriter
 
  Is there a different class that I should be using?  Is there a higher
  number than 10 that will do the trick?
 
  Thanks!
 
  -- Chris



Export big extract from Solr to [My]SQL

2014-05-02 Thread Per Steffensen

Hi

I want to make extracts from my Solr to MySQL. Any tools around that can 
help med perform such a task? I find a lot about data-import from SQL 
when googling, but nothing about export/extract. It is not all of the 
data in Solr I need to extract. It is only documents that full fill a 
normal Solr query, but the number of documents fulfilling it will 
(potentially) be huge.


Regards, Per Steffensen


Re: Export big extract from Solr to [My]SQL

2014-05-02 Thread Siegfried Goeschl

Hi Per,

basically I see three options

* use a lot of memory to scope with huge result sets
* user result set paging
* SOLR 4.7 supports cursors 
(https://issues.apache.org/jira/browse/SOLR-5463)


Cheers,

Siegfried Goeschl

On 02.05.14 13:32, Per Steffensen wrote:

Hi

I want to make extracts from my Solr to MySQL. Any tools around that can
help med perform such a task? I find a lot about data-import from SQL
when googling, but nothing about export/extract. It is not all of the
data in Solr I need to extract. It is only documents that full fill a
normal Solr query, but the number of documents fulfilling it will
(potentially) be huge.

Regards, Per Steffensen




Displaying ExternalFileField values in CSVResponse - Solr 4.6

2014-05-02 Thread Sanjeev Pragada
Hi,nbsp; nbsp;We are using Solr4.6 to index and search our ecommerce product 
details. We are using ExternalFileField option to incorporate some ranking 
signals.nbsp;The problem I am facing currently is that the values of 
ExternalFileField are not displayed in the CSVResponse of the solr. However I 
am able to get the valuesfor other response formats such as XML, JSON, Python 
etc.Can anyone please let me know if there is a way to display the values in 
CSVResponse. I don#39;t prefer to use other response formats as these 
responses are fatterthan the CSV Response and parsing them involves additional 
performance cost.If the required functionality is available in a later version 
of Solr, I will be able to upgrade to it.Any help would be 
great.Thanks,Sanjeevnbsp;

PostingHighlighter complains about no offsets

2014-05-02 Thread Michael Sokolov
I've been wanting to try out the PostingsHighlighter, so I added 
storeOffsetsWithPositions to my field definition, enabled the 
highlighter in solrconfig.xml,  reindexed and tried it out.  When I 
issue a query I'm getting this error:


|field 'text' was indexed without offsets, cannot highlight


java.lang.IllegalArgumentException: field 'text' was indexed without offsets, 
cannot highlight
at 
org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightDoc(PostingsHighlighter.java:545)
at 
org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightField(PostingsHighlighter.java:467)
at 
org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightFieldsAsObjects(PostingsHighlighter.java:392)
at 
org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightFields(PostingsHighlighter.java:293)|

I've been trying to figure out why the field wouldn't have offsets 
indexed, but I just can't see it.  Is there something in the analysis 
chain that could stripping out offsets?



This is the field definition:

field name=text type=text_en indexed=true stored=true 
multiValued=false termVectors=true termPositions=true 
termOffsets=true storeOffsetsWithPositions=true /


(Yes I know PH doesn't require term vectors; I'm keeping them around for 
now while I experiment)


fieldType name=text_en class=solr.TextField 
positionIncrementGap=100

  analyzer type=index
!-- We are indexing mostly HTML so we need to ignore the tags --
charFilter class=solr.HTMLStripCharFilterFactory/
!--tokenizer class=solr.StandardTokenizerFactory/--
tokenizer class=solr.WhitespaceTokenizerFactory/
!-- lower casing must happen before WordDelimiterFilter or 
protwords.txt will not work --

filter class=solr.LowerCaseFilterFactory/
filter class=solr.WordDelimiterFilterFactory 
stemEnglishPossessive=1 protected=protwords.txt/

!-- This deals with contractions --
filter class=solr.SynonymFilterFactory 
synonyms=synonyms.txt expand=true ignoreCase=true/
filter class=solr.HunspellStemFilterFactory 
dictionary=en_US.dic affix=en_US.aff ignoreCase=true/

filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
  analyzer type=query
!--tokenizer class=solr.StandardTokenizerFactory/--
tokenizer class=solr.WhitespaceTokenizerFactory/
!-- lower casing must happen before WordDelimiterFilter or 
protwords.txt will not work --

filter class=solr.LowerCaseFilterFactory/
filter class=solr.WordDelimiterFilterFactory 
protected=protwords.txt/
!-- setting tokenSeparator= solves issues with compound 
words and improves phrase search --
filter class=solr.HunspellStemFilterFactory 
dictionary=en_US.dic affix=en_US.aff ignoreCase=true/

filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
/fieldType


Re: What are the best practices on Multiple Language support in Solr Cloud ?

2014-05-02 Thread Nicole Lacoste
Hi Shamik,

I don't have an answer for you, just a couple of comments.

Why not use dynamic field definitions in the schema? As you say most of
your fields are not analysed you just add a language tag _en, _fr, _de,
...) to the field when you index or query.  Then you can add languages as
you need without having to touch the schema.  For fields that you do
analyse (stop words or synonyms) then you'll have to explicitly define a
field type for them.  My experience with docs that are in two or three main
languages is that single core or multi-core has not been that critical,
sharding and replication made a bigger difference to us.  You could put
english in one core and everything else in another.

What we tried to do was just index stuff to the same field, that is french
and english getting indexed to contents or title field (we have our own
tokenizer and filter chain so did actually analyse them differently) but we
got into lots of problems with tf-idf, so I'd advise to not do that. The
motivation was that we wanted multi-ligual results. Terry's approach here
is much better, and as you thought is addressing the multi-lingual
requirement, but I still don't think it totally addresses the tf-idf
problem. So if you don't need multilingual don't go that route.

I am curious to see what other people think.

Niki


Roll up query with original facets

2014-05-02 Thread Darin Amos
Hello All,

I am having a query issue I cannot seem to find the correct answer for. I am 
searching against a list of items and returning facets for that list of items. 
I would like to group the result set on a field such as a “parentItemId”. 
parentItemId maps to other documents within the same core. I would like my 
query to return the documents that match parentItemId, but still return the 
facets of the original query.

Is this possible with SOLR 4.3 that I am running? I can provide more details if 
needed, thanks!

Darin

Re: PostingHighlighter complains about no offsets

2014-05-02 Thread Michael Sokolov
I checked using the analysis admin page, and I believe there are offsets 
being generated (I assume start/end=offsets).  So IDK I am going to try 
reindexing again.  Maybe I neglected to reload the config before I 
indexed last time.


-Mike

On 05/02/2014 09:34 AM, Michael Sokolov wrote:
I've been wanting to try out the PostingsHighlighter, so I added 
storeOffsetsWithPositions to my field definition, enabled the 
highlighter in solrconfig.xml,  reindexed and tried it out.  When I 
issue a query I'm getting this error:


|field 'text' was indexed without offsets, cannot highlight


java.lang.IllegalArgumentException: field 'text' was indexed without offsets, 
cannot highlight
at 
org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightDoc(PostingsHighlighter.java:545)
at 
org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightField(PostingsHighlighter.java:467)
at 
org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightFieldsAsObjects(PostingsHighlighter.java:392)
at 
org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightFields(PostingsHighlighter.java:293)|
I've been trying to figure out why the field wouldn't have offsets 
indexed, but I just can't see it.  Is there something in the analysis 
chain that could stripping out offsets?



This is the field definition:

field name=text type=text_en indexed=true stored=true 
multiValued=false termVectors=true termPositions=true 
termOffsets=true storeOffsetsWithPositions=true /


(Yes I know PH doesn't require term vectors; I'm keeping them around 
for now while I experiment)


fieldType name=text_en class=solr.TextField 
positionIncrementGap=100

  analyzer type=index
!-- We are indexing mostly HTML so we need to ignore the tags --
charFilter class=solr.HTMLStripCharFilterFactory/
!--tokenizer class=solr.StandardTokenizerFactory/--
tokenizer class=solr.WhitespaceTokenizerFactory/
!-- lower casing must happen before WordDelimiterFilter or 
protwords.txt will not work --

filter class=solr.LowerCaseFilterFactory/
filter class=solr.WordDelimiterFilterFactory 
stemEnglishPossessive=1 protected=protwords.txt/

!-- This deals with contractions --
filter class=solr.SynonymFilterFactory 
synonyms=synonyms.txt expand=true ignoreCase=true/
filter class=solr.HunspellStemFilterFactory 
dictionary=en_US.dic affix=en_US.aff ignoreCase=true/

filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
  analyzer type=query
!--tokenizer class=solr.StandardTokenizerFactory/--
tokenizer class=solr.WhitespaceTokenizerFactory/
!-- lower casing must happen before WordDelimiterFilter or 
protwords.txt will not work --

filter class=solr.LowerCaseFilterFactory/
filter class=solr.WordDelimiterFilterFactory 
protected=protwords.txt/
!-- setting tokenSeparator= solves issues with compound 
words and improves phrase search --
filter class=solr.HunspellStemFilterFactory 
dictionary=en_US.dic affix=en_US.aff ignoreCase=true/

filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
/fieldType




Re: Displaying ExternalFileField values in CSVResponse - Solr 4.6

2014-05-02 Thread Ahmet Arslan
Hi Sanjeev,

Here is the relevant jira : https://issues.apache.org/jira/browse/SOLR-5423 
which has fix versions 4.7.1, 4.8, 5.0.

So I recommend to use/download latest 4.8.0 version.

Ahmet





On Friday, May 2, 2014 2:46 PM, Sanjeev Pragada sanje...@rediff.co.in wrote:
Hi,nbsp; nbsp;We are using Solr4.6 to index and search our ecommerce product 
details. We are using ExternalFileField option to incorporate some ranking 
signals.nbsp;The problem I am facing currently is that the values of 
ExternalFileField are not displayed in the CSVResponse of the solr. However I 
am able to get the valuesfor other response formats such as XML, JSON, Python 
etc.Can anyone please let me know if there is a way to display the values in 
CSVResponse. I don't prefer to use other response formats as these responses 
are fatterthan the CSV Response and parsing them involves additional 
performance cost.If the required functionality is available in a later version 
of Solr, I will be able to upgrade to it.Any help would be 
great.Thanks,Sanjeevnbsp; 


Re: Block Join Score Highlighting

2014-05-02 Thread Mikhail Khludnev
On Fri, May 2, 2014 at 2:34 PM, StrW_dev r.j.bamb...@structweb.nl wrote:

 Mikhail Khludnev wrote
  Hello,
 
  Score support is addressed at
  https://issues.apache.org/jira/browse/SOLR-5882.
   Highlighting is another story. be aware of
  http://heliosearch.org/expand-block-join/ it might somehow useful for
 your
  problem.

 Thx for the reply! My score question is answered with that.

but you forget to vote for that issue!
Regarding highlighting, unfortunately I've never work with it. Hence, no
quick help from my side.


 I already tried expanding based on that exact article. With expanding I
 might be able to search in parent, filter children and also return the
 children based on the same filter query.
 However this doesn't give me the most relevant child and certainly won't
 allow me to use the boost of that child in the score of the parent
 document.
 I am forced to search on child level as this allows me to use the unique
 boost of the child to influence the score.

 What I would need is to return snippet based on search in the parent, but
 now snippets are based on the returning document.





 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Block-Join-Score-Highlighting-tp4134045p4134273.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Export big extract from Solr to [My]SQL

2014-05-02 Thread simon
The cursor-based deep paging in 4.7+ works very well and the performance on
large extracts (for us, maybe  up to 100K documents) is excellent, though
it will obviously depend on the number and size of fields that you need to
 pull. I wrote a Perl module to do the extractions from Solr without
problems (and DBI takes care of  writing to a database).

I'm probably going to rewrite in Python since the final destination of many
of our extracts is Tableau,  which has  a Python API for creating TDEs
(Tableau data extracts)

regards

-Simon


On Fri, May 2, 2014 at 7:43 AM, Siegfried Goeschl sgoes...@gmx.at wrote:

 Hi Per,

 basically I see three options

 * use a lot of memory to scope with huge result sets
 * user result set paging
 * SOLR 4.7 supports cursors (https://issues.apache.org/
 jira/browse/SOLR-5463)

 Cheers,

 Siegfried Goeschl


 On 02.05.14 13:32, Per Steffensen wrote:

 Hi

 I want to make extracts from my Solr to MySQL. Any tools around that can
 help med perform such a task? I find a lot about data-import from SQL
 when googling, but nothing about export/extract. It is not all of the
 data in Solr I need to extract. It is only documents that full fill a
 normal Solr query, but the number of documents fulfilling it will
 (potentially) be huge.

 Regards, Per Steffensen





Re: Searching for tokens does not return any results

2014-05-02 Thread Erick Erickson
bq:  but this index was created using a Java program using Lucene interface

Elaborating a bit on Koji's comment...

The fact that you used Lucene to index the doc means that the analysis
page is almost, but not quite entirely, useless on the indexing side.
It's looking at your field definition in schema.xml and running your
input stream through the indexing portion of your analysis chain
constructed from the schema. What's actually in your index though was
put there by raw Lucene. So your Lucene program _must_ create an
analysis chain that is absolutely identical to what's in your schema
for the admin/analysis page to be accurate.

Quick test: go to you admin/schema browser page or use the
TermsComponent 
(https://cwiki.apache.org/confluence/display/solr/The+Terms+Component)
or Luke to examine the actual tokens in your field. My bet is that
you'll see that the actual terms are not what you expect and almost
certainly not what the admin/analysis page shows on the index side.

Keeping an independent Lucene program that puts data into your index
with raw Lucene aligned with your schema is, as you can see, something
of a problem. If at all possible, consider letting Solr do the
indexing and sending it documents with SolrJ, here's a reference:
https://cwiki.apache.org/confluence/display/solr/Using+SolrJ

By the way, I want to compliment you on your post. You did all the right things:
 defined your problem clearly
 added the critical bit (index created with Lucene). This is especially 
 relevant I think
 illustrated the input and output
 told us what the problem was
 gave us the field definitions
 showed the results of some of your investigation

Best
Erick

On Thu, May 1, 2014 at 7:31 AM, Koji Sekiguchi k...@r.email.ne.jp wrote:
 Hi Yetkin, welcome!

 I think StandardAnalyzer of Lucene is the problem you are facing.

 Why don't you have another field using StandardAnalyzer and see how it
 tokenizes CRD_PROD
 on Solr admin GUI?

 I forgot in the detail but we can use Lucene's Analyzer in schema.xml
 something like this:

 fieldType ...
analyzer class=solr.StandardAnalyzer/
 /fieldType

 Koji
 --
 http://soleami.com/blog/comparing-document-classification-functions-of-lucene-and-mahout.html


 (2014/05/01 23:04), Yetkin Ozkucur wrote:

 Hello everyone,

 I am new to SOLR and this is my first post in this list.
 I have been working on this problem for a couple of days. I tried
 everything which I found in google but it looks like I am missing something.

 Here is my problem:
 I have a field called: DBASE_LOCAT_NM_TEXT
 It contains values like: CRD_PROD
 The goal is to be able to search this field either by putting the exact
 string CRD_PROD or part of it (tokenized by _)  like CRD or PROD

 Currently:
 This query returns results: q=DBASE_LOCAT_NM_TEXT:CRD_PROD
 But this does not: q=DBASE_LOCAT_NM_TEXT:CRD
 I want to understand why the second query does not return any results

 Here is how I configured the field:
 field name=DBASE_LOCAT_NM_TEXT type=text_general indexed=true
 stored=true required=false multiValued=false/

 And Here is how I configured the field type :
  fieldType name=text_general class=solr.TextField
 positionIncrementGap=100
analyzer type=index
filter class=solr.WordDelimiterFilterFactory
 preserveOriginal=1 generateWordParts=1 generateNumberParts=1
 catenateWords=1 catenateNumbers=1 catenateAll=0
 splitOnCaseChange=1/
  tokenizer class=solr.WhitespaceTokenizerFactory/
  filter class=solr.StopFilterFactory  ignoreCase=true
 words=stopwords.txt/
   filter class=solr.LowerCaseFilterFactory/
  filter class=solr.KeywordMarkerFilterFactory
 protected=protwords.txt/
  filter class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer
analyzer type=query
  filter class=solr.WordDelimiterFilterFactory
 preserveOriginal=1 generateWordParts=1 generateNumberParts=1
 catenateWords=0 catenateNumbers=0 catenateAll=0
 splitOnCaseChange=1/
  tokenizer class=solr.WhitespaceTokenizerFactory/
  filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt/

  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.KeywordMarkerFilterFactory
 protected=protwords.txt/
  filter class=solr.RemoveDuplicatesTokenFilterFactory/

/analyzer
  /fieldType

 I am also using the analysis panel in the SOLR admin console. It shows
 this:
 WT  CRD_PROD

 WDF CRD_PROD
 CRD
 PROD
 CRDPROD

 SF  CRD_PROD
 CRD
 PROD
 CRDPROD

 LCF crd_prod
 crd
 prod
 crdprod

 SKMFcrd_prod
 crd
 prod
 crdprod

 RDTFcrd_prod
 crd
 prod
 crdprod


 I am not sure if it is related or not but this index was created using a
 Java program using Lucene interface. It used StandardAnalyzer for writing
 and the field was configured as tokenized, indexed and stored.  

Re: RE : Shards don't return documents in same order

2014-05-02 Thread Erick Erickson
Francois:

Yes, there are several means to examine the raw terms in the index.
 The admin/schema-browser page
 TermsComponent: 
 https://cwiki.apache.org/confluence/display/solr/The+Terms+Component
 Luke

the  schema-browser is all set up for you, it's easiest. The
TermsComponent should be directly usable too, I believe it's
configured by default in solrconfig.xml Luke takes a bit of setup but
is a great tool.

Did you re-index from scratch on all shards? I presume your ordering
is still not the same on all shards... the order I'd expect would be:
mb20140410a
mb20140410anew
mb20140411a

Best,
Erick


On Thu, May 1, 2014 at 8:27 AM, Francois Perron
francois.per...@ticketmaster.com wrote:
 Hi Erick,

   thank you for your response.  You are right, I changed alphaOnlySort to 
 keep lettres and numbers and to remove some acticles (a, an, the).

 This is the filetype definition :

 fieldType name=alphaOnlySort class=solr.TextField 
 sortMissingLast=true omitNorms=true
   analyzer
 tokenizer class=solr.KeywordTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.TrimFilterFactory/
 filter class=solr.PatternReplaceFilterFactory replace=all 
 replacement= pattern=(\b(a|an|the)\b|[^a-z,0-9])/
   /analyzer
 /fieldType


 Then, I tested each name with admin ui on each server and this is the results 
 :

 server1

 MB20140410A = mb20140410a
 MB20140411A = mb20140411a
 MB20140410A-New = mb20140410anew

 server2

 MB20140410A = mb20140410a
 MB20140411A = mb20140411a
 MB20140410A-New = mb20140410anew

 server3

 MB20140410A = mb20140410a
 MB20140411A = mb20140411a
 MB20140410A-New = mb20140410anew

 Unfortunately, all results are identical so is there a mean to view data 
 real indexed in these documents ?  Can be a problem with a particular server 
 ?  All configs are in zookeeper so all cores shouldhave the same config, 
 right ?  Is there any way to force a replicat to resynchronize ?

 Regards,

 Francois.

 
 De : Erick Erickson [erickerick...@gmail.com]
 Envoyé : 30 avril 2014 16:36
 À : solr-user@lucene.apache.org
 Objet : Re: Shards don't return documents in same order

 Hmmm, take a look at the admin/analysis page for these inputs for
 alphaOnlySort. If you're using the stock Solr distro, you're probably
 not considering the effects patternReplaceFilterFactory which is
 removing all non-letters. So these three terms reduce to

 mba
 mba
 mbanew

 You can look at the actual indexed terms by the admin/schema-browser as well.

 That said, unless you transposed the order because you were
 concentrating on the numeric part, the doc with MB20140410A-New should
 always be sorting last.

 All of which is irrelevant if you're doing something else with
 alphaOnlySort, so please paste in the fieldType definition if you've
 changed it.

 What gets returned in the doc for _stored_ data is a verbatim copy,
 NOT the output of the analysis chain, which can be confusing.

 Oh, and Solr uses the internal lucene doc ID to break ties, and docs
 on different replicas can have different internal Lucene doc IDs
 relative to each other as a result of merging so that's something else
 to watch out for.

 Best,
 Erick

 On Wed, Apr 30, 2014 at 1:06 PM, Francois Perron
 francois.per...@ticketmaster.com wrote:
 Hi guys,

   I have a small SolrCloud setup (3 servers, 1 collection with 1 shard and 3 
 replicat).  In my schema, I have a alphaOnlySort field with a copyfield.

 This is a part of my managed-schema :

 field name=_root_ type=string indexed=true stored=false/
 field name=_uid type=string multiValued=false indexed=true 
 required=true stored=true/
 field name=_version_ type=long indexed=true stored=true/
 field name=event_id type=string indexed=true stored=true/
 field name=event_name type=text_general indexed=true 
 stored=true/
 field name=event_name_sort type=alphaOnlySort/

 with the copyfield

   copyField source=event_name dest=event_name_sort/


 The problem is : I query my collection with a sort on my alphasort field but 
 on one of my servers, the sort order is not the same.

 On server 1 and 2, I have this result :

 doc
 str name=event_nameMB20140410A/str
 /doc
 doc
 str name=event_nameMB20140410A-New/str
 /doc
 doc
 str name=event_nameMB20140411A/str
 /doc



 and on the third one, this :

 str name=event_nameMB20140410A/str
 /doc
 doc
 str name=event_nameMB20140411A/str
 /doc
 doc
 str name=event_nameMB20140410A-New/str
 /doc


 The doc named MB20140411A should be at the end ...

 Any idea ?

 Regards


Re: Fastest way to import big amount of documents in SolrCloud

2014-05-02 Thread Erick Erickson
re: optimize after every import

This is not recommended in 4.x unless and until you have evidence that
it really does help, reviews are very mixed, and it's been renamed
force merge  in 4.x just so people don't think Of course I want to do
this, who wouldn't?.

bq: Doing a commit instead of optimize is usually bringing the master
and slave nodes down
This isn't expected unless you're committing far too frequently. I'd
dis-recommend doing any commits except, possibly, a single commit
after all my clients had finished indexing. But even that isn't
necessary.

In batch modes in SolrCloud, reasonable setups are
autocommit: 15 seconds WITH openSearcher=false
autosoftcommit: the interval it takes you to run all your indexing.

Seems odd, but here's the backtround:
http://searchhub.org/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

Best,
Erick

On Thu, May 1, 2014 at 11:12 PM, Alexander Kanarsky
kanarsky2...@gmail.com wrote:
 If you build your index in Hadoop, read this (it is about the Cloudera
 Search but in my understanding also should work with Solr Hadoop contrib
 since 4.7)
 http://www.cloudera.com/content/cloudera-content/cloudera-docs/Search/latest/Cloudera-Search-User-Guide/csug_batch_index_to_solr_servers_using_golive.html


 On Thu, May 1, 2014 at 1:47 PM, Costi Muraru costimur...@gmail.com wrote:

 Hi guys,

 What would you say it's the fastest way to import data in SolrCloud?
 Our use case: each day do a single import of a big number of documents.

 Should we use SolrJ/DataImportHandler/other? Or perhaps is there a bulk
 import feature in SOLR? I came upon this promising link:
 http://wiki.apache.org/solr/UpdateCSV
 Any idea on how UpdateCSV is performance-wise compared with
 SolrJ/DataImportHandler?

 If SolrJ, should we split the data in chunks and start multiple clients at
 once? In this way we could perhaps take advantage of the multitude number
 of servers in the SolrCloud configuration?

 Either way, after the import is finished, should we do an optimize or a
 commit or none (
 http://wiki.solarium-project.org/index.php/V1:Optimize_command)?

 Any tips and tricks to perform this process the right way are gladly
 appreciated.

 Thanks,
 Costi



Re: Roll up query with original facets

2014-05-02 Thread Erick Erickson
I think this might be what you're looking for..
http://wiki.apache.org/solr/SimpleFacetParameters#Multi-Select_Faceting_and_LocalParams

Best,
Erick

On Fri, May 2, 2014 at 7:19 AM, Darin Amos dari...@gmail.com wrote:
 Hello All,

 I am having a query issue I cannot seem to find the correct answer for. I am 
 searching against a list of items and returning facets for that list of 
 items. I would like to group the result set on a field such as a 
 “parentItemId”. parentItemId maps to other documents within the same core. I 
 would like my query to return the documents that match parentItemId, but 
 still return the facets of the original query.

 Is this possible with SOLR 4.3 that I am running? I can provide more details 
 if needed, thanks!

 Darin


ANNOUNCE: Apache Solr Reference Guide for 4.8

2014-05-02 Thread Chris Hostetter


The Lucene PMC is pleased to announce that there is a new version of the 
Solr Reference Guide available for Solr 4.8.


The 396 page PDF serves as the definitive user's manual for Solr 4.8. It 
can be downloaded from the Apache mirror network:


https://www.apache.org/dyn/closer.cgi/lucene/solr/ref-guide/


-Hoss


Can't use 2 highlighting components in the same solrconfig

2014-05-02 Thread Cario, Elaine
Hoping someone can help me...

I'm trying to use both the PostingsHighlighter and the FastVectorHighlighter in 
the same solrconfig (selection driven by different request handlers), but once 
I define 2 search components in the config, it always picks the Postings 
Highlighter (even if I never reference it in any request handler).

Is this even possible to do?  (I'm using 4.7.1).

I think the culprit is some specific code in SolrCore.loadSearchComponents(), 
which specifically overwrites the highlighting component with the contents of 
the postingshighlight component - so the components map has 2 entries, but 
they both point to the same highlighting class (the PostingsHighlighter).

It seems pretty deliberate (it only does it for the highlighter!), but 
wondering if there is some reason to allow only one version of the highlighter 
to be used.

We're using 2 highlighters since the FVH is slow when creating snippets for a 
search result list (10-50 documents), so we turned to the PH (which is 
definitely faster, even though it doesn't keep phrases together, but that's a 
post for another day).  But we like FVH for highlighting query terms in the 
full document, once the user clicks on a result.  The plan is to use the PH in 
a search request handler, and the FVH in a document view request handler.

Thanks.



RE: Searching for tokens does not return any results

2014-05-02 Thread Yetkin Ozkucur
Erick, Koji, Ahmet:

Thank you all for your answers! I think I found the problem and I am on the 
right track to fix it.

1- As you suggested the problem was in the Java code populating the index. The 
analyzer in the Java code had to be consistent with the one defined in SOLR. I 
was able to achieve my goal by creating a slightly customized analyzer.
2- To be able to see the tokens in the index was key to debug the problem. I 
downloaded Luke (well a tweaked version of it for lucene 4.4) to be able to see 
tokens. I did not know SOLR had that terms component. That is a good tip too.

Have a good weekend.

Thanks,
Yetkin

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Friday, May 02, 2014 11:57 AM
To: solr-user@lucene.apache.org
Subject: Re: Searching for tokens does not return any results

bq:  but this index was created using a Java program using Lucene interface

Elaborating a bit on Koji's comment...

The fact that you used Lucene to index the doc means that the analysis page is 
almost, but not quite entirely, useless on the indexing side.
It's looking at your field definition in schema.xml and running your input 
stream through the indexing portion of your analysis chain constructed from the 
schema. What's actually in your index though was put there by raw Lucene. So 
your Lucene program _must_ create an analysis chain that is absolutely 
identical to what's in your schema for the admin/analysis page to be accurate.

Quick test: go to you admin/schema browser page or use the TermsComponent 
(https://cwiki.apache.org/confluence/display/solr/The+Terms+Component)
or Luke to examine the actual tokens in your field. My bet is that you'll see 
that the actual terms are not what you expect and almost certainly not what the 
admin/analysis page shows on the index side.

Keeping an independent Lucene program that puts data into your index with raw 
Lucene aligned with your schema is, as you can see, something of a problem. If 
at all possible, consider letting Solr do the indexing and sending it documents 
with SolrJ, here's a reference:
https://cwiki.apache.org/confluence/display/solr/Using+SolrJ

By the way, I want to compliment you on your post. You did all the right things:
 defined your problem clearly
 added the critical bit (index created with Lucene). This is especially 
 relevant I think illustrated the input and output told us what the 
 problem was gave us the field definitions showed the results of some 
 of your investigation

Best
Erick

On Thu, May 1, 2014 at 7:31 AM, Koji Sekiguchi k...@r.email.ne.jp wrote:
 Hi Yetkin, welcome!

 I think StandardAnalyzer of Lucene is the problem you are facing.

 Why don't you have another field using StandardAnalyzer and see how it 
 tokenizes CRD_PROD on Solr admin GUI?

 I forgot in the detail but we can use Lucene's Analyzer in schema.xml 
 something like this:

 fieldType ...
analyzer class=solr.StandardAnalyzer/ /fieldType

 Koji
 --
 http://soleami.com/blog/comparing-document-classification-functions-of
 -lucene-and-mahout.html


 (2014/05/01 23:04), Yetkin Ozkucur wrote:

 Hello everyone,

 I am new to SOLR and this is my first post in this list.
 I have been working on this problem for a couple of days. I tried 
 everything which I found in google but it looks like I am missing something.

 Here is my problem:
 I have a field called: DBASE_LOCAT_NM_TEXT It contains values like: 
 CRD_PROD The goal is to be able to search this field either by 
 putting the exact string CRD_PROD or part of it (tokenized by _)  
 like CRD or PROD

 Currently:
 This query returns results: q=DBASE_LOCAT_NM_TEXT:CRD_PROD But this 
 does not: q=DBASE_LOCAT_NM_TEXT:CRD I want to understand why the 
 second query does not return any results

 Here is how I configured the field:
 field name=DBASE_LOCAT_NM_TEXT type=text_general indexed=true
 stored=true required=false multiValued=false/

 And Here is how I configured the field type :
  fieldType name=text_general class=solr.TextField
 positionIncrementGap=100
analyzer type=index
filter class=solr.WordDelimiterFilterFactory
 preserveOriginal=1 generateWordParts=1 generateNumberParts=1
 catenateWords=1 catenateNumbers=1 catenateAll=0
 splitOnCaseChange=1/
  tokenizer class=solr.WhitespaceTokenizerFactory/
  filter class=solr.StopFilterFactory  ignoreCase=true
 words=stopwords.txt/
   filter class=solr.LowerCaseFilterFactory/
  filter class=solr.KeywordMarkerFilterFactory
 protected=protwords.txt/
  filter class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer
analyzer type=query
  filter class=solr.WordDelimiterFilterFactory
 preserveOriginal=1 generateWordParts=1 generateNumberParts=1
 catenateWords=0 catenateNumbers=0 catenateAll=0
 splitOnCaseChange=1/
  tokenizer class=solr.WhitespaceTokenizerFactory/
  filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt/

   

Re: Searching for tokens does not return any results

2014-05-02 Thread Erick Erickson
Glad to hear it!

You shouldn't really have to customize the analyzer to get it to
behave as it would if you just used Solr to ingest documents, just
chain things together. That's what Solr does after all. Of course you
may have special needs that are better served by more customization.

TermsComponent is a useful tool. Note that you also get raw terms if
you use the admin/schema-browser page, identify your field, and then
click the show term info button. That technique is somewhat limited
though. The schema-browser page is especially useful for very small
indexes and/or test cases I'll admit. I do vaguely remember something
not right with the schema-browser at one point though, so it might not
work as I expect for 4.4

Best,
Erick

On Fri, May 2, 2014 at 1:56 PM, Yetkin Ozkucur yetkin.ozku...@asg.com wrote:
 Erick, Koji, Ahmet:

 Thank you all for your answers! I think I found the problem and I am on the 
 right track to fix it.

 1- As you suggested the problem was in the Java code populating the index. 
 The analyzer in the Java code had to be consistent with the one defined in 
 SOLR. I was able to achieve my goal by creating a slightly customized 
 analyzer.
 2- To be able to see the tokens in the index was key to debug the problem. I 
 downloaded Luke (well a tweaked version of it for lucene 4.4) to be able to 
 see tokens. I did not know SOLR had that terms component. That is a good tip 
 too.

 Have a good weekend.

 Thanks,
 Yetkin

 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: Friday, May 02, 2014 11:57 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Searching for tokens does not return any results

 bq:  but this index was created using a Java program using Lucene interface

 Elaborating a bit on Koji's comment...

 The fact that you used Lucene to index the doc means that the analysis page 
 is almost, but not quite entirely, useless on the indexing side.
 It's looking at your field definition in schema.xml and running your input 
 stream through the indexing portion of your analysis chain constructed from 
 the schema. What's actually in your index though was put there by raw Lucene. 
 So your Lucene program _must_ create an analysis chain that is absolutely 
 identical to what's in your schema for the admin/analysis page to be accurate.

 Quick test: go to you admin/schema browser page or use the TermsComponent 
 (https://cwiki.apache.org/confluence/display/solr/The+Terms+Component)
 or Luke to examine the actual tokens in your field. My bet is that you'll see 
 that the actual terms are not what you expect and almost certainly not what 
 the admin/analysis page shows on the index side.

 Keeping an independent Lucene program that puts data into your index with raw 
 Lucene aligned with your schema is, as you can see, something of a problem. 
 If at all possible, consider letting Solr do the indexing and sending it 
 documents with SolrJ, here's a reference:
 https://cwiki.apache.org/confluence/display/solr/Using+SolrJ

 By the way, I want to compliment you on your post. You did all the right 
 things:
 defined your problem clearly
 added the critical bit (index created with Lucene). This is especially
 relevant I think illustrated the input and output told us what the
 problem was gave us the field definitions showed the results of some
 of your investigation

 Best
 Erick

 On Thu, May 1, 2014 at 7:31 AM, Koji Sekiguchi k...@r.email.ne.jp wrote:
 Hi Yetkin, welcome!

 I think StandardAnalyzer of Lucene is the problem you are facing.

 Why don't you have another field using StandardAnalyzer and see how it
 tokenizes CRD_PROD on Solr admin GUI?

 I forgot in the detail but we can use Lucene's Analyzer in schema.xml
 something like this:

 fieldType ...
analyzer class=solr.StandardAnalyzer/ /fieldType

 Koji
 --
 http://soleami.com/blog/comparing-document-classification-functions-of
 -lucene-and-mahout.html


 (2014/05/01 23:04), Yetkin Ozkucur wrote:

 Hello everyone,

 I am new to SOLR and this is my first post in this list.
 I have been working on this problem for a couple of days. I tried
 everything which I found in google but it looks like I am missing something.

 Here is my problem:
 I have a field called: DBASE_LOCAT_NM_TEXT It contains values like:
 CRD_PROD The goal is to be able to search this field either by
 putting the exact string CRD_PROD or part of it (tokenized by _)
 like CRD or PROD

 Currently:
 This query returns results: q=DBASE_LOCAT_NM_TEXT:CRD_PROD But this
 does not: q=DBASE_LOCAT_NM_TEXT:CRD I want to understand why the
 second query does not return any results

 Here is how I configured the field:
 field name=DBASE_LOCAT_NM_TEXT type=text_general indexed=true
 stored=true required=false multiValued=false/

 And Here is how I configured the field type :
  fieldType name=text_general class=solr.TextField
 positionIncrementGap=100
analyzer type=index
filter class=solr.WordDelimiterFilterFactory
 

Spellchecking - looking for general advice

2014-05-02 Thread Maciej Dziardziel
Hi

I was looking at spellcheck (Direct and FileBased) and testing that they can do.
Direct works fine most of the time, but I'd like to find solution for
few corner cases:

1) having recruted and recruiter in index, recruter should
suggest the latter.
Obviously the distance to the former is smaller, so it may be
completely arbitrary,
and perhaps must be handled on application side rather then solr.
2) restraunt doesn't suggest restaurant - I assume that distance
is to big for that.

Those are few examples of queries that spellcheck gets (according to
my requirements) wrong.
For now I am just looking at possible solutions and I'd need to come
up with initial concept
to have something to show to users and get more feedback, likely with
more cases
to correct.

I'd like to know if there are some tweaks to spellcheck component I
could make (or perhaps other ways of doing this with solr),
or am I forced to hardcode list of all such corrections that go beyond
what spellcheck can do?

One solution I am considering is to put list of those special cases
into FileSpellChecker (it seems to be more relaxed, and handles
restraunt case well) and fall back to Direct if this yields no
results... though I am not sure yet how well that would work in
practice
if the list of misspelled words would grow beyond few I have now. It
would most likely woldn't scale

Another possibility would be to analyze list of queries our users use
that yield little results and check if there is spellchecked
version that improves that... but that seems to require human to
review corrections.

Yet another thing I was thinking about would be to pull terms into
separate spellchecker (like aspell) and see if they do better job or
are more tweakable.

That's a bit open ended problem, so any advice welcome.

--
Maciej Dziardziel
fied...@gmail.com


Re: ANNOUNCE: Apache Solr Reference Guide for 4.8

2014-05-02 Thread Alexandre Rafalovitch
Somebody should create an offline search interface for it. :-)

Regards,
Alex
On 02/05/2014 11:53 pm, Chris Hostetter hoss...@apache.org wrote:


 The Lucene PMC is pleased to announce that there is a new version of the
 Solr Reference Guide available for Solr 4.8.

 The 396 page PDF serves as the definitive user's manual for Solr 4.8. It
 can be downloaded from the Apache mirror network:

 https://www.apache.org/dyn/closer.cgi/lucene/solr/ref-guide/


 -Hoss