Aw: Fw: highlighting on hl.alternateField (copyField target) doesnt highlight

2014-06-11 Thread jay list
Answer to myself:
using the solr.KeywordTokenizerFactory and solr.WordDelimiterFilterFactory can 
preserve the original phone number and can add a token without containing 
spaces. 

input:  12345 67890
tokens: 12345 67890, 12345, 67890, 1234567890

Two advantages: I don't need another field and the highlighter works as 
aspected.
Best Regards.

 Gesendet: Donnerstag, 05. Juni 2014 um 09:14 Uhr
 Von: jay list jay.l...@web.de
 An: solr-user@lucene.apache.org
 Betreff: Fw: highlighting on hl.alternateField (copyField target) doesnt 
 highlight

 Anybody knowing this issue?
 
  Gesendet: Dienstag, 03. Juni 2014 um 09:11 Uhr
  Von: jay list jay.l...@web.de
  An: solr-user@lucene.apache.org
  Betreff: highlighting on hl.alternateField (copyField target) doesnt 
  highlight
 
  
  Hello,
   
  im trying to implement a user friendly search for phone numbers. These 
  numbers consist out of two digit-tokens like 12345 67890.
   
  Finally I want the highlighting for the phone number in the search result, 
  without any concerns about was this search result hit by field  tel  or 
  copyField  tel2.
   
  The field tel is splitted by a StandardTokenizer in two tokens 12345 AND 
  67890.
  And I want to catch up those people, who enter 1234567890 without any 
  space.
  I use copyField  tel2  to a solr.PatternReplaceCharFilterFactory to 
  eliminate non digits followed by a solr.KeywordTokenizerFactory.
   
  In both cases the search hits as expected.
   
  The highlighter works well for  tel  or  tel2,  but I want the highlight 
  always on field  tel!
  Using  f.tel.hl.alternateField=tel2  is returning the field value wihtout 
  any highlighting.
   
  lst name=params
   str name=qtel2:1234567890/str
   str name=f.tel.hl.alternateFieldtel2/str
   str name=hltrue/str
   str name=hl.requireFieldMatchtrue/str
   str name=hl.simple.preem/str
   str name=hl.simple.post/em/str
   str name=hl.fltel,tel2/str
   str name=fltel,tel2/str
   str name=wtxml/str
   str name=fqtyp:person/str
  /lst
  
  ...
  
  result name=response numFound=1 start=0
   doc
str name=uiduser1/str
str name=tel12345 67890/str
str name=tels12345 67890/str/doc
  /result
  
  ...
  
  lst name=highlighting
   lst name=user1
arr name=tel
 str123456 67890/str !-- here should be a highlight --
/arr
arr name=tels
 strem123456 67890/em/str
/arr
   /lst
  /lst
  
  Any idea? Or do I have to change my velocity macros, always looking for a 
  different highlighted field?
  Best Regards



Can we do conditional boosting using edismax ?

2014-06-11 Thread Shamik Bandopadhyay
Hi,

  I'm using edismax parser to perform a runtime boosting. Here's my sample
request handler entry.

str name=qftext^2 title^3/str
str name=bqSource:Blog^3 Source2:Videos^2/str
str name=bfrecip(ms(NOW/DAY,PublishDate),3.16e-11,1,1)^2.0/str

As you can see, I'm adding weights to text and title, as well as, boosting
on source. What I'm trying to see is if there's a way to change the the
weights based on Source.E.g. for source Blog, I would like to have the
following boost text^3 title^2 while for source Videos , I prefer
text^2 title^3.

Any pointers will be appreciated.

Thanks,
Shamik


Re: How Can I modify the DocList and DocSet in solr

2014-06-11 Thread Vishnu Mishra
Thanks for the reply. I found one solution to modify DocList and DocSet after
searching.  Look At the following code snippet.

private void sortByRecordIDNew(SolrIndexSearcher.QueryResult result,
ResponseBuilder rb) throws IOException {

DocList docList = result.getDocListAndSet().docList;

SortedMapInteger, Integer sortedMap = null;
if (projectSort == 0) { 
   sortedMap = new TreeMapInteger,
Integer(Collections.reverseOrder());
}else{
   sortedMap = new TreeMapInteger, Integer();
}

Iterator iterator = docList.iterator();
while (iterator.hasNext()) {
int docId = (int) iterator.next();

Document d = rb.req.getSearcher().doc(docId);
Integer val = dbData.get(d.get(ID)); // dbData is a map contains
the recordId from the database  
// and the
Unique key in schema.xml 

sortedMap.put(val, docId);

}

float[] scores = new float[docList.size()];
int[] docs = new int[docList.size()];
int docCounter = 0;
int maxScore = 0;

IteratorInteger it = sortedMap.keySet().iterator();
while (it.hasNext()) {
   int recordID = (int) it.next();
   int docId = sortedMap.get(recordID);
   scores[docCounter] = 1.0f;
   docs[docCounter] = docId;
   docCounter++;
 } 

docList = new DocSlice(0, docCounter, docs, scores, 0, maxScore);

result.setDocList(docList);

}


Call this method from QueryComponent's process method after the searching.
In the above code I have sorted the DocList in ascending or descending order
depends upon the user requirement. It works for me.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-Can-I-modify-the-DocList-and-DocSet-in-solr-tp4140754p4141132.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Can we do conditional boosting using edismax ?

2014-06-11 Thread Ahmet Arslan
Hi Shamik,

Yes it is possible with map and query functions. 

Please see Jan's example : 
http://www.cominvent.com/2012/01/25/super-flexible-autocomplete-with-solr/



On Wednesday, June 11, 2014 9:34 AM, Shamik Bandopadhyay sham...@gmail.com 
wrote:
Hi,

  I'm using edismax parser to perform a runtime boosting. Here's my sample
request handler entry.

str name=qftext^2 title^3/str
str name=bqSource:Blog^3 Source2:Videos^2/str
str name=bfrecip(ms(NOW/DAY,PublishDate),3.16e-11,1,1)^2.0/str

As you can see, I'm adding weights to text and title, as well as, boosting
on source. What I'm trying to see is if there's a way to change the the
weights based on Source.E.g. for source Blog, I would like to have the
following boost text^3 title^2 while for source Videos , I prefer
text^2 title^3.

Any pointers will be appreciated.

Thanks,
Shamik



Re: Performance/scaling with custom function queries

2014-06-11 Thread Robert Krüger
Would Solr use multithreading to process the records of a function
query as described above? In my scenario concurrent searches are not
the issue, rather the speed of one query will be the optimization
target. Or will I have to set up distributed search to achieve that?

Thanks,

Robert

On Tue, Jun 10, 2014 at 10:11 AM, Robert Krüger krue...@lesspain.de wrote:
 Great, I was hoping for that. In my case I will have to deal with the
 worst case scenario, i.e. all documents matching the query, because
 the only criterion is the fingerprint and the result of the
 distance/similarity function which will have to be executed for every
 document. However, I am dealing with a scenario where there will not
 be many concurrent users.

 Thank you.

 On Mon, Jun 9, 2014 at 1:57 AM, Joel Bernstein joels...@gmail.com wrote:
 You only need to have fast access to the fingerprint field so only that
 field needs to be in memory. You'll want to review how Lucene DocValues and
 FieldCache work. Sorting is done with a PriorityQueue so only the top N
 docs are kept in memory.

 You'll only need to access the fingerprint field values for documents that
 match the query, so it won't be a full table scan unless all the docs match
 the query.

 Sounds like an interesting project. Please keep us posted.

 Joel Bernstein
 Search Engineer at Heliosearch


 On Sun, Jun 8, 2014 at 6:17 AM, Robert Krüger krue...@lesspain.de wrote:

 Hi,

 let's say I have an index that contains a field of type BinaryField
 called fingerprint that stores a few (let's say 100) bytes that are
 some kind of digital fingerprint-like thing.

 Let's say I want to perform queries on that field to achieve sorting
 or filtering based on a kind of custom distance function
 customDistance, i.e. I input a reference fingerprint and Solr
 returns either all documents sorted by
 customDistance(referenceFingerprint,documentFingerprint) or use
 that in an frange expression for filtering.

 I have read http://wiki.apache.org/solr/SolrPerformanceFactors and I
 do understand that using function queries with a custom function is
 definitely an expensive thing as it will result in what is called a
 full table scan in the sql world, i.e. data from all documents needs
 to be touched to select the correct documents or sort by a function's
 result.

 Given all that and provided, I have to use a custom function for my
 needs, I would like to know a few more details about solr architecture
 to understand what I have to look out for.

 I will have potentially millions of records. Does the data contained
 in other index fields play a role when I only use the fingerprint
 field for sorting and searching when it comes to RAM usage? I am
 hoping to calculate that my RAM should be able to accommodate the
 fingerprint data of all available documents for the queries to be fast
 but not fingerprint data and all other indexed or stored data.

 Example: My fingerprint data needs 100bytes per document, my other
 indexed field data needs 900 bytes per document. Will I need 100MB or
 1GB to fit all data that is needed to process one query in memory?

 Are there other things to be aware of?

 Thanks,

 Robert




 --
 Robert Krüger
 Managing Partner
 Lesspain GmbH  Co. KG

 www.lesspain-software.com



-- 
Robert Krüger
Managing Partner
Lesspain GmbH  Co. KG

www.lesspain-software.com


Re: Documents Added Not Available After Commit (Both Soft and Hard)

2014-06-11 Thread Justin Sweeney
Thanks for the input!

Erick - To clarify, we see the No Uncommitted Changes message repeatedly
for a number of commits (not a consistent number each time this happens)
and then eventually we see a commit that successfully finds changes, at
which point the documents are available.

Shalin - That bug looks like it could be related to our case, did you
notice any impact of the bug in situations where there were not just
pending deletes by term? In our case, we are adding documents, we do have
some deletes, but the bulk are adds. We can see the logging of the adds in
the solr log prior to seeing the No Uncommitted Changes message.

Either way, it may be useful for us to upgrade and see if it fixes the
issue. I'll let you know if that works out once we get a chance to do that.

Thanks,
Justin


On Mon, Jun 9, 2014 at 3:02 AM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 I think this may be the same bug as LUCENE-5289 which was fixed in 4.5.1.
 Can you upgrade to 4.5.1 and see if that solves the problem?




 On Fri, Jun 6, 2014 at 7:17 PM, Justin Sweeney justin.sweene...@gmail.com
 
 wrote:

  Hi,
 
  An application I am working on indexes documents to a Solr index. This
 Solr
  index is setup as a single node, without any replication. This index is
  running Solr 4.5.0.
 
  We have noticed an issue lately that is causing some problems for our
  application. The problem is that we add/update a number of documents in
 the
  Solr index and we have the index setup to autoCommit (hard) once every 30
  minutes. In the Solr logs, I am able to see the add command to Solr and I
  can also see Solr start the hard commit. When this hard commit occurs, we
  see the following message:
  INFO  - 2014-06-04 20:13:55.135;
  org.apache.solr.update.DirectUpdateHandler2; No uncommitted changes.
  Skipping IW.commit.
 
  This only happens sometimes, but Solr will go hours (we have seen 6-12
  hours of this behavior) before it does a hard commit where it find
 changes.
  After the hard commit where the changes are found, we are then able to
  search for and find the documents that were added hours ago, but up until
  that point the documents are not searchable.
 
  We tried enabling autoSoftCommit every 5 minutes in the hope that this
  would help, but we are seeing the same behavior.
 
  Here is a sampling of the logs showing this occurring (I've trimmed it
 down
  to just show what is happening):
 
  INFO  - 2014-06-05 20:00:41.300;
   org.apache.solr.update.processor.LogUpdateProcessor; [zoomCollection]
   webapp=/solr path=/update params={wt=javabinversion=2}
  {add=[359453225]} 0
   0
  
   INFO  - 2014-06-05 20:00:41.376;
   org.apache.solr.update.processor.LogUpdateProcessor; [zoomCollection]
   webapp=/solr path=/update params={wt=javabinversion=2}
  {add=[347170717]} 0
   1
  
   INFO  - 2014-06-05 20:00:51.527;
   org.apache.solr.update.DirectUpdateHandler2; start
  
 
 commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=true,prepareCommit=false}
  
   INFO  - 2014-06-05 20:00:51.533;
  org.apache.solr.search.SolrIndexSearcher;
   Opening Searcher@257c43d main
  
   INFO  - 2014-06-05 20:00:51.533;
   org.apache.solr.update.DirectUpdateHandler2; end_commit_flush
  
   INFO  - 2014-06-05 20:00:51.545;
  org.apache.solr.core.QuerySenderListener;
   QuerySenderListener sending requests to Searcher@257c43d
   main{StandardDirectoryReader(segments_acl:1367002775953
   _2f28(4.5):C13583563/4081507 _2gl6(4.5):C2754573/193533
   _2g21(4.5):C1046256/296354 _2ge2(4.5):C835858/206139
   _2gqd(4.5):C383500/31051 _2gmu(4.5):C125197/32491
 _2grl(4.5):C46906/1255
   _2gpj(4.5):C66480/16562 _2gra(4.5):C364/22 _2gr1(4.5):C36064/2556
   _2gqg(4.5):C42504/21515 _2gqm(4.5):C26821/12659
 _2gqu(4.5):C24172/10240
   _2gqy(4.5):C697/215 _2gr2(4.5):C878/352 _2gr7(4.5):C28135/11775
   _2gr9(4.5):C3276/1341 _2grb(4.5):C5/1 _2grc(4.5):C3247/1219
  _2grd(4.5):C6/1
   _2grf(4.5):C5/2 _2grg(4.5):C23659/10967 _2grh(4.5):C1 _2grj(4.5):C1
   _2grk(4.5):C5160/1482 _2grm(4.5):C1210/351 _2grn(4.5):C3957/1372
   _2gro(4.5):C7734/2207 _2grp(4.5):C220/36)}
  
   INFO  - 2014-06-05 20:00:51.546; org.apache.solr.core.SolrCore;
   [zoomCollection] webapp=null path=null
   params={event=newSearcherq=d_name:ibmdistrib=false} hits=38 status=0
   QTime=0
  
   INFO  - 2014-06-05 20:00:51.546;
  org.apache.solr.core.QuerySenderListener;
   QuerySenderListener done.
  
   INFO  - 2014-06-05 20:00:51.547; org.apache.solr.core.SolrCore;
   [zoomCollection] Registered new searcher Searcher@257c43d
   main{StandardDirectoryReader(segments_acl:1367002775953
   _2f28(4.5):C13583563/4081507 _2gl6(4.5):C2754573/193533
   _2g21(4.5):C1046256/296354 _2ge2(4.5):C835858/206139
   _2gqd(4.5):C383500/31051 _2gmu(4.5):C125197/32491
 _2grl(4.5):C46906/1255
   _2gpj(4.5):C66480/16562 _2gra(4.5):C364/22 _2gr1(4.5):C36064/2556
   _2gqg(4.5):C42504/21515 _2gqm(4.5):C26821/12659
 _2gqu(4.5):C24172/10240
   _2gqy(4.5):C697/215 

Hunspell inaccuracies with Solr 4.8.1 and french dictionnaries

2014-06-11 Thread Quévat Benoît
Hello,

I just moved from Solr 4.6 to Solr 4.8.1 and I notice differences in the way 
Hunspell work.

Some changes are fixes (due to 
https://issues.apache.org/jira/browse/LUCENE-5483 I assume) but other changes 
look like regressions.
To check this, I have compared the results obtained in the Analysis tab of Solr 
admin and the results obtained used hunspell -m command with the same 
dictionaries.

Command line results:
$ hunspell -m -d /DATA/solr-adscope-fr/adscope-fr/conf/fr-moderne
bricolait
bricolait  st:bricoler po:v1it is:iimp is:3sg
instituteur
instituteur  st:institutrice po:nom is:mas is:sg

Solr Analysis tab results (I'm using HunspellStemFilterFactory)

bricolait - bricolait
instituteur - instituteur

The dictionary and affix file are available at this address: 
http://www.dicollecte.org/download.php?prj=fr

As shown above, the words bricolait and instituteur are correctly stemmed 
in command line but not with Solr filter.
These examples were working correctly with Solr 4.6.

Is it something I should open a JIRA issue about?

Thanks,

Benoît.





Ce message et les pièces jointes sont confidentiels et réservés à l'usage 
exclusif de ses destinataires. Il peut également être protégé par le secret 
professionnel. Si vous recevez ce message par erreur, merci d'en avertir 
immédiatement l'expéditeur et de le détruire. L'intégrité du message ne pouvant 
être assurée sur Internet, la responsabilité de Worldline ne pourra être 
recherchée quant au contenu de ce message. Bien que les meilleurs efforts 
soient faits pour maintenir cette transmission exempte de tout virus, 
l'expéditeur ne donne aucune garantie à cet égard et sa responsabilité ne 
saurait être recherchée pour tout dommage résultant d'un virus transmis.

This e-mail and the documents attached are confidential and intended solely for 
the addressee; it may also be privileged. If you receive this e-mail in error, 
please notify the sender immediately and destroy it. As its integrity cannot be 
secured on the Internet, the Worldline liability cannot be triggered for the 
message content. Although the sender endeavours to maintain a computer 
virus-free network, the sender does not warrant that this transmission is 
virus-free and will not be liable for any damages resulting from any virus 
transmitted.


Hunspell inaccuracies with Solr 4.8.1 and french dictionnaries

2014-06-11 Thread Quévat Benoît
Hello,

I just moved from Solr 4.6 to Solr 4.8.1 and I notice differences in the way 
Hunspell work.

Some changes are fixes (due to 
https://issues.apache.org/jira/browse/LUCENE-5483 I assume) but other changes 
look like regressions.
To check this, I have compared the results obtained in the Analysis tab of Solr 
admin and the results obtained used hunspell -m command with the same 
dictionaries.

Command line results:
$ hunspell -m -d /DATA/solr-adscope-fr/adscope-fr/conf/fr-moderne
bricolait
bricolait  st:bricoler po:v1it is:iimp is:3sg
instituteur
instituteur  st:institutrice po:nom is:mas is:sg

Solr Analysis tab results (I'm using HunspellStemFilterFactory)

bricolait - bricolait
instituteur - instituteur

The dictionary and affix file are available at this address: 
http://www.dicollecte.org/download.php?prj=fr

As shown above, the words bricolait and instituteur are correctly stemmed 
in command line but not with Solr filter.
These examples were working correctly with Solr 4.6.

Is it something I should open a JIRA issue about?

Thanks,

Benoît.





Ce message et les pièces jointes sont confidentiels et réservés à l'usage 
exclusif de ses destinataires. Il peut également être protégé par le secret 
professionnel. Si vous recevez ce message par erreur, merci d'en avertir 
immédiatement l'expéditeur et de le détruire. L'intégrité du message ne pouvant 
être assurée sur Internet, la responsabilité de Worldline ne pourra être 
recherchée quant au contenu de ce message. Bien que les meilleurs efforts 
soient faits pour maintenir cette transmission exempte de tout virus, 
l'expéditeur ne donne aucune garantie à cet égard et sa responsabilité ne 
saurait être recherchée pour tout dommage résultant d'un virus transmis.

This e-mail and the documents attached are confidential and intended solely for 
the addressee; it may also be privileged. If you receive this e-mail in error, 
please notify the sender immediately and destroy it. As its integrity cannot be 
secured on the Internet, the Worldline liability cannot be triggered for the 
message content. Although the sender endeavours to maintain a computer 
virus-free network, the sender does not warrant that this transmission is 
virus-free and will not be liable for any damages resulting from any virus 
transmitted.


Non-Heap OOM Error with Small Index Size

2014-06-11 Thread msoltow
While running a Solr-based Web application on Tomcat 6, we have been
repeatedly running into Out of Memory issues.  However, these OOM errors are
not related to the Java heap.  A snapshot of our Solr dashboard just before
the OOM error reported:

Physical memory: 7.13/7.29 GB
JVM-Memory: 57.90 MB - 3.05 GB - 3.56 GB

In addition, the top command displays:
  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
25382 tomcat20   0 9716m 6.8g 175m S 47.9 92.6 196:13.70 java

We're unsure as to why the physical memory usage is so much higher than the
JVM usage, especially given that the size of our index is roughly 500 MB. 
We were originally using OpenJDK, and we tried switching to Oracle JDK with
no luck.

Is it normal for physical memory usage to be this high?  We do not want to
upgrade our RAM if the problem is really just an error in the configuration.

I've attached environment info below, as well as an excerpt of the latest
OOM error report.

Thank you very much in advance.

Kind regards,
Michael


Additional info about our application:
We index documents from a remote location by retrieving them via a REST API. 
The entire remote repository is crawled at regular intervals by our
application.  Twenty-five documents are loaded at a time (the page size
provided by the API), and we manually commit each set of twenty-five
documents.  We do have auto-commit (but not auto-soft-commit) enabled with a
time of 60s, but an auto-commit has never actually occurred.

Solr Info:
Solr 4.8.0
524 MB Index Size
31 Fields
Just under 3000 documents
Directory factory is MMapDirectory
Caches enabled with default settings/size limits

Selected JVM Arguments:
-XX:MaxPermSize=128m
-Dorg.apache.pdfbox.baseParser.pushBackSize=524288
-Xmx4096m
-Xms1024m

Environment:
64-bit AWS EC2 running CentOS 6.5
Tomcat 6.0.24
7.5 GB RAM
Tried using both Oracle JDK 1.7.0_60 and Open JDK

OOM Log Entry:
OpenJDK 64-Bit Server VM warning: INFO:
os::commit_memory(0x000773c8, 366477312, 0) failed; error='Cannot
allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 366477312 bytes for
committing reserved memory.
# An error report file with more information is saved as:
# /tmp/jvm-18372/hs_error.log

OOM Error Report Snippet:
# Native memory allocation (malloc) failed to allocate 366477312 bytes for
committing reserved memory.
# Possible reasons:
#   The system is out of physical RAM or swap space
#   In 32 bit mode, the process size limit was hit
# Possible solutions:
#   Reduce memory load on the system
#   Increase physical memory or swap space
#   Check if swap backing store is full
#   Use 64 bit Java on a 64 bit OS
#   Decrease Java heap size (-Xmx/-Xms)
#   Decrease number of Java threads
#   Decrease Java thread stack sizes (-Xss)
#   Set larger code cache with -XX:ReservedCodeCacheSize=
# This output file may be truncated or incomplete.
#
#  Out of Memory Error (os_linux.cpp:2769), pid=18372, tid=140031038150400
#
# JRE version: OpenJDK Runtime Environment (7.0_55-b13) (build
1.7.0_55-mockbuild_2014_04_16_12_11-b00)
# Java VM: OpenJDK 64-Bit Server VM (24.51-b03 mixed mode linux-amd64
compressed oops)



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Non-Heap-OOM-Error-with-Small-Index-Size-tp4141175.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Performance/scaling with custom function queries

2014-06-11 Thread david.w.smi...@gmail.com
On Wed, Jun 11, 2014 at 7:46 AM, Robert Krüger krue...@lesspain.de wrote:

 Or will I have to set up distributed search to achieve that?


Yes — you have to shard it to achieve that.  The shards could be on the
same node.

There were some discussions this year in JIRA about being able to do
thread-per-segment but it’s not quite there yet.  FWIW I think it would be
a nice option for some use-cases (like yours).

~ David Smiley
Freelance Apache Lucene/Solr Search Consultant/Developer
http://www.linkedin.com/in/davidwsmiley


Re: Performance/scaling with custom function queries

2014-06-11 Thread Joel Bernstein
In Solr 4.9 there is a feature called RankQueries, that allows you to
plugin your own ranking collector. So, if you wanted to write a
ranking/sorting collector that used a thread per segment, you could cleanly
plug it in.

Joel Bernstein
Search Engineer at Heliosearch


On Wed, Jun 11, 2014 at 9:39 AM, david.w.smi...@gmail.com 
david.w.smi...@gmail.com wrote:

 On Wed, Jun 11, 2014 at 7:46 AM, Robert Krüger krue...@lesspain.de
 wrote:

  Or will I have to set up distributed search to achieve that?


 Yes — you have to shard it to achieve that.  The shards could be on the
 same node.

 There were some discussions this year in JIRA about being able to do
 thread-per-segment but it’s not quite there yet.  FWIW I think it would be
 a nice option for some use-cases (like yours).

 ~ David Smiley
 Freelance Apache Lucene/Solr Search Consultant/Developer
 http://www.linkedin.com/in/davidwsmiley



Solr search

2014-06-11 Thread Shay Sofer
Hi,

Any suggestion for tokenizer / filter / other solutions that support search in 
Solr as following -

Use Case

Input

Solr should return

All Results

*

All results

Prefix Search

Text*

All data started by Text* (Prefix search)

Exact Search

Auto Text

Exact match. Only Auto Text

Partial (substring)

*Text*

All strings contains the text


Now I'm using KeywordTokenizerFactory and WordDelimiterFilterFactory.

My issue is with exact search:
When I have document named hello_world, and I'm trying to do Exact Search of 
hello, I got hello_world as a result (I want to get only hello named 
documents).

Thanks in advance,
Shay.



Re: Solr search

2014-06-11 Thread Shawn Heisey
 Hi,

 Any suggestion for tokenizer / filter / other solutions that support
 search in Solr as following -

 Use Case

 Input

 Solr should return

 All Results

 *

 All results

 Prefix Search

 Text*

 All data started by Text* (Prefix search)

 Exact Search

 Auto Text

 Exact match. Only Auto Text

 Partial (substring)

 *Text*

 All strings contains the text


 Now I'm using KeywordTokenizerFactory and WordDelimiterFilterFactory.

 My issue is with exact search:
 When I have document named hello_world, and I'm trying to do Exact Search
 of hello, I got hello_world as a result (I want to get only hello named
 doccuments).

The WordDelimeterFilter will split on the underscore, which means that the
term hello is in the index for that document. Leave that filter out if
you really do want an exact match.

Searching for * by itself is not how you match all documents. It may
work, but it is a wildcard search, which means under the covers that it's
a search for every term in the index for that field. It's SLOW. The
special shortcut *:* (this must be the entire query with no field name,
and I'm assuming the standard query parser here) is what you want for all
documents. In terms of user input, this is what you want to use when the
user leaves the search box empty. If you're using dismax or edismax, then
you would send an empty q parameter or leave it off entirely, and define a
default q.alt parameter in solrconfig.xml, set to *:* for all docs.

Thanks,
Shawn






How to retrieve entire field value (text_general) in custom function?

2014-06-11 Thread Costi Muraru
I have a text_general field and want to use its value in a custom function.
I'm unable to do so. It seems that the tokenizer messes this up and only a
fraction of the entire value is being retrieved. See below for more details.

 doc str name=id1/str str name=field_tterm1 term2 term3/str 
long name=_version_1470628088879513600/long/doc doc str name=id
2/str str name=field_tx1 x2 x3/str long name=_version_
1470628088907825152/long/doc


public class MyFunction extends ValueSource {

@Override
public FunctionValues getValues(Map context, AtomicReaderContext
readerContext) throws IOException {
final FunctionValues values = valueSource.getValues(context,
readerContext);
return new StrDocValues(this) {

@Override
public String strVal(int doc) {
return values.strVal(doc);
}
};
}
}

Tried with SOLR 4.8.1.

Function returns:
- term3 (for first document)
- null (for the second document)

I want the function to return:
- term1 term2 term3 (for first document)
- x1 x2 x3 (for the second document)

How can I achieve this? I tried to google it but no luck. I also looked
through the SOLR code but could not find something similar.

Thanks!
Costi


Problem faceting

2014-06-11 Thread marcos palacios
Hello everyone.



I’m having problems with the performance of queries with  facets, the temp
expend to resolve a query is very high.



The index has 10Millions of documents, each one with 100 fields.

The server has 8 cores and 56 Gb of ram, running with jetty with this
memory configuration: -Xms24096m -Xmx44576m



When I do a query, with 20 facets, the time expended is 4 – 5 seconds. If
the same request is did another time, the



Debug query first execution:

double name=time6037.0/doublelst name=querydouble
name=time265.0/double/lstlst name=facetdouble
name=time5772.0/double/lst



Debug query seconds executions:

double name=time6037.0/doublelst name=querydouble
name=time1.0/double/lstlst name=facetdouble
name=time4872.0/double/lst





What can I do? Why the facets are not cached?





Thank you, Marcos


Re: How to retrieve entire field value (text_general) in custom function?

2014-06-11 Thread Shawn Heisey
On 6/11/2014 9:30 AM, Costi Muraru wrote:
 I have a text_general field and want to use its value in a custom function.
 I'm unable to do so. It seems that the tokenizer messes this up and only a
 fraction of the entire value is being retrieved. See below for more details.

Low-level Lucene details are where my knowledge falls extremely short
... but if you are accessing data in the index itself, you're going to
get terms, not the original value.  You need to access the stored data
or docValues to see the full original text.

I can't answer the question of whether or not this is something that is
accessible (or even makes sense) at the level where your custom code
lives, because I simply don't understand those details.

Thanks,
Shawn



moving to new core.properties setup

2014-06-11 Thread David Santamauro


I have configured many tomcat+solrCloud setups but I'm trying now to 
research the new solr.properties configuration.


I have a functioning zookeeper to which I manually loaded a 
configuration using:


zkcli.sh -cmd upconfig \
  -zkhost xx.xx.xx.xx:2181 \
  -d /test/conf \
  -n test

My solr.xml looks like:

solr
  str name=coreRootDirectory/test/data/str
  bool name=sharedSchematrue/bool
  solrcloud
str name=host${host:}/str
int name=hostPort8080/int
str name=hostContext${hostContext:/test}/str
int name=zkClientTimeout${zkClientTimeout:3}/int
str name=zkhostxx.xx.xx.xx:2181/str
  /solrcloud
  shardHandlerFactory name=shardHandlerFactory
class=HttpShardHandlerFactory
int name=socketTimeout${socketTimeout:0}/int
int name=connTimeout${connTimeout:0}/int
  /shardHandlerFactory
/solr

... all fine. I start tomcat and I see

 Loading container configuration from /test/solr.xml
[...]
 Looking for core definitions underneath /test/data
 Found 0 core definitions

which is anticipated as I have not created any cores or collections.

Then, trying to create a collection

wget -O- \

'http://xx.xx.xx.xx/test/admin/collections?action=CREATEname=testCollectionnumShards=1replicationFactor=1maxShardsPerNode=1collection.config=testproperty.dataDir=/test/data/testCollectionproperty.instanceDir=/test'

I get:

 org.apache.solr.common.SolrException: Solr instance is not running in 
SolrCloud mode.


Hrmmm, here I am confused. I have a working zookeeper, I have a loaded 
configuration, I have an empty data directory (no collections, cores, 
core.properties etc) and I have specified the zkHost configuration 
parameter in my solr.xml (yes, IP:port is correct)


What exactly am I missing?

thanks for the help.

David



Re: Can we do conditional boosting using edismax ?

2014-06-11 Thread shamik
Thanks Ahmet, I'll give it a shot.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Can-we-do-conditional-boosting-using-edismax-tp4141131p4141268.html
Sent from the Solr - User mailing list archive at Nabble.com.


Implementing Hive query in Solr

2014-06-11 Thread Vivekanand Ittigi
Hi,

My requirements is to execute this query(hive) in solr:

select SUM(Primary_cause_vaR),collect_set(skuType),RiskType,market,
collect_set(primary_cause) from bil_tos Where skuType='Product' group by
RiskType,market;

I can implement sum and groupBy operations in solr using StatsComponent
concept but i've no idea to implement collect_set() in solr.

Collect_set() is used in Hive queries.
Please provide me equivalent function for collect_set in solr or links or
how to achieve it. It'd be a great help.


Thanks,
Vivek