Re: SQL-like queries (with percent character) - matching an exact substring, with parts of words

2017-05-31 Thread Maciej Ł. PCSS

Shawn, thank you for your response.

Finally, my search is based on two kinds of fields (strings and text, 
both ignoring case and special characters) that potentially can contain 
any language but mainly Polish or English. This is because the two main 
requirements were:

1) Google-like search for quick lookups,
2) Precise multi-criteria search.

For the first option we use the "ascii_ignorecase_text" field type 
(below). For second case we applied the "ascii_ignorecase_string". It is 
very often that the customer knows only part of an identifier / sample 
name / user's surname / address, and still the customer wants to search 
by that partial information. The application is about exploring a 
scientific database of biological samples, each of them having lots of 
attributes.


Considering the above I'm fine with the following type definitions:

  positionIncrementGap="100">


  
  
  

  
  positionIncrementGap="100">


  
  
  

  

Thank you for your help!

Regards
Maciej Łabędzki


W dniu 02.02.2017 o 16:55, Shawn Heisey pisze:

On 2/2/2017 8:15 AM, Maciej Ł. PCSS wrote:

regardless of the value of such a use-case, there is another thing
that stays unknown for me.

Does SOLR support a simple and silly 'exact substring match'? I mean,
is it possible to search for (actually filter by) a raw substring
without tokenization and without any kind of processing/simplifying
the searched information? By a 'raw substring' I mean a character
string that, among others, can contain non-letters (colons, brackets,
etc.) - basically everything the user is able to input via keyboard.

Does this use case meet SOLR technical possibilities even if that
means a big efficiency cost?

Because you want to do substring matches, things are somewhat more
complicated than if you wanted to do a full exact-string-only query.

First I'll tackle the full exact query idea, because the info is also
important for substrings:

If the class in the fieldType is "solr.StrField" then the input will be
indexed exactly as it is sent, all characters preserved, and all
characters needing to be in the query.

On the query side, you would need to escape any special characters in
the query string -- spaces, colons, and several other characters.
Escaping is done with the backslash.  If you are manually constructing
URL parameters for an HTTP request, you would also need to be aware of
URL encoding.  Some Solr libraries (like SolrJ) are capable of handling
all the URL encoding for you.

Matching *substrings* with StrField would involve either a regular
expression query (with .* before and after) or a wildcard query, which
Erick described in his reply.

An alternate way to do substring matches is the NGram or EdgeNGram
filters, and not using wildcards or regex.  This method will increase
your index size, possibly by a large amount.  To use this method, you'd
need to switch back to solr.TextField, use the keyword tokenizer, and
then follow that with the appropriate NGram filter.  Depending on your
exact needs, you might only do the NGram filter on the index side, or
you might need it on both index and query analysis.  Escaping special
characters on the query side would still be required.

The full list of characters that require escaping is at the end of this
page:

http://lucene.apache.org/core/6_4_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html?is-external=true#Escaping_Special_Characters

Note that it shows && and || as special characters, even though these
are in fact two characters each.  Typically even a single instance of
these characters requires escaping.  Solr will also need spaces to be
escaped.

Thanks,
Shawn




Re: Indexing of documents in more than one step (SOLRJ)

2017-02-15 Thread Maciej Ł. PCSS
No, it's not the case. In both steps I'm indexing documents from the 
same set of IDs (I mean the values of the 'id').


Maciej


W dniu 15.02.2017 o 11:07, Emir Arnautovic pisze:
I did not have time to test it or look at the code, but can you check 
if it could be the case when there is no document with a, b, c fields 
and you are trying to update it with d, e, f using partial update syntax.


Emir


On 15.02.2017 09:25, Maciej Ł. PCSS wrote:

Dear All,
how should I handle the following scenario using SOLRJ?  Index a 
collection of documents (fill fields a, b, c). Then index the same 
collection but this time fill fields d, e, f.


In a pseudo-code it would be: step1(collectionX); step2(collectionX); 
solrCommit();


See my observations below:
- first step is done by calling SolrInputDocument.addField(fieldName, 
value); and this works fine.
- if I do the same for the second step then all fields in my 
documents get removed;
- for that reason I need to call 
SolrInputDocument.addField(fieldName, Collections.singletonMap("set", 
value)); and then it's fine
- but for some field, if I do the call from above, then the indexed 
values are like "{set=value}" instead of just "value".


Can somebody explain this strange behaviour to me?

Regards
Maciej







Indexing of documents in more than one step (SOLRJ)

2017-02-15 Thread Maciej Ł. PCSS

Dear All,
how should I handle the following scenario using SOLRJ?  Index a 
collection of documents (fill fields a, b, c). Then index the same 
collection but this time fill fields d, e, f.


In a pseudo-code it would be: step1(collectionX); step2(collectionX); 
solrCommit();


See my observations below:
- first step is done by calling SolrInputDocument.addField(fieldName, 
value); and this works fine.
- if I do the same for the second step then all fields in my documents 
get removed;
- for that reason I need to call SolrInputDocument.addField(fieldName, 
Collections.singletonMap("set", value)); and then it's fine
- but for some field, if I do the call from above, then the indexed 
values are like "{set=value}" instead of just "value".


Can somebody explain this strange behaviour to me?

Regards
Maciej



Re: SQL-like queries (with percent character) - matching an exact substring, with parts of words

2017-02-02 Thread Maciej Ł. PCSS

Hi Erick, All,

regardless of the value of such a use-case, there is another thing that 
stays unknown for me.


Does SOLR support a simple and silly 'exact substring match'? I mean, is 
it possible to search for (actually filter by) a raw substring without 
tokenization and without any kind of processing/simplifying the searched 
information? By a 'raw substring' I mean a character string that, among 
others, can contain non-letters (colons, brackets, etc.) - basically 
everything the user is able to input via keyboard.


Does this use case meet SOLR technical possibilities even if that means 
a big efficiency cost?


Regards
Maciej


W dniu 30.01.2017 o 17:12, Erick Erickson pisze:

Well, the usual Solr solution to leading and trailing wildcards is to
ngram the field. You can get the entire field (incuding spaces) to be
analyzed as a whole by using KeywordTokenizer. Sometimes you wind up
using a copyField to support this and search against one or the other
if necessary.

You can do this with KeywordTokenizer and '*a bcd ef*", but that'll be
slow for the exact same reason the SQL query is slow: It has to
examine every value in every document to find terms that match then
search on those.

There's some index size cost here so you'll have to test.

Really go back to your use-case to see if this is _really_ necessary
though. Often people think it is because that's the only way they've
been able to search at all in SQL and it can turn out that there are
other ways to solve it. IOW, this could be an XY problem.

Best,
Erick

On Mon, Jan 30, 2017 at 1:52 AM, Maciej Ł. PCSS <labed...@man.poznan.pl> wrote:

Hi All,

What solution have you applied in your implementations?

Regards
Maciej


W dniu 24.01.2017 o 14:10, Maciej Ł. PCSS pisze:

Dear SOLR users,

please point me to the right solution of my problem. I'm using SOLR to
implement a Google-like search in my application and this scenario is
working fine.

However, in specific use-cases I need to filter documents that include a
specific substring in a given field. It's about an SQL-like query similar to
this:

SELECT *  FROM table WHERE someField = '%c def g%'

I expect to match documents having someField ='abc def ghi'. That means I
expect match parts of words.

As I understand SOLR, as a reversed-index, does work with tokens rather
that character strings and thereby looks for whole words (not substrings).

Is there any solution for such an issue?

Regards
Maciej Łabędzki






Re: Query structure

2017-02-01 Thread Maciej Ł. PCSS
You should be able to put 'facetMetatagDatePrefix4:2015 OR 
facetMetatagDatePrefix4:2016' into the filtering query.


Maciej


W dniu 01.02.2017 o 13:43, KRIS MUSSHORN pisze:

I really need some guidance on this query structure issue.

I've got to get this solved today for my employer.

"Help me Obiwan. Your my only hope"

K

- Original Message -

From: "KRIS MUSSHORN" 
To: solr-user@lucene.apache.org
Sent: Tuesday, January 31, 2017 12:31:13 PM
Subject: Query structure

I have a defaultSearchField and facetMetatagDatePrefix4 fields that are 
correctly populated with values in SOLR 5.4.1.

if execute this query q=defaultSearchField:this text
I get the 7 docs that match.
Their are three docs in 2015 and one doc in 2016 per the facet counts in the 
results.
If I then q=defaultSearchField:this text AND facetMetatagDatePrefix4:2015 i get 
the correct 3 documents.

How would I structure my query to get defaultSearchField:this text AND 
(facetMetatagDatePrefix4:2015 OR facetMetatagDatePrefix4:2016) and return only 
4 docs?

TIA,
Kris







Re: Query structure

2017-02-01 Thread Maciej Ł. PCSS

Why not use filtering query? I mean the 'fq' param.

Regards
Maciej


W dniu 01.02.2017 o 13:43, KRIS MUSSHORN pisze:

I really need some guidance on this query structure issue.

I've got to get this solved today for my employer.

"Help me Obiwan. Your my only hope"

K

- Original Message -

From: "KRIS MUSSHORN" 
To: solr-user@lucene.apache.org
Sent: Tuesday, January 31, 2017 12:31:13 PM
Subject: Query structure

I have a defaultSearchField and facetMetatagDatePrefix4 fields that are 
correctly populated with values in SOLR 5.4.1.

if execute this query q=defaultSearchField:this text
I get the 7 docs that match.
Their are three docs in 2015 and one doc in 2016 per the facet counts in the 
results.
If I then q=defaultSearchField:this text AND facetMetatagDatePrefix4:2015 i get 
the correct 3 documents.

How would I structure my query to get defaultSearchField:this text AND 
(facetMetatagDatePrefix4:2015 OR facetMetatagDatePrefix4:2016) and return only 
4 docs?

TIA,
Kris







Re: SQL-like queries (with percent character) - matching an exact substring, with parts of words

2017-01-31 Thread Maciej Ł. PCSS

Thank you Erick. Yes, I'm still thinking about that use case.

Regards
Maciej


W dniu 30.01.2017 o 17:12, Erick Erickson pisze:

Well, the usual Solr solution to leading and trailing wildcards is to
ngram the field. You can get the entire field (incuding spaces) to be
analyzed as a whole by using KeywordTokenizer. Sometimes you wind up
using a copyField to support this and search against one or the other
if necessary.

You can do this with KeywordTokenizer and '*a bcd ef*", but that'll be
slow for the exact same reason the SQL query is slow: It has to
examine every value in every document to find terms that match then
search on those.

There's some index size cost here so you'll have to test.

Really go back to your use-case to see if this is _really_ necessary
though. Often people think it is because that's the only way they've
been able to search at all in SQL and it can turn out that there are
other ways to solve it. IOW, this could be an XY problem.

Best,
Erick

On Mon, Jan 30, 2017 at 1:52 AM, Maciej Ł. PCSS <labed...@man.poznan.pl> wrote:

Hi All,

What solution have you applied in your implementations?

Regards
Maciej


W dniu 24.01.2017 o 14:10, Maciej Ł. PCSS pisze:

Dear SOLR users,

please point me to the right solution of my problem. I'm using SOLR to
implement a Google-like search in my application and this scenario is
working fine.

However, in specific use-cases I need to filter documents that include a
specific substring in a given field. It's about an SQL-like query similar to
this:

SELECT *  FROM table WHERE someField = '%c def g%'

I expect to match documents having someField ='abc def ghi'. That means I
expect match parts of words.

As I understand SOLR, as a reversed-index, does work with tokens rather
that character strings and thereby looks for whole words (not substrings).

Is there any solution for such an issue?

Regards
Maciej Łabędzki






Re: SQL-like queries (with percent character) - matching an exact substring, with parts of words

2017-01-30 Thread Maciej Ł. PCSS

Hi All,

What solution have you applied in your implementations?

Regards
Maciej


W dniu 24.01.2017 o 14:10, Maciej Ł. PCSS pisze:

Dear SOLR users,

please point me to the right solution of my problem. I'm using SOLR to 
implement a Google-like search in my application and this scenario is 
working fine.


However, in specific use-cases I need to filter documents that include 
a specific substring in a given field. It's about an SQL-like query 
similar to this:


SELECT *  FROM table WHERE someField = '%c def g%'

I expect to match documents having someField ='abc def ghi'. That 
means I expect match parts of words.


As I understand SOLR, as a reversed-index, does work with tokens 
rather that character strings and thereby looks for whole words (not 
substrings).


Is there any solution for such an issue?

Regards
Maciej Łabędzki




SQL-like queries (with percent character) - matching an exact substring, with parts of words

2017-01-24 Thread Maciej Ł. PCSS

Dear SOLR users,

please point me to the right solution of my problem. I'm using SOLR to 
implement a Google-like search in my application and this scenario is 
working fine.


However, in specific use-cases I need to filter documents that include a 
specific substring in a given field. It's about an SQL-like query 
similar to this:


SELECT *  FROM table WHERE someField = '%c def g%'

I expect to match documents having someField ='abc def ghi'. That means 
I expect match parts of words.


As I understand SOLR, as a reversed-index, does work with tokens rather 
that character strings and thereby looks for whole words (not substrings).


Is there any solution for such an issue?

Regards
Maciej Łabędzki