Re: What happens if you don't set positionIncrementGap

2014-10-12 Thread Jack Krupansky
Read the Lucene analysis package summary section entitled "Field Section 
Boundaries":

http://lucene.apache.org/core/4_10_0/core/org/apache/lucene/analysis/package-summary.html

TL;DR - if you leave it as the default, then a word at the end of one 
section and a word at the start of the next section would be an exact phrase 
match. You might ask why Lucene chose that default - I don't know, but Solr 
"best practice" is the opposite. I suspect that Solr chose a large number 
like 100 so that a phrase query could use a significant slop like 10 and 
still not match across sections.


In my e-book I have a section entitled "Position Increment Gap" in Chapter 2 
"Analyzers Overview" that details the reasoning as well. There is also 
another section with the same title in the Term Vector Component chapter 
that runs through an example in more detail.


See:
http://www.lulu.com/us/en/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product-21203548.html

-- Jack Krupansky

-Original Message- 
From: Alexandre Rafalovitch

Sent: Sunday, October 12, 2014 7:40 PM
To: solr-user
Subject: What happens if you don't set positionIncrementGap

Hello,

I am working on - yet another - minimal schema, which involves the
settings that are matching defaults (or non-harming if defaults are
used). The one I am trying to figure out now is: positionIncrementGap

We set it to a 100 in all text field definitions. Does it mean it is
NOT some reasonable number by default?

I tried to trace it and all I can find is a default value in
SolrAnalyzer, which is 0.

But if it is 0 (zero), then why do we explicitly define to be 0 in all
non-text fields? Would seem to be redundant and - frankly - confusing.

Regards,
   Alex.

Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853 



Re: DateMathParser question

2014-10-10 Thread Jack Krupansky

Sounds reasonable. File a Jira!

-- Jack Krupansky

-Original Message- 
From: Jamie Johnson 
Sent: Friday, October 10, 2014 11:45 AM 
To: solr-user@lucene.apache.org 
Subject: DateMathParser question 


I have found that DateMathParser is extremely useful in providing nice
labels back to clients, but having to bring in all of solr-core to get it
is causing us issues in our current implementation.  Are there any thoughts
about moving this to another jar (say solr-utils?) that would allow clients
to leverage this functionality?


Re: does one need to reindex when changing similarity class

2014-10-09 Thread Jack Krupansky
The similarity class is only invoked at query time, so it doesn't 
participate in indexing.


-- Jack Krupansky

-Original Message- 
From: Markus Jelsma

Sent: Thursday, October 9, 2014 6:59 AM
To: solr-user@lucene.apache.org
Subject: RE: does one need to reindex when changing similarity class

Hi - no you don't have to, although maybe if you changed on how norms are 
encoded.

Markus



-Original message-

From:elisabeth benoit 
Sent: Thursday 9th October 2014 12:26
To: solr-user@lucene.apache.org
Subject: does one need to reindex when changing similarity class

I've read somewhere that we do have to reindex when changing similarity
class. Is that right?

Thanks again,
Elisabeth





Re: Best way to index wordpress blogs in solr

2014-10-08 Thread Jack Krupansky
The LucidWorks product has builtin crawler support so you could crawl one or 
more web sites.


http://lucidworks.com/product/fusion/

-- Jack Krupansky

-Original Message- 
From: Vishal Sharma

Sent: Tuesday, October 7, 2014 2:08 PM
To: solr-user@lucene.apache.org
Subject: Best way to index wordpress blogs in solr

Hi,

I am trying to get some help on finding out if there is any best practice
to index wordpress blogs in solr index? Can someone help with architecture
I shoudl be setting up?

Do, I need to write separate scripts to crawl wordpress and then pump posts
back to Solr using its API?




*Vishal Sharma**TL, Grazitti Interactive*T: +1 650­ 641 1754
E: vish...@grazitti.com
www.grazitti.com [image: Description: LinkedIn]
<http://www.linkedin.com/company/grazitti-interactive>[image: Description:
Twitter] <https://twitter.com/grazitti>[image: fbook]
<https://www.facebook.com/grazitti.interactive>*dreamforce®*Oct 13-16,
2014 *Meet
us at the Cloud Expo*
Booth N2341 Moscone North,
San Francisco
Schedule a Meeting
<http://www.vcita.com/v/grazittiinteractive/online_scheduling#/schedule>
  |   Follow us <https://twitter.com/grazitti>ZakCalendar
Dreamforce® Featured
App
<https://appexchange.salesforce.com/listingDetail?listingId=a0N300B5UPKEA3> 



Re: Edismax parser and boosts

2014-10-08 Thread Jack Krupansky
Definitely sounds like a bug! File a Jira. Thanks for reporting this. What 
release of Solr?




-- Jack Krupansky
-Original Message- 
From: Pawel Rog

Sent: Wednesday, October 8, 2014 3:57 PM
To: solr-user@lucene.apache.org
Subject: Edismax parser and boosts

Hi,
I use edismax query with q parameter set as below:

q=foo^1.0+AND+bar

For such a query for the same document I see different (lower) scoring
value than for

q=foo+AND+bar

By default boost of term is 1 as far as i know so why the scoring differs?

When I check debugQuery parameter in parsedQuery for "foo^1.0+AND+bar" I
see Boolean query which one of clauses is a phrase query "foo 1.0 bar". It
seems that edismax parser takes whole q parameter as a phrase without
removing boost value and add it as a boolean clause. Is it a bug or it
should work like that?

--
Paweł Róg 



Re: eDisMax parser and special characters

2014-10-08 Thread Jack Krupansky
Hyphen is a "prefix operator" and is normally followed by a term to indicate 
that the term "must not" be present. So, your query has a syntax error. The 
two query parsers differ in how they handle various errors. In the case of 
edismax, it quotes operators and then tries again, so the hyphen gets 
quoted, and then analyzed to nothing for text fields but is still a string 
for string fields.


-- Jack Krupansky

-Original Message- 
From: Lanke,Aniruddha

Sent: Wednesday, October 8, 2014 4:38 PM
To: solr-user@lucene.apache.org
Subject: Re: eDisMax parser and special characters

Sorry for a delayed reply here is more information -

Schema that we are using - http://pastebin.com/WQAJCCph
Request Handler in config - http://pastebin.com/Y0kP40WF

Some analysis -

Search term: red -
Parser eDismax
No results show up
(+((DisjunctionMaxQuery((name_starts_with:red^9.0 | 
name_parts_starts_with:red^6.0 | s_detail:red | name:red^12.0 | 
s_detail_starts_with:red^3.0 | s_detail_parts_starts_with:red^2.0)) 
DisjunctionMaxQuery((name_starts_with:-^9.0 | 
s_detail_starts_with:-^3.0)))~2))/no_coord


Search term: red -
Parser dismax
Results are returned
(+DisjunctionMaxQuery((name_starts_with:red^9.0 | 
name_parts_starts_with:red^6.0 | s_detail:red | name:red^12.0 | 
s_detail_starts_with:red^3.0 | s_detail_parts_starts_with:red^2.0)) 
())/no_coord


Why do we see the variation in the results between dismax and eDismax?


On Oct 8, 2014, at 8:59 AM, Erick Erickson 
mailto:erickerick...@gmail.com>> wrote:


There's not much information here.
What's the doc look like?
What is the analyzer chain for it?
What is the output when you add &debug=query?

Details matter. A lot ;)

Best,
Erick

On Wed, Oct 8, 2014 at 6:26 AM, Michael Joyner 
mailto:mich...@newsrx.com>> wrote:

Try escaping special chars with a "\"


On 10/08/2014 01:39 AM, Lanke,Aniruddha wrote:

We are using a eDisMax parser in our configuration. When we search using
the query term that has a ‘-‘ we don’t get any results back.

Search term: red - yellow
This doesn’t return any data back but




CONFIDENTIALITY NOTICE This message and any included attachments are from 
Cerner Corporation and are intended only for the addressee. The information 
contained in this message is confidential and may constitute inside or 
non-public information under international, federal, or state securities 
laws. Unauthorized forwarding, printing, copying, distribution, or use of 
such information is strictly prohibited and may be unlawful. If you are not 
the addressee, please promptly delete this message and notify the sender of 
the delivery error by e-mail or you may call Cerner's corporate offices in 
Kansas City, Missouri, U.S.A at (+1) (816)221-1024. 



Re: WhitespaceTokenizer to consider incorrectly encoded c2a0?

2014-10-08 Thread Jack Krupansky
The source code uses that Java Character.isWhitespace method which 
specifically excludes the non-breaking white space characters.


The Javadoc contract for WhitespaceTokenizer is too vague, especially since 
Unicode has so many... subtleties.


Personally, I'd go along with treating non-breaking white space as white 
space here.


And update the Lucene Javadoc contract to be more explicit.

-- Jack Krupansky

-Original Message- 
From: Markus Jelsma

Sent: Wednesday, October 8, 2014 10:16 AM
To: solr-user@lucene.apache.org ; solr-user
Subject: RE: WhitespaceTokenizer to consider incorrectly encoded c2a0?

Alexandre - i am sorry if i was not clear, this is about queries, this all 
happens at query time. Yes we can do the substitution in with the regex 
replace filter, but i would propose this weird exception to be added to 
WhitespaceTokenizer so Lucene deals with this by itself.


Markus

-Original message-

From:Alexandre Rafalovitch 
Sent: Wednesday 8th October 2014 16:12
To: solr-user 
Subject: Re: WhitespaceTokenizer to consider incorrectly encoded c2a0?

Is this a suggestion for JIRA ticket? Or a question on how to solve
it? If the later, you could probably stick a RegEx replacement in the
UpdateRequestProcessor chain and be done with it.

As to why? I would look for the rest of the MSWord-generated
artifacts, such as "smart" quotes, extra-long dashes, etc.

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On 8 October 2014 09:59, Markus Jelsma  wrote:
> Hi,
>
> For some crazy reason, some users somehow manage to substitute a 
> perfectly normal space with a badly encoded non-breaking space, properly 
> URL encoded this then becomes %c2a0 and depending on the encoding you 
> use to view you probably see  followed by a space. For example:

>
> Because c2a0 is not considered whitespace (indeed, it is not real 
> whitespace, that is 00a0) by the Java Character class, the 
> WhitespaceTokenizer won't split on it, but the WordDelimiterFilter still 
> does, somehow mitigating the problem as it becomes:

>
> HTMLSCF een abonnement
> WT een abonnement
> WDF een eenabonnement abonnement
>
> Should the WhitespaceTokenizer not include this weird edge case?
>
> Cheers,
> Markus





Re: dismax query does not match with additional field in qf

2014-10-07 Thread Jack Krupansky
Your query term seems particularly inappropriate for dismax - think simple 
keyword queries.


Also, don't confuse dismax and edismax - maybe you want the latter. The 
former is for... simple keyword queries.


I'm still not sure what your actual use case really is. In particular, are 
you trying to do a full, exact match on the string field, or a substring 
match? You can do the latter with wildcards or regex, but normally the 
former (exact match) is used.


Maybe simply enclosing the complex term in quotes to make it a phrase query 
is what you need - that would do an exact match on the string field, but a 
tokenized phrase match on the text field, and support partial matches on the 
text field as a phrase of contiguous terms.


-- Jack Krupansky

-Original Message- 
From: Andreas Hubold

Sent: Tuesday, October 7, 2014 12:08 PM
To: solr-user@lucene.apache.org
Subject: Re: dismax query does not match with additional field in qf

Okay, sounds reasonable. However I didn't expect this when reading the
documentation of the dismax query parser.

Especially the need to escape special characters (and which ones) was
not clear to me as the dismax query parser "is designed to process
simple phrases (without complex syntax) entered by users" and "special
characters (except AND and OR) are escaped" by the parser - as written
on https://cwiki.apache.org/confluence/display/solr/The+DisMax+Query+Parser

Do you know if the new Simple Query Parser has the same behaviour when
searching across multiple fields? Or could it be used instead to search
across "text_general" and "string" fields of arbitrary content without
additional query preprocessing to get results for matches in any of
these fields (as in field1:STUFF OR field2:STUFF).

Thank you,
Andreas

Jack Krupansky wrote on 10/07/2014 05:24 PM:
I think what is happening is that your last term, the naked apostrophe is 
analyzing to zero terms and simply being ignored, but when you add the 
extra field, a string field, you now have another term in the query, and 
you have mm set to 100%, so that "new" term must match. It probably fails 
because you have no naked apostrophe term in that field in the index.


Probably none of your string field terms were matching before, but that 
wasn't apparent since the tokenized text matched. But with this naked 
apostrophe term, there is no way to tell Lucene to match "no" term, so it 
requried the string term to match, which won't happen since only the full 
string is indexed.


Generally, you need to escape all special characters in a query. Then 
hopefully your string field will match.


-- Jack Krupansky

-Original Message- From: Andreas Hubold
Sent: Tuesday, September 30, 2014 11:14 AM
To: solr-user@lucene.apache.org
Subject: dismax query does not match with additional field in qf

Hi,

I ran into a problem with the Solr dismax query parser. We're using Solr
4.10.0 and the field types mentioned below are taken from the example
schema.xml.

In a test we have a document with rather strange content in a field
named "name_tokenized" of type "text_general":

abc_width=0 height=0>


(It's a test for XSS bug detection, but that doesn't matter here.)

I can find the document when I use the following dismax query with qf
set to field "name_tokenized" only:

http://localhost:44080/solr/studio/editor?deftype=dismax&q=abc_%3Ciframe+src%3D%27loadLocale.js%27+onload%3D%27javascript%3Adocument.XSSed%3D%22name%22%27&debug=true&echoParams=all&qf=name_tokenized^2

If I submit exactly the same query but add another field "feederstate"
to the qf parameter, I don't get any results anymore. The field is of
type "string".

http://localhost:44080/solr/studio/editor?deftype=dismax&q=abc_%3Ciframe+src%3D%27loadLocale.js%27+onload%3D%27javascript%3Adocument.XSSed%3D%22name%22%27&debug=true&echoParams=all&qf=name_tokenized^2%20feederstate

The decoded value of q is: abc_DisjunctionMaxQuery((feederstate:abc_name_tokenized:iframe)^2.0))~0.1)
DisjunctionMaxQuery((feederstate:src='loadLocale.js' | 
((name_tokenized:src name_tokenized:loadlocale.js)^2.0))~0.1)
DisjunctionMaxQuery((feederstate:onload='javascript:document.XSSed= | 
((name_tokenized:onload 
name_tokenized:javascript:document.xssed)^2.0))~0.1)

DisjunctionMaxQuery((feederstate:name | name_tokenized:name^2.0)~0.1)
DisjunctionMaxQuery((feederstate:')~0.1)
  )~5)

  DisjunctionMaxQuery((textbody:"abc_ iframe src loadlocale.js onload 
javascript:document.xssed name" | name_tokenized:"abc_ iframe src 
loadlocale.js onload javascript:document.xssed name"^2.0)~0.1)

)/no_coord


I've configured the handler with 100% so that all
of the 5 dismax queries at the top must match. But this one does not 
match:


DisjunctionMaxQuery(

Re: dismax query does not match with additional field in qf

2014-10-07 Thread Jack Krupansky
I think what is happening is that your last term, the naked apostrophe is 
analyzing to zero terms and simply being ignored, but when you add the extra 
field, a string field, you now have another term in the query, and you have 
mm set to 100%, so that "new" term must match. It probably fails because you 
have no naked apostrophe term in that field in the index.


Probably none of your string field terms were matching before, but that 
wasn't apparent since the tokenized text matched. But with this naked 
apostrophe term, there is no way to tell Lucene to match "no" term, so it 
requried the string term to match, which won't happen since only the full 
string is indexed.


Generally, you need to escape all special characters in a query. Then 
hopefully your string field will match.


-- Jack Krupansky

-Original Message- 
From: Andreas Hubold

Sent: Tuesday, September 30, 2014 11:14 AM
To: solr-user@lucene.apache.org
Subject: dismax query does not match with additional field in qf

Hi,

I ran into a problem with the Solr dismax query parser. We're using Solr
4.10.0 and the field types mentioned below are taken from the example
schema.xml.

In a test we have a document with rather strange content in a field
named "name_tokenized" of type "text_general":

abc_width=0 height=0>


(It's a test for XSS bug detection, but that doesn't matter here.)

I can find the document when I use the following dismax query with qf
set to field "name_tokenized" only:

http://localhost:44080/solr/studio/editor?deftype=dismax&q=abc_%3Ciframe+src%3D%27loadLocale.js%27+onload%3D%27javascript%3Adocument.XSSed%3D%22name%22%27&debug=true&echoParams=all&qf=name_tokenized^2

If I submit exactly the same query but add another field "feederstate"
to the qf parameter, I don't get any results anymore. The field is of
type "string".

http://localhost:44080/solr/studio/editor?deftype=dismax&q=abc_%3Ciframe+src%3D%27loadLocale.js%27+onload%3D%27javascript%3Adocument.XSSed%3D%22name%22%27&debug=true&echoParams=all&qf=name_tokenized^2%20feederstate

The decoded value of q is: abc_DisjunctionMaxQuery((feederstate:abc_name_tokenized:iframe)^2.0))~0.1)
DisjunctionMaxQuery((feederstate:src='loadLocale.js' | 
((name_tokenized:src name_tokenized:loadlocale.js)^2.0))~0.1)
DisjunctionMaxQuery((feederstate:onload='javascript:document.XSSed= | 
((name_tokenized:onload name_tokenized:javascript:document.xssed)^2.0))~0.1)

DisjunctionMaxQuery((feederstate:name | name_tokenized:name^2.0)~0.1)
DisjunctionMaxQuery((feederstate:')~0.1)
  )~5)

  DisjunctionMaxQuery((textbody:"abc_ iframe src loadlocale.js onload 
javascript:document.xssed name" | name_tokenized:"abc_ iframe src 
loadlocale.js onload javascript:document.xssed name"^2.0)~0.1)

)/no_coord


I've configured the handler with 100% so that all
of the 5 dismax queries at the top must match. But this one does not match:

DisjunctionMaxQuery((feederstate:')~0.1)


I'd expect that an additional field in the qf parameter would not lead
to fewer matches.
Okay, the above example is a rather crude test but I'd like to
understand it. Is this a bug in Solr?

I've also found https://issues.apache.org/jira/browse/SOLR-3047 which
sounds somewhat similar.

Regards,
Andreas 



Re: Advise on an architecture with lot of cores

2014-10-07 Thread Jack Krupansky
You'll have to do a proof of concept test to determine how many collections 
Solr/SolrCloud can handle.


With a very large number of customers you may have to do sharding of the 
clusters themselves - limit each cluster to however many 
customers/colllections work well (100? 250?) and then have separate clusters 
for larger groups of customers, maybe with a smaller cluster with a 
collection that maps the customer ID to a Solr cluster, and then the 
application layer can direct requests to the Solr cluster that owns that 
customer.


-- Jack Krupansky

-Original Message- 
From: Manoj Bharadwaj

Sent: Tuesday, October 7, 2014 8:27 AM
To: solr-user@lucene.apache.org
Subject: Advise on an architecture with lot of cores

Hi folks,

My team inherited a SOLR setup with an architecture that has a core for
every customer. We have a few different types of cores, say "A", "B", C",
and for each one of this there is a core per customer - namely "A1",
"A2"..., "B1", "B2"... Overall we have over 600 cores. We don't know the
history behind the current design - the exact reasons why it was done the
way it was done - one probable consideration was to ensure a customer data
separate from other.

We want to go to a single core per type architecture, and move on to  SOLR
cloud as well in near future to achieve sharding via the features cloud
provides.

Further aspects such as monitoring become easier as well. We will need to
watch and tune the caches for the different pattern of hits that we see.

Is there anything else to evaluate before we move to a single core per type
setup?

We are using 4.4.0 currently and will be moving to latest 4.10.1 as a part
of the redesign as well.

Regards
Manoj 



Re: Flexible search field analyser/tokenizer configuration

2014-10-04 Thread Jack Krupansky
Thanks for the clarification. Now... "fq" is simply another query, with 
normal query syntax. You wrote two field names as if they were query terms, 
but that's not meaningful query syntax. Sorry, but there is no such feature 
in Solr.


Although the qf parameter of dismax and edismax can be used to apply a boost 
to all un-fielded terms for a field, you otherwise need to apply any boost 
on a term, not a field.


-- Jack Krupansky

-Original Message- 
From: PeterKerk

Sent: Saturday, October 4, 2014 10:43 AM
To: solr-user@lucene.apache.org
Subject: Re: Flexible search field analyser/tokenizer configuration

In Engish, I think this part:
(title_search_global:(Ballonnenboog) OR
title_search_global:"Ballonnenboog"^100)
is looking for a match on "Ballonenboog" in the title and give a boost if it
occurs exactly as this.

The second part does the same but then for the description_search field, and
with an OR operator (so I would think it would not eliminate all matches:

(description_search:(Ballonnenboog) OR
description_search:"Ballonnenboog"^100)

And finally this part:

title_search_global^10.0+description_search^0.3

Gives a higher boost to the occurrence of the query in title_search_global
field than description_search field.

But something must be wrong with my analysis :)



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Flexible-search-field-analyser-tokenizer-configuration-tp4161624p4162660.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Flexible search field analyser/tokenizer configuration

2014-10-04 Thread Jack Krupansky
What exactly do you think that filter query is doing? Explain it in plain 
English.


My guess is that it eliminates all your document matches.

-- Jack Krupansky

-Original Message- 
From: PeterKerk

Sent: Saturday, October 4, 2014 12:34 AM
To: solr-user@lucene.apache.org
Subject: Re: Flexible search field analyser/tokenizer configuration

Ok, that field now totally works, thanks again!

I've removed the wildcard to benefit from ranking and boosting and am now
trying to combine this field with another, but I have some difficulties
figuring out the right query.

I want to search on the occurence of the keyword in the title field
(title_search_global) of a document OR in the description field
(description_search)
and if it occurs in the title field give that the largest boost, over a
minor boost in the description_search field.

Here's what I have now on query "Ballonnenboog"

http://localhost:8983/solr/tt-shop/select?q=(title_search_global%3A(Ballonnenboog)+OR+title_search_global%3A%22Ballonnenboog%22%5E100)+OR+description_search%3A(Ballonnenboog)&fq=title_search_global%5E10.0%2Bdescription_search%5E0.3&fl=id%2Ctitle&wt=xml&indent=true

But it returns 0 results, even though there are results that have
"Ballonnenboog" in the title_search_global field.

What am I missing?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Flexible-search-field-analyser-tokenizer-configuration-tp4161624p4162638.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Solr + Federated Search Question

2014-10-03 Thread Jack Krupansky

Yes, either term can be used to confuse people equally well!

-- Jack Krupansky

-Original Message- 
From: Alejandro Calbazana

Sent: Thursday, October 2, 2014 3:28 PM
To: solr-user@lucene.apache.org ; Ahmet Arslan
Subject: Re: Solr + Federated Search Question

Thanks Ahmet.  Yay!  New term :)  Although it does look like "federated"
and "metasearch" can be  used interchangeably.

Alejandro

On Thu, Oct 2, 2014 at 2:37 PM, Ahmet Arslan 
wrote:


Hi Alejandro,

So your example is better called as "metasearch". Here a quotation from a
book.

"Instead of retrieving information from a single information source using
one search engine, one can utilize multiple search engines or a single
search engine retrieving documents from a plethora of document 
collections.

A scenario where multiple engines are used is known as metasearch, while
the scenario where a single engine retrieves from multiple collections is
known as federation. In both these scenarios, the final result of the
retrieval effort needs to be a single, unified ranking of documents, based
on several ranked lists."

Ahmet


On Thursday, October 2, 2014 7:29 PM, Alejandro Calbazana <
acalbaz...@gmail.com> wrote:
Ahmet,Jeff,

Thanks.  Some terms are a bit overloaded.  By "federated", I do mean the
ability to query multiple, disparate, repositories.  So, no.  All of my
data would not necessarily be in Solr.  Solr would be one of several -
databases, filesystems, document stores, etc...  that I would like to
"plug-in".  The content in each repository would be of different types 
(the

shape/schema of the content would differ significantly).

Thanks,

Alejandro




On Wed, Oct 1, 2014 at 9:47 AM, Jack Krupansky 
wrote:

> Alejandro, you'll have to clarify how you are using the term "federated
> search". I mean, technically Ahmet is correct in that Solr queries can 
> be

> fanned out to shards and the results from each shard aggregated
> ("federated") into a single result list, but... more traditionally,
> "federated" refers to "disparate" databases or search engines.
>
> See:
> http://en.wikipedia.org/wiki/Federated_search
>
> So, please tell us a little more about what you are really trying to do.
>
> I mean, is all of your data in Solr, in multiple collections, or on
> multiple Solr servers, or... is only some of your data in Solr and some
is
> in other search engines?
>
> Another approach taken with Solr is that indeed all of your source data
> may be in "disparate databases", but you perform an ETL (Extract,
> Transform, and Load) process to ingest all of that data into Solr and
then
> simply directly search the data within Solr.
>
> -- Jack Krupansky
>
> -Original Message- From: Ahmet Arslan
> Sent: Wednesday, October 1, 2014 9:35 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr + Federated Search Question
>
> Hi,
>
> Federation is possible. Solr has distributed search support with shards
> parameter.
>
> Ahmet
>
>
>
> On Wednesday, October 1, 2014 4:29 PM, Alejandro Calbazana <
> acalbaz...@gmail.com> wrote:
> Hello,
>
> I have a general question about Solr in a federated search context.  I
> understand that Solr does not do federated search and that  different
tools
> are often used to incorporate Solr indexes into a federated/enterprise
> search solution.  Does anyone have recommendations on any products (open
> source or otherwise) that addresses this space?
>
> Thanks,
>
> Alejandro
>






Re: Regarding Default Scoring For Solr

2014-10-03 Thread Jack Krupansky
That's a reasonable description for Solr/Lucene scoring, but use the latest 
release:

http://lucene.apache.org/core/4_10_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html

-- Jack Krupansky

-Original Message- 
From: mdemarco123

Sent: Thursday, October 2, 2014 6:06 PM
To: solr-user@lucene.apache.org
Subject: Regarding Default Scoring For Solr

If i add this to the end of my query string I get a score back. &fl=*,score"
Is this the default score? I did read some info on scoring and it is
detailed
and granular and conceptual but because of limited time I can't go into
the how's at the moment of the score calculation.  Are the links below a
good start
as to the default calculation or can it be put any more into a tutorial
fashion

http://www.lucenetutorial.com/advanced-topics/scoring.html
http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Regarding-Default-Scoring-For-Solr-tp4162411.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Solr + Federated Search Question

2014-10-01 Thread Jack Krupansky
Alejandro, you'll have to clarify how you are using the term "federated 
search". I mean, technically Ahmet is correct in that Solr queries can be 
fanned out to shards and the results from each shard aggregated 
("federated") into a single result list, but... more traditionally, 
"federated" refers to "disparate" databases or search engines.


See:
http://en.wikipedia.org/wiki/Federated_search

So, please tell us a little more about what you are really trying to do.

I mean, is all of your data in Solr, in multiple collections, or on multiple 
Solr servers, or... is only some of your data in Solr and some is in other 
search engines?


Another approach taken with Solr is that indeed all of your source data may 
be in "disparate databases", but you perform an ETL (Extract, Transform, and 
Load) process to ingest all of that data into Solr and then simply directly 
search the data within Solr.


-- Jack Krupansky

-Original Message- 
From: Ahmet Arslan

Sent: Wednesday, October 1, 2014 9:35 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr + Federated Search Question

Hi,

Federation is possible. Solr has distributed search support with shards 
parameter.


Ahmet



On Wednesday, October 1, 2014 4:29 PM, Alejandro Calbazana 
 wrote:

Hello,

I have a general question about Solr in a federated search context.  I
understand that Solr does not do federated search and that  different tools
are often used to incorporate Solr indexes into a federated/enterprise
search solution.  Does anyone have recommendations on any products (open
source or otherwise) that addresses this space?

Thanks,

Alejandro 



Re: Adding filter in custom query parser

2014-10-01 Thread Jack Krupansky
Unless you consider yourself to be a "Solr expert", it would be best to 
implement such query translation in an application layer.


-- Jack Krupansky

-Original Message- 
From: sagarprasad

Sent: Wednesday, October 1, 2014 3:27 AM
To: solr-user@lucene.apache.org
Subject: Adding filter in custom query parser

I am new bee in SOLR and OpenNLP. I am trying to do a POC and want to write 
a

custom parser which can parse the query string using NLP and create an
appropriate SOLR query with filters.

For eg : "red shirt under 20$" should be translated to q=shirt&fq=price:[*
TO 20] and possibly apply color to one the attribute of doc index.

in parser overrided method, how can i add the filter and pass the query
back?

Any help pointers / sample code  will be helpful.

-Thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Adding-filter-in-custom-query-parser-tp4162044.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Wildcard search makes no sense!!

2014-10-01 Thread Jack Krupansky
The presence of a wildcard in a query term short circuits some portions of 
the analysis process. Some token filters like lower case can still be 
performed on the query terms, but others, like stemming, cannot. So, either 
simplify the analysis (be more selective of what token filters you use), or 
you will have to modify your query terms so that you manually simulate the 
token transformations that your text analysis is performing.


Take one of your indexed terms that you think should match and send it 
through the Solr Admin UI analysis page for the query field and see what the 
source token gets analyzed into - that's what your wildcard prefix must 
match. Sometimes (usually!) you will be surprised.


-- Jack Krupansky

-Original Message- 
From: Wayne W

Sent: Wednesday, October 1, 2014 7:16 AM
To: solr-user@lucene.apache.org
Subject: Wildcard search makes no sense!!

Hi,

I don't understand this at all. We are indexing some contact names. When we
do a standard query:

query 1: capi*
result: Capital Health

query 2: capit*
result: Capital Health

query 3: capita*
result: 

query 4: capital*
result: 

I understand (as we are using solar 3.5) that the wildcard search does not
actually return the query without the wildcard so I understand at least why
query 4 is not working ( I need to use: capital* OR capital ). What I don't
understand is why query 3 is not working.

Also if we place in the text field the following 3 contacts:

j...@capitalhealth.com
f...@capitalhealth.com
Capital Heath

When searching for:

query A: capita*
result: j...@capitalhealth.com, f...@capitalhealth.com

query B: capit*
result: j...@capitalhealth.com, f...@capitalhealth.com, Capital Heath


What is going on and how can I solve this?
many thanks as I'm really stuck on this 



Re: Boost Query (bq) syntax/usage

2014-09-30 Thread Jack Krupansky
The parsing of bq will be according to the main query parser (defType 
parameter) or any localParam-specified query parser, as well as all the 
other query parameters (q.op, mm, qf, etc.) This should be true for both 
dismax and edismax. In theory, you could have the main query be parsed with 
dismax and then specify edismax for bq using the localParam notation.


-- Jack Krupansky

-Original Message- 
From: Jack Krupansky

Sent: Tuesday, September 30, 2014 8:19 PM
To: solr-user@lucene.apache.org
Subject: Re: Boost Query (bq) syntax/usage

The "+" signs in the parsed boost query indicated the terms were ANDed
together, but maybe you can use the q.op and mm parameters to change the
default operator (I forget!).

-- Jack Krupansky
-Original Message- 
From: shamik

Sent: Tuesday, September 30, 2014 7:19 PM
To: solr-user@lucene.apache.org
Subject: Re: Boost Query (bq) syntax/usage

Thanks a lot Jack, makes sense. Just curios, if we used the following bq
entry in solrconfig xml

Source2:sfdc^6 Source2:downloads^5 Source2:topics^3

will it always be treated as an AND query ? Some of local results suggests
otherwise.



--
View this message in context:
http://lucene.472066.n3.nabble.com/Boost-Query-bq-syntax-usage-tp4161989p4161994.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Boost Query (bq) syntax/usage

2014-09-30 Thread Jack Krupansky
The "+" signs in the parsed boost query indicated the terms were ANDed 
together, but maybe you can use the q.op and mm parameters to change the 
default operator (I forget!).


-- Jack Krupansky
-Original Message- 
From: shamik

Sent: Tuesday, September 30, 2014 7:19 PM
To: solr-user@lucene.apache.org
Subject: Re: Boost Query (bq) syntax/usage

Thanks a lot Jack, makes sense. Just curios, if we used the following bq
entry in solrconfig xml

Source2:sfdc^6 Source2:downloads^5 Source2:topics^3

will it always be treated as an AND query ? Some of local results suggests
otherwise.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Boost-Query-bq-syntax-usage-tp4161989p4161994.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Boost Query (bq) syntax/usage

2014-09-30 Thread Jack Krupansky
A boost is basically an "OR" operation - it doesn't select any more or fewer 
documents. So, three separate bq's are three OR terms. But your first bq is 
a single query that ANDs three terms, and that AND-ed query is OR-ed with 
the original query, so it only boosts documents that contain all three of 
the terms rather than any of the three terms.


-- Jack Krupansky

-Original Message- 
From: shamik

Sent: Tuesday, September 30, 2014 5:38 PM
To: solr-user@lucene.apache.org
Subject: Boost Query (bq) syntax/usage

Hi,

 I'm little confused with the right syntax of defining boost queries. If I
use them in the following way:

http://localhost:8983/solr/testhandler?q=Application+Manager&bq=(Source2:sfdc^6
Source2:downloads^5 Source2:topics^3)&debugQuery=true

it gets translated to -->


  
  +Source2:sfdc^6.0 +Source2:downloads^5.0 +Source2:topics^3.0
  


Now, if I use the following query:

http://localhost:8983/solr/testhandler?q=Application+Manager&bq=Source2:sfdc^6&bq=Source2:downloads^5&bq=Source2:topics^3&debugQuery=true

gets translated as -->


   Source2:sfdc^6.0
   Source2:downloads^5.0
   Source2:topics^3.0


Both queries generate different result in terms of relevancy. Just wondering
what is the right way of using bq ?

-Thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Boost-Query-bq-syntax-usage-tp4161988.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Search multiple values with wildcards

2014-09-30 Thread Jack Krupansky
The special characters (colon) are treated as term delimiters for text 
field. How do you really intend to query this "string". You could make it 
simply a "string" field.


-- Jack Krupansky

-Original Message- 
From: J'roo

Sent: Tuesday, September 30, 2014 11:08 AM
To: solr-user@lucene.apache.org
Subject: Search multiple values with wildcards

Hi,

I am using Solr 3.5.0 with JavaClient SolrJ which I cannot change.

I have following type of docs:


:20:13-900-C05-P001:21:REF12349:25:23456789:32A:130202USD100,00:52A:/123456


I want to be able to find docs containing :25:234* AND :32A:1302* using
wildcards, which I thought to do like:

&q=proprietaryMessage_tis:(\:25\:23456*+\:32A\:130202US*)

But this doesn't work. Have tried many variations, anyone got a good tip for
me?

Thanks!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Search-multiple-values-with-wildcards-tp4161916.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: How to query certain fields filtered by a condition

2014-09-29 Thread Jack Krupansky
You can perform boolean operations using parentheses. So you can OR a 
sequence of sub-queries, and each sub-query can be an AND of the desired 
search term and the constraining values for other fields.


-- Jack Krupansky

-Original Message- 
From: Shamik Bandopadhyay

Sent: Monday, September 29, 2014 6:29 PM
To: solr-user@lucene.apache.org
Subject: How to query certain fields filtered by a condition

Hi,

 Just wanted to understand if it's possible to limit a searchable field
only to specific documents during query time. Following are my searchable
fields.

text^0.5 title^10.0 country^1.0

What I want is to make country a searchable field only for documents which
contain "author:Robert". For remaining documents, "country" should not be
considered as a searchable field, only text and title will come into play.
So If I search for "usa", it should bring result from documents where
author=Robert (by matching country field), but not for remaining authors
even if they've a country field with value "usa".

I don't how it can be done during query time or if it's possible at all
through some function queries. The other option is to add the country value
as part of title or text for documents containing Author:Robert during
index time. But I would like to know if its possible during query time.

Appreciate your feedback.

-Thanks,
Shamik 



Re: multiple terms order in query - eDismax

2014-09-29 Thread Jack Krupansky
That's called phrase query - selecting documents based on the order of the 
terms. Just enclose the terms in quotes.


-- Jack Krupansky

-Original Message- 
From: Tomer Levi

Sent: Monday, September 29, 2014 2:41 AM
To: solr-user@lucene.apache.org
Subject: RE: multiple terms order in query - eDismax

Thanks Jack!
Do you have any idea how can I select documents according to the appearance 
order of the terms?


-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com]
Sent: Sunday, September 28, 2014 1:27 PM
To: solr-user@lucene.apache.org
Subject: Re: multiple terms order in query - eDismax

pf and ps merely control boosting of documents, not selection of documents.

mm controls selection of documents.

So, hopefully at least doc3 is returned before doc2.

-- Jack Krupansky

From: Tomer Levi
Sent: Sunday, September 28, 2014 5:39 AM
To: solr-user@lucene.apache.org
Subject: multiple terms order in query - eDismax

Hi,

We have an index with 3 documents, each document contains a single field let’s 
call it ‘text’ (except the id) as below:


· Doc1

o   text:home garden sky sea wolf

· Doc2

o   text:home wolf sea garden sky

· Doc3

o   text:wolf sea home garden sky



When executing the query: home garden apple,

Using eDismax params:

· pf=text

· ps=1

· mm=2

We would like to get Doc1 and Doc3, in other words all the documents having 
at least 2 terms in close proximity (only 1 term off).




The problem is that we get all 3 documents, it looks like the ‘ps’ parameter 
doesn’t count.


Why Doc2 included in the results?  We expected that Solr will emit it since 
the ‘ps’ is larger than 1 => we have home wolf sea garden (ps=2?)








 Tomer Levi

 Software Engineer

 Big Data Group

 Product & Technology Unit

 (T) +972 (9) 775-2693



 tomer.l...@nice.com

 www.nice.com














Re: multiple terms order in query - eDismax

2014-09-28 Thread Jack Krupansky
pf and ps merely control boosting of documents, not selection of documents.

mm controls selection of documents.

So, hopefully at least doc3 is returned before doc2.

-- Jack Krupansky

From: Tomer Levi 
Sent: Sunday, September 28, 2014 5:39 AM
To: solr-user@lucene.apache.org 
Subject: multiple terms order in query - eDismax

Hi,

We have an index with 3 documents, each document contains a single field let’s 
call it ‘text’ (except the id) as below:

· Doc1 

o   text:home garden sky sea wolf

· Doc2 

o   text:home wolf sea garden sky

· Doc3 

o   text:wolf sea home garden sky 

 

When executing the query: home garden apple, 

Using eDismax params:

· pf=text 

· ps=1 

· mm=2 

We would like to get Doc1 and Doc3, in other words all the documents having at 
least 2 terms in close proximity (only 1 term off).

 

The problem is that we get all 3 documents, it looks like the ‘ps’ parameter 
doesn’t count. 

Why Doc2 included in the results?  We expected that Solr will emit it since the 
‘ps’ is larger than 1 => we have home wolf sea garden (ps=2?)

 

 

 

  Tomer Levi
 
  Software Engineer  

  Big Data Group
 
  Product & Technology Unit
 
  (T) +972 (9) 775-2693
 
   
 
  tomer.l...@nice.com 
 
  www.nice.com
 

 
 
   
 

 

 

 


Re: demo app explaining solr features

2014-09-28 Thread Jack Krupansky
And you can also check out the tutorials in any of the Solr books, including 
my Solr Deep Dive e-book:


http://www.lulu.com/us/en/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product-21203548.html

-- Jack Krupansky

-Original Message- 
From: Mikhail Khludnev

Sent: Sunday, September 28, 2014 1:35 AM
To: solr-user
Subject: Re: demo app explaining solr features

On Sat, Sep 27, 2014 at 12:26 PM, Anurag Sharma  wrote:


I am wondering if there is any demo app that can demonstrate all the
features/capabilities of solr. My intention is to understand, use and play
around all the features supported by solr.



https://lucene.apache.org/solr/4_10_0/tutorial.html


--
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
 



Re: java.lang.NumberFormatException: For input string: "string;#-6.872515521, 53.28853084"

2014-09-27 Thread Jack Krupansky
And how is the schema field declared. Seems like it's a TrieDoubleField, 
which should be a simple floating point value. You should be using the 
spatial field types.


-- Jack Krupansky

-Original Message- 
From: Erick Erickson

Sent: Friday, September 26, 2014 12:20 PM
To: solr-user@lucene.apache.org
Subject: Re: java.lang.NumberFormatException: For input string: 
"string;#-6.872515521, 53.28853084"


It looks like the data is, literally,
string;#-6.872515521, 53.28853084

or maybe
#-6.872515521, 53.28853084

either way the data isn't in anything like the format expected.
Of course I may be mis-reading this, but it looks like your
input process isn't doing what you expect.

How are you sending the data to Solr?

Best,
Erick

On Fri, Sep 26, 2014 at 7:00 AM, lalitjangra 
wrote:


Hi,

I am trying to index latitude and longitude data into solr but getting
error
as below.

ERROR - 2014-09-26 13:44:16.503; org.apache.solr.common.SolrException;
org.apache.solr.common.SolrException: ERROR:
[doc=http://testirishwaterportal/sites/am/ass/asi/agg/ami/Lists/Waste
Waste
Water Pumping Station/DispForm.aspx?ID=841] Error adding field
'gis_x0020_coordinate'='string;#-6.872515521, 53.28853084' msg=For input
string: "string;#-6.872515521, 53.28853084"
at
org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:167)
at

org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:77)
at

org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:215)
at

org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
at

org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
at

org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:569)
at

org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:705)
at

org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:435)
at

org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
at

org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(ExtractingDocumentLoader.java:121)
at

org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(ExtractingDocumentLoader.java:126)
at

org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228)
at

org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at

org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at

org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:241)
at
org.apache.solr.core.SolrCore.execute(SolrCore.java:1859)
at

org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:710)
at

org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:413)
at

org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:197)
at

org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
at

org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
at

org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
at

org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
at

org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
at

org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
at

org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at

org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at

org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
at

org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:368)
at

org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
at

org.ec

Re: Changed behavior in solr 4 ??

2014-09-25 Thread Jack Krupansky
I am not aware of any such feature! That doesn't mean it doesn't exist, but 
I don't recall seeing it in the Solr source code.


-- Jack Krupansky

-Original Message- 
From: Jorge Luis Betancourt Gonzalez

Sent: Wednesday, September 24, 2014 1:31 AM
To: solr-user@lucene.apache.org
Subject: Re: Changed behavior in solr 4 ??

Hi Jack:

Thanks for the response, yes the way you describe I know it works and is how 
I get it to work but then what does mean the snippet of the documentation I 
see on the documentation about overriding the default components shipped 
with Solr? Even on the book Solr in Action in chapter 7 listing 7.3 I saw 
something similar to what I wanted to do:



 
   25
   content_field
 
 
   *:*
   true
   explicit
 

Because each default search component exists by default even if it’s not 
defined explicitly in the solrconfig.xml file, defining them explicitly as 
in the previous listing will replace the default configuration.


The previous snippet is from the quoted book Solr in Action, I understand 
that in each SearchHandler I could define this parameters bu if defined in 
the searchComponent (as the book says) this configuration wouldn’t apply to 
all my request handlers? eliminating the need to replicate the same 
parameter in several parts of my solrconfig.xml (i.e all the request 
handlers)?



Regards,
On Sep 23, 2014, at 11:53 PM, Jack Krupansky  
wrote:



You set the defaults on the "search handler", not the "search component". 
See solrconfig.xml:




 
   explicit
   10
   text
 
...

-- Jack Krupansky

-Original Message- From: Jorge Luis Betancourt Gonzalez
Sent: Tuesday, September 23, 2014 11:02 AM
To: solr-user@lucene.apache.org
Subject: Changed behavior in solr 4 ??

Hi:

I’m trying to change the default configuration for the query component of 
a SearchHandler, basically I want to set a default value to the rows 
parameters and that this value be shared by all my SearchHandlers, as 
stated on the solrconfig.xml comments, this could be accomplished 
redeclaring the query search component, however this is not working on 
solr 4.9.0 which is the version I’m using, this is my configuration:


  
  
  1
  
  

The relevant portion of the solrconfig.xml comment is: "If you register a 
searchComponent to one of the standard names,  will be used instead of the 
default.” so is this a new desired behavior?? although just for testing a 
redefined the components of the request handler to only use the query 
component and not to use all the default components, this is how it looks:



  query



Everything works ok but the the rows parameter is not used, although I’m 
not specifying the rows parameter on the URL.


Regards,Concurso "Mi selfie por los 5". Detalles en 
http://justiciaparaloscinco.wordpress.com



Concurso "Mi selfie por los 5". Detalles en 
http://justiciaparaloscinco.wordpress.com




Re: Scoring with wild cars

2014-09-25 Thread Jack Krupansky
The wildcard query is “constant score” to make it faster, so unfortunately that 
means there is no score differentiation between the wildcard matches.

You can simple add the wildcard prefix as a separate query term and boost it:

q=text:carre* text:carre^1.5

-- Jack Krupansky

From: Pigeyre Romain 
Sent: Wednesday, September 24, 2014 2:12 PM
To: solr-user@lucene.apache.org 
Cc: Pigeyre Romain 
Subject: Scoring with wild cars

Hi,

 

I hava two records with name_fra field

One with name_fra=”un test CARREAU”

And another one with name_fra=”un test CARRE”

 

{

"codeBarre": "1",

"name_FRA": "un test CARREAU"

  }

{

"codeBarre": "2",

"name_FRA": "un test CARRE"

  }

 

Configuration of these fields are :

 









 



  











  

  











  



 

When I’m using this query :

http://localhost:8983/solr/cdv_product/select?q=text%3Acarre*&fl=score%2C+*&wt=json&indent=true&debugQuery=true

The result is :

{

  "responseHeader":{

"status":0,

"QTime":2,

"params":{

  "debugQuery":"true",

  "fl":"score, *",

  "indent":"true",

  "q":"text:carre*",

  "wt":"json"}},

  "response":{"numFound":2,"start":0,"maxScore":1.0,"docs":[

  {

   "codeBarre":"1",

"name_FRA":"un test CARREAU",

"_version_":1480150860842401792,

"score":1.0},

  {

"codeBarre":"2",

"name_FRA":"un test CARRE",

"_version_":1480150875738472448,

"score":1.0}]

  },

  "debug":{

"rawquerystring":"text:carre*",

"querystring":"text:carre*",

"parsedquery":"text:carre*",

"parsedquery_toString":"text:carre*",

"explain":{

  "1":"\n1.0 = (MATCH) ConstantScore(text:carre*), product of:\n  1.0 = 
boost\n  1.0 = queryNorm\n",

  "2":"\n1.0 = (MATCH) ConstantScore(text:carre*), product of:\n  1.0 = 
boost\n  1.0 = queryNorm\n"},

"QParser":"LuceneQParser",

"timing":{

  "time":2.0,

  "prepare":{

"time":1.0,

"query":{

  "time":1.0},

"facet":{

  "time":0.0},

"mlt":{

  "time":0.0},

"highlight":{

  "time":0.0},

"stats":{

  "time":0.0},

"expand":{

  "time":0.0},

"debug":{

  "time":0.0}},

  "process":{

"time":1.0,

"query":{

  "time":0.0},

"facet":{

  "time":0.0},

"mlt":{

  "time":0.0},

"highlight":{

  "time":0.0},

"stats":{

  "time":0.0},

"expand":{

  "time":0.0},

"debug":{

  "time":1.0}

 

The score is the same for both of record. CARREAU record is first and CARRE is 
next. I want to place CARRE before CARREAU result because CARRE is an exact 
match. Is it possible?

 

NB : scoring for this query only use querynorm and boosters

 

In this test :

http://localhost:8983/solr/cdv_product/select?q=text%3Acarre&fl=score%2C*&wt=json&indent=true&debugQuery=true

 

I have only one record found but the scoring is more complex. Why?

{  "responseHeader":{"status":0,"QTime":2,"params":{  
"debugQuery":"true",  "fl":"score,*",  "indent":"true",  
"q":"text:carre",  "wt":"json"}},  
"response":{"numFound":1,"start":0,"maxScore":0.53033006,"docs":[  {
"codeBarre":"2","name_FRA":"un test CARRE",
"_version_":1480150875738472448,"score":0.53033006}]  },  "debug":{
"rawquerystring":"text:carre","querystring":"text:carre",
"parsedquery":"text:carre","parsedquery_toString":

Re: Changed behavior in solr 4 ??

2014-09-23 Thread Jack Krupansky
You set the defaults on the "search handler", not the "search component". 
See solrconfig.xml:



 
  
explicit
10
text
  
...

-- Jack Krupansky

-Original Message- 
From: Jorge Luis Betancourt Gonzalez

Sent: Tuesday, September 23, 2014 11:02 AM
To: solr-user@lucene.apache.org
Subject: Changed behavior in solr 4 ??

Hi:

I’m trying to change the default configuration for the query component of a 
SearchHandler, basically I want to set a default value to the rows 
parameters and that this value be shared by all my SearchHandlers, as stated 
on the solrconfig.xml comments, this could be accomplished redeclaring the 
query search component, however this is not working on solr 4.9.0 which is 
the version I’m using, this is my configuration:


   
   
   1
   
   

The relevant portion of the solrconfig.xml comment is: "If you register a 
searchComponent to one of the standard names,  will be used instead of the 
default.” so is this a new desired behavior?? although just for testing a 
redefined the components of the request handler to only use the query 
component and not to use all the default components, this is how it looks:



   query



Everything works ok but the the rows parameter is not used, although I’m not 
specifying the rows parameter on the URL.


Regards,Concurso "Mi selfie por los 5". Detalles en 
http://justiciaparaloscinco.wordpress.com 



Re: query for space character in text field ...

2014-09-23 Thread Jack Krupansky

Or simply enclosed the full term in quotes:

q=path:"my path"

Which is more properly encoded as:

q=path:%22my+path%22

-- Jack Krupansky

-Original Message- 
From: Erick Erickson

Sent: Tuesday, September 23, 2014 11:02 PM
To: solr-user@lucene.apache.org
Subject: Re: query for space character in text field ...

You should be able to escape it with a backslash, as
search\ with\ spaces

Best,
Erick

On Tue, Sep 23, 2014 at 3:18 PM, Samuel Smith  wrote:


Should I be able to search a text field in my index for any value that
contains white space?

The value in my “path” field contains an untokenized string (“that
contains spaces”).

I can do single character searches for other single special characters no
problem (q=path:*!*, or q=path:*-*), but no representation of space (%20, 
“

“, or \s etc.) seem to work.

Thanks in advance for any suggestions! Sam

--
Samuel Smith






Re: [ANN] Lucidworks Fusion 1.0.0

2014-09-23 Thread Jack Krupansky

You simply download it yourself and give yourself a demo!!

http://lucidworks.com/product/fusion/

-- Jack Krupansky

-Original Message- 
From: Thomas Egense

Sent: Tuesday, September 23, 2014 2:00 AM
To: solr-user@lucene.apache.org
Subject: Re: [ANN] Lucidworks Fusion 1.0.0

Hi Grant.
Will there be a Fusion demostration/presentation  at Lucene/Solr Revolution
DC? (Not listed in the program yet).


Thomas Egense

On Mon, Sep 22, 2014 at 3:45 PM, Grant Ingersoll 
wrote:


Hi All,

We at Lucidworks are pleased to announce the release of Lucidworks Fusion
1.0.   Fusion is built to overlay on top of Solr (in fact, you can manage
multiple Solr clusters -- think QA, staging and production -- all from our
Admin).In other words, if you already have Solr, simply point Fusion 
at

your instance and get all kinds of goodies like Banana (
https://github.com/LucidWorks/Banana -- our port of Kibana to Solr + a
number of extensions that Kibana doesn't have), collaborative filtering
style recommendations (without the need for Hadoop or Mahout!), a modern
signal capture framework, analytics, NLP integration, Boosting/Blocking 
and
other relevance tools, flexible index and query time pipelines as well as 
a

myriad of connectors ranging from Twitter to web crawling to Sharepoint.
The best part of all this?  It all leverages the infrastructure that you
know and love: Solr.  Want recommendations?  Deploy more Solr.  Want log
analytics?  Deploy more Solr.  Want to track important system metrics?
Deploy more Solr.

Fusion represents our commitment as a company to continue to contribute a
large quantity of enhancements to the core of Solr while complementing and
extending those capabilities with value adds that integrate a number of 
3rd

party (e.g connectors) and home grown capabilities like an all new,
responsive UI built in AngularJS.  Fusion is not a fork of Solr.  We do 
not
hide Solr in any way.  In fact, our goal is that your existing 
applications
will work out of the box with Fusion, allowing you to take advantage of 
new

capabilities w/o overhauling your existing application.

If you want to learn more, please feel free to join our technical webinar
on October 2: http://lucidworks.com/blog/say-hello-to-lucidworks-fusion/.
If you'd like to download: http://lucidworks.com/product/fusion/.

Cheers,
Grant Ingersoll


Grant Ingersoll | CTO
gr...@lucidworks.com | @gsingers
http://www.lucidworks.com






Re: How to summarize a String Field ?

2014-09-18 Thread Jack Krupansky

Do a  to a numeric field.

-- Jack Krupansky

-Original Message- 
From: Erick Erickson 
Sent: Thursday, September 18, 2014 11:35 AM 
To: solr-user@lucene.apache.org 
Subject: Re: How to summarize a String Field ? 


You cannot do this as far as I know, it must be a numeric field
(float/int/tint/tfloat whatever).

Best
Erick

On Thu, Sep 18, 2014 at 12:46 AM, YouPeng Yang
 wrote:

Hi

   One of my filed called AMOUNT  is  String,and I want to  calculate the
sum of the this filed.
I have try it with the stats component,it only give out the stats
information without sum item just as following:


 
 5000
 24230
 26362
  


   Is there any ways to achieve this object?

Regards


Re: Mongo DB Users

2014-09-15 Thread Jack Krupansky

>Waiting for a positive response!


-1

-- Jack Krupansky

-Original Message- 
From: Rakesh Varna

Sent: Monday, September 15, 2014 10:18 AM
To: solr-user@lucene.apache.org
Subject: Re: Mongo DB Users

Remove

Regards,
Rakesh Varna


On Mon, Sep 15, 2014 at 9:29 AM, Ed Smiley  wrote:


Remove

On 9/15/14, 8:35 AM, "Aaron Susan"  wrote:

>Hi,
>
>I am here to inform you that we are having a contact list of *Mongo DB
>Users *would you be interested in it?
>
>Data Field¹s Consist Of: Name, Job Title, Verified Phone Number, Verified
>Email Address, Company Name & Address Employee Size, Revenue size, SIC
>Code, Industry Type etc.,
>
>We also provide other technology users as well depends on your
>requirement.
>
>For Example:
>
>
>*Red Hat *
>
>*Terra data *
>
>*Net-app *
>
>*NuoDB*
>
>*MongoHQ ** and many more*
>
>
>We also provide IT Decision Makers, Sales and Marketing Decision Makers,
>C-level Titles and other titles as per your requirement.
>
>Please review and let me know your interest if you are looking for above
>mentioned users list or other contacts list for your campaigns.
>
>Waiting for a positive response!
>
>Thanks
>
>*Aaron Susan*
>Data Specialist
>
>If you are not the right person, feel free to forward this email to the
>right person in your organization. To opt out response Remove






Re: Solr Exceptions -- "immense terms"

2014-09-15 Thread Jack Krupansky
I knew it was in there somewhere! But... that truncates the full field 
value, as opposed to an individual term for a text field. It depends on 
whether the immediate issue was for a text field or for a string field. The 
underlying issue may be that it rarely makes sense to "index" a full wiki 
page as a string field.


-- Jack Krupansky

-Original Message- 
From: Alexandre Rafalovitch

Sent: Monday, September 15, 2014 8:39 AM
To: solr-user
Subject: Re: Solr Exceptions -- "immense terms"

May not need a script for that:
http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/update/processor/TruncateFieldUpdateProcessorFactory.html

Regards,
  Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On 15 September 2014 11:05, Jack Krupansky  wrote:

You can use an update request processor to filter the input for large
values. You could write a script with the stateless script processor which
ignores or trims large input values.

-- Jack Krupansky

-Original Message- From: Christopher Gross
Sent: Monday, September 15, 2014 7:58 AM
To: solr-user
Subject: Re: Solr Exceptions -- "immense terms"


Yeah -- for this part I'm just trying to store it to show it later.

There was a change in Lucene 4.8.x.  Before then, the exception was just
being eaten...now they throw it up and don't index that document.

Can't push the whole schema up -- but I do copy the content field into a
"text" field (text_en_splitting) that gets used for a full text search
(along w/ some other fields).  But then I would think I'd see the error 
for

that field instead of "content."  I may try that to figure out where the
problem is, but I do want to have the content available for doing the
search...

It's big.

I'm probably going to have to tweak the schema some (probably wise 
anyway),
but I'm not sure what do to about this large text.  I'm loading the 
content

in via some Java code so I could trim it down, but I'd rather not exclude
content from the page just because it's large.  I was hoping that someone
would have a better field type to use, or an idea of what to do to
configure it.

Thanks Michael.


-- Chris

On Mon, Sep 15, 2014 at 10:38 AM, Michael Della Bitta <
michael.della.bi...@appinions.com> wrote:

I just came back to this because I figured out you're trying to just 
store

this text. Now I'm baffled. How big is it? :)

Not sure why an analyzer is running if you're just storing the content.
Maybe you should post your whole schema.xml... there could be a copyfield
that's dumping the text into a different field that has the keyword
tokenizer?

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<

https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
>
w: appinions.com <http://www.appinions.com/>

On Mon, Sep 15, 2014 at 10:37 AM, Michael Della Bitta <
michael.della.bi...@appinions.com> wrote:

> If you're using a String fieldtype, you're not indexing it so much as
> dumping the whole content blob in there as a single term for exact
> matching.
>
> You probably want to look at one of the text field types for textural
> content.
>
> That doesn't explain the difference in behavior between Solr versions,
but
> my hunch is that you'll be happier in general with the behavior of a
field
> type that does tokenizing and stemming for plain text search anyway.
>
> Michael Della Bitta
>
> Applications Developer
>
> o: +1 646 532 3062
>
> appinions inc.
>
> “The Science of Influence Marketing”
>
> 18 East 41st Street
>
> New York, NY 10017
>
> t: @appinions <https://twitter.com/Appinions> | g+:
> plus.google.com/appinions
> <

https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
>
> w: appinions.com <http://www.appinions.com/>
>
> On Mon, Sep 15, 2014 at 10:06 AM, Christopher Gross 
> wrote:
>
>> Solr 4.9.0
>> Java 1.7.0_49
>>
>> I'm indexing an internal Wiki site.  I was running on an older version
of
>> Solr (4.1) and wasn't having any trouble indexing the content, but now
I'm
>> getting errors:
>>
>> SCHEMA:
>> > required="true"/>
>>
>> LOGS:
>> Caused by: java.lang.IllegalArgumentException: Document contains at
least
>> one immense term in field="content" (whose UTF8 encoding is longer 
>>

Re: Solr Exceptions -- "immense terms"

2014-09-15 Thread Jack Krupansky
You can use an update request processor to filter the input for large 
values. You could write a script with the stateless script processor which 
ignores or trims large input values.


-- Jack Krupansky

-Original Message- 
From: Christopher Gross

Sent: Monday, September 15, 2014 7:58 AM
To: solr-user
Subject: Re: Solr Exceptions -- "immense terms"

Yeah -- for this part I'm just trying to store it to show it later.

There was a change in Lucene 4.8.x.  Before then, the exception was just
being eaten...now they throw it up and don't index that document.

Can't push the whole schema up -- but I do copy the content field into a
"text" field (text_en_splitting) that gets used for a full text search
(along w/ some other fields).  But then I would think I'd see the error for
that field instead of "content."  I may try that to figure out where the
problem is, but I do want to have the content available for doing the
search...

It's big.

I'm probably going to have to tweak the schema some (probably wise anyway),
but I'm not sure what do to about this large text.  I'm loading the content
in via some Java code so I could trim it down, but I'd rather not exclude
content from the page just because it's large.  I was hoping that someone
would have a better field type to use, or an idea of what to do to
configure it.

Thanks Michael.


-- Chris

On Mon, Sep 15, 2014 at 10:38 AM, Michael Della Bitta <
michael.della.bi...@appinions.com> wrote:


I just came back to this because I figured out you're trying to just store
this text. Now I'm baffled. How big is it? :)

Not sure why an analyzer is running if you're just storing the content.
Maybe you should post your whole schema.xml... there could be a copyfield
that's dumping the text into a different field that has the keyword
tokenizer?

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<
https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
>
w: appinions.com <http://www.appinions.com/>

On Mon, Sep 15, 2014 at 10:37 AM, Michael Della Bitta <
michael.della.bi...@appinions.com> wrote:

> If you're using a String fieldtype, you're not indexing it so much as
> dumping the whole content blob in there as a single term for exact
> matching.
>
> You probably want to look at one of the text field types for textural
> content.
>
> That doesn't explain the difference in behavior between Solr versions,
but
> my hunch is that you'll be happier in general with the behavior of a
field
> type that does tokenizing and stemming for plain text search anyway.
>
> Michael Della Bitta
>
> Applications Developer
>
> o: +1 646 532 3062
>
> appinions inc.
>
> “The Science of Influence Marketing”
>
> 18 East 41st Street
>
> New York, NY 10017
>
> t: @appinions <https://twitter.com/Appinions> | g+:
> plus.google.com/appinions
> <
https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
>
> w: appinions.com <http://www.appinions.com/>
>
> On Mon, Sep 15, 2014 at 10:06 AM, Christopher Gross 
> wrote:
>
>> Solr 4.9.0
>> Java 1.7.0_49
>>
>> I'm indexing an internal Wiki site.  I was running on an older version
of
>> Solr (4.1) and wasn't having any trouble indexing the content, but now
I'm
>> getting errors:
>>
>> SCHEMA:
>> > required="true"/>
>>
>> LOGS:
>> Caused by: java.lang.IllegalArgumentException: Document contains at
least
>> one immense term in field="content" (whose UTF8 encoding is longer than
>> the
>> max length 32766), all of which were skipped.  Please correct the
analyzer
>> to not produce such terms.  The prefix of the first immense term is:
'[60,
>> 33, 45, 45, 32, 98, 111, 100, 121, 67, 111, 110, 116, 101, 110, 116, 
>> 32,

>> 45, 45, 62, 10, 9, 9, 9, 60, 100, 105, 118, 32, 115]...', original
>> message:
>> bytes can be at most 32766 in length; got 183250
>> 
>> Caused by:
>> org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException:
bytes
>> can be at most 32766 in length; got 183250
>>
>> I was indexing it, but I switched that off (as you can see above) but 
>> it

>> still is having problems.  Is there a different type I should use, or a
>> different analyzer?  I imagine that there is a way to index very large
>> documents in Solr.  Any recommendations would be helpful.  Thanks!
>>
>> -- Chris
>>
>
>





Re: Tricky exact match, unwanted search results

2014-09-14 Thread Jack Krupansky
I keep asking people this eternal question: What training or doc are you 
reading that is using this term "exact match"? Clearly the term is being 
used by a lot of people in a lot of ambiguous ways, when "exact" should 
be... "exact".


I think we need to start using the term "exact match" ONLY for string field 
queries, and that don't use wildcard, fuzzy, or range queries. And maybe 
also keyword tokenizer text fields that don't have any filters, which might 
as well be string fields.


-- Jack Krupansky

-Original Message- 
From: FiMka

Sent: Sunday, September 14, 2014 9:34 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr: Tricky exact match, unwanted search results

*Erick*, thank you for help!
For exact match I still want:
to use stemming (e.g. for "sleep" I want the word forms "slept", "sleeping",
"sleeps" also to be used in searching)
to disregard case sensitivity
to disregard prepositions, conjunctions and other function words
to match only docs having all of the query words and in the given order
(except function words)
to match only docs if there are no other words in the doc field besides the
words in the query
to use synonyms (e.g. "GB" == "gigabyte", "Television" == "TV")

Erick Erickson wrote

The easiest way to make your examples work wouldbe to use a copyField to
an "exact match" field thatuses the KeywordTokenizer


The KeywordTokenizer treats the entire field as a single token, regardless
of its content. So this does not fit to my requirements.

Erick Erickson wrote

You'll have to be a little careful to escape spaces for muti-term bits,
like exact_field:pussy\ cat.


Hmm... I don't care about quoting right now at all. But should I?
Erick Erickson wrote

As far as your question about "if" and "in", what you're probably getting
here is stopword removal, but that's a guess.


I have the following document:After I disabled solr.StopFilterFactory for
analyzer type="query" Solr stopped returning this document for the query:
http://localhost:8983/solr/lexikos/select?q=phraseExact%3A%22on+a+case-by-case%22.Can
I somehow implement the desired "exact match" behavior?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Tricky-exact-match-unwanted-search-results-tp4158652p4158745.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Solr multiple sources configuration

2014-09-09 Thread Jack Krupansky
It is mostly a matter of how you expect to query that data - do you need 
different queries for different sources, or do you have a common conceptual 
model that covers all sources with a common set of queries?


-- Jack Krupansky

-Original Message- 
From: vineet yadav

Sent: Tuesday, September 9, 2014 6:40 PM
To: solr-user@lucene.apache.org
Subject: Solr multiple sources configuration

Hi,
I am using solr to store data from multiple sources like social media,
news, journals etc. So i am using crawler, multiple scrappers and  apis to
gather data. I want to know which is the  best way to configure solr so
that I can store data which comes from multiple sources.

Thanks
Vineet Yadav 



Re: How to implement multilingual word components fields schema?

2014-09-08 Thread Jack Krupansky
You also need to take a stance as to whether you wish to auto-detect the 
language at query time vs. have a UI selection of language vs. attempt to 
perform the same query for each available language and then "determine" 
which has the best "relevancy". The latter two options are very sensitive to 
short queries. Keep in mind that auto-detection for indexing full documents 
is a different problem that auto-detection for very short queries.


-- Jack Krupansky

-Original Message- 
From: Ilia Sretenskii

Sent: Sunday, September 7, 2014 10:33 PM
To: solr-user@lucene.apache.org
Subject: Re: How to implement multilingual word components fields schema?

Thank you for the replies, guys!

Using field-per-language approach for multilingual content is the last
thing I would try since my actual task is to implement a search
functionality which would implement relatively the same possibilities for
every known world language.
The closest references are those popular web search engines, they seem to
serve worldwide users with their different languages and even
cross-language queries as well.
Thus, a field-per-language approach would be a sure waste of storage
resources due to the high number of duplicates, since there are over 200
known languages.
I really would like to keep single field for cross-language searchable text
content, witout splitting it into specific language fields or specific
language cores.

So my current choice will be to stay with just the ICUTokenizer and
ICUFoldingFilter as they are without any language specific
stemmers/lemmatizers yet at all.

Probably I will put the most popular languages stop words filters and
stemmers into the same one searchable text field to give it a try and see
if it works correctly in a stack.
Does specific language related filters stacking work correctly in one field?

Further development will most likely involve some advanced custom analyzers
like the "SimplePolyGlotStemmingTokenFilter" to utilize the ICU generated
ScriptAttribute.
http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/100236
https://github.com/whateverdood/cross-lingual-search/blob/master/src/main/java/org/apache/lucene/sandbox/analysis/polyglot/SimplePolyGlotStemmingTokenFilter.java

So I would like to know more about those "academic papers on this issue of
how best to deal with mixed language/mixed script queries and documents".
Tom, could you please share them? 



Re: Is there any sentence tokenizers in sold 4.9.0?

2014-09-08 Thread Jack Krupansky
Out of curiosity, what would be an example query for your application that 
would depend on sentence tokenization, as opposed to simple term 
tokenization? I mean, there are no sentence-based query operators in the 
Solr query parsers.


-- Jack Krupansky

-Original Message- 
From: Sandeep B A

Sent: Monday, September 8, 2014 12:24 AM
To: solr-user@lucene.apache.org
Subject: Re: Is there any sentence tokenizers in sold 4.9.0?

Hi Susheel ,
Thanks for the information.
I have crawled few website and all I need is for sentence tokenizers on the
data I have collected.
These websites are English only.

Well I don't have experience in writing custom sentence tokenizers for
solr. Is there any tutorial link which tell how to do it?

Is it possible to integrate nltk for solr? If yes how to do it? Because I
found sentence tokenizers for English in nltk.

Thanks,
Sandeep
On Sep 5, 2014 8:10 PM, "Sandeep B A"  wrote:


Sorry for typo it is solr 4.9.0 instead of sold 4.9.0
 On Sep 5, 2014 7:48 PM, "Sandeep B A"  wrote:


Hi,

I was looking out the options for sentence tokenizers default in solr but
could not find it. Does any one used? Integrated from any other language
tokenizers to solr. Example python etc.. Please let me know.


Thanks and regards,
Sandeep







Re: How to solve?

2014-09-06 Thread Jack Krupansky
Payload really don't have first class support in Solr. It's a solid feature 
of Lucene, but never expressed well in Solr. Any thoughts or proposals are 
welcome!


(Hmmm... I wonder what the good folks at Heliosearch have up their sleeves 
in this area?!)


-- Jack Krupansky

-Original Message- 
From: William Bell

Sent: Friday, September 5, 2014 10:03 PM
To: solr-user@lucene.apache.org
Subject: How to solve?

We have a core with each document as a person.

We want to boost based on the sweater color, but if the person has sweaters
in their closet which are the same manufactuer we want to boost even more
by adding them together.

Peter Smit - Sweater: Blue = 1 : Nike, Sweater: Red = 2: Nike, Sweater:
Blue=1 : Polo
Tony S - Sweater: Red =2: Nike
Bill O - Sweater:Red = 2: Polo, Blue=1: Polo

Scores:

Peter Smit - 1+2 = 3.
Tony S - 2
Bill O - 2 + 1

I thought about using payloads.

sweaters_payload
Blue: Nike: 1
Red: Nike: 2
Blue: Polo: 1

How do I query this?

http://localhost:8983/solr/persons?q=*:*&sort=??

Ideas?




--
Bill Bell
billnb...@gmail.com
cell 720-256-8076 



Re: FAST-like document vector data structures in Solr?

2014-09-05 Thread Jack Krupansky
Sounds like a great future to add to Solr, especially if it would facilitate 
more automatic relevancy enhancement. LucidWorks Search has a feature called 
"unsupervised feedback" that does that but something like a docvector might 
make it a more realistic default.


-- Jack Krupansky

-Original Message- 
From: "Jürgen Wagner (DVT)"

Sent: Friday, September 5, 2014 10:29 AM
To: solr-user@lucene.apache.org
Subject: Re: FAST-like document vector data structures in Solr?

Thanks for posting this. I was just about to send off a message of
similar content :-)

Important to add:

- In FAST ESP, you could have more than one such docvector associated
with a document, in order to reflect different metrics.

- Term weights in docvectors are document-relative, not absolute.

- Processing is done in the search processor (close to the index), not
in the QR server (providing transformations on the result list).

This docvector could be used for unsupervised clustering,
related-to/similarity search, tag clouds or more weird stuff like
identifying experts on topics contained in a particular document.

With Solr, it seems I have to handcraft the term vectors to reflect the
right weights, to approximate the effect of FAST docvectors, e.g., by
normalizing them to [0...1). Processing performance would still be
different from the classical FAST docvectors. The space consumption may
become ugly for a 200+ GB range shard, however, FAST has also been quite
generous with disk space, anyway.

So, the interesting question is whether there is a more canonical way of
handling this in Solr/Lucene, or if something the like is planned for 5.0+.

Best regards,
--Jürgen

On 05.09.2014 16:02, Jack Krupansky wrote:

For reference:

“Item Similarity Vector Reference

This property represents a similarity reference when searching for similar 
items. This is a similarity vector representation that is returned for 
each item in the query result in the docvector managed property.


The value is a string formatted according to the following format:

[string1,weight1][string2,weight2]...[stringN,weightN]

When performing a find similar query, the SimilarTo element should contain 
a string parameter with the value of the docvector managed property of the 
item that is to be used as the similarity reference. The similarity vector 
consists of a set of "term,weight" expressions, indicating the most 
important terms or concepts in the item and the corresponding perceived 
importance (weight). Terms can be single words or phrases.


The weight is a float value between 0 and 1, where 1 indicates the highest 
relevance.


The similarity vector is created during item processing and indicates the 
most important terms or concepts in the item and the corresponding 
 weight.”


See:
http://msdn.microsoft.com/en-us/library/office/ff521597(v=office.14).aspx

-- Jack Krupansky




Re: How to implement multilingual word components fields schema?

2014-09-05 Thread Jack Krupansky
It comes down to how you personally want to value compromises between 
conflicting requirements, such as relative weighting of false positives and 
false negatives. Provide a few use cases that illustrate the boundary cases 
that you care most about. For example field values that have snippets in one 
language embedded within larger values in a different language. And, whether 
your fields are always long or sometimes short - the former can work well 
for language detection, but not the latter, unless all fields of a given 
document are always in the same language.


Otherwise simply index the same source text in multiple fields, one for each 
language. You can then do a dismax query on that set of fields.


-- Jack Krupansky

-Original Message- 
From: Ilia Sretenskii

Sent: Friday, September 5, 2014 10:06 AM
To: solr-user@lucene.apache.org
Subject: How to implement multilingual word components fields schema?

Hello.
We have documents with multilingual words which consist of different
languages parts and seach queries of the same complexity, and it is a
worldwide used online application, so users generate content in all the
possible world languages.

For example:
言語-aware
Løgismose-alike
ຄໍາຮ້ອງສະຫມັກ-dependent

So I guess our schema requires a single field with universal analyzers.

Luckily, there exist ICUTokenizer and ICUFoldingFilter for that.

But then it requires stemming and lemmatization.

How to implement a schema with universal stemming/lemmatization which would
probably utilize the ICU generated token script attribute?
http://lucene.apache.org/core/4_10_0/analyzers-icu/org/apache/lucene/analysis/icu/tokenattributes/ScriptAttribute.html

By the way, I have already examined the Basistech schema of their
commercial plugins and it defines tokenizer/filter language per field type,
which is not a universal solution for such complex multilingual texts.

Please advise how to address this task.

Sincerely, Ilia Sretenskii. 



Re: FAST-like document vector data structures in Solr?

2014-09-05 Thread Jack Krupansky
For reference:

“Item Similarity Vector Reference

This property represents a similarity reference when searching for similar 
items. This is a similarity vector representation that is returned for each 
item in the query result in the docvector managed property.

The value is a string formatted according to the following format:

[string1,weight1][string2,weight2]...[stringN,weightN]

When performing a find similar query, the SimilarTo element should contain a 
string parameter with the value of the docvector managed property of the item 
that is to be used as the similarity reference. The similarity vector consists 
of a set of "term,weight" expressions, indicating the most important terms or 
concepts in the item and the corresponding perceived importance (weight). Terms 
can be single words or phrases.

The weight is a float value between 0 and 1, where 1 indicates the highest 
relevance.

The similarity vector is created during item processing and indicates the most 
important terms or concepts in the item and the corresponding weight.”

See:
http://msdn.microsoft.com/en-us/library/office/ff521597(v=office.14).aspx

-- Jack Krupansky

From: "Jürgen Wagner (DVT)" 
Sent: Friday, September 5, 2014 7:03 AM
To: solr-user@lucene.apache.org 
Subject: Re: FAST-like document vector data structures in Solr?

Hello Jim,
  yes, I am aware of the TermVector and MoreLikeThis stuff. I am presently 
mapping docvectors to these mechanisms and create term vectors myself from 
third-party text mining components.

However, it's not quite like the FAST docvectors. Particularily, the 
performance of MoreLikeThis queries based on TermVectors is suboptimal on large 
document sets, so a more efficient support of such retrievals in the Lucene 
kernel would be preferred.

Cheers,
--Jürgen

On 05.09.2014 10:55, jim ferenczi wrote:

Hi,
Something like ?:
https://cwiki.apache.org/confluence/display/solr/The+Term+Vector+Component
And just to show some impressive search functionality of the wiki: ;)
https://cwiki.apache.org/confluence/dosearchsite.action?where=solr&spaceSearch=true&queryString=document+vectors

Cheers,
Jim


2014-09-05 9:44 GMT+02:00 "Jürgen Wagner (DVT)" 

Re: looking for a solr/search expert in Paris

2014-09-03 Thread Jack Krupansky
Don't forget to check out the Solr Support wiki where consultants advertise 
their services:

http://wiki.apache.org/solr/Support

And any Solr or Lucene consultants on this mailing list should be sure that 
they are "registered" on that support wiki. Hey, it's free! And be sure to 
keep your listing up to date, including regional availability and any 
specialties.


-- Jack Krupansky

-Original Message- 
From: elisabeth benoit

Sent: Wednesday, September 3, 2014 4:02 AM
To: solr-user@lucene.apache.org
Subject: looking for a solr/search expert in Paris

Hello,


We are looking for a solr consultant to help us with our devs using solr.
We've been working on this for a little while, and we feel we need an
expert point of view on what we're doing, who could give us insights about
our solr conf, performance issues, error handling issues (big thing). Well
everything.

The entreprise is in the Paris (France) area. Any suggestion is welcomed.

Thanks,
Elisabeth 



Re: Indexing & search list of Key/Value pairs

2014-09-01 Thread Jack Krupansky
You can certainly have a separate multivalued text field, like "skills" that 
can have arbitrary text values like "PHP", "Ruby, "Software Development", 
"Agile Methodology", "Agile Development", "Cat Herding", etc., that are 
analyzed, lower cased, stemmed, etc.


As far as the dynamic field names, technically they can have spaces and 
special characters and case sensitive, I would suggest that they be 
"normalized" as lower case, with underscores for special characters and 
spaces, such as:


skills:agile software_development:[10 TO *]

That would match somebody with "Agile Methodology" or "Agile Development" 
AND 10 or more years of "Software Development".


-- Jack Krupansky

-Original Message- 
From: amid

Sent: Monday, September 1, 2014 12:50 PM
To: solr-user@lucene.apache.org
Subject: Re: Indexing & search list of Key/Value pairs

Hi Jack,

Thanks for the fast response.
I assume that using this technique will have the following limitations:
1) Skill characters will be limited
2) Field name are not analyze and will not be able to get the full search
pack (synonym, analyzers...)

Am i right?

If so do you familiar with other techniques? (Don't have problem with
customizing implementation of parsers, scoring, etc)

Many thanks,
Ami



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-search-list-of-Key-Value-pairs-tp4156206p4156219.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Indexing & search list of Key/Value pairs

2014-09-01 Thread Jack Krupansky
Solr supports multivalued fields, but really only for scalar, not structured 
values. And trying to manage two or more multivalued fields in parallel is 
also problematic. Better to simply use dynamic fields, such as name the 
field "xyz_skill" and the value is the number of years. Then you can simply 
query:


php_skill:[5 TO *] AND ruby_skill:[2 TO *]

-- Jack Krupansky

-Original Message- 
From: amid

Sent: Monday, September 1, 2014 12:24 PM
To: solr-user@lucene.apache.org
Subject: Indexing & search list of Key/Value pairs

Hi,

I'm using solr and trying to index a list of key/value pairs, the key
contains a string with a skill and the value is the years of experience
(i.e. someone with 5 years of php and 2 years of ruby).

I want to be able to create a query which return all document with a
specific skill and range of years,
i.e. php with 2-4 years

Is there a good way to index the list of skills pair so we can query it
easily?

Thanks,
Ami



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-search-list-of-Key-Value-pairs-tp4156206.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: external indexer for Solr Cloud

2014-09-01 Thread Jack Krupansky
Packaging SolrCell in the same manner, with parallel threads and able to 
talk to multiple SolrCloud servers in parallel would have a lot of the same 
benefits as well.


And maybe there could be some more generic Java framework for indexing as 
well, that "external indexers" in general could use.


-- Jack Krupansky

-Original Message- 
From: Shawn Heisey

Sent: Monday, September 1, 2014 11:42 AM
To: solr-user@lucene.apache.org
Subject: Re: external indexer for Solr Cloud

On 9/1/2014 7:19 AM, Jack Krupansky wrote:

It would be great to have a "standalone DIH" that runs as a separate
server and then sends standard Solr update requests to a Solr cluster.


This has been discussed, and I thought we had an issue in Jira, but I
can't find it.

A completely standalone DIH app would be REALLY nice.  I already know
that the JDBC ResultSet is not the bottleneck for indexing, at least for
me.  I once built a simple single-threaded SolrJ application that pulls
data from JDBC and indexes it in Solr.  It works in batches, typically
500 or 1000 docs at a time.  When I comment out the "solr.add(docs)"
line (so input object manipulation, casting, and building of the
SolrInputDocument objects is still happening), it can read and
manipulate our entire database (99.8 million documents) in about 20
minutes, but if I leave that in, it takes many hours.

The bottleneck is that each DIH has only a single thread indexing to
Solr.  I've theorized that it should be *relatively* easy for me to
write an application that pulls records off the JDBC ResultSet with
multiple threads (say 10-20), have each thread figure out which shard
its document lands on, and send it there with SolrJ.  It might even be
possible for the threads to collect several documents for each shard
before indexing them in the same request.

As with most multithreaded apps, the hard part is figuring out all the
thread synchronization, making absolutely certain that thread timing is
perfect without unnecessary delays.  If I can figure out a generic
approach (with a few configurable bells and whistles available), it
might be something suitable for inclusion in the project, followed with
improvements by all the smart people in our community.

Thanks,
Shawn 



Re: external indexer for Solr Cloud

2014-09-01 Thread Jack Krupansky
Okay, but please clarify further - do you simply wish to run DIH externally, 
but still sending each document to SolrCloud for indexing, or... are you 
expecting to generate the index completely external to the cluster and then 
somehow "merge" that DIH "index" into the SolrCloud index?


It would be great to have a "standalone DIH" that runs as a separate server 
and then sends standard Solr update requests to a Solr cluster.


-- Jack Krupansky

-Original Message- 
From: Lee Chunki

Sent: Sunday, August 31, 2014 8:55 PM
To: solr-user@lucene.apache.org
Subject: Re: external indexer for Solr Cloud

Hi Shawn and Jack,

Thank you for your reply.

Yes, I want to run data import hander independently and sync it to Solr 
Cloud.
because current my DIH node do not only DB fetch & join but also many 
preprocessing.


Thanks,
Chunki.


On Aug 30, 2014, at 1:34 AM, Jack Krupansky  wrote:

My other thought was that maybe he wants to do index updates outside of 
the cluster that is handling queries, and then copy in the completed 
index. Or... maybe take replicas out of the query rotation while they are 
updated. Or... maybe this is yet another X-Y problem!


-- Jack Krupansky

-Original Message- From: Shawn Heisey
Sent: Friday, August 29, 2014 11:19 AM
To: solr-user@lucene.apache.org
Subject: Re: external indexer for Solr Cloud

On 8/29/2014 5:21 AM, Lee Chunki wrote:

Is there any way to run external indexer for solar cloud?


Jack asked an excellent question.  What do you mean by this?  Unless
you're using the dataimport handler, all indexing is external to Solr.


my situation is :

* running two indexer ( for fail over ) and two searcher.
* just use two searcher for service.
* have plan to move on Solr Cloud

however I wonder that if I run indexing job on one of the solr cloud 
server, the server’s load would be higher than other nodes.

so, I want to build index out of sold cloud but….


In SolrCloud, every shard replica will be indexing -- it's not like
old-style replication, where the master indexes everything and the
slaves copy the completed index.  The leader of each shard will be
working slightly harder than the other replicas, but you really don't
need to worry too much about sending all your updates to one server --
those requests get duplicated to the other servers and they all index
them, almost in parallel.

For my setup (non-cloud, but sharded), I use Pacemaker to ensure that
only one of my servers is running my indexing program and haproxy (plus
its shared IP address).

Thanks,
Shawn




Re: AW: Scaling to large Number of Collections

2014-09-01 Thread Jack Krupansky
And I would add another suggested requirement - "dormant collections" - 
collections which may once have been active, but have not seen any recent 
activity and can hence be "suspended" or "swapped out" until such time as 
activity resumes and they can then be "reactivated" or "reloaded". That 
inactivity threshold might be something like an hour, but should be 
configurable globally and per-collection. The alternative is an application 
server which maintains that activity state and starts up and shuts down 
discrete Solr server instances for each tenant's collection(s).


This raises the question: How many of your collections need to be 
simultaneously active? Say, in a one-hour period, how many of them will be 
updating and serving queries, and what query load per-collection and total 
query load do you need to design for?


-- Jack Krupansky
-Original Message- 
From: Christoph Schmidt

Sent: Monday, September 1, 2014 3:50 AM
To: solr-user@lucene.apache.org
Subject: AW: Scaling to large Number of Collections

Yes, this would help us in our scenario.

-Ursprüngliche Nachricht-
Von: Jack Krupansky [mailto:j...@basetechnology.com]
Gesendet: Sonntag, 31. August 2014 18:10
An: solr-user@lucene.apache.org
Betreff: Re: Scaling to large Number of Collections

We should also consider "lightly-sharded" collections. IOW, even if a 
cluster has dozens or a hundred nodes or more, the goal may not be to shard 
all collections across all shards, which is fine for the really large 
collections, but to also support collections which may only need to be 
sharded for a few shards or even just a single shard, and to instead focus 
the attention on large number of collections rather than heavily-sharded 
collections.


-- Jack Krupansky

-Original Message-
From: Erick Erickson
Sent: Sunday, August 31, 2014 12:04 PM
To: solr-user@lucene.apache.org
Subject: Re: Scaling to large Number of Collections

What is your access pattern? By that I mean do all the cores need to be 
searched at the same time or is it reasonable for them to be loaded on 
demand? This latter would impose the penalty of the first time a collection 
was accessed there would be a delay while the core loaded. I suppose I'm 
asking "how many customers are using the system simultaneously?". One way 
around that is to fire a dummy query behind the scenes when a user logs on 
but before she actually executes a search.


Why I'm asking:

See this page: http://wiki.apache.org/solr/LotsOfCores. It was intended for 
the multi-tenancy case in which you could count on a subset of users being 
logged on at once.


WARNING! LotsOfCores is NOT supported in SolrCloud at this point! There has 
been some talk of extending support for SolrCloud, but no action as it's one 
of those cases that has lots of implications particularly around ZooKeeper 
knowing the state of all the cores, cores going into recovery in a cascading 
fashionetc. It's not at all clear that it _can_ be extended to SolrCloud for 
that matter without doing great violence to the code.


With the LotsOfCores approach (and assuming somebody volunteers to code it 
up), the number of cores hosted on a particular node can be many thousands.
The limits will come from how many of them have to be up and running 
simultaneously. The limits would come from two places:

1> The time it takes to recursively walk your SOLR_HOME directory and
discover the cores (I see about 1,000 cores/second discovered on my laptop, 
admittedly an SSD, and there has been no optimization done to this process).

2> having to keep a table of all the cores and their information (home
directory and the like) in memory, but practically I don't think this is a 
problem. I haven't actually measured, but the size of each entry is almost 
certainly less than 1K and probably closer to 0.5K.


But it really does bring us back to the question of whether all these cores 
are necessary or not. The "usual" technique for handling this with the 
LotsOfCores option is to combine the records into a number of smaller cores. 
Without knowing your requirements in detail, something like a customers core 
and a products core where, say, each product has a field with tokens 
indicating what users had access or vice versa, and (possibly) using pseudo 
joins. In one view, this is an ACL problem which has several solutions, each 
with drawbacks of course.


Or just de-normalizing your data entirely and just have a core per customer 
with _all_ the products indexed in to it.


Like I said, I don't know enough details to have a clue whether the data 
would explode unacceptably.


Anyway, enough on a Sunday morning!

Best,
Erick


On Sun, Aug 31, 2014 at 8:18 AM, Shawn Heisey  wrote:


On 8/31/2014 8:58 AM, Joseph Obernberger wrote:
> Could you add another field(s) to your application and use that
> instead
of
> 

Re: Specify Analyzer per field

2014-09-01 Thread Jack Krupansky
Thanks for finally specifying the feature so concisely. IOW, you want the ES 
feature of being able to specify the analyzer for the field as opposed to 
the field type.


See:
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/mapping-intro.html

"For analyzed string fields, use the analyzer attribute to specify which 
analyzer to apply both at search time and at index time. By default, 
Elasticsearch uses the standard analyzer, but you can change this by 
specifying one of the built-in analyzers, such as whitespace, simple, or 
english... In Custom analyzers we will show you how to define and use custom 
analyzers as well."


No, Solr does not have that feature per se - you have to specify a custom 
field TYPE to specify the analyzer.


-- Jack Krupansky

-Original Message- 
From: Ankit Jain

Sent: Monday, September 1, 2014 2:14 AM
To: solr-user@lucene.apache.org
Subject: Re: Specify Analyzer per field

Thanks for the response guys..

Let's consider I have two fields X and Y and field type of both fields are
*text*. Now, i want to use whitespace analyzer for field X and standard
analyzer for field Y.

In Elasticsearch, we can specify the different analyzer for same field
type. Is this feature is available in Solr ?

I want to use schema less feature Solr because the schema is created at
runtime as per user input.


Regards,
Ankit Jain




On Sat, Aug 30, 2014 at 4:53 AM, Walter Underwood 
wrote:


Then don’t use schemaless.

We need a LOT more info about the application.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/


On Aug 29, 2014, at 4:11 PM, Erick Erickson 
wrote:

> bq: Can't you just use old fashion dynamic fields and use suffixes to
mark
> the
> type you want?
>
> Not with "schemaless" I don't think, since you don't quite know what the
> names of the fields are in the first place. It's unlikely that the input
> format has field names like "age_t" that would map to the dynamic
field
>
>
> On Fri, Aug 29, 2014 at 8:55 AM, Alexandre Rafalovitch <
arafa...@gmail.com>
> wrote:
>
>> Can't you just use old fashion dynamic fields and use suffixes to mark
the
>> type you want?
>> On 29/08/2014 8:17 am, "Ankit Jain"  wrote:
>>
>>> Hi All,
>>>
>>> I would like to use schema less feature of Solr and also want to
specify
>>> the analyzer of each field at runtime(specify analyzer at the time of
>>> adding new field into solr).
>>>
>>> Also, I want to use the different analyzer for same field type.
>>>
>>> --
>>> Thanks,
>>> Ankit Jain
>>>
>>





--
Thanks,
Ankit Jain 



Re: Scaling to large Number of Collections

2014-08-31 Thread Jack Krupansky
We should also consider "lightly-sharded" collections. IOW, even if a 
cluster has dozens or a hundred nodes or more, the goal may not be to shard 
all collections across all shards, which is fine for the really large 
collections, but to also support collections which may only need to be 
sharded for a few shards or even just a single shard, and to instead focus 
the attention on large number of collections rather than heavily-sharded 
collections.


-- Jack Krupansky

-Original Message- 
From: Erick Erickson

Sent: Sunday, August 31, 2014 12:04 PM
To: solr-user@lucene.apache.org
Subject: Re: Scaling to large Number of Collections

What is your access pattern? By that I mean do all the cores need to be
searched at the same time or is it reasonable for them to be loaded on
demand? This latter would impose the penalty of the first time a collection
was accessed there would be a delay while the core loaded. I suppose I'm
asking "how many customers are using the system simultaneously?". One way
around that is to fire a dummy query behind the scenes when a user logs on
but before she actually executes a search.

Why I'm asking:

See this page: http://wiki.apache.org/solr/LotsOfCores. It was intended for
the multi-tenancy case in which you could count on a subset of users being
logged on at once.

WARNING! LotsOfCores is NOT supported in SolrCloud at this point! There has
been some talk of extending support for SolrCloud, but no action as it's
one of those cases that has lots of implications particularly around
ZooKeeper knowing the state of all the cores, cores going into recovery in
a cascading fashionetc. It's not at all clear that it _can_ be extended to
SolrCloud for that matter without doing great violence to the code.

With the LotsOfCores approach (and assuming somebody volunteers to code it
up), the number of cores hosted on a particular node can be many thousands.
The limits will come from how many of them have to be up and running
simultaneously. The limits would come from two places:
1> The time it takes to recursively walk your SOLR_HOME directory and
discover the cores (I see about 1,000 cores/second discovered on my laptop,
admittedly an SSD, and there has been no optimization done to this process).
2> having to keep a table of all the cores and their information (home
directory and the like) in memory, but practically I don't think this is a
problem. I haven't actually measured, but the size of each entry is almost
certainly less than 1K and probably closer to 0.5K.

But it really does bring us back to the question of whether all these cores
are necessary or not. The "usual" technique for handling this with the
LotsOfCores option is to combine the records into a number of smaller
cores. Without knowing your requirements in detail, something like a
customers core and a products core where, say, each product has a field
with tokens indicating what users had access or vice versa, and (possibly)
using pseudo joins. In one view, this is an ACL problem which has several
solutions, each with drawbacks of course.

Or just de-normalizing your data entirely and just have a core per customer
with _all_ the products indexed in to it.

Like I said, I don't know enough details to have a clue whether the data
would explode unacceptably.

Anyway, enough on a Sunday morning!

Best,
Erick


On Sun, Aug 31, 2014 at 8:18 AM, Shawn Heisey  wrote:


On 8/31/2014 8:58 AM, Joseph Obernberger wrote:
> Could you add another field(s) to your application and use that instead
of
> creating collections/cores?  When you execute a search, instead of
picking
> a core, just search a single large core but add in a field which 
> contains

> some core ID.

This is a nice idea.  Have one big collection in your cloud and use an
additional field in your queries to filter down to a specific user's data.

It'd be really nice to write a custom search component that ensures
there is a filter query for that specific field, and if it's not
present, change the search results to include a document that informs
the caller that they're not doing it right.

http://www.portal2sounds.com/1780

(That URL probably won't work correctly on mobile browsers)

Thanks,
Shawn






Re: AW: Scaling to large Number of Collections

2014-08-31 Thread Jack Krupansky

You close with two great questions for the community!

We have a similar issue over in Apache Cassandra database land (thousands of 
tables).


There is no immediate, easy, great answer. Other than the kinds of 
"workarounds" being suggested.


-- Jack Krupansky

-Original Message- 
From: Christoph Schmidt

Sent: Sunday, August 31, 2014 11:44 AM
To: solr-user@lucene.apache.org
Subject: AW: Scaling to large Number of Collections

One collection has 2 replicas, no sharding, the collections are not that 
big.


No, they are unfortunately not independent. There are collections with 
customer documents (some thousand customers) and product collections. One 
customer has at least on customer collection and 1 to some hundred products.
The combination of these collections is used to drive the search of a 
Liferay portal. Each customer has its own Liferay portal.


We could split up the cluster in several clusters by customers, but then we 
had for duplicate the product collections in each SolrCluster.


Will Solr go in the direction of "large number of collections"? And the 
question is, what is a "large number"?


Best
Christoph

-Ursprüngliche Nachricht-
Von: Jack Krupansky [mailto:j...@basetechnology.com]
Gesendet: Sonntag, 31. August 2014 14:09
An: solr-user@lucene.apache.org
Betreff: Re: Scaling to large Number of Collections

How are the 5 servers arranged in terms of shards and replicas? 5 shards 
with 1 replica each, 1 shard with 5 replicas, 2 shards with 2 and 3 
replicas, or... what?


How big is each collection? The key strength of SolrCloud is scaling large 
collections via shards, NOT scaling large numbers of collections. If you 
have large numbers of collections, maybe they should be divided into 
separate clusters, especially if they are independent.


Is this a multi-tenancy situation or a single humongous app?

In any case, "large numbers of collections in a single SolrCloud cluster" is 
not a supported scenario at this time. Certainly suggestions for future 
enhancement can be made though.


-- Jack Krupansky

-Original Message-
From: Christoph Schmidt
Sent: Sunday, August 31, 2014 4:04 AM
To: solr-user@lucene.apache.org
Subject: Scaling to large Number of Collections

we see at least two problems when scaling to large number of collections. I 
would like to ask the community, if they are known and maybe already 
addressed in development:

We have a SolrCloud running with the following numbers:
-  5 Servers (each 24 CPUs, 128 RAM)
-  13.000 Collection with 25.000 SolrCores in the Cloud
The Cloud is working fine, but we see two problems, if we like to scale 
further

1.   Resource consumption of native system threads
We see that each collection opens at least two threads: one for the 
zookeeper (coreZkRegister-1-thread-5154) and one for the searcher

(searcherExecutor-28357-thread-1)
We will run in "OutOfMemoryError: unable to create new native thread". Maybe 
the architecture could be changed here to use thread pools?

2.   The shutdown and the startup of one server in the SolrCloud takes 2
hours. So a rolling start is about 10h. For me the problem seems to be that 
leader election is "linear". The Overseer does core per core. The 
organisation of the cloud is not done parallel or distributed. Is this 
already addressed by https://issues.apache.org/jira/browse/SOLR-5473 or is 
there more needed?


Thanks for discussion and help
Christoph
___

Dr. Christoph Schmidt | Geschäftsführer

P +49-89-523041-72
M +49-171-1419367
Skype: cs_moresophy
christoph.schm...@moresophy.de<mailto:heiko.be...@moresophy.de>
www.moresophy.com<http://www.moresophy.com/>
moresophy GmbH | Fraunhoferstrasse 15 | 82152 München-Martinsried 



Re: Scaling to large Number of Collections

2014-08-31 Thread Jack Krupansky
How are the 5 servers arranged in terms of shards and replicas? 5 shards 
with 1 replica each, 1 shard with 5 replicas, 2 shards with 2 and 3 
replicas, or... what?


How big is each collection? The key strength of SolrCloud is scaling large 
collections via shards, NOT scaling large numbers of collections. If you 
have large numbers of collections, maybe they should be divided into 
separate clusters, especially if they are independent.


Is this a multi-tenancy situation or a single humongous app?

In any case, "large numbers of collections in a single SolrCloud cluster" is 
not a supported scenario at this time. Certainly suggestions for future 
enhancement can be made though.


-- Jack Krupansky

-Original Message- 
From: Christoph Schmidt

Sent: Sunday, August 31, 2014 4:04 AM
To: solr-user@lucene.apache.org
Subject: Scaling to large Number of Collections

we see at least two problems when scaling to large number of collections. I 
would like to ask the community, if they are known and maybe already 
addressed in development:

We have a SolrCloud running with the following numbers:
-  5 Servers (each 24 CPUs, 128 RAM)
-  13.000 Collection with 25.000 SolrCores in the Cloud
The Cloud is working fine, but we see two problems, if we like to scale 
further

1.   Resource consumption of native system threads
We see that each collection opens at least two threads: one for the 
zookeeper (coreZkRegister-1-thread-5154) and one for the searcher 
(searcherExecutor-28357-thread-1)
We will run in "OutOfMemoryError: unable to create new native thread". Maybe 
the architecture could be changed here to use thread pools?
2.   The shutdown and the startup of one server in the SolrCloud takes 2 
hours. So a rolling start is about 10h. For me the problem seems to be that 
leader election is "linear". The Overseer does core per core. The 
organisation of the cloud is not done parallel or distributed. Is this 
already addressed by https://issues.apache.org/jira/browse/SOLR-5473 or is 
there more needed?


Thanks for discussion and help
Christoph
___

Dr. Christoph Schmidt | Geschäftsführer

P +49-89-523041-72
M +49-171-1419367
Skype: cs_moresophy
christoph.schm...@moresophy.de<mailto:heiko.be...@moresophy.de>
www.moresophy.com<http://www.moresophy.com/>
moresophy GmbH | Fraunhoferstrasse 15 | 82152 München-Martinsried



Re: solr result handler??

2014-08-30 Thread Jack Krupansky

You can specify a filter query that has "must not" terms. For example:

fq=*:* field1:(-shoot -darn -rats) field2:(-shoot -darn -rats)

or

fq=*:* field1:(-shoot -darn -rats)
fq=*:* field2:(-shoot -darn -rats)

Ypu could specify edismax for the filter query parser and list the fields in 
the qf parameter, BUT... the qf parameter would then be shared between the 
main query and the filter query.


You could also include that filter query in the "invariants" or "appends" 
section of the query request handler configuration in solrconfig to assure 
that no query could override that filter. Or, do an application layer that 
forces that filter to be added.


-- Jack Krupansky

-Original Message- 
From: cmd.ares

Sent: Saturday, August 30, 2014 2:10 AM
To: solr-user@lucene.apache.org
Subject: solr result handler??

I have a blacklist save some keywords,and the query results need to
be excluded the blacklist。if any filed value contains the keyword,the row
should be removed.
I think there are two ways:
1.modify the solr resultset handler..and which class can be modify?
2.if can implement or extend some class to filter the queryresult?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-result-handler-tp4155940.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: external indexer for Solr Cloud

2014-08-29 Thread Jack Krupansky
My other thought was that maybe he wants to do index updates outside of the 
cluster that is handling queries, and then copy in the completed index. 
Or... maybe take replicas out of the query rotation while they are updated. 
Or... maybe this is yet another X-Y problem!


-- Jack Krupansky

-Original Message- 
From: Shawn Heisey

Sent: Friday, August 29, 2014 11:19 AM
To: solr-user@lucene.apache.org
Subject: Re: external indexer for Solr Cloud

On 8/29/2014 5:21 AM, Lee Chunki wrote:

Is there any way to run external indexer for solar cloud?


Jack asked an excellent question.  What do you mean by this?  Unless
you're using the dataimport handler, all indexing is external to Solr.


my situation is :

* running two indexer ( for fail over ) and two searcher.
* just use two searcher for service.
* have plan to move on Solr Cloud

however I wonder that if I run indexing job on one of the solr cloud 
server, the server’s load would be higher than other nodes.

so, I want to build index out of sold cloud but….


In SolrCloud, every shard replica will be indexing -- it's not like
old-style replication, where the master indexes everything and the
slaves copy the completed index.  The leader of each shard will be
working slightly harder than the other replicas, but you really don't
need to worry too much about sending all your updates to one server --
those requests get duplicated to the other servers and they all index
them, almost in parallel.

For my setup (non-cloud, but sharded), I use Pacemaker to ensure that
only one of my servers is running my indexing program and haproxy (plus
its shared IP address).

Thanks,
Shawn 



Re: Specify Analyzer per field

2014-08-29 Thread Jack Krupansky

But that doesn't let him change or override the analyzer for the field type.

-- Jack Krupansky

-Original Message- 
From: Alexandre Rafalovitch 
Sent: Friday, August 29, 2014 11:55 AM 
To: solr-user 
Subject: Re: Specify Analyzer per field 


Can't you just use old fashion dynamic fields and use suffixes to mark the
type you want?
On 29/08/2014 8:17 am, "Ankit Jain"  wrote:


Hi All,

I would like to use schema less feature of Solr and also want to specify
the analyzer of each field at runtime(specify analyzer at the time of
adding new field into solr).

Also, I want to use the different analyzer for same field type.

--
Thanks,
Ankit Jain



Re: Specify Analyzer per field

2014-08-29 Thread Jack Krupansky

Different field TYPES, not different fields.

-- Jack Krupansky

-Original Message- 
From: Ahmet Arslan

Sent: Friday, August 29, 2014 8:49 AM
To: solr-user@lucene.apache.org
Subject: Re: Specify Analyzer per field

Hi,

I think he wants to change query analyzer dynamically, where index analyzer 
remains same.

I needed that functionality in the past.

Creating additional field would waste resources, if the difference is in the 
query analyzer only.


Ahmet



On Friday, August 29, 2014 3:39 PM, Jack Krupansky  
wrote:

Each field type specifies a single analyzer (although query, index, and
multi-term are separate analyzers). If you want to have multiple analyzers
for a given field type, then you need to have a separate field type for
each.

If you expect to have that fine control over field type issues, then I would
suggest that "schemaless" is not an appropriate choice. Maybe you simply
wish to add field types dynamically. There is an open Jira for adding that
feature to Solr:

SOLR-5098 - Add REST support for adding field types to the schema
https://issues.apache.org/jira/browse/SOLR-5098

That said, maybe you could provide a couple of examples of exactly what you
want to do.

-- Jack Krupansky




-Original Message- 
From: Ankit Jain

Sent: Friday, August 29, 2014 8:16 AM
To: solr-user@lucene.apache.org
Subject: Specify Analyzer per field

Hi All,

I would like to use schema less feature of Solr and also want to specify
the analyzer of each field at runtime(specify analyzer at the time of
adding new field into solr).

Also, I want to use the different analyzer for same field type.

--
Thanks,
Ankit Jain 



Re: Specify Analyzer per field

2014-08-29 Thread Jack Krupansky
Each field type specifies a single analyzer (although query, index, and 
multi-term are separate analyzers). If you want to have multiple analyzers 
for a given field type, then you need to have a separate field type for 
each.


If you expect to have that fine control over field type issues, then I would 
suggest that "schemaless" is not an appropriate choice. Maybe you simply 
wish to add field types dynamically. There is an open Jira for adding that 
feature to Solr:


SOLR-5098 - Add REST support for adding field types to the schema
https://issues.apache.org/jira/browse/SOLR-5098

That said, maybe you could provide a couple of examples of exactly what you 
want to do.


-- Jack Krupansky

-Original Message- 
From: Ankit Jain

Sent: Friday, August 29, 2014 8:16 AM
To: solr-user@lucene.apache.org
Subject: Specify Analyzer per field

Hi All,

I would like to use schema less feature of Solr and also want to specify
the analyzer of each field at runtime(specify analyzer at the time of
adding new field into solr).

Also, I want to use the different analyzer for same field type.

--
Thanks,
Ankit Jain 



Re: external indexer for Solr Cloud

2014-08-29 Thread Jack Krupansky

What exactly are you referring to by the term "external indexer"?

-- Jack Krupansky

-Original Message- 
From: Lee Chunki

Sent: Friday, August 29, 2014 7:21 AM
To: solr-user@lucene.apache.org
Subject: external indexer for Solr Cloud

Hi,

Is there any way to run external indexer for solar cloud?


my situation is :

* running two indexer ( for fail over ) and two searcher.
* just use two searcher for service.
* have plan to move on Solr Cloud

however I wonder that if I run indexing job on one of the solr cloud server, 
the server’s load would be higher than other nodes.

so, I want to build index out of sold cloud but….

Please tell me your case or experience.

Thanks,
Chunki.= 



Re: Query regarding URL Analysers

2014-08-28 Thread Jack Krupansky
Sorry for the delay... take a look at the URL Classify update processor, 
which parses a URL and distributes the components to various fields:

http://lucene.apache.org/solr/4_9_0/solr-core/org/apache/solr/update/processor/URLClassifyProcessorFactory.html
http://lucene.apache.org/solr/4_9_0/solr-core/org/apache/solr/update/processor/URLClassifyProcessor.html

The official doc is... pitiful, but I have doc and examples in my e-book:
http://www.lulu.com/us/en/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product-21203548.html

-- Jack Krupansky

-Original Message- 
From: Sathyam

Sent: Thursday, August 28, 2014 6:21 AM
To: solr-user@lucene.apache.org
Subject: Re: Query regarding URL Analysers

Gentle Reminder


On 21 August 2014 18:05, Sathyam  wrote:


Hi,

I needed to generate tokens out of a URL such that I am able to get
hierarchical units of the URL as well as each individual entity as tokens.
For example:
*Given a URL : *

http://www.google.com/abcd/efgh/ijkl/mnop.php?a=10&b=20&c=30#xyz

The tokens that I need are :

*Hierarchical subsets of the URL*

1 http://

2 http://www.google.com/

3 http://www.google.com/abcd/

 4 http://www.google.com/abcd/efgh/

5 http://www.google.com/abcd/efgh/ijkl/

 6 h ttp://www.google.com/abcd/efgh/ijkl/mnop.php

*Individual elements in the path to the resource*

7 abcd

8 efgh

9 ijkl

10 mnop.php

*Query Terms*

11 a=10

12 b=20

13 c=30

*Fragment*
14 xyz

This comes to a total of 14 tokens for the given URL.
Basically a URL analyzer that creates tokens based on the categories
mentioned in bold. Also a separate token for port(if mentioned).

I would like to know how this can be achieved by using a single analyzer
that uses a combination of the tokenizers and filters provided by solr.
Also curious to know why there is a restriction of only *one  *tokenizer
to be used in an analyzer.
Looking forward to a response from your side telling the best possible way
to achieve the closest to what I need.

Thanks.
--
Sathyam Doraswamy







--
Sathyam Doraswamy 



Re: Solr CPU Usage

2014-08-27 Thread Jack Krupansky
Is the high usage just suddenly happening after a long period of up-time 
without it, or is this on a server restart? The latter can happen if you 
have a large commit log to replay because you haven't done hard commits.


-- Jack Krupansky

-Original Message- 
From: Shawn Heisey

Sent: Wednesday, August 27, 2014 9:51 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr CPU Usage

On 8/27/2014 4:16 AM, hendra_budiawan wrote:

I'm having high cpu usage on my server, detailed on picture below
<http://lucene.472066.n3.nabble.com/file/n4155370/htop-server.png>

Using default config for solrconfig.xml & schema.xml, can anyone help me 
to

identified why the cpu so high on solr process?


A standard "top" screenshot would be a lot more useful than htop -- it
includes information about memory sizes and utilization.

The most common reason for performance issues is not enough RAM, either
heap or OS disk cache, maybe both.  Let's start with a standard "top"
screenshot, then additional questions may be required from there.  Some
light reading in the meantime:

http://wiki.apache.org/solr/SolrPerformanceProblems

Thanks,
Shawn 



Re: Solr content limits?

2014-08-27 Thread Jack Krupansky
There are no such "limits" in Solr. Rather, it is up to you to configure as 
much hardware as you need.


From a practical perspective, I would say that you should try to limit 
machines to 100 million documents per node, and maybe 100 nodes maximum in a 
cluster. Those are not hard limits in any way, but beyond that, you will 
need to configure tune much more carefully. To put it another way, to go 
beyond that, you should expect to hire an "expert" to do so.


The more proper answer to your question is to do a "proof of concept" 
implementation in which you load a range of documents on your chosen 
hardware, both a single machine and a small cluster, and measure how much 
load it can handle and how it performs. And then scale your cluster based on 
that application-specific performance data.


-- Jack Krupansky

-Original Message- 
From: lalitjangra

Sent: Tuesday, August 26, 2014 11:36 PM
To: solr-user@lucene.apache.org
Subject: Solr content limits?

Hi,

I am using SOlr 4.6.0 with single collection/core and want to know details
about following.

1. What is the maximum number of documents which can be uploaded in a single
collection/core?
2. What is the maximum size of  a document i can upload in solr without
failing?
3. Is there any way to update these limits, if possible?

Regards.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-content-limits-tp4155317.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Solr range query issue

2014-08-27 Thread Jack Krupansky
The "AND" and "-" operators are being parsed at the same level - no 
parentheses are involved, so they generate a single, flat Boolean query.


So it really is equivalent to:

-name:[A TO Z] -name:[a TO z]

That is a purely negative query, so Solr should automatically supply a *:* 
terms so that it is equivalent to:


*:* -name:[A TO Z] -name:[a TO z]

Now, on to the real problem... "Zareena", 
"Zhariman","Zarimanabibi","Zarnabanu" etc all lexically FOLLOW "Z", so they 
are NOT excluded from the results. You need an end point for the range that 
is greater than or equal to all terms you want to match. That could be 
something like "ZZ" and "zz".


But, that won't work either since the second character could be lower case, 
so maybe it needs to be:


-name:[A TO zz]

Which covers bothe upper and lower case, but also includes the special 
character between the two alpha ranges, including underscore. Is underscore 
an issue here.


Maybe the following pattern will cover your cases but keep underscore names:

-name:[A TO Zz]

Is name a "string" or "text" field (which)? If a text field, does it have a 
lower case filter, in which case you don't need lower case.


Worst case, you could use a regex query term, but better to avoid that if at 
all possible.


-- Jack Krupansky

-Original Message- 
From: nutchsolruser

Sent: Wednesday, August 27, 2014 12:21 AM
To: solr-user@lucene.apache.org
Subject: Solr range query issue

Hi ,

I Am using solr 4.6.1 . I have name field in my schema and I am sending
following query from solr admin UI to solr. which will find names containing
characters other english alphabets.

-name:[A TO Z] AND -name:[a TO z]

In my opinion it should return documents which do not contain name in range
between A TO Z, but in my case Solr is also returning names starting with
letter Z. e.g.  "Zareena", "Zhariman","Zarimanabibi","Zarnabanu" etc

Is this correct behaviour?if yes then what would be the correct query to
find user names which contain only english alphabet characters.

Following is my debug output:
 "debug": {
   "rawquerystring": "-name:[A TO Z] AND -name:[a TO z]",
   "querystring": "-name:[A TO Z] AND -name:[a TO z]",
   "parsedquery": "-name:[a TO z] -name:[a TO z]",
   "parsedquery_toString": "-name:[a TO z] -name:[a TO z]",
   "QParser": "LuceneQParser",







--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-range-query-issue-tp4155327.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Help with StopFilterFactory

2014-08-26 Thread Jack Krupansky
I agree that it's a bad situation, and wasn't handled well by the Lucene 
guys. They may have had good reasons, but they didn't execute a decent plan 
for how to migrate existing behavior.


-- Jack Krupansky

-Original Message- 
From: heaven

Sent: Tuesday, August 26, 2014 6:51 AM
To: solr-user@lucene.apache.org
Subject: Re: Help with StopFilterFactory

So it sounds like a bug to me, doesn't it? Interned is full of complaints
about this issue and why should all we suffer because of someone, who didn't
know when and how to use this feature and as result got wrong data indexed?
Who cares about it??? And why to remove the option that is so useful for
many people who do know how to use it?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Help-with-StopFilterFactory-tp4153839p4155162.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Help with StopFilterFactory

2014-08-26 Thread Jack Krupansky

Sigh. Maybe I vaguely recall some vague discussion of this.

Okay, so you can get the old" behavior, either by globally setting the 
"lucene match version" in solrconfig:


4.3

Or, probably best, just set the lucene match version for that specific token 
filter by adding this attribute:


luceneMatchVersion="4.3"

But... the old behavior is now "deprecated", so it mostly likely will not be 
in Solr 5.0.


I'll think about this some more as to whether there might be some workaround 
or alternative.


-- Jack Krupansky

-Original Message- 
From: heaven

Sent: Tuesday, August 26, 2014 6:02 AM
To: solr-user@lucene.apache.org
Subject: Re: Help with StopFilterFactory

Hi, just tried your suggestion but get this error:


And then I found the next:
http://stackoverflow.com/questions/18668376/solr-4-4-stopfilterfactory-and-enablepositionincrements.

I don't really know why they did so, the reason that "it can create broken
token streams" doesn't fit in my mind. Perhaps those who made this decision
do not use Solr so they simply don't care, that's the only explanation I can
find.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Help-with-StopFilterFactory-tp4153839p4155157.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: embedded documents

2014-08-25 Thread Jack Krupansky
And a comparison to Elasticsearch would be helpful, since ES gets a lot of 
mileage from their super-easy JSON support. IOW, how much of the ES 
"advantage" is eliminated.


-- Jack Krupansky

-Original Message- 
From: Noble Paul

Sent: Monday, August 25, 2014 1:59 PM
To: solr-user@lucene.apache.org
Subject: Re: embedded documents

The simplest use case is to dump the entire json using split=/&f=/** . i am
planning to add an alias for the same (SOLR-6343) .

The nested docs is missing now and we will need to add it. A ticket needs
to be opened


On Mon, Aug 25, 2014 at 6:45 AM, Jack Krupansky 
wrote:


Thanks, Erik, but... I've read that Jira several times over the past
month, it is is far too cryptic for me to make any sense out of what it is
really trying to do. A simpler approach is clearly needed.

My perception of SOLR-6304 is not that it indexes a single JSON object as
a single Solr document, but that it generates a collection of separate
documents, somewhat analogous to Lucene block/child documents, but... not
quite.

I understood the request on this message thread to be the flattening of a
single nested JSON object to a single Solr document.

IMHO, we need to be trying to make Solr more automatic and more
approachable, not an even more complicated "toolkit".

-- Jack Krupansky

-Original Message- From: Erik Hatcher
Sent: Monday, August 25, 2014 9:32 AM

To: solr-user@lucene.apache.org
Subject: Re: embedded documents

Jack et al - there’s now this, which is available in the any-minute
release of Solr 4.10: https://issues.apache.org/jira/browse/SOLR-6304

Erik

On Aug 25, 2014, at 5:01 AM, Jack Krupansky 
wrote:

 That's a completely different concept, I think - the ability to return a
single field value as a structured JSON object in the "writer", rather 
than
simply "loading" from a nested JSON object and distributing the key 
values

to normal Solr fields.

-- Jack Krupansky

-Original Message- From: Bill Bell
Sent: Sunday, August 24, 2014 7:30 PM
To: solr-user@lucene.apache.org
Subject: Re: embedded documents

See my Jira. It supports it via json.fsuffix=_json&wt=json

http://mail-archives.apache.org/mod_mbox/lucene-dev/
201304.mbox/%3CJIRA.12641293.1365394604231.125944.1365397875874@arcas%3E

Bill Bell
Sent from mobile


 On Aug 24, 2014, at 6:43 AM, "Jack Krupansky" 

wrote:

Indexing and query of raw JSON would be a valuable addition to Solr, so
maybe you could simply explain more precisely your data model and
transformation rules. For example, when multi-level nesting occurs, what
does your loader do?

Maybe if the fielld names were derived by concatenating the full path of
JSON key names, like titles_json.FR, field_naming nesting could be 
handled

in a fully automated manner.

I had been thinking of filing a Jira proposing exactly that, so that
even the most deeply nested JSON maps could be supported, although
combinations of arrays and maps would be problematic.

-- Jack Krupansky

-Original Message- From: Michael Pitsounis
Sent: Wednesday, August 20, 2014 7:14 PM
To: solr-user@lucene.apache.org
Subject: embedded documents

Hello everybody,

I had a requirement to store complicated json documents in solr.

i have modified the JsonLoader to accept complicated json documents with
arrays/objects as values.

It stores the object/array and then flatten it and  indexes the fields.

e.g  basic example document

{
 "titles_json":{"FR":"This is the FR title" , "EN":"This is the EN
title"} ,
 "id": 103,
 "guid": "3b2f2998-85ac-4a4e-8867-beb551c0b3c6"
}

It will store titles_json:{"FR":"This is the FR title" , "EN":"This is
the
EN title"}
and then index fields

titles.FR:"This is the FR title"
titles.EN:"This is the EN title"


Do you see any problems with this approach?



Regards,
Michael Pitsounis







--
-
Noble Paul 



Re: embedded documents

2014-08-25 Thread Jack Krupansky
Thanks, Erik, but... I've read that Jira several times over the past month, 
it is is far too cryptic for me to make any sense out of what it is really 
trying to do. A simpler approach is clearly needed.


My perception of SOLR-6304 is not that it indexes a single JSON object as a 
single Solr document, but that it generates a collection of separate 
documents, somewhat analogous to Lucene block/child documents, but... not 
quite.


I understood the request on this message thread to be the flattening of a 
single nested JSON object to a single Solr document.


IMHO, we need to be trying to make Solr more automatic and more 
approachable, not an even more complicated "toolkit".


-- Jack Krupansky

-Original Message- 
From: Erik Hatcher

Sent: Monday, August 25, 2014 9:32 AM
To: solr-user@lucene.apache.org
Subject: Re: embedded documents

Jack et al - there’s now this, which is available in the any-minute release 
of Solr 4.10: https://issues.apache.org/jira/browse/SOLR-6304


Erik

On Aug 25, 2014, at 5:01 AM, Jack Krupansky  wrote:

That's a completely different concept, I think - the ability to return a 
single field value as a structured JSON object in the "writer", rather 
than simply "loading" from a nested JSON object and distributing the key 
values to normal Solr fields.


-- Jack Krupansky

-Original Message- From: Bill Bell
Sent: Sunday, August 24, 2014 7:30 PM
To: solr-user@lucene.apache.org
Subject: Re: embedded documents

See my Jira. It supports it via json.fsuffix=_json&wt=json

http://mail-archives.apache.org/mod_mbox/lucene-dev/201304.mbox/%3CJIRA.12641293.1365394604231.125944.1365397875874@arcas%3E

Bill Bell
Sent from mobile


On Aug 24, 2014, at 6:43 AM, "Jack Krupansky"  
wrote:


Indexing and query of raw JSON would be a valuable addition to Solr, so 
maybe you could simply explain more precisely your data model and 
transformation rules. For example, when multi-level nesting occurs, what 
does your loader do?


Maybe if the fielld names were derived by concatenating the full path of 
JSON key names, like titles_json.FR, field_naming nesting could be 
handled in a fully automated manner.


I had been thinking of filing a Jira proposing exactly that, so that even 
the most deeply nested JSON maps could be supported, although 
combinations of arrays and maps would be problematic.


-- Jack Krupansky

-Original Message- From: Michael Pitsounis
Sent: Wednesday, August 20, 2014 7:14 PM
To: solr-user@lucene.apache.org
Subject: embedded documents

Hello everybody,

I had a requirement to store complicated json documents in solr.

i have modified the JsonLoader to accept complicated json documents with
arrays/objects as values.

It stores the object/array and then flatten it and  indexes the fields.

e.g  basic example document

{
 "titles_json":{"FR":"This is the FR title" , "EN":"This is the EN
title"} ,
 "id": 103,
 "guid": "3b2f2998-85ac-4a4e-8867-beb551c0b3c6"
}

It will store titles_json:{"FR":"This is the FR title" , "EN":"This is 
the

EN title"}
and then index fields

titles.FR:"This is the FR title"
titles.EN:"This is the EN title"


Do you see any problems with this approach?



Regards,
Michael Pitsounis




Re: Help with StopFilterFactory

2014-08-25 Thread Jack Krupansky
Interesting. First, an apology for an error in my e-book - it says that the 
enablePositionIncrements parameter for the stop filter defaults to "false", 
but it actually defaults to "true". The question mark represents a "position 
increment". In your case you don't want position increments, so add the 
enablePositionIncrements="false" parameter to the stop filter, and be sure 
to reindex your data. The position increment leaves a "hole" where each stop 
word was removed. The question mark represents the hole. All bets are off as 
to what phrase query does when the phrase starts with a hole. I think the 
basic idea is that there must be some term in the index at that position 
that can be "skipped".


This is actually a change in behavior, which occurred as a side effect of 
LUCENE-4963 in 4.4. The default for enablePositionIncrements was false, but 
that release changed it to true.


I suspect that I wrote that section of my e-book before 4.4 came out. 
Unfortunately, the change is not well documented - nothing in the Javadoc, 
and this is another example of where an underlying change in Lucene that 
impacts Solr users is not well highlighted for Solr users. Sorry about that.


In any case, try adding enablePositionIncrements="false", reindex, and see 
what happens.


-- Jack Krupansky

-Original Message- 
From: heaven

Sent: Monday, August 25, 2014 3:37 AM
To: solr-user@lucene.apache.org
Subject: Re: Help with StopFilterFactory

A valid search:
http://pastie.org/pastes/9500661/text?key=rgqj5ivlgsbk1jxsudx9za
An Invalid search:
http://pastie.org/pastes/9500662/text?key=b4zlh2oaxtikd8jvo5xaww

What weird I found is that the valid query has:
"parsedquery_toString": "+(url_words_ngram:\"twitter com zer0sleep\")"
And the invalid one has:
"parsedquery_toString": "+(url_words_ngram:\"? twitter com zer0sleep\")"

So "https" part was replaced with a "?".



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Help-with-StopFilterFactory-tp4153839p4154957.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Exact search with special characters

2014-08-25 Thread Jack Krupansky
To be honest, I'm not precisely sure what Google is really doing under the 
hood since there is no detailed spec publically available. We know that 
quotes do force a phrase searchin Google, but do they disable stemming or 
preserve case and special characters? Unknown. Although, my PERCEPTION of 
Google is that it does disable stemming but continues to be case insensitive 
and ignore special characters in quoted phrases, but I don't see that 
behavior documented for search help in Google. IOW, trying to fall back on a 
precise definition from Google won't help us here. IOW, we don't have a 
clear view of "Exact search with special characters" for Google itself.


Bottom line: If you want to search both with and without special characters, 
that will have to be done with separate fields with separate analyzers.


You could use the combination of the keyword tokenizer and the ngram filter 
(at index time only) to support what YOU SEEM to be calling "exact match", 
but then you will need to specify that separate field name in addition to 
quoting the phrase. Or, just use a string field and then do wildcard or 
regex queries on that field for whatever degree of "exactness" you require.


-- Jack Krupansky

-Original Message- 
From: Shay Sofer

Sent: Monday, August 25, 2014 8:02 AM
To: solr-user@lucene.apache.org
Subject: RE: Exact search with special characters

Hi,

Thanks for your reply.

I thought that google search work the same (quotes stand for exact match).

Example for my demands:
Objects:
- test host
- test_host
-test $host
-test-host

When I'll search for test host I'll get all above  results.

When I'll search for "test host" Ill get only test host

Also, when search for partial string like test / host I'll get all above 
results.


Thanks.

-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com]
Sent: Sunday, August 24, 2014 3:34 PM
To: solr-user@lucene.apache.org
Subject: Re: Exact search with special characters

What precisely do you mean by the term "exact search". I mean, Solr (and
Lucene) do not have that concept for tokenized text fields.

Or did you simply mean "quoted phrase". In which case, you need to be aware 
that all the quotes do is assure that the terms occur in that order or in 
close proximity according to the default or specified "phrase slop"
distance. But each term is still analyzed according to the analyzer for the 
field.


Technically, Lucene will in fact analyze the full quoted phrase as one 
stream, which for non-tokenized fields will be one term, but for any 
tokenized fields which split on white space, the phrase will be broken into 
separate tokens and special characters will tend to be removed as well. The 
keyword tokenizer will indeed treat the entire phrase as a single token, and 
the white space tokenizer will preserve special characters, but the standard 
tokenizer will not preserve either white space or special characters.


Nominally, the keyword tokenizer does generate a single term at least at the 
tokenization stage, but the world delimiter filter then splits individual 
terms into multiple terms, thus guaranteeing that a phrase with white space 
will be multiple terms and special characters are removed as well.


The other technicality is that quoting a phrase does prevent the phrase from 
being interpreted as query parser syntax, such as AND and OR operators or 
treating special characters as query parser operators.


But, the fact remains that a quoted phrase is not treated as an "exact"
string literal for any normal tokenized fields.

Out of curiosity, what references have lead you to believe that a quoted 
phrase is an "exact match"?


Use a "string" (not "tokenized text") field if you wish to make an "exact 
match" on a literal string, but the concept of "exact match" is not 
supported for tokenized and filtered text fields.


So, please describe, in plain English, plus examples, exactly what you 
expect your analyzer to do, both in terms of how it treats text to be 
indexed and how you expect to be able to query that text.


-- Jack Krupansky

-Original Message-
From: Shay Sofer
Sent: Sunday, August 24, 2014 5:58 AM
To: solr-user@lucene.apache.org
Subject: Exact search with special characters

Hi all,

I have a docs that's indexed by text field with mention schema.

I have those docs names:

-  Test host

-  Test_host

-  Test-host

-  Test $host

When I'm trying to do exact search like: "test host"
All the results from above are shown as a results.

How can I use exact match so I'll will get only one result?

I prefer to do my changes in search time but if I need to change my schema 
please offer that.


Thanks,
Shay.


This is my schema:
   
   
   
   
   
   
   
   
   
   
   
   



Email secured by Check Point 



Re: embedded documents

2014-08-25 Thread Jack Krupansky
That's a completely different concept, I think - the ability to return a 
single field value as a structured JSON object in the "writer", rather than 
simply "loading" from a nested JSON object and distributing the key values 
to normal Solr fields.


-- Jack Krupansky

-Original Message- 
From: Bill Bell

Sent: Sunday, August 24, 2014 7:30 PM
To: solr-user@lucene.apache.org
Subject: Re: embedded documents

See my Jira. It supports it via json.fsuffix=_json&wt=json

http://mail-archives.apache.org/mod_mbox/lucene-dev/201304.mbox/%3CJIRA.12641293.1365394604231.125944.1365397875874@arcas%3E

Bill Bell
Sent from mobile


On Aug 24, 2014, at 6:43 AM, "Jack Krupansky"  
wrote:


Indexing and query of raw JSON would be a valuable addition to Solr, so 
maybe you could simply explain more precisely your data model and 
transformation rules. For example, when multi-level nesting occurs, what 
does your loader do?


Maybe if the fielld names were derived by concatenating the full path of 
JSON key names, like titles_json.FR, field_naming nesting could be handled 
in a fully automated manner.


I had been thinking of filing a Jira proposing exactly that, so that even 
the most deeply nested JSON maps could be supported, although combinations 
of arrays and maps would be problematic.


-- Jack Krupansky

-Original Message- From: Michael Pitsounis
Sent: Wednesday, August 20, 2014 7:14 PM
To: solr-user@lucene.apache.org
Subject: embedded documents

Hello everybody,

I had a requirement to store complicated json documents in solr.

i have modified the JsonLoader to accept complicated json documents with
arrays/objects as values.

It stores the object/array and then flatten it and  indexes the fields.

e.g  basic example document

{
  "titles_json":{"FR":"This is the FR title" , "EN":"This is the EN
title"} ,
  "id": 103,
  "guid": "3b2f2998-85ac-4a4e-8867-beb551c0b3c6"
 }

It will store titles_json:{"FR":"This is the FR title" , "EN":"This is the
EN title"}
and then index fields

titles.FR:"This is the FR title"
titles.EN:"This is the EN title"


Do you see any problems with this approach?



Regards,
Michael Pitsounis 




Re: Help with StopFilterFactory

2014-08-24 Thread Jack Krupansky
Just to confirm, the generated phrase query is generated using the analyzed 
terms, so if the stop filter is removing the terms, they won't appear in the 
generated query. It will be interesting to see what does get generated.


-- Jack Krupansky

-Original Message- 
From: heaven

Sent: Sunday, August 24, 2014 12:47 PM
To: solr-user@lucene.apache.org
Subject: Re: Help with StopFilterFactory

The problem is in #4:

4. if I index twitter.com/testuser and search for
https://twitter.com/testuser I am getting 0 matches even though "https"
should be filtered out by the StopFilterFactory.


When I said that the stop filter factory "doesn't work" I mentioned that
blacklisted words still somehow affect the search. My guess is that when
autoGeneratePhraseQueries is set to true Solr generates phases before
blacklisted words were removed. That's how it feels looking at search
results (see the first post).

My first post still describes the problem completely, what we can add to it
now is that schema version is 1.5 and autoGeneratePhraseQueries is set to
true.

I remember about the debug output, will be able to add it tomorrow morning.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Help-with-StopFilterFactory-tp4153839p4154822.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Help with StopFilterFactory

2014-08-24 Thread Jack Krupansky
If autoGeneratePhraseQueries="true" (which I endorse) is working, then 
what's the problem?


I mean, the only problem you mention is with 
autoGeneratePhraseQueries="false", which is clearly NOT what you want.


Once again, I have to reiterate that the situation here remains very 
confused, mostly from poor use of language.


It only adds to the confusion when you say things like "doesn't work", 
rather than taking a constructive attitude of telling on the expected 
results vs. the actual results.


And I think I did request that you add the debug=true query parameter and 
post the parsed query so that we can see what was really generated for the 
query.


-- Jack Krupansky

-Original Message- 
From: heaven

Sent: Sunday, August 24, 2014 12:04 PM
To: solr-user@lucene.apache.org
Subject: Re: Help with StopFilterFactory

I don't see any confusions, the problem is clearly explained in the first
post. The one confusion I had was with the autoGeneratePhraseQueries and my
schema version, I didn't know about that attribute and that its behavior
could differ per schema version. I think we now figured that out and I am
using the most recent 1.5 schema version with
autoGeneratePhraseQueries="true" (so the behavior should be exactly the same
as for schema version 1 that I had before).

With autoGeneratePhraseQueries="false" I get unexpected results, e.g. all
those that match only partially, like only by "twitter" and/or "com".

Following your steps:
1. Schema version is 1.5
2. autoGeneratePhraseQueries is set to true.
3. It seems it does, but that doesn't work as expected and those words still
affect the search.
4. if I index twitter.com/testuser and search for
https://twitter.com/testuser I am getting 0 matches even though "https"
should be filtered out by the StopFilterFactory.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Help-with-StopFilterFactory-tp4153839p4154804.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Help with StopFilterFactory

2014-08-24 Thread Jack Krupansky
I think somehow the discussion has gotten confused, so we really need to 
start over.


1. Make sure you're using the most current schema version.
2. Make sure autoGeneratePhraseQueries is set explicitly the way you want 
it, based on #1 above.
3. Yes, stop filter should remove sop words. No question. If it isn't, lets 
track down and see why and report a bug if necessary.
4. Restate the problem, very clearly, in plain English (after performing 
steps #1 and #2). Please reread your reply carefully before clicking the 
send button and make sure you are using negatives properly - you've confused 
the discussion here by failing to do so on at least one occasion, and 
possibly in this latest response although I can't tell for sure.
5. We'll confirm either any mistakes you've made, recommendations, and 
whether there are any bugs.


Fair enough?

-- Jack Krupansky

-Original Message- 
From: heaven

Sent: Sunday, August 24, 2014 11:02 AM
To: solr-user@lucene.apache.org
Subject: Re: Help with StopFilterFactory

Unfortunately I can't change the operator and phrase query for
"https://twitter.com/testuser"; doesn't work as well.

It does work for "twitter.com/testuser" but that makes no sense since I then
can simply use old schema version or autoGenereratePhaseQueries=true and ask
users to remove http/www from urls manually. But then I have a reasonable
question, what then the StopFilterFactory is supposed to do if users still
have to remove blacklisted keywords? It sounds lie a bug to me because stop
filter factory only prevents words from being added to the index, but they
still affect search.

It should generate phases after solr.StopFilterFactory (if one is defined
for a field). Or there should be another mechanism to remove blacklisted
words like if there were no such words at all so they simply disappear.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Help-with-StopFilterFactory-tp4153839p4154795.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: embedded documents

2014-08-24 Thread Jack Krupansky
Indexing and query of raw JSON would be a valuable addition to Solr, so 
maybe you could simply explain more precisely your data model and 
transformation rules. For example, when multi-level nesting occurs, what 
does your loader do?


Maybe if the fielld names were derived by concatenating the full path of 
JSON key names, like titles_json.FR, field_naming nesting could be handled 
in a fully automated manner.


I had been thinking of filing a Jira proposing exactly that, so that even 
the most deeply nested JSON maps could be supported, although combinations 
of arrays and maps would be problematic.


-- Jack Krupansky

-Original Message- 
From: Michael Pitsounis

Sent: Wednesday, August 20, 2014 7:14 PM
To: solr-user@lucene.apache.org
Subject: embedded documents

Hello everybody,

I had a requirement to store complicated json documents in solr.

i have modified the JsonLoader to accept complicated json documents with
arrays/objects as values.

It stores the object/array and then flatten it and  indexes the fields.

e.g  basic example document

 {
   "titles_json":{"FR":"This is the FR title" , "EN":"This is the EN
title"} ,
   "id": 103,
   "guid": "3b2f2998-85ac-4a4e-8867-beb551c0b3c6"
  }

It will store titles_json:{"FR":"This is the FR title" , "EN":"This is the
EN title"}
and then index fields

titles.FR:"This is the FR title"
titles.EN:"This is the EN title"


Do you see any problems with this approach?



Regards,
Michael Pitsounis 



Re: Exact search with special characters

2014-08-24 Thread Jack Krupansky
What precisely do you mean by the term "exact search". I mean, Solr (and 
Lucene) do not have that concept for tokenized text fields.


Or did you simply mean "quoted phrase". In which case, you need to be aware 
that all the quotes do is assure that the terms occur in that order or in 
close proximity according to the default or specified "phrase slop" 
distance. But each term is still analyzed according to the analyzer for the 
field.


Technically, Lucene will in fact analyze the full quoted phrase as one 
stream, which for non-tokenized fields will be one term, but for any 
tokenized fields which split on white space, the phrase will be broken into 
separate tokens and special characters will tend to be removed as well. The 
keyword tokenizer will indeed treat the entire phrase as a single token, and 
the white space tokenizer will preserve special characters, but the standard 
tokenizer will not preserve either white space or special characters.


Nominally, the keyword tokenizer does generate a single term at least at the 
tokenization stage, but the world delimiter filter then splits individual 
terms into multiple terms, thus guaranteeing that a phrase with white space 
will be multiple terms and special characters are removed as well.


The other technicality is that quoting a phrase does prevent the phrase from 
being interpreted as query parser syntax, such as AND and OR operators or 
treating special characters as query parser operators.


But, the fact remains that a quoted phrase is not treated as an "exact" 
string literal for any normal tokenized fields.


Out of curiosity, what references have lead you to believe that a quoted 
phrase is an "exact match"?


Use a "string" (not "tokenized text") field if you wish to make an "exact 
match" on a literal string, but the concept of "exact match" is not 
supported for tokenized and filtered text fields.


So, please describe, in plain English, plus examples, exactly what you 
expect your analyzer to do, both in terms of how it treats text to be 
indexed and how you expect to be able to query that text.


-- Jack Krupansky

-Original Message- 
From: Shay Sofer

Sent: Sunday, August 24, 2014 5:58 AM
To: solr-user@lucene.apache.org
Subject: Exact search with special characters

Hi all,

I have a docs that's indexed by text field with mention schema.

I have those docs names:

-  Test host

-  Test_host

-  Test-host

-  Test $host

When I'm trying to do exact search like: "test host"
All the results from above are shown as a results.

How can I use exact match so I'll will get only one result?

I prefer to do my changes in search time but if I need to change my schema 
please offer that.


Thanks,
Shay.


This is my schema:
   positionIncrementGap="100">

   
   
   splitOnNumerics="0" splitOnCaseChange="0"

   preserveOriginal="1"/>
   
   
   
   
   splitOnNumerics="0" splitOnCaseChange="0"

   preserveOriginal="1"/>
   
   
   




Re: Integrating DictionaryAnnotator and Solr

2014-08-23 Thread Jack Krupansky
Uhhh... UIMA... and parameter checking... NOT. You're probably missing 
something, but there is so much stuff.


I have some examples in my e-book that show various errors you can get for 
missing/incorrect parameters for UIMA:

http://www.lulu.com/us/en/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product-21203548.html

I never actually connected to a UIMA service, but at least got through the 
parameter stuff.


-- Jack Krupansky

-Original Message- 
From: mkhordad

Sent: Friday, August 22, 2014 9:21 PM
To: solr-user@lucene.apache.org
Subject: Integrating DictionaryAnnotator and Solr

Hi,

I am trying to integrate DictionaryAnnotator of UIMA to Solr 4.9.0 to find
gene names from a dictionary. So I made the following changes.

1. I Modified OverridingParamsExtServicesAE.xml file as follow:

:
...

 

...
:

2. Modified the  sections for adding DictionaryAnnotator
node:
 
   AggregateSentenceAE
   OpenCalaisAnnotator
   TextKeywordExtractionAEDescriptor
   TextLanguageDetectionAEDescriptor
   TextCategorizationAEDescriptor
   TextConceptTaggingAEDescriptor
   TextRankedEntityExtractionAEDescriptor
   DictionaryAnnotator
 


3. Added org/apache/uima/desc/DictionaryAnnotator.xml
DictionaryAnnotator.xml

4. Added my dictionary to org/apache/uima/desc/dictionary.xml

5.- Generated the file apache-solr-uima-4.9.jar

6. Added Gene to schema.xml.

7.Added the following lines in solrconfig.xml:



 

 



 DictionaryAnnotator

 

 

 

 

 

 





  

 uima

 

 

 

 

 DictionaryAnnotator



false



  

 false

 

 text

 

 

 

 

 org.apache.uima.DictionaryEntry

 

 gene

 gene

 

 

 

 

 

 

 

 


But I get the following error message when I am truing to import my
documents:

3639 [qtp1023134153-14] ERROR org.apache.solr.core.SolrCore  –
java.lang.NullPointerException

at
org.apache.solr.uima.processor.SolrUIMAConfigurationReader.readAEOverridingParameters(SolrUIMAConfigurationReader.java:101)

at
org.apache.solr.uima.processor.SolrUIMAConfigurationReader.readSolrUIMAConfiguration(SolrUIMAConfigurationReader.java:42)

at
org.apache.solr.uima.processor.UIMAUpdateRequestProcessorFactory.getInstance(UIMAUpdateRequestProcessorFactory.java:53)

at
org.apache.solr.update.processor.UpdateRequestProcessorChain.createProcessor(UpdateRequestProcessorChain.java:204)

at
org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody(DataImportHandler.java:178)

at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)

at org.apache.solr.core.SolrCore.execute(SolrCore.java:1962)

at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)

at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)

at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)

at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)

at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)

at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)

at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)

at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)

at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)

at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)

at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)

at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)

at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)

at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)

at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)

at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)

at org.eclipse.jetty.server.Server.handle(Server.java:368)

at
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)

at
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)

at
org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:953)

at
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1014)

at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:861)

at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235

Re: Minimum Match with filters that add tokens

2014-08-23 Thread Jack Krupansky
Use a percentage rather than an absolute token number, like 50% or 25% or 
maybe 33%. You can also specify different percentages based on different 
ranges of term counts.


Be aware that although it is tempting to think of MM from the user 
perspective of how many terms are written in the original query, the 
implementation (BooleanQuery) uses the terms generated by the analysis 
process, which can break up source terms into multiple terms and generate 
extra terms as well. Any MM number or percentage will count the terms output 
by analysis, not the source terms.


-- Jack Krupansky

-Original Message- 
From: Schmidt, Matthew

Sent: Thursday, August 21, 2014 3:59 PM
To: solr-user@lucene.apache.org
Subject: Minimum Match with filters that add tokens

Is there a good way of handling a minimum match value greater than 1 with 
token filters that add tokens to the stream?


Say you have field with the DoubleMetaphone filter for phonetic matching:

maxCodeLength="6"/>


This would add two tokens to the stream, one for the primary phonetic code, 
one for the secondary.  If I have the min match set to 2 (mm=2) and my query 
only has a single token in it, then I only get results where at least 2 of 
the tokens match.  This means that documents that only match on a phonetic 
token aren't included.


Example:

Field:

 
   
   
   maxCodeLength="6"/>

   
 


Document:
{ id: 1, lastName: "meneghini" } (This generates {meneghini, MNKN} for  the 
index token stream for the lastName field)


Searching (using edismax) with q=meneghini&mm=2 returns document 1, as 
expected, but searching q=menegini&mm=2 does not.  However q=menegini&mm=1 
does.  The reason the first query worked as expected is that after the 
phonetic filter the query token stream has 2 tokens (meneghini, MNKN), and 
both of them match the index tokens, satisfying the mm parameter.  With the 
phonetic misspelling (menegini, {menegini, MNJN, MNKN}), only one of the 
tokens out of the 3 matches, so it is below the mm threshold.  The third 
query only needs one match, which it gets on the phonetic code MNKN.


This seems like counter-intuitive behavior for mm (at least for my use 
case), since I'm only interested in the original query terms being subject 
to the mm limitation, not the expanded token set.  I would imagine this 
would be an issue with synonym expansion and any other filter that might add 
tokens at query time as well.


Possible solutions I've thought of:


-  Just use the regular PhoneticFilterFactory with inject="false" in 
a separate copy field since it will only emit one token per input token.  :(


-  Subclass the DoubleMetaphoneFilterFactory to add a parameter to 
specify if only the primary or secondary token should be emitted.  Then have 
a separate field type and copy field for each and search the original field, 
the primary phonetic token field, and the secondary token field with each 
query.  This only solves for this specific case with the double metaphone 
filter, since it will add at most 2 tokens.  Other filters like 
BeiderMorseFilterFactory or SynonymFilterFactory might add an arbitrary 
number.


-  Change {lots of things} to allow filters to set a flag on a token 
that the query parser can use to determine that it should not count it 
against the minimum match requirement.


-  ?

Any thoughts?

Matt 



Re: Strange Behavior

2014-08-23 Thread Jack Krupansky
It sounds as if you are trying to treat hyphen as a digit so that negative 
numbers are discrete terms. But... that conflicts with the use of hyphen as 
a word separator. Sorry, but WDF does not support both. Pick one or the 
other, you can't have both.


But first, please explain your intended use case clearly - there may be some 
better way to try to achieve it.


Use the analysis page of the Solr Admin UI to see the detailed query and 
index analysis of your terms. You'll be surprised.


-- Jack Krupansky

-Original Message- 
From: EXTERNAL Taminidi Ravi (ETI, Automotive-Service-Solutions)

Sent: Thursday, August 21, 2014 2:31 PM
To: solr-user@lucene.apache.org
Subject: Strange Behavior

Hi , I have a field type text_general where query type for worddelimiter I 
am using the below type: where wddftype.txt contains "- DIGIT"



When I do a query I am not getting the right results. E.g. Name:"Wi-Fi" 
Gets results but Name:"Wi-Fi Devices Make" not getting any results

but if I change it to Name:"Wi-Fi Devices Make"~3 it works.

If someone can explain what is happening with the current situation..? FYI I 
have the types="wdfftypes.txt" in Query Analyzer.



My Fieldtype

positionIncrementGap="100">

 

   
   

   words="stopwords.txt" />


   
   

   generateWordParts="1" generateNumberParts="0" splitOnCaseChange="0"
splitOnNumerics="0" stemEnglishPossessive="0" 
catenateWords="1" catenateNumbers="1"

catenateAll="1" preserveOriginal="1" />

   synonyms="synonyms.txt" ignoreCase="true" expand="true"/>



 

   
   

   words="stopwords.txt" />


   
   

generateWordParts="1" generateNumberParts="0" splitOnCaseChange="0"
splitOnNumerics="0" stemEnglishPossessive="0" 
catenateWords="1" catenateNumbers="1"
catenateAll="1" preserveOriginal="1" 
types="wdfftypes.txt" />
   synonyms="synonyms.txt" ignoreCase="true" expand="true"/>



   





Re: Substring and Case In sensitive Search

2014-08-21 Thread Jack Krupansky
Yes, wildcards can be slow. That's why I suggested that the use cases be 
reviewed more carefully.


But... using the reversed wildcard filter doesn't accomplish any good for 
the substring case where there is a wildcard on both ends.


A prefix wildcard query should actually deliver decent performance, as long 
as the prefix isn't too short (e.g., "cat*"). See PrefixQuery:

http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/search/PrefixQuery.html

ngram filters can also be used, but... that can make the index rather large.

-- Jack Krupansky

-Original Message- 
From: Umesh Prasad

Sent: Wednesday, August 20, 2014 8:26 PM
To: solr-user@lucene.apache.org
Subject: Re: Substring and Case In sensitive Search

The performance of wild card queries and specially prefix wild card query
can be quite slow.

http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/search/WildcardQuery.html

Also, you won't be able to time them out.

Take a look at ReversedWildcardFilter

http://lucene.apache.org/solr/4_9_0/solr-core/org/apache/solr/analysis/ReversedWildcardFilterFactory.html

The blog post describes it nicely ..

http://solr.pl/en/2011/10/10/%E2%80%9Ccar-sale-application%E2%80%9D-%E2%80%93-solr-reversedwildcardfilter-%E2%80%93-lets-optimize-wildcard-queries-part-8/



On 19 August 2014 22:19, Jack Krupansky  wrote:


Substring search a string field using wildcard, "*", at beginning and end
of query term.

Case-insensitive match on string field is not supported.

Instead, copy the string field to a text field, use the keyword tokenizer,
and then apply the lower case filter.

But... review your use case to confirm whether you really need to use
"string" as opposed to "text" field.

-- Jack Krupansky

-Original Message- From: Nishanth S
Sent: Tuesday, August 19, 2014 12:03 PM
To: solr-user@lucene.apache.org
Subject: Substring and Case In sensitive Search


Hi,

I am  very new to solr.How can I allow solr search on a string field case
insensitive and substring?.

Thanks,
Nishanth





--
Thanks & Regards
Umesh Prasad
Search l...@flipkart.com

in.linkedin.com/pub/umesh-prasad/6/5bb/580/ 



Re: Help with StopFilterFactory

2014-08-21 Thread Jack Krupansky
For the sake of completeness, please post the parsed query that you get when 
you add the debug=true parameter. IOW, how Solr/Lucene actually interprets 
the query itself.


-- Jack Krupansky

-Original Message- 
From: Shawn Heisey

Sent: Thursday, August 21, 2014 10:03 AM
To: solr-user@lucene.apache.org
Subject: Re: Help with StopFilterFactory

On 8/21/2014 7:25 AM, heaven wrote:

Any ideas? Doesn't that seems like a bug?


I think it should have worked even with autoGeneratePhraseQueries
enabled by the older schema version.  The relative positions are the
same  -- it's 1,2,3 in the index and 2,3,4 in the query.  Absolute
positions don't matter, only relative.  I ran into the same behavior on
Solr 4.9.0 ... with a 1.5 schema version and your example, everything
works, but if I enable autoGeneratePhraseQueries, it stops working.

This probably needs to be filed in Jira, but let's wait for someone with
more experience to weigh in before taking that step.

Thanks,
Shawn 



Re: Help with StopFilterFactory

2014-08-19 Thread Jack Krupansky

What release of Solr?

Do you have autoGeneratePhraseQueries="true" on the field?

And when you said "But any of these does", did you mean "But NONE of these 
does"?


-- Jack Krupansky

-Original Message- 
From: heaven

Sent: Tuesday, August 19, 2014 2:34 PM
To: solr-user@lucene.apache.org
Subject: Help with StopFilterFactory

Hi, I have the next text field:


 
   
   
   
 


url_stopwords.txt looks like:
http
https
ftp
www

So very simple. In index I have:
* twitter.com/testuser

All these queries do match:
* twitter.com/testuser
* com/testuser
* testuser

But any of these does:
* https://twitter.com/testuser
* https://www.twitter.com/testuser
* www.twitter.com/testuser

What do I do wrong? Analysis makes me think something is wrong with token
positions:
<http://lucene.472066.n3.nabble.com/file/n4153839/oi7o69.jpg>
but I was thinking StopFilterFactory is supposed to remove
https/http/ftw/www keywords. Why do they figure there at all? That doesn't
make much sense.

Regards,
Alexander



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Help-with-StopFilterFactory-tp4153839.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Performance of Boolean query with hundreds of OR clauses.

2014-08-19 Thread Jack Krupansky
A large number of query terms is definitely an anti-pattern and not a 
recommended use case for Solr, but I'm a little surprised that it takes 
minutes, as opposed to 10 to 20 seconds.


Does your index fit entirely in the OS system memory available for file 
caching?


IOW, are those "few minutes" CPU-bound or I/O-bound?

-- Jack Krupansky

-Original Message- 
From: SolrUser1543

Sent: Tuesday, August 19, 2014 2:57 PM
To: solr-user@lucene.apache.org
Subject: Performance of Boolean query with hundreds of OR clauses.

I am using Solr to perform search for finding similar pictures.

For this purpose, every image indexed as a set of descriptors ( descriptor
is a string of 6 chars ) .
Number of descriptors for every image may vary ( from few to many thousands)

When I want to search  for a similar image , I am extracting the descriptors
from it and create a query like :
MyImage:( desc1 desc2 ...  desc n )

Number of descriptors in query may also vary. Usual it is about 1000.

Of course performance of this query very bad and may take few minutes to
return .

Any ideas for performance improvement ?

P.s I also tried to use lire , but it is not fits my use case.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Performance-of-Boolean-query-with-hundreds-of-OR-clauses-tp4153844.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Substring and Case In sensitive Search

2014-08-19 Thread Jack Krupansky
Substring search a string field using wildcard, "*", at beginning and end of 
query term.


Case-insensitive match on string field is not supported.

Instead, copy the string field to a text field, use the keyword tokenizer, 
and then apply the lower case filter.


But... review your use case to confirm whether you really need to use 
"string" as opposed to "text" field.


-- Jack Krupansky

-Original Message- 
From: Nishanth S

Sent: Tuesday, August 19, 2014 12:03 PM
To: solr-user@lucene.apache.org
Subject: Substring and Case In sensitive Search

Hi,

I am  very new to solr.How can I allow solr search on a string field case
insensitive and substring?.

Thanks,
Nishanth 



Re: explaination of query processing in SOLR

2014-08-17 Thread Jack Krupansky
In any case, besides the raw code and the similarity Javadoc, Lucene does 
have Javadoc for "file formats":

http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/codecs/lucene49/package-summary.html

-- Jack Krupansky

-Original Message- 
From: Aman Tandon

Sent: Sunday, August 17, 2014 6:25 AM
To: solr-user@lucene.apache.org
Subject: Re: explaination of query processing in SOLR

I think you is confused with the extension for files created for lucene
index. Those files does not play crucial role in search.
I will suggest you to first setup the solr, index some files, then you
should apply various features like faceting, etc.
Then you also understand the significance of schema.xml and
solrconfig.xml,these files has some great comment that might help.

Then you can look into solr default similarity algorithm.
On Aug 8, 2014 5:30 PM, "abhi Abhishek"  wrote:


Hello,
I am fairly new to SOLR, can someone please help me understand how a
query is processed in SOLR, i.e, what i want to understand is from the 
time
it hits solr what files it refers to process the query, i.e, order in 
which

.tvx, .tvd files and others are accessed. basically i would like to
understand the code path of the search functionality also significance of
various files in the solr directory such as .tvx, .tcd, .frq, etc.


Regards,
Abhishek Das





Re: Solr cloud performance degradation with billions of documents

2014-08-14 Thread Jack Krupansky
You're using the term "cloud" again. Maybe that's the cause of your 
misunderstanding - SolrCloud probably should have been named SolrCluster 
since that's what it really is, a cluster rather than a "cloud". The term 
"cloud" conjures up images of vast, unlimited numbers of nodes, thousands, 
tens of thousands of machines, but SolrCloud is much more modest than that.


Again, start with a model of 100 million documents on a fairly commodity box 
(say, 32GB as opposed to expensive 16-core 256GB machines). So, 1 billion 
docs means 10 servers, times replication - I assume you want to serve a 
healthy query load. So, 5 billion docs needs 50 servers, times replication. 
100 billion docs would require 1,000 servers. 500 billion documents would 
require 5,000 servers, times replication. Not quite Google class, but not a 
typical SolrCloud "cluster" either. You will have to test for yourself 
whether that 100 million number is achievable for your particular hardware 
and data. Maybe you can double it... or maybe only half of that.


And, once again, make sure your index for each node fits in the OS system 
memory available for file caching.


I haven't heard of any specific experiences of SolrCloud beyond dozens of 
nodes, but 64 nodes is probably a reasonable expectation for a SolrCloud 
cluster. How much bigger than that a SolrCloud cluster could grow is 
unknown. Whatever the actual practical limit, based on your own hardware, 
I/O, and network, and your own data schema and data patterns, which you will 
have to test for yourself, you will probably need to use an application 
layer to "shard" your 100s of billions to specific SolrCloud clusters.


-- Jack Krupansky

-Original Message- 
From: Wilburn, Scott

Sent: Thursday, August 14, 2014 11:05 AM
To: solr-user@lucene.apache.org
Subject: RE: Solr cloud performance degradation with billions of documents

Erick,
Thanks for your suggestion to look into MapReduceIndexerTool, I'm looking 
into that now. I agree what I am trying to do is a tall order, and the more 
I hear from all of your comments, the more I am convinced that lack of 
memory is my biggest problem. I'm going to work on increasing the memory 
now, but was wondering if there are any configuration or other techniques 
that could also increase ingest performance? Does anyone know if a cloud of 
this size( hundreds of billions ) with an ingest rate of 5 billion new each 
day, has ever been attempted before?


Thanks,
Scott


-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: Wednesday, August 13, 2014 4:48 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr cloud performance degradation with billions of documents

Several points:

1> Have you considered using the MapReduceIndexerTool for your ingestion?
Assuming you don't have duplicate IDs, i.e. each doc is new, you can spread 
your indexing across as many nodes as you have in your cluster. That said, 
it's not entirely clear that you'll gain throughput since you have as many 
nodes as you do.


2> Um, fitting this many documents into 6G of memory is ambitious.
2> Very
ambitious. Actually it's impossible. By my calculations:
bq: 4 separate and individual clouds of 32 shards each so 128 shards in 
aggregate


bq:  inserting into these clouds per day is 5 Billion each in two clouds, 3 
Billion into the third, and 2 Billion into the fourth so we're talking 15B 
docs/day


bq: the plan is to keep up to 60 days...
So were talking 900B documents.

It just won't work. 900B/128 docs/shard is over 7B documents/shard on 
average. Your two larger collections will have more than that, the two 
smaller ones less. But it doesn't matter because:

1: Lucene has a limit of 2B docs per core(shard), positive signed int.
2: It ain't gonna fit in 6G of memory even without this limit I'm pretty 
sure.
3: I've rarely heard of a single shard coping with over 300M docs without 
performance issues. I usually start getting nervous around 100M and insist 
on stress testing. Of course it depends lots on your query profile.


So you're going to need a LOT more shards. You might be able to squeeze some 
more from your hardware by hosting multiple shards on for each collection on 
each machine, but I'm pretty sure your present setup is inadequate for your 
projected load.


Of course I may be misinterpreting what you're saying hugely, but from what 
I understand this system just won't work.


Best,
Erick




On Wed, Aug 13, 2014 at 2:39 PM, Markus Jelsma 
wrote:


Hi - You are running mapred jobs on the same nodes as Solr runs right?
The first thing i would think of is that your OS file buffer cache is 
abused.

The mappers read all data, presumably residing on the same node. The
mapper output and shuffling part would take place on the same node,
only the reducer output is se

Re: Question

2014-08-14 Thread Jack Krupansky
1. Better to target a max of 100 million docs per node, unless you do a POC 
that more docs really does work well for you.
2. Sounds like you don't have enough memory, either heap or system memory. 
Increase your heap first. Then more system memory.
3. Document examples of a simple query, facet query, and pivot query, with 
QTime, and debug=true "timing" to show which search components are consuming 
the time.


-- Jack Krupansky

-Original Message- 
From: Oded Sofer

Sent: Thursday, August 14, 2014 6:29 AM
To: solr-user@lucene.apache.org
Subject: Question

Hello


We are implementing SolrCloud; we expect around ~200millions documents per
node and 160-200 nodes. I looked on other references, seems like we are
not the first to work with such volume.

The indexing itself will be done locally (no distribution, each
node-server indexes its own)
The search is distributed. The search includes simple search, facet and
pivot.

The end-user may search specific field or free-text-search.


We are indexing kind of event log (user, client, serverIP, time, object,
etc.around 14 fields);
We would like to enable specific field search (e.g., user=John Smith) and
also free text search (e.g., John Smith with no restriction to specific
field).

We've tried to index each field separately and the whole string together
(all fields together) in another field to allow free-text.

With 1 million documents where a document represents one event (pretty
short), the performance are poor (seconds , we expect ms).

- The field search is fast but when searching the full string field
(free-text-search) it is pretty slow (seconds).
- We've implement the SolrCloud, when we try two machines with 1 millions
documents, the Pivot search is very, very slow.

In the past we did it with pure Lucene (local only) and it was pretty
cool, 160millions document were pretty fast for free text search.


Thanks
Oded




Re: Solr cloud performance degradation with billions of documents

2014-08-13 Thread Jack Krupansky
Be careful when you say "instance" - that usually refers to a single Solr 
node. Anyway...


32 shards - with a replication factor of 1?

So, given your worst case here, 5 billion documents in a 32-node cluster, 
that's 156 million documents per node. What is the index size on a typical 
node? And how much system memory is available for caching of file reads? 
Generally, you want to have enough system memory to cache the full index. Or 
do you have SSD?


But please clarify what you mean by "about 80-100 Billion documents per 
cloud". Is it really 5 billion total, refreshed every day, or 5 billion 
added per day and lots of days stored?


If you start seeing indexing rate drop off, that could be caused by not 
having enough RAM system memory to cache the full index. In particular, 
Lucene will occasionally be performing index merges, which would otherwise 
be I/O-intensive.


I would start with a rule of thumb of 100 million documents per node (and 
that is million, not billion.) That could be a lot higher - or a lot lower - 
based on your actual schema and data value distribution.


-- Jack Krupansky

-Original Message- 
From: Wilburn, Scott

Sent: Wednesday, August 13, 2014 5:42 PM
To: solr-user@lucene.apache.org
Subject: RE: Solr cloud performance degradation with billions of documents

Thanks for replying Jack. I have 4 SolrCloud instances( or clusters ), each 
consisting of 32 shards. The clusters do not have any interaction with each 
other.


Thanks,
Scott


-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com]
Sent: Wednesday, August 13, 2014 2:17 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr cloud performance degradation with billions of documents

Could you clarify what you mean with the term "cloud", as in "per cloud" and 
"individual clouds"? That's not a proper Solr or SolrCloud concept per se.
SolrCloud works with a single "cluster" of nodes. And there is no 
interaction between separate SolrCloud clusters.


-- Jack Krupansky

-Original Message-
From: Wilburn, Scott
Sent: Wednesday, August 13, 2014 5:08 PM
To: solr-user@lucene.apache.org
Subject: Solr cloud performance degradation with billions of documents

Hello everyone,
I am trying to use SolrCloud to index a very large number of simple 
documents and have run into some performance and scalability limitations and 
was wondering what can be done about it.


Hardware wise, I have a 32-node Hadoop cluster that I use to run all of the 
Solr shards and each node has 128GB of memory. The current SolrCloud setup 
is split into 4 separate and individual clouds of 32 shards each thereby 
giving four running shards per cloud or one cloud per eight nodes. Each 
shard is currently assigned a 6GB heap size. I’d prefer to avoid increasing 
heap memory for Solr shards to have enough to run other MapReduce jobs on 
the cluster.


The rate of documents that I am currently inserting into these clouds per 
day is 5 Billion each in two clouds, 3 Billion into the third, and 2 Billion 
into the fourth ; however to account for capacity, the aim is to scale the 
solution to support double that amount of documents. To index these 
documents, there are MapReduce jobs that run that generate the Solr XML 
documents and will then submit these documents via SolrJ's CloudSolrServer 
interface. In testing, I have found that limiting the number of active 
parallel inserts to 80 per cloud gave the best performance as anything 
higher gave diminishing returns, most likely due to the constant shuffling 
of documents internally to SolrCloud. From an index perspective, dated 
collections are being created to hold an entire day's of documents and 
generally the inserting happens primarily on the current day (the previous 
days are only to allow for searching) and the plan is to keep up to 60 days 
(or collections) in each cloud. A single shard index in one collection in 
the busiest cloud currently takes up 30G disk space or 960G for the entire 
collection. The documents are being auto committed with a hard commit time 
of 4 minutes (opensearcher = false) and soft commit time of 8 minutes.


From a search perspective, the use case is fairly generic and simple 
searches of the type :, so there is no need to tune the system to use any of 
the more advanced querying features. Therefore, the most important thing for 
me is to have the indexing performance be able to keep up with the rate of 
input.


In the initial load testing, I was able to achieve a projected indexing rate 
of 10 Billion documents per cloud per day for a grand total of 40 Billion 
per day. However, the initial load testing was done on fairly empty clouds 
with just a few small collections. Now that there have been several days of 
documents being indexed, I am starting to see a fairly steep drop-off in 
indexing performance once the clouds reached about 15 full collections (or

Re: Solr cloud performance degradation with billions of documents

2014-08-13 Thread Jack Krupansky
Could you clarify what you mean with the term "cloud", as in "per cloud" and 
"individual clouds"? That's not a proper Solr or SolrCloud concept per se. 
SolrCloud works with a single "cluster" of nodes. And there is no 
interaction between separate SolrCloud clusters.


-- Jack Krupansky

-Original Message- 
From: Wilburn, Scott

Sent: Wednesday, August 13, 2014 5:08 PM
To: solr-user@lucene.apache.org
Subject: Solr cloud performance degradation with billions of documents

Hello everyone,
I am trying to use SolrCloud to index a very large number of simple 
documents and have run into some performance and scalability limitations and 
was wondering what can be done about it.


Hardware wise, I have a 32-node Hadoop cluster that I use to run all of the 
Solr shards and each node has 128GB of memory. The current SolrCloud setup 
is split into 4 separate and individual clouds of 32 shards each thereby 
giving four running shards per cloud or one cloud per eight nodes. Each 
shard is currently assigned a 6GB heap size. I’d prefer to avoid increasing 
heap memory for Solr shards to have enough to run other MapReduce jobs on 
the cluster.


The rate of documents that I am currently inserting into these clouds per 
day is 5 Billion each in two clouds, 3 Billion into the third, and 2 Billion 
into the fourth ; however to account for capacity, the aim is to scale the 
solution to support double that amount of documents. To index these 
documents, there are MapReduce jobs that run that generate the Solr XML 
documents and will then submit these documents via SolrJ's CloudSolrServer 
interface. In testing, I have found that limiting the number of active 
parallel inserts to 80 per cloud gave the best performance as anything 
higher gave diminishing returns, most likely due to the constant shuffling 
of documents internally to SolrCloud. From an index perspective, dated 
collections are being created to hold an entire day's of documents and 
generally the inserting happens primarily on the current day (the previous 
days are only to allow for searching) and the plan is to keep up to 60 days 
(or collections) in each cloud. A single shard index in one collection in 
the busiest cloud currently takes up 30G disk space or 960G for the entire 
collection. The documents are being auto committed with a hard commit time 
of 4 minutes (opensearcher = false) and soft commit time of 8 minutes.


From a search perspective, the use case is fairly generic and simple 
searches of the type :, so there is no need to tune the system to use any of 
the more advanced querying features. Therefore, the most important thing for 
me is to have the indexing performance be able to keep up with the rate of 
input.


In the initial load testing, I was able to achieve a projected indexing rate 
of 10 Billion documents per cloud per day for a grand total of 40 Billion 
per day. However, the initial load testing was done on fairly empty clouds 
with just a few small collections. Now that there have been several days of 
documents being indexed, I am starting to see a fairly steep drop-off in 
indexing performance once the clouds reached about 15 full collections (or 
about 80-100 Billion documents per cloud) in the two biggest clouds. Based 
on current application logging I’m seeing a 40% drop off in indexing 
performance. Because of this, I have concerns on how performance will hold 
as more collections are added.


My question to the community is if anyone else has had any experience in 
using Solr at this scale (hundreds of Billions) and if anyone has observed 
such a decline in indexing performance as the number of collections 
increases. My understanding is that each collection is a separate index and 
therefore the inserting rate should remain constant. Aside from that, what 
other tweaks or changes can be done in the SolrCloud configuration to 
increase the rate of indexing performance? Am I hitting a hard limitation of 
what Solr can handle?


Thanks,
Scott



Re: explaination of query processing in SOLR

2014-08-13 Thread Jack Krupansky
Why? The semantics are defined by the code and similarity matching 
algorithm, not... files.


-- Jack Krupansky

-Original Message- 
From: abhi Abhishek

Sent: Wednesday, August 13, 2014 2:40 AM
To: solr-user@lucene.apache.org
Subject: Re: explaination of query processing in SOLR

Thanks Alex and Jack for the direction, actually what i was trying to
understand was how various files had an effect on the search.

Thanks,
Abhishek


On Fri, Aug 8, 2014 at 6:35 PM, Alexandre Rafalovitch 
wrote:


Abhishek,

Your first part of the question is interesting, but your specific
details are probably the wrong level for you to concentrate on. The
issues you will be facing are not about which file does what. That's
more performance and inner details. I feel you should worry more about
the fields, default search fields, multiterms, whitespaces, etc.

One way to do that is to enable debug and see if you actually
understand what those different debug entries do. And don't use string
or basic tokenizer. Pick something that has complex analyzer chain and
see how that affects debug.

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On Fri, Aug 8, 2014 at 1:59 PM, abhi Abhishek  wrote:
> Hello,
> I am fairly new to SOLR, can someone please help me understand how a
> query is processed in SOLR, i.e, what i want to understand is from the
time
> it hits solr what files it refers to process the query, i.e, order in
which
> .tvx, .tvd files and others are accessed. basically i would like to
> understand the code path of the search functionality also significance 
> of

> various files in the solr directory such as .tvx, .tcd, .frq, etc.
>
>
> Regards,
> Abhishek Das





Re: Modifying date format when using TrieDateField.

2014-08-12 Thread Jack Krupansky

Use the parse date update request processor:

http://lucene.apache.org/solr/4_9_0/solr-core/org/apache/solr/update/processor/ParseDateFieldUpdateProcessorFactory.html

Additional examples are in my e-book:
http://www.lulu.com/us/en/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product-21203548.html

-- Jack Krupansky

-Original Message- 
From: Modassar Ather

Sent: Tuesday, August 12, 2014 7:24 AM
To: solr-user@lucene.apache.org
Subject: Modifying date format when using TrieDateField.

Hi,

I have a TrieDateField where I want to store a date in "-MM-dd" format
as my source contains the date in same format.
As I understand TrieDateField stores date in "-MM-dd'T'HH:mm:ss" format
hence the date is getting formatted to the same.

Kindly let me know:
How can I change the date format during indexing when using
TrieDateField?
How I can stop the date modification due to time zone? E.g. My
1972-07-03 date is getting changed to 1972-07-03T18:30:00Z when using
TrieDateField.

Thanks,
Modassar 



Re: Solr search \ special cases

2014-08-11 Thread Jack Krupansky
The use of a wildcard suppresses analysis of the query term, so the special 
characters remain, but... they were removed when the terms were indexed, so 
no match. You must manually emulate the index term analysis in order to use 
wildcards.


-- Jack Krupansky

-Original Message- 
From: Shay Sofer

Sent: Monday, August 11, 2014 6:34 AM
To: solr-user@lucene.apache.org
Subject: Solr search \ special cases

Hi,

I have some strange cases while search with Solr.

I have doc with names like: rule #22, rule +33, rule %44.

When search for #22 or %55 or +33 Solr bring me as expected:  rule #22 and 
rule +33 and rule %44.


But when appending star (*) to each search (#22*, +33*, %55*), just the one 
with + sign bring rule +33, all other result none.


Can someone explain?

Thanks,
Shay. 



Re: How can I request a big list of values ?

2014-08-10 Thread Jack Krupansky
The issue is not whether or how to do a massive request, but to recognize 
that a single massive request across the network is very clearly an 
anti-pattern for modern distributed systems.


Instead of searching for ways to do something "bad", it is better to figure 
out how to exploit the positive potential of a system, which in this case is 
parallel execution of distributed components.


-- Jack Krupansky

-Original Message- 
From: Bruno Mannina

Sent: Sunday, August 10, 2014 6:01 PM
To: solr-user@lucene.apache.org
Subject: Re: How can I request a big list of values ?

Hi Jack,

ok but for 2000 values, it means that I must do 40 requests if I choose
to have 50 values by requests :'(
and in my case, user can choose about 8 topics, so it can generate 8
times 40 requests... humm...

is it not possible to send a text, json, xml file ?

Le 10/08/2014 17:38, Jack Krupansky a écrit :
Generally, "large requests" are an anti-pattern in modern distributed 
systems. Better to have a number of smaller requests executing in parallel 
and then merge the results in the application layer.


-- Jack Krupansky

-Original Message- From: Bruno Mannina
Sent: Saturday, August 9, 2014 7:18 PM
To: solr-user@lucene.apache.org
Subject: How can I request a big list of values ?

Hi All,

I'm using actually SOLR 3.6 and I have around 91 000 000 docs inside.

All work fine, it's great :)

But now, I would like to request a list of values in the same field
(more than 2000 values)

I know I can use |?q=x:(AAA BBB CCC ...) (my default operator is OR)

but I have a list of 2000 values ! I think it's not the good idea to use
this method.

Can someone help me to find the good solution ?
Can I use a json structure by using a POST method ?

Thanks a lot,
Bruno
|


---
Ce courrier électronique ne contient aucun virus ou logiciel malveillant 
parce que la protection avast! Antivirus est active.

http://www.avast.com





---
Ce courrier électronique ne contient aucun virus ou logiciel malveillant 
parce que la protection avast! Antivirus est active.
http://www.avast.com 



Re: How can I request a big list of values ?

2014-08-10 Thread Jack Krupansky

Not safe? In what way?

It might be nice to have a specialized SolrJ API for this particular kind of 
request, so the API can do the merge. Maybe do it as a class so that you 
could have a method that gets invoked as documents trickle back from the 
various requests, again so that it is not a massive, blocking request.


-- Jack Krupansky

-Original Message- 
From: Bruno Mannina

Sent: Sunday, August 10, 2014 6:04 PM
To: solr-user@lucene.apache.org
Subject: Re: How can I request a big list of values ?

Hi Anshum,

I can do it with 3.6 release no ?

my main problem, it's that I have around 2000 values, so I can't use one
request with these values, it's too wide. :'(

I will take a look to generate (like Jack proposes me) several requests,
but even in this case it seems to be not safe...

Le 10/08/2014 19:45, Anshum Gupta a écrit :

Hi Bruno,

If you would have been on a more recent release,
https://issues.apache.org/jira/browse/SOLR-6318 would have come in
handy perhaps.
You might want to look at patching your version with this though (as a
work around).

On Sat, Aug 9, 2014 at 4:18 PM, Bruno Mannina  wrote:

Hi All,

I'm using actually SOLR 3.6 and I have around 91 000 000 docs inside.

All work fine, it's great :)

But now, I would like to request a list of values in the same field (more
than 2000 values)

I know I can use |?q=x:(AAA BBB CCC ...) (my default operator is OR)

but I have a list of 2000 values ! I think it's not the good idea to use
this method.

Can someone help me to find the good solution ?
Can I use a json structure by using a POST method ?

Thanks a lot,
Bruno
|


---
Ce courrier électronique ne contient aucun virus ou logiciel malveillant
parce que la protection avast! Antivirus est active.
http://www.avast.com






---
Ce courrier électronique ne contient aucun virus ou logiciel malveillant 
parce que la protection avast! Antivirus est active.
http://www.avast.com 



Re: How can I request a big list of values ?

2014-08-10 Thread Jack Krupansky
Generally, "large requests" are an anti-pattern in modern distributed 
systems. Better to have a number of smaller requests executing in parallel 
and then merge the results in the application layer.


-- Jack Krupansky

-Original Message- 
From: Bruno Mannina

Sent: Saturday, August 9, 2014 7:18 PM
To: solr-user@lucene.apache.org
Subject: How can I request a big list of values ?

Hi All,

I'm using actually SOLR 3.6 and I have around 91 000 000 docs inside.

All work fine, it's great :)

But now, I would like to request a list of values in the same field
(more than 2000 values)

I know I can use |?q=x:(AAA BBB CCC ...) (my default operator is OR)

but I have a list of 2000 values ! I think it's not the good idea to use
this method.

Can someone help me to find the good solution ?
Can I use a json structure by using a POST method ?

Thanks a lot,
Bruno
|


---
Ce courrier électronique ne contient aucun virus ou logiciel malveillant 
parce que la protection avast! Antivirus est active.
http://www.avast.com 



Re: WordDelimiter

2014-08-08 Thread Jack Krupansky
The word delimiter filter is actually combining "100-001" into "11". You 
have BOTH catenateNumbers AND catenateAll, so "100-R8989" should generate 
THREE tokens: the concatenated numbers 100", the concatenated words "R8989", 
and both numbers and words concatenated, "100R8989 ".


-- Jack Krupansky

-Original Message- 
From: EXTERNAL Taminidi Ravi (ETI, Automotive-Service-Solutions)

Sent: Friday, August 8, 2014 3:27 PM
To: solr-user@lucene.apache.org
Subject: WordDelimiter

HI, I have a situation where I don't want to split the words, I am using the 
workdelimterfilter where it works good.


For eg. If I send to analyszer for 100-001 , it is not splitting the 
keyword, but if I send 100-R8989 then the worddelimiter filter to 100 | 
R9889, below is the filed analyzer and filter. Same thing using for Query 
time.


Let me know if I am missing something here.



 class="solr.HTMLStripCharFilterFactory" />
 class="solr.WhitespaceTokenizerFactory"/>


 ignoreCase="true" words="stopwords.txt" />


 class="solr.LowerCaseFilterFactory"/>

 

   class="solr.WordDelimiterFilterFactory" generateWordParts="0" 
generateNumberParts="0" splitOnCaseChange="0"
 splitOnNumerics="0" 
stemEnglishPossessive="0" catenateWords="1" catenateNumbers="1"
 catenateAll="1" 
preserveOriginal="0"/>


 synonyms="synonyms.txt" ignoreCase="true" expand="true"/>


  



Re: Is it OK to have very big number of fields in solr/lucene ?

2014-08-08 Thread Jack Krupansky
Solr scales based on number of documents, not fields or collections. Dozens 
of fields or collections is perfectly fine. Hundreds of fields or 
collections CAN work, but you have to be extra diligent and use more 
powerful hardware. Millions and even billions of DOCUMENTS is fine - that's 
the primary way that Solr scales (that and shards.)


Dynamic fields are fine too, but dozens or hundreds for a single document 
are recommended limits. Different documents can have different dynamic 
fields, so the total field count could be thousands, although, again, you 
may have to be extra diligent and use more powerful hardware.


Architect your application and model your data around the strengths of Solr 
(and Lucene.) And also look at your queries first, to make sure they will 
make sense.


-- Jack Krupansky

-Original Message- 
From: Lisheng Zhang

Sent: Friday, August 8, 2014 5:25 PM
To: solr-user@lucene.apache.org
Subject: Is it OK to have very big number of fields in solr/lucene ?

In our application there are many complicated filter conditions, very often
those conditions are special to each user (like whether or not a doc is
important or already read by a user ..), two possible solutions to
implement those filters in lucene:

1/ create many fields
2/ create many collections (for each user, for example)

Here the number could be as big as 10G. I would prefer the 1st solution
(many fields), but if lucene cache all existing field info, memory could be
a problem ?

Thanks very much for helps, Lisheng 



Re: Help Required

2014-08-08 Thread Jack Krupansky
And the Solr Support list is where people register their available 
consulting services:

http://wiki.apache.org/solr/Support

-- Jack Krupansky

-Original Message- 
From: Alexandre Rafalovitch

Sent: Friday, August 8, 2014 9:12 AM
To: solr-user
Subject: Re: Help Required

We don't mediate jobs offers/positions on this list. We help people to
learn how to make these kinds of things yourself. If you are a
developer, you may find that it would take only several days to get a
strong feel for Solr. Especially, if you start from tutorials/right
books.

To find developers, using the normal job boards would probably be more
efficient. That way you can list location, salary, timelines, etc.

Regards,
  Alex.
P.s. CityPantry does not actually seem to do what you are asking. They
are starting from postcode, though possibly use the geodistance
sorting afterwards.
P.p.s. Yes, Solr can help with distance-based sorting.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On Fri, Aug 8, 2014 at 11:36 AM, INGRID MARSH
 wrote:

Dear Sirs,

I wonder if you can help me?

I'm looking for a developer who uses Solr to build for me a facted seach 
facilty using location. In a nutshell, I need this funtionality as in 
here:


www.citypantry.com
wwwdinein.

Here the vendor via google maps enters the area/radius they cover which 
enable the user to enter their postcode and be presented with the users 
who serve/cover their area. Is this what solr does?


can you put me in touch with small developers who can help?

Thanks so much.


Ingrid Marsh 




Re: explaination of query processing in SOLR

2014-08-08 Thread Jack Krupansky
That would be more of a question for the Lucene dev list, but... the 
standard answer there would be for you to become familiar with the Lucene 
source code and trace through it yourself.


It's a "Lucene directory", not a "Solr directory" - Solr is a server built 
on top of the Lucene search library.


The starting point would be the IndexSearcher class:
http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/search/IndexSearcher.html

And the IndexReader class:
http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/index/IndexReader.html

And the DirectoryReader class:
http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/index/DirectoryReader.html

And its open method:
http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/index/DirectoryReader.html#open(org.apache.lucene.index.IndexWriter,+boolean)

There is a lot of processing that occurs for queries in Solr as well (Search 
Components), but none of it is down at that Lucene file level.


-- Jack Krupansky

-Original Message- 
From: abhi Abhishek

Sent: Friday, August 8, 2014 7:59 AM
To: solr-user@lucene.apache.org
Subject: explaination of query processing in SOLR

Hello,
   I am fairly new to SOLR, can someone please help me understand how a
query is processed in SOLR, i.e, what i want to understand is from the time
it hits solr what files it refers to process the query, i.e, order in which
.tvx, .tvd files and others are accessed. basically i would like to
understand the code path of the search functionality also significance of
various files in the solr directory such as .tvx, .tcd, .frq, etc.


Regards,
Abhishek Das 



Re: how to change field value during index time?

2014-08-06 Thread Jack Krupansky
An update request processor could do the trick. You can use the stateless 
script update processor to code a JavaScript snippet to do whatever logic 
you want. Plenty of examples in my e-book:


http://www.lulu.com/us/en/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product-21203548.html

You can check the list of update processors - there might be one that can be 
used to just mutate a specific input value.


Wait... check out the parse boolean processor - you can specify values that 
you want turned into particular boolean values:

http://lucene.apache.org/solr/4_9_0/solr-core/org/apache/solr/update/processor/ParseBooleanFieldUpdateProcessorFactory.html

But my book has examples for all these processors, and configuration info as 
well.


-- Jack Krupansky

-Original Message- 
From: abhayd

Sent: Wednesday, August 6, 2014 7:55 PM
To: solr-user@lucene.apache.org
Subject: how to change field value during index time?

hi

I am indexing a csv file using csv handler. I have two fields f1 and f2.
Based on value of f1 i want to set the value of f2. Like if(f1=='T') then
f2=True;

Is this something i can do during index time? I was reading about javascript
transformers but that only seem to work with DIH

Any help?





--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-change-field-value-during-index-time-tp4151568.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: indexing comments with Apache Solr

2014-08-06 Thread Jack Krupansky
Nested documents and block join MAY work, but... I'm not so sure that nutch 
be be able to send the data in the structure that Solr and Lucene would 
expect. You may have to do some sort of customer connector between nutch and 
Solr to do that. I mean, normally the output of nutch is simply a stream of 
flat documents.


-- Jack Krupansky

-Original Message- 
From: Ali Nazemian

Sent: Wednesday, August 6, 2014 9:35 AM
To: solr-user@lucene.apache.org
Subject: Re: indexing comments with Apache Solr

Dear Alexandre,
Hi,
Thank you very much. I think nested document is what I need. Do you have
more information about how can I define such thing in solr schema? Your
mentioned blog post was all about retrieving nested docs.
Best regards.


On Wed, Aug 6, 2014 at 5:16 PM, Alexandre Rafalovitch 
wrote:


You can index comments as child records. The structure of the Solr
document should be able to incorporate both parents and children
fields and you need to index them all together. Then, just search for
JOIN syntax for nested documents. Also, latest Solr (4.9) has some
extra functionality that allows you to find all parent pages and then
expand children pages to match.

E.g.: http://heliosearch.org/expand-block-join/ seems relevant

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On Wed, Aug 6, 2014 at 11:18 AM, Ali Nazemian 
wrote:
> Dear Gora,
> I think you misunderstood my problem. Actually I used nutch for crawling
> websites and my problem is in index side and not crawl side. Suppose 
> page

> is fetch and parsed by Nutch and all comments and the date and source of
> comments are identified by parsing. Now what can I do for indexing these
> comments? What is the document granularity?
> Best regards.
>
>
> On Wed, Aug 6, 2014 at 1:29 PM, Gora Mohanty  wrote:
>
>> On 6 August 2014 14:13, Ali Nazemian  wrote:
>> >
>> > Dear all,
>> > Hi,
>> > I was wondering how can I mange to index comments in solr? suppose I
am
>> > going to index a web page that has a content of news and some 
>> > comments

>> that
>> > are presented by people at the end of this page. How can I index 
>> > these

>> > comments in solr? consider the fact that I am going to do some
analysis
>> on
>> > these comments. For example I want to have such query flexibility for
>> > retrieving all comments that are presented between 24 June 2014 to 24
>> July
>> > 2014! or all the comments that are presented by specific person.
>> Therefore
>> > defining these comment as multi-value field would not be the solution
>> since
>> > in this case such query flexibility is not feasible. So what is you
>> > suggestion about document granularity in this case? Can I consider
all of
>> > these comments as a new document inside main document (tree based
>> > structure). What is your suggestion for this case? I think it is a
common
>> > case of indexing webpages these days so probably I am not the only 
>> > one

>> > thinking about this situation. Please share you though and perhaps
your
>> > experiences in this condition with me. Thank you very much.
>>
>> Parsing a web page, and breaking up parts up for indexing into 
>> different

>> fields
>> is out of the scope of Solr. You might want to look at Apache Nutch
which
>> can index into Solr, and/or other web crawlers/scrapers.
>>
>> Regards,
>> Gora
>>
>
>
>
> --
> A.Nazemian





--
A.Nazemian 



<    1   2   3   4   5   6   7   8   9   10   >