Re: multivalue faceting term optimization

2020-03-09 Thread Nicolas Paris
https://lucene.apache.org/solr/guide/8_4/the-stats-component.html#local-parameters-with-the-stats-component is about hll and facets, but I am not sure that really meet the use case. I also have to admit that part is quite cryptic to me. -- nicolas paris

Re: multivalue faceting term optimization

2020-03-09 Thread Nicolas Paris
bsetting with extra random fields -- nicolas paris

multivalue faceting term optimization

2020-03-09 Thread Nicolas Paris
faster performances for the brute task, I guess I could artificially limit the FQ under 2M for all queries by getting a sample (I don't really care having more than 2M documents to build the word cloud). I am wondering how I could filter the documents to get approximate facets ? Thanks ! -- nicolas paris

Re: Storage/Volume type for Kubernetes Solr POD?

2020-02-07 Thread Nicolas PARIS
nsmitted with it are confidential and > may be legally privileged, and intended solely for the use of the individual > or entity to whom they are addressed. If you have received this email in > error please notify the sender. This email message has been swept for the > presence of computer viruses. -- nicolas paris

Re: Coming back to search after some time... SOLR or Elastic for text search?

2020-01-16 Thread Nicolas Paris
> We have implemented the content ingestion and processing pipelines already > in python and SPARK, so most of the data will be pushed in using APIs. I use the spark-solr library in production and have looked at the ES equivalent and the solr connector looks much more advanced for both loading

Re: does copyFields increase indexe size ?

2019-12-28 Thread Nicolas Paris
havior is perfect for my needs. On Fri, Dec 27, 2019 at 05:28:25PM -0700, Shawn Heisey wrote: > On 12/26/2019 1:21 PM, Nicolas Paris wrote: > > Below a part of the managed-schema. There is 1k section* fields. The > > second experience, I removed the copyField, droped the collecti

Re: does copyFields increase indexe size ?

2019-12-26 Thread Nicolas Paris
parate part of the relevant files (.tim, .pos, > etc). Term frequencies are kept on a _per field_ basis for instance. > > So this pretty much has to be small sample size or other measurement error. > > Best, > Erick > > > On Dec 26, 2019, at 9:27 AM, Nicolas Paris wr

Re: does copyFields increase indexe size ?

2019-12-26 Thread Nicolas Paris
Anyway, that´s good news copy field does not increase indexe size in some circumstance: - the copied fields and the target field share the same datatype - the target field is not stored this is tested on text fields On Wed, Dec 25, 2019 at 11:42:23AM +0100, Nicolas Paris wrote: > > On We

Re: does copyFields increase indexe size ?

2019-12-25 Thread Nicolas Paris
ase, with/without the _text_ field > > > On Dec 25, 2019, at 3:07 AM, Nicolas Paris wrote: > > > >  > >> > >> If you are redoing the indexing after changing the schema and > >> reloading/restarting, then you can ignore me. > > > > I am s

Re: does copyFields increase indexe size ?

2019-12-25 Thread Nicolas Paris
e same ! (while the _text_ field is working correctly) On Tue, Dec 24, 2019 at 05:32:09PM -0700, Shawn Heisey wrote: > On 12/24/2019 5:11 PM, Nicolas Paris wrote: > > Do you mean "copy fields" is only an action of changing the schema ? > > I was thinking it was adding a

Re: does copyFields increase indexe size ?

2019-12-24 Thread Nicolas Paris
On Tue, Dec 24, 2019 at 10:59:03AM -0700, Shawn Heisey wrote: > On 12/24/2019 10:45 AM, Nicolas Paris wrote: > > From my understanding, copy fields creates an new indexes from the > > copied fields. > > From my tests, I copied 1k textual fields into _text_ with copyFields.

does copyFields increase indexe size ?

2019-12-24 Thread Nicolas Paris
Hi >From my understanding, copy fields creates an new indexes from the copied fields. >From my tests, I copied 1k textual fields into _text_ with copyFields. As a result there is no increase in the size of the collection. All the source fields are indexed and stored. The _text_ field is indexed

Re: A Last Message to the Solr Users

2019-12-01 Thread Nicolas Paris
Hi Mark, Have you shared with the community all the weaknesses of solrcloud you have in mind and the advice to overcome that? Apparently you wrote most of that code and your feedback would be helpful for community. Regards On Sat, Nov 30, 2019 at 09:31:34PM -0600, Mark Miller wrote: > I’d

Re: CloudSolrClient - basic auth - multi shard collection

2019-11-20 Thread Nicolas Paris
h of those > bugs are fixed in that version. > > Hope that helps, > > Jason > > > On Mon, Nov 18, 2019 at 8:26 AM Nicolas Paris > wrote: > > > > Hello, > > > > I am having trouble with basic auth on a solrcloud instance. When the > > collecti

CloudSolrClient - basic auth - multi shard collection

2019-11-18 Thread Nicolas Paris
Hello, I am having trouble with basic auth on a solrcloud instance. When the collection is only one shard, there is no problem. When the collection is multiple shard, there is no problem until I ask multiple query concurrently: I get 401 error and asking for credentials for concurrent queries. I

Re: Solr Ref Guide Changes - now HTML only

2019-10-28 Thread Nicolas Paris
> If you are someone who wishes the PDF would continue, please share your > feedback. I have not particularly explored the documentation format but the content. However here my thought on this: Pdf version of solr documentation has two advantages: 1. readable offline 2. make searching easier

Re: POS Tagger

2019-10-25 Thread Nicolas Paris
solr/guide/7_3/language-analysis.html#opennlp-part-of-speech-filter On Fri, Oct 25, 2019 at 06:25:36PM +0200, Nicolas Paris wrote: > > Do you use the POS tagger at query time, or just at index time? > > I have the POS tagger pipeline ready but nothing done yet on the solr > part. Rig

Re: POS Tagger

2019-10-25 Thread Nicolas Paris
ide POS tagger (must be fast). > > -- > Audrey Lorberfeld > Data Scientist, w3 Search > IBM > audrey.lorberf...@ibm.com > > > On 10/25/19, 11:57 AM, "Nicolas Paris" wrote: > > Also we are using stanford POS tagger for french. The processing time is &g

Re: POS Tagger

2019-10-25 Thread Nicolas Paris
Also we are using stanford POS tagger for french. The processing time is mitigated by the spark-corenlp package which distribute the process over multiple node. Also I am interesting in the way you use POS information within solr queries, or solr fields. Thanks, On Fri, Oct 25, 2019 at

Re: Document Update performances Improvement

2019-10-23 Thread Nicolas Paris
what is your current performance? > > Once this is clear further architecture aspects can be derived, such as > number of spark executors, number of Solr instances, sharding, replication, > commit timing etc. > > > Am 19.10.2019 um 21:52 schrieb Nicolas Paris :

Re: Document Update performances Improvement

2019-10-23 Thread Nicolas Paris
mic updates might be faster. The documents are stored within parquet files without any processing needed. In this case, the atomic update is not likely to faster things. Thanks On Wed, Oct 23, 2019 at 07:49:44AM -0600, Shawn Heisey wrote: > On 10/22/2019 1:12 PM, Nicolas Paris wrote: > > &

Re: Document Update performances Improvement

2019-10-23 Thread Nicolas Paris
t converges to 60 million unique > documents after atomic updates (full indexing). > > > > > Would you say atomical update is faster than regular replacement of > > documents? > > > No, I don't say that. Either of the two configs (autoCommit, Merge Policy) > will impac

Re: Document Update performances Improvement

2019-10-22 Thread Nicolas Paris
dDocs, Merge Policies)? We, at > Auto-Suggest, also do atomic updates daily and specifically changing merge > factor gave us a boost of ~4x during indexing. At current configuration, > our core atomically updates ~423 documents per second. > > On Sun, 20 Oct 2019 at 02:07, Nicolas Paris >

Re: Document Update performances Improvement

2019-10-19 Thread Nicolas Paris
instances, sharding, replication, > commit timing etc. > > > Am 19.10.2019 um 21:52 schrieb Nicolas Paris : > > > > Hi community, > > > > Any advice to speed-up updates ? > > Is there any advice on commit, memory, docvalues, stored or any tips to >

Re: Document Update performances Improvement

2019-10-19 Thread Nicolas Paris
Hi community, Any advice to speed-up updates ? Is there any advice on commit, memory, docvalues, stored or any tips to faster things ? Thanks On Wed, Oct 16, 2019 at 12:47:47AM +0200, Nicolas Paris wrote: > Hi > > I am looking for a way to faster the update of documents. > >

Re: Solr-Cloud, join and collection collocation

2019-10-16 Thread Nicolas Paris
ing score=none as a local param. Turns another algorithm dragging > by from side join. > > On Wed, Oct 16, 2019 at 11:37 AM Nicolas Paris > wrote: > > > Sadly, the join performances are poor. > > The joined collection is 12M documents, and the performances are

Re: Solr-Cloud, join and collection collocation

2019-10-16 Thread Nicolas Paris
is 12M or 1 document in size. So the performance of join looks correlated to size of joined collection and not the kind of filter applied to it. I will explore the streaming expressions On Wed, Oct 16, 2019 at 08:00:43AM +0200, Nicolas Paris wrote: > > You can certainly replicate the

Re: Solr-Cloud, join and collection collocation

2019-10-16 Thread Nicolas Paris
a of that shard must be co-located with every > replica of the “to” collection. > > Have you looked at streaming and “streaming expressions"? It does not have > the same problem, although it does have its own limitations. > > Best, > Erick > > > On Oct 15, 2019,

Solr-Cloud, join and collection collocation

2019-10-15 Thread Nicolas Paris
Hi I have several large collections that cannot fit in a standalone solr instance. They are split over multiple shards in solr-cloud mode. Those collections are supposed to be joined to an other collection to retrieve subset. Because I am using distributed collections, I am not able to use the

Document Update performances Improvement

2019-10-15 Thread Nicolas Paris
Hi I am looking for a way to faster the update of documents. In my context, the update replaces one of the many existing indexed fields, and keep the others as is. Right now, I am building the whole document, and replacing the existing one by id. I am wondering if **atomic update feature**

Re: Status of solR / HDFS-v3 compatibility

2019-05-02 Thread Nicolas Paris
t; Kevin Risden > > > On Thu, May 2, 2019 at 9:32 AM Nicolas Paris > wrote: > > > Hi > > > > solr doc [1] says it's only compatible with hdfs 2.x > > is that true ? > > > > > > [1]: http://lucene.apache.org/solr/guide/7_7/running-solr-on-hdfs.html > > > > -- > > nicolas > > -- nicolas

Status of solR / HDFS-v3 compatibility

2019-05-02 Thread Nicolas Paris
Hi solr doc [1] says it's only compatible with hdfs 2.x is that true ? [1]: http://lucene.apache.org/solr/guide/7_7/running-solr-on-hdfs.html -- nicolas

Re: edismax: sorting on numeric fields

2019-02-17 Thread Nicolas Paris
ith, including myself > > > >> On Feb 16, 2019, at 10:10 AM, Nicolas Paris > >> wrote: > >> > >> Hi > >> > >> Thanks. > >> To clarify, I don't want to sort by numeric fields, instead, I d'like to > >> get sort by

Re: edismax: sorting on numeric fields

2019-02-16 Thread Nicolas Paris
if your query is > > q=kind:animal weight:50 you will get no results, as nothing matches > > (assuming a q.op of AND) > > > > > > On Thu, Feb 14, 2019 at 4:06 PM Nicolas Paris > > wrote: > > > > > Hi > > > > > > I have a numeric f

edismax: sorting on numeric fields

2019-02-14 Thread Nicolas Paris
Hi I have a numeric field (say "weight") and I d'like to be able to get results sorted. q=kind:animal weight:50 pf=kind^2 weight^3 would return: name=dog, kind=animal, weight=51 name=tiger, kind=animal,weight=150 name=elephant, kind=animal,weight=2000 In other terms how to deal with numeric

Re: MoreLikeThis & Synonyms

2018-12-27 Thread Nicolas Paris
On Wed, Dec 26, 2018 at 09:09:02PM -0800, Erick Erickson wrote: > bq. However multiword synonyms are only compatible with queryTime synonyms > expansion. > > Why do you say this? What version of Solr? Query-time mult-word > synonyms were _added_, but AFAIK the capability of multi-word synonyms >

MoreLikeThis & Synonyms

2018-12-26 Thread Nicolas Paris
Hi It turns out that MoreLikeThis handler does not use queryTime synonyms expansion. It is only compatible with indexTime synonyms. However multiword synonyms are only compatible with queryTime synonyms expansion. For this reason this does not allow the use of multiword synonyms within

Re: Search only for single value of Solr multivalue field (part 2)

2018-12-17 Thread Nicolas Paris
On Sun, Dec 16, 2018 at 05:44:30PM -0800, Erick Erickson wrote: > No, the idea is that you have N single valued fields, one for each of > the MV entries you have. The copyField dest would be MV, and only used > in those cases you wanted to match across values. Not saying that's a > great solution,

Re: Search only for single value of Solr multivalue field (part 2)

2018-12-16 Thread Nicolas Paris
On Sun, Dec 16, 2018 at 09:30:33AM -0800, Erick Erickson wrote: > Have you looked at ComplexPhraseQueryParser here? > https://lucene.apache.org/solr/guide/6_6/other-parsers.html Sure. However, I am using multi-word synonyms and so far the complexphrase does not handle them. (maybe soon ?) >

Search only for single value of Solr multivalue field (part 2)

2018-12-16 Thread Nicolas Paris
hi This question is highly related to a previous one found on the mailing-list archive [1]. I have this document: "content_txt":["001 first","002 second"] I d'like the below query return nothing: > q=content_txt:(first AND second) The method proposed ([1]) by Erick works ok to look for a

Highlighting Parent Documents

2018-12-09 Thread Nicolas Paris
Hi I have read here [1] and here [2] that it is possible to highlight only parent documents in block join queries. But I didn't succeed yet: So here is my nested document example: [ { "id": "2", "type_s": "parent", "content_txt": ["apache"],

Pagination Graph/SQL

2018-12-02 Thread Nicolas Paris
Hi SolR pagination is incredible: you can provide the end user a small set of results together with the total number of documents found (numFound). I am wondering if both "parallel SQL" and "Graph Traversal" feature also provide a pagination mechanism. As an example, the above SQL : SELECT id

Synonyms relationships

2018-10-31 Thread Nicolas Paris
Hi Does SolR provide a way to describe synonyms relationships such "equivalent to" ,"narrower thant", "broader than" ? It turns out both postgres and oracle do, but I can't find any related information in the documentation. This is useful to allow generalizing the terms of the research or not.

Re: Is anybody using UIMA with Solr?

2018-06-19 Thread Nicolas Paris
sorry thought I was on UIMA mailing list. That being said, my position is the same : let UIMA folks load data into SolR by using the most optimized way. (what would be the best way ? Loading jsons ?) 2018-06-19 22:48 GMT+02:00 Nicolas Paris : > Hi > > Not realy a direct answer - N

Re: Is anybody using UIMA with Solr?

2018-06-19 Thread Nicolas Paris
Hi Not realy a direct answer - Never used it, however this feature have been attractive to me while first looking at uima. Right now, I would say UIMA connectors in general are by design a pain to maintain. Source and target often do have optimised way to bulk export/import data. For example,

Re: query bag of word with negation

2018-04-22 Thread Nicolas Paris
Hello Markus Thanks ! The ComplexPhraseQueryParser syntax: q={!complexphrase inOrder=false}collector:"wonderful pizza -peperoni"~5 answers my needs. BTW, Apparently it accepts both leading/ending wildcards, that's look powerful feature. Any chance it would support the "sow=false" in order to

Re: query bag of word with negation

2018-04-22 Thread Nicolas Paris
1. Query terms containing other than just letters or digits may be placed >> within double quotes so that those other characters do not separate a term >> into many terms. A dot (period) and white space are neither letter nor >> digit. Examples: "Now is the time for all good men" (spaces,

query bag of word with negation

2018-04-22 Thread Nicolas Paris
Hello I wonder if there is a plain text query syntax to say: give me all document that match: wonderful pizza NOT peperoni all those in a 5 distance word bag then pizza are wonderful -> would match I made a wonderful pasta and pizza -> would match Peperoni pizza are so wonderful -> would not