Re: storing large text fields in a database? (instead of inside index)

2018-02-21 Thread Roman Chyla
Elasticsearch Consulting Support Training - http://sematext.com/ > > > > > On 20 Feb 2018, at 20:39, Roman Chyla wrote: > > > > Say there is a high load and I'd like to bring a new machine and let it > > replicate the index, if 100gb and more can be shaved, i

Re: storing large text fields in a database? (instead of inside index)

2018-02-20 Thread Roman Chyla
well at least. > > On Tue, Feb 20, 2018 at 10:27 AM, Roman Chyla > wrote: > > > Hello, > > > > We have a use case of a very large index (slave-master; for unrelated > > reasons the search cannot work in the cloud mode) - one of the fields is > a >

storing large text fields in a database? (instead of inside index)

2018-02-20 Thread Roman Chyla
Hello, We have a use case of a very large index (slave-master; for unrelated reasons the search cannot work in the cloud mode) - one of the fields is a very large text, stored mostly for highlighting. To cut down the index size (for purposes of replication/scaling) I thought I could try to save it

Re: The most efficient way to get un-inverted view of the index?

2016-08-17 Thread Roman Chyla
amp; !(i < liveDocs.length() && liveDocs.get(i))) { i++; continue; } transformer.process(docBase, i); i++; } } } } On Wed, Aug 17, 2016 at 1:22 PM, Roman Chyla wrote: > Joel, thanks, but which of them? I'v

Re: The most efficient way to get un-inverted view of the index?

2016-08-17 Thread Roman Chyla
values are available. --roman On Tue, Aug 16, 2016 at 9:54 PM, Joel Bernstein wrote: > You'll want to use org.apache.lucene.index.DocValues. The DocValues api has > replaced the field cache. > > > > > > Joel Bernstein > http://joelsolr.blogspot.com/ > > On

The most efficient way to get un-inverted view of the index?

2016-08-16 Thread Roman Chyla
I need to read data from the index in order to build a special cache. Previously, in SOLR4, this was accomplished with FieldCache or DocTermOrds Now, I'm struggling to see what API to use, there is many of them: on lucene level: UninvertingReader.getNumericDocValues (and others) .getNumericValue

Jetty refuses connections

2016-05-16 Thread Roman Chyla
Hi, I'm hoping someone has seen/encountered a similar problem. We have solr instances with all Jetty threads in BLOCKED state. The application does not respond to any http requests. It is SOLR 4.9 running inside docker on Amazon EC2. Jetty is 8.1 and there is an nginx proxy in front of it (with p

Re: Forking Solr

2015-10-17 Thread Roman Chyla
I've taken the route of extending solr, the repo checks out solr and builds on top of that. The hard part was to figure out how to use solr test classes and the default location for integration tests, but once there, it is relatively easy. Google for montysolr, the repo is on github. Roman On Oct 1

Re: Scramble data

2015-10-08 Thread Roman Chyla
Or you could also apply XSL to returned records: https://wiki.apache.org/solr/XsltResponseWriter On Thu, Oct 8, 2015 at 5:06 PM, Uwe Reh wrote: > Hi, > > my suggestions are probably to simple, because they are not a real > protection of privacy. But maybe one fits to your needs. > > Most simple:

Re: Reverse query?

2015-10-02 Thread Roman Chyla
I'd like to offer another option: you say you want to match long query into a document - but maybe you won't know whether to pick "Mad Max" or "Max is" (not mentioning the performance hit of "*mad max*" search - or is it not the case anymore?). Take a look at the NGram tokenizer (say size of 2; or

Re: How to use BitDocSet within a PostFilter

2015-08-03 Thread Roman Chyla
Hi, inStockSkusBitSet.get(currentChildDocNumber) Is that child a lucene id? If yes, does it include offset? Every index segment starts at a different point, but docs are numbered from zero. So to check them against the full index bitset, I'd be doing Bitset.exists(indexBase + docid) Just one thin

Re: Injecting synonymns into Solr

2015-05-04 Thread Roman Chyla
It shouldn't matter. Btw try a url instead of a file path. I think the underlying loading mechanism uses java File , it could work. On May 4, 2015 2:07 AM, "Zheng Lin Edwin Yeo" wrote: > Would like to check, will this method of splitting the synonyms into > multiple files use up a lot of memory?

Re: Mutli term synonyms

2015-04-29 Thread Roman Chyla
ose to the solution. > Any thoughts there? > > I appreciate your help on this matter. > > Thank you, > > Kaushik > > > > On Wed, Apr 29, 2015 at 5:48 PM, Roman Chyla > wrote: > > > Hi Kaushik, I meant to compare tween 20 against "tween 20

Re: Mutli term synonyms

2015-04-29 Thread Roman Chyla
t; "parsedquery": "name:tweenx20", > "parsedquery_toString": "name:tweenx20", > "explain": {}, > > Thank you, > > Kaushik > > > On Wed, Apr 29, 2015 at 4:00 PM, Roman Chyla > wrote: > > > Pls post o

Re: Mutli term synonyms

2015-04-29 Thread Roman Chyla
TE 20 [MART.],SORBIMACROGOL LAURATE > 300,POLYSORBATE 20 [FHFI],FEMA NO. 2915,POLYSORBATE 20 [FCC],POLYSORBATE 20 > [WHO-DD],POLYSORBATE 20 [VANDF] > > *Autophrase.txt...* > > Has all the above phrases in one column > > *Indexed document....* > > > 31 > Poly

Re: Mutli term synonyms

2015-04-29 Thread Roman Chyla
I'm not sure I understand - the autophrasing filter will allow the parser to see all the tokens, so that they can be parsed (and multi-token synonyms) identified. So if you are using the same analyzer at query and index time, they should be able to see the same stuff. are you using multi-token syn

Re: New UI for SOLR-based projects

2015-01-30 Thread Roman Chyla
hanks, Roman On 30 Jan 2015 21:51, "Shawn Heisey" wrote: > On 1/30/2015 1:07 PM, Roman Chyla wrote: > > There exists a new open-source implementation of a search interface for > > SOLR. It is written in Javascript (using Backbone), currently in version > > v1.0.19 - bu

New UI for SOLR-based projects

2015-01-30 Thread Roman Chyla
Hi everybody, There exists a new open-source implementation of a search interface for SOLR. It is written in Javascript (using Backbone), currently in version v1.0.19 - but new features are constantly coming. Rather than describing it in words, please see it in action for yourself at http://ui.ads

Re: shards per disk

2015-01-20 Thread Roman Chyla
I think this makes sense to (ie. the setup), since the search is getting 1K documents each time (for textual analysis, ie. they are probably large docs), and use Solr as a storage (which is totally fine) then the parallel multiple drive i/o shards speed things up. The index is probably large, so it

Re: SOLR - any open source framework

2015-01-06 Thread Roman Chyla
, but that was one year ago... On Tue, Jan 6, 2015 at 5:20 PM, Vishal Swaroop wrote: > Thanks Roman... I will check it... Maybe it's off topic but how about > Angular... > On Jan 6, 2015 5:17 PM, "Roman Chyla" wrote: > > > Hi Vishal, Alexandre, > > &

Re: SOLR - any open source framework

2015-01-06 Thread Roman Chyla
Hi Vishal, Alexandre, Here is another one, using Backbone, just released v1.0.16 https://github.com/adsabs/bumblebee you can see it in action: http://ui.adslabs.org/ While it primarily serves our own needs, I tried to architect it to be extendible (within reasonable limits of code, man power)

Re: Queries not supported by Lucene Query Parser syntax

2015-01-01 Thread Roman Chyla
Hi Leonid, I didn't look into solr qparser for a long time, but I think you should be able to combine different query parsers in one query. Look at the SolrQueryParser code, maybe now you can specify custom query parser for every clause (?), st like: foo AND {!lucene}bar I dont know, but worth e

Re: Anti-Pattern in lucent-join jar?

2014-12-05 Thread Roman Chyla
parser or parser plugin? > > I might not have followed you, this discussing challenges my understanding > of Lucene and SOLR. > > Darin > > > > > On Dec 5, 2014, at 12:47 PM, Roman Chyla wrote: > > > > Hi Mikhail, I think you are right, it won't be pro

Re: Anti-Pattern in lucent-join jar?

2014-12-05 Thread Roman Chyla
> onto segment keys, hence it exclude such leakage across different > searchers. > > On Fri, Dec 5, 2014 at 6:43 AM, Roman Chyla wrote: > > > +1, additionally (as it follows from your observation) the query can get > > out of sync with the index, if eg it was saved for

Re: Anti-Pattern in lucent-join jar?

2014-12-04 Thread Roman Chyla
+1, additionally (as it follows from your observation) the query can get out of sync with the index, if eg it was saved for later use and ran against newly opened searcher Roman On 4 Dec 2014 10:51, "Darin Amos" wrote: > Hello All, > > I have been doing a lot of research in building some custom

Re: What is the usage of solr.NumericPayloadTokenFilterFactory

2014-05-17 Thread Roman Chyla
Hi, What will replace spans, if spans are nuked ? Roman On 17 May 2014 09:15, "Ahmet Arslan" wrote: > Hi, > > > Payloads are used to store arbitrary data along with terms. You can > influence score with these arbitrary data. > See : > http://sujitpal.blogspot.com.tr/2013/07/porting-payloads-to-so

Re: w/10 ? [was: Partial Counts in SOLR]

2014-03-24 Thread Roman Chyla
perhaps useful, here is an open source implementation with near[digit] support, incl analysis of proximity tokens. When days become longer maybe itwill be packaged into a nice lib...:-) https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/grammars/ADS.g On 25 Mar 2014 00:14, "Salman

Re: filtering/faceting by a big list of IDs

2014-02-13 Thread Roman Chyla
Hi Tri, Look at this: http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201307.mbox/%3CCAEN8dyX_Am_v4f=5614eu35fnhb5h7dzkmkzdfwvrrm1xpq...@mail.gmail.com%3E Roman On 13 Feb 2014 03:39, "Tri Cao" wrote: > Hi Joel, > > Thanks a lot for the suggestion. > > After thinking more about this, I t

Re: APACHE SOLR: Pass a file as query parameter and then parse each line to form a criteria

2014-02-13 Thread Roman Chyla
Hi Rajeev, You can take this: http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201307.mbox/%3CCAEN8dyX_Am_v4f=5614eu35fnhb5h7dzkmkzdfwvrrm1xpq...@mail.gmail.com%3E I haven't created the jira yet, but I have improved the plugin. Recently, I have seen a use case of passing 90K identifiers /

Re: Solr4 performance

2014-02-12 Thread Roman Chyla
And perhaps one other, but very pertinent, recommendation is: allocate only as little heap as is necessary. By allocating more, you are working against the OS caching. To know how much is enough is bit tricky, though. Best, roman On Wed, Feb 12, 2014 at 2:56 PM, Shawn Heisey wrote: > On 2/1

Re: Commit Issue in Solr 3.4

2014-02-08 Thread Roman Chyla
objects with holding to some big object etc/. Btw if i study the graph, i see that there *are* warning signs. That's the point of testing/measuring after all, IMHO. --roman On 8 Feb 2014 13:51, "Shawn Heisey" wrote: > On 2/8/2014 11:02 AM, Roman Chyla wrote: > > I would be c

Re: Commit Issue in Solr 3.4

2014-02-08 Thread Roman Chyla
I would be curious what the cause is. Samarth says that it worked for over a year /and supposedly docs were being added all the time/. Did the index grew considerably in the last period? Perhaps he could attach visualvm while it is in the 'black hole' state to see what is actually going on. I don't

Re: Bad fieldNorm when using morphologic synonyms

2013-12-09 Thread Roman Chyla
Isaac, is there an easy way to recognize this problem? We also index synonym tokens in the same position (like you do, and I'm sure that our positions are set correctly). I could test whether the default similarity factory in solrconfig.xml had any effect (before/after reindexing). --roman On Mo

Re: Caches contain deleted docs (?)

2013-11-27 Thread Roman Chyla
nts are write-once. It's been > a long standing design that deleted data will be > reclaimed on segment merge, but not before. It's > pretty expensive to change the terms loaded on the > fly to respect deleted document's removed data. > > Best, > Erick > >

Caches contain deleted docs (?)

2013-11-27 Thread Roman Chyla
Hi, I'd like to check - there is something I don't understand about cache - and I don't know if it is a bug, or feature the following calls return a cache FieldCache.DEFAULT.getTerms(reader, idField); FieldCache.DEFAULT.getInts(reader, idField, false); the resulting arrays *will* contain entrie

Re: building custom cache - using lucene docids

2013-11-25 Thread Roman Chyla
roman On Mon, Nov 25, 2013 at 7:54 PM, Roman Chyla wrote: > > > > On Mon, Nov 25, 2013 at 12:54 AM, Mikhail Khludnev < > mkhlud...@griddynamics.com> wrote: > >> Roman, >> >> I don't fully understand your question. After segment is flushed it&#

Re: building custom cache - using lucene docids

2013-11-25 Thread Roman Chyla
n't know if they are in the middle of some regeneration or not, and they should not keep a state (of previous index) - as they can be shared by threads that build the cache Best, roman > > > On Sat, Nov 23, 2013 at 9:40 AM, Roman Chyla > wrote: > > > Hi, > > doc

Re: building custom cache - using lucene docids

2013-11-25 Thread Roman Chyla
e different > than it was in segment1 or 2. > > I think you're reading too much into LUCENE-2897. I'm pretty sure the > segment in question is not available to you anyway before this rewrite is > done, > but freely admit I don't know much about it. > > Yo

Re: building custom cache - using lucene docids

2013-11-25 Thread Roman Chyla
ch seemed to explain that behaviour. > > You're probably going to get into the whole PerSegment family of > operations, > which is something I'm not all that familiar with so I'll leave > explanations > to others. > Thank you, it is useful to get insights from various si

Re: building custom cache - using lucene docids

2013-11-23 Thread Roman Chyla
t; > As long as a searcher is open, it's guaranteed that nothing is changing. > Hard commits with openSearcher=false don't open new searchers, which > is why changes aren't visible until a softCommit or a hard commit with > openSearcher=true despite the fact that the segm

building custom cache - using lucene docids

2013-11-22 Thread Roman Chyla
Hi, docids are 'ephemeral', but i'd still like to build a search cache with them (they allow for the fastest joins). i'm seeing docids keep changing with updates (especially, in the last index segment) - as per https://issues.apache.org/jira/browse/LUCENE-2897 That would be fine, because i could

Re: Inconsistent number of hits returned by two solr instances (from the same index!)

2013-11-07 Thread Roman Chyla
; 18 East 41st Street > > New York, NY 10017 > > t: @appinions <https://twitter.com/Appinions> | g+: > plus.google.com/appinions< > https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts > > > w: appinions.com <http://www.appinions.com/&

Re: Inconsistent number of hits returned by two solr instances (from the same index!)

2013-11-06 Thread Roman Chyla
ast 41st Street > > New York, NY 10017 > > t: @appinions <https://twitter.com/Appinions> | g+: > plus.google.com/appinions< > https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts > > > w: appinions.com <http://www.appinions.com/&g

Inconsistent number of hits returned by two solr instances (from the same index!)

2013-11-06 Thread Roman Chyla
Hello, We have two solr searchers/instances (read-only). They read the same index, but they did not return the same #hits for a particular query Log is below, but to summarize: first server always returns 576 hits, the second server returns: 440, 440, 576, 576... These are just few seconds apart

Re: Recherche avec et sans espaces

2013-11-04 Thread Roman Chyla
Hi Antoine, I'll permit myself to respond in English, cause my written French is slower;-) Your problem is a well known amongst Sold users, the query parser splits tokens by empty space, so the analyser never sees input 'la redoutte' but it receives 'la' 'reroute'. You can of course enclose your se

Re: Compound words

2013-10-28 Thread Roman Chyla
Hi Parvesh, I think you should check the following jira https://issues.apache.org/jira/browse/SOLR-5379. You will find there links to other possible solutions/problems:-) Roman On 28 Oct 2013 09:06, "Erick Erickson" wrote: > Consider setting expand=true at index time. That > puts all the tokens i

Re: Complex Queries in solr

2013-10-20 Thread Roman Chyla
i just tested it whether our 'beautifu' parser supports it, and funnily enough, it does :-) https://github.com/romanchyla/montysolr/commit/f88577345c6d3a2dbefc0161f6bb07a549bc6b15 but i've (kinda) given up hope that people need powerful query parsers in the lucene world, the LUCENE-5014 is there s

Re: Solr's Filtering approaches

2013-10-12 Thread Roman Chyla
David, We have a similar query in astrophysics, an user can select an area of the skymany stars out there I am long overdue in creating a Jira issue, but here you have another efficient mechanism for searching large number of ids https://github.com/romanchyla/montysolr/blob/master/contrib

Web App Engineer at Harvard-Smithsonian Astrophysical Observatory, full time, indefinite contract

2013-10-07 Thread Roman Chyla
sting online at: http://www.cfa.harvard.edu/hr/postings/13-32.html Thank you, Roman -- Dr. Roman Chyla ADS, Harvard-Smithsonian Center for Astrophysics roman.ch...@gmail.com

Re: Dynamic Query Analyzer

2013-09-03 Thread Roman Chyla
You don't need to index fields several times, you can index is just into one field, and use the different query analyzers just to build the query. We're doing this for authors, for example - if query language says "=author:einstein", the query parser knows this field should be analyzed differently

Re: Measuring SOLR performance

2013-09-03 Thread Roman Chyla
niversalRunner.buildUpdatedClassPath(UniversalRunner.java:109) > at kg.apc.cmd.UniversalRunner.(UniversalRunner.java:55) > > at > kg.apc.cmd.UniversalRunner.buildUpdatedClassPath(UniversalRunner.java:109) > at kg.apc.cmd.UniversalRunner.(UniversalRunner.java:55) > > >

Re: Measuring SOLR performance

2013-09-02 Thread Roman Chyla
solr/statements/admin/system > > > > > > > > But I can access http://localhost:8983/solr/admin/cores, only when > > with > > > > adminPath="/admin/cores" (which suggests that this is the right value > > to > > > be > > > > used for cores), a

Re: Measuring SOLR performance

2013-08-22 Thread Roman Chyla
son > > Regards, > > Dmitry > > > > On Wed, Aug 14, 2013 at 2:03 PM, Dmitry Kan wrote: > > > Hi Roman, > > > > This looks much better, thanks! The ordinary non-comarison mode works. > > I'll post here, if there are other findings. > > > &

Re: Measuring SOLR performance

2013-08-13 Thread Roman Chyla
ges/simplejson/encoder.py", line 202, > in default > raise TypeError(repr(o) + " is not JSON serializable") > TypeError: <__main__.ForgivingValue object at 0x7fc6d4040fd0> is not JSON > serializable > > > Regards, > > D. > > > On Tue, Aug 13, 2013 at

Re: Measuring SOLR performance

2013-08-12 Thread Roman Chyla
d' > Thanks for letting me know, that info is probably not available in this situation - i've cooked st quick to fix it, please try the latest commit (hope it doesn't do more harm, i should get some sleep ..;)) roman > > In case it matters: Python 2.7.3, ubuntu, so

Re: Percolate feature?

2013-08-09 Thread Roman Chyla
On Fri, Aug 9, 2013 at 2:56 PM, Chris Hostetter wrote: > > : I'll look into this. Thanks for the concrete example as I don't even > : know which classes to start to look at to implement such a feature. > > Either roman isn't understanding what you are aksing for, or i'm not -- > but i don't think

Re: Percolate feature?

2013-08-09 Thread Roman Chyla
On Fri, Aug 9, 2013 at 11:29 AM, Mark wrote: > > *All* of the terms in the field must be matched by the querynot > vice-versa. > > Exactly. This is why I was trying to explain it as a reverse search. > > I just realized I describe it as a *large list of known keywords when > really its small;

Re: Measuring SOLR performance

2013-08-07 Thread Roman Chyla
ly on the shard server in > > background mode. > > > > my test run was: > > > > python solrjmeter.py -a -x ./jmx/SolrQueryTest.jmx -q > > ./queries/demo/demo.queries -s localhost -p 8983 -a --durationInSecs 60 > -R > > foo -t /solr/statements -e statements

Re: Measuring SOLR performance

2013-08-06 Thread Roman Chyla
34 PM, Shawn Heisey wrote: > > > On 8/6/2013 6:17 AM, Dmitry Kan wrote: > > > Of three URLs you asked for, only the 3rd one gave response: > > > > > The rest report 404. > > > > > > On Mon, Aug 5, 2013 at 8:38 PM, Roman Chyla > > wrote: > &

Re: Measuring SOLR performance

2013-08-05 Thread Roman Chyla
o JSON object could be decoded: line 1 > column 0 (char 0) > > > The README.md on the github is somehow outdated, it suggests using -q > ./demo/queries/demo.queries, but there is no such path in the fresh > checkout. > > Nice to have the -t param. > > Dmitry > > >

Re: Measuring SOLR performance

2013-08-02 Thread Roman Chyla
x27; % > (options.serverName, options.serverPort) > > jmx_options = [] > for k, v in options.__dict__.items(): > > > > Dmitry > > > On Thu, Aug 1, 2013 at 6:41 PM, Roman Chyla wrote: > > > Dmitry, > > Can you post the entire invocation line?

Re: Measuring SOLR performance

2013-08-01 Thread Roman Chyla
On Thu, Aug 1, 2013 at 6:11 PM, Shawn Heisey wrote: > On 8/1/2013 2:08 PM, Roman Chyla wrote: > >> Hi, here is a short post describing the results of the yesterday run with >> added parameters as per Shawn's recommendation, have fun getting confused >> ;) >> &

Re: Measuring SOLR performance

2013-08-01 Thread Roman Chyla
Hi, here is a short post describing the results of the yesterday run with added parameters as per Shawn's recommendation, have fun getting confused ;) http://29min.wordpress.com/2013/08/01/measuring-solr-performance-ii/ roman On Wed, Jul 31, 2013 at 12:32 PM, Roman Chyla wrote: > I&#

Re: How to uncache a query to debug?

2013-08-01 Thread Roman Chyla
When you set your cache (solrconfig.xml) to size=0, you are not using a cache. so you can debug more easily roman On Thu, Aug 1, 2013 at 1:12 PM, jimtronic wrote: > I have a query that runs slow occasionally. I'm having trouble debugging it > because once it's cached, it runs fast -- under 10

Re: Measuring SOLR performance

2013-08-01 Thread Roman Chyla
storting your measurements. > > > Bernd > > > Am 31.07.2013 05:01, schrieb Shawn Heisey: > > On 7/30/2013 6:59 PM, Roman Chyla wrote: > >> I have been wanting some tools for measuring performance of SOLR, > similar > >> to Mike McCandles' lucene benchmark.

Re: Measuring SOLR performance

2013-08-01 Thread Roman Chyla
ib/python2.7/contextlib.py", line 17, in __enter__ > return self.gen.next() > File "solrjmeter.py", line 229, in changed_dir > os.chdir(new) > OSError: [Errno 20] Not a directory: > '/home/dmitry/projects/lab/solrjmeter/queries/demo/demo.queries' >

Re: Measuring SOLR performance

2013-07-31 Thread Roman Chyla
y be random. So, yes, now I am sure what to > > think of default G1 as 'bad', and that these G1 parameters, even if they > > don't seem G1 specific, have real effect. > > Thanks, > > > > roman > > > > > > On Tue, Jul 30, 2013 at 1

Re: Measuring SOLR performance

2013-07-31 Thread Roman Chyla
o think of default G1 as 'bad', and that these G1 parameters, even if they don't seem G1 specific, have real effect. Thanks, roman On Tue, Jul 30, 2013 at 11:01 PM, Shawn Heisey wrote: > On 7/30/2013 6:59 PM, Roman Chyla wrote: > > I have been wanting some tools for meas

Re: Measuring SOLR performance

2013-07-31 Thread Roman Chyla
es(options) >> File "solrjmeter.py", line 351, in check_prerequisities >> error('Cannot contact: %s' % options.query_endpoint) >> File "solrjmeter.py", line 66, in error >> traceback.print_stack() >> Cannot contact: http://localhost:8983

Measuring SOLR performance

2013-07-30 Thread Roman Chyla
Hello, I have been wanting some tools for measuring performance of SOLR, similar to Mike McCandles' lucene benchmark. so yet another monitor was born, is described here: http://29min.wordpress.com/2013/07/31/measuring-solr-query-performance/ I tested it on the problem of garbage collectors (see

Re: Solr-4663 - Alternatives to use same data dir in different cores for optimal cache performance

2013-07-28 Thread Roman Chyla
Hi, Yes, it can be done, if you search the mailing list for 'two solr instances same datadir', you will a post where i am describing our setup - it works well even with automated deployments how do you measure performance? I am asking before one reason for us having the same setup is sharing the O

Re: processing documents in solr

2013-07-27 Thread Roman Chyla
On Sat, Jul 27, 2013 at 4:17 PM, Shawn Heisey wrote: > On 7/27/2013 11:38 AM, Joe Zhang wrote: > > I have a constantly growing index, so not updating the index can't be > > practical... > > > > Going back to the beginning of this thread: when we use the vanilla > > "*:*"+pagination approach, woul

Re: paging vs streaming. spawn from (Processing a lot of results in Solr)

2013-07-27 Thread Roman Chyla
.com/m-khl/solr-patches/compare/streaming#L15R115 > > all other code purposed for distributed search. > > > > On Sat, Jul 27, 2013 at 4:44 PM, Roman Chyla > wrote: > > > Mikhail, > > If your solution gives lazy loading of solr docs /and thus streaming of > > hu

Re: paging vs streaming. spawn from (Processing a lot of results in Solr)

2013-07-27 Thread Roman Chyla
Mikhail, If your solution gives lazy loading of solr docs /and thus streaming of huge result lists/ it should be big YES! Roman On 27 Jul 2013 07:55, "Mikhail Khludnev" wrote: > Otis, > You gave links to 'deep paging' when I asked about response streaming. > Let me understand. From my POV, deep p

Re: processing documents in solr

2013-07-27 Thread Roman Chyla
Dear list, I'vw written a special processor exactly for this kind of operations https://github.com/romanchyla/montysolr/tree/master/contrib/adsabs/src/java/org/apache/solr/handler/batch This is how we use it http://labs.adsabs.harvard.edu/trac/ads-invenio/wiki/SearchEngineBatch It is capable of

Re: Using Solr to search between two Strings without using index

2013-07-25 Thread Roman Chyla
Hi, I think you are pushing it too far - there is no 'string search' without an index. And besides, these things are just better done by a few lines of code - and if your array is too big, then you should create the index... roman On Thu, Jul 25, 2013 at 9:06 AM, Rohit Kumar wrote: > Hi, > >

Re: Document Similarity Algorithm at Solr/Lucene

2013-07-24 Thread Roman Chyla
This paper contains an excellent algorithm for plagiarism detection, but beware the published version had a mistake in the algorithm - look for corrections - I can't find them now, but I know they have been published (perhaps by one of the co-authors). You could do it with solr, to create an index

Re: How to debug an OutOfMemoryError?

2013-07-24 Thread Roman Chyla
_One_ idea would be to configure your java to dump core on the oom error - you can then load the dump into some analyzers, eg. Eclipse, and that may give you the desired answers (I fortunately don't remember that from top of my head how to activate the dump, but google will give your the answer) r

Re: Processing a lot of results in Solr

2013-07-24 Thread Roman Chyla
performances acceptable (~ within minutes) ? > > Thanks, > Matt > > On 7/23/13 6:57 PM, "Roman Chyla" wrote: > > >Hello Matt, > > > >You can consider writing a batch processing handler, which receives a > >query > >and instead of sending res

Re: Processing a lot of results in Solr

2013-07-24 Thread Roman Chyla
you disclosure how that streaming writer works? What does it stream > docList or docSet? > > Thanks > > > On Wed, Jul 24, 2013 at 5:57 AM, Roman Chyla > wrote: > > > Hello Matt, > > > > You can consider writing a batch processing handler, which recei

Re: Processing a lot of results in Solr

2013-07-23 Thread Roman Chyla
Hello Matt, You can consider writing a batch processing handler, which receives a query and instead of sending results back, it writes them into a file which is then available for streaming (it has its own UUID). I am dumping many GBs of data from solr in few minutes - your query + streaming write

Re: Performance of cross join vs block join

2013-07-22 Thread Roman Chyla
the query, so in that sense, it is not different from pre-computing the citation cache - but it happens for every query/request, and so for 0.5M of edges it must take some time. But I guess I should measure it. I haven't made notes so now I am having hard time backtracking :) roman > It

Re: short-circuit OR operator in lucene/solr

2013-07-22 Thread Roman Chyla
Deepak, I think your goal is to gain something in speed, but most likely the function query will be slower than the query without score computation (the filter query) - this stems from the fact how the query is executed, but I may, of course, be wrong. Would you mind sharing measurements you make?

Re: Getting a large number of documents by id

2013-07-18 Thread Roman Chyla
Look at speed of reading the data - likely, it takes long time to assemble a big response, especially if there are many long fields - you may want to try SSD disks, if you have that option. Also, to gain better understanding: Start your solr, start jvisualvm and attach to your running solr. Start

Re: ACL implementation: Pseudo-join performance & Atomic Updates

2013-07-17 Thread Roman Chyla
gt; field in a Solr doc with the value 6 in it. I can then > > form a query like > > {!bitwise field=myfield op=AND source=2} > > and it would match. > > > > You're talking about a much different operation as I > > understand it. > > > > In which ca

Re: Searching w/explicit Multi-Word Synonym Expansion

2013-07-17 Thread Roman Chyla
Hi Dave, On Wed, Jul 17, 2013 at 2:03 PM, dmarini wrote: > Roman, > > As a developer, I understand where you are coming from. My issue is that I > specialize in .NET, haven't done java dev in over 10 years. As an > organization we're new to solr (coming from endeca) and we're looking to > use

Re: Searching w/explicit Multi-Word Synonym Expansion

2013-07-17 Thread Roman Chyla
rch for query-time phrase > synonyms, off-the-shelf, today, no patches required.) > > > -- Jack Krupansky > > -Original Message- From: Roman Chyla > Sent: Wednesday, July 17, 2013 11:44 AM > > To: solr-user@lucene.apache.org > Subject: Re: Searching w/expli

Re: Searching w/explicit Multi-Word Synonym Expansion

2013-07-17 Thread Roman Chyla
implementation, but again, this is all a > longer-term future, not a "here and now". Maybe in the 5.0 timeframe? > > I don't want anyone to get the impression that there are off-the-shelf > patches that completely solve the synonym phrase problem. Yes, progress is > be

Re: Searching w/explicit Multi-Word Synonym Expansion

2013-07-17 Thread Roman Chyla
Hi all, What I find very 'sad' is that Lucene/SOLR contain all the necessary components for handling multi-token synonyms; the Finite State Automaton works perfectly for matching these items; the biggest problem is IMO the old query parser which split things on spaces and doesn't know to be smarte

Re: Range query on a substring.

2013-07-16 Thread Roman Chyla
on of different fields is allowed). I'd like to spend > some time on ANTLR and the new way of parsing you mentioned. I will let you > know if it was useful for me. Thanks. > > Kind regards. > > > On 16 July 2013 20:07, Roman Chyla wrote: > > > Well, I think this is

Re: Range query on a substring.

2013-07-16 Thread Roman Chyla
Well, I think this is slightly too categorical - a range query on a substring can be thought of as a simple range query. So, for example the following query: "lucene 1*" becomes behind the scenes: "lucene (10|11|12|13|14|1abcd)" the issue there is that it is a string range, but it is a range que

Re: ACL implementation: Pseudo-join performance & Atomic Updates

2013-07-16 Thread Roman Chyla
JIRA? Somehow I missed it if it did, and this > > would > > be pretty cool > > > > Erick > > > > On Mon, Jul 15, 2013 at 6:52 PM, Roman Chyla > > wrote: > > > On Sun, Jul 14, 2013 at 1:45 PM, Oleg Burlaca > > wrote: > > > > > >> He

Re: ACL implementation: Pseudo-join performance & Atomic Updates

2013-07-15 Thread Roman Chyla
On Sun, Jul 14, 2013 at 1:45 PM, Oleg Burlaca wrote: > Hello Erick, > > > Join performance is most sensitive to the number of values > > in the field being joined on. So if you have lots and lots of > > distinct values in the corpus, join performance will be affected. > Yep, we have a list of uni

Re: Performance of cross join vs block join

2013-07-12 Thread Roman Chyla
Hi Mikhail, I have commented on your blog, but it seems I have done st wrong, as the comment is not there. Would it be possible to share the test setup (script)? I have found out that the crucial thing with joins is the number of 'joins' [hits returned] and it seems that the experiments I have see

Re: amount of values in a multi value field - is denormalization always the best option?

2013-07-10 Thread Roman Chyla
On Wed, Jul 10, 2013 at 5:37 PM, Marcelo Elias Del Valle wrote: > Hello, > > I have asked a question recently about solr limitations and some about > joins. It comes that this question is about both at the same time. > I am trying to figure how to denormalize my data so I will need just 1

Re: Best way to call asynchronously - Custom data import handler

2013-07-09 Thread Roman Chyla
Other than using futures and callables? Runnables ;-) Other than that you will need async request (ie. client). But in case sb else is looking for an easy-recipe for the server-side async: public void handleRequestBody(.) { if (isBusy()) { rsp.add("message", "Batch processing is already r

Re: Solr large boolean filter

2013-07-08 Thread Roman Chyla
uot;server-side named filters". It > matches the feature described at > http://www.elasticsearch.org/blog/terms-filter-lookup/ > > Would be a cool addition, IMHO. > > Otis > -- > Solr & ElasticSearch Support -- http://sematext.com/ > Performance Monitoring -- http

Re: solr way to exclude terms

2013-07-08 Thread Roman Chyla
One of the approaches is to index create a new field based on the stopwords (ie accept only stopwords :)) - ie. if the documents contains them, you index 1 - and use a q=apple&fq=bad_apple:0 This has many limitations (in terms of flexibility), but it will be superfast roman On Mon, Jul 8, 2013 a

Re: joins in solr cloud - good or bad idea?

2013-07-08 Thread Roman Chyla
Hello, The joins are not the only idea, you may want to write your own function (ValueSource) that can implement your logic. However, I think you should not throw away the regex idea (as being slow), before trying it out - because it can be faster than the joins. Your problem is that the number of

Re: What are the options for obtaining IDF at interactive speeds?

2013-07-08 Thread Roman Chyla
gt; would never have occurred to me. Thank you too! > > Best, > Katie > > > On Wed, Jul 3, 2013 at 11:35 PM, Roman Chyla > wrote: > > > Hi Kathryn, > > I wonder if you could index all your terms as separate documents and then > > construct a new query

  1   2   >