customize posting size(or block size) ...
Hi, Out searching patterns mostly use SpanNearQuery with PrefixQuery. In addition, single search query includes a lot of PrefixQuery. Actually, we don't have constraint using PrefixQuery. For this reason, JVM heap memory usage is often high. In this time, other queries also hangs. I'd like to know any solution for low memory usage although search time takes a little long. I guess that posting size(or block size) reading at a time can reduce when scoring SpanNearQuery. If possible, does the memory usage for processing SpanNearQuery reduce? and how do I customize? examples are really helpful to me. * I'm using Solr 4.2.1. thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/customize-posting-size-or-block-size-tp4305878.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Sorl shards: very sensitive to swap space usage !?
Thanks everyone! The discussion is really helpful. Hi Toke, can you explain exactly what you mean by "the aggressive IO for the memory mapping caused the kernel to start swapping parts of the JVM heap to get better caching of storage data"? Which JVM are you talking about? Solr shard? I have other services running on the same host as well. Thanks! On Fri, Nov 11, 2016 at 7:32 AM, Shawn Heisey wrote: > On 11/11/2016 6:46 AM, Toke Eskildsen wrote: > > but on two occasions I have > > experienced heavy swapping with multiple gigabytes free for disk > > cache. In both cases, the cache-to-index size was fairly low (let's > > say < 10%). My guess (I don't know the intrinsics of memory mapping > > vs. swapping) is that the aggressive IO for the memory mapping caused > > the kernel to start swapping parts of the JVM heap to get better > > caching of storage data. Yes, with terrible performance as a result. > > That's really weird, and sounds like a broken operating system. I've > had other issues with swap, but in those cases, free memory was actually > near zero, and it sounds like your situation was not the same. So the > OP here might be having similar problems even if nothing's > misconfigured. If so, your solution will probably help them. > > > No matter the cause, the swapping problems were "solved" by > > effectively disabling the swap (swappiness 0). > > Solr certainly doesn't need (or even want) swap, if the machine is sized > right. I've read some things saying that Linux doesn't behave correctly > if you completely get rid of all swap, but setting swappiness to zero > sounds like a good option. The OS would still utilize swap if it > actually ran out of physical memory, so you don't lose the safety valve > that swap normally provides. > > Thanks, > Shawn > >
Re: Parallelize Cursor approach
I got it when you said form N queries. Just wanted to try the "get all cursorMark first" approach but just realized it would be very inefficient as you said since cursor mark is serialized version of the last sorted value you received and hence still you are reading the results from solr although your "fl" -> null. Just wanted to try this approach as I need everything sorted. In submitting N queries, I will have to merge sort the results of N queries. But that should be way better than the first approach I tried. Thanks! On Mon, Nov 14, 2016 at 3:58 PM, Erick Erickson wrote: > You're executing all the queries to parallelize before even starting. > Seems very inefficient. My suggestion doesn't require this first step. > Perhaps it was confusing because I mentioned "your own cursorMark". > Really I meant bypass that entirely, just form N queries that were > restricted to N disjoint subsets of the data and process them all in > parallel, either with /export or /select. > > Best, > Erick > > On Mon, Nov 14, 2016 at 3:53 PM, Chetas Joshi > wrote: > > Thanks Joel for the explanation. > > > > Hi Erick, > > > > One of the ways I am trying to parallelize the cursor approach is by > > iterating the result set twice. > > (1) Once just to get all the cursor marks > > > > val q: SolrQuery = new solrj.SolrQuery() > > q.set("q", query) > > q.add("fq", query) > > q.add("rows", batchSize.toString) > > q.add("collection", collection) > > q.add("fl", "null") > > q.add("sort", "id asc") > > > > Here I am not asking for any field values ( "fl" -> null ) > > > > (2) Once I get all the cursor marks, I can start parallel threads to get > > the results in parallel. > > > > However, the first step in fact takes a lot of time. Even more than when > I > > would actually iterate through the results with "fl" -> field1, field2, > > field3 > > > > Why is this happening? > > > > Thanks! > > > > > > On Thu, Nov 10, 2016 at 8:22 PM, Joel Bernstein > wrote: > > > >> Solr 5 was very early days for Streaming Expressions. Streaming > Expressions > >> and SQL use Java 8 so development switched to the 6.0 branch five months > >> before the 6.0 release. So there was a very large jump in features and > bug > >> fixes from Solr 5 to Solr 6 in Streaming Expressions. > >> > >> Joel Bernstein > >> http://joelsolr.blogspot.com/ > >> > >> On Thu, Nov 10, 2016 at 11:14 PM, Joel Bernstein > >> wrote: > >> > >> > In Solr 5 the /export handler wasn't escaping json text fields, which > >> > would produce json parse exceptions. This was fixed in Solr 6.0. > >> > > >> > Joel Bernstein > >> > http://joelsolr.blogspot.com/ > >> > > >> > On Tue, Nov 8, 2016 at 6:17 PM, Erick Erickson < > erickerick...@gmail.com> > >> > wrote: > >> > > >> >> Hmm, that should work fine. Let us know what the logs show if > anything > >> >> because this is weird. > >> >> > >> >> Best, > >> >> Erick > >> >> > >> >> On Tue, Nov 8, 2016 at 1:00 PM, Chetas Joshi > > >> >> wrote: > >> >> > Hi Erick, > >> >> > > >> >> > This is how I use the streaming approach. > >> >> > > >> >> > Here is the solrconfig block. > >> >> > > >> >> > > >> >> > > >> >> > {!xport} > >> >> > xsort > >> >> > false > >> >> > > >> >> > > >> >> > query > >> >> > > >> >> > > >> >> > > >> >> > And here is the code in which SolrJ is being used. > >> >> > > >> >> > String zkHost = args[0]; > >> >> > String collection = args[1]; > >> >> > > >> >> > Map props = new HashMap(); > >> >> > props.put("q", "*:*"); > >> >> > props.put("qt", "/export"); > >> >> > props.put("sort", "fieldA asc"); > >> >> > props.put("fl", "fieldA,fieldB,fieldC"); > >> >> > > >> >> > CloudSolrStream cloudstream = new CloudSolrStream(zkHost,collect > >> >> ion,props); > >> >> > > >> >> > And then I iterate through the cloud stream (TupleStream). > >> >> > So I am using streaming expressions (SolrJ). > >> >> > > >> >> > I have not looked at the solr logs while I started getting the JSON > >> >> parsing > >> >> > exceptions. But I will let you know what I see the next time I run > >> into > >> >> the > >> >> > same exceptions. > >> >> > > >> >> > Thanks > >> >> > > >> >> > On Sat, Nov 5, 2016 at 9:32 PM, Erick Erickson < > >> erickerick...@gmail.com > >> >> > > >> >> > wrote: > >> >> > > >> >> >> Hmmm, export is supposed to handle 10s of million result sets. I > know > >> >> >> of a situation where the Streaming Aggregation functionality back > >> >> >> ported to Solr 4.10 processes on that scale. So do you have any > clue > >> >> >> what exactly is failing? Is there anything in the Solr logs? > >> >> >> > >> >> >> _How_ are you using /export, through Streaming Aggregation > (SolrJ) or > >> >> >> just the raw xport handler? It might be worth trying to do this > from > >> >> >> SolrJ if you're not, it should be a very quick program to write, > just > >> >> >> to test we're talking 100 lines max. > >> >> >> > >> >> >> You could always roll your own cursor mark stuff by partitioning > the > >>
Re: Parallelize Cursor approach
You're executing all the queries to parallelize before even starting. Seems very inefficient. My suggestion doesn't require this first step. Perhaps it was confusing because I mentioned "your own cursorMark". Really I meant bypass that entirely, just form N queries that were restricted to N disjoint subsets of the data and process them all in parallel, either with /export or /select. Best, Erick On Mon, Nov 14, 2016 at 3:53 PM, Chetas Joshi wrote: > Thanks Joel for the explanation. > > Hi Erick, > > One of the ways I am trying to parallelize the cursor approach is by > iterating the result set twice. > (1) Once just to get all the cursor marks > > val q: SolrQuery = new solrj.SolrQuery() > q.set("q", query) > q.add("fq", query) > q.add("rows", batchSize.toString) > q.add("collection", collection) > q.add("fl", "null") > q.add("sort", "id asc") > > Here I am not asking for any field values ( "fl" -> null ) > > (2) Once I get all the cursor marks, I can start parallel threads to get > the results in parallel. > > However, the first step in fact takes a lot of time. Even more than when I > would actually iterate through the results with "fl" -> field1, field2, > field3 > > Why is this happening? > > Thanks! > > > On Thu, Nov 10, 2016 at 8:22 PM, Joel Bernstein wrote: > >> Solr 5 was very early days for Streaming Expressions. Streaming Expressions >> and SQL use Java 8 so development switched to the 6.0 branch five months >> before the 6.0 release. So there was a very large jump in features and bug >> fixes from Solr 5 to Solr 6 in Streaming Expressions. >> >> Joel Bernstein >> http://joelsolr.blogspot.com/ >> >> On Thu, Nov 10, 2016 at 11:14 PM, Joel Bernstein >> wrote: >> >> > In Solr 5 the /export handler wasn't escaping json text fields, which >> > would produce json parse exceptions. This was fixed in Solr 6.0. >> > >> > Joel Bernstein >> > http://joelsolr.blogspot.com/ >> > >> > On Tue, Nov 8, 2016 at 6:17 PM, Erick Erickson >> > wrote: >> > >> >> Hmm, that should work fine. Let us know what the logs show if anything >> >> because this is weird. >> >> >> >> Best, >> >> Erick >> >> >> >> On Tue, Nov 8, 2016 at 1:00 PM, Chetas Joshi >> >> wrote: >> >> > Hi Erick, >> >> > >> >> > This is how I use the streaming approach. >> >> > >> >> > Here is the solrconfig block. >> >> > >> >> > >> >> > >> >> > {!xport} >> >> > xsort >> >> > false >> >> > >> >> > >> >> > query >> >> > >> >> > >> >> > >> >> > And here is the code in which SolrJ is being used. >> >> > >> >> > String zkHost = args[0]; >> >> > String collection = args[1]; >> >> > >> >> > Map props = new HashMap(); >> >> > props.put("q", "*:*"); >> >> > props.put("qt", "/export"); >> >> > props.put("sort", "fieldA asc"); >> >> > props.put("fl", "fieldA,fieldB,fieldC"); >> >> > >> >> > CloudSolrStream cloudstream = new CloudSolrStream(zkHost,collect >> >> ion,props); >> >> > >> >> > And then I iterate through the cloud stream (TupleStream). >> >> > So I am using streaming expressions (SolrJ). >> >> > >> >> > I have not looked at the solr logs while I started getting the JSON >> >> parsing >> >> > exceptions. But I will let you know what I see the next time I run >> into >> >> the >> >> > same exceptions. >> >> > >> >> > Thanks >> >> > >> >> > On Sat, Nov 5, 2016 at 9:32 PM, Erick Erickson < >> erickerick...@gmail.com >> >> > >> >> > wrote: >> >> > >> >> >> Hmmm, export is supposed to handle 10s of million result sets. I know >> >> >> of a situation where the Streaming Aggregation functionality back >> >> >> ported to Solr 4.10 processes on that scale. So do you have any clue >> >> >> what exactly is failing? Is there anything in the Solr logs? >> >> >> >> >> >> _How_ are you using /export, through Streaming Aggregation (SolrJ) or >> >> >> just the raw xport handler? It might be worth trying to do this from >> >> >> SolrJ if you're not, it should be a very quick program to write, just >> >> >> to test we're talking 100 lines max. >> >> >> >> >> >> You could always roll your own cursor mark stuff by partitioning the >> >> >> data amongst N threads/processes if you have any reasonable >> >> >> expectation that you could form filter queries that partition the >> >> >> result set anywhere near evenly. >> >> >> >> >> >> For example, let's say you have a field with random numbers between 0 >> >> >> and 100. You could spin off 10 cursorMark-aware processes each with >> >> >> its own fq clause like >> >> >> >> >> >> fq=partition_field:[0 TO 10} >> >> >> fq=[10 TO 20} >> >> >> >> >> >> fq=[90 TO 100] >> >> >> >> >> >> Note the use of inclusive/exclusive end points >> >> >> >> >> >> Each one would be totally independent of all others with no >> >> >> overlapping documents. And since the fq's would presumably be cached >> >> >> you should be able to go as fast as you can drive your cluster. Of >> >> >> course you lose query-wide sorting and the like, if that's important >> >> >> you'd need to figure som
Re: Parallelize Cursor approach
Thanks Joel for the explanation. Hi Erick, One of the ways I am trying to parallelize the cursor approach is by iterating the result set twice. (1) Once just to get all the cursor marks val q: SolrQuery = new solrj.SolrQuery() q.set("q", query) q.add("fq", query) q.add("rows", batchSize.toString) q.add("collection", collection) q.add("fl", "null") q.add("sort", "id asc") Here I am not asking for any field values ( "fl" -> null ) (2) Once I get all the cursor marks, I can start parallel threads to get the results in parallel. However, the first step in fact takes a lot of time. Even more than when I would actually iterate through the results with "fl" -> field1, field2, field3 Why is this happening? Thanks! On Thu, Nov 10, 2016 at 8:22 PM, Joel Bernstein wrote: > Solr 5 was very early days for Streaming Expressions. Streaming Expressions > and SQL use Java 8 so development switched to the 6.0 branch five months > before the 6.0 release. So there was a very large jump in features and bug > fixes from Solr 5 to Solr 6 in Streaming Expressions. > > Joel Bernstein > http://joelsolr.blogspot.com/ > > On Thu, Nov 10, 2016 at 11:14 PM, Joel Bernstein > wrote: > > > In Solr 5 the /export handler wasn't escaping json text fields, which > > would produce json parse exceptions. This was fixed in Solr 6.0. > > > > Joel Bernstein > > http://joelsolr.blogspot.com/ > > > > On Tue, Nov 8, 2016 at 6:17 PM, Erick Erickson > > wrote: > > > >> Hmm, that should work fine. Let us know what the logs show if anything > >> because this is weird. > >> > >> Best, > >> Erick > >> > >> On Tue, Nov 8, 2016 at 1:00 PM, Chetas Joshi > >> wrote: > >> > Hi Erick, > >> > > >> > This is how I use the streaming approach. > >> > > >> > Here is the solrconfig block. > >> > > >> > > >> > > >> > {!xport} > >> > xsort > >> > false > >> > > >> > > >> > query > >> > > >> > > >> > > >> > And here is the code in which SolrJ is being used. > >> > > >> > String zkHost = args[0]; > >> > String collection = args[1]; > >> > > >> > Map props = new HashMap(); > >> > props.put("q", "*:*"); > >> > props.put("qt", "/export"); > >> > props.put("sort", "fieldA asc"); > >> > props.put("fl", "fieldA,fieldB,fieldC"); > >> > > >> > CloudSolrStream cloudstream = new CloudSolrStream(zkHost,collect > >> ion,props); > >> > > >> > And then I iterate through the cloud stream (TupleStream). > >> > So I am using streaming expressions (SolrJ). > >> > > >> > I have not looked at the solr logs while I started getting the JSON > >> parsing > >> > exceptions. But I will let you know what I see the next time I run > into > >> the > >> > same exceptions. > >> > > >> > Thanks > >> > > >> > On Sat, Nov 5, 2016 at 9:32 PM, Erick Erickson < > erickerick...@gmail.com > >> > > >> > wrote: > >> > > >> >> Hmmm, export is supposed to handle 10s of million result sets. I know > >> >> of a situation where the Streaming Aggregation functionality back > >> >> ported to Solr 4.10 processes on that scale. So do you have any clue > >> >> what exactly is failing? Is there anything in the Solr logs? > >> >> > >> >> _How_ are you using /export, through Streaming Aggregation (SolrJ) or > >> >> just the raw xport handler? It might be worth trying to do this from > >> >> SolrJ if you're not, it should be a very quick program to write, just > >> >> to test we're talking 100 lines max. > >> >> > >> >> You could always roll your own cursor mark stuff by partitioning the > >> >> data amongst N threads/processes if you have any reasonable > >> >> expectation that you could form filter queries that partition the > >> >> result set anywhere near evenly. > >> >> > >> >> For example, let's say you have a field with random numbers between 0 > >> >> and 100. You could spin off 10 cursorMark-aware processes each with > >> >> its own fq clause like > >> >> > >> >> fq=partition_field:[0 TO 10} > >> >> fq=[10 TO 20} > >> >> > >> >> fq=[90 TO 100] > >> >> > >> >> Note the use of inclusive/exclusive end points > >> >> > >> >> Each one would be totally independent of all others with no > >> >> overlapping documents. And since the fq's would presumably be cached > >> >> you should be able to go as fast as you can drive your cluster. Of > >> >> course you lose query-wide sorting and the like, if that's important > >> >> you'd need to figure something out there. > >> >> > >> >> Do be aware of a potential issue. When regular doc fields are > >> >> returned, for each document returned, a 16K block of data will be > >> >> decompressed to get the stored field data. Streaming Aggregation > >> >> (/xport) reads docValues entries which are held in MMapDirectory > space > >> >> so will be much, much faster. As of Solr 5.5. You can override the > >> >> decompression stuff, see: > >> >> https://issues.apache.org/jira/browse/SOLR-8220 for fields that are > >> >> both stored and docvalues... > >> >> > >> >> Best, > >> >> Erick > >> >> > >> >> On Sat, Nov 5, 20
Re: Editing schema and solrconfig files
Oh, and of course there's the whole managed schema capabilities where you use API end points to modify the schema file and a similar for some parts of solrconfig.xml. That said, though, for any kind of serious installation I'd still be pulling the modified configs off of ZK and putting them in source code control. If you're not going to create a UI to manipulate the schema, I find just using the scripts or zkcli about as fast. Best, Erick On Mon, Nov 14, 2016 at 2:22 PM, Reth RM wrote: > There's a way to add/update/delete schema fields, this is helpful. > https://jpst.it/Pqqz > although no way to add field-Type > > On Wed, Nov 9, 2016 at 2:20 PM, Erick Erickson > wrote: > >> We had the bright idea of allowing editing of the config files through >> the UI... but the ability to upload arbitrary XML is a security >> vulnerability, so that idea was nixed. >> >> The solr/bin script has an upconfig and downconfig command that are (I >> hope) easier to use than zkcli, I think from 5.5. In Solr 6.2 the >> solr/bin script has been enhanced to allow other ZK operations. Not >> quite what you were looking for, but I thought I'd mention it. >> >> There are some ZK clients out there that'll let you edit files >> directly in ZK, and I know IntelliJ also has a plugin that'll allow >> you to do that from the IDE, don't know about Eclipse but I expect it >> does. >> >> I usually edit them locally and set up a shell script to push them up >> as necessary... >> >> FWIW, >> Erick >> >> On Wed, Nov 9, 2016 at 2:09 PM, John Bickerstaff >> wrote: >> > I never found a way to do it through the UI... and ended up using "nano" >> on >> > linux for simple things. >> > >> > For more complex stuff, I scp'd the file (or the whole conf directory) up >> > to my dev box (a Mac in my case) and edited in a decent UI tool, then >> scp'd >> > the whole thing back... I wrote a simple bash script to automate the scp >> > process on both ends once I got tired of typing it over and over... >> > >> > On Wed, Nov 9, 2016 at 3:05 PM, Reth RM wrote: >> > >> >> What are some easiest ways to edit/modify/add conf files, such as >> >> solrconfig.xml and schema.xml other than APIs end points or using zk >> >> commands to re-upload modified file? >> >> >> >> In other words, can we edit conf files through solr admin (GUI) >> >> interface(add new filed by click on button or add new request handler on >> >> click?) with feature of enabling/disabling same feature as required? >> >> >>
Re: RTF Rich text format
The logical place to do that (if you cannot do outside of Solr) would be in an UpdateRequestProcessor. Unfortunately, there is no TikaExtract URP though other similar ones exist (e.g. for language guessing). The full list is here: http://www.solr-start.com/info/update-request-processors/ But you could write. You'd have to be very careful about using Tika to not leak memory and to handle the failure states, but technically it should be possible. Regards, Alex. Solr Example reading group is starting November 2016, join us at http://j.mp/SolrERG Newsletter and resources for Solr beginners and intermediates: http://www.solr-start.com/ On 15 November 2016 at 04:01, Sergio García Maroto wrote: > Thanks for the response. > > I am afraid I can't use the DataImportHandler. I do the indexation using an > Indexation Service joining data from several places. > > I have a final xml with plenty of data and one of them is the rtf field. > That's the xml I send to Solr using the /update. I am guessing if it would > be possible Solr to do it with a tokenizer filter or something like that. > > On 14 November 2016 at 16:24, Alexandre Rafalovitch > wrote: > >> I think DataImportHandler with nested entity (JDBC, then Tika with >> FieldReaderDataSource) should do the trick. >> >> Have you tried that? >> >> Regards, >>Alex. >> >> Solr Example reading group is starting November 2016, join us at >> http://j.mp/SolrERG >> Newsletter and resources for Solr beginners and intermediates: >> http://www.solr-start.com/ >> >> >> On 15 November 2016 at 03:19, marotosg wrote: >> > Hi, >> > >> > I have a use case where I need to index information coming from a >> database >> > where there is a field which contains rich text format. I would like to >> > convert that text into simple plain text, same as tika does when indexing >> > documents. >> > >> > Is there any way to achive that having a field only where i sent this >> rich >> > text and then Solr cleans that data? I can't find anyhting so far. >> > >> > Thanks >> > Sergio >> > >> > >> > >> > -- >> > View this message in context: http://lucene.472066.n3. >> nabble.com/RTF-Rich-text-format-tp4305778.html >> > Sent from the Solr - User mailing list archive at Nabble.com. >>
Re: Editing schema and solrconfig files
There's a way to add/update/delete schema fields, this is helpful. https://jpst.it/Pqqz although no way to add field-Type On Wed, Nov 9, 2016 at 2:20 PM, Erick Erickson wrote: > We had the bright idea of allowing editing of the config files through > the UI... but the ability to upload arbitrary XML is a security > vulnerability, so that idea was nixed. > > The solr/bin script has an upconfig and downconfig command that are (I > hope) easier to use than zkcli, I think from 5.5. In Solr 6.2 the > solr/bin script has been enhanced to allow other ZK operations. Not > quite what you were looking for, but I thought I'd mention it. > > There are some ZK clients out there that'll let you edit files > directly in ZK, and I know IntelliJ also has a plugin that'll allow > you to do that from the IDE, don't know about Eclipse but I expect it > does. > > I usually edit them locally and set up a shell script to push them up > as necessary... > > FWIW, > Erick > > On Wed, Nov 9, 2016 at 2:09 PM, John Bickerstaff > wrote: > > I never found a way to do it through the UI... and ended up using "nano" > on > > linux for simple things. > > > > For more complex stuff, I scp'd the file (or the whole conf directory) up > > to my dev box (a Mac in my case) and edited in a decent UI tool, then > scp'd > > the whole thing back... I wrote a simple bash script to automate the scp > > process on both ends once I got tired of typing it over and over... > > > > On Wed, Nov 9, 2016 at 3:05 PM, Reth RM wrote: > > > >> What are some easiest ways to edit/modify/add conf files, such as > >> solrconfig.xml and schema.xml other than APIs end points or using zk > >> commands to re-upload modified file? > >> > >> In other words, can we edit conf files through solr admin (GUI) > >> interface(add new filed by click on button or add new request handler on > >> click?) with feature of enabling/disabling same feature as required? > >> >
how to tell SolrHttpServer client to accept/ignore all certs?
I'm using HttpSolrServer (in Solr 3.6) to connect to a Solr web service and perform a query. The certificate at the other end has expired and so connections now fail. It will take the IT at the other end too many days to replace the cert (this is out of my control). How can I tell the HttpSolrServer to ignore bad certs when it does queries to the server? NOTE 1: I noticed that I can pass my own Apache HttpClient (we're currently using 4.3) into the HttpSolrServer constructor, but internally HttpSolrServer seems to do a lot of customizing/configuring it's own default HttpClient, so I didn't want to mess with that. NOTE: This is an 100% internal application so there is real security problems with this temporary workaround. Thanks!! rh
Re: index and data directories
Theoretically, perhaps. And it's quite true that stored data for fields marked stored=true are just passed through verbatim and compressed on disk while the data associated with indexed=true fields go through an analysis chain and are stored in a much different format. However these different data are simply stored in files with different suffixes in a segment. So you might have _0.fdx, _0.fdt, _0.tim, _0.tvx etc. that together form a single segment. This is done on a per-segment basis. So certain segment files, namely the *.fdt and *.fdx file will contain the stored data while other extensions have the indexed data, see: "File naming" here for a somewhat out of date format, but close enough for this discussion: https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/codecs/lucene40/package-summary.html. And there's no option to store the *.fdt and *.fdx files independently from the rest of the segment files. This statement: "I mean documents which are to be indexed" really doesn't make sense. You send these things called Solr documents to be indexed, but they are just a set of fields with values handled as their definitions indicate (i.e. respecting stored=true|false, indexed=true false, docValues=true|false. The Solr document sent by SolrJ is simply thrown away after processing into segment files. If you're sending semi-structured docs (say Word, PDF etc) to be indexed through Tika they are simply transformed into a Solr doc (set of field/value pairs) and the original document is thrown away as well. There's no option to store the original semi-structured doc either. Best, Erick On Mon, Nov 14, 2016 at 12:35 PM, Prateek Jain J wrote: > > By data, I mean documents which are to be indexed. Some fields can be > stored="true" but that doesn’t matter. > > For example: App1 creates an object (AppObj) to be indexed and sends it to > SOLR via solrj. Some of the attributes of this object can be declared to be > used for storage. > > Now, my understanding is data and indexes generated on data are two separate > things. In my particular example, all fields have stored="true" but only > selected fields have indexed="true". My expectation is, indexes are stored > separately from data because indexes can be generated by different > techniques/algorithms but data/documents remain unchanged. Please correct me > if my understanding is not correct. > > > Regards, > Prateek Jain > > -Original Message- > From: Erick Erickson [mailto:erickerick...@gmail.com] > Sent: 14 November 2016 07:05 PM > To: solr-user > Subject: Re: index and data directories > > The question is pretty opaque. What do you mean by "data" as opposed to > "indexes"? Are you talking about where Lucene puts stored="true" > fields? If not, what do you mean by "data"? > > If you are talking about where Lucene puts the stored="true" bits the no, > there's no way to segregate that our from the other files that make up a > segment. > > Best, > Erick > > On Mon, Nov 14, 2016 at 7:58 AM, Prateek Jain J > wrote: >> >> Hi Alex, >> >> I am unable to get it correctly. Is it possible to store indexes and data >> separately? >> >> >> Regards, >> Prateek Jain >> >> -Original Message- >> From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] >> Sent: 14 November 2016 03:53 PM >> To: solr-user >> Subject: Re: index and data directories >> >> solr.xml also has a bunch of properties under the core tag: >> >> >> >> >> >> >> >> You can get the Reference Guide for your specific version here: >> http://archive.apache.org/dist/lucene/solr/ref-guide/ >> >> Regards, >>Alex. >> >> Solr Example reading group is starting November 2016, join us at >> http://j.mp/SolrERG Newsletter and resources for Solr beginners and >> intermediates: >> http://www.solr-start.com/ >> >> >> On 15 November 2016 at 02:37, Prateek Jain J >> wrote: >>> >>> Hi All, >>> >>> We are using solr 4.8.1 and would like to know if it is possible to >>> store data and indexes in separate directories? I know following tag >>> exist in solrconfig.xml file >>> >>> >>> C:/del-it/solr/cm_events_nbi/data >>> >>> >>> >>> Regards, >>> Prateek Jain
RE: index and data directories
By data, I mean documents which are to be indexed. Some fields can be stored="true" but that doesn’t matter. For example: App1 creates an object (AppObj) to be indexed and sends it to SOLR via solrj. Some of the attributes of this object can be declared to be used for storage. Now, my understanding is data and indexes generated on data are two separate things. In my particular example, all fields have stored="true" but only selected fields have indexed="true". My expectation is, indexes are stored separately from data because indexes can be generated by different techniques/algorithms but data/documents remain unchanged. Please correct me if my understanding is not correct. Regards, Prateek Jain -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: 14 November 2016 07:05 PM To: solr-user Subject: Re: index and data directories The question is pretty opaque. What do you mean by "data" as opposed to "indexes"? Are you talking about where Lucene puts stored="true" fields? If not, what do you mean by "data"? If you are talking about where Lucene puts the stored="true" bits the no, there's no way to segregate that our from the other files that make up a segment. Best, Erick On Mon, Nov 14, 2016 at 7:58 AM, Prateek Jain J wrote: > > Hi Alex, > > I am unable to get it correctly. Is it possible to store indexes and data > separately? > > > Regards, > Prateek Jain > > -Original Message- > From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] > Sent: 14 November 2016 03:53 PM > To: solr-user > Subject: Re: index and data directories > > solr.xml also has a bunch of properties under the core tag: > > > > > > > > You can get the Reference Guide for your specific version here: > http://archive.apache.org/dist/lucene/solr/ref-guide/ > > Regards, >Alex. > > Solr Example reading group is starting November 2016, join us at > http://j.mp/SolrERG Newsletter and resources for Solr beginners and > intermediates: > http://www.solr-start.com/ > > > On 15 November 2016 at 02:37, Prateek Jain J > wrote: >> >> Hi All, >> >> We are using solr 4.8.1 and would like to know if it is possible to >> store data and indexes in separate directories? I know following tag >> exist in solrconfig.xml file >> >> >> C:/del-it/solr/cm_events_nbi/data >> >> >> >> Regards, >> Prateek Jain
Re: index and data directories
The question is pretty opaque. What do you mean by "data" as opposed to "indexes"? Are you talking about where Lucene puts stored="true" fields? If not, what do you mean by "data"? If you are talking about where Lucene puts the stored="true" bits the no, there's no way to segregate that our from the other files that make up a segment. Best, Erick On Mon, Nov 14, 2016 at 7:58 AM, Prateek Jain J wrote: > > Hi Alex, > > I am unable to get it correctly. Is it possible to store indexes and data > separately? > > > Regards, > Prateek Jain > > -Original Message- > From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] > Sent: 14 November 2016 03:53 PM > To: solr-user > Subject: Re: index and data directories > > solr.xml also has a bunch of properties under the core tag: > > > > > > > > You can get the Reference Guide for your specific version here: > http://archive.apache.org/dist/lucene/solr/ref-guide/ > > Regards, >Alex. > > Solr Example reading group is starting November 2016, join us at > http://j.mp/SolrERG Newsletter and resources for Solr beginners and > intermediates: > http://www.solr-start.com/ > > > On 15 November 2016 at 02:37, Prateek Jain J > wrote: >> >> Hi All, >> >> We are using solr 4.8.1 and would like to know if it is possible to >> store data and indexes in separate directories? I know following tag >> exist in solrconfig.xml file >> >> >> C:/del-it/solr/cm_events_nbi/data >> >> >> >> Regards, >> Prateek Jain
Re: Filtering a field when some of the documents don't have the value
You want something like: name:x&fq=population:[10 TO *] OR (*:* -population:*:*) Best, Erick On Mon, Nov 14, 2016 at 10:29 AM, Gintautas Sulskus wrote: > Hi, > > I have an index with two fields "name" and "population". Some of the > documents have the "population" field empty. > > I would like to search for a value X in field "name" with the following > condition: > 1. if the field is empty - return results for > name:X > 2. else set the minimum value for the "population" field to 10: > name:X AND population: [10 TO *] > The population field should not influence the score. > > Could you please help me out with the query construction? > I have tried conditional statements with exists(), but it seems it does not > suit the case. > > Thanks, > Gin
Filtering a field when some of the documents don't have the value
Hi, I have an index with two fields "name" and "population". Some of the documents have the "population" field empty. I would like to search for a value X in field "name" with the following condition: 1. if the field is empty - return results for name:X 2. else set the minimum value for the "population" field to 10: name:X AND population: [10 TO *] The population field should not influence the score. Could you please help me out with the query construction? I have tried conditional statements with exists(), but it seems it does not suit the case. Thanks, Gin
Re: RTF Rich text format
Thanks for the response. I am afraid I can't use the DataImportHandler. I do the indexation using an Indexation Service joining data from several places. I have a final xml with plenty of data and one of them is the rtf field. That's the xml I send to Solr using the /update. I am guessing if it would be possible Solr to do it with a tokenizer filter or something like that. On 14 November 2016 at 16:24, Alexandre Rafalovitch wrote: > I think DataImportHandler with nested entity (JDBC, then Tika with > FieldReaderDataSource) should do the trick. > > Have you tried that? > > Regards, >Alex. > > Solr Example reading group is starting November 2016, join us at > http://j.mp/SolrERG > Newsletter and resources for Solr beginners and intermediates: > http://www.solr-start.com/ > > > On 15 November 2016 at 03:19, marotosg wrote: > > Hi, > > > > I have a use case where I need to index information coming from a > database > > where there is a field which contains rich text format. I would like to > > convert that text into simple plain text, same as tika does when indexing > > documents. > > > > Is there any way to achive that having a field only where i sent this > rich > > text and then Solr cleans that data? I can't find anyhting so far. > > > > Thanks > > Sergio > > > > > > > > -- > > View this message in context: http://lucene.472066.n3. > nabble.com/RTF-Rich-text-format-tp4305778.html > > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: RTF Rich text format
I think DataImportHandler with nested entity (JDBC, then Tika with FieldReaderDataSource) should do the trick. Have you tried that? Regards, Alex. Solr Example reading group is starting November 2016, join us at http://j.mp/SolrERG Newsletter and resources for Solr beginners and intermediates: http://www.solr-start.com/ On 15 November 2016 at 03:19, marotosg wrote: > Hi, > > I have a use case where I need to index information coming from a database > where there is a field which contains rich text format. I would like to > convert that text into simple plain text, same as tika does when indexing > documents. > > Is there any way to achive that having a field only where i sent this rich > text and then Solr cleans that data? I can't find anyhting so far. > > Thanks > Sergio > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/RTF-Rich-text-format-tp4305778.html > Sent from the Solr - User mailing list archive at Nabble.com.
RTF Rich text format
Hi, I have a use case where I need to index information coming from a database where there is a field which contains rich text format. I would like to convert that text into simple plain text, same as tika does when indexing documents. Is there any way to achive that having a field only where i sent this rich text and then Solr cleans that data? I can't find anyhting so far. Thanks Sergio -- View this message in context: http://lucene.472066.n3.nabble.com/RTF-Rich-text-format-tp4305778.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: sorting by date not working on dates earlier than EPOCH
Hi there. I have found a possible solution for this issue. -- View this message in context: http://lucene.472066.n3.nabble.com/sorting-by-date-not-working-on-dates-earlier-than-EPOCH-tp4303456p4305770.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: index and data directories
Hi Alex, I am unable to get it correctly. Is it possible to store indexes and data separately? Regards, Prateek Jain -Original Message- From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] Sent: 14 November 2016 03:53 PM To: solr-user Subject: Re: index and data directories solr.xml also has a bunch of properties under the core tag: You can get the Reference Guide for your specific version here: http://archive.apache.org/dist/lucene/solr/ref-guide/ Regards, Alex. Solr Example reading group is starting November 2016, join us at http://j.mp/SolrERG Newsletter and resources for Solr beginners and intermediates: http://www.solr-start.com/ On 15 November 2016 at 02:37, Prateek Jain J wrote: > > Hi All, > > We are using solr 4.8.1 and would like to know if it is possible to > store data and indexes in separate directories? I know following tag > exist in solrconfig.xml file > > > C:/del-it/solr/cm_events_nbi/data > > > > Regards, > Prateek Jain
Re: DIH problem with multiple (types of) resources
On 15 November 2016 at 02:19, Peter Blokland wrote: > > Attribute names are case sensitive as far as I remember. Try 'dataSource' for the second definition. Regards, Alex. Solr Example reading group is starting November 2016, join us at http://j.mp/SolrERG Newsletter and resources for Solr beginners and intermediates: http://www.solr-start.com/
Re: index and data directories
solr.xml also has a bunch of properties under the core tag: You can get the Reference Guide for your specific version here: http://archive.apache.org/dist/lucene/solr/ref-guide/ Regards, Alex. Solr Example reading group is starting November 2016, join us at http://j.mp/SolrERG Newsletter and resources for Solr beginners and intermediates: http://www.solr-start.com/ On 15 November 2016 at 02:37, Prateek Jain J wrote: > > Hi All, > > We are using solr 4.8.1 and would like to know if it is possible to store > data and indexes in separate directories? I know following tag exist in > solrconfig.xml file > > > C:/del-it/solr/cm_events_nbi/data > > > > Regards, > Prateek Jain
index and data directories
Hi All, We are using solr 4.8.1 and would like to know if it is possible to store data and indexes in separate directories? I know following tag exist in solrconfig.xml file C:/del-it/solr/cm_events_nbi/data Regards, Prateek Jain
DIH problem with multiple (types of) resources
hi, I'm porting an old data-import configuratie from 4.x to 6.3.0. a minimal config is this : http://site/nl/${page.pid}"; format="text"> when I try to do a full import with this, I get : 2016-11-14 12:31:52.173 INFO (Thread-68) [ x:meulboek] o.a.s.u.p.LogUpdateProcessorFactory [meulboek] webapp=/solr path=/dataimport params={core=meulboek&optimize=false&indent=on&commit=true&clean=true&wt=json&command=full-import&_=1479122291861&verbose=true} status=0 QTime=11{deleteByQuery=*:* (-1550976769832517632),add=[ed99517c-ece9-40c6-9682-c9ec74173241 (1550976769976172544), 9283532a-2395-43eb-bcb8-fd30c5ebfd08 (1550976770348417024), 87b75d5c-a12a-4538-bc29-ceb13d6a9d1c (1550976770455371776), 476b5da3-3752-4867-bdb3-4264403c5c2d (1550976770787770368), 71cdaadb-62ba-4753-ad1b-01ba7fd75bfa (1550976770875850752), 02f41269-4a28-4001-aab9-7b1feb51e332 (1550976770954493952), 6216ec48-2abd-465b-8d6b-60907c7f49db (1550976771047817216), 4317b308-dc88-47e1-9240-0d7d94646de6 (1550976771136946176), 159ee092-2f72-45f6-970e-9dfd6d635bdf (1550976771221880832), bdfa48c4-23e2-483f-9b63-e0c5753d60a5 (1550976771336175616)]} 0 1465 2016-11-14 12:31:52.173 ERROR (Thread-68) [ x:meulboek] o.a.s.h.d.DataImporter Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: Exception in invoking url null Processing Document # 11 at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:270) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:416) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:475) at org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(DataImporter.java:458) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: Exception in invoking url null Processing Document # 11 at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:416) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:329) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:232) ... 4 more Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException: Exception in invoking url null Processing Document # 11 at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:69) at org.apache.solr.handler.dataimport.BinURLDataSource.getData(BinURLDataSource.java:89) at org.apache.solr.handler.dataimport.BinURLDataSource.getData(BinURLDataSource.java:38) at org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59) at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:244) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:475) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:414) ... 6 more Caused by: java.net.MalformedURLException: no protocol: nullselect edition from editions at java.net.URL.(URL.java:593) at java.net.URL.(URL.java:490) at java.net.URL.(URL.java:439) at org.apache.solr.handler.dataimport.BinURLDataSource.getData(BinURLDataSource.java:81) ... 12 more note that this failure occurrs with the second entity, and judging from this line : Caused by: java.net.MalformedURLException: no protocol: nullselect edition from editions it seems solr tries to use the datasource named "web" (the BinURLDataSource) instead of the configured "db" datasource (the JdbcDataSource). am I doing something wrong, or is this a bug ? -- CUL8R, Peter. www.desk.nl Your excuse is: Communist revolutionaries taking over the server room and demanding all the computers in the building or they shoot the sysadmin. Poor misguided fools.
Suggestions
Is there a chance that suggestions will be generated at indexing time and not afterwards based on indexed data? This will make it possible to suggest on fields which are not "stored". Or is there another way to make suggestion like behavior possible? Thx! Arkadi
Re: price sort
Hi Midas, You can boost result by reciprocal value of price, but that does not guaranty that there will not be irrelevant result first because of it is cheap. Emir On 14.11.2016 11:19, Midas A wrote: Thanks for replying , i want to maintain relevancy along with price sorting \ for example if i search "nike shoes" According to relevance "nike shoes" come first then tshirt (other product) from nike . and now if we sort the results tshirt from nike come on the top . this is some thing that is not users intent . In this situation we have to adopt mediocre approach that does not change users intent . On Mon, Nov 14, 2016 at 2:38 PM, Emir Arnautovic < emir.arnauto...@sematext.com> wrote: Hi Midas, Sorting by price means that score (~relevancy) is ignored/used as second sorting criteria. My assumption is that you have long tail of false positives causing sort by price to sort cheap, unrelated items first just because they matched by some stop word. Or I missed your question? Emir On 14.11.2016 06:39, Midas A wrote: Hi, we are in e-commerce business and we have to give price sort functionality . what logic should we use that does not break the relevance . please give the query for the same assuming dummy fields. -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr & Elasticsearch Support * http://sematext.com/ -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr & Elasticsearch Support * http://sematext.com/
Re: price sort
Thanks for replying , i want to maintain relevancy along with price sorting \ for example if i search "nike shoes" According to relevance "nike shoes" come first then tshirt (other product) from nike . and now if we sort the results tshirt from nike come on the top . this is some thing that is not users intent . In this situation we have to adopt mediocre approach that does not change users intent . On Mon, Nov 14, 2016 at 2:38 PM, Emir Arnautovic < emir.arnauto...@sematext.com> wrote: > Hi Midas, > > Sorting by price means that score (~relevancy) is ignored/used as second > sorting criteria. My assumption is that you have long tail of false > positives causing sort by price to sort cheap, unrelated items first just > because they matched by some stop word. > > Or I missed your question? > > Emir > > > > On 14.11.2016 06:39, Midas A wrote: > >> Hi, >> >> we are in e-commerce business and we have to give price sort >> functionality >> . >> what logic should we use that does not break the relevance . >> please give the query for the same assuming dummy fields. >> >> > -- > Monitoring * Alerting * Anomaly Detection * Centralized Log Management > Solr & Elasticsearch Support * http://sematext.com/ > >
Collection sincronization and AWS instace autoscale
Hi, I have a SolrCloud 4.9.1 setup with 4 nodes, 50 collections /1 shard and 4 replicas per collection Question one: What happens with collection data when I shutdown one node? When I start this node again, ZK would update the collection data? Question two: If I setup an auto scale load based instance on AWS, when a new node is started, what is the best way to add collections replicas to this new node? Running a script via opsworks? ZK conf? Thanks Iván --
Re: spell checking on query
Hi Midas, You can use Solr's spellcheck component: https://cwiki.apache.org/confluence/display/solr/Spell+Checking Emir On 14.11.2016 08:37, Midas A wrote: How can we do the query time spell checking with help of solr . -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr & Elasticsearch Support * http://sematext.com/
Re: price sort
Hi Midas, Sorting by price means that score (~relevancy) is ignored/used as second sorting criteria. My assumption is that you have long tail of false positives causing sort by price to sort cheap, unrelated items first just because they matched by some stop word. Or I missed your question? Emir On 14.11.2016 06:39, Midas A wrote: Hi, we are in e-commerce business and we have to give price sort functionality . what logic should we use that does not break the relevance . please give the query for the same assuming dummy fields. -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr & Elasticsearch Support * http://sematext.com/
Re: facet query performance
On Mon, 2016-11-14 at 11:36 +0530, Midas A wrote: > How to improve facet query performance 1) Don't shard unless you really need to. Replicas are fine. 2) If the problem is the first facet call, then enable DocValues and re-index. 3) Keep facet.limit <= 100, especially if you shard. and most important 4) Describe in detail what you have, how you facet and what you expect. Give us something to work. - Toke Eskildsen, State and University Library, Denmark