Re: Performing DIH on predefined list of IDS
On 2/21/2015 6:33 PM, Walter Underwood wrote: > Never do POST for a read-only request. Never. That only guarantees that you > cannot reproduce the problem by looking at the logs. > > If your design requires extremely long GET requests, you may need to re-think > your design. I agree with those sentiments ... but those who consume the services we provide tend to push the envelope well beyond any reasonable limits. My Solr install deals with some Solr queries where the GET request is pushing 2 characters. The queries and filters constructed by the website code for some of the more powerful users are really large. I had to configure haproxy and jetty to allow HTTP headers up to 32K. I'd like to tell development that we just can't handle it, but with the way the system is currently structured, there's no other way to get the results they need. If I were to make it widely known internally that the Solr config is currently allowing POST requests up to 32 megabytes, I am really scared to find out what sort of queries development would try to do. I raised that particular configuration limit (which defaults to 2MB) for my own purposes, not for the development group. Thanks, Shawn
Re: Performing DIH on predefined list of IDS
Am an expert? Not sure, but I worked on an enterprise search spider and search engine for about a decade (Ultraseek Server) and I’ve done customer-facing search for another 6+ years. Let the server reject URLs it cannot handle. Great servers will return a 414, good servers will return a 400, broken servers will return a 500, and crapulous servers will hang. In nearly all cases, you’ll get a fast fail which won’t hurt other users of the site. Manage your site for zero errors, so you can fix the queries that are too long. At Chegg, we have people paste entire homework problems into the search for homework solutions, and, yes, we have a few queries longer than 8K. But we deal with it gracefully. Never do POST for a read-only request. Never. That only guarantees that you cannot reproduce the problem by looking at the logs. If your design requires extremely long GET requests, you may need to re-think your design. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On Feb 21, 2015, at 4:45 PM, Shawn Heisey wrote: > On 2/21/2015 1:46 AM, steve wrote: >> Careful with the GETs! There is a real, hard limit on the length of a GET >> url (in the low hundreds of characters). That's why a POST is so much better >> for complex queries; the limit is in the hundreds of MegaBytes. > > The limit on a GET command (including the GET itself and the protocol > specifier (usually HTTP/1.1) is normally 8K, or 8192 bytes. That's the > default value in Jetty, at least. > > A question for the experts: Would it be a good idea to force a POST > request in SolrEntityProcessor? It may be dealing with parameters that > have been sent via POST and may exceed the header size limit. > > Thanks, > Shawn >
Re: Performing DIH on predefined list of IDS
On 2/21/2015 1:46 AM, steve wrote: > Careful with the GETs! There is a real, hard limit on the length of a GET url > (in the low hundreds of characters). That's why a POST is so much better for > complex queries; the limit is in the hundreds of MegaBytes. The limit on a GET command (including the GET itself and the protocol specifier (usually HTTP/1.1) is normally 8K, or 8192 bytes. That's the default value in Jetty, at least. A question for the experts: Would it be a good idea to force a POST request in SolrEntityProcessor? It may be dealing with parameters that have been sent via POST and may exceed the header size limit. Thanks, Shawn
RE: Performing DIH on predefined list of IDS
Thank you! Another 4xx error that makes sense. Quoting from the Book of StackOverFlowhttp://stackoverflow.com/questions/2659952/maximum-length-of-http-get-request"Most webservers have a limit of 8192 bytes (8KB), which is usually configureable somewhere in the server configuration. As to the client side matter, the HTTP 1.1 specification even warns about this, here's an extract of chapter 3.2.1:Note: Servers ought to be cautious about depending on URI lengths above 255 bytes, because some older client or proxy implementations might not properly support these lengths.The limit is in MSIE and Safari about 2KB, in Opera about 4KB and in Firefox about 8KB. We may thus assume that 8KB is the maximum possible length and that 2KB is a more affordable length to rely on at the server side and that 255 bytes is the safest length to assume that the entire URL will come in.If the limit is exceeded in either the browser or the server, most will just truncate the characters outside the limit without any warning. Some servers however may send a HTTP 414 error. If you need to send large data, then better use POST instead of GET. Its limit is much higher, but more dependent on the server used than the client. Usually up to around 2GB is allowed by the average webserver. This is also configureable somewhere in the server settings. The average server will display a server-specific error/exception when the POST limit is exceeded, usually as HTTP 500 error." > From: wun...@wunderwood.org > Subject: Re: Performing DIH on predefined list of IDS > Date: Sat, 21 Feb 2015 09:50:46 -0800 > To: solr-user@lucene.apache.org > > The HTTP protocol does not set a limit on GET URL size, but individual web > servers usually do. You should get a response code of “414 Request-URI Too > Long” when the URL is too long. > > This limit is usually configurable. > > wunder > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) > > > On Feb 21, 2015, at 12:46 AM, steve wrote: > > > Careful with the GETs! There is a real, hard limit on the length of a GET > > url (in the low hundreds of characters). That's why a POST is so much > > better for complex queries; the limit is in the hundreds of MegaBytes. > > > >> Date: Sat, 21 Feb 2015 01:42:03 -0700 > >> From: osta...@gmail.com > >> To: solr-user@lucene.apache.org > >> Subject: Re: Performing DIH on predefined list of IDS > >> > >> Yes, you right, I am not using a DB. > >> SolrEntityProcessor is using a GET method, so I will need to send > >> relatively big URL ( something like a hundreds of ids ) hope it will be > >> possible. > >> > >> Any way I think it is the only method to perform reindex if I want to > >> control it and be able to continue from any point in case of failure. > >> > >> > >> > >> -- > >> View this message in context: > >> http://lucene.472066.n3.nabble.com/Performing-DIH-on-predefined-list-of-IDS-tp4187589p4187835.html > >> Sent from the Solr - User mailing list archive at Nabble.com. > > >
Re: Performing DIH on predefined list of IDS
The HTTP protocol does not set a limit on GET URL size, but individual web servers usually do. You should get a response code of “414 Request-URI Too Long” when the URL is too long. This limit is usually configurable. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On Feb 21, 2015, at 12:46 AM, steve wrote: > Careful with the GETs! There is a real, hard limit on the length of a GET url > (in the low hundreds of characters). That's why a POST is so much better for > complex queries; the limit is in the hundreds of MegaBytes. > >> Date: Sat, 21 Feb 2015 01:42:03 -0700 >> From: osta...@gmail.com >> To: solr-user@lucene.apache.org >> Subject: Re: Performing DIH on predefined list of IDS >> >> Yes, you right, I am not using a DB. >> SolrEntityProcessor is using a GET method, so I will need to send >> relatively big URL ( something like a hundreds of ids ) hope it will be >> possible. >> >> Any way I think it is the only method to perform reindex if I want to >> control it and be able to continue from any point in case of failure. >> >> >> >> -- >> View this message in context: >> http://lucene.472066.n3.nabble.com/Performing-DIH-on-predefined-list-of-IDS-tp4187589p4187835.html >> Sent from the Solr - User mailing list archive at Nabble.com. >
RE: Performing DIH on predefined list of IDS
And I'm familiar with the setup and configuration using Python, JavaScript, and PHP; not at all with Java. > Date: Sat, 21 Feb 2015 01:52:07 -0700 > From: osta...@gmail.com > To: solr-user@lucene.apache.org > Subject: RE: Performing DIH on predefined list of IDS > > That's right, but I am not sure that if it is works with Get I will able to > use Post without changing it. > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Performing-DIH-on-predefined-list-of-IDS-tp4187589p4187838.html > Sent from the Solr - User mailing list archive at Nabble.com.
RE: Performing DIH on predefined list of IDS
That's right, but I am not sure that if it is works with Get I will able to use Post without changing it. -- View this message in context: http://lucene.472066.n3.nabble.com/Performing-DIH-on-predefined-list-of-IDS-tp4187589p4187838.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Performing DIH on predefined list of IDS
Careful with the GETs! There is a real, hard limit on the length of a GET url (in the low hundreds of characters). That's why a POST is so much better for complex queries; the limit is in the hundreds of MegaBytes. > Date: Sat, 21 Feb 2015 01:42:03 -0700 > From: osta...@gmail.com > To: solr-user@lucene.apache.org > Subject: Re: Performing DIH on predefined list of IDS > > Yes, you right, I am not using a DB. > SolrEntityProcessor is using a GET method, so I will need to send > relatively big URL ( something like a hundreds of ids ) hope it will be > possible. > > Any way I think it is the only method to perform reindex if I want to > control it and be able to continue from any point in case of failure. > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Performing-DIH-on-predefined-list-of-IDS-tp4187589p4187835.html > Sent from the Solr - User mailing list archive at Nabble.com.
Re: Performing DIH on predefined list of IDS
Yes, you right, I am not using a DB. SolrEntityProcessor is using a GET method, so I will need to send relatively big URL ( something like a hundreds of ids ) hope it will be possible. Any way I think it is the only method to perform reindex if I want to control it and be able to continue from any point in case of failure. -- View this message in context: http://lucene.472066.n3.nabble.com/Performing-DIH-on-predefined-list-of-IDS-tp4187589p4187835.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Performing DIH on predefined list of IDS
On 2/20/2015 3:46 PM, Shawn Heisey wrote: > If the URL parameter is "idlist" then you can use > ${dih.request.idlist} in your SELECT statement. I realized after I sent this that you are not using a database ... the list would simply go in the query you send to the other server. I don't know whether the request that the SolrEntityProcessor sends is a GET or a POST, so for a really large list of IDs, you might need to edit the container config on both servers. Thanks, Shawn
Re: Performing DIH on predefined list of IDS
On 2/20/2015 2:57 PM, SolrUser1543 wrote: > That's the reason that I want to run on predefined list of IDs. > In this case I will able to restart from any point and to know about filed > IDs. You can include information on a URL parameter and then use that URL parameter inside your dih config. If the URL parameter is "idlist" then you can use ${dih.request.idlist} in your SELECT statement. Be aware that most servlet containers have a default header length limit of about 8192 characters, affecting the length of the URL that can be sent successfully. If the list of IDs is going to get huge, you will either need to switch from a GET to a POST request where the parameter is in the post body, or increase the header length limit in the servlet container that is running Solr. Thanks, Shawn
Re: Performing DIH on predefined list of IDS
My index has about 110 millions of documents. The index is split over several shards. May be the number it's not so big ,but each document is relatively large. The reason to perform the reindex is something like adding a new fields , or adding some update processor which can extract something from one field and put in another and etc. Each time I need to reindex data , I create a new collection and starting to import data from old one . It gives the opportunity for an update processors to act. The dih running with *:* query and takes some number of items each time. In case of exception , the process stops and the middle and I can't to restart from this point. That's the reason that I want to run on predefined list of IDs. In this case I will able to restart from any point and to know about filed IDs. -- View this message in context: http://lucene.472066.n3.nabble.com/Performing-DIH-on-predefined-list-of-IDS-tp4187589p4187753.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Performing DIH on predefined list of IDS
It's a little bit hard to get the overall context eg why do you live with OOME as usual, what's the reasoning to pull from one index to another, and what's added during this process. Make sure that you are aware of http://wiki.apache.org/solr/DataImportHandler#SolrEntityProcessor which queries other Solr. and http://wiki.apache.org/solr/DataImportHandler#LogTransformer that you can use to log recently imported ids, to be able to restart indexing from this point. You can drop me more details in your native language if you wish. On Fri, Feb 20, 2015 at 1:32 PM, SolrUser1543 wrote: > Relatively frequently (about a once a month) we need to reindex the data, > by > using DIH and copying the data from one index to another. > Because of the fact that we have a large index, it could take from 12 to 24 > hours to complete. At the same time the old index is being queried by > users. > Sometimes DIH could be interrupted at the middle, because of some > unexpected > exception caused by OutOfMemory or something else (many times it failed > when > more than 90 % was completed). > More than this, almost every time, some items are missing at new the > index. > It is very complicated to find them. > At this stage I can't be sure about what documents exactly were missed and > I > have to do it again and waiting for many hours. At the same time the old > index constantly receives new items. > > I want to suggest the following way to solve the problem: > • Get list of all item ids ( call LUCINE API , like CLUE does for > example ) > • Start DIH, which will iterate over those ids and each time > make a > query for n items. > 1. Of course original DIH class should be changed to support it. > • This will give the following advantages : > 1. I will know exactly what items were failed. > 2. I can restart the process from any point and in case of DIH failure > restart it from the point of failure. > > > so the main difference will be that now DIH running on *:* query and I > suggest to run it list of IDS > > for example if I have 1000 docs and want that this new DIH will take each > time 100 docs , so it will do 10 queries , each one will have 100 IDS . ( > like id:(1 2 3 ... 100) then id:(101 102 ... 200) etc... ) > > The question is what do you think about it? Or all of this could be done > another way and I am trying to reinvent the wheel? > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Performing-DIH-on-predefined-list-of-IDS-tp4187589.html > Sent from the Solr - User mailing list archive at Nabble.com. > -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics <http://www.griddynamics.com>
Re: Performing DIH on predefined list of IDS
Personally, I much prefer indexing from an independent SolrJ client to using DIH when I have to take explicit control of errors & etc. Here's an example: https://lucidworks.com/blog/indexing-with-solrj/ In your example, you seem to be assuming that the Lucene IDs (and here I'm assuming you're not talking about the internal Lucene ID) corresponds to some kind of primary key in your database table. But the correspondence isn't necessarily straightforward, how would it handle composite keys? I'll leave actual comments on DIH's internals to people who, you know, actually understand the code ;)... Erick On Fri, Feb 20, 2015 at 2:32 AM, SolrUser1543 wrote: > Relatively frequently (about a once a month) we need to reindex the data, by > using DIH and copying the data from one index to another. > Because of the fact that we have a large index, it could take from 12 to 24 > hours to complete. At the same time the old index is being queried by users. > Sometimes DIH could be interrupted at the middle, because of some unexpected > exception caused by OutOfMemory or something else (many times it failed when > more than 90 % was completed). > More than this, almost every time, some items are missing at new the index. > It is very complicated to find them. > At this stage I can't be sure about what documents exactly were missed and I > have to do it again and waiting for many hours. At the same time the old > index constantly receives new items. > > I want to suggest the following way to solve the problem: > • Get list of all item ids ( call LUCINE API , like CLUE does for > example ) > • Start DIH, which will iterate over those ids and each time make a > query for n items. > 1. Of course original DIH class should be changed to support it. > • This will give the following advantages : > 1. I will know exactly what items were failed. > 2. I can restart the process from any point and in case of DIH failure > restart it from the point of failure. > > > so the main difference will be that now DIH running on *:* query and I > suggest to run it list of IDS > > for example if I have 1000 docs and want that this new DIH will take each > time 100 docs , so it will do 10 queries , each one will have 100 IDS . ( > like id:(1 2 3 ... 100) then id:(101 102 ... 200) etc... ) > > The question is what do you think about it? Or all of this could be done > another way and I am trying to reinvent the wheel? > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Performing-DIH-on-predefined-list-of-IDS-tp4187589.html > Sent from the Solr - User mailing list archive at Nabble.com.
Performing DIH on predefined list of IDS
Relatively frequently (about a once a month) we need to reindex the data, by using DIH and copying the data from one index to another. Because of the fact that we have a large index, it could take from 12 to 24 hours to complete. At the same time the old index is being queried by users. Sometimes DIH could be interrupted at the middle, because of some unexpected exception caused by OutOfMemory or something else (many times it failed when more than 90 % was completed). More than this, almost every time, some items are missing at new the index. It is very complicated to find them. At this stage I can't be sure about what documents exactly were missed and I have to do it again and waiting for many hours. At the same time the old index constantly receives new items. I want to suggest the following way to solve the problem: • Get list of all item ids ( call LUCINE API , like CLUE does for example ) • Start DIH, which will iterate over those ids and each time make a query for n items. 1. Of course original DIH class should be changed to support it. • This will give the following advantages : 1. I will know exactly what items were failed. 2. I can restart the process from any point and in case of DIH failure restart it from the point of failure. so the main difference will be that now DIH running on *:* query and I suggest to run it list of IDS for example if I have 1000 docs and want that this new DIH will take each time 100 docs , so it will do 10 queries , each one will have 100 IDS . ( like id:(1 2 3 ... 100) then id:(101 102 ... 200) etc... ) The question is what do you think about it? Or all of this could be done another way and I am trying to reinvent the wheel? -- View this message in context: http://lucene.472066.n3.nabble.com/Performing-DIH-on-predefined-list-of-IDS-tp4187589.html Sent from the Solr - User mailing list archive at Nabble.com.