Re: Performing DIH on predefined list of IDS

2015-02-21 Thread Shawn Heisey
On 2/21/2015 6:33 PM, Walter Underwood wrote:
> Never do POST for a read-only request. Never. That only guarantees that you 
> cannot reproduce the problem by looking at the logs.
> 
> If your design requires extremely long GET requests, you may need to re-think 
> your design.

I agree with those sentiments ... but those who consume the services we
provide tend to push the envelope well beyond any reasonable limits.

My Solr install deals with some Solr queries where the GET request is
pushing 2 characters.  The queries and filters constructed by the
website code for some of the more powerful users are really large.  I
had to configure haproxy and jetty to allow HTTP headers up to 32K.  I'd
like to tell development that we just can't handle it, but with the way
the system is currently structured, there's no other way to get the
results they need.

If I were to make it widely known internally that the Solr config is
currently allowing POST requests up to 32 megabytes, I am really scared
to find out what sort of queries development would try to do.  I raised
that particular configuration limit (which defaults to 2MB) for my own
purposes, not for the development group.

Thanks,
Shawn



Re: Performing DIH on predefined list of IDS

2015-02-21 Thread Walter Underwood
Am an expert? Not sure, but I worked on an enterprise search spider and search 
engine for about a decade (Ultraseek Server) and I’ve done customer-facing 
search for another 6+ years.

Let the server reject URLs it cannot handle. Great servers will return a 414, 
good servers will return a 400, broken servers will return a 500, and crapulous 
servers will hang. In nearly all cases, you’ll get a fast fail which won’t hurt 
other users of the site.

Manage your site for zero errors, so you can fix the queries that are too long.

At Chegg, we have people paste entire homework problems into the search for 
homework solutions, and, yes, we have a few queries longer than 8K. But we deal 
with it gracefully.

Never do POST for a read-only request. Never. That only guarantees that you 
cannot reproduce the problem by looking at the logs.

If your design requires extremely long GET requests, you may need to re-think 
your design.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

On Feb 21, 2015, at 4:45 PM, Shawn Heisey  wrote:

> On 2/21/2015 1:46 AM, steve wrote:
>> Careful with the GETs! There is a real, hard limit on the length of a GET 
>> url (in the low hundreds of characters). That's why a POST is so much better 
>> for complex queries; the limit is in the hundreds of MegaBytes.
> 
> The limit on a GET command (including the GET itself and the protocol
> specifier (usually HTTP/1.1) is normally 8K, or 8192 bytes.  That's the
> default value in Jetty, at least.
> 
> A question for the experts:  Would it be a good idea to force a POST
> request in SolrEntityProcessor?  It may be dealing with parameters that
> have been sent via POST and may exceed the header size limit.
> 
> Thanks,
> Shawn
> 



Re: Performing DIH on predefined list of IDS

2015-02-21 Thread Shawn Heisey
On 2/21/2015 1:46 AM, steve wrote:
> Careful with the GETs! There is a real, hard limit on the length of a GET url 
> (in the low hundreds of characters). That's why a POST is so much better for 
> complex queries; the limit is in the hundreds of MegaBytes.

The limit on a GET command (including the GET itself and the protocol
specifier (usually HTTP/1.1) is normally 8K, or 8192 bytes.  That's the
default value in Jetty, at least.

A question for the experts:  Would it be a good idea to force a POST
request in SolrEntityProcessor?  It may be dealing with parameters that
have been sent via POST and may exceed the header size limit.

Thanks,
Shawn



RE: Performing DIH on predefined list of IDS

2015-02-21 Thread steve
Thank you! Another 4xx error that makes sense. Quoting from the Book of 
StackOverFlowhttp://stackoverflow.com/questions/2659952/maximum-length-of-http-get-request"Most
 webservers have a limit of 8192 bytes (8KB), which is usually configureable 
somewhere in the server configuration. As to the client side matter, the HTTP 
1.1 specification even warns about this, here's an extract of chapter 
3.2.1:Note: Servers ought to be cautious about depending on URI lengths above 
255 bytes, because some older client or proxy implementations might not 
properly support these lengths.The limit is in MSIE and Safari about 2KB, in 
Opera about 4KB and in Firefox about 8KB. We may thus assume that 8KB is the 
maximum possible length and that 2KB is a more affordable length to rely on at 
the server side and that 255 bytes is the safest length to assume that the 
entire URL will come in.If the limit is exceeded in either the browser or the 
server, most will just truncate the characters outside the limit without any 
warning. Some servers however may send a HTTP 414 error. If you need to send 
large data, then better use POST instead of GET. Its limit is much higher, but 
more dependent on the server used than the client. Usually up to around 2GB is 
allowed by the average webserver. This is also configureable somewhere in the 
server settings. The average server will display a server-specific 
error/exception when the POST limit is exceeded, usually as HTTP 500 error."
> From: wun...@wunderwood.org
> Subject: Re: Performing DIH on predefined list of IDS
> Date: Sat, 21 Feb 2015 09:50:46 -0800
> To: solr-user@lucene.apache.org
> 
> The HTTP protocol does not set a limit on GET URL size, but individual web 
> servers usually do. You should get a response code of “414 Request-URI Too 
> Long” when the URL is too long.
> 
> This limit is usually configurable.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
> 
> On Feb 21, 2015, at 12:46 AM, steve  wrote:
> 
> > Careful with the GETs! There is a real, hard limit on the length of a GET 
> > url (in the low hundreds of characters). That's why a POST is so much 
> > better for complex queries; the limit is in the hundreds of MegaBytes.
> > 
> >> Date: Sat, 21 Feb 2015 01:42:03 -0700
> >> From: osta...@gmail.com
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: Performing DIH on predefined list of IDS
> >> 
> >> Yes,  you right,  I am not using a DB. 
> >> SolrEntityProcessor is using a GET method,  so I will need to send
> >> relatively big URL ( something like a hundreds of ids ) hope it will be
> >> possible. 
> >> 
> >> Any way I think it is the only method to perform reindex if I want to
> >> control it and be able to continue from any point in case of failure.  
> >> 
> >> 
> >> 
> >> --
> >> View this message in context: 
> >> http://lucene.472066.n3.nabble.com/Performing-DIH-on-predefined-list-of-IDS-tp4187589p4187835.html
> >> Sent from the Solr - User mailing list archive at Nabble.com.
> >   
> 
  

Re: Performing DIH on predefined list of IDS

2015-02-21 Thread Walter Underwood
The HTTP protocol does not set a limit on GET URL size, but individual web 
servers usually do. You should get a response code of “414 Request-URI Too 
Long” when the URL is too long.

This limit is usually configurable.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


On Feb 21, 2015, at 12:46 AM, steve  wrote:

> Careful with the GETs! There is a real, hard limit on the length of a GET url 
> (in the low hundreds of characters). That's why a POST is so much better for 
> complex queries; the limit is in the hundreds of MegaBytes.
> 
>> Date: Sat, 21 Feb 2015 01:42:03 -0700
>> From: osta...@gmail.com
>> To: solr-user@lucene.apache.org
>> Subject: Re: Performing DIH on predefined list of IDS
>> 
>> Yes,  you right,  I am not using a DB. 
>> SolrEntityProcessor is using a GET method,  so I will need to send
>> relatively big URL ( something like a hundreds of ids ) hope it will be
>> possible. 
>> 
>> Any way I think it is the only method to perform reindex if I want to
>> control it and be able to continue from any point in case of failure.  
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://lucene.472066.n3.nabble.com/Performing-DIH-on-predefined-list-of-IDS-tp4187589p4187835.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
> 



RE: Performing DIH on predefined list of IDS

2015-02-21 Thread steve
And I'm familiar with the setup and configuration using Python, JavaScript, and 
PHP; not at all with Java.

> Date: Sat, 21 Feb 2015 01:52:07 -0700
> From: osta...@gmail.com
> To: solr-user@lucene.apache.org
> Subject: RE: Performing DIH on predefined list of IDS
> 
> That's right, but I am not sure that if it is works with Get I will able to
> use Post without changing it. 
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Performing-DIH-on-predefined-list-of-IDS-tp4187589p4187838.html
> Sent from the Solr - User mailing list archive at Nabble.com.
  

RE: Performing DIH on predefined list of IDS

2015-02-21 Thread SolrUser1543
That's right, but I am not sure that if it is works with Get I will able to
use Post without changing it. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Performing-DIH-on-predefined-list-of-IDS-tp4187589p4187838.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Performing DIH on predefined list of IDS

2015-02-21 Thread steve
Careful with the GETs! There is a real, hard limit on the length of a GET url 
(in the low hundreds of characters). That's why a POST is so much better for 
complex queries; the limit is in the hundreds of MegaBytes.

> Date: Sat, 21 Feb 2015 01:42:03 -0700
> From: osta...@gmail.com
> To: solr-user@lucene.apache.org
> Subject: Re: Performing DIH on predefined list of IDS
> 
> Yes,  you right,  I am not using a DB. 
>  SolrEntityProcessor is using a GET method,  so I will need to send
> relatively big URL ( something like a hundreds of ids ) hope it will be
> possible. 
> 
> Any way I think it is the only method to perform reindex if I want to
> control it and be able to continue from any point in case of failure.  
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Performing-DIH-on-predefined-list-of-IDS-tp4187589p4187835.html
> Sent from the Solr - User mailing list archive at Nabble.com.
  

Re: Performing DIH on predefined list of IDS

2015-02-21 Thread SolrUser1543
Yes,  you right,  I am not using a DB. 
 SolrEntityProcessor is using a GET method,  so I will need to send
relatively big URL ( something like a hundreds of ids ) hope it will be
possible. 

Any way I think it is the only method to perform reindex if I want to
control it and be able to continue from any point in case of failure.  



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Performing-DIH-on-predefined-list-of-IDS-tp4187589p4187835.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Performing DIH on predefined list of IDS

2015-02-20 Thread Shawn Heisey
On 2/20/2015 3:46 PM, Shawn Heisey wrote:
> If the URL parameter is "idlist" then you can use
> ${dih.request.idlist} in your SELECT statement.

I realized after I sent this that you are not using a database ... the
list would simply go in the query you send to the other server.  I don't
know whether the request that the SolrEntityProcessor sends is a GET or
a POST, so for a really large list of IDs, you might need to edit the
container config on both servers.

Thanks,
Shawn



Re: Performing DIH on predefined list of IDS

2015-02-20 Thread Shawn Heisey
On 2/20/2015 2:57 PM, SolrUser1543 wrote:
> That's the reason that I want to run on predefined list of IDs. 
> In this case I will able to restart from any point and to know about filed
> IDs. 

You can include information on a URL parameter and then use that URL
parameter inside your dih config.  If the URL parameter is "idlist" then
you can use ${dih.request.idlist} in your SELECT statement.

Be aware that most servlet containers have a default header length limit
of about 8192 characters, affecting the length of the URL that can be
sent successfully.  If the list of IDs is going to get huge, you will
either need to switch from a GET to a POST request where the parameter
is in the post body, or increase the header length limit in the servlet
container that is running Solr.

Thanks,
Shawn



Re: Performing DIH on predefined list of IDS

2015-02-20 Thread SolrUser1543
My index has about 110 millions of documents. The index is split over several
shards. 
May be the number it's not so big ,but each document is relatively large. 

The reason to perform the reindex is something like adding a new fields , or
adding some update processor which can extract something from one field and
put in another and etc. 

Each time I need to reindex data , I create a new collection and starting to
import data from old one .
It gives the opportunity for an update processors to act. 

The dih running with *:* query and takes some number of items each time. 
In case of exception , the process stops and the middle and I can't to
restart from this point. 

That's the reason that I want to run on predefined list of IDs. 
In this case I will able to restart from any point and to know about filed
IDs. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Performing-DIH-on-predefined-list-of-IDS-tp4187589p4187753.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Performing DIH on predefined list of IDS

2015-02-20 Thread Mikhail Khludnev
It's a little bit hard to get the overall context eg why do you live with
OOME as usual, what's the reasoning to pull from one index to another, and
what's added during this process.

Make sure that you are aware of
http://wiki.apache.org/solr/DataImportHandler#SolrEntityProcessor which
queries other Solr. and
http://wiki.apache.org/solr/DataImportHandler#LogTransformer that you can
use to log recently imported ids, to be able to restart indexing from this
point.
You can drop me more details in your native language if you wish.

On Fri, Feb 20, 2015 at 1:32 PM, SolrUser1543  wrote:

> Relatively frequently (about a once a month) we need to reindex the data,
> by
> using DIH and copying the data from one index to another.
> Because of the fact that we have a large index, it could take from 12 to 24
> hours to complete. At the same time the old index is being queried by
> users.
> Sometimes DIH could be interrupted at the middle, because of some
> unexpected
> exception caused by OutOfMemory or something else (many times it failed
> when
> more than 90 % was completed).
> More than this, almost every time, some items are missing at new the
> index.
> It is very complicated to find them.
> At this stage I can't be sure about what documents exactly were missed and
> I
> have to do it again and waiting for many hours. At the same time the old
> index constantly receives new items.
>
> I want to suggest the following way to solve the problem:
> •   Get list of all item ids ( call LUCINE API , like CLUE does for
> example )
> •   Start DIH, which will iterate over those ids and each time
> make a
> query for n items.
> 1.  Of course original DIH class should be changed to support it.
> •   This will give the following advantages :
> 1.  I will know exactly what items were failed.
> 2.  I can restart the process from any point and in case of DIH failure
> restart it from the point of failure.
>
>
> so the main difference will be that now DIH running on *:* query and I
> suggest to run it list of IDS
>
> for example if I have 1000 docs and want that this new DIH will take each
> time 100 docs , so it will do 10 queries , each one will have 100 IDS . (
> like id:(1 2 3 ... 100) then id:(101 102 ... 200) etc... )
>
> The question is what do you think about it? Or all of this could be done
> another way and I am trying to reinvent the wheel?
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Performing-DIH-on-predefined-list-of-IDS-tp4187589.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics





Re: Performing DIH on predefined list of IDS

2015-02-20 Thread Erick Erickson
Personally, I much prefer indexing from an independent SolrJ client
to using DIH when I have to take explicit control of errors & etc.
Here's an example:
https://lucidworks.com/blog/indexing-with-solrj/

In your example, you seem to be assuming that the Lucene IDs
(and here I'm assuming you're not talking about the internal Lucene
ID) corresponds to some kind of primary key in your database table.
But the correspondence isn't necessarily straightforward, how
would it handle composite keys?

I'll leave actual comments on DIH's internals to people who, you know,
actually understand the code ;)...

Erick



On Fri, Feb 20, 2015 at 2:32 AM, SolrUser1543  wrote:
> Relatively frequently (about a once a month) we need to reindex the data, by
> using DIH and copying the data from one index to another.
> Because of the fact that we have a large index, it could take from 12 to 24
> hours to complete. At the same time the old index is being queried by users.
> Sometimes DIH could be interrupted at the middle, because of some unexpected
> exception caused by OutOfMemory or something else (many times it failed when
> more than 90 % was completed).
> More than this, almost every time, some items are missing at new the  index.
> It is very complicated to find them.
> At this stage I can't be sure about what documents exactly were missed and I
> have to do it again and waiting for many hours. At the same time the old
> index constantly receives new items.
>
> I want to suggest the following way to solve the problem:
> •   Get list of all item ids ( call LUCINE API , like CLUE does for 
> example )
> •   Start DIH, which will iterate over those ids and each time make a
> query for n items.
> 1.  Of course original DIH class should be changed to support it.
> •   This will give the following advantages :
> 1.  I will know exactly what items were failed.
> 2.  I can restart the process from any point and in case of DIH failure
> restart it from the point of failure.
>
>
> so the main difference will be that now DIH running on *:* query and I
> suggest to run it list of IDS
>
> for example if I have 1000 docs and want that this new DIH will take each
> time 100 docs , so it will do 10 queries , each one will have 100 IDS . (
> like id:(1 2 3 ... 100) then id:(101 102 ... 200) etc... )
>
> The question is what do you think about it? Or all of this could be done
> another way and I am trying to reinvent the wheel?
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Performing-DIH-on-predefined-list-of-IDS-tp4187589.html
> Sent from the Solr - User mailing list archive at Nabble.com.