Re: DataImportHandler scheduling
While it may be useful to have a scheduler for simple cases, I think there are too many variables to make it useful for everyone's case. For example, I recently wrote a script that uses the data import handler api to get the status, kick off the import, etc. However, before allowing it to just kick off, I needed to query the database where the data was coming from to make sure it had finished it's daily load and then if it hadn't finished, wait for awhile to see if it would, then the script could do the load. After the load is finished it does another check to ensure the expected number of docs was actually loaded by Solr based on the data from the database. If a scheduler were built into Solr it probably would only cover the simple case and for production you'd probably need to write your own scripts and use your own scheduler anyways to ensure the loads are starting/completing as expected. > On Sep 1, 2015, at 1:09 PM, William Bell wrote: > > We should add a simple scheduler in the UI. It is very useful. To schedule > various actions: > > - Full index > - Delta Index > - Replicate > > > > >> On Tue, Sep 1, 2015 at 12:41 PM, Shawn Heisey wrote: >> >>> On 9/1/2015 11:45 AM, Troy Edwards wrote: >>> My initial thought was to use scheduling built with DIH: >>> http://wiki.apache.org/solr/DataImportHandler#Scheduling >>> >>> But I think just a cron job should do the same for me. >> >> The dataimport scheduler does not exist in any Solr version. This is a >> proposed feature, with the enhancement issue open for more than four years: >> >> https://issues.apache.org/jira/browse/SOLR-2305 >> >> I have updated the wiki page to state the fact that the scheduler is a >> proposed improvement, not a usable feature. >> >> Thanks, >> Shawn > > > -- > Bill Bell > billnb...@gmail.com > cell 720-256-8076
Re: DataImportHandler scheduling
We should add a simple scheduler in the UI. It is very useful. To schedule various actions: - Full index - Delta Index - Replicate On Tue, Sep 1, 2015 at 12:41 PM, Shawn Heisey wrote: > On 9/1/2015 11:45 AM, Troy Edwards wrote: > > My initial thought was to use scheduling built with DIH: > > http://wiki.apache.org/solr/DataImportHandler#Scheduling > > > > But I think just a cron job should do the same for me. > > The dataimport scheduler does not exist in any Solr version. This is a > proposed feature, with the enhancement issue open for more than four years: > > https://issues.apache.org/jira/browse/SOLR-2305 > > I have updated the wiki page to state the fact that the scheduler is a > proposed improvement, not a usable feature. > > Thanks, > Shawn > > -- Bill Bell billnb...@gmail.com cell 720-256-8076
Re: DataImportHandler scheduling
On 9/1/2015 11:45 AM, Troy Edwards wrote: > My initial thought was to use scheduling built with DIH: > http://wiki.apache.org/solr/DataImportHandler#Scheduling > > But I think just a cron job should do the same for me. The dataimport scheduler does not exist in any Solr version. This is a proposed feature, with the enhancement issue open for more than four years: https://issues.apache.org/jira/browse/SOLR-2305 I have updated the wiki page to state the fact that the scheduler is a proposed improvement, not a usable feature. Thanks, Shawn
Re: DataImportHandler scheduling
My initial thought was to use scheduling built with DIH: http://wiki.apache.org/solr/DataImportHandler#Scheduling But I think just a cron job should do the same for me. Thanks On Tue, Sep 1, 2015 at 8:51 AM, Davis, Daniel (NIH/NLM) [C] < daniel.da...@nih.gov> wrote: > On 8/31/2015 11:26 AM, Troy Edwards wrote: > > I am having a hard time finding documentation on DataImportHandler > > scheduling in SolrCloud. Can someone please post a link to that? I > > have a requirement that the DIH should be initiated at a specific time > > Monday through Friday. > > Troy, is your question how to use scheduled tasks? Shawn pointed you to > the right direction. I thought it more likely that you want to schedule a > cron task to run on any of your servers running SolrCloud, and you want the > job to run even if the cluster is degraded. > > Here's an idea - schedule your job Monday on node 1, Tuesday on node 2, > etc. That way, if the cluster is degraded (a node is down), > re-indexing/delta indexing still happens, it just happens slower.You > can certainly write a zookeeper client to make each cron job compete to see > who does the job - questions on how to do this should be directed to a > zookeeper users' mailing list. > > -Original Message- > From: Shawn Heisey [mailto:apa...@elyograg.org] > Sent: Monday, August 31, 2015 7:50 PM > To: solr-user@lucene.apache.org > Subject: Re: DataImportHandler scheduling > > On 8/31/2015 11:26 AM, Troy Edwards wrote: > > I am having a hard time finding documentation on DataImportHandler > > scheduling in SolrCloud. Can someone please post a link to that? I > > have a requirement that the DIH should be initiated at a specific time > > Monday through Friday. > > Every modern operating system (and most of the previous versions of every > modern OS) has a built-in task scheduling system. For Windows, it's > literally called Task Scheduler. For most other operating systems, it's > called cron. > > Including dataimport scheduling capability in Solr has been discussed, and > I think someone even wrote a working version ... but since every OS already > has scheduling capability that has had years of time to mature, why should > Solr reinvent the wheel and take the risk that the implementation will have > bugs? > > Currently virtually all updates to Solr's index must be initiated outside > of Solr, and there is good reason to make sure that Solr doesn't ever > modify the index without outside input. The only thing I know of right now > that can update the index automatically is Document Expiration, but the > expiration time is decided when the document is indexed, and the original > indexing action is external to Solr. > > https://lucidworks.com/blog/document-expiration/ > > Thanks, > Shawn > >
RE: DataImportHandler scheduling
On 8/31/2015 11:26 AM, Troy Edwards wrote: > I am having a hard time finding documentation on DataImportHandler > scheduling in SolrCloud. Can someone please post a link to that? I > have a requirement that the DIH should be initiated at a specific time > Monday through Friday. Troy, is your question how to use scheduled tasks? Shawn pointed you to the right direction. I thought it more likely that you want to schedule a cron task to run on any of your servers running SolrCloud, and you want the job to run even if the cluster is degraded. Here's an idea - schedule your job Monday on node 1, Tuesday on node 2, etc. That way, if the cluster is degraded (a node is down), re-indexing/delta indexing still happens, it just happens slower.You can certainly write a zookeeper client to make each cron job compete to see who does the job - questions on how to do this should be directed to a zookeeper users' mailing list. -Original Message- From: Shawn Heisey [mailto:apa...@elyograg.org] Sent: Monday, August 31, 2015 7:50 PM To: solr-user@lucene.apache.org Subject: Re: DataImportHandler scheduling On 8/31/2015 11:26 AM, Troy Edwards wrote: > I am having a hard time finding documentation on DataImportHandler > scheduling in SolrCloud. Can someone please post a link to that? I > have a requirement that the DIH should be initiated at a specific time > Monday through Friday. Every modern operating system (and most of the previous versions of every modern OS) has a built-in task scheduling system. For Windows, it's literally called Task Scheduler. For most other operating systems, it's called cron. Including dataimport scheduling capability in Solr has been discussed, and I think someone even wrote a working version ... but since every OS already has scheduling capability that has had years of time to mature, why should Solr reinvent the wheel and take the risk that the implementation will have bugs? Currently virtually all updates to Solr's index must be initiated outside of Solr, and there is good reason to make sure that Solr doesn't ever modify the index without outside input. The only thing I know of right now that can update the index automatically is Document Expiration, but the expiration time is decided when the document is indexed, and the original indexing action is external to Solr. https://lucidworks.com/blog/document-expiration/ Thanks, Shawn
Re: DataImportHandler scheduling
On 8/31/2015 11:26 AM, Troy Edwards wrote: > I am having a hard time finding documentation on DataImportHandler > scheduling in SolrCloud. Can someone please post a link to that? I have a > requirement that the DIH should be initiated at a specific time Monday > through Friday. Every modern operating system (and most of the previous versions of every modern OS) has a built-in task scheduling system. For Windows, it's literally called Task Scheduler. For most other operating systems, it's called cron. Including dataimport scheduling capability in Solr has been discussed, and I think someone even wrote a working version ... but since every OS already has scheduling capability that has had years of time to mature, why should Solr reinvent the wheel and take the risk that the implementation will have bugs? Currently virtually all updates to Solr's index must be initiated outside of Solr, and there is good reason to make sure that Solr doesn't ever modify the index without outside input. The only thing I know of right now that can update the index automatically is Document Expiration, but the expiration time is decided when the document is indexed, and the original indexing action is external to Solr. https://lucidworks.com/blog/document-expiration/ Thanks, Shawn
RE: DataImportHandler scheduling
So, I think corncobs is not a utility, but a pattern - you have cron run curl to invoke something on your web application on the localhost (and elsewhere), and it runs the job if the job needs running, thus the webapp keeps the state. There's a utility cronlock (https://github.com/kvz/cronlock) that runs on top of Redis. I was thinking that a common pattern would be something similar written in python using the kazoo module to dialog with zookeeper. No point writing much Java for a cronjob, but python should be OK. What I don't like about cronlock is that it isn't "run once", but instead avoids overlap, so there's good reason to write something specific to that case. -Original Message- From: Ahmet Arslan [mailto:iori...@yahoo.com.INVALID] Sent: Monday, August 31, 2015 1:35 PM To: solr-user@lucene.apache.org Subject: Re: DataImportHandler scheduling Hi Troy, I think folks use corncobs (with curl utility) provided by the Operating System. Ahmet On Monday, August 31, 2015 8:26 PM, Troy Edwards wrote: I am having a hard time finding documentation on DataImportHandler scheduling in SolrCloud. Can someone please post a link to that? I have a requirement that the DIH should be initiated at a specific time Monday through Friday. Thanks!
Re: DataImportHandler scheduling
Hi Troy, I think folks use corncobs (with curl utility) provided by the Operating System. Ahmet On Monday, August 31, 2015 8:26 PM, Troy Edwards wrote: I am having a hard time finding documentation on DataImportHandler scheduling in SolrCloud. Can someone please post a link to that? I have a requirement that the DIH should be initiated at a specific time Monday through Friday. Thanks!
DataImportHandler scheduling
I am having a hard time finding documentation on DataImportHandler scheduling in SolrCloud. Can someone please post a link to that? I have a requirement that the DIH should be initiated at a specific time Monday through Friday. Thanks!
Re: Weird memory leak problem with dataimporthandler scheduling
OK. Just typing out the question fixed it. Changing from post to get: GetMethod method = new GetMethod(completeUrl); removed the errors. The reason, I cannot explain... On Tue, Apr 3, 2012 at 6:46 PM, janne mattila wrote: > I have implemented dataimporthandler scheduling based on > http://wiki.apache.org/solr/DataImportHandler#Scheduling. It > periodically triggers full and delta updates. I'm unpacking the > original solr.war, adding a few scheduling-related classes such as > ApplicationListener etc (I have modified the example a lot) and > repacking the web application. > > The scheduling works fine, but when I undeploy solr web application, > Tomcat gives errors about ThreadLocals that were not cleared: > > SEVERE: The web application [/my-solr] created a ThreadLocal with key > of type [org.apache.solr.handler.dataimport.DataImporter$2] (value > [org.apache.solr. > handler.dataimport.DataImporter$2@b0e2096]) and a value of type > [java.util.concurrent.atomic.AtomicLong] (value [2]) but failed to > remove it when the web applic > ation was stopped. Threads are going to be renewed over time to try > and avoid a probable memory leak. > 3.4.2012 18:36:49 org.apache.catalina.loader.WebappClassLoader > checkThreadLocalMapForLeaks > SEVERE: The web application [/my-solr] created a ThreadLocal with key > of type [org.apache.solr.handler.dataimport.DataImporter$3] (value > [org.apache.solr. > handler.dataimport.DataImporter$3@4c7d5d85]) and a value of type > [java.text.SimpleDateFormat] (value > [java.text.SimpleDateFormat@4f76f1a0]) but failed to remove > it when the web application was stopped. Threads are going to be > renewed over time to try and avoid a probable memory leak. > 3.4.2012 18:36:49 org.apache.catalina.loader.WebappClassLoader > checkThreadLocalMapForLeaks > SEVERE: The web application [/my-solr] created a ThreadLocal with key > of type [java.lang.ThreadLocal] (value > [java.lang.ThreadLocal@3a86edfe]) and a value > of type [org.apache.solr.handler.dataimport.ContextImpl] (value > [org.apache.solr.handler.dataimport.ContextImpl@7072dcb6]) but failed > to remove it when the web > application was stopped. Threads are going to be renewed over time to > try and avoid a probable memory leak. > 3.4.2012 18:36:49 org.apache.catalina.loader.WebappClassLoader > checkThreadLocalMapForLeaks > SEVERE: The web application [/my-solr] created a ThreadLocal with key > of type [org.apache.solr.schema.DateField.ThreadLocalDateFormat] > (value [org.apache. > solr.schema.DateField$ThreadLocalDateFormat@4f86a67]) and a value of > type [org.apache.solr.schema.DateField.ISO8601CanonicalDateFormat] > (value [org.apache.solr. > schema.DateField$ISO8601CanonicalDateFormat@6b2ed43a]) but failed to > remove it when the web application was stopped. Threads are going to > be renewed over time t > o try and avoid a probable memory leak. > 3.4.2012 18:36:49 org.apache.catalina.loader.WebappClassLoader > checkThreadLocalMapForLeaks > SEVERE: The web application [/my-solr] created a ThreadLocal with key > of type [org.apache.solr.handler.dataimport.DataImporter$2] (value > [org.apache.solr. > handler.dataimport.DataImporter$2@b0e2096]) and a value of type > [java.util.concurrent.atomic.AtomicLong] (value [2]) but failed to > remove it when the web applic > ation was stopped. Threads are going to be renewed over time to try > and avoid a probable memory leak. > 3.4.2012 18:36:49 org.apache.catalina.loader.WebappClassLoader > checkThreadLocalMapForLeaks > SEVERE: The web application [/my-solr] created a ThreadLocal with key > of type [java.lang.ThreadLocal] (value > [java.lang.ThreadLocal@3a86edfe]) and a value > of type [org.apache.solr.handler.dataimport.ContextImpl] (value > [org.apache.solr.handler.dataimport.ContextImpl@511192bd]) but failed > to remove it when the web > application was stopped. Threads are going to be renewed over time to > try and avoid a probable memory leak. > 3.4.2012 18:36:49 org.apache.catalina.loader.WebappClassLoader > checkThreadLocalMapForLeaks > SEVERE: The web application [/my-solr] created a ThreadLocal with key > of type [org.apache.solr.handler.dataimport.DataImporter$3] (value > [org.apache.solr. > handler.dataimport.DataImporter$3@4c7d5d85]) and a value of type > [java.text.SimpleDateFormat] (value > [java.text.SimpleDateFormat@4f76f1a0]) but failed to remove > it when the web application was stopped. Threads are going to be > renewed over time to try and avoid a probable memory leak. > 3.4.2012 18:36:49 org.apache.catalina.loader.WebappClassLoader > checkThreadLocalMapForLeaks > SEVERE: The web application [/my-solr] created a ThreadLocal with key > of type [org.apache.solr.schema.DateField.ThreadLocalDateFormat] &g
Weird memory leak problem with dataimporthandler scheduling
I have implemented dataimporthandler scheduling based on http://wiki.apache.org/solr/DataImportHandler#Scheduling. It periodically triggers full and delta updates. I'm unpacking the original solr.war, adding a few scheduling-related classes such as ApplicationListener etc (I have modified the example a lot) and repacking the web application. The scheduling works fine, but when I undeploy solr web application, Tomcat gives errors about ThreadLocals that were not cleared: SEVERE: The web application [/my-solr] created a ThreadLocal with key of type [org.apache.solr.handler.dataimport.DataImporter$2] (value [org.apache.solr. handler.dataimport.DataImporter$2@b0e2096]) and a value of type [java.util.concurrent.atomic.AtomicLong] (value [2]) but failed to remove it when the web applic ation was stopped. Threads are going to be renewed over time to try and avoid a probable memory leak. 3.4.2012 18:36:49 org.apache.catalina.loader.WebappClassLoader checkThreadLocalMapForLeaks SEVERE: The web application [/my-solr] created a ThreadLocal with key of type [org.apache.solr.handler.dataimport.DataImporter$3] (value [org.apache.solr. handler.dataimport.DataImporter$3@4c7d5d85]) and a value of type [java.text.SimpleDateFormat] (value [java.text.SimpleDateFormat@4f76f1a0]) but failed to remove it when the web application was stopped. Threads are going to be renewed over time to try and avoid a probable memory leak. 3.4.2012 18:36:49 org.apache.catalina.loader.WebappClassLoader checkThreadLocalMapForLeaks SEVERE: The web application [/my-solr] created a ThreadLocal with key of type [java.lang.ThreadLocal] (value [java.lang.ThreadLocal@3a86edfe]) and a value of type [org.apache.solr.handler.dataimport.ContextImpl] (value [org.apache.solr.handler.dataimport.ContextImpl@7072dcb6]) but failed to remove it when the web application was stopped. Threads are going to be renewed over time to try and avoid a probable memory leak. 3.4.2012 18:36:49 org.apache.catalina.loader.WebappClassLoader checkThreadLocalMapForLeaks SEVERE: The web application [/my-solr] created a ThreadLocal with key of type [org.apache.solr.schema.DateField.ThreadLocalDateFormat] (value [org.apache. solr.schema.DateField$ThreadLocalDateFormat@4f86a67]) and a value of type [org.apache.solr.schema.DateField.ISO8601CanonicalDateFormat] (value [org.apache.solr. schema.DateField$ISO8601CanonicalDateFormat@6b2ed43a]) but failed to remove it when the web application was stopped. Threads are going to be renewed over time t o try and avoid a probable memory leak. 3.4.2012 18:36:49 org.apache.catalina.loader.WebappClassLoader checkThreadLocalMapForLeaks SEVERE: The web application [/my-solr] created a ThreadLocal with key of type [org.apache.solr.handler.dataimport.DataImporter$2] (value [org.apache.solr. handler.dataimport.DataImporter$2@b0e2096]) and a value of type [java.util.concurrent.atomic.AtomicLong] (value [2]) but failed to remove it when the web applic ation was stopped. Threads are going to be renewed over time to try and avoid a probable memory leak. 3.4.2012 18:36:49 org.apache.catalina.loader.WebappClassLoader checkThreadLocalMapForLeaks SEVERE: The web application [/my-solr] created a ThreadLocal with key of type [java.lang.ThreadLocal] (value [java.lang.ThreadLocal@3a86edfe]) and a value of type [org.apache.solr.handler.dataimport.ContextImpl] (value [org.apache.solr.handler.dataimport.ContextImpl@511192bd]) but failed to remove it when the web application was stopped. Threads are going to be renewed over time to try and avoid a probable memory leak. 3.4.2012 18:36:49 org.apache.catalina.loader.WebappClassLoader checkThreadLocalMapForLeaks SEVERE: The web application [/my-solr] created a ThreadLocal with key of type [org.apache.solr.handler.dataimport.DataImporter$3] (value [org.apache.solr. handler.dataimport.DataImporter$3@4c7d5d85]) and a value of type [java.text.SimpleDateFormat] (value [java.text.SimpleDateFormat@4f76f1a0]) but failed to remove it when the web application was stopped. Threads are going to be renewed over time to try and avoid a probable memory leak. 3.4.2012 18:36:49 org.apache.catalina.loader.WebappClassLoader checkThreadLocalMapForLeaks SEVERE: The web application [/my-solr] created a ThreadLocal with key of type [org.apache.solr.schema.DateField.ThreadLocalDateFormat] (value [org.apache. solr.schema.DateField$ThreadLocalDateFormat@4f86a67]) and a value of type [org.apache.solr.schema.DateField.ISO8601CanonicalDateFormat] (value [org.apache.solr. schema.DateField$ISO8601CanonicalDateFormat@6b2ed43a]) but failed to remove it when the web application was stopped. Threads are going to be renewed over time t o try and avoid a probable memory leak. I have rechecked my code to make sure it should not have any memory leaks. I have identified the cause to method: private void sendHttpPost(String completeUrl, String coreName) { HttpClient client = new HttpClient(); PostMethod method = new PostM