date:20090802

[jira] Updated: (SOLR-1330) the details command shows current replication status when no replication is going on

2009-08-02 Thread Noble Paul (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noble Paul updated SOLR-1330:
-

Attachment: SOLR-1330.patch

> the details command shows current replication status when no replication is 
> going on
> 
>
> Key: SOLR-1330
> URL: https://issues.apache.org/jira/browse/SOLR-1330
> Project: Solr
>  Issue Type: Bug
>  Components: replication (java)
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Minor
> Fix For: 1.4
>
> Attachments: SOLR-1330.patch, SOLR-1330.patch
>
>
> The details of current replication should be shown only when a replication is 
> going on

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1330) the details command shows current replication status when no replication is going on

2009-08-02 Thread Noble Paul (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noble Paul updated SOLR-1330:
-

Attachment: (was: SOLR-1330.patch)

> the details command shows current replication status when no replication is 
> going on
> 
>
> Key: SOLR-1330
> URL: https://issues.apache.org/jira/browse/SOLR-1330
> Project: Solr
>  Issue Type: Bug
>  Components: replication (java)
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Minor
> Fix For: 1.4
>
> Attachments: SOLR-1330.patch, SOLR-1330.patch
>
>
> The details of current replication should be shown only when a replication is 
> going on

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1330) the details command shows current replication status when no replication is going on

2009-08-02 Thread Noble Paul (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noble Paul updated SOLR-1330:
-

Attachment: SOLR-1330.patch

also keep at least 10 latest timestamps of replication

> the details command shows current replication status when no replication is 
> going on
> 
>
> Key: SOLR-1330
> URL: https://issues.apache.org/jira/browse/SOLR-1330
> Project: Solr
>  Issue Type: Bug
>  Components: replication (java)
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Minor
> Fix For: 1.4
>
> Attachments: SOLR-1330.patch, SOLR-1330.patch
>
>
> The details of current replication should be shown only when a replication is 
> going on

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1330) the details command shows current replication status when no replication is going on

2009-08-02 Thread Noble Paul (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noble Paul updated SOLR-1330:
-

Attachment: SOLR-1330.patch

> the details command shows current replication status when no replication is 
> going on
> 
>
> Key: SOLR-1330
> URL: https://issues.apache.org/jira/browse/SOLR-1330
> Project: Solr
>  Issue Type: Bug
>  Components: replication (java)
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Minor
> Fix For: 1.4
>
> Attachments: SOLR-1330.patch
>
>
> The details of current replication should be shown only when a replication is 
> going on

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (SOLR-1330) the details command shows current replication status when no replication is going on

2009-08-02 Thread Noble Paul (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noble Paul reassigned SOLR-1330:


Assignee: Noble Paul

> the details command shows current replication status when no replication is 
> going on
> 
>
> Key: SOLR-1330
> URL: https://issues.apache.org/jira/browse/SOLR-1330
> Project: Solr
>  Issue Type: Bug
>  Components: replication (java)
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Minor
> Fix For: 1.4
>
> Attachments: SOLR-1330.patch
>
>
> The details of current replication should be shown only when a replication is 
> going on

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (SOLR-1330) the details command shows current replication status when no replication is going on

2009-08-02 Thread Noble Paul (JIRA)

the details command shows current replication status when no replication is 
going on


 Key: SOLR-1330
 URL: https://issues.apache.org/jira/browse/SOLR-1330
 Project: Solr
  Issue Type: Bug
  Components: replication (java)
Reporter: Noble Paul
Priority: Minor
 Fix For: 1.4


The details of current replication should be shown only when a replication is 
going on

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: trie fields default in example schema

2009-08-02 Thread Yonik Seeley

On Sun, Aug 2, 2009 at 10:41 PM, Otis
Gospodnetic wrote:
> Would it make sense to instead add new tint(eger) type instead of renaming 
> integer to pinteger? (thinking about people upgrading to Solr 1.4).

People upgrading can always use their existing schemas - remember that
the naming is local to the schema... the class always defines the real
behavior and the java class names won't be changing.

 -Yonik
http://www.lucidimagination.com

Re: trie fields default in example schema

2009-08-02 Thread Otis Gospodnetic

Would it make sense to instead add new tint(eger) type instead of renaming 
integer to pinteger? (thinking about people upgrading to Solr 1.4).

Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR




- Original Message 
> From: Yonik Seeley 
> To: solr-dev@lucene.apache.org
> Sent: Sunday, August 2, 2009 3:01:09 PM
> Subject: trie fields default in example schema
> 
> I'm working on a jumbo trie patch (just many smaller trie related
> issues at once) - SOLR-1288.
> 
> Anyway, I think support will be good enough for 1.4 that we should
> make types like "integer" in the example schema be based on the trie
> fields.  The current "integer" fields should be renamed to "pinteger"
> (for plain integer), and have a recommended use only for compatibility
> with other/older indexes.  People have mistakenly used the plain
> integer in the past based on the name, so I think we should fix the
> naming.
> 
> The trie based fields should have lower memory footprint in the
> fieldcache and are faster for a lookup (the only reason to use plain
> ints in the past)... sint uses StringIndex for historical reasons - we
> had no other option... we could upgrade the existing sint fields, but
> it wouldn't be quite 100% compatible and there's little reason since
> we have the trie fields now.
> 
> -Yonik
> http://www.lucidimagination.com

[jira] Created: (SOLR-1329) StatsComponent needs trie support

2009-08-02 Thread Yonik Seeley (JIRA)

StatsComponent needs trie support
-

 Key: SOLR-1329
 URL: https://issues.apache.org/jira/browse/SOLR-1329
 Project: Solr
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Yonik Seeley


Currently, the stats component uses FieldCache.StringIndex - won't work for 
trie fields.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (SOLR-1089) do write to Solr in a separate thread

2009-08-02 Thread Noble Paul (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-1089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noble Paul reassigned SOLR-1089:


Assignee: Noble Paul

> do write to Solr in a separate thread
> -
>
> Key: SOLR-1089
> URL: https://issues.apache.org/jira/browse/SOLR-1089
> Project: Solr
>  Issue Type: Improvement
>  Components: contrib - DataImportHandler
>Reporter: Noble Paul
>Assignee: Noble Paul
> Attachments: SOLR-1089.patch, SOLR-1089.patch, SOLR-1089.patch
>
>
> import can be made faster if the write is done in a different thread

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

trie fields default in example schema

2009-08-02 Thread Yonik Seeley

I'm working on a jumbo trie patch (just many smaller trie related
issues at once) - SOLR-1288.

Anyway, I think support will be good enough for 1.4 that we should
make types like "integer" in the example schema be based on the trie
fields.  The current "integer" fields should be renamed to "pinteger"
(for plain integer), and have a recommended use only for compatibility
with other/older indexes.  People have mistakenly used the plain
integer in the past based on the name, so I think we should fix the
naming.

The trie based fields should have lower memory footprint in the
fieldcache and are faster for a lookup (the only reason to use plain
ints in the past)... sint uses StringIndex for historical reasons - we
had no other option... we could upgrade the existing sint fields, but
it wouldn't be quite 100% compatible and there's little reason since
we have the trie fields now.

-Yonik
http://www.lucidimagination.com

[jira] Updated: (SOLR-1089) do write to Solr in a separate thread

2009-08-02 Thread Noble Paul (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-1089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noble Paul updated SOLR-1089:
-

Attachment: SOLR-1089.patch

> do write to Solr in a separate thread
> -
>
> Key: SOLR-1089
> URL: https://issues.apache.org/jira/browse/SOLR-1089
> Project: Solr
>  Issue Type: Improvement
>  Components: contrib - DataImportHandler
>Reporter: Noble Paul
> Attachments: SOLR-1089.patch, SOLR-1089.patch, SOLR-1089.patch
>
>
> import can be made faster if the write is done in a different thread

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (SOLR-1328) implement date faceting for trie date

2009-08-02 Thread Yonik Seeley (JIRA)

implement date faceting for trie date
-

 Key: SOLR-1328
 URL: https://issues.apache.org/jira/browse/SOLR-1328
 Project: Solr
  Issue Type: Sub-task
Affects Versions: 1.4
Reporter: Yonik Seeley
Assignee: Yonik Seeley
 Fix For: 1.4


implement date faceting for trie date

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: Queries regarding a "ParallelDataImportHandler"

2009-08-02 Thread Avlesh Singh

>
> There can be a batch command (which) will take in multiple commands in one
> http request.
>
You seem to be obsessed with this approach, Noble.
Solr-1093also echoes
the same sentiments :)
I personally find this approach a bit restrictive and difficult to adapt to.
IMHO, it is better handled as a configuration. i.e. user tells us how the
single task can be "batched" (or 'sliced', as you call it) while configuring
the Parallel(or, MultiThreaded) DIH inside solrconfig.

As an example, for non-jdbc data sources where batching might be difficult
to achieve in an abstract way, the user might choose to configure different
data-config.xml's (for different DIH instances) altogether.

Cheers
Avlesh

2009/8/2 Noble Paul നോബിള്‍ नोब्ळ् 

> On Sun, Aug 2, 2009 at 8:56 PM, Avlesh Singh wrote:
> > I have one more question w.r.t the MultiThreaded DIH - What would be the
> > logic behind distributing tasks to thread?
> >
> > I am sorry to have not mentioned this earlier - In my case, I take a
> "count
> > query" parameter as an configuration element. Based on this count and the
> > maxNumberOfDIHInstances, task assignment scheduling is done by
> "injecting"
> > limit and offset values in the import query for each DIH instance.
> > And this is, one of the reasons, why I call it a
> ParallelDataImportHandler.
> There can be a batch command will take in multiple commands in one
> http request. so it will be like invoking multiple DIH instances and
> the user will have to find ways to split up the whole task into
> multiple 'slices'. DIH in turn would fire up multiple threads and once
> all the threads are returned it should issue a commit
>
> this is a very dumb implementation but is a very easy path.
> >
> > Cheers
> > Avlesh
> >
> > On Sun, Aug 2, 2009 at 8:39 PM, Avlesh Singh  wrote:
> >
> >> run the add() calls to Solr in a dedicated thread
> >>
> >> Makes absolute sense. This would actually mean, DIH sits on top of all
> the
> >> add/update operations making it easier to implement a multi-threaded
> DIH.
> >>
> >> I would create a JIRA issue, right away.
> >> However, I would still love to see responses to my problems due to
> >> limitations in 1.3
> >>
> >> Cheers
> >> Avlesh
> >>
> >> 2009/8/2 Noble Paul നോബിള്‍ नोब्ळ् 
> >>
> >> a multithreaded DIH is in my top priority list. There are muliple
> >>> approaches
> >>>
> >>> 1) create multiple instances of dataImporter instances in the same DIH
> >>> instance and run them in parallel and commit when all of them are done
> >>> 2) run the add() calls to Solr in a dedicated thread
> >>> 3) make DIH automatically multithreaded . This is much harder to
> >>> implement.
> >>>
> >>> but a and #1 and #2 can be implemented with ease. It does not have to
> >>> be aother implementation called ParallelDataImportHandler. I believe
> >>> it can be done in DIH itself
> >>>
> >>> you may not need to create a project in google code. you can open a
> >>> JIRA issue and start posting patches and we can put it back into Solr.
> >>>
> >>> .
> >>>
> >>> On Sun, Aug 2, 2009 at 7:33 PM, Avlesh Singh wrote:
> >>> > In my quest to improve indexing time (in a multi-core environment), I
> >>> tried
> >>> > writing a Solr RequestHandler called ParallelDataImportHandler.
> >>> > I had a few lame questions to begin with, which Noble and Shalin
> >>> answered
> >>> > here -
> >>> >
> >>>
> http://www.lucidimagination.com/search/document/22b7371c063fdb06/using_dih_for_parallel_indexing
> >>> >
> >>> > As the name suggests, the handler, when invoked, tries to execute
> >>> multiple
> >>> > DIH instances on the same core in parallel. Of-course the catch here
> is
> >>> > that, only those data-sources, that can be batched can benifit from
> this
> >>> > handler. In my case, I am writing this for import from a MySQL
> database.
> >>> So,
> >>> > I have a single data-config.xml, in which the query has to add
> >>> placeholders
> >>> > for "limit" and "offset". Each DIH instance uses the same data-config
> >>> file,
> >>> > and replaces its own values for the limit and offset (which is in
> fact
> >>> > supplied by the parent ParallelDataImportHandler).
> >>> >
> >>> > I am achieving this by making my handler SolrCoreAware, and creating
> >>> > maxNumberOfDIHInstances (configurable) in the inform method. These
> >>> instances
> >>> > are then initialized and  registered with the core. Whenever a
> request
> >>> comes
> >>> > in, the ParallelDataImportHandler delegates the task to these
> instances,
> >>> > schedules the remainder and aggregates responses from each of these
> >>> > instances to return back to the user.
> >>> >
> >>> > Thankfully, all of these worked, and preliminary benchmarking with
> >>> 5million
> >>> > records indicated 50% decrease in re-indexing time. Moreover, all my
> >>> cores
> >>> > (Solr in my case is hosted on a quad-core machine), indicated above
> 70%
> >>> CPU
> >>> > utilization. All that I could have asked for!
> >>> >
> >>>

Re: Queries regarding a "ParallelDataImportHandler"

2009-08-02 Thread Noble Paul നോബിള്‍ नोब्ळ्

On Sun, Aug 2, 2009 at 8:56 PM, Avlesh Singh wrote:
> I have one more question w.r.t the MultiThreaded DIH - What would be the
> logic behind distributing tasks to thread?
>
> I am sorry to have not mentioned this earlier - In my case, I take a "count
> query" parameter as an configuration element. Based on this count and the
> maxNumberOfDIHInstances, task assignment scheduling is done by "injecting"
> limit and offset values in the import query for each DIH instance.
> And this is, one of the reasons, why I call it a ParallelDataImportHandler.
There can be a batch command will take in multiple commands in one
http request. so it will be like invoking multiple DIH instances and
the user will have to find ways to split up the whole task into
multiple 'slices'. DIH in turn would fire up multiple threads and once
all the threads are returned it should issue a commit

this is a very dumb implementation but is a very easy path.
>
> Cheers
> Avlesh
>
> On Sun, Aug 2, 2009 at 8:39 PM, Avlesh Singh  wrote:
>
>> run the add() calls to Solr in a dedicated thread
>>
>> Makes absolute sense. This would actually mean, DIH sits on top of all the
>> add/update operations making it easier to implement a multi-threaded DIH.
>>
>> I would create a JIRA issue, right away.
>> However, I would still love to see responses to my problems due to
>> limitations in 1.3
>>
>> Cheers
>> Avlesh
>>
>> 2009/8/2 Noble Paul നോബിള്‍ नोब्ळ् 
>>
>> a multithreaded DIH is in my top priority list. There are muliple
>>> approaches
>>>
>>> 1) create multiple instances of dataImporter instances in the same DIH
>>> instance and run them in parallel and commit when all of them are done
>>> 2) run the add() calls to Solr in a dedicated thread
>>> 3) make DIH automatically multithreaded . This is much harder to
>>> implement.
>>>
>>> but a and #1 and #2 can be implemented with ease. It does not have to
>>> be aother implementation called ParallelDataImportHandler. I believe
>>> it can be done in DIH itself
>>>
>>> you may not need to create a project in google code. you can open a
>>> JIRA issue and start posting patches and we can put it back into Solr.
>>>
>>> .
>>>
>>> On Sun, Aug 2, 2009 at 7:33 PM, Avlesh Singh wrote:
>>> > In my quest to improve indexing time (in a multi-core environment), I
>>> tried
>>> > writing a Solr RequestHandler called ParallelDataImportHandler.
>>> > I had a few lame questions to begin with, which Noble and Shalin
>>> answered
>>> > here -
>>> >
>>> http://www.lucidimagination.com/search/document/22b7371c063fdb06/using_dih_for_parallel_indexing
>>> >
>>> > As the name suggests, the handler, when invoked, tries to execute
>>> multiple
>>> > DIH instances on the same core in parallel. Of-course the catch here is
>>> > that, only those data-sources, that can be batched can benifit from this
>>> > handler. In my case, I am writing this for import from a MySQL database.
>>> So,
>>> > I have a single data-config.xml, in which the query has to add
>>> placeholders
>>> > for "limit" and "offset". Each DIH instance uses the same data-config
>>> file,
>>> > and replaces its own values for the limit and offset (which is in fact
>>> > supplied by the parent ParallelDataImportHandler).
>>> >
>>> > I am achieving this by making my handler SolrCoreAware, and creating
>>> > maxNumberOfDIHInstances (configurable) in the inform method. These
>>> instances
>>> > are then initialized and  registered with the core. Whenever a request
>>> comes
>>> > in, the ParallelDataImportHandler delegates the task to these instances,
>>> > schedules the remainder and aggregates responses from each of these
>>> > instances to return back to the user.
>>> >
>>> > Thankfully, all of these worked, and preliminary benchmarking with
>>> 5million
>>> > records indicated 50% decrease in re-indexing time. Moreover, all my
>>> cores
>>> > (Solr in my case is hosted on a quad-core machine), indicated above 70%
>>> CPU
>>> > utilization. All that I could have asked for!
>>> >
>>> > With respect to this whole thing, I have a few questions -
>>> >
>>> >   1. Is something similar available out of the box?
>>> >   2. Is the idea flawed? Is the approach fundamentally correct?
>>> >   3. I am using Solr 1.3. DIH did not have "EventListeners" in the stone
>>> >   age. I need to know, if a DIH instance is done with its task (mostly
>>> the
>>> >   "commit") operation. I could not figure a clean way out. As a hack, I
>>> keep
>>> >   pinging the DIH instances with command=status at regular intervals (in
>>> a
>>> >   separate thread), to figure out if it is free to be assigned some
>>> task. With
>>> >   works, but obviously with an overhead of unnessecary wasted CPU
>>> cycles. Is
>>> >   there a better approach?
>>> >   4. I can better the time taken, even further if there was a way for me
>>> to
>>> >   tell a DIH instance not to open a new IndexSearcher. In the current
>>> scheme
>>> >   of things, as soon as one DIH instance is done committing, a new
>>> searcher is

[jira] Created: (SOLR-1327) Allow special Filters to access, modify, and/or add Fields to/on a Solr Document

2009-08-02 Thread Mark Miller (JIRA)

Allow special Filters to access, modify, and/or add Fields to/on a Solr Document


 Key: SOLR-1327
 URL: https://issues.apache.org/jira/browse/SOLR-1327
 Project: Solr
  Issue Type: New Feature
Reporter: Mark Miller
Priority: Minor
 Fix For: 1.5


Add a special Filter type that causes the field its in to be pre-analyzed - 
when this happens, the Filter can work with the Solr Document and modify it 
based on the tokens it sees.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: Queries regarding a "ParallelDataImportHandler"

2009-08-02 Thread Avlesh Singh

I have one more question w.r.t the MultiThreaded DIH - What would be the
logic behind distributing tasks to thread?

I am sorry to have not mentioned this earlier - In my case, I take a "count
query" parameter as an configuration element. Based on this count and the
maxNumberOfDIHInstances, task assignment scheduling is done by "injecting"
limit and offset values in the import query for each DIH instance.
And this is, one of the reasons, why I call it a ParallelDataImportHandler.

Cheers
Avlesh

On Sun, Aug 2, 2009 at 8:39 PM, Avlesh Singh  wrote:

> run the add() calls to Solr in a dedicated thread
>
> Makes absolute sense. This would actually mean, DIH sits on top of all the
> add/update operations making it easier to implement a multi-threaded DIH.
>
> I would create a JIRA issue, right away.
> However, I would still love to see responses to my problems due to
> limitations in 1.3
>
> Cheers
> Avlesh
>
> 2009/8/2 Noble Paul നോബിള്‍ नोब्ळ् 
>
> a multithreaded DIH is in my top priority list. There are muliple
>> approaches
>>
>> 1) create multiple instances of dataImporter instances in the same DIH
>> instance and run them in parallel and commit when all of them are done
>> 2) run the add() calls to Solr in a dedicated thread
>> 3) make DIH automatically multithreaded . This is much harder to
>> implement.
>>
>> but a and #1 and #2 can be implemented with ease. It does not have to
>> be aother implementation called ParallelDataImportHandler. I believe
>> it can be done in DIH itself
>>
>> you may not need to create a project in google code. you can open a
>> JIRA issue and start posting patches and we can put it back into Solr.
>>
>> .
>>
>> On Sun, Aug 2, 2009 at 7:33 PM, Avlesh Singh wrote:
>> > In my quest to improve indexing time (in a multi-core environment), I
>> tried
>> > writing a Solr RequestHandler called ParallelDataImportHandler.
>> > I had a few lame questions to begin with, which Noble and Shalin
>> answered
>> > here -
>> >
>> http://www.lucidimagination.com/search/document/22b7371c063fdb06/using_dih_for_parallel_indexing
>> >
>> > As the name suggests, the handler, when invoked, tries to execute
>> multiple
>> > DIH instances on the same core in parallel. Of-course the catch here is
>> > that, only those data-sources, that can be batched can benifit from this
>> > handler. In my case, I am writing this for import from a MySQL database.
>> So,
>> > I have a single data-config.xml, in which the query has to add
>> placeholders
>> > for "limit" and "offset". Each DIH instance uses the same data-config
>> file,
>> > and replaces its own values for the limit and offset (which is in fact
>> > supplied by the parent ParallelDataImportHandler).
>> >
>> > I am achieving this by making my handler SolrCoreAware, and creating
>> > maxNumberOfDIHInstances (configurable) in the inform method. These
>> instances
>> > are then initialized and  registered with the core. Whenever a request
>> comes
>> > in, the ParallelDataImportHandler delegates the task to these instances,
>> > schedules the remainder and aggregates responses from each of these
>> > instances to return back to the user.
>> >
>> > Thankfully, all of these worked, and preliminary benchmarking with
>> 5million
>> > records indicated 50% decrease in re-indexing time. Moreover, all my
>> cores
>> > (Solr in my case is hosted on a quad-core machine), indicated above 70%
>> CPU
>> > utilization. All that I could have asked for!
>> >
>> > With respect to this whole thing, I have a few questions -
>> >
>> >   1. Is something similar available out of the box?
>> >   2. Is the idea flawed? Is the approach fundamentally correct?
>> >   3. I am using Solr 1.3. DIH did not have "EventListeners" in the stone
>> >   age. I need to know, if a DIH instance is done with its task (mostly
>> the
>> >   "commit") operation. I could not figure a clean way out. As a hack, I
>> keep
>> >   pinging the DIH instances with command=status at regular intervals (in
>> a
>> >   separate thread), to figure out if it is free to be assigned some
>> task. With
>> >   works, but obviously with an overhead of unnessecary wasted CPU
>> cycles. Is
>> >   there a better approach?
>> >   4. I can better the time taken, even further if there was a way for me
>> to
>> >   tell a DIH instance not to open a new IndexSearcher. In the current
>> scheme
>> >   of things, as soon as one DIH instance is done committing, a new
>> searcher is
>> >   opened. This is blocking for other DIH instances (which were active)
>> and
>> >   they cannot continue without the searcher being initialized. Is there
>> a way
>> >   I can implement, single commit once all these DIH instances are done
>> with
>> >   their tasks? I tried each DIH instance with a commit=false without
>> luck.
>> >   5. Can this implementation be extended to support other data-sources
>> >   supported in DIH (HTTP, File, URL etc)?
>> >   6. If the utility is worth it, can I host this on Google code as an
>> open
>> >   source

Re: Queries regarding a "ParallelDataImportHandler"

2009-08-02 Thread Avlesh Singh

>
> run the add() calls to Solr in a dedicated thread

Makes absolute sense. This would actually mean, DIH sits on top of all the
add/update operations making it easier to implement a multi-threaded DIH.

I would create a JIRA issue, right away.
However, I would still love to see responses to my problems due to
limitations in 1.3

Cheers
Avlesh

2009/8/2 Noble Paul നോബിള്‍ नोब्ळ् 

> a multithreaded DIH is in my top priority list. There are muliple
> approaches
>
> 1) create multiple instances of dataImporter instances in the same DIH
> instance and run them in parallel and commit when all of them are done
> 2) run the add() calls to Solr in a dedicated thread
> 3) make DIH automatically multithreaded . This is much harder to implement.
>
> but a and #1 and #2 can be implemented with ease. It does not have to
> be aother implementation called ParallelDataImportHandler. I believe
> it can be done in DIH itself
>
> you may not need to create a project in google code. you can open a
> JIRA issue and start posting patches and we can put it back into Solr.
>
> .
>
> On Sun, Aug 2, 2009 at 7:33 PM, Avlesh Singh wrote:
> > In my quest to improve indexing time (in a multi-core environment), I
> tried
> > writing a Solr RequestHandler called ParallelDataImportHandler.
> > I had a few lame questions to begin with, which Noble and Shalin answered
> > here -
> >
> http://www.lucidimagination.com/search/document/22b7371c063fdb06/using_dih_for_parallel_indexing
> >
> > As the name suggests, the handler, when invoked, tries to execute
> multiple
> > DIH instances on the same core in parallel. Of-course the catch here is
> > that, only those data-sources, that can be batched can benifit from this
> > handler. In my case, I am writing this for import from a MySQL database.
> So,
> > I have a single data-config.xml, in which the query has to add
> placeholders
> > for "limit" and "offset". Each DIH instance uses the same data-config
> file,
> > and replaces its own values for the limit and offset (which is in fact
> > supplied by the parent ParallelDataImportHandler).
> >
> > I am achieving this by making my handler SolrCoreAware, and creating
> > maxNumberOfDIHInstances (configurable) in the inform method. These
> instances
> > are then initialized and  registered with the core. Whenever a request
> comes
> > in, the ParallelDataImportHandler delegates the task to these instances,
> > schedules the remainder and aggregates responses from each of these
> > instances to return back to the user.
> >
> > Thankfully, all of these worked, and preliminary benchmarking with
> 5million
> > records indicated 50% decrease in re-indexing time. Moreover, all my
> cores
> > (Solr in my case is hosted on a quad-core machine), indicated above 70%
> CPU
> > utilization. All that I could have asked for!
> >
> > With respect to this whole thing, I have a few questions -
> >
> >   1. Is something similar available out of the box?
> >   2. Is the idea flawed? Is the approach fundamentally correct?
> >   3. I am using Solr 1.3. DIH did not have "EventListeners" in the stone
> >   age. I need to know, if a DIH instance is done with its task (mostly
> the
> >   "commit") operation. I could not figure a clean way out. As a hack, I
> keep
> >   pinging the DIH instances with command=status at regular intervals (in
> a
> >   separate thread), to figure out if it is free to be assigned some task.
> With
> >   works, but obviously with an overhead of unnessecary wasted CPU cycles.
> Is
> >   there a better approach?
> >   4. I can better the time taken, even further if there was a way for me
> to
> >   tell a DIH instance not to open a new IndexSearcher. In the current
> scheme
> >   of things, as soon as one DIH instance is done committing, a new
> searcher is
> >   opened. This is blocking for other DIH instances (which were active)
> and
> >   they cannot continue without the searcher being initialized. Is there a
> way
> >   I can implement, single commit once all these DIH instances are done
> with
> >   their tasks? I tried each DIH instance with a commit=false without
> luck.
> >   5. Can this implementation be extended to support other data-sources
> >   supported in DIH (HTTP, File, URL etc)?
> >   6. If the utility is worth it, can I host this on Google code as an
> open
> >   source contrib?
> >
> > Any help will be deeply acknowledged and appreciated. While suggesting,
> > please don't forget that I am using Solr 1.3. If it all goes well, I
> don't
> > mind writing one for Solr 1.4.
> >
> > Cheers
> > Avlesh
> >
>
>
>
> --
> -
> Noble Paul | Principal Engineer| AOL | http://aol.com
>

Re: Queries regarding a "ParallelDataImportHandler"

2009-08-02 Thread Noble Paul നോബിള്‍ नोब्ळ्

a multithreaded DIH is in my top priority list. There are muliple approaches

1) create multiple instances of dataImporter instances in the same DIH
instance and run them in parallel and commit when all of them are done
2) run the add() calls to Solr in a dedicated thread
3) make DIH automatically multithreaded . This is much harder to implement.

but a and #1 and #2 can be implemented with ease. It does not have to
be aother implementation called ParallelDataImportHandler. I believe
it can be done in DIH itself

you may not need to create a project in google code. you can open a
JIRA issue and start posting patches and we can put it back into Solr.

.

On Sun, Aug 2, 2009 at 7:33 PM, Avlesh Singh wrote:
> In my quest to improve indexing time (in a multi-core environment), I tried
> writing a Solr RequestHandler called ParallelDataImportHandler.
> I had a few lame questions to begin with, which Noble and Shalin answered
> here -
> http://www.lucidimagination.com/search/document/22b7371c063fdb06/using_dih_for_parallel_indexing
>
> As the name suggests, the handler, when invoked, tries to execute multiple
> DIH instances on the same core in parallel. Of-course the catch here is
> that, only those data-sources, that can be batched can benifit from this
> handler. In my case, I am writing this for import from a MySQL database. So,
> I have a single data-config.xml, in which the query has to add placeholders
> for "limit" and "offset". Each DIH instance uses the same data-config file,
> and replaces its own values for the limit and offset (which is in fact
> supplied by the parent ParallelDataImportHandler).
>
> I am achieving this by making my handler SolrCoreAware, and creating
> maxNumberOfDIHInstances (configurable) in the inform method. These instances
> are then initialized and  registered with the core. Whenever a request comes
> in, the ParallelDataImportHandler delegates the task to these instances,
> schedules the remainder and aggregates responses from each of these
> instances to return back to the user.
>
> Thankfully, all of these worked, and preliminary benchmarking with 5million
> records indicated 50% decrease in re-indexing time. Moreover, all my cores
> (Solr in my case is hosted on a quad-core machine), indicated above 70% CPU
> utilization. All that I could have asked for!
>
> With respect to this whole thing, I have a few questions -
>
>   1. Is something similar available out of the box?
>   2. Is the idea flawed? Is the approach fundamentally correct?
>   3. I am using Solr 1.3. DIH did not have "EventListeners" in the stone
>   age. I need to know, if a DIH instance is done with its task (mostly the
>   "commit") operation. I could not figure a clean way out. As a hack, I keep
>   pinging the DIH instances with command=status at regular intervals (in a
>   separate thread), to figure out if it is free to be assigned some task. With
>   works, but obviously with an overhead of unnessecary wasted CPU cycles. Is
>   there a better approach?
>   4. I can better the time taken, even further if there was a way for me to
>   tell a DIH instance not to open a new IndexSearcher. In the current scheme
>   of things, as soon as one DIH instance is done committing, a new searcher is
>   opened. This is blocking for other DIH instances (which were active) and
>   they cannot continue without the searcher being initialized. Is there a way
>   I can implement, single commit once all these DIH instances are done with
>   their tasks? I tried each DIH instance with a commit=false without luck.
>   5. Can this implementation be extended to support other data-sources
>   supported in DIH (HTTP, File, URL etc)?
>   6. If the utility is worth it, can I host this on Google code as an open
>   source contrib?
>
> Any help will be deeply acknowledged and appreciated. While suggesting,
> please don't forget that I am using Solr 1.3. If it all goes well, I don't
> mind writing one for Solr 1.4.
>
> Cheers
> Avlesh
>



-- 
-
Noble Paul | Principal Engineer| AOL | http://aol.com

Queries regarding a "ParallelDataImportHandler"

2009-08-02 Thread Avlesh Singh

In my quest to improve indexing time (in a multi-core environment), I tried
writing a Solr RequestHandler called ParallelDataImportHandler.
I had a few lame questions to begin with, which Noble and Shalin answered
here -
http://www.lucidimagination.com/search/document/22b7371c063fdb06/using_dih_for_parallel_indexing

As the name suggests, the handler, when invoked, tries to execute multiple
DIH instances on the same core in parallel. Of-course the catch here is
that, only those data-sources, that can be batched can benifit from this
handler. In my case, I am writing this for import from a MySQL database. So,
I have a single data-config.xml, in which the query has to add placeholders
for "limit" and "offset". Each DIH instance uses the same data-config file,
and replaces its own values for the limit and offset (which is in fact
supplied by the parent ParallelDataImportHandler).

I am achieving this by making my handler SolrCoreAware, and creating
maxNumberOfDIHInstances (configurable) in the inform method. These instances
are then initialized and registered with the core. Whenever a request comes
in, the ParallelDataImportHandler delegates the task to these instances,
schedules the remainder and aggregates responses from each of these
instances to return back to the user.

Thankfully, all of these worked, and preliminary benchmarking with 5million
records indicated 50% decrease in re-indexing time. Moreover, all my cores
(Solr in my case is hosted on a quad-core machine), indicated above 70% CPU
utilization. All that I could have asked for!

With respect to this whole thing, I have a few questions -

1. Is something similar available out of the box?
2. Is the idea flawed? Is the approach fundamentally correct?
3. I am using Solr 1.3. DIH did not have "EventListeners" in the stone
age. I need to know, if a DIH instance is done with its task (mostly the
"commit") operation. I could not figure a clean way out. As a hack, I keep
pinging the DIH instances with command=status at regular intervals (in a
separate thread), to figure out if it is free to be assigned some task. With
works, but obviously with an overhead of unnessecary wasted CPU cycles. Is
there a better approach?
4. I can better the time taken, even further if there was a way for me to
tell a DIH instance not to open a new IndexSearcher. In the current scheme
of things, as soon as one DIH instance is done committing, a new searcher is
opened. This is blocking for other DIH instances (which were active) and
they cannot continue without the searcher being initialized. Is there a way
I can implement, single commit once all these DIH instances are done with
their tasks? I tried each DIH instance with a commit=false without luck.
5. Can this implementation be extended to support other data-sources
supported in DIH (HTTP, File, URL etc)?
6. If the utility is worth it, can I host this on Google code as an open
source contrib?

Any help will be deeply acknowledged and appreciated. While suggesting,
please don't forget that I am using Solr 1.3. If it all goes well, I don't
mind writing one for Solr 1.4.

Cheers
Avlesh

[jira] Created: (SOLR-1326) New interface PluginInfoInitialized

2009-08-02 Thread Noble Paul (JIRA)

New interface PluginInfoInitialized
---

 Key: SOLR-1326
 URL: https://issues.apache.org/jira/browse/SOLR-1326
 Project: Solr
  Issue Type: Improvement
Reporter: Noble Paul


There is no way for a plugin to know the information of the attributes 
mentioned in the tag itself (like name). We should have a new interface to 
initialize with PluginInfo such as 
{code:java}
public interface PluginInfoInitialized{
  public void init(PluginInfo pluginInfo);
}
{code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1330) the details command shows current replication status when no replication is going on

[jira] Updated: (SOLR-1330) the details command shows current replication status when no replication is going on

[jira] Updated: (SOLR-1330) the details command shows current replication status when no replication is going on

[jira] Updated: (SOLR-1330) the details command shows current replication status when no replication is going on

[jira] Assigned: (SOLR-1330) the details command shows current replication status when no replication is going on

[jira] Created: (SOLR-1330) the details command shows current replication status when no replication is going on

Re: trie fields default in example schema

Re: trie fields default in example schema

[jira] Created: (SOLR-1329) StatsComponent needs trie support

[jira] Assigned: (SOLR-1089) do write to Solr in a separate thread

trie fields default in example schema

[jira] Updated: (SOLR-1089) do write to Solr in a separate thread

[jira] Created: (SOLR-1328) implement date faceting for trie date

Re: Queries regarding a "ParallelDataImportHandler"

Re: Queries regarding a "ParallelDataImportHandler"

[jira] Created: (SOLR-1327) Allow special Filters to access, modify, and/or add Fields to/on a Solr Document

Re: Queries regarding a "ParallelDataImportHandler"

Re: Queries regarding a "ParallelDataImportHandler"

Re: Queries regarding a "ParallelDataImportHandler"

Queries regarding a "ParallelDataImportHandler"

[jira] Created: (SOLR-1326) New interface PluginInfoInitialized

21 matches

Site Navigation

Mail list logo

Footer information