[
https://issues.apache.org/jira/browse/SOLR-3011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13228549#comment-13228549
]
James Dyer commented on SOLR-3011:
----------------------------------
I'm just starting to look at the DIH threaded code and the history behind it
(SOLR-1352, etc). It seems like a lot of work has been put into this by Noble,
Shalin, you and others. Yet, I can't help but wonder if we've gone down a bad
path with this one. That is, DIH is essentially a collection of
loosely-coupled components: DataSource, Entity, Transformer, etc. So it seems
that for this to work, not only does the core (DocBuilder etc) need to be
thread-safe, but every component in a given DIH configuration needs to be also.
There also is quite a bit of code duplication in DocBuilder and classes like
ThreadedEntityProcessorWrapper for threaded configurations. In the past, this
seems to have caused threaded-only problems like SOLR-2668. Also, the
DocBuilder duplication only covers full-imports. The delta-import doesn't look
like it supports threading at all? Finally, users can get confused because
specifying threads="1" actually does something: it puts the whole import
through the threaded code branches instead of the single-thread code branches.
Then there is the issue of tests. Mikhail, you've just noticed that
MockDataSource was not designed to test a multi-threaded scenario in a valid
fashion. But I'm not sure even the tests that are just for threads are all
that valid. Take a look at TestDocBuilderThreaded. Most of the configurations
have "threads=1". And what about this comment:
<code>
/*
* This test fails in TestEnviroment, but works in real Live
*/
@Test
public void testEvaluator() throws Exception
<code>
I'm starting to worry we're playing wack-a-mole with this, and maybe we need a
different approach. What if we tried this as a path forward:
1. Keep 3.x as-is, and make any quick fixes to threads for common use-cases
there, as possible.
2. In 4.0 (or a separate branch), remove threading from DIH.
3. Refactor code and improve tests.
4. Make DocBuilder, etc threadsafe.
5. Create a marker interface or annotation that can be put on DataSources,
Entities, Transformers, SolrWriters, etc signifying they can be used safely
with threads.
6. Re-implement the "threads" parameter. Maybe be a bit more thoughtful about
how it will work & document it. Do we have it be a thread-per-document (like
we have now) or a thread-per-entity (run child threads in parallel, but the
root entity is single-threaded)? Can we design it so that both can be
supported?
7. One-by-one, make the DataSources, Entities, etc threadsafe and implement the
marker interface, the new annotation, etc.
I realize that #1-2 present a problem with what has been done here already.
The SOLR-3011 patches work on 4.x and it would be a lot of work to backport
3.x. Yet I am proposing removing the current threading entirely from 4.x and
"fixing" only 3.x. But I can probably help with porting (some of?) this patch
back to 3.x.
We recently had this comment come from one of our PMC members: "If we would
have a better alternative without locale, threading and unicode bugs, I would
svn rm." But what I'm seeing is that while Lucene and Solr-core have had a lot
of hands in refactoring and improving the code, DIH has only had features piled
up onto it. It was mostly written by 1 or 2 people, with limited community
support from a code-maintenance perspective. In short, it hasn't gotten the
TLC it needs to thrive long-term. Maybe now's the time.
Comments? Does this seem like a viable way forward?
> DIH MultiThreaded bug
> ---------------------
>
> Key: SOLR-3011
> URL: https://issues.apache.org/jira/browse/SOLR-3011
> Project: Solr
> Issue Type: Sub-task
> Components: contrib - DataImportHandler
> Affects Versions: 3.5, 4.0
> Reporter: Mikhail Khludnev
> Priority: Minor
> Fix For: 4.0
>
> Attachments: SOLR-3011.patch, SOLR-3011.patch,
> patch-3011-EntityProcessorBase-iterator.patch,
> patch-3011-EntityProcessorBase-iterator.patch
>
>
> current DIH design is not thread safe. see last comments at SOLR-2382 and
> SOLR-2947. I'm going to provide the patch makes DIH core threadsafe. Mostly
> it's a SOLR-2947 patch from 28th Dec.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]