from:"James Dyer \(Commented\) \(JIRA\)"

[jira] [Commented] (SOLR-3366) Restart of Solr during data import causes an empty index to be generated on restart

2012-04-17 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13255827#comment-13255827
 ] 

James Dyer commented on SOLR-3366:
--

I don't see how this would be related to DIH.  Even if you had "clean=true", it 
doesn't commit the deletes until the entire update is complete.  So, like you 
say, we should expect to only lose the changes from the current import, not the 
entire index.

I wonder if this is a side-effect from using replication.  Sometimes, 
replication copies an entire new index to the slaves in a new directory, then 
writes this new directory to "index.properties".  On restart solr looks for 
"index.properties" to find the appropriate index directory.  If this file had 
been touched or removed, possibly it restarted and didn't find the correct 
directory, then created a new index?  Of course, this would have affected the 
slaves only.

I vaguely remember there being a bug some releases back where index corruption 
could occur if the system is ungracefully shut down, and I see you're on 3.4.  
But then again, maybe my memory is failing me because I didn't see this in the 
release notes.

> Restart of Solr during data import causes an empty index to be generated on 
> restart
> ---
>
> Key: SOLR-3366
> URL: https://issues.apache.org/jira/browse/SOLR-3366
> Project: Solr
>  Issue Type: Bug
>  Components: contrib - DataImportHandler, replication (java)
>Affects Versions: 3.4
>Reporter: Kevin Osborn
>
> We use the DataImportHandler and Java replication in a fairly simple setup of 
> a single master and 4 slaves. We had an operating index of about 16,000 
> documents. The DataImportHandler is pulled periodically by an external 
> service using the "command=full-import&clean=false" command for a delta 
> import.
> While processing one of these commands, we did a deployment which required us 
> to restart the application server (Tomcat 7). So, the import was interrupted. 
> Prior to this deployment, the full index of 16,000 documents had been 
> replicated to all slaves and was working correctly.
> Upon restart, the master restarted with an empty index and then this empty 
> index was replicated across all slaves. So, our search index was now empty.
> My expected behavior was to lose any changes in the delta import (basically 
> prior to the commit). However, I was not expecting to lose all data. Perhaps 
> this is due to the fact that I am using the full-import method, even though 
> it is really a delta, for performance reasons? Or does the data import just 
> put the index in some sort of invalid state?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3033) "numberToKeep" on replication handler does not work with "backupAfter"

2012-04-16 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13255068#comment-13255068
 ] 

James Dyer commented on SOLR-3033:
--

Tomas,

You're correct this is a bug.  I opened SOLR-3361 for this and attached a fix.  
Thank you for reporting this.

> "numberToKeep" on replication handler does not work with "backupAfter"
> --
>
> Key: SOLR-3033
> URL: https://issues.apache.org/jira/browse/SOLR-3033
> Project: Solr
>  Issue Type: Bug
>  Components: replication (java)
>Affects Versions: 3.5
> Environment: openjdk 1.6, linux 3.x
>Reporter: Torsten Krah
>Assignee: James Dyer
> Fix For: 3.6
>
> Attachments: SOLR-3033-failingtest.patch, SOLR-3033.patch, 
> SOLR-3033.patch
>
>
> Configured my replication handler like this:
>
>
>startup
>commit
>optimize
> name="confFiles">elevate.xml,schema.xml,spellings.txt,stopwords.txt,stopwords_de.txt,stopwords_en.txt,synonyms_de.txt,synonyms.txt
>optimize
>1
>  
>
> So after optimize a snapshot should be taken, this works. But numberToKeep is 
> ignored, snapshots are increasing with each call to optimize and are kept 
> forever. Seems this settings have no effect.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3360) Problem with DataImportHandler multi-threaded

2012-04-16 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13254940#comment-13254940
 ] 

James Dyer commented on SOLR-3360:
--

I think we need to verify whether or not it is adding the same 1000 documents 
10x, or if its just counting each document 10x.  The fact that the successful 
10-thread 3.5 run took 1:12 but that same 10-thread run on 3.6 took 14:15 makes 
me wonder if each thread is actually duplicating the work and not just doing 
extra counting?

But then again the successful ONE-thread 3.6 run took 1:12 also... hmm...

Probably we need a unit test that does a simple SQL import with 2 threads and 
counts how many times SolrWriter#upload got called, then compares it both with 
the # of docs sent and the # docs reported to the user.  Then we'd know what is 
actually broken.  It'd be interesting to see what that same test against 3.5 
does (if it can be made to run to completion).  Possibly this is broken in 3.5 
too (except the counters) but nobody noticed because they always got 
synchronization problems and gave up??

> Problem with DataImportHandler multi-threaded
> -
>
> Key: SOLR-3360
> URL: https://issues.apache.org/jira/browse/SOLR-3360
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 3.6
> Environment: Solr 3.6.0, Apache Tomcat 6.0.20, jdk1.6.0_15, Windows XP
>Reporter: Claudio R
>
> Hi,
> If I use dataimport with 1 thread, I got:
> 
>5001
>1000
>0
>2012-04-16 11:21:57
>Indexing completed. Added/Updated: 1000 documents. Deleted 0 
> documents.
>2012-04-16 11:23:19
>1000
>0:1:22.390
> 
> If I use datamport with 10 threads, I got:
> 
>0
>1
>0
>2012-04-16 11:31:43
>Indexing completed. Added/Updated: 1 documents. Deleted 0 
> documents.
>2012-04-16 11:41:50
>1
>0:10:7.586
> 
> The configuration of 10 threads consumed 10 times longer than the 
> configuration with 1 thread.
> I have 1000 records in the database.
> My db-data-config.xml is shown below:
> 
> 
> url="jdbc:sqlserver://200.XXX.XXX.XXX:1433;databaseName=test" user="user" 
> password="pass"/>
>   
>   transformer="RegexTransformer,TemplateTransformer" query="select top 1000 
> i.id_indice, i.a, i.b from indice i where i.status = 'I'" 
> deltaImportQuery="i.id_indice, i.a, i.b from indice i where id_indice in 
> ('${dataimporter.delta.id_indice}')" deltaQuery="select id_indice from indice 
> where status='I' and data_hora_modificacao >= convert(datetime, 
> '${dataimporter.last_index_time}', 120)" deletedPkQuery="select id_indice 
> from indice where status='D' and data_hora_modificacao >= convert(datetime, 
> '${dataimporter.last_index_time}', 120)">  
> 
> 
> 
>  transformer="RegexTransformer,TemplateTransformer" query="select categoria, 
> sub_categoria from filtro where indice_id_indice = '${indice.id_indice}'">
>
>
> template="${filtro.categoria}|${filtro.sub_categoria}" />
> 
> 
>
> 
> 
>
>
> 
> 
>
> 
> 
>
> 
>  
>   
> 
> Thanks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3360) Problem with DataImportHandler multi-threaded

2012-04-16 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13254765#comment-13254765
 ] 

James Dyer commented on SOLR-3360:
--

Claudio,

Thanks for reporting this.  Was this working prior with 3.5?  (We did some work 
with the "threads" feature in 3.6, so it'd be helpful to know if this is a new 
bug).  

Also, can you try it (1) without any transformers and (2) with just the parent 
entity (take out the sub-entities).  Do you get 10,000 or 1,000 ?  This might 
help in diagnosing any maybe solving this problem.

Finally, you may want to be aware that 3.6 is the last release that will 
support the DIH "threads" feature.  It simply had too many bugs and was too 
difficult to maintain to keep it in.  But we did try and fix as many bugs for 
3.6 as we could.  Possibly in "fixing" what we could, we introduced this as a 
new problem?

> Problem with DataImportHandler multi-threaded
> -
>
> Key: SOLR-3360
> URL: https://issues.apache.org/jira/browse/SOLR-3360
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 3.6
> Environment: Solr 3.6.0, Apache Tomcat 6.0.20, jdk1.6.0_15, Windows XP
>Reporter: Claudio R
>
> Hi,
> If I use dataimport with 1 thread, I got:
> 
>5001
>1000
>0
>2012-04-16 11:21:57
>Indexing completed. Added/Updated: 1000 documents. Deleted 0 
> documents.
>2012-04-16 11:23:19
>1000
>0:1:22.390
> 
> If I use datamport with 10 threads, I got:
> 
>0
>1
>0
>2012-04-16 11:31:43
>Indexing completed. Added/Updated: 1 documents. Deleted 0 
> documents.
>2012-04-16 11:41:50
>1
>0:10:7.586
> 
> The configuration of 10 threads consumed 10 times longer than the 
> configuration with 1 thread.
> I have 1000 records in the database.
> My db-data-config.xml is shown below:
> 
> 
> url="jdbc:sqlserver://200.XXX.XXX.XXX:1433;databaseName=test" user="user" 
> password="pass"/>
>   
>   transformer="RegexTransformer,TemplateTransformer" query="select top 1000 
> i.id_indice, i.a, i.b from indice i where i.status = 'I'" 
> deltaImportQuery="i.id_indice, i.a, i.b from indice i where id_indice in 
> ('${dataimporter.delta.id_indice}')" deltaQuery="select id_indice from indice 
> where status='I' and data_hora_modificacao >= convert(datetime, 
> '${dataimporter.last_index_time}', 120)" deletedPkQuery="select id_indice 
> from indice where status='D' and data_hora_modificacao >= convert(datetime, 
> '${dataimporter.last_index_time}', 120)">  
> 
> 
> 
>  transformer="RegexTransformer,TemplateTransformer" query="select categoria, 
> sub_categoria from filtro where indice_id_indice = '${indice.id_indice}'">
>
>
> template="${filtro.categoria}|${filtro.sub_categoria}" />
> 
> 
>
> 
> 
>
>
> 
> 
>
> 
> 
>
> 
>  
>   
> 
> Thanks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2729) DIH status: successful zero-document delta-import missing "" field

2012-04-04 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13246587#comment-13246587
 ] 

James Dyer commented on SOLR-2729:
--

I agree the status messages should be better (Fix typos, no blank names, etc).  
I don't think we should worry too much about breaking people's code (mine 
included).  Really, there should be a better way for automated schedulers to be 
able to check DIH status (JMX maybe?).  This is probably more of a long-term 
wish though.  In any case, I think the focus on the existing status page should 
be human-readability.

> DIH status: successful zero-document delta-import missing "" field
> --
>
> Key: SOLR-2729
> URL: https://issues.apache.org/jira/browse/SOLR-2729
> Project: Solr
>  Issue Type: Bug
>  Components: contrib - DataImportHandler
>Affects Versions: 3.2
> Environment: Linux idxst0-a 2.6.18-238.12.1.el5.centos.plusxen #1 SMP 
> Wed Jun 1 11:57:54 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux
> java version "1.6.0_26"
> Java(TM) SE Runtime Environment (build 1.6.0_26-b03)
> Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode)
>Reporter: Shawn Heisey
>Priority: Minor
> Fix For: 4.0
>
>
> If you have a successful delta-import that happens to process zero documents, 
> the  field is not present in the status.  I've run into this 
> situation when the SQL query results in an empty set.  A workaround for the 
> problem is to instead look for the "Time taken " field ... but if you don't 
> happen to notice that this field has an extraneous space in the name, that 
> won't work either.
> A full-import that processes zero documents has the field present as expected:
> Indexing completed. Added/Updated: 0 documents. Deleted 0 
> documents.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3011) DIH MultiThreaded bug

2012-04-02 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13244160#comment-13244160
 ] 

James Dyer commented on SOLR-3011:
--

If the changes in 3.6 break FileListEntityProcessor then we should try and fix 
it.  A failing unit test would help a lot.  As a workaround, you should always 
be able to use the 3.5 jar with 3.6.

4.0 is only going to support single-threaded DIH configurations.  I understand 
that some users have gotten performance gains using "threads" and haven't had 
problems.  I suspect these were mostly cases like yours where you're processing 
text documents and have a somewhat simple configuration.  But looking at the 
code, I don't think we can guarantee DIH using the "threads" parameter will 
never encounter a race condition, etc, and that some configurations (especially 
using SQL, caching, etc) were not working at all (which SOLR-3011 at least 
mostly fixes).  It was also getting hard to add new features because all bets 
were pretty much off as to whether or not any changes would work with "threads".

Long term, I would like to see some type of multi-threading added back in.  But 
we do need to refactor the code.  I am looking now in trying to consolidate 
some of the objects that DIH passes around, reducing member visibility, making 
things immutable, etc.  Some of the classes need to be made simpler (DocBuilder 
comes to mind).  Hopefully we can have a code base that can be more easily made 
threadsafe in the future.

> DIH MultiThreaded bug
> -
>
> Key: SOLR-3011
> URL: https://issues.apache.org/jira/browse/SOLR-3011
> Project: Solr
>  Issue Type: Sub-task
>  Components: contrib - DataImportHandler
>Affects Versions: 3.5
>Reporter: Mikhail Khludnev
>Assignee: James Dyer
>Priority: Minor
> Fix For: 3.6
>
> Attachments: SOLR-3011.patch, SOLR-3011.patch, SOLR-3011.patch, 
> SOLR-3011.patch, SOLR-3011.patch, 
> patch-3011-EntityProcessorBase-iterator.patch, 
> patch-3011-EntityProcessorBase-iterator.patch
>
>
> current DIH design is not thread safe. see last comments at SOLR-2382 and 
> SOLR-2947. I'm going to provide the patch makes DIH core threadsafe. Mostly 
> it's a SOLR-2947 patch from 28th Dec. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3011) DIH MultiThreaded bug

2012-03-30 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13242370#comment-13242370
 ] 

James Dyer commented on SOLR-3011:
--

Bernd,

Do you think this is a new bug since SOLR-3011 was applied to the 3.6 branch, 
or does this also fail this way for you with 3.5?  Also, is there any way you 
can provide a failing unit test?  If so, open a new issue and attach your unit 
test in a patch.  As you might be aware, "threads" is removed from 4.0, mostly 
because of bugs like this one.  I'd be interested in getting this fixed in the 
3.x branch especially if this is newly-caused by SOLR-3011.

You're right that the random test is redundant for the 1-thread case.  No harm 
in what's there, but it would be better if the random test did 2-10, not 1-10.

> DIH MultiThreaded bug
> -
>
> Key: SOLR-3011
> URL: https://issues.apache.org/jira/browse/SOLR-3011
> Project: Solr
>  Issue Type: Sub-task
>  Components: contrib - DataImportHandler
>Affects Versions: 3.5
>Reporter: Mikhail Khludnev
>Assignee: James Dyer
>Priority: Minor
> Fix For: 3.6
>
> Attachments: SOLR-3011.patch, SOLR-3011.patch, SOLR-3011.patch, 
> SOLR-3011.patch, SOLR-3011.patch, 
> patch-3011-EntityProcessorBase-iterator.patch, 
> patch-3011-EntityProcessorBase-iterator.patch
>
>
> current DIH design is not thread safe. see last comments at SOLR-2382 and 
> SOLR-2947. I'm going to provide the patch makes DIH core threadsafe. Mostly 
> it's a SOLR-2947 patch from 28th Dec. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3029) Poor json formatting of spelling collation info

2012-03-29 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13241289#comment-13241289
 ] 

James Dyer commented on SOLR-3029:
--

Yonik,

I can answer some of your questions.  I do agree the spellcheck response format 
leaves something to be desired and maybe 4.0 is a good time to break backwards 
and improve it.

{quote}
Unless order is really important, "suggestions" should be a map
{quote}
I don't see why order would matter here, although some users might like to see 
the corrections listed in the order they appeared in the query.

{quote}
same for "collation"
{quote}
The collations are ranked, so order is important.

{quote}
and "misspellingsAndCorrections"
{quote}
The order shouldn't matter unless users are picky about the corrections being 
presented in the order they occur in the query.

{quote}
why is "collation" inside "suggestions" along with other words? should this be 
one level higher?
{quote}
This always confused me too.  I agree it should be one level higher.

{quote}
why isn't this giving me multiple collations
{quote}
This is a bug.  See SOLR-2853.

{quote}
why aren't multiple suggestions returned in misspellingsAndCorrections? (and 
what's the purpose ...?)
{quote}
This is nested with the Collation and gives details, for that particular 
collation, which misspelled word got which replacement.  This makes it easy for 
clients to generate messages like "no results found for abcdefgq ...  Showing 
abcdefgx instead!"  You can suppress this information by not specifying 
"spellcheck.collateExtendedResults=true".  For users (like me) who are 
interested in the collations only and don't care about individual-word 
corrections, it would be nice if we could suppress the first section of the 
response entirely.

{quote}
I briefly tried distributed search...
{quote}
DistributedSpellCheckComponentTest is supposed to detect problems like this but 
maybe something is going on and there is a bug this test isn't catching?

For what its worth you had voiced some misgivings about the JSON format when 
the multiple-collations feature was added.  At that time I supplied a quick 
patch to address your concerns.  I'm not sure if that patch fixes the problem 
described here.  See SOLR-2010 and your comment from Oct 16, 2010 and the (now 
outdated, never committed) patch I supplied on Oct 20.  

The patch on this issue causes multiple test failures although I didn't look 
into them.





> Poor json formatting of spelling collation info
> ---
>
> Key: SOLR-3029
> URL: https://issues.apache.org/jira/browse/SOLR-3029
> Project: Solr
>  Issue Type: Bug
>  Components: spellchecker
>Affects Versions: 4.0
>Reporter: Antony Stubbs
>Priority: Blocker
> Attachments: SOLR-3029.patch
>
>
> {noformat}
> "spellcheck": {
> "suggestions": [
> "dalllas",
> {
> 
> {
> "word": "canallas",
> "freq": 1
> }
> ]
> },
> "correctlySpelled",
> false,
> "collation",
> "dallas"
> ]
> }
> {noformat}
> The correctlySpelled and collation key/values are stored as consecutive 
> elements in an array - quite odd. Is there a reason isn't not a key/value map 
> like most things?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3262) Remove "threads" from DIH (Trunk only)

2012-03-23 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13237024#comment-13237024
 ] 

James Dyer commented on SOLR-3262:
--

Patch to remove "threads" from DIH.  I would like to commit this in a few days.

> Remove "threads" from DIH (Trunk only)
> --
>
> Key: SOLR-3262
> URL: https://issues.apache.org/jira/browse/SOLR-3262
> Project: Solr
>  Issue Type: Bug
>  Components: contrib - DataImportHandler
>Affects Versions: 4.0
>Reporter: James Dyer
>Assignee: James Dyer
>Priority: Minor
> Fix For: 4.0
>
> Attachments: SOLR-3262.patch
>
>
> SOLR-1352 introduced a multi-threading feature for DataImportHandler.  
> Historically, this feature only seemed to work in a limited set of cases and 
> I don't think we can guarantee users that using "threads" will behave 
> consistently.  Also, the multi-threaded option adds considerable complexity 
> making code refactoring difficult. 
> I propose removing "threads" from Trunk.  (But keep it in 3.x, applying any 
> bug fixes for it there.)  This can be a first step in improving the DIH code 
> base.  
> Eventually we can possibly add a carefully though-out "threads" 
> implementation back in.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2724) Deprecate defaultSearchField and defaultOperator defined in schema.xml

2012-03-23 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13236878#comment-13236878
 ] 

James Dyer commented on SOLR-2724:
--

We have apps that always use "AND" (or mm=100%).  So for us, the global default 
is nice to have.  I can see why it belongs in solrconfig.xml and not schema.xml 
though.

> Deprecate defaultSearchField and defaultOperator defined in schema.xml
> --
>
> Key: SOLR-2724
> URL: https://issues.apache.org/jira/browse/SOLR-2724
> Project: Solr
>  Issue Type: Improvement
>  Components: Schema and Analysis, search
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Minor
> Fix For: 3.6, 4.0
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> I've always been surprised to see the  element and 
>  defined in the schema.xml file since 
> the first time I saw them.  They just seem out of place to me since they are 
> more query parser related than schema related. But not only are they 
> misplaced, I feel they shouldn't exist. For query parsers, we already have a 
> "df" parameter that works just fine, and explicit field references. And the 
> default lucene query operator should stay at OR -- if a particular query 
> wants different behavior then use q.op or simply use "OR".
>  Seems like something better placed in solrconfig.xml than in the 
> schema. 
> In my opinion, defaultSearchField and defaultOperator configuration elements 
> should be deprecated in Solr 3.x and removed in Solr 4.  And  
> should move to solrconfig.xml. I am willing to do it, provided there is 
> consensus on it of course.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2961) DIH with threads and TikaEntityProcessor JDBC ISsue

2012-03-22 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13235842#comment-13235842
 ] 

James Dyer commented on SOLR-2961:
--

{quote}
Mikhail Khludnev commented on SOLR-3011:


bq.  Is SOLR-2961 just for Tika?

yep. it seems so. Why do you ask, we don't need to support it further?
{quote}

I don't think we have to support _threads_ with everything.  (This is one 
reason why I want to remove "threads" on Trunk.  Its going to be very difficult 
to support every use-case.)  On the other hand, if you or someone else puts up 
a good patch in the very near-term I will try to get it into 3.6.

> DIH with threads and TikaEntityProcessor JDBC ISsue
> ---
>
> Key: SOLR-2961
> URL: https://issues.apache.org/jira/browse/SOLR-2961
> Project: Solr
>  Issue Type: Bug
>  Components: contrib - DataImportHandler
>Affects Versions: 3.4, 3.5
> Environment: Windows Server 2008, Apache Tomcat 6, Oracle 11g, ojdbc 
> 11.2.0.1
>Reporter: David Webb
>  Labels: dih, tika
> Attachments: SOLR-2961.patch, data-config.xml
>
>
> I have a DIH Configuration that works great when I dont specify threads="X" 
> in the root entity.  As soon as I give a value for threads, I get the 
> following error messages in the stacktrace.  Please advise.  
> SEVERE: JdbcDataSource was not closed prior to finalize(), indicates a bug -- 
> POSSIBLE RESOURCE LEAK!!!
> Dec 10, 2011 1:18:33 PM org.apache.solr.handler.dataimport.JdbcDataSource 
> closeConnection
> SEVERE: Ignoring Error when closing connection
> java.sql.SQLRecoverableException: IO Error: Socket closed
>   at oracle.jdbc.driver.T4CConnection.logoff(T4CConnection.java:511)
>   at 
> oracle.jdbc.driver.PhysicalConnection.close(PhysicalConnection.java:3931)
>   at 
> org.apache.solr.handler.dataimport.JdbcDataSource.closeConnection(JdbcDataSource.java:401)
>   at 
> org.apache.solr.handler.dataimport.JdbcDataSource.close(JdbcDataSource.java:392)
>   at 
> org.apache.solr.handler.dataimport.JdbcDataSource.finalize(JdbcDataSource.java:380)
>   at java.lang.ref.Finalizer.invokeFinalizeMethod(Native Method)
>   at java.lang.ref.Finalizer.runFinalizer(Unknown Source)
>   at java.lang.ref.Finalizer.access$100(Unknown Source)
>   at java.lang.ref.Finalizer$FinalizerThread.run(Unknown Source)
> Caused by: java.net.SocketException: Socket closed
>   at java.net.SocketOutputStream.socketWrite(Unknown Source)
>   at java.net.SocketOutputStream.write(Unknown Source)
>   at oracle.net.ns.DataPacket.send(DataPacket.java:199)
>   at oracle.net.ns.NetOutputStream.flush(NetOutputStream.java:211)
>   at oracle.net.ns.NetInputStream.getNextPacket(NetInputStream.java:227)
>   at oracle.net.ns.NetInputStream.read(NetInputStream.java:175)
>   at oracle.net.ns.NetInputStream.read(NetInputStream.java:100)
>   at oracle.net.ns.NetInputStream.read(NetInputStream.java:85)
>   at 
> oracle.jdbc.driver.T4CSocketInputStreamWrapper.readNextPacket(T4CSocketInputStreamWrapper.java:123)
>   at 
> oracle.jdbc.driver.T4CSocketInputStreamWrapper.read(T4CSocketInputStreamWrapper.java:79)
>   at oracle.jdbc.driver.T4CMAREngine.unmarshalUB1(T4CMAREngine.java:1122)
>   at oracle.jdbc.driver.T4CMAREngine.unmarshalSB1(T4CMAREngine.java:1099)
>   at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:288)
>   at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:191)
>   at oracle.jdbc.driver.T4C7Ocommoncall.doOLOGOFF(T4C7Ocommoncall.java:61)
>   at oracle.jdbc.driver.T4CConnection.logoff(T4CConnection.java:498)
>   ... 8 more
> Dec 10, 2011 1:18:34 PM 
> org.apache.solr.handler.dataimport.ThreadedEntityProcessorWrapper nextRow
> SEVERE: Exception in entity : null
> org.apache.solr.handler.dataimport.DataImportHandlerException: Failed to 
> initialize DataSource: f2
>   at 
> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
>   at 
> org.apache.solr.handler.dataimport.DataImporter.getDataSourceInstance(DataImporter.java:333)
>   at 
> org.apache.solr.handler.dataimport.ContextImpl.getDataSource(ContextImpl.java:99)
>   at 
> org.apache.solr.handler.dataimport.ThreadedContext.getDataSource(ThreadedContext.java:66)
>   at 
> org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:101)
>   at 
> org.apache.solr.handler.dataimport.ThreadedEntityProcessorWrapper.nextRow(ThreadedEntityProcessorWrapper.java:84)
>   at 
> org.apache.solr.handler.dataimport.DocBuilder$EntityRunner.runAThread(DocBuilder.java:446)
>   at 
> org.apache.solr.handler.dataimport.DocBuilder$EntityRunner.run(DocBuilder.java:399

[jira] [Commented] (SOLR-3011) DIH MultiThreaded bug

2012-03-22 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13235824#comment-13235824
 ] 

James Dyer commented on SOLR-3011:
--

That would be great if you can.  Lucene/Solr 3.6 is going to be the last 3.x 
release and it is closing for new functionality soon.  SOLR-2804 for sure looks 
like something that should be there.  Is SOLR-2961 just for Tika?

> DIH MultiThreaded bug
> -
>
> Key: SOLR-3011
> URL: https://issues.apache.org/jira/browse/SOLR-3011
> Project: Solr
>  Issue Type: Sub-task
>  Components: contrib - DataImportHandler
>Affects Versions: 3.5
>Reporter: Mikhail Khludnev
>Priority: Minor
> Fix For: 3.6
>
> Attachments: SOLR-3011.patch, SOLR-3011.patch, SOLR-3011.patch, 
> SOLR-3011.patch, SOLR-3011.patch, 
> patch-3011-EntityProcessorBase-iterator.patch, 
> patch-3011-EntityProcessorBase-iterator.patch
>
>
> current DIH design is not thread safe. see last comments at SOLR-2382 and 
> SOLR-2947. I'm going to provide the patch makes DIH core threadsafe. Mostly 
> it's a SOLR-2947 patch from 28th Dec. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3260) Improve exception handling / logging for ScriptTransformer.init()

2012-03-22 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13235609#comment-13235609
 ] 

James Dyer commented on SOLR-3260:
--

I missed one.  Sorry about that.  Should be fixed now.

> Improve exception handling / logging for ScriptTransformer.init()
> -
>
> Key: SOLR-3260
> URL: https://issues.apache.org/jira/browse/SOLR-3260
> Project: Solr
>  Issue Type: Bug
>  Components: contrib - DataImportHandler
>Affects Versions: 3.5, 4.0
>Reporter: James Dyer
>Assignee: James Dyer
>Priority: Trivial
> Fix For: 3.6, 4.0
>
> Attachments: SOLR-3260.patch
>
>
> This came up on the user-list.  ScriptTransformer logs the same "need a >=1.6 
> jre" message for several problems, making debugging difficult for users.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3260) Improve exception handling / logging for ScriptTransformer.init()

2012-03-20 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13234036#comment-13234036
 ] 

James Dyer commented on SOLR-3260:
--

Thanks, Steven!  I'll look a little bit more at this tomorrow.  Sorry to break 
the build.

> Improve exception handling / logging for ScriptTransformer.init()
> -
>
> Key: SOLR-3260
> URL: https://issues.apache.org/jira/browse/SOLR-3260
> Project: Solr
>  Issue Type: Bug
>  Components: contrib - DataImportHandler
>Affects Versions: 3.5, 4.0
>Reporter: James Dyer
>Assignee: James Dyer
>Priority: Trivial
> Fix For: 3.6, 4.0
>
> Attachments: SOLR-3260.patch
>
>
> This came up on the user-list.  ScriptTransformer logs the same "need a >=1.6 
> jre" message for several problems, making debugging difficult for users.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3260) Improve exception handling / logging for ScriptTransformer.init()

2012-03-20 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13233860#comment-13233860
 ] 

James Dyer commented on SOLR-3260:
--

That is probably the right thing to do.  If its using a JRE that supports the 
class "javax.script.ScriptEngineManager" but there are nop ScriptingEngines 
installed, you'll get the "Cannot load Script Engine..." message.  I am a bit 
confused why it is giving the "eval failed..." message though.

Why don't you make the fix you've proposed and I'll look later to see if there 
is something less-blunt we can do.  Sound ok?

> Improve exception handling / logging for ScriptTransformer.init()
> -
>
> Key: SOLR-3260
> URL: https://issues.apache.org/jira/browse/SOLR-3260
> Project: Solr
>  Issue Type: Bug
>  Components: contrib - DataImportHandler
>Affects Versions: 3.5, 4.0
>Reporter: James Dyer
>Assignee: James Dyer
>Priority: Trivial
> Fix For: 3.6, 4.0
>
> Attachments: SOLR-3260.patch
>
>
> This came up on the user-list.  ScriptTransformer logs the same "need a >=1.6 
> jre" message for several problems, making debugging difficult for users.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3011) DIH MultiThreaded bug

2012-03-19 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13232650#comment-13232650
 ] 

James Dyer commented on SOLR-3011:
--

{quote}
what is the importance of delta import scenario?  I see it can be done via full 
import instead
{quote}
This just provides two ways you can set up your deltas.  Personally I like to 
use "full-import w/ clean=false" to do deltas.  Its much more flexible.  With 
that said, people use the other way and it works.  The bad thing, I think, is 
it is not documented that "threads" doesn't work with it.  Also, surely the 
code can be cleaned up to not be an entirely different branch from 
"full-import".

{quote}
It's hard to maintain code ever. 
{quote}
This is why I think removing "threads" in trunk is still the way to go.  This 
will give us a good starting point for improving the code.  We can add add a 
better-designed version of "threads" later.

With that said, I think you've probably fixed "threads" for a lot of use-cases. 
 Why can't we back-port this to 3.x and document for the users what works and 
doesn't work with "threads", with warnings to test thoroughly before going to 
production?  If it means back-porting the cache refactorings first, so be it.  

I know you were thinking we really should start over with "Ultimate DIH".  
That's fine if people want to do that.  But I'm using the existing DIH for some 
pretty complex things and it works great.  My issue with DIH is not that it 
isn't a good product.  Its just that it needs some work internally if we're 
going to be able to continue to improve it from here.

> DIH MultiThreaded bug
> -
>
> Key: SOLR-3011
> URL: https://issues.apache.org/jira/browse/SOLR-3011
> Project: Solr
>  Issue Type: Sub-task
>  Components: contrib - DataImportHandler
>Affects Versions: 3.5, 4.0
>Reporter: Mikhail Khludnev
>Priority: Minor
> Fix For: 4.0
>
> Attachments: SOLR-3011.patch, SOLR-3011.patch, SOLR-3011.patch, 
> SOLR-3011.patch, patch-3011-EntityProcessorBase-iterator.patch, 
> patch-3011-EntityProcessorBase-iterator.patch
>
>
> current DIH design is not thread safe. see last comments at SOLR-2382 and 
> SOLR-2947. I'm going to provide the patch makes DIH core threadsafe. Mostly 
> it's a SOLR-2947 patch from 28th Dec. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3011) DIH MultiThreaded bug

2012-03-13 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13228549#comment-13228549
 ] 

James Dyer commented on SOLR-3011:
--

I'm just starting to look at the DIH threaded code and the history behind it 
(SOLR-1352, etc).  It seems like a lot of work has been put into this by Noble, 
Shalin, you and others.  Yet, I can't help but wonder if we've gone down a bad 
path with this one.  That is, DIH is essentially a collection of 
loosely-coupled components:  DataSource, Entity, Transformer, etc.  So it seems 
that for this to work, not only does the core (DocBuilder etc) need to be 
thread-safe, but every component in a given DIH configuration needs to be also.

There also is quite a bit of code duplication in DocBuilder and classes like 
ThreadedEntityProcessorWrapper for threaded configurations.  In the past, this 
seems to have caused threaded-only problems like SOLR-2668.  Also, the 
DocBuilder duplication only covers full-imports.  The delta-import doesn't look 
like it supports threading at all?  Finally, users can get confused because 
specifying threads="1" actually does something:  it puts the whole import 
through the threaded code branches instead of the single-thread code branches.

Then there is the issue of tests.  Mikhail, you've just noticed that 
MockDataSource was not designed to test a multi-threaded scenario in a valid 
fashion.  But I'm not sure even the tests that are just for threads are all 
that valid.  Take a look at TestDocBuilderThreaded.  Most of the configurations 
have "threads=1".  And what about this comment:


/*
  * This test fails in TestEnviroment, but works in real Live
  */
  @Test
  public void testEvaluator() throws Exception


I'm starting to worry we're playing wack-a-mole with this, and maybe we need a 
different approach.  What if we tried this as a path forward:

1. Keep 3.x as-is, and make any quick fixes to threads for common use-cases 
there, as possible.
2. In 4.0 (or a separate branch), remove threading from DIH.
3. Refactor code and improve tests.
4. Make DocBuilder, etc threadsafe.
5. Create a marker interface or annotation that can be put on DataSources, 
Entities, Transformers, SolrWriters, etc signifying they can be used safely 
with threads.
6. Re-implement the "threads" parameter.  Maybe be a bit more thoughtful about 
how it will work & document it.  Do we have it be a thread-per-document (like 
we have now) or a thread-per-entity (run child threads in parallel, but the 
root entity is single-threaded)?  Can we design it so that both can be 
supported?
7. One-by-one, make the DataSources, Entities, etc threadsafe and implement the 
marker interface, the new annotation, etc.

I realize that #1-2 present a problem with what has been done here already.  
The SOLR-3011 patches work on 4.x and it would be a lot of work to backport 
3.x.  Yet I am proposing removing the current threading entirely from 4.x and 
"fixing" only 3.x.  But I can probably help with porting (some of?) this patch 
back to 3.x.

We recently had this comment come from one of our PMC members:  "If we would 
have a better alternative without locale, threading and unicode bugs, I would 
svn rm."  But what I'm seeing is that while Lucene and Solr-core have had a lot 
of hands in refactoring and improving the code, DIH has only had features piled 
up onto it.  It was mostly written by 1 or 2 people, with limited community 
support from a code-maintenance perspective.  In short, it hasn't gotten the 
TLC it needs to thrive long-term.  Maybe now's the time.

Comments?  Does this seem like a viable way forward?

> DIH MultiThreaded bug
> -
>
> Key: SOLR-3011
> URL: https://issues.apache.org/jira/browse/SOLR-3011
> Project: Solr
>  Issue Type: Sub-task
>  Components: contrib - DataImportHandler
>Affects Versions: 3.5, 4.0
>Reporter: Mikhail Khludnev
>Priority: Minor
> Fix For: 4.0
>
> Attachments: SOLR-3011.patch, SOLR-3011.patch, 
> patch-3011-EntityProcessorBase-iterator.patch, 
> patch-3011-EntityProcessorBase-iterator.patch
>
>
> current DIH design is not thread safe. see last comments at SOLR-2382 and 
> SOLR-2947. I'm going to provide the patch makes DIH core threadsafe. Mostly 
> it's a SOLR-2947 patch from 28th Dec. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3240) add spellcheck 'approximate collation count' mode

2012-03-13 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13228492#comment-13228492
 ] 

James Dyer commented on SOLR-3240:
--

collation.hits is just metadata for the user, so I think what you want to do 
would be entirely valid.  

The estimates would only be good if the hits are somewhat evenly distributed 
across the index, right?  For instance, if you're indexing something by topic 
and all and then a bunch of new docs get added on the same topic around the 
same time, you'd get a cluster of hits in one place.  

Even so, like you say, many (most) people would rather improve performance than 
have an accurate (any) hit count returned.

Beyond this, there are also some dead-simple optimizations we can make by 
simply removing any sorting & boosting parameters from the query before testing 
the collation.

> add spellcheck 'approximate collation count' mode
> -
>
> Key: SOLR-3240
> URL: https://issues.apache.org/jira/browse/SOLR-3240
> Project: Solr
>  Issue Type: Improvement
>  Components: spellchecker
>Reporter: Robert Muir
>
> SpellCheck's Collation in Solr is a way to ensure spellcheck/suggestions
> will actually net results (taking into account context like filtering).
> In order to do this (from my understanding), it generates candidate queries,
> executes them, and saves the total hit count: collation.setHits(hits).
> For a large index it seems this might be doing too much work: in particular
> I'm interested in ensuring this feature can work fast enough/well for 
> autosuggesters.
> So I think we should offer an 'approximate' mode that uses an 
> early-terminating
> Collector, collect()ing only N docs (e.g. n=1), and we approximate this result
> count based on docid space. 
> I'm not sure what needs to happen on the solr side (possibly support for 
> custom collectors?),
> but I think this could help and should possibly be the default.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3240) add spellcheck 'approximate collation count' mode

2012-03-13 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13228463#comment-13228463
 ] 

James Dyer commented on SOLR-3240:
--

Are you saying that if a user only cares that a collation will yield some hits, 
but doesn't care how many, then we can short-circuit these queries to quit once 
one document is collected?  (alternatively, quit after n docs are collected is 
the user doesn't care if it is "greater than n" ?)

> add spellcheck 'approximate collation count' mode
> -
>
> Key: SOLR-3240
> URL: https://issues.apache.org/jira/browse/SOLR-3240
> Project: Solr
>  Issue Type: Improvement
>  Components: spellchecker
>Reporter: Robert Muir
>
> SpellCheck's Collation in Solr is a way to ensure spellcheck/suggestions
> will actually net results (taking into account context like filtering).
> In order to do this (from my understanding), it generates candidate queries,
> executes them, and saves the total hit count: collation.setHits(hits).
> For a large index it seems this might be doing too much work: in particular
> I'm interested in ensuring this feature can work fast enough/well for 
> autosuggesters.
> So I think we should offer an 'approximate' mode that uses an 
> early-terminating
> Collector, collect()ing only N docs (e.g. n=1), and we approximate this result
> count based on docid space. 
> I'm not sure what needs to happen on the solr side (possibly support for 
> custom collectors?),
> but I think this could help and should possibly be the default.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2124) SEVERE exceptions are being logged for expected PingRequestHandler SERVICE_UNAVAILABLE exceptions

2012-03-07 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13224608#comment-13224608
 ] 

James Dyer commented on SOLR-2124:
--

I think this is still a problem.  I have a snapshot from 1/31 (post-2191 and 
its children) and it still logs a whole stack trace every time the load 
balancer pings it, if the "service-enabled" file is missing.  This is a pretty 
big annoyance for me because I often use a "live", load-balanced dev 
environment to test new versions with the testing node taken "out" using this 
ping feature.  If nobody else does anything, I'll likely be annoyed enough to 
fix it eventually.

> SEVERE exceptions are being logged for expected PingRequestHandler 
> SERVICE_UNAVAILABLE exceptions
> -
>
> Key: SOLR-2124
> URL: https://issues.apache.org/jira/browse/SOLR-2124
> Project: Solr
>  Issue Type: Bug
>Reporter: Hoss Man
>Assignee: Erick Erickson
> Fix For: 3.6, 4.0
>
>
> As reported by a user, if you use the PingRequestHandler, and the 
> corrisponding helthcheck file doesn't exist (and expected situation when a 
> server is out of rotation) Solr is logging a SEVERE error...
> {noformat}
> SEVERE: org.apache.solr.common.SolrException: Service disabled
>   at 
> org.apache.solr.handler.PingRequestHandler.handleRequestBody(PingRequestHandler.java:48)
>   at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1324)
>   at 
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:337)
>   at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:240)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)
>   at 
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:388)
>   at 
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>   at 
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
>   at 
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
>   at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:418)
>   at 
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
>   at 
> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>   at 
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
>   at org.mortbay.jetty.Server.handle(Server.java:326)
>   at 
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
>   at 
> org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:923)
>   at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:547)
>   at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
>   at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
>   at 
> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
>   at 
> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
> {noformat}
> This is in spite of hte fact that PingRequestHandler explicitly sets the 
> "alreadyLogged" boolean to true in the SolrException constructor.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2578) ReplicationHandler Backups -- clean up old backups

2012-02-27 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13217602#comment-13217602
 ] 

James Dyer commented on SOLR-2578:
--

Thanks for pointing this out, Neil.  I opened SOLR-3168 for this bug and will 
commit the fix shortly.

> ReplicationHandler Backups -- clean up old backups
> --
>
> Key: SOLR-2578
> URL: https://issues.apache.org/jira/browse/SOLR-2578
> Project: Solr
>  Issue Type: Improvement
>  Components: replication (java)
>Affects Versions: 3.2, 4.0
>Reporter: James Dyer
>Assignee: James Dyer
>Priority: Minor
> Fix For: 3.5, 4.0
>
> Attachments: SOLR-2578.patch, SOLR-2578.patch, SOLR-2578_3x.patch
>
>
> It would be nice when performing backups if there was an easy way to tell 
> ReplicationHandler to only keep so many and then delete the older ones.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3033) "numberToKeep" on replication handler does not work with "backupAfter"

2012-02-27 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13217584#comment-13217584
 ] 

James Dyer commented on SOLR-3033:
--

New patch incorporates Hoss's suggestions from 1-12-2012.

- "numberToKeep" stays as a request param

- "maxNumberOfBackups" is introduced as a init param.  This is top-level and 
cannot be specified separately for masters & slaves.

- Trying to use both params results in a "BAD_REQUEST" (400) SolrException.

I will commit this shortly to Trunk & 3.x along with Wiki changes.

> "numberToKeep" on replication handler does not work with "backupAfter"
> --
>
> Key: SOLR-3033
> URL: https://issues.apache.org/jira/browse/SOLR-3033
> Project: Solr
>  Issue Type: Bug
>  Components: replication (java)
>Affects Versions: 3.5
> Environment: openjdk 1.6, linux 3.x
>Reporter: Torsten Krah
>Assignee: James Dyer
> Attachments: SOLR-3033.patch, SOLR-3033.patch
>
>
> Configured my replication handler like this:
>
>
>startup
>commit
>optimize
> name="confFiles">elevate.xml,schema.xml,spellings.txt,stopwords.txt,stopwords_de.txt,stopwords_en.txt,synonyms_de.txt,synonyms.txt
>optimize
>1
>  
>
> So after optimize a snapshot should be taken, this works. But numberToKeep is 
> ignored, snapshots are increasing with each call to optimize and are kept 
> forever. Seems this settings have no effect.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3161) Use of 'qt' should be restricted to searching and should not start with a '/'

2012-02-27 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13217409#comment-13217409
 ] 

James Dyer commented on SOLR-3161:
--

I thinks its nice that you can currently set up various handler with all of the 
different parameters, etc set up in your config and then clients don't have to 
worry about setting.  (ie...what is the secret sauce for relevance, anyway, and 
which spelling dictionary goes with which "qf" list?, etc)  Its just easier to 
have this all in the configuration.  This is the beauty of "qt" so whatever 
solution we find here, I'd really like it if this beauty doesn't get spoiled.

By the way, when we were converting an app from Endeca, we used "qt" to roughly 
emulate Endeca's "search interface" concept, which is basically like a dismax 
request handler that behaves as if it were a field.  Imagine having multiple 
"qt"s (Request Handlers) set up, each with its own "qf", spelling config, 
highlighter config, etc, and then being able to do something like this: 
q=Handler1:(+this +that) AND Handler2:(something else) .  Someday I would love 
to see this kind of enhancement (best I could tell you can't do anything like 
this even with local params).  But if we lock down qt too much or eliminate it 
altogether, we might make it harder to have this kind of possibility in the 
future.

> Use of 'qt' should be restricted to searching and should not start with a '/'
> -
>
> Key: SOLR-3161
> URL: https://issues.apache.org/jira/browse/SOLR-3161
> Project: Solr
>  Issue Type: Improvement
>  Components: search, web gui
>Reporter: David Smiley
>Assignee: David Smiley
> Fix For: 3.6, 4.0
>
>
> I haven't yet looked at the code involved for suggestions here; I'm speaking 
> based on how I think things should work and not work, based on intuitiveness 
> and security. In general I feel it is best practice to use '/' leading 
> request handler names and not use "qt", but I don't hate it enough when used 
> in limited (search-only) circumstances to propose its demise. But if someone 
> proposes its deprecation that then I am +1 for that.
> Here is my proposal:
> Solr should error if the parameter "qt" is supplied with a leading '/'. 
> (trunk only)
> Solr should only honor "qt" if the target request handler extends 
> solr.SearchHandler.
> The new admin UI should only use 'qt' when it has to. For the query screen, 
> it could present a little pop-up menu of handlers to choose from, including 
> "/select?qt=mycustom" for handlers that aren't named with a leading '/'. This 
> choice should be positioned at the top.
> And before I forget, me or someone should investigate if there are any 
> similar security problems with the shards.qt parameter. Perhaps shards.qt can 
> abide by the same rules outlined above.
> Does anyone foresee any problems with this proposal?
> On a related subject, I think the notion of a default request handler is bad 
> - the default="true" thing. Honestly I'm not sure what it does, since I 
> noticed Solr trunk redirects '/solr/' to the new admin UI at '/solr/#/'. 
> Assuming it doesn't do anything useful anymore, I think it would be clearer 
> to use  instead of 
> what's there now. The delta is to put the leading '/' on this request handler 
> name, and remove the "default" attribute.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3011) DIH MultiThreaded bug

2012-02-26 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13217044#comment-13217044
 ] 

James Dyer commented on SOLR-3011:
--

Mikhail,

I am very interested in getting the DIH threading bugs fixed.  However, it 
might be a few weeks until I'll have time to give this issue&patch the time it 
deserves.  Unless someone beats me to this, I will gladly work with you on 
getting these fixes committed.

> DIH MultiThreaded bug
> -
>
> Key: SOLR-3011
> URL: https://issues.apache.org/jira/browse/SOLR-3011
> Project: Solr
>  Issue Type: Sub-task
>  Components: contrib - DataImportHandler
>Affects Versions: 3.5, 4.0
>Reporter: Mikhail Khludnev
>Priority: Minor
> Fix For: 4.0
>
> Attachments: SOLR-3011.patch, SOLR-3011.patch
>
>
> current DIH design is not thread safe. see last comments at SOLR-2382 and 
> SOLR-2947. I'm going to provide the patch makes DIH core threadsafe. Mostly 
> it's a SOLR-2947 patch from 28th Dec. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2947) DIH caching bug - EntityRunner destroys child entity processor

2012-02-16 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13209489#comment-13209489
 ] 

James Dyer commented on SOLR-2947:
--

This bug was caused by SOLR-2382 which is trunk-only.  Do we need a CHANGES.txt 
entry for that?

> DIH caching bug - EntityRunner destroys child entity processor
> --
>
> Key: SOLR-2947
> URL: https://issues.apache.org/jira/browse/SOLR-2947
> Project: Solr
>  Issue Type: Sub-task
>  Components: contrib - DataImportHandler
>Affects Versions: 4.0
>Reporter: Mikhail Khludnev
>Assignee: James Dyer
>  Labels: noob
> Fix For: 4.0
>
> Attachments: SOLR-2947.patch, SOLR-2947.patch, SOLR-2947.patch, 
> SOLR-2947.patch, SOLR-2947.patch, SOLR-2947.patch, 
> dih-cache-destroy-on-threads-fix.patch, dih-cache-threads-enabling-bug.patch
>
>
> My intention is fix multithread import with SQL cache. Here is the 2nd stage. 
> If I enable DocBuilder.EntityRunner flow even for single thread, it breaks 
> the pretty basic functionality: parent-child join.
> the reason is [line 473 
> entityProcessor.destroy();|http://svn.apache.org/viewvc/lucene/dev/trunk/solr/contrib/dataimporthandler/src/java/org/apache/solr/handler/dataimport/DocBuilder.java?revision=1201659&view=markup]
>  breaks children entityProcessor.
> see attachement comments for more details. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2933) DIHCacheSupport ignores left side of where="xid=x.id" attribute

2012-02-16 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13209462#comment-13209462
 ] 

James Dyer commented on SOLR-2933:
--

I will commit this one shortly.

> DIHCacheSupport ignores left side of where="xid=x.id" attribute
> ---
>
> Key: SOLR-2933
> URL: https://issues.apache.org/jira/browse/SOLR-2933
> Project: Solr
>  Issue Type: Sub-task
>  Components: contrib - DataImportHandler
>Affects Versions: 4.0
>Reporter: Mikhail Khludnev
>Assignee: James Dyer
>Priority: Minor
>  Labels: noob, random
> Fix For: 3.6, 4.0
>
> Attachments: 
> AbstractDataImportHandlerTestCase.java-choose-map-randomly.patch, 
> SOLR-2933.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> DIHCacheSupport introduced at SOLR-2382 uses new config attributes cachePk 
> and cacheLookup. But support old one where="xid=x.id" is broken by 
> [DIHCacheSupport.|http://svn.apache.org/viewvc/lucene/dev/trunk/solr/contrib/dataimporthandler/src/java/org/apache/solr/handler/dataimport/DIHCacheSupport.java?view=markup]
>  - it never put where="" sides into the context, but it revealed by 
> [SortedMapBackedCache.|http://svn.apache.org/viewvc/lucene/dev/trunk/solr/contrib/dataimporthandler/src/java/org/apache/solr/handler/dataimport/SortedMapBackedCache.java?view=markup],
>  which takes just first column as a primary key. That's why all tests are 
> green.
> To reproduce the issue I need just reorder entry at [line 
> 219|http://svn.apache.org/viewvc/lucene/dev/trunk/solr/contrib/dataimporthandler/src/test/org/apache/solr/handler/dataimport/TestCachedSqlEntityProcessor.java?revision=1201659&view=markup]
>  and make desc first and picked up as a primary key. 
> To do that I propose to chose concrete map class randomly for all DIH test 
> cases at 
> [createMap()|http://svn.apache.org/viewvc/lucene/dev/trunk/solr/contrib/dataimporthandler/src/test/org/apache/solr/handler/dataimport/AbstractDataImportHandlerTestCase.java?revision=1149600&view=markup].
>  
> I'm attaching test breaking patch and seed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2947) DIH caching bug - EntityRunner destroys child entity processor

2012-02-10 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13205644#comment-13205644
 ] 

James Dyer commented on SOLR-2947:
--

I not set up yet and not sure how long it'll be.  But I do want to get the DIH 
bugs taken care of asap.

> DIH caching bug - EntityRunner destroys child entity processor
> --
>
> Key: SOLR-2947
> URL: https://issues.apache.org/jira/browse/SOLR-2947
> Project: Solr
>  Issue Type: Sub-task
>  Components: contrib - DataImportHandler
>Affects Versions: 4.0
>Reporter: Mikhail Khludnev
>  Labels: noob
> Fix For: 4.0
>
> Attachments: SOLR-2947.patch, SOLR-2947.patch, SOLR-2947.patch, 
> SOLR-2947.patch, SOLR-2947.patch, dih-cache-destroy-on-threads-fix.patch, 
> dih-cache-threads-enabling-bug.patch
>
>
> My intention is fix multithread import with SQL cache. Here is the 2nd stage. 
> If I enable DocBuilder.EntityRunner flow even for single thread, it breaks 
> the pretty basic functionality: parent-child join.
> the reason is [line 473 
> entityProcessor.destroy();|http://svn.apache.org/viewvc/lucene/dev/trunk/solr/contrib/dataimporthandler/src/java/org/apache/solr/handler/dataimport/DocBuilder.java?revision=1201659&view=markup]
>  breaks children entityProcessor.
> see attachement comments for more details. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2191) Change SolrException cstrs that take Throwable to default to alreadyLogged=false

2012-02-09 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13204672#comment-13204672
 ] 

James Dyer commented on SOLR-2191:
--

With SOLR-3022 and SOLR-3032 complete, should this one be closed also?

> Change SolrException cstrs that take Throwable to default to 
> alreadyLogged=false
> 
>
> Key: SOLR-2191
> URL: https://issues.apache.org/jira/browse/SOLR-2191
> Project: Solr
>  Issue Type: Bug
>Reporter: Mark Miller
>Assignee: Mark Miller
> Fix For: 3.6, 4.0
>
> Attachments: SOLR-2191.patch
>
>
> Because of misuse, many exceptions are now not logged at all - can be painful 
> when doing dev. I think we should flip this setting and work at removing any 
> double logging - losing logging is worse (and it almost looks like we lose 
> more logging than we would get in double logging) - and bad 
> solrexception/logging patterns are proliferating.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2649) MM ignored in edismax queries with operators

2012-02-02 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13199286#comment-13199286
 ] 

James Dyer commented on SOLR-2649:
--

It seems it would be simpler to implement and understand if we just counted up 
the optional words in the query and apply "mm" to those.  I suppose you could 
create a subtle rule that naked terms count for "mm" but OR-ed terms do not.  
This might be functionality someone wants but then again it might confuse 
others who would expect "x OR y" to mean the same as "x y".  

Counting multiple terms as 1 because they are in parenthesis together doesn't 
seem like a good idea to me.  But then again, maybe someone out there would 
appreciate all the subtle things you could do with this?

I guess whatever is decided just needs to be well-documented so when/if someone 
is surprised by the functionality they can look it up and see what's going on.  
Whatever is done, it will be a nice improvement over the current behavior.

> MM ignored in edismax queries with operators
> 
>
> Key: SOLR-2649
> URL: https://issues.apache.org/jira/browse/SOLR-2649
> Project: Solr
>  Issue Type: Bug
>  Components: search
>Affects Versions: 3.3
>Reporter: Magnus Bergmark
>Priority: Minor
>
> Hypothetical scenario:
>   1. User searches for "stocks oil gold" with MM set to "50%"
>   2. User adds "-stockings" to the query: "stocks oil gold -stockings"
>   3. User gets no hits since MM was ignored and all terms where AND-ed 
> together
> The behavior seems to be intentional, although the reason why is never 
> explained:
>   // For correct lucene queries, turn off mm processing if there
>   // were explicit operators (except for AND).
>   boolean doMinMatched = (numOR + numNOT + numPluses + numMinuses) == 0; 
> (lines 232-234 taken from 
> tags/lucene_solr_3_3/solr/src/java/org/apache/solr/search/ExtendedDismaxQParserPlugin.java)
> This makes edismax unsuitable as an replacement to dismax; mm is one of the 
> primary features of dismax.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2649) MM ignored in edismax queries with operators

2012-02-02 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13199194#comment-13199194
 ] 

James Dyer commented on SOLR-2649:
--

Maybe a simple answer is to have it make "mm" apply to all optional terms and 
ignore the rest.  So for...
{noformat}
q=word1 AND word2 word3%mm=50%
{noformat}
..."word3" is the only optional term, so mm=50% only applies to "word3".

And for...
{noformat}
q=word1 OR word2 word3 word4 word5%mm=50%
{noformat}
...Everything here is optional, so "mm" applies to all the terms.  Otherwise, 
you'd be in a situation where "OR" takes on a meaning that is different from 
"optional" and I'm not sure you want to introduce a 4th concept here beyond 
what we already have: required/optional/prohibited.

The semantics of "mm" would then become "the minimum of all optional terms that 
need to match".

> MM ignored in edismax queries with operators
> 
>
> Key: SOLR-2649
> URL: https://issues.apache.org/jira/browse/SOLR-2649
> Project: Solr
>  Issue Type: Bug
>  Components: search
>Affects Versions: 3.3
>Reporter: Magnus Bergmark
>Priority: Minor
>
> Hypothetical scenario:
>   1. User searches for "stocks oil gold" with MM set to "50%"
>   2. User adds "-stockings" to the query: "stocks oil gold -stockings"
>   3. User gets no hits since MM was ignored and all terms where AND-ed 
> together
> The behavior seems to be intentional, although the reason why is never 
> explained:
>   // For correct lucene queries, turn off mm processing if there
>   // were explicit operators (except for AND).
>   boolean doMinMatched = (numOR + numNOT + numPluses + numMinuses) == 0; 
> (lines 232-234 taken from 
> tags/lucene_solr_3_3/solr/src/java/org/apache/solr/search/ExtendedDismaxQParserPlugin.java)
> This makes edismax unsuitable as an replacement to dismax; mm is one of the 
> primary features of dismax.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3033) "numberToKeep" on replication handler does not work with "backupAfter"

2012-02-01 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13198114#comment-13198114
 ] 

James Dyer commented on SOLR-3033:
--

I agree we should have it as a request param for backwards compatibility, and 
allow it as a better-named initParam.  Clear documentation in the wiki would be 
in order.  Two things though:

1. Maybe we should have the request param override the init param rather than 
generate an error.  This is consistent with how handler params work in general. 
 As it was lost on me, many people won't appreciate the subtle difference 
between an init-param and a request-param in this case and will just want it to 
behave like any other handler.  (moot point here if we are just removing the 
init-param from 4.x and keeping it, deprecated, in 3.x)

2. Are you saying that we should require this param to be outside  and 
, thus avoiding the conflict if a node is a repeater?  We could allow 
it inside  and  and document that in the case of a repeater 
the value in  takes precedence over the value in .  This is 
more confusing for the repeater case, but simpler in that it seems every other 
init parameter gets specified separately for slaves and master.

Once again I don't have a strong preference here but the thoughts occurred...

> "numberToKeep" on replication handler does not work with "backupAfter"
> --
>
> Key: SOLR-3033
> URL: https://issues.apache.org/jira/browse/SOLR-3033
> Project: Solr
>  Issue Type: Bug
>  Components: replication (java)
>Affects Versions: 3.5
> Environment: openjdk 1.6, linux 3.x
>Reporter: Torsten Krah
> Attachments: SOLR-3033.patch
>
>
> Configured my replication handler like this:
>
>
>startup
>commit
>optimize
> name="confFiles">elevate.xml,schema.xml,spellings.txt,stopwords.txt,stopwords_de.txt,stopwords_en.txt,synonyms_de.txt,synonyms.txt
>optimize
>1
>  
>
> So after optimize a snapshot should be taken, this works. But numberToKeep is 
> ignored, snapshots are increasing with each call to optimize and are kept 
> forever. Seems this settings have no effect.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3033) "numberToKeep" on replication handler does not work with "backupAfter"

2012-02-01 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13197948#comment-13197948
 ] 

James Dyer commented on SOLR-3033:
--

{quote}
So instead of adding an init param version of numberToKeep, perhaps it owuld be 
better if the "backupAfter" codepath followed the same code path as 
handleRequest as much as possible?
{quote}
It wasn't so much my intention to add an init-param but to make a way to give a 
default value for this in solrconfig.xml as you can for other handlers.  
Without a way to declare a default in solrconfig.xml, the user has no way to 
use this parameter should a backup be triggered by "backupAfter".  

When I looked at this, it didn't seem that ReplicationHandler follows the 
normal rules.  We don't have a  section for request parameters, do 
we?  And looking at the available request parameters, we probably wouldn't want 
defaults for any of them  (see 
http://wiki.apache.org/solr/SolrReplication#HTTP_API).  

This makes me wonder if my first try was a mistake.  Possibly this should 
_only_ be an init-param.  This would let users configure how many to keep on 
the Master, and how many to keep on the Slave.  We don't let users change poll 
intervals with request params, so why let them change the archive policy with 
request params?

If we kept it as a request-param only, but then let the user specify defaults, 
would that create a legal  and  section nested within 
 and , so users can specify defaults for each?

I don't have a strong feeling on this and would change the patch to work any 
way you thought best.  Somehow it seems that "numberToKeep" needs to have a 
default setting somewhere, somehow, so it will work with "backupAfter".

> "numberToKeep" on replication handler does not work with "backupAfter"
> --
>
> Key: SOLR-3033
> URL: https://issues.apache.org/jira/browse/SOLR-3033
> Project: Solr
>  Issue Type: Bug
>  Components: replication (java)
>Affects Versions: 3.5
> Environment: openjdk 1.6, linux 3.x
>Reporter: Torsten Krah
> Attachments: SOLR-3033.patch
>
>
> Configured my replication handler like this:
>
>
>startup
>commit
>optimize
> name="confFiles">elevate.xml,schema.xml,spellings.txt,stopwords.txt,stopwords_de.txt,stopwords_en.txt,synonyms_de.txt,synonyms.txt
>optimize
>1
>  
>
> So after optimize a snapshot should be taken, this works. But numberToKeep is 
> ignored, snapshots are increasing with each call to optimize and are kept 
> forever. Seems this settings have no effect.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2947) DIH caching bug - EntityRunner destroys child entity processor

2012-01-17 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13187889#comment-13187889
 ] 

James Dyer commented on SOLR-2947:
--

I think this patch does the right thing here, calling "destroy()" down the 
hierarchy of EntityProcessors, but waiting until after doc-building is 
complete.  While I had it this way for the single-threaded code, I punted on 
the multi-threaded case simply hoping that because the unit tests were passing 
then everything would be alright :) .  I appreciate the effort to improve the 
DIH multithreaded code.  We really need to get rid of bugs like this and 
long-term it would pay if we could try and make the code more maintainable, get 
better test coverage, etc.  

An example is the new "children()" method...using just the first 
ThreadedEntityProcessorWrapper from the list I think is valid because the 
"children" will be same on all the threads.  But then again, looking at how 
this all gets populated in the ThreadedEntityProcessorWrapper constructor, the 
answer (to me) isn't obvious.  Best I can say is this is probably correct and 
certainly a vast improvement than what is currently in Trunk.  

Small point here but I prefer the TestEphemeralCache changes I made in the Dec 
11, 2011 patch version.  I switched to building the config file on-the-fly and 
testMultiThreaded() uses a random number of threads instead of always using 10. 
 Of course, if we go with this then we'd need to add "@Ignore" for 
testMultiThreaded() until SOLR-3011 can be commited.

> DIH caching bug - EntityRunner destroys child entity processor
> --
>
> Key: SOLR-2947
> URL: https://issues.apache.org/jira/browse/SOLR-2947
> Project: Solr
>  Issue Type: Sub-task
>  Components: contrib - DataImportHandler
>Affects Versions: 4.0
>Reporter: Mikhail Khludnev
>  Labels: noob
> Fix For: 4.0
>
> Attachments: SOLR-2947.patch, SOLR-2947.patch, SOLR-2947.patch, 
> SOLR-2947.patch, SOLR-2947.patch, dih-cache-destroy-on-threads-fix.patch, 
> dih-cache-threads-enabling-bug.patch
>
>
> My intention is fix multithread import with SQL cache. Here is the 2nd stage. 
> If I enable DocBuilder.EntityRunner flow even for single thread, it breaks 
> the pretty basic functionality: parent-child join.
> the reason is [line 473 
> entityProcessor.destroy();|http://svn.apache.org/viewvc/lucene/dev/trunk/solr/contrib/dataimporthandler/src/java/org/apache/solr/handler/dataimport/DocBuilder.java?revision=1201659&view=markup]
>  breaks children entityProcessor.
> see attachement comments for more details. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3033) numberToKeep on replication handler does not work - snapshots are increasing beyond configured maximum

2012-01-16 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13187268#comment-13187268
 ] 

James Dyer commented on SOLR-3033:
--

I see what you're saying.  I wasn't thinking about using the "backupAfter" 
parameter at all.  It seems reasonable then, because of "backupAfter", we would 
have to support having "numberToKeep" in the configuration. 

As a workaround, you can execute backups by calling the command url.  The 
syntax is:
{noformat}
http://master_host:port/solr/replication?command=backup&numberToKeep=1
{noformat}

See http://wiki.apache.org/solr/SolrReplication#HTTP_API for more information.

> numberToKeep on replication handler does not work - snapshots are increasing 
> beyond configured maximum
> --
>
> Key: SOLR-3033
> URL: https://issues.apache.org/jira/browse/SOLR-3033
> Project: Solr
>  Issue Type: Bug
>  Components: replication (java)
>Affects Versions: 3.5
> Environment: openjdk 1.6, linux 3.x
>Reporter: Torsten Krah
>
> Configured my replication handler like this:
>
>
>startup
>commit
>optimize
> name="confFiles">elevate.xml,schema.xml,spellings.txt,stopwords.txt,stopwords_de.txt,stopwords_en.txt,synonyms_de.txt,synonyms.txt
>optimize
>1
>  
>
> So after optimize a snapshot should be taken, this works. But numberToKeep is 
> ignored, snapshots are increasing with each call to optimize and are kept 
> forever. Seems this settings have no effect.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3033) numberToKeep on replication handler does not work - snapshots are increasing beyond configured maximum

2012-01-16 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13187047#comment-13187047
 ] 

James Dyer commented on SOLR-3033:
--

Torsten,

I think for this to work you need to put the "numberToKeep" parameter in the 
request URL, not in the configuration file.  I realize this is counterintuitive 
because most request handlers let you specify parameters either way.  I think 
the reason it doesn't work in the config file might be because you've got to 
nest parameters within the  element.   So try putting this 
in the url and see if that works for you.

In any case, I'm thinking the only thing to do for this is perhaps clarify this 
point in the wiki.  Even if we could fix this, it wouldn't be appropriate to 
put this parameter is every replication request as typically you'd use the same 
handler for both replication and backups, and this one applies to backups only. 
 Anyone have any thoughts about this?

> numberToKeep on replication handler does not work - snapshots are increasing 
> beyond configured maximum
> --
>
> Key: SOLR-3033
> URL: https://issues.apache.org/jira/browse/SOLR-3033
> Project: Solr
>  Issue Type: Bug
>  Components: replication (java)
>Affects Versions: 3.5
> Environment: openjdk 1.6, linux 3.x
>Reporter: Torsten Krah
>
> Configured my replication handler like this:
>
>
>startup
>commit
>optimize
> name="confFiles">elevate.xml,schema.xml,spellings.txt,stopwords.txt,stopwords_de.txt,stopwords_en.txt,synonyms_de.txt,synonyms.txt
>optimize
>1
>  
>
> So after optimize a snapshot should be taken, this works. But numberToKeep is 
> ignored, snapshots are increasing with each call to optimize and are kept 
> forever. Seems this settings have no effect.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3025) spellcheck.onlyMorePopular=true/false incorrectly shows correctlySpelled = false/true

2012-01-11 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13184128#comment-13184128
 ] 

James Dyer commented on SOLR-3025:
--

This appears to be a duplicate of SOLR-2555.  Also see SOLR-2585 which tries to 
address this with alternative functionality.

> spellcheck.onlyMorePopular=true/false incorrectly shows correctlySpelled = 
> false/true
> -
>
> Key: SOLR-3025
> URL: https://issues.apache.org/jira/browse/SOLR-3025
> Project: Solr
>  Issue Type: Bug
>  Components: spellchecker
>Affects Versions: 4.0
> Environment: apache-solr-4.0-2012-01-02_08-53-41.zip
>Reporter: Antony Stubbs
>
> Setting spellcheck.onlyMorePopular to true, causes my correctlySpelled output 
> to read false. Switching onlyMorePopular back to false, makes 
> correctlySpelled = true.
> Either it's a bug, or the meaning of correctlySpelled is convoluted, and 
> needs augmenting/ correcting.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2993) Integrate WordBreakSpellChecker with Solr

2012-01-10 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183339#comment-13183339
 ] 

James Dyer commented on SOLR-2993:
--

{quote}
So should it be possible to get the suggestion "spellcheck" from "spell check", 
or not?  Note: I do get suggestions for terms that are in the index.
{quote}

When combining words, it will require that _at least one_ of the original terms 
be not in the index.  

So to use your example, WordBreakSpellChecker will combine "spell check" to 
"spellcheck" provided that:
1. "spellcheck" is in the index.
2. either:
 - "spell" is NOT in the index.
   -OR-
 - "check" is NOT in the index"
   -OR-
 - both "spell" and "check" are NOT in the index.

But if both "spell" and "check" are in the index, then you won't get 
"spellcheck" as a suggestion.  You can override this behavior if:
1. You specify "onlyMorePopular".  This works if "spellcheck" has a document 
frequency that is greater or equal than the highest document frequency between 
"spell" and "check".
2. You apply SOLR-2585 (theoretically...not possible yet) and set 
"spellcheck.alternativeTermCount" greater than zero.  This would tell it to 
generate alternative term suggestions for indexed terms.

If this is not consistent with what you're experiencing then there is a 
possible bug in the WordBreakSpellChecker.  In that case, please provide as 
many details as possible (or write a failing unit test) and I can look into it 
further.

> Integrate WordBreakSpellChecker with Solr
> -
>
> Key: SOLR-2993
> URL: https://issues.apache.org/jira/browse/SOLR-2993
> Project: Solr
>  Issue Type: Improvement
>  Components: SolrCloud, spellchecker
>Affects Versions: 4.0
>Reporter: James Dyer
>Priority: Minor
> Fix For: 4.0
>
> Attachments: SOLR-2993.patch
>
>
> A SpellCheckComponent enhancement, leveraging the WordBreakSpellChecker from 
> LUCENE-3523:
> - Detect spelling errors resulting from misplaced whitespace without the use 
> of shingle-based dictionaries.  
> - Seamlessly integrate word-break suggestions with single-word spelling 
> corrections from the existing FileBased-, IndexBased- or Direct- spell 
> checkers.  
> - Provide collation support for word-break errors including cases where the 
> user has a mix of single-word spelling errors and word-break errors in the 
> same query.  
> - Provide shard support.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2993) Integrate WordBreakSpellChecker with Solr

2012-01-09 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13182790#comment-13182790
 ] 

James Dyer commented on SOLR-2993:
--

{quote}
So "spa llcheck" is preferred over "spellcheck" if spa has more hits then 
spellcheck.
{quote}
I honestly didn't try this much with queries having all optional terms.  I see 
what you mean, though that you might prefer it just leave the misspelled word 
in there if its an optional term anyhow.  But wouldn't the query, in addition 
to giving spelling suggestions, also return some results because it would 
ignore the optional & misspelled query terms?  If that's the case, your app can 
look at the results you got back and compare that to the collation options and 
determine what to do from there.

{quote}
no suggestions were given when both word fragments were spelled correctly
{quote}
As discussed in SOLR-2585, you can't get suggestions for terms that are in the 
index, unless you specify "spellcheck.onlyMorePopular=true".  Of course 
"onlyMorePopular" can have its own unintended consequences.  Hopefully someday 
in the not too distant future we'll be in a state where we can have both this 
issue and SOLR-2585 working together.

> Integrate WordBreakSpellChecker with Solr
> -
>
> Key: SOLR-2993
> URL: https://issues.apache.org/jira/browse/SOLR-2993
> Project: Solr
>  Issue Type: Improvement
>  Components: SolrCloud, spellchecker
>Affects Versions: 4.0
>Reporter: James Dyer
>Priority: Minor
> Fix For: 4.0
>
> Attachments: SOLR-2993.patch
>
>
> A SpellCheckComponent enhancement, leveraging the WordBreakSpellChecker from 
> LUCENE-3523:
> - Detect spelling errors resulting from misplaced whitespace without the use 
> of shingle-based dictionaries.  
> - Seamlessly integrate word-break suggestions with single-word spelling 
> corrections from the existing FileBased-, IndexBased- or Direct- spell 
> checkers.  
> - Provide collation support for word-break errors including cases where the 
> user has a mix of single-word spelling errors and word-break errors in the 
> same query.  
> - Provide shard support.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2993) Integrate WordBreakSpellChecker with Solr

2012-01-09 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13182577#comment-13182577
 ] 

James Dyer commented on SOLR-2993:
--

Okke,

Thanks for looking at this patch.  Here are a few comments:

{quote}
if both word parts resulted in suggestions, the collation made no sense.
{quote}
This is a problem with collations in general:  By default, it simply mashes the 
top corrections together, often resulting in nonsense.  The solution is to set 
"spellcheck.maxCollationTries" to a non-zero value.  Doing so will cause the 
spellchecker to vet the collation possibilities against the index, resulting in 
collations that are guaranteed to generate hits.

{quote}
"spe llcheck" would give suggestions "spa" and "spellcheck" and collate this 
into "spa spellcheck"
{quote}
This is surprising to me and might indicate a bug.  This patch is designed to 
carefully ensure that when building collations, the corrections do not overlap 
one another.  For instance if "q=spe llcheck" and it gives corrections of 
"spe>spa" and "spe llcheck>spellcheck", it should not collate these to "q=spa 
spellcheck" because "spe" overlaps with "spe llcheck".  So if you can describe 
in detail what you're indexing and querying (maybe paste the resulting xml), it 
would be help me figure out what's going on.  Better yet, if you can write a 
failing unit test and post a patch...

{quote}
I never got any results back when one of the parts had a typo. So "spe llchek" 
would not give any suggestions.
{quote}
This patch does not have the ability to first correct a word fragment and then 
combine it with another fragment to make a corrected word.  Possibly this would 
be a good next step after what we've got here already gets worked out.

{quote}
it would also be handy if "spell check" would result in the suggestion 
"spellcheck".  Or is this already possible?
{quote}
This is the core of what this issue (really LUCENE-3523) is all about, provided 
that "spellcheck" is in the dictionary&index you're using.

> Integrate WordBreakSpellChecker with Solr
> -
>
> Key: SOLR-2993
> URL: https://issues.apache.org/jira/browse/SOLR-2993
> Project: Solr
>  Issue Type: Improvement
>  Components: SolrCloud, spellchecker
>Affects Versions: 4.0
>Reporter: James Dyer
>Priority: Minor
> Fix For: 4.0
>
> Attachments: SOLR-2993.patch
>
>
> A SpellCheckComponent enhancement, leveraging the WordBreakSpellChecker from 
> LUCENE-3523:
> - Detect spelling errors resulting from misplaced whitespace without the use 
> of shingle-based dictionaries.  
> - Seamlessly integrate word-break suggestions with single-word spelling 
> corrections from the existing FileBased-, IndexBased- or Direct- spell 
> checkers.  
> - Provide collation support for word-break errors including cases where the 
> user has a mix of single-word spelling errors and word-break errors in the 
> same query.  
> - Provide shard support.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2999) spellcheck-index is rebuilt on commit if optimized

2012-01-05 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13180469#comment-13180469
 ] 

James Dyer commented on SOLR-2999:
--

Another reason removing this functionality might make sense:  users can just 
put a build command in the firstsearcher/newsearcher warming queries, ensuring 
a build whenever a new searcher is opened.  I've found this a better option 
than using buildOnCommit/buildOnOptimize.  We have 1 index that does 
incremental updates and another that rebuilds nightly and its a good technique 
in both situations.

>  spellcheck-index is rebuilt on commit if optimized
> ---
>
> Key: SOLR-2999
> URL: https://issues.apache.org/jira/browse/SOLR-2999
> Project: Solr
>  Issue Type: Bug
>  Components: spellchecker
>Affects Versions: 3.1, 3.2, 3.3, 3.4, 3.5, 4.0
>Reporter: Oliver Schihin
>Priority: Minor
> Fix For: 3.6, 4.0
>
>
> If an empty commit (i.e. without having posted new documents) is issued on an 
> optimized index, the spellcheck-index is rebuilt even though solrconfig 
> defines buildOnOptimize=true, not buildOnCommit=true.
> The problem was discovered on solr 4.0 but seems to happen on 3.x, too. 
> Discussion and further information can be found on the list 
> (http://lucene.472066.n3.nabble.com/spellcheck-index-is-rebuilt-on-commit-tp3626492p3626492.html)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2993) Integrate WordBreakSpellChecker with Solr

2012-01-03 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13178775#comment-13178775
 ] 

James Dyer commented on SOLR-2993:
--

Okke,

Thanks for your interest.  For now you may need to evaluate the features 
separately.  Possibly you could vote for your favorite one.  Should either 
issue get committed, I will sync the other issue to the updated state of Trunk. 
 Then we can have both at the same time.  If there isn't any movement on these 
2 for a long time maybe I'd consider merging the patches but that seems like an 
unnecessary step.  It would be nice if one of the first 4.x releases included 
both of these features... 

> Integrate WordBreakSpellChecker with Solr
> -
>
> Key: SOLR-2993
> URL: https://issues.apache.org/jira/browse/SOLR-2993
> Project: Solr
>  Issue Type: Improvement
>  Components: SolrCloud, spellchecker
>Affects Versions: 4.0
>Reporter: James Dyer
>Priority: Minor
> Fix For: 4.0
>
> Attachments: SOLR-2993.patch
>
>
> A SpellCheckComponent enhancement, leveraging the WordBreakSpellChecker from 
> LUCENE-3523:
> - Detect spelling errors resulting from misplaced whitespace without the use 
> of shingle-based dictionaries.  
> - Seamlessly integrate word-break suggestions with single-word spelling 
> corrections from the existing FileBased-, IndexBased- or Direct- spell 
> checkers.  
> - Provide collation support for word-break errors including cases where the 
> user has a mix of single-word spelling errors and word-break errors in the 
> same query.  
> - Provide shard support.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2993) Integrate WordBreakSpellChecker with Solr

2011-12-29 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177295#comment-13177295
 ] 

James Dyer commented on SOLR-2993:
--

Also included with the patch are several new unit tests, including one 
distributed/shard test scenario.

> Integrate WordBreakSpellChecker with Solr
> -
>
> Key: SOLR-2993
> URL: https://issues.apache.org/jira/browse/SOLR-2993
> Project: Solr
>  Issue Type: Improvement
>  Components: SolrCloud, spellchecker
>Affects Versions: 4.0
>Reporter: James Dyer
>Priority: Minor
> Fix For: 4.0
>
> Attachments: SOLR-2993.patch
>
>
> A SpellCheckComponent enhancement, leveraging the WordBreakSpellChecker from 
> LUCENE-3523:
> - Detect spelling errors resulting from misplaced whitespace without the use 
> of shingle-based dictionaries.  
> - Seamlessly integrate word-break suggestions with single-word spelling 
> corrections from the existing FileBased-, IndexBased- or Direct- spell 
> checkers.  
> - Provide collation support for word-break errors including cases where the 
> user has a mix of single-word spelling errors and word-break errors in the 
> same query.  
> - Provide shard support.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2549) DIH LineEntityProcessor support for delimited & fixed-width files

2011-12-13 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13168660#comment-13168660
 ] 

James Dyer commented on SOLR-2549:
--

The dependency here to SOLR-2943 is only for the "DIHCacheTypes" enum, which 
defines data types for each flat file column of data.  This is particularly 
helpful when joining to SQL data sources as DIH requires the join keys be the 
same type.  It might be beneficial to rename the enum to "DIHType" or something 
more generic, should either issue become a candidate for commit.

> DIH LineEntityProcessor support for delimited & fixed-width files
> -
>
> Key: SOLR-2549
> URL: https://issues.apache.org/jira/browse/SOLR-2549
> Project: Solr
>  Issue Type: Improvement
>  Components: contrib - DataImportHandler
>Affects Versions: 4.0
>Reporter: James Dyer
>Priority: Minor
> Attachments: SOLR-2549.patch, SOLR-2549.patch, SOLR-2549.patch
>
>
> Provides support for Fixed Width and Delimited Files without needing to write 
> a Transformer. 
> The following xml properties are supported with this version of 
> LineEntityProcessor:
> For fixed width files:
>  - colDef[#]
> For Delimited files:
>  - fieldDelimiterRegex
>  - firstLineHasFieldnames
>  - delimitedFieldNames
>  - delimitedFieldTypes
> These properties are described in the api documentation.  See patch.
> When combined with the cache improvements from SOLR-2382 this allows you to 
> join a flat file entity with other entities (sql, etc).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3298) FST has hard limit max size of 2.1 GB

2011-12-07 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13164531#comment-13164531
 ] 

James Dyer commented on LUCENE-3298:


Carlos,

I'm not sure how much help this is, but you might be able to eke a little bit 
of performance if you can tighten RewritablePagedBytes.copyBytes().  You'll 
note it currently moves the From-Bytes into a temp array then writes that back 
to the fst an the To-Bytes location.  Note also, the one place this gets 
called, it used to be a simple "System.ArrayCopy".  So if you can make it copy 
in-place that might claw back the performance loss a little.  Beyond this, a 
different pair of eyes might find more ways to optimize.  In the end though you 
will likely never make it perform quite as well as the simple array.

Also, it sounds as if you've maybe done work to sync this with the current 
trunk.  If so, would you mind uploading the updated patch?

Also if you end up using this, be sure to test thoroughly.  I implemented this 
one just to gain a little familiarity with the code and I do not claim any sort 
of expertise in this area, so beware!  But all of the regular unit tests did 
pass for me.  I was meaning to try to run test2bpostings against this but 
wasn't able to get it set up.  If I remember this issue came up originally 
because someone wanted to run test2bpostings with memorycodec and it was going 
passed the limit.

> FST has hard limit max size of 2.1 GB
> -
>
> Key: LUCENE-3298
> URL: https://issues.apache.org/jira/browse/LUCENE-3298
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/FSTs
>Reporter: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-3298.patch
>
>
> The FST uses a single contiguous byte[] under the hood, which in java is 
> indexed by int so we cannot grow this over Integer.MAX_VALUE.  It also 
> internally encodes references to this array as vInt.
> We could switch this to a paged byte[] and make the far larger.
> But I think this is low priority... I'm not going to work on it any time soon.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2509) spellcheck: StringIndexOutOfBoundsException: String index out of range: -1

2011-12-06 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13163695#comment-13163695
 ] 

James Dyer commented on SOLR-2509:
--

a.  I see in r1022768 you combined "testCollate()" with "testCollate2()", 
where this test scenario was originally.  Thanks for the clarification (and 
sorry!).  So in fact this was added with SOLR-1630 (r987509), the comments 
therein are not very reassuring that it was a "correct" or "final" fix.

> spellcheck: StringIndexOutOfBoundsException: String index out of range: -1
> --
>
> Key: SOLR-2509
> URL: https://issues.apache.org/jira/browse/SOLR-2509
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 3.1
> Environment: Debian Lenny
> JAVA Version "1.6.0_20"
>Reporter: Thomas Gambier
>Assignee: Erick Erickson
>Priority: Blocker
> Attachments: SOLR-2509.patch, SOLR-2509.patch, document.xml, 
> schema.xml, solrconfig.xml
>
>
> Hi,
> I'm a french user of SOLR and i've encountered a problem since i've installed 
> SOLR 3.1.
> I've got an error with this query : 
> cle_frbr:"LYSROUGE1149-73190"
> *SEE COMMENTS BELOW*
> I've tested to escape the minus char and the query worked :
> cle_frbr:"LYSROUGE1149(BACKSLASH)-73190"
> But, strange fact, if i change one letter in my query it works :
> cle_frbr:"LASROUGE1149-73190"
> I've tested the same query on SOLR 1.4 and it works !
> Can someone test the query on next line on a 3.1 SOLR version and tell me if 
> he have the same problem ? 
> yourfield:"LYSROUGE1149-73190"
> Where do the problem come from ?
> Thank you by advance for your help.
> Tom

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2509) spellcheck: StringIndexOutOfBoundsException: String index out of range: -1

2011-12-05 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13163103#comment-13163103
 ] 

James Dyer commented on SOLR-2509:
--

Steffen's changes are most certainly correct.  The index contains "pixmaa" and 
we are querying on "pixma-a-b-c-d-e-f-g".  The spelling index is using analyzer 
"lowerpunctfilt" (solrconfig-spellcheckcomponent.xml, line 44) which includes 
WordDelimiterFilter and "generateWordParts=1".  So we would expect this query 
to tokenize down to "pixma" "a" "b" "c" "d" "e" "f" "g".  As the Collate 
feature is only supposed to replace the misspelled token with the new one, I 
wonder why this test scenario would expect all 8 tokens to be replaced by 1 
token (!).

Indeed, this test scenario was added during a refactoring (r1022768) with no 
JIRA # or bug mentioned at all in the comments.  So we can't know for sure why 
it was added.  I'm thinking this is invalid.  I would expect the correct 
collation to be "pixma-a-b-c-d-e-f-g".  

Just for grins, I put a "println" in SpellingQueryConverter to show the start & 
end offsets for each token before and after the patch.  In both cases, we get 
the same token texts, but prior to the patch the offset values are clearly 
wrong.

--before:
TOKEN: pixma so=0 eo=19
TOKEN: a so=0 eo=19
TOKEN: b so=0 eo=19
TOKEN: c so=0 eo=19
TOKEN: d so=0 eo=19
TOKEN: e so=0 eo=19
TOKEN: f so=0 eo=19
TOKEN: g so=0 eo=19
TOKEN: pixmaabcdefg so=0 eo=19

--after:
TOKEN: pixma so=0 eo=5
TOKEN: a so=6 eo=7
TOKEN: b so=8 eo=9
TOKEN: c so=10 eo=11
TOKEN: d so=12 eo=13
TOKEN: e so=14 eo=15
TOKEN: f so=16 eo=17
TOKEN: g so=18 eo=19
TOKEN: pixmaabcdefg so=0 eo=19

 

> spellcheck: StringIndexOutOfBoundsException: String index out of range: -1
> --
>
> Key: SOLR-2509
> URL: https://issues.apache.org/jira/browse/SOLR-2509
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 3.1
> Environment: Debian Lenny
> JAVA Version "1.6.0_20"
>Reporter: Thomas Gambier
>Assignee: Erick Erickson
>Priority: Blocker
> Attachments: SOLR-2509.patch, SOLR-2509.patch, document.xml, 
> schema.xml, solrconfig.xml
>
>
> Hi,
> I'm a french user of SOLR and i've encountered a problem since i've installed 
> SOLR 3.1.
> I've got an error with this query : 
> cle_frbr:"LYSROUGE1149-73190"
> *SEE COMMENTS BELOW*
> I've tested to escape the minus char and the query worked :
> cle_frbr:"LYSROUGE1149(BACKSLASH)-73190"
> But, strange fact, if i change one letter in my query it works :
> cle_frbr:"LASROUGE1149-73190"
> I've tested the same query on SOLR 1.4 and it works !
> Can someone test the query on next line on a 3.1 SOLR version and tell me if 
> he have the same problem ? 
> yourfield:"LYSROUGE1149-73190"
> Where do the problem come from ?
> Thank you by advance for your help.
> Tom

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2933) DIHCacheSupport ignores left side of where="xid=x.id" attribute

2011-12-02 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161861#comment-13161861
 ] 

James Dyer commented on SOLR-2933:
--

{quote}
Why not add randomization for choosing map class
{quote}
createMap() is but a trifle to facilitate testing.  Personally I think 
randomizing on various Map impl's here is confusing and it would not be 
intuitive why it was done.  What is important is the order of the fields being 
sent to the MockDataSource.  In this case, there are 2 fields:  "id" and 
"desc".  Having some rows put "id" first and the others put "desc" first is 
adequate to test all the possible variants.  I also think it is a good idea to 
always use a Map that preserves order because when writing these tests, it is 
not intuitive (for me anyhow) that the Map you're creating is going to iterate 
in an order different than the one you specify.  That is why I changed the 
Abstract class to always use a LinkedHashMap.

{quote}
"assuming that first column is a primary key" is a really user-"hostile" 
feature.
{quote}
I didn't investigate whether this is just a function of the test or if DIH "in 
real life" behaves this way.  In any case, if this is truly a DIH feature, I 
wouldn't consider it hostile.  The primary key comes first most of the time. In 
any case, I wanted to solve the bug that this JIRA issue address and no more.  
If we have questions about default behaviors and want to change how the 
features work that is probably left to a separate non-"bug" issue.

{quote}
test methods names withWhereClause() and withKeyAndLookup() at 
TestCachedSqlEntityProcessor should be swapped each other.
{quote}
I think you're right, but then again I didn't write these tests, so I'm not 
sure why they were named this way.  Also, changing the names is not directly 
related to fixing the bug here.  Personally, I would love to see the DIH tests 
get revamped someday so you could just glance through them and understand what 
they do and how.

In any case, now that I understand the bug and how to fix it, I would rather 
create a lean patch that someone can quickly evaluate and commit.  If we add 
bloat or make it hard for a committer to understand it might just languish and 
remain unfixed for a long time.

> DIHCacheSupport ignores left side of where="xid=x.id" attribute
> ---
>
> Key: SOLR-2933
> URL: https://issues.apache.org/jira/browse/SOLR-2933
> Project: Solr
>  Issue Type: Sub-task
>  Components: contrib - DataImportHandler
>Affects Versions: 4.0
>Reporter: Mikhail Khludnev
>Priority: Minor
>  Labels: noob, random
> Attachments: 
> AbstractDataImportHandlerTestCase.java-choose-map-randomly.patch, 
> SOLR-2933.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> DIHCacheSupport introduced at SOLR-2382 uses new config attributes cachePk 
> and cacheLookup. But support old one where="xid=x.id" is broken by 
> [DIHCacheSupport.|http://svn.apache.org/viewvc/lucene/dev/trunk/solr/contrib/dataimporthandler/src/java/org/apache/solr/handler/dataimport/DIHCacheSupport.java?view=markup]
>  - it never put where="" sides into the context, but it revealed by 
> [SortedMapBackedCache.|http://svn.apache.org/viewvc/lucene/dev/trunk/solr/contrib/dataimporthandler/src/java/org/apache/solr/handler/dataimport/SortedMapBackedCache.java?view=markup],
>  which takes just first column as a primary key. That's why all tests are 
> green.
> To reproduce the issue I need just reorder entry at [line 
> 219|http://svn.apache.org/viewvc/lucene/dev/trunk/solr/contrib/dataimporthandler/src/test/org/apache/solr/handler/dataimport/TestCachedSqlEntityProcessor.java?revision=1201659&view=markup]
>  and make desc first and picked up as a primary key. 
> To do that I propose to chose concrete map class randomly for all DIH test 
> cases at 
> [createMap()|http://svn.apache.org/viewvc/lucene/dev/trunk/solr/contrib/dataimporthandler/src/test/org/apache/solr/handler/dataimport/AbstractDataImportHandlerTestCase.java?revision=1149600&view=markup].
>  
> I'm attaching test breaking patch and seed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2382) DIH Cache Improvements

2011-12-02 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161726#comment-13161726
 ] 

James Dyer commented on SOLR-2382:
--

Noble,

I have attached a patch with a corrected unit test & fix on SOLR-2933, to fix 
one of the problems Mikhail described.  Indeed the "where" parameter was broken 
by our last commit and TestCachedSqlEntityProcessor would mask the failure and 
pass anyway.  Would you mind looking at my patch and committing it?  Thanks.

> DIH Cache Improvements
> --
>
> Key: SOLR-2382
> URL: https://issues.apache.org/jira/browse/SOLR-2382
> Project: Solr
>  Issue Type: New Feature
>  Components: contrib - DataImportHandler
>Reporter: James Dyer
>Priority: Minor
> Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, 
> SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, 
> SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter_standalone.patch, 
> SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
> SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
> SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
> SOLR-2382-properties.patch, SOLR-2382-properties.patch, 
> SOLR-2382-solrwriter-verbose-fix.patch, SOLR-2382-solrwriter.patch, 
> SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, SOLR-2382.patch, 
> SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
> SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
> TestCachedSqlEntityProcessor.java-break-where-clause.patch, 
> TestCachedSqlEntityProcessor.java-fix-where-clause-by-adding-cachePk-and-lookup.patch,
>  
> TestCachedSqlEntityProcessor.java-wrong-pk-detected-due-to-lack-of-where-support.patch,
>  TestThreaded.java.patch
>
>
> Functionality:
>  1. Provide a pluggable caching framework for DIH so that users can choose a 
> cache implementation that best suits their data and application.
>  
>  2. Provide a means to temporarily cache a child Entity's data without 
> needing to create a special cached implementation of the Entity Processor 
> (such as CachedSqlEntityProcessor).
>  
>  3. Provide a means to write the final (root entity) DIH output to a cache 
> rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
> cache as an Entity input.  Also provide the ability to do delta updates on 
> such persistent caches.
>  
>  4. Provide the ability to partition data across multiple caches that can 
> then be fed back into DIH and indexed either to varying Solr Shards, or to 
> the same Core in parallel.
> Use Cases:
>  1. We needed a flexible & scalable way to temporarily cache child-entity 
> data prior to joining to parent entities.
>   - Using SqlEntityProcessor with Child Entities can cause an "n+1 select" 
> problem.
>   - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching 
> mechanism and does not scale.
>   - There is no way to cache non-SQL inputs (ex: flat files, xml, etc).
>  
>  2. We needed the ability to gather data from long-running entities by a 
> process that runs separate from our main indexing process.
>   
>  3. We wanted the ability to do a delta import of only the entities that 
> changed.
>   - Lucene/Solr requires entire documents to be re-indexed, even if only a 
> few fields changed.
>   - Our data comes from 50+ complex sql queries and/or flat files.
>   - We do not want to incur overhead re-gathering all of this data if only 1 
> entity's data changed.
>   - Persistent DIH caches solve this problem.
>   
>  4. We want the ability to index several documents in parallel (using 1.4.1, 
> which did not have the "threads" parameter).
>  
>  5. In the future, we may need to use Shards, creating a need to easily 
> partition our source data into Shards.
> Implementation Details:
>  1. De-couple EntityProcessorBase from caching.  
>   - Created a new interface, DIHCache & two implementations:  
> - SortedMapBackedCache - An in-memory cache, used as default with 
> CachedSqlEntityProcessor (now deprecated).
> - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested 
> with je-4.1.6.jar
>- NOTE: the existing Lucene Contrib "db" project uses je-3.3.93.jar.  
> I believe this may be incompatible due to Generic Usage.
>- NOTE: I did not modify the ant script to automatically get this jar, 
> so to use or evaluate this patch, download bdb-je from 
> http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html 
>  
>  2. Allow Entity Processors to take a "cacheImpl" parameter to cause the 
> entity data to be cached (see EntityProcessorBase & DIHCacheProperties).
>  
>  3. Partially De-couple SolrWriter from DocBuilder
>   - Created a new interface DIHWriter, & two implementations:
>- SolrWriter (refactored)

[jira] [Commented] (SOLR-2382) DIH Cache Improvements

2011-11-28 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158797#comment-13158797
 ] 

James Dyer commented on SOLR-2382:
--

Mikhail,

I looked at "TestCachedSqlEntityProcessor.java-break-where-clause.patch".  It 
looks like a bug is being introduced into the test because we have a 
Map with some Strings and some Integers in it.  You modified the test 
case to store these in a TreeMap.  But when the TreeMap's comparator tries to 
get these in order, it cannot cast the String to Integer and/or visa-versa.  I 
don't mean to say the bug you describe doesn't exist, but that this new test 
case doesn't seem to properly test for it.  Could you improve this test case?



> DIH Cache Improvements
> --
>
> Key: SOLR-2382
> URL: https://issues.apache.org/jira/browse/SOLR-2382
> Project: Solr
>  Issue Type: New Feature
>  Components: contrib - DataImportHandler
>Reporter: James Dyer
>Priority: Minor
> Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, 
> SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, 
> SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter_standalone.patch, 
> SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
> SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
> SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
> SOLR-2382-properties.patch, SOLR-2382-properties.patch, 
> SOLR-2382-solrwriter-verbose-fix.patch, SOLR-2382-solrwriter.patch, 
> SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, SOLR-2382.patch, 
> SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
> SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
> TestCachedSqlEntityProcessor.java-break-where-clause.patch, 
> TestCachedSqlEntityProcessor.java-fix-where-clause-by-adding-cachePk-and-lookup.patch,
>  TestThreaded.java.patch
>
>
> Functionality:
>  1. Provide a pluggable caching framework for DIH so that users can choose a 
> cache implementation that best suits their data and application.
>  
>  2. Provide a means to temporarily cache a child Entity's data without 
> needing to create a special cached implementation of the Entity Processor 
> (such as CachedSqlEntityProcessor).
>  
>  3. Provide a means to write the final (root entity) DIH output to a cache 
> rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
> cache as an Entity input.  Also provide the ability to do delta updates on 
> such persistent caches.
>  
>  4. Provide the ability to partition data across multiple caches that can 
> then be fed back into DIH and indexed either to varying Solr Shards, or to 
> the same Core in parallel.
> Use Cases:
>  1. We needed a flexible & scalable way to temporarily cache child-entity 
> data prior to joining to parent entities.
>   - Using SqlEntityProcessor with Child Entities can cause an "n+1 select" 
> problem.
>   - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching 
> mechanism and does not scale.
>   - There is no way to cache non-SQL inputs (ex: flat files, xml, etc).
>  
>  2. We needed the ability to gather data from long-running entities by a 
> process that runs separate from our main indexing process.
>   
>  3. We wanted the ability to do a delta import of only the entities that 
> changed.
>   - Lucene/Solr requires entire documents to be re-indexed, even if only a 
> few fields changed.
>   - Our data comes from 50+ complex sql queries and/or flat files.
>   - We do not want to incur overhead re-gathering all of this data if only 1 
> entity's data changed.
>   - Persistent DIH caches solve this problem.
>   
>  4. We want the ability to index several documents in parallel (using 1.4.1, 
> which did not have the "threads" parameter).
>  
>  5. In the future, we may need to use Shards, creating a need to easily 
> partition our source data into Shards.
> Implementation Details:
>  1. De-couple EntityProcessorBase from caching.  
>   - Created a new interface, DIHCache & two implementations:  
> - SortedMapBackedCache - An in-memory cache, used as default with 
> CachedSqlEntityProcessor (now deprecated).
> - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested 
> with je-4.1.6.jar
>- NOTE: the existing Lucene Contrib "db" project uses je-3.3.93.jar.  
> I believe this may be incompatible due to Generic Usage.
>- NOTE: I did not modify the ant script to automatically get this jar, 
> so to use or evaluate this patch, download bdb-je from 
> http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html 
>  
>  2. Allow Entity Processors to take a "cacheImpl" parameter to cause the 
> entity data to be cached (see EntityProcessorBase & DIHCacheProperties).
>  
>  3. Partially De-c

[jira] [Commented] (SOLR-2382) DIH Cache Improvements

2011-11-28 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158512#comment-13158512
 ] 

James Dyer commented on SOLR-2382:
--

Mikhail,

Thank you for testing this and providing information and a patch.  I have a few 
questions.

{quote}
this code, which cleanups the cache, makes sense, but for parent entities only, 
and causes a failure for the child entities enumeration
{quote}
Currently none of the existing unit tests fail with this change and I'm not 
sure exactly how to reproduce the problem you're describing.  Could you create 
a failing unit test for this to clarify what you're experiencing?  

{quote}
looks like where="xid=x.id" is not supported by new code, which relies on 
cachePk="xid" and cacheLookup="x.id"
{quote}
Do you mean this breaks CachedSqlEntityProcessor?  Looking at 
TestCachedSQLEntityProcessor, it seems like the case you describe is adequately 
tested and this test still passes.  Possibly you mean something different?  
Once again, a failing unit test would be helpful in knowing how to reproduce 
the specific problem you've found.

{quote}
the most interesting problem is failure of testCachedMultiThread_FullImport(). 
At 3.4 it's caused by concurrent access...
{quote}
This sounds like a valid issue, and just looking it seems like you've got a 
good unit test for it.  But if its happening in 3.4 then its not related to 
SOLR-2382, which is in Trunk/4.0 only.  Would you mind opening a new "bug" 
issue for this?

> DIH Cache Improvements
> --
>
> Key: SOLR-2382
> URL: https://issues.apache.org/jira/browse/SOLR-2382
> Project: Solr
>  Issue Type: New Feature
>  Components: contrib - DataImportHandler
>Reporter: James Dyer
>Priority: Minor
> Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, 
> SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, 
> SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
> SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
> SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
> SOLR-2382-properties.patch, SOLR-2382-properties.patch, 
> SOLR-2382-solrwriter-verbose-fix.patch, SOLR-2382-solrwriter.patch, 
> SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, SOLR-2382.patch, 
> SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
> SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, TestThreaded.java.patch
>
>
> Functionality:
>  1. Provide a pluggable caching framework for DIH so that users can choose a 
> cache implementation that best suits their data and application.
>  
>  2. Provide a means to temporarily cache a child Entity's data without 
> needing to create a special cached implementation of the Entity Processor 
> (such as CachedSqlEntityProcessor).
>  
>  3. Provide a means to write the final (root entity) DIH output to a cache 
> rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
> cache as an Entity input.  Also provide the ability to do delta updates on 
> such persistent caches.
>  
>  4. Provide the ability to partition data across multiple caches that can 
> then be fed back into DIH and indexed either to varying Solr Shards, or to 
> the same Core in parallel.
> Use Cases:
>  1. We needed a flexible & scalable way to temporarily cache child-entity 
> data prior to joining to parent entities.
>   - Using SqlEntityProcessor with Child Entities can cause an "n+1 select" 
> problem.
>   - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching 
> mechanism and does not scale.
>   - There is no way to cache non-SQL inputs (ex: flat files, xml, etc).
>  
>  2. We needed the ability to gather data from long-running entities by a 
> process that runs separate from our main indexing process.
>   
>  3. We wanted the ability to do a delta import of only the entities that 
> changed.
>   - Lucene/Solr requires entire documents to be re-indexed, even if only a 
> few fields changed.
>   - Our data comes from 50+ complex sql queries and/or flat files.
>   - We do not want to incur overhead re-gathering all of this data if only 1 
> entity's data changed.
>   - Persistent DIH caches solve this problem.
>   
>  4. We want the ability to index several documents in parallel (using 1.4.1, 
> which did not have the "threads" parameter).
>  
>  5. In the future, we may need to use Shards, creating a need to easily 
> partition our source data into Shards.
> Implementation Details:
>  1. De-couple EntityProcessorBase from caching.  
>   - Created a new interface, DIHCache & two implementations:  
> - SortedMapBackedCache - An in-memory cache, used as default with 
> CachedSqlEntityProcessor (now deprecated).
> - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, test

[jira] [Commented] (SOLR-2382) DIH Cache Improvements

2011-11-21 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13154257#comment-13154257
 ] 

James Dyer commented on SOLR-2382:
--

Noble,

I can't speak for every use case, but these were necessary for one of our 
applications.  The whole idea is it lets you load your caches in advance of 
indexing (DIHCacheWriter), then read back your caches at a later time when 
you're ready to index (DIHCacheProcessor).

- This is especially helpful if you have a lot of different data sources that 
each contribute a few data elements in each Solr record.  (we have at least 40 
data sources.)  

- If you have slow data sources, you can run multiple DIH scripts at the same 
time and build your caches simultaneously (My app builds 12 DIH Caches at a 
time as we have some slow legacy databases to content with).  

- If you have a some data sources that change infrequently and other that are 
changing all the time, you can build caches for the infrequently-changing data 
sources, making it unnecessary to re-acquire this data every time you do a 
delta update (this is actually a very common case.  Imagine having Solr loaded 
with Product metadata.  Most of the data would seldom change but things like 
prices, availability flags, stock numbers, etc, might change all the time.)

- The fact that you can do delta imports on caches allows users to optimize the 
indexing process further.  If you have multiple child-entity caches with data 
that mostly stays the same, but each has churn on a small percentage of the 
data, being able to just go in and delta update the cache lets you only 
re-acquire what changed.  Otherwise, you have to take every record that had a 
change in even 1 data source and re-acquire all of the data sources for every 
record.

- These last two points relate to the fact that Lucene cannot do an "update" 
but only a "replace".  Being able to store your system-of-record data in caches 
alleviates the need to re-acquire all of your data sources every time you need 
to do an "update" on a few fields.

- Some systems do not have a separate system-of-record as the data being 
indexed to Solr is ephemeral or changes frequently.  Having the data in caches 
gives you the freedom to delta update the information or easily re-index all 
data at system upgrades, etc.  I could see for some users these caches 
factoring into their disaster recovery strategy.

- There is also a feature to partition the data into multiple caches, which 
would make it easier to subsequently index the data to separate shards.  We use 
this feature to index the data in parallel to the same core (we're using Solr 
1.4, which did not have a "threads" parameter), but this would apply to using 
multiple shards also.

Is this convincing enough to go ahead and work towards commit?

> DIH Cache Improvements
> --
>
> Key: SOLR-2382
> URL: https://issues.apache.org/jira/browse/SOLR-2382
> Project: Solr
>  Issue Type: New Feature
>  Components: contrib - DataImportHandler
>Reporter: James Dyer
>Priority: Minor
> Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, 
> SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, 
> SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
> SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
> SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
> SOLR-2382-properties.patch, SOLR-2382-properties.patch, 
> SOLR-2382-solrwriter-verbose-fix.patch, SOLR-2382-solrwriter.patch, 
> SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, SOLR-2382.patch, 
> SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
> SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch
>
>
> Functionality:
>  1. Provide a pluggable caching framework for DIH so that users can choose a 
> cache implementation that best suits their data and application.
>  
>  2. Provide a means to temporarily cache a child Entity's data without 
> needing to create a special cached implementation of the Entity Processor 
> (such as CachedSqlEntityProcessor).
>  
>  3. Provide a means to write the final (root entity) DIH output to a cache 
> rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
> cache as an Entity input.  Also provide the ability to do delta updates on 
> such persistent caches.
>  
>  4. Provide the ability to partition data across multiple caches that can 
> then be fed back into DIH and indexed either to varying Solr Shards, or to 
> the same Core in parallel.
> Use Cases:
>  1. We needed a flexible & scalable way to temporarily cache child-entity 
> data prior to joining to parent entities.
>   - Using SqlEntityProcessor with Child Entities can cause an "n+1 select" 
> problem.
>   - CachedSqlEntityProcesso

[jira] [Commented] (SOLR-2578) ReplicationHandler Backups -- clean up old backups

2011-11-17 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13152085#comment-13152085
 ] 

James Dyer commented on SOLR-2578:
--

Thanks for committing this.  I have updated the wiki.

> ReplicationHandler Backups -- clean up old backups
> --
>
> Key: SOLR-2578
> URL: https://issues.apache.org/jira/browse/SOLR-2578
> Project: Solr
>  Issue Type: Improvement
>  Components: replication (java)
>Affects Versions: 3.2, 4.0
>Reporter: James Dyer
>Assignee: James Dyer
>Priority: Minor
> Fix For: 3.5, 4.0
>
> Attachments: SOLR-2578.patch, SOLR-2578.patch, SOLR-2578_3x.patch
>
>
> It would be nice when performing backups if there was an easy way to tell 
> ReplicationHandler to only keep so many and then delete the older ones.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2902) List of collations are wrong parsed in SpellCheckResponse

2011-11-16 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13151256#comment-13151256
 ] 

James Dyer commented on SOLR-2902:
--

This same bug was previously mentioned on the User List.  

See thread:  
http://lucene.472066.n3.nabble.com/SpellCheck-Print-Multiple-Collations-td3358391.html

Thank you Bastiaan for opening this issue and providing a patch.  Your fix is 
indeed correct.  In case you haven't noticed, you can work around this issue 
for now by specifying spellcheck.collateExtendedResults=true, as the separate 
branch in SpellCheckResponse for extended results does not have the bug.  While 
I hope a committer will take this one up sometime soon for both 4.x and 3.x, I 
wouldn't classify this as "Major" in priority as there is a good workaround.

> List of collations are wrong parsed in SpellCheckResponse
> -
>
> Key: SOLR-2902
> URL: https://issues.apache.org/jira/browse/SOLR-2902
> Project: Solr
>  Issue Type: Bug
>  Components: clients - java
>Affects Versions: 3.4
> Environment: windows xp.
>Reporter: Bastiaan Verhoef
> Attachments: SpellCheckResponse.java.patch
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> When I do a search query which {{spellcheck=on}} I get more then one 
> collation in the solr response:
>   {{kaart}}
>   {{maart}}
>   {{vaart}}
>   {{staart}}
>   {{baart}}
>   {{komkaart}}
>   {{dagvaart}}
> The SpellCheckResponse gives me only the collation 'dagvaart':
> {{getCollatedResults()}} gives a list of 7 items that contains only Collation 
> objects with 'dagvaart'.
> {{getCollatedResult()}} gives a string with the value 'dagvaart'.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2848) DirectSolrSpellChecker fails in distributed environment

2011-11-09 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13147083#comment-13147083
 ] 

James Dyer commented on SOLR-2848:
--

I would really like to get this issue resolved if possible.  Here are 3 
possible solutions:

1. The Nov 1 patch "SOLR-2848.patch" increases test coverage and makes the 
minimal changes to fix the distributed bug with DirectSolrSpellChecker.

2. The Nov 1 patch "SOLR-2848-refactoring.patch" also refactors the code, 
breaking the finishStage() method up and also moving the final merge into 
SolrSpellChecker.  This allows us to theoretically have different spell 
checkers choose to merge differently.  In practice, all of our spell checkers 
currently would use the same default version of "merge()"

3. We could dial back the changes in "SOLR-2848-refactoring.patch" to keep 
merge() as a method in SpellCheckComponent as all spell checkers use the same 
algorithm anyhow.  But we could keep the changes to make finishStage() more 
readable and, more importantly, keep the "getStringDistance()" and 
"getAccuracy()" methods in SolrSpellChecker.  This at least eliminates the need 
for "instanceof" checks, making Distributed Spell Check less brittle as new 
spell checkers are added.

Please advise how we should move forward.  (I like option #3 the best.  I can 
create a patch for this if desired.)  Thanks.




> DirectSolrSpellChecker fails in distributed environment
> ---
>
> Key: SOLR-2848
> URL: https://issues.apache.org/jira/browse/SOLR-2848
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud, spellchecker
>Affects Versions: 4.0
>Reporter: James Dyer
>Priority: Minor
> Fix For: 4.0
>
> Attachments: SOLR-2848-refactoring.patch, SOLR-2848.patch, 
> SOLR-2848.patch
>
>
> While working on SOLR-2585, it was brought to my attention that 
> DirectSolrSpellChecker has no test coverage involving a distributed 
> environment.  Here I am adding a random element to 
> DistributedSpellCheckComponentTest to alternate between the "IndexBased" and 
> "Direct" spell checkers.  Doing so revealed bugs in using 
> DirectSolrSpellChecker in a distributed environment.  The fixes here roughly 
> mirror those made to the "IndexBased" spell checker with SOLR-2083.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2382) DIH Cache Improvements

2011-11-09 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13147072#comment-13147072
 ] 

James Dyer commented on SOLR-2382:
--

Noble,

Is there anything else you need from me to help move this forward?

> DIH Cache Improvements
> --
>
> Key: SOLR-2382
> URL: https://issues.apache.org/jira/browse/SOLR-2382
> Project: Solr
>  Issue Type: New Feature
>  Components: contrib - DataImportHandler
>Reporter: James Dyer
>Priority: Minor
> Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, 
> SOLR-2382-dihwriter.patch, SOLR-2382-entities.patch, 
> SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
> SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
> SOLR-2382-entities.patch, SOLR-2382-properties.patch, 
> SOLR-2382-properties.patch, SOLR-2382-solrwriter-verbose-fix.patch, 
> SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, 
> SOLR-2382-solrwriter.patch, SOLR-2382.patch, SOLR-2382.patch, 
> SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
> SOLR-2382.patch, SOLR-2382.patch
>
>
> Functionality:
>  1. Provide a pluggable caching framework for DIH so that users can choose a 
> cache implementation that best suits their data and application.
>  
>  2. Provide a means to temporarily cache a child Entity's data without 
> needing to create a special cached implementation of the Entity Processor 
> (such as CachedSqlEntityProcessor).
>  
>  3. Provide a means to write the final (root entity) DIH output to a cache 
> rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
> cache as an Entity input.  Also provide the ability to do delta updates on 
> such persistent caches.
>  
>  4. Provide the ability to partition data across multiple caches that can 
> then be fed back into DIH and indexed either to varying Solr Shards, or to 
> the same Core in parallel.
> Use Cases:
>  1. We needed a flexible & scalable way to temporarily cache child-entity 
> data prior to joining to parent entities.
>   - Using SqlEntityProcessor with Child Entities can cause an "n+1 select" 
> problem.
>   - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching 
> mechanism and does not scale.
>   - There is no way to cache non-SQL inputs (ex: flat files, xml, etc).
>  
>  2. We needed the ability to gather data from long-running entities by a 
> process that runs separate from our main indexing process.
>   
>  3. We wanted the ability to do a delta import of only the entities that 
> changed.
>   - Lucene/Solr requires entire documents to be re-indexed, even if only a 
> few fields changed.
>   - Our data comes from 50+ complex sql queries and/or flat files.
>   - We do not want to incur overhead re-gathering all of this data if only 1 
> entity's data changed.
>   - Persistent DIH caches solve this problem.
>   
>  4. We want the ability to index several documents in parallel (using 1.4.1, 
> which did not have the "threads" parameter).
>  
>  5. In the future, we may need to use Shards, creating a need to easily 
> partition our source data into Shards.
> Implementation Details:
>  1. De-couple EntityProcessorBase from caching.  
>   - Created a new interface, DIHCache & two implementations:  
> - SortedMapBackedCache - An in-memory cache, used as default with 
> CachedSqlEntityProcessor (now deprecated).
> - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested 
> with je-4.1.6.jar
>- NOTE: the existing Lucene Contrib "db" project uses je-3.3.93.jar.  
> I believe this may be incompatible due to Generic Usage.
>- NOTE: I did not modify the ant script to automatically get this jar, 
> so to use or evaluate this patch, download bdb-je from 
> http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html 
>  
>  2. Allow Entity Processors to take a "cacheImpl" parameter to cause the 
> entity data to be cached (see EntityProcessorBase & DIHCacheProperties).
>  
>  3. Partially De-couple SolrWriter from DocBuilder
>   - Created a new interface DIHWriter, & two implementations:
>- SolrWriter (refactored)
>- DIHCacheWriter (allows DIH to write ultimately to a Cache).
>
>  4. Create a new Entity Processor, DIHCacheProcessor, which reads a 
> persistent Cache as DIH Entity Input.
>  
>  5. Support a "partition" parameter with both DIHCacheWriter and 
> DIHCacheProcessor to allow for easy partitioning of source entity data.
>  
>  6. Change the semantics of entity.destroy()
>   - Previously, it was being called on each iteration of 
> DocBuilder.buildDocument().
>   - Now it is does one-time cleanup tasks (like closing or deleting a 
> disk-backed cache) once the entity processor is completed.
>   - The only

[jira] [Commented] (SOLR-2382) DIH Cache Improvements

2011-10-27 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13137184#comment-13137184
 ] 

James Dyer commented on SOLR-2382:
--

{quote}
The EntityProcessorBase.transformers field is not used in the latest patch. How 
do does transformation work?
{quote}
This hasn't been in use since SOLR-1120 was applied back in 2009 (r766608).  
Since SOLR-1120, Transformers are applied in class EntityProcessorWrapper.  We 
could remove this variable from the class now as it does nothing and is not 
used by this class or any prepackaged subclasses.  Agree?

{quote}
With some clean up. 
{quote}
What other kinds of things do you have in mind?

> DIH Cache Improvements
> --
>
> Key: SOLR-2382
> URL: https://issues.apache.org/jira/browse/SOLR-2382
> Project: Solr
>  Issue Type: New Feature
>  Components: contrib - DataImportHandler
>Reporter: James Dyer
>Priority: Minor
> Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, 
> SOLR-2382-dihwriter.patch, SOLR-2382-entities.patch, 
> SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
> SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
> SOLR-2382-entities.patch, SOLR-2382-properties.patch, 
> SOLR-2382-properties.patch, SOLR-2382-solrwriter-verbose-fix.patch, 
> SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, 
> SOLR-2382-solrwriter.patch, SOLR-2382.patch, SOLR-2382.patch, 
> SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
> SOLR-2382.patch, SOLR-2382.patch
>
>
> Functionality:
>  1. Provide a pluggable caching framework for DIH so that users can choose a 
> cache implementation that best suits their data and application.
>  
>  2. Provide a means to temporarily cache a child Entity's data without 
> needing to create a special cached implementation of the Entity Processor 
> (such as CachedSqlEntityProcessor).
>  
>  3. Provide a means to write the final (root entity) DIH output to a cache 
> rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
> cache as an Entity input.  Also provide the ability to do delta updates on 
> such persistent caches.
>  
>  4. Provide the ability to partition data across multiple caches that can 
> then be fed back into DIH and indexed either to varying Solr Shards, or to 
> the same Core in parallel.
> Use Cases:
>  1. We needed a flexible & scalable way to temporarily cache child-entity 
> data prior to joining to parent entities.
>   - Using SqlEntityProcessor with Child Entities can cause an "n+1 select" 
> problem.
>   - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching 
> mechanism and does not scale.
>   - There is no way to cache non-SQL inputs (ex: flat files, xml, etc).
>  
>  2. We needed the ability to gather data from long-running entities by a 
> process that runs separate from our main indexing process.
>   
>  3. We wanted the ability to do a delta import of only the entities that 
> changed.
>   - Lucene/Solr requires entire documents to be re-indexed, even if only a 
> few fields changed.
>   - Our data comes from 50+ complex sql queries and/or flat files.
>   - We do not want to incur overhead re-gathering all of this data if only 1 
> entity's data changed.
>   - Persistent DIH caches solve this problem.
>   
>  4. We want the ability to index several documents in parallel (using 1.4.1, 
> which did not have the "threads" parameter).
>  
>  5. In the future, we may need to use Shards, creating a need to easily 
> partition our source data into Shards.
> Implementation Details:
>  1. De-couple EntityProcessorBase from caching.  
>   - Created a new interface, DIHCache & two implementations:  
> - SortedMapBackedCache - An in-memory cache, used as default with 
> CachedSqlEntityProcessor (now deprecated).
> - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested 
> with je-4.1.6.jar
>- NOTE: the existing Lucene Contrib "db" project uses je-3.3.93.jar.  
> I believe this may be incompatible due to Generic Usage.
>- NOTE: I did not modify the ant script to automatically get this jar, 
> so to use or evaluate this patch, download bdb-je from 
> http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html 
>  
>  2. Allow Entity Processors to take a "cacheImpl" parameter to cause the 
> entity data to be cached (see EntityProcessorBase & DIHCacheProperties).
>  
>  3. Partially De-couple SolrWriter from DocBuilder
>   - Created a new interface DIHWriter, & two implementations:
>- SolrWriter (refactored)
>- DIHCacheWriter (allows DIH to write ultimately to a Cache).
>
>  4. Create a new Entity Processor, DIHCacheProcessor, which reads a 
> persistent Cache as DIH Entity Input.
>  
>

[jira] [Commented] (SOLR-2853) SpellCheckCollator.collate method creates the a PossibilityIterator with maxTries instead of maxCollations

2011-10-26 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13136014#comment-13136014
 ] 

James Dyer commented on SOLR-2853:
--

Matt,

Actuall, if you wouldn't mind, could you re-open this one (I can't), or 
otherwise I can open another issue.  Taking a closer look after reading your 
last comment, there is a bug in the case the user sets "maxCollationTries" to 
0.  In SpellCheckCollator, we have:

{code}
if (maxTries < 1) {
  maxTries = 1;
  verifyCandidateWithQuery = false;
}
{code}

But I think in this case we need it to set "maxTries" to the same value as 
"maxCollations".  The current code will, as you point out, only return 1 
collation no mater how many the user specified, unless "maxTries" > 0.  I would 
think most users who want multiple collations would also want them verified, so 
this is probably not something that would get easily caught.  An extra test 
case might be prudent as well.



> SpellCheckCollator.collate method creates the a PossibilityIterator with 
> maxTries instead of maxCollations
> --
>
> Key: SOLR-2853
> URL: https://issues.apache.org/jira/browse/SOLR-2853
> Project: Solr
>  Issue Type: Bug
>  Components: spellchecker
>Affects Versions: 3.3, 4.0
>Reporter: Matt Traynham
>Priority: Minor
>
> Class SpellCheckCollator creates a new possibility iterator based on the 
> spellcheck results.  The iterator is created with: 
> PossibilityIterator possibilityIter = new 
> PossibilityIterator(result.getSuggestions(), maxTries, maxEvaluations);
> The issue is maxTries, should be maxCollations.  Correct me if I'm wrong.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2853) SpellCheckCollator.collate method creates the a PossibilityIterator with maxTries instead of maxCollations

2011-10-25 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13135468#comment-13135468
 ] 

James Dyer commented on SOLR-2853:
--

Matt,

Thanks for taking such a deep dive into this code.  Its great to see people 
checking for things like this.  I think the current code is correct, however.  
What PossibilityIterator is returning is a set of word combinations that 
SpellCheckCollator then needs to test against the index.  So 
PossibilityIterator will return up to "maxTries" word combinations.  But some 
of these possibilities could be nonsense and will return 0 hits when queried 
for against the index. SpellCheckCollator will throw these 0-hit possibilities 
out, trying each possibility until it has as many good ones as requested by 
"maxCollations", or until it has exhausted the list.  (If the user sets 
maxCollationTries to zero, SpellCheckCollator won't test any and in this case 
will just return the first "maxCollations" possibilities back to the user.)  
Make sense? 

> SpellCheckCollator.collate method creates the a PossibilityIterator with 
> maxTries instead of maxCollations
> --
>
> Key: SOLR-2853
> URL: https://issues.apache.org/jira/browse/SOLR-2853
> Project: Solr
>  Issue Type: Bug
>  Components: spellchecker
>Affects Versions: 3.3, 4.0
>Reporter: Matt Traynham
>Priority: Minor
>
> Class SpellCheckCollator creates a new possibility iterator based on the 
> spellcheck results.  The iterator is created with: 
> PossibilityIterator possibilityIter = new 
> PossibilityIterator(result.getSuggestions(), maxTries, maxEvaluations);
> The issue is maxTries, should be maxCollations.  Correct me if I'm wrong.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2848) DirectSolrSpellChecker fails in distributed environment

2011-10-24 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13134182#comment-13134182
 ] 

James Dyer commented on SOLR-2848:
--

{quote}
OK, Lets do this, such that the distance impl is a "real" one computing 
levenshtein like Lucene does
{quote}
I opened LUCENE-3527.

{quote}
Rather than instanceof/StringDistance maybe we could add a merge() method that 
would be more general?
{quote}
Are you thinking each *SolrSpellChecker should have a merge() that 
finishStage() calls?  This sounds reasonable to me.

> DirectSolrSpellChecker fails in distributed environment
> ---
>
> Key: SOLR-2848
> URL: https://issues.apache.org/jira/browse/SOLR-2848
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud, spellchecker
>Affects Versions: 4.0
>Reporter: James Dyer
>Priority: Minor
> Fix For: 4.0
>
> Attachments: SOLR-2848.patch
>
>
> While working on SOLR-2585, it was brought to my attention that 
> DirectSolrSpellChecker has no test coverage involving a distributed 
> environment.  Here I am adding a random element to 
> DistributedSpellCheckComponentTest to alternate between the "IndexBased" and 
> "Direct" spell checkers.  Doing so revealed bugs in using 
> DirectSolrSpellChecker in a distributed environment.  The fixes here roughly 
> mirror those made to the "IndexBased" spell checker with SOLR-2083.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2848) DirectSolrSpellChecker fails in distributed environment

2011-10-24 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13134142#comment-13134142
 ] 

James Dyer commented on SOLR-2848:
--

finishStage() is being run on the Master Shard.  It receives spelling results 
from all of the shards and then has to integrate them together.  Solr doesn't 
return the scores with spelling suggestions back to the client.  I suppose the 
authors of SOLR-785 could have modified the response Solr sends back to its 
clients.  However, it probably seemed inexpensive enough to just re-compute the 
string distance after-the-fact (indeed Lucene In Action 2nd ed sect 8.5.3 
mentions doing the same thing, so I take it this is a common thing to do?).  
The problem now we have is we've got a spellchecker that doesn't fully 
implement a StringDistance all the time.  I'd imagine the best bet is to try 
and change that.  Possibly, the slight discrepancies our current patch leave 
are not serious enough to fix?  If neither option is good, then we'd probably 
have to modify the solr response to include scores.

> DirectSolrSpellChecker fails in distributed environment
> ---
>
> Key: SOLR-2848
> URL: https://issues.apache.org/jira/browse/SOLR-2848
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud, spellchecker
>Affects Versions: 4.0
>Reporter: James Dyer
>Priority: Minor
> Fix For: 4.0
>
> Attachments: SOLR-2848.patch
>
>
> While working on SOLR-2585, it was brought to my attention that 
> DirectSolrSpellChecker has no test coverage involving a distributed 
> environment.  Here I am adding a random element to 
> DistributedSpellCheckComponentTest to alternate between the "IndexBased" and 
> "Direct" spell checkers.  Doing so revealed bugs in using 
> DirectSolrSpellChecker in a distributed environment.  The fixes here roughly 
> mirror those made to the "IndexBased" spell checker with SOLR-2083.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2848) DirectSolrSpellChecker fails in distributed environment

2011-10-24 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13134121#comment-13134121
 ] 

James Dyer commented on SOLR-2848:
--

Robert,

I think your first suggestion (moving configuration and response formatting out 
of the *SolrSpellCheck) is desirable and doable, but I wanted to keep this 
issue focused on increasing test coverage and to make DirectSolrSpellChecker 
mirror what AbstractLuceneSpellChecker already does so that it can pass.

Obviously, if every SpellChecker plug-in implemented or extended something that 
had a "getStringDistance" or "getAccuracy" method then we wouldn't need to do 
instanceof and cast.  Once again, a big structural change like this seems 
inappropriate in a bug fix, especially as we are not introducing these checks 
for the first time.  This is a long-standing problem.

It looks to me like "internal levenshtein" is just a dummy class designed to 
technically meet the api requirements while not actually doing anything.  But 
SpellCheckComponent.finishStage() needs to be able to get the StringDistance 
impl that was used to generate suggestions during the first stage, then 
re-compute distances using its getDistance() method.  This is how it can know 
how to order the varying suggestions from multiple shards after-the-fact.  I 
see from the notes in DirectSpellChecker that using the "internal" 
StringDistance yields performance improvements over using a pluggable s.d.  I 
did not look enough to determine if "internal levenshtein" could be modified to 
re-compute these internally-generated distance calculations and be usable 
externally, without sacrificing the performance gain.  If possible, this would 
probably be our best bet, eliminating the Exception hack and any possible 
discrepancies using 2 different s.d. classes would cause.  Do you agree?

> DirectSolrSpellChecker fails in distributed environment
> ---
>
> Key: SOLR-2848
> URL: https://issues.apache.org/jira/browse/SOLR-2848
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud, spellchecker
>Affects Versions: 4.0
>Reporter: James Dyer
>Priority: Minor
> Fix For: 4.0
>
> Attachments: SOLR-2848.patch
>
>
> While working on SOLR-2585, it was brought to my attention that 
> DirectSolrSpellChecker has no test coverage involving a distributed 
> environment.  Here I am adding a random element to 
> DistributedSpellCheckComponentTest to alternate between the "IndexBased" and 
> "Direct" spell checkers.  Doing so revealed bugs in using 
> DirectSolrSpellChecker in a distributed environment.  The fixes here roughly 
> mirror those made to the "IndexBased" spell checker with SOLR-2083.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2382) DIH Cache Improvements

2011-10-20 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13131674#comment-13131674
 ] 

James Dyer commented on SOLR-2382:
--

{quote}
The entities are reused . But it has always been like that. Why do you need 
that initialized flag? What is initialized? 
{quote}

Perhaps "intialized" is the wrong name for this flag, but let me explain how 
its used.  In "Implementation Details" #6 in this issue's description, I 
mentioned the need to change the semantics for "entity.destroy()".  Previous to 
this patch, for child entities, both "entity.destroy()" & "entity.init" get 
called once per parent row.  So throughout the course of a DIH import, child 
entities constantly get their "init" and "destroy" methods called over and over 
again.  But what if we have "init" and "destroy" operations that are meant to 
be executed only once?  "init" copes with this by setting a "firstInit" flag on 
each entity and having any init steps that get called only once controlled by 
this flag.  

But there was no such coping mechanism built into "destroy".  There was never a 
need because in actuality only one of our prepacked entities implements 
"destroy()". But entities that use persistent caching require that there be a 
way to clean up any unneeded caches at the end.  Because "destroy()" was 
largely unused, I decided to change its semantics to handle this 
end-of-lifecycle cleanup operation.  (The one entity that already implements 
"destroy" is LineEntitiyProcessor, but prior to this patch we cannot use 
LineEntityProcessor as a child entity and do joins, so the semantic change here 
doesn't matter.)  

Thus the "entityWrapper.initalized" flag gets set (DocBuilder lines 637-640) 
the first time a particular entity is encountered.  The flag ensures that the 
entity gets added to the "Destroy-List" only once.  When any entity is done 
being used (its parent is finished), the appropriate "Destroy-List" is looped 
through, the children are destroyed, and their initialized flags get set back 
to "false". (DocBuilder lines 617-621).  "resetEntity()" sets the flag back, 
existing in its own method so that it may be done recursively.

I apologize for this very long explanation, but I hope this is helpful.  
Obviously I've made design decisions here that you may (or perhaps not) differ 
on.  Basically I need to have an "entity.destroy()" that is guaranteed to get 
called only once, at the time the entity is done executing.  If you would like 
this done differently, let me know what you have in mind and I can try and 
change it.  

Do you now understand why I am using an "initalized" flag?  Is this ok as-is, 
or if not, how would you like the design changed?

> DIH Cache Improvements
> --
>
> Key: SOLR-2382
> URL: https://issues.apache.org/jira/browse/SOLR-2382
> Project: Solr
>  Issue Type: New Feature
>  Components: contrib - DataImportHandler
>Reporter: James Dyer
>Priority: Minor
> Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, 
> SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
> SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
> SOLR-2382-properties.patch, SOLR-2382-properties.patch, 
> SOLR-2382-solrwriter-verbose-fix.patch, SOLR-2382-solrwriter.patch, 
> SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, SOLR-2382.patch, 
> SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
> SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch
>
>
> Functionality:
>  1. Provide a pluggable caching framework for DIH so that users can choose a 
> cache implementation that best suits their data and application.
>  
>  2. Provide a means to temporarily cache a child Entity's data without 
> needing to create a special cached implementation of the Entity Processor 
> (such as CachedSqlEntityProcessor).
>  
>  3. Provide a means to write the final (root entity) DIH output to a cache 
> rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
> cache as an Entity input.  Also provide the ability to do delta updates on 
> such persistent caches.
>  
>  4. Provide the ability to partition data across multiple caches that can 
> then be fed back into DIH and indexed either to varying Solr Shards, or to 
> the same Core in parallel.
> Use Cases:
>  1. We needed a flexible & scalable way to temporarily cache child-entity 
> data prior to joining to parent entities.
>   - Using SqlEntityProcessor with Child Entities can cause an "n+1 select" 
> problem.
>   - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching 
> mechanism and does not scale.
>   - There is no way to cache non-SQL inputs (ex: flat files, xml, etc).
>  
>  2. We needed the ability to gather data from long-running entities by a 
> p

[jira] [Commented] (SOLR-2382) DIH Cache Improvements

2011-10-18 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13129778#comment-13129778
 ] 

James Dyer commented on SOLR-2382:
--

Noble,

Thanks for the comments.  Let me see if I can answer some of your questions and 
perhaps you can give further guidance as to how we can address your concerns.  
In the mean time, I can try to get a patch together that incorporates as much 
as possible of what you suggest.

{quote}
SolrWriter.setDeltaKeys();
it is not implemented and I am not even clear why it is there
{quote}

This implements the new DIHWriter interface, but I see I failed to put the 
proper annotations in, hence the confusion.  This method is required by a 
DIHWriter that supports both delta updates and duplicate keys (ex. the 
DIHCacheWriter, in the next patch).  SolrWriter does not implement this because 
Solr does not support duplicate keys.  That is, in the case of a delta update 
to a Solr Index, a repeat key is definitely an Update, and cannot be an Add.  
Caches that support duplicate keys, however, need to know up-front whether or 
not a duplicate key is an Add or an Update.  In the next patch, I will put all 
the proper annotations in place.  Will this satisfy your concern?

{quote}
Now that we have a concept of DIHCache, move all cache related logic from 
EntityprocessorBase to another class. Probably a baseDIHCache.
{quote}
I realized back when this was first developed that this would be a good future 
refactoring but this is a pretty big project already and I was trying to 
minimize the changes.  But I can do this in the next patch version if you'd 
like it done now.  Sound good? 

{quote}
remove the DIHCacheProperties class and inline the constants. That is the way 
it done everywhere else
{quote}
It made more sense back when I was developing this to have the constants 
centralized because many of them are being used by more than one class.  But 
for consistency I can inline them somewhere for the next patch version.  Agree?

{quote}
I don't understand the need for DocBuilder.resetEntity() According to me the 
DataCOnfig state must not be changed between runs. 
{quote}
All this does is recursively set the "initalized" flag on an entity and its 
children back to "false".  (the "initalized" flag ensures that "destroy" is 
only called on an entity once..See "Implementation Details" #6 in my original 
description for this issue).  I think I added "resetEntity" as a safety measure 
because I don't know enough about DIH to be guaranteed that these entity 
objects never get used again.  If you're pretty sure its impossible the same 
entity objects would be used again, we can remove "resetEntity".  In the mean 
time, let me see if all the unit tests pass for everything with this removed.  
If you're sure, and if all unit tests pass without it, then I'd agree we should 
remove it.  Sound like a plan on this one?


> DIH Cache Improvements
> --
>
> Key: SOLR-2382
> URL: https://issues.apache.org/jira/browse/SOLR-2382
> Project: Solr
>  Issue Type: New Feature
>  Components: contrib - DataImportHandler
>Reporter: James Dyer
>Priority: Minor
> Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, 
> SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
> SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
> SOLR-2382-properties.patch, SOLR-2382-properties.patch, 
> SOLR-2382-solrwriter-verbose-fix.patch, SOLR-2382-solrwriter.patch, 
> SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, SOLR-2382.patch, 
> SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
> SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch
>
>
> Functionality:
>  1. Provide a pluggable caching framework for DIH so that users can choose a 
> cache implementation that best suits their data and application.
>  
>  2. Provide a means to temporarily cache a child Entity's data without 
> needing to create a special cached implementation of the Entity Processor 
> (such as CachedSqlEntityProcessor).
>  
>  3. Provide a means to write the final (root entity) DIH output to a cache 
> rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
> cache as an Entity input.  Also provide the ability to do delta updates on 
> such persistent caches.
>  
>  4. Provide the ability to partition data across multiple caches that can 
> then be fed back into DIH and indexed either to varying Solr Shards, or to 
> the same Core in parallel.
> Use Cases:
>  1. We needed a flexible & scalable way to temporarily cache child-entity 
> data prior to joining to parent entities.
>   - Using SqlEntityProcessor with Child Entities can cause an "n+1 select" 
> problem.
>   - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching

[jira] [Commented] (SOLR-2382) DIH Cache Improvements

2011-10-13 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13126751#comment-13126751
 ] 

James Dyer commented on SOLR-2382:
--

Pulkit,

I take it you have a root entity (possibly from a SQL database) and a child 
entity from a flat text file, and you need to join the two data sources.  There 
are two ways to do this using this caching.  In either case you'll need both 
patches ("entities" & "dihwriter").  Also, if an in-memory cache is not 
adequate, you also need to get BerkleyBackedCache from SOLR-2613 (required by 
the second 2-handler approach).  

The simple way uses a temporary (or ephemeral) cache.  To do this, create a 
single DIH Request Handler and add your cached child entity to data-config.xml. 
 DIH will load and cache "child_entity" every time you do an import.  When the 
import is finished, the cache is deleted.  This allows you to do joins on flat 
files whereas without caching it would not be possible.  The downside is if the 
flat file changes infrequently, or if you're doing delta updates on your index, 
it would be inefficient to load and cache a large flat file every time.  Here's 
a sample data-config.xml:

{noformat}

 
 
 
   

   
 

{noformat}

The second approach is to create a second DIH Request Handler in your 
solrconfig.xml for the child entity.  This request handler has its own 
data-config.xml (named dih-flatfile.xml).  You would run this second request 
handler to build a persistent cache for the flat file, prior to running the 
main DIH request handler.  Here's an example of this second DIH Request Handler 
configured in sorlconfig.xml:

{noformat}

 
  dih-flatfile.xml
  true
  root_id,  flatfiledata1, etc
  BIGDECIMAL, STRING, etc
  org.apache.solr.handler.dataimport.DIHCacheWriter
  BerkleyBackedCache
  location_of_persistent_caches
  flatfile_cache_name
  root_id
 

{noformat}

And here is what "dih-flatfile.xml" would look like:

{noformat}

 
 
   
 

{noformat}

Your main "dataconfig-xml" would look like this:

{noformat}

 
 
 
   

   
 

{noformat}

This second approach offers more flexibility (you can load the persistent cache 
off-hours, re-use it, do delta updates on it, etc) but it is significantly more 
complex.  The worst part is to create a scheduler that will run the child 
entity's DIH request handler, wait until it finishes, then run the main DIH 
Request handler.  But this is moot if you only need to load the child once or 
once in a great while.

Should this all get committed, I will eventually create something on the wiki.  
In the mean time, I hope you find all of this helpful.  For more examples, see 
the xml files these patches add to the 
"solr/contrib/dataimporthandler/src/test-files/dih/solr/conf" folder, and also 
the new unit tests that use these.

> DIH Cache Improvements
> --
>
> Key: SOLR-2382
> URL: https://issues.apache.org/jira/browse/SOLR-2382
> Project: Solr
>  Issue Type: New Feature
>  Components: contrib - DataImportHandler
>Reporter: James Dyer
>Priority: Minor
> Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, 
> SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
> SOLR-2382-entities.patch, SOLR-2382-properties.patch, 
> SOLR-2382-properties.patch, SOLR-2382-solrwriter-verbose-fix.patch, 
> SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, 
> SOLR-2382-solrwriter.patch, SOLR-2382.patch, SOLR-2382.patch, 
> SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
> SOLR-2382.patch, SOLR-2382.patch
>
>
> Functionality:
>  1. Provide a pluggable caching framework for DIH so that users can choose a 
> cache implementation that best suits their data and application.
>  
>  2. Provide a means to temporarily cache a child Entity's data without 
> needing to create a special cached implementation of the Entity Processor 
> (such as CachedSqlEntityProcessor).
>  
>  3. Provide a means to write the final (root entity) DIH output to a cache 
> rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
> cache as an Entity input.  Also provide the ability to do delta updates on 
> such persistent caches.
>  
>  4. Provide the ability to partition data across multiple caches that can 
> then be fed back into DIH and indexed either to varying Solr Shards, or to 
> the same Core in parallel.
> Use Cases:
>  1. We needed a flexible & scalable way to temporarily cache child-entity 
> data prior to joining to parent entities.
>   - Using SqlEntityProcessor with Child Entities can cause an "n+1 select" 
> problem.
>   - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching 
> mechanism and does not scale.
>   - There is no way to cache non-SQL inputs (ex: flat files, xml, etc).
>  
>  2. We needed

[jira] [Commented] (SOLR-2382) DIH Cache Improvements

2011-10-11 Thread James Dyer (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13125230#comment-13125230
 ] 

James Dyer commented on SOLR-2382:
--

Noble,

I've been meaning to ask you if there is anything I can do to help move this 
along.  But then I got very busy and needed to wait until I could genuinely 
offer assistance!  Things are slowing a bit for me now but I realize you no 
doubt are busy too.  

In any case, do you have any lingering concerns you'd like to see addressed?  
Like I mentioned before, we wouldn't be in production with Solr without this 
functionality.  I would imagine many other users would like this also. 

> DIH Cache Improvements
> --
>
> Key: SOLR-2382
> URL: https://issues.apache.org/jira/browse/SOLR-2382
> Project: Solr
>  Issue Type: New Feature
>  Components: contrib - DataImportHandler
>Reporter: James Dyer
>Priority: Minor
> Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, 
> SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
> SOLR-2382-entities.patch, SOLR-2382-properties.patch, 
> SOLR-2382-properties.patch, SOLR-2382-solrwriter-verbose-fix.patch, 
> SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, 
> SOLR-2382-solrwriter.patch, SOLR-2382.patch, SOLR-2382.patch, 
> SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
> SOLR-2382.patch, SOLR-2382.patch
>
>
> Functionality:
>  1. Provide a pluggable caching framework for DIH so that users can choose a 
> cache implementation that best suits their data and application.
>  
>  2. Provide a means to temporarily cache a child Entity's data without 
> needing to create a special cached implementation of the Entity Processor 
> (such as CachedSqlEntityProcessor).
>  
>  3. Provide a means to write the final (root entity) DIH output to a cache 
> rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
> cache as an Entity input.  Also provide the ability to do delta updates on 
> such persistent caches.
>  
>  4. Provide the ability to partition data across multiple caches that can 
> then be fed back into DIH and indexed either to varying Solr Shards, or to 
> the same Core in parallel.
> Use Cases:
>  1. We needed a flexible & scalable way to temporarily cache child-entity 
> data prior to joining to parent entities.
>   - Using SqlEntityProcessor with Child Entities can cause an "n+1 select" 
> problem.
>   - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching 
> mechanism and does not scale.
>   - There is no way to cache non-SQL inputs (ex: flat files, xml, etc).
>  
>  2. We needed the ability to gather data from long-running entities by a 
> process that runs separate from our main indexing process.
>   
>  3. We wanted the ability to do a delta import of only the entities that 
> changed.
>   - Lucene/Solr requires entire documents to be re-indexed, even if only a 
> few fields changed.
>   - Our data comes from 50+ complex sql queries and/or flat files.
>   - We do not want to incur overhead re-gathering all of this data if only 1 
> entity's data changed.
>   - Persistent DIH caches solve this problem.
>   
>  4. We want the ability to index several documents in parallel (using 1.4.1, 
> which did not have the "threads" parameter).
>  
>  5. In the future, we may need to use Shards, creating a need to easily 
> partition our source data into Shards.
> Implementation Details:
>  1. De-couple EntityProcessorBase from caching.  
>   - Created a new interface, DIHCache & two implementations:  
> - SortedMapBackedCache - An in-memory cache, used as default with 
> CachedSqlEntityProcessor (now deprecated).
> - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested 
> with je-4.1.6.jar
>- NOTE: the existing Lucene Contrib "db" project uses je-3.3.93.jar.  
> I believe this may be incompatible due to Generic Usage.
>- NOTE: I did not modify the ant script to automatically get this jar, 
> so to use or evaluate this patch, download bdb-je from 
> http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html 
>  
>  2. Allow Entity Processors to take a "cacheImpl" parameter to cause the 
> entity data to be cached (see EntityProcessorBase & DIHCacheProperties).
>  
>  3. Partially De-couple SolrWriter from DocBuilder
>   - Created a new interface DIHWriter, & two implementations:
>- SolrWriter (refactored)
>- DIHCacheWriter (allows DIH to write ultimately to a Cache).
>
>  4. Create a new Entity Processor, DIHCacheProcessor, which reads a 
> persistent Cache as DIH Entity Input.
>  
>  5. Support a "partition" parameter with both DIHCacheWriter and 
> DIHCacheProcessor to allow for easy partitioning of source entity data.
>  
>  6.

65 matches

Mail list logo