date:20110711


[ 
https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063689#comment-13063689
 ] 

Mike Sokolov commented on LUCENE-2878:
--

I hope you all will review the patch and see what you think.  My gut at the 
moment tells me we can have it both ways with a bit more tinkering.  I think 
that as it stands now, if you ask for positions you get them in more or less 
the most efficient way we know how. At the moment there is some performance hit 
when you don't want positions, but I think we can deal with that. Simon had the 
idea we could rely on the JIT compiler to optimize away the test we have if we 
set it up as a final false boolean (totally do-able if we set up the state 
during Scorer construction), which would be great and convenient.  I'm no 
compiler expert, so not sure how reliable that is - is it?  But we could also 
totally separate the two cases (say with a wrapping Scorer? - no need for 
compiler tricks) while still allowing us to retrieve positions while querying, 
collecting docs, and scoring.

> Allow Scorer to expose positions and payloads aka. nuke spans 
> --
>
> Key: LUCENE-2878
> URL: https://issues.apache.org/jira/browse/LUCENE-2878
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: Bulk Postings branch
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
>  Labels: gsoc2011, lucene-gsoc-11, mentor
> Attachments: LUCENE-2878-OR.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, 
> PosHighlighter.patch, PosHighlighter.patch
>
>
> Currently we have two somewhat separate types of queries, the one which can 
> make use of positions (mainly spans) and payloads (spans). Yet Span*Query 
> doesn't really do scoring comparable to what other queries do and at the end 
> of the day they are duplicating lot of code all over lucene. Span*Queries are 
> also limited to other Span*Query instances such that you can not use a 
> TermQuery or a BooleanQuery with SpanNear or anthing like that. 
> Beside of the Span*Query limitation other queries lacking a quiet interesting 
> feature since they can not score based on term proximity since scores doesn't 
> expose any positional information. All those problems bugged me for a while 
> now so I stared working on that using the bulkpostings API. I would have done 
> that first cut on trunk but TermScorer is working on BlockReader that do not 
> expose positions while the one in this branch does. I started adding a new 
> Positions class which users can pull from a scorer, to prevent unnecessary 
> positions enums I added ScorerContext#needsPositions and eventually 
> Scorere#needsPayloads to create the corresponding enum on demand. Yet, 
> currently only TermQuery / TermScorer implements this API and other simply 
> return null instead. 
> To show that the API really works and our BulkPostings work fine too with 
> positions I cut over TermSpanQuery to use a TermScorer under the hood and 
> nuked TermSpans entirely. A nice sideeffect of this was that the Position 
> BulkReading implementation got some exercise which now :) work all with 
> positions while Payloads for bulkreading are kind of experimental in the 
> patch and those only work with Standard codec. 
> So all spans now work on top of TermScorer ( I truly hate spans since today ) 
> including the ones that need Payloads (StandardCodec ONLY)!!  I didn't bother 
> to implement the other codecs yet since I want to get feedback on the API and 
> on this first cut before I go one with it. I will upload the corresponding 
> patch in a minute. 
> I also had to cut over SpanQuery.getSpans(IR) to 
> SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk 
> first but after that pain today I need a break first :).
> The patch passes all core tests 
> (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't 
> look into the MemoryIndex BulkPostings API yet)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [JENKINS] Lucene-Solr-tests-only-trunk - Build # 9507 - Failure

2011-07-11 Thread Robert Muir

here's a change that makes the test reproducible (run it a few times
and eventually you get a problematic locale/tz, then the seed will
reproduce the problem):

Index: 
lucene/contrib/queryparser/src/test/org/apache/lucene/queryParser/standard/TestNumericQueryParser.java
===
--- 
lucene/contrib/queryparser/src/test/org/apache/lucene/queryParser/standard/TestNumericQueryParser.java
  (revision
1145431)
+++ 
lucene/contrib/queryparser/src/test/org/apache/lucene/queryParser/standard/TestNumericQueryParser.java
  (working
copy)
@@ -63,26 +63,30 @@

   final private static int PRECISION_STEP = 8;
   final private static String FIELD_NAME = "field";
-  final private static Locale LOCALE = randomLocale(random);
-  final private static TimeZone TIMEZONE = randomTimeZone(random);
-  final private static Map RANDOM_NUMBER_MAP;
+  private static Locale LOCALE;
+  private static TimeZone TIMEZONE;
+  private static Map RANDOM_NUMBER_MAP;
   final private static EscapeQuerySyntax ESCAPER = new EscapeQuerySyntaxImpl();
   final private static String DATE_FIELD_NAME = "date";
-  final private static int DATE_STYLE = randomDateStyle(random);
-  final private static int TIME_STYLE = randomDateStyle(random);
+  private static int DATE_STYLE;
+  private static int TIME_STYLE;
+  private static Analyzer ANALYZER;

-  final private static Analyzer ANALYZER = new MockAnalyzer(random);
+  private static NumberFormat NUMBER_FORMAT;

-  final private static NumberFormat NUMBER_FORMAT = NumberFormat
-  .getNumberInstance(LOCALE);
+  private static StandardQueryParser qp;

-  final private static StandardQueryParser qp = new StandardQueryParser(
-  ANALYZER);
+  private static NumberDateFormat DATE_FORMAT;

-  final private static NumberDateFormat DATE_FORMAT;
-
-  static {
+  static void initFormats() {
 try {
+  LOCALE = randomLocale(random);
+  TIMEZONE = randomTimeZone(random);
+  DATE_STYLE = randomDateStyle(random);
+  TIME_STYLE = randomDateStyle(random);
+  ANALYZER = new MockAnalyzer(random);
+  NUMBER_FORMAT = NumberFormat.getNumberInstance(LOCALE);
+  qp = new StandardQueryParser(ANALYZER);
   NUMBER_FORMAT.setMaximumFractionDigits((random.nextInt() & 20) + 1);
   NUMBER_FORMAT.setMinimumFractionDigits((random.nextInt() & 20) + 1);
   NUMBER_FORMAT.setMaximumIntegerDigits((random.nextInt() & 20) + 1);
@@ -145,6 +149,7 @@

   @BeforeClass
   public static void beforeClass() throws Exception {
+initFormats();
 directory = newDirectory();
 RandomIndexWriter writer = new RandomIndexWriter(random, directory,
 newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random))


On Mon, Jul 11, 2011 at 10:30 PM, Robert Muir  wrote:
> I think this test has incorrect randomization, because it initializes
> its random locale and timezone statically (not in @beforeclass).
>
> You can see this by running the test, it has the same timezone every time.
>
> On Mon, Jul 11, 2011 at 10:18 PM, Chris Male  wrote:
>> I'm seeing this locally as well.
>>
>> On Tue, Jul 12, 2011 at 1:55 PM, Apache Jenkins Server
>>  wrote:
>>>
>>> Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/9507/
>>>
>>> 2 tests failed.
>>> REGRESSION:
>>>  org.apache.lucene.queryParser.standard.TestNumericQueryParser.testInclusiveNumericRange
>>>
>>> Error Message:
>>> null
>>>
>>> Stack Trace:
>>> java.lang.NullPointerException
>>>        at
>>> org.apache.lucene.queryParser.standard.config.NumberDateFormat.parse(NumberDateFormat.java:50)
>>>        at java.text.NumberFormat.parse(NumberFormat.java:348)
>>>        at
>>> org.apache.lucene.queryParser.standard.processors.NumericRangeQueryNodeProcessor.postProcessNode(NumericRangeQueryNodeProcessor.java:72)
>>>        at
>>> org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:98)
>>>        at
>>> org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processChildren(QueryNodeProcessorImpl.java:124)
>>>        at
>>> org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:96)
>>>        at
>>> org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processChildren(QueryNodeProcessorImpl.java:124)
>>>        at
>>> org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:96)
>>>        at
>>> org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.process(QueryNodeProcessorImpl.java:89)
>>>        at
>>> org.apache.lucene.queryParser.core.processors.QueryNodeProcessorPipeline.process(QueryNodeProcessorPipeline.java:88)
>>>        at
>>> org.apache.lucene.queryParser.core.QueryParserHelper.parse(QueryParserHelper.java:254)
>>>        at
>>> org.apache.lucene.queryParser.standard.StandardQueryParser.parse(StandardQueryParser.java:166)
>>>        at
>>>

Re: [JENKINS] Lucene-Solr-tests-only-trunk - Build # 9507 - Failure

2011-07-11 Thread Robert Muir

I think this test has incorrect randomization, because it initializes
its random locale and timezone statically (not in @beforeclass).

You can see this by running the test, it has the same timezone every time.

On Mon, Jul 11, 2011 at 10:18 PM, Chris Male  wrote:
> I'm seeing this locally as well.
>
> On Tue, Jul 12, 2011 at 1:55 PM, Apache Jenkins Server
>  wrote:
>>
>> Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/9507/
>>
>> 2 tests failed.
>> REGRESSION:
>>  org.apache.lucene.queryParser.standard.TestNumericQueryParser.testInclusiveNumericRange
>>
>> Error Message:
>> null
>>
>> Stack Trace:
>> java.lang.NullPointerException
>>        at
>> org.apache.lucene.queryParser.standard.config.NumberDateFormat.parse(NumberDateFormat.java:50)
>>        at java.text.NumberFormat.parse(NumberFormat.java:348)
>>        at
>> org.apache.lucene.queryParser.standard.processors.NumericRangeQueryNodeProcessor.postProcessNode(NumericRangeQueryNodeProcessor.java:72)
>>        at
>> org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:98)
>>        at
>> org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processChildren(QueryNodeProcessorImpl.java:124)
>>        at
>> org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:96)
>>        at
>> org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processChildren(QueryNodeProcessorImpl.java:124)
>>        at
>> org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:96)
>>        at
>> org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.process(QueryNodeProcessorImpl.java:89)
>>        at
>> org.apache.lucene.queryParser.core.processors.QueryNodeProcessorPipeline.process(QueryNodeProcessorPipeline.java:88)
>>        at
>> org.apache.lucene.queryParser.core.QueryParserHelper.parse(QueryParserHelper.java:254)
>>        at
>> org.apache.lucene.queryParser.standard.StandardQueryParser.parse(StandardQueryParser.java:166)
>>        at
>> org.apache.lucene.queryParser.standard.TestNumericQueryParser.testQuery(TestNumericQueryParser.java:385)
>>        at
>> org.apache.lucene.queryParser.standard.TestNumericQueryParser.assertRangeQuery(TestNumericQueryParser.java:356)
>>        at
>> org.apache.lucene.queryParser.standard.TestNumericQueryParser.testInclusiveNumericRange(TestNumericQueryParser.java:282)
>>        at
>> org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1464)
>>        at
>> org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1382)
>>
>>
>> REGRESSION:
>>  org.apache.lucene.queryParser.standard.TestNumericQueryParser.testExclusiveNumericRange
>>
>> Error Message:
>> null
>>
>> Stack Trace:
>> java.lang.NullPointerException
>>        at
>> org.apache.lucene.queryParser.standard.config.NumberDateFormat.parse(NumberDateFormat.java:50)
>>        at java.text.NumberFormat.parse(NumberFormat.java:348)
>>        at
>> org.apache.lucene.queryParser.standard.processors.NumericRangeQueryNodeProcessor.postProcessNode(NumericRangeQueryNodeProcessor.java:72)
>>        at
>> org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:98)
>>        at
>> org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processChildren(QueryNodeProcessorImpl.java:124)
>>        at
>> org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:96)
>>        at
>> org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processChildren(QueryNodeProcessorImpl.java:124)
>>        at
>> org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:96)
>>        at
>> org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.process(QueryNodeProcessorImpl.java:89)
>>        at
>> org.apache.lucene.queryParser.core.processors.QueryNodeProcessorPipeline.process(QueryNodeProcessorPipeline.java:88)
>>        at
>> org.apache.lucene.queryParser.core.QueryParserHelper.parse(QueryParserHelper.java:254)
>>        at
>> org.apache.lucene.queryParser.standard.StandardQueryParser.parse(StandardQueryParser.java:166)
>>        at
>> org.apache.lucene.queryParser.standard.TestNumericQueryParser.testQuery(TestNumericQueryParser.java:385)
>>        at
>> org.apache.lucene.queryParser.standard.TestNumericQueryParser.assertRangeQuery(TestNumericQueryParser.java:356)
>>        at
>> org.apache.lucene.queryParser.standard.TestNumericQueryParser.testExclusiveNumericRange(TestNumericQueryParser.java:311)
>>        at
>> org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1464)
>>        at
>> org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.jav

[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans


[ 
https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063672#comment-13063672
 ] 

Grant Ingersoll commented on LUCENE-2878:
-

Yeah, I agree.  I don't want to block the primary use case, I'm just really 
hoping we can have a solution for the second one that elegantly falls out of 
the primary one and doesn't require a two pass solution.  You are correct on 
the Term Vec access, but for large enough sets, the second search isn't 
trivial, even if it is dwarfed.  Although, I think it may be possible to at 
least access them in document order.

> Allow Scorer to expose positions and payloads aka. nuke spans 
> --
>
> Key: LUCENE-2878
> URL: https://issues.apache.org/jira/browse/LUCENE-2878
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: Bulk Postings branch
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
>  Labels: gsoc2011, lucene-gsoc-11, mentor
> Attachments: LUCENE-2878-OR.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, 
> PosHighlighter.patch, PosHighlighter.patch
>
>
> Currently we have two somewhat separate types of queries, the one which can 
> make use of positions (mainly spans) and payloads (spans). Yet Span*Query 
> doesn't really do scoring comparable to what other queries do and at the end 
> of the day they are duplicating lot of code all over lucene. Span*Queries are 
> also limited to other Span*Query instances such that you can not use a 
> TermQuery or a BooleanQuery with SpanNear or anthing like that. 
> Beside of the Span*Query limitation other queries lacking a quiet interesting 
> feature since they can not score based on term proximity since scores doesn't 
> expose any positional information. All those problems bugged me for a while 
> now so I stared working on that using the bulkpostings API. I would have done 
> that first cut on trunk but TermScorer is working on BlockReader that do not 
> expose positions while the one in this branch does. I started adding a new 
> Positions class which users can pull from a scorer, to prevent unnecessary 
> positions enums I added ScorerContext#needsPositions and eventually 
> Scorere#needsPayloads to create the corresponding enum on demand. Yet, 
> currently only TermQuery / TermScorer implements this API and other simply 
> return null instead. 
> To show that the API really works and our BulkPostings work fine too with 
> positions I cut over TermSpanQuery to use a TermScorer under the hood and 
> nuked TermSpans entirely. A nice sideeffect of this was that the Position 
> BulkReading implementation got some exercise which now :) work all with 
> positions while Payloads for bulkreading are kind of experimental in the 
> patch and those only work with Standard codec. 
> So all spans now work on top of TermScorer ( I truly hate spans since today ) 
> including the ones that need Payloads (StandardCodec ONLY)!!  I didn't bother 
> to implement the other codecs yet since I want to get feedback on the API and 
> on this first cut before I go one with it. I will upload the corresponding 
> patch in a minute. 
> I also had to cut over SpanQuery.getSpans(IR) to 
> SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk 
> first but after that pain today I need a break first :).
> The patch passes all core tests 
> (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't 
> look into the MemoryIndex BulkPostings API yet)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3285) Move QueryParsers from contrib/queryparser to queryparser module


[ 
https://issues.apache.org/jira/browse/LUCENE-3285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063671#comment-13063671
 ] 

Chris Male commented on LUCENE-3285:


Committed revision 1145430.

Now moving onto flexible QP.

> Move QueryParsers from contrib/queryparser to queryparser module
> 
>
> Key: LUCENE-3285
> URL: https://issues.apache.org/jira/browse/LUCENE-3285
> Project: Lucene - Java
>  Issue Type: Sub-task
>  Components: modules/queryparser
>Reporter: Chris Male
> Attachments: LUCENE-3285.patch
>
>
> Each of the QueryParsers will be ported across.
> Those which use the flexible parsing framework will be placed under the 
> package flexible.  The StandardQueryParser will be renamed to 
> FlexibleQueryParser and surround.QueryParser will be renamed to 
> SurroundQueryParser.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [JENKINS] Lucene-Solr-tests-only-trunk - Build # 9507 - Failure

2011-07-11 Thread Chris Male

I'm seeing this locally as well.

On Tue, Jul 12, 2011 at 1:55 PM, Apache Jenkins Server <
jenk...@builds.apache.org> wrote:

> Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/9507/
>
> 2 tests failed.
> REGRESSION:
>  
> org.apache.lucene.queryParser.standard.TestNumericQueryParser.testInclusiveNumericRange
>
> Error Message:
> null
>
> Stack Trace:
> java.lang.NullPointerException
>at
> org.apache.lucene.queryParser.standard.config.NumberDateFormat.parse(NumberDateFormat.java:50)
>at java.text.NumberFormat.parse(NumberFormat.java:348)
>at
> org.apache.lucene.queryParser.standard.processors.NumericRangeQueryNodeProcessor.postProcessNode(NumericRangeQueryNodeProcessor.java:72)
>at
> org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:98)
>at
> org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processChildren(QueryNodeProcessorImpl.java:124)
>at
> org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:96)
>at
> org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processChildren(QueryNodeProcessorImpl.java:124)
>at
> org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:96)
>at
> org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.process(QueryNodeProcessorImpl.java:89)
>at
> org.apache.lucene.queryParser.core.processors.QueryNodeProcessorPipeline.process(QueryNodeProcessorPipeline.java:88)
>at
> org.apache.lucene.queryParser.core.QueryParserHelper.parse(QueryParserHelper.java:254)
>at
> org.apache.lucene.queryParser.standard.StandardQueryParser.parse(StandardQueryParser.java:166)
>at
> org.apache.lucene.queryParser.standard.TestNumericQueryParser.testQuery(TestNumericQueryParser.java:385)
>at
> org.apache.lucene.queryParser.standard.TestNumericQueryParser.assertRangeQuery(TestNumericQueryParser.java:356)
>at
> org.apache.lucene.queryParser.standard.TestNumericQueryParser.testInclusiveNumericRange(TestNumericQueryParser.java:282)
>at
> org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1464)
>at
> org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1382)
>
>
> REGRESSION:
>  
> org.apache.lucene.queryParser.standard.TestNumericQueryParser.testExclusiveNumericRange
>
> Error Message:
> null
>
> Stack Trace:
> java.lang.NullPointerException
>at
> org.apache.lucene.queryParser.standard.config.NumberDateFormat.parse(NumberDateFormat.java:50)
>at java.text.NumberFormat.parse(NumberFormat.java:348)
>at
> org.apache.lucene.queryParser.standard.processors.NumericRangeQueryNodeProcessor.postProcessNode(NumericRangeQueryNodeProcessor.java:72)
>at
> org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:98)
>at
> org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processChildren(QueryNodeProcessorImpl.java:124)
>at
> org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:96)
>at
> org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processChildren(QueryNodeProcessorImpl.java:124)
>at
> org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:96)
>at
> org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.process(QueryNodeProcessorImpl.java:89)
>at
> org.apache.lucene.queryParser.core.processors.QueryNodeProcessorPipeline.process(QueryNodeProcessorPipeline.java:88)
>at
> org.apache.lucene.queryParser.core.QueryParserHelper.parse(QueryParserHelper.java:254)
>at
> org.apache.lucene.queryParser.standard.StandardQueryParser.parse(StandardQueryParser.java:166)
>at
> org.apache.lucene.queryParser.standard.TestNumericQueryParser.testQuery(TestNumericQueryParser.java:385)
>at
> org.apache.lucene.queryParser.standard.TestNumericQueryParser.assertRangeQuery(TestNumericQueryParser.java:356)
>at
> org.apache.lucene.queryParser.standard.TestNumericQueryParser.testExclusiveNumericRange(TestNumericQueryParser.java:311)
>at
> org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1464)
>at
> org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1382)
>
>
>
>
> Build Log (for compile errors):
> [...truncated 3344 lines...]
>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


-- 
Chris Male | Software Developer | JT

[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans


[ 
https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063667#comment-13063667
 ] 

Robert Muir commented on LUCENE-2878:
-

{quote}
Isn't that what highlighting does? This is just highlighting on a much bigger 
set of documents. I don't see why we should prevent users from doing it just 
b/c you don't see the use case. 
{quote}

well it is different: I'm not saying we should prevent users from doing it, but 
we shouldn't slow down normal use cases either: I think its fine for this to be 
a 2-pass operation, because any performance differences from it being 2-pass 
across many documents are going to be completely dwarfed by the term vector 
access!


> Allow Scorer to expose positions and payloads aka. nuke spans 
> --
>
> Key: LUCENE-2878
> URL: https://issues.apache.org/jira/browse/LUCENE-2878
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: Bulk Postings branch
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
>  Labels: gsoc2011, lucene-gsoc-11, mentor
> Attachments: LUCENE-2878-OR.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, 
> PosHighlighter.patch, PosHighlighter.patch
>
>
> Currently we have two somewhat separate types of queries, the one which can 
> make use of positions (mainly spans) and payloads (spans). Yet Span*Query 
> doesn't really do scoring comparable to what other queries do and at the end 
> of the day they are duplicating lot of code all over lucene. Span*Queries are 
> also limited to other Span*Query instances such that you can not use a 
> TermQuery or a BooleanQuery with SpanNear or anthing like that. 
> Beside of the Span*Query limitation other queries lacking a quiet interesting 
> feature since they can not score based on term proximity since scores doesn't 
> expose any positional information. All those problems bugged me for a while 
> now so I stared working on that using the bulkpostings API. I would have done 
> that first cut on trunk but TermScorer is working on BlockReader that do not 
> expose positions while the one in this branch does. I started adding a new 
> Positions class which users can pull from a scorer, to prevent unnecessary 
> positions enums I added ScorerContext#needsPositions and eventually 
> Scorere#needsPayloads to create the corresponding enum on demand. Yet, 
> currently only TermQuery / TermScorer implements this API and other simply 
> return null instead. 
> To show that the API really works and our BulkPostings work fine too with 
> positions I cut over TermSpanQuery to use a TermScorer under the hood and 
> nuked TermSpans entirely. A nice sideeffect of this was that the Position 
> BulkReading implementation got some exercise which now :) work all with 
> positions while Payloads for bulkreading are kind of experimental in the 
> patch and those only work with Standard codec. 
> So all spans now work on top of TermScorer ( I truly hate spans since today ) 
> including the ones that need Payloads (StandardCodec ONLY)!!  I didn't bother 
> to implement the other codecs yet since I want to get feedback on the API and 
> on this first cut before I go one with it. I will upload the corresponding 
> patch in a minute. 
> I also had to cut over SpanQuery.getSpans(IR) to 
> SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk 
> first but after that pain today I need a break first :).
> The patch passes all core tests 
> (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't 
> look into the MemoryIndex BulkPostings API yet)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2644) DIH handler - when using threads=2 the default logging is set too high


[ 
https://issues.apache.org/jira/browse/SOLR-2644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063664#comment-13063664
 ] 

Bill Bell commented on SOLR-2644:
-

New patch you forgot 
solr/contrib/dataimporthandler/src/java/org/apache/solr/handler/dataimport/DocBuilder.java.

Also, I would rather change to debug and leave it.

> DIH handler - when using threads=2 the default logging is set too high
> --
>
> Key: SOLR-2644
> URL: https://issues.apache.org/jira/browse/SOLR-2644
> Project: Solr
>  Issue Type: Bug
>  Components: contrib - DataImportHandler
>Affects Versions: 3.3
>Reporter: Bill Bell
>Assignee: Shalin Shekhar Mangar
> Fix For: 3.4, 4.0
>
> Attachments: SOLR-2644-2.patch, SOLR-2644.patch
>
>
> Setting threads parameter in DIH handler, every add outputs to the log in 
> INFO level.
> The only current solution is to set the following in log4j.properties:
> log4j.rootCategory=INFO, logfile
> log4j.logger.org.apache.solr.handler.dataimport.DocBuilder=FATAL
> log4j.logger.org.apache.solr.handler.dataimport.ThreadedEntityProcessorWrapper=FATAL
> These 2 log messages need to be changed to DEBUG.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-2644) DIH handler - when using threads=2 the default logging is set too high


 [ 
https://issues.apache.org/jira/browse/SOLR-2644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Bell updated SOLR-2644:


Attachment: SOLR-2644-2.patch

> DIH handler - when using threads=2 the default logging is set too high
> --
>
> Key: SOLR-2644
> URL: https://issues.apache.org/jira/browse/SOLR-2644
> Project: Solr
>  Issue Type: Bug
>  Components: contrib - DataImportHandler
>Affects Versions: 3.3
>Reporter: Bill Bell
>Assignee: Shalin Shekhar Mangar
> Fix For: 3.4, 4.0
>
> Attachments: SOLR-2644-2.patch, SOLR-2644.patch
>
>
> Setting threads parameter in DIH handler, every add outputs to the log in 
> INFO level.
> The only current solution is to set the following in log4j.properties:
> log4j.rootCategory=INFO, logfile
> log4j.logger.org.apache.solr.handler.dataimport.DocBuilder=FATAL
> log4j.logger.org.apache.solr.handler.dataimport.ThreadedEntityProcessorWrapper=FATAL
> These 2 log messages need to be changed to DEBUG.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-2644) DIH handler - when using threads=2 the default logging is set too high


 [ 
https://issues.apache.org/jira/browse/SOLR-2644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Bell updated SOLR-2644:


Attachment: (was: SOLR-2644-2.patch)

> DIH handler - when using threads=2 the default logging is set too high
> --
>
> Key: SOLR-2644
> URL: https://issues.apache.org/jira/browse/SOLR-2644
> Project: Solr
>  Issue Type: Bug
>  Components: contrib - DataImportHandler
>Affects Versions: 3.3
>Reporter: Bill Bell
>Assignee: Shalin Shekhar Mangar
> Fix For: 3.4, 4.0
>
> Attachments: SOLR-2644-2.patch, SOLR-2644.patch
>
>
> Setting threads parameter in DIH handler, every add outputs to the log in 
> INFO level.
> The only current solution is to set the following in log4j.properties:
> log4j.rootCategory=INFO, logfile
> log4j.logger.org.apache.solr.handler.dataimport.DocBuilder=FATAL
> log4j.logger.org.apache.solr.handler.dataimport.ThreadedEntityProcessorWrapper=FATAL
> These 2 log messages need to be changed to DEBUG.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-2644) DIH handler - when using threads=2 the default logging is set too high


 [ 
https://issues.apache.org/jira/browse/SOLR-2644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Bell updated SOLR-2644:


Attachment: SOLR-2644-2.patch

> DIH handler - when using threads=2 the default logging is set too high
> --
>
> Key: SOLR-2644
> URL: https://issues.apache.org/jira/browse/SOLR-2644
> Project: Solr
>  Issue Type: Bug
>  Components: contrib - DataImportHandler
>Affects Versions: 3.3
>Reporter: Bill Bell
>Assignee: Shalin Shekhar Mangar
> Fix For: 3.4, 4.0
>
> Attachments: SOLR-2644-2.patch, SOLR-2644.patch
>
>
> Setting threads parameter in DIH handler, every add outputs to the log in 
> INFO level.
> The only current solution is to set the following in log4j.properties:
> log4j.rootCategory=INFO, logfile
> log4j.logger.org.apache.solr.handler.dataimport.DocBuilder=FATAL
> log4j.logger.org.apache.solr.handler.dataimport.ThreadedEntityProcessorWrapper=FATAL
> These 2 log messages need to be changed to DEBUG.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans


[ 
https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063662#comment-13063662
 ] 

Grant Ingersoll commented on LUCENE-2878:
-

bq. Really, just because its complicated? Accessing other terms 'around the 
position' seems like accessing the document in a non-inverted way.

Isn't that what highlighting does?  This is just highlighting on a much bigger 
set of documents.  I don't see why we should prevent users from doing it just 
b/c you don't see the use case.  

bq. Really? If this is so common, why do the spans get so little attention? if 
the queries are so complex, how is this even possible now given that spans have 
so many problems, even basic ones (e.g. discarding boosts)

Isn't that the point of this whole patch?  To bring "spans" into the fold and 
treat as first class citizens? I didn't say it happened all the time.  I just 
said it happened enough that I think it warrants being covered before one 
"nukes spans".

bq. If performance here is so important towards looking at these 'windows 
around a match' (which is gonna be slow as shit via term vectors),
why don't I see codecs that e.g. deduplicate terms and store pointers to the 
term windows around themselves in payloads, and things like that
for this use case?

Um, b/c it's open source and not everything gets implemented the minute you 
think of it?

bq. I don't think we need to lock ourselves into a particular solution (such as 
per-position callback API) for something that sounds like its really slow 
already.

Never said we did.



> Allow Scorer to expose positions and payloads aka. nuke spans 
> --
>
> Key: LUCENE-2878
> URL: https://issues.apache.org/jira/browse/LUCENE-2878
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: Bulk Postings branch
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
>  Labels: gsoc2011, lucene-gsoc-11, mentor
> Attachments: LUCENE-2878-OR.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, 
> PosHighlighter.patch, PosHighlighter.patch
>
>
> Currently we have two somewhat separate types of queries, the one which can 
> make use of positions (mainly spans) and payloads (spans). Yet Span*Query 
> doesn't really do scoring comparable to what other queries do and at the end 
> of the day they are duplicating lot of code all over lucene. Span*Queries are 
> also limited to other Span*Query instances such that you can not use a 
> TermQuery or a BooleanQuery with SpanNear or anthing like that. 
> Beside of the Span*Query limitation other queries lacking a quiet interesting 
> feature since they can not score based on term proximity since scores doesn't 
> expose any positional information. All those problems bugged me for a while 
> now so I stared working on that using the bulkpostings API. I would have done 
> that first cut on trunk but TermScorer is working on BlockReader that do not 
> expose positions while the one in this branch does. I started adding a new 
> Positions class which users can pull from a scorer, to prevent unnecessary 
> positions enums I added ScorerContext#needsPositions and eventually 
> Scorere#needsPayloads to create the corresponding enum on demand. Yet, 
> currently only TermQuery / TermScorer implements this API and other simply 
> return null instead. 
> To show that the API really works and our BulkPostings work fine too with 
> positions I cut over TermSpanQuery to use a TermScorer under the hood and 
> nuked TermSpans entirely. A nice sideeffect of this was that the Position 
> BulkReading implementation got some exercise which now :) work all with 
> positions while Payloads for bulkreading are kind of experimental in the 
> patch and those only work with Standard codec. 
> So all spans now work on top of TermScorer ( I truly hate spans since today ) 
> including the ones that need Payloads (StandardCodec ONLY)!!  I didn't bother 
> to implement the other codecs yet since I want to get feedback on the API and 
> on this first cut before I go one with it. I will upload the corresponding 
> patch in a minute. 
> I also had to cut over SpanQuery.getSpans(IR) to 
> SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk 
> first but after that pain today I need a break first :).
> The patch passes all core tests 
> (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't 
> look into the MemoryIndex BulkPostings API yet)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (SOLR-2644) DIH handler - when using threads=2 the default logging is set too high


 [ 
https://issues.apache.org/jira/browse/SOLR-2644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Bell updated SOLR-2644:


Description: 
Setting threads parameter in DIH handler, every add outputs to the log in INFO 
level.
The only current solution is to set the following in log4j.properties:

log4j.rootCategory=INFO, logfile
log4j.logger.org.apache.solr.handler.dataimport.DocBuilder=FATAL
log4j.logger.org.apache.solr.handler.dataimport.ThreadedEntityProcessorWrapper=FATAL

These 2 log messages need to be changed to DEBUG.


  was:
Setting threads parameter in DIH handler, every add outputs to the log in INFO 
level.
The only current solution is to set the following in log4j.properties:

log4j.rootCategory=INFO, logfile
log4j.logger.org.apache.solr.handler.dataimport.DocBuilder=FATAL
log4j.logger.org.apache.solr.handler.dataimport.ThreadedEntityProcessorWrapper=FATAL

These 2 log messages need to be changed to > INFO.



> DIH handler - when using threads=2 the default logging is set too high
> --
>
> Key: SOLR-2644
> URL: https://issues.apache.org/jira/browse/SOLR-2644
> Project: Solr
>  Issue Type: Bug
>  Components: contrib - DataImportHandler
>Affects Versions: 3.3
>Reporter: Bill Bell
>Assignee: Shalin Shekhar Mangar
> Fix For: 3.4, 4.0
>
> Attachments: SOLR-2644.patch
>
>
> Setting threads parameter in DIH handler, every add outputs to the log in 
> INFO level.
> The only current solution is to set the following in log4j.properties:
> log4j.rootCategory=INFO, logfile
> log4j.logger.org.apache.solr.handler.dataimport.DocBuilder=FATAL
> log4j.logger.org.apache.solr.handler.dataimport.ThreadedEntityProcessorWrapper=FATAL
> These 2 log messages need to be changed to DEBUG.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[JENKINS] Lucene-Solr-tests-only-trunk - Build # 9507 - Failure

Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/9507/

2 tests failed.
REGRESSION:  
org.apache.lucene.queryParser.standard.TestNumericQueryParser.testInclusiveNumericRange

Error Message:
null

Stack Trace:
java.lang.NullPointerException
at 
org.apache.lucene.queryParser.standard.config.NumberDateFormat.parse(NumberDateFormat.java:50)
at java.text.NumberFormat.parse(NumberFormat.java:348)
at 
org.apache.lucene.queryParser.standard.processors.NumericRangeQueryNodeProcessor.postProcessNode(NumericRangeQueryNodeProcessor.java:72)
at 
org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:98)
at 
org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processChildren(QueryNodeProcessorImpl.java:124)
at 
org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:96)
at 
org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processChildren(QueryNodeProcessorImpl.java:124)
at 
org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:96)
at 
org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.process(QueryNodeProcessorImpl.java:89)
at 
org.apache.lucene.queryParser.core.processors.QueryNodeProcessorPipeline.process(QueryNodeProcessorPipeline.java:88)
at 
org.apache.lucene.queryParser.core.QueryParserHelper.parse(QueryParserHelper.java:254)
at 
org.apache.lucene.queryParser.standard.StandardQueryParser.parse(StandardQueryParser.java:166)
at 
org.apache.lucene.queryParser.standard.TestNumericQueryParser.testQuery(TestNumericQueryParser.java:385)
at 
org.apache.lucene.queryParser.standard.TestNumericQueryParser.assertRangeQuery(TestNumericQueryParser.java:356)
at 
org.apache.lucene.queryParser.standard.TestNumericQueryParser.testInclusiveNumericRange(TestNumericQueryParser.java:282)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1464)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1382)


REGRESSION:  
org.apache.lucene.queryParser.standard.TestNumericQueryParser.testExclusiveNumericRange

Error Message:
null

Stack Trace:
java.lang.NullPointerException
at 
org.apache.lucene.queryParser.standard.config.NumberDateFormat.parse(NumberDateFormat.java:50)
at java.text.NumberFormat.parse(NumberFormat.java:348)
at 
org.apache.lucene.queryParser.standard.processors.NumericRangeQueryNodeProcessor.postProcessNode(NumericRangeQueryNodeProcessor.java:72)
at 
org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:98)
at 
org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processChildren(QueryNodeProcessorImpl.java:124)
at 
org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:96)
at 
org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processChildren(QueryNodeProcessorImpl.java:124)
at 
org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.processIteration(QueryNodeProcessorImpl.java:96)
at 
org.apache.lucene.queryParser.core.processors.QueryNodeProcessorImpl.process(QueryNodeProcessorImpl.java:89)
at 
org.apache.lucene.queryParser.core.processors.QueryNodeProcessorPipeline.process(QueryNodeProcessorPipeline.java:88)
at 
org.apache.lucene.queryParser.core.QueryParserHelper.parse(QueryParserHelper.java:254)
at 
org.apache.lucene.queryParser.standard.StandardQueryParser.parse(StandardQueryParser.java:166)
at 
org.apache.lucene.queryParser.standard.TestNumericQueryParser.testQuery(TestNumericQueryParser.java:385)
at 
org.apache.lucene.queryParser.standard.TestNumericQueryParser.assertRangeQuery(TestNumericQueryParser.java:356)
at 
org.apache.lucene.queryParser.standard.TestNumericQueryParser.testExclusiveNumericRange(TestNumericQueryParser.java:311)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1464)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1382)




Build Log (for compile errors):
[...truncated 3344 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans


[ 
https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063657#comment-13063657
 ] 

Robert Muir commented on LUCENE-2878:
-

{quote}
The key is you usually have a fairly complex Query to begin with, so I do think 
it is legitimate and it is the right data structure.
{quote}

Really, just because its complicated? Accessing other terms 'around the 
position' seems like accessing the document in a non-inverted way.

{quote}
I've seen this use case multiple times, where multiple is more than 10, so I am 
pretty convinced it is beyond just me.
{quote}

Really? If this is so common, why do the spans get so little attention? if the 
queries are so complex, how is this even possible now given that spans have so 
many problems, even basic ones (e.g. discarding boosts)

If performance here is so important towards looking at these 'windows around a 
match' (which is gonna be slow as shit via term vectors),
why don't I see codecs that e.g. deduplicate terms and store pointers to the 
term windows around themselves in payloads, and things like that
for this use case?

I don't think we need to lock ourselves into a particular solution (such as 
per-position callback API) for something that sounds like its really slow 
already.


> Allow Scorer to expose positions and payloads aka. nuke spans 
> --
>
> Key: LUCENE-2878
> URL: https://issues.apache.org/jira/browse/LUCENE-2878
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: Bulk Postings branch
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
>  Labels: gsoc2011, lucene-gsoc-11, mentor
> Attachments: LUCENE-2878-OR.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, 
> PosHighlighter.patch, PosHighlighter.patch
>
>
> Currently we have two somewhat separate types of queries, the one which can 
> make use of positions (mainly spans) and payloads (spans). Yet Span*Query 
> doesn't really do scoring comparable to what other queries do and at the end 
> of the day they are duplicating lot of code all over lucene. Span*Queries are 
> also limited to other Span*Query instances such that you can not use a 
> TermQuery or a BooleanQuery with SpanNear or anthing like that. 
> Beside of the Span*Query limitation other queries lacking a quiet interesting 
> feature since they can not score based on term proximity since scores doesn't 
> expose any positional information. All those problems bugged me for a while 
> now so I stared working on that using the bulkpostings API. I would have done 
> that first cut on trunk but TermScorer is working on BlockReader that do not 
> expose positions while the one in this branch does. I started adding a new 
> Positions class which users can pull from a scorer, to prevent unnecessary 
> positions enums I added ScorerContext#needsPositions and eventually 
> Scorere#needsPayloads to create the corresponding enum on demand. Yet, 
> currently only TermQuery / TermScorer implements this API and other simply 
> return null instead. 
> To show that the API really works and our BulkPostings work fine too with 
> positions I cut over TermSpanQuery to use a TermScorer under the hood and 
> nuked TermSpans entirely. A nice sideeffect of this was that the Position 
> BulkReading implementation got some exercise which now :) work all with 
> positions while Payloads for bulkreading are kind of experimental in the 
> patch and those only work with Standard codec. 
> So all spans now work on top of TermScorer ( I truly hate spans since today ) 
> including the ones that need Payloads (StandardCodec ONLY)!!  I didn't bother 
> to implement the other codecs yet since I want to get feedback on the API and 
> on this first cut before I go one with it. I will upload the corresponding 
> patch in a minute. 
> I also had to cut over SpanQuery.getSpans(IR) to 
> SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk 
> first but after that pain today I need a break first :).
> The patch passes all core tests 
> (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't 
> look into the MemoryIndex BulkPostings API yet)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans


[ 
https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063644#comment-13063644
 ] 

Grant Ingersoll commented on LUCENE-2878:
-

bq. I'm not sure this sounds like an inverted index is even the best data 
structure for what you describe

The key is you usually have a fairly complex Query to begin with, so I do think 
it is legitimate and it is the right data structure.  It is always driven by 
the search results.  I've seen this use case multiple times, where multiple is 
more than 10, so I am pretty convinced it is beyond just me.  I think if you 
are taking away the ability to create windows around a match (if you read my 
early comments on this issue I brought it up from the beginning), that is a 
pretty big loss.  I don't think the two things are mutually exclusive.  As long 
as I have a way to get at the positions for all matches, I don't care that it.  
A "collector" type callback interface or a way for one to iterate all positions 
for a given match should be sufficient.

That being said, if Mike's comments about a collector like API are how it is 
implemented, I think it should work.  In reality, I think one would just need a 
way to, for whatever number of results, be told about positions as they happen. 
 Naturally, the default should be to only do this after the top X are 
retrieved, when X is small, but I could see implementing it in the scoring loop 
on certain occasions (and I'm not saying Lucene need have first order support 
for that).  As long as you don't preclude me from doing that, it should be fine.

I'll try to find time to review the patch in more depth in the coming day or so.

> Allow Scorer to expose positions and payloads aka. nuke spans 
> --
>
> Key: LUCENE-2878
> URL: https://issues.apache.org/jira/browse/LUCENE-2878
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: Bulk Postings branch
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
>  Labels: gsoc2011, lucene-gsoc-11, mentor
> Attachments: LUCENE-2878-OR.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, 
> PosHighlighter.patch, PosHighlighter.patch
>
>
> Currently we have two somewhat separate types of queries, the one which can 
> make use of positions (mainly spans) and payloads (spans). Yet Span*Query 
> doesn't really do scoring comparable to what other queries do and at the end 
> of the day they are duplicating lot of code all over lucene. Span*Queries are 
> also limited to other Span*Query instances such that you can not use a 
> TermQuery or a BooleanQuery with SpanNear or anthing like that. 
> Beside of the Span*Query limitation other queries lacking a quiet interesting 
> feature since they can not score based on term proximity since scores doesn't 
> expose any positional information. All those problems bugged me for a while 
> now so I stared working on that using the bulkpostings API. I would have done 
> that first cut on trunk but TermScorer is working on BlockReader that do not 
> expose positions while the one in this branch does. I started adding a new 
> Positions class which users can pull from a scorer, to prevent unnecessary 
> positions enums I added ScorerContext#needsPositions and eventually 
> Scorere#needsPayloads to create the corresponding enum on demand. Yet, 
> currently only TermQuery / TermScorer implements this API and other simply 
> return null instead. 
> To show that the API really works and our BulkPostings work fine too with 
> positions I cut over TermSpanQuery to use a TermScorer under the hood and 
> nuked TermSpans entirely. A nice sideeffect of this was that the Position 
> BulkReading implementation got some exercise which now :) work all with 
> positions while Payloads for bulkreading are kind of experimental in the 
> patch and those only work with Standard codec. 
> So all spans now work on top of TermScorer ( I truly hate spans since today ) 
> including the ones that need Payloads (StandardCodec ONLY)!!  I didn't bother 
> to implement the other codecs yet since I want to get feedback on the API and 
> on this first cut before I go one with it. I will upload the corresponding 
> patch in a minute. 
> I also had to cut over SpanQuery.getSpans(IR) to 
> SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk 
> first but after that pain today I need a break first :).
> The patch passes all core tests 
> (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't 
> look into the MemoryIndex BulkPostings API yet)

--
This message is automatically generated by JIRA.
For

[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans


[ 
https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063635#comment-13063635
 ] 

Robert Muir commented on LUCENE-2878:
-

{quote}
In those cases, some people care about the score, some do not. For instance, if 
one is analyzing all the words around the name of a company, you search term 
would be the company name and you want to iterate over all the positions where 
it matched, looking for other words near it 
{quote}

Grant, I'm not sure this sounds like an inverted index is even the best data 
structure for what you describe.

I just don't want us to confuse the issue with the nuking of spans/speeding up 
highlighting/enabling positional scoring use cases which are core to search.

> Allow Scorer to expose positions and payloads aka. nuke spans 
> --
>
> Key: LUCENE-2878
> URL: https://issues.apache.org/jira/browse/LUCENE-2878
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: Bulk Postings branch
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
>  Labels: gsoc2011, lucene-gsoc-11, mentor
> Attachments: LUCENE-2878-OR.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, 
> PosHighlighter.patch, PosHighlighter.patch
>
>
> Currently we have two somewhat separate types of queries, the one which can 
> make use of positions (mainly spans) and payloads (spans). Yet Span*Query 
> doesn't really do scoring comparable to what other queries do and at the end 
> of the day they are duplicating lot of code all over lucene. Span*Queries are 
> also limited to other Span*Query instances such that you can not use a 
> TermQuery or a BooleanQuery with SpanNear or anthing like that. 
> Beside of the Span*Query limitation other queries lacking a quiet interesting 
> feature since they can not score based on term proximity since scores doesn't 
> expose any positional information. All those problems bugged me for a while 
> now so I stared working on that using the bulkpostings API. I would have done 
> that first cut on trunk but TermScorer is working on BlockReader that do not 
> expose positions while the one in this branch does. I started adding a new 
> Positions class which users can pull from a scorer, to prevent unnecessary 
> positions enums I added ScorerContext#needsPositions and eventually 
> Scorere#needsPayloads to create the corresponding enum on demand. Yet, 
> currently only TermQuery / TermScorer implements this API and other simply 
> return null instead. 
> To show that the API really works and our BulkPostings work fine too with 
> positions I cut over TermSpanQuery to use a TermScorer under the hood and 
> nuked TermSpans entirely. A nice sideeffect of this was that the Position 
> BulkReading implementation got some exercise which now :) work all with 
> positions while Payloads for bulkreading are kind of experimental in the 
> patch and those only work with Standard codec. 
> So all spans now work on top of TermScorer ( I truly hate spans since today ) 
> including the ones that need Payloads (StandardCodec ONLY)!!  I didn't bother 
> to implement the other codecs yet since I want to get feedback on the API and 
> on this first cut before I go one with it. I will upload the corresponding 
> patch in a minute. 
> I also had to cut over SpanQuery.getSpans(IR) to 
> SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk 
> first but after that pain today I need a break first :).
> The patch passes all core tests 
> (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't 
> look into the MemoryIndex BulkPostings API yet)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans


[ 
https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063630#comment-13063630
 ] 

Grant Ingersoll commented on LUCENE-2878:
-

In the cases where I've both done this and seen it done, you often have an 
arbitrary query that matches X docs.  You then want to know where exactly the 
matches occur and then you often want to do something in a window around those 
matches.  Right now, w/ Spans, you have to run the query once to get the scores 
and then run a second time to get the windows.  The times I've seen it, the 
result is most often given to some downstream process that does deeper analysis 
of the window, so in these cases X can be quite large (1000's if not more).  In 
those cases, some people care about the score, some do not.  For instance, if 
one is analyzing all the words around the name of a company, you search term 
would be the company name and you want to iterate over all the positions where 
it matched, looking for other words near it (perhaps sentiment words or other 
things)

> Allow Scorer to expose positions and payloads aka. nuke spans 
> --
>
> Key: LUCENE-2878
> URL: https://issues.apache.org/jira/browse/LUCENE-2878
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: Bulk Postings branch
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
>  Labels: gsoc2011, lucene-gsoc-11, mentor
> Attachments: LUCENE-2878-OR.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, 
> PosHighlighter.patch, PosHighlighter.patch
>
>
> Currently we have two somewhat separate types of queries, the one which can 
> make use of positions (mainly spans) and payloads (spans). Yet Span*Query 
> doesn't really do scoring comparable to what other queries do and at the end 
> of the day they are duplicating lot of code all over lucene. Span*Queries are 
> also limited to other Span*Query instances such that you can not use a 
> TermQuery or a BooleanQuery with SpanNear or anthing like that. 
> Beside of the Span*Query limitation other queries lacking a quiet interesting 
> feature since they can not score based on term proximity since scores doesn't 
> expose any positional information. All those problems bugged me for a while 
> now so I stared working on that using the bulkpostings API. I would have done 
> that first cut on trunk but TermScorer is working on BlockReader that do not 
> expose positions while the one in this branch does. I started adding a new 
> Positions class which users can pull from a scorer, to prevent unnecessary 
> positions enums I added ScorerContext#needsPositions and eventually 
> Scorere#needsPayloads to create the corresponding enum on demand. Yet, 
> currently only TermQuery / TermScorer implements this API and other simply 
> return null instead. 
> To show that the API really works and our BulkPostings work fine too with 
> positions I cut over TermSpanQuery to use a TermScorer under the hood and 
> nuked TermSpans entirely. A nice sideeffect of this was that the Position 
> BulkReading implementation got some exercise which now :) work all with 
> positions while Payloads for bulkreading are kind of experimental in the 
> patch and those only work with Standard codec. 
> So all spans now work on top of TermScorer ( I truly hate spans since today ) 
> including the ones that need Payloads (StandardCodec ONLY)!!  I didn't bother 
> to implement the other codecs yet since I want to get feedback on the API and 
> on this first cut before I go one with it. I will upload the corresponding 
> patch in a minute. 
> I also had to cut over SpanQuery.getSpans(IR) to 
> SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk 
> first but after that pain today I need a break first :).
> The patch passes all core tests 
> (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't 
> look into the MemoryIndex BulkPostings API yet)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans


[ 
https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063626#comment-13063626
 ] 

Robert Muir commented on LUCENE-2878:
-

I don't understand the exact use case... it still sounds like the wrong query? 
What "operating" over the positions do you need to do?

> Allow Scorer to expose positions and payloads aka. nuke spans 
> --
>
> Key: LUCENE-2878
> URL: https://issues.apache.org/jira/browse/LUCENE-2878
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: Bulk Postings branch
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
>  Labels: gsoc2011, lucene-gsoc-11, mentor
> Attachments: LUCENE-2878-OR.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, 
> PosHighlighter.patch, PosHighlighter.patch
>
>
> Currently we have two somewhat separate types of queries, the one which can 
> make use of positions (mainly spans) and payloads (spans). Yet Span*Query 
> doesn't really do scoring comparable to what other queries do and at the end 
> of the day they are duplicating lot of code all over lucene. Span*Queries are 
> also limited to other Span*Query instances such that you can not use a 
> TermQuery or a BooleanQuery with SpanNear or anthing like that. 
> Beside of the Span*Query limitation other queries lacking a quiet interesting 
> feature since they can not score based on term proximity since scores doesn't 
> expose any positional information. All those problems bugged me for a while 
> now so I stared working on that using the bulkpostings API. I would have done 
> that first cut on trunk but TermScorer is working on BlockReader that do not 
> expose positions while the one in this branch does. I started adding a new 
> Positions class which users can pull from a scorer, to prevent unnecessary 
> positions enums I added ScorerContext#needsPositions and eventually 
> Scorere#needsPayloads to create the corresponding enum on demand. Yet, 
> currently only TermQuery / TermScorer implements this API and other simply 
> return null instead. 
> To show that the API really works and our BulkPostings work fine too with 
> positions I cut over TermSpanQuery to use a TermScorer under the hood and 
> nuked TermSpans entirely. A nice sideeffect of this was that the Position 
> BulkReading implementation got some exercise which now :) work all with 
> positions while Payloads for bulkreading are kind of experimental in the 
> patch and those only work with Standard codec. 
> So all spans now work on top of TermScorer ( I truly hate spans since today ) 
> including the ones that need Payloads (StandardCodec ONLY)!!  I didn't bother 
> to implement the other codecs yet since I want to get feedback on the API and 
> on this first cut before I go one with it. I will upload the corresponding 
> patch in a minute. 
> I also had to cut over SpanQuery.getSpans(IR) to 
> SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk 
> first but after that pain today I need a break first :).
> The patch passes all core tests 
> (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't 
> look into the MemoryIndex BulkPostings API yet)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-2048) Omit positions but keep termFreq


 [ 
https://issues.apache.org/jira/browse/LUCENE-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2048:


Fix Version/s: 3.4

> Omit positions but keep termFreq
> 
>
> Key: LUCENE-2048
> URL: https://issues.apache.org/jira/browse/LUCENE-2048
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.1
>Reporter: Andrzej Bialecki 
>Assignee: Robert Muir
> Fix For: 3.4, 4.0
>
> Attachments: LUCENE-2048.patch, LUCENE-2048.patch
>
>
> it would be useful to have an option to discard positional information but 
> still keep the term frequency - currently setOmitTermFreqAndPositions 
> discards both. Even though position-dependent queries wouldn't work in such 
> case, still any other queries would work fine and we would get the right 
> scoring.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans


[ 
https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063625#comment-13063625
 ] 

Grant Ingersoll commented on LUCENE-2878:
-

bq. I don't think this issue should try to solve that problem: if you are doing 
that, it sounds like you are using the wrong Query!

It's basically a boolean match on any arbitrary Query where you care about the 
positions.  Pretty common in e-discovery and other areas.  You have a query 
that tells you all the matches and you want to operate over the positions.  
Right now, it's a pain as you have to execute the query twice.  Once to get the 
scores and once to get the positions/spans.  If you have a callback mechanism, 
one can do both at once.

> Allow Scorer to expose positions and payloads aka. nuke spans 
> --
>
> Key: LUCENE-2878
> URL: https://issues.apache.org/jira/browse/LUCENE-2878
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: Bulk Postings branch
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
>  Labels: gsoc2011, lucene-gsoc-11, mentor
> Attachments: LUCENE-2878-OR.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, 
> PosHighlighter.patch, PosHighlighter.patch
>
>
> Currently we have two somewhat separate types of queries, the one which can 
> make use of positions (mainly spans) and payloads (spans). Yet Span*Query 
> doesn't really do scoring comparable to what other queries do and at the end 
> of the day they are duplicating lot of code all over lucene. Span*Queries are 
> also limited to other Span*Query instances such that you can not use a 
> TermQuery or a BooleanQuery with SpanNear or anthing like that. 
> Beside of the Span*Query limitation other queries lacking a quiet interesting 
> feature since they can not score based on term proximity since scores doesn't 
> expose any positional information. All those problems bugged me for a while 
> now so I stared working on that using the bulkpostings API. I would have done 
> that first cut on trunk but TermScorer is working on BlockReader that do not 
> expose positions while the one in this branch does. I started adding a new 
> Positions class which users can pull from a scorer, to prevent unnecessary 
> positions enums I added ScorerContext#needsPositions and eventually 
> Scorere#needsPayloads to create the corresponding enum on demand. Yet, 
> currently only TermQuery / TermScorer implements this API and other simply 
> return null instead. 
> To show that the API really works and our BulkPostings work fine too with 
> positions I cut over TermSpanQuery to use a TermScorer under the hood and 
> nuked TermSpans entirely. A nice sideeffect of this was that the Position 
> BulkReading implementation got some exercise which now :) work all with 
> positions while Payloads for bulkreading are kind of experimental in the 
> patch and those only work with Standard codec. 
> So all spans now work on top of TermScorer ( I truly hate spans since today ) 
> including the ones that need Payloads (StandardCodec ONLY)!!  I didn't bother 
> to implement the other codecs yet since I want to get feedback on the API and 
> on this first cut before I go one with it. I will upload the corresponding 
> patch in a minute. 
> I also had to cut over SpanQuery.getSpans(IR) to 
> SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk 
> first but after that pain today I need a break first :).
> The patch passes all core tests 
> (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't 
> look into the MemoryIndex BulkPostings API yet)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans


[ 
https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063622#comment-13063622
 ] 

Robert Muir commented on LUCENE-2878:
-

{quote}
FWIW, I do think there are use cases where one wants positions over all hits 
(or most such that you might as well do all), so if it doesn't cause problems 
for the main use case, it would be nice to support it. In fact, in these 
scenarios, you usually care less about the PQ and more about the positions. 
{quote}

I don't think this issue should try to solve that problem: if you are doing 
that, it sounds like you are using the wrong Query!


> Allow Scorer to expose positions and payloads aka. nuke spans 
> --
>
> Key: LUCENE-2878
> URL: https://issues.apache.org/jira/browse/LUCENE-2878
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: Bulk Postings branch
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
>  Labels: gsoc2011, lucene-gsoc-11, mentor
> Attachments: LUCENE-2878-OR.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, 
> PosHighlighter.patch, PosHighlighter.patch
>
>
> Currently we have two somewhat separate types of queries, the one which can 
> make use of positions (mainly spans) and payloads (spans). Yet Span*Query 
> doesn't really do scoring comparable to what other queries do and at the end 
> of the day they are duplicating lot of code all over lucene. Span*Queries are 
> also limited to other Span*Query instances such that you can not use a 
> TermQuery or a BooleanQuery with SpanNear or anthing like that. 
> Beside of the Span*Query limitation other queries lacking a quiet interesting 
> feature since they can not score based on term proximity since scores doesn't 
> expose any positional information. All those problems bugged me for a while 
> now so I stared working on that using the bulkpostings API. I would have done 
> that first cut on trunk but TermScorer is working on BlockReader that do not 
> expose positions while the one in this branch does. I started adding a new 
> Positions class which users can pull from a scorer, to prevent unnecessary 
> positions enums I added ScorerContext#needsPositions and eventually 
> Scorere#needsPayloads to create the corresponding enum on demand. Yet, 
> currently only TermQuery / TermScorer implements this API and other simply 
> return null instead. 
> To show that the API really works and our BulkPostings work fine too with 
> positions I cut over TermSpanQuery to use a TermScorer under the hood and 
> nuked TermSpans entirely. A nice sideeffect of this was that the Position 
> BulkReading implementation got some exercise which now :) work all with 
> positions while Payloads for bulkreading are kind of experimental in the 
> patch and those only work with Standard codec. 
> So all spans now work on top of TermScorer ( I truly hate spans since today ) 
> including the ones that need Payloads (StandardCodec ONLY)!!  I didn't bother 
> to implement the other codecs yet since I want to get feedback on the API and 
> on this first cut before I go one with it. I will upload the corresponding 
> patch in a minute. 
> I also had to cut over SpanQuery.getSpans(IR) to 
> SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk 
> first but after that pain today I need a break first :).
> The patch passes all core tests 
> (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't 
> look into the MemoryIndex BulkPostings API yet)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3282) BlockJoinQuery: Allow to add a custom child collector, and customize the parent bitset extraction

2011-07-11 Thread Shay Banon (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shay Banon updated LUCENE-3282:
---

Attachment: LUCENE-3282.patch

New version, with CollectorProvider.

> BlockJoinQuery: Allow to add a custom child collector, and customize the 
> parent bitset extraction
> -
>
> Key: LUCENE-3282
> URL: https://issues.apache.org/jira/browse/LUCENE-3282
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: 3.4, 4.0
>Reporter: Shay Banon
> Attachments: LUCENE-3282.patch, LUCENE-3282.patch
>
>
> It would be nice to allow to add a custom child collector to the 
> BlockJoinQuery to be called on every matching doc (so we can do things with 
> it, like counts and such). Also, allow to extend BlockJoinQuery to have a 
> custom code that converts the filter bitset to an OpenBitSet.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3282) BlockJoinQuery: Allow to add a custom child collector, and customize the parent bitset extraction

2011-07-11 Thread Shay Banon (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063619#comment-13063619
 ] 

Shay Banon commented on LUCENE-3282:


Heya,

   In my app, I have a wrapper around OBS, that has a common interface that 
allows to access bits by index (similar to Bits in trunk), so I need to extract 
from it the OBS.

   Regarding the Collector, I will work on CollectorProvider interface. I liked 
the NoOpCollector option since then you don't have to check for nulls each 
time...

> BlockJoinQuery: Allow to add a custom child collector, and customize the 
> parent bitset extraction
> -
>
> Key: LUCENE-3282
> URL: https://issues.apache.org/jira/browse/LUCENE-3282
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: 3.4, 4.0
>Reporter: Shay Banon
> Attachments: LUCENE-3282.patch
>
>
> It would be nice to allow to add a custom child collector to the 
> BlockJoinQuery to be called on every matching doc (so we can do things with 
> it, like counts and such). Also, allow to extend BlockJoinQuery to have a 
> custom code that converts the filter bitset to an OpenBitSet.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2308) Separately specify a field's type


[ 
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063618#comment-13063618
 ] 

Michael McCandless commented on LUCENE-2308:


Nikola tracked this down -- it's because we're not reading numeric field back 
properly from stored fields.

> Separately specify a field's type
> -
>
> Key: LUCENE-2308
> URL: https://issues.apache.org/jira/browse/LUCENE-2308
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>  Labels: gsoc2011, lucene-gsoc-11, mentor
> Fix For: 4.0
>
> Attachments: LUCENE-2308-10.patch, LUCENE-2308-2.patch, 
> LUCENE-2308-3.patch, LUCENE-2308-4.patch, LUCENE-2308-4.patch, 
> LUCENE-2308-5.patch, LUCENE-2308-6.patch, LUCENE-2308-7.patch, 
> LUCENE-2308-8.patch, LUCENE-2308-9.patch, LUCENE-2308-ltc.patch, 
> LUCENE-2308.patch, LUCENE-2308.patch
>
>
> This came up from dicussions on IRC.  I'm summarizing here...
> Today when you make a Field to add to a document you can set things
> index or not, stored or not, analyzed or not, details like omitTfAP,
> omitNorms, index term vectors (separately controlling
> offsets/positions), etc.
> I think we should factor these out into a new class (FieldType?).
> Then you could re-use this FieldType instance across multiple fields.
> The Field instance would still hold the actual value.
> We could then do per-field analyzers by adding a setAnalyzer on the
> FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise
> for per-field codecs (with flex), where we now have
> PerFieldCodecWrapper).
> This would NOT be a schema!  It's just refactoring what we already
> specify today.  EG it's not serialized into the index.
> This has been discussed before, and I know Michael Busch opened a more
> ambitious (I think?) issue.  I think this is a good first baby step.  We could
> consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold
> off on that for starters...

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans


[ 
https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063614#comment-13063614
 ] 

Grant Ingersoll commented on LUCENE-2878:
-

FWIW, I do think there are use cases where one wants positions over all hits 
(or most such that you might as well do all), so if it doesn't cause problems 
for the main use case, it would be nice to support it.  In fact, in these 
scenarios, you usually care less about the PQ and more about the positions. 

> Allow Scorer to expose positions and payloads aka. nuke spans 
> --
>
> Key: LUCENE-2878
> URL: https://issues.apache.org/jira/browse/LUCENE-2878
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: Bulk Postings branch
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
>  Labels: gsoc2011, lucene-gsoc-11, mentor
> Attachments: LUCENE-2878-OR.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, 
> PosHighlighter.patch, PosHighlighter.patch
>
>
> Currently we have two somewhat separate types of queries, the one which can 
> make use of positions (mainly spans) and payloads (spans). Yet Span*Query 
> doesn't really do scoring comparable to what other queries do and at the end 
> of the day they are duplicating lot of code all over lucene. Span*Queries are 
> also limited to other Span*Query instances such that you can not use a 
> TermQuery or a BooleanQuery with SpanNear or anthing like that. 
> Beside of the Span*Query limitation other queries lacking a quiet interesting 
> feature since they can not score based on term proximity since scores doesn't 
> expose any positional information. All those problems bugged me for a while 
> now so I stared working on that using the bulkpostings API. I would have done 
> that first cut on trunk but TermScorer is working on BlockReader that do not 
> expose positions while the one in this branch does. I started adding a new 
> Positions class which users can pull from a scorer, to prevent unnecessary 
> positions enums I added ScorerContext#needsPositions and eventually 
> Scorere#needsPayloads to create the corresponding enum on demand. Yet, 
> currently only TermQuery / TermScorer implements this API and other simply 
> return null instead. 
> To show that the API really works and our BulkPostings work fine too with 
> positions I cut over TermSpanQuery to use a TermScorer under the hood and 
> nuked TermSpans entirely. A nice sideeffect of this was that the Position 
> BulkReading implementation got some exercise which now :) work all with 
> positions while Payloads for bulkreading are kind of experimental in the 
> patch and those only work with Standard codec. 
> So all spans now work on top of TermScorer ( I truly hate spans since today ) 
> including the ones that need Payloads (StandardCodec ONLY)!!  I didn't bother 
> to implement the other codecs yet since I want to get feedback on the API and 
> on this first cut before I go one with it. I will upload the corresponding 
> patch in a minute. 
> I also had to cut over SpanQuery.getSpans(IR) to 
> SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk 
> first but after that pain today I need a break first :).
> The patch passes all core tests 
> (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't 
> look into the MemoryIndex BulkPostings API yet)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2048) Omit positions but keep termFreq


[ 
https://issues.apache.org/jira/browse/LUCENE-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063612#comment-13063612
 ] 

Michael McCandless commented on LUCENE-2048:


Looks great!  +1 to commit.

> Omit positions but keep termFreq
> 
>
> Key: LUCENE-2048
> URL: https://issues.apache.org/jira/browse/LUCENE-2048
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.1
>Reporter: Andrzej Bialecki 
>Assignee: Robert Muir
> Fix For: 4.0
>
> Attachments: LUCENE-2048.patch, LUCENE-2048.patch
>
>
> it would be useful to have an option to discard positional information but 
> still keep the term frequency - currently setOmitTermFreqAndPositions 
> discards both. Even though position-dependent queries wouldn't work in such 
> case, still any other queries would work fine and we would get the right 
> scoring.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans


[ 
https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063610#comment-13063610
 ] 

Robert Muir commented on LUCENE-2878:
-

{quote}
But that would force you actually into a third pass (I think), if you wanted 
positional scoring too, wouldn't it?
{quote}

I think thats ok? because the two things are different: 
* in general i think you want to rerank more than just page 1 with scoring, 
e.g. maybe 100 or even 1000 documents versus the 20 that highlighting needs.
* for scoring, we need to adjust our PQ, resulting in a (possibly) different 
set of page 1 documents for the highlighting process, so if we are doing both 
algorithms, we still don't yet know what to highlight anyway.
* if we assume we are going to add offsets (optionally) to our postings lists 
in parallel to the positions,
  thats another difference: scoring doesnt care about offsets, but highlighting 
needs them.


> Allow Scorer to expose positions and payloads aka. nuke spans 
> --
>
> Key: LUCENE-2878
> URL: https://issues.apache.org/jira/browse/LUCENE-2878
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: Bulk Postings branch
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
>  Labels: gsoc2011, lucene-gsoc-11, mentor
> Attachments: LUCENE-2878-OR.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, 
> PosHighlighter.patch, PosHighlighter.patch
>
>
> Currently we have two somewhat separate types of queries, the one which can 
> make use of positions (mainly spans) and payloads (spans). Yet Span*Query 
> doesn't really do scoring comparable to what other queries do and at the end 
> of the day they are duplicating lot of code all over lucene. Span*Queries are 
> also limited to other Span*Query instances such that you can not use a 
> TermQuery or a BooleanQuery with SpanNear or anthing like that. 
> Beside of the Span*Query limitation other queries lacking a quiet interesting 
> feature since they can not score based on term proximity since scores doesn't 
> expose any positional information. All those problems bugged me for a while 
> now so I stared working on that using the bulkpostings API. I would have done 
> that first cut on trunk but TermScorer is working on BlockReader that do not 
> expose positions while the one in this branch does. I started adding a new 
> Positions class which users can pull from a scorer, to prevent unnecessary 
> positions enums I added ScorerContext#needsPositions and eventually 
> Scorere#needsPayloads to create the corresponding enum on demand. Yet, 
> currently only TermQuery / TermScorer implements this API and other simply 
> return null instead. 
> To show that the API really works and our BulkPostings work fine too with 
> positions I cut over TermSpanQuery to use a TermScorer under the hood and 
> nuked TermSpans entirely. A nice sideeffect of this was that the Position 
> BulkReading implementation got some exercise which now :) work all with 
> positions while Payloads for bulkreading are kind of experimental in the 
> patch and those only work with Standard codec. 
> So all spans now work on top of TermScorer ( I truly hate spans since today ) 
> including the ones that need Payloads (StandardCodec ONLY)!!  I didn't bother 
> to implement the other codecs yet since I want to get feedback on the API and 
> on this first cut before I go one with it. I will upload the corresponding 
> patch in a minute. 
> I also had to cut over SpanQuery.getSpans(IR) to 
> SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk 
> first but after that pain today I need a break first :).
> The patch passes all core tests 
> (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't 
> look into the MemoryIndex BulkPostings API yet)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-1768) NumericRange support for new query parser

2011-07-11 Thread Vinicius Barros (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063607#comment-13063607
 ] 

Vinicius Barros commented on LUCENE-1768:
-

Thanks for committing the patch Uwe!

I will review the code again looking for switch without default case and fix it.

I never did anything with javacc, I just quickly looked at the code, does not 
seem complicated, however, I have no idea how complex is to run javacc and 
regenerate the java files. Does lucene ant script do that automaticaly?

I can try to fix open range queries on contrib query parser, add "<="-like 
operators or backport numeric support to 3.x. Just let me know the priorities 
and I will work on it. My suggestion is that the bug on open range queries is 
the most critical now, so I could start working on that. Your call Uwe.

> NumericRange support for new query parser
> -
>
> Key: LUCENE-1768
> URL: https://issues.apache.org/jira/browse/LUCENE-1768
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: core/queryparser
>Affects Versions: 2.9
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
>  Labels: contrib, gsoc, gsoc2011, lucene-gsoc-11, mentor
> Fix For: 4.0
>
> Attachments: week-7.patch, week1.patch, week2.patch, week3.patch, 
> week4.patch, week5-6.patch
>
>
> It would be good to specify some type of "schema" for the query parser in 
> future, to automatically create NumericRangeQuery for different numeric 
> types? It would then be possible to index a numeric value 
> (double,float,long,int) using NumericField and then the query parser knows, 
> which type of field this is and so it correctly creates a NumericRangeQuery 
> for strings like "[1.567..*]" or "(1.787..19.5]".
> There is currently no way to extract if a field is numeric from the index, so 
> the user will have to configure the FieldConfig objects in the ConfigHandler. 
> But if this is done, it will not be that difficult to implement the rest.
> The only difference between the current handling of RangeQuery is then the 
> instantiation of the correct Query type and conversion of the entered numeric 
> values (simple Number.valueOf(...) cast of the user entered numbers). 
> Evenerything else is identical, NumericRangeQuery also supports the MTQ 
> rewrite modes (as it is a MTQ).
> Another thing is a change in Date semantics. There are some strange flags in 
> the current parser that tells it how to handle dates.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-2308) Separately specify a field's type

2011-07-11 Thread Nikola Tankovic (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikola Tankovic updated LUCENE-2308:


Attachment: LUCENE-2308-10.patch

Solr cutover to FieldType. Having repeated similar errors in tests. Trying to 
debug. Help is appriciated :)

> Separately specify a field's type
> -
>
> Key: LUCENE-2308
> URL: https://issues.apache.org/jira/browse/LUCENE-2308
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>  Labels: gsoc2011, lucene-gsoc-11, mentor
> Fix For: 4.0
>
> Attachments: LUCENE-2308-10.patch, LUCENE-2308-2.patch, 
> LUCENE-2308-3.patch, LUCENE-2308-4.patch, LUCENE-2308-4.patch, 
> LUCENE-2308-5.patch, LUCENE-2308-6.patch, LUCENE-2308-7.patch, 
> LUCENE-2308-8.patch, LUCENE-2308-9.patch, LUCENE-2308-ltc.patch, 
> LUCENE-2308.patch, LUCENE-2308.patch
>
>
> This came up from dicussions on IRC.  I'm summarizing here...
> Today when you make a Field to add to a document you can set things
> index or not, stored or not, analyzed or not, details like omitTfAP,
> omitNorms, index term vectors (separately controlling
> offsets/positions), etc.
> I think we should factor these out into a new class (FieldType?).
> Then you could re-use this FieldType instance across multiple fields.
> The Field instance would still hold the actual value.
> We could then do per-field analyzers by adding a setAnalyzer on the
> FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise
> for per-field codecs (with flex), where we now have
> PerFieldCodecWrapper).
> This would NOT be a schema!  It's just refactoring what we already
> specify today.  EG it's not serialized into the index.
> This has been discussed before, and I know Michael Busch opened a more
> ambitious (I think?) issue.  I think this is a good first baby step.  We could
> consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold
> off on that for starters...

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans


[ 
https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063595#comment-13063595
 ] 

Mike Sokolov commented on LUCENE-2878:
--

bq. But I think this is what I think we want? If there are 10 million documents 
that match a query, but our priority queue size is 20 (1 page), we only want to 
do the expensive highlighting on those 20 documents. 

Yes - the comments may be getting lost in the weeds a bit here; sorry.  I've 
been assuming you'd search once to collect documents and then search again with 
the same query plus a constraint to limited by gathered docids, with an 
indication that positions are required - this pushes you towards some sort of 
collector-style callback API. Maybe life would be simpler if instead you could 
just say getPositionIterator(docid, query).  But that would force you actually 
into a third pass (I think), if you wanted positional scoring too, wouldn't it?

> Allow Scorer to expose positions and payloads aka. nuke spans 
> --
>
> Key: LUCENE-2878
> URL: https://issues.apache.org/jira/browse/LUCENE-2878
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: Bulk Postings branch
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
>  Labels: gsoc2011, lucene-gsoc-11, mentor
> Attachments: LUCENE-2878-OR.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, 
> PosHighlighter.patch, PosHighlighter.patch
>
>
> Currently we have two somewhat separate types of queries, the one which can 
> make use of positions (mainly spans) and payloads (spans). Yet Span*Query 
> doesn't really do scoring comparable to what other queries do and at the end 
> of the day they are duplicating lot of code all over lucene. Span*Queries are 
> also limited to other Span*Query instances such that you can not use a 
> TermQuery or a BooleanQuery with SpanNear or anthing like that. 
> Beside of the Span*Query limitation other queries lacking a quiet interesting 
> feature since they can not score based on term proximity since scores doesn't 
> expose any positional information. All those problems bugged me for a while 
> now so I stared working on that using the bulkpostings API. I would have done 
> that first cut on trunk but TermScorer is working on BlockReader that do not 
> expose positions while the one in this branch does. I started adding a new 
> Positions class which users can pull from a scorer, to prevent unnecessary 
> positions enums I added ScorerContext#needsPositions and eventually 
> Scorere#needsPayloads to create the corresponding enum on demand. Yet, 
> currently only TermQuery / TermScorer implements this API and other simply 
> return null instead. 
> To show that the API really works and our BulkPostings work fine too with 
> positions I cut over TermSpanQuery to use a TermScorer under the hood and 
> nuked TermSpans entirely. A nice sideeffect of this was that the Position 
> BulkReading implementation got some exercise which now :) work all with 
> positions while Payloads for bulkreading are kind of experimental in the 
> patch and those only work with Standard codec. 
> So all spans now work on top of TermScorer ( I truly hate spans since today ) 
> including the ones that need Payloads (StandardCodec ONLY)!!  I didn't bother 
> to implement the other codecs yet since I want to get feedback on the API and 
> on this first cut before I go one with it. I will upload the corresponding 
> patch in a minute. 
> I also had to cut over SpanQuery.getSpans(IR) to 
> SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk 
> first but after that pain today I need a break first :).
> The patch passes all core tests 
> (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't 
> look into the MemoryIndex BulkPostings API yet)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans


[ 
https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063582#comment-13063582
 ] 

Robert Muir commented on LUCENE-2878:
-

{quote}
But if I'm right, then it means there has to be some sort of callback mechanism 
in place during scoring, or else we have to resign ourselves to scoring first, 
and then re-setting and iterating positions in a second pass.
{quote}

But I think this is what I think we want? If there are 10 million documents 
that match a query, but our priority queue size is 20 (1 page), we only want to 
do the expensive highlighting on those 20 documents. 

Its the same for the positional scoring, its too expensive to look at positions 
for all documents, so you re-order maybe the top 100 or so.

Or maybe I'm totally confused by the comments!

> Allow Scorer to expose positions and payloads aka. nuke spans 
> --
>
> Key: LUCENE-2878
> URL: https://issues.apache.org/jira/browse/LUCENE-2878
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: Bulk Postings branch
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
>  Labels: gsoc2011, lucene-gsoc-11, mentor
> Attachments: LUCENE-2878-OR.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, 
> PosHighlighter.patch, PosHighlighter.patch
>
>
> Currently we have two somewhat separate types of queries, the one which can 
> make use of positions (mainly spans) and payloads (spans). Yet Span*Query 
> doesn't really do scoring comparable to what other queries do and at the end 
> of the day they are duplicating lot of code all over lucene. Span*Queries are 
> also limited to other Span*Query instances such that you can not use a 
> TermQuery or a BooleanQuery with SpanNear or anthing like that. 
> Beside of the Span*Query limitation other queries lacking a quiet interesting 
> feature since they can not score based on term proximity since scores doesn't 
> expose any positional information. All those problems bugged me for a while 
> now so I stared working on that using the bulkpostings API. I would have done 
> that first cut on trunk but TermScorer is working on BlockReader that do not 
> expose positions while the one in this branch does. I started adding a new 
> Positions class which users can pull from a scorer, to prevent unnecessary 
> positions enums I added ScorerContext#needsPositions and eventually 
> Scorere#needsPayloads to create the corresponding enum on demand. Yet, 
> currently only TermQuery / TermScorer implements this API and other simply 
> return null instead. 
> To show that the API really works and our BulkPostings work fine too with 
> positions I cut over TermSpanQuery to use a TermScorer under the hood and 
> nuked TermSpans entirely. A nice sideeffect of this was that the Position 
> BulkReading implementation got some exercise which now :) work all with 
> positions while Payloads for bulkreading are kind of experimental in the 
> patch and those only work with Standard codec. 
> So all spans now work on top of TermScorer ( I truly hate spans since today ) 
> including the ones that need Payloads (StandardCodec ONLY)!!  I didn't bother 
> to implement the other codecs yet since I want to get feedback on the API and 
> on this first cut before I go one with it. I will upload the corresponding 
> patch in a minute. 
> I also had to cut over SpanQuery.getSpans(IR) to 
> SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk 
> first but after that pain today I need a break first :).
> The patch passes all core tests 
> (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't 
> look into the MemoryIndex BulkPostings API yet)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-2048) Omit positions but keep termFreq


 [ 
https://issues.apache.org/jira/browse/LUCENE-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2048:


Attachment: LUCENE-2048.patch

ok here's a updated patch. I think its ready to commit!

> Omit positions but keep termFreq
> 
>
> Key: LUCENE-2048
> URL: https://issues.apache.org/jira/browse/LUCENE-2048
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.1
>Reporter: Andrzej Bialecki 
>Assignee: Robert Muir
> Fix For: 4.0
>
> Attachments: LUCENE-2048.patch, LUCENE-2048.patch
>
>
> it would be useful to have an option to discard positional information but 
> still keep the term frequency - currently setOmitTermFreqAndPositions 
> discards both. Even though position-dependent queries wouldn't work in such 
> case, still any other queries would work fine and we would get the right 
> scoring.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-1768) NumericRange support for new query parser

2011-07-11 Thread Adriano Crestani (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063570#comment-13063570
 ] 

Adriano Crestani commented on LUCENE-1768:
--

{quote}
I am not sure if this is really a break when you change inheritance. If code 
still compiles, its no break, if classes were renamed its more serious. I am 
not sure, if implementation classes (and -names) should be covered by the 
backwards compatibility. In my opinion, mainly the configuration and interfaces 
of the QP must be covered by backwards policy.
{quote}

I didn't see any class renaming, I need to double check Vinicius's patches. But 
he did change the query node inheritance, which may affect how processors and 
builder (specially QueryNodeTreeBuilder) work. I am not saying it is not 
possible to implement his approach on 3.x, but he will need to deal differently 
with query nodes classes he created. As I said before, what he did is good and 
clean, I like the way it is, but it will break someone's code if pushed to 3.x. 
So if you ask me whether to push it to 3.x, I say YES, just make sure to not 
break the query node structure that people may be relying on.

{quote}
As we are now at mid-time, it would be a good idea, to maybe add some extra 
syntax support for numerics, like "<" and ">"? We should also add tests/support 
for half-open ranges, so syntax like "[* TO 1.0]" should also be supported (I 
am not sure, if TermRangeQueryNode supports this, but numerics should do this 
in all cases) - the above syntax is also printed out on 
NumericRangeQuery.toString(), if one of the bounds is null. The latter could be 
easily implemented by checking for "*" as input to the range bounds and map 
those special "values" to NULL. Adding support for "<" and ">" (also "<=", 
">=") needs knowledge of JavaCC parser language. Vinicius, have you ever worked 
with JavaCC, so do you think you will be able to extend the syntax?
{quote}

I still need to investigate the bugs Vinicius reported (should have been 
created a JIRA for that already), I never really tried open ranges in contrib 
QP. And if Vinicius thinks he will have time and skills to do the JAVACC change 
to support those new operators, go for it! And remember Vinicius, you don't 
need to do everything during gsoc, you are always welcome to contribute code 
whenever you want :)

> NumericRange support for new query parser
> -
>
> Key: LUCENE-1768
> URL: https://issues.apache.org/jira/browse/LUCENE-1768
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: core/queryparser
>Affects Versions: 2.9
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
>  Labels: contrib, gsoc, gsoc2011, lucene-gsoc-11, mentor
> Fix For: 4.0
>
> Attachments: week-7.patch, week1.patch, week2.patch, week3.patch, 
> week4.patch, week5-6.patch
>
>
> It would be good to specify some type of "schema" for the query parser in 
> future, to automatically create NumericRangeQuery for different numeric 
> types? It would then be possible to index a numeric value 
> (double,float,long,int) using NumericField and then the query parser knows, 
> which type of field this is and so it correctly creates a NumericRangeQuery 
> for strings like "[1.567..*]" or "(1.787..19.5]".
> There is currently no way to extract if a field is numeric from the index, so 
> the user will have to configure the FieldConfig objects in the ConfigHandler. 
> But if this is done, it will not be that difficult to implement the rest.
> The only difference between the current handling of RangeQuery is then the 
> instantiation of the correct Query type and conversion of the entered numeric 
> values (simple Number.valueOf(...) cast of the user entered numbers). 
> Evenerything else is identical, NumericRangeQuery also supports the MTQ 
> rewrite modes (as it is a MTQ).
> Another thing is a change in Date semantics. There are some strange flags in 
> the current parser that tells it how to handle dates.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Closed] (SOLR-2645) NullPointerException in StopFilter

2011-07-11 Thread Gabriele Kahlout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-2645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabriele Kahlout closed SOLR-2645.
--

Resolution: Invalid

> NullPointerException in StopFilter
> --
>
> Key: SOLR-2645
> URL: https://issues.apache.org/jira/browse/SOLR-2645
> Project: Solr
>  Issue Type: Bug
>  Components: Schema and Analysis
>Affects Versions: 3.3
>Reporter: Gabriele Kahlout
>Priority: Minor
> Fix For: 3.4
>
> Attachments: SolrAddBug.zip
>
>
> To reproduce:
> Download and unzip attached and from the extracted directory run: {code} mvn 
> test {code}
> {code}
> java.lang.NullPointerException
>   at org.apache.lucene.analysis.CharArraySet.(CharArraySet.java:91)
>   at org.apache.lucene.analysis.StopFilter.(StopFilter.java:101)
>   at org.apache.lucene.analysis.StopFilter.(StopFilter.java:93)
>   at 
> org.apache.solr.analysis.StopFilterFactory.create(StopFilterFactory.java:84)
>   at 
> org.apache.solr.analysis.StopFilterFactory.create(StopFilterFactory.java:43)
>   at 
> org.apache.solr.analysis.TokenizerChain.getStream(TokenizerChain.java:71)
>   at 
> org.apache.solr.analysis.SolrAnalyzer.reusableTokenStream(SolrAnalyzer.java:75)
>   at 
> org.apache.solr.schema.IndexSchema$SolrIndexAnalyzer.reusableTokenStream(IndexSchema.java:385)
>   at 
> org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:126)
>   at 
> org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:278)
>   at 
> org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:766)
>   at 
> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2056)
>   at 
> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2030)
>   at com.mysimpatico.solraddbug.SolrTest.test(SolrTest.java:45)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
>   at org.junit.rules.TestWatchman$1.evaluate(TestWatchman.java:48)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:76)
>   at 
> org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1277)
>   at 
> org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1195)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
>   at 
> org.apache.maven.surefire.junit4.JUnit4TestSet.execute(JUnit4TestSet.java:35)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:115)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:97)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at 
> org.apache.maven.surefire.booter.ProviderFactory$ClassLoaderProxy.invoke(ProviderFactory.java:103)
>   at $Proxy0.invoke(Unknown Source)
>   at 
> org.apache.maven.surefire.booter.SurefireStarter.invokeProvider(SurefireStarter.java:150)
>   at 
> org.apache.maven.surefire.booter.SurefireStarter.runSuitesInProcess(SurefireStarter.java:91)
>   at 
> org.apache.maven.surefire.booter.ForkedBo

[jira] [Commented] (LUCENE-1768) NumericRange support for new query parser

2011-07-11 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063555#comment-13063555
 ] 

Uwe Schindler commented on LUCENE-1768:
---

Vinicius, do you have any plans about backporting the stuff to Lucene 3.x - it 
should not be that hard :-)

bq. I am not sure about numeric support, Vinicius changed TermRangeQueryNode 
inheritance, which breaks the backwards compatibility. I am not saying the 
change is bad, I agree with the new structure, however Vinicius will need to 
find another solution before backporting it to 3.x.

I am not sure if this is really a break when you change inheritance. If code 
still compiles, its no break, if classes were renamed its more serious. I am 
not sure, if implementation classes (and -names) should be covered by the 
backwards compatibility. In my opinion, mainly the configuration and interfaces 
of the QP must be covered by backwards policy.

As we are now at mid-time, it would be a good idea, to maybe add some extra 
syntax support for numerics, like "<" and ">"? We should also add tests/support 
for half-open ranges, so syntax like "[* TO 1.0]" should also be supported (I 
am not sure, if TermRangeQueryNode supports this, but numerics should do this 
in all cases) - the above syntax is also printed out on 
NumericRangeQuery.toString(), if one of the bounds is null. The latter could be 
easily implemented by checking for "*" as input to the range bounds and map 
those special "values" to NULL. Adding support for "<" and ">" (also "<=", 
">=") needs knowledge of JavaCC parser language. Vinicius, have you ever worked 
with JavaCC, so do you think you will be able to extend the syntax?


> NumericRange support for new query parser
> -
>
> Key: LUCENE-1768
> URL: https://issues.apache.org/jira/browse/LUCENE-1768
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: core/queryparser
>Affects Versions: 2.9
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
>  Labels: contrib, gsoc, gsoc2011, lucene-gsoc-11, mentor
> Fix For: 4.0
>
> Attachments: week-7.patch, week1.patch, week2.patch, week3.patch, 
> week4.patch, week5-6.patch
>
>
> It would be good to specify some type of "schema" for the query parser in 
> future, to automatically create NumericRangeQuery for different numeric 
> types? It would then be possible to index a numeric value 
> (double,float,long,int) using NumericField and then the query parser knows, 
> which type of field this is and so it correctly creates a NumericRangeQuery 
> for strings like "[1.567..*]" or "(1.787..19.5]".
> There is currently no way to extract if a field is numeric from the index, so 
> the user will have to configure the FieldConfig objects in the ConfigHandler. 
> But if this is done, it will not be that difficult to implement the rest.
> The only difference between the current handling of RangeQuery is then the 
> instantiation of the correct Query type and conversion of the entered numeric 
> values (simple Number.valueOf(...) cast of the user entered numbers). 
> Evenerything else is identical, NumericRangeQuery also supports the MTQ 
> rewrite modes (as it is a MTQ).
> Another thing is a change in Date semantics. There are some strange flags in 
> the current parser that tells it how to handle dates.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-2308) Separately specify a field's type


 [ 
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-2308:
---

Attachment: LUCENE-2308-ltc.patch

Small patch to fix LTC.newField to again randomly add in term vectors when they 
are disabled.

> Separately specify a field's type
> -
>
> Key: LUCENE-2308
> URL: https://issues.apache.org/jira/browse/LUCENE-2308
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>  Labels: gsoc2011, lucene-gsoc-11, mentor
> Fix For: 4.0
>
> Attachments: LUCENE-2308-2.patch, LUCENE-2308-3.patch, 
> LUCENE-2308-4.patch, LUCENE-2308-4.patch, LUCENE-2308-5.patch, 
> LUCENE-2308-6.patch, LUCENE-2308-7.patch, LUCENE-2308-8.patch, 
> LUCENE-2308-9.patch, LUCENE-2308-ltc.patch, LUCENE-2308.patch, 
> LUCENE-2308.patch
>
>
> This came up from dicussions on IRC.  I'm summarizing here...
> Today when you make a Field to add to a document you can set things
> index or not, stored or not, analyzed or not, details like omitTfAP,
> omitNorms, index term vectors (separately controlling
> offsets/positions), etc.
> I think we should factor these out into a new class (FieldType?).
> Then you could re-use this FieldType instance across multiple fields.
> The Field instance would still hold the actual value.
> We could then do per-field analyzers by adding a setAnalyzer on the
> FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise
> for per-field codecs (with flex), where we now have
> PerFieldCodecWrapper).
> This would NOT be a schema!  It's just refactoring what we already
> specify today.  EG it's not serialized into the index.
> This has been discussed before, and I know Michael Busch opened a more
> ambitious (I think?) issue.  I think this is a good first baby step.  We could
> consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold
> off on that for starters...

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3287) Allow ability to set maxDocCharsToAnalyze in WeightedSpanTermExtractor

2011-07-11 Thread Jahangir Anwari (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-3287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jahangir Anwari updated LUCENE-3287:


Description: 
Spinoff from this thread:

http://www.gossamer-threads.com/lists/lucene/java-user/129668

In WeightedSpanTermExtractor the default maxDocCharsToAnalyze value is 0. This 
inhibits us from getting the weighted span terms in any custom code(e.g 
attached CustomHighlighter.java) that uses WeightedSpanTermExtractor. Currently 
the setMaxDocCharsToAnalyze() method is protected, which prevents us from 
setting  maxDocCharsToAnalyze to a value greater than 0. Changing the method to 
public would give us the ability to set the maxDocCharsToAnalyze.


  was:
In WeightedSpanTermExtractor the default maxDocCharsToAnalyze value is 0. This 
inhibits us from getting the weighted span terms in any custom code(e.g 
attached CustomHighlighter.java) that uses WeightedSpanTermExtractor. Currently 
the setMaxDocCharsToAnalyze() method is protected, which prevents us from 
setting  maxDocCharsToAnalyze to a value greater than 0. Changing the method to 
public would give us the ability to set the maxDocCharsToAnalyze.



> Allow ability to set maxDocCharsToAnalyze in WeightedSpanTermExtractor
> --
>
> Key: LUCENE-3287
> URL: https://issues.apache.org/jira/browse/LUCENE-3287
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: modules/highlighter
>Affects Versions: 3.3
>Reporter: Jahangir Anwari
>Priority: Trivial
> Attachments: CustomHighlighter.java, WeightedSpanTermExtractor.patch
>
>
> Spinoff from this thread:
> http://www.gossamer-threads.com/lists/lucene/java-user/129668
> In WeightedSpanTermExtractor the default maxDocCharsToAnalyze value is 0. 
> This inhibits us from getting the weighted span terms in any custom code(e.g 
> attached CustomHighlighter.java) that uses WeightedSpanTermExtractor. 
> Currently the setMaxDocCharsToAnalyze() method is protected, which prevents 
> us from setting  maxDocCharsToAnalyze to a value greater than 0. Changing the 
> method to public would give us the ability to set the maxDocCharsToAnalyze.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-3304) Allow WeightedSpanTermExtractor to collect positions for TermQuerys

2011-07-11 Thread Jahangir Anwari (JIRA)

Allow WeightedSpanTermExtractor to collect positions for TermQuerys
---

 Key: LUCENE-3304
 URL: https://issues.apache.org/jira/browse/LUCENE-3304
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 3.3
Reporter: Jahangir Anwari
Priority: Trivial


Spinoff from this thread:

http://www.gossamer-threads.com/lists/lucene/java-user/129668

Currently WeightedSpanTermExtractor only collects positions for "position 
sensitive" queries. Allowing WeightedSpanTermExtractor to store positions for 
TermQuery would enable the WeightedSpanTermExtractor to be used outside the 
highlighter in custom plugins to get positions information.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2551) Check dataimport.properties for write access before starting import


[ 
https://issues.apache.org/jira/browse/SOLR-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063511#comment-13063511
 ] 

Steven Rowe commented on SOLR-2551:
---

The [Lucene-Solr-tests-only-trunk Jenkins 
job|https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/] has run only 
once since the DIH tests were made to run sequentially 
(https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/9500/), so I'll 
delay closing this issue until it's successfully run 15 or 20 more times, which 
should take less than one day.

> Check dataimport.properties for write access before starting import
> ---
>
> Key: SOLR-2551
> URL: https://issues.apache.org/jira/browse/SOLR-2551
> Project: Solr
>  Issue Type: Improvement
>  Components: contrib - DataImportHandler
>Affects Versions: 1.4.1, 3.1
>Reporter: C S
>Assignee: Shalin Shekhar Mangar
>Priority: Minor
> Fix For: 3.3, 4.0
>
> Attachments: SOLR-2551.patch
>
>
> A common mistake is that the /conf (respectively the dataimport.properties) 
> file is not writable for solr. It would be great if that were detected on 
> starting a dataimport job. 
> Currently and import might grind away for days and fail if it can't write its 
> timestamp to the dataimport.properties file.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

RE: [JENKINS] Lucene-Solr-tests-only-3.x - Build # 9512 - Still Failing

2011-07-11 Thread Steven A Rowe

More @Override annotations - I've again committed the fix (removing the 
annotations):

[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/test/org/apache/solr/analysis/TestSynonymMap.java:274:
 method does not override a method from its superclass
[javac]   @Override
[javac]^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/test/org/apache/solr/analysis/TestSynonymMap.java:284:
 method does not override a method from its superclass
[javac]   @Override
[javac]^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/test/org/apache/solr/analysis/TestSynonymMap.java:289:
 method does not override a method from its superclass
[javac]   @Override
[javac]^

> -Original Message-
> From: Apache Jenkins Server [mailto:jenk...@builds.apache.org]
> Sent: Monday, July 11, 2011 4:11 PM
> To: dev@lucene.apache.org
> Subject: [JENKINS] Lucene-Solr-tests-only-3.x - Build # 9512 - Still
> Failing
> 
> Build: https://builds.apache.org/job/Lucene-Solr-tests-only-3.x/9512/
> 
> No tests ran.
> 
> Build Log (for compile errors):
> [...truncated 3549 lines...]
> [javac] NamedList fieldTypes =
> result.get("field_types");
> [javac] ^
> [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-
> only-
> 3.x/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHa
> ndlerTest.java:133: warning: [unchecked] unchecked conversion
> [javac] found   : org.apache.solr.common.util.NamedList
> [javac] required:
> org.apache.solr.common.util.NamedList st>
> [javac] NamedList textType = fieldTypes.get("text");
> [javac]   ^
> [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-
> only-
> 3.x/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHa
> ndlerTest.java:136: warning: [unchecked] unchecked conversion
> [javac] found   : org.apache.solr.common.util.NamedList
> [javac] required:
> org.apache.solr.common.util.NamedList on.util.NamedList>>
> [javac] NamedList> indexPart =
> textType.get("index");
> [javac]^
> [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-
> only-
> 3.x/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHa
> ndlerTest.java:201: warning: [unchecked] unchecked conversion
> [javac] found   : org.apache.solr.common.util.NamedList
> [javac] required:
> org.apache.solr.common.util.NamedList on.util.NamedList>>
> [javac] NamedList> queryPart =
> textType.get("query");
> [javac]^
> [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-
> only-
> 3.x/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHa
> ndlerTest.java:230: warning: [unchecked] unchecked conversion
> [javac] found   : org.apache.solr.common.util.NamedList
> [javac] required:
> org.apache.solr.common.util.NamedList st>
> [javac] NamedList nameTextType =
> fieldTypes.get("nametext");
> [javac]   ^
> [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-
> only-
> 3.x/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHa
> ndlerTest.java:233: warning: [unchecked] unchecked conversion
> [javac] found   : org.apache.solr.common.util.NamedList
> [javac] required:
> org.apache.solr.common.util.NamedList on.util.NamedList>>
> [javac] indexPart = nameTextType.get("index");
> [javac] ^
> [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-
> only-
> 3.x/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHa
> ndlerTest.java:250: warning: [unchecked] unchecked conversion
> [javac] found   : org.apache.solr.common.util.NamedList
> [javac] required:
> org.apache.solr.common.util.NamedList on.util.NamedList>>
> [javac] queryPart = nameTextType.get("query");
> [javac] ^
> [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-
> only-
> 3.x/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHa
> ndlerTest.java:256: warning: [unchecked] unchecked conversion
> [javac] found   : org.apache.solr.common.util.NamedList
> [javac] required:
> org.apache.solr.common.util.NamedList st>
> [javac] NamedList fieldNames =
> result.get("field_names");
> [javac] ^
> [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-
> only-
> 3.x/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHa
> ndlerTest.java:259: warning: [un

[JENKINS] Lucene-Solr-tests-only-3.x - Build # 9512 - Still Failing

Build: https://builds.apache.org/job/Lucene-Solr-tests-only-3.x/9512/

No tests ran.

Build Log (for compile errors):
[...truncated 3549 lines...]
[javac] NamedList fieldTypes = result.get("field_types");
[javac] ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:133:
 warning: [unchecked] unchecked conversion
[javac] found   : org.apache.solr.common.util.NamedList
[javac] required: 
org.apache.solr.common.util.NamedList
[javac] NamedList textType = fieldTypes.get("text");
[javac]   ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:136:
 warning: [unchecked] unchecked conversion
[javac] found   : org.apache.solr.common.util.NamedList
[javac] required: 
org.apache.solr.common.util.NamedList>
[javac] NamedList> indexPart = textType.get("index");
[javac]^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:201:
 warning: [unchecked] unchecked conversion
[javac] found   : org.apache.solr.common.util.NamedList
[javac] required: 
org.apache.solr.common.util.NamedList>
[javac] NamedList> queryPart = textType.get("query");
[javac]^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:230:
 warning: [unchecked] unchecked conversion
[javac] found   : org.apache.solr.common.util.NamedList
[javac] required: 
org.apache.solr.common.util.NamedList
[javac] NamedList nameTextType = fieldTypes.get("nametext");
[javac]   ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:233:
 warning: [unchecked] unchecked conversion
[javac] found   : org.apache.solr.common.util.NamedList
[javac] required: 
org.apache.solr.common.util.NamedList>
[javac] indexPart = nameTextType.get("index");
[javac] ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:250:
 warning: [unchecked] unchecked conversion
[javac] found   : org.apache.solr.common.util.NamedList
[javac] required: 
org.apache.solr.common.util.NamedList>
[javac] queryPart = nameTextType.get("query");
[javac] ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:256:
 warning: [unchecked] unchecked conversion
[javac] found   : org.apache.solr.common.util.NamedList
[javac] required: 
org.apache.solr.common.util.NamedList
[javac] NamedList fieldNames = result.get("field_names");
[javac] ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:259:
 warning: [unchecked] unchecked conversion
[javac] found   : org.apache.solr.common.util.NamedList
[javac] required: 
org.apache.solr.common.util.NamedList
[javac] NamedList whitetok = fieldNames.get("whitetok");
[javac]   ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:262:
 warning: [unchecked] unchecked conversion
[javac] found   : org.apache.solr.common.util.NamedList
[javac] required: 
org.apache.solr.common.util.NamedList>
[javac] indexPart = whitetok.get("index");
[javac] ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:279:
 warning: [unchecked] unchecked conversion
[javac] found   : org.apache.solr.common.util.NamedList
[javac] required: 
org.apache.solr.common.util.NamedList>
[javac] queryPart = whitetok.get("query");
[javac] ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/test/org/apache/solr/handler/FieldAnalysisRequestHandlerTest.java:288:
 warning: [unchecked] unchecked conversion
[ja

[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans


[ 
https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063499#comment-13063499
 ] 

Mike Sokolov commented on LUCENE-2878:
--

OK I think I brushed by some of your comments, Simon, in my hasty response, 
sorry.  Here's a little more thought, I hope:

bq. So bottom line here is that we need an api that is capable of collecting 
fine grained parts of the scorer tree. The only way I see doing this is 1. have 
a subscribe / register method and 2. do this subscription during scorer 
creation. Once we have this we can implement very simple collect methods that 
only collect positions for the current match like in a near query, while the 
current matching document is collected all contributing TermScorers have their 
positioninterval ready for collection. The collect method can then be called 
from the consumer instead of in the loop this way we only get the positions we 
need since we know the document we are collecting.

I *think* it's necessary to have both a callback from within the scoring loop, 
and a mechanism for iterating over the current state of the iterator.  For 
boolean queries, the positions will never be iterated in the scoring loop (all 
you care about is the frequencies, positions are ignored), so some new process: 
either the position collector (highlighter, say), or a loop in the scorer that 
knows positions are being consumed (needsPositions==true) has to cause the 
iteration to be performed.  But for position-aware queries (like phrases), the 
scorer *will* iterate over positions, and in order to score properly, I think 
the Scorer has to drive the iteration?  I tried a few different approaches at 
this before deciding to just push the iteration into the Scorer, but none of 
them really worked properly.

Let's say, for example that a document is collected.  Then the position 
consumer comes in to find out what positions were matched - it may already too 
late, because during scoring, some of the positions may have been consumed (eg 
to score phrases)?  It's possible I may be suffering from some delusion, though 
:)  But if I'm right, then it means there has to be some sort of callback 
mechanism in place *during scoring*, or else we have to resign ourselves to 
scoring first, and then re-setting and iterating positions in a second pass.

I actually think that if we follow through with the 
registration-during-construction idea, we can have the tests done in an 
efficient way during scoring (with final boolean properties of the scorers), 
and it can be OK to have them in the scoring loop.

> Allow Scorer to expose positions and payloads aka. nuke spans 
> --
>
> Key: LUCENE-2878
> URL: https://issues.apache.org/jira/browse/LUCENE-2878
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: Bulk Postings branch
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
>  Labels: gsoc2011, lucene-gsoc-11, mentor
> Attachments: LUCENE-2878-OR.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, 
> PosHighlighter.patch, PosHighlighter.patch
>
>
> Currently we have two somewhat separate types of queries, the one which can 
> make use of positions (mainly spans) and payloads (spans). Yet Span*Query 
> doesn't really do scoring comparable to what other queries do and at the end 
> of the day they are duplicating lot of code all over lucene. Span*Queries are 
> also limited to other Span*Query instances such that you can not use a 
> TermQuery or a BooleanQuery with SpanNear or anthing like that. 
> Beside of the Span*Query limitation other queries lacking a quiet interesting 
> feature since they can not score based on term proximity since scores doesn't 
> expose any positional information. All those problems bugged me for a while 
> now so I stared working on that using the bulkpostings API. I would have done 
> that first cut on trunk but TermScorer is working on BlockReader that do not 
> expose positions while the one in this branch does. I started adding a new 
> Positions class which users can pull from a scorer, to prevent unnecessary 
> positions enums I added ScorerContext#needsPositions and eventually 
> Scorere#needsPayloads to create the corresponding enum on demand. Yet, 
> currently only TermQuery / TermScorer implements this API and other simply 
> return null instead. 
> To show that the API really works and our BulkPostings work fine too with 
> positions I cut over TermSpanQuery to use a TermScorer under the hood and 
> nuked TermSpans entirely. A nice sideeffect of this was that the Position 
> BulkReading implementation got some

RE: [JENKINS] Lucene-Solr-tests-only-3.x - Build # 9511 - Failure

2011-07-11 Thread Steven A Rowe

This compilation failure is down to @Override annotations - I've committed the 
fix (removing the annotations):

[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/java/org/apache/solr/analysis/FSTSynonymFilterFactory.java:57:
 method does not override a method from its superclass
[javac]   @Override
[javac]^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/java/org/apache/solr/analysis/FSTSynonymFilterFactory.java:62:
 method does not override a method from its superclass
[javac]   @Override
[javac]^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/java/org/apache/solr/analysis/SynonymFilterFactory.java:43:
 method does not override a method from its superclass
[javac]   @Override
[javac]^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/java/org/apache/solr/analysis/SynonymFilterFactory.java:49:
 method does not override a method from its superclass
[javac]   @Override
[javac]^


> -Original Message-
> From: Apache Jenkins Server [mailto:jenk...@builds.apache.org]
> Sent: Monday, July 11, 2011 3:43 PM
> To: dev@lucene.apache.org
> Subject: [JENKINS] Lucene-Solr-tests-only-3.x - Build # 9511 - Failure
> 
> Build: https://builds.apache.org/job/Lucene-Solr-tests-only-3.x/9511/
> 
> No tests ran.
> 
> Build Log (for compile errors):
> [...truncated 3060 lines...]
> [javac] found   : java.util.Collection
> [javac] required: java.util.Collection
> [javac] public Collection getFileNames() throws IOException {
> [javac]   ^
> [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-
> only-
> 3.x/checkout/solr/src/java/org/apache/solr/core/IndexDeletionPolicyWrappe
> r.java:211: warning: getUserData() in
> org.apache.solr.core.IndexDeletionPolicyWrapper.IndexCommitWrapper
> overrides getUserData() in org.apache.lucene.index.IndexCommit; return
> type requires unchecked conversion
> [javac] found   : java.util.Map
> [javac] required: java.util.Map
> [javac] public Map getUserData() throws IOException {
> [javac]^
> [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-
> only-
> 3.x/checkout/solr/src/java/org/apache/solr/handler/RequestHandlerBase.jav
> a:173: warning: [unchecked] unchecked call to add(java.lang.String,T) as
> a member of the raw type org.apache.solr.common.util.NamedList
> [javac] lst.add("handlerStart",handlerStart);
> [javac]^
> [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-
> only-
> 3.x/checkout/solr/src/java/org/apache/solr/handler/RequestHandlerBase.jav
> a:174: warning: [unchecked] unchecked call to add(java.lang.String,T) as
> a member of the raw type org.apache.solr.common.util.NamedList
> [javac] lst.add("requests", numRequests);
> [javac]^
> [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-
> only-
> 3.x/checkout/solr/src/java/org/apache/solr/handler/RequestHandlerBase.jav
> a:175: warning: [unchecked] unchecked call to add(java.lang.String,T) as
> a member of the raw type org.apache.solr.common.util.NamedList
> [javac] lst.add("errors", numErrors);
> [javac]^
> [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-
> only-
> 3.x/checkout/solr/src/java/org/apache/solr/handler/RequestHandlerBase.jav
> a:176: warning: [unchecked] unchecked call to add(java.lang.String,T) as
> a member of the raw type org.apache.solr.common.util.NamedList
> [javac] lst.add("timeouts", numTimeouts);
> [javac]^
> [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-
> only-
> 3.x/checkout/solr/src/java/org/apache/solr/handler/RequestHandlerBase.jav
> a:177: warning: [unchecked] unchecked call to add(java.lang.String,T) as
> a member of the raw type org.apache.solr.common.util.NamedList
> [javac] lst.add("totalTime",totalTime);
> [javac]^
> [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-
> only-
> 3.x/checkout/solr/src/java/org/apache/solr/handler/RequestHandlerBase.jav
> a:178: warning: [unchecked] unchecked call to add(java.lang.String,T) as
> a member of the raw type org.apache.solr.common.util.NamedList
> [javac] lst.add("avgTimePerRequest", (float) totalTime / (float)
> this.numRequests);
> [javac]^
> [javac] /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-
> only-
> 3.x/checkout/solr/src/java/org/apache/solr/handler/RequestHandlerBase.jav
> a:179: warning: [unchecked] unchecked call to add(java.lang.String,T) as
> a member of the raw type org.apache.solr.common.util.NamedList
> [javac] lst.add("avgRequestsPerSecond", (float) numRequests*1000
> / (float)(System.currentTimeMillis()-handlerStar

[jira] [Resolved] (LUCENE-3289) FST should allow controlling how hard builder tries to share suffixes


 [ 
https://issues.apache.org/jira/browse/LUCENE-3289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-3289.


Resolution: Fixed

> FST should allow controlling how hard builder tries to share suffixes
> -
>
> Key: LUCENE-3289
> URL: https://issues.apache.org/jira/browse/LUCENE-3289
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.4, 4.0
>
> Attachments: LUCENE-3289.patch, LUCENE-3289.patch
>
>
> Today we have a boolean option to the FST builder telling it whether
> it should share suffixes.
> If you turn this off, building is much faster, uses much less RAM, and
> the resulting FST is a prefix trie.  But, the FST is larger than it
> needs to be.  When it's on, the builder maintains a node hash holding
> every node seen so far in the FST -- this uses up RAM and slows things
> down.
> On a dataset that Elmer (see java-user thread "Autocompletion on large
> index" on Jul 6 2011) provided (thank you!), which is 1.32 M titles
> avg 67.3 chars per title, building with suffix sharing on took 22.5
> seconds, required 1.25 GB heap, and produced 91.6 MB FST.  With suffix
> sharing off, it was 8.2 seconds, 450 MB heap and 129 MB FST.
> I think we should allow this boolean to be shade-of-gray instead:
> usually, how well suffixes can share is a function of how far they are
> from the end of the string, so, by adding a tunable N to only share
> when suffix length < N, we can let caller make reasonable tradeoffs. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[JENKINS] Lucene-Solr-tests-only-3.x - Build # 9511 - Failure

Build: https://builds.apache.org/job/Lucene-Solr-tests-only-3.x/9511/

No tests ran.

Build Log (for compile errors):
[...truncated 3060 lines...]
[javac] found   : java.util.Collection
[javac] required: java.util.Collection
[javac] public Collection getFileNames() throws IOException {
[javac]   ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/java/org/apache/solr/core/IndexDeletionPolicyWrapper.java:211:
 warning: getUserData() in 
org.apache.solr.core.IndexDeletionPolicyWrapper.IndexCommitWrapper overrides 
getUserData() in org.apache.lucene.index.IndexCommit; return type requires 
unchecked conversion
[javac] found   : java.util.Map
[javac] required: java.util.Map
[javac] public Map getUserData() throws IOException {
[javac]^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/java/org/apache/solr/handler/RequestHandlerBase.java:173:
 warning: [unchecked] unchecked call to add(java.lang.String,T) as a member of 
the raw type org.apache.solr.common.util.NamedList
[javac] lst.add("handlerStart",handlerStart);
[javac]^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/java/org/apache/solr/handler/RequestHandlerBase.java:174:
 warning: [unchecked] unchecked call to add(java.lang.String,T) as a member of 
the raw type org.apache.solr.common.util.NamedList
[javac] lst.add("requests", numRequests);
[javac]^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/java/org/apache/solr/handler/RequestHandlerBase.java:175:
 warning: [unchecked] unchecked call to add(java.lang.String,T) as a member of 
the raw type org.apache.solr.common.util.NamedList
[javac] lst.add("errors", numErrors);
[javac]^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/java/org/apache/solr/handler/RequestHandlerBase.java:176:
 warning: [unchecked] unchecked call to add(java.lang.String,T) as a member of 
the raw type org.apache.solr.common.util.NamedList
[javac] lst.add("timeouts", numTimeouts);
[javac]^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/java/org/apache/solr/handler/RequestHandlerBase.java:177:
 warning: [unchecked] unchecked call to add(java.lang.String,T) as a member of 
the raw type org.apache.solr.common.util.NamedList
[javac] lst.add("totalTime",totalTime);
[javac]^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/java/org/apache/solr/handler/RequestHandlerBase.java:178:
 warning: [unchecked] unchecked call to add(java.lang.String,T) as a member of 
the raw type org.apache.solr.common.util.NamedList
[javac] lst.add("avgTimePerRequest", (float) totalTime / (float) 
this.numRequests);
[javac]^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/java/org/apache/solr/handler/RequestHandlerBase.java:179:
 warning: [unchecked] unchecked call to add(java.lang.String,T) as a member of 
the raw type org.apache.solr.common.util.NamedList
[javac] lst.add("avgRequestsPerSecond", (float) numRequests*1000 / 
(float)(System.currentTimeMillis()-handlerStart));   
[javac]^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/java/org/apache/solr/handler/admin/CoreAdminHandler.java:213:
 warning: [unchecked] unchecked conversion
[javac] found   : org.apache.solr.util.RefCounted[]
[javac] required: 
org.apache.solr.util.RefCounted[]
[javac]   searchers = new RefCounted[sourceCores.length];
[javac]   ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/java/org/apache/solr/handler/component/ResponseBuilder.java:291:
 warning: [unchecked] unchecked call to add(java.lang.String,T) as a member of 
the raw type org.apache.solr.common.util.NamedList
[javac]   rsp.getResponseHeader().add( "partialResults", Boolean.TRUE );
[javac]  ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/java/org/apache/solr/search/FunctionQParser.java:254:
 warning: [unchecked] unchecked conversion
[javac] found   : java.util.HashMap
[javac] required: java.util.Map
[javac]   int end = QueryParsing.parseLocalParams(qs, start, 
nestedLocalParams, getParams());
[javac]  ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/src/java/org/apache/solr/handler/component/FacetComponent.java:405:
 warning: [unchecked] unche

[jira] [Commented] (SOLR-2452) rewrite solr build system


[ 
https://issues.apache.org/jira/browse/SOLR-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063488#comment-13063488
 ] 

Steven Rowe commented on SOLR-2452:
---

If there are no objections, I plan on committing the patch hacking script to 
{{dev-tools/scripts/}} later today.

> rewrite solr build system
> -
>
> Key: SOLR-2452
> URL: https://issues.apache.org/jira/browse/SOLR-2452
> Project: Solr
>  Issue Type: Task
>  Components: Build
>Reporter: Robert Muir
>Assignee: Steven Rowe
> Fix For: 3.4, 4.0
>
> Attachments: SOLR-2452-post-reshuffling.patch, 
> SOLR-2452-post-reshuffling.patch, SOLR-2452-post-reshuffling.patch, 
> SOLR-2452.diffSource.py.patch.zip, SOLR-2452.dir.reshuffle.sh, 
> SOLR-2452.dir.reshuffle.sh, SOLR-2452.patch.hack.pl, SOLR-2452.patch.hack.pl
>
>
> As discussed some in SOLR-2002 (but that issue is long and hard to follow), I 
> think we should rewrite the solr build system.
> Its slow, cumbersome, and messy, and makes it hard for us to improve things.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3293) Use IOContext.READONCE in VarGapTermsIndexReader to load FST

2011-07-11 Thread Varun Thacker (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Thacker updated LUCENE-3293:
--

Attachment: LUCENE-3293.patch

Also edited SegmentReader#loadLiveDocs 

> Use IOContext.READONCE in VarGapTermsIndexReader to load FST
> 
>
> Key: LUCENE-3293
> URL: https://issues.apache.org/jira/browse/LUCENE-3293
> Project: Lucene - Java
>  Issue Type: Task
>  Components: core/codecs
>Affects Versions: 4.0
>Reporter: Simon Willnauer
>Assignee: Varun Thacker
>Priority: Minor
> Fix For: 4.0
>
> Attachments: LUCENE-3293.patch
>
>
> VarGapTermsIndexReader should pass READONCE context down when it
> opens/reads the FST. Yet, it should just replace the ctx passed in, ie if we 
> are merging vs reading we want to differentiate.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2308) Separately specify a field's type


[ 
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063473#comment-13063473
 ] 

Michael McCandless commented on LUCENE-2308:


Great!  Tests pass and it looks like the accidental changes are gone.  I'll 
commit to the branch...

> Separately specify a field's type
> -
>
> Key: LUCENE-2308
> URL: https://issues.apache.org/jira/browse/LUCENE-2308
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>  Labels: gsoc2011, lucene-gsoc-11, mentor
> Fix For: 4.0
>
> Attachments: LUCENE-2308-2.patch, LUCENE-2308-3.patch, 
> LUCENE-2308-4.patch, LUCENE-2308-4.patch, LUCENE-2308-5.patch, 
> LUCENE-2308-6.patch, LUCENE-2308-7.patch, LUCENE-2308-8.patch, 
> LUCENE-2308-9.patch, LUCENE-2308.patch, LUCENE-2308.patch
>
>
> This came up from dicussions on IRC.  I'm summarizing here...
> Today when you make a Field to add to a document you can set things
> index or not, stored or not, analyzed or not, details like omitTfAP,
> omitNorms, index term vectors (separately controlling
> offsets/positions), etc.
> I think we should factor these out into a new class (FieldType?).
> Then you could re-use this FieldType instance across multiple fields.
> The Field instance would still hold the actual value.
> We could then do per-field analyzers by adding a setAnalyzer on the
> FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise
> for per-field codecs (with flex), where we now have
> PerFieldCodecWrapper).
> This would NOT be a schema!  It's just refactoring what we already
> specify today.  EG it's not serialized into the index.
> This has been discussed before, and I know Michael Busch opened a more
> ambitious (I think?) issue.  I think this is a good first baby step.  We could
> consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold
> off on that for starters...

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-2308) Separately specify a field's type

2011-07-11 Thread Nikola Tankovic (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikola Tankovic updated LUCENE-2308:


Attachment: LUCENE-2308-9.patch

Patch 9: few more compilation errors fixed.

> Separately specify a field's type
> -
>
> Key: LUCENE-2308
> URL: https://issues.apache.org/jira/browse/LUCENE-2308
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>  Labels: gsoc2011, lucene-gsoc-11, mentor
> Fix For: 4.0
>
> Attachments: LUCENE-2308-2.patch, LUCENE-2308-3.patch, 
> LUCENE-2308-4.patch, LUCENE-2308-4.patch, LUCENE-2308-5.patch, 
> LUCENE-2308-6.patch, LUCENE-2308-7.patch, LUCENE-2308-8.patch, 
> LUCENE-2308-9.patch, LUCENE-2308.patch, LUCENE-2308.patch
>
>
> This came up from dicussions on IRC.  I'm summarizing here...
> Today when you make a Field to add to a document you can set things
> index or not, stored or not, analyzed or not, details like omitTfAP,
> omitNorms, index term vectors (separately controlling
> offsets/positions), etc.
> I think we should factor these out into a new class (FieldType?).
> Then you could re-use this FieldType instance across multiple fields.
> The Field instance would still hold the actual value.
> We could then do per-field analyzers by adding a setAnalyzer on the
> FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise
> for per-field codecs (with flex), where we now have
> PerFieldCodecWrapper).
> This would NOT be a schema!  It's just refactoring what we already
> specify today.  EG it's not serialized into the index.
> This has been discussed before, and I know Michael Busch opened a more
> ambitious (I think?) issue.  I think this is a good first baby step.  We could
> consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold
> off on that for starters...

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-1085) SolrJ client java does not support moreLikeThis querys and results

2011-07-11 Thread Juan Grande (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-1085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Juan Grande updated SOLR-1085:
--

Attachment: SOLR-1085.patch

Attached is a patch that adds MoreLikeThis support to SolrJ. The patch is for 
the trunk, but if necessary I can backport it to the 3x branch.

> SolrJ client java does not support moreLikeThis querys and results
> --
>
> Key: SOLR-1085
> URL: https://issues.apache.org/jira/browse/SOLR-1085
> Project: Solr
>  Issue Type: Improvement
>  Components: clients - java
> Environment: SolrJ java client
>Reporter: Maurice Jumelet
> Fix For: 3.4, 4.0
>
> Attachments: SOLR-1085.patch, solrj-java-morelikethis.patch
>
>
> Although SOLR supports the more like this querys (see 
> http://wiki.apache.org/solr/MoreLikeThis) these type of query are currently 
> not supported by the SOLR java client.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2048) Omit positions but keep termFreq


[ 
https://issues.apache.org/jira/browse/LUCENE-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063455#comment-13063455
 ] 

Robert Muir commented on LUCENE-2048:
-

i created a throwaway branch: branches/omitp, to hopefully sucker mike into 
helping me with some random fails (always pulsing is involved!)

in general the pulsing cutover was tricky for me.


> Omit positions but keep termFreq
> 
>
> Key: LUCENE-2048
> URL: https://issues.apache.org/jira/browse/LUCENE-2048
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.1
>Reporter: Andrzej Bialecki 
>Assignee: Robert Muir
> Fix For: 4.0
>
> Attachments: LUCENE-2048.patch
>
>
> it would be useful to have an option to discard positional information but 
> still keep the term frequency - currently setOmitTermFreqAndPositions 
> discards both. Even though position-dependent queries wouldn't work in such 
> case, still any other queries would work fine and we would get the right 
> scoring.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-3280) Add new bit set impl for caching filters


 [ 
https://issues.apache.org/jira/browse/LUCENE-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-3280.


Resolution: Fixed

> Add new bit set impl for caching filters
> 
>
> Key: LUCENE-3280
> URL: https://issues.apache.org/jira/browse/LUCENE-3280
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.4, 4.0
>
> Attachments: LUCENE-3280.patch, LUCENE-3280.patch
>
>
> I think OpenBitSet is trying to satisfy too many audiences, and it's
> confusing/error-proned as a result.  It has int/long variants of many
> methods.  Some methods require in-bound access, others don't; of those
> others, some methods auto-grow the bits, some don't.  OpenBitSet
> doesn't always know its numBits.
> I'd like to factor out a more "focused" bit set impl whose primary
> target usage is a cached Lucene Filter, ie a bit set indexed by docID
> (int, not long) whose size is known and fixed up front (backed by
> final long[]) and is always accessed in-bounds.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-2648) improve interaction of synonymsfilterfactory with analysis chain

improve interaction of synonymsfilterfactory with analysis chain

Key: SOLR-2648
URL: https://issues.apache.org/jira/browse/SOLR-2648
Project: Solr
Issue Type: Improvement
Components: Schema and Analysis
Affects Versions: 3.4, 4.0
Reporter: Robert Muir

Spinoff of LUCENE-3233 (there is a TODO here), this was also mentioned by Otis
on the mailing list:
http://www.lucidimagination.com/search/document/8e91f858314562e/automatic_synonyms_for_multiple_variations_of_a_word#76c3d09f95f7a58f

As of LUCENE-3233, the builder for the synonyms structure uses an Analyzer
behind the scenes to actually tokenize the synonyms in your synonyms file.
Currently the solr factory uses a WhitespaceTokenizer, unless you supply the
tokenizerchain parameter, which lets you specify a tokenizer.

If there was some way to instead specify a chain to this factory (e.g.
charfilters, tokenizer, tokenfilter such as stemmers) versus just a
tokenizerfactory,
it would be a lot more flexible (e.g. it would stem your synonyms for you), and
would solve this use case.

Personally I think it would be most ideal if this just automatically work, e.g.
if you have a chain of A, B, SynonymsFilter, C, D: then in my opinion the
synonyms
should be analyzed with an analysis chain of A, B. This way the injected
synonyms are processed as if they were in the tokenstream to begin with.

Note: there are some limitations here to what the chain can do, e.g. you cant
be putting WDF before synonyms or other things that muck with positions, and
you cant
have a synonym that analyzes to nothing at all, but the parser checks for all
these conditions and throws a syntax error so it would be clear to the user
that
they put the synonymsfilter in the "wrong place" in their chain.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [Lucene.Net] Incubator Status Page

2011-07-11 Thread digy digy

On Sun, Jul 10, 2011 at 6:24 PM, Stefan Bodewig  wrote:

> Hi all,
>
>  contains quite a
> few blanks that I think we could easily fill.  I intend to either add
> some N/A or real dates where I can during the coming week.
>
> On the IP issues part (copyright and distribution rights) I trust the
> Lucene PMC has been taking care of this before Lucene.NET headed back to
> the Incubator and after that all contributions have come either directly
> by people with a CLA on file or as patches via JIRA where the "ASF may
> use this" checkbox has been checked - is this correct?
>

absolutely.


>
> For the project specific tasks I'd ask all of you to fill in whatever
> you feel like adding.  All Lucene.NET committers should be able to
> modify the status page.
>
> Stefan
>

DIGY

[jira] [Commented] (SOLR-2551) Check dataimport.properties for write access before starting import


[ 
https://issues.apache.org/jira/browse/SOLR-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063424#comment-13063424
 ] 

Shalin Shekhar Mangar commented on SOLR-2551:
-

Thanks Steven!

> Check dataimport.properties for write access before starting import
> ---
>
> Key: SOLR-2551
> URL: https://issues.apache.org/jira/browse/SOLR-2551
> Project: Solr
>  Issue Type: Improvement
>  Components: contrib - DataImportHandler
>Affects Versions: 1.4.1, 3.1
>Reporter: C S
>Assignee: Shalin Shekhar Mangar
>Priority: Minor
> Fix For: 3.3, 4.0
>
> Attachments: SOLR-2551.patch
>
>
> A common mistake is that the /conf (respectively the dataimport.properties) 
> file is not writable for solr. It would be great if that were detected on 
> starting a dataimport job. 
> Currently and import might grind away for days and fail if it can't write its 
> timestamp to the dataimport.properties file.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-2452) rewrite solr build system


 [ 
https://issues.apache.org/jira/browse/SOLR-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rowe updated SOLR-2452:
--

Attachment: SOLR-2452.patch.hack.pl

This version of the patch hacking script is fixed so that all paths are 
modified instead of just the ones on 'Index:' lines

> rewrite solr build system
> -
>
> Key: SOLR-2452
> URL: https://issues.apache.org/jira/browse/SOLR-2452
> Project: Solr
>  Issue Type: Task
>  Components: Build
>Reporter: Robert Muir
>Assignee: Steven Rowe
> Fix For: 3.4, 4.0
>
> Attachments: SOLR-2452-post-reshuffling.patch, 
> SOLR-2452-post-reshuffling.patch, SOLR-2452-post-reshuffling.patch, 
> SOLR-2452.diffSource.py.patch.zip, SOLR-2452.dir.reshuffle.sh, 
> SOLR-2452.dir.reshuffle.sh, SOLR-2452.patch.hack.pl, SOLR-2452.patch.hack.pl
>
>
> As discussed some in SOLR-2002 (but that issue is long and hard to follow), I 
> think we should rewrite the solr build system.
> Its slow, cumbersome, and messy, and makes it hard for us to improve things.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2452) rewrite solr build system


[ 
https://issues.apache.org/jira/browse/SOLR-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063419#comment-13063419
 ] 

Steven Rowe commented on SOLR-2452:
---

Thanks Yonik - I'll fix it


> rewrite solr build system
> -
>
> Key: SOLR-2452
> URL: https://issues.apache.org/jira/browse/SOLR-2452
> Project: Solr
>  Issue Type: Task
>  Components: Build
>Reporter: Robert Muir
>Assignee: Steven Rowe
> Fix For: 3.4, 4.0
>
> Attachments: SOLR-2452-post-reshuffling.patch, 
> SOLR-2452-post-reshuffling.patch, SOLR-2452-post-reshuffling.patch, 
> SOLR-2452.diffSource.py.patch.zip, SOLR-2452.dir.reshuffle.sh, 
> SOLR-2452.dir.reshuffle.sh, SOLR-2452.patch.hack.pl
>
>
> As discussed some in SOLR-2002 (but that issue is long and hard to follow), I 
> think we should rewrite the solr build system.
> Its slow, cumbersome, and messy, and makes it hard for us to improve things.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2452) rewrite solr build system


[ 
https://issues.apache.org/jira/browse/SOLR-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063418#comment-13063418
 ] 

Yonik Seeley commented on SOLR-2452:


The script produced output like this:

{code}
Index: solr/core/src/java/org/apache/solr/core/SolrCore.java
===
--- solr/src/java/org/apache/solr/core/SolrCore.java(revision 
80231429dc9c7680375a0a21b1886e59b194)
+++ solr/src/java/org/apache/solr/core/SolrCore.java(revision )
{code}

Notice that "core" wasn't substituted on the lines starting with --- and +++

Trying to use the resulting patch file, I got:
{code}
/opt/code/lusolr$ patch -p0 < tt.patch
can't find file to patch at input line 5
Perhaps you used the wrong -p or --strip option?
The text leading up to this was:
--
|Index: solr/core/src/java/org/apache/solr/core/SolrCore.java
|===
|--- solr/src/java/org/apache/solr/core/SolrCore.java   (revision 
80231429dc9c7680375a0a21b1886e59b194)
|+++ solr/src/java/org/apache/solr/core/SolrCore.java   (revision )
--
{code}

> rewrite solr build system
> -
>
> Key: SOLR-2452
> URL: https://issues.apache.org/jira/browse/SOLR-2452
> Project: Solr
>  Issue Type: Task
>  Components: Build
>Reporter: Robert Muir
>Assignee: Steven Rowe
> Fix For: 3.4, 4.0
>
> Attachments: SOLR-2452-post-reshuffling.patch, 
> SOLR-2452-post-reshuffling.patch, SOLR-2452-post-reshuffling.patch, 
> SOLR-2452.diffSource.py.patch.zip, SOLR-2452.dir.reshuffle.sh, 
> SOLR-2452.dir.reshuffle.sh, SOLR-2452.patch.hack.pl
>
>
> As discussed some in SOLR-2002 (but that issue is long and hard to follow), I 
> think we should rewrite the solr build system.
> Its slow, cumbersome, and messy, and makes it hard for us to improve things.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-2452) rewrite solr build system


 [ 
https://issues.apache.org/jira/browse/SOLR-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rowe updated SOLR-2452:
--

Attachment: SOLR-2452.patch.hack.pl

This script, given a patch created with 'svn diff' against trunk pre-SOLR-2452, 
will write a new patch on the standard output stream with paths fixed up to 
reflect the post-SOLR2452 structure.

Usage: {{perl SOLR-2452.patch.hack.pl < old.patch > new.patch}}

I've tested it on a couple of patches I had lying around, and it seems to work.

Yonik, could you test it on the original patch you said you manually hacked?


> rewrite solr build system
> -
>
> Key: SOLR-2452
> URL: https://issues.apache.org/jira/browse/SOLR-2452
> Project: Solr
>  Issue Type: Task
>  Components: Build
>Reporter: Robert Muir
>Assignee: Steven Rowe
> Fix For: 3.4, 4.0
>
> Attachments: SOLR-2452-post-reshuffling.patch, 
> SOLR-2452-post-reshuffling.patch, SOLR-2452-post-reshuffling.patch, 
> SOLR-2452.diffSource.py.patch.zip, SOLR-2452.dir.reshuffle.sh, 
> SOLR-2452.dir.reshuffle.sh, SOLR-2452.patch.hack.pl
>
>
> As discussed some in SOLR-2002 (but that issue is long and hard to follow), I 
> think we should rewrite the solr build system.
> Its slow, cumbersome, and messy, and makes it hard for us to improve things.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

heads up: reindex trunk indexes

2011-07-11 Thread Robert Muir

I just committed https://issues.apache.org/jira/browse/LUCENE-3233,
which includes improvements that change the format of the terms index.

You should reindex.

-- 
lucidimagination.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2452) rewrite solr build system


[ 
https://issues.apache.org/jira/browse/SOLR-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063394#comment-13063394
 ] 

Yonik Seeley commented on SOLR-2452:


bq. What's the right thing to do here in terms of a patch against the old file 
structure? Is it reasonable to check out fresh code, hack the patch file to 
reflect the new paths and apply it to the new structure or must I re-edit the 
source?

That's what I did.

bq. And is SVN merge smart enough to deal when merging from trunk to 3x when 3x 
hasn't been changed, or is it better to just wait on it all until the back-port 
is done?

Apply the changes in 3x however you can (i.e. patch, etc) and then use "svn 
merge --record-only".  http://wiki.apache.org/lucene-java/SvnMerge

> rewrite solr build system
> -
>
> Key: SOLR-2452
> URL: https://issues.apache.org/jira/browse/SOLR-2452
> Project: Solr
>  Issue Type: Task
>  Components: Build
>Reporter: Robert Muir
>Assignee: Steven Rowe
> Fix For: 3.4, 4.0
>
> Attachments: SOLR-2452-post-reshuffling.patch, 
> SOLR-2452-post-reshuffling.patch, SOLR-2452-post-reshuffling.patch, 
> SOLR-2452.diffSource.py.patch.zip, SOLR-2452.dir.reshuffle.sh, 
> SOLR-2452.dir.reshuffle.sh
>
>
> As discussed some in SOLR-2002 (but that issue is long and hard to follow), I 
> think we should rewrite the solr build system.
> Its slow, cumbersome, and messy, and makes it hard for us to improve things.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (SOLR-2615) Have LogUpdateProcessor log each command (add, delete, ...) at debug/FINE level


 [ 
https://issues.apache.org/jira/browse/SOLR-2615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley resolved SOLR-2615.


Resolution: Fixed

> Have LogUpdateProcessor log each command (add, delete, ...) at debug/FINE 
> level
> ---
>
> Key: SOLR-2615
> URL: https://issues.apache.org/jira/browse/SOLR-2615
> Project: Solr
>  Issue Type: Improvement
>  Components: update
>Reporter: David Smiley
>Assignee: Yonik Seeley
>Priority: Minor
> Fix For: 3.4, 4.0
>
> Attachments: SOLR-2615_LogUpdateProcessor_debug_logging.patch, 
> SOLR-2615_LogUpdateProcessor_debug_logging.patch
>
>
> It would be great if the LogUpdateProcessor logged each command (add, delete, 
> ...) at debug ("Fine") level. Presently it only logs a summary of 8 commands 
> and it does so at the very end.
> The attached patch implements this.
> * I moved the LogUpdateProcessor ahead of RunUpdateProcessor so that the 
> debug level log happens before Solr does anything with it. It should not 
> affect the ordering of the existing summary log which happens at finish(). 
> * I changed UpdateRequestProcessor's static log variable to be an instance 
> variable that uses the current class name. I think this makes much more sense 
> since I want to be able to alter logging levels for a specific processor 
> without doing it for all of them. This change did require me to tweak the 
> factory's detection of the log level which avoids creating the 
> LogUpdateProcessor.
> * There was an NPE bug in AddUpdateCommand.getPrintableId() in the event 
> there is no schema unique field. I fixed that.
> You may notice I use SLF4J's nifty log.debug("message blah {} blah", var) 
> syntax, which is both performant and concise as there's no point in guarding 
> the debug message with an isDebugEnabled() since debug() will internally 
> check this any way and there is no string concatenation if debug isn't 
> enabled.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2615) Have LogUpdateProcessor log each command (add, delete, ...) at debug/FINE level


[ 
https://issues.apache.org/jira/browse/SOLR-2615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063382#comment-13063382
 ] 

Yonik Seeley commented on SOLR-2615:


Thanks for the tips David, hopefully when SOLR-2616 is committed we can fix 
this oddity.
I've committed this patch to trunk and will backport.

> Have LogUpdateProcessor log each command (add, delete, ...) at debug/FINE 
> level
> ---
>
> Key: SOLR-2615
> URL: https://issues.apache.org/jira/browse/SOLR-2615
> Project: Solr
>  Issue Type: Improvement
>  Components: update
>Reporter: David Smiley
>Assignee: Yonik Seeley
>Priority: Minor
> Fix For: 3.4, 4.0
>
> Attachments: SOLR-2615_LogUpdateProcessor_debug_logging.patch, 
> SOLR-2615_LogUpdateProcessor_debug_logging.patch
>
>
> It would be great if the LogUpdateProcessor logged each command (add, delete, 
> ...) at debug ("Fine") level. Presently it only logs a summary of 8 commands 
> and it does so at the very end.
> The attached patch implements this.
> * I moved the LogUpdateProcessor ahead of RunUpdateProcessor so that the 
> debug level log happens before Solr does anything with it. It should not 
> affect the ordering of the existing summary log which happens at finish(). 
> * I changed UpdateRequestProcessor's static log variable to be an instance 
> variable that uses the current class name. I think this makes much more sense 
> since I want to be able to alter logging levels for a specific processor 
> without doing it for all of them. This change did require me to tweak the 
> factory's detection of the log level which avoids creating the 
> LogUpdateProcessor.
> * There was an NPE bug in AddUpdateCommand.getPrintableId() in the event 
> there is no schema unique field. I fixed that.
> You may notice I use SLF4J's nifty log.debug("message blah {} blah", var) 
> syntax, which is both performant and concise as there's no point in guarding 
> the debug message with an isDebugEnabled() since debug() will internally 
> check this any way and there is no string concatenation if debug isn't 
> enabled.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans


[ 
https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063379#comment-13063379
 ] 

Mike Sokolov commented on LUCENE-2878:
--

bq. if(foo==null) is a problem as long as foo is not final so maybe we should 
try to make them final and make the pos collector part of the scorer setup 
(just a thought), we could do that using a ScorerContext for instance.

Yes, agreed.  I just wanted to implement something simple first.  I think we 
can fix the setup problem separately from the actual collection/reporting of 
intervals. We can eventually undo the changes to DisjunctionSumScorer, and get 
rid of those extra if()s.  Also, as you say the other tests can be made final 
if we do them during setup.

bq. While I talk about this I start realizing that it might even be easier that 
this if we walk the PositionInterator tree rather than the scorer tree and 
collect the positin iterators from there.

Did you look at SubTermPositionIterator (I think that's what I called it) and 
getTermPositions() yet?  They are supposed to be providing pretty much just 
that capability.

bq. Oh man this is all kind of tricky ey

I tore my hair out all weekend and lost sleep!  But I think it actually is 
where we want now, aside from the registration piece.

> Allow Scorer to expose positions and payloads aka. nuke spans 
> --
>
> Key: LUCENE-2878
> URL: https://issues.apache.org/jira/browse/LUCENE-2878
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: Bulk Postings branch
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
>  Labels: gsoc2011, lucene-gsoc-11, mentor
> Attachments: LUCENE-2878-OR.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, 
> PosHighlighter.patch, PosHighlighter.patch
>
>
> Currently we have two somewhat separate types of queries, the one which can 
> make use of positions (mainly spans) and payloads (spans). Yet Span*Query 
> doesn't really do scoring comparable to what other queries do and at the end 
> of the day they are duplicating lot of code all over lucene. Span*Queries are 
> also limited to other Span*Query instances such that you can not use a 
> TermQuery or a BooleanQuery with SpanNear or anthing like that. 
> Beside of the Span*Query limitation other queries lacking a quiet interesting 
> feature since they can not score based on term proximity since scores doesn't 
> expose any positional information. All those problems bugged me for a while 
> now so I stared working on that using the bulkpostings API. I would have done 
> that first cut on trunk but TermScorer is working on BlockReader that do not 
> expose positions while the one in this branch does. I started adding a new 
> Positions class which users can pull from a scorer, to prevent unnecessary 
> positions enums I added ScorerContext#needsPositions and eventually 
> Scorere#needsPayloads to create the corresponding enum on demand. Yet, 
> currently only TermQuery / TermScorer implements this API and other simply 
> return null instead. 
> To show that the API really works and our BulkPostings work fine too with 
> positions I cut over TermSpanQuery to use a TermScorer under the hood and 
> nuked TermSpans entirely. A nice sideeffect of this was that the Position 
> BulkReading implementation got some exercise which now :) work all with 
> positions while Payloads for bulkreading are kind of experimental in the 
> patch and those only work with Standard codec. 
> So all spans now work on top of TermScorer ( I truly hate spans since today ) 
> including the ones that need Payloads (StandardCodec ONLY)!!  I didn't bother 
> to implement the other codecs yet since I want to get feedback on the API and 
> on this first cut before I go one with it. I will upload the corresponding 
> patch in a minute. 
> I also had to cut over SpanQuery.getSpans(IR) to 
> SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk 
> first but after that pain today I need a break first :).
> The patch passes all core tests 
> (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't 
> look into the MemoryIndex BulkPostings API yet)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2382) DIH Cache Improvements

2011-07-11 Thread James Dyer (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063372#comment-13063372
 ] 

James Dyer commented on SOLR-2382:
--

Noble,

Are you still able to work with me on this issue?  Is there anything else you 
are waiting for from me?  The patch I submitted on June 24 passes parameters 
via the Context object as you requested.  Also, I previously separated 
"BerkleyBackedCache" out into a separate issue to (SOLR-2613) so we won't run 
into licensing issues here.  Let me know what else you think we need to do.  
Thanks.

> DIH Cache Improvements
> --
>
> Key: SOLR-2382
> URL: https://issues.apache.org/jira/browse/SOLR-2382
> Project: Solr
>  Issue Type: New Feature
>  Components: contrib - DataImportHandler
>Reporter: James Dyer
>Priority: Minor
> Attachments: SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
> SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
> SOLR-2382.patch
>
>
> Functionality:
>  1. Provide a pluggable caching framework for DIH so that users can choose a 
> cache implementation that best suits their data and application.
>  
>  2. Provide a means to temporarily cache a child Entity's data without 
> needing to create a special cached implementation of the Entity Processor 
> (such as CachedSqlEntityProcessor).
>  
>  3. Provide a means to write the final (root entity) DIH output to a cache 
> rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
> cache as an Entity input.  Also provide the ability to do delta updates on 
> such persistent caches.
>  
>  4. Provide the ability to partition data across multiple caches that can 
> then be fed back into DIH and indexed either to varying Solr Shards, or to 
> the same Core in parallel.
> Use Cases:
>  1. We needed a flexible & scalable way to temporarily cache child-entity 
> data prior to joining to parent entities.
>   - Using SqlEntityProcessor with Child Entities can cause an "n+1 select" 
> problem.
>   - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching 
> mechanism and does not scale.
>   - There is no way to cache non-SQL inputs (ex: flat files, xml, etc).
>  
>  2. We needed the ability to gather data from long-running entities by a 
> process that runs separate from our main indexing process.
>   
>  3. We wanted the ability to do a delta import of only the entities that 
> changed.
>   - Lucene/Solr requires entire documents to be re-indexed, even if only a 
> few fields changed.
>   - Our data comes from 50+ complex sql queries and/or flat files.
>   - We do not want to incur overhead re-gathering all of this data if only 1 
> entity's data changed.
>   - Persistent DIH caches solve this problem.
>   
>  4. We want the ability to index several documents in parallel (using 1.4.1, 
> which did not have the "threads" parameter).
>  
>  5. In the future, we may need to use Shards, creating a need to easily 
> partition our source data into Shards.
> Implementation Details:
>  1. De-couple EntityProcessorBase from caching.  
>   - Created a new interface, DIHCache & two implementations:  
> - SortedMapBackedCache - An in-memory cache, used as default with 
> CachedSqlEntityProcessor (now deprecated).
> - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested 
> with je-4.1.6.jar
>- NOTE: the existing Lucene Contrib "db" project uses je-3.3.93.jar.  
> I believe this may be incompatible due to Generic Usage.
>- NOTE: I did not modify the ant script to automatically get this jar, 
> so to use or evaluate this patch, download bdb-je from 
> http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html 
>  
>  2. Allow Entity Processors to take a "cacheImpl" parameter to cause the 
> entity data to be cached (see EntityProcessorBase & DIHCacheProperties).
>  
>  3. Partially De-couple SolrWriter from DocBuilder
>   - Created a new interface DIHWriter, & two implementations:
>- SolrWriter (refactored)
>- DIHCacheWriter (allows DIH to write ultimately to a Cache).
>
>  4. Create a new Entity Processor, DIHCacheProcessor, which reads a 
> persistent Cache as DIH Entity Input.
>  
>  5. Support a "partition" parameter with both DIHCacheWriter and 
> DIHCacheProcessor to allow for easy partitioning of source entity data.
>  
>  6. Change the semantics of entity.destroy()
>   - Previously, it was being called on each iteration of 
> DocBuilder.buildDocument().
>   - Now it is does one-time cleanup tasks (like closing or deleting a 
> disk-backed cache) once the entity processor is completed.
>   - The only out-of-the-box entity processor that previously implemented 
> destroy() was LineEntitiyProcessor, so this is not a very invasive change.
> General Notes:
> We are near completion

[jira] [Commented] (SOLR-2551) Check dataimport.properties for write access before starting import


[ 
https://issues.apache.org/jira/browse/SOLR-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063369#comment-13063369
 ] 

Steven Rowe commented on SOLR-2551:
---

Committed the patch to run DIH tests sequentially:
- r1145194: trunk
- r1145196: branch_3x

> Check dataimport.properties for write access before starting import
> ---
>
> Key: SOLR-2551
> URL: https://issues.apache.org/jira/browse/SOLR-2551
> Project: Solr
>  Issue Type: Improvement
>  Components: contrib - DataImportHandler
>Affects Versions: 1.4.1, 3.1
>Reporter: C S
>Assignee: Shalin Shekhar Mangar
>Priority: Minor
> Fix For: 3.3, 4.0
>
> Attachments: SOLR-2551.patch
>
>
> A common mistake is that the /conf (respectively the dataimport.properties) 
> file is not writable for solr. It would be great if that were detected on 
> starting a dataimport job. 
> Currently and import might grind away for days and fail if it can't write its 
> timestamp to the dataimport.properties file.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Issue Comment Edited] (SOLR-2551) Check dataimport.properties for write access before starting import


[ 
https://issues.apache.org/jira/browse/SOLR-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063366#comment-13063366
 ] 

Steven Rowe edited comment on SOLR-2551 at 7/11/11 2:39 PM:


I'll switch the DIH tests to run sequentially.  The benchmark module does this 
by setting the {{tests.threadspercpu}} property to zero.

Here's the patch:

{noformat}
Index: solr/contrib/dataimporthandler/build.xml
===
--- solr/contrib/dataimporthandler/build.xml(revision 1145189)
+++ solr/contrib/dataimporthandler/build.xml(working copy)
@@ -23,6 +23,9 @@
 Data Import Handler
   
 
+  
+  
+
   
 
 
{noformat}

Committing shortly.

  was (Author: steve_rowe):
I'll switch the DIH tests to run sequentially.  The benchmark module does 
this by setting the {{tests.threadspercpu}} property to zero.

Here's the patch:

{{
Index: solr/contrib/dataimporthandler/build.xml
===
--- solr/contrib/dataimporthandler/build.xml(revision 1145189)
+++ solr/contrib/dataimporthandler/build.xml(working copy)
@@ -23,6 +23,9 @@
 Data Import Handler
   
 
+  
+  
+
   
 
 
}}

Committing shortly.
  
> Check dataimport.properties for write access before starting import
> ---
>
> Key: SOLR-2551
> URL: https://issues.apache.org/jira/browse/SOLR-2551
> Project: Solr
>  Issue Type: Improvement
>  Components: contrib - DataImportHandler
>Affects Versions: 1.4.1, 3.1
>Reporter: C S
>Assignee: Shalin Shekhar Mangar
>Priority: Minor
> Fix For: 3.3, 4.0
>
> Attachments: SOLR-2551.patch
>
>
> A common mistake is that the /conf (respectively the dataimport.properties) 
> file is not writable for solr. It would be great if that were detected on 
> starting a dataimport job. 
> Currently and import might grind away for days and fail if it can't write its 
> timestamp to the dataimport.properties file.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2551) Check dataimport.properties for write access before starting import


[ 
https://issues.apache.org/jira/browse/SOLR-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063366#comment-13063366
 ] 

Steven Rowe commented on SOLR-2551:
---

I'll switch the DIH tests to run sequentially.  The benchmark module does this 
by setting the {{tests.threadspercpu}} property to zero.

Here's the patch:

{{
Index: solr/contrib/dataimporthandler/build.xml
===
--- solr/contrib/dataimporthandler/build.xml(revision 1145189)
+++ solr/contrib/dataimporthandler/build.xml(working copy)
@@ -23,6 +23,9 @@
 Data Import Handler
   
 
+  
+  
+
   
 
 
}}

Committing shortly.

> Check dataimport.properties for write access before starting import
> ---
>
> Key: SOLR-2551
> URL: https://issues.apache.org/jira/browse/SOLR-2551
> Project: Solr
>  Issue Type: Improvement
>  Components: contrib - DataImportHandler
>Affects Versions: 1.4.1, 3.1
>Reporter: C S
>Assignee: Shalin Shekhar Mangar
>Priority: Minor
> Fix For: 3.3, 4.0
>
> Attachments: SOLR-2551.patch
>
>
> A common mistake is that the /conf (respectively the dataimport.properties) 
> file is not writable for solr. It would be great if that were detected on 
> starting a dataimport job. 
> Currently and import might grind away for days and fail if it can't write its 
> timestamp to the dataimport.properties file.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans

[
https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063362#comment-13063362
]

Simon Willnauer commented on LUCENE-2878:
-

{quote} We want to highlight positions that explain why the document matches
the query. Not all terms that match the term queries will count - some of them
should be "filtered out" by near-conditions; ie in a PhraseQuery, matching
terms not in the phrase should not be highlighted. I think if I just register a
callback with the sub-scorers (scoring terms), I would see all the terms,
right? {quote}

this is why I think we should add a dedicated collector API (ie. not part of
Collector maybe an interface?). the current api gives you a "view" for each
match meaning that once you advance the iterator you get the positions for the
"current" positional match. I think the caller should also drive the collection
of intermediate positions / intervals. The big challenge here is to collect the
positions you are interested in efficiently. I agree that the if(foo==null) is
a problem as long as foo is not final so maybe we should try to make them final
and make the pos collector part of the scorer setup (just a thought), we could
do that using a ScorerContext for instance.

{quote}
make further progress, I think we need to resolve the position API. The
testMultipleDocumentsOr test case illustrates the problem with the approach I
was trying: walking the PositionIterator tree when collecting documents.
Something like the PositionCollector API could work, but I think we still need
to solve the problem Mike M alluded to back at the beginning:
{quote}
Agreed we should work on the API. I looked at your patch and some changes
appear to be not necessary IMO. Like the problems in testMultipleDocumentsOr
are not actually a problem if we sketch this out properly. As I said above if
the collector is part of the initialization we can simply pass them to the
leaves or intermediate scorers and collect safely even if scorers are advanced.
Since during Documents collection the view should be stable, right?
So bottom line here is that we need an api that is capable of collecting fine
grained parts of the scorer tree. The only way I see doing this is 1. have a
subscribe / register method and 2. do this subscription during scorer creation.
Once we have this we can implement very simple collect methods that only
collect positions for the current match like in a near query, while the current
matching document is collected all contributing TermScorers have their
positioninterval ready for collection. The collect method can then be called
from the consumer instead of in the loop this way we only get the positions we
need since we know the document we are collecting.

bq. The core problem solved here is how to report positions that are not
consumed during scoring, and also those that are,
this can be solved by my comment above?

{quote} The interesting case is PositionFilterScorer, which filters its child
Scorers. I added PositionIntervalIterator.getTermPositions() to enable this;
this walks the tree of position iterators and returns a snapshot of their
current state (as another iterator) so the consumer can retrieve all the term
positions as filtered by intermediate iterators without advancing them.
{quote}
this would work the same way ey? We register during setup, something like
{code}void PositinoCollector#registerScorer(Scorer){code} then we can decide
that if we need that scorer or rather its positions for collection or not. The
entire iteration should only be driven by the top-level consumer, if you
advance the iterator on an intermediate iterator you might break some higher
level algs. like conjunction / disjunction though. So lets drive this further,
lets say we have all collectors that we are interested in, when should we
collect positions? I think the top level consumer should 1. advance the
positions 2. call collect on the scorers we are interested.
While I talk about this I start realizing that it might even be easier that
this if we walk the PositionInterator tree rather than the scorer tree and
collect the positin iterators from there. This is already possible with the
subs() call right? What we essentially need is a method that returns the
current interval for each of the iterators. It still might be needed to have a
collect method on the iterator so that something like Conjunctions can call
collect on the subs if needed?

Oh man this is all kind of tricky ey :)

bq. There are a few (11) failing tests with this branch+patch (ran lucene tests
only), but they seem unrelated (TestFlushByRamOrCountsPolicy has 5, eg) I am
ignoring?

I don't see anything failing... can you attach a file with the failures?

> Allow Scorer to expose positions and payloads aka. nuke spans
> -

[jira] [Commented] (SOLR-2551) Check dataimport.properties for write access before starting import


[ 
https://issues.apache.org/jira/browse/SOLR-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063355#comment-13063355
 ] 

Shalin Shekhar Mangar commented on SOLR-2551:
-

Yes, lets disable this test for now. I don't think it is even worth testing. I 
guess I just had too much time that day :)

Another option could be to run the DIH tests sequentially.

> Check dataimport.properties for write access before starting import
> ---
>
> Key: SOLR-2551
> URL: https://issues.apache.org/jira/browse/SOLR-2551
> Project: Solr
>  Issue Type: Improvement
>  Components: contrib - DataImportHandler
>Affects Versions: 1.4.1, 3.1
>Reporter: C S
>Assignee: Shalin Shekhar Mangar
>Priority: Minor
> Fix For: 3.3, 4.0
>
> Attachments: SOLR-2551.patch
>
>
> A common mistake is that the /conf (respectively the dataimport.properties) 
> file is not writable for solr. It would be great if that were detected on 
> starting a dataimport job. 
> Currently and import might grind away for days and fail if it can't write its 
> timestamp to the dataimport.properties file.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2551) Check dataimport.properties for write access before starting import


[ 
https://issues.apache.org/jira/browse/SOLR-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063354#comment-13063354
 ] 

Chris Male commented on SOLR-2551:
--

or alternatively, we could make the DIH tests run sequentially, so we don't hit 
this problem.

> Check dataimport.properties for write access before starting import
> ---
>
> Key: SOLR-2551
> URL: https://issues.apache.org/jira/browse/SOLR-2551
> Project: Solr
>  Issue Type: Improvement
>  Components: contrib - DataImportHandler
>Affects Versions: 1.4.1, 3.1
>Reporter: C S
>Assignee: Shalin Shekhar Mangar
>Priority: Minor
> Fix For: 3.3, 4.0
>
> Attachments: SOLR-2551.patch
>
>
> A common mistake is that the /conf (respectively the dataimport.properties) 
> file is not writable for solr. It would be great if that were detected on 
> starting a dataimport job. 
> Currently and import might grind away for days and fail if it can't write its 
> timestamp to the dataimport.properties file.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2551) Check dataimport.properties for write access before starting import


[ 
https://issues.apache.org/jira/browse/SOLR-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063352#comment-13063352
 ] 

Chris Male commented on SOLR-2551:
--

Okay, so, given that basically every test writes to this file, what are our 
options? 

To me it seems since the file is getting written too (whether we rely on the 
contents or not), this could get in the way of another test. So perhaps we need 
to pull the checkWritablePersistFile method out for awhile and re-assess how to 
achieve the same functionality in a way the tests can handle? 

> Check dataimport.properties for write access before starting import
> ---
>
> Key: SOLR-2551
> URL: https://issues.apache.org/jira/browse/SOLR-2551
> Project: Solr
>  Issue Type: Improvement
>  Components: contrib - DataImportHandler
>Affects Versions: 1.4.1, 3.1
>Reporter: C S
>Assignee: Shalin Shekhar Mangar
>Priority: Minor
> Fix For: 3.3, 4.0
>
> Attachments: SOLR-2551.patch
>
>
> A common mistake is that the /conf (respectively the dataimport.properties) 
> file is not writable for solr. It would be great if that were detected on 
> starting a dataimport job. 
> Currently and import might grind away for days and fail if it can't write its 
> timestamp to the dataimport.properties file.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-3233) HuperDuperSynonymsFilter™


 [ 
https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved LUCENE-3233.
-

   Resolution: Fixed
Fix Version/s: 4.0
   3.4

> HuperDuperSynonymsFilter™
> -
>
> Key: LUCENE-3233
> URL: https://issues.apache.org/jira/browse/LUCENE-3233
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Robert Muir
>Assignee: Robert Muir
> Fix For: 3.4, 4.0
>
> Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, 
> LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, 
> LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, 
> LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, 
> LUCENE-3233.patch, LUCENE-3233.patch, synonyms.zip
>
>
> The current synonymsfilter uses a lot of ram and cpu, especially at build 
> time.
> I think yesterday I heard about "huge synonyms files" three times.
> So, I think we should use an FST-based structure, sharing the inputs and 
> outputs.
> And we should be more efficient with the tokenStream api, e.g. using 
> save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2551) Check dataimport.properties for write access before starting import


[ 
https://issues.apache.org/jira/browse/SOLR-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063346#comment-13063346
 ] 

Shalin Shekhar Mangar commented on SOLR-2551:
-

bq. Doesn't every test of delta functionality write to the 
dataimport.properties file?

Yes, it does but I don't think any of our tests rely on the contents of the 
properties file.

Ironically, the fact that the tests failed is proof that this feature works :)

> Check dataimport.properties for write access before starting import
> ---
>
> Key: SOLR-2551
> URL: https://issues.apache.org/jira/browse/SOLR-2551
> Project: Solr
>  Issue Type: Improvement
>  Components: contrib - DataImportHandler
>Affects Versions: 1.4.1, 3.1
>Reporter: C S
>Assignee: Shalin Shekhar Mangar
>Priority: Minor
> Fix For: 3.3, 4.0
>
> Attachments: SOLR-2551.patch
>
>
> A common mistake is that the /conf (respectively the dataimport.properties) 
> file is not writable for solr. It would be great if that were detected on 
> starting a dataimport job. 
> Currently and import might grind away for days and fail if it can't write its 
> timestamp to the dataimport.properties file.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2564) Integrating grouping module into Solr 4.0

2011-07-11 Thread Martijn van Groningen (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063342#comment-13063342
 ] 

Martijn van Groningen commented on SOLR-2564:
-

I've fixed the issue and added a test that triggered the exception.
Fixed in trunk in revision 1145173
Fixed in 3x branch in revision 1145176

> Integrating grouping module into Solr 4.0
> -
>
> Key: SOLR-2564
> URL: https://issues.apache.org/jira/browse/SOLR-2564
> Project: Solr
>  Issue Type: Improvement
>Reporter: Martijn van Groningen
>Assignee: Martijn van Groningen
>Priority: Blocker
> Fix For: 4.0
>
> Attachments: LUCENE-2564.patch, SOLR-2564.patch, SOLR-2564.patch, 
> SOLR-2564.patch, SOLR-2564.patch, SOLR-2564.patch, SOLR-2564.patch, 
> SOLR-2564.patch, SOLR-2564_performance_loss_fix.patch
>
>
> Since work on grouping module is going well. I think it is time to wire this 
> up in Solr.
> Besides the current grouping features Solr provides, Solr will then also 
> support second pass caching and total count based on groups.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2551) Check dataimport.properties for write access before starting import


[ 
https://issues.apache.org/jira/browse/SOLR-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063339#comment-13063339
 ] 

Chris Male commented on SOLR-2551:
--

Amazingly, I came to the exact same conclusion at the same time.  Doesn't every 
test of delta functionality write to the dataimport.properties file?

> Check dataimport.properties for write access before starting import
> ---
>
> Key: SOLR-2551
> URL: https://issues.apache.org/jira/browse/SOLR-2551
> Project: Solr
>  Issue Type: Improvement
>  Components: contrib - DataImportHandler
>Affects Versions: 1.4.1, 3.1
>Reporter: C S
>Assignee: Shalin Shekhar Mangar
>Priority: Minor
> Fix For: 3.3, 4.0
>
> Attachments: SOLR-2551.patch
>
>
> A common mistake is that the /conf (respectively the dataimport.properties) 
> file is not writable for solr. It would be great if that were detected on 
> starting a dataimport job. 
> Currently and import might grind away for days and fail if it can't write its 
> timestamp to the dataimport.properties file.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-2644) DIH handler - when using threads=2 the default logging is set too high


 [ 
https://issues.apache.org/jira/browse/SOLR-2644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar updated SOLR-2644:


Fix Version/s: 4.0
   3.4
 Assignee: Shalin Shekhar Mangar

> DIH handler - when using threads=2 the default logging is set too high
> --
>
> Key: SOLR-2644
> URL: https://issues.apache.org/jira/browse/SOLR-2644
> Project: Solr
>  Issue Type: Bug
>  Components: contrib - DataImportHandler
>Affects Versions: 3.3
>Reporter: Bill Bell
>Assignee: Shalin Shekhar Mangar
> Fix For: 3.4, 4.0
>
> Attachments: SOLR-2644.patch
>
>
> Setting threads parameter in DIH handler, every add outputs to the log in 
> INFO level.
> The only current solution is to set the following in log4j.properties:
> log4j.rootCategory=INFO, logfile
> log4j.logger.org.apache.solr.handler.dataimport.DocBuilder=FATAL
> log4j.logger.org.apache.solr.handler.dataimport.ThreadedEntityProcessorWrapper=FATAL
> These 2 log messages need to be changed to > INFO.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-2644) DIH handler - when using threads=2 the default logging is set too high


 [ 
https://issues.apache.org/jira/browse/SOLR-2644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar updated SOLR-2644:


Attachment: SOLR-2644.patch

This was probably added for debugging. Attached patch to remove the extra 
logging.

I'll commit shortly.

> DIH handler - when using threads=2 the default logging is set too high
> --
>
> Key: SOLR-2644
> URL: https://issues.apache.org/jira/browse/SOLR-2644
> Project: Solr
>  Issue Type: Bug
>  Components: contrib - DataImportHandler
>Affects Versions: 3.3
>Reporter: Bill Bell
> Fix For: 3.4, 4.0
>
> Attachments: SOLR-2644.patch
>
>
> Setting threads parameter in DIH handler, every add outputs to the log in 
> INFO level.
> The only current solution is to set the following in log4j.properties:
> log4j.rootCategory=INFO, logfile
> log4j.logger.org.apache.solr.handler.dataimport.DocBuilder=FATAL
> log4j.logger.org.apache.solr.handler.dataimport.ThreadedEntityProcessorWrapper=FATAL
> These 2 log messages need to be changed to > INFO.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2551) Check dataimport.properties for write access before starting import


[ 
https://issues.apache.org/jira/browse/SOLR-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063335#comment-13063335
 ] 

Shalin Shekhar Mangar commented on SOLR-2551:
-

The tests run in parallel so for the brief time that the dataimport.properties 
is set to read-only, other tests can fail. There is no way to change the 
location of the properties file so I don't see a way to fix this. Should we 
ignore this particular test?

> Check dataimport.properties for write access before starting import
> ---
>
> Key: SOLR-2551
> URL: https://issues.apache.org/jira/browse/SOLR-2551
> Project: Solr
>  Issue Type: Improvement
>  Components: contrib - DataImportHandler
>Affects Versions: 1.4.1, 3.1
>Reporter: C S
>Assignee: Shalin Shekhar Mangar
>Priority: Minor
> Fix For: 3.3, 4.0
>
> Attachments: SOLR-2551.patch
>
>
> A common mistake is that the /conf (respectively the dataimport.properties) 
> file is not writable for solr. It would be great if that were detected on 
> starting a dataimport job. 
> Currently and import might grind away for days and fail if it can't write its 
> timestamp to the dataimport.properties file.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (SOLR-2647) DOMUtilTestBase should be abstract


 [ 
https://issues.apache.org/jira/browse/SOLR-2647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Male resolved SOLR-2647.
--

   Resolution: Fixed
Fix Version/s: 4.0
 Assignee: Chris Male

Committed revision 1145154.

> DOMUtilTestBase should be abstract
> --
>
> Key: SOLR-2647
> URL: https://issues.apache.org/jira/browse/SOLR-2647
> Project: Solr
>  Issue Type: Improvement
>Reporter: Chris Male
>Assignee: Chris Male
>Priority: Trivial
> Fix For: 4.0
>
> Attachments: SOLR-2647.patch
>
>
> Its serves as a base for other test classes that use the DOM.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3295) BitVector never skips fully populated bytes when writing ClearedDgaps


[ 
https://issues.apache.org/jira/browse/LUCENE-3295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063312#comment-13063312
 ] 

Michael McCandless commented on LUCENE-3295:


Thank you for catching that something was amiss in the first place ;)  That's 
the hardest part.

> BitVector never skips fully populated bytes when writing ClearedDgaps
> -
>
> Key: LUCENE-3295
> URL: https://issues.apache.org/jira/browse/LUCENE-3295
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: core/other
>Affects Versions: 4.0
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
>Priority: Minor
> Fix For: 4.0
>
> Attachments: LUCENE-3295.patch, LUCENE-3295.patch
>
>
> When writing cleared DGaps in BitVector we compare a byte against 0xFF (255) 
> yet the byte is casted into an int (-1) and the comparison will never 
> succeed. We should mask the byte with 0xFF before comparing or compare 
> against -1

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2647) DOMUtilTestBase should be abstract


[ 
https://issues.apache.org/jira/browse/SOLR-2647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063303#comment-13063303
 ] 

Steven Rowe commented on SOLR-2647:
---

+1

This is my mistake - thanks for fixing!

> DOMUtilTestBase should be abstract
> --
>
> Key: SOLR-2647
> URL: https://issues.apache.org/jira/browse/SOLR-2647
> Project: Solr
>  Issue Type: Improvement
>Reporter: Chris Male
>Priority: Trivial
> Attachments: SOLR-2647.patch
>
>
> Its serves as a base for other test classes that use the DOM.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Assigned] (LUCENE-3233) HuperDuperSynonymsFilter™


 [ 
https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir reassigned LUCENE-3233:
---

Assignee: Robert Muir

> HuperDuperSynonymsFilter™
> -
>
> Key: LUCENE-3233
> URL: https://issues.apache.org/jira/browse/LUCENE-3233
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Robert Muir
>Assignee: Robert Muir
> Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, 
> LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, 
> LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, 
> LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, 
> LUCENE-3233.patch, LUCENE-3233.patch, synonyms.zip
>
>
> The current synonymsfilter uses a lot of ram and cpu, especially at build 
> time.
> I think yesterday I heard about "huge synonyms files" three times.
> So, I think we should use an FST-based structure, sharing the inputs and 
> outputs.
> And we should be more efficient with the tokenStream api, e.g. using 
> save/restoreState instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[JENKINS] Lucene-Solr-tests-only-trunk - Build # 9496 - Failure

Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/9496/

2 tests failed.
REGRESSION:  
org.apache.solr.handler.dataimport.TestSqlEntityProcessor2.testCompositePk_FullImport

Error Message:
Exception during query

Stack Trace:
java.lang.RuntimeException: Exception during query
at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:405)
at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:372)
at 
org.apache.solr.handler.dataimport.TestSqlEntityProcessor2.testCompositePk_FullImport(TestSqlEntityProcessor2.java:66)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1464)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1382)
Caused by: java.lang.RuntimeException: REQUEST FAILED: xpath=//*[@numFound='1']
xml response was: 

000id:1standard202.2


request was:start=0&q=id:1&qt=standard&rows=20&version=2.2
at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:398)


REGRESSION:  
org.apache.solr.handler.dataimport.TestSqlEntityProcessorDelta3.testCompositePk_FullImport

Error Message:
Exception during query

Stack Trace:
java.lang.RuntimeException: Exception during query
at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:405)
at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:372)
at 
org.apache.solr.handler.dataimport.TestSqlEntityProcessorDelta3.add1document(TestSqlEntityProcessorDelta3.java:83)
at 
org.apache.solr.handler.dataimport.TestSqlEntityProcessorDelta3.testCompositePk_FullImport(TestSqlEntityProcessorDelta3.java:92)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1464)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1382)
Caused by: java.lang.RuntimeException: REQUEST FAILED: xpath=//*[@numFound='1']
xml response was: 

010*:* OR add1documentstandard202.2


request 
was:start=0&q=*:*+OR+add1document&qt=standard&rows=20&version=2.2
at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:398)




Build Log (for compile errors):
[...truncated 12155 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2524) Adding grouping to Solr 3x

2011-07-11 Thread Yuriy Akopov (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063284#comment-13063284
 ] 

Yuriy Akopov commented on SOLR-2524:


I suppose I'm late with these questions, but could you please acknowledge if 
the following is correct:

1) The functionality from this patch was included into Solr 3.3, so no need to 
apply it to any version >= 3.3

2) This patch (as well as the collapsing functionality in 3.3) doesn't allow 
calculation of facet numbers after collapsing. Faceting is still possible for 
collapsed results but the numbers returned for facets are always calculated 
before collapsing the results.

3) In order to calculate facets after collapsing, LUCENE-3097 must be applied 
to Solr 3.3.

Thanks.

> Adding grouping to Solr 3x
> --
>
> Key: SOLR-2524
> URL: https://issues.apache.org/jira/browse/SOLR-2524
> Project: Solr
>  Issue Type: New Feature
>Reporter: Martijn van Groningen
>Assignee: Martijn van Groningen
> Fix For: 3.3
>
> Attachments: SOLR-2524.patch, SOLR-2524.patch, SOLR-2524.patch, 
> SOLR-2524.patch, SOLR-2524.patch, SOLR-2524.patch
>
>
> Grouping was recently added to Lucene 3x. See LUCENE-1421 for more 
> information.
> I think it would be nice if we expose this functionality also to the Solr 
> users that are bound to a 3.x version.
> The grouping feature added to Lucene is currently a subset of the 
> functionality that Solr 4.0-trunk offers. Mainly it doesn't support grouping 
> by function / query.
> The work involved getting the grouping contrib to work on Solr 3x is 
> acceptable. I have it more or less running here. It supports the response 
> format and request parameters (expect: group.query and group.func) described 
> in the FieldCollapse page on the Solr wiki.
> I think it would be great if this is included in the Solr 3.2 release. Many 
> people are using grouping as patch now and this would help them a lot. Any 
> thoughts?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2564) Integrating grouping module into Solr 4.0

2011-07-11 Thread Martijn van Groningen (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063272#comment-13063272
 ] 

Martijn van Groningen commented on SOLR-2564:
-

Hi Matteo, I can also confirm the bug and only happens when group.main=true. I 
also think that this error occurs on 3x code base. I'll provide a fix for this 
issue soon.

> Integrating grouping module into Solr 4.0
> -
>
> Key: SOLR-2564
> URL: https://issues.apache.org/jira/browse/SOLR-2564
> Project: Solr
>  Issue Type: Improvement
>Reporter: Martijn van Groningen
>Assignee: Martijn van Groningen
>Priority: Blocker
> Fix For: 4.0
>
> Attachments: LUCENE-2564.patch, SOLR-2564.patch, SOLR-2564.patch, 
> SOLR-2564.patch, SOLR-2564.patch, SOLR-2564.patch, SOLR-2564.patch, 
> SOLR-2564.patch, SOLR-2564_performance_loss_fix.patch
>
>
> Since work on grouping module is going well. I think it is time to wire this 
> up in Solr.
> Besides the current grouping features Solr provides, Solr will then also 
> support second pass caching and total count based on groups.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3296) Enable passing a config into PKIndexSplitter


 [ 
https://issues.apache.org/jira/browse/LUCENE-3296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-3296:


Attachment: LUCENE-3296.patch

this patch includes added version to MultipassIndexSplitter ctor.

I am going to commit this and backport to 3.x

> Enable passing a config into PKIndexSplitter
> 
>
> Key: LUCENE-3296
> URL: https://issues.apache.org/jira/browse/LUCENE-3296
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/other
>Affects Versions: 3.3, 4.0
>Reporter: Jason Rutherglen
>Assignee: Simon Willnauer
>Priority: Trivial
> Attachments: LUCENE-3296.patch, LUCENE-3296.patch, LUCENE-3296.patch, 
> LUCENE-3296.patch
>
>
> I need to be able to pass the IndexWriterConfig into the IW used by 
> PKIndexSplitter.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3296) Enable passing a config into PKIndexSplitter

2011-07-11 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063261#comment-13063261
 ] 

Uwe Schindler commented on LUCENE-3296:
---

bq. We should not randomly use the VERSION_CURRENT but rather be consistent 
when we use version.

I agree, but when you backport to 3.4, you have to keep backwards 
compatibility, so only deprecate the ctor there.

IndexUpgrader only uses LUCENE_CURRENT when you invoke from command line, in 
all other cases its required arg, so we are consistent here.

We should also look at the other IndexSplitters in this package!

> Enable passing a config into PKIndexSplitter
> 
>
> Key: LUCENE-3296
> URL: https://issues.apache.org/jira/browse/LUCENE-3296
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/other
>Affects Versions: 3.3, 4.0
>Reporter: Jason Rutherglen
>Assignee: Simon Willnauer
>Priority: Trivial
> Attachments: LUCENE-3296.patch, LUCENE-3296.patch, LUCENE-3296.patch
>
>
> I need to be able to pass the IndexWriterConfig into the IW used by 
> PKIndexSplitter.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3296) Enable passing a config into PKIndexSplitter

2011-07-11 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063259#comment-13063259
 ] 

Uwe Schindler commented on LUCENE-3296:
---

bq. not entirely true, we use the index splitter in 3.x and if you upgrade from 
3.1 to 3.2 you get a new mergepolicy by default which doesn't merge in order. I 
think its a problem that this version is not in 3.x yet so let fix it properly 
and backport.

PKIndexSplitter is new in 3.3, so you would never used it with older versions...

> Enable passing a config into PKIndexSplitter
> 
>
> Key: LUCENE-3296
> URL: https://issues.apache.org/jira/browse/LUCENE-3296
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/other
>Affects Versions: 3.3, 4.0
>Reporter: Jason Rutherglen
>Assignee: Simon Willnauer
>Priority: Trivial
> Attachments: LUCENE-3296.patch, LUCENE-3296.patch, LUCENE-3296.patch
>
>
> I need to be able to pass the IndexWriterConfig into the IW used by 
> PKIndexSplitter.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Issue Comment Edited] (LUCENE-3296) Enable passing a config into PKIndexSplitter


[ 
https://issues.apache.org/jira/browse/LUCENE-3296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063249#comment-13063249
 ] 

Simon Willnauer edited comment on LUCENE-3296 at 7/11/11 9:54 AM:
--

here is a new patch. I added a second IWC since we can not reuse IWC instances 
across IW due to SetOnce restrictions. I also moved out the VERSION_CURRENT and 
made it a ctor argument. We should not randomly use the VERSION_CURRENT but 
rather be consistent when we use version.

bq. Simon: The Version.LUCENE_CURRENT is not important here, for easier 
porting, the version should be LUCENE_CURRENT (and it was before Jason's 
patch). Else we will have to always upgrade it with every new release. The same 
applies to the IndexUpdater class in core, it also uses LUCENE_CURRENT when you 
not pass in anything (as the version is completely useless for simple merge 
operations - like here).

not entirely true, we use the index splitter in 3.x and if you upgrade from 3.1 
to 3.2 you get a new mergepolicy by default which doesn't merge in order. I 
think its a problem that this version is not in 3.x yet so let fix it properly 
and backport.

Simon

  was (Author: simonw):
here is a new patch. I added a second IWC since we can not reuse IWC 
instances across IW due to SetOnce restrictions. I also moved out the 
VERSION_CURRENT and made it a ctor argument. We should not randomly use the 
VERSION_CURRENT but rather be consistent when we use version.
  
> Enable passing a config into PKIndexSplitter
> 
>
> Key: LUCENE-3296
> URL: https://issues.apache.org/jira/browse/LUCENE-3296
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/other
>Affects Versions: 3.3, 4.0
>Reporter: Jason Rutherglen
>Assignee: Simon Willnauer
>Priority: Trivial
> Attachments: LUCENE-3296.patch, LUCENE-3296.patch, LUCENE-3296.patch
>
>
> I need to be able to pass the IndexWriterConfig into the IW used by 
> PKIndexSplitter.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3296) Enable passing a config into PKIndexSplitter


 [ 
https://issues.apache.org/jira/browse/LUCENE-3296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-3296:


Attachment: LUCENE-3296.patch

here is a new patch. I added a second IWC since we can not reuse IWC instances 
across IW due to SetOnce restrictions. I also moved out the VERSION_CURRENT and 
made it a ctor argument. We should not randomly use the VERSION_CURRENT but 
rather be consistent when we use version.

> Enable passing a config into PKIndexSplitter
> 
>
> Key: LUCENE-3296
> URL: https://issues.apache.org/jira/browse/LUCENE-3296
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/other
>Affects Versions: 3.3, 4.0
>Reporter: Jason Rutherglen
>Assignee: Simon Willnauer
>Priority: Trivial
> Attachments: LUCENE-3296.patch, LUCENE-3296.patch, LUCENE-3296.patch
>
>
> I need to be able to pass the IndexWriterConfig into the IW used by 
> PKIndexSplitter.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[JENKINS] Lucene-Solr-tests-only-trunk - Build # 9494 - Still Failing

Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/9494/

1 tests failed.
REGRESSION:  
org.apache.solr.handler.dataimport.TestSqlEntityProcessor2.testCompositePk_DeltaImport

Error Message:
Exception during query

Stack Trace:
java.lang.RuntimeException: Exception during query
at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:405)
at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:372)
at 
org.apache.solr.handler.dataimport.TestSqlEntityProcessor2.testCompositePk_DeltaImport(TestSqlEntityProcessor2.java:129)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1464)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1382)
Caused by: java.lang.RuntimeException: REQUEST FAILED: xpath=//*[@numFound='1']
xml response was: 

010id:5standard202.2


request was:start=0&q=id:5&qt=standard&rows=20&version=2.2
at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:398)




Build Log (for compile errors):
[...truncated 12034 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[JENKINS] Lucene-Solr-tests-only-3.x - Build # 9505 - Failure

Build: https://builds.apache.org/job/Lucene-Solr-tests-only-3.x/9505/

All tests passed

Build Log (for compile errors):
[...truncated 17572 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3295) BitVector never skips fully populated bytes when writing ClearedDgaps


[ 
https://issues.apache.org/jira/browse/LUCENE-3295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063245#comment-13063245
 ] 

Simon Willnauer commented on LUCENE-3295:
-

thanks for resolving this mike

> BitVector never skips fully populated bytes when writing ClearedDgaps
> -
>
> Key: LUCENE-3295
> URL: https://issues.apache.org/jira/browse/LUCENE-3295
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: core/other
>Affects Versions: 4.0
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
>Priority: Minor
> Fix For: 4.0
>
> Attachments: LUCENE-3295.patch, LUCENE-3295.patch
>
>
> When writing cleared DGaps in BitVector we compare a byte against 0xFF (255) 
> yet the byte is casted into an int (-1) and the comparison will never 
> succeed. We should mask the byte with 0xFF before comparing or compare 
> against -1

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2564) Integrating grouping module into Solr 4.0


[ 
https://issues.apache.org/jira/browse/SOLR-2564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063244#comment-13063244
 ] 

Simon Willnauer commented on SOLR-2564:
---

bq. Since Lucene is now also Java 6 we can just change the code in 
AbstractFirstPassGroupingCollector and the TermFirstPassGroupingCollectorJava6 
in grouping.java is no longer needed, right?
yes thats right

> Integrating grouping module into Solr 4.0
> -
>
> Key: SOLR-2564
> URL: https://issues.apache.org/jira/browse/SOLR-2564
> Project: Solr
>  Issue Type: Improvement
>Reporter: Martijn van Groningen
>Assignee: Martijn van Groningen
>Priority: Blocker
> Fix For: 4.0
>
> Attachments: LUCENE-2564.patch, SOLR-2564.patch, SOLR-2564.patch, 
> SOLR-2564.patch, SOLR-2564.patch, SOLR-2564.patch, SOLR-2564.patch, 
> SOLR-2564.patch, SOLR-2564_performance_loss_fix.patch
>
>
> Since work on grouping module is going well. I think it is time to wire this 
> up in Solr.
> Besides the current grouping features Solr provides, Solr will then also 
> support second pass caching and total count based on groups.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[JENKINS] Lucene-Solr-tests-only-trunk - Build # 9493 - Still Failing