Re: Distributed Indexing

2011-02-02 Thread Upayavira


On Tue, 01 Feb 2011 19:52 -0800, "Lance Norskog" 
wrote:
> Another use case is that N indexers operate independently, all pulling
> data from the  same database. Each has a separate query to get the
> documents in its policy.

But surely in this case, you are externalising the policy, and Solr
doesn't need to know about it? I.e. your indexers are deciding what goes
in what shard, not Solr?

Upayavira

> On Tue, Feb 1, 2011 at 12:38 PM, Upayavira  wrote:
> >
> > On Tue, 01 Feb 2011 19:04 +, "Alex Cowell"  wrote:
> >
> > I noticed there is a comment in the
> > org.apache.solr.servlet.DirectSolrConnection class which reads, "//Find a
> > way to turn List into File/SolrDocument". Did anyone find a
> > way to do this?
> >
> > Turns out that comment was left over from some experimenting one of our team
> > was doing. But I suppose the question still stands.
> >
> > Addressing the "retrieve the unique ID from the document" issue, does it
> > matter if the unique ID you do the hash on is the actual uniqueKey of the
> > document? Surely as long as you generate some value unique for each document
> > to index (for example, the name of the doc/stream + the current time) it
> > would still distribute the documents as we expect?
> >
> >
> > Well, one requirement I've heard for this is for it to be deterministic.
> > That is, a document will always go to the same shard, and you can work out
> > at any point in time where a particular document is.
> >
> > Once you've parsed the document to a SolrInputDocument, surely you can get
> > the ID/uniqueKey out? I'll do some digging tomorrow AM.
> >
> > Upayavira
> >
> > ---
> > Enterprise Search Consultant at Sourcesense UK,
> > Making Sense of Open Source
> 
> 
> 
> -- 
> Lance Norskog
> goks...@gmail.com
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
> 
> 
--- 
Enterprise Search Consultant at Sourcesense UK, 
Making Sense of Open Source


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2609) Generate jar containing test classes.

2011-02-02 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-2609:
---

Attachment: LUCENE-2609.patch

Previous patch was hard to review IMO. This patch includes the changes to the 
.xml files only. If you run the 'svn mv' commands I pasted before and then 
apply this patch, all should work.

> Generate jar containing test classes.
> -
>
> Key: LUCENE-2609
> URL: https://issues.apache.org/jira/browse/LUCENE-2609
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.2
>Reporter: Drew Farris
>Assignee: Shai Erera
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2609.patch, LUCENE-2609.patch, LUCENE-2609.patch, 
> LUCENE-2609.patch, LUCENE-2609.patch
>
>
> The test classes are useful for writing unit tests for code external to the 
> Lucene project. It would be helpful to build a jar of these classes and 
> publish them as a maven dependency.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1540) Improvements to contrib.benchmark for TREC collections

2011-02-02 Thread Doron Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989565#comment-12989565
 ] 

Doron Cohen commented on LUCENE-1540:
-

Thanks for reviewing Shai!

bq. Maybe instead of moving the unzip method to LuceneTestCase, you can put it 
as a static method in _TestUtil? Also, _TestUtil already has a rmDir method, I 
think we should use it? I would also do the same for fullTempDir.
Good point, will do.

bq. The method pathType(File f) in TrecDocParser – maybe instead of walking up 
the path elements you can obtain its full absolute path (which is a String) and 
then do indexOf() checks for the 4 types? It will simplify matters IMO.
Not sure yet if I like better this file separator sensitive approach, I'll take 
a look.

bq. Typo in TDP: unmodofied --> unmodified.
Will fix.

bq. Maybe we can use String.replaceAll() which takes a regex? This is not 
critical ...
Right, much simpler this way, will do!

bq. Does stripTags strips off tags of the HTML content? Or is it used for the 
TREC types other than GOV2?
It strips any tags, but it is used by parsers which are not using the HTML 
parser, that is, the Gov2 one does not use it.

bq. In TrecContentSource, can you replace TrecParserByPath.pathType to 
TrecDocParser.pathType?
Good catch, this is part of older code, will do.

bq. Also, do we still need TrecParserByPath? I don't see that it's used.
Yes we do, this is an important addition of this patch - allowing you to index 
trec docs of several formats. It is used, but dynamically, through the 
algorithm in TrecContentSourceTest.testTrecFeedDirAllTypes(). So removing it 
will not break compilation but will fail the tests.

> Improvements to contrib.benchmark for TREC collections
> --
>
> Key: LUCENE-1540
> URL: https://issues.apache.org/jira/browse/LUCENE-1540
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/benchmark
>Reporter: Tim Armstrong
>Assignee: Doron Cohen
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-1540.patch, LUCENE-1540.patch, LUCENE-1540.patch, 
> trecdocs.zip
>
>
> The benchmarking utilities for  TREC test collections (http://trec.nist.gov) 
> are quite limited and do not support some of the variations in format of 
> older TREC collections.  
> I have been doing some benchmarking work with Lucene and have had to modify 
> the package to support:
> * Older TREC document formats, which the current parser fails on due to 
> missing document headers.
> * Variations in query format - newlines after  tag causing the query 
> parser to get confused.
> * Ability to detect and read in uncompressed text collections
> * Storage of document numbers by default without storing full text.
> I can submit a patch if there is interest, although I will probably want to 
> write unit tests for the new functionality first.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: GSoC

2011-02-02 Thread David Nemeskey
Hi guys,

Mark, Robert, Simon: thanks for the support! I really hope we can work 
together this summer (and before that, obviously).

According to http://www.google-
melange.com/document/show/gsoc_program/google/gsoc2011/timeline , there's 
still some time until the application period. So let me use this week to finish 
my PhD research plan, and get back to you next week.

I am not really familiar with how the program works, i.e. how detailed the 
application description should be, when mentorship is decided, etc. so I guess 
we will have a lot to talk about. :)

(Actually, should we move this discussion private?)

David

> Hi David, honestly this sounds fantastic.
> 
> It would be great to have someone to work with us on this issue!
> 
> To date, progress is pretty slow-going (minor improvements, cleanups,
> additional stats here and there)... but we really need all the help we
> can get, especially from people who have a really good understanding
> of the various models.
> 
> In case you are interested, here are some references to discussions
> about adding more flexibility (with some prototypes etc):
> http://www.lucidimagination.com/search/document/72787e0e54f798e4/baby_steps
> _towards_making_lucene_s_scoring_more_flexible
> https://issues.apache.org/jira/browse/LUCENE-2392

> On Fri, Jan 28, 2011 at 11:32 AM, David Nemeskey
> 
>  wrote:
> > Hi all,
> > 
> > I have already sent this mail to Simon Willnauer, and he suggested me to
> > post it here for discussion.
> > 
> > I am David Nemeskey, a PhD student at the Eotvos Lorand University,
> > Budapest, Hungary. I am doing an IR-related research, and we have
> > considered using Lucene as our search engine. We were quite satisfied
> > with the speed and ease of use. However, we would like to experiment
> > with different ranking algorithms, and this is where problems arise.
> > Lucene only supports the VSM, and unfortunately the ranking architecture
> > seems to be tailored specifically to its needs.
> > 
> > I would be very much interested in revamping the ranking component as a
> > GSoC project. The following modifications should be doable in the
> > allocated time frame:
> > - a new ranking class hierarchy, which is generic enough to allow easy
> > implementation of new weighting schemes (at least bag-of-words ones),
> > - addition of state-of-the-art ranking methods, such as Okapi BM25,
> > proximity and DFR models,
> > - configuration for ranking selection, with the old method as default.
> > 
> > I believe all users of Lucene would profit from such a project. It would
> > provide the scientific community with an even more useful research aid,
> > while regular users could benefit from superior ranking results.
> > 
> > Please let me know your opinion about this proposal.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: GSoC

2011-02-02 Thread Simon Willnauer
Hey David,

I saw that you added a tiny line to the GSoC Lucene wiki - thanks for that.

On Wed, Feb 2, 2011 at 10:10 AM, David Nemeskey
 wrote:
> Hi guys,
>
> Mark, Robert, Simon: thanks for the support! I really hope we can work
> together this summer (and before that, obviously).
Same here!
>
> According to http://www.google-
> melange.com/document/show/gsoc_program/google/gsoc2011/timeline , there's
> still some time until the application period. So let me use this week to 
> finish
> my PhD research plan, and get back to you next week.
>
> I am not really familiar with how the program works, i.e. how detailed the
> application description should be, when mentorship is decided, etc. so I guess
> we will have a lot to talk about. :)

so from a 1ft view it work like this:

1. Write up a short proposal what your idea is about
2. make it public! and publish a implementation plan - how you would
want to realize your proposal. If you don't follow that 100% in the
actual impl. don't worry. Its just mean to give us an idea that you
know what you are doing and where you want to go. something like a 1
A4 rough design doc.
3. give other people the change to apply for the same suggestion (this
is how it works though)
4 Let the ASF / us assign one or more possible mentors to it
5. let us apply for a slot in GSoC (those are limited for organizations)
6. get accepted
7. rock it!

>
> (Actually, should we move this discussion private?)
no - we usually do everything in public except of discussion within
the PMC that are meant to be private for legal reasons or similar
things. Lets stick to the mailing list for all communication except
you have something that should clearly not be public. This also give
other contributors a chance to help and get interested in your work!!

simon
>
> David
>
>> Hi David, honestly this sounds fantastic.
>>
>> It would be great to have someone to work with us on this issue!
>>
>> To date, progress is pretty slow-going (minor improvements, cleanups,
>> additional stats here and there)... but we really need all the help we
>> can get, especially from people who have a really good understanding
>> of the various models.
>>
>> In case you are interested, here are some references to discussions
>> about adding more flexibility (with some prototypes etc):
>> http://www.lucidimagination.com/search/document/72787e0e54f798e4/baby_steps
>> _towards_making_lucene_s_scoring_more_flexible
>> https://issues.apache.org/jira/browse/LUCENE-2392
>
>> On Fri, Jan 28, 2011 at 11:32 AM, David Nemeskey
>>
>>  wrote:
>> > Hi all,
>> >
>> > I have already sent this mail to Simon Willnauer, and he suggested me to
>> > post it here for discussion.
>> >
>> > I am David Nemeskey, a PhD student at the Eotvos Lorand University,
>> > Budapest, Hungary. I am doing an IR-related research, and we have
>> > considered using Lucene as our search engine. We were quite satisfied
>> > with the speed and ease of use. However, we would like to experiment
>> > with different ranking algorithms, and this is where problems arise.
>> > Lucene only supports the VSM, and unfortunately the ranking architecture
>> > seems to be tailored specifically to its needs.
>> >
>> > I would be very much interested in revamping the ranking component as a
>> > GSoC project. The following modifications should be doable in the
>> > allocated time frame:
>> > - a new ranking class hierarchy, which is generic enough to allow easy
>> > implementation of new weighting schemes (at least bag-of-words ones),
>> > - addition of state-of-the-art ranking methods, such as Okapi BM25,
>> > proximity and DFR models,
>> > - configuration for ranking selection, with the old method as default.
>> >
>> > I believe all users of Lucene would profit from such a project. It would
>> > provide the scientific community with an even more useful research aid,
>> > while regular users could benefit from superior ranking results.
>> >
>> > Please let me know your opinion about this proposal.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-2155) Geospatial search using geohash prefixes

2011-02-02 Thread Bill Bell (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989571#comment-12989571
 ] 

Bill Bell commented on SOLR-2155:
-

It might make sense to create a geohashdist() since geodist() only works on 
MultiValueSources not GeoHash. 



> Geospatial search using geohash prefixes
> 
>
> Key: SOLR-2155
> URL: https://issues.apache.org/jira/browse/SOLR-2155
> Project: Solr
>  Issue Type: Improvement
>Reporter: David Smiley
> Attachments: GeoHashPrefixFilter.patch, GeoHashPrefixFilter.patch, 
> GeoHashPrefixFilter.patch
>
>
> There currently isn't a solution in Solr for doing geospatial filtering on 
> documents that have a variable number of points.  This scenario occurs when 
> there is location extraction (i.e. via a "gazateer") occurring on free text.  
> None, one, or many geospatial locations might be extracted from any given 
> document and users want to limit their search results to those occurring in a 
> user-specified area.
> I've implemented this by furthering the GeoHash based work in Lucene/Solr 
> with a geohash prefix based filter.  A geohash refers to a lat-lon box on the 
> earth.  Each successive character added further subdivides the box into a 4x8 
> (or 8x4 depending on the even/odd length of the geohash) grid.  The first 
> step in this scheme is figuring out which geohash grid squares cover the 
> user's search query.  I've added various extra methods to GeoHashUtils (and 
> added tests) to assist in this purpose.  The next step is an actual Lucene 
> Filter, GeoHashPrefixFilter, that uses these geohash prefixes in 
> TermsEnum.seek() to skip to relevant grid squares in the index.  Once a 
> matching geohash grid is found, the points therein are compared against the 
> user's query to see if it matches.  I created an abstraction GeoShape 
> extended by subclasses named PointDistance... and CartesianBox to support 
> different queried shapes so that the filter need not care about these details.
> This work was presented at LuceneRevolution in Boston on October 8th.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2475) Incorrect Bounding Box calculation results in the exclusion of valid data locations

2011-02-02 Thread Nicolas Helleringer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989589#comment-12989589
 ] 

Nicolas Helleringer commented on LUCENE-2475:
-

This implementation of geo in Lucene has been deprecated and will not be fixed 
any further nor backported. see LUCENE-1747

> Incorrect Bounding Box calculation results in the exclusion of valid data 
> locations
> ---
>
> Key: LUCENE-2475
> URL: https://issues.apache.org/jira/browse/LUCENE-2475
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/spatial
>Affects Versions: 2.9.1, 3.0
>Reporter: Julian Atkinson
> Attachments: BoundingBoxCalucationIssueTest.java, test.html
>
>
> I have found a scenario where some of my location data is not being returned. 
>  The calculated distance between my search origin and the data is well within 
> my search radius but the data is not being returned. 
> I have traced this down to what I think is an error when calculating the 
> boundary box which is used to determine the Shape for the 
> CartesianShapeFilter in  CartesianPolyFilterBuilder.getBoxShape()
> The boundary box calculated by LLRect.createBox() is incorrect.  The box 
> returned is a box that fits WITHIN the search circle, where the four corners 
> of the box intersect the circle line. This creates 4 regions where data 
> points are not included - these are regions that are in the circle but 
> outside the box.
> What I is required is a boundary box that fully CONTAINS the search circle.  
> As a side effect you would end up with 4 regions outside of the circle but 
> inside the box.  This would potentially return data that are not real hits 
> but these can be filtered out by a more precise distance comparison.
> I will attach a test class that covers the issue with more details and a 
> proposed fix - a one liner in LLRect.java
> I would appreciate if someone could verify my findings.  All my data tests 
> pass with this fix but there is one test case in Lucene 3.0.0 that fails and 
> I can't figure out why.  TestCartesian.testAntiM().

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1391) Token type and flags values get lost when using ShingleMatrixFilter

2011-02-02 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989592#comment-12989592
 ] 

Uwe Schindler commented on LUCENE-1391:
---

I wanted to look into this this week!

> Token type and flags values get lost when using ShingleMatrixFilter
> ---
>
> Key: LUCENE-1391
> URL: https://issues.apache.org/jira/browse/LUCENE-1391
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/analyzers
>Affects Versions: 2.4, 2.9, 3.0
>Reporter: Wouter Heijke
>Assignee: Uwe Schindler
> Fix For: 3.1, 4.0
>
>
> While using the new ShingleMatrixFilter I noticed that a token's type and 
> flags get lost while using this filter. ShingleFilter does respect these 
> values like the other filters I know.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [HUDSON] Lucene-3.x - Build # 267 - Failure

2011-02-02 Thread Michael McCandless
Can we prefix the build failure emails send by this?  Something like
[CLOVER MAY FAIL]...?

I wonder if we can somehow scp this Clover config file, instead of
relying on Hudson to move it (thus sometimes corrupting it)...

Mike

On Tue, Feb 1, 2011 at 7:06 PM, Apache Hudson Server
 wrote:
> Build: https://hudson.apache.org/hudson/job/Lucene-3.x/267/
>
> All tests passed
>
> Build Log (for compile errors):
> [...truncated 19105 lines...]
>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



RE: [HUDSON] Lucene-3.x - Build # 267 - Failure

2011-02-02 Thread Uwe Schindler
NO.

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -Original Message-
> From: Michael McCandless [mailto:luc...@mikemccandless.com]
> Sent: Wednesday, February 02, 2011 12:42 PM
> To: dev@lucene.apache.org
> Subject: Re: [HUDSON] Lucene-3.x - Build # 267 - Failure
> 
> Can we prefix the build failure emails send by this?  Something like
[CLOVER
> MAY FAIL]...?
> 
> I wonder if we can somehow scp this Clover config file, instead of relying
on
> Hudson to move it (thus sometimes corrupting it)...
> 
> Mike
> 
> On Tue, Feb 1, 2011 at 7:06 PM, Apache Hudson Server
>  wrote:
> > Build: https://hudson.apache.org/hudson/job/Lucene-3.x/267/
> >
> > All tests passed
> >
> > Build Log (for compile errors):
> > [...truncated 19105 lines...]
> >
> >
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For
> > additional commands, e-mail: dev-h...@lucene.apache.org
> >
> >
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional
> commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2831) Revise Weight#scorer & Filter#getDocIdSet API to pass Readers context

2011-02-02 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-2831:


Attachment: LUCENE-2831-recursion.patch

here is a slightly different patch that makes the dangerous ctor private and 
uses the leaf's reader as the IS reader. I also put an assert into 
getTopReaderContext to assert that nobody pulls a toplevel context from the 
schizo IS.

All tests pass with LUCENE-2751

> Revise Weight#scorer & Filter#getDocIdSet API to pass Readers context
> -
>
> Key: LUCENE-2831
> URL: https://issues.apache.org/jira/browse/LUCENE-2831
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 4.0
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
> Fix For: 4.0
>
> Attachments: LUCENE-2831-nuke-SolrIndexReader.patch, 
> LUCENE-2831-nuke-SolrIndexReader.patch, LUCENE-2831-recursion.patch, 
> LUCENE-2831-recursion.patch, LUCENE-2831-recursion.patch, 
> LUCENE-2831-recursion.patch, LUCENE-2831.patch, LUCENE-2831.patch, 
> LUCENE-2831.patch, LUCENE-2831.patch, LUCENE-2831.patch, LUCENE-2831.patch, 
> LUCENE-2831_transition_to_atomicCtx.patch, 
> LUCENE-2831_transition_to_atomicCtx.patch, 
> LUCENE-2831_transition_to_atomicCtx.patch
>
>
> Spinoff from LUCENE-2694 - instead of passing a reader into Weight#scorer(IR, 
> boolean, boolean) we should / could revise the API and pass in a struct that 
> has parent reader, sub reader, ord of that sub. The ord mapping plus the 
> context with its parent would make several issues way easier. See 
> LUCENE-2694, LUCENE-2348 and LUCENE-2829 to name some.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-2343) Phonetic Filters should respect KeywordAttribute

2011-02-02 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989614#comment-12989614
 ] 

Robert Muir commented on SOLR-2343:
---

well we discussed before, about whether to expand keyword beyond stemmers.

Personally i prefer keeping it to stemmers-only... otherwise it gets confusing
and inconsistent which filters respect it or not.

there are tons of various things like LowerCaseFilter that would need to be 
changed
if we go this path...

> Phonetic Filters should respect KeywordAttribute
> 
>
> Key: SOLR-2343
> URL: https://issues.apache.org/jira/browse/SOLR-2343
> Project: Solr
>  Issue Type: Bug
>  Components: Schema and Analysis
>Reporter: Ryan McKinley
>Priority: Minor
> Attachments: SOLR-2343-phonetic-keyword.patch
>
>
> The Phonetic filters should not transform tokens with the keyword attribute 
> set

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2831) Revise Weight#scorer & Filter#getDocIdSet API to pass Readers context

2011-02-02 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989641#comment-12989641
 ] 

Michael McCandless commented on LUCENE-2831:


Patch looks good!

But, can you change the new assert to say something like "cannot access top 
context when IS is a leaf reader" or something?  Right now if you trip that 
assert it's like not clear what's gone wrong...

And I think either remove the jdoc on that method, or, clarify that it's only 
sugar when IS is not based on an leaf reader?

I really don't like this schitzo IS though ;)  Sometimes it's top reader 
sometimes it's leaf reader.  But, the schitzo IS's should never "escape" out of 
the top IS that has an ES.

> Revise Weight#scorer & Filter#getDocIdSet API to pass Readers context
> -
>
> Key: LUCENE-2831
> URL: https://issues.apache.org/jira/browse/LUCENE-2831
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 4.0
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
> Fix For: 4.0
>
> Attachments: LUCENE-2831-nuke-SolrIndexReader.patch, 
> LUCENE-2831-nuke-SolrIndexReader.patch, LUCENE-2831-recursion.patch, 
> LUCENE-2831-recursion.patch, LUCENE-2831-recursion.patch, 
> LUCENE-2831-recursion.patch, LUCENE-2831.patch, LUCENE-2831.patch, 
> LUCENE-2831.patch, LUCENE-2831.patch, LUCENE-2831.patch, LUCENE-2831.patch, 
> LUCENE-2831_transition_to_atomicCtx.patch, 
> LUCENE-2831_transition_to_atomicCtx.patch, 
> LUCENE-2831_transition_to_atomicCtx.patch
>
>
> Spinoff from LUCENE-2694 - instead of passing a reader into Weight#scorer(IR, 
> boolean, boolean) we should / could revise the API and pass in a struct that 
> has parent reader, sub reader, ord of that sub. The ord mapping plus the 
> context with its parent would make several issues way easier. See 
> LUCENE-2694, LUCENE-2348 and LUCENE-2829 to name some.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1057) PathTokenizerFactory

2011-02-02 Thread Koji Sekiguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989643#comment-12989643
 ] 

Koji Sekiguchi commented on SOLR-1057:
--

Can you use MappingCharFilter to normalize backslash to slash?

> PathTokenizerFactory
> 
>
> Key: SOLR-1057
> URL: https://issues.apache.org/jira/browse/SOLR-1057
> Project: Solr
>  Issue Type: New Feature
>  Components: Schema and Analysis
>Reporter: Ryan McKinley
>Assignee: Koji Sekiguchi
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: SOLR-1057-PathTokenizerFactory.patch, 
> SOLR-1057-PathTokenizerFactory.patch, SOLR-1057.patch
>
>
> This is a Tokenizer that splits the input string into a series of paths.  For 
> example:
> {panel}
>  /aaa/bbb/ccc
> {panel}
> becomes:
> {panel}
>  /aaa/
>  /aaa/bbb/
>  /aaa/bbb/ccc
> {panel}

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2886) Adaptive Frame Of Reference

2011-02-02 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2886:


Attachment: LUCENE-2886_simple64.patch

I pulled out the simple64 implementation here and adapted it to the 
bulkpostings branch.

Thanks for uploading this code Renaud, its great and the code is easy to work 
with. I hope to get some more of the codecs you wrote into the branch for 
testing.

I changed a few things that helped in benchmarking:
* the decoder uses relative gets instead of absolute
* we write #longs in the block header instead of #bytes (as its always long 
aligned, but smaller numbers)

But mainly, for this one I think we should change it to be a VariableIntBlock 
codec... right now it packs 128 integers into as few longs as possible, but 
this is wasteful for two reasons: it has to write a per-block byte header, and 
also wastes bits (e.g. in the case of a block of 128 1's).

With variableintblock, we could do this differently, e.g. read a fixed number 
of longs per-block (say 4 longs), and our block would then be variable between 
4 and 240 integers depending upon data.


> Adaptive Frame Of Reference 
> 
>
> Key: LUCENE-2886
> URL: https://issues.apache.org/jira/browse/LUCENE-2886
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Codecs
>Reporter: Renaud Delbru
> Fix For: 4.0
>
> Attachments: LUCENE-2886_simple64.patch, lucene-afor.tar.gz
>
>
> We could test the implementation of the Adaptive Frame Of Reference [1] on 
> the lucene-4.0 branch.
> I am providing the source code of its implementation. Some work needs to be 
> done, as this implementation is working on the old lucene-1458 branch. 
> I will attach a tarball containing a running version (with tests) of the AFOR 
> implementation, as well as the implementations of PFOR and of Simple64 
> (simple family codec working on 64bits word) that has been used in the 
> experiments in [1].
> [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-2343) Phonetic Filters should respect KeywordAttribute

2011-02-02 Thread Ryan McKinley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989658#comment-12989658
 ] 

Ryan McKinley commented on SOLR-2343:
-

that makes sense -- I actually noticed it because I implemented the KStem as an 
encoder and used the PhoneticFilter class.  That does make me wonder if the 
name should be more general though:
 http://commons.apache.org/codec/apidocs/org/apache/commons/codec/Encoder.html
since this filter really just lets you replace tokens using an enocder.


> Phonetic Filters should respect KeywordAttribute
> 
>
> Key: SOLR-2343
> URL: https://issues.apache.org/jira/browse/SOLR-2343
> Project: Solr
>  Issue Type: Bug
>  Components: Schema and Analysis
>Reporter: Ryan McKinley
>Priority: Minor
> Attachments: SOLR-2343-phonetic-keyword.patch
>
>
> The Phonetic filters should not transform tokens with the keyword attribute 
> set

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2843) Add variable-gap terms index impl.

2011-02-02 Thread Toke Eskildsen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989662#comment-12989662
 ] 

Toke Eskildsen commented on LUCENE-2843:


I see that the VariableGapTermsIndexReader/Writer is now the default (or at 
least an experimental default) in trunk. This means that ord() and consequently 
seek() are not available. Are you, Michael, planning on adding these later on 
or are they gone for good?

If they are gone for good, it does represent a bit of a problem for me as I use 
ord() and seek() for a memory-efficient hierarchical faceting system. Not 
having those in the default reader/writer means that most indexes "out there" 
will not support accessing terms by ordinals and that my code won't work on 
them unless they are re-build. Boo hoo for me, but not implementing the 
interface fully in the default implementation seems wrong. Or maybe the 
interface should be changed?

> Add variable-gap terms index impl.
> --
>
> Key: LUCENE-2843
> URL: https://issues.apache.org/jira/browse/LUCENE-2843
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2843.patch, LUCENE-2843.patch
>
>
> PrefixCodedTermsReader/Writer (used by all "real" core codecs) already
> supports pluggable terms index impls.
> The only impl we have now is FixedGapTermsIndexReader/Writer, which
> picks every Nth (default 32) term and holds it in efficient packed
> int/byte arrays in RAM.  This is already an enormous improvement (RAM
> reduction, init time) over 3.x.
> This patch adds another impl, VariableGapTermsIndexReader/Writer,
> which lets you specify an arbitrary IndexTermSelector to pick which
> terms are indexed, and then uses an FST to hold the indexed terms.
> This is typically even more memory efficient than packed int/byte
> arrays, though, it does not support ord() so it's not quite a fair
> comparison.
> I had to relax the terms index plugin api for
> PrefixCodedTermsReader/Writer to not assume that the terms index impl
> supports ord.
> I also did some cleanup of the FST/FSTEnum APIs and impls, and broke
> out separate seekCeil and seekFloor in FSTEnum.  Eg we need seekFloor
> when the FST is used as a terms index but seekCeil when it's holding
> all terms in the index (ie which SimpleText uses FSTs for).

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2843) Add variable-gap terms index impl.

2011-02-02 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989666#comment-12989666
 ] 

Robert Muir commented on LUCENE-2843:
-

bq. Or maybe the interface should be changed?

+1, ord is not an interface, its an implementation detail specific
to only certain basic implementations that shouldn't be in TermsEnum.

i would much prefer if this were some attribute, or somehow exposed
via those implementations' TermStates... as in my opinion its really 
actually an implementation detail of TermState, not even terms.


> Add variable-gap terms index impl.
> --
>
> Key: LUCENE-2843
> URL: https://issues.apache.org/jira/browse/LUCENE-2843
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2843.patch, LUCENE-2843.patch
>
>
> PrefixCodedTermsReader/Writer (used by all "real" core codecs) already
> supports pluggable terms index impls.
> The only impl we have now is FixedGapTermsIndexReader/Writer, which
> picks every Nth (default 32) term and holds it in efficient packed
> int/byte arrays in RAM.  This is already an enormous improvement (RAM
> reduction, init time) over 3.x.
> This patch adds another impl, VariableGapTermsIndexReader/Writer,
> which lets you specify an arbitrary IndexTermSelector to pick which
> terms are indexed, and then uses an FST to hold the indexed terms.
> This is typically even more memory efficient than packed int/byte
> arrays, though, it does not support ord() so it's not quite a fair
> comparison.
> I had to relax the terms index plugin api for
> PrefixCodedTermsReader/Writer to not assume that the terms index impl
> supports ord.
> I also did some cleanup of the FST/FSTEnum APIs and impls, and broke
> out separate seekCeil and seekFloor in FSTEnum.  Eg we need seekFloor
> when the FST is used as a terms index but seekCeil when it's holding
> all terms in the index (ie which SimpleText uses FSTs for).

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2843) Add variable-gap terms index impl.

2011-02-02 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989670#comment-12989670
 ] 

Michael McCandless commented on LUCENE-2843:


Toke, the FixedGapTermsIndexWriter/Reader supports ord, but requires more RAM 
for the terms index and may cause some queries to run slower.

Can you describe how your faceting system is using ord?

> Add variable-gap terms index impl.
> --
>
> Key: LUCENE-2843
> URL: https://issues.apache.org/jira/browse/LUCENE-2843
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2843.patch, LUCENE-2843.patch
>
>
> PrefixCodedTermsReader/Writer (used by all "real" core codecs) already
> supports pluggable terms index impls.
> The only impl we have now is FixedGapTermsIndexReader/Writer, which
> picks every Nth (default 32) term and holds it in efficient packed
> int/byte arrays in RAM.  This is already an enormous improvement (RAM
> reduction, init time) over 3.x.
> This patch adds another impl, VariableGapTermsIndexReader/Writer,
> which lets you specify an arbitrary IndexTermSelector to pick which
> terms are indexed, and then uses an FST to hold the indexed terms.
> This is typically even more memory efficient than packed int/byte
> arrays, though, it does not support ord() so it's not quite a fair
> comparison.
> I had to relax the terms index plugin api for
> PrefixCodedTermsReader/Writer to not assume that the terms index impl
> supports ord.
> I also did some cleanup of the FST/FSTEnum APIs and impls, and broke
> out separate seekCeil and seekFloor in FSTEnum.  Eg we need seekFloor
> when the FST is used as a terms index but seekCeil when it's holding
> all terms in the index (ie which SimpleText uses FSTs for).

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2843) Add variable-gap terms index impl.

2011-02-02 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989672#comment-12989672
 ] 

Robert Muir commented on LUCENE-2843:
-

Also, if faceting wants to exploit a codec-specific implementation detail,
its far more interesting to evaluate things like changing VariableGapIndex's 
FST output to be a pair including max(docFreq) for the block it indexes.

Then, faceting that wants to get the top-10 terms by docfreq, would 
instead work an FSTEnum, and only go to disk for the top-10 blocks... this
would actually be a change in complexity order no?


> Add variable-gap terms index impl.
> --
>
> Key: LUCENE-2843
> URL: https://issues.apache.org/jira/browse/LUCENE-2843
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2843.patch, LUCENE-2843.patch
>
>
> PrefixCodedTermsReader/Writer (used by all "real" core codecs) already
> supports pluggable terms index impls.
> The only impl we have now is FixedGapTermsIndexReader/Writer, which
> picks every Nth (default 32) term and holds it in efficient packed
> int/byte arrays in RAM.  This is already an enormous improvement (RAM
> reduction, init time) over 3.x.
> This patch adds another impl, VariableGapTermsIndexReader/Writer,
> which lets you specify an arbitrary IndexTermSelector to pick which
> terms are indexed, and then uses an FST to hold the indexed terms.
> This is typically even more memory efficient than packed int/byte
> arrays, though, it does not support ord() so it's not quite a fair
> comparison.
> I had to relax the terms index plugin api for
> PrefixCodedTermsReader/Writer to not assume that the terms index impl
> supports ord.
> I also did some cleanup of the FST/FSTEnum APIs and impls, and broke
> out separate seekCeil and seekFloor in FSTEnum.  Eg we need seekFloor
> when the FST is used as a terms index but seekCeil when it's holding
> all terms in the index (ie which SimpleText uses FSTs for).

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2843) Add variable-gap terms index impl.

2011-02-02 Thread Toke Eskildsen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989679#comment-12989679
 ] 

Toke Eskildsen commented on LUCENE-2843:


Thank you. I will use the FixedGap-version myself, but that only works when I'm 
the one controlling the index build, right?

As for the faceting system then the principle really simple: Instead of holding 
terms (BytesRefs) in memory, I just hold their ordinals. As the terms 
themselves only need to be resolved when the final faceting result is to be 
returned, seeking for a few hundred or thousand terms by their ordinal has 
worked very well so far (no guarantees for old hardware such as spinning disks 
though).

The memory savings over holding BytesRefs in memory of course varies with term 
lengths. There are some numbers at 
https://sbdevel.wordpress.com/2010/10/11/hierarchical-faceting/ if someone 
finds it interesting and LUCENE-2369 has some measurements of the same 
principle applied to sorting.

> Add variable-gap terms index impl.
> --
>
> Key: LUCENE-2843
> URL: https://issues.apache.org/jira/browse/LUCENE-2843
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2843.patch, LUCENE-2843.patch
>
>
> PrefixCodedTermsReader/Writer (used by all "real" core codecs) already
> supports pluggable terms index impls.
> The only impl we have now is FixedGapTermsIndexReader/Writer, which
> picks every Nth (default 32) term and holds it in efficient packed
> int/byte arrays in RAM.  This is already an enormous improvement (RAM
> reduction, init time) over 3.x.
> This patch adds another impl, VariableGapTermsIndexReader/Writer,
> which lets you specify an arbitrary IndexTermSelector to pick which
> terms are indexed, and then uses an FST to hold the indexed terms.
> This is typically even more memory efficient than packed int/byte
> arrays, though, it does not support ord() so it's not quite a fair
> comparison.
> I had to relax the terms index plugin api for
> PrefixCodedTermsReader/Writer to not assume that the terms index impl
> supports ord.
> I also did some cleanup of the FST/FSTEnum APIs and impls, and broke
> out separate seekCeil and seekFloor in FSTEnum.  Eg we need seekFloor
> when the FST is used as a terms index but seekCeil when it's holding
> all terms in the index (ie which SimpleText uses FSTs for).

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: GSoC

2011-02-02 Thread Grant Ingersoll

On Feb 2, 2011, at 4:10 AM, David Nemeskey wrote:

> Hi guys,
> 
> Mark, Robert, Simon: thanks for the support! I really hope we can work 
> together this summer (and before that, obviously).

Sounds like a great idea.  Looking forward to the proposal.

> 
> According to http://www.google-
> melange.com/document/show/gsoc_program/google/gsoc2011/timeline , there's 
> still some time until the application period. So let me use this week to 
> finish 
> my PhD research plan, and get back to you next week.
> 
> I am not really familiar with how the program works, i.e. how detailed the 
> application description should be, when mentorship is decided, etc. so I 
> guess 
> we will have a lot to talk about. :)

It's pretty competitive, especially since you are not only competing against 
others for Lucene slots, but you are competing against other ASF projects.  I 
highly recommend you, as well as interested mentors, look through Mahout's past 
GSOC projects: http://www.lucidimagination.com/search/?q=GSOC#/p:mahout and 
http://www.lucidimagination.com/search/document/2acd6fd380feec3/thoughts_on_gsoc
 and https://cwiki.apache.org/confluence/display/MAHOUT/GSOC

> 
> (Actually, should we move this discussion private?)

No, you shouldn't and it would be to your detriment come the ranking process 
since people won't have a track record of what you've done as it relates to 
your proposal.  The goal of GSOC is to learn how Open Source works.  Even 
though you have a mentor, that person is there to help you navigate the 
community, not to be a private tutor on technical details.   I routinely tell 
all my students that I will help them w/ personal issues (vacation, 
emergencies, etc.) but that all technical stuff must be done on list (JIRA, 
IRC, dev@, patches, etc.)

> 
> David
> 
>> Hi David, honestly this sounds fantastic.
>> 
>> It would be great to have someone to work with us on this issue!
>> 
>> To date, progress is pretty slow-going (minor improvements, cleanups,
>> additional stats here and there)... but we really need all the help we
>> can get, especially from people who have a really good understanding
>> of the various models.
>> 
>> In case you are interested, here are some references to discussions
>> about adding more flexibility (with some prototypes etc):
>> http://www.lucidimagination.com/search/document/72787e0e54f798e4/baby_steps
>> _towards_making_lucene_s_scoring_more_flexible
>> https://issues.apache.org/jira/browse/LUCENE-2392
> 
>> On Fri, Jan 28, 2011 at 11:32 AM, David Nemeskey
>> 
>>  wrote:
>>> Hi all,
>>> 
>>> I have already sent this mail to Simon Willnauer, and he suggested me to
>>> post it here for discussion.
>>> 
>>> I am David Nemeskey, a PhD student at the Eotvos Lorand University,
>>> Budapest, Hungary. I am doing an IR-related research, and we have
>>> considered using Lucene as our search engine. We were quite satisfied
>>> with the speed and ease of use. However, we would like to experiment
>>> with different ranking algorithms, and this is where problems arise.
>>> Lucene only supports the VSM, and unfortunately the ranking architecture
>>> seems to be tailored specifically to its needs.
>>> 
>>> I would be very much interested in revamping the ranking component as a
>>> GSoC project. The following modifications should be doable in the
>>> allocated time frame:
>>> - a new ranking class hierarchy, which is generic enough to allow easy
>>> implementation of new weighting schemes (at least bag-of-words ones),
>>> - addition of state-of-the-art ranking methods, such as Okapi BM25,
>>> proximity and DFR models,
>>> - configuration for ranking selection, with the old method as default.
>>> 
>>> I believe all users of Lucene would profit from such a project. It would
>>> provide the scientific community with an even more useful research aid,
>>> while regular users could benefit from superior ranking results.
>>> 
>>> Please let me know your opinion about this proposal.
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
> 

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem docs using Solr/Lucene:
http://www.lucidimagination.com/search


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[HUDSON] Lucene-Solr-tests-only-trunk - Build # 4419 - Failure

2011-02-02 Thread Apache Hudson Server
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/4419/

1 tests failed.
REGRESSION:  org.apache.solr.TestGroupingSearch.testRandomGrouping

Error Message:
mismatch: 'i'!='null' @ grouped/small_s/groups/[0]/groupValue

Stack Trace:
junit.framework.AssertionFailedError: mismatch: 'i'!='null' @ 
grouped/small_s/groups/[0]/groupValue
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1144)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1076)
at 
org.apache.solr.TestGroupingSearch.testRandomGrouping(TestGroupingSearch.java:500)




Build Log (for compile errors):
[...truncated 8550 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1057) PathTokenizerFactory

2011-02-02 Thread Ryan McKinley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989687#comment-12989687
 ] 

Ryan McKinley commented on SOLR-1057:
-

that would work if this were a filter... but I would need to run the 
MappingCharFilter *before* the path tokenizer.

Perhaps we should change this to a Filter, and use the KeywordTokenizer to 
start?


> PathTokenizerFactory
> 
>
> Key: SOLR-1057
> URL: https://issues.apache.org/jira/browse/SOLR-1057
> Project: Solr
>  Issue Type: New Feature
>  Components: Schema and Analysis
>Reporter: Ryan McKinley
>Assignee: Koji Sekiguchi
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: SOLR-1057-PathTokenizerFactory.patch, 
> SOLR-1057-PathTokenizerFactory.patch, SOLR-1057.patch
>
>
> This is a Tokenizer that splits the input string into a series of paths.  For 
> example:
> {panel}
>  /aaa/bbb/ccc
> {panel}
> becomes:
> {panel}
>  /aaa/
>  /aaa/bbb/
>  /aaa/bbb/ccc
> {panel}

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2843) Add variable-gap terms index impl.

2011-02-02 Thread Toke Eskildsen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989691#comment-12989691
 ] 

Toke Eskildsen commented on LUCENE-2843:


Robert, there is already OrdTermState to hold the ord, but the ordinal itself 
is only interesting if the corresponding term can be seeked from it. Upon 
further inspection I see that the method 
{code}TermsIndexReaderBase.supportsOrd(){code} is coupled logically to 
{code}seek(long ord){code} and {code}ord(){code} so support for ordinals does 
not seem like something one can expect.

As for the FSTEnum-idea then I don't understand how it can work with faceting 
where the terms to return are defined by the documents from a search? ...But 
maybe we should discuss that elsewhere.

> Add variable-gap terms index impl.
> --
>
> Key: LUCENE-2843
> URL: https://issues.apache.org/jira/browse/LUCENE-2843
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2843.patch, LUCENE-2843.patch
>
>
> PrefixCodedTermsReader/Writer (used by all "real" core codecs) already
> supports pluggable terms index impls.
> The only impl we have now is FixedGapTermsIndexReader/Writer, which
> picks every Nth (default 32) term and holds it in efficient packed
> int/byte arrays in RAM.  This is already an enormous improvement (RAM
> reduction, init time) over 3.x.
> This patch adds another impl, VariableGapTermsIndexReader/Writer,
> which lets you specify an arbitrary IndexTermSelector to pick which
> terms are indexed, and then uses an FST to hold the indexed terms.
> This is typically even more memory efficient than packed int/byte
> arrays, though, it does not support ord() so it's not quite a fair
> comparison.
> I had to relax the terms index plugin api for
> PrefixCodedTermsReader/Writer to not assume that the terms index impl
> supports ord.
> I also did some cleanup of the FST/FSTEnum APIs and impls, and broke
> out separate seekCeil and seekFloor in FSTEnum.  Eg we need seekFloor
> when the FST is used as a terms index but seekCeil when it's holding
> all terms in the index (ie which SimpleText uses FSTs for).

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1057) PathTokenizerFactory

2011-02-02 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989693#comment-12989693
 ] 

Robert Muir commented on SOLR-1057:
---

I'm a little confused about the use of the tokenizer (i have no problems 
technically, its maybe a naming issue?)

Is this intended for tokenizing file pathnames as its name would suggest? In 
this case I think the path should have positions, e.g. /foo/bar/whatever.txt is 
foo(1), bar(1), whatever.txt(1)?

It seems instead, this one is intended for representing hierarchies, as it 
creates synonyms of /foo, /foo/bar, /foo/bar/whatever.txt... with position 
increments of zero.

I guess I'm just being picky about naming, but i think this hierarchical case 
is more specific than 'tokenizing file pathnames' and maybe a name like 
HierarchyTokenizer (this one too probably isn't the best!) would better 
represent what it does?


> PathTokenizerFactory
> 
>
> Key: SOLR-1057
> URL: https://issues.apache.org/jira/browse/SOLR-1057
> Project: Solr
>  Issue Type: New Feature
>  Components: Schema and Analysis
>Reporter: Ryan McKinley
>Assignee: Koji Sekiguchi
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: SOLR-1057-PathTokenizerFactory.patch, 
> SOLR-1057-PathTokenizerFactory.patch, SOLR-1057.patch
>
>
> This is a Tokenizer that splits the input string into a series of paths.  For 
> example:
> {panel}
>  /aaa/bbb/ccc
> {panel}
> becomes:
> {panel}
>  /aaa/
>  /aaa/bbb/
>  /aaa/bbb/ccc
> {panel}

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec

2011-02-02 Thread Paul Elschot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989692#comment-12989692
 ] 

Paul Elschot commented on LUCENE-2903:
--

Just one nitpick about the codec name containing 'New'.
This will be out of date rather soon, so it may be better to simply use an 
incremental number.

> Improvement of PForDelta Codec
> --
>
> Key: LUCENE-2903
> URL: https://issues.apache.org/jira/browse/LUCENE-2903
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: hao yan
> Attachments: LUCENE_2903.patch
>
>
> There are 3 versions of PForDelta implementations in the Bulk Branch: 
> FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2.
> The FrameOfRef is a very basic one which is essentially a binary encoding 
> (may result in huge index size).
> The PatchedFrameOfRef is the implmentation based on the original version of 
> PForDelta in the literatures.
> The PatchedFrameOfRef2 is my previous implementation which are improved this 
> time. (The Codec name is changed to NewPForDelta.).
> In particular, the changes are:
> 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the 
> old PForDelta does not support very large exceptions (since
> the Simple16 does not support very large numbers). Now this has been fixed in 
> the new LCPForDelta.
> 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other 
> two PForDelta implementation in the bulk branch (FrameOfRef and 
> PatchedFrameOfRef). The codec's name is "NewPForDelta", as you can see in the 
> CodecProvider and PForDeltaFixedIntBlockCodec.
> 3. The performance test results are:
> 1) My "NewPForDelta" codec is faster then FrameOfRef and PatchedFrameOfRef 
> for almost all kinds of queries, slightly worse then BulkVInt.
> 2) My "NewPForDelta" codec can result in the smallest index size among all 4 
> methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself)
> 3) All performance test results are achieved by running with "-server" 
> instead of "-client"

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Boosting score based on a 0 to 5 star rating

2011-02-02 Thread seb835

Hi All,

I have a set of data where a user can rate each search result (a book in my
case) using 0 to 5 stars (5 stars being a highly recommended book).

I therefore want to boost the relevancy score of results that have been
given a star rating. Now, I can do this quite easily as follows with this
query:

"q=CookBook rating_value:1^0.1 rating_value:2^0.2" etc, etc...

The trouble with this query is as follows. If a book has a star rating of
anything over 0 (i.e. it actually has a rating), then it gets returned in
the search results - even if the word "CookBook" doesnt match within the
book title.

Is there any way to omit a result if there is no keyword match, even if it
has a star rating?

Many thanks in advance,
Seb
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Boosting-score-based-on-a-0-to-5-star-rating-tp2405905p2405905.html
Sent from the Solr - Dev mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2843) Add variable-gap terms index impl.

2011-02-02 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989695#comment-12989695
 ] 

Robert Muir commented on LUCENE-2843:
-

bq. Robert, there is already OrdTermState to hold the ord, but the ordinal 
itself is only interesting if the corresponding term can be seeked from it. 

You can seek to any arbitrary TermState (even if its not holding ord), but it 
might hold other things you don't care about.

bq. As for the FSTEnum-idea then I don't understand how it can work with 
faceting where the terms to return are defined by the documents from a search? 
...But maybe we should discuss that elsewhere.

In the general case, if you are using something like a priority queue to get 
the top-N terms (even if you are filtering by the documents from a search), 
this number would mean that once your priority queue is full, you can tell that 
an entire block of low freq terms is not-competitive to enter the PQ, without 
going to disk?


> Add variable-gap terms index impl.
> --
>
> Key: LUCENE-2843
> URL: https://issues.apache.org/jira/browse/LUCENE-2843
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2843.patch, LUCENE-2843.patch
>
>
> PrefixCodedTermsReader/Writer (used by all "real" core codecs) already
> supports pluggable terms index impls.
> The only impl we have now is FixedGapTermsIndexReader/Writer, which
> picks every Nth (default 32) term and holds it in efficient packed
> int/byte arrays in RAM.  This is already an enormous improvement (RAM
> reduction, init time) over 3.x.
> This patch adds another impl, VariableGapTermsIndexReader/Writer,
> which lets you specify an arbitrary IndexTermSelector to pick which
> terms are indexed, and then uses an FST to hold the indexed terms.
> This is typically even more memory efficient than packed int/byte
> arrays, though, it does not support ord() so it's not quite a fair
> comparison.
> I had to relax the terms index plugin api for
> PrefixCodedTermsReader/Writer to not assume that the terms index impl
> supports ord.
> I also did some cleanup of the FST/FSTEnum APIs and impls, and broke
> out separate seekCeil and seekFloor in FSTEnum.  Eg we need seekFloor
> when the FST is used as a terms index but seekCeil when it's holding
> all terms in the index (ie which SimpleText uses FSTs for).

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2894) Use of google-code-prettify for Lucene/Solr Javadoc

2011-02-02 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated LUCENE-2894:
---

Attachment: LUCENE-2894.patch

New patch that uses "-header" option.

> Use of google-code-prettify for Lucene/Solr Javadoc
> ---
>
> Key: LUCENE-2894
> URL: https://issues.apache.org/jira/browse/LUCENE-2894
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Javadocs
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2894.patch, LUCENE-2894.patch
>
>
> My company, RONDHUIT uses google-code-prettify (Apache License 2.0) in 
> Javadoc for syntax highlighting:
> http://www.rondhuit-demo.com/RCSS/api/com/rondhuit/solr/analysis/JaReadingSynonymFilterFactory.html
> I think we can use it for Lucene javadoc (java sample code in overview.html 
> etc) and Solr javadoc (Analyzer Factories etc) to improve or simplify our 
> life.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-2155) Geospatial search using geohash prefixes

2011-02-02 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989702#comment-12989702
 ] 

David Smiley commented on SOLR-2155:


Hi Bill.  I'm at O'Reilly Strata 2011 this week and so I have limited ability 
to help you until next Monday.  My code so far is purely for filtering, not 
sorting/ranking.  That's a TODO item.  It wasn't a requirement for my 
geospatial app so far.  In the mean time, limit your use to a filter query 
using any of geofilt, bbox, or my query parser.

> Geospatial search using geohash prefixes
> 
>
> Key: SOLR-2155
> URL: https://issues.apache.org/jira/browse/SOLR-2155
> Project: Solr
>  Issue Type: Improvement
>Reporter: David Smiley
> Attachments: GeoHashPrefixFilter.patch, GeoHashPrefixFilter.patch, 
> GeoHashPrefixFilter.patch
>
>
> There currently isn't a solution in Solr for doing geospatial filtering on 
> documents that have a variable number of points.  This scenario occurs when 
> there is location extraction (i.e. via a "gazateer") occurring on free text.  
> None, one, or many geospatial locations might be extracted from any given 
> document and users want to limit their search results to those occurring in a 
> user-specified area.
> I've implemented this by furthering the GeoHash based work in Lucene/Solr 
> with a geohash prefix based filter.  A geohash refers to a lat-lon box on the 
> earth.  Each successive character added further subdivides the box into a 4x8 
> (or 8x4 depending on the even/odd length of the geohash) grid.  The first 
> step in this scheme is figuring out which geohash grid squares cover the 
> user's search query.  I've added various extra methods to GeoHashUtils (and 
> added tests) to assist in this purpose.  The next step is an actual Lucene 
> Filter, GeoHashPrefixFilter, that uses these geohash prefixes in 
> TermsEnum.seek() to skip to relevant grid squares in the index.  Once a 
> matching geohash grid is found, the points therein are compared against the 
> user's query to see if it matches.  I created an abstraction GeoShape 
> extended by subclasses named PointDistance... and CartesianBox to support 
> different queried shapes so that the filter need not care about these details.
> This work was presented at LuceneRevolution in Boston on October 8th.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1057) PathTokenizerFactory

2011-02-02 Thread Ryan McKinley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989701#comment-12989701
 ] 

Ryan McKinley commented on SOLR-1057:
-

Maybe PathHierarchyTokenizer?

Yes, the point is to preserve the folder/path structure. 


> PathTokenizerFactory
> 
>
> Key: SOLR-1057
> URL: https://issues.apache.org/jira/browse/SOLR-1057
> Project: Solr
>  Issue Type: New Feature
>  Components: Schema and Analysis
>Reporter: Ryan McKinley
>Assignee: Koji Sekiguchi
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: SOLR-1057-PathTokenizerFactory.patch, 
> SOLR-1057-PathTokenizerFactory.patch, SOLR-1057.patch
>
>
> This is a Tokenizer that splits the input string into a series of paths.  For 
> example:
> {panel}
>  /aaa/bbb/ccc
> {panel}
> becomes:
> {panel}
>  /aaa/
>  /aaa/bbb/
>  /aaa/bbb/ccc
> {panel}

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2894) Use of google-code-prettify for Lucene/Solr Javadoc

2011-02-02 Thread Koji Sekiguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989704#comment-12989704
 ] 

Koji Sekiguchi commented on LUCENE-2894:


TODO:
- Support not only javadoc-core, but also javadoc-solrj, javadoc-all, 
javadoc-contrib. They should share a common prettify.
- Support lucene javadocs.
- Support modules javadocs?


> Use of google-code-prettify for Lucene/Solr Javadoc
> ---
>
> Key: LUCENE-2894
> URL: https://issues.apache.org/jira/browse/LUCENE-2894
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Javadocs
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2894.patch, LUCENE-2894.patch
>
>
> My company, RONDHUIT uses google-code-prettify (Apache License 2.0) in 
> Javadoc for syntax highlighting:
> http://www.rondhuit-demo.com/RCSS/api/com/rondhuit/solr/analysis/JaReadingSynonymFilterFactory.html
> I think we can use it for Lucene javadoc (java sample code in overview.html 
> etc) and Solr javadoc (Analyzer Factories etc) to improve or simplify our 
> life.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-64) strict hierarchical facets

2011-02-02 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-64?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated SOLR-64:
---

Fix Version/s: (was: Next)
   3.1
 Assignee: Koji Sekiguchi

> strict hierarchical facets
> --
>
> Key: SOLR-64
> URL: https://issues.apache.org/jira/browse/SOLR-64
> Project: Solr
>  Issue Type: New Feature
>  Components: search
>Reporter: Yonik Seeley
>Assignee: Koji Sekiguchi
> Fix For: 3.1
>
> Attachments: SOLR-64.patch, SOLR-64.patch, SOLR-64.patch, 
> SOLR-64.patch
>
>
> Strict Facet Hierarchies... each tag has at most one parent (a tree).

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Boosting score based on a 0 to 5 star rating

2011-02-02 Thread Jens Wilmer
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi All,

Am 02.02.2011 16:17, schrieb seb835:
> I therefore want to boost the relevancy score of results that have been
> given a star rating. Now, I can do this quite easily as follows with this
> query:
> 
> "q=CookBook rating_value:1^0.1 rating_value:2^0.2" etc, etc...
> 
> The trouble with this query is as follows. If a book has a star rating of
> anything over 0 (i.e. it actually has a rating), then it gets returned in
> the search results - even if the word "CookBook" doesnt match within the
> book title.
Have you already tried "q=+CookBook rating_value:1^0.1 ... ?


- -- 
Mit freundlichen Grüßen,
- - --
Jens Wilmer
(Softwareentwickler)


- - -- LWsystems GmbH & Co. KG ++ http://www.lw-systems.de/impressum
Tel: 05403 / 5556 ++ Fax: 05403 795 89 97
Ihr Spezialist für Linux, Open Source & IT-Sicherheit
++
LWsystems GmbH & Co. KG Sitz der Gesellschaft: Tegelerweg 11, 49186 Bad
Iburg
Telefon +49 (0)5403 5556 Telefax +49 (0)5403 7958997
Handelsregister: Amtsgericht Osnabrück, HRA 110668 USt.-ID-Nr. DE23852211
Persönlich haftende Gesellschafterin: LWsystems Verwaltungs GmbH
Sitz der Gesellschaft: Tegelerweg 11, 49186 Bad Iburg
Handelsregister: Amtsgericht Osnabrück, HRB 63
Geschäftsführer: Dipl.-Ing. Ansgar H. Licher, Bad Iburg Dipl.-Ing.
Martin Werthmöller, Ibbenbüren
Für weitere Firmendetails zu LWsystems siehe / For further company
details please look at: http://www.lw-systems.de/impressum
++
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJNSYk6AAoJEMLpTPM5AYjnySIH/RxPKOLa96yQrut1nObAlr49
VdT9XpwFYuGasN58eHXxVETQsG16e4VW3zu8qnIbcYSxHlRwWdpmyEmroiVNHXLW
qvA0dXLmHZuDLM57UAK11s21G/pUTH/J5ZVc6QpOEDL9YZ6AzvhgY9vt+La607Lz
po43Cs9bCOEMHdH/kNes9eOGKoa/zYv/QempuKNYsNIcHtUQjBKH7tEw9D2WheO7
heL4pkN8aJpdI9aPItBLTXRWkgl2xN+4SwI36D2pk6btC8iPF0pgBdant3wgOBuq
LoT42ljiE5EtmaIcz+sTPjMceH2S3AyL8h+3TEievaf1zxwNuDTwSMgvwN0j6Ks=
=IN8+
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



unsubscribe

2011-02-02 Thread Torsten Eberhardt
unsubscribe



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-64) strict hierarchical facets

2011-02-02 Thread Koji Sekiguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-64?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989705#comment-12989705
 ] 

Koji Sekiguchi commented on SOLR-64:


Let's push this to 3.1 without distributed support.

> strict hierarchical facets
> --
>
> Key: SOLR-64
> URL: https://issues.apache.org/jira/browse/SOLR-64
> Project: Solr
>  Issue Type: New Feature
>  Components: search
>Reporter: Yonik Seeley
>Assignee: Koji Sekiguchi
> Fix For: 3.1
>
> Attachments: SOLR-64.patch, SOLR-64.patch, SOLR-64.patch, 
> SOLR-64.patch
>
>
> Strict Facet Hierarchies... each tag has at most one parent (a tree).

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1057) PathTokenizerFactory

2011-02-02 Thread Koji Sekiguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989706#comment-12989706
 ] 

Koji Sekiguchi commented on SOLR-1057:
--

bq. that would work if this were a filter... but I would need to run the 
MappingCharFilter before the path tokenizer.

CharFilters run before Tokenizer.

bq. Maybe PathHierarchyTokenizer?

+1.

> PathTokenizerFactory
> 
>
> Key: SOLR-1057
> URL: https://issues.apache.org/jira/browse/SOLR-1057
> Project: Solr
>  Issue Type: New Feature
>  Components: Schema and Analysis
>Reporter: Ryan McKinley
>Assignee: Koji Sekiguchi
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: SOLR-1057-PathTokenizerFactory.patch, 
> SOLR-1057-PathTokenizerFactory.patch, SOLR-1057.patch
>
>
> This is a Tokenizer that splits the input string into a series of paths.  For 
> example:
> {panel}
>  /aaa/bbb/ccc
> {panel}
> becomes:
> {panel}
>  /aaa/
>  /aaa/bbb/
>  /aaa/bbb/ccc
> {panel}

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: unsubscribe

2011-02-02 Thread Simon Willnauer
Torsten,
mail to: dev-unsubscr...@lucene.apache.org
to unsubscribe!

simon

On Wed, Feb 2, 2011 at 5:43 PM, Torsten Eberhardt
 wrote:
> unsubscribe
>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Created: (SOLR-2344) JMX not reliable - MBeans deleted

2011-02-02 Thread Matthew Sporleder (JIRA)
JMX not reliable - MBeans deleted
-

 Key: SOLR-2344
 URL: https://issues.apache.org/jira/browse/SOLR-2344
 Project: Solr
  Issue Type: Bug
 Environment: linux 64bit x86_64
java version "1.6.0_13"
Reporter: Matthew Sporleder


I am using JMX to monitor my replication status and am finding that my
MBeans are disappearing.  I turned on debugging for JMX and found that
solr seems to be deleting the mbeans.

This appears to be a triggered event, but I am not finding any clues as to why 
this is happening.

Is this a bug?  Some trace info is below..

here's me reading the mbean successfully:
Jan 27, 2011 5:00:02 PM ServerCommunicatorAdmin reqIncoming
FINER: Receive a new request.
Jan 27, 2011 5:00:02 PM DefaultMBeanServerInterceptor getAttribute
FINER: Attribute= indexReplicatedAt, obj=
solr/myapp-core:type=/replication,id=org.apache.solr.handler.ReplicationHandler
Jan 27, 2011 5:00:02 PM Repository retrieve
FINER: 
name=solr/myapp-core:type=/replication,id=org.apache.solr.handler.ReplicationHandler
Jan 27, 2011 5:00:02 PM ServerCommunicatorAdmin reqIncoming
FINER: Finish a request.


a little while later it removes the mbean from the PM Repository
(whatever that is) and then re-adds it:
FINER: Send create notification of object
solr/myapp-core:id=org.apache.solr.handler.component.SearchHandler,type=atlas
Jan 27, 2011 5:16:14 PM DefaultMBeanServerInterceptor sendNotification
FINER: JMX.mbean.registered
solr/myapp-core:type=atlas,id=org.apache.solr.handler.component.SearchHandler
Jan 27, 2011 5:16:14 PM Repository contains
FINER: 
name=solr/myapp-core:type=/replication,id=org.apache.solr.handler.ReplicationHandler
Jan 27, 2011 5:16:14 PM Repository retrieve
FINER: 
name=solr/myapp-core:type=/replication,id=org.apache.solr.handler.ReplicationHandler
Jan 27, 2011 5:16:14 PM Repository remove
FINER: 
name=solr/myapp-core:type=/replication,id=org.apache.solr.handler.ReplicationHandler
Jan 27, 2011 5:16:14 PM DefaultMBeanServerInterceptor unregisterMBean
FINER: Send delete notification of object
solr/myapp-core:id=org.apache.solr.handler.ReplicationHandler,type=/replication
Jan 27, 2011 5:16:14 PM DefaultMBeanServerInterceptor sendNotification
FINER: JMX.mbean.unregistered
solr/myapp-core:type=/replication,id=org.apache.solr.handler.ReplicationHandler
Jan 27, 2011 5:16:14 PM DefaultMBeanServerInterceptor registerMBean
FINER: ObjectName =
solr/myapp-core:type=/replication,id=org.apache.solr.handler.ReplicationHandler
Jan 27, 2011 5:16:14 PM Repository addMBean
FINER: 
name=solr/myapp-core:type=/replication,id=org.apache.solr.handler.ReplicationHandler
Jan 27, 2011 5:16:14 PM DefaultMBeanServerInterceptor addObject
FINER: Send create notification of object
solr/myapp-core:id=org.apache.solr.handler.ReplicationHandler,type=/replication
Jan 27, 2011 5:16:14 PM DefaultMBeanServerInterceptor sendNotification
FINER: JMX.mbean.registered
solr/myapp-core:type=/replication,id=org.apache.solr.handler.ReplicationHandler


And after a tons of messages but still in the same second it does:
Jan 27, 2011 5:16:14 PM Repository contains
FINER: 
name=solr/myapp-core:type=org.apache.solr.handler.ReplicationHandler,id=org.apache.solr.handler.ReplicationHandler
Jan 27, 2011 5:16:14 PM Repository retrieve
FINER: 
name=solr/myapp-core:type=org.apache.solr.handler.ReplicationHandler,id=org.apache.solr.handler.ReplicationHandler
Jan 27, 2011 5:16:14 PM Repository removeFINER:
name=solr/myapp-core:type=org.apache.solr.handler.ReplicationHandler,id=org.apache.solr.handler.ReplicationHandler
Jan 27, 2011 5:16:14 PM DefaultMBeanServerInterceptor unregisterMBean
FINER: Send delete notification of object
solr/myapp-core:id=org.apache.solr.handler.ReplicationHandler,type=org.apache.solr.handler.ReplicationHandler
Jan 27, 2011 5:16:14 PM DefaultMBeanServerInterceptor sendNotification
FINER: JMX.mbean.unregistered
solr/myapp-core:type=org.apache.solr.handler.ReplicationHandler,id=org.apache.solr.handler.ReplicationHandler
Jan 27, 2011 5:16:14 PM DefaultMBeanServerInterceptor registerMBean
FINER: ObjectName =
solr/myapp-core:type=org.apache.solr.handler.ReplicationHandler,id=org.apache.solr.handler.ReplicationHandlerJan
27, 2011 5:16:14 PM Repository addMBean
FINER: 
name=solr/myapp-core:type=org.apache.solr.handler.ReplicationHandler,id=org.apache.solr.handler.ReplicationHandler
Jan 27, 2011 5:16:14 PM DefaultMBeanServerInterceptor addObjectFINER:
Send create notification of object
solr/myapp-core:id=org.apache.solr.handler.ReplicationHandler,type=org.apache.solr.handler.ReplicationHandler
Jan 27, 2011 5:16:14 PM DefaultMBeanServerInterceptor
sendNotificationFINER: JMX.mbean.registered
solr/myapp-core:type=org.apache.solr.handler.ReplicationHandler,id=org.apache.solr.handler.ReplicationHandler


And then I don't know what this is about but it removes the bean again:
Jan 27, 2011 5:16:15 PM Repository contains
FINER: 
name=solr/myapp-core:type=org.ap

[jira] Updated: (LUCENE-1540) Improvements to contrib.benchmark for TREC collections

2011-02-02 Thread Doron Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen updated LUCENE-1540:


Attachment: LUCENE-1540.patch

Updated patch, plan to commit later today or tomorrow.

> Improvements to contrib.benchmark for TREC collections
> --
>
> Key: LUCENE-1540
> URL: https://issues.apache.org/jira/browse/LUCENE-1540
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/benchmark
>Reporter: Tim Armstrong
>Assignee: Doron Cohen
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-1540.patch, LUCENE-1540.patch, LUCENE-1540.patch, 
> LUCENE-1540.patch, trecdocs.zip
>
>
> The benchmarking utilities for  TREC test collections (http://trec.nist.gov) 
> are quite limited and do not support some of the variations in format of 
> older TREC collections.  
> I have been doing some benchmarking work with Lucene and have had to modify 
> the package to support:
> * Older TREC document formats, which the current parser fails on due to 
> missing document headers.
> * Variations in query format - newlines after  tag causing the query 
> parser to get confused.
> * Ability to detect and read in uncompressed text collections
> * Storage of document numbers by default without storing full text.
> I can submit a patch if there is interest, although I will probably want to 
> write unit tests for the new functionality first.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2609) Generate jar containing test classes.

2011-02-02 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989727#comment-12989727
 ] 

Shai Erera commented on LUCENE-2609:


I think we should compile test-framework under classes/test as well. It will 
only simplify the build.xml, and tests have to reference both test and 
test-framework in their classpath.

I forgot to modify jar-test-core, but I will do so in my next patch. I will 
rename to jar-test-framework and remove dev-tools/testjar.

> Generate jar containing test classes.
> -
>
> Key: LUCENE-2609
> URL: https://issues.apache.org/jira/browse/LUCENE-2609
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.2
>Reporter: Drew Farris
>Assignee: Shai Erera
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2609.patch, LUCENE-2609.patch, LUCENE-2609.patch, 
> LUCENE-2609.patch, LUCENE-2609.patch
>
>
> The test classes are useful for writing unit tests for code external to the 
> Lucene project. It would be helpful to build a jar of these classes and 
> publish them as a maven dependency.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Assigned: (SOLR-874) Dismax parser exceptions on trailing OPERATOR

2011-02-02 Thread Erik Hatcher (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Hatcher reassigned SOLR-874:
-

Assignee: Erik Hatcher

> Dismax parser exceptions on trailing OPERATOR
> -
>
> Key: SOLR-874
> URL: https://issues.apache.org/jira/browse/SOLR-874
> Project: Solr
>  Issue Type: Bug
>  Components: search
>Affects Versions: 1.3
>Reporter: Erik Hatcher
>Assignee: Erik Hatcher
> Fix For: Next
>
> Attachments: SOLR-874-1.3.patch, SOLR-874-1.4.1.patch, SOLR-874.patch
>
>
> Dismax is supposed to be immune to parse exceptions, but alas it's not:
> http://localhost:8983/solr/select?defType=dismax&qf=name&q=ipod+AND
> kaboom!
> Caused by: org.apache.lucene.queryParser.ParseException: Cannot parse 'ipod 
> AND': Encountered "" at line 1, column 8.
> Was expecting one of:
>  ...
> "+" ...
> "-" ...
> "(" ...
> "*" ...
>  ...
>  ...
>  ...
>  ...
> "[" ...
> "{" ...
>  ...
>  ...
> "*" ...
> 
>   at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:175)
>   at 
> org.apache.solr.search.DismaxQParser.parse(DisMaxQParserPlugin.java:138)
>   at org.apache.solr.search.QParser.getQuery(QParser.java:88)

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-874) Dismax parser exceptions on trailing OPERATOR

2011-02-02 Thread Erik Hatcher (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989744#comment-12989744
 ] 

Erik Hatcher commented on SOLR-874:
---

Johannes - thanks!  Test cases look thorough from a glance.   Kinda hairy stuff 
in there, so give me a few days to scratch my head and review this, but 
something worthwhile getting fixed finally.  

Many other commenters on this issue - maybe we can get a few more folks to try 
this out and confirm it fixes their cases too.  



> Dismax parser exceptions on trailing OPERATOR
> -
>
> Key: SOLR-874
> URL: https://issues.apache.org/jira/browse/SOLR-874
> Project: Solr
>  Issue Type: Bug
>  Components: search
>Affects Versions: 1.3
>Reporter: Erik Hatcher
>Assignee: Erik Hatcher
> Fix For: Next
>
> Attachments: SOLR-874-1.3.patch, SOLR-874-1.4.1.patch, SOLR-874.patch
>
>
> Dismax is supposed to be immune to parse exceptions, but alas it's not:
> http://localhost:8983/solr/select?defType=dismax&qf=name&q=ipod+AND
> kaboom!
> Caused by: org.apache.lucene.queryParser.ParseException: Cannot parse 'ipod 
> AND': Encountered "" at line 1, column 8.
> Was expecting one of:
>  ...
> "+" ...
> "-" ...
> "(" ...
> "*" ...
>  ...
>  ...
>  ...
>  ...
> "[" ...
> "{" ...
>  ...
>  ...
> "*" ...
> 
>   at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:175)
>   at 
> org.apache.solr.search.DismaxQParser.parse(DisMaxQParserPlugin.java:138)
>   at org.apache.solr.search.QParser.getQuery(QParser.java:88)

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec

2011-02-02 Thread hao yan (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989754#comment-12989754
 ] 

hao yan commented on LUCENE-2903:
-

Hi, Paul. thanks for the suggestions. I just uploaded a new patch which renamed 
the codec as PatchedFrameOfRef3. 

I actually have a question to ask. In BulkVInt codec, it writes the compressed 
byte stream as a chunk of bytes. However, in pfordelta-related codecs, the 
compressed results are in ints, i have to either write single int with a loop, 
or first convert int array to byte array and then call out.writeBytes(). Do you 
know any other smarter way to write an int array to indexOutput? 

Another try I did is to make PForDelta itself produce byte-wise compressed 
results. However, from my experimental results, it will slow down pfordelta 
significantly. Also, i do not think the NIO buffer used in FrameOfRef and 
PatchedFrameOfRef help since essentially it is like the way that we first 
convert int array to byte array and then writeBytes().

Do you have any good suggestions? thanks! 

> Improvement of PForDelta Codec
> --
>
> Key: LUCENE-2903
> URL: https://issues.apache.org/jira/browse/LUCENE-2903
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: hao yan
> Attachments: LUCENE_2903.patch
>
>
> There are 3 versions of PForDelta implementations in the Bulk Branch: 
> FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2.
> The FrameOfRef is a very basic one which is essentially a binary encoding 
> (may result in huge index size).
> The PatchedFrameOfRef is the implmentation based on the original version of 
> PForDelta in the literatures.
> The PatchedFrameOfRef2 is my previous implementation which are improved this 
> time. (The Codec name is changed to NewPForDelta.).
> In particular, the changes are:
> 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the 
> old PForDelta does not support very large exceptions (since
> the Simple16 does not support very large numbers). Now this has been fixed in 
> the new LCPForDelta.
> 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other 
> two PForDelta implementation in the bulk branch (FrameOfRef and 
> PatchedFrameOfRef). The codec's name is "NewPForDelta", as you can see in the 
> CodecProvider and PForDeltaFixedIntBlockCodec.
> 3. The performance test results are:
> 1) My "NewPForDelta" codec is faster then FrameOfRef and PatchedFrameOfRef 
> for almost all kinds of queries, slightly worse then BulkVInt.
> 2) My "NewPForDelta" codec can result in the smallest index size among all 4 
> methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself)
> 3) All performance test results are achieved by running with "-server" 
> instead of "-client"

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2903) Improvement of PForDelta Codec

2011-02-02 Thread hao yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hao yan updated LUCENE-2903:


Attachment: LUCENE_2903.patch

This patch rename the NewPForDeltaCodec as PatchedFrameOfRef3 to follow the 
tradition.

And also add back the BulkVInt allones trick. (I removed it accidently in the 
last patch).

> Improvement of PForDelta Codec
> --
>
> Key: LUCENE-2903
> URL: https://issues.apache.org/jira/browse/LUCENE-2903
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: hao yan
> Attachments: LUCENE_2903.patch, LUCENE_2903.patch
>
>
> There are 3 versions of PForDelta implementations in the Bulk Branch: 
> FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2.
> The FrameOfRef is a very basic one which is essentially a binary encoding 
> (may result in huge index size).
> The PatchedFrameOfRef is the implmentation based on the original version of 
> PForDelta in the literatures.
> The PatchedFrameOfRef2 is my previous implementation which are improved this 
> time. (The Codec name is changed to NewPForDelta.).
> In particular, the changes are:
> 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the 
> old PForDelta does not support very large exceptions (since
> the Simple16 does not support very large numbers). Now this has been fixed in 
> the new LCPForDelta.
> 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other 
> two PForDelta implementation in the bulk branch (FrameOfRef and 
> PatchedFrameOfRef). The codec's name is "NewPForDelta", as you can see in the 
> CodecProvider and PForDeltaFixedIntBlockCodec.
> 3. The performance test results are:
> 1) My "NewPForDelta" codec is faster then FrameOfRef and PatchedFrameOfRef 
> for almost all kinds of queries, slightly worse then BulkVInt.
> 2) My "NewPForDelta" codec can result in the smallest index size among all 4 
> methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself)
> 3) All performance test results are achieved by running with "-server" 
> instead of "-client"

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



CustomScoreQueryWithSubqueries

2011-02-02 Thread Fernando Wasylyszyn
Hi everyone. My name is Fernando and I am a researcher and developer in the R+D 
lab at Snoop Consulting S.R.L. in Argentina.
Based on the patch suggested in LUCENE-1608 
(https://issues.apache.org/jira/browse/LUCENE-1608) and in the needs of one of 
our customers, for who we are developing a customized search engine on top of 
Lucene and Solr, we have developed the class CustomScoreQueryWithSubqueries, 
which is a variation of CustomScoreQuery that allows the use of arbitrary Query 
objects besides instances of ValueSourceQuery, without the need of wrapping the 
arbitrary/ies query/ies with the QueryValueSource proposed in Jira, which has 
the disadvantage of create an instance of an IndexSearcher in each invocation 
of 
the method getValues(IndexReader).
If you think that this contribution can be useful for the Lucene community, 
please let me know the steps in order to contribute.
Thanks.
Regards.
Fernando.



  

Nightly Lucene and Solr Maven snapshot artifacts

2011-02-02 Thread Steven A Rowe
I have switched generation and publication of Lucene and Solr snapshot 
artifacts to two new nightly Hudson jobs: Lucene-Solr-Maven-3.x and 
Lucene-Solr-Maven-trunk.

Latest Lucene and Solr 3.x snapshots:

https://hudson.apache.org/hudson/job/Lucene-Solr-Maven-3.x/lastSuccessfulBuild/artifact/maven_artifacts

Latest Lucene and Solr trunk snapshots:

https://hudson.apache.org/hudson/job/Lucene-Solr-Maven-trunk/lastSuccessfulBuild/artifact/maven_artifacts

Steve



[jira] Commented: (LUCENE-1540) Improvements to contrib.benchmark for TREC collections

2011-02-02 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989766#comment-12989766
 ] 

Shai Erera commented on LUCENE-1540:


I see that we both missed the CHANGES entry? :)

Other than that, patch looks good. +1 to commit !

> Improvements to contrib.benchmark for TREC collections
> --
>
> Key: LUCENE-1540
> URL: https://issues.apache.org/jira/browse/LUCENE-1540
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/benchmark
>Reporter: Tim Armstrong
>Assignee: Doron Cohen
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-1540.patch, LUCENE-1540.patch, LUCENE-1540.patch, 
> LUCENE-1540.patch, trecdocs.zip
>
>
> The benchmarking utilities for  TREC test collections (http://trec.nist.gov) 
> are quite limited and do not support some of the variations in format of 
> older TREC collections.  
> I have been doing some benchmarking work with Lucene and have had to modify 
> the package to support:
> * Older TREC document formats, which the current parser fails on due to 
> missing document headers.
> * Variations in query format - newlines after  tag causing the query 
> parser to get confused.
> * Ability to detect and read in uncompressed text collections
> * Storage of document numbers by default without storing full text.
> I can submit a patch if there is interest, although I will probably want to 
> write unit tests for the new functionality first.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: CustomScoreQueryWithSubqueries

2011-02-02 Thread Robert Muir
On Wed, Feb 2, 2011 at 2:37 PM, Fernando Wasylyszyn
 wrote:
> Hi everyone. My name is Fernando and I am a researcher and developer in the
> R+D lab at Snoop Consulting S.R.L. in Argentina.
> Based on the patch suggested in LUCENE-1608
> (https://issues.apache.org/jira/browse/LUCENE-1608) and in the needs of one
> of our customers, for who we are developing a customized search engine on
> top of Lucene and Solr, we have developed the class
> CustomScoreQueryWithSubqueries, which is a variation of CustomScoreQuery
> that allows the use of arbitrary Query objects besides instances of
> ValueSourceQuery, without the need of wrapping the arbitrary/ies query/ies
> with the QueryValueSource proposed in Jira, which has the disadvantage of
> create an instance of an IndexSearcher in each invocation of the method
> getValues(IndexReader).
> If you think that this contribution can be useful for the Lucene community,
> please let me know the steps in order to contribute.

Hi Fernando: https://issues.apache.org/jira/browse/LUCENE-1608 is
still an open issue.

If you have a better solution, please don't hesitate to upload a patch
file to the issue!
There are some more detailed instructions here:
http://wiki.apache.org/lucene-java/HowToContribute

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: svn commit: r1065719 - in /lucene/dev/trunk/modules/benchmark/lib: xercesImpl-2.10.0.jar xercesImpl-2.9.1-patched-XERCESJ-1257.jar xml-apis-2.10.0.jar xml-apis-2.9.0.jar

2011-02-02 Thread Grant Ingersoll
It might be worth noting that Freebase publishes a Text only extract of 
Wikipedia: http://download.freebase.com/wex/latest/  We could take a snapshot 
of that and host it somewhere as the new standard for benchmarking.


On Jan 31, 2011, at 2:20 PM, mikemcc...@apache.org wrote:

> Author: mikemccand
> Date: Mon Jan 31 19:20:34 2011
> New Revision: 1065719
> 
> URL: http://svn.apache.org/viewvc?rev=1065719&view=rev
> Log:
> LUCENE-1591: rollback to old patched xercesImpl.jar to workaround 
> XERCESJ-1257, which we hit on current Wikipedia XML export 
> (enwiki-20110115-pages-articles.xml)
> 
> Added:
>
> lucene/dev/trunk/modules/benchmark/lib/xercesImpl-2.9.1-patched-XERCESJ-1257.jar
>(with props)
>lucene/dev/trunk/modules/benchmark/lib/xml-apis-2.9.0.jar   (with props)
> Removed:
>lucene/dev/trunk/modules/benchmark/lib/xercesImpl-2.10.0.jar
>lucene/dev/trunk/modules/benchmark/lib/xml-apis-2.10.0.jar
> 
> Added: 
> lucene/dev/trunk/modules/benchmark/lib/xercesImpl-2.9.1-patched-XERCESJ-1257.jar
> URL: 
> http://svn.apache.org/viewvc/lucene/dev/trunk/modules/benchmark/lib/xercesImpl-2.9.1-patched-XERCESJ-1257.jar?rev=1065719&view=auto
> ==
> Binary file - no diff available.
> 
> Added: lucene/dev/trunk/modules/benchmark/lib/xml-apis-2.9.0.jar
> URL: 
> http://svn.apache.org/viewvc/lucene/dev/trunk/modules/benchmark/lib/xml-apis-2.9.0.jar?rev=1065719&view=auto
> ==
> Binary file - no diff available.
> 
> 


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2609) Generate jar containing test classes.

2011-02-02 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-2609:
---

Attachment: LUCENE-2609.patch

Patch fixes Solr build.xmls to depend only on lucene's test-framework, as well 
as removing dev-tools/testjar.

Again, I only included the .xml changes in the path - the rest are just svn 
moves.

> Generate jar containing test classes.
> -
>
> Key: LUCENE-2609
> URL: https://issues.apache.org/jira/browse/LUCENE-2609
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.2
>Reporter: Drew Farris
>Assignee: Shai Erera
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2609.patch, LUCENE-2609.patch, LUCENE-2609.patch, 
> LUCENE-2609.patch, LUCENE-2609.patch, LUCENE-2609.patch
>
>
> The test classes are useful for writing unit tests for code external to the 
> Lucene project. It would be helpful to build a jar of these classes and 
> publish them as a maven dependency.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2831) Revise Weight#scorer & Filter#getDocIdSet API to pass Readers context

2011-02-02 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-2831:


Attachment: LUCENE-2831-no_sub_searcher.patch

I think we can really get rid of the sub searchers and do it all on the top 
level searcher. I just sketched something out how this should be done from my 
point of view. The executors should only specify the ARC slice they want to 
execute and use the top level searcher to do the searches.

this patch is just for illustration purposes... Jdocs need to be fixed etc. 

I think if we do it that way the semantics are clear for IS.

> Revise Weight#scorer & Filter#getDocIdSet API to pass Readers context
> -
>
> Key: LUCENE-2831
> URL: https://issues.apache.org/jira/browse/LUCENE-2831
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 4.0
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
> Fix For: 4.0
>
> Attachments: LUCENE-2831-no_sub_searcher.patch, 
> LUCENE-2831-nuke-SolrIndexReader.patch, 
> LUCENE-2831-nuke-SolrIndexReader.patch, LUCENE-2831-recursion.patch, 
> LUCENE-2831-recursion.patch, LUCENE-2831-recursion.patch, 
> LUCENE-2831-recursion.patch, LUCENE-2831.patch, LUCENE-2831.patch, 
> LUCENE-2831.patch, LUCENE-2831.patch, LUCENE-2831.patch, LUCENE-2831.patch, 
> LUCENE-2831_transition_to_atomicCtx.patch, 
> LUCENE-2831_transition_to_atomicCtx.patch, 
> LUCENE-2831_transition_to_atomicCtx.patch
>
>
> Spinoff from LUCENE-2694 - instead of passing a reader into Weight#scorer(IR, 
> boolean, boolean) we should / could revise the API and pass in a struct that 
> has parent reader, sub reader, ord of that sub. The ord mapping plus the 
> context with its parent would make several issues way easier. See 
> LUCENE-2694, LUCENE-2348 and LUCENE-2829 to name some.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2831) Revise Weight#scorer & Filter#getDocIdSet API to pass Readers context

2011-02-02 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989800#comment-12989800
 ] 

Michael McCandless commented on LUCENE-2831:


Patch looks great!  No more schitzo sub searchers!

+1 to commit

> Revise Weight#scorer & Filter#getDocIdSet API to pass Readers context
> -
>
> Key: LUCENE-2831
> URL: https://issues.apache.org/jira/browse/LUCENE-2831
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 4.0
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
> Fix For: 4.0
>
> Attachments: LUCENE-2831-no_sub_searcher.patch, 
> LUCENE-2831-nuke-SolrIndexReader.patch, 
> LUCENE-2831-nuke-SolrIndexReader.patch, LUCENE-2831-recursion.patch, 
> LUCENE-2831-recursion.patch, LUCENE-2831-recursion.patch, 
> LUCENE-2831-recursion.patch, LUCENE-2831.patch, LUCENE-2831.patch, 
> LUCENE-2831.patch, LUCENE-2831.patch, LUCENE-2831.patch, LUCENE-2831.patch, 
> LUCENE-2831_transition_to_atomicCtx.patch, 
> LUCENE-2831_transition_to_atomicCtx.patch, 
> LUCENE-2831_transition_to_atomicCtx.patch
>
>
> Spinoff from LUCENE-2694 - instead of passing a reader into Weight#scorer(IR, 
> boolean, boolean) we should / could revise the API and pass in a struct that 
> has parent reader, sub reader, ord of that sub. The ord mapping plus the 
> context with its parent would make several issues way easier. See 
> LUCENE-2694, LUCENE-2348 and LUCENE-2829 to name some.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[HUDSON] Lucene-Solr-tests-only-trunk - Build # 4430 - Failure

2011-02-02 Thread Apache Hudson Server
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/4430/

1 tests failed.
REGRESSION:  org.apache.solr.TestGroupingSearch.testRandomGrouping

Error Message:
mismatch: 'XATJ'!='MAGU' @ grouped/foo_s/groups/[0]/doclist/docs/[5]/id

Stack Trace:
junit.framework.AssertionFailedError: mismatch: 'XATJ'!='MAGU' @ 
grouped/foo_s/groups/[0]/doclist/docs/[5]/id
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1144)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1076)
at 
org.apache.solr.TestGroupingSearch.testRandomGrouping(TestGroupingSearch.java:500)




Build Log (for compile errors):
[...truncated 8327 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2831) Revise Weight#scorer & Filter#getDocIdSet API to pass Readers context

2011-02-02 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-2831:


Attachment: LUCENE-2831-no_sub_searcher.patch

another iteration - fixed / added some javadocs and marked the LeafSlice 
experimental. I will wait a bit an commit later. All tests pass even with 
LUCENE-2751 

> Revise Weight#scorer & Filter#getDocIdSet API to pass Readers context
> -
>
> Key: LUCENE-2831
> URL: https://issues.apache.org/jira/browse/LUCENE-2831
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 4.0
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
> Fix For: 4.0
>
> Attachments: LUCENE-2831-no_sub_searcher.patch, 
> LUCENE-2831-no_sub_searcher.patch, LUCENE-2831-nuke-SolrIndexReader.patch, 
> LUCENE-2831-nuke-SolrIndexReader.patch, LUCENE-2831-recursion.patch, 
> LUCENE-2831-recursion.patch, LUCENE-2831-recursion.patch, 
> LUCENE-2831-recursion.patch, LUCENE-2831.patch, LUCENE-2831.patch, 
> LUCENE-2831.patch, LUCENE-2831.patch, LUCENE-2831.patch, LUCENE-2831.patch, 
> LUCENE-2831_transition_to_atomicCtx.patch, 
> LUCENE-2831_transition_to_atomicCtx.patch, 
> LUCENE-2831_transition_to_atomicCtx.patch
>
>
> Spinoff from LUCENE-2694 - instead of passing a reader into Weight#scorer(IR, 
> boolean, boolean) we should / could revise the API and pass in a struct that 
> has parent reader, sub reader, ord of that sub. The ord mapping plus the 
> context with its parent would make several issues way easier. See 
> LUCENE-2694, LUCENE-2348 and LUCENE-2829 to name some.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-2831) Revise Weight#scorer & Filter#getDocIdSet API to pass Readers context

2011-02-02 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer resolved LUCENE-2831.
-

Resolution: Fixed

Committed revision 109.

> Revise Weight#scorer & Filter#getDocIdSet API to pass Readers context
> -
>
> Key: LUCENE-2831
> URL: https://issues.apache.org/jira/browse/LUCENE-2831
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 4.0
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
> Fix For: 4.0
>
> Attachments: LUCENE-2831-no_sub_searcher.patch, 
> LUCENE-2831-no_sub_searcher.patch, LUCENE-2831-nuke-SolrIndexReader.patch, 
> LUCENE-2831-nuke-SolrIndexReader.patch, LUCENE-2831-recursion.patch, 
> LUCENE-2831-recursion.patch, LUCENE-2831-recursion.patch, 
> LUCENE-2831-recursion.patch, LUCENE-2831.patch, LUCENE-2831.patch, 
> LUCENE-2831.patch, LUCENE-2831.patch, LUCENE-2831.patch, LUCENE-2831.patch, 
> LUCENE-2831_transition_to_atomicCtx.patch, 
> LUCENE-2831_transition_to_atomicCtx.patch, 
> LUCENE-2831_transition_to_atomicCtx.patch
>
>
> Spinoff from LUCENE-2694 - instead of passing a reader into Weight#scorer(IR, 
> boolean, boolean) we should / could revise the API and pass in a struct that 
> has parent reader, sub reader, ord of that sub. The ord mapping plus the 
> context with its parent would make several issues way easier. See 
> LUCENE-2694, LUCENE-2348 and LUCENE-2829 to name some.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2751) add LuceneTestCase.newSearcher()

2011-02-02 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989839#comment-12989839
 ] 

Simon Willnauer commented on LUCENE-2751:
-

I just fixed the issues you have seen here on LUCENE-2831 all tests pass with 
that latest patch here.

> add LuceneTestCase.newSearcher()
> 
>
> Key: LUCENE-2751
> URL: https://issues.apache.org/jira/browse/LUCENE-2751
> Project: Lucene - Java
>  Issue Type: Test
>  Components: Build
>Reporter: Robert Muir
>Assignee: Robert Muir
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2751.patch, LUCENE-2751.patch, LUCENE-2751.patch, 
> LUCENE-2751.patch, LUCENE-2751.patch, LUCENE-2751.patch
>
>
> Most tests in the search package don't care about what kind of searcher they 
> use.
> we should randomly use MultiSearcher or ParallelMultiSearcher sometimes in 
> tests.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2751) add LuceneTestCase.newSearcher()

2011-02-02 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989846#comment-12989846
 ] 

Robert Muir commented on LUCENE-2751:
-

Thanks Simon! I'll commit this and take a look at backporting this monster to 
3.x


> add LuceneTestCase.newSearcher()
> 
>
> Key: LUCENE-2751
> URL: https://issues.apache.org/jira/browse/LUCENE-2751
> Project: Lucene - Java
>  Issue Type: Test
>  Components: Build
>Reporter: Robert Muir
>Assignee: Robert Muir
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2751.patch, LUCENE-2751.patch, LUCENE-2751.patch, 
> LUCENE-2751.patch, LUCENE-2751.patch, LUCENE-2751.patch
>
>
> Most tests in the search package don't care about what kind of searcher they 
> use.
> we should randomly use MultiSearcher or ParallelMultiSearcher sometimes in 
> tests.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[HUDSON] Lucene-Solr-tests-only-trunk - Build # 4433 - Failure

2011-02-02 Thread Apache Hudson Server
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/4433/

1 tests failed.
REGRESSION:  org.apache.solr.TestGroupingSearch.testRandomGrouping

Error Message:
mismatch: 'LRMT'!='CETY' @ grouped/id/groups/[0]/groupValue

Stack Trace:
junit.framework.AssertionFailedError: mismatch: 'LRMT'!='CETY' @ 
grouped/id/groups/[0]/groupValue
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1144)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1076)
at 
org.apache.solr.TestGroupingSearch.testRandomGrouping(TestGroupingSearch.java:500)




Build Log (for compile errors):
[...truncated 8301 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec

2011-02-02 Thread Paul Elschot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989856#comment-12989856
 ] 

Paul Elschot commented on LUCENE-2903:
--

One way to get an underlying byte array from an IntBuffer is by using 
ByteBuffer.asIntBuffer() to allocate the IntBuffer via a ByteBuffer from the 
byte array. Would that be possible here?
I remember using this for testing the original (P)FOR implementation with 
Lucene's IndexInput/IndexOutput. I did not look at any code answer this though, 
so please holler if this is a dead end.


> Improvement of PForDelta Codec
> --
>
> Key: LUCENE-2903
> URL: https://issues.apache.org/jira/browse/LUCENE-2903
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: hao yan
> Attachments: LUCENE_2903.patch, LUCENE_2903.patch
>
>
> There are 3 versions of PForDelta implementations in the Bulk Branch: 
> FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2.
> The FrameOfRef is a very basic one which is essentially a binary encoding 
> (may result in huge index size).
> The PatchedFrameOfRef is the implmentation based on the original version of 
> PForDelta in the literatures.
> The PatchedFrameOfRef2 is my previous implementation which are improved this 
> time. (The Codec name is changed to NewPForDelta.).
> In particular, the changes are:
> 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the 
> old PForDelta does not support very large exceptions (since
> the Simple16 does not support very large numbers). Now this has been fixed in 
> the new LCPForDelta.
> 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other 
> two PForDelta implementation in the bulk branch (FrameOfRef and 
> PatchedFrameOfRef). The codec's name is "NewPForDelta", as you can see in the 
> CodecProvider and PForDeltaFixedIntBlockCodec.
> 3. The performance test results are:
> 1) My "NewPForDelta" codec is faster then FrameOfRef and PatchedFrameOfRef 
> for almost all kinds of queries, slightly worse then BulkVInt.
> 2) My "NewPForDelta" codec can result in the smallest index size among all 4 
> methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself)
> 3) All performance test results are achieved by running with "-server" 
> instead of "-client"

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec

2011-02-02 Thread hao yan (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989872#comment-12989872
 ] 

hao yan commented on LUCENE-2903:
-

Yes, using ByteBuffer.asIntBuffer() is the same as converting int/byte array to 
byte/int array. I think the underlying implementation ByteBuffer.asIntBuffer() 
cannot avoid. I also tried ByteBuffer/IntBuffer though, the result is worse 
which makes sense since it may incur extra costs.

Where to holler? :) 

> Improvement of PForDelta Codec
> --
>
> Key: LUCENE-2903
> URL: https://issues.apache.org/jira/browse/LUCENE-2903
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: hao yan
> Attachments: LUCENE_2903.patch, LUCENE_2903.patch
>
>
> There are 3 versions of PForDelta implementations in the Bulk Branch: 
> FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2.
> The FrameOfRef is a very basic one which is essentially a binary encoding 
> (may result in huge index size).
> The PatchedFrameOfRef is the implmentation based on the original version of 
> PForDelta in the literatures.
> The PatchedFrameOfRef2 is my previous implementation which are improved this 
> time. (The Codec name is changed to NewPForDelta.).
> In particular, the changes are:
> 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the 
> old PForDelta does not support very large exceptions (since
> the Simple16 does not support very large numbers). Now this has been fixed in 
> the new LCPForDelta.
> 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other 
> two PForDelta implementation in the bulk branch (FrameOfRef and 
> PatchedFrameOfRef). The codec's name is "NewPForDelta", as you can see in the 
> CodecProvider and PForDeltaFixedIntBlockCodec.
> 3. The performance test results are:
> 1) My "NewPForDelta" codec is faster then FrameOfRef and PatchedFrameOfRef 
> for almost all kinds of queries, slightly worse then BulkVInt.
> 2) My "NewPForDelta" codec can result in the smallest index size among all 4 
> methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself)
> 3) All performance test results are achieved by running with "-server" 
> instead of "-client"

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Created: (SOLR-2345) Extend geodist() to support MultiValuefield for sorting/scoring

2011-02-02 Thread Bill Bell (JIRA)
Extend geodist() to support MultiValuefield for sorting/scoring
---

 Key: SOLR-2345
 URL: https://issues.apache.org/jira/browse/SOLR-2345
 Project: Solr
  Issue Type: New Feature
Reporter: Bill Bell


Extend geodist() and potentially other functions to support MultiValue fields 
for sorting and scoring.

sort=geodist() asc

This should grab the closest point in the MultiValue list, and return the 
distance so that is can be scored.

The problem is I cannot find a way to get the MultiValue list?

In function: 
src/java/org/apache/solr/search/function/distance/HaversineConstFunction.java

VectorValueSource p2;
this.p2 = vs
List sources = p2.getSources();
ValueSource latSource = sources.get(0); 
ValueSource lonSource = sources.get(1); 
DocValues latVals = latSource.getValues(context1, readerContext1);
DocValues lonVals = lonSource.getValues(context1, readerContext1);
double latRad = latVals.doubleVal(doc) * DistanceUtils.DEGREES_TO_RADIANS;
double lonRad = lonVals.doubleVal(doc) * DistanceUtils.DEGREES_TO_RADIANS;
etc...

It would be good if I could loop through sources.get() but it only returns 2 
sources even when there are 2 pairs of lat/long.

sources:[double(store_0_coordinate), double(store_1_coordinate)]

How do I extend the sources?

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



starting with the demo

2011-02-02 Thread Mark David
Hey, Lucenesters,
I'm just getting started developing with Solr/Lucene.  I successfully checked 
out the code from SVN and built it using "ant".
However, the demo page (http://lucene.apache.org/java/3_0_3/demo.html) says 
that I should be able to run "ant war-demo", but there is no target with that 
name in the build.xml file, and no war files were generated by my regular build.

Are these instructions out-of-date?  Or am I doing something bone-headed?

MARK DAVID
Technical Consultant

SEARCH TECHNOLOGIES
THE EXPERT IN THE SEARCH SPACE
www.searchtechnologies.com



[jira] Issue Comment Edited: (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans

2011-02-02 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989917#comment-12989917
 ] 

Grant Ingersoll edited comment on LUCENE-2878 at 2/3/11 12:25 AM:
--

{quote} And it's finally single source... caller must say
up-front (when pulling the scorer) if it will want positions (and,
separately, also payloads – great).
{quote}
Does it make sense that we could just want AttributeSources as we go here?

  was (Author: gsingers):
bq. And it's finally single source... caller must say
up-front (when pulling the scorer) if it will want positions (and,
separately, also payloads – great).

Does it make sense that we could just want AttributeSources as we go here?
  
> Allow Scorer to expose positions and payloads aka. nuke spans 
> --
>
> Key: LUCENE-2878
> URL: https://issues.apache.org/jira/browse/LUCENE-2878
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: Bulk Postings branch
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
> Attachments: LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch
>
>
> Currently we have two somewhat separate types of queries, the one which can 
> make use of positions (mainly spans) and payloads (spans). Yet Span*Query 
> doesn't really do scoring comparable to what other queries do and at the end 
> of the day they are duplicating lot of code all over lucene. Span*Queries are 
> also limited to other Span*Query instances such that you can not use a 
> TermQuery or a BooleanQuery with SpanNear or anthing like that. 
> Beside of the Span*Query limitation other queries lacking a quiet interesting 
> feature since they can not score based on term proximity since scores doesn't 
> expose any positional information. All those problems bugged me for a while 
> now so I stared working on that using the bulkpostings API. I would have done 
> that first cut on trunk but TermScorer is working on BlockReader that do not 
> expose positions while the one in this branch does. I started adding a new 
> Positions class which users can pull from a scorer, to prevent unnecessary 
> positions enums I added ScorerContext#needsPositions and eventually 
> Scorere#needsPayloads to create the corresponding enum on demand. Yet, 
> currently only TermQuery / TermScorer implements this API and other simply 
> return null instead. 
> To show that the API really works and our BulkPostings work fine too with 
> positions I cut over TermSpanQuery to use a TermScorer under the hood and 
> nuked TermSpans entirely. A nice sideeffect of this was that the Position 
> BulkReading implementation got some exercise which now :) work all with 
> positions while Payloads for bulkreading are kind of experimental in the 
> patch and those only work with Standard codec. 
> So all spans now work on top of TermScorer ( I truly hate spans since today ) 
> including the ones that need Payloads (StandardCodec ONLY)!!  I didn't bother 
> to implement the other codecs yet since I want to get feedback on the API and 
> on this first cut before I go one with it. I will upload the corresponding 
> patch in a minute. 
> I also had to cut over SpanQuery.getSpans(IR) to 
> SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk 
> first but after that pain today I need a break first :).
> The patch passes all core tests 
> (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't 
> look into the MemoryIndex BulkPostings API yet)

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans

2011-02-02 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989917#comment-12989917
 ] 

Grant Ingersoll commented on LUCENE-2878:
-

bq. And it's finally single source... caller must say
up-front (when pulling the scorer) if it will want positions (and,
separately, also payloads – great).

Does it make sense that we could just want AttributeSources as we go here?

> Allow Scorer to expose positions and payloads aka. nuke spans 
> --
>
> Key: LUCENE-2878
> URL: https://issues.apache.org/jira/browse/LUCENE-2878
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: Bulk Postings branch
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
> Attachments: LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch
>
>
> Currently we have two somewhat separate types of queries, the one which can 
> make use of positions (mainly spans) and payloads (spans). Yet Span*Query 
> doesn't really do scoring comparable to what other queries do and at the end 
> of the day they are duplicating lot of code all over lucene. Span*Queries are 
> also limited to other Span*Query instances such that you can not use a 
> TermQuery or a BooleanQuery with SpanNear or anthing like that. 
> Beside of the Span*Query limitation other queries lacking a quiet interesting 
> feature since they can not score based on term proximity since scores doesn't 
> expose any positional information. All those problems bugged me for a while 
> now so I stared working on that using the bulkpostings API. I would have done 
> that first cut on trunk but TermScorer is working on BlockReader that do not 
> expose positions while the one in this branch does. I started adding a new 
> Positions class which users can pull from a scorer, to prevent unnecessary 
> positions enums I added ScorerContext#needsPositions and eventually 
> Scorere#needsPayloads to create the corresponding enum on demand. Yet, 
> currently only TermQuery / TermScorer implements this API and other simply 
> return null instead. 
> To show that the API really works and our BulkPostings work fine too with 
> positions I cut over TermSpanQuery to use a TermScorer under the hood and 
> nuked TermSpans entirely. A nice sideeffect of this was that the Position 
> BulkReading implementation got some exercise which now :) work all with 
> positions while Payloads for bulkreading are kind of experimental in the 
> patch and those only work with Standard codec. 
> So all spans now work on top of TermScorer ( I truly hate spans since today ) 
> including the ones that need Payloads (StandardCodec ONLY)!!  I didn't bother 
> to implement the other codecs yet since I want to get feedback on the API and 
> on this first cut before I go one with it. I will upload the corresponding 
> patch in a minute. 
> I also had to cut over SpanQuery.getSpans(IR) to 
> SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk 
> first but after that pain today I need a break first :).
> The patch passes all core tests 
> (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't 
> look into the MemoryIndex BulkPostings API yet)

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-2339) No error reported when sorting on a field Solr knows you shouldn't sort on.

2011-02-02 Thread Hoss Man (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated SOLR-2339:
---

Attachment: SOLR-2339.patch

Updated patch that fixes the remaining sort related test failures.

TestDistributedSearch and BasicDistributedZkTest were fundementally flawed in 
that they expected to get deterministic sort orderings on fields that they were 
deliberately putting multiple values into (ie: not just sorting on a 
multivalued field; sorting on a multivalued field that was actually used to 
store multiple values for each doc)

I'm amazed they ever passed. 

> No error reported when sorting on a field Solr knows you shouldn't sort on.
> ---
>
> Key: SOLR-2339
> URL: https://issues.apache.org/jira/browse/SOLR-2339
> Project: Solr
>  Issue Type: Bug
>  Components: search
>Reporter: Hoss Man
>Assignee: Hoss Man
> Fix For: 3.1, 4.0
>
> Attachments: SOLR-2339.patch, SOLR-2339.patch
>
>
> In the past, Solr has relied on the underlying FieldCache to throw an error 
> in situations where sorting on a field was not possible.  however LUCENE-2142 
> has changed this, so that FieldCache never throws an error.
> In order to maintain the functionality of past Solr releases (ie: error when 
> users attempt to sort on a field that we known will produce meaningless 
> results) we should add some sort of check at the Solr level.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[HUDSON] Lucene-Solr-tests-only-trunk - Build # 4437 - Failure

2011-02-02 Thread Apache Hudson Server
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/4437/

1 tests failed.
REGRESSION:  org.apache.solr.TestGroupingSearch.testRandomGrouping

Error Message:
mismatch: 'us'!='i' @ grouped/foo_s/groups/[0]/groupValue

Stack Trace:
junit.framework.AssertionFailedError: mismatch: 'us'!='i' @ 
grouped/foo_s/groups/[0]/groupValue
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1183)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1115)
at 
org.apache.solr.TestGroupingSearch.testRandomGrouping(TestGroupingSearch.java:500)




Build Log (for compile errors):
[...truncated 8338 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2751) add LuceneTestCase.newSearcher()

2011-02-02 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2751:


Attachment: LUCENE-2751_branch3x.patch

It seems this uncovered some monsters in branch_3x (unless i severely screwed 
things up).

Attached is a patch of my merge... there are problems: some tests totally hang 
(e.g. TestLazyProxSkipping), and if TestSort gets a parallel executor, 
testNormalizedScores fails:
{noformat}
[junit] - Standard Error -
[junit] NOTE: reproduce with: ant test -Dtestcase=TestSort 
-Dtestmethod=testNormalizedScores 
-Dtests.seed=367124575585283127:1927326607030239825
[junit] NOTE: test params are: locale=zh_SG, timezone=America/Tijuana
[junit] NOTE: all tests run in this JVM:
[junit] [TestSort]
[junit] NOTE: Windows Vista 6.0 x86/Sun Microsystems Inc. 1.6.0_23 
(32-bit)/cpus=4,threads=1,free=12557672,total=16252928
[junit] -  ---
[junit] Testcase: testNormalizedScores(org.apache.lucene.search.TestSort):  
FAILED
[junit] expected:<1.6163856983184814> but was:
[junit] junit.framework.AssertionFailedError: expected:<1.6163856983184814> 
but was:
[junit] at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1045)
[junit] at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:977)
[junit] at 
org.apache.lucene.search.TestSort.assertSameValues(TestSort.java:1094)
[junit] at 
org.apache.lucene.search.TestSort.testNormalizedScores(TestSort.java:668)
[junit]
[junit]
[junit] Test org.apache.lucene.search.TestSort FAILED
{noformat}

Posting the patch of the merge so we can hopefully debug through these.


> add LuceneTestCase.newSearcher()
> 
>
> Key: LUCENE-2751
> URL: https://issues.apache.org/jira/browse/LUCENE-2751
> Project: Lucene - Java
>  Issue Type: Test
>  Components: Build
>Reporter: Robert Muir
>Assignee: Robert Muir
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2751.patch, LUCENE-2751.patch, LUCENE-2751.patch, 
> LUCENE-2751.patch, LUCENE-2751.patch, LUCENE-2751.patch, 
> LUCENE-2751_branch3x.patch
>
>
> Most tests in the search package don't care about what kind of searcher they 
> use.
> we should randomly use MultiSearcher or ParallelMultiSearcher sometimes in 
> tests.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2751) add LuceneTestCase.newSearcher()

2011-02-02 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989944#comment-12989944
 ] 

Robert Muir commented on LUCENE-2751:
-

I fixed the hangs, it was due to a similar recursion bug that trunk had 
(IndexSearcher
setting itself as a sub).

I committed the patch, but i added an @Ignore to the testNormalizeScores:
{noformat}
[junit] Testsuite: org.apache.lucene.search.TestSort
[junit] Tests run: 26, Failures: 0, Errors: 0, Time elapsed: 2.471 sec
[junit]
[junit] - Standard Error -
[junit] NOTE: Ignoring test method 'testNormalizedScores': Fix me! Fails if one 
of the subs is a threaded indexsearcher
[junit] -  ---
{noformat}

I think we should get to the bottom of why this one fails... I'll keep the 
issue open.



> add LuceneTestCase.newSearcher()
> 
>
> Key: LUCENE-2751
> URL: https://issues.apache.org/jira/browse/LUCENE-2751
> Project: Lucene - Java
>  Issue Type: Test
>  Components: Build
>Reporter: Robert Muir
>Assignee: Robert Muir
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2751.patch, LUCENE-2751.patch, LUCENE-2751.patch, 
> LUCENE-2751.patch, LUCENE-2751.patch, LUCENE-2751.patch, 
> LUCENE-2751_branch3x.patch
>
>
> Most tests in the search package don't care about what kind of searcher they 
> use.
> we should randomly use MultiSearcher or ParallelMultiSearcher sometimes in 
> tests.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



RE: starting with the demo

2011-02-02 Thread Mark David
Ah, I see that the target exists in the distribution download, but not in the 
SVN checkout, so I suspect that the instructions for making the demo are 
different for the two cases and not the same as indicated on the demo page 
mentioned.

From: Mark David [mailto:mda...@searchtechnologies.com]
Sent: Wednesday, February 02, 2011 3:43 PM
To: dev@lucene.apache.org
Subject: starting with the demo

Hey, Lucenesters,
I'm just getting started developing with Solr/Lucene.  I successfully checked 
out the code from SVN and built it using "ant".
However, the demo page (http://lucene.apache.org/java/3_0_3/demo.html) says 
that I should be able to run "ant war-demo", but there is no target with that 
name in the build.xml file, and no war files were generated by my regular build.

Are these instructions out-of-date?  Or am I doing something bone-headed?

MARK DAVID
Technical Consultant

SEARCH TECHNOLOGIES
THE EXPERT IN THE SEARCH SPACE
www.searchtechnologies.com



Click 
here
 to report this email as spam.


Re: starting with the demo

2011-02-02 Thread Robert Muir
On Wed, Feb 2, 2011 at 6:42 PM, Mark  David
 wrote:
> Hey, Lucenesters,
>
> I’m just getting started developing with Solr/Lucene.  I successfully
> checked out the code from SVN and built it using “ant”.
>
> However, the demo page (http://lucene.apache.org/java/3_0_3/demo.html) says
> that I should be able to run “ant war-demo”, but there is no target with
> that name in the build.xml file, and no war files were generated by my
> regular build.
>

Hi, are you checking out from our svn trunk?
(https://svn.apache.org/repos/asf/lucene/dev/trunk/) ?
In this case the demo has moved under lucene/contrib/demo

You will find the same ant targets work under that directory.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[HUDSON] Lucene-trunk - Build # 1454 - Failure

2011-02-02 Thread Apache Hudson Server
Build: https://hudson.apache.org/hudson/job/Lucene-trunk/1454/

1 tests failed.
REGRESSION:  org.apache.lucene.search.TestWildcard.testParsingAndSearching

Error Message:
expected:<0> but was:<1>

Stack Trace:
junit.framework.AssertionFailedError: expected:<0> but was:<1>
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1183)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1115)
at 
org.apache.lucene.search.TestWildcard.testParsingAndSearching(TestWildcard.java:338)




Build Log (for compile errors):
[...truncated 9298 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [HUDSON] Lucene-trunk - Build # 1454 - Failure

2011-02-02 Thread Robert Muir
This is a false fail, the test assumes in-order merge policy.

I committed a fix in r1066727

On Wed, Feb 2, 2011 at 9:54 PM, Apache Hudson Server
 wrote:
> Build: https://hudson.apache.org/hudson/job/Lucene-trunk/1454/
>
> 1 tests failed.
> REGRESSION:  org.apache.lucene.search.TestWildcard.testParsingAndSearching
>
> Error Message:
> expected:<0> but was:<1>
>
> Stack Trace:
> junit.framework.AssertionFailedError: expected:<0> but was:<1>
>        at 
> org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1183)
>        at 
> org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1115)
>        at 
> org.apache.lucene.search.TestWildcard.testParsingAndSearching(TestWildcard.java:338)
>
>
>
>
> Build Log (for compile errors):
> [...truncated 9298 lines...]
>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



RE: starting with the demo

2011-02-02 Thread Mark David
Yes, thank you, that works.

There is no clear way to edit the documentation as there is on the wiki, or I 
would change it to reflect this information.

Mark

-Original Message-
From: Robert Muir [mailto:rcm...@gmail.com] 
Sent: Wednesday, February 02, 2011 6:41 PM
To: dev@lucene.apache.org
Subject: Re: starting with the demo

On Wed, Feb 2, 2011 at 6:42 PM, Mark  David
 wrote:
> Hey, Lucenesters,
>
> I’m just getting started developing with Solr/Lucene.  I successfully
> checked out the code from SVN and built it using “ant”.
>
> However, the demo page (http://lucene.apache.org/java/3_0_3/demo.html) says
> that I should be able to run “ant war-demo”, but there is no target with
> that name in the build.xml file, and no war files were generated by my
> regular build.
>

Hi, are you checking out from our svn trunk?
(https://svn.apache.org/repos/asf/lucene/dev/trunk/) ?
In this case the demo has moved under lucene/contrib/demo

You will find the same ant targets work under that directory.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-1057) PathTokenizerFactory

2011-02-02 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated SOLR-1057:
-

Attachment: SOLR-1057.patch

New patch. renamed it to PathHierarchyTokenizer.

> PathTokenizerFactory
> 
>
> Key: SOLR-1057
> URL: https://issues.apache.org/jira/browse/SOLR-1057
> Project: Solr
>  Issue Type: New Feature
>  Components: Schema and Analysis
>Reporter: Ryan McKinley
>Assignee: Koji Sekiguchi
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: SOLR-1057-PathTokenizerFactory.patch, 
> SOLR-1057-PathTokenizerFactory.patch, SOLR-1057.patch, SOLR-1057.patch
>
>
> This is a Tokenizer that splits the input string into a series of paths.  For 
> example:
> {panel}
>  /aaa/bbb/ccc
> {panel}
> becomes:
> {panel}
>  /aaa/
>  /aaa/bbb/
>  /aaa/bbb/ccc
> {panel}

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[HUDSON] Lucene-Solr-tests-only-3.x - Build # 4426 - Failure

2011-02-02 Thread Apache Hudson Server
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-3.x/4426/

1 tests failed.
REGRESSION:  
org.apache.lucene.search.payloads.TestPayloadTermQuery.testIgnoreSpanScorer

Error Message:
MockDirectoryWrapper: cannot close: there are still open files: {_0.tis=1, 
_3.frq=1, _3.tvd=1, _1.frq=1, _3.tvf=1, _4.prx=1, _3.fdt=1, _4.fdx=1, _3.tvx=1, 
_4.frq=1, _4.tis=1, _0.prx=1, _4.tvx=1, _3.nrm=1, _0.nrm=1, _1.tvx=1, _1.tis=1, 
_0.tvd=1, _4.nrm=1, _0.tvf=1, _4.fdt=1, _3.prx=1, _2.cfs=1, _3.fdx=1, _1.prx=1, 
_1.fdx=1, _1.tvf=1, _4.tvd=1, _1.fdt=1, _0.tvx=1, _4.tvf=1, _0.frq=1, _0.fdx=1, 
_1.tvd=1, _0.fdt=1, _1.nrm=1, _3.tis=1}

Stack Trace:
java.lang.RuntimeException: MockDirectoryWrapper: cannot close: there are still 
open files: {_0.tis=1, _3.frq=1, _3.tvd=1, _1.frq=1, _3.tvf=1, _4.prx=1, 
_3.fdt=1, _4.fdx=1, _3.tvx=1, _4.frq=1, _4.tis=1, _0.prx=1, _4.tvx=1, _3.nrm=1, 
_0.nrm=1, _1.tvx=1, _1.tis=1, _0.tvd=1, _4.nrm=1, _0.tvf=1, _4.fdt=1, _3.prx=1, 
_2.cfs=1, _3.fdx=1, _1.prx=1, _1.fdx=1, _1.tvf=1, _4.tvd=1, _1.fdt=1, _0.tvx=1, 
_4.tvf=1, _0.frq=1, _0.fdx=1, _1.tvd=1, _0.fdt=1, _1.nrm=1, _3.tis=1}
at 
org.apache.lucene.store.MockDirectoryWrapper.close(MockDirectoryWrapper.java:418)
at 
org.apache.lucene.search.payloads.TestPayloadTermQuery.tearDown(TestPayloadTermQuery.java:135)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1045)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:977)
Caused by: java.lang.RuntimeException: unclosed IndexInput
at 
org.apache.lucene.store.MockDirectoryWrapper.openInput(MockDirectoryWrapper.java:373)
at org.apache.lucene.store.Directory.openInput(Directory.java:139)
at 
org.apache.lucene.index.TermVectorsReader.(TermVectorsReader.java:81)
at 
org.apache.lucene.index.SegmentReader$CoreReaders.openDocStores(SegmentReader.java:299)
at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:580)
at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:556)
at 
org.apache.lucene.index.DirectoryReader.(DirectoryReader.java:113)
at 
org.apache.lucene.index.ReadOnlyDirectoryReader.(ReadOnlyDirectoryReader.java:29)
at 
org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:81)
at 
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:736)
at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:75)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:428)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:288)
at org.apache.lucene.search.IndexSearcher.(IndexSearcher.java:97)
at 
org.apache.lucene.search.payloads.TestPayloadTermQuery.testIgnoreSpanScorer(TestPayloadTermQuery.java:224)




Build Log (for compile errors):
[...truncated 4480 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[HUDSON] Lucene-Solr-tests-only-trunk - Build # 4442 - Failure

2011-02-02 Thread Apache Hudson Server
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/4442/

1 tests failed.
REGRESSION:  org.apache.lucene.search.payloads.TestPayloadTermQuery.test

Error Message:
NaN does not equal: 1

Stack Trace:
junit.framework.AssertionFailedError: NaN does not equal: 1
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1183)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1115)
at 
org.apache.lucene.search.payloads.TestPayloadTermQuery.test(TestPayloadTermQuery.java:149)




Build Log (for compile errors):
[...truncated 2961 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2208) Token div exceeds length of provided text sized 4114

2011-02-02 Thread Hsiu Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hsiu Wang updated LUCENE-2208:
--

Attachment: LUCENE-2208.patch

patch to fix org.apache.lucene.search.highlight.InvalidTokenOffsetsException

> Token div exceeds length of provided text sized 4114
> 
>
> Key: LUCENE-2208
> URL: https://issues.apache.org/jira/browse/LUCENE-2208
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/highlighter
>Affects Versions: 3.0
> Environment:  diagnostics = {os.version=5.1, os=Windows XP, 
> lucene.version=3.0.0 883080 - 2009-11-22 15:43:58, source=flush, os.arch=x86, 
> java.version=1.6.0_12, java.vendor=Sun Microsystems Inc.}
>
>Reporter: Ramazan VARLIKLI
> Attachments: LUCENE-2208.patch, LUCENE-2208_test.patch
>
>
> I have a doc which contains html codes. I want to strip html tags and make 
> the test clear after then apply highlighter on the clear text . But 
> highlighter throws an exceptions if I strip out the html characters  , if i 
> don't strip out , it works fine. It just confuses me at the moment 
> I copy paste 3 thing here from the console as it may contain special 
> characters which might cause the problem.
> 1 -) Here is the html text 
>   Starter
>   
> 
> 
>  Learning path: History
>   Key question
>   Did transport fuel the industrial revolution?
>   Learning Objective
> 
>   To categorise points as for or against an argument
>   
> 
>   What to do?
>   
> Watch the clip: Transport fuelled the industrial 
> revolution.
>   
>   The clips claims that transport fuelled the industrial 
> revolution. Some historians argue that the industrial revolution only 
> happened because of developments in transport.
> 
>   Read the statements below and decide which 
> points are for and which points are against the argument 
> that industry expanded in the 18th and 19th centuries because of developments 
> in transport.
>   
>   
>   
>   Industry expanded because of inventions and 
> the discovery of steam power.
>   Improvements in transport allowed goods to 
> be sold all over the country and all over the world so there were more 
> customers to develop industry for.
>   Developments in transport allowed 
> resources, such as coal from mines and cotton from America to come together 
> to manufacture products.
>   Transport only developed because industry 
> needed it. It was slow to develop as money was spent on improving roads, then 
> building canals and the replacing them with railways in order to keep up with 
> industry.
>   
>   
>   Now try to think of 2 more statements of your 
> own.
>   
> 
> 
>   
>   Main activity
>   
> 
> Learning path: 
> History
>   Learning Objective
>   
> To select evidence to support points
>   
>   What to do?
>   
>   Choose the 4 points that you think are most important - 
> try to be balanced by having two for and two 
> against.
> Write one in each of the point boxes of the 
> paragraphs on the sheet  class="link-internal">Constructing a balanced argument. You 
> might like to re write the points in your own words and use connectives to 
> link the paragraphs.
>   
> In history and in any argument, you need evidence 
> to support your points.
> Find evidence from these sources and from 
> your own knowledge to support each of your points:
> 
>  href="../servlet/link?template=vid¯o=setResource&resourceID=2044" 
> class="link-internal">At a toll gate
>  href="../servlet/link?macro=setResource&template=vid&resourceID=2046" 
> class="link-internal">Canals
>  href="../servlet/link?macro=setResource&template=vid&resourceID=2043" 
> class="link-internal">Growing cities: traffic
>href="../servlet/link?macro=setResource&template=vid&resourceID=2047" 
> class="link-internal">Impact of the railway 
>href="../servlet/link?macro=setResource&template=vid&resourceID=2048" 
> class="link-internal">Sailing ships 
>href="

[jira] Created: (SOLR-2346) Non UTF-8 Text files having other than english texts(Japanese/Hebrew) are no getting indexed correctly.

2011-02-02 Thread Prasad Deshpande (JIRA)
Non UTF-8 Text files having other than english texts(Japanese/Hebrew) are no 
getting indexed correctly.
---

 Key: SOLR-2346
 URL: https://issues.apache.org/jira/browse/SOLR-2346
 Project: Solr
  Issue Type: Bug
  Components: contrib - Solr Cell (Tika extraction)
Affects Versions: 1.4.1
 Environment: Solr 1.4.1, Packaged Jetty as servlet container, Windows 
XP SP1, Machine is booted in Japanese Locale.
Reporter: Prasad Deshpande
Priority: Critical


I am able to successfully index/search non-Engilsh files (like Hebrew, 
Japanese) which was encoded in UTF-8. However, When I tried to index data which 
was encoded in local encoding like Big5 for Japanese I could not see the 
desired results. The contents after indexing looked garbled for Big5 encoded 
document when I searched for all indexed documents. When I index attached non 
utf-8 file it indexes in following way

- 
- 
- 
  �� ��
  
- 
  Big5
  
- 
  zh
  
- 
  zh
  
- 
  17
  
- 
  text/plain
  
  doc2
  
  
  

Here you said it index file in UTF8 however it seems that non UTF8 file gets 
indexed in Big5 encoding.
Here I tried fetching indexed data stream in Big5 and converted in UTF8.

String id = (String) resulDocument.getFirstValue("attr_content");
byte[] bytearray = id.getBytes("Big5");
String utf8String = new String(bytearray, "UTF-8");
It does not gives expected results.

When I index UTF-8 file it indexes like following

- 
- 
  マイ ネットワーク
  
- 
  UTF-8
  
- 
  text/plain
  
- 
  sample_jap_unicode.txt
  
- 
  28
  
- 
  myfile
  
- 
  text/plain
  
  doc2
  

So, I can index and search UTF-8 data.



-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-2208) Token div exceeds length of provided text sized 4114

2011-02-02 Thread Hsiu Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989965#comment-12989965
 ] 

Hsiu Wang edited comment on LUCENE-2208 at 2/3/11 5:49 AM:
---

added patch(LUCENE-2208.patch) to fix 
org.apache.lucene.search.highlight.InvalidTokenOffsetsException.

The exception is caused by HTML escape characters (e.g., &, &) which 
are counted as 1 character in text.length() in 
Highlighter.getBestTextFragments, but in HTMLStripCharfilter, they are counted 
as N characters(& counted as 5). 

In the patch, I commented out an incorrect test case in 
HTMLStripCharFilterTest.testOffset()("X & X ( X < > X"). The 
commented out test case is covered by Robert's test patch. 


  was (Author: hwang):
patch to fix org.apache.lucene.search.highlight.InvalidTokenOffsetsException
  
> Token div exceeds length of provided text sized 4114
> 
>
> Key: LUCENE-2208
> URL: https://issues.apache.org/jira/browse/LUCENE-2208
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/highlighter
>Affects Versions: 3.0
> Environment:  diagnostics = {os.version=5.1, os=Windows XP, 
> lucene.version=3.0.0 883080 - 2009-11-22 15:43:58, source=flush, os.arch=x86, 
> java.version=1.6.0_12, java.vendor=Sun Microsystems Inc.}
>
>Reporter: Ramazan VARLIKLI
> Attachments: LUCENE-2208.patch, LUCENE-2208_test.patch
>
>
> I have a doc which contains html codes. I want to strip html tags and make 
> the test clear after then apply highlighter on the clear text . But 
> highlighter throws an exceptions if I strip out the html characters  , if i 
> don't strip out , it works fine. It just confuses me at the moment 
> I copy paste 3 thing here from the console as it may contain special 
> characters which might cause the problem.
> 1 -) Here is the html text 
>   Starter
>   
> 
> 
>  Learning path: History
>   Key question
>   Did transport fuel the industrial revolution?
>   Learning Objective
> 
>   To categorise points as for or against an argument
>   
> 
>   What to do?
>   
> Watch the clip: Transport fuelled the industrial 
> revolution.
>   
>   The clips claims that transport fuelled the industrial 
> revolution. Some historians argue that the industrial revolution only 
> happened because of developments in transport.
> 
>   Read the statements below and decide which 
> points are for and which points are against the argument 
> that industry expanded in the 18th and 19th centuries because of developments 
> in transport.
>   
>   
>   
>   Industry expanded because of inventions and 
> the discovery of steam power.
>   Improvements in transport allowed goods to 
> be sold all over the country and all over the world so there were more 
> customers to develop industry for.
>   Developments in transport allowed 
> resources, such as coal from mines and cotton from America to come together 
> to manufacture products.
>   Transport only developed because industry 
> needed it. It was slow to develop as money was spent on improving roads, then 
> building canals and the replacing them with railways in order to keep up with 
> industry.
>   
>   
>   Now try to think of 2 more statements of your 
> own.
>   
> 
> 
>   
>   Main activity
>   
> 
> Learning path: 
> History
>   Learning Objective
>   
> To select evidence to support points
>   
>   What to do?
>   
>   Choose the 4 points that you think are most important - 
> try to be balanced by having two for and two 
> against.
> Write one in each of the point boxes of the 
> paragraphs on the sheet  class="link-internal">Constructing a balanced argument. You 
> might like to re write the points in your own words and use connectives to 
> link the paragraphs.
>   
> In history and in any argument, you need evidence 
> to support your points.
> Find evidence from these sources and from 
> your own knowledge to support each of your points:
> 
>  href="../servlet/link?template=vid¯o=setResource&resourceID=2044" 
> class="l

[jira] Updated: (SOLR-2346) Non UTF-8 Text files having other than english texts(Japanese/Hebrew) are no getting indexed correctly.

2011-02-02 Thread Prasad Deshpande (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasad Deshpande updated SOLR-2346:
---

Attachment: sample_jap_non_UTF-8.txt
sample_jap_UTF-8.txt

I have verified use case using attached files.

> Non UTF-8 Text files having other than english texts(Japanese/Hebrew) are no 
> getting indexed correctly.
> ---
>
> Key: SOLR-2346
> URL: https://issues.apache.org/jira/browse/SOLR-2346
> Project: Solr
>  Issue Type: Bug
>  Components: contrib - Solr Cell (Tika extraction)
>Affects Versions: 1.4.1
> Environment: Solr 1.4.1, Packaged Jetty as servlet container, Windows 
> XP SP1, Machine is booted in Japanese Locale.
>Reporter: Prasad Deshpande
>Priority: Critical
> Attachments: sample_jap_UTF-8.txt, sample_jap_non_UTF-8.txt
>
>
> I am able to successfully index/search non-Engilsh files (like Hebrew, 
> Japanese) which was encoded in UTF-8. However, When I tried to index data 
> which was encoded in local encoding like Big5 for Japanese I could not see 
> the desired results. The contents after indexing looked garbled for Big5 
> encoded document when I searched for all indexed documents. When I index 
> attached non utf-8 file it indexes in following way
> - 
> - 
> - 
>   �� ��
>   
> - 
>   Big5
>   
> - 
>   zh
>   
> - 
>   zh
>   
> - 
>   17
>   
> - 
>   text/plain
>   
>   doc2
>   
>   
>   
> Here you said it index file in UTF8 however it seems that non UTF8 file gets 
> indexed in Big5 encoding.
> Here I tried fetching indexed data stream in Big5 and converted in UTF8.
> String id = (String) resulDocument.getFirstValue("attr_content");
> byte[] bytearray = id.getBytes("Big5");
> String utf8String = new String(bytearray, "UTF-8");
> It does not gives expected results.
> When I index UTF-8 file it indexes like following
> - 
> - 
>   マイ ネットワーク
>   
> - 
>   UTF-8
>   
> - 
>   text/plain
>   
> - 
>   sample_jap_unicode.txt
>   
> - 
>   28
>   
> - 
>   myfile
>   
> - 
>   text/plain
>   
>   doc2
>   
> So, I can index and search UTF-8 data.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-2346) Non UTF-8 Text files having other than english texts(Japanese/Hebrew) are no getting indexed correctly.

2011-02-02 Thread Prasad Deshpande (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasad Deshpande updated SOLR-2346:
---

Description: 
I am able to successfully index/search non-Engilsh files (like Hebrew, 
Japanese) which was encoded in UTF-8. However, When I tried to index data which 
was encoded in local encoding like Big5 for Japanese I could not see the 
desired results. The contents after indexing looked garbled for Big5 encoded 
document when I searched for all indexed documents. When I index attached non 
utf-8 file it indexes in following way

- 
- 
- 
  �� ��
  
- 
  Big5
  
- 
  zh
  
- 
  zh
  
- 
  17
  
- 
  text/plain
  
  doc2
  
  
  

Here you said it index file in UTF8 however it seems that non UTF8 file gets 
indexed in Big5 encoding.
Here I tried fetching indexed data stream in Big5 and converted in UTF8.

String id = (String) resulDocument.getFirstValue("attr_content");
byte[] bytearray = id.getBytes("Big5");
String utf8String = new String(bytearray, "UTF-8");
It does not gives expected results.

When I index UTF-8 file it indexes like following

- 
- 
  マイ ネットワーク
  
- 
  UTF-8
  
- 
  text/plain
  
- 
  sample_jap_unicode.txt
  
- 
  28
  
- 
  myfile
  
- 
  text/plain
  
  doc2
  

So, I can index and search UTF-8 data.


For more reference below is the discussion with Yonik.
Please find attached TXT file which I was using to index and search.

curl 
"http://localhost:8983/solr/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=attr_content&fmap.div=foo_t&boost.foo_t=3&commit=true&charset=utf-8";
 -F "myfile=@sample_jap_non_UTF-8"


One problem is that you are giving big5 encoded text to Solr and saying that 
it's UTF8.
Here's one way to actually tell solr what the encoding of the text you are 
sending is:

curl 
"http://localhost:8983/solr/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=attr_content&fmap.div=foo_t&boost.foo_t=3&commit=true";
 --data-binary @sample_jap_non_UTF-8.txt -H 'Content-type:text/plain; 
charset=big5'

Now the problem appears that for some reason, this doesn't work...
Could you open a JIRA issue and attach your two test files?

-Yonik
http://lucidimagination.com




  was:
I am able to successfully index/search non-Engilsh files (like Hebrew, 
Japanese) which was encoded in UTF-8. However, When I tried to index data which 
was encoded in local encoding like Big5 for Japanese I could not see the 
desired results. The contents after indexing looked garbled for Big5 encoded 
document when I searched for all indexed documents. When I index attached non 
utf-8 file it indexes in following way

- 
- 
- 
  �� ��
  
- 
  Big5
  
- 
  zh
  
- 
  zh
  
- 
  17
  
- 
  text/plain
  
  doc2
  
  
  

Here you said it index file in UTF8 however it seems that non UTF8 file gets 
indexed in Big5 encoding.
Here I tried fetching indexed data stream in Big5 and converted in UTF8.

String id = (String) resulDocument.getFirstValue("attr_content");
byte[] bytearray = id.getBytes("Big5");
String utf8String = new String(bytearray, "UTF-8");
It does not gives expected results.

When I index UTF-8 file it indexes like following

- 
- 
  マイ ネットワーク
  
- 
  UTF-8
  
- 
  text/plain
  
- 
  sample_jap_unicode.txt
  
- 
  28
  
- 
  myfile
  
- 
  text/plain
  
  doc2
  

So, I can index and search UTF-8 data.




> Non UTF-8 Text files having other than english texts(Japanese/Hebrew) are no 
> getting indexed correctly.
> ---
>
> Key: SOLR-2346
> URL: https://issues.apache.org/jira/browse/SOLR-2346
> Project: Solr
>  Issue Type: Bug
>  Components: contrib - Solr Cell (Tika extraction)
>Affects Versions: 1.4.1
> Environment: Solr 1.4.1, Packaged Jetty as servlet container, Windows 
> XP SP1, Machine is booted in Japanese Locale.
>Reporter: Prasad Deshpande
>Priority: Critical
> Attachments: sample_jap_UTF-8.txt, sample_jap_non_UTF-8.txt
>
>
> I am able to successfully index/search non-Engilsh files (like Hebrew, 
> Japanese) which was encoded in UTF-8. However, When I tried to index data 
> which was encoded in local encoding like Big5 for Japanese I could not see 
> the desired results. The contents after indexing looked garbled for Big5 
> encoded document when I searched for all indexed documents. When I index 
> attached non utf-8 file it indexes in following way
> - 
> - 
> - 
>   �� ��
>   
> - 
>   Big5
>   
> - 
>   zh
>   
> - 
>   zh
>   
> - 
>   17
>   
> - 
>   text/plain
>   
>   doc2
>   
>   
>   
> Here you said it index file in UTF8 however it seems that non UTF8 file gets 
> indexed in Big5 encoding.
> Here I tried fetching indexed data stream in Big5 and converted in UTF8.
> String id = (String) resulD

[jira] Issue Comment Edited: (LUCENE-2208) Token div exceeds length of provided text sized 4114

2011-02-02 Thread Hsiu Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989965#comment-12989965
 ] 

Hsiu Wang edited comment on LUCENE-2208 at 2/3/11 5:55 AM:
---

added patch(LUCENE-2208.patch) to fix 
org.apache.lucene.search.highlight.InvalidTokenOffsetsException.

The exception is caused by HTML escape characters (e.g., &, &) 
which are counted as 1 character in text.length() in 
Highlighter.getBestTextFragments, but in HTMLStripCharfilter, they are counted 
as N characters(& counted as 5). 

In the patch, I commented out an incorrect test case in 
HTMLStripCharFilterTest.testOffset()("X & X ( X < > X"). The 
commented out test case is covered by Robert's test patch. 


  was (Author: hwang):
added patch(LUCENE-2208.patch) to fix 
org.apache.lucene.search.highlight.InvalidTokenOffsetsException.

The exception is caused by HTML escape characters (e.g., &, &) which 
are counted as 1 character in text.length() in 
Highlighter.getBestTextFragments, but in HTMLStripCharfilter, they are counted 
as N characters(& counted as 5). 

In the patch, I commented out an incorrect test case in 
HTMLStripCharFilterTest.testOffset()("X & X ( X < > X"). The 
commented out test case is covered by Robert's test patch. 

  
> Token div exceeds length of provided text sized 4114
> 
>
> Key: LUCENE-2208
> URL: https://issues.apache.org/jira/browse/LUCENE-2208
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/highlighter
>Affects Versions: 3.0
> Environment:  diagnostics = {os.version=5.1, os=Windows XP, 
> lucene.version=3.0.0 883080 - 2009-11-22 15:43:58, source=flush, os.arch=x86, 
> java.version=1.6.0_12, java.vendor=Sun Microsystems Inc.}
>
>Reporter: Ramazan VARLIKLI
> Attachments: LUCENE-2208.patch, LUCENE-2208_test.patch
>
>
> I have a doc which contains html codes. I want to strip html tags and make 
> the test clear after then apply highlighter on the clear text . But 
> highlighter throws an exceptions if I strip out the html characters  , if i 
> don't strip out , it works fine. It just confuses me at the moment 
> I copy paste 3 thing here from the console as it may contain special 
> characters which might cause the problem.
> 1 -) Here is the html text 
>   Starter
>   
> 
> 
>  Learning path: History
>   Key question
>   Did transport fuel the industrial revolution?
>   Learning Objective
> 
>   To categorise points as for or against an argument
>   
> 
>   What to do?
>   
> Watch the clip: Transport fuelled the industrial 
> revolution.
>   
>   The clips claims that transport fuelled the industrial 
> revolution. Some historians argue that the industrial revolution only 
> happened because of developments in transport.
> 
>   Read the statements below and decide which 
> points are for and which points are against the argument 
> that industry expanded in the 18th and 19th centuries because of developments 
> in transport.
>   
>   
>   
>   Industry expanded because of inventions and 
> the discovery of steam power.
>   Improvements in transport allowed goods to 
> be sold all over the country and all over the world so there were more 
> customers to develop industry for.
>   Developments in transport allowed 
> resources, such as coal from mines and cotton from America to come together 
> to manufacture products.
>   Transport only developed because industry 
> needed it. It was slow to develop as money was spent on improving roads, then 
> building canals and the replacing them with railways in order to keep up with 
> industry.
>   
>   
>   Now try to think of 2 more statements of your 
> own.
>   
> 
> 
>   
>   Main activity
>   
> 
> Learning path: 
> History
>   Learning Objective
>   
> To select evidence to support points
>   
>   What to do?
>   
>   Choose the 4 points that you think are most important - 
> try to be balanced by having two for and two 
> against.
> Write one in each of the point boxes of the 
> paragraphs on the sheet  class="link-internal">Constructing a balanced argument. You 
> might like to r

[jira] Issue Comment Edited: (LUCENE-2208) Token div exceeds length of provided text sized 4114

2011-02-02 Thread Hsiu Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989965#comment-12989965
 ] 

Hsiu Wang edited comment on LUCENE-2208 at 2/3/11 5:56 AM:
---

added patch(LUCENE-2208.patch) to fix 
org.apache.lucene.search.highlight.InvalidTokenOffsetsException.

The exception is caused by HTML escape characters (e.g., &, &  
) which are counted as 1 character in text.length() in 
Highlighter.getBestTextFragments, but in HTMLStripCharfilter, they are counted 
as N characters(& counted as 5). 

In the patch, I commented out an incorrect test case in 
HTMLStripCharFilterTest.testOffset()("X & X ( X < > X"). The 
commented out test case is covered by Robert's test patch. 


  was (Author: hwang):
added patch(LUCENE-2208.patch) to fix 
org.apache.lucene.search.highlight.InvalidTokenOffsetsException.

The exception is caused by HTML escape characters (e.g., &, &) 
which are counted as 1 character in text.length() in 
Highlighter.getBestTextFragments, but in HTMLStripCharfilter, they are counted 
as N characters(& counted as 5). 

In the patch, I commented out an incorrect test case in 
HTMLStripCharFilterTest.testOffset()("X & X ( X < > X"). The 
commented out test case is covered by Robert's test patch. 

  
> Token div exceeds length of provided text sized 4114
> 
>
> Key: LUCENE-2208
> URL: https://issues.apache.org/jira/browse/LUCENE-2208
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/highlighter
>Affects Versions: 3.0
> Environment:  diagnostics = {os.version=5.1, os=Windows XP, 
> lucene.version=3.0.0 883080 - 2009-11-22 15:43:58, source=flush, os.arch=x86, 
> java.version=1.6.0_12, java.vendor=Sun Microsystems Inc.}
>
>Reporter: Ramazan VARLIKLI
> Attachments: LUCENE-2208.patch, LUCENE-2208_test.patch
>
>
> I have a doc which contains html codes. I want to strip html tags and make 
> the test clear after then apply highlighter on the clear text . But 
> highlighter throws an exceptions if I strip out the html characters  , if i 
> don't strip out , it works fine. It just confuses me at the moment 
> I copy paste 3 thing here from the console as it may contain special 
> characters which might cause the problem.
> 1 -) Here is the html text 
>   Starter
>   
> 
> 
>  Learning path: History
>   Key question
>   Did transport fuel the industrial revolution?
>   Learning Objective
> 
>   To categorise points as for or against an argument
>   
> 
>   What to do?
>   
> Watch the clip: Transport fuelled the industrial 
> revolution.
>   
>   The clips claims that transport fuelled the industrial 
> revolution. Some historians argue that the industrial revolution only 
> happened because of developments in transport.
> 
>   Read the statements below and decide which 
> points are for and which points are against the argument 
> that industry expanded in the 18th and 19th centuries because of developments 
> in transport.
>   
>   
>   
>   Industry expanded because of inventions and 
> the discovery of steam power.
>   Improvements in transport allowed goods to 
> be sold all over the country and all over the world so there were more 
> customers to develop industry for.
>   Developments in transport allowed 
> resources, such as coal from mines and cotton from America to come together 
> to manufacture products.
>   Transport only developed because industry 
> needed it. It was slow to develop as money was spent on improving roads, then 
> building canals and the replacing them with railways in order to keep up with 
> industry.
>   
>   
>   Now try to think of 2 more statements of your 
> own.
>   
> 
> 
>   
>   Main activity
>   
> 
> Learning path: 
> History
>   Learning Objective
>   
> To select evidence to support points
>   
>   What to do?
>   
>   Choose the 4 points that you think are most important - 
> try to be balanced by having two for and two 
> against.
> Write one in each of the point boxes of the 
> paragraphs on the sheet  class="link-internal">Constructing a balanced argument. You 
> might

[jira] Issue Comment Edited: (LUCENE-2208) Token div exceeds length of provided text sized 4114

2011-02-02 Thread Hsiu Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989965#comment-12989965
 ] 

Hsiu Wang edited comment on LUCENE-2208 at 2/3/11 5:57 AM:
---

added patch(LUCENE-2208.patch) to fix 
org.apache.lucene.search.highlight.InvalidTokenOffsetsException.

The exception is caused by HTML escape characters (e.g., &, &  
) which are counted as 1 character in text.length() in 
Highlighter.getBestTextFragments, but in HTMLStripCharfilter, they are counted 
as N characters(& counted as 5). 

In the patch, I commented out an incorrect test case in 
HTMLStripCharFilterTest.testOffset()("X & X ( X < > X"). The 
commented out test case is covered by Robert's test patch. 


  was (Author: hwang):
added patch(LUCENE-2208.patch) to fix 
org.apache.lucene.search.highlight.InvalidTokenOffsetsException.

The exception is caused by HTML escape characters (e.g., &, &  
) which are counted as 1 character in text.length() in 
Highlighter.getBestTextFragments, but in HTMLStripCharfilter, they are counted 
as N characters(& counted as 5). 

In the patch, I commented out an incorrect test case in 
HTMLStripCharFilterTest.testOffset()("X & X ( X < > X"). The 
commented out test case is covered by Robert's test patch. 

  
> Token div exceeds length of provided text sized 4114
> 
>
> Key: LUCENE-2208
> URL: https://issues.apache.org/jira/browse/LUCENE-2208
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/highlighter
>Affects Versions: 3.0
> Environment:  diagnostics = {os.version=5.1, os=Windows XP, 
> lucene.version=3.0.0 883080 - 2009-11-22 15:43:58, source=flush, os.arch=x86, 
> java.version=1.6.0_12, java.vendor=Sun Microsystems Inc.}
>
>Reporter: Ramazan VARLIKLI
> Attachments: LUCENE-2208.patch, LUCENE-2208_test.patch
>
>
> I have a doc which contains html codes. I want to strip html tags and make 
> the test clear after then apply highlighter on the clear text . But 
> highlighter throws an exceptions if I strip out the html characters  , if i 
> don't strip out , it works fine. It just confuses me at the moment 
> I copy paste 3 thing here from the console as it may contain special 
> characters which might cause the problem.
> 1 -) Here is the html text 
>   Starter
>   
> 
> 
>  Learning path: History
>   Key question
>   Did transport fuel the industrial revolution?
>   Learning Objective
> 
>   To categorise points as for or against an argument
>   
> 
>   What to do?
>   
> Watch the clip: Transport fuelled the industrial 
> revolution.
>   
>   The clips claims that transport fuelled the industrial 
> revolution. Some historians argue that the industrial revolution only 
> happened because of developments in transport.
> 
>   Read the statements below and decide which 
> points are for and which points are against the argument 
> that industry expanded in the 18th and 19th centuries because of developments 
> in transport.
>   
>   
>   
>   Industry expanded because of inventions and 
> the discovery of steam power.
>   Improvements in transport allowed goods to 
> be sold all over the country and all over the world so there were more 
> customers to develop industry for.
>   Developments in transport allowed 
> resources, such as coal from mines and cotton from America to come together 
> to manufacture products.
>   Transport only developed because industry 
> needed it. It was slow to develop as money was spent on improving roads, then 
> building canals and the replacing them with railways in order to keep up with 
> industry.
>   
>   
>   Now try to think of 2 more statements of your 
> own.
>   
> 
> 
>   
>   Main activity
>   
> 
> Learning path: 
> History
>   Learning Objective
>   
> To select evidence to support points
>   
>   What to do?
>   
>   Choose the 4 points that you think are most important - 
> try to be balanced by having two for and two 
> against.
> Write one in each of the point boxes of the 
> paragraphs on the sheet  class="link-internal">Constructing a balanced argument. You 
>

[HUDSON] Lucene-Solr-tests-only-3.x - Build # 4428 - Failure

2011-02-02 Thread Apache Hudson Server
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-3.x/4428/

1 tests failed.
REGRESSION:  org.apache.lucene.search.payloads.TestPayloadTermQuery.test

Error Message:
NaN does not equal: 1

Stack Trace:
junit.framework.AssertionFailedError: NaN does not equal: 1
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1045)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:977)
at 
org.apache.lucene.search.payloads.TestPayloadTermQuery.test(TestPayloadTermQuery.java:149)




Build Log (for compile errors):
[...truncated 4457 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[HUDSON] Lucene-Solr-tests-only-trunk - Build # 4445 - Failure

2011-02-02 Thread Apache Hudson Server
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/4445/

1 tests failed.
REGRESSION:  org.apache.solr.TestGroupingSearch.testRandomGrouping

Error Message:
mismatch: 'PLPO'!='ZHNN' @ grouped/foo_i/groups/[0]/doclist/docs/[24]/id

Stack Trace:
junit.framework.AssertionFailedError: mismatch: 'PLPO'!='ZHNN' @ 
grouped/foo_i/groups/[0]/doclist/docs/[24]/id
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1183)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1115)
at 
org.apache.solr.TestGroupingSearch.testRandomGrouping(TestGroupingSearch.java:500)




Build Log (for compile errors):
[...truncated 8466 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2609) Generate jar containing test classes.

2011-02-02 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989995#comment-12989995
 ] 

Shai Erera commented on LUCENE-2609:


I plan to commit this sometime today or tomorrow if there are no objections. I 
will then make the same changes (+ move the extra files) to trunk

> Generate jar containing test classes.
> -
>
> Key: LUCENE-2609
> URL: https://issues.apache.org/jira/browse/LUCENE-2609
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.2
>Reporter: Drew Farris
>Assignee: Shai Erera
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2609.patch, LUCENE-2609.patch, LUCENE-2609.patch, 
> LUCENE-2609.patch, LUCENE-2609.patch, LUCENE-2609.patch
>
>
> The test classes are useful for writing unit tests for code external to the 
> Lucene project. It would be helpful to build a jar of these classes and 
> publish them as a maven dependency.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2609) Generate jar containing test classes.

2011-02-02 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990002#comment-12990002
 ] 

Uwe Schindler commented on LUCENE-2609:
---

Hi Shai,
I will try this out today, also with Clover. In the Clover part of build.xml 
may some changes needed, as Clover needs to find test source files to create 
the statistics!

> Generate jar containing test classes.
> -
>
> Key: LUCENE-2609
> URL: https://issues.apache.org/jira/browse/LUCENE-2609
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.2
>Reporter: Drew Farris
>Assignee: Shai Erera
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2609.patch, LUCENE-2609.patch, LUCENE-2609.patch, 
> LUCENE-2609.patch, LUCENE-2609.patch, LUCENE-2609.patch
>
>
> The test classes are useful for writing unit tests for code external to the 
> Lucene project. It would be helpful to build a jar of these classes and 
> publish them as a maven dependency.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org