date:20101108

Apache Lucene and Apache Solr merged to one checkout at:
http://svn.apache.org/repos/asf/lucene/dev/

The combined version now share the same version numbers:
Lucene 3.x:
 - http://svn.apache.org/repos/asf/lucene/dev/branches/branch3.x
Lucene trunk:
 - http://svn.apache.org/repos/asf/lucene/dev/trunk

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


 -Original Message-
 From: Li Li [mailto:fancye...@gmail.com]
 Sent: Monday, November 08, 2010 9:19 AM
 To: dev@lucene.apache.org
 Subject: Re: lucene 4.0 release date
 
 question about svn structure of lucene
 I visit http://svn.apache.org/repos/asf/lucene/
 it contains many things
 ..
 .htaccess
 board-reports/
 dev/
 java/
 lucene.net/
 mahout/
 openrelevance/
 pylucene/
 sandbox/
 site/
 solr/
 
 I just want to use lucene/java + solr. what directories should I check
out?
 it seems http://svn.apache.org/repos/asf/lucene/dev/ is current developed
 version http://svn.apache.org/repos/asf/lucene/java/ is old version before
3.0
 So I just need http://svn.apache.org/repos/asf/lucene/dev/ and
 https://svn.apache.org/repos/asf/lucene/dev/?
 
 2010/11/8 Li Li fancye...@gmail.com:
  thank you.
 
  2010/11/8 Uwe Schindler u...@thetaphi.de:
  You have to also use Solr 4.0 :-)
 
  -
  Uwe Schindler
  H.-H.-Meier-Allee 63, D-28213 Bremen
  http://www.thetaphi.de
  eMail: u...@thetaphi.de
 
 
  -Original Message-
  From: Li Li [mailto:fancye...@gmail.com]
  Sent: Monday, November 08, 2010 8:47 AM
  To: dev@lucene.apache.org; simon.willna...@gmail.com
  Subject: Re: lucene 4.0 release date
 
      thank you.
      so if I want to use new compress/decompress algorithm, I must
  use
  lucene
  4.0 in svn? Is there any patch for old release such as 2.9?because I
  need
  solr 1.4
  which based on lucene 2.9
 
  2010/11/8 Simon Willnauer simon.willna...@googlemail.com:
   Li Li,
  
   there is no official / unofficial release date for lucene 4.0 if
   you want to use the latest and greatest features you need to
   checkout trunk of use a nightly build. My guess would be that
   there is at least
   6 to 8 month to the next release but I can be wrong (more likely
   it might take even longer) .
  
   For PFoR etc you should look into:
   https://issues.apache.org/jira/browse/LUCENE-1410
   https://issues.apache.org/jira/browse/LUCENE-2723
  
   to get started - and read mikes blog
   http://chbits.blogspot.com/2010/08/lucene-performance-with-pfordel
   ta-c
   odec.html
  
   There is also S9
   https://issues.apache.org/jira/browse/LUCENE-2189
   and GroupVInt impls
   https://issues.apache.org/jira/browse/LUCENE-2735
  
   simon
  
   On Mon, Nov 8, 2010 at 4:59 AM, Li Li fancye...@gmail.com wrote:
   hi all,
      When will lucene 4.0 be released?
      I want to replace VInt compression with fast ones such as
   PForDelta. In my application, decompressing a docList of 10M will
   use about 300ms. In  Performance of Compressed Inverted List
   Caching in Search Engines. With J. Zhang and X.Long. 17th
   International World Wide Web Conference (WWW), April 2008. , the
   author says PForDelta is much faster than VInt. And I also found
   a java implementation in
   http://code.google.com/p/integer-array-compress-kit/ it's speed
   is
   500 (M int / sec). But to achieve, I have to modify index file
   format. And I found
   http://wiki.apache.org/lucene-java/FlexibleIndexing  in lucene
   4.0, it will support fore flexible index format. I want to know
   when it will be released so as to decide whether wait it or doing
it
 myself.
  Thank
  you.
  
   -
    To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
   For additional commands, e-mail: dev-h...@lucene.apache.org
  
  
  
   --
   --- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For
   additional commands, e-mail: dev-h...@lucene.apache.org
  
  
 
  
  - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For
  additional commands, e-mail: dev-h...@lucene.apache.org
 
 
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For
  additional commands, e-mail: dev-h...@lucene.apache.org
 
 
 
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional
 commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Antw.: Solr-3.x - Build # 160 - Failure

2010-11-08 Thread Simon Willnauer

On Mon, Nov 8, 2010 at 7:26 AM, Uwe Schindler u...@thetaphi.de wrote:
 No updates on the Hudson issue until now. What should we do? Disable Clover
 report generation for now?

+1 - test / CI-Build success are more important to me!
 I have no idea, what else we could do.

 Uwe

 ---
 Uwe Schindler
 Generics Policeman
 Bremen, Germany

 - Reply message -
 Von: Apache Hudson Server hud...@hudson.apache.org
 Datum: Mo., Nov. 8, 2010 06:55
 Betreff: Solr-3.x - Build # 160 - Failure
 An: dev@lucene.apache.org

 Build: https://hudson.apache.org/hudson/job/Solr-3.x/160/

 All tests passed

 Build Log (for compile errors):
 [...truncated 18776 lines...]



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org






-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

RE: Antw.: Solr-3.x - Build # 160 - Failure

-1 For succeeding tests we have running and working builds. For me the clover 
report is more important, and that one works!

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


 -Original Message-
 From: Simon Willnauer [mailto:simon.willna...@googlemail.com]
 Sent: Monday, November 08, 2010 10:04 AM
 To: dev@lucene.apache.org
 Subject: Re: Antw.: Solr-3.x - Build # 160 - Failure
 
 On Mon, Nov 8, 2010 at 7:26 AM, Uwe Schindler u...@thetaphi.de wrote:
  No updates on the Hudson issue until now. What should we do? Disable
  Clover report generation for now?
 
 +1 - test / CI-Build success are more important to me!
  I have no idea, what else we could do.
 
  Uwe
 
  ---
  Uwe Schindler
  Generics Policeman
  Bremen, Germany
 
  - Reply message -
  Von: Apache Hudson Server hud...@hudson.apache.org
  Datum: Mo., Nov. 8, 2010 06:55
  Betreff: Solr-3.x - Build # 160 - Failure
  An: dev@lucene.apache.org
 
  Build: https://hudson.apache.org/hudson/job/Solr-3.x/160/
 
  All tests passed
 
  Build Log (for compile errors):
  [...truncated 18776 lines...]
 
 
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For
  additional commands, e-mail: dev-h...@lucene.apache.org
 
 
 
 
 
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional
 commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Antw.: Solr-3.x - Build # 160 - Failure

2010-11-08 Thread Simon Willnauer

On Mon, Nov 8, 2010 at 10:14 AM, Uwe Schindler u...@thetaphi.de wrote:
 -1 For succeeding tests we have running and working builds. For me the clover 
 report is more important, and that one works!

Ah you are right we have other builds - still confuses me, nevermind.
But I disagree that a broken clover is important its just annoying.
 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de


 -Original Message-
 From: Simon Willnauer [mailto:simon.willna...@googlemail.com]
 Sent: Monday, November 08, 2010 10:04 AM
 To: dev@lucene.apache.org
 Subject: Re: Antw.: Solr-3.x - Build # 160 - Failure

 On Mon, Nov 8, 2010 at 7:26 AM, Uwe Schindler u...@thetaphi.de wrote:
  No updates on the Hudson issue until now. What should we do? Disable
  Clover report generation for now?
 
 +1 - test / CI-Build success are more important to me!
  I have no idea, what else we could do.
 
  Uwe
 
  ---
  Uwe Schindler
  Generics Policeman
  Bremen, Germany
 
  - Reply message -
  Von: Apache Hudson Server hud...@hudson.apache.org
  Datum: Mo., Nov. 8, 2010 06:55
  Betreff: Solr-3.x - Build # 160 - Failure
  An: dev@lucene.apache.org
 
  Build: https://hudson.apache.org/hudson/job/Solr-3.x/160/
 
  All tests passed
 
  Build Log (for compile errors):
  [...truncated 18776 lines...]
 
 
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For
  additional commands, e-mail: dev-h...@lucene.apache.org
 
 
 
 
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional
 commands, e-mail: dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

RE: Antw.: Solr-3.x - Build # 160 - Failure

Clover is not broken, only Hudson plugin that links the clover report in the 
workspace. And it is important to have at least one version of the clover 
report. I use it quite often to verify coverage.

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


 -Original Message-
 From: Simon Willnauer [mailto:simon.willna...@googlemail.com]
 Sent: Monday, November 08, 2010 10:44 AM
 To: Uwe Schindler
 Cc: dev@lucene.apache.org
 Subject: Re: Antw.: Solr-3.x - Build # 160 - Failure
 
 On Mon, Nov 8, 2010 at 10:14 AM, Uwe Schindler u...@thetaphi.de wrote:
  -1 For succeeding tests we have running and working builds. For me the
 clover report is more important, and that one works!
 
 Ah you are right we have other builds - still confuses me, nevermind.
 But I disagree that a broken clover is important its just annoying.
  -
  Uwe Schindler
  H.-H.-Meier-Allee 63, D-28213 Bremen
  http://www.thetaphi.de
  eMail: u...@thetaphi.de
 
 
  -Original Message-
  From: Simon Willnauer [mailto:simon.willna...@googlemail.com]
  Sent: Monday, November 08, 2010 10:04 AM
  To: dev@lucene.apache.org
  Subject: Re: Antw.: Solr-3.x - Build # 160 - Failure
 
  On Mon, Nov 8, 2010 at 7:26 AM, Uwe Schindler u...@thetaphi.de wrote:
   No updates on the Hudson issue until now. What should we do?
   Disable Clover report generation for now?
  
  +1 - test / CI-Build success are more important to me!
   I have no idea, what else we could do.
  
   Uwe
  
   ---
   Uwe Schindler
   Generics Policeman
   Bremen, Germany
  
   - Reply message -
   Von: Apache Hudson Server hud...@hudson.apache.org
   Datum: Mo., Nov. 8, 2010 06:55
   Betreff: Solr-3.x - Build # 160 - Failure
   An: dev@lucene.apache.org
  
   Build: https://hudson.apache.org/hudson/job/Solr-3.x/160/
  
   All tests passed
  
   Build Log (for compile errors):
   [...truncated 18776 lines...]
  
  
  
   ---
   -- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For
   additional commands, e-mail: dev-h...@lucene.apache.org
  
  
  
  
  
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For
  additional commands, e-mail: dev-h...@lucene.apache.org
 
 
 


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

RE: Antw.: Solr-3.x - Build # 160 - Failure

We got some response for our Clover Hudson issue bug (see attached mail).


-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


 -Original Message-
 From: Simon Willnauer [mailto:simon.willna...@googlemail.com]
 Sent: Monday, November 08, 2010 10:44 AM
 To: Uwe Schindler
 Cc: dev@lucene.apache.org
 Subject: Re: Antw.: Solr-3.x - Build # 160 - Failure
 
 On Mon, Nov 8, 2010 at 10:14 AM, Uwe Schindler u...@thetaphi.de wrote:
  -1 For succeeding tests we have running and working builds. For me the
 clover report is more important, and that one works!
 
 Ah you are right we have other builds - still confuses me, nevermind.
 But I disagree that a broken clover is important its just annoying.
  -
  Uwe Schindler
  H.-H.-Meier-Allee 63, D-28213 Bremen
  http://www.thetaphi.de
  eMail: u...@thetaphi.de
 
 
  -Original Message-
  From: Simon Willnauer [mailto:simon.willna...@googlemail.com]
  Sent: Monday, November 08, 2010 10:04 AM
  To: dev@lucene.apache.org
  Subject: Re: Antw.: Solr-3.x - Build # 160 - Failure
 
  On Mon, Nov 8, 2010 at 7:26 AM, Uwe Schindler u...@thetaphi.de wrote:
   No updates on the Hudson issue until now. What should we do?
   Disable Clover report generation for now?
  
  +1 - test / CI-Build success are more important to me!
   I have no idea, what else we could do.
  
   Uwe
  
   ---
   Uwe Schindler
   Generics Policeman
   Bremen, Germany
  
   - Reply message -
   Von: Apache Hudson Server hud...@hudson.apache.org
   Datum: Mo., Nov. 8, 2010 06:55
   Betreff: Solr-3.x - Build # 160 - Failure
   An: dev@lucene.apache.org
  
   Build: https://hudson.apache.org/hudson/job/Solr-3.x/160/
  
   All tests passed
  
   Build Log (for compile errors):
   [...truncated 18776 lines...]
  
  
  
   ---
   -- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For
   additional commands, e-mail: dev-h...@lucene.apache.org
  
  
  
  
  
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For
  additional commands, e-mail: dev-h...@lucene.apache.org
 
 
 
---BeginMessage---

 [ 
http://issues.hudson-ci.org/browse/HUDSON-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stubbs updated HUDSON-7836:
---

Attachment: HUDSON-7836-stacktrace.txt

Stack trace from the master Hudson's log at the time the build failed with this 
error.

 Clover and cobertura parsing on hudson master fails because of invalid XML
 --

 Key: HUDSON-7836
 URL: http://issues.hudson-ci.org/browse/HUDSON-7836
 Project: Hudson
  Issue Type: Bug
  Components: clover, cobertura
Affects Versions: current
Reporter: thetaphi
Assignee: stephenconnolly
Priority: Critical
 Attachments: HUDSON-7836-stacktrace.txt


 Since a few days, on our Apache Hudson installation, parsing of Clover's 
 clover.xml or the Coberture's coverage.xml file fails (but not in all cases, 
 sometimes it simply passes with the same build and same job configuration). 
 This only happens after transferring to master, the reports and xml file is 
 created on Hudson slave. It seems like the network code somehow breaks the 
 xml file during transfer to the master.
 Downloading th clover.xml from the workspace to my local computer and 
 validating it confirms, that it is not incorrectly formatted and has no XML 
 parse errors.
 - Here are errors that appear during clover publishing: 
 [https://hudson.apache.org/hudson/job/Lucene-trunk/1336/console]
 - For cobertura: 
 [https://hudson.apache.org/hudson/view/Directory/job/dir-shared-metrics/34/console]

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.hudson-ci.org/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



---End Message---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Rethinking spatial implementation

2010-11-08 Thread Christopher Schmidt

Some questions:

@Grant: Can you clarify what you mean with the Sinusoidal projection is
broken?

Would it be possible to use a LGPL library like the Java Topology Suite
(JTS: http://www.vividsolutions.com/jts/JTSHome.htm)?

Neo4j is using JTS for creating a spatial index (code is here:
https://github.com/neo4j/neo4j-spatial)...

(I've just seen that JTS has some index creation classes, but I'm not at all
familiar with them)

Christopher

On Mon, Nov 8, 2010 at 1:10 AM, Grant Ingersoll gsing...@apache.org wrote:


 On Nov 6, 2010, at 5:23 PM, Christopher Schmidt wrote:

  Hi Ryan, thx for your answer.
 
  You mean there is room for improvement and volunteers?

 We've been looking at replacing it with the Military Grid system.  The
 primary issue with the current is that the Sinusoidal projection is broken
 which then breaks almost all the tests.  I worked on it for a while trying
 to straighten it out, but gave up and now think it is easier to implement
 clean.  I definitely would like to see a tier/grid implementation.


 
  On Friday, November 5, 2010, Ryan McKinley ryan...@gmail.com wrote:
  Hi Christopher -
 
  I do not believe there is any active work on this.  From what I
  understand, the Tier implementation works OK within some constraints,
  but we could not get it to pass more robust testing that the other
  methods were using.
 
  However, LatLonType and GeoHashField are well tested and work well --
  the Tier type may have better performance when your index is really
  large, but no active developers understand it and no-one has stepped
  up to figure it out.
 
  ryan
 
 
  On Wed, Nov 3, 2010 at 3:16 PM, Christopher Schmidt
  fakod...@googlemail.com wrote:
  Hi all,
  I saw a mail thread Rethinking Cartesian Tiers implementation (here).
  Is there any work in progress regarding this? If yes, is the current
  implementation deprecated or do you plan some enhancements (other
  projections or spatial indexes) ?
  I am asking because I want to use Lucene's spatial indexing in a
 production
  system...
 
  --
  Christopher
  twitter: @fakod
  blog: http://blog.fakod.eu
 
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: dev-h...@lucene.apache.org
 
 
 
  --
  Christopher
  twitter: @fakod
  blog: http://blog.fakod.eu
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: dev-h...@lucene.apache.org
 

 --
 Grant Ingersoll
 http://www.lucidimagination.com/

 Search the Lucene ecosystem docs using Solr/Lucene:
 http://www.lucidimagination.com/search


 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




-- 
Christopher
twitter: @fakod
blog: http://blog.fakod.eu

[jira] Commented: (LUCENE-2745) ArabicAnalyzer - the ability to recognise email addresses host names and so on

2010-11-08 Thread M Alexander (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929556#action_12929556
]

M Alexander commented on LUCENE-2745:
-

{quote}
I think that ArabicLetterTokenizer, which is the tokenizer used by
ArabicAnalyzer, is obsolete (as of version 3.1), since StandardTokenizer, which
implements the Unicode word segmentation rules from UAX#29, should be able to
properly tokenize Arabic. StandardTokenizer recognizes email addresses,
hostnames, and URLs, so your concern would be addressed. (See LUCENE-2167,
though, which was just reopened to turn off full URL output.)
You can test this by composing your own analyzer, if you're willing to try
using using as-yet-unreleased branch_3X, from which 3.1 will be cut (hopefully
fairly soon): just copy ArabicAnalyzer class and swap in StandardTokenizer for
ArabicLetterTokenizer
{quote}

I tried to test this and failed (miserably). I think I struggled to patch
LUCENE-2167 correctly through my eclipse. I might just wait for branch_3X
release to make my life easier. I will then create my own Analyzer to perform
Arabic Text Analysis and another one for Farsi Text Analysis. Both Analyzers
will have the ability to handle diacritics as well as email addresses,
hostnames and so on. I will colse this issue for now (will re-open in the
future if needed).

Quick question - any thoughts of handling Arabic email addresses and hostnames
in the future?

Thanks to both of you for the time taken and I shall wait for the branch
release to solve my issue.

ArabicAnalyzer - the ability to recognise email addresses host names and so on
--

Key: LUCENE-2745
URL: https://issues.apache.org/jira/browse/LUCENE-2745
Project: Lucene - Java
Issue Type: Improvement
Components: contrib/analyzers
Affects Versions: 2.9.2, 2.9.3, 3.0, 3.0.1, 3.0.2
Environment: All
Reporter: M Alexander

The ArabicAnalyzer does not recognise email addresses, hostnames and so on.
For example,
a...@hotmail.com
will be tokenised to [adam] [hotmail] [com]
It would be great if the ArabicAnalyzer can tokenises this to
[a...@hotmail.com]. The same applies to hostnames and so on.
Can this be resolved? I hope so
Thanks
MAA

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Closed: (LUCENE-2745) ArabicAnalyzer - the ability to recognise email addresses host names and so on

2010-11-08 Thread M Alexander (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

M Alexander closed LUCENE-2745.
---

Resolution: Later

Will wait for the relaese, which should have the solution within

 ArabicAnalyzer - the ability to recognise email addresses host names and so on
 --

 Key: LUCENE-2745
 URL: https://issues.apache.org/jira/browse/LUCENE-2745
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Affects Versions: 2.9.2, 2.9.3, 3.0, 3.0.1, 3.0.2
 Environment: All
Reporter: M Alexander

 The ArabicAnalyzer does not recognise email addresses, hostnames and so on. 
 For example,
 a...@hotmail.com
 will be tokenised to [adam] [hotmail] [com]
 It would be great if the ArabicAnalyzer can tokenises this to 
 [a...@hotmail.com]. The same applies to hostnames and so on.
 Can this be resolved? I hope so
 Thanks
 MAA

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2745) ArabicAnalyzer - the ability to recognise email addresses host names and so on

2010-11-08 Thread M Alexander (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929558#action_12929558
 ] 

M Alexander commented on LUCENE-2745:
-

Oh, do you have a rough timing of the branch_3X release date?

 ArabicAnalyzer - the ability to recognise email addresses host names and so on
 --

 Key: LUCENE-2745
 URL: https://issues.apache.org/jira/browse/LUCENE-2745
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Affects Versions: 2.9.2, 2.9.3, 3.0, 3.0.1, 3.0.2
 Environment: All
Reporter: M Alexander

 The ArabicAnalyzer does not recognise email addresses, hostnames and so on. 
 For example,
 a...@hotmail.com
 will be tokenised to [adam] [hotmail] [com]
 It would be great if the ArabicAnalyzer can tokenises this to 
 [a...@hotmail.com]. The same applies to hostnames and so on.
 Can this be resolved? I hope so
 Thanks
 MAA

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard


[ 
https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929560#action_12929560
 ] 

Robert Muir commented on LUCENE-2167:
-

{quote}
In theory, you should just feed initial text as a single monster token from 
hell into analysis chain, and then you only have TokenFilters, none/one/some of 
which might split this token.
If there are no TokenFilters at all, you get a NOT_ANALYZED case without extra 
flags, yahoo!

The only problem here is the need for ability to wrap arbitrary Reader in a 
TermAttribute :/
{quote}

No thanks, i dont want to read my entire documents into RAM and have massive 
gc'ing going on.
We don't need to have a mega-tokenizer that solves everyones problems... this 
is just supposed to be a good general-purpose tokenizer.


 Implement StandardTokenizer with the UAX#29 Standard
 

 Key: LUCENE-2167
 URL: https://issues.apache.org/jira/browse/LUCENE-2167
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/analyzers
Affects Versions: 3.1, 4.0
Reporter: Shyamal Prasad
Assignee: Steven Rowe
Priority: Minor
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2167-jflex-tld-macro-gen.patch, 
 LUCENE-2167-jflex-tld-macro-gen.patch, LUCENE-2167-jflex-tld-macro-gen.patch, 
 LUCENE-2167-lucene-buildhelper-maven-plugin.patch, 
 LUCENE-2167.benchmark.patch, LUCENE-2167.benchmark.patch, 
 LUCENE-2167.benchmark.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
 LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
 LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
 LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
 LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
 LUCENE-2167.patch, standard.zip, StandardTokenizerImpl.jflex

   Original Estimate: 0.5h
  Remaining Estimate: 0.5h

 It would be really nice for StandardTokenizer to adhere straight to the 
 standard as much as we can with jflex. Then its name would actually make 
 sense.
 Such a transition would involve renaming the old StandardTokenizer to 
 EuropeanTokenizer, as its javadoc claims:
 bq. This should be a good tokenizer for most European-language documents
 The new StandardTokenizer could then say
 bq. This should be a good tokenizer for most languages.
 All the english/euro-centric stuff like the acronym/company/apostrophe stuff 
 can stay with that EuropeanTokenizer, and it could be used by the european 
 analyzers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2745) ArabicAnalyzer - the ability to recognise email addresses host names and so on


[ 
https://issues.apache.org/jira/browse/LUCENE-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929566#action_12929566
 ] 

Steven Rowe commented on LUCENE-2745:
-

bq. Oh, do you have a rough timing of the branch_3X release date? 

Wild guess: January 2011

 ArabicAnalyzer - the ability to recognise email addresses host names and so on
 --

 Key: LUCENE-2745
 URL: https://issues.apache.org/jira/browse/LUCENE-2745
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Affects Versions: 2.9.2, 2.9.3, 3.0, 3.0.1, 3.0.2
 Environment: All
Reporter: M Alexander

 The ArabicAnalyzer does not recognise email addresses, hostnames and so on. 
 For example,
 a...@hotmail.com
 will be tokenised to [adam] [hotmail] [com]
 It would be great if the ArabicAnalyzer can tokenises this to 
 [a...@hotmail.com]. The same applies to hostnames and so on.
 Can this be resolved? I hope so
 Thanks
 MAA

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENENET-380) Evaluate Sharpen as a port tool

2010-11-08 Thread George Aroush (JIRA)

[
https://issues.apache.org/jira/browse/LUCENENET-380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929567#action_12929567
]

George Aroush commented on LUCENENET-380:
-

Few points:

1) Work on ASF projects need to be done at ASF. Please use this JIRA issue and
the mailing list to communicate questions, report progress and results.
2) The converted files need to be attached to this JIRA issue, so we have a
record of it and able to evaluate by all.
3) Prescott point of highlighting pre / post processing work is a good one and
important. Please write this up as you work on this task.
4) More than one person can work on this JIRA issue, just keep everyone posted.

My expected outcome of this JIRA issue is:

1) What pre / post processing did you use if any? It would also help to show
the raw output with and without the pre processing.
2) How close is the result of those 5 attached files to the existing converted
C# files? This includes the layout of the code (was anything lost or
considerably change?) but most importantly, are the public APIs consistent?

The reason why I picked those 5 files is because those are the ones JLCA has
some of the most issues with, so it should be a good barometer seeing how
Sharpen does.

Evaluate Sharpen as a port tool
---

Key: LUCENENET-380
URL: https://issues.apache.org/jira/browse/LUCENENET-380
Project: Lucene.Net
Issue Type: Task
Reporter: George Aroush
Attachments: IndexWriter.java, NIOFSDirectory.java, QueryParser.java,
TestBufferedIndexInput.java, TestDateFilter.java

This task is to evaluate Sharpen as a port tool for Lucene.Net.
The files to be evaluated are attached. We need to run those files (which
are off Java Lucene 2.9.2) against Sharpen and compare the result against
JLCA result.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (SOLR-2222) Merge duplicates documents with uniqueKey

2010-11-08 Thread Andreas Laager (JIRA)

Merge duplicates documents with uniqueKey
-

 Key: SOLR-
 URL: https://issues.apache.org/jira/browse/SOLR-
 Project: Solr
  Issue Type: Bug
Affects Versions: 1.4.1
Reporter: Andreas Laager


When merging one core into an other one could get multiple documents for one 
uniqueKey. As a result the facet counts are wrong.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (SOLR-2222) Merge duplicates documents with uniqueKey

2010-11-08 Thread Koji Sekiguchi (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929574#action_12929574
 ] 

Koji Sekiguchi commented on SOLR-:
--

I think this is expected behavior because Solr just calls Lucene's 
IndexWriter.addIndexes() to merge indexes and Lucene doesn't care uniqueKeys.

 Merge duplicates documents with uniqueKey
 -

 Key: SOLR-
 URL: https://issues.apache.org/jira/browse/SOLR-
 Project: Solr
  Issue Type: Bug
Affects Versions: 1.4.1
Reporter: Andreas Laager

 When merging one core into an other one could get multiple documents for one 
 uniqueKey. As a result the facet counts are wrong.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Rethinking spatial implementation

 Neo4j is using JTS for creating a spatial index (code is here:
 https://github.com/neo4j/neo4j-spatial)...
 (I've just seen that JTS has some index creation classes, but I'm not at all
 familiar with them)

JTS does not have a spatial index -- it is good for spatial operations
(check if some shape is within/intersects/etc another shape)  In
Neo4j, they use JTS to build an RTree that is stored in their native
graph format:
https://github.com/neo4j/neo4j-spatial/blob/master/src/main/java/org/neo4j/gis/spatial/RTreeIndex.java

Building an RTree in lucene is a bit more difficult since we can not
easily update the value of a given field.  I'd like to figure some way
to do this though.

ryan

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Rethinking spatial implementation

2010-11-08 Thread Mattmann, Chris A (388J)

Hi All,

FYI, Apache SIS [1], currently Incubating, is working on building an ASLv2 
licensed library comparable to JTS or GeoTools. You'll notice that most of the 
GIS related libs out there are GPL or LGPL (or at least I did), so I decided to 
do something about it.

If anyone else is interested in joining the cause, we'd welcome you over there.

At present, we have code that implements a QuadTree storage and does 
PointRadius and bounding box computations, as well as a REST-ful web service to 
handle spatial location based on those 2 methods. We're close to making an 
0.1-incubating release.

Cheers,
Chris

[1] http://incubator.apache.org/sis/


On 11/8/10 2:40 AM, Chris Male gento...@gmail.com wrote:

Hi,

I'll jump in and give my opinion:

Can you clarify what you mean with the Sinusoidal projection is broken?

Inside Spatial Lucene's Cartesian codebase is an implementation of Sinusoidal 
projection.  Grant discovered while working on improving the testing coverage 
of the code that the implementation doesn't actually match the formula 
specified on Wikipedia.  When we tried to change it, many tests broke since the 
overall logic somehow depends on this broken implementation.


Would it be possible to use a LGPL library like the Java Topology Suite (JTS: 
http://www.vividsolutions.com/jts/JTSHome.htm)?

This is something we've talked about using.  I think it would be nice to 
offload some of the geographic-specific from Lucene.  So using another library 
would be good.  At the same time it limits our options for optimizations and 
the like.  I'm certainly looking into it though.

Thanks,
Chris


Neo4j is using JTS for creating a spatial index (code is here: 
https://github.com/neo4j/neo4j-spatial)...

(I've just seen that JTS has some index creation classes, but I'm not at all 
familiar with them)

Christopher


On Mon, Nov 8, 2010 at 1:10 AM, Grant Ingersoll gsing...@apache.org wrote:

On Nov 6, 2010, at 5:23 PM, Christopher Schmidt wrote:

 Hi Ryan, thx for your answer.

 You mean there is room for improvement and volunteers?

We've been looking at replacing it with the Military Grid system.  The primary 
issue with the current is that the Sinusoidal projection is broken which then 
breaks almost all the tests.  I worked on it for a while trying to straighten 
it out, but gave up and now think it is easier to implement clean.  I 
definitely would like to see a tier/grid implementation.



 On Friday, November 5, 2010, Ryan McKinley ryan...@gmail.com wrote:
 Hi Christopher -

 I do not believe there is any active work on this.  From what I
 understand, the Tier implementation works OK within some constraints,
 but we could not get it to pass more robust testing that the other
 methods were using.

 However, LatLonType and GeoHashField are well tested and work well --
 the Tier type may have better performance when your index is really
 large, but no active developers understand it and no-one has stepped
 up to figure it out.

 ryan


 On Wed, Nov 3, 2010 at 3:16 PM, Christopher Schmidt
 fakod...@googlemail.com wrote:
 Hi all,
 I saw a mail thread Rethinking Cartesian Tiers implementation (here).
 Is there any work in progress regarding this? If yes, is the current
 implementation deprecated or do you plan some enhancements (other
 projections or spatial indexes) ?
 I am asking because I want to use Lucene's spatial indexing in a production
 system...

 --
 Christopher
 twitter: @fakod
 blog: http://blog.fakod.eu


 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



 --
 Christopher
 twitter: @fakod
 blog: http://blog.fakod.eu

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org


--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem docs using Solr/Lucene:
http://www.lucidimagination.com/search


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org





++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++

Re: Antw.: Solr-3.x - Build # 160 - Failure

2010-11-08 Thread Robert Muir

On Mon, Nov 8, 2010 at 4:44 AM, Simon Willnauer
simon.willna...@googlemail.com wrote:
 On Mon, Nov 8, 2010 at 10:14 AM, Uwe Schindler u...@thetaphi.de wrote:
 -1 For succeeding tests we have running and working builds. For me the 
 clover report is more important, and that one works!

 Ah you are right we have other builds - still confuses me, nevermind.
 But I disagree that a broken clover is important its just annoying.

when it works, it works... i don't think we should disable it, its
useful in finding untested things / bugs.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (SOLR-2222) Merge duplicates documents with uniqueKey

2010-11-08 Thread Andreas Laager (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929577#action_12929577
 ] 

Andreas Laager commented on SOLR-:
--

I've read this that Lucene does not care about the unique key. But where does 
the uniqueKey configuration in the schema.xml come from? Is that part of SOLR? 
If yes then also SOLR should also care about it on merging cores.

Our system is using solr with a live core dedicated for inserts that gets 
merged into a search core from time to time. We expect a better search 
performance out of this. I expect a negativ performance impact when I have to 
handle all the duplicated documents after the merge. 

 Merge duplicates documents with uniqueKey
 -

 Key: SOLR-
 URL: https://issues.apache.org/jira/browse/SOLR-
 Project: Solr
  Issue Type: Bug
Affects Versions: 1.4.1
Reporter: Andreas Laager

 When merging one core into an other one could get multiple documents for one 
 uniqueKey. As a result the facet counts are wrong.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard

2010-11-08 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929587#action_12929587
 ] 

Earwin Burrfoot commented on LUCENE-2167:
-

bq. No thanks, i dont want to read my entire documents into RAM and have 
massive gc'ing going on.
This is obvious. And that's why I was talking about wrapping Reader in an 
Attribute, not copying its contents.
How to do so, is much less obvious. And that's why I called it a problem.

bq. We don't need to have a mega-tokenizer that solves everyones problems... 
this is just supposed to be a good general-purpose tokenizer.
Exactly. That's why I'm thinking of a way to get some composability, instead of 
having to fully rewrite tokenizer once you want extras.

 Implement StandardTokenizer with the UAX#29 Standard
 

 Key: LUCENE-2167
 URL: https://issues.apache.org/jira/browse/LUCENE-2167
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/analyzers
Affects Versions: 3.1, 4.0
Reporter: Shyamal Prasad
Assignee: Steven Rowe
Priority: Minor
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2167-jflex-tld-macro-gen.patch, 
 LUCENE-2167-jflex-tld-macro-gen.patch, LUCENE-2167-jflex-tld-macro-gen.patch, 
 LUCENE-2167-lucene-buildhelper-maven-plugin.patch, 
 LUCENE-2167.benchmark.patch, LUCENE-2167.benchmark.patch, 
 LUCENE-2167.benchmark.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
 LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
 LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
 LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
 LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
 LUCENE-2167.patch, standard.zip, StandardTokenizerImpl.jflex

   Original Estimate: 0.5h
  Remaining Estimate: 0.5h

 It would be really nice for StandardTokenizer to adhere straight to the 
 standard as much as we can with jflex. Then its name would actually make 
 sense.
 Such a transition would involve renaming the old StandardTokenizer to 
 EuropeanTokenizer, as its javadoc claims:
 bq. This should be a good tokenizer for most European-language documents
 The new StandardTokenizer could then say
 bq. This should be a good tokenizer for most languages.
 All the english/euro-centric stuff like the acronym/company/apostrophe stuff 
 can stay with that EuropeanTokenizer, and it could be used by the european 
 analyzers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (SOLR-792) Pivot (ie: Decision Tree) Faceting Component

2010-11-08 Thread Peter Karich (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929590#action_12929590
]

Peter Karich commented on SOLR-792:
---

Hi Toke and all,

maybe I am a bit evil or stupid but could someone enlight me why this patch is
necessary?

Why can't you we the existing mechanisms in Solr (facets!) and a bit logic
while indexing:

http://markmail.org/message/2aza6nnsiw3l4bbb#query:+page:1+mid:3j3ttojacpjoyfg5+state:results

This has no performance problems when using tons of categories. We already
using it with lots of categories. It works out of the box with a nearly
infinity depth (either you need a DB -
unlimited or the URL length is the limit).

The only drawback of this approach is that you won't be able to display two or
more 'branches' at the same time. Only one current branch with the current
possible categories is possible, which is no limitation in our case. Because
the UI would be unusable if too many items would be visible at the same time.

One could introduce a special update component for this feature which uses a
category tree (in RAM) built from the json or xml definition. I could create
such a component if someone is interested.

Regards,
Peter.

Pivot (ie: Decision Tree) Faceting Component

Key: SOLR-792
URL: https://issues.apache.org/jira/browse/SOLR-792
Project: Solr
Issue Type: New Feature
Reporter: Erik Hatcher
Assignee: Yonik Seeley
Priority: Minor
Attachments: SOLR-792-as-helper-class.patch,
SOLR-792-PivotFaceting.patch, SOLR-792-PivotFaceting.patch,
SOLR-792-PivotFaceting.patch, SOLR-792-PivotFaceting.patch,
SOLR-792-raw-type.patch, SOLR-792.patch, SOLR-792.patch, SOLR-792.patch,
SOLR-792.patch, SOLR-792.patch, SOLR-792.patch, SOLR-792.patch

A component to do multi-level faceting.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (SOLR-792) Pivot (ie: Decision Tree) Faceting Component


[ 
https://issues.apache.org/jira/browse/SOLR-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929596#action_12929596
 ] 

Grant Ingersoll commented on SOLR-792:
--

Hi Peter,

I like to think of it as What if faceting and doesn't require the categories 
to be defined up front.  You can solve this through hierarchical faceting, too, 
but this (pivot) approach doesn't require a traditional relationship 
description like hierarchical faceting does.

 Pivot (ie: Decision Tree) Faceting Component
 

 Key: SOLR-792
 URL: https://issues.apache.org/jira/browse/SOLR-792
 Project: Solr
  Issue Type: New Feature
Reporter: Erik Hatcher
Assignee: Yonik Seeley
Priority: Minor
 Attachments: SOLR-792-as-helper-class.patch, 
 SOLR-792-PivotFaceting.patch, SOLR-792-PivotFaceting.patch, 
 SOLR-792-PivotFaceting.patch, SOLR-792-PivotFaceting.patch, 
 SOLR-792-raw-type.patch, SOLR-792.patch, SOLR-792.patch, SOLR-792.patch, 
 SOLR-792.patch, SOLR-792.patch, SOLR-792.patch, SOLR-792.patch


 A component to do multi-level faceting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (SOLR-792) Pivot (ie: Decision Tree) Faceting Component

2010-11-08 Thread Toke Eskildsen (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929606#action_12929606
 ] 

Toke Eskildsen commented on SOLR-792:
-

I'd be interested to hear what the focus of SOLR-792 is, as opposed to SOLR-64. 
Or to put it another way: If SOLR-64 was adapted to accept a list of fields for 
the hierarchy, what would the purpose of SOLR-792 be?

 Pivot (ie: Decision Tree) Faceting Component
 

 Key: SOLR-792
 URL: https://issues.apache.org/jira/browse/SOLR-792
 Project: Solr
  Issue Type: New Feature
Reporter: Erik Hatcher
Assignee: Yonik Seeley
Priority: Minor
 Attachments: SOLR-792-as-helper-class.patch, 
 SOLR-792-PivotFaceting.patch, SOLR-792-PivotFaceting.patch, 
 SOLR-792-PivotFaceting.patch, SOLR-792-PivotFaceting.patch, 
 SOLR-792-raw-type.patch, SOLR-792.patch, SOLR-792.patch, SOLR-792.patch, 
 SOLR-792.patch, SOLR-792.patch, SOLR-792.patch, SOLR-792.patch


 A component to do multi-level faceting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Created: (LUCENE-2746) Implement PMC Branding Guidelines

Implement PMC Branding Guidelines
-

 Key: LUCENE-2746
 URL: https://issues.apache.org/jira/browse/LUCENE-2746
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll


Per the Trademark committee's Branding Requirements, there are a number of 
things we need to do across our projects to comply.  See 
http://www.apache.org/foundation/marks/pmcs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2746) Implement PMC Branding Guidelines


 [ 
https://issues.apache.org/jira/browse/LUCENE-2746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated LUCENE-2746:


Attachment: LUCENE-2746.patch

Work in the guidelines.

 Implement PMC Branding Guidelines
 -

 Key: LUCENE-2746
 URL: https://issues.apache.org/jira/browse/LUCENE-2746
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
 Attachments: LUCENE-2746.patch


 Per the Trademark committee's Branding Requirements, there are a number of 
 things we need to do across our projects to comply.  See 
 http://www.apache.org/foundation/marks/pmcs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENENET-379) Clean up Lucene.Net website


[ 
https://issues.apache.org/jira/browse/LUCENENET-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929622#action_12929622
 ] 

Grant Ingersoll commented on LUCENENET-379:
---

Please see https://issues.apache.org/jira/browse/LUCENE-2746.

Also, keep in mind we will probably be dumping Forrest at some point in the 
near future in favor of the ASF house CMS.

 Clean up Lucene.Net website
 ---

 Key: LUCENENET-379
 URL: https://issues.apache.org/jira/browse/LUCENENET-379
 Project: Lucene.Net
  Issue Type: Task
Reporter: George Aroush

 The existing Lucene.Net home page at http://lucene.apache.org/lucene.net/ is 
 still based on the incubation, out of date design.  This JIRA task is to 
 bring it up to date with other ASF project's web page.
 The existing website is here: 
 https://svn.apache.org/repos/asf/lucene/lucene.net/site/
 See http://www.apache.org/dev/project-site.html to get started.
 It would be best to start by cloning an existing ASF project's website and 
 adopting it for Lucene.Net.  Some examples, 
 https://svn.apache.org/repos/asf/lucene/pylucene/site/ and 
 https://svn.apache.org/repos/asf/lucene/java/site/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Lucene-Solr-tests-only-trunk - Build # 1135 - Failure

Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/1135/

1 tests failed.
REGRESSION:  org.apache.lucene.index.TestThreadedOptimize.testThreadedOptimize

Error Message:
expected:248 but was:256

Stack Trace:
junit.framework.AssertionFailedError: expected:248 but was:256
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:878)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:844)
at 
org.apache.lucene.index.TestThreadedOptimize.runTest(TestThreadedOptimize.java:119)
at 
org.apache.lucene.index.TestThreadedOptimize.testThreadedOptimize(TestThreadedOptimize.java:141)




Build Log (for compile errors):
[...truncated 3079 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

[
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929629#action_12929629
]

Jason Rutherglen commented on LUCENE-2680:
--

I'm running test-core multiple times and am seeing some lurking test
failures (thanks to the randomized tests that have been recently added).
I'm guessing they're related to the syncs on IW and DW not being in sync
some of the time.

I will clean up the patch so that others may properly review it and
hopefully we can figure out what's going on.

Improve how IndexWriter flushes deletes against existing segments
-

Key: LUCENE-2680
URL: https://issues.apache.org/jira/browse/LUCENE-2680
Project: Lucene - Java
Issue Type: Improvement
Reporter: Michael McCandless
Fix For: 4.0

Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch,
LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch,
LUCENE-2680.patch

IndexWriter buffers up all deletes (by Term and Query) and only
applies them if 1) commit or NRT getReader() is called, or 2) a merge
is about to kickoff.
We do this because, for a large index, it's very costly to open a
SegmentReader for every segment in the index. So we defer as long as
we can. We do it just before merge so that the merge can eliminate
the deleted docs.
But, most merges are small, yet in a big index we apply deletes to all
of the segments, which is really very wasteful.
Instead, we should only apply the buffered deletes to the segments
that are about to be merged, and keep the buffer around for the
remaining segments.
I think it's not so hard to do; we'd have to have generations of
pending deletions, because the newly merged segment doesn't need the
same buffered deletions applied again. So every time a merge kicks
off, we pinch off the current set of buffered deletions, open a new
set (the next generation), and record which segment was created as of
which generation.
This should be a very sizable gain for large indices that mix
deletes, though, less so in flex since opening the terms index is much
faster.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Lucene-Solr-tests-only-trunk - Build # 1137 - Failure

Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/1137/

1 tests failed.
REGRESSION:  org.apache.solr.cloud.CloudStateUpdateTest.testCoreRegistration

Error Message:
expected:2 but was:3

Stack Trace:
junit.framework.AssertionFailedError: expected:2 but was:3
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:878)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:844)
at 
org.apache.solr.cloud.CloudStateUpdateTest.testCoreRegistration(CloudStateUpdateTest.java:201)




Build Log (for compile errors):
[...truncated 8752 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Document links

Any updates/progress with this?

I'm looking at ways to implement an RTree with lucene -- and this
discussion seems relevant

thanks
ryan


On Sat, Sep 25, 2010 at 5:42 PM, mark harwood markharw...@yahoo.co.uk wrote:
Both these on disk data structures and the ones in a B+ tree have seek 
offsets
into files
that require disk seeks. And both could use document ids as key values.

 Yep. However my approach doesn't use a doc id as a key that is searched in any
 B+ tree index (which involves disk seeks) - it is used as direct offset into a
 file to get the pointer into a links data structure.



But do these disk data structures support dynamic addition and deletion of
(larger
numbers of) document links?

 Yes, the slide deck I linked to shows how links (like documents) spend the 
 early
 stages of life being merged frequently in the smaller, newer segments and over
 time migrate into larger, more stable segments as part of Lucene transactions.

 That's the theory - I'm currently benchmarking an early prototype.



 - Original Message 
 From: Paul Elschot paul.elsc...@xs4all.nl
 To: dev@lucene.apache.org
 Sent: Sat, 25 September, 2010 22:03:28
 Subject: Re: Document links

 Op zaterdag 25 september 2010 15:23:39 schreef Mark Harwood:
 My starting point in the solution I propose was to eliminate linking via any
type of key. Key lookups mean indexes and indexes mean disk seeks. Graph
traversals have exponential numbers of links and so all these index disk seeks
start to stack up. The solution I propose uses doc ids as more-or-less direct
pointers into file structures avoiding any index lookup.
 I've started coding up some tests using the file structures I outlined and 
 will
compare that with a traditional key-based approach.

 Both these on disk data structures and the ones in a B+ tree have seek offsets
 into files
 that require disk seeks. And both could use document ids as key values.

 But do these disk data structures support dynamic addition and deletion of
 (larger
 numbers of) document links?

 B+ trees are a standard solution for problems like this one, and it would
 probably
 not be easy to outperform them.
 It may be possible to improve performance of B+ trees somewhat by specializing
 for the fairly simple keys that would be needed, and by encoding very short
 lists of links
 for a single document directly into a seek offset to avoid the actual seek, 
 but
 that's
 about it.

 Regards,
 Paul Elschot


 For reference - playing the Kevin Bacon game on a traditional Lucene index 
 of
IMDB data took 18 seconds to find a short path that Neo4j finds in 200
milliseconds on the same data (and this was a disk based graph of 3m nodes, 
10m
edges).
 Going from actor-movies-actors-movies produces a lot of key lookups and 
 the
difference between key indexes and direct node pointers becomes clear.
 I know path finding analysis is perhaps not a typical Lucene application but
other forms of link analysis e.g. recommendation engines require similar
performance.

 Cheers
 Mark



 On 25 Sep 2010, at 11:41, Paul Elschot wrote:

  Op vrijdag 24 september 2010 17:57:45 schreef mark harwood:
  While not exactly equivalent, it reminds me of our earlier discussion
around

  layered segments for dealing with field updates
 
  Right. Fast discovery of document relations is a foundation on which lots 
  of

  things like this can build. Relations can be given types to support a 
  number
of

  different use cases.
 
  How about using this (bsd licenced) tree as a starting point:
  http://bplusdotnet.sourceforge.net/
  It has various keys: ao. byte array, String and long.
 
  A fixed size byte array as key seems to be just fine: two bytes for a field
number,
  four for the segment number and four for the in-segment document id.
  The separate segment number would allow to minimize the updates
  in the tree during merges. One could also use the normal doc id directly.
 
  The value could then be a similar to the key, but without
  the field number, and with an indication of the direction of the link.
  Or perhaps the direction of the link should be added to the key.
  A link would be present twice, once for each direction.
  Also both directions could have their own payloads.
 
  It could be put in its own file as a separate 'segment', or maybe
  each segment could allow for allocation of a part of this tree.
 
  I like this somehow, in case it is done right one might never
  need a relational database again. Well, almost...
 
  Regards,
  Paul Elschot
 
 
 
 
 
  - Original Message 
  From: Grant Ingersoll gsing...@apache.org
  To: dev@lucene.apache.org
  Sent: Fri, 24 September, 2010 16:26:27
  Subject: Re: Document links
 
  While not exactly equivalent, it reminds me of our earlier discussion 
  around

  layered segments for dealing with field updates [1], [2], albeit this 
  is a
bit

  more generic since one could not only use the links for relating 
  documents,
but

  one could use special links

Re: Document links

2010-11-08 Thread mark harwood

I came to the conclusion that the transient meaning of document ids is too 
deeply ingrained in Lucene's design to use them to underpin any reliable 
linking.
While it might work for relatively static indexes, any index with a reasonable 
number of updates or deletes will invalidate any stored document references in 
ways which are very hard to track. Lucene's compaction shuffles IDs without 
taking care to preserve identity, unlike graph DBs like Neo4j (see recycling 
IDs here: http://goo.gl/5UbJi )


Cheers,
Mark


- Original Message 
From: Ryan McKinley ryan...@gmail.com
To: dev@lucene.apache.org
Sent: Mon, 8 November, 2010 19:03:59
Subject: Re: Document links

Any updates/progress with this?

I'm looking at ways to implement an RTree with lucene -- and this
discussion seems relevant

thanks
ryan


On Sat, Sep 25, 2010 at 5:42 PM, mark harwood markharw...@yahoo.co.uk wrote:
Both these on disk data structures and the ones in a B+ tree have seek 
offsets
into files
that require disk seeks. And both could use document ids as key values.

 Yep. However my approach doesn't use a doc id as a key that is searched in any
 B+ tree index (which involves disk seeks) - it is used as direct offset into a
 file to get the pointer into a links data structure.



But do these disk data structures support dynamic addition and deletion of
(larger
numbers of) document links?

 Yes, the slide deck I linked to shows how links (like documents) spend the 
early
 stages of life being merged frequently in the smaller, newer segments and over
 time migrate into larger, more stable segments as part of Lucene transactions.

 That's the theory - I'm currently benchmarking an early prototype.



 - Original Message 
 From: Paul Elschot paul.elsc...@xs4all.nl
 To: dev@lucene.apache.org
 Sent: Sat, 25 September, 2010 22:03:28
 Subject: Re: Document links

 Op zaterdag 25 september 2010 15:23:39 schreef Mark Harwood:
 My starting point in the solution I propose was to eliminate linking via any
type of key. Key lookups mean indexes and indexes mean disk seeks. Graph
traversals have exponential numbers of links and so all these index disk seeks
start to stack up. The solution I propose uses doc ids as more-or-less direct
pointers into file structures avoiding any index lookup.
 I've started coding up some tests using the file structures I outlined and 
will
compare that with a traditional key-based approach.

 Both these on disk data structures and the ones in a B+ tree have seek offsets
 into files
 that require disk seeks. And both could use document ids as key values.

 But do these disk data structures support dynamic addition and deletion of
 (larger
 numbers of) document links?

 B+ trees are a standard solution for problems like this one, and it would
 probably
 not be easy to outperform them.
 It may be possible to improve performance of B+ trees somewhat by specializing
 for the fairly simple keys that would be needed, and by encoding very short
 lists of links
 for a single document directly into a seek offset to avoid the actual seek, 
but
 that's
 about it.

 Regards,
 Paul Elschot


 For reference - playing the Kevin Bacon game on a traditional Lucene index 
of
IMDB data took 18 seconds to find a short path that Neo4j finds in 200
milliseconds on the same data (and this was a disk based graph of 3m nodes, 
10m
edges).
 Going from actor-movies-actors-movies produces a lot of key lookups and 
the
difference between key indexes and direct node pointers becomes clear.
 I know path finding analysis is perhaps not a typical Lucene application but
other forms of link analysis e.g. recommendation engines require similar
performance.

 Cheers
 Mark



 On 25 Sep 2010, at 11:41, Paul Elschot wrote:

  Op vrijdag 24 september 2010 17:57:45 schreef mark harwood:
  While not exactly equivalent, it reminds me of our earlier discussion
around

  layered segments for dealing with field updates
 
  Right. Fast discovery of document relations is a foundation on which lots 
of

  things like this can build. Relations can be given types to support a 
number
of

  different use cases.
 
  How about using this (bsd licenced) tree as a starting point:
  http://bplusdotnet.sourceforge.net/
  It has various keys: ao. byte array, String and long.
 
  A fixed size byte array as key seems to be just fine: two bytes for a field
number,
  four for the segment number and four for the in-segment document id.
  The separate segment number would allow to minimize the updates
  in the tree during merges. One could also use the normal doc id directly.
 
  The value could then be a similar to the key, but without
  the field number, and with an indication of the direction of the link.
  Or perhaps the direction of the link should be added to the key.
  A link would be present twice, once for each direction.
  Also both directions could have their own payloads.
 
  It could be put in its own file as a separate 'segment', or maybe
  each segment

[jira] Commented: (LUCENE-2729) Index corruption after 'read past EOF' under heavy update load and snapshot export

2010-11-08 Thread John Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929702#action_12929702
 ] 

John Wang commented on LUCENE-2729:
---

zoie does not touch index files, only adds an index.directory file containing 
version information.

 Index corruption after 'read past EOF' under heavy update load and snapshot 
 export
 --

 Key: LUCENE-2729
 URL: https://issues.apache.org/jira/browse/LUCENE-2729
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 3.0.1, 3.0.2
 Environment: Happens on both OS X 10.6 and Windows 2008 Server. 
 Integrated with zoie (using a zoie snapshot from 2010-08-06: 
 zoie-2.0.0-snapshot-20100806.jar).
Reporter: Nico Krijnen
 Attachments: 2010-11-02 IndexWriter infoStream log.zip


 We have a system running lucene and zoie. We use lucene as a content store 
 for a CMS/DAM system. We use the hot-backup feature of zoie to make scheduled 
 backups of the index. This works fine for small indexes and when there are 
 not a lot of changes to the index when the backup is made.
 On large indexes (about 5 GB to 19 GB), when a backup is made while the index 
 is being changed a lot (lots of document additions and/or deletions), we 
 almost always get a 'read past EOF' at some point, followed by lots of 'Lock 
 obtain timed out'.
 At that point we get lots of 0 kb files in the index, data gets lots, and the 
 index is unusable.
 When we stop our server, remove the 0kb files and restart our server, the 
 index is operational again, but data has been lost.
 I'm not sure if this is a zoie or a lucene issue, so i'm posting it to both. 
 Hopefully someone has some ideas where to look to fix this.
 Some more details...
 Stack trace of the read past EOF and following Lock obtain timed out:
 {code}
 78307 [proj.zoie.impl.indexing.internal.realtimeindexdataloa...@31ca5085] 
 ERROR proj.zoie.impl.indexing.internal.BaseSearchIndex - read past EOF
 java.io.IOException: read past EOF
 at 
 org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:154)
 at 
 org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:39)
 at 
 org.apache.lucene.store.ChecksumIndexInput.readByte(ChecksumIndexInput.java:37)
 at org.apache.lucene.store.IndexInput.readInt(IndexInput.java:69)
 at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:245)
 at 
 org.apache.lucene.index.IndexFileDeleter.init(IndexFileDeleter.java:166)
 at 
 org.apache.lucene.index.DirectoryReader.doCommit(DirectoryReader.java:725)
 at org.apache.lucene.index.IndexReader.commit(IndexReader.java:987)
 at org.apache.lucene.index.IndexReader.commit(IndexReader.java:973)
 at org.apache.lucene.index.IndexReader.decRef(IndexReader.java:162)
 at org.apache.lucene.index.IndexReader.close(IndexReader.java:1003)
 at 
 proj.zoie.impl.indexing.internal.BaseSearchIndex.deleteDocs(BaseSearchIndex.java:203)
 at 
 proj.zoie.impl.indexing.internal.BaseSearchIndex.loadFromIndex(BaseSearchIndex.java:223)
 at 
 proj.zoie.impl.indexing.internal.LuceneIndexDataLoader.loadFromIndex(LuceneIndexDataLoader.java:153)
 at 
 proj.zoie.impl.indexing.internal.DiskLuceneIndexDataLoader.loadFromIndex(DiskLuceneIndexDataLoader.java:134)
 at 
 proj.zoie.impl.indexing.internal.RealtimeIndexDataLoader.processBatch(RealtimeIndexDataLoader.java:171)
 at 
 proj.zoie.impl.indexing.internal.BatchedIndexDataLoader$LoaderThread.run(BatchedIndexDataLoader.java:373)
 579336 [proj.zoie.impl.indexing.internal.realtimeindexdataloa...@31ca5085] 
 ERROR proj.zoie.impl.indexing.internal.LuceneIndexDataLoader - 
 Problem copying segments: Lock obtain timed out: 
 org.apache.lucene.store.singleinstancel...@5ad0b895: write.lock
 org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: 
 org.apache.lucene.store.singleinstancel...@5ad0b895: write.lock
 at org.apache.lucene.store.Lock.obtain(Lock.java:84)
 at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1060)
 at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:957)
 at 
 proj.zoie.impl.indexing.internal.DiskSearchIndex.openIndexWriter(DiskSearchIndex.java:176)
 at 
 proj.zoie.impl.indexing.internal.BaseSearchIndex.loadFromIndex(BaseSearchIndex.java:228)
 at 
 proj.zoie.impl.indexing.internal.LuceneIndexDataLoader.loadFromIndex(LuceneIndexDataLoader.java:153)
 at 
 proj.zoie.impl.indexing.internal.DiskLuceneIndexDataLoader.loadFromIndex(DiskLuceneIndexDataLoader.java:134)
 at 
 proj.zoie.impl.indexing.internal.RealtimeIndexDataLoader.processBatch(RealtimeIndexDataLoader.java:171)
 at

[jira] Created: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
---

 Key: LUCENE-2747
 URL: https://issues.apache.org/jira/browse/LUCENE-2747
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 3.1, 4.0
Reporter: Steven Rowe
 Fix For: 3.1, 4.0


As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to 
provide language-neutral tokenization.  Lucene contains several 
language-specific tokenizers that should be replaced by UAX#29-based 
StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The 
language-specific *analyzers*, by contrast, should remain, because they contain 
language-specific post-tokenization filters.  The language-specific analyzers 
should switch to StandardTokenizer in 3.1.

Some usages of language-specific tokenizers will need additional work beyond 
just replacing the tokenizer in the language-specific analyzer.  

For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and depends 
on the fact that this tokenizer breaks tokens on the ZWNJ character (zero-width 
non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ is not a word 
boundary.  Robert Muir has suggested using a char filter converting ZWNJ to 
spaces prior to StandardTokenizer in the converted PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Document links

2010-11-08 Thread Paul Elschot

On Monday 08 November 2010 20:03:59 Ryan McKinley wrote:
 Any updates/progress with this?
 
 I'm looking at ways to implement an RTree with lucene -- and this
 discussion seems relevant

Did you consider merging the bits of each dimension into a NumericField?

For example: one dimension a0 a1 .. an
and a second dimension b0 b1 .. bn
into a0 b0 a1 b1 .. an bn
and then index this number as a NumericField.

Regards,
Paul Elschot

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Document links

On Mon, Nov 8, 2010 at 3:20 PM, Paul Elschot paul.elsc...@xs4all.nl wrote:
 On Monday 08 November 2010 20:03:59 Ryan McKinley wrote:
 Any updates/progress with this?

 I'm looking at ways to implement an RTree with lucene -- and this
 discussion seems relevant

 Did you consider merging the bits of each dimension into a NumericField?

 For example: one dimension a0 a1 .. an
 and a second dimension b0 b1 .. bn
 into a0 b0 a1 b1 .. an bn
 and then index this number as a NumericField.


Something like the geohash algorithm but with n dimensions?

The linking work that Mark discussed seems nice since it would give
faster access to navigating the tree -- finding N nearest neigbhors
etc...

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Created: (SOLR-2223) Separate out generic Solr site from release specific content.

Separate out generic Solr site from release specific content.
---

 Key: SOLR-2223
 URL: https://issues.apache.org/jira/browse/SOLR-2223
 Project: Solr
  Issue Type: Task
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor


It would be useful for deployment purposes if we separated out the Solr site 
that is non-release specific from the release specific content.  This would 
make it easier to apply updates, etc. while still keeping release specific info 
handy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Created: (LUCENE-2748) Convert all Lucene web properties to use the ASF CMS

Convert all Lucene web properties to use the ASF CMS


 Key: LUCENE-2748
 URL: https://issues.apache.org/jira/browse/LUCENE-2748
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Grant Ingersoll


The new CMS has a lot of nice features (and some kinks to still work out) and 
Forrest just doesn't cut it anymore, so we should move to the ASF CMS: 
http://apache.org/dev/cms.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Document links

On Mon, Nov 8, 2010 at 2:52 PM, mark harwood markharw...@yahoo.co.uk wrote:
 I came to the conclusion that the transient meaning of document ids is too
 deeply ingrained in Lucene's design to use them to underpin any reliable
 linking.

What about if we define an id field (like in solr)?

Whatever does the traversal would need to make a Mapid,docID, but
that is still better then then needing to do a query for each link.


 While it might work for relatively static indexes, any index with a reasonable
 number of updates or deletes will invalidate any stored document references in
 ways which are very hard to track. Lucene's compaction shuffles IDs without
 taking care to preserve identity, unlike graph DBs like Neo4j (see recycling
 IDs here: http://goo.gl/5UbJi )


oh ya -- and it is even more akward since each subreader often reuses
the same docId

ryan

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENENET-380) Evaluate Sharpen as a port tool

2010-11-08 Thread Aaron Powell (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENENET-380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929725#action_12929725
 ] 

Aaron Powell commented on LUCENENET-380:


George,

The reason I spun up the external repo is so that it's easy to track changes 
and have a collaborative effort trying to find the right tool for the job.

Can we spin up a repo under the ASF so we can collaboratively work on a 
solution?

 Evaluate Sharpen as a port tool
 ---

 Key: LUCENENET-380
 URL: https://issues.apache.org/jira/browse/LUCENENET-380
 Project: Lucene.Net
  Issue Type: Task
Reporter: George Aroush
 Attachments: IndexWriter.java, Lucene.Net.Sharpen20101104.zip, 
 NIOFSDirectory.java, QueryParser.java, TestBufferedIndexInput.java, 
 TestDateFilter.java


 This task is to evaluate Sharpen as a port tool for Lucene.Net.
 The files to be evaluated are attached.  We need to run those files (which 
 are off Java Lucene 2.9.2) against Sharpen and compare the result against 
 JLCA result.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

[
https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Robert Muir updated LUCENE-2747:

Attachment: LUCENE-2747.patch

here's a quick stab at a patch.

I had to add at least minimal support to ReusableAnalyzerBase in case you want
charfilters,
since it doesn't have any today.

maybe there is a better way to do it though.

Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
---

Key: LUCENE-2747
URL: https://issues.apache.org/jira/browse/LUCENE-2747
Project: Lucene - Java
Issue Type: Improvement
Components: Analysis
Affects Versions: 3.1, 4.0
Reporter: Steven Rowe
Fix For: 3.1, 4.0

Attachments: LUCENE-2747.patch

As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to
provide language-neutral tokenization. Lucene contains several
language-specific tokenizers that should be replaced by UAX#29-based
StandardTokenizer (deprecated in 3.1 and removed in 4.0). The
language-specific *analyzers*, by contrast, should remain, because they
contain language-specific post-tokenization filters. The language-specific
analyzers should switch to StandardTokenizer in 3.1.
Some usages of language-specific tokenizers will need additional work beyond
just replacing the tokenizer in the language-specific analyzer.
For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and
depends on the fact that this tokenizer breaks tokens on the ZWNJ character
(zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ
is not a word boundary. Robert Muir has suggested using a char filter
converting ZWNJ to spaces prior to StandardTokenizer in the converted
PersianAnalyzer.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Document links

2010-11-08 Thread Paul Elschot

On Monday 08 November 2010 21:34:18 Ryan McKinley wrote:
 On Mon, Nov 8, 2010 at 3:20 PM, Paul Elschot paul.elsc...@xs4all.nl wrote:
  On Monday 08 November 2010 20:03:59 Ryan McKinley wrote:
  Any updates/progress with this?
 
  I'm looking at ways to implement an RTree with lucene -- and this
  discussion seems relevant
 
  Did you consider merging the bits of each dimension into a NumericField?
 
  For example: one dimension a0 a1 .. an
  and a second dimension b0 b1 .. bn
  into a0 b0 a1 b1 .. an bn
  and then index this number as a NumericField.
 
 
 Something like the geohash algorithm but with n dimensions?

Yes. It is also a simple bounded volume hierarchy.

Regards,
Paul Elschot

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

[
https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929738#action_12929738
]

Robert Muir commented on LUCENE-2747:
-

bq. CharFilter must at least also implement read() to read one char.

Thats incorrect.
only read(char[] cbuf, int off, int len) is abstract in Reader.
CharStream extends Reader, but only adds correctOffset.
CharFilter extends CharStream, but only delegates read(char[] cbuf, int off,
int len)

So implementing read() only adds useless code duplication here.

Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
---

Attachments: LUCENE-2747.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Lucene-Solr-tests-only-trunk - Build # 1145 - Failure

Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/1145/

4 tests failed.
FAILED:  
junit.framework.TestSuite.org.apache.lucene.search.TestNumericRangeQuery64

Error Message:
this writer hit an OutOfMemoryError; cannot complete optimize

Stack Trace:
java.lang.IllegalStateException: this writer hit an OutOfMemoryError; cannot 
complete optimize
at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2394)
at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2346)
at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2316)
at 
org.apache.lucene.index.RandomIndexWriter.getReader(RandomIndexWriter.java:129)
at 
org.apache.lucene.search.TestNumericRangeQuery64.beforeClass(TestNumericRangeQuery64.java:90)


FAILED:  
junit.framework.TestSuite.org.apache.lucene.search.TestNumericRangeQuery64

Error Message:
null

Stack Trace:
java.lang.NullPointerException
at 
org.apache.lucene.search.TestNumericRangeQuery64.afterClass(TestNumericRangeQuery64.java:97)


FAILED:  
junit.framework.TestSuite.org.apache.lucene.search.TestNumericRangeQuery64

Error Message:
directory of test was not closed, opened from: 
org.apache.lucene.util.LuceneTestCase.newDirectory(LuceneTestCase.java:653)

Stack Trace:
junit.framework.AssertionFailedError: directory of test was not closed, opened 
from: 
org.apache.lucene.util.LuceneTestCase.newDirectory(LuceneTestCase.java:653)
at 
org.apache.lucene.util.LuceneTestCase.afterClassLuceneTestCaseJ4(LuceneTestCase.java:331)


REGRESSION:  org.apache.lucene.search.TestPrefixFilter.testPrefixFilter

Error Message:
ConcurrentMergeScheduler hit unhandled exceptions

Stack Trace:
junit.framework.AssertionFailedError: ConcurrentMergeScheduler hit unhandled 
exceptions
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:878)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:844)
at 
org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:458)




Build Log (for compile errors):
[...truncated 3116 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Created: (LUCENE-2749) Lexically sorted shingle filter

Lexically sorted shingle filter
---

 Key: LUCENE-2749
 URL: https://issues.apache.org/jira/browse/LUCENE-2749
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Analysis
Affects Versions: 3.1, 4.0
Reporter: Steven Rowe
Priority: Minor
 Fix For: 3.1, 4.0


Sometimes people want to know if words have co-occurred within a specific 
window onto the token stream, but don't care what the order is.  A Lucene token 
filter (LexicallySortedWindowFilter?), perhaps implemented as a ShingleFilter 
sub-class, could provide this functionality.

This feature would allow for exact term set equality queries (in the case of a 
full-field-width window).


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

[
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jason Rutherglen updated LUCENE-2680:
-

Attachment: LUCENE-2680.patch

Here's a cleaned up patch, please take a look. I ran 'ant test-core' 5 times
with no failures, however running the below several times does eventually
produce a failure.

ant test-core -Dtestcase=TestThreadedOptimize -Dtestmethod=testThreadedOptimize
-Dtests.seed=1547315783637080859:5267275843141383546

ant test-core -Dtestcase=TestIndexWriterMergePolicy
-Dtestmethod=testMaxBufferedDocsChange
-Dtests.seed=7382971652679988823:-6672235304390823521

Improve how IndexWriter flushes deletes against existing segments
-

Key: LUCENE-2680
URL: https://issues.apache.org/jira/browse/LUCENE-2680
Project: Lucene - Java
Issue Type: Improvement
Reporter: Michael McCandless
Fix For: 4.0

Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch,
LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch,
LUCENE-2680.patch, LUCENE-2680.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (SOLR-1709) Distributed Date Faceting

2010-11-08 Thread Peter Karich (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929760#action_12929760
]

Peter Karich commented on SOLR-1709:

Hi Peter Sturge,

what are the limitations of this patch? only that earlier + later isn't
supported?

What are the issues before commiting this into trunk?

Distributed Date Faceting
-

Key: SOLR-1709
URL: https://issues.apache.org/jira/browse/SOLR-1709
Project: Solr
Issue Type: Improvement
Components: SearchComponents - other
Affects Versions: 1.4
Reporter: Peter Sturge
Priority: Minor
Attachments: FacetComponent.java, FacetComponent.java,
ResponseBuilder.java, solr-1.4.0-solr-1709.patch

This patch is for adding support for date facets when using distributed
searches.
Date faceting across multiple machines exposes some time-based issues that
anyone interested in this behaviour should be aware of:
Any time and/or time-zone differences are not accounted for in the patch
(i.e. merged date facets are at a time-of-day, not necessarily at a universal
'instant-in-time', unless all shards are time-synced to the exact same time).
The implementation uses the first encountered shard's facet_dates as the
basis for subsequent shards' data to be merged in.
This means that if subsequent shards' facet_dates are skewed in relation to
the first by 1 'gap', these 'earlier' or 'later' facets will not be merged
in.
There are several reasons for this:
* Performance: It's faster to check facet_date lists against a single map's
data, rather than against each other, particularly if there are many shards
* If 'earlier' and/or 'later' facet_dates are added in, this will make the
time range larger than that which was requested
(e.g. a request for one hour's worth of facets could bring back 2, 3
or more hours of data)
This could be dealt with if timezone and skew information was added, and
the dates were normalized.
One possibility for adding such support is to [optionally] add 'timezone' and
'now' parameters to the 'facet_dates' map. This would tell requesters what
time and TZ the remote server thinks it is, and so multiple shards' time data
can be normalized.
The patch affects 2 files in the Solr core:
org.apache.solr.handler.component.FacetComponent.java
org.apache.solr.handler.component.ResponseBuilder.java
The main changes are in FacetComponent - ResponseBuilder is just to hold the
completed SimpleOrderedMap until the finishStage.
One possible enhancement is to perhaps make this an optional parameter, but
really, if facet.date parameters are specified, it is assumed they are
desired.
Comments suggestions welcome.
As a favour to ask, if anyone could take my 2 source files and create a PATCH
file from it, it would be greatly appreciated, as I'm having a bit of trouble
with svn (don't shoot me, but my environment is a Redmond-based os company).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENENET-380) Evaluate Sharpen as a port tool

2010-11-08 Thread Mauricio Scheffer (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENENET-380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929768#action_12929768
 ] 

Mauricio Scheffer commented on LUCENENET-380:
-

@Aaron Powell: the ASF has official Git mirrors at github, see 
https://github.com/apache/lucene.net
It's outdated so there seems to be a problem with the ASF sync, I'd notify the 
ASF infrastructure team about it.
See also http://www.apache.org/dev/git.html

 Evaluate Sharpen as a port tool
 ---

 Key: LUCENENET-380
 URL: https://issues.apache.org/jira/browse/LUCENENET-380
 Project: Lucene.Net
  Issue Type: Task
Reporter: George Aroush
 Attachments: IndexWriter.java, Lucene.Net.Sharpen20101104.zip, 
 NIOFSDirectory.java, QueryParser.java, TestBufferedIndexInput.java, 
 TestDateFilter.java


 This task is to evaluate Sharpen as a port tool for Lucene.Net.
 The files to be evaluated are attached.  We need to run those files (which 
 are off Java Lucene 2.9.2) against Sharpen and compare the result against 
 JLCA result.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-2211) Create Solr FilterFactory for Lucene StandardTokenizer with UAX#29 support

2010-11-08 Thread Tom Burton-West (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom Burton-West updated SOLR-2211:
--

Attachment: SOLR-2211.patch

Patch implements Solr UAX29TokenizerFactory and TestUAX29TokenizerFactory.  

Tom

 Create Solr FilterFactory for Lucene StandardTokenizer with  UAX#29 support
 ---

 Key: SOLR-2211
 URL: https://issues.apache.org/jira/browse/SOLR-2211
 Project: Solr
  Issue Type: New Feature
Affects Versions: 3.1
Reporter: Tom Burton-West
Priority: Minor
 Attachments: SOLR-2211.patch


 The Lucene 3.x StandardTokenizer with UAX#29 support provides benefits for 
 non-English tokenizing.  Presently it can be invoked by using the 
 StandardTokenizerFactory and setting the Version to 3.1.  However, it would 
 be useful to be able to use the improved unicode processing without 
 necessarily including the ip address and email address processing of 
 StandardAnalyzer.   A FilterFactory that allowed the use of the 
 StandardTokenizer with UAX#29 support on its own would be useful.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Assigned: (SOLR-2211) Create Solr FilterFactory for Lucene StandardTokenizer with UAX#29 support


 [ 
https://issues.apache.org/jira/browse/SOLR-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir reassigned SOLR-2211:
-

Assignee: Robert Muir

 Create Solr FilterFactory for Lucene StandardTokenizer with  UAX#29 support
 ---

 Key: SOLR-2211
 URL: https://issues.apache.org/jira/browse/SOLR-2211
 Project: Solr
  Issue Type: New Feature
Affects Versions: 3.1
Reporter: Tom Burton-West
Assignee: Robert Muir
Priority: Minor
 Attachments: SOLR-2211.patch


 The Lucene 3.x StandardTokenizer with UAX#29 support provides benefits for 
 non-English tokenizing.  Presently it can be invoked by using the 
 StandardTokenizerFactory and setting the Version to 3.1.  However, it would 
 be useful to be able to use the improved unicode processing without 
 necessarily including the ip address and email address processing of 
 StandardAnalyzer.   A FilterFactory that allowed the use of the 
 StandardTokenizer with UAX#29 support on its own would be useful.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Document links

2010-11-08 Thread Mark Harwood

What about if we define an id field (like in solr)?


Last time I floated the idea of supporting primary keys as a core concept in 
Lucene (in the context of helping doc updates, not linking) there were 
objections along the lines of lucene shouldn't try to be a database 


On 8 Nov 2010, at 20:47, Ryan McKinley ryan...@gmail.com wrote:

On Mon, Nov 8, 2010 at 2:52 PM, mark harwood markharw...@yahoo.co.uk wrote:
I came to the conclusion that the transient meaning of document ids is too
deeply ingrained in Lucene's design to use them to underpin any reliable
linking.

What about if we define an id field (like in solr)?

Whatever does the traversal would need to make a Mapid,docID, but
that is still better then then needing to do a query for each link.


While it might work for relatively static indexes, any index with a reasonable
number of updates or deletes will invalidate any stored document references in
ways which are very hard to track. Lucene's compaction shuffles IDs without
taking care to preserve identity, unlike graph DBs like Neo4j (see recycling
IDs here: http://goo.gl/5UbJi )


oh ya -- and it is even more akward since each subreader often reuses
the same docId

ryan

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org






-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (SOLR-1709) Distributed Date Faceting

2010-11-08 Thread Peter Sturge (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929786#action_12929786
]

Peter Sturge commented on SOLR-1709:

Hi Peter,

Thanks for your message. There's of course the issue of 'now' as described in
some of the above comments. This is perhaps a little ancillary to this issue,
but not totally irrelevant.

The issue of time zone/skew on distributed shards is currently handled by
SOLR-1729 by passing a 'facet.date.now=epochtime' parameter in the search
query. This is then used by the particapating shards to use as 'now'. Of
course, there are a number of ways to skin that one, but this is a
straightforward solution that is backward compatible and still easy to
implement in client code.

Note that the facet.date.now change is not part of this patch - see SOLR-1729
for a separate patch for this parameter. (kept separate because it's, strictly
speaking, a separate issue generally for distributed search)

It's not that eariler/later aren't supported - the date facet 'edges' are fine,
it's just the patch will 'quantize the ends' of the start/end date facets if
the time is skewed from the calling server. This is where SOLR-1729 comes into
play, so that this doesn't happen.

As this is a pre-3x/4x branch patch, the testing is a bit limited on the
latest trunk(s). Having said that, I have this (and SOLR-1729) building/running
fine on my svn 3x branch release copy.
Any other questions, or info you need, please do let me know.

Thanks!
Peter

Distributed Date Faceting
-

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Resolved: (SOLR-2211) Create Solr FilterFactory for Lucene StandardTokenizer with UAX#29 support


 [ 
https://issues.apache.org/jira/browse/SOLR-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved SOLR-2211.
---

   Resolution: Fixed
Fix Version/s: 4.0
   3.1

Committed revision 1032776, 1032779 (3x).

Thanks Tom!

 Create Solr FilterFactory for Lucene StandardTokenizer with  UAX#29 support
 ---

 Key: SOLR-2211
 URL: https://issues.apache.org/jira/browse/SOLR-2211
 Project: Solr
  Issue Type: New Feature
Affects Versions: 3.1
Reporter: Tom Burton-West
Assignee: Robert Muir
Priority: Minor
 Fix For: 3.1, 4.0

 Attachments: SOLR-2211.patch


 The Lucene 3.x StandardTokenizer with UAX#29 support provides benefits for 
 non-English tokenizing.  Presently it can be invoked by using the 
 StandardTokenizerFactory and setting the Version to 3.1.  However, it would 
 be useful to be able to use the improved unicode processing without 
 necessarily including the ip address and email address processing of 
 StandardAnalyzer.   A FilterFactory that allowed the use of the 
 StandardTokenizer with UAX#29 support on its own would be useful.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (SOLR-2040) is ConcurrentLRUCache really a thread-safe/LRU implementation?

2010-11-08 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929800#action_12929800
 ] 

Yonik Seeley commented on SOLR-2040:


ConcurrentLRUCache is not strictly bounded by the size - when the max size is 
hit, we still allow other puts to proceed while evicting the oldest entries - 
by design for greater concurrency.

If adds to the cache are very cheap to generate, this is not an appropriate 
cache to use since evictions won't keep up with additions.  The uses in Solr 
are all appropriate however, so trying to fix this via additional 
synchronization will only result in lower throughput.

 is ConcurrentLRUCache really a thread-safe/LRU implementation?
 --

 Key: SOLR-2040
 URL: https://issues.apache.org/jira/browse/SOLR-2040
 Project: Solr
  Issue Type: Bug
Reporter: lszwycn

 hi, i wrote a simple test
 {code}
 package lru.solr;
 import java.util.ArrayList;
 import java.util.List;
 import java.util.Random;
 import java.util.concurrent.Callable;
 import java.util.concurrent.ExecutorService;
 import java.util.concurrent.Executors;
 import java.util.concurrent.Future;
 import java.util.concurrent.atomic.AtomicInteger;
 public class ConcurrentLRUCacheTest {
   static final int loop = 1;
   static final int threadCount = 500;
   static final ConcurrentLRUCache lruMap = new ConcurrentLRUCache(128, 80,
   100, 100, false, false, null);
   static final ExecutorService exec = Executors
   .newFixedThreadPool(threadCount);
   static final AtomicInteger totalRuncounter = new AtomicInteger();
   static final AtomicInteger putCounter = new AtomicInteger();
   static final AtomicInteger sizeCounter = new AtomicInteger();
   static long totalTime = 0;
   public static void main(String[] args) throws Exception {
   ListCallableLong callList = new ArrayListCallableLong();
   for (int i = 0; i  threadCount; i++) {
   callList.add(new CallableLong() {
   int maxCacheSize = 0;
   int maxCacheInternalMapSize = 0;
   public Long call() throws Exception {
   final long begin = System.nanoTime();
   Random r = new Random();
   for (int j = 0; j  loop; j++) {
   
 totalRuncounter.getAndIncrement();
   int n = r.nextInt(1);
   int currentCacheSize = 
 lruMap.size();
   int currentCacheInternalMapSize 
 = lruMap.getMap()
   .size();
   maxCacheSize = 
 Math.max(currentCacheSize, maxCacheSize);
   maxCacheInternalMapSize = 
 Math.max(
   
 currentCacheInternalMapSize,
   
 maxCacheInternalMapSize);
   if (null == lruMap.get(n)) {
   lruMap.put(n, j);
   
 putCounter.getAndIncrement();
   } else {
   lruMap.size();
   
 sizeCounter.getAndIncrement();
   }
   }
   System.out.println(maxCacheSize:  + 
 maxCacheSize
   +  
 ,maxCacheInternalMapSize: 
   + 
 maxCacheInternalMapSize);
   final long end = System.nanoTime();
   return (end - begin);
   }
   });
   }
   ListFutureLong futureList = exec.invokeAll(callList);
   for (FutureLong future : futureList) {
   totalTime += future.get();
   }
   System.out.println(final cache size:  + lruMap.size());
   System.out.println(final cache internal map size: 
   + lruMap.getMap().size());
   System.out.println(total get: +totalRuncounter +  spend 
 time= + totalTime / 1000
   +  , put:  + putCounter.get() +  , size:

[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

[
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929810#action_12929810
]

Jason Rutherglen commented on LUCENE-2680:
--

The problem could be that IW deleteDocument is not synced on IW,
when I tried adding the sync, there was deadlock perhaps from DW
waitReady. We could be adding pending deletes to segments that
are not quite current because we're not adding them in an IW
sync block.

Improve how IndexWriter flushes deletes against existing segments
-

Key: LUCENE-2680
URL: https://issues.apache.org/jira/browse/LUCENE-2680
Project: Lucene - Java
Issue Type: Improvement
Reporter: Michael McCandless
Fix For: 4.0

Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch,
LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch,
LUCENE-2680.patch, LUCENE-2680.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENENET-379) Clean up Lucene.Net website

2010-11-08 Thread Prescott Nasser (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENENET-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929813#action_12929813
 ] 

Prescott Nasser commented on LUCENENET-379:
---

Any objections then to just digging into the ASF CMS system? Also, in terms of 
what the page should look like, do we still want to mimic the other lucene 
pages? or should we go with the skeleton that apache.org uses? 

 Clean up Lucene.Net website
 ---

 Key: LUCENENET-379
 URL: https://issues.apache.org/jira/browse/LUCENENET-379
 Project: Lucene.Net
  Issue Type: Task
Reporter: George Aroush

 The existing Lucene.Net home page at http://lucene.apache.org/lucene.net/ is 
 still based on the incubation, out of date design.  This JIRA task is to 
 bring it up to date with other ASF project's web page.
 The existing website is here: 
 https://svn.apache.org/repos/asf/lucene/lucene.net/site/
 See http://www.apache.org/dev/project-site.html to get started.
 It would be best to start by cloning an existing ASF project's website and 
 adopting it for Lucene.Net.  Some examples, 
 https://svn.apache.org/repos/asf/lucene/pylucene/site/ and 
 https://svn.apache.org/repos/asf/lucene/java/site/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Lucene-3.x - Build # 175 - Failure

Build: https://hudson.apache.org/hudson/job/Lucene-3.x/175/

All tests passed

Build Log (for compile errors):
[...truncated 21371 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

2010-11-08 Thread Uwe Schindler (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929826#action_12929826
]

Uwe Schindler commented on LUCENE-2747:
---

Ay, ay Code Dup Policeman.

From perf standpoint for real FilterReaders in java.io that would be no-go,
but here it's fine as Tokenizers always buffer. Also java.io's FilterReader
are different and delegate this method, but not CharFilter.

Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
---

Attachments: LUCENE-2747.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Lucene-Solr-tests-only-trunk - Build # 1149 - Still Failing

Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/1149/

1 tests failed.
REGRESSION:  org.apache.solr.cloud.CloudStateUpdateTest.testCoreRegistration

Error Message:
expected:2 but was:3

Stack Trace:
junit.framework.AssertionFailedError: expected:2 but was:3
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:878)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:844)
at 
org.apache.solr.cloud.CloudStateUpdateTest.testCoreRegistration(CloudStateUpdateTest.java:201)




Build Log (for compile errors):
[...truncated 8711 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer


[ 
https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929834#action_12929834
 ] 

Robert Muir commented on LUCENE-2747:
-

Wait, thats an interesting point, any advantage to actually using real 
FilterReaders for this API?


 Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
 ---

 Key: LUCENE-2747
 URL: https://issues.apache.org/jira/browse/LUCENE-2747
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 3.1, 4.0
Reporter: Steven Rowe
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2747.patch


 As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to 
 provide language-neutral tokenization.  Lucene contains several 
 language-specific tokenizers that should be replaced by UAX#29-based 
 StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The 
 language-specific *analyzers*, by contrast, should remain, because they 
 contain language-specific post-tokenization filters.  The language-specific 
 analyzers should switch to StandardTokenizer in 3.1.
 Some usages of language-specific tokenizers will need additional work beyond 
 just replacing the tokenizer in the language-specific analyzer.  
 For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and 
 depends on the fact that this tokenizer breaks tokens on the ZWNJ character 
 (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ 
 is not a word boundary.  Robert Muir has suggested using a char filter 
 converting ZWNJ to spaces prior to StandardTokenizer in the converted 
 PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (SOLR-2211) Create Solr FilterFactory for Lucene StandardTokenizer with UAX#29 support


[ 
https://issues.apache.org/jira/browse/SOLR-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929849#action_12929849
 ] 

Robert Muir commented on SOLR-2211:
---

Great, I look forward to the results.

By the way, on SOLR-2210 i also added the ICU filters, you could consider 
replacing LowerCaseFilterFactory with ICUNormalizer2Factory (just use the 
defaults).
In addition to better lowercasing (e.g. ß - ss), this would also bring the 
advantages described in http://unicode.org/reports/tr15/

Alternatively, if you are already using both LowerCaseFilterFactory and 
ASCIIFoldingFilterFactory, you can replace both with ICUFoldingFilterFactory,
which goes further and also incorporates 
http://www.unicode.org/reports/tr30/tr30-4.html


 Create Solr FilterFactory for Lucene StandardTokenizer with  UAX#29 support
 ---

 Key: SOLR-2211
 URL: https://issues.apache.org/jira/browse/SOLR-2211
 Project: Solr
  Issue Type: New Feature
Affects Versions: 3.1
Reporter: Tom Burton-West
Assignee: Robert Muir
Priority: Minor
 Fix For: 3.1, 4.0

 Attachments: SOLR-2211.patch


 The Lucene 3.x StandardTokenizer with UAX#29 support provides benefits for 
 non-English tokenizing.  Presently it can be invoked by using the 
 StandardTokenizerFactory and setting the Version to 3.1.  However, it would 
 be useful to be able to use the improved unicode processing without 
 necessarily including the ip address and email address processing of 
 StandardAnalyzer.   A FilterFactory that allowed the use of the 
 StandardTokenizer with UAX#29 support on its own would be useful.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Lucene-Solr-tests-only-trunk - Build # 1152 - Failure

Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/1152/

1 tests failed.
REGRESSION:  org.apache.lucene.index.TestIndexWriter.testCommitThreadSafety

Error Message:
null

Stack Trace:
junit.framework.AssertionFailedError: 
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:878)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:844)
at 
org.apache.lucene.index.TestIndexWriter.testCommitThreadSafety(TestIndexWriter.java:2385)




Build Log (for compile errors):
[...truncated 3107 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Created: (LUCENE-2750) add Kamikaze 3.0.1 into Lucene

2010-11-08 Thread hao yan (JIRA)

add Kamikaze 3.0.1 into Lucene
--

 Key: LUCENE-2750
 URL: https://issues.apache.org/jira/browse/LUCENE-2750
 Project: Lucene - Java
  Issue Type: Sub-task
  Components: contrib/*
Reporter: hao yan


Kamikaze 3.0.1 is the updated version of Kamikaze 2.0.0. It can achieve 
significantly better performance then Kamikaze 2.0.0 in terms of both 
compressed size and decompression speed. The main difference between the two 
versions is Kamikaze 3.0.x uses the much more efficient implementation of the 
PForDelta compression algorithm. My goal is to integrate the highly efficient 
PForDelta implementation into Lucene Codec.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Lucene-trunk - Build # 1357 - Still Failing

Build: https://hudson.apache.org/hudson/job/Lucene-trunk/1357/

All tests passed

Build Log (for compile errors):
[...truncated 18287 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Lucene-Solr-tests-only-trunk - Build # 1155 - Failure

Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/1155/

1 tests failed.
REGRESSION:  org.apache.solr.TestDistributedSearch.testDistribSearch

Error Message:
Some threads threw uncaught exceptions!

Stack Trace:
junit.framework.AssertionFailedError: Some threads threw uncaught exceptions!
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:878)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:844)
at 
org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:437)
at org.apache.solr.SolrTestCaseJ4.tearDown(SolrTestCaseJ4.java:78)
at 
org.apache.solr.BaseDistributedSearchTestCase.tearDown(BaseDistributedSearchTestCase.java:144)




Build Log (for compile errors):
[...truncated 8857 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Created: (SOLR-2224) TermVectorComponent did not return results when using distributedProcess in distribution envs

2010-11-08 Thread tom liu (JIRA)

TermVectorComponent did not return results when using distributedProcess in 
distribution envs
-

 Key: SOLR-2224
 URL: https://issues.apache.org/jira/browse/SOLR-2224
 Project: Solr
  Issue Type: Bug
  Components: SearchComponents - other
Affects Versions: 4.0
 Environment: JDK1.6/Tomcat6
Reporter: tom liu


when using distributed query, TVRH did not return any results.
in distributedProcess, tv creates one request, that use 
TermVectorParams.DOC_IDS, for example, tv.docIds=10001
but queryCommponent returns ids, that is uniqueKeys, not DOCIDS.

so, in distribution envs, must not use distributedProcess.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (SOLR-2224) TermVectorComponent did not return results when using distributedProcess in distribution envs

2010-11-08 Thread tom liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929916#action_12929916
 ] 

tom liu commented on SOLR-2224:
---

we can delete distributedProcess method, and add modifyRequest method:
{noformat}
public void modifyRequest(ResponseBuilder rb, SearchComponent who, ShardRequest 
sreq) {
  if (rb.stage == ResponseBuilder.STAGE_GET_FIELDS)
  sreq.params.set(tv, true);
  else
  sreq.params.set(tv, false);
}
{noformat}

 TermVectorComponent did not return results when using distributedProcess in 
 distribution envs
 -

 Key: SOLR-2224
 URL: https://issues.apache.org/jira/browse/SOLR-2224
 Project: Solr
  Issue Type: Bug
  Components: SearchComponents - other
Affects Versions: 4.0
 Environment: JDK1.6/Tomcat6
Reporter: tom liu

 when using distributed query, TVRH did not return any results.
 in distributedProcess, tv creates one request, that use 
 TermVectorParams.DOC_IDS, for example, tv.docIds=10001
 but queryCommponent returns ids, that is uniqueKeys, not DOCIDS.
 so, in distribution envs, must not use distributedProcess.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

[
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929927#action_12929927
]

Jason Rutherglen commented on LUCENE-2680:
--

Ok, TestThreadedOptimize works when the DW sync'ed pushSegmentInfos method
isn't called anymore (no extra per-segment deleting is going on), and stops
working when pushSegmentInfos is turned back on. Something about the sync
on DW is causing a problem. Hmm... We need another way to pass segment
infos around consistently.

Improve how IndexWriter flushes deletes against existing segments
-

Key: LUCENE-2680
URL: https://issues.apache.org/jira/browse/LUCENE-2680
Project: Lucene - Java
Issue Type: Improvement
Reporter: Michael McCandless
Fix For: 4.0

Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch,
LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch,
LUCENE-2680.patch, LUCENE-2680.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

2010-11-08 Thread DM Smith (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929934#action_12929934
]

DM Smith commented on LUCENE-2747:
--

I'm not too keen on this. For classics and ancient texts the standard analyzer
is not as good as the simple analyzer. I think it is important to have a
tokenizer that does not try to be too smart. I think it'd be good to have a
SimpleAnalyzer based upon UAX#29, too.

Then I'd be happy.

Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
---

Attachments: LUCENE-2747.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer


[ 
https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929936#action_12929936
 ] 

Steven Rowe commented on LUCENE-2747:
-

Robert, your patch looks good - I have a couple of questions:

* You removed {{TestHindiFilters.testTokenizer()}}, 
{{TestIndicTokenizer.testBasics()}} and {{TestIndicTokenizer.testFormat()}}, 
but these would be useful in {{TestStandardAnalyzer}} and 
{{TestUAX29Tokenizer}}, wouldn't they?
* You did not remove {{ArabicLetterTokenizer}} and {{IndicTokenizer}}, 
presumably so that they can be used with Lucene 4.0+ when the supplied 
{{Version}} is less than 3.1 -- good catch, I had forgotten this requirement -- 
but when can we actually get rid of these?  Since they will be staying, 
shouldn't their tests remain too, but using {{Version.LUCENE_30}} instead of 
{{TEST_VERSION_CURRENT}}?

 Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
 ---

 Key: LUCENE-2747
 URL: https://issues.apache.org/jira/browse/LUCENE-2747
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 3.1, 4.0
Reporter: Steven Rowe
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2747.patch


 As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to 
 provide language-neutral tokenization.  Lucene contains several 
 language-specific tokenizers that should be replaced by UAX#29-based 
 StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The 
 language-specific *analyzers*, by contrast, should remain, because they 
 contain language-specific post-tokenization filters.  The language-specific 
 analyzers should switch to StandardTokenizer in 3.1.
 Some usages of language-specific tokenizers will need additional work beyond 
 just replacing the tokenizer in the language-specific analyzer.  
 For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and 
 depends on the fact that this tokenizer breaks tokens on the ZWNJ character 
 (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ 
 is not a word boundary.  Robert Muir has suggested using a char filter 
 converting ZWNJ to spaces prior to StandardTokenizer in the converted 
 PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

[
https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929945#action_12929945
]

Steven Rowe commented on LUCENE-2747:
-

bq. I'm not too keen on this. For classics and ancient texts the standard
analyzer is not as good as the simple analyzer. I think it is important to have
a tokenizer that does not try to be too smart. I think it'd be good to have a
SimpleAnalyzer based upon UAX#29, too.

{{UAX29Tokenizer}} could be combined with {{LowercaseFilter}} to provide that,
no?

Robert is arguing in the reopened LUCENE-2167 for {{StandardTokenizer}} to be
stripped down so that it only implements UAX#29 rules (i.e., dropping URL+Email
recognition), so if that comes to pass, {{StandardAnalyzer}} would just be
UAX#29+lowercase+stopword (with English stopwords by default, but those can be
overridden in the ctor) -- would that make you happy?

Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
---

Attachments: LUCENE-2747.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Lucene-Solr-tests-only-trunk - Build # 1158 - Failure