RE: [Lucene.Net] I want to help! Also, where are we at?

2012-02-04 Thread Prescott Nasser

Good catch Stefan!
  From: bode...@apache.org
 To: seansevilt...@gmail.com; sean.new...@grantadesign.com
 CC: lucene-net-...@incubator.apache.org
 Date: Sat, 4 Feb 2012 08:07:48 +0100
 Subject: Re: [Lucene.Net] I want to help! Also, where are we at?
 
 Hi Sean,
 
 I just now realized the responses you received
 
 http://mail-archives.apache.org/mod_mbox/lucene-lucene-net-dev/201201.mbox/%3CCAFZAm_XwoDKkTK9AuJ=zeegvtqufdmebwbz89pd6lbbjguc...@mail.gmail.com%3E
 http://mail-archives.apache.org/mod_mbox/lucene-lucene-net-dev/201201.mbox/%3cca+p8kvobdicn-njpcua8obeqqzygdkpah5iu4j-6mr3796g...@mail.gmail.com%3E
 
 only went to the Lucene.Net list rather than yourself, you may have
 never received them.
 
 In short: You are more than welcome.  Look around to see if you find
 anything you want to work on, do what you enjoy to do.  We don't assign
 work to people, people pick the stuff they want to work on.  If you have
 any questions or need help, don't hesitate to ask.
 
 In order to join the mailing list, which is where all discussion and
 coordination happens you have to send an email to
 lucene-net-dev-subscr...@incubator.apache.org using the email address
 you intend to use when posting to this list.  Everybody can join the
 list.
 
 Cheers
 
 Stefan
  

RE: [Lucene.Net] 3.0.3

2012-02-04 Thread Prescott Nasser

So, Chris if you did this as a direct port of the java version 
(https://svn.apache.org/repos/asf/lucene/java/tags/lucene_3_0_3/), Does that 
mean that all of the LUCENE JIRA issues 
(https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=truejqlQuery=project+%3D+LUCENE+AND+fixVersion+%3D+%223.0.3%22+AND+status+%3D+Closed+ORDER+BY+priority+DESCmode=hide)
 are part of this code already? That would make 3.0.3 well on it's way to 
release... ~P
  From: bode...@apache.org
 To: lucene-net-...@incubator.apache.org
 Date: Wed, 25 Jan 2012 12:35:25 +0100
 Subject: Re: [Lucene.Net] 3.0.3
 
 On 2012-01-25, Michael Herndon wrote:
 
  Do we have a standard of copy or tag of Java's version source that we're
  doing a compare against?  I only see the 3_1 and above in the tags.
 
 Likely because the svn location has changed in between.  I think it must
 be https://svn.apache.org/repos/asf/lucene/java/tags/lucene_3_0_3/
 
 Stefan
  

Re: Changes to enable easy_install of packages using JCC

2012-02-04 Thread Andi Vajda


 Hi Chris,

On Wed, 1 Feb 2012, Andi Vajda wrote:

No objections to these patches in principle but it would be easier for me 
to integrate them if you could provide patches computed from the svn 
repository of JCC: 
http://svn.apache.org/repos/asf/lucene/pylucene/trunk/jcc/ Your patches 
seem to be small enough so I should be able to do without but it would be 
nicer if I didn't have to guess...


I think the patch that I attached was already based on trunk. The git 
repository includes the .svn directories, points to trunk, and I generated 
the patch using svn diff.


Sorry, I missed that you indeed had attached a patch last time.
(to be continued...)

Also, please write small descriptions for these new command line flags to 
go into JCC's __main__.py file:

http://svn.apache.org/repos/asf/lucene/pylucene/trunk/jcc/jcc/__main__.py


Done, new patch attached.


Thank you !


I integrated your patches with rev 1240624.
I moved a few changes around :parameters to their section in __main__.py and 
'maxstack' hardcoding to where it used to be.


Thank you for your contribution.

Andi..



This mess of setuptools patching was meant to be *temporary* until 
setuptools' issue 43 was fixed. As you can see, I filed this bug 3 1/2 
years ago, http://bugs.python.org/setuptools/issue43, and my patch for 
issue 43 still hasn't been accepted, rejected, integrated, anything'ed... 
Dormant. For over three years.


Sorry about that. I've had similar experience with bugs reported against 
ubuntu, hibernate, rails... :(



 * Why does JCC use non-standard command line arguments like --build and
 --install? Can it be modified to make it easier to invoke from a
 setup.py-style environment, such as exporting a setup() function as
 setuptools does?


What standard are you referring to ?
The python extension module build/install/deploy story on Python keeps 
evolving... Add Python 3.x support into the mix, and the mess is complete.


Seriously, though, I think that the right thing to do to better integrate 
JCC with distutils/setuptools/distribute/pip/etc... is to make it into a 
distutils 'compiler'. This requires some work, though, and I haven't done 
it in all thee years. Anyone with the itch to hack on distutils is welcome 
to take that on.


I'm afraid I don't fully understand how distutils works, it seems to be 
sparsely documented, and I don't have a lot of time and energy to work on 
refactoring jcc. I am a bit surprised that we can't just generate a source 
distribution containing the jars, .cpp files and a setup.py which does the 
rest like any other Python extension.


Same here. I don't know distutils too well and whenever I tried to dig into 
it, I quickly gave up. I don't know what it means to just generate a source 
distribution.


If they contain .class files, JAR files are not source files. My 
understanding could be wrong here, but I don't think they're even compatible 
between 32- and 64-bit VMs. Or is that incompatible between Java 5 and 6 ?


I have very little itch to dabble in configure scripts either so I've been 
dragging my feet. If someone were to step forward with a patch for that, 
I'd be delighted in ripping out all this patching brittleness.


How would a configure script solve the problem and what would it have to 
do? Generate the .cpp files? How does it integrate with Python extensions?


A configure script for building libjcc.dylib (libjcc.so on Linux, jcc.dll on 
Windows, etc...) would take care of doing what setuptools + the issue43 patch 
is doing for us currently: invoking the C++ compiler and linker against the 
correct Python headers and Libraries to produce a vanilla shared library. 
With such a contribute script, there is no longer a need to patch setuptools.


That is a whole different project. If I remember correctly, the JPype 
project is (or was) taking that approach: http://jpype.sourceforge.net


OK, thanks.


 * Could JCC generate a source distribution (sdist) that could be
   uploaded to pypi?


You mean a source distribution that includes the Java sources of all the 
libraries/classes wrapped ?


I was thinking more of the jars. Something like 
https://github.com/aptivate/python-tika that doesn't depend on jcc any 
more.



 * setup.py develop is still broken in the current implementation


I'm not familiar with this 'develop' command nor that it is broken. What 
is it supposed to be doing and how is it broken ?


http://packages.python.org/distribute/setuptools.html#development-mode

It seems that when invoked this way, my setup.py (from python-tika) which 
calls jcc ends up creating build/_tika as a file (not a directory).


For example, this command:

 sudo pip install -e git+https://github.com/aptivate/python-tika#egg=tika

(note the -e for editable mode) results in this:

 Running setup.py develop for tika
 ...
   Traceback (most recent call last):
 File string, line 1, in module
 File /tmp/src/tika/setup.py, line 108, in module
   cpp.jcc(jcc_args)
 File 

Re: Changes to enable easy_install of packages using JCC

2012-02-04 Thread Chris Wilson

Hi Andi,

On Sat, 4 Feb 2012, Andi Vajda wrote:

I integrated your patches with rev 1240624. I moved a few changes around 
:parameters to their section in __main__.py and 'maxstack' hardcoding to 
where it used to be.


Thank you for your contribution.


Thanks :)

Cheers, Chris.
--
Aptivate | http://www.aptivate.org | Phone: +44 1223 760887
The Humanitarian Centre, Fenner's, Gresham Road, Cambridge CB1 2ES

Aptivate is a not-for-profit company registered in England and Wales
with company number 04980791.



[jira] [Commented] (SOLR-3049) UpdateRequestProcessorChain for UIMA : runtimeParameters: not all types supported

2012-02-04 Thread Tommaso Teofili (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13200362#comment-13200362
 ] 

Tommaso Teofili commented on SOLR-3049:
---

Hi Harsh, I think there should be a more general way of mapping typed 
parameters, just need to dig a little deeper to find it.
However in the meantime I'll try and test your patch, thanks!

 UpdateRequestProcessorChain for UIMA : runtimeParameters: not all types 
 supported
 -

 Key: SOLR-3049
 URL: https://issues.apache.org/jira/browse/SOLR-3049
 Project: Solr
  Issue Type: Bug
  Components: update
Reporter: Harsh P
Priority: Minor
  Labels: uima, update_request_handler
 Attachments: SOLR-3049.patch


 solrconfig.xml file has an option to override certain UIMA runtime
 parameters in the UpdateRequestProcessorChain section.
 There are certain UIMA annotators like RegexAnnotator which define
 runtimeParameters value as an Array which is not currently supported
 in the Solr-UIMA interface.
 In java/org/apache/solr/uima/processor/ae/OverridingParamsAEProvider.java,
 private Object getRuntimeValue(AnalysisEngineDescription desc, String
 attributeName) function defines override for UIMA analysis engine
 runtimeParameters as they are passed to UIMA Analysis Engine.
 runtimeParameters which are currently supported in the Solr-UIMA interface 
 are:
  String
  Integer
  Boolean
  Float
 I have made a hack to fix this issue to add Array support. I would
 like to submit that as a patch if no one else is working on fixing
 this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3745) Need stopwords and stoptags lists for default Japanese configuration

2012-02-04 Thread Christian Moen (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13200461#comment-13200461
 ] 

Christian Moen commented on LUCENE-3745:


I'll submit a patch for this tomorrow.

 Need stopwords and stoptags lists for default Japanese configuration
 

 Key: LUCENE-3745
 URL: https://issues.apache.org/jira/browse/LUCENE-3745
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Reporter: Christian Moen
 Attachments: filter_stoptags.py, top-10.txt, top-100-pos.txt, 
 top-pos.txt


 Stopwords and stoptags lists for Japanese needs to be developed, tested and 
 integrated into Lucene.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2649) MM ignored in edismax queries with operators

2012-02-04 Thread Brian Carver (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13200476#comment-13200476
 ] 

Brian Carver commented on SOLR-2649:


I'm new to solr, so I have a tenuous grasp on some of these issues, but I've 
understood boolean logic for a couple of decades and it seems to me like solr's 
current behavior is thwarting the expectations of those who understand what 
they want and explicitly ask for it. Mike's example above is what troubles me.

Principles:
1. The maintainer sets whitespace to be interpreted as AND or OR and solr 
should do nothing to change that in particular instances.
2. Where a user inputs an ambiguous query, a default rule about how operator 
scope will work is needed and that also should not be changed in particular 
instances.

So, Mike says he sets whitespace to AND, users know this, and then a user 
enters:

Example 1: (A or B or C) D E

Given the above assumptions, the only reasonable interpretation of this is:

(A or B or C) AND D E which is a conjunction with two conjuncts, both of 
which must be satisfied for a result to be produced, yet Mike/the user gets 
results that only satisfy one of the conjuncts. That shouldn't happen.

I'd agree though that how to understand/apply mm in some of the examples above 
creates hard questions, but that is why many search engines provide two 
interfaces, one natural language interface and one that requires strict use 
of boolean syntax. Allowing people to enter some boolean operators (which 
they're going to expect will be respected-no-matter-what) and simultaneously 
interpreting their query using mm handlers intended for a more rough-and-ready 
approach is just going to lead to confused end users most of the time. So, in 
some ways, ignoring mm when operators are used is a feature, not a bug, but 
that seems orthogonal to the completely unacceptable outcome Mike described: 
whatever is causing THAT, is a bug.

 MM ignored in edismax queries with operators
 

 Key: SOLR-2649
 URL: https://issues.apache.org/jira/browse/SOLR-2649
 Project: Solr
  Issue Type: Bug
  Components: search
Affects Versions: 3.3
Reporter: Magnus Bergmark
Priority: Minor

 Hypothetical scenario:
   1. User searches for stocks oil gold with MM set to 50%
   2. User adds -stockings to the query: stocks oil gold -stockings
   3. User gets no hits since MM was ignored and all terms where AND-ed 
 together
 The behavior seems to be intentional, although the reason why is never 
 explained:
   // For correct lucene queries, turn off mm processing if there
   // were explicit operators (except for AND).
   boolean doMinMatched = (numOR + numNOT + numPluses + numMinuses) == 0; 
 (lines 232-234 taken from 
 tags/lucene_solr_3_3/solr/src/java/org/apache/solr/search/ExtendedDismaxQParserPlugin.java)
 This makes edismax unsuitable as an replacement to dismax; mm is one of the 
 primary features of dismax.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[JENKINS] Solr-3.x - Build # 589 - Failure

2012-02-04 Thread Apache Jenkins Server
Build: https://builds.apache.org/job/Solr-3.x/589/

All tests passed

Build Log (for compile errors):
[...truncated 36740 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[JENKINS] Lucene-Solr-tests-only-3.x - Build # 12368 - Failure

2012-02-04 Thread Apache Jenkins Server
Build: https://builds.apache.org/job/Lucene-Solr-tests-only-3.x/12368/

No tests ran.

Build Log (for compile errors):
[...truncated 3338 lines...]
[javac] lst.add(errors, numErrors);
[javac]^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/core/src/java/org/apache/solr/handler/RequestHandlerBase.java:176:
 warning: [unchecked] unchecked call to add(java.lang.String,T) as a member of 
the raw type org.apache.solr.common.util.NamedList
[javac] lst.add(timeouts, numTimeouts);
[javac]^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/core/src/java/org/apache/solr/handler/RequestHandlerBase.java:177:
 warning: [unchecked] unchecked call to add(java.lang.String,T) as a member of 
the raw type org.apache.solr.common.util.NamedList
[javac] lst.add(totalTime,totalTime);
[javac]^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/core/src/java/org/apache/solr/handler/RequestHandlerBase.java:178:
 warning: [unchecked] unchecked call to add(java.lang.String,T) as a member of 
the raw type org.apache.solr.common.util.NamedList
[javac] lst.add(avgTimePerRequest, (float) totalTime / (float) 
this.numRequests);
[javac]^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/core/src/java/org/apache/solr/handler/RequestHandlerBase.java:179:
 warning: [unchecked] unchecked call to add(java.lang.String,T) as a member of 
the raw type org.apache.solr.common.util.NamedList
[javac] lst.add(avgRequestsPerSecond, (float) numRequests*1000 / 
(float)(System.currentTimeMillis()-handlerStart));   
[javac]^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/core/src/java/org/apache/solr/handler/admin/CoreAdminHandler.java:216:
 warning: [unchecked] unchecked conversion
[javac] found   : org.apache.solr.util.RefCounted[]
[javac] required: 
org.apache.solr.util.RefCountedorg.apache.solr.search.SolrIndexSearcher[]
[javac]   searchers = new RefCounted[sourceCores.length];
[javac]   ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/core/src/java/org/apache/solr/handler/component/ResponseBuilder.java:331:
 warning: [unchecked] unchecked call to add(java.lang.String,T) as a member of 
the raw type org.apache.solr.common.util.NamedList
[javac]   rsp.getResponseHeader().add( partialResults, Boolean.TRUE );
[javac]  ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/core/src/java/org/apache/solr/search/FunctionQParser.java:254:
 warning: [unchecked] unchecked conversion
[javac] found   : java.util.HashMap
[javac] required: java.util.Mapjava.lang.String,java.lang.String
[javac]   int end = QueryParsing.parseLocalParams(qs, start, 
nestedLocalParams, getParams());
[javac]  ^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/core/src/java/org/apache/solr/handler/component/FacetComponent.java:491:
 warning: [unchecked] unchecked call to add(java.lang.String,T) as a member of 
the raw type org.apache.solr.common.util.NamedList
[javac]   facet_queries.add(qf.getKey(), num(qf.count));
[javac]^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/core/src/java/org/apache/solr/request/SimpleFacets.java:194:
 warning: [unchecked] unchecked call to add(java.lang.String,T) as a member of 
the raw type org.apache.solr.common.util.NamedList
[javac]   facetResponse.add(facet_queries, getFacetQueryCounts());
[javac]^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/core/src/java/org/apache/solr/request/SimpleFacets.java:195:
 warning: [unchecked] unchecked call to add(java.lang.String,T) as a member of 
the raw type org.apache.solr.common.util.NamedList
[javac]   facetResponse.add(facet_fields, getFacetFieldCounts());
[javac]^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/core/src/java/org/apache/solr/request/SimpleFacets.java:196:
 warning: [unchecked] unchecked call to add(java.lang.String,T) as a member of 
the raw type org.apache.solr.common.util.NamedList
[javac]   facetResponse.add(facet_dates, getFacetDateCounts());
[javac]^
[javac] 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-3.x/checkout/solr/core/src/java/org/apache/solr/request/SimpleFacets.java:197:
 warning: [unchecked] unchecked call to add(java.lang.String,T) as a member of 

RE: svn commit: r1240035 - in /lucene/dev/branches/branch_3x/lucene/src: java/org/apache/lucene/analysis/TypeTokenFilter.java test/org/apache/lucene/analysis/TestTypeTokenFilter.java

2012-02-04 Thread Uwe Schindler
Hi Tommaso,

As you are a new committer, please take care of the following:
- The branch 3.x of Lucene/Solr must still compile and test with Java 5, so 
after merging from trunk, run and compile all tests with Java 5. There is a 
bug/feature/whatever in Java 6's compiler that it does not complain about 
@Override on -source 1.5 -target 1.5 when added to interface implementations 
(but it should, as @Override is not allowed there in Java 5).
- You had a merge relict (x somewhere).

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


 -Original Message-
 From: tomm...@apache.org [mailto:tomm...@apache.org]
 Sent: Friday, February 03, 2012 10:14 AM
 To: comm...@lucene.apache.org
 Subject: svn commit: r1240035 - in
 /lucene/dev/branches/branch_3x/lucene/src:
 java/org/apache/lucene/analysis/TypeTokenFilter.java
 test/org/apache/lucene/analysis/TestTypeTokenFilter.java
 
 Author: tommaso
 Date: Fri Feb  3 09:14:08 2012
 New Revision: 1240035
 
 URL: http://svn.apache.org/viewvc?rev=1240035view=rev
 Log:
 [LUCENE-3744] - applied patch for whiteList usage in TypeTokenFilter
 
 Modified:
 
 lucene/dev/branches/branch_3x/lucene/src/java/org/apache/lucene/analysis/T
 ypeTokenFilter.java
 
 lucene/dev/branches/branch_3x/lucene/src/test/org/apache/lucene/analysis/T
 estTypeTokenFilter.java
 
 Modified:
 lucene/dev/branches/branch_3x/lucene/src/java/org/apache/lucene/analysis/T
 ypeTokenFilter.java
 URL:
 http://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/lucene/src/java
 /org/apache/lucene/analysis/TypeTokenFilter.java?rev=1240035r1=1240034
 r2=1240035view=diff
 
 ==
 ---
 lucene/dev/branches/branch_3x/lucene/src/java/org/apache/lucene/analysis/T
 ypeTokenFilter.java (original)
 +++ lucene/dev/branches/branch_3x/lucene/src/java/org/apache/lucene/anal
 +++ ysis/TypeTokenFilter.java Fri Feb  3 09:14:08 2012
 @@ -29,17 +29,24 @@ public final class TypeTokenFilter exten
 
private final SetString stopTypes;
private final TypeAttribute typeAttribute = 
 addAttribute(TypeAttribute.class);
 +  private final boolean useWhiteList;
 
 -  public TypeTokenFilter(boolean enablePositionIncrements, TokenStream
 input, SetString stopTypes) {
 +  public TypeTokenFilter(boolean enablePositionIncrements, TokenStream
 + input, SetString stopTypes, boolean useWhiteList) {
  super(enablePositionIncrements, input);
  this.stopTypes = stopTypes;
 +this.useWhiteList = useWhiteList;
 +  }
 +
 +  public TypeTokenFilter(boolean enablePositionIncrements, TokenStream
 input, SetString stopTypes) {
 +this(enablePositionIncrements, input, stopTypes, false);
}
 
/**
 -   * Returns the next input Token whose typeAttribute.type() is not a stop 
 type.
 +   * By default accept the token if its type is not a stop type.
 +   * When the useWhiteList parameter is set to true then accept the
 + token if its type is contained in the stopTypes
 */
@Override
protected boolean accept() throws IOException {
 -return !stopTypes.contains(typeAttribute.type());
 +return useWhiteList == stopTypes.contains(typeAttribute.type());
}
  }
 
 Modified:
 lucene/dev/branches/branch_3x/lucene/src/test/org/apache/lucene/analysis/T
 estTypeTokenFilter.java
 URL:
 http://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/lucene/src/test/
 org/apache/lucene/analysis/TestTypeTokenFilter.java?rev=1240035r1=12400
 34r2=1240035view=diff
 
 ==
 ---
 lucene/dev/branches/branch_3x/lucene/src/test/org/apache/lucene/analysis/T
 estTypeTokenFilter.java (original)
 +++ lucene/dev/branches/branch_3x/lucene/src/test/org/apache/lucene/anal
 +++ ysis/TestTypeTokenFilter.java Fri Feb  3 09:14:08 2012
 @@ -23,9 +23,9 @@ import org.apache.lucene.analysis.tokena  import
 org.apache.lucene.analysis.tokenattributes.TypeAttribute;
  import org.apache.lucene.util.English;
 
 +import java.util.Collections;
  import java.io.IOException;
  import java.io.StringReader;
 -import java.util.Collections;
  import java.util.Set;
 
 
 @@ -81,6 +81,13 @@ public class TestTypeTokenFilter extends
  stpf.close();
}
 
 +  public void testTypeFilterWhitelist() throws IOException {
 +StringReader reader = new StringReader(121 is palindrome, while 123 is
 not);
 +SetString stopTypes = Collections.singleton(NUM);
 +TokenStream stream = new TypeTokenFilter(true, new
 StandardTokenizer(TEST_VERSION_CURRENT, reader), stopTypes, true);
 +assertTokenStreamContents(stream, new String[]{121, 123});
 + }
 +
// print debug info depending on VERBOSE
private static void log(String s) {
  if (VERBOSE) {



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



RE: svn commit: r1240035 - in /lucene/dev/branches/branch_3x/lucene/src: java/org/apache/lucene/analysis/TypeTokenFilter.java test/org/apache/lucene/analysis/TestTypeTokenFilter.java

2012-02-04 Thread Uwe Schindler
One more thing:
Please merge changes from trunk to 3.x, not only apply patch twice. More info 
about the sometimes complicated merging (because of move to modules of some 
code parts): http://wiki.apache.org/lucene-java/SvnMerge

I added the missing merge properties.

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


 -Original Message-
 From: Uwe Schindler [mailto:u...@thetaphi.de]
 Sent: Saturday, February 04, 2012 6:51 PM
 To: dev@lucene.apache.org
 Cc: tommaso.teof...@gmail.com
 Subject: RE: svn commit: r1240035 - in
 /lucene/dev/branches/branch_3x/lucene/src:
 java/org/apache/lucene/analysis/TypeTokenFilter.java
 test/org/apache/lucene/analysis/TestTypeTokenFilter.java
 
 Hi Tommaso,
 
 As you are a new committer, please take care of the following:
 - The branch 3.x of Lucene/Solr must still compile and test with Java 5, so 
 after
 merging from trunk, run and compile all tests with Java 5. There is a
 bug/feature/whatever in Java 6's compiler that it does not complain about
 @Override on -source 1.5 -target 1.5 when added to interface
 implementations (but it should, as @Override is not allowed there in Java 5).
 - You had a merge relict (x somewhere).
 
 Uwe
 
 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de
 
 
  -Original Message-
  From: tomm...@apache.org [mailto:tomm...@apache.org]
  Sent: Friday, February 03, 2012 10:14 AM
  To: comm...@lucene.apache.org
  Subject: svn commit: r1240035 - in
  /lucene/dev/branches/branch_3x/lucene/src:
  java/org/apache/lucene/analysis/TypeTokenFilter.java
  test/org/apache/lucene/analysis/TestTypeTokenFilter.java
 
  Author: tommaso
  Date: Fri Feb  3 09:14:08 2012
  New Revision: 1240035
 
  URL: http://svn.apache.org/viewvc?rev=1240035view=rev
  Log:
  [LUCENE-3744] - applied patch for whiteList usage in TypeTokenFilter
 
  Modified:
 
  lucene/dev/branches/branch_3x/lucene/src/java/org/apache/lucene/analys
  is/T
  ypeTokenFilter.java
 
  lucene/dev/branches/branch_3x/lucene/src/test/org/apache/lucene/analys
  is/T
  estTypeTokenFilter.java
 
  Modified:
  lucene/dev/branches/branch_3x/lucene/src/java/org/apache/lucene/analys
  is/T
  ypeTokenFilter.java
  URL:
  http://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/lucene/src/
  java
  /org/apache/lucene/analysis/TypeTokenFilter.java?rev=1240035r1=124003
  4
  r2=1240035view=diff
 
 
  ==
  ---
  lucene/dev/branches/branch_3x/lucene/src/java/org/apache/lucene/analys
  is/T
  ypeTokenFilter.java (original)
  +++ lucene/dev/branches/branch_3x/lucene/src/java/org/apache/lucene/an
  +++ al ysis/TypeTokenFilter.java Fri Feb  3 09:14:08 2012
  @@ -29,17 +29,24 @@ public final class TypeTokenFilter exten
 
 private final SetString stopTypes;
 private final TypeAttribute typeAttribute =
  addAttribute(TypeAttribute.class);
  +  private final boolean useWhiteList;
 
  -  public TypeTokenFilter(boolean enablePositionIncrements,
  TokenStream input, SetString stopTypes) {
  +  public TypeTokenFilter(boolean enablePositionIncrements,
  + TokenStream input, SetString stopTypes, boolean useWhiteList) {
   super(enablePositionIncrements, input);
   this.stopTypes = stopTypes;
  +this.useWhiteList = useWhiteList;  }
  +
  +  public TypeTokenFilter(boolean enablePositionIncrements,
  + TokenStream
  input, SetString stopTypes) {
  +this(enablePositionIncrements, input, stopTypes, false);
 }
 
 /**
  -   * Returns the next input Token whose typeAttribute.type() is not a stop
 type.
  +   * By default accept the token if its type is not a stop type.
  +   * When the useWhiteList parameter is set to true then accept the
  + token if its type is contained in the stopTypes
  */
 @Override
 protected boolean accept() throws IOException {
  -return !stopTypes.contains(typeAttribute.type());
  +return useWhiteList == stopTypes.contains(typeAttribute.type());
 }
   }
 
  Modified:
  lucene/dev/branches/branch_3x/lucene/src/test/org/apache/lucene/analys
  is/T
  estTypeTokenFilter.java
  URL:
  http://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/lucene/src/
  test/
  org/apache/lucene/analysis/TestTypeTokenFilter.java?rev=1240035r1=124
  00
  34r2=1240035view=diff
 
 
  ==
  ---
  lucene/dev/branches/branch_3x/lucene/src/test/org/apache/lucene/analys
  is/T
  estTypeTokenFilter.java (original)
  +++ lucene/dev/branches/branch_3x/lucene/src/test/org/apache/lucene/an
  +++ al ysis/TestTypeTokenFilter.java Fri Feb  3 09:14:08 2012
  @@ -23,9 +23,9 @@ import org.apache.lucene.analysis.tokena  import
  org.apache.lucene.analysis.tokenattributes.TypeAttribute;
   import org.apache.lucene.util.English;
 
  +import java.util.Collections;
   import java.io.IOException;
   import 

[jira] [Created] (SOLR-3096) Add book information to the new website

2012-02-04 Thread David Smiley (Created) (JIRA)
Add book information to the new website
---

 Key: SOLR-3096
 URL: https://issues.apache.org/jira/browse/SOLR-3096
 Project: Solr
  Issue Type: Task
Reporter: David Smiley
 Attachments: website_books.patch

The attached patch modifies the new website design to incorporate the book 
information.  It ads a header mantle slideshow entry with both book images 
(just the 2 current books), and it adds a book page with the 3 books published 
(this includes the 1st edition that is out of date now).  The image files 
referenced are the same actual binary images on the current website by I chose 
a more consistent naming convention.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-3096) Add book information to the new website

2012-02-04 Thread David Smiley (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-3096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Smiley updated SOLR-3096:
---

Attachment: website_books.patch

 Add book information to the new website
 ---

 Key: SOLR-3096
 URL: https://issues.apache.org/jira/browse/SOLR-3096
 Project: Solr
  Issue Type: Task
Reporter: David Smiley
 Attachments: website_books.patch


 The attached patch modifies the new website design to incorporate the book 
 information.  It ads a header mantle slideshow entry with both book images 
 (just the 2 current books), and it adds a book page with the 3 books 
 published (this includes the 1st edition that is out of date now).  The image 
 files referenced are the same actual binary images on the current website by 
 I chose a more consistent naming convention.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-3750) Convert Versioned docs to Markdown/New CMS

2012-02-04 Thread Grant Ingersoll (Created) (JIRA)
Convert Versioned docs to Markdown/New CMS
--

 Key: LUCENE-3750
 URL: https://issues.apache.org/jira/browse/LUCENE-3750
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Grant Ingersoll
Priority: Minor


Since we are moving our main site to the ASF CMS (LUCENE-2748), we should bring 
in any new versioned Lucene docs into the same format so that we don't have to 
deal w/ Forrest anymore.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3749) Similarity.java javadocs and simplifications for 4.0

2012-02-04 Thread Robert Muir (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-3749:


Attachment: LUCENE-3749_part2.patch

Here's part2: nuking SimilarityProvider (instead use PerFieldSimilarityWrapper 
if you want special per-field stuff).

This really simplifies the APIs, especially for say a casual user who just 
wants to try out a new ranking model.

 Similarity.java javadocs and simplifications for 4.0
 

 Key: LUCENE-3749
 URL: https://issues.apache.org/jira/browse/LUCENE-3749
 Project: Lucene - Java
  Issue Type: Task
Affects Versions: 4.0
Reporter: Robert Muir
 Fix For: 4.0

 Attachments: LUCENE-3749.patch, LUCENE-3749_part2.patch


 As part of adding additional scoring systems to lucene, we made a lower-level 
 Similarity
 and the existing stuff became e.g. TFIDFSimilarity which extends it.
 However, I always feel bad about the complexity introduced here (though I do 
 feel there
 are some excuses, that its a difficult challenge).
 In order to try to mitigate this, we also exposed an easier API 
 (SimilarityBase) on top of 
 it that makes some assumptions (and trades off some performance) to try to 
 provide something 
 consumable for e.g. experiments.
 Still, we can cleanup a few things with the low-level api: fix outdated 
 documentation and
 shoot for better/clearer naming etc.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2802) Toolkit of UpdateProcessors for modifying document values

2012-02-04 Thread Commented

[ 
https://issues.apache.org/jira/browse/SOLR-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13200620#comment-13200620
 ] 

Jan Høydahl commented on SOLR-2802:
---

Sweet :) You got there before me

 Toolkit of UpdateProcessors for modifying document values
 -

 Key: SOLR-2802
 URL: https://issues.apache.org/jira/browse/SOLR-2802
 Project: Solr
  Issue Type: New Feature
Reporter: Hoss Man
 Attachments: SOLR-2802_update_processor_toolkit.patch, 
 SOLR-2802_update_processor_toolkit.patch, 
 SOLR-2802_update_processor_toolkit.patch, 
 SOLR-2802_update_processor_toolkit.patch


 Frequently users ask about questions about things where the answer is you 
 could do it with an UpdateProcessor but the number of our of hte box 
 UpdateProcessors is generally lacking and there aren't even very good base 
 classes for the common case of manipulating field values when adding documents

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [jira] [Commented] (SOLR-3049) UpdateRequestProcessorChain for UIMA : runtimeParameters: not all types supported

2012-02-04 Thread Harshad Patil
I will try to find a better way.
I found this issue while using RegexAnnotator.

On Sat, Feb 4, 2012 at 1:55 PM, Tommaso Teofili (Commented) (JIRA)
j...@apache.org wrote:

    [ 
 https://issues.apache.org/jira/browse/SOLR-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13200362#comment-13200362
  ]

 Tommaso Teofili commented on SOLR-3049:
 ---

 Hi Harsh, I think there should be a more general way of mapping typed 
 parameters, just need to dig a little deeper to find it.
 However in the meantime I'll try and test your patch, thanks!

 UpdateRequestProcessorChain for UIMA : runtimeParameters: not all types 
 supported
 -

                 Key: SOLR-3049
                 URL: https://issues.apache.org/jira/browse/SOLR-3049
             Project: Solr
          Issue Type: Bug
          Components: update
            Reporter: Harsh P
            Priority: Minor
              Labels: uima, update_request_handler
         Attachments: SOLR-3049.patch


 solrconfig.xml file has an option to override certain UIMA runtime
 parameters in the UpdateRequestProcessorChain section.
 There are certain UIMA annotators like RegexAnnotator which define
 runtimeParameters value as an Array which is not currently supported
 in the Solr-UIMA interface.
 In java/org/apache/solr/uima/processor/ae/OverridingParamsAEProvider.java,
 private Object getRuntimeValue(AnalysisEngineDescription desc, String
 attributeName) function defines override for UIMA analysis engine
 runtimeParameters as they are passed to UIMA Analysis Engine.
 runtimeParameters which are currently supported in the Solr-UIMA interface 
 are:
  String
  Integer
  Boolean
  Float
 I have made a hack to fix this issue to add Array support. I would
 like to submit that as a patch if no one else is working on fixing
 this issue.

 --
 This message is automatically generated by JIRA.
 If you think it was sent incorrectly, please contact your JIRA 
 administrators: 
 https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
 For more information on JIRA, see: http://www.atlassian.com/software/jira



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3726) Default KuromojiAnalyzer to use search mode

2012-02-04 Thread Christian Moen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Moen updated LUCENE-3726:
---

Attachment: LUCENE-3726.patch

 Default KuromojiAnalyzer to use search mode
 ---

 Key: LUCENE-3726
 URL: https://issues.apache.org/jira/browse/LUCENE-3726
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 3.6, 4.0
Reporter: Robert Muir
 Attachments: LUCENE-3726.patch, kuromojieval.tar.gz


 Kuromoji supports an option to segment text in a way more suitable for search,
 by preventing long compound nouns as indexing terms.
 In general 'how you segment' can be important depending on the application 
 (see http://nlp.stanford.edu/pubs/acl-wmt08-cws.pdf for some studies on this 
 in chinese)
 The current algorithm punishes the cost based on some parameters 
 (SEARCH_MODE_PENALTY, SEARCH_MODE_LENGTH, etc)
 for long runs of kanji.
 Some questions (these can be separate future issues if any useful ideas come 
 out):
 * should these parameters continue to be static-final, or configurable?
 * should POS also play a role in the algorithm (can/should we refine exactly 
 what we decompound)?
 * is the Tokenizer the best place to do this, or should we do it in a 
 tokenfilter? or both?
   with a tokenfilter, one idea would be to also preserve the original 
 indexing term, overlapping it: e.g. ABCD - AB, CD, ABCD(posInc=0)
   from my understanding this tends to help with noun compounds in other 
 languages, because IDF of the original term boosts 'exact' compound matches.
   but does a tokenfilter provide the segmenter enough 'context' to do this 
 properly?
 Either way, I think as a start we should turn on what we have by default: its 
 likely a very easy win.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3726) Default KuromojiAnalyzer to use search mode

2012-02-04 Thread Christian Moen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Moen updated LUCENE-3726:
---

Attachment: LUCENE-3726.patch

 Default KuromojiAnalyzer to use search mode
 ---

 Key: LUCENE-3726
 URL: https://issues.apache.org/jira/browse/LUCENE-3726
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 3.6, 4.0
Reporter: Robert Muir
 Attachments: LUCENE-3726.patch, LUCENE-3726.patch, kuromojieval.tar.gz


 Kuromoji supports an option to segment text in a way more suitable for search,
 by preventing long compound nouns as indexing terms.
 In general 'how you segment' can be important depending on the application 
 (see http://nlp.stanford.edu/pubs/acl-wmt08-cws.pdf for some studies on this 
 in chinese)
 The current algorithm punishes the cost based on some parameters 
 (SEARCH_MODE_PENALTY, SEARCH_MODE_LENGTH, etc)
 for long runs of kanji.
 Some questions (these can be separate future issues if any useful ideas come 
 out):
 * should these parameters continue to be static-final, or configurable?
 * should POS also play a role in the algorithm (can/should we refine exactly 
 what we decompound)?
 * is the Tokenizer the best place to do this, or should we do it in a 
 tokenfilter? or both?
   with a tokenfilter, one idea would be to also preserve the original 
 indexing term, overlapping it: e.g. ABCD - AB, CD, ABCD(posInc=0)
   from my understanding this tends to help with noun compounds in other 
 languages, because IDF of the original term boosts 'exact' compound matches.
   but does a tokenfilter provide the segmenter enough 'context' to do this 
 properly?
 Either way, I think as a start we should turn on what we have by default: its 
 likely a very easy win.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3726) Default KuromojiAnalyzer to use search mode

2012-02-04 Thread Christian Moen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Moen updated LUCENE-3726:
---

Attachment: LUCENE-3726.patch

 Default KuromojiAnalyzer to use search mode
 ---

 Key: LUCENE-3726
 URL: https://issues.apache.org/jira/browse/LUCENE-3726
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 3.6, 4.0
Reporter: Robert Muir
 Attachments: LUCENE-3726.patch, LUCENE-3726.patch, LUCENE-3726.patch, 
 kuromojieval.tar.gz


 Kuromoji supports an option to segment text in a way more suitable for search,
 by preventing long compound nouns as indexing terms.
 In general 'how you segment' can be important depending on the application 
 (see http://nlp.stanford.edu/pubs/acl-wmt08-cws.pdf for some studies on this 
 in chinese)
 The current algorithm punishes the cost based on some parameters 
 (SEARCH_MODE_PENALTY, SEARCH_MODE_LENGTH, etc)
 for long runs of kanji.
 Some questions (these can be separate future issues if any useful ideas come 
 out):
 * should these parameters continue to be static-final, or configurable?
 * should POS also play a role in the algorithm (can/should we refine exactly 
 what we decompound)?
 * is the Tokenizer the best place to do this, or should we do it in a 
 tokenfilter? or both?
   with a tokenfilter, one idea would be to also preserve the original 
 indexing term, overlapping it: e.g. ABCD - AB, CD, ABCD(posInc=0)
   from my understanding this tends to help with noun compounds in other 
 languages, because IDF of the original term boosts 'exact' compound matches.
   but does a tokenfilter provide the segmenter enough 'context' to do this 
 properly?
 Either way, I think as a start we should turn on what we have by default: its 
 likely a very easy win.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3726) Default KuromojiAnalyzer to use search mode

2012-02-04 Thread Christian Moen (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13200654#comment-13200654
 ] 

Christian Moen commented on LUCENE-3726:


The latest attached patch introduces a default mode in {{Segmenter}}, which is 
now {{Mode.SEARCH}}.

This mode is used by {{KuromojiAnalyzer}} in Lucene without further code 
changes.  The Solr factory duplicated the default mode, but now retrieves it 
from {{Segmenter}}.  This way, we set the default mode for both Solr and Lucene 
in a single place (in {{Segmenter}}), which I find cleaner.

I've also moved some constructors around in {{Segmenter}} and did some minor 
formatting/style changes.

 Default KuromojiAnalyzer to use search mode
 ---

 Key: LUCENE-3726
 URL: https://issues.apache.org/jira/browse/LUCENE-3726
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 3.6, 4.0
Reporter: Robert Muir
 Attachments: LUCENE-3726.patch, LUCENE-3726.patch, LUCENE-3726.patch, 
 kuromojieval.tar.gz


 Kuromoji supports an option to segment text in a way more suitable for search,
 by preventing long compound nouns as indexing terms.
 In general 'how you segment' can be important depending on the application 
 (see http://nlp.stanford.edu/pubs/acl-wmt08-cws.pdf for some studies on this 
 in chinese)
 The current algorithm punishes the cost based on some parameters 
 (SEARCH_MODE_PENALTY, SEARCH_MODE_LENGTH, etc)
 for long runs of kanji.
 Some questions (these can be separate future issues if any useful ideas come 
 out):
 * should these parameters continue to be static-final, or configurable?
 * should POS also play a role in the algorithm (can/should we refine exactly 
 what we decompound)?
 * is the Tokenizer the best place to do this, or should we do it in a 
 tokenfilter? or both?
   with a tokenfilter, one idea would be to also preserve the original 
 indexing term, overlapping it: e.g. ABCD - AB, CD, ABCD(posInc=0)
   from my understanding this tends to help with noun compounds in other 
 languages, because IDF of the original term boosts 'exact' compound matches.
   but does a tokenfilter provide the segmenter enough 'context' to do this 
 properly?
 Either way, I think as a start we should turn on what we have by default: its 
 likely a very easy win.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-3751) Align default Japanese configurations for Lucene and Solr

2012-02-04 Thread Christian Moen (Created) (JIRA)
Align default Japanese configurations for Lucene and Solr
-

 Key: LUCENE-3751
 URL: https://issues.apache.org/jira/browse/LUCENE-3751
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen


The {{KuromojiAnalyzer}} in Lucene shoud have the same default configuration as 
the {{text_ja}} field type introduced in {{schema.xml}} by SOLR-3056.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3745) Need stopwords and stoptags lists for default Japanese configuration

2012-02-04 Thread Christian Moen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Moen updated LUCENE-3745:
---

Attachment: LUCENE-3745.patch

 Need stopwords and stoptags lists for default Japanese configuration
 

 Key: LUCENE-3745
 URL: https://issues.apache.org/jira/browse/LUCENE-3745
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Reporter: Christian Moen
 Attachments: LUCENE-3745.patch, filter_stoptags.py, top-10.txt, 
 top-100-pos.txt, top-pos.txt


 Stopwords and stoptags lists for Japanese needs to be developed, tested and 
 integrated into Lucene.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3745) Need stopwords and stoptags lists for default Japanese configuration

2012-02-04 Thread Christian Moen (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13200680#comment-13200680
 ] 

Christian Moen commented on LUCENE-3745:


Please find a patch attached.

I've made {{stoptags.txt}} lighter by not stopping all prefixes and also 
allowing auxiliary verbs and interjections to pass.  I didn't come across any 
occurrences of unclassified symbols (記号) in Wikipedia, but it is now stopped as 
that seem to align better with our overall stop approach for symbols.

Many of the most frequent terms that now pass have been re-introduced in 
{{stopwords.txt} so they are stopped using a {{StopFilter}} instead of 
{{KuromojiPartOfSpeechStopFilter}}.  I believe this configuration is more 
balanced.

Overall, I've used the term frequencies attached to as a governing guideline 
for what to introduce into {{stopwords.txt}}.  It mostly contains hiragana 
words and expressions and I've deliberately left out common kanji as I'd like 
to keep the stopping fairly light.

I'll create a separate JIRA for introducing stopwords and stoptags to Solr.

 Need stopwords and stoptags lists for default Japanese configuration
 

 Key: LUCENE-3745
 URL: https://issues.apache.org/jira/browse/LUCENE-3745
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Reporter: Christian Moen
 Attachments: LUCENE-3745.patch, filter_stoptags.py, top-10.txt, 
 top-100-pos.txt, top-pos.txt


 Stopwords and stoptags lists for Japanese needs to be developed, tested and 
 integrated into Lucene.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-3097) Introduce default Japanese stoptags and stopwords to Solr's example configuration

2012-02-04 Thread Christian Moen (Created) (JIRA)
Introduce default Japanese stoptags and stopwords to Solr's example 
configuration
-

 Key: SOLR-3097
 URL: https://issues.apache.org/jira/browse/SOLR-3097
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen


SOLR-3056 discusses introducing a default field type {{text_ja}} for Japanese 
in {{schema.xml}}.  This configuration will be improved by also introducing 
default stopwords and stoptags configuration for the field type.  

I believe this configuration should be easily available and tunable to Solr 
users and I'm proposing that we introduce the same stopwords and stoptags 
provided in LUCENE-3745 to Solr example configuration.  I'm proposing that 
files can live in {{solr/example/solr/conf}} as {{stopwords_ja.txt}} and 
{{stoptags_ja.txt}} alongside {{stopwords_en.txt}} for English.  (Longer term, 
I think should reconsider our overall approach to this across all languages, 
but that's perhaps a separate discussion.)


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-3097) Introduce default Japanese stoptags and stopwords to Solr's example configuration

2012-02-04 Thread Christian Moen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Moen updated SOLR-3097:
-

Attachment: SOLR-3097.patch

 Introduce default Japanese stoptags and stopwords to Solr's example 
 configuration
 -

 Key: SOLR-3097
 URL: https://issues.apache.org/jira/browse/SOLR-3097
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
 Attachments: SOLR-3097.patch


 SOLR-3056 discusses introducing a default field type {{text_ja}} for Japanese 
 in {{schema.xml}}.  This configuration will be improved by also introducing 
 default stopwords and stoptags configuration for the field type.  
 I believe this configuration should be easily available and tunable to Solr 
 users and I'm proposing that we introduce the same stopwords and stoptags 
 provided in LUCENE-3745 to Solr example configuration.  I'm proposing that 
 files can live in {{solr/example/solr/conf}} as {{stopwords_ja.txt}} and 
 {{stoptags_ja.txt}} alongside {{stopwords_en.txt}} for English.  (Longer 
 term, I think should reconsider our overall approach to this across all 
 languages, but that's perhaps a separate discussion.)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3097) Introduce default Japanese stoptags and stopwords to Solr's example configuration

2012-02-04 Thread Christian Moen (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13200686#comment-13200686
 ] 

Christian Moen commented on SOLR-3097:
--

Patch for {{trunk}} and {{branch_3x}} attached.

 Introduce default Japanese stoptags and stopwords to Solr's example 
 configuration
 -

 Key: SOLR-3097
 URL: https://issues.apache.org/jira/browse/SOLR-3097
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
 Attachments: SOLR-3097.patch


 SOLR-3056 discusses introducing a default field type {{text_ja}} for Japanese 
 in {{schema.xml}}.  This configuration will be improved by also introducing 
 default stopwords and stoptags configuration for the field type.  
 I believe this configuration should be easily available and tunable to Solr 
 users and I'm proposing that we introduce the same stopwords and stoptags 
 provided in LUCENE-3745 to Solr example configuration.  I'm proposing that 
 files can live in {{solr/example/solr/conf}} as {{stopwords_ja.txt}} and 
 {{stoptags_ja.txt}} alongside {{stopwords_en.txt}} for English.  (Longer 
 term, I think should reconsider our overall approach to this across all 
 languages, but that's perhaps a separate discussion.)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org