[jira] Commented: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff

2007-06-28 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508805
 ] 

Grant Ingersoll commented on LUCENE-848:


OK, looks like that one went through, using wget.  I think I will commit as 
there must have been something screwed up on my network side.

 Add supported for Wikipedia English as a corpus in the benchmarker stuff
 

 Key: LUCENE-848
 URL: https://issues.apache.org/jira/browse/LUCENE-848
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/benchmark
Reporter: Steven Parkes
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: LUCENE-848.txt, LUCENE-848.txt, LUCENE-848.txt, 
 LUCENE-848.txt, LUCENE-848.txt, LUCENE-848.txt, LUCENE-848.txt, 
 WikipediaHarvester.java, xerces.jar, xerces.jar, xml-apis.jar


 Add support for using Wikipedia for benchmarking.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Updated: (LUCENE-906) Elision filter for simple french analyzing

2007-06-28 Thread Mathieu Lecarme
Any news about the integration of this patch?

M.

Mathieu Lecarme (JIRA) a écrit :
  [ 
 https://issues.apache.org/jira/browse/LUCENE-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
  ]

 Mathieu Lecarme updated LUCENE-906:
 ---

 Attachment: elision-0.2.patch

 All suggested corrections are done.

   
 Elision filter for simple french analyzing
 --

 Key: LUCENE-906
 URL: https://issues.apache.org/jira/browse/LUCENE-906
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Analysis
Reporter: Mathieu Lecarme
 Attachments: elision-0.2.patch, elision.patch


 If you don't wont to use stemming, StandardAnalyzer miss some french 
 strangeness like elision.
 l'avion wich means the plane must be tokenized as avion (plane).
 This filter could be used with other latin language if elision exists.
 

   


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff

2007-06-28 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508830
 ] 

Grant Ingersoll commented on LUCENE-848:


I take back my promise to commit, I am getting (after processing 189500 docs):
 [java] Error: cannot execute the algorithm! term out of order 
(docid:disrs.compareTo(docname:disregardle

*Ar) = 0)
 [java] org.apache.lucene.index.CorruptIndexException: term out of order 
(docid:disrs.compareTo(docname:disregardle

  *Ar) = 0)
 [java] at 
org.apache.lucene.index.TermInfosWriter.add(TermInfosWriter.java:102)
 [java] at 
org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:332)
 [java] at 
org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:297)
 [java] at 
org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:261)
 [java] at 
org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:98)
 [java] at 
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:1883)
 [java] at 
org.apache.lucene.index.IndexWriter.maybeMergeSegments(IndexWriter.java:1811)
 [java] at 
org.apache.lucene.index.IndexWriter.flushRamSegments(IndexWriter.java:1742)
 [java] at 
org.apache.lucene.index.IndexWriter.flushRamSegments(IndexWriter.java:1733)
 [java] at 
org.apache.lucene.index.IndexWriter.maybeFlushRamSegments(IndexWriter.java:1727)
 [java] at 
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1004)
 [java] at 
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:983)
 [java] at 
org.apache.lucene.benchmark.byTask.tasks.AddDocTask.doLogic(AddDocTask.java:74)
 [java] at 
org.apache.lucene.benchmark.byTask.tasks.PerfTask.runAndMaybeStats(PerfTask.java:83)
 [java] at 
org.apache.lucene.benchmark.byTask.tasks.TaskSequence.doSerialTasks(TaskSequence.java:107)
 [java] at 
org.apache.lucene.benchmark.byTask.tasks.TaskSequence.doLogic(TaskSequence.java:93)
 [java] at 
org.apache.lucene.benchmark.byTask.tasks.PerfTask.runAndMaybeStats(PerfTask.java:90)
 [java] at 
org.apache.lucene.benchmark.byTask.tasks.TaskSequence.doSerialTasks(TaskSequence.java:107)
 [java] at 
org.apache.lucene.benchmark.byTask.tasks.TaskSequence.doLogic(TaskSequence.java:93)
 [java] at 
org.apache.lucene.benchmark.byTask.tasks.PerfTask.runAndMaybeStats(PerfTask.java:90)
 [java] at 
org.apache.lucene.benchmark.byTask.tasks.TaskSequence.doSerialTasks(TaskSequence.java:107)
 [java] at 
org.apache.lucene.benchmark.byTask.tasks.TaskSequence.doLogic(TaskSequence.java:93)
 [java] at 
org.apache.lucene.benchmark.byTask.tasks.PerfTask.runAndMaybeStats(PerfTask.java:90)
 [java] at 
org.apache.lucene.benchmark.byTask.tasks.TaskSequence.doSerialTasks(TaskSequence.java:107)
 [java] at 
org.apache.lucene.benchmark.byTask.tasks.TaskSequence.doLogic(TaskSequence.java:93)
 [java] at 
org.apache.lucene.benchmark.byTask.utils.Algorithm.execute(Algorithm.java:228)
 [java] at 
org.apache.lucene.benchmark.byTask.Benchmark.execute(Benchmark.java:72)
 [java] at 
org.apache.lucene.benchmark.byTask.Benchmark.main(Benchmark.java:108)
 [java] 
 [java] ###  D O N E !!! ###
 [java] 


Can you reproduce this?  It seems like an actual issue with core.

 Add supported for Wikipedia English as a corpus in the benchmarker stuff
 

 Key: LUCENE-848
 URL: https://issues.apache.org/jira/browse/LUCENE-848
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/benchmark
Reporter: Steven Parkes
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: LUCENE-848.txt, LUCENE-848.txt, LUCENE-848.txt, 
 LUCENE-848.txt, LUCENE-848.txt, LUCENE-848.txt, LUCENE-848.txt, 
 WikipediaHarvester.java, xerces.jar, xerces.jar, xml-apis.jar


 Add support for using Wikipedia for benchmarking.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Stephen Hussey is out of the office.

2007-06-28 Thread Stephen Hussey

I will be out of the office starting  06/28/2007 and will not return until
07/02/2007.

I will respond to your message when I return.

[jira] Commented: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff

2007-06-28 Thread Steven Parkes (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508833
 ] 

Steven Parkes commented on LUCENE-848:
--

Trying to reproduce now.

Something that came up while restarting the fetch/decompress/etc. was the 
number of files this procedure creates. It's a lot: one for each article. I 
used the existing benchmark code for doing this stuff but perhaps it's not a 
good idea on this scale? For one thing, it kinda kills ant since ant wants to 
do a walk of subtrees for some of its tasks. Either we need to exclude the work 
and temp directories from ant's walks and/or we should come up with something 
better than one file per article.

I think Mike mentioned not doing the one file per article. I'll try to look at 
that ...

 Add supported for Wikipedia English as a corpus in the benchmarker stuff
 

 Key: LUCENE-848
 URL: https://issues.apache.org/jira/browse/LUCENE-848
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/benchmark
Reporter: Steven Parkes
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: LUCENE-848.txt, LUCENE-848.txt, LUCENE-848.txt, 
 LUCENE-848.txt, LUCENE-848.txt, LUCENE-848.txt, LUCENE-848.txt, 
 WikipediaHarvester.java, xerces.jar, xerces.jar, xml-apis.jar


 Add support for using Wikipedia for benchmarking.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff

2007-06-28 Thread Doron Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508922
 ] 

Doron Cohen commented on LUCENE-848:


Steven wrote:
 I think Mike mentioned not doing the one file per article. I'll try to look 
 at that ...

Perhaps also (re) consider the compress and add on-the-fly approach, similar 
to what TrecDocmaker is doing?

Grant wrote:
 I take back my promise to commit, I am getting (after processing 189500 
 docs): 
[java] Error: cannot execute the algorithm! term out of order 
 (docid:disrs.compareTo(docname:disregardle 
   
  *Ar) = 0) 
   [java] org.apache.lucene.index.CorruptIndexException: term out of order 
 (docid:disrs.compareTo(docname:disregardle 
   
   *Ar) = 0) 

Just to verify that it is not a benchmark issue, could you also post here the 
executed algorithm (as printed, or, if not printed, the actual file)...?

 Add supported for Wikipedia English as a corpus in the benchmarker stuff
 

 Key: LUCENE-848
 URL: https://issues.apache.org/jira/browse/LUCENE-848
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/benchmark
Reporter: Steven Parkes
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: LUCENE-848.txt, LUCENE-848.txt, LUCENE-848.txt, 
 LUCENE-848.txt, LUCENE-848.txt, LUCENE-848.txt, LUCENE-848.txt, 
 WikipediaHarvester.java, xerces.jar, xerces.jar, xml-apis.jar


 Add support for using Wikipedia for benchmarking.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff

2007-06-28 Thread Grant Ingersoll


On Jun 28, 2007, at 3:47 PM, Doron Cohen (JIRA) wrote:



[ https://issues.apache.org/jira/browse/LUCENE-848? 
page=com.atlassian.jira.plugin.system.issuetabpanels:comment- 
tabpanel#action_12508922 ]


Doron Cohen commented on LUCENE-848:


Steven wrote:
I think Mike mentioned not doing the one file per article. I'll  
try to look at that ...


Perhaps also (re) consider the compress and add on-the-fly  
approach, similar to what TrecDocmaker is doing?


Grant wrote:
I take back my promise to commit, I am getting (after processing  
189500 docs):
   [java] Error: cannot execute the algorithm! term out of order  
(docid:disrs.compareTo(docname:disregardle
  
  *Ar) = 0)
  [java] org.apache.lucene.index.CorruptIndexException: term out  
of order (docid:disrs.compareTo(docname:disregardle
  
   *Ar) = 0)


Just to verify that it is not a benchmark issue, could you also  
post here the executed algorithm (as printed, or, if not printed,  
the actual file)...?


It is the one in the patch.  I ran ant enwiki




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Resolved: (LUCENE-906) Elision filter for simple french analyzing

2007-06-28 Thread Otis Gospodnetic (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic resolved LUCENE-906.
-

   Resolution: Fixed
Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])

Patch applied, thanks.
I reformatted the code to match Lucene style.
I also put the Apache license on top of both files.

Thanks!


 Elision filter for simple french analyzing
 --

 Key: LUCENE-906
 URL: https://issues.apache.org/jira/browse/LUCENE-906
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Analysis
Reporter: Mathieu Lecarme
Assignee: Otis Gospodnetic
 Attachments: elision-0.2.patch, elision.patch


 If you don't wont to use stemming, StandardAnalyzer miss some french 
 strangeness like elision.
 l'avion wich means the plane must be tokenized as avion (plane).
 This filter could be used with other latin language if elision exists.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Build failed in Hudson: Lucene-Nightly #136

2007-06-28 Thread hudson
See http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/136/changes

Changes:

[otis] - LUCENE-906: Elision filter for French.

--
[...truncated 4178 lines...]
[junit] Writing files byte by byte
[junit] 934 total milliseconds to read, 8112 kb/s
[junit] 86 total milliseconds to delete even files
[junit] 86 total milliseconds to create, 88105 kb/s
[junit] 336 total milliseconds to read, 22550 kb/s
[junit] 80 total milliseconds to delete
[junit] 1438 total milliseconds
[junit] -  ---
   [delete] Deleting: 
http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/ws/trunk/build/contrib/db/bdb-je/test/junitfailed.flag
 

check-1-5:

init:

extBuildPath:
 [echo] Preparing build path for external dependencies

download:

compile-core:
 [echo] Building gdata-core...

javacc-uptodate-check:

javacc-notice:

common.init:

build-lucene:

init:

compile-core:
 [echo] Use gdata - compile-core task 
[javac] Compiling 5 source files to 
http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/ws/trunk/build/contrib/gdata-server/core/classes/java
 
 [echo] Building hivemind...

javacc-uptodate-check:

javacc-notice:

common.init:

build-lucene:

init:

clover.setup:

clover.info:

clover:

compile-core:
 [echo] Building gdata-gom...

javacc-uptodate-check:

javacc-notice:

common.init:

build-lucene:

init:

clover.setup:

clover.info:

clover:

compile-core:

compile-test:
 [echo] Building gdata-core...

javacc-uptodate-check:

javacc-notice:

common.init:

build-lucene:

init:

compile-core:
 [echo] Use gdata - compile-core task 
[javac] Compiling 5 source files to 
http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/ws/trunk/build/contrib/gdata-server/core/classes/java
 

compile-test:
 [echo] Use gdata - compile-test task 
 [echo] Building hivemind...

javacc-uptodate-check:

javacc-notice:

common.init:

build-lucene:

init:

compile-test:
 [echo] Building hivemind...

javacc-uptodate-check:

javacc-notice:

common.init:

build-lucene:

init:

clover.setup:

clover.info:

clover:

compile-core:

common.compile-test:
 [echo] Building gdata-gom...

javacc-uptodate-check:

javacc-notice:

common.init:

build-lucene:

init:

compile-test:
 [echo] Building gdata-gom...

javacc-uptodate-check:

javacc-notice:

common.init:

build-lucene:

init:

clover.setup:

clover.info:

clover:

compile-core:

common.compile-test:

test:
 [echo] Building gdata-core...

javacc-uptodate-check:

javacc-notice:

common.init:

build-lucene:

init:

test:
 [echo] Building gdata-core...

javacc-uptodate-check:

javacc-notice:

common.init:

build-lucene:

init:

compile-core:
 [echo] Use gdata - compile-core task 
[javac] Compiling 5 source files to 
http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/ws/trunk/build/contrib/gdata-server/core/classes/java
 

compile-test:
 [echo] Use gdata - compile-test task 

common.test:
[mkdir] Created dir: 
http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/ws/trunk/build/contrib/gdata-server/core/test
 
[junit] Testsuite: org.apache.lucene.gdata.data.TestGDataUser
[junit] Tests run: 8, Failures: 0, Errors: 0, Time elapsed: 0.205 sec

[junit] Testsuite: org.apache.lucene.gdata.search.TestStandardGdataSearcher
[junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 0.434 sec

[junit] Testsuite: 
org.apache.lucene.gdata.search.analysis.TestContentStrategy
[junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 0.189 sec

[junit] Testsuite: org.apache.lucene.gdata.search.analysis.TestDomIndexable
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.391 sec

[junit] - Standard Error -
[junit] Jun 29, 2007 2:37:18 AM 
org.apache.lucene.gdata.server.registry.GDataServerRegistry registerScopeVisitor
[junit] INFO: Register scope visitor -- class 
org.apache.lucene.gdata.server.registry.ProvidedServiceConfig
[junit] -  ---
[junit] Testsuite: 
org.apache.lucene.gdata.search.analysis.TestGdataCategoryStrategy
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.231 sec

[junit] Testsuite: 
org.apache.lucene.gdata.search.analysis.TestGdataDateStrategy
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.248 sec

[junit] Testsuite: org.apache.lucene.gdata.search.analysis.TestHTMLStrategy
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.272 sec

[junit] Testsuite: 
org.apache.lucene.gdata.search.analysis.TestKeywordStrategy
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.217 sec

[junit] Testsuite: org.apache.lucene.gdata.search.analysis.TestMixedStrategy
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.284 sec

[junit] Testsuite: