[ 
https://issues.apache.org/jira/browse/LUCENE-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12796066#action_12796066
 ] 

Robert Muir commented on LUCENE-2181:
-------------------------------------

bq. I do have one concern, though: the LineDocSource parser doesn't know how to 
handle comments, so these four files don't have Apache2 license declarations in 
them. We should put a README (or something like it) with these files to 
indicate the license.

Are they really apache license? or derived from wikipedia content? if these 
files are only being downloaded when you run 'ant benchmark' for collation, 
then it is just like the enwiki task in benchmark, it downloads some huge 
wikipedia data and runs it. So someone please correct me if I am wrong, but I 
don't think we should be putting apache license headers in these files anyway, 
its just like the benchmark enwiki task, we are not shipping it with our source 
distribution.

bq. Different subject: I'm not sure where it would go, but the code I used to 
produce these top-TF wikipedia files may be useful to other people - where do 
you think it could live? An example, maybe?

hmm I will have to think about this... anyone got ideas?  I think this would be 
useful too (I admit to not having yet looked at the implementation), here are 
two examples:
* Karl could use this to evaluate his swedish stemming improvements, taking 
frequency into account.
* the obvious use of when you need to build a stopword list, these top terms 
are where you want to start.

> benchmark for collation
> -----------------------
>
>                 Key: LUCENE-2181
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2181
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/benchmark
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>         Attachments: LUCENE-2181.patch.zip
>
>
> Steven Rowe attached a contrib/benchmark-based benchmark for collation (both 
> jdk and icu) under LUCENE-2084, along with some instructions to run it... 
> I think it would be a nice if we could turn this into a committable patch and 
> add it to benchmark.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to