[ 
https://issues.apache.org/jira/browse/LUCENE-1730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12726507#action_12726507
 ] 

Mark Miller commented on LUCENE-1730:
-------------------------------------

I think that it makes sense to make the default the encoding the one that trec 
typically/always uses, but we should probably make this configurable from the 
alg file. We don't want to be locked down to one input encoding. Could be done 
in another issue though. Should allow that for the other contentsources as well.

> TrecContentSource should use a fixed encoding, rather than system dependent
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-1730
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1730
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/benchmark
>            Reporter: Shai Erera
>             Fix For: 2.9
>
>         Attachments: LUCENE-1730.patch
>
>
> TrecContentSource opens InputStreamReader w/o a fixed encoding. On Windows, 
> this means CP1252 (at least on my machine) which is ok. However, when I 
> opened it on a Linux machine w/ a default of UTF-8, it failed to read the 
> files. The patch changes it to use ISO-8859-1, which seems to be the right 
> one (and http://mg4j.dsi.unimi.it/man/manual/ch01s04.html mentions this 
> encoding in its example of a script which reads the data).
> Patch to follow shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to