laimis commented on issue #792:
URL: https://github.com/apache/lucenenet/issues/792#issuecomment-1544439765
I was able to dig into this further, and this is rather bizarre. So first, I
confirmed that Lucene.NET codebase is not doing anything funky here and is not
writing those bytes explicitly. It's .net framework writing a BOM marker out
when it does this in QueryParserTokenManager:
`temp_writer = new StreamWriter(Console.OpenStandardOutput(),
Console.Out.Encoding);
temp_writer.AutoFlush = true;`
In your code example, you set the Console.OutputEncoding to Encoding.UTF8
and that's what it gets back when doing Console.Out.Encoding. AutoFlush being
set to true flushes the stream behind the scenes and flushes the BOM marker.
Why doesn't it do that in .net 7 (that's the only .net core fx I tried, it
might not be doing that in other .net core versions either)? It appears that
something is different between framework versions as to how this line is
handled:
`Console.OutputEncoding=Encoding.UTF8;`
I wrote a quick test where I output to console Console.OutputEncoding and
Console.Out.Encoding properties:
`
Console.WriteLine("before Console.Out encoding: " + Console.Out.Encoding);
Console.WriteLine("before Console.OutputEncoding: " +
Console.OutputEncoding);
Console.OutputEncoding=System.Text.Encoding.UTF8;
Console.WriteLine("after Console.Out encoding: " + Console.Out.Encoding);
Console.WriteLine("after Console.OutputEncoding: " + Console.OutputEncoding);
`
In .net fx 4.8, here is the output on my machine:
> before Console.Out encoding: System.Text.SBCSCodePageEncoding
> before Console.OutputEncoding: System.Text.SBCSCodePageEncoding
> after Console.Out encoding: System.Text.UTF8Encoding
> after Console.OutputEncoding: System.Text.UTF8Encoding
Now .net 7:
> before Console.Out encoding: System.Text.OSEncoding
> before Console.OutputEncoding: System.Text.OSEncoding
> after Console.Out encoding: System.Text.ConsoleEncoding
> after Console.OutputEncoding: System.Text.UTF8Encoding
System.Out.Encoding in .net 7 is not set to UTF8Encoding when you set
Console.OutputEncoding and thus the BOM marker is not written out.
Really bizarre.
Anyway, I will keep this issue open because we can comment out the code that
Lucene Java version commented out but at least we know exactly what's going on.
You could argue that perhaps Console.Out.Encoding should not be used in
QueryParserTokenManager and instead Console.OutputEncoding should be used. But
that's the only place where this is happening and commenting out should close
the chapter on this.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]