>If a method doesn't intrinsically require a String, then I prefer
CharSequence. It's probable that sooner or later something is going to
demand a String, but that's not a good reason to be "that guy" :-)
I lean towards using CharSequence when that makes sense too (i.e. suggesting we
are working on code points, and supporting implementations of charsequence).
The tdebatty/java-string-similarity library work only Strings I think. Others
like LingPipe, ICU4J, Lucene, Apache Commons Text, and Apache OpenNLP use both
CharSequence and String.
Analysing the use of CharSequence and String could be an interesting idea for a
blog post, and could even raise some tickets to fix consistency in the API of
[text] or some other component/project.
>Also, wouldn't some sort of low-space-overhead string storage be a good fit
>for text?
Sounds interesting. Normally when I have some idea like that for [text] (or for
other projects/components) I either note it down somewhere (normally first at
http://kinoshita.eti.br/todo/), and then file an issue like TEXT-71, TEXT-77,
TEXT-78, or TEXT-79, to start investigating it.
If you have some idea of how that could be implemented, or know about some
projects for that, feel free to suggest it in a JIRA ticket, or start another
thread here in the mailing list.
Cheers
Bruno
From: Simon Spero
To: Commons Developers List
Sent: Tuesday, 20 June 2017 1:39 AM
Subject: CharSequence vs. String (was Re: [GitHub] commons-text pull request
#46: TEXT-85:Added CaseUtils class with camel case...)
On Jun 12, 2017 10:47 AM, "arunvinudss" wrote:
Github user arunvinudss commented on a diff in the pull request:
I am a bit biased towards using String instead of CharSequence . Yes
CharSequence allows us to pass String Buffers and builders and other types
as input potentially increasing the scope of the function but considering
the nature of work we do in this particular method it may not necessarily
be a good idea. My basic contention is that the minute we call toString()
on a charSequence to do any sort of manipulation it becomes a costly
operation and we may lose performance .
True if the particular CharSequence is not in fact an instance of String.
String::toString returns this.
The bigger problem is that too many methods use String as a parameter or
return type, when CharSequence would serve just as well. This indeed
requires the invocation of Object::toString.
For methods that use String as the return type, changing the result to
CharSequence is source and binary incompatible, and properly so (since at
some point the user may actually need a String).
A generic method with Type parameter with CharSequence as bound (T extends
CharSequence) can sometimes be useful, and can be added in addition to
methods taking String arguments, but can't replace them.
There are some places in javac that have special treatment for String - for
example, the + operator , but jdk9 reduces that particular win by indyfying
concat.
If a method doesn't intrinsically require a String, then I prefer
CharSequence. It's probable that sooner or later something is going to
demand a String, but that's not a good reason to be "that guy" :-)
Note:
Strings can be an incredible waste of memory; 40 + ⌈length/4⌉ bytes
(reduced to a mere 40 + ⌈length/8⌉ bytes in jdk9 when compact strings can
be used).
This is incredibly painful if you have a vast number of small "strings",
which may not all need to be materialized simultaneously. See e.g. [1]
(~50MiB of UTF-8 chars becomes ~250MiB of Strings. And since there's no
individual humongous object they all get to make the journey from TLAB to
Old Space the hard way. Note this predates jdk 9,but illustrates some of
the win from compact strings)
Storing the character data in a shared byte array is a huge win. Someone
should tell the jdk implementors to look at applications that do this.
Like, um, javac :-)
Materializing these strings as possibly transient CharSequence's is
really convenient... until some method just has to have a String
Also, wouldn't some sort of low-space-overhead string storage be a good
fit for text?
Simon
[1] Spero,S. (2015). Time And Relative Dimensions In Semantics: Is OWL
Bigger On The Inside? OWLED 2015. Available at
http://cgi.csc.liv.ac.uk/~valli/OWLED2015/OWLED_2015_paper_12.pdf
-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org