Hi everyone,
I'm currently working with Tika and I'm having problems with the accuracy
of predictions, so I was wondering if anyone has any ideas on how to
improve what we have.
My use-case is kind of specific: I have short to medium sized text that can
be in either one or two languages. If the text has two languages it is
usually a concatenation of a text and its translation right after (So the
position of the two languages is kind of identifiable).
For this reason what we are currently doing is splitting the text in half,
running Tika on both halves and then taking both the outputs (if it's the
same language we consider it only once)
The code that does this is the following:
private static List<String> segmentateText(String text, int
numSegments, int minLength) {
// If the text is shorter than the given minLength we don't
segmentate it
// In this example minLength is 0
if (text.length() <= minLength) {
List<String> singleSegment = new ArrayList<>();
singleSegment.add(text);
return singleSegment;
}
// Computes the size of each segment by counting the number of words
String[] words = text.split("\\s+");
int segmentSize = words.length / numSegments;
List<String> segments = new ArrayList<>();
for (int i = 0; i < numSegments; i++) {
int start = i * segmentSize;
// If this is not the last segment, it computes the end index,
else it gets all remaining words
int end;
if (i < numSegments - 1)
end = (i + 1) * segmentSize;
else
end = words.length;
// Constructs the segment
StringBuilder segment = new StringBuilder();
for (int j = start; j < end; j++) {
segment.append(words[j]);
// Appends whitespaces if this is not the lst word
if (j < end - 1)
segment.append(" ");
}
segments.add(segment.toString());
}
return segments;
}
We then simply pass the segments to this other functions which does the
detection:
private static Set<String> detectLanguages(List<String> segments) {
LanguageDetector detector = new
OptimaizeLangDetector().loadModels();
Set<String> result = new HashSet<>();
String detectedLang;
// For each segment the language gets detected, parsed and added to
a set
for (String segment : segments) {
LanguageResult language = detector.detect(segment);
detectedLang = language.getLanguage();
detectedLang = (String) Utils.normalizeLanguage(detectedLang);
// We can ignore this
result.add(detectedLang);
}
return result;
}
The problem is that this approach has okay-ish results for medium sized
texts but sucks for shorter texts, for example:
"Open Access: Soziologische Aspekte ; Open Access: Sociological
Implications"
This should be German and English but it is instead flagged as Italian.
There are many similar cases across our data, even with different languages.
Now, I'm wondering what can I do to improve performance for this specific
use case?
I'll list some additional information I think could be relevant:
- I can't know beforehand the set of languages present in the data, so we
can't load only the models we need.
- The computation power is kind of limited as this code has to be used in a
Big Data pipeline with millions of strings, so it can't be too slow.
- I'm using the latest version of Tika.
Thanks a lot in advance to everyone who is willing to help,
Francesco