Hi Sean,
Apologies for the delayed reply. I've been away on vacation and then
busy catching up afterwards.
Regarding the evalution using MulticlassClassificationEvaluator: This is
a about a sequence labeling task to identify specific non-standard named
entities. The training and evaluation data is in CoNLL format. The
training works all fine, using the categorical labels for the NEs. In
order to use the MulticlassClassificationEvaluator, however, I need to
convert these to floats. This is possible and also works fine, it is
just inconvenient having to do the extra step. I would have expected the
MulticlassClassificationEvaluator to be able to use the labels directly.
I will try to create and propose a code change in this regard, if or
when I find the time.
Cheers,
Martin
Am 2021-10-25 14:31, schrieb Sean Owen:
I don't think the question is representation as double. The question is
how this output represents a label? This looks like the result of an
annotator. What are you classifying? you need, first, ground truth and
prediction somewhere to use any utility to assess classification
metrics.
On Mon, Oct 25, 2021 at 5:42 AM <mar...@wunderlich.com> wrote:
Hello,
I am using SparkNLP to do some NER. The result datastructure after
training and classification is a Dataset<Row>, with one column each
for labels and predictions. For evaluating the model, I would like to
use the Spark ML class
org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator.
However, this evaluator expects labels as double numbers. In the case
of an NER task, the results in my case are of type
array<struct<annotatorType:string,begin:int,end:int,result:string,metadata:map<string,string>,embeddings:array<float>>>.
It would be possible, of course, to convert this format to the
required doubles. But is there a way to easily apply
MulticlassClassificationEvaluator to the NER task or is there maybe a
better evaluator? I haven't found anything yet (neither in Spark ML
nor in SparkNLP).
Thanks a lot.
Cheers,
Martin