OK, thank you, Gourav. I didn't realize that Spark works with numerical
formats only by design.
What I am trying to achieve is rather straight-forward: Evaluate a
trained model using the standard metrics provided by
MulticlassClassificationEvaluator. Since this isn't possible for text
labels, we'll need to work around it and possible create a wrapper
evaluator around the Spark standard class.
Thanks a lot for the help.
Cheers,
Martin
Am 2021-11-11 13:10, schrieb Gourav Sengupta:
Hi Martin,
okay, so you will ofcourse need to translate the NER string output to a
numerical format as you would do with any text data before feeding it
to SPARK ML. Please read SPARK ML documentation on this. I think that
they are quite clear on how to do that.
But more importantly please try to answer Sean's question, explaining
what you are trying to achieve and how, always helps.
Regards,
Gourav Sengupta
On Thu, Nov 11, 2021 at 11:03 AM Martin Wunderlich
<mar...@wunderlich.com> wrote:
Hi Gourav,
Mostly correct. The output of SparNLP here is a trained
pipeline/model/transformer. I am feeding this trained pipeline to the
MulticlassClassificationEvaluator for evaluation and this
MulticlassClassificationEvaluator only accepts floats or doubles are
the labels (instead of NER labels).
Cheers,
Martin
Am 11.11.21 um 11:39 schrieb Gourav Sengupta:
Hi Martin,
just to confirm, you are taking the output of SPARKNLP, and then trying
to feed it to SPARK ML for running algorithms on the output of
NERgenerated by SPARKNLP right?
Regards,
Gourav Sengupta
On Thu, Nov 11, 2021 at 8:00 AM <mar...@wunderlich.com> wrote:
Hi Sean,
Apologies for the delayed reply. I've been away on vacation and then
busy catching up afterwards.
Regarding the evalution using MulticlassClassificationEvaluator: This
is a about a sequence labeling task to identify specific non-standard
named entities. The training and evaluation data is in CoNLL format.
The training works all fine, using the categorical labels for the NEs.
In order to use the MulticlassClassificationEvaluator, however, I need
to convert these to floats. This is possible and also works fine, it is
just inconvenient having to do the extra step. I would have expected
the MulticlassClassificationEvaluator to be able to use the labels
directly.
I will try to create and propose a code change in this regard, if or
when I find the time.
Cheers,
Martin
Am 2021-10-25 14:31, schrieb Sean Owen:
I don't think the question is representation as double. The question is
how this output represents a label? This looks like the result of an
annotator. What are you classifying? you need, first, ground truth and
prediction somewhere to use any utility to assess classification
metrics.
On Mon, Oct 25, 2021 at 5:42 AM <mar...@wunderlich.com> wrote:
Hello,
I am using SparkNLP to do some NER. The result datastructure after
training and classification is a Dataset<Row>, with one column each for
labels and predictions. For evaluating the model, I would like to use
the Spark ML class
org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator.
However, this evaluator expects labels as double numbers. In the case
of an NER task, the results in my case are of type
array<struct<annotatorType:string,begin:int,end:int,result:string,metadata:map<string,string>,embeddings:array<float>>>.
It would be possible, of course, to convert this format to the required
doubles. But is there a way to easily apply
MulticlassClassificationEvaluator to the NER task or is there maybe a
better evaluator? I haven't found anything yet (neither in Spark ML nor
in SparkNLP).
Thanks a lot.
Cheers,
Martin