Hi Gourav,
Mostly correct. The output of SparNLP here is a trained
pipeline/model/transformer. I am feeding this trained pipeline to the
MulticlassClassificationEvaluator for evaluation and this
MulticlassClassificationEvaluator only accepts floats or doubles are the
labels (instead of NER labels).
Cheers,
Martin
Am 11.11.21 um 11:39 schrieb Gourav Sengupta:
Hi Martin,
just to confirm, you are taking the output of SPARKNLP, and then
trying to feed it to SPARK ML for running algorithms on the output of
NERgenerated by SPARKNLP right?
Regards,
Gourav Sengupta
On Thu, Nov 11, 2021 at 8:00 AM <mar...@wunderlich.com> wrote:
Hi Sean,
Apologies for the delayed reply. I've been away on vacation and
then busy catching up afterwards.
Regarding the evalution using MulticlassClassificationEvaluator:
This is a about a sequence labeling task to identify specific
non-standard named entities. The training and evaluation data is
in CoNLL format. The training works all fine, using the
categorical labels for the NEs. In order to use the
MulticlassClassificationEvaluator, however, I need to convert
these to floats. This is possible and also works fine, it is just
inconvenient having to do the extra step. I would have expected
the MulticlassClassificationEvaluator to be able to use the labels
directly.
I will try to create and propose a code change in this regard, if
or when I find the time.
Cheers,
Martin
Am 2021-10-25 14:31, schrieb Sean Owen:
I don't think the question is representation as double. The
question is how this output represents a label? This looks like
the result of an annotator. What are you classifying? you need,
first, ground truth and prediction somewhere to use any utility
to assess classification metrics.
On Mon, Oct 25, 2021 at 5:42 AM <mar...@wunderlich.com> wrote:
Hello,
I am using SparkNLP to do some NER. The result datastructure
after training and classification is a Dataset<Row>, with one
column each for labels and predictions. For evaluating the
model, I would like to use the Spark ML class
org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator.
However, this evaluator expects labels as double numbers. In
the case of an NER task, the results in my case are of type
array<struct<annotatorType:string,begin:int,end:int,result:string,metadata:map<string,string>,embeddings:array<float>>>.
It would be possible, of course, to convert this format to
the required doubles. But is there a way to easily apply
MulticlassClassificationEvaluator to the NER task or is there
maybe a better evaluator? I haven't found anything yet
(neither in Spark ML nor in SparkNLP).
Thanks a lot.
Cheers,
Martin