Re: Log4j 1.2.17 spark CVE

2021-12-13 Thread Martin Wunderlich
There is a discussion on Github on this topic and the recommendation is 
to upgrade from 1.x to 2.15.0, due to the vulnerability of 1.x: 
https://github.com/apache/logging-log4j2/pull/608


This discussion is also referenced by the German Federal Office for 
Information Security: https://www.bsi.bund.de/EN/Home/home_node.html


Cheers,

Martin

Am 13.12.21 um 17:02 schrieb Jörn Franke:
Is it in any case appropriate to use log4j 1.x which is not maintained 
anymore and has other security vulnerabilities which won’t be fixed 
anymore ?



Am 13.12.2021 um 06:06 schrieb Sean Owen :


Check the CVE - the log4j vulnerability appears to affect log4j 2, 
not 1.x. There was mention that it could affect 1.x when used with 
JNDI or SMS handlers, but Spark does neither. (unless anyone can 
think of something I'm missing, but never heard or seen that come up 
at all in 7 years in Spark)


The big issue would be applications that themselves configure log4j 
2.x, but that's not a Spark issue per se.


On Sun, Dec 12, 2021 at 10:46 PM Pralabh Kumar 
 wrote:


Hi developers,  users

Spark is built using log4j 1.2.17 . Is there a plan to upgrade
based on recent CVE detected ?


Regards
Pralabh kumar


Re: [Spark] Does Spark support backward and forward compatibility?

2021-11-24 Thread Martin Wunderlich

Hi Amin,

This might be only marginally relevant to your question, but in my 
project I also noticed the following: The trained and exported Spark 
models (i.e. pipelines saved to binary files) are also not compatible 
between versions, at least between major versions. I noticed this when 
trying to load a model built with Spark 2.4.4 after updating to 3.2.0. 
This didn't work.


Cheers,

Martin

Am 24.11.21 um 20:18 schrieb Sean Owen:
I think/hope that it goes without saying you can't mix Spark versions 
within a cluster.
Forwards compatibility is something you don't generally expect as a 
default from any piece of software, so not sure there is something to 
document explicitly.
Backwards compatibility is important, and this is documented 
extensively where it doesn't hold in the Spark docs and release notes.



On Wed, Nov 24, 2021 at 1:16 PM Amin Borjian 
 wrote:


Thank you very much for the reply you sent. It would be great if
these items were mentioned in the Spark document (for example, the
download page or something else)

If I understand correctly, it means that we can compile the client
(for example Java, etc.) with a newer version (for example 3.2.0)
within the range of a major version against older server (for
example 3.1.x) and do not see any problem in most cases. Am I
right?(Because the issue of backward-compatibility can be
expressed from both the server and the client view, I repeated the
sentence to make sure I got it right.)

But what happened if we update server to 3.2.x and our client was
in version 3.1.x? Does it client can work with newer cluster
version because it uses just old feature of severs? (Maybe you
mean this and in fact my previous sentence was wrong and I
misunderstood)

*From: *Sean Owen 
*Sent: *Wednesday, November 24, 2021 5:38 PM
*To: *Amin Borjian 
*Cc: *user@spark.apache.org
*Subject: *Re: [Spark] Does Spark support backward and forward
compatibility?

Can you mix different Spark versions on driver and executor? no.

Can you compile against a different version of Spark than you run
on? That typically works within a major release, though forwards
compatibility may not work (you can't use a feature that doesn't
exist in the version on the cluster). Compiling vs 3.2.0 and
running on 3.1.x for example should work fine in 99% of cases.

On Wed, Nov 24, 2021 at 8:04 AM Amin Borjian
 wrote:

I have a simple question about using Spark, which although
most tools usually explain this question explicitly (in
important text, such as a specific format or a separate page),
I did not find it anywhere. Maybe my search was not enough,
but I thought it was good that I ask this question in the hope
that maybe the answer will benefit other people as well.

Spark binary is usually downloaded from the following link and
installed and configured on the cluster: Download Apache Spark


If, for example, we use the Java language for programming
(although it can be other supported languages), we need the
following dependencies to communicate with Spark:

||

|    org.apache.spark|

|    spark-core_2|.12||

|    |3.2.0||

||

||

|    org.apache.spark|

|    spark-sql_2|.12||

|    |3.2.0||

||

As is clear, both the Spark cluster (binary of Spark) and the
dependencies used on the application side have a specific
version. In my opinion, it is obvious that if the version used
is the same on both the application side and the server side,
everything will most likely work in its ideal state without
any problems.

But the question is, what if the two versions are not the
same? Is it possible to have compatibility between the server
and the application in specific number of conditions (such as
not changing major version)? Or, for example, if the client is
always ahead, is it not a problem? Or if the server is always
ahead, is it not a problem?

The argument is that there may be a library that I did not write
and it is an old version, but I want to update my cluster (server
version). Or it may not be possible for me to update the server
version and all the applications version at the same time, so I
want to update each one separately. As a result, the
application-server version differs in a period of time. (maybe
short or long period) I want to know exactly how Spark works in
this situation.


Re: Using MulticlassClassificationEvaluator for NER evaluation

2021-11-11 Thread Martin Wunderlich

Hi Gourav,

Mostly correct. The output of SparNLP here is a trained 
pipeline/model/transformer. I am feeding this trained pipeline to the 
MulticlassClassificationEvaluator for evaluation and this 
MulticlassClassificationEvaluator only accepts floats or doubles are the 
labels (instead of NER labels).


Cheers,

Martin

Am 11.11.21 um 11:39 schrieb Gourav Sengupta:

Hi Martin,

just to confirm, you are taking the output of SPARKNLP, and then 
trying to feed it to SPARK ML for running algorithms on the output of 
NERgenerated by SPARKNLP right?



Regards,
Gourav Sengupta

On Thu, Nov 11, 2021 at 8:00 AM  wrote:

Hi Sean,

Apologies for the delayed reply. I've been away on vacation and
then busy catching up afterwards.

Regarding the evalution using MulticlassClassificationEvaluator:
This is a about a sequence labeling task to identify specific
non-standard named entities. The training and evaluation data is
in CoNLL format. The training works all fine, using the
categorical labels for the NEs. In order to use the
MulticlassClassificationEvaluator, however, I need to convert
these to floats. This is possible and also works fine, it is just
inconvenient having to do the extra step. I would have expected
the MulticlassClassificationEvaluator to be able to use the labels
directly.

I will try to create and propose a code change in this regard, if
or when I find the time.

Cheers,

Martin


Am 2021-10-25 14:31, schrieb Sean Owen:


I don't think the question is representation as double. The
question is how this output represents a label? This looks like
the result of an annotator. What are you classifying? you need,
first, ground truth and prediction somewhere to use any utility
to assess classification metrics.

On Mon, Oct 25, 2021 at 5:42 AM  wrote:

Hello,

I am using SparkNLP to do some NER. The result datastructure
after training and classification is a Dataset, with one
column each for labels and predictions. For evaluating the
model, I would like to use the Spark ML class
org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator.
However, this evaluator expects labels as double numbers. In
the case of an NER task, the results in my case are of type

array,embeddings:array>>.


It would be possible, of course, to convert this format to
the required doubles. But is there a way to easily apply
MulticlassClassificationEvaluator to the NER task or is there
maybe a better evaluator? I haven't found anything yet
(neither in Spark ML nor in SparkNLP).

Thanks a lot.

Cheers,

Martin