Re: Log4j 2.x support in 3.3.0

2021-12-20 Thread Sean Owen
I would suppose it will be at least 2.17. Whatever is needed to resolve
recent security issues. There is no release date for 3.3.0 yet. Probably
within a few months.

On Mon, Dec 20, 2021, 10:46 PM Chintan Mohan Rohila 
wrote:

> Hi,
>
> I see that under JIRA: https://issues.apache.org/jira/browse/SPARK-6305
> support for log4j 2.x to Spark in release 3.3.0
>
> I need to know what version of  log4j 2 will be bundled and what is the
> release date of 3.3.0, so that we can plan to integrate it in our product?
>
> Any help is highly appreciated !!!
>
> --
> Best regards,
> Chintan Rohila
>


Log4j 2.x support in 3.3.0

2021-12-20 Thread Chintan Mohan Rohila
Hi,

I see that under JIRA: https://issues.apache.org/jira/browse/SPARK-6305
support for log4j 2.x to Spark in release 3.3.0

I need to know what version of  log4j 2 will be bundled and what is the
release date of 3.3.0, so that we can plan to integrate it in our product?

Any help is highly appreciated !!!

-- 
Best regards,
Chintan Rohila


??? INFO CreateViewCommand:57 - Try to uncache `rawCounts` before replacing.

2021-12-20 Thread Andrew Davidson
Happy Holidays

I am a newbie

I have 16,000 data files, all files have the same number of rows and columns. 
The row ids are identical and are in the same order. I want to create a new 
data frame that contains the 3rd column from each data file. My pyspark script 
runs correctly when I test on small number of files how ever I get an OOM when 
I run on all 16000.

To try and debug I ran a small test and set warning level to INFO. I found the 
following

2021-12-21 00:47:04 INFO  CreateViewCommand:57 - Try to uncache `rawCounts` 
before replacing.

for i in range( 1, len( self.sampleNamesList ) ):
sampleName = self.sampleNamesList[i]

# select the key and counts from the sample.
qsdf = quantSparkDFList[i]
sampleSDF = qsdf\
.select( ["Name", "NumReads", ] )\
.withColumnRenamed( "NumReads", sampleName )

sampleSDF.createOrReplaceTempView( "sample" )

# the sample name must be quoted else column names with a '-'
# like GTEX-1117F-0426-SM-5EGHI will generate an error
# spark think the '-' is an expression. '_' is also
# a special char for the sql like operator
# https://stackoverflow.com/a/63899306/4586180
sqlStmt = '\t\t\t\t\t\tselect rc.*, `{}` \n\
from \n\
   rawCounts as rc, \n\
   sample  \n\
where \n\
rc.Name == sample.Name \n'.format( sampleName )

rawCountsSDF = self.spark.sql( sqlStmt )
rawCountsSDF.createOrReplaceTempView( "rawCounts" )


The way I wrote my script, I do a lot of transformations, the first action is 
at the end of the script
retCountDF.coalesce(1).write.csv( outfileCount, mode='overwrite', 
header=True)

Should I be calling sql.spark.sql( ‘uncache table rawCountsSDF “) before 
calling   rawCountsSDF.createOrReplaceTempView( "rawCounts" ) ? I expected to 
manage spark to manage the cache automatically given that I do not explicitly 
call cache().


How come I do not get a similar warning from?
sampleSDF.createOrReplaceTempView( "sample" )

Will this reduce my memory requirements?


Kind regards

Andy


RE: Spark 3.0 plugins

2021-12-20 Thread Luca Canali
Hi Anil,

 

To recap: Apache Spark plugins are an interface and configuration that allows 
to inject code on executor start-up and, among others, provide a hook to the 
Spark metrics system. This provides a way to extend metrics collection beyond 
what is available in Apache Spark.   

Instrumenting some parts of the Spark workload with plugins provides additional 
flexibility compared to instrumentation that is committed in the Apache Spark 
code, as only users who want to activate it can do so and also they can play 
with configuration that may be customized for their environment, so not 
necessarily suitable for all possible uses of Apache Spark code.  

 

The repository https://github.com/cerndb/SparkPlugins that you mentioned 
provides code that implements a few Spark plugins that I developed and found 
useful, including plugins for measuring (some) I/O metrics.  

At present this is “third-party code”, you are most welcome to use, although it 
is not yet part of the Apache Spark project. I’d say it may end up there, as a 
set of examples maybe, if more people find this type of instrumentation useful. 
 

 

You referenced in your mail to the DATA+AI summit talk  What is New with Apache 
Spark Performance Monitoring in Spark 3.0 - Databricks 

  you can also find additional work on this in the DATA+AI summit 2021 talk 
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins - Databricks 

 

 

Best,

Luca

 

From: Anil Dasari  
Sent: Monday, December 20, 2021 07:02
To: user@spark.apache.org
Subject: Spark 3.0 plugins

 

Hello everyone,

 

I was going through Apache Spark Performance Monitoring in Spark 3.0 
  talk and wanted to collect IO 
metrics for my spark application. 

Couldn’t find Spark 3.0 built-in plugins for IO metrics like 
https://github.com/cerndb/SparkPlugins  in Spark 3 documentation. Does spark 3 
bundle have in-built IO metric plugins ? Thanks in advance.

 

Regards,

Anil