Re: Hi Team, I put a UDF-Utils jar on Google Cloud Storage, but I can't run it

2021-12-13 Thread Mich Talebzadeh
probably it is because you are using an older version of spark

This works for version 3.1.1

gsutil ls gs://etcbucket/ojdbc8.jar
gs://etcbucket/ojdbc8.jar

*spark-sql add jar gs://etcbucket/ojdbc8.jar*

21/12/13 08:56:00 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
setLogLevel(newLevel).
21/12/13 08:56:05 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout
does not exist
21/12/13 08:56:05 WARN HiveConf: HiveConf of name hive.stats.retries.wait
does not exist
21/12/13 08:56:09 WARN ObjectStore: Version information not found in
metastore. hive.metastore.schema.verification is not enabled so recording
the schema version 2.3.0
21/12/13 08:56:09 WARN ObjectStore: setMetaStoreSchemaVersion called but
recording version is disabled: version = 2.3.0, comment = Set by MetaStore
hduser@10.154.15.201
Spark master: local[*], Application Id: local-1639385763388
spark-sql>

HTH



   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 13 Dec 2021 at 03:25, bo zhao  wrote:

> Update. My Spark version is 2.3.
>
> bo zhao  于2021年12月13日周一 11:24写道:
>
>> Hi Team,
>> I'm doing migration from onPerm to Google Cloud Platform. I have a
>> UDF-Uitls jar including several udf functions. when  code on Perm running
>> by Spark just need "add jar hdfs://x/home/path/UDF-Uitls.jar" with
>> SparkSQL. Now I put this jar on cloud storage. when I run SparkSQL with
>> "add jar gs://x/home/path/UDF-Uitls.jar". It showed: not supported url
>> schema  only support hdfs | file | ivy.  So I want to know if we have a
>> solution to deal with this? I can't open hdfs file system on Goole Cloud
>> Platform or put this jar on artifactory.
>>
>


Re: Log4j 1.2.17 spark CVE

2021-12-13 Thread Jörn Franke
Is it in any case appropriate to use log4j 1.x which is not maintained anymore 
and has other security vulnerabilities which won’t be fixed anymore ?

> Am 13.12.2021 um 06:06 schrieb Sean Owen :
> 
> 
> Check the CVE - the log4j vulnerability appears to affect log4j 2, not 1.x. 
> There was mention that it could affect 1.x when used with JNDI or SMS 
> handlers, but Spark does neither. (unless anyone can think of something I'm 
> missing, but never heard or seen that come up at all in 7 years in Spark)
> 
> The big issue would be applications that themselves configure log4j 2.x, but 
> that's not a Spark issue per se.
> 
>> On Sun, Dec 12, 2021 at 10:46 PM Pralabh Kumar  
>> wrote:
>> Hi developers,  users 
>> 
>> Spark is built using log4j 1.2.17 . Is there a plan to upgrade based on 
>> recent CVE detected ?
>> 
>> 
>> Regards
>> Pralabh kumar


Re: Log4j 1.2.17 spark CVE

2021-12-13 Thread Sean Owen
This has come up several times over years - search JIRA. The very short
summary is: Spark does not use log4j 1.x, but its dependencies do, and
that's the issue.
Anyone that can successfully complete the surgery at this point is welcome
to, but I failed ~2 years ago.

On Mon, Dec 13, 2021 at 10:02 AM Jörn Franke  wrote:

> Is it in any case appropriate to use log4j 1.x which is not maintained
> anymore and has other security vulnerabilities which won’t be fixed anymore
> ?
>
> Am 13.12.2021 um 06:06 schrieb Sean Owen :
>
> 
> Check the CVE - the log4j vulnerability appears to affect log4j 2, not
> 1.x. There was mention that it could affect 1.x when used with JNDI or SMS
> handlers, but Spark does neither. (unless anyone can think of something I'm
> missing, but never heard or seen that come up at all in 7 years in Spark)
>
> The big issue would be applications that themselves configure log4j 2.x,
> but that's not a Spark issue per se.
>
> On Sun, Dec 12, 2021 at 10:46 PM Pralabh Kumar 
> wrote:
>
>> Hi developers,  users
>>
>> Spark is built using log4j 1.2.17 . Is there a plan to upgrade based on
>> recent CVE detected ?
>>
>>
>> Regards
>> Pralabh kumar
>>
>


Re: Log4j 1.2.17 spark CVE

2021-12-13 Thread Sean Owen
You would want to shade this dependency in your app, in which case you
would be using log4j 2. If you don't shade and just include it, you will
also be using log4j 2 as some of the API classes are different. If they
overlap with log4j 1, you will probably hit errors anyway.

On Mon, Dec 13, 2021 at 6:33 PM James Yu  wrote:

> Question: Spark use log4j 1.2.17, if my application jar contains log4j 2.x
> and gets submitted to the Spark cluster.  Which version of log4j gets
> actually used during the Spark session?
> --
> *From:* Sean Owen 
> *Sent:* Monday, December 13, 2021 8:25 AM
> *To:* Jörn Franke 
> *Cc:* Pralabh Kumar ; dev ;
> user.spark 
> *Subject:* Re: Log4j 1.2.17 spark CVE
>
> This has come up several times over years - search JIRA. The very short
> summary is: Spark does not use log4j 1.x, but its dependencies do, and
> that's the issue.
> Anyone that can successfully complete the surgery at this point is welcome
> to, but I failed ~2 years ago.
>
> On Mon, Dec 13, 2021 at 10:02 AM Jörn Franke  wrote:
>
> Is it in any case appropriate to use log4j 1.x which is not maintained
> anymore and has other security vulnerabilities which won’t be fixed anymore
> ?
>
> Am 13.12.2021 um 06:06 schrieb Sean Owen :
>
> 
> Check the CVE - the log4j vulnerability appears to affect log4j 2, not
> 1.x. There was mention that it could affect 1.x when used with JNDI or SMS
> handlers, but Spark does neither. (unless anyone can think of something I'm
> missing, but never heard or seen that come up at all in 7 years in Spark)
>
> The big issue would be applications that themselves configure log4j 2.x,
> but that's not a Spark issue per se.
>
> On Sun, Dec 12, 2021 at 10:46 PM Pralabh Kumar 
> wrote:
>
> Hi developers,  users
>
> Spark is built using log4j 1.2.17 . Is there a plan to upgrade based on
> recent CVE detected ?
>
>
> Regards
> Pralabh kumar
>
>


Re: Log4j 1.2.17 spark CVE

2021-12-13 Thread Qian Sun
My understanding is that we don’t need to do anything. Log4j2-core not used in 
spark.

> 2021年12月13日 下午12:45,Pralabh Kumar  写道:
> 
> Hi developers,  users 
> 
> Spark is built using log4j 1.2.17 . Is there a plan to upgrade based on 
> recent CVE detected ?
> 
> 
> Regards
> Pralabh kumar


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Creating a memory-efficient AggregateFunction to calculate Median

2021-12-13 Thread Nicholas Chammas
No takers here? :)

I can see now why a median function is not available in most data
processing systems. It's pretty annoying to implement!

On Thu, Dec 9, 2021 at 9:25 PM Nicholas Chammas 
wrote:

> I'm trying to create a new aggregate function. It's my first time working
> with Catalyst, so it's exciting---but I'm also in a bit over my head.
>
> My goal is to create a function to calculate the median
> .
>
> As a very simple solution, I could just define median to be an alias of 
> `Percentile(col,
> 0.5)`. However, the leading comment on the Percentile expression
> 
> highlights that it's very memory-intensive and can easily lead to
> OutOfMemory errors.
>
> So instead of using Percentile, I'm trying to create an Expression that
> calculates the median without needing to hold everything in memory at once.
> I'm considering two different approaches:
>
> 1. Define Median as a combination of existing expressions: The median can
> perhaps be built out of the existing expressions for Count
> 
> and NthValue
> 
> .
>
> I don't see a template I can follow for building a new expression out of
> existing expressions (i.e. without having to implement a bunch of methods
> for DeclarativeAggregate or ImperativeAggregate). I also don't know how I
> would wrap NthValue to make it usable as a regular aggregate function. The
> wrapped NthValue would need an implicit window that provides the necessary
> ordering.
>
>
> Is there any potential to this idea? Any pointers on how to implement it?
>
>
> 2. Another memory-light approach to calculating the median requires
> multiple passes over the data to converge on the answer. The approach is 
> described
> here
> .
> (I posted a sketch implementation of this approach using Spark's user-level
> API here
> 
> .)
>
> I am also struggling to understand how I would build an aggregate function
> like this, since it requires multiple passes over the data. From what I can
> see, Catalyst's aggregate functions are designed to work with a single pass
> over the data.
>
> We don't seem to have an interface for AggregateFunction that supports
> multiple passes over the data. Is there some way to do this?
>
>
> Again, this is my first serious foray into Catalyst. Any specific
> implementation guidance is appreciated!
>
> Nick
>
>


Re: Creating a memory-efficient AggregateFunction to calculate Median

2021-12-13 Thread Reynold Xin
tl;dr: there's no easy way to implement aggregate expressions that'd require 
multiple pass over data. It is simply not something that's supported and doing 
so would be very high cost.

Would you be OK using approximate percentile? That's relatively cheap.

On Mon, Dec 13, 2021 at 6:43 PM, Nicholas Chammas < nicholas.cham...@gmail.com 
> wrote:

> 
> No takers here? :)
> 
> 
> I can see now why a median function is not available in most data
> processing systems. It's pretty annoying to implement!
> 
> On Thu, Dec 9, 2021 at 9:25 PM Nicholas Chammas < nicholas. chammas@ gmail.
> com ( nicholas.cham...@gmail.com ) > wrote:
> 
> 
>> I'm trying to create a new aggregate function. It's my first time working
>> with Catalyst, so it's exciting---but I'm also in a bit over my head.
>> 
>> 
>> My goal is to create a function to calculate the median (
>> https://issues.apache.org/jira/browse/SPARK-26589 ).
>> 
>> 
>> As a very simple solution, I could just define median to be an alias of ` 
>> Percentile(col,
>> 0.5)`. However, the leading comment on the Percentile expression (
>> https://github.com/apache/spark/blob/08123a3795683238352e5bf55452de381349fdd9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala#L37-L39
>> ) highlights that it's very memory-intensive and can easily lead to
>> OutOfMemory errors.
>> 
>> 
>> So instead of using Percentile, I'm trying to create an Expression that
>> calculates the median without needing to hold everything in memory at
>> once. I'm considering two different approaches:
>> 
>> 
>> 1. Define Median as a combination of existing expressions: The median can
>> perhaps be built out of the existing expressions for Count (
>> https://github.com/apache/spark/blob/9af338cd685bce26abbc2dd4d077bde5068157b1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Count.scala#L48
>> ) and NthValue (
>> https://github.com/apache/spark/blob/568ad6aa4435ce76ca3b5d9966e64259ea1f9b38/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L675
>> ).
>> 
>> 
>> 
>>> I don't see a template I can follow for building a new expression out of
>>> existing expressions (i.e. without having to implement a bunch of methods
>>> for DeclarativeAggregate or ImperativeAggregate). I also don't know how I
>>> would wrap NthValue to make it usable as a regular aggregate function. The
>>> wrapped NthValue would need an implicit window that provides the necessary
>>> ordering.
>>> 
>> 
>> 
>> 
>> 
>>> Is there any potential to this idea? Any pointers on how to implement it?
>>> 
>> 
>> 
>> 
>> 2. Another memory-light approach to calculating the median requires
>> multiple passes over the data to converge on the answer. The approach is 
>> described
>> here (
>> https://www.quora.com/Distributed-Algorithms/What-is-the-distributed-algorithm-to-determine-the-median-of-arrays-of-integers-located-on-different-computers
>> ). (I posted a sketch implementation of this approach using Spark's
>> user-level API here (
>> https://issues.apache.org/jira/browse/SPARK-26589?focusedCommentId=17452081&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17452081
>> ).)
>> 
>> 
>> 
>>> I am also struggling to understand how I would build an aggregate function
>>> like this, since it requires multiple passes over the data. From what I
>>> can see, Catalyst's aggregate functions are designed to work with a single
>>> pass over the data.
>>> 
>>> 
>>> We don't seem to have an interface for AggregateFunction that supports
>>> multiple passes over the data. Is there some way to do this?
>>> 
>> 
>> 
>> Again, this is my first serious foray into Catalyst. Any specific
>> implementation guidance is appreciated!
>> 
>> 
>> Nick
>> 
> 
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: Creating a memory-efficient AggregateFunction to calculate Median

2021-12-13 Thread Nicholas Chammas
Yeah, I think approximate percentile is good enough most of the time.

I don't have a specific need for a precise median. I was interested in
implementing it more as a Catalyst learning exercise, but it turns out I
picked a bad learning exercise to solve. :)

On Mon, Dec 13, 2021 at 9:46 PM Reynold Xin  wrote:

> tl;dr: there's no easy way to implement aggregate expressions that'd
> require multiple pass over data. It is simply not something that's
> supported and doing so would be very high cost.
>
> Would you be OK using approximate percentile? That's relatively cheap.
>
>
>
> On Mon, Dec 13, 2021 at 6:43 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> No takers here? :)
>>
>> I can see now why a median function is not available in most data
>> processing systems. It's pretty annoying to implement!
>>
>> On Thu, Dec 9, 2021 at 9:25 PM Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> I'm trying to create a new aggregate function. It's my first time
>>> working with Catalyst, so it's exciting---but I'm also in a bit over my
>>> head.
>>>
>>> My goal is to create a function to calculate the median
>>> .
>>>
>>> As a very simple solution, I could just define median to be an alias of 
>>> `Percentile(col,
>>> 0.5)`. However, the leading comment on the Percentile expression
>>> 
>>> highlights that it's very memory-intensive and can easily lead to
>>> OutOfMemory errors.
>>>
>>> So instead of using Percentile, I'm trying to create an Expression that
>>> calculates the median without needing to hold everything in memory at once.
>>> I'm considering two different approaches:
>>>
>>> 1. Define Median as a combination of existing expressions: The median
>>> can perhaps be built out of the existing expressions for Count
>>> 
>>> and NthValue
>>> 
>>> .
>>>
>>> I don't see a template I can follow for building a new expression out of
>>> existing expressions (i.e. without having to implement a bunch of methods
>>> for DeclarativeAggregate or ImperativeAggregate). I also don't know how I
>>> would wrap NthValue to make it usable as a regular aggregate function. The
>>> wrapped NthValue would need an implicit window that provides the necessary
>>> ordering.
>>>
>>>
>>> Is there any potential to this idea? Any pointers on how to implement it?
>>>
>>>
>>> 2. Another memory-light approach to calculating the median requires
>>> multiple passes over the data to converge on the answer. The approach is 
>>> described
>>> here
>>> .
>>> (I posted a sketch implementation of this approach using Spark's user-level
>>> API here
>>> 
>>> .)
>>>
>>> I am also struggling to understand how I would build an aggregate
>>> function like this, since it requires multiple passes over the data. From
>>> what I can see, Catalyst's aggregate functions are designed to work with a
>>> single pass over the data.
>>>
>>> We don't seem to have an interface for AggregateFunction that supports
>>> multiple passes over the data. Is there some way to do this?
>>>
>>>
>>> Again, this is my first serious foray into Catalyst. Any specific
>>> implementation guidance is appreciated!
>>>
>>> Nick
>>>
>>
>