Re: Dataframe multiple joins with same dataframe not able to resolve correct join columns

2018-07-11 Thread Ben White
Sounds like the same root cause as SPARK-14948 or SPARK-10925.

A workaround is to "clone" df3 like this:

val df3clone = df3.toDF(df.schema.fieldNames:_*)

Then use df3clone in place of df3 in the second join.



On Wed, Jul 11, 2018 at 2:52 PM Nirav Patel  wrote:

> I am trying to joind df1 with df2 and result of which to again with df2.
>
> df is a common dataframe.
>
> val df3 = df1
>   .join(*df2*,
>   df1("PARTICIPANT_ID") === df2("PARTICIPANT_ID") and
>   df1("BUSINESS_ID") === df2("BUSINESS_ID"))
>   .drop(df1("BUSINESS_ID")) //dropping duplicates
>   .drop(df1("PARTICIPANT_ID")) //dropping duplicates
>   .select("EMPLOYEE_ID",...)
>
> val df4 = df3
>   .join(*df2*,
>   df3("EMPLOYEE_ID") === df2("EMPLOYEE_ID") and
>   df3("BUSINESS_ID") === df2("BUSINESS_ID"))
>   .drop(df2("BUSINESS_ID")) //dropping duplicates
>   .drop(df2("EMPLOYEE_ID")) //dropping duplicates
>   .select(...)
>
>
> I am getting following warning and most likely its an Cartesian join which
> is not what I want.
> 14:30:32.193 12262 [main] WARN   org.apache.spark.sql.Column - *Constructing
> trivially true equals predicate*, 'EMPLOYEE_ID#83 = EMPLOYEE_ID#83'.
> Perhaps you need to use aliases.
>
> 14:30:32.195 12264 [main] WARN   org.apache.spark.sql.Column -
> Constructing trivially true equals predicate, 'BUSINESS_ID#312 =
> BUSINESS_ID#312'. Perhaps you need to use aliases.
>
> As you can see,  one of my Join predicate is converted to
> "(EMPLOYEE_ID#83 = EMPLOYEE_ID#83)"  I think this should be okay because they
> should still be columns from different dataframe (df3 and df2).
>
> Just want to confirm that this warning is harmless in this scenario.
>
> Problem is similar to this one:
>
> https://stackoverflow.com/questions/32190828/spark-sql-performing-carthesian-join-instead-of-inner-join
>
>
>
>
>
> [image: What's New with Xactly] 
>
> 
> 
>    
> 


Dataframe multiple joins with same dataframe not able to resolve correct join columns

2018-07-11 Thread Nirav Patel
I am trying to joind df1 with df2 and result of which to again with df2.

df is a common dataframe.

val df3 = df1
  .join(*df2*,
  df1("PARTICIPANT_ID") === df2("PARTICIPANT_ID") and
  df1("BUSINESS_ID") === df2("BUSINESS_ID"))
  .drop(df1("BUSINESS_ID")) //dropping duplicates
  .drop(df1("PARTICIPANT_ID")) //dropping duplicates
  .select("EMPLOYEE_ID",...)

val df4 = df3
  .join(*df2*,
  df3("EMPLOYEE_ID") === df2("EMPLOYEE_ID") and
  df3("BUSINESS_ID") === df2("BUSINESS_ID"))
  .drop(df2("BUSINESS_ID")) //dropping duplicates
  .drop(df2("EMPLOYEE_ID")) //dropping duplicates
  .select(...)


I am getting following warning and most likely its an Cartesian join which
is not what I want.
14:30:32.193 12262 [main] WARN   org.apache.spark.sql.Column - *Constructing
trivially true equals predicate*, 'EMPLOYEE_ID#83 = EMPLOYEE_ID#83'.
Perhaps you need to use aliases.

14:30:32.195 12264 [main] WARN   org.apache.spark.sql.Column - Constructing
trivially true equals predicate, 'BUSINESS_ID#312 = BUSINESS_ID#312'.
Perhaps you need to use aliases.

As you can see,  one of my Join predicate is converted to "(EMPLOYEE_ID#83
= EMPLOYEE_ID#83)"  I think this should be okay because they should still
be columns from different dataframe (df3 and df2).

Just want to confirm that this warning is harmless in this scenario.

Problem is similar to this one:
https://stackoverflow.com/questions/32190828/spark-sql-performing-carthesian-join-instead-of-inner-join

-- 


 

 
   
   
      



CVE-2018-8024 Apache Spark XSS vulnerability in UI

2018-07-11 Thread Sean Owen
Severity: Medium

Vendor: The Apache Software Foundation

Versions Affected:
Spark versions through 2.1.2
Spark 2.2.0 through 2.2.1
Spark 2.3.0

Description:
In Apache Spark up to and including 2.1.2, 2.2.0 to 2.2.1, and 2.3.0, it's
possible for a malicious user to construct a URL pointing to a Spark
cluster's UI's job and stage info pages, and if a user can be tricked into
accessing the URL, can be used to cause script to execute and expose
information from the user's view of the Spark UI. While some browsers like
recent versions of Chrome and Safari are able to block this type of attack,
current versions of Firefox (and possibly others) do not.

Mitigation:
1.x, 2.0.x, and 2.1.x users should upgrade to 2.1.3 or newer
2.2.x users should upgrade to 2.2.2 or newer
2.3.x users should upgrade to 2.3.1 or newer

Credit:
Spencer Gietzen, Rhino Security Labs

References:
https://spark.apache.org/security.html


CVE-2018-1334 Apache Spark local privilege escalation vulnerability

2018-07-11 Thread Sean Owen
Severity: High

Vendor: The Apache Software Foundation

Versions affected:
Spark versions through 2.1.2
Spark 2.2.0 to 2.2.1
Spark 2.3.0

Description:
In Apache Spark up to and including 2.1.2, 2.2.0 to 2.2.1, and 2.3.0, when
using PySpark or SparkR, it's possible for a different local user to
connect to the Spark application and impersonate the user running the Spark
application.

Mitigation:
1.x, 2.0.x, and 2.1.x users should upgrade to 2.1.3 or newer
2.2.x users should upgrade to 2.2.2 or newer
2.3.x users should upgrade to 2.3.1 or newer
Otherwise, affected users should avoid using PySpark and SparkR in
multi-user environments.

Credit:
Nehmé Tohmé, Cloudera, Inc.

References:
https://spark.apache.org/security.html


Spark accessing fakes3

2018-07-11 Thread Patrick Roemer
Hi,

does anybody if (and how) it's possible to get a (dev-local) Spark
installation to talk to fakes3 for s3[n|a]:// URLs?

I have managed to connect to AWS S3 from my local installation by adding
hadoop-aws and aws-java-sdk to jars, using s3:// URLs as arguments for
SparkContext#textFile(), but I'm at loss how to get it to work with a
local fakes3.

The only reference I've found so far is this issue, where somebody seems
to have gotten close, but unfortunately he's forgotten about the details:

https://github.com/jubos/fake-s3/issues/108

Thanks and best regards,
Patrick

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark on Mesos - Weird behavior

2018-07-11 Thread Pavel Plotnikov
Oh, sorry, i missed that you use spark without dynamic allocation. Anyway,
i don't know does this parameters works without dynamic allocation.

On Wed, Jul 11, 2018 at 5:11 PM Thodoris Zois  wrote:

> Hello,
>
> Yeah you are right, but I think that works only if you use Spark dynamic
> allocation. Am I wrong?
>
> -Thodoris
>
> On 11 Jul 2018, at 17:09, Pavel Plotnikov 
> wrote:
>
> Hi, Thodoris
> You can configure resources per executor and manipulate with number of
> executers instead using spark.max.cores. I think 
> spark.dynamicAllocation.minExecutors
> and spark.dynamicAllocation.maxExecutors configuration values can help
> you.
>
> On Tue, Jul 10, 2018 at 5:07 PM Thodoris Zois  wrote:
>
>> Actually after some experiments we figured out that spark.max.cores /
>> spark.executor.cores is the upper bound for the executors. Spark apps will
>> run even only if one executor can be launched.
>>
>> Is there any way to specify also the lower bound? It is a bit annoying
>> that seems that we can’t control the resource usage of an application. By
>> the way, we are not using dynamic allocation.
>>
>> - Thodoris
>>
>>
>> On 10 Jul 2018, at 14:35, Pavel Plotnikov 
>> wrote:
>>
>> Hello Thodoris!
>> Have you checked this:
>>  - does mesos cluster have available resources?
>>   - if spark have waiting tasks in queue more than
>> spark.dynamicAllocation.schedulerBacklogTimeout configuration value?
>>  - And then, have you checked that mesos send offers to spark app mesos
>> framework at least with 10 cores and 2GB RAM?
>>
>> If mesos have not available offers with 10 cores, for example, but have
>> with 8 or 9, so you can use smaller executers for better fit for available
>> resources on nodes for example with 4 cores and 1 GB RAM, for example
>>
>> Cheers,
>> Pavel
>>
>> On Mon, Jul 9, 2018 at 9:05 PM Thodoris Zois  wrote:
>>
>>> Hello list,
>>>
>>> We are running Apache Spark on a Mesos cluster and we face a weird
>>> behavior of executors. When we submit an app with e.g 10 cores and 2GB of
>>> memory and max cores 30, we expect to see 3 executors running on the
>>> cluster. However, sometimes there are only 2... Spark applications are not
>>> the only one that run on the cluster. I guess that Spark starts executors
>>> on the available offers even if it does not satisfy our needs. Is there any
>>> configuration that we can use in order to prevent Spark from starting when
>>> there are no resource offers for the total number of executors?
>>>
>>> Thank you
>>> - Thodoris
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>


Re: Spark on Mesos - Weird behavior

2018-07-11 Thread Thodoris Zois
Hello,

Yeah you are right, but I think that works only if you use Spark dynamic 
allocation. Am I wrong?

-Thodoris

> On 11 Jul 2018, at 17:09, Pavel Plotnikov  
> wrote:
> 
> Hi, Thodoris
> You can configure resources per executor and manipulate with number of 
> executers instead using spark.max.cores. I think 
> spark.dynamicAllocation.minExecutors and spark.dynamicAllocation.maxExecutors 
> configuration values can help you.
> 
> On Tue, Jul 10, 2018 at 5:07 PM Thodoris Zois  > wrote:
> Actually after some experiments we figured out that spark.max.cores / 
> spark.executor.cores is the upper bound for the executors. Spark apps will 
> run even only if one executor can be launched. 
> 
> Is there any way to specify also the lower bound? It is a bit annoying that 
> seems that we can’t control the resource usage of an application. By the way, 
> we are not using dynamic allocation. 
> 
> - Thodoris 
> 
> 
> On 10 Jul 2018, at 14:35, Pavel Plotnikov  > wrote:
> 
>> Hello Thodoris!
>> Have you checked this:
>>  - does mesos cluster have available resources?
>>   - if spark have waiting tasks in queue more than 
>> spark.dynamicAllocation.schedulerBacklogTimeout configuration value?
>>  - And then, have you checked that mesos send offers to spark app mesos 
>> framework at least with 10 cores and 2GB RAM?
>> 
>> If mesos have not available offers with 10 cores, for example, but have with 
>> 8 or 9, so you can use smaller executers for better fit for available 
>> resources on nodes for example with 4 cores and 1 GB RAM, for example
>> 
>> Cheers,
>> Pavel
>> 
>> On Mon, Jul 9, 2018 at 9:05 PM Thodoris Zois > > wrote:
>> Hello list,
>> 
>> We are running Apache Spark on a Mesos cluster and we face a weird behavior 
>> of executors. When we submit an app with e.g 10 cores and 2GB of memory and 
>> max cores 30, we expect to see 3 executors running on the cluster. However, 
>> sometimes there are only 2... Spark applications are not the only one that 
>> run on the cluster. I guess that Spark starts executors on the available 
>> offers even if it does not satisfy our needs. Is there any configuration 
>> that we can use in order to prevent Spark from starting when there are no 
>> resource offers for the total number of executors?
>> 
>> Thank you 
>> - Thodoris 
>> 
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
>> 
>> 



Re: Spark on Mesos - Weird behavior

2018-07-11 Thread Pavel Plotnikov
Hi, Thodoris
You can configure resources per executor and manipulate with number of
executers instead using spark.max.cores. I think
spark.dynamicAllocation.minExecutors
and spark.dynamicAllocation.maxExecutors configuration values can help you.

On Tue, Jul 10, 2018 at 5:07 PM Thodoris Zois  wrote:

> Actually after some experiments we figured out that spark.max.cores /
> spark.executor.cores is the upper bound for the executors. Spark apps will
> run even only if one executor can be launched.
>
> Is there any way to specify also the lower bound? It is a bit annoying
> that seems that we can’t control the resource usage of an application. By
> the way, we are not using dynamic allocation.
>
> - Thodoris
>
>
> On 10 Jul 2018, at 14:35, Pavel Plotnikov 
> wrote:
>
> Hello Thodoris!
> Have you checked this:
>  - does mesos cluster have available resources?
>   - if spark have waiting tasks in queue more than
> spark.dynamicAllocation.schedulerBacklogTimeout configuration value?
>  - And then, have you checked that mesos send offers to spark app mesos
> framework at least with 10 cores and 2GB RAM?
>
> If mesos have not available offers with 10 cores, for example, but have
> with 8 or 9, so you can use smaller executers for better fit for available
> resources on nodes for example with 4 cores and 1 GB RAM, for example
>
> Cheers,
> Pavel
>
> On Mon, Jul 9, 2018 at 9:05 PM Thodoris Zois  wrote:
>
>> Hello list,
>>
>> We are running Apache Spark on a Mesos cluster and we face a weird
>> behavior of executors. When we submit an app with e.g 10 cores and 2GB of
>> memory and max cores 30, we expect to see 3 executors running on the
>> cluster. However, sometimes there are only 2... Spark applications are not
>> the only one that run on the cluster. I guess that Spark starts executors
>> on the available offers even if it does not satisfy our needs. Is there any
>> configuration that we can use in order to prevent Spark from starting when
>> there are no resource offers for the total number of executors?
>>
>> Thank you
>> - Thodoris
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


Re: DataTypes of an ArrayType

2018-07-11 Thread Patrick McCarthy
Arrays need to be a single type, I think you're looking for a Struct
column. See:
https://medium.com/@mrpowers/adding-structtype-columns-to-spark-dataframes-b44125409803

On Wed, Jul 11, 2018 at 6:37 AM, dimitris plakas 
wrote:

> Hello everyone,
>
> I am new to Pyspark and i would like to ask if there is any way to have a
> Dataframe column which is ArrayType and have a different DataType for each
> elemnt of the ArrayType. For example
> to have something like :
>
> StructType([StructField("Column_Name", ArrayType(ArrayType(FloatType(),
> FloatType(), DecimalType(), False),False), False)]).
>
> I want to have an ArrayType column with 2 elements as FloatType and 1
> element as DecimalType
>
> Thank you in advance
>


Re: [SPARK on MESOS] Avoid re-fetching Spark binary

2018-07-11 Thread Tien Dat
Thanks for your suggestion.

I have been checking Spark-jobserver. Just a off-topic question about this
project: Does Apache Spark project have any support/connection to this
Spark-jobserver project? I noticed that they do not have release for the
newest version of Spark (e.g., 2.3.1). 

As you mentioned, many organizations and individuals have been using this,
so wouldn't it be better to have it developed within Spark community? 

Best
Tien Dat



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



DataTypes of an ArrayType

2018-07-11 Thread dimitris plakas
Hello everyone,

I am new to Pyspark and i would like to ask if there is any way to have a
Dataframe column which is ArrayType and have a different DataType for each
elemnt of the ArrayType. For example
to have something like :

StructType([StructField("Column_Name", ArrayType(ArrayType(FloatType(),
FloatType(), DecimalType(), False),False), False)]).

I want to have an ArrayType column with 2 elements as FloatType and 1
element as DecimalType

Thank you in advance