date:20190131

Avoiding collect but use foreach

2019-01-31 Thread Aakash Basu

Hi,

This:


*to_list = [list(row) for row in df.collect()]*


Gives:


[[5, 1, 1, 1, 2, 1, 3, 1, 1, 0], [5, 4, 4, 5, 7, 10, 3, 2, 1, 0], [3, 1, 1,
1, 2, 2, 3, 1, 1, 0], [6, 8, 8, 1, 3, 4, 3, 7, 1, 0], [4, 1, 1, 3, 2, 1, 3,
1, 1, 0]]


I want to avoid collect operation, but still convert the dataframe to a
python list of list just as above for downstream operations.


Is there a way, I can do it, maybe a better performant code that using
collect?


Thanks,

Aakash.

Re: Aws

2019-01-31 Thread Hiroyuki Nagata

Hi, Pedro


I also start using AWS EMR, with Spark 2.4.0. I'm seeking methods for
performance tuning.

Do you configure dynamic allocation ?

FYI:
https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation

I've not tested it yet. I guess spark-submit needs to specify number of
executors.

Regards,
Hiroyuki

2019年2月1日(金) 5:23、Pedro Tuero さん（tuerope...@gmail.com）のメッセージ:

> Hi guys,
> I use to run spark jobs in Aws emr.
> Recently I switch from aws emr label  5.16 to 5.20 (which use Spark 2.4.0).
> I've noticed that a lot of steps are taking longer than before.
> I think it is related to the automatic configuration of cores by executor.
> In version 5.16, some executors toke more cores if the instance allows it.
> Let say, if an instance had 8 cores and 40gb of ram, and ram configured by
> executor was 10gb, then aws emr automatically assigned 2 cores by executor.
> Now in label 5.20, unless I configure the number of cores manually, only
> one core is assigned per executor.
>
> I don't know if it is related to Spark 2.4.0 or if it is something managed
> by aws...
> Does anyone know if there is a way to automatically use more cores when it
> is physically possible?
>
> Thanks,
> Peter.
>

Fwd: Spark driver pod scheduling fails on auto scaled node

2019-01-31 Thread prudhvi ch

-- Forwarded message -
From: prudhvi ch 
Date: Thu, Jan 31, 2019, 5:54 PM
Subject: Fwd: Spark driver pod scheduling fails on auto scaled node
To: 

-- Forwarded message -
From: Prudhvi Chennuru (CONT) 
Date: Thu, Jan 31, 2019, 5:01 PM
Subject: Fwd: Spark driver pod scheduling fails on auto scaled node
To: 

Hi,

I am using kubernetes *v 1.11.5* and spark *v 2.3.0*,
*calico(daemonset)* as overlay network plugin and kubernetes *cluster auto
scalar* feature to autoscale cluster if needed. When the cluster is auto
scaling calico pods are scheduling on those nodes but they are not ready
for 40 to 50 seconds and the driver and executors pods scheduling on those
nodes are failing as calico is not ready.
   So is there a way to overcome this issue by not scheduling
driver and executor pods until *calico* is ready or introduce a delay in
driver or executor pods to schedule on the nodes.

*I am not using spark operator.*

-- 
*Thanks,*
*Prudhvi Chennuru.*

-- 
*Thanks,*
*Prudhvi Chennuru.*

--

The information contained in this e-mail is confidential and/or proprietary
to Capital One and/or its affiliates and may only be used solely in
performance of work or services for Capital One. The information
transmitted herewith is intended only for use by the individual or entity
to which it is addressed. If the reader of this message is not the intended
recipient, you are hereby notified that any review, retransmission,
dissemination, distribution, copying or other use of, or taking of any
action in reliance upon this information is strictly prohibited. If you
have received this communication in error, please contact the sender and
delete the material from your computer.

Fwd: Spark driver pod scheduling fails on auto scaled node

2019-01-31 Thread prudhvi ch

-- Forwarded message -
From: Prudhvi Chennuru (CONT) 
Date: Thu, Jan 31, 2019, 5:01 PM
Subject: Fwd: Spark driver pod scheduling fails on auto scaled node
To: 

Hi,

I am using kubernetes *v 1.11.5* and spark *v 2.3.0*,
*calico(daemonset)* as overlay network plugin and kubernetes *cluster auto
scalar* feature to autoscale cluster if needed. When the cluster is auto
scaling calico pods are scheduling on those nodes but they are not ready
for 40 to 50 seconds and the driver and executors pods scheduling on those
nodes are failing as calico is not ready.
   So is there a way to overcome this issue by not scheduling
driver and executor pods until *calico* is ready or introduce a delay in
driver or executor pods to schedule on the nodes.

*I am not using spark operator.*

-- 
*Thanks,*
*Prudhvi Chennuru.*

-- 
*Thanks,*
*Prudhvi Chennuru.*

--

The information contained in this e-mail is confidential and/or proprietary
to Capital One and/or its affiliates and may only be used solely in
performance of work or services for Capital One. The information
transmitted herewith is intended only for use by the individual or entity
to which it is addressed. If the reader of this message is not the intended
recipient, you are hereby notified that any review, retransmission,
dissemination, distribution, copying or other use of, or taking of any
action in reliance upon this information is strictly prohibited. If you
have received this communication in error, please contact the sender and
delete the material from your computer.

Fwd: Spark driver pod scheduling fails on auto scaled node

2019-01-31 Thread Prudhvi Chennuru (CONT)

Hi,

I am using kubernetes *v 1.11.5* and spark *v 2.3.0*,
*calico(daemonset)* as overlay network plugin and kubernetes *cluster auto
scalar* feature to autoscale cluster if needed. When the cluster is auto
scaling calico pods are scheduling on those nodes but they are not ready
for 40 to 50 seconds and the driver and executors pods scheduling on those
nodes are failing as calico is not ready.
   So is there a way to overcome this issue by not scheduling
driver and executor pods until *calico* is ready or introduce a delay in
driver or executor pods to schedule on the nodes.

*I am not using spark operator.*

-- 
*Thanks,*
*Prudhvi Chennuru.*


The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.

Aws

2019-01-31 Thread Pedro Tuero

Hi guys,
I use to run spark jobs in Aws emr.
Recently I switch from aws emr label  5.16 to 5.20 (which use Spark 2.4.0).
I've noticed that a lot of steps are taking longer than before.
I think it is related to the automatic configuration of cores by executor.
In version 5.16, some executors toke more cores if the instance allows it.
Let say, if an instance had 8 cores and 40gb of ram, and ram configured by
executor was 10gb, then aws emr automatically assigned 2 cores by executor.
Now in label 5.20, unless I configure the number of cores manually, only
one core is assigned per executor.

I don't know if it is related to Spark 2.4.0 or if it is something managed
by aws...
Does anyone know if there is a way to automatically use more cores when it
is physically possible?

Thanks,
Peter.

unsubscribe

2019-01-31 Thread Daniel O' Shaughnessy

Re: CVE-2018-11760: Apache Spark local privilege escalation vulnerability

2019-01-31 Thread Imran Rashid

I received some questions about what the exact change was which fixed the
issue, and the PMC decided to post info in jira to make it easier for the
community to track.  The relevant details are all on

https://issues.apache.org/jira/browse/SPARK-26802

On Mon, Jan 28, 2019 at 1:08 PM Imran Rashid  wrote:

> Severity: Important
>
> Vendor: The Apache Software Foundation
>
> Versions affected:
> All Spark 1.x, Spark 2.0.x, and Spark 2.1.x versions
> Spark 2.2.0 to 2.2.2
> Spark 2.3.0 to 2.3.1
>
> Description:
> When using PySpark , it's possible for a different local user to connect
> to the Spark application and impersonate the user running the Spark
> application.  This affects versions 1.x, 2.0.x, 2.1.x, 2.2.0 to 2.2.2, and
> 2.3.0 to 2.3.1.
>
> Mitigation:
> 1.x, 2.0.x, 2.1.x, and 2.2.x users should upgrade to 2.2.3 or newer
> 2.3.x users should upgrade to 2.3.2 or newer
> Otherwise, affected users should avoid using PySpark in multi-user
> environments.
>
> Credit:
> This issue was reported by Luca Canali and Jose Carlos Luna Duran from
> CERN.
>
> References:
> https://spark.apache.org/security.html
>

Exactly-Once delivery with Structured Streaming and Kafka

2019-01-31 Thread William Briggs

I noticed that Spark 2.4.0 implemented support for reading only committed
messages in Kafka, and was excited. Are there currently any plans to update
the Kafka output sink to support exactly-once delivery?

Thanks,
Will

Please stop asking to unsubscribe

2019-01-31 Thread Andrew Melo

The correct way to unsubscribe is to mail

user-unsubscr...@spark.apache.org

Just mailing the list with "unsubscribe" doesn't actually do anything...

Thanks
Andrew

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

unsubscribe

2019-01-31 Thread Ahmed Abdulla

unsubscribe

Driver OOM does not shut down Spark Context

2019-01-31 Thread Bryan Jeffrey

Hello.

I am running Spark 2.3.0 via Yarn.  I have a Spark Streaming application
where the driver threw an uncaught out of memory exception:

19/01/31 13:00:59 ERROR Utils: Uncaught exception in thread
element-tracking-store-worker
java.lang.OutOfMemoryError: GC overhead limit exceeded
at
org.apache.spark.util.kvstore.KVTypeInfo$MethodAccessor.get(KVTypeInfo.java:154)
at
org.apache.spark.util.kvstore.InMemoryStore$InMemoryView.compare(InMemoryStore.java:248)
at
org.apache.spark.util.kvstore.InMemoryStore$InMemoryView.lambda$iterator$0(InMemoryStore.java:203)
at
org.apache.spark.util.kvstore.InMemoryStore$InMemoryView$$Lambda$27/1691147907.compare(Unknown
Source)
at java.util.TimSort.binarySort(TimSort.java:296)
at java.util.TimSort.sort(TimSort.java:239)
at java.util.Arrays.sort(Arrays.java:1512)
at java.util.ArrayList.sort(ArrayList.java:1462)
at java.util.Collections.sort(Collections.java:175)
at
org.apache.spark.util.kvstore.InMemoryStore$InMemoryView.iterator(InMemoryStore.java:203)
at
scala.collection.convert.Wrappers$JIterableWrapper.iterator(Wrappers.scala:54)
at
scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at
org.apache.spark.status.AppStatusListener$$anonfun$org$apache$spark$status$AppStatusListener$$cleanupStages$1.apply(AppStatusListener.scala:894)
at
org.apache.spark.status.AppStatusListener$$anonfun$org$apache$spark$status$AppStatusListener$$cleanupStages$1.apply(AppStatusListener.scala:874)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.status.AppStatusListener.org
$apache$spark$status$AppStatusListener$$cleanupStages(AppStatusListener.scala:874)
at
org.apache.spark.status.AppStatusListener$$anonfun$3.apply$mcVJ$sp(AppStatusListener.scala:84)
at
org.apache.spark.status.ElementTrackingStore$$anonfun$write$1$$anonfun$apply$1$$anonfun$apply$mcV$sp$1.apply(ElementTrackingStore.scala:109)
at
org.apache.spark.status.ElementTrackingStore$$anonfun$write$1$$anonfun$apply$1$$anonfun$apply$mcV$sp$1.apply(ElementTrackingStore.scala:107)
at scala.collection.immutable.List.foreach(List.scala:381)
at
org.apache.spark.status.ElementTrackingStore$$anonfun$write$1$$anonfun$apply$1.apply$mcV$sp(ElementTrackingStore.scala:107)
at
org.apache.spark.status.ElementTrackingStore$$anonfun$write$1$$anonfun$apply$1.apply(ElementTrackingStore.scala:105)
at
org.apache.spark.status.ElementTrackingStore$$anonfun$write$1$$anonfun$apply$1.apply(ElementTrackingStore.scala:105)
at org.apache.spark.util.Utils$.tryLog(Utils.scala:2001)
at
org.apache.spark.status.ElementTrackingStore$$anon$1.run(ElementTrackingStore.scala:91)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Despite the uncaught exception the Streaming application never terminated.
No new batches were started.  As a result my job did not process data for
some period of time (until our ancillary monitoring noticed the issue).

*Ask: What can we do to ensure that the driver is shut down when this type
of exception occurs?*

Regards,

Bryan Jeffrey

[no subject]

2019-01-31 Thread Ahmed Abdulla

unsubscribe

Survey on Data Stream Processing

2019-01-31 Thread Alexandre Strapacao Guedes Vianna

Hello People,

I'm conducting PhD research on applications using data stream processing,
that aims to investigate practices, tools and experiences with development,
testing and validation of data stream software.
We’ll be grateful if you share your expertise by answering a questionnaire.
(it takes about 15 minutes)
Survey: http://bit.ly/strproc 

Your help is essential to guide academic research.

*The information entered will be kept anonymous. The data will be used only
to publish academic papers. If you have any questions or concerns, please
contact me.*
Alexandre Vianna (a...@cin.ufpe.br)
PhD Student at Federal University of Pernambuco/Brazil