Re: Executor lost for unknown reasons error Spark 2.3 on kubernetes

2018-07-31 Thread purna pradeep
More details about executor pod which died abruptly from spark driver pod
logs


2018-07-30 19:58:41 ERROR TaskSchedulerImpl:70 - Lost executor 3 on
10.*.*.*.*: Executor lost for unknown reasons.

2018-07-30 19:58:41 WARN  TaskSetManager:66 - Lost task 32.0 in stage 9.0
(TID 133, 10.10.147.6, executor 3): ExecutorLostFailure (executor 3 exited
caused by one of the running tasks) Reason: Executor lost for unknown
reasons.

2018-07-30 19:58:41 WARN  KubernetesClusterSchedulerBackend:66 - Received
delete event of executor pod
accelerate-snowflake-test-5b6ba9d5495b3ae9a1358ae9c3f9a8c3-exec-3. Reason:
null

2018-07-30 19:58:41 WARN  KubernetesClusterSchedulerBackend:347 - Executor
with id 3 was not marked as disconnected, but the watch received an event
of type DELETED for this executor. The executor may have failed to start in
the first place and never registered with the driver.

2018-07-30 19:58:41 INFO  TaskSetManager:54 - Starting task 32.1 in stage
9.0 (TID 134, 10.*.*.*.*, executor 7, partition 32, ANY, 9262 bytes)

2018-07-30 19:58:42 INFO  ContextCleaner:54 - Cleaned accumulator 246

2018-07-30 19:58:42 INFO  ContextCleaner:54 - Cleaned accumulator 252

2018-07-30 19:58:42 INFO  ContextCleaner:54 - Cleaned accumulator 254

2018-07-30 19:58:42 INFO  BlockManagerInfo:54 - Removed broadcast_11_piece0
on spark-1532979165550-driver-svc.spark.svc:7079 in memory (size: 6.9 KB,
free: 997.6 MB)

2018-07-30 19:58:42 INFO  BlockManagerInfo:54 - Removed broadcast_11_piece0
on 10.*.*.*.*:43815 on disk (size: 6.9 KB)

2018-07-30 19:58:42 WARN  TransportChannelHandler:78 - Exception in
connection from /10.*.*.*.*:37578

java.io.IOException: Connection reset by peer

at sun.nio.ch.FileDispatcherImpl.read0(Native Method)

at
sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)

at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)

at sun.nio.ch.IOUtil.read(IOUtil.java:192)

at
sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)

at
io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288)

at
io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1106)

at
io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:343)

at
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:123)

at
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)

at
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)

at
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)

at
io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)

at
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)

at
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)

at java.lang.Thread.run(Thread.java:748)

2018-07-30 19:58:42 ERROR TransportResponseHandler:154 - Still have 1
requests outstanding when connection from /10.*.*.*.*:37578 is closed

2018-07-30 19:58:42 INFO
KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint:54 - Disabling
executor 7.

2018-07-30 19:58:42 INFO  DAGScheduler:54 - Executor lost: 7 (epoch 1)

2018-07-30 19:58:42 WARN  BlockManagerMaster:87 - Failed to remove
broadcast 11 with removeFromMaster = true - Connection reset by peer

java.io.IOException: Connection reset by peer

at sun.nio.ch.FileDispatcherImpl.read0(Native Method)

at
sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)

at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)

at sun.nio.ch.IOUtil.read(IOUtil.java:192)

at
sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)

at
io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288)

at
io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1106)

at
io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:343)

at
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:123)

at
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)

at
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)

at
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)

at
io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)

at
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)

at

Executor lost for unknown reasons error Spark 2.3 on kubernetes

2018-07-31 Thread purna pradeep
> Hello,
>
>
>
> I’m getting below error in spark driver pod logs and executor pods are
> getting killed midway through while the job is running  and even driver pod
> Terminated with below intermittent error ,this happens if I run multiple
> jobs in parallel.
>
>
>
> Not able to see executor logs as executor pods are killed
>
>
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 23
> in stage 36.0 failed 4 times, most recent failure: Lost task 23.3 in stage
> 36.0 (TID 1006, 10.10.125.119, executor 1): ExecutorLostFailure (executor 1
> exited caused by one of the running tasks) Reason: Executor lost for
> unknown reasons.
>
> Driver stacktrace:
>
> at org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599)
>
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1587)
>
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1586)
>
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>
> at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1586)
>
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
>
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
>
> at scala.Option.foreach(Option.scala:257)
>
> at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
>
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820)
>
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769)
>
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758)
>
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>
> at
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
>
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2027)
>
> at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194)
>
> ... 42 mor
>


Re: Use Arrow instead of Pickle without pandas_udf

2018-07-31 Thread Hichame El Khalfi
Thanks Bryan for the pointer +1

Hichame

From: cutl...@gmail.com
Sent: July 30, 2018 6:40 PM
To: hich...@elkhalfi.com
Cc: hol...@pigscanfly.ca; user@spark.apache.org
Subject: Re: Use Arrow instead of Pickle without pandas_udf


Here is a link to the JIRA for adding StructType support for scalar pandas_udf 
https://issues.apache.org/jira/browse/SPARK-24579


On Wed, Jul 25, 2018 at 3:36 PM, Hichame El Khalfi 
mailto:hich...@elkhalfi.com>> wrote:
Hey Holden,
Thanks for your reply,

We currently using a python function that produces a Row(TS=LongType(), 
bin=BinaryType()).
We use this function like this 
dataframe.rdd.map(my_function).toDF().write.parquet()

To reuse it in pandas_udf, we changes the return type to 
StructType(StructField(Long), StructField(BinaryType).

1)But we face an issue that StructType is not supported by pandas_udf.

So I was wondering to still continue to reuse dataftame.rdd.map but get an 
improvement in serialization by using ArrowFormat instead of Pickle.

From: hol...@pigscanfly.ca
Sent: July 25, 2018 4:41 PM
To: hich...@elkhalfi.com
Cc: user@spark.apache.org
Subject: Re: Use Arrow instead of Pickle without pandas_udf


Not currently. What's the problem with pandas_udf for your use case?

On Wed, Jul 25, 2018 at 1:27 PM, Hichame El Khalfi 
mailto:hich...@elkhalfi.com>> wrote:

Hi There,


Is there a way to use Arrow format instead of Pickle but without using 
pandas_udf ?


Thank for your help,


Hichame



--
Twitter: https://twitter.com/holdenkarau



Re: How to do PCA with Spark Streaming Dataframe?

2018-07-31 Thread Aakash Basu
FYI

The relevant StackOverflow query on the same -
https://stackoverflow.com/questions/51610482/how-to-do-pca-with-spark-streaming-dataframe

On Tue, Jul 31, 2018 at 3:18 PM, Aakash Basu 
wrote:

> Hi,
>
> Just curious to know, how can we run a Principal Component Analysis on
> streaming data in distributed mode? If we can, is it mathematically valid
> enough?
>
> Have anyone done that before? Can you guys share your experience over it?
> Is there any API Spark provides to do the same on Spark Streaming mode?
>
> Thanks,
> Aakash.
>


How to do PCA with Spark Streaming Dataframe?

2018-07-31 Thread Aakash Basu
Hi,

Just curious to know, how can we run a Principal Component Analysis on
streaming data in distributed mode? If we can, is it mathematically valid
enough?

Have anyone done that before? Can you guys share your experience over it?
Is there any API Spark provides to do the same on Spark Streaming mode?

Thanks,
Aakash.


Re: Query on Profiling Spark Code

2018-07-31 Thread Aakash Basu
Okay, sure!

On Tue, Jul 31, 2018 at 1:06 PM, Patil, Prashasth <
prashasth.pa...@spglobal.com> wrote:

> Hi Aakash,
>
> On a related note, you may want to try SparkLens for profiling which is
> quite helpful in my opinion.
>
>
>
>
>
> -Prash
>
>
>
> *From:* Aakash Basu [mailto:aakash.spark@gmail.com]
> *Sent:* Tuesday, July 17, 2018 12:41 PM
> *To:* user
> *Subject:* Query on Profiling Spark Code
>
>
>
> Hi guys,
>
>
>
> I'm trying to profile my Spark code on cProfiler and check where more time
> is taken. I found the most time taken is by some socket object, which I'm
> quite clueless of, as to where it is used.
>
>
>
> Can anyone shed some light on this?
>
>
>
>
>
> *ncalls*
>
> *tottime*
>
> *percall*
>
> *cumtime*
>
> *percall*
>
> *filename:lineno(function)*
>
> 11789
>
> 479.8
>
> 0.0407
>
> 479.8
>
> 0.0407
>
> ~:0()
>
>
>
>
>
> Thanks,
>
> Aakash.
>
> --
>
> The information contained in this message is intended only for the
> recipient, and may be a confidential attorney-client communication or may
> otherwise be privileged and confidential and protected from disclosure. If
> the reader of this message is not the intended recipient, or an employee or
> agent responsible for delivering this message to the intended recipient,
> please be aware that any dissemination or copying of this communication is
> strictly prohibited. If you have received this communication in error,
> please immediately notify us by replying to the message and deleting it
> from your computer. S Global Inc. reserves the right, subject to
> applicable local law, to monitor, review and process the content of any
> electronic message or information sent to or from S Global Inc. e-mail
> addresses without informing the sender or recipient of the message. By
> sending electronic message or information to S Global Inc. e-mail
> addresses you, as the sender, are consenting to S Global Inc. processing
> any of your personal data therein.
>


RE: Query on Profiling Spark Code

2018-07-31 Thread Patil, Prashasth
Hi Aakash,
On a related note, you may want to try SparkLens for profiling which is quite 
helpful in my opinion.


-Prash

From: Aakash Basu [mailto:aakash.spark@gmail.com]
Sent: Tuesday, July 17, 2018 12:41 PM
To: user
Subject: Query on Profiling Spark Code

Hi guys,

I'm trying to profile my Spark code on cProfiler and check where more time is 
taken. I found the most time taken is by some socket object, which I'm quite 
clueless of, as to where it is used.

Can anyone shed some light on this?


ncalls

tottime

percall

cumtime

percall

filename:lineno(function)

11789

479.8

0.0407

479.8

0.0407

~:0()



Thanks,
Aakash.



The information contained in this message is intended only for the recipient, 
and may be a confidential attorney-client communication or may otherwise be 
privileged and confidential and protected from disclosure. If the reader of 
this message is not the intended recipient, or an employee or agent responsible 
for delivering this message to the intended recipient, please be aware that any 
dissemination or copying of this communication is strictly prohibited. If you 
have received this communication in error, please immediately notify us by 
replying to the message and deleting it from your computer. S Global Inc. 
reserves the right, subject to applicable local law, to monitor, review and 
process the content of any electronic message or information sent to or from 
S Global Inc. e-mail addresses without informing the sender or recipient of 
the message. By sending electronic message or information to S Global Inc. 
e-mail addresses you, as the sender, are consenting to S Global Inc. 
processing any of your personal data therein.


RE: Split a row into multiple rows Java

2018-07-31 Thread Patil, Prashasth
Hi,
Have you tried using spark dataframe's Pivot feature ?


-Original Message-
From: nookala [mailto:srinook...@gmail.com]
Sent: Thursday, July 26, 2018 7:33 AM
To: user@spark.apache.org
Subject: Split a row into multiple rows Java

I'm trying to generate multiple rows from a single row

I have schema

Name Id Date 0100 0200 0300 0400

and would like to make it into a vertical format with schema

Name Id Date Time

I have the code below and get the error

Caused by: java.lang.RuntimeException:
org.apache.spark.sql.catalyst.expressions.GenericRow is not a valid external 
type for schema of string
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
Source)
at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)

   StructType schemata = DataTypes.createStructType(
   new StructField[]{
   DataTypes.createStructField("Name", DataTypes.StringType,
false),
   DataTypes.createStructField("Id", DataTypes.StringType,
false),
DataTypes.createStructField("Date",
DataTypes.StringType, false),
   DataTypes.createStructField("Time", DataTypes.StringType,
false)
   }
   );
   ExpressionEncoder encoder = RowEncoder.apply(schemata);
   Dataset modifiedRDD = intervalDF.flatMap(new
FlatMapFunction() {
@Override
   public Iterator call (Row row) throws Exception {
   List rowList = new ArrayList();
   String[] timeList = {"0100", "0200", "0300", "0400"}
   for (String time : timeList) {

   Row r1 = RowFactory.create(row.getAs("sdp_id"),
   "WGL",
   row.getAs("Name"),
   row.getAs("Id"),
   row.getAs("Date"),
   timeList[0],
   row.getAs(timeList[0]));


   //updated row by creating new Row
   rowList.add(RowFactory.create(r1));


   }
   return rowList.iterator();
   }
   }, encoder);
modifiedRDD.write().csv("file:///Users/mod/out");



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org




The information contained in this message is intended only for the recipient, 
and may be a confidential attorney-client communication or may otherwise be 
privileged and confidential and protected from disclosure. If the reader of 
this message is not the intended recipient, or an employee or agent responsible 
for delivering this message to the intended recipient, please be aware that any 
dissemination or copying of this communication is strictly prohibited. If you 
have received this communication in error, please immediately notify us by 
replying to the message and deleting it from your computer. S Global Inc. 
reserves the right, subject to applicable local law, to monitor, review and 
process the content of any electronic message or information sent to or from 
S Global Inc. e-mail addresses without informing the sender or recipient of 
the message. By sending electronic message or information to S Global Inc. 
e-mail addresses you, as the sender, are consenting to S Global Inc. 
processing any of your personal data therein.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org