date:20150817

Python's ReduceByKeyAndWindow DStream Keeps Growing

2015-08-17 Thread Asim Jalis

When I use reduceByKeyAndWindow with func and invFunc (in PySpark) the size
of the window keeps growing. I am appending the code that reproduces this
issue. This prints out the count() of the dstream which goes up every batch
by 10 elements.

Is this a bug in the Python version of Scala or is this expected behavior?

Here is the code that reproduces this issue.

from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pprint import pprint

print 'Initializing ssc'
ssc = StreamingContext(SparkContext(), batchDuration=1)
ssc.checkpoint('ckpt')

ds = ssc.textFileStream('input') \
.map(lambda event: (event,1)) \
.reduceByKeyAndWindow(
func=lambda count1,count2: count1+count2,
invFunc=lambda count1,count2: count1-count2,
windowDuration=10,
slideDuration=2)

ds.pprint()
ds.count().pprint()

print 'Starting ssc'
ssc.start()

import itertools
import time
import random

from distutils import dir_util

def batch_write(batch_data, batch_file_path):
with open(batch_file_path,'w') as batch_file:
for element in batch_data:
line = str(element) + \n
batch_file.write(line)

def xrange_write(
batch_size = 5,
batch_dir = 'input',
batch_duration = 1):
'''Every batch_duration write a file with batch_size numbers,
forever. Start at 0 and keep incrementing. Intended for testing
Spark Streaming code.'''

dir_util.mkpath('./input')
for i in itertools.count():
min = batch_size * i
max = batch_size * (i + 1)
batch_data = xrange(min,max)
file_path = batch_dir + '/' + str(i)
batch_write(batch_data, file_path)
time.sleep(batch_duration)

print 'Feeding data to app'
xrange_write()

ssc.awaitTermination()

Re: Spark executor lost because of time out even after setting quite long time out value 1000 seconds

2015-08-17 Thread Akhil Das

It could be stuck on a GC pause, Can you check a bit more in the executor
logs and see whats going on? Also from the driver UI you would get to know
at which stage it is being stuck etc.

Thanks
Best Regards

On Sun, Aug 16, 2015 at 11:45 PM, unk1102 umesh.ka...@gmail.com wrote:

Hi I have written Spark job which seems to be working fine for almost an
hour
and after that executor start getting lost because of timeout I see the
following in log statement

15/08/16 12:26:46 WARN spark.HeartbeatReceiver: Removing executor 10 with
no
recent heartbeats: 1051638 ms exceeds timeout 100 ms

I dont see any errors but I see above warning and because of it executor
gets removed by YARN and I see Rpc client disassociated error and
IOException connection refused and FetchFailedException

After executor gets removed I see it is again getting added and starts
working and some other executors fails again. My question is is it normal
for executor getting lost? What happens to that task lost executors were
working on? My Spark job keeps on running since it is long around 4-5 hours
I have very good cluster with 1.2 TB memory and good no of CPU cores. To
solve above time out issue I tried to increase time spark.akka.timeout to
1000 seconds but no luck. I am using the following command to run my Spark
job Please guide I am new to Spark. I am using Spark 1.4.1. Thanks in
advance.

/spark-submit --class com.xyz.abc.MySparkJob --conf
spark.executor.extraJavaOptions=-XX:MaxPermSize=512M
--driver-java-options
-XX:MaxPermSize=512m --driver-memory 4g --master yarn-client
--executor-memory 25G --executor-cores 8 --num-executors 5 --jars
/path/to/spark-job.jar

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-executor-lost-because-of-time-out-even-after-setting-quite-long-time-out-value-1000-seconds-tp24289.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

54 matches

Mail list logo