Re: overriding spark.streaming.blockQueueSize default value

2016-03-29 Thread Spark Newbie
Pinging back. Hope someone else has seen this behavior where
spark.streaming.blockQueueSize becomes a bottleneck. Is there a suggestion
on how to adjust the queue size? Or any documentation on what the effects
would be. It seems to be straightforward. But just trying to learn from
others experiences.


Thanks,

On Mon, Mar 28, 2016 at 10:40 PM, Spark Newbie <sparknewbie1...@gmail.com>
wrote:

> Hi All,
>
> The default value for spark.streaming.blockQueueSize is 10 in
> https://github.com/apache/spark/blob/branch-1.6/streaming/src/main/scala/org/apache/spark/streaming/receiver/BlockGenerator.scala.
> In spark kinesis asl 1.4 the received Kinesis records are stored by calling
> addData on line 115 -
> https://github.com/apache/spark/blob/branch-1.4/streaming/src/main/scala/org/apache/spark/streaming/receiver/BlockGenerator.scala#L115
> which pushes one data item to the buffer. This is a problem because, at
> application startup, a single Kinesis Worker gains lease for all (or a
> majority of) shards for the Kinesis stream. This is by design, KCL load
> balances as new Workers are started. But, the single Worker which initially
> gains lease for a lot of shards, ends up being blocked on the addData
> method, as there will be many KinesisRecordProcessor threads trying to add
> the received data to the buffer. The buffer uses a ArrayBlockingQueue
> with the size specified in spark.streaming.blockQueueSize which is set to
> 10 by default. The
> ArrayBlockingQueue is flushed out to memorystore every 100ms. So the
> KinesisRecordProcessor threads will be blocked for long period (like upto
> an hour) on application startup. The impact is that there will be some
> Kinesis shards that don't get consumed by the spark streaming application,
> until its KinesisRecordProcessor thread gets unblocked.
>
> To fix/work around the issue would it be ok to increase the
> spark.streaming.blockQueueSize to a larger value. I suppose the main
> consideration when increasing this size would be the memory allocated to
> the executor. I haven't seen much documentation on this config. And any
> advise on how to fine tune this would be useful.
>
> Thanks,
> Spark newbie
>


overriding spark.streaming.blockQueueSize default value

2016-03-28 Thread Spark Newbie
Hi All,

The default value for spark.streaming.blockQueueSize is 10 in
https://github.com/apache/spark/blob/branch-1.6/streaming/src/main/scala/org/apache/spark/streaming/receiver/BlockGenerator.scala.
In spark kinesis asl 1.4 the received Kinesis records are stored by calling
addData on line 115 -
https://github.com/apache/spark/blob/branch-1.4/streaming/src/main/scala/org/apache/spark/streaming/receiver/BlockGenerator.scala#L115
which pushes one data item to the buffer. This is a problem because, at
application startup, a single Kinesis Worker gains lease for all (or a
majority of) shards for the Kinesis stream. This is by design, KCL load
balances as new Workers are started. But, the single Worker which initially
gains lease for a lot of shards, ends up being blocked on the addData
method, as there will be many KinesisRecordProcessor threads trying to add
the received data to the buffer. The buffer uses a ArrayBlockingQueue with
the size specified in spark.streaming.blockQueueSize which is set to 10 by
default. The
ArrayBlockingQueue is flushed out to memorystore every 100ms. So the
KinesisRecordProcessor threads will be blocked for long period (like upto
an hour) on application startup. The impact is that there will be some
Kinesis shards that don't get consumed by the spark streaming application,
until its KinesisRecordProcessor thread gets unblocked.

To fix/work around the issue would it be ok to increase the
spark.streaming.blockQueueSize to a larger value. I suppose the main
consideration when increasing this size would be the memory allocated to
the executor. I haven't seen much documentation on this config. And any
advise on how to fine tune this would be useful.

Thanks,
Spark newbie


Re: SparkException: Failed to get broadcast_10_piece0

2015-11-30 Thread Spark Newbie
Pinging again ...

On Wed, Nov 25, 2015 at 4:19 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> Which Spark release are you using ?
>
> Please take a look at:
> https://issues.apache.org/jira/browse/SPARK-5594
>
> Cheers
>
> On Wed, Nov 25, 2015 at 3:59 PM, Spark Newbie <sparknewbie1...@gmail.com>
> wrote:
>
>> Hi Spark users,
>>
>> I'm seeing the below exceptions once in a while which causes tasks to
>> fail (even after retries, so it is a non recoverable exception I think),
>> hence stage fails and then the job gets aborted.
>>
>> Exception ---
>> java.io.IOException: org.apache.spark.SparkException: Failed to get
>> broadcast_10_piece0 of broadcast_10
>>
>> Any idea why this exception occurs and how to avoid/handle these
>> exceptions? Please let me know if you have seen this exception and know a
>> fix for it.
>>
>> Thanks,
>> Bharath
>>
>
>


Error in block pushing thread puts the KinesisReceiver in a stuck state

2015-11-25 Thread Spark Newbie
Hi Spark users,

I have been seeing this issue where receivers enter a "stuck" state after
it encounters a the following exception "Error in block pushing thread -
java.util.concurrent.TimeoutException: Futures timed out".
I am running the application on spark-1.4.1 and using kinesis-asl-1.4.

When this happens, the observation is that the
Kinesis.ProcessTask.shard.MillisBehindLatest metric does not get
published anymore, when I look at cloudwatch, which indicates that the
workers associated with the receiver are not checkpointing any more for the
shards that they were reading from.

This seems like a bug in to BlockGenerator code , here -
https://github.com/apache/spark/blob/branch-1.4/streaming/src/main/scala/org/apache/spark/streaming/receiver/BlockGenerator.scala#L171
when pushBlock encounters an exception, in this case the TimeoutException,
it stops pushing blocks. Is this really expected behavior?

Has anyone else seen this error and have you also seen the issue where
receivers stop receiving records? I'm also trying to find the root cause
for the TimeoutException. If anyone has an idea on this please share.

Thanks,

Bharath


SparkException: Failed to get broadcast_10_piece0

2015-11-25 Thread Spark Newbie
Hi Spark users,

I'm seeing the below exceptions once in a while which causes tasks to fail
(even after retries, so it is a non recoverable exception I think), hence
stage fails and then the job gets aborted.

Exception ---
java.io.IOException: org.apache.spark.SparkException: Failed to get
broadcast_10_piece0 of broadcast_10

Any idea why this exception occurs and how to avoid/handle these
exceptions? Please let me know if you have seen this exception and know a
fix for it.

Thanks,
Bharath


Re: SparkException: Failed to get broadcast_10_piece0

2015-11-25 Thread Spark Newbie
Using Spark-1.4.1

On Wed, Nov 25, 2015 at 4:19 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> Which Spark release are you using ?
>
> Please take a look at:
> https://issues.apache.org/jira/browse/SPARK-5594
>
> Cheers
>
> On Wed, Nov 25, 2015 at 3:59 PM, Spark Newbie <sparknewbie1...@gmail.com>
> wrote:
>
>> Hi Spark users,
>>
>> I'm seeing the below exceptions once in a while which causes tasks to
>> fail (even after retries, so it is a non recoverable exception I think),
>> hence stage fails and then the job gets aborted.
>>
>> Exception ---
>> java.io.IOException: org.apache.spark.SparkException: Failed to get
>> broadcast_10_piece0 of broadcast_10
>>
>> Any idea why this exception occurs and how to avoid/handle these
>> exceptions? Please let me know if you have seen this exception and know a
>> fix for it.
>>
>> Thanks,
>> Bharath
>>
>
>


Re: s3a file system and spark deployment mode

2015-10-15 Thread Spark Newbie
Are you using EMR?
You can install Hadoop-2.6.0 along with Spark-1.5.1 in your EMR cluster.
And that brings s3a jars to the worker nodes and it becomes available to
your application.

On Thu, Oct 15, 2015 at 11:04 AM, Scott Reynolds 
wrote:

> List,
>
> Right now we build our spark jobs with the s3a hadoop client. We do this
> because our machines are only allowed to use IAM access to the s3 store. We
> can build our jars with the s3a filesystem and the aws sdk just fine and
> this jars run great in *client mode*.
>
> We would like to move from client mode to cluster mode as that will allow
> us to be more resilient to driver failure. In order to do this either:
> 1. the jar file has to be on worker's local disk
> 2. the jar file is in shared storage (s3a)
>
> We would like to put the jar file in s3 storage, but when we give the jar
> path as s3a://.., the worker node doesn't have the hadoop s3a and aws
> sdk in its classpath / uber jar.
>
> Other then building spark with those two dependencies, what other options
> do I have ? We are using 1.5.1 so SPARK_CLASSPATH is no longer a thing.
>
> Need to get s3a access to both the master (so that we can log spark event
> log to s3) and to the worker processes (driver, executor).
>
> Looking for ideas before just adding the dependencies to our spark build
> and calling it a day.
>


Re: Spark 1.5 java.net.ConnectException: Connection refused

2015-10-15 Thread Spark Newbie
What is the best way to fail the application when job gets aborted?

On Wed, Oct 14, 2015 at 1:27 PM, Tathagata Das <t...@databricks.com> wrote:

> When a job gets aborted, it means that the internal tasks were retried a
> number of times before the system gave up. You can control the number
> retries (see Spark's configuration page). The job by default does not get
> resubmitted.
>
> You could try getting the logs of the failed executor, to see what caused
> the failure. Could be a memory limit issue, and YARN killing it somehow.
>
>
>
> On Wed, Oct 14, 2015 at 11:05 AM, Spark Newbie <sparknewbie1...@gmail.com>
> wrote:
>
>> Is it slowing things down or blocking progress.
>> >> I didn't see slowing of processing, but I do see jobs aborted
>> consecutively for a period of 18 batches (5 minute batch intervals). So I
>> am worried about what happened to the records that these jobs were
>> processing.
>> Also, one more thing to mention is that the
>> StreamingListenerBatchCompleted.numRecords information shows all
>> received records as processed even if the batch/job failed. The processing
>> time as well shows as the same time it takes for a successful batch.
>> It seems like it is the numRecords which was the input to the batch
>> regardless of whether they were successfully processed or not.
>>
>> On Wed, Oct 14, 2015 at 11:01 AM, Spark Newbie <sparknewbie1...@gmail.com
>> > wrote:
>>
>>> I ran 2 different spark 1.5 clusters that have been running for more
>>> than a day now. I do see jobs getting aborted due to task retry's maxing
>>> out (default 4) due to ConnectionException. It seems like the executors die
>>> and get restarted and I was unable to find the root cause (same app code
>>> and conf used on spark 1.4.1 I don't see ConnectionException).
>>>
>>> Another question related to this, what happens to the kinesis records
>>> received when Job gets aborted? In Spark-1.5 and kinesis-asl-1.5 (which I
>>> am using) does the job gets resubmitted with the same received records? Or
>>> does the kinesis-asl library get those records again based on sequence
>>> numbers it tracks? It would good for me to understand the story around
>>> lossless processing of kinesis records in Spark-1.5 + kinesis-asl-1.5 when
>>> jobs are aborted. Any pointers or quick explanation would be very helpful.
>>>
>>>
>>> On Tue, Oct 13, 2015 at 4:04 PM, Tathagata Das <t...@databricks.com>
>>> wrote:
>>>
>>>> Is this happening too often? Is it slowing things down or blocking
>>>> progress. Failures once in a while is part of the norm, and the system
>>>> should take care of itself.
>>>>
>>>> On Tue, Oct 13, 2015 at 2:47 PM, Spark Newbie <
>>>> sparknewbie1...@gmail.com> wrote:
>>>>
>>>>> Hi Spark users,
>>>>>
>>>>> I'm seeing the below exception in my spark streaming application. It
>>>>> happens in the first stage where the kinesis receivers receive records and
>>>>> perform a flatMap operation on the unioned Dstream. A coalesce step also
>>>>> happens as a part of that stage for optimizing the performance.
>>>>>
>>>>> This is happening on my spark 1.5 instance using kinesis-asl-1.5. When
>>>>> I look at the executor logs I do not see any exceptions indicating the 
>>>>> root
>>>>> cause of why there is no connectivity on xxx.xx.xx.xxx:36684 or when did
>>>>> that service go down.
>>>>>
>>>>> Any help debugging this problem will be helpful.
>>>>>
>>>>> 15/10/13 16:36:07 ERROR shuffle.RetryingBlockFetcher: Exception while
>>>>> beginning fetch of 1 outstanding blocks
>>>>> java.io.IOException: Failed to connect to
>>>>> ip-xxx-xx-xx-xxx.ec2.internal/xxx.xx.xx.xxx:36684
>>>>> at
>>>>> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193)
>>>>> at
>>>>> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
>>>>> at
>>>>> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:88)
>>>>> at
>>>>> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>>>>> at
>>>>> org

Re: Spark 1.5 java.net.ConnectException: Connection refused

2015-10-14 Thread Spark Newbie
Is it slowing things down or blocking progress.
>> I didn't see slowing of processing, but I do see jobs aborted
consecutively for a period of 18 batches (5 minute batch intervals). So I
am worried about what happened to the records that these jobs were
processing.
Also, one more thing to mention is that the
StreamingListenerBatchCompleted.numRecords information shows all received
records as processed even if the batch/job failed. The processing time as
well shows as the same time it takes for a successful batch.
It seems like it is the numRecords which was the input to the batch
regardless of whether they were successfully processed or not.

On Wed, Oct 14, 2015 at 11:01 AM, Spark Newbie <sparknewbie1...@gmail.com>
wrote:

> I ran 2 different spark 1.5 clusters that have been running for more than
> a day now. I do see jobs getting aborted due to task retry's maxing out
> (default 4) due to ConnectionException. It seems like the executors die and
> get restarted and I was unable to find the root cause (same app code and
> conf used on spark 1.4.1 I don't see ConnectionException).
>
> Another question related to this, what happens to the kinesis records
> received when Job gets aborted? In Spark-1.5 and kinesis-asl-1.5 (which I
> am using) does the job gets resubmitted with the same received records? Or
> does the kinesis-asl library get those records again based on sequence
> numbers it tracks? It would good for me to understand the story around
> lossless processing of kinesis records in Spark-1.5 + kinesis-asl-1.5 when
> jobs are aborted. Any pointers or quick explanation would be very helpful.
>
>
> On Tue, Oct 13, 2015 at 4:04 PM, Tathagata Das <t...@databricks.com>
> wrote:
>
>> Is this happening too often? Is it slowing things down or blocking
>> progress. Failures once in a while is part of the norm, and the system
>> should take care of itself.
>>
>> On Tue, Oct 13, 2015 at 2:47 PM, Spark Newbie <sparknewbie1...@gmail.com>
>> wrote:
>>
>>> Hi Spark users,
>>>
>>> I'm seeing the below exception in my spark streaming application. It
>>> happens in the first stage where the kinesis receivers receive records and
>>> perform a flatMap operation on the unioned Dstream. A coalesce step also
>>> happens as a part of that stage for optimizing the performance.
>>>
>>> This is happening on my spark 1.5 instance using kinesis-asl-1.5. When I
>>> look at the executor logs I do not see any exceptions indicating the root
>>> cause of why there is no connectivity on xxx.xx.xx.xxx:36684 or when did
>>> that service go down.
>>>
>>> Any help debugging this problem will be helpful.
>>>
>>> 15/10/13 16:36:07 ERROR shuffle.RetryingBlockFetcher: Exception while
>>> beginning fetch of 1 outstanding blocks
>>> java.io.IOException: Failed to connect to
>>> ip-xxx-xx-xx-xxx.ec2.internal/xxx.xx.xx.xxx:36684
>>> at
>>> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193)
>>> at
>>> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
>>> at
>>> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:88)
>>> at
>>> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>>> at
>>> org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
>>> at
>>> org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:97)
>>> at
>>> org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:89)
>>> at
>>> org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:595)
>>> at
>>> org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:593)
>>> at
>>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>> at
>>> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>>> at
>>> org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:593)
>>> at
>>> org.apache.spark.storage.BlockManager.getRemote(BlockManager.scala:579)
>>> at
>>> org.apache.spark.storage.BlockManager.get(BlockManager.scala:623)
>>> at
>>> org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44

Re: Spark 1.5 java.net.ConnectException: Connection refused

2015-10-14 Thread Spark Newbie
I ran 2 different spark 1.5 clusters that have been running for more than a
day now. I do see jobs getting aborted due to task retry's maxing out
(default 4) due to ConnectionException. It seems like the executors die and
get restarted and I was unable to find the root cause (same app code and
conf used on spark 1.4.1 I don't see ConnectionException).

Another question related to this, what happens to the kinesis records
received when Job gets aborted? In Spark-1.5 and kinesis-asl-1.5 (which I
am using) does the job gets resubmitted with the same received records? Or
does the kinesis-asl library get those records again based on sequence
numbers it tracks? It would good for me to understand the story around
lossless processing of kinesis records in Spark-1.5 + kinesis-asl-1.5 when
jobs are aborted. Any pointers or quick explanation would be very helpful.


On Tue, Oct 13, 2015 at 4:04 PM, Tathagata Das <t...@databricks.com> wrote:

> Is this happening too often? Is it slowing things down or blocking
> progress. Failures once in a while is part of the norm, and the system
> should take care of itself.
>
> On Tue, Oct 13, 2015 at 2:47 PM, Spark Newbie <sparknewbie1...@gmail.com>
> wrote:
>
>> Hi Spark users,
>>
>> I'm seeing the below exception in my spark streaming application. It
>> happens in the first stage where the kinesis receivers receive records and
>> perform a flatMap operation on the unioned Dstream. A coalesce step also
>> happens as a part of that stage for optimizing the performance.
>>
>> This is happening on my spark 1.5 instance using kinesis-asl-1.5. When I
>> look at the executor logs I do not see any exceptions indicating the root
>> cause of why there is no connectivity on xxx.xx.xx.xxx:36684 or when did
>> that service go down.
>>
>> Any help debugging this problem will be helpful.
>>
>> 15/10/13 16:36:07 ERROR shuffle.RetryingBlockFetcher: Exception while
>> beginning fetch of 1 outstanding blocks
>> java.io.IOException: Failed to connect to
>> ip-xxx-xx-xx-xxx.ec2.internal/xxx.xx.xx.xxx:36684
>> at
>> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193)
>> at
>> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
>> at
>> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:88)
>> at
>> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>> at
>> org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
>> at
>> org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:97)
>> at
>> org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:89)
>> at
>> org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:595)
>> at
>> org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:593)
>> at
>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>> at
>> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>> at
>> org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:593)
>> at
>> org.apache.spark.storage.BlockManager.getRemote(BlockManager.scala:579)
>> at
>> org.apache.spark.storage.BlockManager.get(BlockManager.scala:623)
>> at
>> org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
>> at
>> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:139)
>> at
>> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:135)
>> at
>> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>> at scala.collection.immutable.List.foreach(List.scala:318)
>> at
>> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
>> at
>> org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:135)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>>

Spark 1.5 java.net.ConnectException: Connection refused

2015-10-13 Thread Spark Newbie
Hi Spark users,

I'm seeing the below exception in my spark streaming application. It
happens in the first stage where the kinesis receivers receive records and
perform a flatMap operation on the unioned Dstream. A coalesce step also
happens as a part of that stage for optimizing the performance.

This is happening on my spark 1.5 instance using kinesis-asl-1.5. When I
look at the executor logs I do not see any exceptions indicating the root
cause of why there is no connectivity on xxx.xx.xx.xxx:36684 or when did
that service go down.

Any help debugging this problem will be helpful.

15/10/13 16:36:07 ERROR shuffle.RetryingBlockFetcher: Exception while
beginning fetch of 1 outstanding blocks
java.io.IOException: Failed to connect to
ip-xxx-xx-xx-xxx.ec2.internal/xxx.xx.xx.xxx:36684
at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193)
at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
at
org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:88)
at
org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
at
org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
at
org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:97)
at
org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:89)
at
org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:595)
at
org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:593)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:593)
at
org.apache.spark.storage.BlockManager.getRemote(BlockManager.scala:579)
at org.apache.spark.storage.BlockManager.get(BlockManager.scala:623)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
at
org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:139)
at
org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:135)
at
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at scala.collection.immutable.List.foreach(List.scala:318)
at
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:135)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.ConnectException: Connection refused:
ip-xxx-xx-xx-xxx.ec2.internal/xxx.xx.xx.xxx:36684

Thanks,
Bharath


DEBUG level log in receivers and executors

2015-10-12 Thread Spark Newbie
Hi Spark users,

Is there an easy way to turn on DEBUG logs in receivers and executors?
Setting sparkContext.setLogLevel seems to turn on DEBUG level only on the
Driver.

Thanks,


Re: Spark checkpoint restore failure due to s3 consistency issue

2015-10-09 Thread Spark Newbie
Unfortunately I don't have the before stop logs anymore since the log was
overwritten in my next run.

I created a rdd-_$folder$ file in S3 which was missing compared to the
other rdd- checkpointed. The app started without the
IllegalArgumentException. Do you still need to after restart log4j logs? I
can send it if that will help dig into the root cause.

On Fri, Oct 9, 2015 at 2:18 PM, Tathagata Das <t...@databricks.com> wrote:

> Can you provide the before stop and after restart log4j logs for this?
>
> On Fri, Oct 9, 2015 at 2:13 PM, Spark Newbie <sparknewbie1...@gmail.com>
> wrote:
>
>> Hi Spark Users,
>>
>> I'm seeing checkpoint restore failures causing the application startup to
>> fail with the below exception. When I do "ls" on the s3 path I see the key
>> listed sometimes and not listed sometimes. There are no part files
>> (checkpointed files) in the specified S3 path. This is possible because I
>> killed the app and restarted as a part of my testing to see if kinesis-asl
>> library's implementation of lossless kinesis receivers work.
>>
>> Has anyone seen the below exception before? If so is there a recommended
>> way to handle this case?
>>
>> 15/10/09 21:02:09 DEBUG S3NativeFileSystem: getFileStatus could not find
>> key ''
>> Exception in thread "main" java.lang.IllegalArgumentException:
>> requirement failed: Checkpoint directory does not exist: > path to the checkpointed rdd>
>> at scala.Predef$.require(Predef.scala:233)
>> at
>> org.apache.spark.rdd.ReliableCheckpointRDD.(ReliableCheckpointRDD.scala:45)
>> at
>> org.apache.spark.SparkContext$$anonfun$checkpointFile$1.apply(SparkContext.scala:1218)
>> at
>> org.apache.spark.SparkContext$$anonfun$checkpointFile$1.apply(SparkContext.scala:1218)
>> at
>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>> at
>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>> at org.apache.spark.SparkContext.withScope(SparkContext.scala:700)
>> at
>> org.apache.spark.SparkContext.checkpointFile(SparkContext.scala:1217)
>> at
>> org.apache.spark.streaming.dstream.DStreamCheckpointData$$anonfun$restore$1.apply(DStreamCheckpointData.scala:112)
>> at
>> org.apache.spark.streaming.dstream.DStreamCheckpointData$$anonfun$restore$1.apply(DStreamCheckpointData.scala:109)
>> at
>> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>> at
>> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>> at
>> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
>> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
>> at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
>> at
>> org.apache.spark.streaming.dstream.DStreamCheckpointData.restore(DStreamCheckpointData.scala:109)
>> at
>> org.apache.spark.streaming.dstream.DStream.restoreCheckpointData(DStream.scala:487)
>> at
>> org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:488)
>> at
>> org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:488)
>> at scala.collection.immutable.List.foreach(List.scala:318)
>> at
>> org.apache.spark.streaming.dstream.DStream.restoreCheckpointData(DStream.scala:488)
>> at
>> org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:488)
>> at
>> org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:488)
>> at scala.collection.immutable.List.foreach(List.scala:318)
>> at
>> org.apache.spark.streaming.dstream.DStream.restoreCheckpointData(DStream.scala:488)
>> at
>> org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:488)
>> at
>> org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:488)
>> at scala.collection.immutable.List.foreach(List.scala:318)
>> at
>> org.apache.spark.streaming.dstream.DStream.restoreCheckpointData(DStream.scala:488)
>> at
>> org.apache.spark.streaming.DStreamGraph$$anonfun$restoreCheckpointData$2.apply(DStreamGraph.scala:153)
>> at
>> org.apache.spark.streaming.DStreamGraph$$anonfun$restoreCheckpointData$2.apply(DStreamGraph.scala:153)
>>  

Spark checkpoint restore failure due to s3 consistency issue

2015-10-09 Thread Spark Newbie
Hi Spark Users,

I'm seeing checkpoint restore failures causing the application startup to
fail with the below exception. When I do "ls" on the s3 path I see the key
listed sometimes and not listed sometimes. There are no part files
(checkpointed files) in the specified S3 path. This is possible because I
killed the app and restarted as a part of my testing to see if kinesis-asl
library's implementation of lossless kinesis receivers work.

Has anyone seen the below exception before? If so is there a recommended
way to handle this case?

15/10/09 21:02:09 DEBUG S3NativeFileSystem: getFileStatus could not find
key ''
Exception in thread "main" java.lang.IllegalArgumentException: requirement
failed: Checkpoint directory does not exist: 
at scala.Predef$.require(Predef.scala:233)
at
org.apache.spark.rdd.ReliableCheckpointRDD.(ReliableCheckpointRDD.scala:45)
at
org.apache.spark.SparkContext$$anonfun$checkpointFile$1.apply(SparkContext.scala:1218)
at
org.apache.spark.SparkContext$$anonfun$checkpointFile$1.apply(SparkContext.scala:1218)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:700)
at
org.apache.spark.SparkContext.checkpointFile(SparkContext.scala:1217)
at
org.apache.spark.streaming.dstream.DStreamCheckpointData$$anonfun$restore$1.apply(DStreamCheckpointData.scala:112)
at
org.apache.spark.streaming.dstream.DStreamCheckpointData$$anonfun$restore$1.apply(DStreamCheckpointData.scala:109)
at
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at
org.apache.spark.streaming.dstream.DStreamCheckpointData.restore(DStreamCheckpointData.scala:109)
at
org.apache.spark.streaming.dstream.DStream.restoreCheckpointData(DStream.scala:487)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:488)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:488)
at scala.collection.immutable.List.foreach(List.scala:318)
at
org.apache.spark.streaming.dstream.DStream.restoreCheckpointData(DStream.scala:488)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:488)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:488)
at scala.collection.immutable.List.foreach(List.scala:318)
at
org.apache.spark.streaming.dstream.DStream.restoreCheckpointData(DStream.scala:488)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:488)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:488)
at scala.collection.immutable.List.foreach(List.scala:318)
at
org.apache.spark.streaming.dstream.DStream.restoreCheckpointData(DStream.scala:488)
at
org.apache.spark.streaming.DStreamGraph$$anonfun$restoreCheckpointData$2.apply(DStreamGraph.scala:153)
at
org.apache.spark.streaming.DStreamGraph$$anonfun$restoreCheckpointData$2.apply(DStreamGraph.scala:153)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.streaming.DStreamGraph.restoreCheckpointData(DStreamGraph.scala:153)
at
org.apache.spark.streaming.StreamingContext.(StreamingContext.scala:158)
at
org.apache.spark.streaming.StreamingContext$$anonfun$getOrCreate$1.apply(StreamingContext.scala:837)
at
org.apache.spark.streaming.StreamingContext$$anonfun$getOrCreate$1.apply(StreamingContext.scala:837)
at scala.Option.map(Option.scala:145)
at
org.apache.spark.streaming.StreamingContext$.getOrCreate(StreamingContext.scala:837)
at foo$.getStreamingContext(foo.scala:72)

Thanks,
Bharath