Re: BlockManager issues

2014-09-22 Thread Hortonworks
Actually I met similar issue when doing groupByKey and then count if the 
shuffle size is big e.g. 1tb.

Thanks.

Zhan Zhang

Sent from my iPhone

 On Sep 21, 2014, at 10:56 PM, Nishkam Ravi nr...@cloudera.com wrote:
 
 Thanks for the quick follow up Reynold and Patrick. Tried a run with
 significantly higher ulimit, doesn't seem to help. The executors have 35GB
 each. Btw, with a recent version of the branch, the error message is fetch
 failures as opposed to too many open files. Not sure if they are
 related.  Please note that the workload runs fine with head set to 066765d.
 In case you want to reproduce the problem: I'm running slightly modified
 ScalaPageRank (with KryoSerializer and persistence level
 memory_and_disk_ser) on a 30GB input dataset and a 6-node cluster.
 
 Thanks,
 Nishkam
 
 On Sun, Sep 21, 2014 at 10:32 PM, Patrick Wendell pwend...@gmail.com
 wrote:
 
 Ah I see it was SPARK-2711 (and PR1707). In that case, it's possible
 that you are just having more spilling as a result of the patch and so
 the filesystem is opening more files. I would try increasing the
 ulimit.
 
 How much memory do your executors have?
 
 - Patrick
 
 On Sun, Sep 21, 2014 at 10:29 PM, Patrick Wendell pwend...@gmail.com
 wrote:
 Hey the numbers you mentioned don't quite line up - did you mean PR 2711?
 
 On Sun, Sep 21, 2014 at 8:45 PM, Reynold Xin r...@databricks.com
 wrote:
 It seems like you just need to raise the ulimit?
 
 
 On Sun, Sep 21, 2014 at 8:41 PM, Nishkam Ravi nr...@cloudera.com
 wrote:
 
 Recently upgraded to 1.1.0. Saw a bunch of fetch failures for one of
 the
 workloads. Tried tracing the problem through change set analysis. Looks
 like the offending commit is 4fde28c from Aug 4th for PR1707. Please
 see
 SPARK-3633 for more details.
 
 Thanks,
 Nishkam
 

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: BlockManager issues

2014-09-22 Thread Christoph Sawade
Hey all. We had also the same problem described by Nishkam almost in the
same big data setting. We fixed the fetch failure by increasing the timeout
for acks in the driver:

set(spark.core.connection.ack.wait.timeout, 600) // 10 minutes timeout
for acks between nodes

Cheers, Christoph

2014-09-22 9:24 GMT+02:00 Hortonworks zzh...@hortonworks.com:

 Actually I met similar issue when doing groupByKey and then count if the
 shuffle size is big e.g. 1tb.

 Thanks.

 Zhan Zhang

 Sent from my iPhone

  On Sep 21, 2014, at 10:56 PM, Nishkam Ravi nr...@cloudera.com wrote:
 
  Thanks for the quick follow up Reynold and Patrick. Tried a run with
  significantly higher ulimit, doesn't seem to help. The executors have
 35GB
  each. Btw, with a recent version of the branch, the error message is
 fetch
  failures as opposed to too many open files. Not sure if they are
  related.  Please note that the workload runs fine with head set to
 066765d.
  In case you want to reproduce the problem: I'm running slightly modified
  ScalaPageRank (with KryoSerializer and persistence level
  memory_and_disk_ser) on a 30GB input dataset and a 6-node cluster.
 
  Thanks,
  Nishkam
 
  On Sun, Sep 21, 2014 at 10:32 PM, Patrick Wendell pwend...@gmail.com
  wrote:
 
  Ah I see it was SPARK-2711 (and PR1707). In that case, it's possible
  that you are just having more spilling as a result of the patch and so
  the filesystem is opening more files. I would try increasing the
  ulimit.
 
  How much memory do your executors have?
 
  - Patrick
 
  On Sun, Sep 21, 2014 at 10:29 PM, Patrick Wendell pwend...@gmail.com
  wrote:
  Hey the numbers you mentioned don't quite line up - did you mean PR
 2711?
 
  On Sun, Sep 21, 2014 at 8:45 PM, Reynold Xin r...@databricks.com
  wrote:
  It seems like you just need to raise the ulimit?
 
 
  On Sun, Sep 21, 2014 at 8:41 PM, Nishkam Ravi nr...@cloudera.com
  wrote:
 
  Recently upgraded to 1.1.0. Saw a bunch of fetch failures for one of
  the
  workloads. Tried tracing the problem through change set analysis.
 Looks
  like the offending commit is 4fde28c from Aug 4th for PR1707. Please
  see
  SPARK-3633 for more details.
 
  Thanks,
  Nishkam
 

 --
 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity to
 which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the reader
 of this message is not the intended recipient, you are hereby notified that
 any printing, copying, dissemination, distribution, disclosure or
 forwarding of this communication is strictly prohibited. If you have
 received this communication in error, please contact the sender immediately
 and delete it from your system. Thank You.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: BlockManager issues

2014-09-22 Thread David Rowe
I've run into this with large shuffles - I assumed that there was
contention between the shuffle output files and the JVM for memory.
Whenever we start getting these fetch failures, it corresponds with high
load on the machines the blocks are being fetched from, and in some cases
complete unresponsiveness (no ssh etc). Setting the timeout higher, or the
JVM heap lower (as a percentage of total machine memory) seemed to help..



On Mon, Sep 22, 2014 at 8:02 PM, Christoph Sawade 
christoph.saw...@googlemail.com wrote:

 Hey all. We had also the same problem described by Nishkam almost in the
 same big data setting. We fixed the fetch failure by increasing the timeout
 for acks in the driver:

 set(spark.core.connection.ack.wait.timeout, 600) // 10 minutes timeout
 for acks between nodes

 Cheers, Christoph

 2014-09-22 9:24 GMT+02:00 Hortonworks zzh...@hortonworks.com:

  Actually I met similar issue when doing groupByKey and then count if the
  shuffle size is big e.g. 1tb.
 
  Thanks.
 
  Zhan Zhang
 
  Sent from my iPhone
 
   On Sep 21, 2014, at 10:56 PM, Nishkam Ravi nr...@cloudera.com wrote:
  
   Thanks for the quick follow up Reynold and Patrick. Tried a run with
   significantly higher ulimit, doesn't seem to help. The executors have
  35GB
   each. Btw, with a recent version of the branch, the error message is
  fetch
   failures as opposed to too many open files. Not sure if they are
   related.  Please note that the workload runs fine with head set to
  066765d.
   In case you want to reproduce the problem: I'm running slightly
 modified
   ScalaPageRank (with KryoSerializer and persistence level
   memory_and_disk_ser) on a 30GB input dataset and a 6-node cluster.
  
   Thanks,
   Nishkam
  
   On Sun, Sep 21, 2014 at 10:32 PM, Patrick Wendell pwend...@gmail.com
   wrote:
  
   Ah I see it was SPARK-2711 (and PR1707). In that case, it's possible
   that you are just having more spilling as a result of the patch and so
   the filesystem is opening more files. I would try increasing the
   ulimit.
  
   How much memory do your executors have?
  
   - Patrick
  
   On Sun, Sep 21, 2014 at 10:29 PM, Patrick Wendell pwend...@gmail.com
 
   wrote:
   Hey the numbers you mentioned don't quite line up - did you mean PR
  2711?
  
   On Sun, Sep 21, 2014 at 8:45 PM, Reynold Xin r...@databricks.com
   wrote:
   It seems like you just need to raise the ulimit?
  
  
   On Sun, Sep 21, 2014 at 8:41 PM, Nishkam Ravi nr...@cloudera.com
   wrote:
  
   Recently upgraded to 1.1.0. Saw a bunch of fetch failures for one
 of
   the
   workloads. Tried tracing the problem through change set analysis.
  Looks
   like the offending commit is 4fde28c from Aug 4th for PR1707.
 Please
   see
   SPARK-3633 for more details.
  
   Thanks,
   Nishkam
  
 
  --
  CONFIDENTIALITY NOTICE
  NOTICE: This message is intended for the use of the individual or entity
 to
  which it is addressed and may contain information that is confidential,
  privileged and exempt from disclosure under applicable law. If the reader
  of this message is not the intended recipient, you are hereby notified
 that
  any printing, copying, dissemination, distribution, disclosure or
  forwarding of this communication is strictly prohibited. If you have
  received this communication in error, please contact the sender
 immediately
  and delete it from your system. Thank You.
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 



BlockManager issues

2014-09-21 Thread Nishkam Ravi
Recently upgraded to 1.1.0. Saw a bunch of fetch failures for one of the
workloads. Tried tracing the problem through change set analysis. Looks
like the offending commit is 4fde28c from Aug 4th for PR1707. Please see
SPARK-3633 for more details.

Thanks,
Nishkam


Re: BlockManager issues

2014-09-21 Thread Reynold Xin
It seems like you just need to raise the ulimit?


On Sun, Sep 21, 2014 at 8:41 PM, Nishkam Ravi nr...@cloudera.com wrote:

 Recently upgraded to 1.1.0. Saw a bunch of fetch failures for one of the
 workloads. Tried tracing the problem through change set analysis. Looks
 like the offending commit is 4fde28c from Aug 4th for PR1707. Please see
 SPARK-3633 for more details.

 Thanks,
 Nishkam



Re: BlockManager issues

2014-09-21 Thread Patrick Wendell
Hey the numbers you mentioned don't quite line up - did you mean PR 2711?

On Sun, Sep 21, 2014 at 8:45 PM, Reynold Xin r...@databricks.com wrote:
 It seems like you just need to raise the ulimit?


 On Sun, Sep 21, 2014 at 8:41 PM, Nishkam Ravi nr...@cloudera.com wrote:

 Recently upgraded to 1.1.0. Saw a bunch of fetch failures for one of the
 workloads. Tried tracing the problem through change set analysis. Looks
 like the offending commit is 4fde28c from Aug 4th for PR1707. Please see
 SPARK-3633 for more details.

 Thanks,
 Nishkam


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: BlockManager issues

2014-09-21 Thread Patrick Wendell
Ah I see it was SPARK-2711 (and PR1707). In that case, it's possible
that you are just having more spilling as a result of the patch and so
the filesystem is opening more files. I would try increasing the
ulimit.

How much memory do your executors have?

- Patrick

On Sun, Sep 21, 2014 at 10:29 PM, Patrick Wendell pwend...@gmail.com wrote:
 Hey the numbers you mentioned don't quite line up - did you mean PR 2711?

 On Sun, Sep 21, 2014 at 8:45 PM, Reynold Xin r...@databricks.com wrote:
 It seems like you just need to raise the ulimit?


 On Sun, Sep 21, 2014 at 8:41 PM, Nishkam Ravi nr...@cloudera.com wrote:

 Recently upgraded to 1.1.0. Saw a bunch of fetch failures for one of the
 workloads. Tried tracing the problem through change set analysis. Looks
 like the offending commit is 4fde28c from Aug 4th for PR1707. Please see
 SPARK-3633 for more details.

 Thanks,
 Nishkam


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: BlockManager issues

2014-09-21 Thread Nishkam Ravi
Thanks for the quick follow up Reynold and Patrick. Tried a run with
significantly higher ulimit, doesn't seem to help. The executors have 35GB
each. Btw, with a recent version of the branch, the error message is fetch
failures as opposed to too many open files. Not sure if they are
related.  Please note that the workload runs fine with head set to 066765d.
In case you want to reproduce the problem: I'm running slightly modified
ScalaPageRank (with KryoSerializer and persistence level
memory_and_disk_ser) on a 30GB input dataset and a 6-node cluster.

Thanks,
Nishkam

On Sun, Sep 21, 2014 at 10:32 PM, Patrick Wendell pwend...@gmail.com
wrote:

 Ah I see it was SPARK-2711 (and PR1707). In that case, it's possible
 that you are just having more spilling as a result of the patch and so
 the filesystem is opening more files. I would try increasing the
 ulimit.

 How much memory do your executors have?

 - Patrick

 On Sun, Sep 21, 2014 at 10:29 PM, Patrick Wendell pwend...@gmail.com
 wrote:
  Hey the numbers you mentioned don't quite line up - did you mean PR 2711?
 
  On Sun, Sep 21, 2014 at 8:45 PM, Reynold Xin r...@databricks.com
 wrote:
  It seems like you just need to raise the ulimit?
 
 
  On Sun, Sep 21, 2014 at 8:41 PM, Nishkam Ravi nr...@cloudera.com
 wrote:
 
  Recently upgraded to 1.1.0. Saw a bunch of fetch failures for one of
 the
  workloads. Tried tracing the problem through change set analysis. Looks
  like the offending commit is 4fde28c from Aug 4th for PR1707. Please
 see
  SPARK-3633 for more details.
 
  Thanks,
  Nishkam