Re: Subject: [Spark SQL] [Debug] Spark Memory Issue with DataFrame Processing

2024-05-27 Thread Shay Elbaz
Seen this before; had a very(!) complex plan behind a DataFrame, to the point 
where any additional transformation went OOM on the driver.

A quick and ugly solution was to break the plan - convert the DataFrame to rdd 
and back to DF at certain points to make the plan shorter. This has obvious 
drawbacks, and is not recommended in general, but at least we had something 
working. The real, long-term solution was to replace the many ( > 200)  
withColumn() calls to only a few select() calls. You can easily find sources on 
the internet for why this is better. (it was on Spark 2.3, but I think the main 
principles remain). TBH, it was easier than I expected, as it mainly involved 
moving pieces of code from one place to another, and not a "real", meaningful 
refactoring.



From: Mich Talebzadeh 
Sent: Monday, May 27, 2024 15:43
Cc: user@spark.apache.org 
Subject: Re: Subject: [Spark SQL] [Debug] Spark Memory Issue with DataFrame 
Processing


This message contains hyperlinks, take precaution before opening these links.

Few ideas on top of my head for how to go about solving the problem


  1.  Try with subsets: Try reproducing the issue with smaller subsets of your 
data to pinpoint the specific operation causing the memory problems.
  2.  Explode or Flatten Nested Structures: If your DataFrame schema involves 
deep nesting, consider using techniques like explode or flattening to transform 
it into a less nested structure. This can reduce memory usage during operations 
like withColumn.
  3.  Lazy Evaluation: Use select before withColumn: this ensures lazy 
evaluation, meaning Spark only materializes the data when necessary. This can 
improve memory usage compared to directly calling withColumn on the entire 
DataFrame.
  4.  spark.sql.shuffle.partitions: Setting this configuration to a value close 
to the number of executors can improve shuffle performance and potentially 
reduce memory usage.
  5.  Spark UI Monitoring: Utilize the Spark UI to monitor memory usage 
throughout your job execution and identify potential memory bottlenecks.

HTH

Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime
London
United Kingdom


 
[https://ci3.googleusercontent.com/mail-sig/AIorK4zholKucR2Q9yMrKbHNn-o1TuS4mYXyi2KO6Xmx6ikHPySa9MLaLZ8t2hrA6AUcxSxDgHIwmKE]
   view my Linkedin 
profile<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: The information provided is correct to the best of my knowledge but 
of course cannot be guaranteed . It is essential to note that, as with any 
advice, quote "one test result is worth one-thousand expert opinions (Werner 
<https://en.wikipedia.org/wiki/Wernher_von_Braun> Von 
Braun<https://en.wikipedia.org/wiki/Wernher_von_Braun>)".



On Mon, 27 May 2024 at 12:50, Gaurav Madan 
 wrote:
Dear Community,

I'm reaching out to seek your assistance with a memory issue we've been facing 
while processing certain large and nested DataFrames using Apache Spark. We 
have encountered a scenario where the driver runs out of memory when applying 
the `withColumn` method on specific DataFrames in Spark 3.4.1. However, the 
same DataFrames are processed successfully in Spark 2.4.0.

Problem Summary:
For certain DataFrames, applying the `withColumn` method in Spark 3.4.1 causes 
the driver to choke and run out of memory. However, the same DataFrames are 
processed successfully in Spark 2.4.0.

Heap Dump Analysis:
We performed a heap dump analysis after enabling heap dump on out-of-memory 
errors, and the analysis revealed the following significant frames and local 
variables:

```

org.apache.spark.sql.Dataset.withPlan(Lorg/apache/spark/sql/catalyst/plans/logical/LogicalPlan;)Lorg/apache/spark/sql/Dataset;
 (Dataset.scala:4273)

org.apache.spark.sql.Dataset @ 0x680190b10 retains 4,307,840,192 (80.38%) bytes

org.apache.spark.sql.Dataset.select(Lscala/collection/Seq;)Lorg/apache/spark/sql/Dataset;
 (Dataset.scala:1622)

org.apache.spark.sql.Dataset @ 0x680190b10 retains 4,307,840,192 (80.38%) bytes

org.apache.spark.sql.Dataset.withColumns(Lscala/collection/Seq;Lscala/collection/Seq;)Lorg/apache/spark/sql/Dataset;
 (Dataset.scala:2820)

org.apache.spark.sql.Dataset @ 0x680190b10 retains 4,307,840,192 (80.38%) bytes

org.apache.spark.sql.Dataset.withColumn(Ljava/lang/String;Lorg/apache/spark/sql/Column;)Lorg/apache/spark/sql/Dataset;
 (Dataset.scala:2759)

org.apache.spark.sql.Dataset @ 0x680190b10 retains 4,307,840,192 (80.38%) bytes

com.urbanclap.dp.eventpersistence.utils.DataPersistenceUtil$.addStartTimeInPartitionColumn(Lorg/apache/spark/sql/Dataset;Lcom/urbanclap/dp/factory/output/Event;Lcom/urbanclap/dp/eventpersistence/JobMetadata;)Lorg/apache/spark/sql/Dataset;
 (DataPersistenceUtil.scala:88)

org.apache.spark.sql.Dataset @ 0x680190b10 retains 4,307,840,192 (80.38%) bytes

com.urbanclap.dp.eventpersistence.utils.Data

Re: Subject: [Spark SQL] [Debug] Spark Memory Issue with DataFrame Processing

2024-05-27 Thread Mich Talebzadeh
Few ideas on top of my head for how to go about solving the problem


   1. Try with subsets: Try reproducing the issue with smaller subsets of
   your data to pinpoint the specific operation causing the memory problems.
   2. Explode or Flatten Nested Structures: If your DataFrame schema
   involves deep nesting, consider using techniques like explode or flattening
   to transform it into a less nested structure. This can reduce memory usage
   during operations like withColumn.
   3. Lazy Evaluation: Use select before withColumn: this ensures lazy
   evaluation, meaning Spark only materializes the data when necessary. This
   can improve memory usage compared to directly calling withColumn on the
   entire DataFrame.
   4. spark.sql.shuffle.partitions: Setting this configuration to a value
   close to the number of executors can improve shuffle performance and
   potentially reduce memory usage.
   5. Spark UI Monitoring: Utilize the Spark UI to monitor memory usage
   throughout your job execution and identify potential memory bottlenecks.

HTH

Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".



On Mon, 27 May 2024 at 12:50, Gaurav Madan
 wrote:

> Dear Community,
>
> I'm reaching out to seek your assistance with a memory issue we've been
> facing while processing certain large and nested DataFrames using Apache
> Spark. We have encountered a scenario where the driver runs out of memory
> when applying the `withColumn` method on specific DataFrames in Spark
> 3.4.1. However, the same DataFrames are processed successfully in Spark
> 2.4.0.
>
>
> *Problem Summary:*For certain DataFrames, applying the `withColumn`
> method in Spark 3.4.1 causes the driver to choke and run out of memory.
> However, the same DataFrames are processed successfully in Spark 2.4.0.
>
>
> *Heap Dump Analysis:*We performed a heap dump analysis after enabling
> heap dump on out-of-memory errors, and the analysis revealed the following
> significant frames and local variables:
>
> ```
>
> org.apache.spark.sql.Dataset.withPlan(Lorg/apache/spark/sql/catalyst/plans/logical/LogicalPlan;)Lorg/apache/spark/sql/Dataset;
> (Dataset.scala:4273)
>
> org.apache.spark.sql.Dataset @ 0x680190b10 retains 4,307,840,192 (80.38%)
> bytes
>
> org.apache.spark.sql.Dataset.select(Lscala/collection/Seq;)Lorg/apache/spark/sql/Dataset;
> (Dataset.scala:1622)
>
> org.apache.spark.sql.Dataset @ 0x680190b10 retains 4,307,840,192 (80.38%)
> bytes
>
> org.apache.spark.sql.Dataset.withColumns(Lscala/collection/Seq;Lscala/collection/Seq;)Lorg/apache/spark/sql/Dataset;
> (Dataset.scala:2820)
>
> org.apache.spark.sql.Dataset @ 0x680190b10 retains 4,307,840,192 (80.38%)
> bytes
>
> org.apache.spark.sql.Dataset.withColumn(Ljava/lang/String;Lorg/apache/spark/sql/Column;)Lorg/apache/spark/sql/Dataset;
> (Dataset.scala:2759)
>
> org.apache.spark.sql.Dataset @ 0x680190b10 retains 4,307,840,192 (80.38%)
> bytes
>
> com.urbanclap.dp.eventpersistence.utils.DataPersistenceUtil$.addStartTimeInPartitionColumn(Lorg/apache/spark/sql/Dataset;Lcom/urbanclap/dp/factory/output/Event;Lcom/urbanclap/dp/eventpersistence/JobMetadata;)Lorg/apache/spark/sql/Dataset;
> (DataPersistenceUtil.scala:88)
>
> org.apache.spark.sql.Dataset @ 0x680190b10 retains 4,307,840,192 (80.38%)
> bytes
>
> com.urbanclap.dp.eventpersistence.utils.DataPersistenceUtil$.writeRawData(Lorg/apache/spark/sql/Dataset;Lcom/urbanclap/dp/factory/output/Event;Ljava/lang/String;Lcom/urbanclap/dp/eventpersistence/JobMetadata;)V
> (DataPersistenceUtil.scala:19)
>
> org.apache.spark.sql.Dataset @ 0x680190b10 retains 4,307,840,192 (80.38%)
> bytes
>
> com.urbanclap.dp.eventpersistence.step.bronze.BronzeStep$.persistRecords(Lorg/apache/spark/sql/Dataset;Lcom/urbanclap/dp/factory/output/Event;Lcom/urbanclap/dp/eventpersistence/JobMetadata;)V
> (BronzeStep.scala:23)
>
> org.apache.spark.sql.Dataset @ 0x680190b10 retains 4,307,840,192 (80.38%)
> bytes
>
> com.urbanclap.dp.eventpersistence.MainJob$.processRecords(Lorg/apache/spark/sql/SparkSession;Lcom/urbanclap/dp/factory/output/Event;Lcom/urbanclap/dp/eventpersistence/JobMetadata;Lorg/apache/spark/sql/Dataset;)V
> (MainJob.scala:78)
>
> org.apache.spark.sql.Dataset @ 0x680190b10 retains 4,307,840,192 (80.38%)
> bytes
>
> com.urbanclap.dp.eventpersistence.MainJob$.processBatchCB(Lorg/apache/spark/sql/SparkSession;Lcom/urbanclap/dp/eventpersistence/JobMetadata;Lcom/urbanclap/dp/factory/output/Event;Lorg/apache/spark/sql/Dataset;)V
> 

Subject: [Spark SQL] [Debug] Spark Memory Issue with DataFrame Processing

2024-05-27 Thread Gaurav Madan
Dear Community,

I'm reaching out to seek your assistance with a memory issue we've been
facing while processing certain large and nested DataFrames using Apache
Spark. We have encountered a scenario where the driver runs out of memory
when applying the `withColumn` method on specific DataFrames in Spark
3.4.1. However, the same DataFrames are processed successfully in Spark
2.4.0.


*Problem Summary:*For certain DataFrames, applying the `withColumn` method
in Spark 3.4.1 causes the driver to choke and run out of memory. However,
the same DataFrames are processed successfully in Spark 2.4.0.


*Heap Dump Analysis:*We performed a heap dump analysis after enabling heap
dump on out-of-memory errors, and the analysis revealed the following
significant frames and local variables:

```

org.apache.spark.sql.Dataset.withPlan(Lorg/apache/spark/sql/catalyst/plans/logical/LogicalPlan;)Lorg/apache/spark/sql/Dataset;
(Dataset.scala:4273)

org.apache.spark.sql.Dataset @ 0x680190b10 retains 4,307,840,192 (80.38%)
bytes

org.apache.spark.sql.Dataset.select(Lscala/collection/Seq;)Lorg/apache/spark/sql/Dataset;
(Dataset.scala:1622)

org.apache.spark.sql.Dataset @ 0x680190b10 retains 4,307,840,192 (80.38%)
bytes

org.apache.spark.sql.Dataset.withColumns(Lscala/collection/Seq;Lscala/collection/Seq;)Lorg/apache/spark/sql/Dataset;
(Dataset.scala:2820)

org.apache.spark.sql.Dataset @ 0x680190b10 retains 4,307,840,192 (80.38%)
bytes

org.apache.spark.sql.Dataset.withColumn(Ljava/lang/String;Lorg/apache/spark/sql/Column;)Lorg/apache/spark/sql/Dataset;
(Dataset.scala:2759)

org.apache.spark.sql.Dataset @ 0x680190b10 retains 4,307,840,192 (80.38%)
bytes

com.urbanclap.dp.eventpersistence.utils.DataPersistenceUtil$.addStartTimeInPartitionColumn(Lorg/apache/spark/sql/Dataset;Lcom/urbanclap/dp/factory/output/Event;Lcom/urbanclap/dp/eventpersistence/JobMetadata;)Lorg/apache/spark/sql/Dataset;
(DataPersistenceUtil.scala:88)

org.apache.spark.sql.Dataset @ 0x680190b10 retains 4,307,840,192 (80.38%)
bytes

com.urbanclap.dp.eventpersistence.utils.DataPersistenceUtil$.writeRawData(Lorg/apache/spark/sql/Dataset;Lcom/urbanclap/dp/factory/output/Event;Ljava/lang/String;Lcom/urbanclap/dp/eventpersistence/JobMetadata;)V
(DataPersistenceUtil.scala:19)

org.apache.spark.sql.Dataset @ 0x680190b10 retains 4,307,840,192 (80.38%)
bytes

com.urbanclap.dp.eventpersistence.step.bronze.BronzeStep$.persistRecords(Lorg/apache/spark/sql/Dataset;Lcom/urbanclap/dp/factory/output/Event;Lcom/urbanclap/dp/eventpersistence/JobMetadata;)V
(BronzeStep.scala:23)

org.apache.spark.sql.Dataset @ 0x680190b10 retains 4,307,840,192 (80.38%)
bytes

com.urbanclap.dp.eventpersistence.MainJob$.processRecords(Lorg/apache/spark/sql/SparkSession;Lcom/urbanclap/dp/factory/output/Event;Lcom/urbanclap/dp/eventpersistence/JobMetadata;Lorg/apache/spark/sql/Dataset;)V
(MainJob.scala:78)

org.apache.spark.sql.Dataset @ 0x680190b10 retains 4,307,840,192 (80.38%)
bytes

com.urbanclap.dp.eventpersistence.MainJob$.processBatchCB(Lorg/apache/spark/sql/SparkSession;Lcom/urbanclap/dp/eventpersistence/JobMetadata;Lcom/urbanclap/dp/factory/output/Event;Lorg/apache/spark/sql/Dataset;)V
(MainJob.scala:66)

org.apache.spark.sql.Dataset @ 0x680190b10 retains 4,307,840,192 (80.38%)
bytes

```


*Driver Configuration:*1. Driver instance: c6g.xlarge with 4 vCPUs and 8 GB
RAM.
2.  `spark.driver.memory` and `spark.driver.memoryOverhead` are set to
default values.


*Observations:*- The DataFrame schema is very nested and large, which might
be contributing to the memory issue.
- Despite similar configurations, Spark 2.4.0 processes the DataFrame
without issues, while Spark 3.4.1 does not.


*Tried Solutions:*We have tried several solutions, including disabling
Adaptive Query Execution, setting the driver max result size, increasing
driver cores, and enabling specific optimizer rules. However, the issue
persisted until we increased the driver memory to 48 GB and memory overhead
to 5 GB, which allowed the driver to schedule the tasks successfully.


*Request for Suggestions:*Are there any additional configurations or
optimizations that could help mitigate this memory issue without always
resorting to a larger machine? We would greatly appreciate any insights or
recommendations from the community on how to resolve this issue effectively.

I have attached the DataFrame schema and the complete stack trace from the
heap dump analysis for your reference.

Doc explaining the issue

DataFrame Schema


Thank you in advance for your assistance.

Best regards,
Gaurav Madan
LinkedIn 
*Personal Mail: *gauravmadan...@gmail.com
*Work Mail:* gauravma...@urbancompany.com


[Advanced][Spark Core][Debug] Spark incorrectly deserializes object during Dataset.map

2021-09-24 Thread Eddie
I am calling the Spark Dataset API (map method) and getting exceptions on 
deserialization of task results. I am calling this API from Clojure using 
standard JVM interop syntax.

This gist has a tiny Clojure program that shows the problem, as well as the 
corresponding (working) Scala implementation. There is also a full stack trace 
and Spark logs for a run of the Clojure code. 
https://gist.github.com/erp12/233a60574dc157aa544079959108f9db

Interestingly, the exception is not raised if the Clojure code is first 
compiled to bytecode, or if it is executed from a REPL. It seems that, 
depending on the context, the Spark deserializer is improperly deserializing a 
Function3 as a SerializedLambda. 

Does anyone have any ideas on why is this happening? Is this a Spark bug?

[Debug] [Spark Core 2.4.4] org.apache.spark.storage.BlockException: Negative block size -9223372036854775808

2020-06-29 Thread Adam Tobey
Hi,

I'm encountering a strange exception in spark 2.4.4 (on AWS EMR 5.29):
org.apache.spark.storage.BlockException: Negative block size
-9223372036854775808.
I've seen this mostly from this line (for remote blocks)
org.apache.spark.storage.ShuffleBlockFetcherIterator$$anonfun$splitLocalRemoteBlocks$3.apply(ShuffleBlockFetcherIterator.scala:295)
But also from this line (for local blocks)
org.apache.spark.storage.ShuffleBlockFetcherIterator$$anonfun$splitLocalRemoteBlocks$3.apply(ShuffleBlockFetcherIterator.scala:281)
The block size of -9223372036854775808 (Long.MinValue) is the same every
time.

I only see this exception coming from a single physical node in every
cluster (i3.16xlarge EC2 instance hosting multiple executors), but it
affects multiple executors across separate jobs running on this node over
relatively long periods of time (e.g. 1+ hours) and outlives the first
executors that encounter the exception. This has happened on multiple EMR
clusters. We have dynamic allocation enabled, so it could be related
somehow to the external shuffle service, which would continue running
across these jobs. We am also using Kryo as the serializer.

This exception occurs in multiple stages, but all these stages are reading
shuffle output from a single stage with 15,000 partitions. When this
exception occurs, the job does not fail, but it loses shuffle data between
stages (the number of shuffle records written from upstream stages is
slightly more than the number read) and the job output becomes corrupted.
Re-running the job on a new cluster produces correct output as long as this
exception is never thrown.

>From reading the code, it seems to me the only possible way to have
Long.MinValue as a block size is from the avgSize of a
HighlyCompressedMapStatus since the size compression scheme of taking log
base 1.1 cannot produce a negative size (negative inputs map to 1:
https://github.com/apache/spark/blob/v2.4.4/core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala#L75-L95).
I don't see how the average computation itself can output Long.MinValue due
to the size checks above, even in case of overflow (
https://github.com/apache/spark/blob/v2.4.4/core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala#L203-L240
).

Does anyone have ideas as to how this block size of Long.MinValue is
possible? Thanks!


Re: How to debug Spark job

2018-09-08 Thread Marco Mistroni
Hi
 Might sound like a dumb advice. But try to break apart your process.
Sounds like you
Are doing ETL
start basic with just ET. and do the changes that results in issues
If no problem add the load step
Enable spark logging so that you can post error message to the list
I think you can have a look at spark console to see if your process has
memory issues
Another thing you can do is to run with subset of data and increase the
load until you find the it blows
Sorry hth

On Sep 7, 2018 10:48 AM, "James Starks" 
wrote:


I have a Spark job that reads from a postgresql (v9.5) table, and write
result to parquet. The code flow is not complicated, basically

case class MyCaseClass(field1: String, field2: String)
val df = spark.read.format("jdbc")...load()
df.createOrReplaceTempView(...)
val newdf = spark.sql("seslect field1, field2 from
mytable").as[MyCaseClass].map { row =>
  val fieldX = ... // extract something from field2
  (field1, fileldX)
}.filter { ... /* filter out field 3 that's not valid */ }
newdf.write.mode(...).parquet(destPath)

This job worked correct without a problem. But it's doesn't look working ok
(the job looks like hanged) when adding more fields. The refactored job
looks as below
...
val newdf = spark.sql("seslect field1, field2, ... fieldN from
mytable").as[MyCaseClassWithMoreFields].map { row =>
...
NewCaseClassWithMoreFields(...) // all fields plus fieldX
}.filter { ... }
newdf.write.mode(...).parquet(destPath)

Basically what the job does is extracting some info from one of a field in
db table, appends that newly extracted field to the original row, and then
dumps the whole new table to parquet.

new filed + (original field1 + ... + original fieldN)
...
...

Records loaded by spark sql to spark job (before refactored) are around
8MM, this remains the same, but when the refactored spark runs, it looks
hanging there without progress. The only output on the console is (there is
no crash, no exceptions thrown)

WARN  HeartbeatReceiver:66 - Removing executor driver with no recent
heartbeats: 137128 ms exceeds timeout 12 ms

Memory in top command looks like

VIRT RES SHR%CPU %MEM
15.866g 8.001g  41.4m 740.3   25.6

The command used to  submit spark job is

spark-submit --class ... --master local[*] --driver-memory 10g
--executor-memory 10g ... --files ... --driver-class-path ... 
...

How can I debug or check which part of my code might cause the problem (so
I can improve it)?

Thanks


Re: [External Sender] How to debug Spark job

2018-09-08 Thread Sonal Goyal
You could also try to profile your program on the executor or driver by
using jvisualvm or yourkit to see if there is any memory/cpu optimization
you could do.

Thanks,
Sonal
Nube Technologies 





On Fri, Sep 7, 2018 at 6:35 PM, James Starks  wrote:

> Got the root cause eventually as it throws java.lang.OutOfMemoryError:
> Java heap space. Increasing --driver-memory temporarily fixes the problem.
> Thanks.
>
> ‐‐‐ Original Message ‐‐‐
> On 7 September 2018 12:32 PM, Femi Anthony 
> wrote:
>
> One way I would go about this would be to try running a new_df.show(numcols,
> truncate=False) on a few columns before you try writing to parquet to
> force computation of newdf and see whether the hanging is occurring at that
> point or during the write. You may also try doing a newdf.count() as well.
>
> Femi
>
> On Fri, Sep 7, 2018 at 5:48 AM James Starks 
> wrote:
>
>>
>> I have a Spark job that reads from a postgresql (v9.5) table, and write
>> result to parquet. The code flow is not complicated, basically
>>
>> case class MyCaseClass(field1: String, field2: String)
>> val df = spark.read.format("jdbc")...load()
>> df.createOrReplaceTempView(...)
>> val newdf = spark.sql("seslect field1, field2 from
>> mytable").as[MyCaseClass].map { row =>
>>   val fieldX = ... // extract something from field2
>>   (field1, fileldX)
>> }.filter { ... /* filter out field 3 that's not valid */ }
>> newdf.write.mode(...).parquet(destPath)
>>
>> This job worked correct without a problem. But it's doesn't look working
>> ok (the job looks like hanged) when adding more fields. The refactored job
>> looks as below
>> ...
>> val newdf = spark.sql("seslect field1, field2, ... fieldN from
>> mytable").as[MyCaseClassWithMoreFields].map { row =>
>> ...
>> NewCaseClassWithMoreFields(...) // all fields plus fieldX
>> }.filter { ... }
>> newdf.write.mode(...).parquet(destPath)
>>
>> Basically what the job does is extracting some info from one of a field
>> in db table, appends that newly extracted field to the original row, and
>> then dumps the whole new table to parquet.
>>
>> new filed + (original field1 + ... + original fieldN)
>> ...
>> ...
>>
>> Records loaded by spark sql to spark job (before refactored) are around
>> 8MM, this remains the same, but when the refactored spark runs, it looks
>> hanging there without progress. The only output on the console is (there is
>> no crash, no exceptions thrown)
>>
>> WARN  HeartbeatReceiver:66 - Removing executor driver with no recent
>> heartbeats: 137128 ms exceeds timeout 12 ms
>>
>> Memory in top command looks like
>>
>> VIRT RES SHR%CPU %MEM
>> 15.866g 8.001g  41.4m 740.3   25.6
>>
>> The command used to  submit spark job is
>>
>> spark-submit --class ... --master local[*] --driver-memory 10g
>> --executor-memory 10g ... --files ... --driver-class-path ... 
>> ...
>>
>> How can I debug or check which part of my code might cause the problem
>> (so I can improve it)?
>>
>> Thanks
>>
>>
>>
>
> --
>
> The information contained in this e-mail is confidential and/or
> proprietary to Capital One and/or its affiliates and may only be used
> solely in performance of work or services for Capital One. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed. If the reader of this message is not the intended
> recipient, you are hereby notified that any review, retransmission,
> dissemination, distribution, copying or other use of, or taking of any
> action in reliance upon this information is strictly prohibited. If you
> have received this communication in error, please contact the sender and
> delete the material from your computer.
>
>
>


Re: [External Sender] How to debug Spark job

2018-09-07 Thread James Starks
Got the root cause eventually as it throws java.lang.OutOfMemoryError: Java 
heap space. Increasing --driver-memory temporarily fixes the problem. Thanks.

‐‐‐ Original Message ‐‐‐
On 7 September 2018 12:32 PM, Femi Anthony  
wrote:

> One way I would go about this would be to try running a new_df.show(numcols, 
> truncate=False) on a few columns before you try writing to parquet to force 
> computation of newdf and see whether the hanging is occurring at that point 
> or during the write. You may also try doing a newdf.count() as well.
>
> Femi
>
> On Fri, Sep 7, 2018 at 5:48 AM James Starks  
> wrote:
>
>> I have a Spark job that reads from a postgresql (v9.5) table, and write 
>> result to parquet. The code flow is not complicated, basically
>>
>> case class MyCaseClass(field1: String, field2: String)
>> val df = spark.read.format("jdbc")...load()
>> df.createOrReplaceTempView(...)
>> val newdf = spark.sql("seslect field1, field2 from 
>> mytable").as[MyCaseClass].map { row =>
>>   val fieldX = ... // extract something from field2
>>   (field1, fileldX)
>> }.filter { ... /* filter out field 3 that's not valid */ }
>> newdf.write.mode(...).parquet(destPath)
>>
>> This job worked correct without a problem. But it's doesn't look working ok 
>> (the job looks like hanged) when adding more fields. The refactored job 
>> looks as below
>> ...
>> val newdf = spark.sql("seslect field1, field2, ... fieldN from 
>> mytable").as[MyCaseClassWithMoreFields].map { row =>
>> ...
>> NewCaseClassWithMoreFields(...) // all fields plus fieldX
>> }.filter { ... }
>> newdf.write.mode(...).parquet(destPath)
>>
>> Basically what the job does is extracting some info from one of a field in 
>> db table, appends that newly extracted field to the original row, and then 
>> dumps the whole new table to parquet.
>>
>> new filed + (original field1 + ... + original fieldN)
>> ...
>> ...
>>
>> Records loaded by spark sql to spark job (before refactored) are around 8MM, 
>> this remains the same, but when the refactored spark runs, it looks hanging 
>> there without progress. The only output on the console is (there is no 
>> crash, no exceptions thrown)
>>
>> WARN  HeartbeatReceiver:66 - Removing executor driver with no recent 
>> heartbeats: 137128 ms exceeds timeout 12 ms
>>
>> Memory in top command looks like
>>
>> VIRT RES SHR%CPU %MEM
>> 15.866g 8.001g  41.4m 740.3   25.6
>>
>> The command used to  submit spark job is
>>
>> spark-submit --class ... --master local[*] --driver-memory 10g 
>> --executor-memory 10g ... --files ... --driver-class-path ...  ...
>>
>> How can I debug or check which part of my code might cause the problem (so I 
>> can improve it)?
>>
>> Thanks
>
> ---
>
> The information contained in this e-mail is confidential and/or proprietary 
> to Capital One and/or its affiliates and may only be used solely in 
> performance of work or services for Capital One. The information transmitted 
> herewith is intended only for use by the individual or entity to which it is 
> addressed. If the reader of this message is not the intended recipient, you 
> are hereby notified that any review, retransmission, dissemination, 
> distribution, copying or other use of, or taking of any action in reliance 
> upon this information is strictly prohibited. If you have received this 
> communication in error, please contact the sender and delete the material 
> from your computer.

Re: [External Sender] How to debug Spark job

2018-09-07 Thread Femi Anthony
One way I would go about this would be to try running a new_df.show(numcols,
truncate=False) on a few columns before you try writing to parquet to force
computation of newdf and see whether the hanging is occurring at that point
or during the write. You may also try doing a newdf.count() as well.

Femi

On Fri, Sep 7, 2018 at 5:48 AM James Starks 
wrote:

>
> I have a Spark job that reads from a postgresql (v9.5) table, and write
> result to parquet. The code flow is not complicated, basically
>
> case class MyCaseClass(field1: String, field2: String)
> val df = spark.read.format("jdbc")...load()
> df.createOrReplaceTempView(...)
> val newdf = spark.sql("seslect field1, field2 from
> mytable").as[MyCaseClass].map { row =>
>   val fieldX = ... // extract something from field2
>   (field1, fileldX)
> }.filter { ... /* filter out field 3 that's not valid */ }
> newdf.write.mode(...).parquet(destPath)
>
> This job worked correct without a problem. But it's doesn't look working
> ok (the job looks like hanged) when adding more fields. The refactored job
> looks as below
> ...
> val newdf = spark.sql("seslect field1, field2, ... fieldN from
> mytable").as[MyCaseClassWithMoreFields].map { row =>
> ...
> NewCaseClassWithMoreFields(...) // all fields plus fieldX
> }.filter { ... }
> newdf.write.mode(...).parquet(destPath)
>
> Basically what the job does is extracting some info from one of a field in
> db table, appends that newly extracted field to the original row, and then
> dumps the whole new table to parquet.
>
> new filed + (original field1 + ... + original fieldN)
> ...
> ...
>
> Records loaded by spark sql to spark job (before refactored) are around
> 8MM, this remains the same, but when the refactored spark runs, it looks
> hanging there without progress. The only output on the console is (there is
> no crash, no exceptions thrown)
>
> WARN  HeartbeatReceiver:66 - Removing executor driver with no recent
> heartbeats: 137128 ms exceeds timeout 12 ms
>
> Memory in top command looks like
>
> VIRT RES SHR%CPU %MEM
> 15.866g 8.001g  41.4m 740.3   25.6
>
> The command used to  submit spark job is
>
> spark-submit --class ... --master local[*] --driver-memory 10g
> --executor-memory 10g ... --files ... --driver-class-path ... 
> ...
>
> How can I debug or check which part of my code might cause the problem (so
> I can improve it)?
>
> Thanks
>
>
>


The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.


How to debug Spark job

2018-09-07 Thread James Starks
I have a Spark job that reads from a postgresql (v9.5) table, and write result 
to parquet. The code flow is not complicated, basically

case class MyCaseClass(field1: String, field2: String)
val df = spark.read.format("jdbc")...load()
df.createOrReplaceTempView(...)
val newdf = spark.sql("seslect field1, field2 from 
mytable").as[MyCaseClass].map { row =>
  val fieldX = ... // extract something from field2
  (field1, fileldX)
}.filter { ... /* filter out field 3 that's not valid */ }
newdf.write.mode(...).parquet(destPath)

This job worked correct without a problem. But it's doesn't look working ok 
(the job looks like hanged) when adding more fields. The refactored job looks 
as below
...
val newdf = spark.sql("seslect field1, field2, ... fieldN from 
mytable").as[MyCaseClassWithMoreFields].map { row =>
...
NewCaseClassWithMoreFields(...) // all fields plus fieldX
}.filter { ... }
newdf.write.mode(...).parquet(destPath)

Basically what the job does is extracting some info from one of a field in db 
table, appends that newly extracted field to the original row, and then dumps 
the whole new table to parquet.

new filed + (original field1 + ... + original fieldN)
...
...

Records loaded by spark sql to spark job (before refactored) are around 8MM, 
this remains the same, but when the refactored spark runs, it looks hanging 
there without progress. The only output on the console is (there is no crash, 
no exceptions thrown)

WARN  HeartbeatReceiver:66 - Removing executor driver with no recent 
heartbeats: 137128 ms exceeds timeout 12 ms

Memory in top command looks like

VIRT RES SHR%CPU %MEM
15.866g 8.001g  41.4m 740.3   25.6

The command used to  submit spark job is

spark-submit --class ... --master local[*] --driver-memory 10g 
--executor-memory 10g ... --files ... --driver-class-path ...  ...

How can I debug or check which part of my code might cause the problem (so I 
can improve it)?

Thanks

Re: how to debug spark app?

2016-08-04 Thread Ben Teeuwen
Related question: what are good profiling tools other than watching along the 
application master with the running code? 
Are there things that can be logged during the run? If I have say 2 ways of 
accomplishing the same thing, and I want to learn about the time/memory/general 
resource blocking performance of both, what is the best way of doing that? What 
tic, toc does in Matlab, or profile on, profile report.

> On Aug 4, 2016, at 3:19 AM, Sumit Khanna  wrote:
> 
> Am not really sure of the best practices on this , but I either consult the 
> localhost:4040/jobs/ etc 
> or better this :
> 
> val customSparkListener: CustomSparkListener = new CustomSparkListener()
> sc.addSparkListener(customSparkListener)
> class CustomSparkListener extends SparkListener {
>  override def onApplicationEnd(applicationEnd: SparkListenerApplicationEnd) {
>   debug(s"application ended at time : ${applicationEnd.time}")
>  }
>  override def onApplicationStart(applicationStart: 
> SparkListenerApplicationStart): Unit ={
>   debug(s"[SPARK LISTENER DEBUGS] application Start app attempt id : 
> ${applicationStart.appAttemptId}")
>   debug(s"[SPARK LISTENER DEBUGS] application Start app id : 
> ${applicationStart.appId}")
>   debug(s"[SPARK LISTENER DEBUGS] application start app name : 
> ${applicationStart.appName}")
>   debug(s"[SPARK LISTENER DEBUGS] applicaton start driver logs : 
> ${applicationStart.driverLogs}")
>   debug(s"[SPARK LISTENER DEBUGS] application start spark user : 
> ${applicationStart.sparkUser}")
>   debug(s"[SPARK LISTENER DEBUGS] application start time : 
> ${applicationStart.time}")
>  }
>  override def onExecutorAdded(executorAdded: SparkListenerExecutorAdded): 
> Unit = {
>   debug(s"[SPARK LISTENER DEBUGS] ${executorAdded.executorId}")
>   debug(s"[SPARK LISTENER DEBUGS] ${executorAdded.executorInfo}")
>   debug(s"[SPARK LISTENER DEBUGS] ${executorAdded.time}")
>  }
>  override  def onExecutorRemoved(executorRemoved: 
> SparkListenerExecutorRemoved): Unit = {
>   debug(s"[SPARK LISTENER DEBUGS] the executor removed Id : 
> ${executorRemoved.executorId}")
>   debug(s"[SPARK LISTENER DEBUGS] the executor removed reason : 
> ${executorRemoved.reason}")
>   debug(s"[SPARK LISTENER DEBUGS] the executor temoved at time : 
> ${executorRemoved.time}")
>  }
> 
>  override def onJobEnd(jobEnd: SparkListenerJobEnd): Unit = {
>   debug(s"[SPARK LISTENER DEBUGS] job End id : ${jobEnd.jobId}")
>   debug(s"[SPARK LISTENER DEBUGS] job End job Result : ${jobEnd.jobResult}")
>   debug(s"[SPARK LISTENER DEBUGS] job End time : ${jobEnd.time}")
>  }
>  override def onJobStart(jobStart: SparkListenerJobStart) {
>   debug(s"[SPARK LISTENER DEBUGS] Job started with properties 
> ${jobStart.properties}")
>   debug(s"[SPARK LISTENER DEBUGS] Job started with time ${jobStart.time}")
>   debug(s"[SPARK LISTENER DEBUGS] Job started with job id 
> ${jobStart.jobId.toString}")
>   debug(s"[SPARK LISTENER DEBUGS] Job started with stage ids 
> ${jobStart.stageIds.toString()}")
>   debug(s"[SPARK LISTENER DEBUGS] Job started with stages 
> ${jobStart.stageInfos.size} : $jobStart")
>  }
> 
>  override def onStageCompleted(stageCompleted: SparkListenerStageCompleted): 
> Unit = {
>   debug(s"[SPARK LISTENER DEBUGS] Stage ${stageCompleted.stageInfo.stageId} 
> completed with ${stageCompleted.stageInfo.numTasks} tasks.")
>   debug(s"[SPARK LISTENER DEBUGS] Stage details : 
> ${stageCompleted.stageInfo.details.toString}")
>   debug(s"[SPARK LISTENER DEBUGS] Stage completion time : 
> ${stageCompleted.stageInfo.completionTime}")
>   debug(s"[SPARK LISTENER DEBUGS] Stage details : 
> ${stageCompleted.stageInfo.rddInfos.toString()}")
>  }
>  override def onStageSubmitted(stageSubmitted: SparkListenerStageSubmitted): 
> Unit = {
>   debug(s"[SPARK LISTENER DEBUGS] Stage properties : 
> ${stageSubmitted.properties}")
>   debug(s"[SPARK LISTENER DEBUGS] Stage rddInfos : 
> ${stageSubmitted.stageInfo.rddInfos.toString()}")
>   debug(s"[SPARK LISTENER DEBUGS] Stage submission Time : 
> ${stageSubmitted.stageInfo.submissionTime}")
>   debug(s"[SPARK LISTENER DEBUGS] Stage submission details : 
> ${stageSubmitted.stageInfo.details.toString()}")
>  }
>  override def onTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = {
>   debug(s"[SPARK LISTENER DEBUGS] task ended reason : ${taskEnd.reason}")
>   debug(s"[SPARK LISTENER DEBUGS] task type : ${taskEnd.taskType}")
>   debug(s"[SPARK LISTENER DEBUGS] task Metrics : ${taskEnd.taskMetrics}")
>   debug(s"[SPARK LISTENER DEBUGS] task Info : ${taskEnd.taskInfo}")
>   debug(s"[SPARK LISTENER DEBUGS] task stage Id : ${taskEnd.stageId}")
>   debug(s"[SPARK LISTENER DEBUGS] task stage attempt Id : 
> ${taskEnd.stageAttemptId}")
>   debug(s"[SPARK LISTENER DEBUGS] task ended reason : ${taskEnd.reason}")
>  }
>  override def onTaskStart(taskStart: SparkListenerTaskStart): Unit = {
>   debug(s"[SPARK LISTENER DEBUGS] stage Attempt id : 
> ${taskStart.stageAttemptId}")
>   debug(s"[SPARK 

Re: how to debug spark app?

2016-08-03 Thread Sumit Khanna
Am not really sure of the best practices on this , but I either consult the
localhost:4040/jobs/ etc
or better this :

val customSparkListener: CustomSparkListener = new CustomSparkListener()
sc.addSparkListener(customSparkListener)

class CustomSparkListener extends SparkListener {
 override def onApplicationEnd(applicationEnd: SparkListenerApplicationEnd) {
  debug(s"application ended at time : ${applicationEnd.time}")
 }
 override def onApplicationStart(applicationStart:
SparkListenerApplicationStart): Unit ={
  debug(s"[SPARK LISTENER DEBUGS] application Start app attempt id :
${applicationStart.appAttemptId}")
  debug(s"[SPARK LISTENER DEBUGS] application Start app id :
${applicationStart.appId}")
  debug(s"[SPARK LISTENER DEBUGS] application start app name :
${applicationStart.appName}")
  debug(s"[SPARK LISTENER DEBUGS] applicaton start driver logs :
${applicationStart.driverLogs}")
  debug(s"[SPARK LISTENER DEBUGS] application start spark user :
${applicationStart.sparkUser}")
  debug(s"[SPARK LISTENER DEBUGS] application start time :
${applicationStart.time}")
 }
 override def onExecutorAdded(executorAdded:
SparkListenerExecutorAdded): Unit = {
  debug(s"[SPARK LISTENER DEBUGS] ${executorAdded.executorId}")
  debug(s"[SPARK LISTENER DEBUGS] ${executorAdded.executorInfo}")
  debug(s"[SPARK LISTENER DEBUGS] ${executorAdded.time}")
 }
 override  def onExecutorRemoved(executorRemoved:
SparkListenerExecutorRemoved): Unit = {
  debug(s"[SPARK LISTENER DEBUGS] the executor removed Id :
${executorRemoved.executorId}")
  debug(s"[SPARK LISTENER DEBUGS] the executor removed reason :
${executorRemoved.reason}")
  debug(s"[SPARK LISTENER DEBUGS] the executor temoved at time :
${executorRemoved.time}")
 }

 override def onJobEnd(jobEnd: SparkListenerJobEnd): Unit = {
  debug(s"[SPARK LISTENER DEBUGS] job End id : ${jobEnd.jobId}")
  debug(s"[SPARK LISTENER DEBUGS] job End job Result : ${jobEnd.jobResult}")
  debug(s"[SPARK LISTENER DEBUGS] job End time : ${jobEnd.time}")
 }
 override def onJobStart(jobStart: SparkListenerJobStart) {
  debug(s"[SPARK LISTENER DEBUGS] Job started with properties
${jobStart.properties}")
  debug(s"[SPARK LISTENER DEBUGS] Job started with time ${jobStart.time}")
  debug(s"[SPARK LISTENER DEBUGS] Job started with job id
${jobStart.jobId.toString}")
  debug(s"[SPARK LISTENER DEBUGS] Job started with stage ids
${jobStart.stageIds.toString()}")
  debug(s"[SPARK LISTENER DEBUGS] Job started with stages
${jobStart.stageInfos.size} : $jobStart")
 }

 override def onStageCompleted(stageCompleted:
SparkListenerStageCompleted): Unit = {
  debug(s"[SPARK LISTENER DEBUGS] Stage
${stageCompleted.stageInfo.stageId} completed with
${stageCompleted.stageInfo.numTasks} tasks.")
  debug(s"[SPARK LISTENER DEBUGS] Stage details :
${stageCompleted.stageInfo.details.toString}")
  debug(s"[SPARK LISTENER DEBUGS] Stage completion time :
${stageCompleted.stageInfo.completionTime}")
  debug(s"[SPARK LISTENER DEBUGS] Stage details :
${stageCompleted.stageInfo.rddInfos.toString()}")
 }
 override def onStageSubmitted(stageSubmitted:
SparkListenerStageSubmitted): Unit = {
  debug(s"[SPARK LISTENER DEBUGS] Stage properties :
${stageSubmitted.properties}")
  debug(s"[SPARK LISTENER DEBUGS] Stage rddInfos :
${stageSubmitted.stageInfo.rddInfos.toString()}")
  debug(s"[SPARK LISTENER DEBUGS] Stage submission Time :
${stageSubmitted.stageInfo.submissionTime}")
  debug(s"[SPARK LISTENER DEBUGS] Stage submission details :
${stageSubmitted.stageInfo.details.toString()}")
 }
 override def onTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = {
  debug(s"[SPARK LISTENER DEBUGS] task ended reason : ${taskEnd.reason}")
  debug(s"[SPARK LISTENER DEBUGS] task type : ${taskEnd.taskType}")
  debug(s"[SPARK LISTENER DEBUGS] task Metrics : ${taskEnd.taskMetrics}")
  debug(s"[SPARK LISTENER DEBUGS] task Info : ${taskEnd.taskInfo}")
  debug(s"[SPARK LISTENER DEBUGS] task stage Id : ${taskEnd.stageId}")
  debug(s"[SPARK LISTENER DEBUGS] task stage attempt Id :
${taskEnd.stageAttemptId}")
  debug(s"[SPARK LISTENER DEBUGS] task ended reason : ${taskEnd.reason}")
 }
 override def onTaskStart(taskStart: SparkListenerTaskStart): Unit = {
  debug(s"[SPARK LISTENER DEBUGS] stage Attempt id :
${taskStart.stageAttemptId}")
  debug(s"[SPARK LISTENER DEBUGS] stage Id : ${taskStart.stageId}")
  debug(s"[SPARK LISTENER DEBUGS] task Info : ${taskStart.taskInfo}")
 }
 override def onUnpersistRDD(unpersistRDD: SparkListenerUnpersistRDD): Unit = {
  debug(s"[SPARK LISTENER DEBUGS] the unpersist RDD id : ${unpersistRDD.rddId}")
 }
}

and then usually check for logs. P.S :I am running it as a jar.

Thanks,


On Thu, Aug 4, 2016 at 6:46 AM, Ted Yu  wrote:

> Have you looked at:
>
> https://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application
>
> If you use Mesos:
>
> https://spark.apache.org/docs/latest/running-on-mesos.html#troubleshooting-and-debugging
>
> On Wed, Aug 3, 2016 at 6:13 PM, glen  

Re: how to debug spark app?

2016-08-03 Thread Ted Yu
Have you looked at:
https://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application

If you use Mesos:
https://spark.apache.org/docs/latest/running-on-mesos.html#troubleshooting-and-debugging

On Wed, Aug 3, 2016 at 6:13 PM, glen  wrote:

> Any tool like gdb? Which support break point at some line or some function?
>
>
>
>
>
>


how to debug spark app?

2016-08-03 Thread glen
Any tool like gdb? Which support break point at some line or some function?





Re: Debug spark jobs on Intellij

2016-05-31 Thread Marcelo Oikawa
> Is this python right? I'm not used to it, I'm used to scala, so
>

No. It is Java.


> val toDebug = rdd.foreachPartition(partition -> { //breakpoint stop here
> *// by val toDebug I mean to assign the result of foreachPartition to a
> variable*
> partition.forEachRemaining(message -> {
> //breakpoint doenst stop here
>
>  })
> });
>
> *toDebug.first* // now is when this method will run
>

foreachPartition is a void method.


>
>
> 2016-05-31 17:59 GMT-03:00 Marcelo Oikawa :
>
>>
>>
>>> Hi Marcelo, this is because the operations in rdd are lazy, you will
>>> only stop at this inside foreach breakpoint when you call a first, a
>>> collect or a reduce operation.
>>>
>>
>> Does forEachRemaining isn't a final method as first, collect or reduce?
>> Anyway, I guess this is not the problem itself because the code inside
>> forEachRemaining runs well but I can't debug this block.
>>
>>
>>> This is when the spark will run the operations.
>>> Have you tried that?
>>>
>>> Cheers.
>>>
>>> 2016-05-31 17:18 GMT-03:00 Marcelo Oikawa :
>>>
 Hello, list.

 I'm trying to debug my spark application on Intellij IDE. Before I
 submit my job, I ran the command line:

 export
 SPARK_SUBMIT_OPTS=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=4000

 after that:

 bin/spark-submit app-jar-with-dependencies.jar 

 The IDE connects with the running job but all code that is running on
 worker machine is unreachable to debug. See below:

 rdd.foreachPartition(partition -> { //breakpoint stop here

 partition.forEachRemaining(message -> {

 //breakpoint doenst stop here

  })
 });

 Does anyone know if is is possible? How? Any ideas?



>>>
>>
>


Re: Debug spark jobs on Intellij

2016-05-31 Thread Dirceu Semighini Filho
Try this:
Is this python right? I'm not used to it, I'm used to scala, so

val toDebug = rdd.foreachPartition(partition -> { //breakpoint stop here
*// by val toDebug I mean to assign the result of foreachPartition to a
variable*
partition.forEachRemaining(message -> {
//breakpoint doenst stop here

 })
});

*toDebug.first* // now is when this method will run


2016-05-31 17:59 GMT-03:00 Marcelo Oikawa :

>
>
>> Hi Marcelo, this is because the operations in rdd are lazy, you will only
>> stop at this inside foreach breakpoint when you call a first, a collect or
>> a reduce operation.
>>
>
> Does forEachRemaining isn't a final method as first, collect or reduce?
> Anyway, I guess this is not the problem itself because the code inside
> forEachRemaining runs well but I can't debug this block.
>
>
>> This is when the spark will run the operations.
>> Have you tried that?
>>
>> Cheers.
>>
>> 2016-05-31 17:18 GMT-03:00 Marcelo Oikawa :
>>
>>> Hello, list.
>>>
>>> I'm trying to debug my spark application on Intellij IDE. Before I
>>> submit my job, I ran the command line:
>>>
>>> export
>>> SPARK_SUBMIT_OPTS=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=4000
>>>
>>> after that:
>>>
>>> bin/spark-submit app-jar-with-dependencies.jar 
>>>
>>> The IDE connects with the running job but all code that is running on
>>> worker machine is unreachable to debug. See below:
>>>
>>> rdd.foreachPartition(partition -> { //breakpoint stop here
>>>
>>> partition.forEachRemaining(message -> {
>>>
>>> //breakpoint doenst stop here
>>>
>>>  })
>>> });
>>>
>>> Does anyone know if is is possible? How? Any ideas?
>>>
>>>
>>>
>>
>


Re: Debug spark jobs on Intellij

2016-05-31 Thread Marcelo Oikawa
> Hi Marcelo, this is because the operations in rdd are lazy, you will only
> stop at this inside foreach breakpoint when you call a first, a collect or
> a reduce operation.
>

Does forEachRemaining isn't a final method as first, collect or reduce?
Anyway, I guess this is not the problem itself because the code inside
forEachRemaining runs well but I can't debug this block.


> This is when the spark will run the operations.
> Have you tried that?
>
> Cheers.
>
> 2016-05-31 17:18 GMT-03:00 Marcelo Oikawa :
>
>> Hello, list.
>>
>> I'm trying to debug my spark application on Intellij IDE. Before I submit
>> my job, I ran the command line:
>>
>> export
>> SPARK_SUBMIT_OPTS=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=4000
>>
>> after that:
>>
>> bin/spark-submit app-jar-with-dependencies.jar 
>>
>> The IDE connects with the running job but all code that is running on
>> worker machine is unreachable to debug. See below:
>>
>> rdd.foreachPartition(partition -> { //breakpoint stop here
>>
>> partition.forEachRemaining(message -> {
>>
>> //breakpoint doenst stop here
>>
>>  })
>> });
>>
>> Does anyone know if is is possible? How? Any ideas?
>>
>>
>>
>


Re: Debug spark jobs on Intellij

2016-05-31 Thread Dirceu Semighini Filho
Hi Marcelo, this is because the operations in rdd are lazy, you will only
stop at this inside foreach breakpoint when you call a first, a collect or
a reduce operation.
This is when the spark will run the operations.
Have you tried that?

Cheers.

2016-05-31 17:18 GMT-03:00 Marcelo Oikawa :

> Hello, list.
>
> I'm trying to debug my spark application on Intellij IDE. Before I submit
> my job, I ran the command line:
>
> export
> SPARK_SUBMIT_OPTS=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=4000
>
> after that:
>
> bin/spark-submit app-jar-with-dependencies.jar 
>
> The IDE connects with the running job but all code that is running on
> worker machine is unreachable to debug. See below:
>
> rdd.foreachPartition(partition -> { //breakpoint stop here
>
> partition.forEachRemaining(message -> {
>
> //breakpoint doenst stop here
>
>  })
> });
>
> Does anyone know if is is possible? How? Any ideas?
>
>
>


Debug spark jobs on Intellij

2016-05-31 Thread Marcelo Oikawa
Hello, list.

I'm trying to debug my spark application on Intellij IDE. Before I submit
my job, I ran the command line:

export
SPARK_SUBMIT_OPTS=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=4000

after that:

bin/spark-submit app-jar-with-dependencies.jar 

The IDE connects with the running job but all code that is running on
worker machine is unreachable to debug. See below:

rdd.foreachPartition(partition -> { //breakpoint stop here

partition.forEachRemaining(message -> {

//breakpoint doenst stop here

 })
});

Does anyone know if is is possible? How? Any ideas?


Re: Debug spark core and streaming programs in scala

2016-05-16 Thread Ted Yu
>From https://spark.apache.org/docs/latest/monitoring.html#metrics :

   - JmxSink: Registers metrics for viewing in a JMX console.

FYI

On Sun, May 15, 2016 at 11:54 PM, Mich Talebzadeh  wrote:

> Have you tried Spark GUI on 4040. This will show jobs being executed by
> executors is each stage and the line of code as well.
>
> [image: Inline images 1]
>
> Also command line tools like jps and jmonitor
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 16 May 2016 at 06:25, Deepak Sharma  wrote:
>
>> Hi
>> I have scala program consisting of spark core and spark streaming APIs
>> Is there any open source tool that i can use to debug the program for
>> performance reasons?
>> My primary interest is to find the block of codes that would be exeuted
>> on driver and what would go to the executors.
>> Is there JMX extension of Spark?
>>
>> --
>> Thanks
>> Deepak
>>
>>
>


Re: Debug spark core and streaming programs in scala

2016-05-16 Thread Mich Talebzadeh
Have you tried Spark GUI on 4040. This will show jobs being executed by
executors is each stage and the line of code as well.

[image: Inline images 1]

Also command line tools like jps and jmonitor

HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 16 May 2016 at 06:25, Deepak Sharma  wrote:

> Hi
> I have scala program consisting of spark core and spark streaming APIs
> Is there any open source tool that i can use to debug the program for
> performance reasons?
> My primary interest is to find the block of codes that would be exeuted on
> driver and what would go to the executors.
> Is there JMX extension of Spark?
>
> --
> Thanks
> Deepak
>
>


Debug spark core and streaming programs in scala

2016-05-15 Thread Deepak Sharma
Hi
I have scala program consisting of spark core and spark streaming APIs
Is there any open source tool that i can use to debug the program for
performance reasons?
My primary interest is to find the block of codes that would be exeuted on
driver and what would go to the executors.
Is there JMX extension of Spark?

-- 
Thanks
Deepak


Fwd: [Help]:Strange Issue :Debug Spark Dataframe code

2016-04-17 Thread Divya Gehlot
Reposting again as unable to find the root cause where things are going
wrong.

Experts please help .


-- Forwarded message --
From: Divya Gehlot <divya.htco...@gmail.com>
Date: 15 April 2016 at 19:13
Subject: [Help]:Strange Issue :Debug Spark Dataframe code
To: "user @spark" <user@spark.apache.org>


Hi,
I am using Spark 1.5.2 with Scala 2.10.
Is there any other option apart from "explain(true)" to debug Spark
Dataframe code .

I am facing strange issue .
I have a lookuo dataframe and using it join another  dataframe on different
columns .

I am getting *Analysis exception* in third join.
When I checked  the logical plan ,  its using the same reference for key
but while selecting the columns reference  are changing.
For example
df1 = COLUMN1#15,COLUMN2#16,COLUMN3#17

In first two joins
I am getting the same reference and joining is happening
For first two join  the column  COLUMN1#15  I am getting the COLUMN2#16 and
COLUMN3#17.

But at third join COLUMN1#15 is same but the other column reference are
updating as  COLUMN2#167,COLUMN3#168

Its throwing Spark Analysis Exception

> org.apache.spark.sql.AnalysisException: resolved attribute(s) COLUMN1#15
> missing from


after two joins,the  dataframe has more than 25 columns

Could anybody help light the path by holding the torch.
Would really appreciate the help.

Thanks,
Divya


[Help]:Strange Issue :Debug Spark Dataframe code

2016-04-15 Thread Divya Gehlot
Hi,
I am using Spark 1.5.2 with Scala 2.10.
Is there any other option apart from "explain(true)" to debug Spark
Dataframe code .

I am facing strange issue .
I have a lookuo dataframe and using it join another  dataframe on different
columns .

I am getting *Analysis exception* in third join.
When I checked  the logical plan ,  its using the same reference for key
but while selecting the columns reference  are changing.
For example
df1 = COLUMN1#15,COLUMN2#16,COLUMN3#17

In first two joins
I am getting the same reference and joining is happening
For first two join  the column  COLUMN1#15  I am getting the COLUMN2#16 and
COLUMN3#17.

But at third join COLUMN1#15 is same but the other column reference are
updating as  COLUMN2#167,COLUMN3#168

Its throwing Spark Analysis Exception

> org.apache.spark.sql.AnalysisException: resolved attribute(s) COLUMN1#15
> missing from


after two joins,the  dataframe has more than 25 columns

Could anybody help light the path by holding the torch.
Would really appreciate the help.

Thanks,
Divya


How to debug spark-core with function call stack?

2016-02-16 Thread DaeJin Jung
hello everyone,

I would like to draw call stack of Spark-core by analyzing source
code. But, I'm not sure how to apply debugging tool like gdb which
can support backtrace command.

Please let me know if you have any suggestion.

Best Regards,
Daejin Jung


How to debug Spark source using IntelliJ/ Eclipse

2015-12-05 Thread jatinganhotra
Hi,

I am trying to understand Spark internal code and wanted to debug Spark
source, to add a new feature. I have tried the steps lined out here on the 
Spark Wiki page IDE setup
<https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IDESetup>
 
, but they don't work.

I also found other posts in the Dev mailing list such as - 

1.  Spark-1-5-0-setting-up-debug-env
<http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-1-5-0-setting-up-debug-env-td14056.html>
 
, and

2.  using-IntelliJ-to-debug-SPARK-1-1-Apps-with-mvn-sbt-for-beginners
<http://apache-spark-developers-list.1001551.n3.nabble.com/Intro-to-using-IntelliJ-to-debug-SPARK-1-1-Apps-with-mvn-sbt-for-beginners-td9429.html>
  

But, I found many issues with both the links. I have tried both these
articles many times, often re-starting the whole process from scratch after
deleting everything and re-installing again, but I always face some
dependency issues.

It would be great if someone from the Spark users group could point me to
the steps for setting up Spark debug environment.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-debug-Spark-source-using-IntelliJ-Eclipse-tp25577.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Debug Spark

2015-12-02 Thread Masf
This is very intersting.

Thanks!!!

On Thu, Dec 3, 2015 at 8:28 AM, Sudhanshu Janghel <
sudhanshu.jang...@cloudwick.com> wrote:

> Hi,
>
> Here is a doc that I had created for my team. This has steps along with
> snapshots of how to setup debugging in spark using IntelliJ locally.
>
>
> https://docs.google.com/a/cloudwick.com/document/d/13kYPbmK61di0f_XxxJ-wLP5TSZRGMHE6bcTBjzXD0nA/edit?usp=sharing
>
> Kind Regards,
> Sudhanshu
>
> On Thu, Dec 3, 2015 at 6:46 AM, Akhil Das <ak...@sigmoidanalytics.com>
> wrote:
>
>> This doc will get you started
>> https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IntelliJ
>>
>> Thanks
>> Best Regards
>>
>> On Sun, Nov 29, 2015 at 9:48 PM, Masf <masfwo...@gmail.com> wrote:
>>
>>> Hi
>>>
>>> Is it possible to debug spark locally with IntelliJ or another IDE?
>>>
>>> Thanks
>>>
>>> --
>>> Regards.
>>> Miguel Ángel
>>>
>>
>>
>


-- 


Saludos.
Miguel Ángel


Re: Debug Spark

2015-12-02 Thread Sudhanshu Janghel
Hi,

Here is a doc that I had created for my team. This has steps along with
snapshots of how to setup debugging in spark using IntelliJ locally.

https://docs.google.com/a/cloudwick.com/document/d/13kYPbmK61di0f_XxxJ-wLP5TSZRGMHE6bcTBjzXD0nA/edit?usp=sharing

Kind Regards,
Sudhanshu

On Thu, Dec 3, 2015 at 6:46 AM, Akhil Das <ak...@sigmoidanalytics.com>
wrote:

> This doc will get you started
> https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IntelliJ
>
> Thanks
> Best Regards
>
> On Sun, Nov 29, 2015 at 9:48 PM, Masf <masfwo...@gmail.com> wrote:
>
>> Hi
>>
>> Is it possible to debug spark locally with IntelliJ or another IDE?
>>
>> Thanks
>>
>> --
>> Regards.
>> Miguel Ángel
>>
>
>


Re: Debug Spark

2015-12-02 Thread Akhil Das
This doc will get you started
https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IntelliJ

Thanks
Best Regards

On Sun, Nov 29, 2015 at 9:48 PM, Masf <masfwo...@gmail.com> wrote:

> Hi
>
> Is it possible to debug spark locally with IntelliJ or another IDE?
>
> Thanks
>
> --
> Regards.
> Miguel Ángel
>


Re: Debug Spark

2015-11-30 Thread Jacek Laskowski
Hi,

Yes, that's possible -- I'm doing it every day in local and standalone modes.

Just use SPARK_PRINT_LAUNCH_COMMAND=1 before any Spark command, i.e.
spark-submit, spark-shell, to know the command to start it:

$ SPARK_PRINT_LAUNCH_COMMAND=1 ./bin/spark-shell

SPARK_PRINT_LAUNCH_COMMAND environment variable controls whether the
Spark launch command is printed out to the standard error output, i.e.
System.err, or not.

Once you've got the command, add the following command-line option to
enable JDWP agent and have it suspended (suspend=y) until a remote
debugging client connects (on port 5005):

-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005

In IntelliJ IDEA, define a new debug configuration for Remote and
press Debug. You're done.

https://www.jetbrains.com/idea/help/debugging-2.html might help.

Pozdrawiam,
Jacek

--
Jacek Laskowski | https://medium.com/@jaceklaskowski/ |
http://blog.jaceklaskowski.pl
Mastering Spark https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
Follow me at https://twitter.com/jaceklaskowski
Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski


On Sun, Nov 29, 2015 at 5:18 PM, Masf <masfwo...@gmail.com> wrote:
> Hi
>
> Is it possible to debug spark locally with IntelliJ or another IDE?
>
> Thanks
>
> --
> Regards.
> Miguel Ángel

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Debug Spark

2015-11-29 Thread Ndjido Ardo BAR
Spark Job server allows you to submit your apps to any kind of deployment
(Standalone, Cluster). I think that it could be suitable for your use case.
Check the following Github repo:
https://github.com/spark-jobserver/spark-jobserver

Ardo

On Sun, Nov 29, 2015 at 6:42 PM, Նարեկ Գալստեան <ngalsty...@gmail.com>
wrote:

> A question regarding the topic,
>
> I am using Intellij to write spark applications and then have to ship the
> source code to my cluster on the cloud to compile and test
>
> is there a way to automatise the process using Intellij?
>
> Narek Galstyan
>
> Նարեկ Գալստյան
>
> On 29 November 2015 at 20:51, Ndjido Ardo BAR <ndj...@gmail.com> wrote:
>
>> Masf, the following link sets the basics to start debugging your spark
>> apps in local mode:
>>
>>
>> https://medium.com/large-scale-data-processing/how-to-kick-start-spark-development-on-intellij-idea-in-4-steps-c7c8f5c2fe63#.675s86940
>>
>> Ardo
>>
>> On Sun, Nov 29, 2015 at 5:34 PM, Masf <masfwo...@gmail.com> wrote:
>>
>>> Hi Ardo
>>>
>>>
>>> Some tutorial to debug with Intellij?
>>>
>>> Thanks
>>>
>>> Regards.
>>> Miguel.
>>>
>>>
>>> On Sun, Nov 29, 2015 at 5:32 PM, Ndjido Ardo BAR <ndj...@gmail.com>
>>> wrote:
>>>
>>>> hi,
>>>>
>>>> IntelliJ is just great for that!
>>>>
>>>> cheers,
>>>> Ardo.
>>>>
>>>> On Sun, Nov 29, 2015 at 5:18 PM, Masf <masfwo...@gmail.com> wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> Is it possible to debug spark locally with IntelliJ or another IDE?
>>>>>
>>>>> Thanks
>>>>>
>>>>> --
>>>>> Regards.
>>>>> Miguel Ángel
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>>
>>>
>>> Saludos.
>>> Miguel Ángel
>>>
>>
>>
>


Re: Debug Spark

2015-11-29 Thread Նարեկ Գալստեան
A question regarding the topic,

I am using Intellij to write spark applications and then have to ship the
source code to my cluster on the cloud to compile and test

is there a way to automatise the process using Intellij?

Narek Galstyan

Նարեկ Գալստյան

On 29 November 2015 at 20:51, Ndjido Ardo BAR <ndj...@gmail.com> wrote:

> Masf, the following link sets the basics to start debugging your spark
> apps in local mode:
>
>
> https://medium.com/large-scale-data-processing/how-to-kick-start-spark-development-on-intellij-idea-in-4-steps-c7c8f5c2fe63#.675s86940
>
> Ardo
>
> On Sun, Nov 29, 2015 at 5:34 PM, Masf <masfwo...@gmail.com> wrote:
>
>> Hi Ardo
>>
>>
>> Some tutorial to debug with Intellij?
>>
>> Thanks
>>
>> Regards.
>> Miguel.
>>
>>
>> On Sun, Nov 29, 2015 at 5:32 PM, Ndjido Ardo BAR <ndj...@gmail.com>
>> wrote:
>>
>>> hi,
>>>
>>> IntelliJ is just great for that!
>>>
>>> cheers,
>>> Ardo.
>>>
>>> On Sun, Nov 29, 2015 at 5:18 PM, Masf <masfwo...@gmail.com> wrote:
>>>
>>>> Hi
>>>>
>>>> Is it possible to debug spark locally with IntelliJ or another IDE?
>>>>
>>>> Thanks
>>>>
>>>> --
>>>> Regards.
>>>> Miguel Ángel
>>>>
>>>
>>>
>>
>>
>> --
>>
>>
>> Saludos.
>> Miguel Ángel
>>
>
>


Re: Debug Spark

2015-11-29 Thread Ndjido Ardo BAR
Masf, the following link sets the basics to start debugging your spark apps
in local mode:

https://medium.com/large-scale-data-processing/how-to-kick-start-spark-development-on-intellij-idea-in-4-steps-c7c8f5c2fe63#.675s86940

Ardo

On Sun, Nov 29, 2015 at 5:34 PM, Masf <masfwo...@gmail.com> wrote:

> Hi Ardo
>
>
> Some tutorial to debug with Intellij?
>
> Thanks
>
> Regards.
> Miguel.
>
>
> On Sun, Nov 29, 2015 at 5:32 PM, Ndjido Ardo BAR <ndj...@gmail.com> wrote:
>
>> hi,
>>
>> IntelliJ is just great for that!
>>
>> cheers,
>> Ardo.
>>
>> On Sun, Nov 29, 2015 at 5:18 PM, Masf <masfwo...@gmail.com> wrote:
>>
>>> Hi
>>>
>>> Is it possible to debug spark locally with IntelliJ or another IDE?
>>>
>>> Thanks
>>>
>>> --
>>> Regards.
>>> Miguel Ángel
>>>
>>
>>
>
>
> --
>
>
> Saludos.
> Miguel Ángel
>


Re: Debug Spark

2015-11-29 Thread Danny Stephan
Hi,

You can use “jwdp" to debug everything that run on top of JVM including Spark.
 
Specific with IntelliJ,  maybe this link can help you:

http://danosipov.com/?p=779 <http://danosipov.com/?p=779>

regards,
Danny


> Op 29 nov. 2015, om 17:34 heeft Masf <masfwo...@gmail.com> het volgende 
> geschreven:
> 
> Hi Ardo
> 
> 
> Some tutorial to debug with Intellij?
> 
> Thanks
> 
> Regards.
> Miguel.
> 
> 
> On Sun, Nov 29, 2015 at 5:32 PM, Ndjido Ardo BAR <ndj...@gmail.com 
> <mailto:ndj...@gmail.com>> wrote:
> hi,
> 
> IntelliJ is just great for that!
> 
> cheers,
> Ardo.
> 
> On Sun, Nov 29, 2015 at 5:18 PM, Masf <masfwo...@gmail.com 
> <mailto:masfwo...@gmail.com>> wrote:
> Hi
> 
> Is it possible to debug spark locally with IntelliJ or another IDE?
> 
> Thanks
> 
> -- 
> Regards.
> Miguel Ángel
> 
> 
> 
> 
> -- 
> 
> 
> Saludos.
> Miguel Ángel



Debug Spark

2015-11-29 Thread Masf
Hi

Is it possible to debug spark locally with IntelliJ or another IDE?

Thanks

-- 
Regards.
Miguel Ángel


Re: Debug Spark

2015-11-29 Thread Ndjido Ardo BAR
hi,

IntelliJ is just great for that!

cheers,
Ardo.

On Sun, Nov 29, 2015 at 5:18 PM, Masf <masfwo...@gmail.com> wrote:

> Hi
>
> Is it possible to debug spark locally with IntelliJ or another IDE?
>
> Thanks
>
> --
> Regards.
> Miguel Ángel
>


Re: Debug Spark

2015-11-29 Thread Masf
Hi Ardo


Some tutorial to debug with Intellij?

Thanks

Regards.
Miguel.


On Sun, Nov 29, 2015 at 5:32 PM, Ndjido Ardo BAR <ndj...@gmail.com> wrote:

> hi,
>
> IntelliJ is just great for that!
>
> cheers,
> Ardo.
>
> On Sun, Nov 29, 2015 at 5:18 PM, Masf <masfwo...@gmail.com> wrote:
>
>> Hi
>>
>> Is it possible to debug spark locally with IntelliJ or another IDE?
>>
>> Thanks
>>
>> --
>> Regards.
>> Miguel Ángel
>>
>
>


-- 


Saludos.
Miguel Ángel


Re: Debug Spark Streaming in PyCharm

2015-07-10 Thread Tathagata Das
spark-submit does a lot of magic configurations (classpaths etc) underneath
the covers to enable pyspark to find Spark JARs, etc. I am not sure how you
can start running things directly from the PyCharm IDE. Others in the
community may be able to answer. For now the main way to run pyspark stuff
is through spark-submit, or pyspark (which uses spark-submit underneath).

On Fri, Jul 10, 2015 at 6:28 AM, blbradley bradleytas...@gmail.com wrote:

 Hello,

 I'm trying to debug a PySpark app with Kafka Streaming in PyCharm. However,
 PySpark cannot find the jar dependencies for Kafka Streaming without
 editing
 the program. I can temporarily use SparkConf to set 'spark.jars', but I'm
 using Mesos for production and don't want to edit my program everytime I
 want to debug. I'd like to find a way to debug without editing the source.

 Here's what my PyCharm debug execution command looks like:

 home/brandon/.pyenv/versions/coinspark/bin/python2.7
 /opt/pycharm-community/helpers/pydev/pydevd.py --multiproc --client
 127.0.0.1 --port 59042 --file
 /home/brandon/src/coins/coinspark/streaming.py

 I might be able to use spark-submit has the command PyCharm runs, but I'm
 not sure if that will work with the debugger.

 Thoughts?

 Cheers!
 Brandon Bradley



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Debug-Spark-Streaming-in-PyCharm-tp23766.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Debug Spark Streaming in PyCharm

2015-07-10 Thread blbradley
Hello,

I'm trying to debug a PySpark app with Kafka Streaming in PyCharm. However,
PySpark cannot find the jar dependencies for Kafka Streaming without editing
the program. I can temporarily use SparkConf to set 'spark.jars', but I'm
using Mesos for production and don't want to edit my program everytime I
want to debug. I'd like to find a way to debug without editing the source.

Here's what my PyCharm debug execution command looks like:

home/brandon/.pyenv/versions/coinspark/bin/python2.7
/opt/pycharm-community/helpers/pydev/pydevd.py --multiproc --client
127.0.0.1 --port 59042 --file /home/brandon/src/coins/coinspark/streaming.py

I might be able to use spark-submit has the command PyCharm runs, but I'm
not sure if that will work with the debugger.

Thoughts?

Cheers!
Brandon Bradley



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Debug-Spark-Streaming-in-PyCharm-tp23766.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How to debug spark in IntelliJ Idea

2015-05-18 Thread Yi.Zhang
Hi all,

Currently, I wrote some code lines to access spark master which was deployed
on standalone style. I wanted to set the breakpoint for spark master which
was running on the different process. I am wondering maybe I need attach
process in IntelliJ, so that when AppClient sent the message to remote
actor(spark master), the breakpoint would be enabled. 

I don't know how to debug it in IntelliJ Idea. I need help. Thanks.

Regards,
Yi



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-debug-spark-in-IntelliJ-Idea-tp22932.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to debug Spark on Yarn?

2015-04-28 Thread Steve Loughran

On 27 Apr 2015, at 07:51, ÐΞ€ρ@Ҝ (๏̯͡๏) 
deepuj...@gmail.commailto:deepuj...@gmail.com wrote:

Spark 1.3

1. View stderr/stdout from executor from Web UI: when the job is running i 
figured out the executor that am suppose to see, and those two links show 4 
special characters on browser.

2. Tail on Yarn logs:


/apache/hadoop/bin/yarn logs -applicationId  application_1429087638744_151059 | 
less

Threw me: Application has not completed. Logs are only available after an 
application completes


Any other ideas that i can try ?


There's some stuff on log streaming of running Apps on Hadoop 2.6+ which can 
stream logs of running apps to HDFS. I don't know if spark supports that (I 
haven't tested it) so won't give the details right now.

You can go from the RM to the node managers running the containers, and view 
the logs that way.


From some other notes of mine:




One configuration to aid debugging is tell the nodemanagers to keep data for a 
short period after containers finish

!-- 10 minutes after a failure to see what is left in the directory--
property
  nameyarn.nodemanager.delete.debug-delay-sec/name
  value600/value
/property


You can then retrieve logs by either the web UI, or by connecting to the server 
(usually by ssh) and retrieve the logs from the log directory

We also recommend making sure that YARN kills processes

!--time before the process gets a -9 --
property
  nameyarn.nodemanager.sleep-delay-before-sigkill.ms/name
  value3/value
/property




Re: How to debug Spark on Yarn?

2015-04-27 Thread ๏̯͡๏
Spark 1.3

1. View stderr/stdout from executor from Web UI: when the job is running i
figured out the executor that am suppose to see, and those two links show 4
special characters on browser.

2. Tail on Yarn logs:

/apache/hadoop/bin/yarn logs -applicationId
application_1429087638744_151059 | less
Threw me: Application has not completed. Logs are only available after an
application completes


Any other ideas that i can try ?



On Sat, Apr 25, 2015 at 12:07 AM, Sven Krasser kras...@gmail.com wrote:

 On Fri, Apr 24, 2015 at 11:31 AM, Marcelo Vanzin van...@cloudera.com
 wrote:


 Spark 1.3 should have links to the executor logs in the UI while the
 application is running. Not yet in the history server, though.


 You're absolutely correct -- didn't notice it until now. This is a great
 addition!

 --
 www.skrasser.com http://www.skrasser.com/?utm_source=sig




-- 
Deepak


Re: How to debug Spark on Yarn?

2015-04-27 Thread ๏̯͡๏
1) Application container logs from Web RM UI never load on browser. I
eventually have to kill the browser.
2)  /apache/hadoop/bin/yarn logs -applicationId
application_1429087638744_151059
| less emits logs only after the application has completed.

Are there no better ways to see the logs as they are emitted. Something
similar to hadoop world ?


On Mon, Apr 27, 2015 at 1:58 PM, Zoltán Zvara zoltan.zv...@gmail.com
wrote:

 You can check container logs from RM web UI or when log-aggregation is
 enabled with the yarn command. There are other, but less convenient
 options.

 On Mon, Apr 27, 2015 at 8:53 AM ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:

 Spark 1.3

 1. View stderr/stdout from executor from Web UI: when the job is running
 i figured out the executor that am suppose to see, and those two links show
 4 special characters on browser.

 2. Tail on Yarn logs:

 /apache/hadoop/bin/yarn logs -applicationId
 application_1429087638744_151059 | less
 Threw me: Application has not completed. Logs are only available after an
 application completes


 Any other ideas that i can try ?



 On Sat, Apr 25, 2015 at 12:07 AM, Sven Krasser kras...@gmail.com wrote:

 On Fri, Apr 24, 2015 at 11:31 AM, Marcelo Vanzin van...@cloudera.com
 wrote:


 Spark 1.3 should have links to the executor logs in the UI while the
 application is running. Not yet in the history server, though.


 You're absolutely correct -- didn't notice it until now. This is a great
 addition!

 --
 www.skrasser.com http://www.skrasser.com/?utm_source=sig




 --
 Deepak




-- 
Deepak


Re: How to debug Spark on Yarn?

2015-04-27 Thread Zoltán Zvara
You can check container logs from RM web UI or when log-aggregation is
enabled with the yarn command. There are other, but less convenient options.

On Mon, Apr 27, 2015 at 8:53 AM ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:

 Spark 1.3

 1. View stderr/stdout from executor from Web UI: when the job is running i
 figured out the executor that am suppose to see, and those two links show 4
 special characters on browser.

 2. Tail on Yarn logs:

 /apache/hadoop/bin/yarn logs -applicationId
 application_1429087638744_151059 | less
 Threw me: Application has not completed. Logs are only available after an
 application completes


 Any other ideas that i can try ?



 On Sat, Apr 25, 2015 at 12:07 AM, Sven Krasser kras...@gmail.com wrote:

 On Fri, Apr 24, 2015 at 11:31 AM, Marcelo Vanzin van...@cloudera.com
 wrote:


 Spark 1.3 should have links to the executor logs in the UI while the
 application is running. Not yet in the history server, though.


 You're absolutely correct -- didn't notice it until now. This is a great
 addition!

 --
 www.skrasser.com http://www.skrasser.com/?utm_source=sig




 --
 Deepak




Re: How to debug Spark on Yarn?

2015-04-24 Thread Marcelo Vanzin
On top of what's been said...

On Wed, Apr 22, 2015 at 10:48 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
 1) I can go to Spark UI and see the status of the APP but cannot see the
 logs as the job progresses. How can i see logs of executors as they progress
 ?

Spark 1.3 should have links to the executor logs in the UI while the
application is running. Not yet in the history server, though.

 2) In case the App fails/completes then Spark UI vanishes and i get a YARN
 Job page that says job failed, there are no link to Spark UI again. Now i
 take the job ID and run yarn application logs appid and my console fills up
 (with huge scrolling) with logs of all executors. Then i copy and paste into
 a text editor and search for keywords Exception , Job aborted due to .
 Is this the right way to view logs ?

Aside from Ted's suggestion, you could also pipe the yarn logs
output to less.

-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to debug Spark on Yarn?

2015-04-24 Thread Sven Krasser
For #1, click on a worker node on the YARN dashboard. From there,
Tools-Local logs-Userlogs has the logs for each application, and you can
view them by executor even while an application is running. (This is for
Hadoop 2.4, things may have changed in 2.6.)
-Sven

On Thu, Apr 23, 2015 at 6:27 AM, Ted Yu yuzhih...@gmail.com wrote:

 For step 2, you can pipe application log to a file instead of
 copy-pasting.

 Cheers



 On Apr 22, 2015, at 10:48 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:

 I submit a spark app to YARN and i get these messages

  15/04/22 22:45:04 INFO yarn.Client: Application report for
 application_1429087638744_101363 (state: RUNNING)

 15/04/22 22:45:04 INFO yarn.Client: Application report for
 application_1429087638744_101363 (state: RUNNING).

 ...


 1) I can go to Spark UI and see the status of the APP but cannot see the
 logs as the job progresses. How can i see logs of executors as they
 progress ?

 2) In case the App fails/completes then Spark UI vanishes and i get a YARN
 Job page that says job failed, there are no link to Spark UI again. Now i
 take the job ID and run yarn application logs appid and my console fills up
 (with huge scrolling) with logs of all executors. Then i copy and paste
 into a text editor and search for keywords Exception , Job aborted due
 to . Is this the right way to view logs ?

 --
 Deepak




-- 
www.skrasser.com http://www.skrasser.com/?utm_source=sig


Re: How to debug Spark on Yarn?

2015-04-24 Thread Sven Krasser
On Fri, Apr 24, 2015 at 11:31 AM, Marcelo Vanzin van...@cloudera.com
wrote:


 Spark 1.3 should have links to the executor logs in the UI while the
 application is running. Not yet in the history server, though.


You're absolutely correct -- didn't notice it until now. This is a great
addition!

-- 
www.skrasser.com http://www.skrasser.com/?utm_source=sig


Re: How to debug Spark on Yarn?

2015-04-23 Thread Ted Yu
For step 2, you can pipe application log to a file instead of copy-pasting. 

Cheers



 On Apr 22, 2015, at 10:48 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
 
 I submit a spark app to YARN and i get these messages
 
 
 
 15/04/22 22:45:04 INFO yarn.Client: Application report for 
 application_1429087638744_101363 (state: RUNNING)
 
 
 15/04/22 22:45:04 INFO yarn.Client: Application report for 
 application_1429087638744_101363 (state: RUNNING).
 
 ...
 
 
 
 1) I can go to Spark UI and see the status of the APP but cannot see the logs 
 as the job progresses. How can i see logs of executors as they progress ?
 
 2) In case the App fails/completes then Spark UI vanishes and i get a YARN 
 Job page that says job failed, there are no link to Spark UI again. Now i 
 take the job ID and run yarn application logs appid and my console fills up 
 (with huge scrolling) with logs of all executors. Then i copy and paste into 
 a text editor and search for keywords Exception , Job aborted due to . Is 
 this the right way to view logs ?
 
 
 -- 
 Deepak
 


How to debug Spark on Yarn?

2015-04-22 Thread ๏̯͡๏
I submit a spark app to YARN and i get these messages

 15/04/22 22:45:04 INFO yarn.Client: Application report for
application_1429087638744_101363 (state: RUNNING)

15/04/22 22:45:04 INFO yarn.Client: Application report for
application_1429087638744_101363 (state: RUNNING).

...


1) I can go to Spark UI and see the status of the APP but cannot see the
logs as the job progresses. How can i see logs of executors as they
progress ?

2) In case the App fails/completes then Spark UI vanishes and i get a YARN
Job page that says job failed, there are no link to Spark UI again. Now i
take the job ID and run yarn application logs appid and my console fills up
(with huge scrolling) with logs of all executors. Then i copy and paste
into a text editor and search for keywords Exception , Job aborted due
to . Is this the right way to view logs ?

-- 
Deepak


How to properly debug spark streaming?

2014-10-28 Thread kpeng1
I am still fairly new to spark and spark streaming.  I have been struggling
with how to properly debug spark streaming and I was wondering what is the
best approach.  I have been basically putting println statements everywhere,
but sometimes they show up when I run the job and sometimes they don't.  I
currently run the job using the following command:
/usr/bin/spark-submit --class project.TheMain --master local[2]
/home/cloudera/my.jar 100





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-properly-debug-spark-streaming-tp17571.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Debug Spark in Cluster Mode

2014-10-10 Thread Ilya Ganelin
I would also be interested in knowing more about this. I have used the
cloudera manager and the spark resource interface (clientnode:4040) but
would love to know if there are other tools out there - either for post
processing or better observation during execution.
On Oct 9, 2014 4:50 PM, Rohit Pujari rpuj...@hortonworks.com wrote:

 Hello Folks:

 What're some best practices to debug Spark in cluster mode?


 Thanks,
 Rohit

 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity
 to which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the reader
 of this message is not the intended recipient, you are hereby notified that
 any printing, copying, dissemination, distribution, disclosure or
 forwarding of this communication is strictly prohibited. If you have
 received this communication in error, please contact the sender immediately
 and delete it from your system. Thank You.


Debug Spark in Cluster Mode

2014-10-09 Thread Rohit Pujari
Hello Folks:

What're some best practices to debug Spark in cluster mode?


Thanks,
Rohit

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


Re: best practice: write and debug Spark application in scala-ide and maven

2014-06-07 Thread Gerard Maas
I think that you have two options:

- to run your code locally, you can use local mode by using the 'local'
master like so:
 new SparkConf().setMaster(local[4])  where 4 is the number of cores
assigned to the local mode.

- to run your code remotely you need to build the jar with dependencies and
add it to your context.
new SparkConf().setMaster(spark://uri
).addJars(Array(/path/to/target/jar-with-dependencies.jar)
You will need to run maven before running your program to ensure the latest
version of your jar is built.

-regards, Gerard.



On Sat, Jun 7, 2014 at 3:10 AM, Wei Tan w...@us.ibm.com wrote:

 Hi,

   I am trying to write and debug Spark applications in scala-ide and
 maven, and in my code I target at a Spark instance at spark://xxx

 object App {


   def main(args : Array[String]) {
 println( Hello World! )
 val sparkConf = new
 SparkConf().setMaster(spark://xxx:7077).setAppName(WordCount)

 val spark = new SparkContext(sparkConf)
 val file = spark.textFile(hdfs://xxx:9000/wcinput/pg1184.txt)
 val counts = file.flatMap(line = line.split( ))
  .map(word = (word, 1))
  .reduceByKey(_ + _)
 counts.saveAsTextFile(hdfs://flex05.watson.ibm.com:9000/wcoutput)
   }

 }

 I added spark-core and hadoop-client in maven dependency so the code
 compiles fine.

 When I click run in Eclipse I got this error:

 14/06/06 20:52:18 WARN scheduler.TaskSetManager: Loss was due to
 java.lang.ClassNotFoundException
 java.lang.ClassNotFoundException: samples.App$$anonfun$2

 I googled this error and it seems that I need to package my code into a
 jar file and push it to spark nodes. But since I am debugging the code, it
 would be handy if I can quickly see results without packaging and uploading
 jars.

 What is the best practice of writing a spark application (like wordcount)
 and debug quickly on a remote spark instance?

 Thanks!
 Wei


 -
 Wei Tan, PhD
 Research Staff Member
 IBM T. J. Watson Research Center
 *http://researcher.ibm.com/person/us-wtan*
 http://researcher.ibm.com/person/us-wtan


Re: best practice: write and debug Spark application in scala-ide and maven

2014-06-07 Thread Madhu
For debugging, I run locally inside Eclipse without maven.
I just add the Spark assembly jar to my Eclipse project build path and click
'Run As... Scala Application'.
I have done the same with Java and Scala Test, it's quick and easy.
I didn't see any third party jar dependencies in your code, so that should
be sufficient for your example.



-
Madhu
https://www.linkedin.com/in/msiddalingaiah
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/best-practice-write-and-debug-Spark-application-in-scala-ide-and-maven-tp7151p7183.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


best practice: write and debug Spark application in scala-ide and maven

2014-06-06 Thread Wei Tan
Hi,

  I am trying to write and debug Spark applications in scala-ide and 
maven, and in my code I target at a Spark instance at spark://xxx

object App {
 
 
  def main(args : Array[String]) {
println( Hello World! )
val sparkConf = new 
SparkConf().setMaster(spark://xxx:7077).setAppName(WordCount)
 
val spark = new SparkContext(sparkConf)
val file = spark.textFile(hdfs://xxx:9000/wcinput/pg1184.txt)
val counts = file.flatMap(line = line.split( ))
 .map(word = (word, 1))
 .reduceByKey(_ + _)
counts.saveAsTextFile(hdfs://flex05.watson.ibm.com:9000/wcoutput) 
  }

}

I added spark-core and hadoop-client in maven dependency so the code 
compiles fine.

When I click run in Eclipse I got this error:

14/06/06 20:52:18 WARN scheduler.TaskSetManager: Loss was due to 
java.lang.ClassNotFoundException
java.lang.ClassNotFoundException: samples.App$$anonfun$2

I googled this error and it seems that I need to package my code into a 
jar file and push it to spark nodes. But since I am debugging the code, it 
would be handy if I can quickly see results without packaging and 
uploading jars.

What is the best practice of writing a spark application (like wordcount) 
and debug quickly on a remote spark instance?

Thanks!
Wei


-
Wei Tan, PhD
Research Staff Member
IBM T. J. Watson Research Center
http://researcher.ibm.com/person/us-wtan