Re: Data loss in spark job

Faraz Mateen Tue, 27 Feb 2018 19:34:23 -0800

Hi,

I saw the following error message in executor logs:


*Java HotSpot(TM) 64-Bit Server VM warning: INFO:
os::commit_memory(0x0000000662f00000, 520093696, 0) failed; error='Cannot
allocate memory' (errno=12)*

By increasing RAM of my nodes to 40 GB each, I was able to get rid of RPC
connection failures. However, the results I am getting after copying data
are still incorrect.

Before termination, executor logs have this error message:

*ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM*

I believe the executors are not shutting down gracefully and that is
causing spark to lose some data.

Can anyone please explain how I can further debug this?

Thanks,
Faraz

On Mon, Feb 26, 2018 at 4:46 PM, Faraz Mateen <fmat...@an10.io> wrote:

> Hi,
>
> I think I have a situation where spark is silently failing to write data
> to my Cassandra table. Let me explain my current situation.
>
> I have a table consisting of around 402 million records. The table
> consists of 84 columns. Table schema is something like this:
>
>
> *id (text)  |   datetime (timestamp)  |   field1 (text) | ..... |   field
> 84 (text)*
>
>
> To optimize queries on the data, I am splitting it into multiple tables
> using spark job mentioned below. Each separated table must have data from
> just one field from the source table. New table has the following structure:
>
>
> *id (text)  |   datetime (timestamp)  |   day (date)  |   value (text)*
>
>
> where, "value" column will contain the field column from the source table.
> Source table has around *402 million* records which is around *85 GB* of
> data distributed on *3 nodes (27 + 32 + 26)*. New table being populated
> is supposed to have the same number of records but it is missing some data.
>
> Initially, I assumed some problem with the data in source table. So, I
> copied 1 weeks of data from the source table into another table with the
> same schema. Then I split the data like I did before but this time, field
> specific table had the same number of records as the source table. I
> repeated this again with another data set from another time period and
> again number of records in field specific table  were equal to number of
> records in the source table.
>
> This has led me to believe that there is some problem with spark's
> handling of large data set. Here is my spark submit command to separate the
> data:
>
> *~/spark-2.1.0-bin-hadoop2.7/bin/spark-submit --master
> spark://10.128.0.18:7077 <http://10.128.0.18:7077/>  --packages
> datastax:spark-cassandra-connector:2.0.1-s_2.11 --con**f
> spark.cassandra.connection.host="10.128.1.1,10.128.1.2,10.128.1.3" --conf
> "spark.storage.memoryFraction=1" --conf spark.local.dir=/media/db/
> --executor-memory 10G --num-executors=6 --executo**r-cores=3
> --total-executor-cores 18 split_data.py*
>
>
> *split_data.py* is the name of my pyspark application. It is essentially
> executing the following query:
>
>
> *("select id,datetime,DATE_FORMAT(datetime,'yyyy-MM-dd') as day, "+field+"
> as value  from data  " )*
>
> The spark job does not crash after these errors and warnings. However when
> I check the number of records in the new table, it is always less than the
> number of records in source table. Moreover, the number of records in
> destination table is not the same after each run of the query. I changed
> logging level for spark submit to WARN and saw the following WARNINGS and
> ERRORS on the console:
>
> https://gist.github.com/anonymous/e05f1aaa131348c9a5a9a2db6d
> 141f8c#file-gistfile1-txt
>
> My cluster consists of *3 gcloud VMs*. A spark and a cassandra node is
> deployed on each VM.
> Each VM has *8 cores* of CPU and* 30 GB* RAM. Spark is deployed in
> standalone cluster mode.
> Spark version is *2.1.0*
> I am using datastax spark cassandra connector version *2.0.1*
> Cassandra Version is *3.9*
> Each spark executor is allowed 10 GB of RAM and there are 2 executors
> running on each node.
>
> Is the problem related to my machine resources? How can I root cause or
> fix this?
> Any help will be greatly appreciated.
>
> Thanks,
> Faraz
>

Re: Data loss in spark job

Reply via email to