Data loss in spark job

Faraz Mateen Mon, 26 Feb 2018 03:47:21 -0800

 Hi,

I think I have a situation where spark is silently failing to write data to
my Cassandra table. Let me explain my current situation.


I have a table consisting of around 402 million records. The table consists
of 84 columns. Table schema is something like this:


*id (text)  |   datetime (timestamp)  |   field1 (text) | ..... |   field
84 (text)*


To optimize queries on the data, I am splitting it into multiple tables
using spark job mentioned below. Each separated table must have data from
just one field from the source table. New table has the following structure:


*id (text)  |   datetime (timestamp)  |   day (date)  |   value (text)*


where, "value" column will contain the field column from the source table.
Source table has around *402 million* records which is around *85 GB* of
data distributed on *3 nodes (27 + 32 + 26)*. New table being populated is
supposed to have the same number of records but it is missing some data.

Initially, I assumed some problem with the data in source table. So, I
copied 1 weeks of data from the source table into another table with the
same schema. Then I split the data like I did before but this time, field
specific table had the same number of records as the source table. I
repeated this again with another data set from another time period and
again number of records in field specific table  were equal to number of
records in the source table.

This has led me to believe that there is some problem with spark's handling
of large data set. Here is my spark submit command to separate the data:

*~/spark-2.1.0-bin-hadoop2.7/bin/spark-submit --master
spark://10.128.0.18:7077 <http://10.128.0.18:7077/>  --packages
datastax:spark-cassandra-connector:2.0.1-s_2.11 --con**f
spark.cassandra.connection.host="10.128.1.1,10.128.1.2,10.128.1.3" --conf
"spark.storage.memoryFraction=1" --conf spark.local.dir=/media/db/
--executor-memory 10G --num-executors=6 --executo**r-cores=3
--total-executor-cores 18 split_data.py*


*split_data.py* is the name of my pyspark application. It is essentially
executing the following query:


*("select id,datetime,DATE_FORMAT(datetime,'yyyy-MM-dd') as day, "+field+"
as value  from data  " )*

The spark job does not crash after these errors and warnings. However when
I check the number of records in the new table, it is always less than the
number of records in source table. Moreover, the number of records in
destination table is not the same after each run of the query. I changed
logging level for spark submit to WARN and saw the following WARNINGS and
ERRORS on the console:

https://gist.github.com/anonymous/e05f1aaa131348c9a5a9a2db6d
141f8c#file-gistfile1-txt

My cluster consists of *3 gcloud VMs*. A spark and a cassandra node is
deployed on each VM.
Each VM has *8 cores* of CPU and* 30 GB* RAM. Spark is deployed in
standalone cluster mode.
Spark version is *2.1.0*
I am using datastax spark cassandra connector version *2.0.1*
Cassandra Version is *3.9*
Each spark executor is allowed 10 GB of RAM and there are 2 executors
running on each node.

Is the problem related to my machine resources? How can I root cause or fix
this?
Any help will be greatly appreciated.

Thanks,
Faraz

Data loss in spark job

Reply via email to