Hi, I think I have a situation where spark is silently failing to write data to my Cassandra table. Let me explain my current situation.
I have a table consisting of around 402 million records. The table consists of 84 columns. Table schema is something like this: *id (text) | datetime (timestamp) | field1 (text) | ..... | field 84 (text)* To optimize queries on the data, I am splitting it into multiple tables using spark job mentioned below. Each separated table must have data from just one field from the source table. New table has the following structure: *id (text) | datetime (timestamp) | day (date) | value (text)* where, "value" column will contain the field column from the source table. Source table has around *402 million* records which is around *85 GB* of data distributed on *3 nodes (27 + 32 + 26)*. New table being populated is supposed to have the same number of records but it is missing some data. Initially, I assumed some problem with the data in source table. So, I copied 1 weeks of data from the source table into another table with the same schema. Then I split the data like I did before but this time, field specific table had the same number of records as the source table. I repeated this again with another data set from another time period and again number of records in field specific table were equal to number of records in the source table. This has led me to believe that there is some problem with spark's handling of large data set. Here is my spark submit command to separate the data: *~/spark-2.1.0-bin-hadoop2.7/bin/spark-submit --master spark://10.128.0.18:7077 <http://10.128.0.18:7077/> --packages datastax:spark-cassandra-connector:2.0.1-s_2.11 --con**f spark.cassandra.connection.host="10.128.1.1,10.128.1.2,10.128.1.3" --conf "spark.storage.memoryFraction=1" --conf spark.local.dir=/media/db/ --executor-memory 10G --num-executors=6 --executo**r-cores=3 --total-executor-cores 18 split_data.py* *split_data.py* is the name of my pyspark application. It is essentially executing the following query: *("select id,datetime,DATE_FORMAT(datetime,'yyyy-MM-dd') as day, "+field+" as value from data " )* The spark job does not crash after these errors and warnings. However when I check the number of records in the new table, it is always less than the number of records in source table. Moreover, the number of records in destination table is not the same after each run of the query. I changed logging level for spark submit to WARN and saw the following WARNINGS and ERRORS on the console: https://gist.github.com/anonymous/e05f1aaa131348c9a5a9a2db6d 141f8c#file-gistfile1-txt My cluster consists of *3 gcloud VMs*. A spark and a cassandra node is deployed on each VM. Each VM has *8 cores* of CPU and* 30 GB* RAM. Spark is deployed in standalone cluster mode. Spark version is *2.1.0* I am using datastax spark cassandra connector version *2.0.1* Cassandra Version is *3.9* Each spark executor is allowed 10 GB of RAM and there are 2 executors running on each node. Is the problem related to my machine resources? How can I root cause or fix this? Any help will be greatly appreciated. Thanks, Faraz