Hi, I saw the following error message in executor logs:
*Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x0000000662f00000, 520093696, 0) failed; error='Cannot allocate memory' (errno=12)* By increasing RAM of my nodes to 40 GB each, I was able to get rid of RPC connection failures. However, the results I am getting after copying data are still incorrect. Before termination, executor logs have this error message: *ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM* I believe the executors are not shutting down gracefully and that is causing spark to lose some data. Can anyone please explain how I can further debug this? Thanks, Faraz On Mon, Feb 26, 2018 at 4:46 PM, Faraz Mateen <fmat...@an10.io> wrote: > Hi, > > I think I have a situation where spark is silently failing to write data > to my Cassandra table. Let me explain my current situation. > > I have a table consisting of around 402 million records. The table > consists of 84 columns. Table schema is something like this: > > > *id (text) | datetime (timestamp) | field1 (text) | ..... | field > 84 (text)* > > > To optimize queries on the data, I am splitting it into multiple tables > using spark job mentioned below. Each separated table must have data from > just one field from the source table. New table has the following structure: > > > *id (text) | datetime (timestamp) | day (date) | value (text)* > > > where, "value" column will contain the field column from the source table. > Source table has around *402 million* records which is around *85 GB* of > data distributed on *3 nodes (27 + 32 + 26)*. New table being populated > is supposed to have the same number of records but it is missing some data. > > Initially, I assumed some problem with the data in source table. So, I > copied 1 weeks of data from the source table into another table with the > same schema. Then I split the data like I did before but this time, field > specific table had the same number of records as the source table. I > repeated this again with another data set from another time period and > again number of records in field specific table were equal to number of > records in the source table. > > This has led me to believe that there is some problem with spark's > handling of large data set. Here is my spark submit command to separate the > data: > > *~/spark-2.1.0-bin-hadoop2.7/bin/spark-submit --master > spark://10.128.0.18:7077 <http://10.128.0.18:7077/> --packages > datastax:spark-cassandra-connector:2.0.1-s_2.11 --con**f > spark.cassandra.connection.host="10.128.1.1,10.128.1.2,10.128.1.3" --conf > "spark.storage.memoryFraction=1" --conf spark.local.dir=/media/db/ > --executor-memory 10G --num-executors=6 --executo**r-cores=3 > --total-executor-cores 18 split_data.py* > > > *split_data.py* is the name of my pyspark application. It is essentially > executing the following query: > > > *("select id,datetime,DATE_FORMAT(datetime,'yyyy-MM-dd') as day, "+field+" > as value from data " )* > > The spark job does not crash after these errors and warnings. However when > I check the number of records in the new table, it is always less than the > number of records in source table. Moreover, the number of records in > destination table is not the same after each run of the query. I changed > logging level for spark submit to WARN and saw the following WARNINGS and > ERRORS on the console: > > https://gist.github.com/anonymous/e05f1aaa131348c9a5a9a2db6d > 141f8c#file-gistfile1-txt > > My cluster consists of *3 gcloud VMs*. A spark and a cassandra node is > deployed on each VM. > Each VM has *8 cores* of CPU and* 30 GB* RAM. Spark is deployed in > standalone cluster mode. > Spark version is *2.1.0* > I am using datastax spark cassandra connector version *2.0.1* > Cassandra Version is *3.9* > Each spark executor is allowed 10 GB of RAM and there are 2 executors > running on each node. > > Is the problem related to my machine resources? How can I root cause or > fix this? > Any help will be greatly appreciated. > > Thanks, > Faraz >