Hi

We are trying to inject millions to data into a table by executing Batches of 
PreparedStatments.

We found that when we use 'session.execute(batch)', it write more data but very 
very slow.
However if we use  'session.execute_async(batch)' then its relatively fast but 
when it reaches certain limit, its fillup the memory (python process)

Our implementation:
Cassandra 3.7.0 cluster  ring with 3 nodes (RedHat, 150GB Disk, 8GB of RAM each)

Python 2.7.12

Anybody know how to reduce the memory use of Cassandra-python driver API 
specifically for execute_async? Thank you!



===CODE ======================================
      sqlQuery = "INSERT INTO tableV  (id, sample_name, pos, ref_base, 
var_base) values (?,?,?,?,?)"
       random_numbers_for_strains = random.sample(xrange(1,300), 200)
        random_numbers = random.sample(xrange(1,2000000), 200000)

        totalCounter  = 0
        c = 0
        time_init = time.time()
        for random_number_strain in random_numbers_for_strains:

            sample_name = None
            sample_name = 'sample'+str(random_number_strain)

            cassandraCluster = CassandraCluster.CassandraCluster()
            cluster = cassandraCluster.create_cluster_with_protocol2()
            session = cluster.connect();
            #session.default_timeout = 1800
            session.set_keyspace(self.KEYSPACE_NAME)

            preparedStatement = session.prepare(sqlQuery)

            counter = 0
            c = c + 1

            for random_number in random_numbers:

                totalCounter += 1
                if counter == 0 :
                    batch = BatchStatement()

                counter += 1
                if totalCounter % 10000 == 0 :
                    print "Total Count "+ str(totalCounter)

                batch.add(preparedStatement.bind([ uuid.uuid1(), sample_name, 
random_number, random.choice('GT'), random.choice('AC')]))
                if counter % 50 == 0:
                    session.execute_async(batch)
                    #session.execute(batch)
                    batch = None
                    del batch
                    counter = 0

            time.sleep(2);
            session.cluster.shutdown()
            random_number= None
            del random_number
            preparedStatement = None
            session = None
            del session
            cluster = None
            del cluster
            cassandraCluster = None
            del cassandraCluster
            gc.collect()

===CODE ======================================



Kind regards,
Rajesh Radhakrishnan


**************************************************************************
The information contained in the EMail and any attachments is confidential and 
intended solely and for the attention and use of the named addressee(s). It may 
not be disclosed to any other person without the express authority of Public 
Health England, or the intended recipient, or both. If you are not the intended 
recipient, you must not disclose, copy, distribute or retain this message or 
any part of it. This footnote also confirms that this EMail has been swept for 
computer viruses by Symantec.Cloud, but please re-sweep any attachments before 
opening or saving. http://www.gov.uk/PHE
**************************************************************************

Reply via email to