Re: [google-appengine] Memory usage while parsing XML file into Google App Engine datastore

2013-08-01 Thread djidjadji
Have you looked into using the Remote API

https://developers.google.com/appengine/docs/python/tools/remoteapi
https://developers.google.com/appengine/articles/remote_api

You use your local machine to parse the file [no memory problems],
create the needed objects, and put them in the datastore in batches.

You can use a modified class Mapper to do the batch write as found in

https://developers.google.com/appengine/articles/deferred

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to google-appengine+unsubscr...@googlegroups.com.
To post to this group, send email to google-appengine@googlegroups.com.
Visit this group at http://groups.google.com/group/google-appengine.
For more options, visit https://groups.google.com/groups/opt_out.




[google-appengine] Memory usage while parsing XML file into Google App Engine datastore

2013-07-30 Thread Andrew Free


I'm trying to parse a big (5GB) XML file (product catalog) into a google 
datastore. The issue I am having is it taking up a lot of memory. I was 
able to get the memory down from the parsing part by reading it line by 
line and deleting elements as I go. However something is still sticking 
behind.

My code is http://pastebin.com/ESARQikC

I believe the issue is occuring in this specific function (process_element)

def process_element(self,item):
if item.tag == "programname":
self.Plist.append(item.text)
elif item.tag == 'name':
self.Plist.append(item.text)
elif item.tag == 'description':
self.Plist.append(item.text)
elif item.tag == 'sku':
self.Plist.append(item.text)
elif item.tag == 'manufacturer':
self.Plist.append(item.text)
elif item.tag == 'price':
self.Plist.append(item.text)
elif item.tag == 'buyurl':
self.Plist.append(item.text)
elif item.tag == 'imageurl':
self.Plist.append(item.text)
elif item.tag == 'advertisercategory':
self.Plist.append(item.text)
elif item.tag=="product":
Product(
programname=("%s" % self.Plist[0]),
name=("%s" % self.Plist[1]),
description=("%s" % self.Plist[2][0:500]),
sku=("%s" % self.Plist[3]),
manufacturer=("%s" % self.Plist[4]),
price=("%s" % self.Plist[5]),
buyurl=("%s" % self.Plist[6]),
imageurl=("%s" % self.getBigImageUrl(self.Plist[7])),
advertisercategory=("%s" % self.Plist[8])).put()

self.count+=1
print self.count
if self.count%15000 == 0:  
time.sleep(1)
for ob in self.Plist:
del ob
del self.Plist
self.Plist=[]
del item

When I comment out the Product().put() line and run it, it can go through 
tons of lines without making much of a memory impact. The reason I added 
the sleep in the middle of it is I was thinking some subprocesses that GAE 
spawns were adding the data to the datastore and might need some time to 
operate. So I waited after adding 15000 items to see if any ram would be 
freed up (purged memory on the OS side as well) however it did not help. Is 
this something in my code or something I can't change related to adding 
data to a datastore. I'm stuck and confused after hours/days of playing 
around with it.

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to google-appengine+unsubscr...@googlegroups.com.
To post to this group, send email to google-appengine@googlegroups.com.
Visit this group at http://groups.google.com/group/google-appengine.
For more options, visit https://groups.google.com/groups/opt_out.