That large of a file/job seems like a job for MapReduce rather than 
AppEngine front or backend instances, IMO.

My 2c.

On Tuesday, July 30, 2013 2:29:31 AM UTC-4, Andrew Free wrote:
>
> I'm trying to parse a big (5GB) XML file (product catalog) into a google 
> datastore. The issue I am having is it taking up a lot of memory. I was 
> able to get the memory down from the parsing part by reading it line by 
> line and deleting elements as I go. However something is still sticking 
> behind.
>
> My code is http://pastebin.com/ESARQikC
>
> I believe the issue is occuring in this specific function (
> process_element)
>
> def process_element(self,item):
>     if item.tag == "programname":
>         self.Plist.append(item.text)
>     elif item.tag == 'name':
>         self.Plist.append(item.text)        
>     elif item.tag == 'description':
>         self.Plist.append(item.text)
>     elif item.tag == 'sku':
>         self.Plist.append(item.text)
>     elif item.tag == 'manufacturer':
>         self.Plist.append(item.text)
>     elif item.tag == 'price':
>         self.Plist.append(item.text)
>     elif item.tag == 'buyurl':
>         self.Plist.append(item.text)
>     elif item.tag == 'imageurl':
>         self.Plist.append(item.text)
>     elif item.tag == 'advertisercategory':
>         self.Plist.append(item.text)
>     elif item.tag=="product":
>         Product(
>             programname=("%s" % self.Plist[0]),
>             name=("%s" % self.Plist[1]),
>             description=("%s" % self.Plist[2][0:500]),
>             sku=("%s" % self.Plist[3]),
>             manufacturer=("%s" % self.Plist[4]),
>             price=("%s" % self.Plist[5]),
>             buyurl=("%s" % self.Plist[6]),
>             imageurl=("%s" % self.getBigImageUrl(self.Plist[7])),
>             advertisercategory=("%s" % self.Plist[8])).put()
>
>         self.count+=1
>         print self.count
>         if self.count%15000 == 0:      
>             time.sleep(10000)
>         for ob in self.Plist:
>             del ob
>         del self.Plist
>         self.Plist=[]
>     del item
>
> When I comment out the Product().put() line and run it, it can go through 
> tons of lines without making much of a memory impact. The reason I added 
> the sleep in the middle of it is I was thinking some subprocesses that GAE 
> spawns were adding the data to the datastore and might need some time to 
> operate. So I waited after adding 15000 items to see if any ram would be 
> freed up (purged memory on the OS side as well) however it did not help. Is 
> this something in my code or something I can't change related to adding 
> data to a datastore. I'm stuck and confused after hours/days of playing 
> around with it.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to google-appengine+unsubscr...@googlegroups.com.
To post to this group, send email to google-appengine@googlegroups.com.
Visit this group at http://groups.google.com/group/google-appengine.
For more options, visit https://groups.google.com/groups/opt_out.


Reply via email to