Hi Shivam,

  Is it possible to use Python instead of Java on Dataflow to do the update 
on Datastore? If so where can I find an example?

Best regards,

On Tuesday, 15 August 2017 18:04:58 UTC+1, Shivam(Google Cloud Support) 
wrote:
>
> The job could tend to be slow for such amount of entities following the 
> example here 
> <https://cloud.google.com/appengine/articles/update_schema#updating-existing-entities>.
>  
> The solution for proper Datastore Map Reduce solution in the cloud would be 
> Datastore I/O using Dataflow 
> <https://cloud.google.com/dataflow/model/datastore-io>.
>
>
> Dataflow SDKs provide an API for reading data from and writing data to a 
> Google Cloud Datastore database. Its programming model is designed to 
> simplify the mechanics of large-scale data processing. When you program 
> with a Dataflow SDK, you are essentially creating a data processing job to 
> be executed by one of the Cloud Dataflow runner services. This model lets 
> you concentrate on the logical composition of your data processing job, 
> rather than the physical orchestration of parallel processing. You can 
> focus on what you need your job to do instead of exactly how that job gets 
> executed.
>
> If you choose to stick with Map Reduce on App Engine, it is recommended to 
> file any issues you experience directly to the engineering team on their 
> Git repository 
> <https://github.com/GoogleCloudPlatform/appengine-mapreduce>.
>
> On Tuesday, August 15, 2017 at 6:00:37 AM UTC-4, Filipe Caldas wrote:
>>
>> The job was actually doing slightly more than just setting a property to 
>> a default value, it also was doing a .strip() in one of the fields due to 
>> an error in our insert scripts, so in some cases there is a need to do a 
>> mass update on all entities, it definitely doesn't happen often but we 
>> would rather not re-insert all the entities on the table.
>>
>> The documented method of updating entities works fine, but as many other 
>> users have noticed for any case where the amount of rows is big (> 10M) 
>> this would take over a week to finish, it is also definitely much cheaper 
>> to run that MapReduce, but it takes too long.
>>
>> The way we found to do it in a "safe" way, as in we will be sure the task 
>> will be done and in a limited amount of time was to instead use a VM that 
>> spawns about 5 threads and reads / updates in parallel the entities on 
>> Datastore (and even this is still taking about 2 days to finish for 12M 
>> entities).
>>
>> On Friday, 11 August 2017 21:46:20 UTC+1, Shivam(Google Cloud Support) 
>> wrote:
>>>
>>> There should be no actual need to mass-put a new property to all of your 
>>> entities, and set that new property to a default value since the Datastore 
>>> supports entities with and without set property values (as you have noticed 
>>> with the failed Map Reduce job). 
>>>
>>> You can assume that if an entity does not have the property, that it is 
>>> equal to the default "indexed=0". You can then set this value directly in 
>>> your application during read time. If it exists, read it and use it, else 
>>> use a hard-coded default and set the value then in  your code (aka only 
>>> when the entity is being read).
>>>
>>> Updating existing entities is documented here 
>>> <https://cloud.google.com/appengine/articles/update_schema#updating-existing-entities>.
>>>  
>>>
>>>
>>>
>>> Without knowing what happened exactly, it is not possible to know the 
>>> reason for 70M reads. However, I would recommend to view this post 
>>> <https://stackoverflow.com/a/15946970> which might answer your question.
>>>
>>>
>>> On Friday, August 11, 2017 at 9:02:53 AM UTC-4, Filipe Caldas wrote:
>>>>
>>>> Hi,
>>>>
>>>>   I am currently trying to update a kind in my database and add a field 
>>>> (indexed=0), the table has more than 10M entities.
>>>>
>>>>   I tried to use MapReduce for appengine and launched a fairly simple 
>>>> job where the mapper only sets the property and yields an 
>>>> operation.db.Put(), the only problem is that some of the shards failed, so 
>>>> the job was stopped and automatically restarted.
>>>>
>>>>   Problem is, launching this job on 10M entities cost me about $ 100 
>>>> and the job was not finished (the retry was going slow so don't think they 
>>>> billed much for that). 
>>>>   
>>>> The extra annoying thing is that there is no other way that I know to 
>>>> update these properties "fast" enough (the mapreduce took over 7 hours to 
>>>> fail on 10M). I know Beam/Dataflow is apparently the way to go, but 
>>>> documentation on doing basic operations like updating Datastore entities 
>>>> is 
>>>> still very poor (not sure if can even be done).
>>>>
>>>>   So, my question is is there a fast and *safe* way to update  my 
>>>> entities that does not consist of doing 10M fetchs and puts in sequence?
>>>>
>>>>   Bonus question: do anyone know why was I billed 70M reads on only 10M 
>>>> entities?
>>>>
>>>> Best regards,
>>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to google-appengine+unsubscr...@googlegroups.com.
To post to this group, send email to google-appengine@googlegroups.com.
Visit this group at https://groups.google.com/group/google-appengine.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/google-appengine/8b53db45-adf8-4d7f-bd0a-54e0398b619e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to