Thanks for all your help, Ahmet!

Comments inline.

On Thu, Jun 8, 2017 at 6:32 PM, Ahmet Altay <[email protected]> wrote:

> Thank you for the update, some questions inline.
>
> On Thu, Jun 8, 2017 at 6:21 PM, Dmitry Demeshchuk <[email protected]>
> wrote:
>
>> FYI, I tried to install a psycopg2 wheel from a file using the
>> "extra_packages" argument (although, wheels installation is apparently
>> still an experimental feature), but this led to a problem with ECS-2 vs
>> ECS-4 compatibility issues (looks like the Dataflow version of Python is
>> using ECS-2, while wheels for Linux generally use ECS-4).
>>
>
> What is ECS-2 vs ECS-4 problem, and what the compatibility issue?
>

Basically, when I would try to import psycopg2 inside a module, the
pipeline would die with the following error:

/usr/local/lib/python2.7/dist-packages/psycopg2/_psycopg.so: undefined
symbol: PyUnicodeUCS2_DecodeUTF8

This issue is explained in the official Python FAQ:
https://docs.python.org/2.7/faq/extending.html#when-importing-module-x-why-do-i-get-undefined-symbol-pyunicodeucs2
.

During Python compilation, there's a ./configure option that gets passed to
specify how many bytes are being used for Unicode. My guess would be that
Dataflow's Python uses 2.



>
>
>
>>
>> What ended up working for me ultimately, though, is an approach similar
>> to juliaset, with a few small differences: https://gist.gith
>> ub.com/doubleyou/27bf3abb0fc77a2bc9257e6adc5cfe8f
>>
>> Note two things here:
>>
>> 1. We import the "install" class from setuptools, not from distutils.
>> This, in fact, has been the core problem for me. I haven't yet tried if the
>> juliaset example works for me at all, but I strongly suspect that it may
>> not work exactly because of this issue.
>>
>
> Please let us know if juliaset does not work for you as is.
>

Will do! I'll try to find some time to test it out tomorrow.


>
>
>>
>> 2. We handle commands in a simpler fashion, by just using one single
>> class.
>>
>> I'll make a Jira ticket later today or tomorrow to reflect my findings,
>> maybe make a pull request if I confirm that juliaset is not universally
>> working either, if that's fine.
>>
>
> It would be great if you can share this information in a JIRA issue.
> Juliaset is only an example of running commands at setup time, it does not
> globally solve all possible issues.
>

Sounds good. I'll create a JIRA when I have enough input information to
provide and a clean reproduction case.


>
> Ahmet
>
>
>>
>> On Tue, Jun 6, 2017 at 8:46 PM, Dmitry Demeshchuk <[email protected]>
>> wrote:
>>
>>> Yeah, I wasn't really pinning it myself, it's one of the dependency
>>> packages that depends on that specific version.
>>>
>>> Thanks for the information, I'll try to explicitly install 33.1.1 and
>>> see if it changes anything.
>>>
>>> On Tue, Jun 6, 2017 at 7:13 PM, Ahmet Altay <[email protected]> wrote:
>>>
>>>> Pinning setuptools is generally not a good practice. The reason is at
>>>> installation time it might cause removal of the the setuptools that is
>>>> being used to install packages.
>>>>
>>>> FWIW, dataflow workers should have setuptools 33.1.1, which was
>>>> released in 2017/01/16.
>>>>
>>>> Ahmet
>>>>
>>>> On Tue, Jun 6, 2017 at 6:53 PM, Dmitry Demeshchuk <[email protected]
>>>> > wrote:
>>>>
>>>>> Thanks, Ahmet, it really turned out that Stackdriver had more logs
>>>>> than just the Dataflow logs section.
>>>>>
>>>>> So, I ended up seeing this code that fails constantly:
>>>>>
>>>>> I    Running setup.py install for dataflow: started
>>>>> I      Running setup.py install for dataflow: finished with status 'error'
>>>>> I      Complete output from command /usr/bin/python -u -c "import 
>>>>> setuptools, 
>>>>> tokenize;__file__='/tmp/pip-bXyST4-build/setup.py';f=getattr(tokenize, 
>>>>> 'open', open)(__file__);code=f.read().replace('\r\n', 
>>>>> '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record 
>>>>> /tmp/pip-sHw6oI-record/install-record.txt 
>>>>> --single-version-externally-managed --compile:
>>>>> I      usage: -c [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
>>>>> I         or: -c --help [cmd1 cmd2 ...]
>>>>> I         or: -c --help-commands
>>>>> I         or: -c cmd --help
>>>>> I
>>>>> I      error: option --single-version-externally-managed not recognized
>>>>> I
>>>>> I      ----------------------------------------
>>>>> I  Command "/usr/bin/python -u -c "import setuptools, 
>>>>> tokenize;__file__='/tmp/pip-bXyST4-build/setup.py';f=getattr(tokenize, 
>>>>> 'open', open)(__file__);code=f.read().replace('\r\n', 
>>>>> '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record 
>>>>> /tmp/pip-sHw6oI-record/install-record.txt 
>>>>> --single-version-externally-managed --compile" failed with error code 1 
>>>>> in /tmp/pip-bXyST4-build/
>>>>> I  /usr/local/bin/pip failed with exit status 1
>>>>>
>>>>>
>>>>> This seems to mean that the natively installed setuptools are too old,
>>>>> and the new command has been generated with a newer version of setuptools
>>>>> (specifically, my project has setuptools==36.0.1 as a dependency of some
>>>>> package). I'm still digging more through the Stackdriver logs but so far
>>>>> couldn't find out the exact reason of the failure.
>>>>>
>>>>> Also talking to the Dataflow folks, maybe they'll have a better idea.
>>>>> I'll also try to compare this to the output of successful pipelines and 
>>>>> see
>>>>> if it gives me any ideas.
>>>>>
>>>>> Thank you.
>>>>>
>>>>> On Tue, Jun 6, 2017 at 4:40 PM, Ahmet Altay <[email protected]> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Jun 6, 2017 at 2:07 PM, Dmitry Demeshchuk <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hi Ahmet,
>>>>>>>
>>>>>>> Thanks a lot for pointing out that doc, I somehow missed it from the
>>>>>>> official Python SDK page!
>>>>>>>
>>>>>>> One thing that comes to my mind is that generally one should
>>>>>>> probably use the 'install' command in setuptools, not 'build', like it's
>>>>>>> done in https://github.com/apache/beam/blob/master/sdks/python/ap
>>>>>>> ache_beam/examples/complete/juliaset/setup.py#L113. Reason being,
>>>>>>> the 'build' step seems to be executed on the original machine, not 
>>>>>>> inside
>>>>>>> the runner's containers, while 'install' will be triggered inside of 
>>>>>>> them.
>>>>>>> If I run a pipeline that uses setup.py with a "build" step, it fails 
>>>>>>> due to
>>>>>>> being unable to "apt-get install libpq-dev" on a mac.
>>>>>>>
>>>>>>
>>>>>> Thank you. This example should similarly work in install commands I
>>>>>> believe. Also, if possible please file a JIRA issue with your ideas and 
>>>>>> we
>>>>>> can work on improving things.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> I'm still trying to make it work with either build or install steps,
>>>>>>> talking to the Dataflow folks in parallel to get more understanding of 
>>>>>>> what
>>>>>>> I'm doing wrong (Dataflow doesn't send out installation failure logs to
>>>>>>> Stackdriver, only runtime logs, so it seems).
>>>>>>>
>>>>>>
>>>>>> Have you tried looking worker-startup logs? All of the logs should be
>>>>>> in stackdriver.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> On Tue, Jun 6, 2017 at 9:21 AM, Ahmet Altay <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Please see Managing Python Pipeline Dependencies [1] for various
>>>>>>>> ways on installing additional dependencies. The section on non-python
>>>>>>>> dependencies is relevant to your question.
>>>>>>>>
>>>>>>>> Thank you,
>>>>>>>> Ahmet
>>>>>>>>
>>>>>>>> [1] https://beam.apache.org/documentation/sdks/python-pipeli
>>>>>>>> ne-dependencies/
>>>>>>>>
>>>>>>>> On Mon, Jun 5, 2017 at 11:52 PM, Morand, Sebastien <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Interested too. Could be fine for instance to add sftp
>>>>>>>>> BoundedSource, but compilalation of paramiko with ssl library (and so
>>>>>>>>> installation of ssl-dev)
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>>
>>>>>>>>> *Sébastien MORAND*
>>>>>>>>> Team Lead Solution Architect
>>>>>>>>> Technology & Operations / Digital Factory
>>>>>>>>> Veolia - Group Information Systems & Technology (IS&T)
>>>>>>>>> Cell.: +33 7 52 66 20 81 / Direct: +33 1 85 57 71 08
>>>>>>>>> <+33%201%2085%2057%2071%2008>
>>>>>>>>> Bureau 0144C (Ouest)
>>>>>>>>> 30, rue Madeleine-Vionnet - 93300 Aubervilliers, France
>>>>>>>>> *www.veolia.com <http://www.veolia.com>*
>>>>>>>>> <http://www.veolia.com>
>>>>>>>>> <https://www.facebook.com/veoliaenvironment/>
>>>>>>>>> <https://www.youtube.com/user/veoliaenvironnement>
>>>>>>>>> <https://www.linkedin.com/company/veolia-environnement>
>>>>>>>>> <https://twitter.com/veolia>
>>>>>>>>>
>>>>>>>>> On 6 June 2017 at 08:01, Dmitry Demeshchuk <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi again, folks,
>>>>>>>>>>
>>>>>>>>>> How should I go about installing Python packages that require to
>>>>>>>>>> be built and/or require native dependencies like shared libraries or 
>>>>>>>>>> such?
>>>>>>>>>>
>>>>>>>>>> I guess, I could potentially build the C-based modules using the
>>>>>>>>>> same version of kernel and glibc that Dataflow is running, but 
>>>>>>>>>> doesn't seem
>>>>>>>>>> like there's any way to install shared libraries at these boxes, 
>>>>>>>>>> right?
>>>>>>>>>>
>>>>>>>>>> Thanks!
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Best regards,
>>>>>>>>>> Dmitry Demeshchuk.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ------------------------------------------------------------
>>>>>>>>> --------------------------------
>>>>>>>>> This e-mail transmission (message and any attached files) may
>>>>>>>>> contain information that is proprietary, privileged and/or 
>>>>>>>>> confidential to
>>>>>>>>> Veolia Environnement and/or its affiliates and is intended 
>>>>>>>>> exclusively for
>>>>>>>>> the person(s) to whom it is addressed. If you are not the intended
>>>>>>>>> recipient, please notify the sender by return e-mail and delete all 
>>>>>>>>> copies
>>>>>>>>> of this e-mail, including all attachments. Unless expressly 
>>>>>>>>> authorized, any
>>>>>>>>> use, disclosure, publication, retransmission or dissemination of this
>>>>>>>>> e-mail and/or of its attachments is strictly prohibited.
>>>>>>>>>
>>>>>>>>> Ce message electronique et ses fichiers attaches sont strictement
>>>>>>>>> confidentiels et peuvent contenir des elements dont Veolia 
>>>>>>>>> Environnement
>>>>>>>>> et/ou l'une de ses entites affiliees sont proprietaires. Ils sont donc
>>>>>>>>> destines a l'usage de leurs seuls destinataires. Si vous avez recu ce
>>>>>>>>> message par erreur, merci de le retourner a son emetteur et de le 
>>>>>>>>> detruire
>>>>>>>>> ainsi que toutes les pieces attachees. L'utilisation, la divulgation, 
>>>>>>>>> la
>>>>>>>>> publication, la distribution, ou la reproduction non expressement
>>>>>>>>> autorisees de ce message et de ses pieces attachees sont interdites.
>>>>>>>>> ------------------------------------------------------------
>>>>>>>>> --------------------------------
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Best regards,
>>>>>>> Dmitry Demeshchuk.
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best regards,
>>>>> Dmitry Demeshchuk.
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Best regards,
>>> Dmitry Demeshchuk.
>>>
>>
>>
>>
>> --
>> Best regards,
>> Dmitry Demeshchuk.
>>
>
>


-- 
Best regards,
Dmitry Demeshchuk.

Reply via email to