[ https://issues.apache.org/jira/browse/ARROW-6776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147676#comment-17147676 ]
Antoine Pitrou commented on ARROW-6776: --------------------------------------- The latest PyArrow wheels (*) are much lighter: {code} $ du -hs venv-3.7/lib/python3.7/site-packages/pyarrow/ 57M venv-3.7/lib/python3.7/site-packages/pyarrow/ {code} PS: see here for nightly PyArrow wheels: https://arrow.apache.org/docs/python/install.html#installing-nightly-packages > [Python] Need a lite version of pyarrow > --------------------------------------- > > Key: ARROW-6776 > URL: https://issues.apache.org/jira/browse/ARROW-6776 > Project: Apache Arrow > Issue Type: Improvement > Components: Python > Affects Versions: 0.14.1 > Reporter: Haowei Yu > Priority: Major > > Currently I am building a library packages on top of pyarrow, so I include > pyarrow as a dependency and ship it to our customer. However, when our > customer installed our packages, it will also install pyarrow and pyarrow's > dependency (numpy). However the dependency size is huge. > {code:bash} > (py36env) [hyu@c6x64-hyu-newuser-final-clone connector]$ ls -l --block-size=M > /home/hyu/py36env/lib/python3.6/site-packages/pyarrow/ > total 186M > {code} > And numpy is around 80MB. Total is more than 250 MB. > Our customer want to bundle all dependency and run the code inside AWS > Lambda, however they hit the size limit and failed to run the code. > Looking into the pyarrow, I saw multiple .so files are shipped both with and > without version suffix, I wonder if you can remove the one of them (either > with or without suffix), it will at least reduce the package size by half. > Further, our library just want to use IPC and read data as record batch, I > don't need arrow flight at all (which is the biggest .so file and takes > around 100 MB). I wonder if you can push a lite version of the pyarrow so > that I can specify lite version as the dependency. Or maybe I need to build > my own lite version and push it pypi. However, this approach cause further > problem if our customer is using the "fat" version of pyarrow unless you the > change the namespace of lite version of pyarrow. > Another alternative is that I bundle the pyarrow with our library ( copy the > whole directory into vendored namespace) and ship it to our customer without > specifying pyarrow as a dependency. The advantage of this one is that I can > build pyarrow with whatever option/sub-module/libraries I need. However, I > tried a lot but failed because pyarrow use absolute import and it will fail > to import the script in the new location. > Any insight how I should resolve this issue? -- This message was sent by Atlassian Jira (v8.3.4#803005)