Re: [Celebrate] Arrow has reached 2000 stargeezers

2018-05-28 Thread simba nyatsanga
Congratulations everyone! On Mon, 28 May 2018 at 21:42 Li Jin wrote: > Congrats everyone! > On Mon, May 28, 2018 at 3:21 PM Jacques Nadeau wrote: > > > Woo! > > > > On Mon, May 28, 2018 at 4:50 PM, Wes McKinney > wrote: > > > > > Congrats all! The journey continues > > > > > > On Mon, May 28,

Memory mapping error on pq.read_table

2018-02-08 Thread simba nyatsanga
Hi Everyone, I've encountered a memory mapping error when attempting to read a parquet file to a Pandas DataFrame. It seems to be happening intermittently though, I've so far encountered it once. In my case the pq.read_table code is being invoked in a Linux docker container. I had a look at the do

Re: [Python] Disk size performance of Snappy vs Brotli vs Blosc

2018-01-30 Thread simba nyatsanga
hu, 25 Jan 2018 at 15:37 simba nyatsanga wrote: > Thanks all for the great feedback! > > Thanks Daniel for the sample data sets. I loaded them up and they're quite > comparable in size to some of the data I'm dealing with. In my case the > shapes range from 150 to ~100millio

Re: [Python] Disk size performance of Snappy vs Brotli vs Blosc

2018-01-25 Thread simba nyatsanga
instance, if > you > > > store measurements, it is very typical to have very strong > correlations. > > > Likewise if the rows are, say, the time evolution of an optimization. > You > > > also have a very small number of rows which can penalize system that > >

Re: [Python] Disk size performance of Snappy vs Brotli vs Blosc

2018-01-24 Thread simba nyatsanga
them somewhere and link > to them in the mails. Attachments are always stripped on Apache > mailing lists. > Uwe > > > On Wed, Jan 24, 2018, at 1:48 PM, simba nyatsanga wrote: > > Hi Everyone, > > > > I did some benchmarking to compare the disk size performance w

[Python] Disk size performance of Snappy vs Brotli vs Blosc

2018-01-24 Thread simba nyatsanga
Hi Everyone, I did some benchmarking to compare the disk size performance when writing Pandas DataFrames to parquet files using Snappy and Brotli compression. I then compared these numbers with those of my current file storage solution. In my current (non Arrow+Parquet solution), every column in

Re: Uniform types in Arrow table columns (pyarrow.array) and the case of python dictionaries

2018-01-22 Thread simba nyatsanga
t. > > - Wes > > On Mon, Jan 22, 2018 at 4:50 PM, simba nyatsanga > wrote: > > Hi Uwe, > > > > Thank you very much for the detailed explanation. I have a much better > > understanding now. > > > > Cheers > > > > On Mon, 22 Jan 2018 at

Re: Uniform types in Arrow table columns (pyarrow.array) and the case of python dictionaries

2018-01-22 Thread simba nyatsanga
Hi Uwe, Thank you very much for the detailed explanation. I have a much better understanding now. Cheers On Mon, 22 Jan 2018 at 19:37 Uwe L. Korn wrote: > Hello Simba, > > find the answers inline. > > On Mon, Jan 22, 2018, at 7:29 AM, simba nyatsanga wrote: > > Hi Every

Uniform types in Arrow table columns (pyarrow.array) and the case of python dictionaries

2018-01-21 Thread simba nyatsanga
Hi Everyone, I've got two questions that I'd like help with: 1. Pandas and numpy arrays can handle multiple types in a sequence eg. a float and a string by using the dtype=object. From what I gather, Arrow arrays enforce a uniform type depending on the type of the first encountered element in a s

Re: PyArrow python list to numpy nd.array inference in pd.read_table

2018-01-18 Thread simba nyatsanga
r an > ndarray. Returning ndarray is faster and much more memory efficient; > producing lists would require creating a lot of Python objects. > > Hypothetically, we could add an option to return lists instead of > ndarrays if there were a strong enough need. > > - Wes > &

Re: PyArrow python list to numpy nd.array inference in pd.read_table

2018-01-18 Thread simba nyatsanga
arrow_to_pandas.cc#L541 > > - Wes > > On Thu, Jan 18, 2018 at 1:26 PM, simba nyatsanga > wrote: > > > Good day everyone, > > > > I noticed what looks like type inference happening after persisting a > > pandas DataFrame where one of the column values is a li

PyArrow python list to numpy nd.array inference in pd.read_table

2018-01-18 Thread simba nyatsanga
Good day everyone, I noticed what looks like type inference happening after persisting a pandas DataFrame where one of the column values is a list. When I load up the DataFrame again and do df.to_dict(), the value is no longer a list but a numpy array. I dug through functions in the pandas_compat.

Re: Trying to build to build pyarrow for python 2.7

2018-01-17 Thread simba nyatsanga
is progressing. > > - Wes > > On Sun, Jan 14, 2018 at 9:19 AM, simba nyatsanga > wrote: > > Thanks a lot. I see that there's a PR that's been opened to resolve the > > encoding issue - https://github.com/apache/arrow/pull/1476 > > > > Do you think this

Re: Trying to build to build pyarrow for python 2.7

2018-01-14 Thread simba nyatsanga
#x27;s merged? Kind Regards On Sun, 14 Jan 2018 at 15:50 Uwe L. Korn wrote: > Nice to hear that it worked. > > Updating the docs should not be necessary, we should rather see that we > soon get a 0.9.0 release out (but that will also take some more weeks) > > Uwe > > On Su

Re: Trying to build to build pyarrow for python 2.7

2018-01-14 Thread simba nyatsanga
the package discovery using pkg-config instead of the > *_HOME variables. Currently this is the only path on which we can > auto-detect the extension of the parquet shared library. > > Nevertheless, I will take a shot at fixing the issues as it seems that > multiple users run into it.

Re: Trying to build to build pyarrow for python 2.7

2018-01-11 Thread simba nyatsanga
rquet.1.dylib -> libparquet.1.3.2.dylib-rw-r--r--1 simba staff 3.0M Jan 11 18:45 libparquet.a lrwxr-xr-x1 simba staff18B Jan 11 18:45 libparquet.dylib -> libparquet.1.dylib Just to clarify also, I'm attempting to build the wheel from within *arrow/python* folder where th

Re: Trying to build to build pyarrow for python 2.7

2018-01-10 Thread simba nyatsanga
ng development instructions in > > http://arrow.apache.org/docs/python/development.html#developing-on-linux-and-macos > or something else? > > - Wes > > On Wed, Jan 10, 2018 at 11:20 AM, simba nyatsanga > wrote: > > Hi, > > > > I've created a python 2.7 v

Trying to build to build pyarrow for python 2.7

2018-01-10 Thread simba nyatsanga
Hi, I've created a python 2.7 virtualenv in my attempt to build the pyarrow project. But I'm having trouble running one of commands as specified in the development docs on Github, specifically this command: cd arrow/python python setup.py build_ext --build-type=$ARROW_BUILD_TYPE \ --with-p