Re: PyArrow: Using is_in compute to filter list of strings in a Table

2020-11-04 Thread Niklas B
Never mind, I realized I can use the pyarrow.compute.invert. Thank you again 
for the super fast answer

> On 4 Nov 2020, at 15:13, Niklas B  wrote:
> 
> Thank you! This looks awesome. Any good way to inverse the ChunkedArray? I 
> know I can cast to Numpy (experimental) and do it there, but would love a 
> native arrow version :)
> 
>> On 4 Nov 2020, at 14:45, Joris Van den Bossche 
>>  wrote:
>> 
>> Hi Niklas,
>> 
>> The "is_in" docstring is not directly clear about it, but you need to pass
>> the second argument as a keyword argument using "value_set" keyword name.
>> Small example:
>> 
>> In [19]: pc.is_in(pa.array(["a", "b", "c", "d"]), value_set=pa.array(["a",
>> "c"]))
>> Out[19]:
>> 
>> [
>> true,
>> false,
>> true,
>> false
>> ]
>> 
>> You can find this keyword in the keywords of pc.SetLookupOptions.
>> We know the docstrings are not yet in a good state. This was recently
>> already improved in https://issues.apache.org/jira/browse/ARROW-9164, and
>> we should maybe also try to inject the option keywords in the function
>> docstring.
>> 
>> Best,
>> Joris
>> 
>> On Wed, 4 Nov 2020 at 14:14, Niklas B  wrote:
>> 
>>> Hi,
>>> 
>>> I’m trying in Python to (without reading entire parquet file into memory)
>>> filter out certain rows (based on uuid-strings). My approach is to read
>>> each row group, then try to filter it without casting it to pandas (since
>>> it’s expensive for data-frames with lots of strings it in). Looking in the
>>> compute function list my hope was to be able to use `is_in` operator. How
>>> would you actually use it? My naive approach would be:
>>> 
>>> import pyarrow.compute as pc
>>> mask = pc.is_in(table['id’], pa.array([“uuid1”, “uuid2"]))
>>> # somehow invert the mask since it shows the ones that I don’t want
>>> 
>>> Above gives:
>>> 
>>>>>> pc.is_in(table["id"], pa.array(["uuid1", "uuid2"]))
>>> Traceback (most recent call last):
>>> File "", line 1, in 
>>> TypeError: wrapper() takes 1 positional argument but 2 were given
>>> 
>>> Trying to pass anything else into is_in, like an pya.array results into
>>> segfaults.
>>> 
>>> Is above at all possible with is_in?
>>> 
>>> Normally I would use pyarrow.parquet.ParquetDataset().filters() but I need
>>> to apply it per row group, not for the entire file so I can then write the
>>> modified row group to disk again in another file.
>>> 
>>> Regards,
>>> Niklas
> 



Re: PyArrow: Using is_in compute to filter list of strings in a Table

2020-11-04 Thread Niklas B
Thank you! This looks awesome. Any good way to inverse the ChunkedArray? I know 
I can cast to Numpy (experimental) and do it there, but would love a native 
arrow version :)

> On 4 Nov 2020, at 14:45, Joris Van den Bossche  
> wrote:
> 
> Hi Niklas,
> 
> The "is_in" docstring is not directly clear about it, but you need to pass
> the second argument as a keyword argument using "value_set" keyword name.
> Small example:
> 
> In [19]: pc.is_in(pa.array(["a", "b", "c", "d"]), value_set=pa.array(["a",
> "c"]))
> Out[19]:
> 
> [
>  true,
>  false,
>  true,
>  false
> ]
> 
> You can find this keyword in the keywords of pc.SetLookupOptions.
> We know the docstrings are not yet in a good state. This was recently
> already improved in https://issues.apache.org/jira/browse/ARROW-9164, and
> we should maybe also try to inject the option keywords in the function
> docstring.
> 
> Best,
> Joris
> 
> On Wed, 4 Nov 2020 at 14:14, Niklas B  wrote:
> 
>> Hi,
>> 
>> I’m trying in Python to (without reading entire parquet file into memory)
>> filter out certain rows (based on uuid-strings). My approach is to read
>> each row group, then try to filter it without casting it to pandas (since
>> it’s expensive for data-frames with lots of strings it in). Looking in the
>> compute function list my hope was to be able to use `is_in` operator. How
>> would you actually use it? My naive approach would be:
>> 
>> import pyarrow.compute as pc
>> mask = pc.is_in(table['id’], pa.array([“uuid1”, “uuid2"]))
>> # somehow invert the mask since it shows the ones that I don’t want
>> 
>> Above gives:
>> 
>>>>> pc.is_in(table["id"], pa.array(["uuid1", "uuid2"]))
>> Traceback (most recent call last):
>>  File "", line 1, in 
>> TypeError: wrapper() takes 1 positional argument but 2 were given
>> 
>> Trying to pass anything else into is_in, like an pya.array results into
>> segfaults.
>> 
>> Is above at all possible with is_in?
>> 
>> Normally I would use pyarrow.parquet.ParquetDataset().filters() but I need
>> to apply it per row group, not for the entire file so I can then write the
>> modified row group to disk again in another file.
>> 
>> Regards,
>> Niklas



PyArrow: Using is_in compute to filter list of strings in a Table

2020-11-04 Thread Niklas B
Hi,

I’m trying in Python to (without reading entire parquet file into memory) 
filter out certain rows (based on uuid-strings). My approach is to read each 
row group, then try to filter it without casting it to pandas (since it’s 
expensive for data-frames with lots of strings it in). Looking in the compute 
function list my hope was to be able to use `is_in` operator. How would you 
actually use it? My naive approach would be:

import pyarrow.compute as pc
mask = pc.is_in(table['id’], pa.array([“uuid1”, “uuid2"]))
# somehow invert the mask since it shows the ones that I don’t want

Above gives:

>>> pc.is_in(table["id"], pa.array(["uuid1", "uuid2"]))
Traceback (most recent call last):
  File "", line 1, in 
TypeError: wrapper() takes 1 positional argument but 2 were given

Trying to pass anything else into is_in, like an pya.array results into 
segfaults. 

Is above at all possible with is_in? 

Normally I would use pyarrow.parquet.ParquetDataset().filters() but I need to 
apply it per row group, not for the entire file so I can then write the 
modified row group to disk again in another file.

Regards,
Niklas

Arrow on PyPy3 patch

2020-10-22 Thread Niklas B
Hi,

I’ve been (together with the PyPy team) working on getting arrow to build on 
PyPy3. I’m not looking for full feature capability, but specifically getting it 
to work with pandas read_parquet/to_parquet which it now does. There were a few 
roadblocks solved by the awesome Matti Picus on the PyPy Team and we now have a 
patch that successful builds pyarrow on PyPy3. PyPy3 side has already been 
patched. 

The patch for pyarrow is on 
https://gist.githubusercontent.com/mattip/c9c8398b58721ae5893dc8134c353f28/raw/0daff3e11ceed6dcde485a56e6b8bd2b7ca48bbc/gistfile1.txt

A Dockerfile which builds everything is available on 
https://github.com/bivald/pyarrow-on-pypy3/blob/feature/latest-pypy-latest-pyarrow/Dockerfile
 
(https://github.com/bivald/pyarrow-on-pypy3/tree/feature/latest-pypy-latest-pyarrow)

A surprisingly amount of tests passes (such as all the parquet tests when I 
tested it last) but some other areas segfaults, but neither of them is super 
important for me right now.

Would the arrow project be open to a PR with the above patch, even though it 
doesn’t give you full PyPy support?

Regards,
Niklas

Re: Using DNF-like filters on a (py)arrow Table already in memory (or probably: convert pyarrow table to UnionDataset)

2020-10-01 Thread Niklas B
Amazing, just what I need. Thank you so much!

> On 1 Oct 2020, at 15:48, Joris Van den Bossche  
> wrote:
> 
> Hi Niklas,
> 
> In the datasets project, there is indeed the notion of an in-memory dataset
> (from a RecordBatch or Table), however, constructing such a dataset is
> currently not directly exposed in Python (except fro writing it).
> 
> But, for RecordBatch/Table objects, you can also directly filter those with
> a boolean mask. A small example with a Table (but will work the same with
> RecordBatch):
> 
>>>> table = pa.table({'a': range(5), 'b': [1, 2, 1, 2, 1]})
>>>> table.filter(np.array([True, False, False, False, True]))
> pyarrow.Table
> a: int64
> b: int64
>>>> table.filter(np.array([True, False, False, False, True])).to_pandas()
>   a  b
> 0  0  1
> 1  4  1
> 
> Creating the boolean mask based on the table can be done with the
> pyarrow.compute module:
> 
>>>> import pyarrow.compute as pc
>>>> mask = pc.equal(table['b'], pa.scalar(1))
>>>> table.filter(mask).to_pandas()
>   a  b
> 0  0  1
> 1  2  1
> 2  4  1
> 
> So still more manual work than just specifying a DNF filter, but normally
> all necessary building blocks are available (the goal is certainly to use
> those building block in a more general query engine that works for both
> in-memory tables as file datasets, eventually, but that doesn't exist yet).
> 
> Best,
> Joris
> 
> On Thu, 1 Oct 2020 at 15:01, Niklas B  <mailto:niklas.biv...@enplore.com>> wrote:
> 
>> Hi,
>> 
>> I have an in-memory dataset from Plasma that I need to filter before
>> running `to_pandas()`. It’s a very text heavy dataset with a lot of rows
>> and columns (only about 30% of which is applicable for any operation). Now
>> I know that you use DNF filters to filter a parquet file before reading to
>> memory. I’m now trying to do the same for my pa.Table that is already in
>> memory. https://issues.apache.org/jira/browse/ARROW-7945 
>> <https://issues.apache.org/jira/browse/ARROW-7945> <
>> https://issues.apache.org/jira/browse/ARROW-7945 
>> <https://issues.apache.org/jira/browse/ARROW-7945>> indicates that should
>> be possible to unify datasets after constructions, but my arrow skills
>> aren’t quite there yet.
>> 
>>> […]
>>> [data] = plasma_client.get_buffers([object_id], timeout_ms=100)
>>> buffer = pa.BufferReader(data)
>>> reader = pa.RecordBatchStreamReader(buffer)
>>> record_batch = pa.Table.from_batches(reader)
>> 
>> 
>> I’ve been reading up on Dataset, UnionDataset and how the ParquetDataset
>> does it (
>> http://arrow.apache.org/docs/_modules/pyarrow/parquet.html#ParquetDataset 
>> <http://arrow.apache.org/docs/_modules/pyarrow/parquet.html#ParquetDataset>
>> <http://arrow.apache.org/docs/_modules/pyarrow/parquet.html#ParquetDataset>).
>> My thinking is that if I can cast my table to a UnionDataset I can use the
>> same _filter() code as ParquetDataset does. But I’m not having any luck
>> with my (albeit very naive) approach:
>> 
>>>>>> pyarrow.dataset.UnionDataset(table.schema, [table])
>> 
>> But that just gives me:
>> 
>>> Traceback (most recent call last):
>>>  File "", line 1, in 
>>>  File "pyarrow/_dataset.pyx", line 429, in
>> pyarrow._dataset.UnionDataset.__init__
>>> TypeError: Cannot convert pyarrow.lib.Table to pyarrow._dataset.Dataset
>> 
>> 
>> I’m guessing somewhere deep in
>> https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_dataset.py
>>  
>> <https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_dataset.py>
>> <
>> https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_dataset.py
>>  
>> <https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_dataset.py>>
>> shows an example on how to make a Dataset out of a table, but I’m not
>> finding it.
>> 
>> Any help would be greatly appreciated :)
>> 
>> Regards,
>> Niklas



Using DNF-like filters on a (py)arrow Table already in memory (or probably: convert pyarrow table to UnionDataset)

2020-10-01 Thread Niklas B
Hi,

I have an in-memory dataset from Plasma that I need to filter before running 
`to_pandas()`. It’s a very text heavy dataset with a lot of rows and columns 
(only about 30% of which is applicable for any operation). Now I know that you 
use DNF filters to filter a parquet file before reading to memory. I’m now 
trying to do the same for my pa.Table that is already in memory. 
https://issues.apache.org/jira/browse/ARROW-7945 
 indicates that should be 
possible to unify datasets after constructions, but my arrow skills aren’t 
quite there yet. 

> […]
> [data] = plasma_client.get_buffers([object_id], timeout_ms=100)
> buffer = pa.BufferReader(data)
> reader = pa.RecordBatchStreamReader(buffer)
> record_batch = pa.Table.from_batches(reader)


I’ve been reading up on Dataset, UnionDataset and how the ParquetDataset does 
it (http://arrow.apache.org/docs/_modules/pyarrow/parquet.html#ParquetDataset 
). 
My thinking is that if I can cast my table to a UnionDataset I can use the same 
_filter() code as ParquetDataset does. But I’m not having any luck with my 
(albeit very naive) approach:

> >>> pyarrow.dataset.UnionDataset(table.schema, [table])

But that just gives me:

> Traceback (most recent call last):
>   File "", line 1, in 
>   File "pyarrow/_dataset.pyx", line 429, in 
> pyarrow._dataset.UnionDataset.__init__
> TypeError: Cannot convert pyarrow.lib.Table to pyarrow._dataset.Dataset


I’m guessing somewhere deep in 
https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_dataset.py
 

 shows an example on how to make a Dataset out of a table, but I’m not finding 
it.

Any help would be greatly appreciated :)

Regards,
Niklas

Re: [DISCUSS] Plasma appears to have been forked, consider deprecating pyarrow.serialization

2020-09-27 Thread Niklas B
We to rely heavily on Plasma (we use Ray as well, but also Plasma independent 
of Ray). I’ve started a thread on ray dev list to see if Rays plasma can be 
used standalone outside of ray as well. That would allow us who use Plasma to 
move to a standalone “ray plasma” when/if it’s removed from Arrow.

> On 26 Sep 2020, at 00:30, Wes McKinney  wrote:
> 
> I'd suggest as a preliminary that we stop packaging Plasma for 1-2
> releases to see who is affected by the component's removal. Usage may
> be more widespread than we realize, and we don't have much telemetry
> to know for certain.
> 
> On Tue, Aug 18, 2020 at 1:26 PM Antoine Pitrou  wrote:
>> 
>> 
>> Also, the fact that Ray has forked Plasma means their implementation
>> becomes potentially incompatible with Arrow's.  So even if we keep
>> Plasma in our codebase, we can't guarantee interoperability with Ray.
>> 
>> Regards
>> 
>> Antoine.
>> 
>> 
>> Le 18/08/2020 à 19:51, Wes McKinney a écrit :
>>> I do not think there is an urgency to remove Plasma from the Arrow
>>> codebase (as it currently does not cause much maintenance burden), but
>>> the reality is that Ray has already hard-forked and so new maintainers
>>> will need to come out of the woodwork to help support the project if
>>> it is to continue having a life of its own. I started this thread to
>>> create more awareness of the issue so that existing Plasma
>>> stakeholders can make themselves known and possibly volunteer their
>>> time to develop and maintain the codebase.
>>> 
>>> On Tue, Aug 18, 2020 at 12:02 PM Matthias Vallentin
>>>  wrote:
 
 We are very interested in Plasma as a stand-alone project. The fork would
 hit us doubly hard, because it reduces both the appeal of an Arrow-specific
 use case as well as our planned Ray integration.
 
 We are developing effectively a database for network activity data that
 runs with Arrow as data plane. See https://github.com/tenzir/vast for
 details. One of our upcoming features is supporting a 1:N output channel
 using Plasma, where multiple downstream tools (Python/Pandas, R, Spark) can
 process the same data set that's exactly materialized in memory once. We
 currently don't have the developer bandwidth to prioritize this effort, but
 the concurrent, multi-tool processing capability was one of the main
 strategic reasons to go with Arrow as data plane. If Plasma has no future,
 Arrow has a reduced appeal for us in the medium term.
 
 We also have Ray as a data consumer on our roadmap, but the dependency
 chain seems now inverted. If we have to do costly custom plumbing for Ray,
 with a custom version of Plasma, the Ray integration will lose quite a bit
 of appeal because it doesn't fit into the existing 1:N model. That is, even
 though the fork may make sense from a Ray-internal point of view, it
 decreases the appeal of Ray from the outside. (Again, only speaking shared
 data plane here.)
 
 In the future, we're happy to contribute cycles when it comes to keeping
 Plasma as a useful standalone project. We recently made sure that static
 builds work as expected . As of
 now, we unfortunately cannot commit to anything specific though, but our
 interest extends to Gandiva, Flight, and lots of other parts of the Arrow
 ecosystem.
 
 On Tue, Aug 18, 2020 at 4:02 AM Robert Nishihara 
 
 wrote:
 
> To answer Wes's question, the Plasma inside of Ray is not currently usable
> 
> 
> in a C++ library context, though it wouldn't be impossible to make that
> 
> 
> happen.
> 
> 
> 
> 
> 
> I (or someone) could conduct a simple poll via Google Forms on the user
> 
> 
> mailing list to gauge demand if we are concerned about breaking a lot of
> 
> 
> people's workflow.
> 
> 
> 
> 
> 
> On Mon, Aug 17, 2020 at 3:21 AM Antoine Pitrou  wrote:
> 
> 
> 
> 
> 
>> 
> 
> 
>> Le 15/08/2020 à 17:56, Wes McKinney a écrit :
> 
> 
>>> 
> 
> 
>>> What isn't clear is whether the Plasma that's in Ray is usable in a
> 
> 
>>> C++ library context (e.g. what we currently ship as libplasma-dev e.g.
> 
> 
>>> on Ubuntu/Debian). That seems still useful, but if the project isn't
> 
> 
>>> being actively maintained / developed (which, given the series of
> 
> 
>>> stale PRs over the last year or two, it doesn't seem to be) it's
> 
> 
>>> unclear whether we want to keep shipping it.
> 
> 
>> 
> 
> 
>> At least on GitHub, the C++ API seems to be getting little use.  Most
> 
> 
>> search results below are forks/copies of the Arrow or Ray codebases.
> 
> 
>> There are also a couple stale experiments:
> 
> 
>> https://github.com/search?l=

Re: PyArrow: Incrementally using ParquetWriter without keeping entire dataset in memory (large than memory parquet files)

2020-09-21 Thread Niklas B
Hi,

I’ve tried both with little success. I made a JIRA: 
https://issues.apache.org/jira/browse/ARROW-10052 
<https://issues.apache.org/jira/browse/ARROW-10052>

Looking at it now when I've made a minimal example I see something I didn't 
see/realize before which is that while the memory usage is increasing it 
doesn't appear to be linear to the file written. This possibly indicates (I 
guess) that it isn't actually storing the written dataset, but something else. 

I’ll keep digging, sorry for it not being as clear as I would have wanted it. 
In real world we see writing a 3 GB parquet file exhausting 10GB of memory when 
writing incrementally. 

Regards,
Niklas

> On 20 Sep 2020, at 06:07, Micah Kornfield  wrote:
> 
> Hi Niklas,
> Two suggestions:
> * Try to adjust row_group_size on write_table [1] to a smaller then default
> value.  If I read the code correctly this is currently 64 million rows [2],
> which seems potentially two high as a default (I'll open a JIRA about this).
> * If this is on linux/mac try setting the jemalloc decay which can return
> memory the the OS more quickly [3]
> 
> Just to confirm this is a local disk (not a blob store?) that you are
> writing to?
> 
> If you can produce a minimal example that still seems to hold onto all
> memory, after trying these two items please open a JIRA as there could be a
> bug or some unexpected buffering happening.
> <https://github.com/apache/arrow/blob/a4eb08d54ee0d4c0d0202fa0a2dfa8af7aad7a05/python/pyarrow/memory.pxi#L156>
> 
> Thanks,
> Micah
> 
> [1]
> https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html#pyarrow.parquet.ParquetWriter.write_table
> [2]
> https://github.com/apache/arrow/blob/a4eb08d54ee0d4c0d0202fa0a2dfa8af7aad7a05/python/pyarrow/memory.pxi#L156
> [3]
> https://github.com/apache/arrow/blob/a4eb08d54ee0d4c0d0202fa0a2dfa8af7aad7a05/python/pyarrow/memory.pxi#L156
> 
> On Tue, Sep 15, 2020 at 8:46 AM Niklas B  wrote:
> 
>> First of all: Thank you so much for all hard work on Arrow, it’s an
>> awesome project.
>> 
>> Hi,
>> 
>> I'm trying to write a large parquet file onto disk (larger then memory)
>> using PyArrows ParquetWriter and write_table, but even though the file is
>> written incrementally to disk it still appears to keeps the entire dataset
>> in memory (eventually getting OOM killed). Basically what I am trying to do
>> is:
>> 
>> with pq.ParquetWriter(
>>output_file,
>>arrow_schema,
>>compression='snappy',
>>allow_truncated_timestamps=True,
>>version='2.0',  # Highest available schema
>>data_page_version='2.0',  # Highest available schema
>>) as writer:
>>for rows_dataframe in function_that_yields_data():
>>writer.write_table(
>>pa.Table.from_pydict(
>>rows_dataframe,
>>arrow_schema
>>)
>>)
>> 
>> Where I have a function that yields data and then write it in chunks using
>> write_table.
>> 
>> Is it possible to force the ParquetWriter to not keep the entire dataset
>> in memory, or is it simply not possible for good reasons?
>> 
>> I’m streaming data from a database and writes it to Parquet. The
>> end-consumer has plenty of ram, but the machine that does the conversion
>> doesn’t.
>> 
>> Regards,
>> Niklas
>> 
>> PS: I’ve also created a stack overflow question, which I will update with
>> any answer I might get from the mailing list
>> 
>> https://stackoverflow.com/questions/63891231/pyarrow-incrementally-using-parquetwriter-without-keeping-entire-dataset-in-mem



PyArrow: Incrementally using ParquetWriter without keeping entire dataset in memory (large than memory parquet files)

2020-09-15 Thread Niklas B
First of all: Thank you so much for all hard work on Arrow, it’s an awesome 
project. 

Hi,

I'm trying to write a large parquet file onto disk (larger then memory) using 
PyArrows ParquetWriter and write_table, but even though the file is written 
incrementally to disk it still appears to keeps the entire dataset in memory 
(eventually getting OOM killed). Basically what I am trying to do is:

with pq.ParquetWriter(
output_file,
arrow_schema,
compression='snappy',
allow_truncated_timestamps=True,
version='2.0',  # Highest available schema
data_page_version='2.0',  # Highest available schema
) as writer:
for rows_dataframe in function_that_yields_data():
writer.write_table(
pa.Table.from_pydict(
rows_dataframe,
arrow_schema
)
)

Where I have a function that yields data and then write it in chunks using 
write_table. 

Is it possible to force the ParquetWriter to not keep the entire dataset in 
memory, or is it simply not possible for good reasons?

I’m streaming data from a database and writes it to Parquet. The end-consumer 
has plenty of ram, but the machine that does the conversion doesn’t. 

Regards,
Niklas

PS: I’ve also created a stack overflow question, which I will update with any 
answer I might get from the mailing list 
https://stackoverflow.com/questions/63891231/pyarrow-incrementally-using-parquetwriter-without-keeping-entire-dataset-in-mem