[Java] Memory Allocation Tips
Hi, Does the Arrow community have any tips / recommendations / best practices on how to manage Arrow memory in Java? Is there a way to rely on the GC exclusively (i.e. is there support for heap-only allocation)? Best, Razvan
[jira] [Created] (ARROW-6899) to_pandas() not implemented on list
Razvan Chitu created ARROW-6899: --- Summary: to_pandas() not implemented on list Key: ARROW-6899 URL: https://issues.apache.org/jira/browse/ARROW-6899 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.15.0, 0.13.0 Reporter: Razvan Chitu Attachments: encoded.arrow Hi, {{pyarrow.Table.to_pandas()}} fails on an Arrow List Vector where the data vector is of type "dictionary encoded string". Here is the table schema as printed by pyarrow: {code:java} pyarrow.Table encodedList: list<$data$: dictionary not null> not null child 0, $data$: dictionary not null metadata OrderedDict() {code} and the data (also attached in a file to this ticket) {code:java} [ [ -- dictionary: [ "a", "b", "c", "d" ] -- indices: [ 0, 1, 2 ], -- dictionary: [ "a", "b", "c", "d" ] -- indices: [ 0, 3 ] ] ] {code} and the exception I got {code:java} --- ArrowNotImplementedError Traceback (most recent call last) in > 1 df.to_pandas() ~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/array.pxi in pyarrow.lib._PandasConvertible.to_pandas() ~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/table.pxi in pyarrow.lib.Table._to_pandas() ~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/pandas_compat.py in table_to_blockmanager(options, table, categories, ignore_metadata) 700 701 _check_data_column_metadata_consistency(all_columns) --> 702 blocks = _table_to_blocks(options, table, categories) 703 columns = _deserialize_column_index(table, all_columns, column_indexes) 704 ~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/pandas_compat.py in _table_to_blocks(options, block_table, categories) 972 973 # Convert an arrow table to Block from the internal pandas API --> 974 result = pa.lib.table_to_blocks(options, block_table, categories) 975 976 # Defined above ~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/table.pxi in pyarrow.lib.table_to_blocks() ~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status() ArrowNotImplementedError: Not implemented type for list in DataFrameBlock: dictionary {code} Note that the data vector itself can be loaded successfully by to_pandas. It'd be great if this would be addressed in the next version of pyarrow. For now, is there anything I can do on my end to bypass this unimplemented conversion? Thanks, Razvan -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: IPC Tensor + Indices
Sure. I'd like to bundle an M x N shaped tensor along with the M row labels (dates) and N column labels (string identifiers) in one response. Razvan On Fri, Jul 12, 2019, 6:53 PM Wes McKinney wrote: > hi Razvan -- can you clarify what "together with a row and a column > index? means? > > On Fri, Jul 12, 2019 at 11:17 AM Razvan Chitu > wrote: > > > > Hi, > > > > Does the IPC format currently support streaming a tensor together with a > > row and a column index? If not, are there any plans for this to be > > supported? It'd be quite a useful for matrices that could have 10s of > > thousands of either rows, columns or both. For my use case I am currently > > representing matrices as record batches, but performance is not that > great > > when there are many columns and few rows. > > > > Thanks, > > Razvan >
IPC Tensor + Indices
Hi, Does the IPC format currently support streaming a tensor together with a row and a column index? If not, are there any plans for this to be supported? It'd be quite a useful for matrices that could have 10s of thousands of either rows, columns or both. For my use case I am currently representing matrices as record batches, but performance is not that great when there are many columns and few rows. Thanks, Razvan
Re: Java OutOfMemoryException!
Hi Tanveer, The stack trace seems to indicate that you you've breached the limit of the allocator used by the ArrowStreamReader, so that's where I'd look first. The limit is usually set when constructing an allocator (e.g. new RootAllocator(myLimit)) or when getting a child allocator (e.g. rootAllocator.newChildAllocator(...)). Razvan On Sun, Mar 24, 2019 at 12:33 PM Tanveer Ahmad - EWI wrote: > Hi, > > I am de-serializing multiple plasma objects in java at the same time, > everything is working fine but when the data size increases the following > error is being occurred for some threads. Any suggestion where I can > increase/change the memory allocation for these processes (I have more > memory available)? Is it JVM related or Arrow specific? > > Exception in thread "Thread-1" > org.apache.arrow.memory.OutOfMemoryException: Unable to allocate buffer of > size 634729984 due to memory limit. Current allocation: 0 > at > org.apache.arrow.memory.BaseAllocator.buffer(BaseAllocator.java:273) > at > org.apache.arrow.memory.BaseAllocator.buffer(BaseAllocator.java:249) > at > org.apache.arrow.vector.ipc.message.MessageChannelReader.readMessageBody(MessageChannelReader.java:88) > at > org.apache.arrow.vector.ipc.message.MessageSerializer.deserializeRecordBatch(MessageSerializer.java:204) > at > org.apache.arrow.vector.ipc.ArrowStreamReader.loadNextBatch(ArrowStreamReader.java:116) > > > > Thanks. > > > Regards, > Tanveer Ahmad >
Memory mapped files in Java
Hi, I was looking for a way to interact with memory mapped Arrow files in Java and I found this thread: http://mail-archives.apache.org/mod_mbox/arrow-dev/201709.mbox/%3CCAOgX8szfO-F=ccsqcggucqfzqkgu2wy+pihztbv1gkat4eq...@mail.gmail.com%3E . Are there any updates on the status of an implementation (or a plan / design)? Best, Razvan