[Java] Memory Allocation Tips

2020-04-20 Thread Razvan Chitu
Hi,

Does the Arrow community have any tips / recommendations / best practices
on how to manage Arrow memory in Java? Is there a way to rely on the GC
exclusively (i.e. is there support for heap-only allocation)?

Best,
Razvan


[jira] [Created] (ARROW-6899) to_pandas() not implemented on list

2019-10-16 Thread Razvan Chitu (Jira)
Razvan Chitu created ARROW-6899:
---

 Summary: to_pandas() not implemented on 
list
 Key: ARROW-6899
 URL: https://issues.apache.org/jira/browse/ARROW-6899
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.15.0, 0.13.0
Reporter: Razvan Chitu
 Attachments: encoded.arrow

Hi,

{{pyarrow.Table.to_pandas()}} fails on an Arrow List Vector where the data 
vector is of type "dictionary encoded string". Here is the table schema as 
printed by pyarrow:
{code:java}
pyarrow.Table
encodedList: list<$data$: dictionary 
not null> not null
  child 0, $data$: dictionary not null
metadata

OrderedDict() {code}
and the data (also attached in a file to this ticket)
{code:java}

[
  [

-- dictionary:
  [
"a",
"b",
"c",
"d"
  ]
-- indices:
  [
0,
1,
2
  ],

-- dictionary:
  [
"a",
"b",
"c",
"d"
  ]
-- indices:
  [
0,
3
  ]
  ]
] {code}
and the exception I got
{code:java}
---
ArrowNotImplementedError  Traceback (most recent call last)
 in 
> 1 df.to_pandas()

~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/array.pxi
 in pyarrow.lib._PandasConvertible.to_pandas()

~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/table.pxi
 in pyarrow.lib.Table._to_pandas()

~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/pandas_compat.py
 in table_to_blockmanager(options, table, categories, ignore_metadata)
700 
701 _check_data_column_metadata_consistency(all_columns)
--> 702 blocks = _table_to_blocks(options, table, categories)
703 columns = _deserialize_column_index(table, all_columns, 
column_indexes)
704 

~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/pandas_compat.py
 in _table_to_blocks(options, block_table, categories)
972 
973 # Convert an arrow table to Block from the internal pandas API
--> 974 result = pa.lib.table_to_blocks(options, block_table, categories)
975 
976 # Defined above

~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/table.pxi
 in pyarrow.lib.table_to_blocks()

~/.local/share/virtualenvs/jupyter-BKbz0SEp/lib/python3.6/site-packages/pyarrow/error.pxi
 in pyarrow.lib.check_status()

ArrowNotImplementedError: Not implemented type for list in DataFrameBlock: 
dictionary {code}
Note that the data vector itself can be loaded successfully by to_pandas.

It'd be great if this would be addressed in the next version of pyarrow. For 
now, is there anything I can do on my end to bypass this unimplemented 
conversion?

Thanks,

Razvan



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: IPC Tensor + Indices

2019-07-12 Thread Razvan Chitu
Sure. I'd like to bundle an M x N shaped tensor along with the M row labels
(dates) and N column labels (string identifiers) in one response.

Razvan

On Fri, Jul 12, 2019, 6:53 PM Wes McKinney  wrote:

> hi Razvan -- can you clarify what "together with a row and a column
> index? means?
>
> On Fri, Jul 12, 2019 at 11:17 AM Razvan Chitu 
> wrote:
> >
> > Hi,
> >
> > Does the IPC format currently support streaming a tensor together with a
> > row and a column index? If not, are there any plans for this to be
> > supported? It'd be quite a useful for matrices that could have 10s of
> > thousands of either rows, columns or both. For my use case I am currently
> > representing matrices as record batches, but performance is not that
> great
> > when there are many columns and few rows.
> >
> > Thanks,
> > Razvan
>


IPC Tensor + Indices

2019-07-12 Thread Razvan Chitu
Hi,

Does the IPC format currently support streaming a tensor together with a
row and a column index? If not, are there any plans for this to be
supported? It'd be quite a useful for matrices that could have 10s of
thousands of either rows, columns or both. For my use case I am currently
representing matrices as record batches, but performance is not that great
when there are many columns and few rows.

Thanks,
Razvan


Re: Java OutOfMemoryException!

2019-03-24 Thread Razvan Chitu
Hi Tanveer,

The stack trace seems to indicate that you you've breached the limit of the
allocator used by the ArrowStreamReader, so that's where I'd look first.
The limit is usually set when constructing an allocator (e.g. new
RootAllocator(myLimit)) or when getting a child allocator (e.g.
rootAllocator.newChildAllocator(...)).

Razvan



On Sun, Mar 24, 2019 at 12:33 PM Tanveer Ahmad - EWI 
wrote:

> Hi,
>
> I am de-serializing multiple plasma objects in java at the same time,
> everything is working fine but when the data size increases the following
> error is being occurred for some threads. Any suggestion where I can
> increase/change the memory allocation for these processes (I have more
> memory available)? Is it JVM related or Arrow specific?
>
> Exception in thread "Thread-1"
> org.apache.arrow.memory.OutOfMemoryException: Unable to allocate buffer of
> size 634729984 due to memory limit. Current allocation: 0
> at
> org.apache.arrow.memory.BaseAllocator.buffer(BaseAllocator.java:273)
> at
> org.apache.arrow.memory.BaseAllocator.buffer(BaseAllocator.java:249)
> at
> org.apache.arrow.vector.ipc.message.MessageChannelReader.readMessageBody(MessageChannelReader.java:88)
> at
> org.apache.arrow.vector.ipc.message.MessageSerializer.deserializeRecordBatch(MessageSerializer.java:204)
> at
> org.apache.arrow.vector.ipc.ArrowStreamReader.loadNextBatch(ArrowStreamReader.java:116)
>
>
>
> Thanks.
>
>
> Regards,
> Tanveer Ahmad
>


Memory mapped files in Java

2019-03-19 Thread Razvan Chitu
Hi,

I was looking for a way to interact with memory mapped Arrow files in Java
and I found this thread:
http://mail-archives.apache.org/mod_mbox/arrow-dev/201709.mbox/%3CCAOgX8szfO-F=ccsqcggucqfzqkgu2wy+pihztbv1gkat4eq...@mail.gmail.com%3E
. Are there any updates on the status of an implementation (or a plan /
design)?

Best,
Razvan