[jira] [Created] (ARROW-7214) [Python] unpickling a pyarrow table with dictionary fields crashes

2019-11-19 Thread Yevgeni Litvin (Jira)
Yevgeni Litvin created ARROW-7214:
-

 Summary: [Python] unpickling a pyarrow table with dictionary 
fields crashes
 Key: ARROW-7214
 URL: https://issues.apache.org/jira/browse/ARROW-7214
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.15.1, 0.15.0, 0.14.1, 0.14.0
Reporter: Yevgeni Litvin


The following code crashes on this check:
{code:java}
F1120 07:51:37.523720 12432 array.cc:773]  Check failed: (data->dictionary) != 
(nullptr) 
{code}
Used pandas 0.24.2. 

 
{code:java}
import cPickle as pickle
import pandas as pd
import pyarrow as pa

df = pd.DataFrame([{"cat": "a", "val":1},{"cat": "b", "val":2} ])
df["cat"] = df["cat"].astype('category')index_table = pa.Table.from_pandas(df, 
preserve_index=False)

with open('/tmp/zz.pickle', 'wb') as f:
pickle.dump(index_table, f, protocol=2)

with open('/tmp/zz.pickle', 'rb') as f:
   index_table = pickle.load(f)
{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: MIME type

2019-11-19 Thread Micah Kornfield
I would propose:
application/apache-arrow-stream
application/apache-arrow-file

I'm not attached to those names but I think there should be two different
mime-types, since the formats are not interchangeable.

On Tue, Nov 19, 2019 at 10:31 PM Sutou Kouhei  wrote:

> Hi,
>
> What MIME type should be used for Apache Arrow data?
> application/arrow?
>
> Should we use the same MIME type for IPC Streaming Format[1]
> and IPC File Format[2]? Or should we use different MIME
> types for them?
>
> [1]
> https://arrow.apache.org/docs/format/Columnar.html#ipc-streaming-format
> [2] https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format
>
>
> Thanks,
> --
> kou
>


MIME type

2019-11-19 Thread Sutou Kouhei
Hi,

What MIME type should be used for Apache Arrow data?
application/arrow?

Should we use the same MIME type for IPC Streaming Format[1]
and IPC File Format[2]? Or should we use different MIME
types for them?

[1] https://arrow.apache.org/docs/format/Columnar.html#ipc-streaming-format
[2] https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format


Thanks,
--
kou


[jira] [Created] (ARROW-7213) [Java] Represent a data element of a vector as a tree of ArrowBufPointer

2019-11-19 Thread Liya Fan (Jira)
Liya Fan created ARROW-7213:
---

 Summary: [Java] Represent a data element of a vector as a tree of 
ArrowBufPointer
 Key: ARROW-7213
 URL: https://issues.apache.org/jira/browse/ARROW-7213
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


For a fixed/variable width vector, each of its data element can be represented 
as an ArrowBufPointer object, which represents a contiguous memory segment. 
This makes many tasks easier and more efficient (without memory copy): 
calculating hash code, comparing values, etc.

This cannot be achieved for complex vectors, because their values often reside 
in more than one contiguous memory regions. However, it can be seen that the 
contiguous memory regions for each data element forms a tree-like structure, 
whose leaf nodes are the contiguous memory regions. For example, a data element 
for a struct vector forms a tree, whose root corresponds to the struct vector, 
while the child vectors corresponds to the child nodes of the tree root. 

In this issue, we provide a data structure that represents each data element of 
a vector as a tree, whose leaf nodes are ArrowBufPointers, representing 
contiguous memory regions for the data element. 

With this data structure, many tasks also becomes easier and more efficient: 
calculating hash code, comparing vector elements (ordering & equality). In 
addition, we can do something that could not have been done in the past, like 
placing data elements into a hash table/hash set, etc. 




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7212) "go test -bench=8192 -run=. ./math" fails

2019-11-19 Thread Michael Poole (Jira)
Michael Poole created ARROW-7212:


 Summary: "go test -bench=8192 -run=. ./math" fails
 Key: ARROW-7212
 URL: https://issues.apache.org/jira/browse/ARROW-7212
 Project: Apache Arrow
  Issue Type: Bug
  Components: Go
Affects Versions: 0.15.0
 Environment: go version go1.13.4 linux/amd64
Reporter: Michael Poole


Starting at commit 699878dfc (ARROW-4081), "go test -bench=8192 -run=. ./math" 
fails:
{noformat}
% go test -bench=8192 -run=. ./math   
--- FAIL: BenchmarkFloat64Funcs_Sum_8192
float64_test.go:69: invalid memory size exp=0, got=67584
--- FAIL: BenchmarkInt64Funcs_Sum_8192
int64_test.go:69: invalid memory size exp=0, got=67584
--- FAIL: BenchmarkUint64Funcs_Sum_8192
uint64_test.go:69: invalid memory size exp=0, got=67584
FAIL
exit status 1
FAILgithub.com/apache/arrow/go/arrow/math   0.008s
FAIL
{noformat}
Adding a call to 
{code:go}
vec.Release(){code}
at the end of the benchmark\{{.Name}}Funcs_Sum() template fixes this.
  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7211) [Rust] [Parquet] Support writing to byte buffers

2019-11-19 Thread Onur Satici (Jira)
Onur Satici created ARROW-7211:
--

 Summary: [Rust] [Parquet] Support writing to byte buffers
 Key: ARROW-7211
 URL: https://issues.apache.org/jira/browse/ARROW-7211
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust
Reporter: Onur Satici


Parquet in Rust currently only supports writing to files. Extending this to 
include byte buffers would enable Rust to write to remote targets such as S3.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: pyarrow read_csv with different amount of columns per row

2019-11-19 Thread Maarten Ballintijn
Hi Elisa,

One option is to preprocess the file and add the missing columns.
You can do this using two passes (reading once to determine the number of 
columns
and once writing out the lines filled out to the right number of columns)
This does not need to take a lot of memory as  you can read line by line.
(see http://docs.python.org/3/tutorial/inputoutput.html#methods-of-file-objects 
)

Another option is to make an iterator to do this on the fly.

Either way the resulting DataFrame is going to be large:

14e6 rows * 43 columns * 8 bytes/float ~ 5 Gbyte  (in the best case)

Cheers,
Maarten



> On Nov 19, 2019, at 4:51 AM, Antoine Pitrou  wrote:
> 
> 
> No, there is no way to load CSV files with irregular dimensions, and we
> don't have any plans currently to support them.  Sorry :-(
> 
> Regards
> 
> Antoine.
> 
> 
> Le 19/11/2019 à 05:54, Micah Kornfield a écrit :
>> +dev@arrow to see if there is a more definitive answer, but I don't believe
>> this type of functionality is supported currently.
>> 
>> 
>> 
>> 
>> On Fri, Nov 15, 2019 at 1:42 AM Elisa Scandellari <
>> elisa.scandell...@gmail.com> wrote:
>> 
>>> Hi,
>>> I'm trying to improve the performance of my program that loads csv data
>>> and manipulates it.
>>> My CSV file contains 14 million rows and has a variable amount of columns.
>>> The first 27 columns will always be available, and a row can have up to 16
>>> more columns for a total of 43.
>>> 
>>> Using vanilla pandas I've found this workaround:
>>> ```
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> *largest_column_count = 0with open(data_file, 'r') as temp_f:lines =
>>> temp_f.readlines()for l in lines:column_count =
>>> len(l.split(',')) + 1largest_column_count = column_count if
>>> largest_column_count < column_count else
>>> largest_column_counttemp_f.close()column_names = [i for i in range(0,
>>> largest_column_count)]all_columns_df = pd.read_csv(file, header=None,
>>> delimiter=',', names=column_names, dtype='category').replace(pd.np.nan, '',
>>> regex=True)*```
>>> This will create the table with all my data plus empty cells where the
>>> data is not available.
>>> With a smaller file, this works perfectly well. With the complete file, my
>>> memory usage goes over the roof.
>>> 
>>> I've been reading about Apache Arrow and, after a few attempts to load a
>>> structured csv file (same amount of columns for every row), I'm extremely
>>> impressed.
>>> I've tried to load my data file, using the same concept as above:
>>> ```
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> *fixed_column_names = [str(i) for i in range(0, 27)]extra_column_names =
>>> [str(i) for i in range(len(fixed_column_names),
>>> largest_column_count)]total_columns =
>>> fixed_column_namestotal_columns.extend(extra_column_names)read_options =
>>> csv.ReadOptions(column_names=total_columns)convert_options =
>>> csv.ConvertOptions(include_columns=total_columns,
>>>   include_missing_columns=True,
>>> strings_can_be_null=True)table = csv.read_csv(edr_filename,
>>> read_options=read_options, convert_options=convert_options)*
>>> ```
>>> but I get the following error
>>> Exception: CSV parse error: Expected 43 columns, got 32
>>> 
>>> I need to use the csv provided by pyarrow, if not I wouldn't be able to
>>> create the pyarrow table to then convert to pandas
>>> ```from pyarrow import csv```
>>> 
>>> I guess that the csv library provided by pyarrow is more streamlined than
>>> the complete one.
>>> 
>>> Is there any way I can load this file? Maybe using some ReadOptions and/or
>>> ConvertOptions?
>>> I'd be using pandas to manipulate the data after it's been loaded.
>>> 
>>> Thank you in advance
>>> 
>>> 
>> 



[jira] [Created] (ARROW-7210) [C++] Scalar cast should support time-based types

2019-11-19 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-7210:
-

 Summary: [C++] Scalar cast should support time-based types
 Key: ARROW-7210
 URL: https://issues.apache.org/jira/browse/ARROW-7210
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Francois Saint-Jacques


This would allow supporting a minimum of expression evaluation on time-based 
arrays.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7209) [Python] tests with pandas master are failing now __from_arrow__ support landed in pandas

2019-11-19 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7209:


 Summary: [Python] tests with pandas master are failing now 
__from_arrow__ support landed in pandas
 Key: ARROW-7209
 URL: https://issues.apache.org/jira/browse/ARROW-7209
 Project: Apache Arrow
  Issue Type: Test
  Components: Python
Reporter: Joris Van den Bossche


I implemented pandas <-> arrow roundtrip for pandas' integer+string dtype in 
https://github.com/pandas-dev/pandas/pull/29483, which is now merged. But our 
tests where assuming this did not yet work in pandas, and thus need to be 
updated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7208) Arrow using ParquetFile class

2019-11-19 Thread Roelant Stegmann (Jira)
Roelant Stegmann created ARROW-7208:
---

 Summary: Arrow using ParquetFile class
 Key: ARROW-7208
 URL: https://issues.apache.org/jira/browse/ARROW-7208
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.15.1
Reporter: Roelant Stegmann


Somehow have the same errors. We are working with pyarrow 0.15.1, trying to 
access a folder of `parquet` files generated with Amazon Athena.

```python
table2 = pq.read_table('C:/Data/test-parquet')
```

works fine in contrast to

```python
parquet_file = pq.ParquetFile('C:/Data/test-parquet')
# parquet_file.read_row_group(0)
```

which raises

`ArrowIOError: Failed to open local file 'C:/Data/test-parquet', error: Access 
is denied.`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Disabling Gandiva, Plasma, or other components

2019-11-19 Thread Wes McKinney
The relevant JIRA is

https://issues.apache.org/jira/browse/ARROW-6776

This is not a very complex project (changing flags and refactoring for
code reuse between the "slim" and "comprehensive" build). If there
were interested maintainers, we could even have a "pyarrow-slim" on
PyPI. But I cannot do it and I do not have the funding to commit my
team's time to maintain it, so the work will have to be done and
maintained by someone else.

On Sun, Nov 17, 2019 at 5:27 AM Micah Kornfield  wrote:
>
> Hi Carlos,
> I'm not sure simply removing the files from the pyarrow folder would work.
> I'm don't have much knowledge in this area, but I think you would
> potentially need build your own wheel by modifying the script referenced in
> the issue [1].
>
> Thanks,
> Micah
>
> [1]
> https://github.com/apache/arrow/blob/master/python/manylinux1/build_arrow.sh
>
> On Thu, Nov 14, 2019 at 12:44 PM Segovia, Carlos EXT <
> carlos.sego...@comafi.com.ar> wrote:
>
> > Hi, I read from https://github.com/apache/arrow/issues/4216 that it's
> > posible to disabling Gandiva, Plasma, or other components that you do not
> > require.
> > I',m trying to deploy a aws lambda with pandas and pyarrow but I get the
> > error Unzipped size must be smaller than 262144000 bytes
> > How can I disable this components? I delete it from the
> > venv.../site-packages/pyarrow folder but it doesn't work.
> >
> > Thanks in advance
> > Regards
> > Carlos Segovia
> > La informacion contenida en el presente correo y en sus adjuntos -en su
> > caso- es confidencial y de uso exclusivo para los destinatarios del mismo.
> > Si Ud. recibe este correo por error tenga bien notitificar al emisor y
> > eliminarlo. Esta prohibido a las personas o entidades que no sean los
> > destinatarios de este correo cualquier tipo de modificacion, copia,
> > distribucion, divulgacion, retencion o uso de la informacion que contiene.
> > Banco Comafi S.A., sus vinculadas y demas sociedades del Grupo Comafi no se
> > responsabilizan por cualquier uso del correo electronico que fuera:
> > abusivo, contrario a la moral, a las buenas costumbres o a la ley, o
> > realizado fuera de las competencias laborales del autor del mail. Este no
> > es un medio utilizado para contraer obligaciones contractuales, excepto
> > que, previamente, asi se hubiera pactado por escrito.
> >


Re: Apache Arrow build with needed dependencies only

2019-11-19 Thread Richard Bachmann

Hello Wes and Sebastien,
First off a correction from earlier: It appears I misinterpreted the 
documentation and thought that 'thirdparty/download_dependencies.sh' 
would download all dependencies no matter what, which isn't the case. 
Apologies.


We were _originally_ building Arrow with the following command:

${long_path}/bin/cmake 
${another_long_path}/arrow-0.14.1/src/arrow/0.14.1/cpp \

    -DARROW_USE_SSE=ON \
    -DARROW_PYTHON=ON  \
    -DCMAKE_INSTALL_PREFIX=${path_to_install_dir} \
    -DCMAKE_CXX_COMPILER=${long_path}/bin/g++ \
    -DCMAKE_CXX_STANDARD=17 \
    -DARROW_WITH_ZSTD=OFF \
    -DARROW_BUILD_TESTS=OFF \
    -DARROW_BUILD_BENCHMARKS=OFF \
    -DARROW_PARQUET=ON \
    -DCXX_COMMON_FLAGS=-march=core2 -mno-sse4.2 -mno-bmi2 -mno-bmi 
-mno-sse3 -mno-ssse3 \
    -DARROW_CXXFLAGS=-march=core2 -mno-sse4.2 -mno-bmi2 -mno-bmi 
-mno-sse3 -mno-ssse3 \

    -DBoost_NO_BOOST_CMAKE=ON \
    -DBoost_ADDITIONAL_VERSIONS=1.70

This produced the following in our build logs:
    [  7%] Performing download step (download, verify and extract) for 
'rapidjson_ep'
    [  8%] Performing download step (download, verify and extract) for 
'double-conversion_ep'
    [  8%] Performing download step (download, verify and extract) for 
'snappy_ep'
    [  8%] Performing download step (download, verify and extract) for 
'lz4_ep'
    [  8%] Performing download step (download, verify and extract) for 
'jemalloc_ep'
    [  8%] Performing download step (download, verify and extract) for 
'gflags_ep'
    [  9%] Performing download step (download, verify and extract) for 
'thrift_ep'
    [  9%] Performing download step (download, verify and extract) for 
'brotli_ep'



Thank you for opening the Jira issue. I agree, the difficulty in telling 
why some of these packages are downloaded is a core part of the issue. 
In the example above I had some difficulty when trying to figure out why 
Snappy, for instance, was downloaded. The build's 
`projects/arrow-0.14.1/src/arrow/0.14.1/cpp/CMakeLists.txt` revealed 
that the setting ARROW_ORC is the likely cause, I think. Similarly it 
was unclear why jemalloc, which already exists in our stack, was not 
taken from the system. I now understand that this is done in order to 
use a specific version which you can reliably patch, but it would be 
nice to have some clearer labeling.


In order to avoid offline mirrors interrupting builds we have taken the 
following steps:
The packages downloaded above have now been added properly to the stack, 
and listed as dependencies of arrow. Arrow is now built like so:


ENVIRONMENT FLATBUFFERS_HOME=${flatbuffers_home} 
ARROW_JEMALLOC_URL=${local_jemalloc_tar.bz2}
${long_path}/bin/cmake 
${another_long_path}/arrow-0.14.1/src/arrow/0.14.1/cpp \

    -DARROW_PYTHON=ON
    -DCMAKE_INSTALL_PREFIX=${path_to_install_dir}
    -DCMAKE_CXX_COMPILER=${long_path}/bin/g++
    -DCMAKE_CXX_STANDARD=17
    -DARROW_WITH_ZSTD=OFF
    -DARROW_BUILD_TESTS=OFF
    -DARROW_BUILD_BENCHMARKS=OFF
    -DARROW_PARQUET=ON
*    -DRapidJSON_ROOT=${rapidjson_home}**
**    -DRAPIDJSON_INCLUDE_DIR=${rapidjson_home}/include*
    "-DCXX_COMMON_FLAGS=-march=core2 -mno-sse4.2 -mno-bmi2 -mno-bmi 
-mno-sse3 -mno-ssse3"
    "-DARROW_CXXFLAGS=-march=core2 -mno-sse4.2 -mno-bmi2 -mno-bmi 
-mno-sse3 -mno-ssse3"

    -DBoost_NO_BOOST_CMAKE=ON \
    -DBoost_ADDITIONAL_VERSIONS=1.70

The dependencies are detected (no longer downloaded), except for 
jemalloc where the find function has been disabled. As a work-around the 
ARROW_JEMALLOC_URL is supplied to take the tarball from local storage.
Thrift is now built with CMake, identically to how Arrow would do it 
internally, with the addition of the -fPIC flag. We will look into what 
features can be safely disabled for Arrow and Thrift in the future. 
Thank you Sebastien for the pointer to the ALICE build script.


We ended up not going for the full 'offline builds' solution of 
specifying all URLs, as this would introduce additional complexities in 
the form of a 'special' set of packages which are not version controlled 
like the others.


Thank you for the advice.
Kind regards:

    - Richard

On 11/7/19 5:10 PM, Wes McKinney wrote:

I just openedhttps://issues.apache.org/jira/browse/ARROW-7089  about
increasing transparency around what options are causing thirdparty
dependencies to be required

On Thu, Nov 7, 2019 at 10:05 AM Wes McKinney  wrote:

hi Richard,

On Thu, Nov 7, 2019 at 9:59 AM Richard Bachmann
  wrote:

Hello,
I'm contacting you on behalf of the LCG Releases team at CERN. We
provide a common software stack for LHCb, ATLAS and others to be used at
CERN and the worldwide computing grid.

Right now we're looking into optimizing the way we're building Apache
Arrow (C++ & Python) and its dependencies. Ideally we'd like to build
Arrow using only the minimum of necessary dependencies to run it, and to
use packages already installed in the stack to fulfill these
dependencies. The former would be nice to keep the stack clean, the
latter wo

[jira] [Created] (ARROW-7207) [Rust] Update Generated Flatbuffer Files

2019-11-19 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-7207:
-

 Summary: [Rust] Update Generated Flatbuffer Files
 Key: ARROW-7207
 URL: https://issues.apache.org/jira/browse/ARROW-7207
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Reporter: Neville Dipale


We last built the fbs files early in the year, and since then there have been 
some changes like LargeLists. We should update the generated Rust files to 
incorporate these changes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[NIGHTLY] Arrow Build Report for Job nightly-2019-11-19-0

2019-11-19 Thread Crossbow


Arrow Build Report for Job nightly-2019-11-19-0

All tasks: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-19-0

Failed Tasks:
- conda-osx-clang-py27:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-19-0-azure-conda-osx-clang-py27
- conda-osx-clang-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-19-0-azure-conda-osx-clang-py36
- conda-osx-clang-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-19-0-azure-conda-osx-clang-py37
- homebrew-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-19-0-travis-homebrew-cpp
- test-conda-python-3.7-dask-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-19-0-circle-test-conda-python-3.7-dask-master
- test-conda-python-3.7-spark-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-19-0-circle-test-conda-python-3.7-spark-master
- test-debian-10-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-19-0-circle-test-debian-10-cpp
- test-debian-10-python-3:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-19-0-circle-test-debian-10-python-3
- test-debian-c-glib:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-19-0-circle-test-debian-c-glib
- test-debian-ruby:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-19-0-circle-test-debian-ruby
- test-fedora-29-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-19-0-circle-test-fedora-29-cpp
- test-fedora-29-python-3:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-19-0-circle-test-fedora-29-python-3
- test-ubuntu-14.04-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-19-0-circle-test-ubuntu-14.04-cpp
- test-ubuntu-18.04-cpp-release:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-19-0-circle-test-ubuntu-18.04-cpp-release
- test-ubuntu-18.04-cpp-static:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-19-0-circle-test-ubuntu-18.04-cpp-static
- test-ubuntu-18.04-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-19-0-circle-test-ubuntu-18.04-cpp
- test-ubuntu-18.04-docs:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-19-0-circle-test-ubuntu-18.04-docs
- test-ubuntu-18.04-python-3:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-19-0-circle-test-ubuntu-18.04-python-3
- test-ubuntu-c-glib:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-19-0-circle-test-ubuntu-c-glib
- test-ubuntu-fuzzit:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-19-0-circle-test-ubuntu-fuzzit
- test-ubuntu-ruby:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-19-0-circle-test-ubuntu-ruby
- ubuntu-xenial:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-19-0-azure-ubuntu-xenial
- wheel-manylinux1-cp27m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-19-0-travis-wheel-manylinux1-cp27m
- wheel-manylinux1-cp27mu:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-19-0-travis-wheel-manylinux1-cp27mu
- wheel-manylinux1-cp35m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-19-0-travis-wheel-manylinux1-cp35m
- wheel-manylinux1-cp36m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-19-0-travis-wheel-manylinux1-cp36m
- wheel-manylinux1-cp37m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-19-0-travis-wheel-manylinux1-cp37m
- wheel-manylinux2010-cp27m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-19-0-travis-wheel-manylinux2010-cp27m
- wheel-manylinux2010-cp27mu:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-19-0-travis-wheel-manylinux2010-cp27mu
- wheel-manylinux2010-cp35m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-19-0-travis-wheel-manylinux2010-cp35m
- wheel-manylinux2010-cp36m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-19-0-travis-wheel-manylinux2010-cp36m
- wheel-manylinux2010-cp37m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-19-0-travis-wheel-manylinux2010-cp37m

Succeeded Tasks:
- centos-6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-19-0-azure-centos-6
- centos-7:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-19-0-azure-centos-7
- centos-8:
  URL: 
https://github.com/ursa-labs/crossbow/b

[jira] [Created] (ARROW-7206) avoid string concatenation when calling Preconditions#checkArgument

2019-11-19 Thread stephane campinas (Jira)
stephane campinas created ARROW-7206:


 Summary: avoid string concatenation when calling 
Preconditions#checkArgument
 Key: ARROW-7206
 URL: https://issues.apache.org/jira/browse/ARROW-7206
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: stephane campinas
Assignee: stephane campinas
 Attachments: after.png, before.png

Preconditions#checkArgument is called in VectorLoader with the String message 
already built.

This causes some noticeable overhead as can be seen from the attached flame 
graphs.

 

Calling checkArgument with an error template instead avoids the call to 
StringBuilder as can be seen in the `after` image.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: pyarrow read_csv with different amount of columns per row

2019-11-19 Thread Antoine Pitrou


No, there is no way to load CSV files with irregular dimensions, and we
don't have any plans currently to support them.  Sorry :-(

Regards

Antoine.


Le 19/11/2019 à 05:54, Micah Kornfield a écrit :
> +dev@arrow to see if there is a more definitive answer, but I don't believe
> this type of functionality is supported currently.
> 
> 
> 
> 
> On Fri, Nov 15, 2019 at 1:42 AM Elisa Scandellari <
> elisa.scandell...@gmail.com> wrote:
> 
>> Hi,
>> I'm trying to improve the performance of my program that loads csv data
>> and manipulates it.
>> My CSV file contains 14 million rows and has a variable amount of columns.
>> The first 27 columns will always be available, and a row can have up to 16
>> more columns for a total of 43.
>>
>> Using vanilla pandas I've found this workaround:
>> ```
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *largest_column_count = 0with open(data_file, 'r') as temp_f:lines =
>> temp_f.readlines()for l in lines:column_count =
>> len(l.split(',')) + 1largest_column_count = column_count if
>> largest_column_count < column_count else
>> largest_column_counttemp_f.close()column_names = [i for i in range(0,
>> largest_column_count)]all_columns_df = pd.read_csv(file, header=None,
>> delimiter=',', names=column_names, dtype='category').replace(pd.np.nan, '',
>> regex=True)*```
>> This will create the table with all my data plus empty cells where the
>> data is not available.
>> With a smaller file, this works perfectly well. With the complete file, my
>> memory usage goes over the roof.
>>
>> I've been reading about Apache Arrow and, after a few attempts to load a
>> structured csv file (same amount of columns for every row), I'm extremely
>> impressed.
>> I've tried to load my data file, using the same concept as above:
>> ```
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *fixed_column_names = [str(i) for i in range(0, 27)]extra_column_names =
>> [str(i) for i in range(len(fixed_column_names),
>> largest_column_count)]total_columns =
>> fixed_column_namestotal_columns.extend(extra_column_names)read_options =
>> csv.ReadOptions(column_names=total_columns)convert_options =
>> csv.ConvertOptions(include_columns=total_columns,
>>include_missing_columns=True,
>>  strings_can_be_null=True)table = csv.read_csv(edr_filename,
>> read_options=read_options, convert_options=convert_options)*
>> ```
>> but I get the following error
>> Exception: CSV parse error: Expected 43 columns, got 32
>>
>> I need to use the csv provided by pyarrow, if not I wouldn't be able to
>> create the pyarrow table to then convert to pandas
>> ```from pyarrow import csv```
>>
>> I guess that the csv library provided by pyarrow is more streamlined than
>> the complete one.
>>
>> Is there any way I can load this file? Maybe using some ReadOptions and/or
>> ConvertOptions?
>> I'd be using pandas to manipulate the data after it's been loaded.
>>
>> Thank you in advance
>>
>>
> 


[jira] [Created] (ARROW-7205) [C++][Gandiva] Implement regexp_matches, regexp_like functions in ganidva

2019-11-19 Thread Projjal Chanda (Jira)
Projjal Chanda created ARROW-7205:
-

 Summary: [C++][Gandiva] Implement regexp_matches, regexp_like 
functions in ganidva
 Key: ARROW-7205
 URL: https://issues.apache.org/jira/browse/ARROW-7205
 Project: Apache Arrow
  Issue Type: Task
Reporter: Projjal Chanda






--
This message was sent by Atlassian Jira
(v8.3.4#803005)