RE: [Celebrate] Arrow has reached 2000 stargeezers

2018-05-29 Thread Atul Dambalkar
Congratulations to the entire Arrow team!!!

-Original Message-
From: Wes McKinney [mailto:wesmck...@gmail.com] 
Sent: Monday, May 28, 2018 9:20 PM
To: dev@arrow.apache.org
Subject: Re: [Celebrate] Arrow has reached 2000 stargeezers

Congrats all! The journey continues

On Mon, May 28, 2018 at 9:43 AM, Krisztián Szűcs  
wrote:
> Which makes Arrow the 33rd most starred Apache repository (out of 1555, 
> according to github).
>
> Congratulations!


[jira] [Created] (ARROW-2642) [Python] Fail building parquet binding on Windows

2018-05-29 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-2642:
-

 Summary: [Python] Fail building parquet binding on Windows
 Key: ARROW-2642
 URL: https://issues.apache.org/jira/browse/ARROW-2642
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.9.0
Reporter: Antoine Pitrou


For some reason I get the following error. I'm not sure why Thrift is needed 
here:

{code}
-- Found the Parquet library: C:/Miniconda3/envs/arrow/Library/lib/parquet.lib
-- THRIFT_HOME:
-- Thrift compiler/libraries NOT found:  (THRIFT_INCLUDE_DIR-NOTFOUND, THRIFT_ST
ATIC_LIB-NOTFOUND). Looked in system search paths.
-- Boost version: 1.66.0
-- Found the following Boost libraries:
--   regex
Added static library dependency boost_regex: C:/Miniconda3/envs/arrow/Library/li
b/libboost_regex.lib
Added static library dependency parquet: C:/Miniconda3/envs/arrow/Library/lib/pa
rquet_static.lib
CMake Error at C:/t/arrow/cpp/cmake_modules/BuildUtils.cmake:88 (message):
  No static or shared library provided for thrift
Call Stack (most recent call first):
  CMakeLists.txt:376 (ADD_THIRDPARTY_LIB)

{code}

The {{thrift-cpp}} package from conda-forge is installed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2643) [C++] Travis-CI build failure with cpp toolchain enabled

2018-05-29 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-2643:
-

 Summary: [C++] Travis-CI build failure with cpp toolchain enabled
 Key: ARROW-2643
 URL: https://issues.apache.org/jira/browse/ARROW-2643
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Continuous Integration
Affects Versions: 0.9.0
Reporter: Antoine Pitrou


This is a new failure, perhaps triggered by a conda-forge package update. See 
example at https://travis-ci.org/apache/arrow/jobs/385002355#L2235

{code}
/usr/bin/ld: 
/home/travis/build/apache/arrow/cpp-toolchain/lib/libz.a(deflate.o): relocation 
R_X86_64_32S against `zcalloc' can not be used when making a shared object; 
recompile with -fPIC
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


JDBC Adapter PR - 1759

2018-05-29 Thread Atul Dambalkar
Hi Sid, Laurent, Uwe,

Any idea when can someone take a look at the PR 
https://github.com/apache/arrow/pull/1759/.

Laurent had given bunch of comments earlier and now we have taken care of most 
of those. We have also added multiple test cases. It will be great if someone 
can take a look.

Regards,
-Atul



Re: Peak memory usage for pyarrow.parquet.read_table

2018-05-29 Thread Bryant Menn
Following up on what I have found with Uwe's advice and poking around the
code base.

* `columns=` helped but it was because forced me to realize I did not need
all of the columns at once every time. No particular column was
significantly worse in memory usage.
* There seems to be some interaction between
`parquet::internal::RecordReader` and `arrow::PoolBuffer` or
`arrow::DefaultMemoryPool`. `RecordReader` request an allocation to hold
the entire column in memory without compression/encoding even though Arrow
supports dictionary encoding (and the column is dictionary encoded).

I imagine `RecordReader` requests enough memory to hold the data without
encoding/compression for good reason (perhaps more robust assumptions about
the underlying memory pool?), but is there a way to request only the memory
require for dictionary encoding when it is an option?

My (incomplete) understanding comes from the surrounding lines here
https://github.com/apache/parquet-cpp/blob/c405bf36506ec584e8009a6d53349277e600467d/src/parquet/arrow/record_reader.cc#L232
.

On Wed, Apr 25, 2018 at 2:23 PM Bryant Menn  wrote:

> Uwe,
>
> I'll try pinpointing things further with `columns=` and try to reproduce
> what I find with data I can share.
>
> Thanks for the pointer.
>
> -Bryant
>
> On Wed, Apr 25, 2018 at 2:10 PM Uwe L. Korn  wrote:
>
>> No, there is no need to pass any options on reading. Sometimes they are
>> beneficial depending on what you want to achieve but defaults are ok, too.
>>
>> I'm not sure if you're able to post an example but it would be nice if
>> you could post the resulting Arrow schema from the table. It might be
>> related to a specific type. A quick way to debug this on your side would
>> also be to specify only a subset of columns to read using the `columns=`
>> attribute on read_table. Maybe you can already pinpoint the memory problems
>> to a specific column. Having these hints would it make easier for us to
>> diagnose what the underlying problem is.
>>
>> Uwe
>>
>> On Wed, Apr 25, 2018, at 8:06 PM, Bryant Menn wrote:
>> > Uwe,
>> >
>> > I am not. Should I be? I forgot to mention earlier that the Parquet file
>> > came from Spark/PySpark.
>> >
>> > On Wed, Apr 25, 2018 at 1:32 PM Uwe L. Korn  wrote:
>> >
>> > > Hello Bryant,
>> > >
>> > > are you using any options on `pyarrow.parquet.read_table` or a
>> possible
>> > > `to_pandas` afterwards?
>> > >
>> > > Uwe
>> > >
>> > > On Wed, Apr 25, 2018, at 7:27 PM, Bryant Menn wrote:
>> > > > I tried reading a Parquet file (<200MB, lots of text with snappy)
>> using
>> > > > read_table and saw the memory usage peak over 8GB before settling
>> back
>> > > down
>> > > > to ~200MB. This surprised me as I was expecting to be able to
>> handle a
>> > > > Parquet file of this size with much less RAM (doing some processing
>> with
>> > > > smaller VMs).
>> > > >
>> > > > I am not sure if this expected, but I thought I might check with
>> everyone
>> > > > here and learn something new. Poking around it seems to be related
>> with
>> > > > ParquetReader.read_all?
>> > > >
>> > > > Thanks in advance,
>> > > > Bryant
>> > >
>>
>


[jira] [Created] (ARROW-2644) [Python] parquet binding fails building on AppVeyor

2018-05-29 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-2644:
-

 Summary: [Python] parquet binding fails building on AppVeyor
 Key: ARROW-2644
 URL: https://issues.apache.org/jira/browse/ARROW-2644
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.9.0
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou


This is a new issue (perhaps due to a new Cython version). See e.g. 
https://ci.appveyor.com/project/pitrou/arrow/build/1.0.509/job/dxdqcdk30kmiy6pd#L4291

Excerpt:

{code}
-- Running cmake --build for pyarrow
C:\Program Files (x86)\CMake\bin\cmake.exe --build . --config release
[1/8] cmd.exe /C "cd /D 
C:\projects\arrow\python\build\temp.win-amd64-3.6\Release && 
C:\Miniconda36-x64\envs\arrow\python.exe -m cython --cplus --working 
C:/projects/arrow/python --output-file 
C:/projects/arrow/python/build/temp.win-amd64-3.6/Release/_parquet.cpp 
C:/projects/arrow/python/pyarrow/_parquet.pyx"
[2/8] cmd.exe /c
[3/8] cmd.exe /C "cd /D 
C:\projects\arrow\python\build\temp.win-amd64-3.6\Release && 
C:\Miniconda36-x64\envs\arrow\python.exe -m cython --cplus --working 
C:/projects/arrow/python --output-file 
C:/projects/arrow/python/build/temp.win-amd64-3.6/Release/lib.cpp 
C:/projects/arrow/python/pyarrow/lib.pyx"
[4/8] cmd.exe /c
[5/8] 
C:\PROGRA~2\MIB055~1\2017\COMMUN~1\VC\Tools\MSVC\1414~1.264\bin\Hostx64\x64\cl.exe
   /TP -DARROW_EXPORTING -D_CRT_SECURE_NO_WARNINGS -D_parquet_EXPORTS 
-IC:\Miniconda36-x64\envs\arrow\lib\site-packages\numpy\core\include 
-IC:\Miniconda36-x64\envs\arrow\include -I..\..\..\src 
-IC:\Miniconda36-x64\envs\arrow\Library\include /bigobj /W3 /wd4800 /DWIN32 
/D_WINDOWS  /GR /EHsc /D_SILENCE_TR1_NAMESPACE_DEPRECATION_WARNING  /WX /wd4190 
/wd4293 /wd4800 /MD /O2 /Ob2 /DNDEBUG /showIncludes 
/FoCMakeFiles\_parquet.dir\_parquet.cpp.obj /FdCMakeFiles\_parquet.dir\ /FS -c 
_parquet.cpp
FAILED: CMakeFiles/_parquet.dir/_parquet.cpp.obj 
C:\PROGRA~2\MIB055~1\2017\COMMUN~1\VC\Tools\MSVC\1414~1.264\bin\Hostx64\x64\cl.exe
   /TP -DARROW_EXPORTING -D_CRT_SECURE_NO_WARNINGS -D_parquet_EXPORTS 
-IC:\Miniconda36-x64\envs\arrow\lib\site-packages\numpy\core\include 
-IC:\Miniconda36-x64\envs\arrow\include -I..\..\..\src 
-IC:\Miniconda36-x64\envs\arrow\Library\include /bigobj /W3 /wd4800 /DWIN32 
/D_WINDOWS  /GR /EHsc /D_SILENCE_TR1_NAMESPACE_DEPRECATION_WARNING  /WX /wd4190 
/wd4293 /wd4800 /MD /O2 /Ob2 /DNDEBUG /showIncludes 
/FoCMakeFiles\_parquet.dir\_parquet.cpp.obj /FdCMakeFiles\_parquet.dir\ /FS -c 
_parquet.cpp
Microsoft (R) C/C++ Optimizing Compiler Version 19.14.26428.1 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.
_parquet.cpp(6790): error C2220: warning treated as error - no 'object' file 
generated
_parquet.cpp(6790): warning C4244: 'argument': conversion from 'int64_t' to 
'long', possible loss of data
[6/8] 
C:\PROGRA~2\MIB055~1\2017\COMMUN~1\VC\Tools\MSVC\1414~1.264\bin\Hostx64\x64\cl.exe
   /TP -DARROW_EXPORTING -D_CRT_SECURE_NO_WARNINGS -Dlib_EXPORTS 
-IC:\Miniconda36-x64\envs\arrow\lib\site-packages\numpy\core\include 
-IC:\Miniconda36-x64\envs\arrow\include -I..\..\..\src 
-IC:\Miniconda36-x64\envs\arrow\Library\include /bigobj /W3 /wd4800 /DWIN32 
/D_WINDOWS  /GR /EHsc /D_SILENCE_TR1_NAMESPACE_DEPRECATION_WARNING  /WX /wd4190 
/wd4293 /wd4800 /MD /O2 /Ob2 /DNDEBUG /showIncludes 
/FoCMakeFiles\lib.dir\lib.cpp.obj /FdCMakeFiles\lib.dir\ /FS -c lib.cpp
Microsoft (R) C/C++ Optimizing Compiler Version 19.14.26428.1 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.
ninja: build stopped: subcommand failed.
error: command 'C:\\Program Files (x86)\\CMake\\bin\\cmake.exe' failed with 
exit status 1
(arrow) C:\projects\arrow\python>set lastexitcode=1 
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: JDBC Adapter PR - 1759

2018-05-29 Thread Siddharth Teotia
Hi Atul,

I will take a look today.

Thanks,
Sidd

On Tue, May 29, 2018 at 2:45 AM, Atul Dambalkar 
wrote:

> Hi Sid, Laurent, Uwe,
>
> Any idea when can someone take a look at the PR https://github.com/apache/
> arrow/pull/1759/.
>
> Laurent had given bunch of comments earlier and now we have taken care of
> most of those. We have also added multiple test cases. It will be great if
> someone can take a look.
>
> Regards,
> -Atul
>
>


[jira] [Created] (ARROW-2645) [Java] ArrowStreamWriter accumulates DictionaryBatch ArrowBlocks

2018-05-29 Thread Bryan Cutler (JIRA)
Bryan Cutler created ARROW-2645:
---

 Summary: [Java] ArrowStreamWriter accumulates DictionaryBatch 
ArrowBlocks
 Key: ARROW-2645
 URL: https://issues.apache.org/jira/browse/ARROW-2645
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java - Vectors
Reporter: Bryan Cutler
Assignee: Bryan Cutler


While reading the base method ensureStarted in ArrowStreamWriter accumulates 
Dictionary blocks.  This is used for ArrowFileWriter but not ArrowStreamWriter. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: JDBC Adapter PR - 1759

2018-05-29 Thread Laurent Goujon
Same here.

On Tue, May 29, 2018 at 9:59 AM, Siddharth Teotia 
wrote:

> Hi Atul,
>
> I will take a look today.
>
> Thanks,
> Sidd
>
> On Tue, May 29, 2018 at 2:45 AM, Atul Dambalkar <
> atul.dambal...@xoriant.com>
> wrote:
>
> > Hi Sid, Laurent, Uwe,
> >
> > Any idea when can someone take a look at the PR
> https://github.com/apache/
> > arrow/pull/1759/.
> >
> > Laurent had given bunch of comments earlier and now we have taken care of
> > most of those. We have also added multiple test cases. It will be great
> if
> > someone can take a look.
> >
> > Regards,
> > -Atul
> >
> >
>