Re: [Java] Append multiple record batches together?

2019-11-14 Thread Fan Liya
One use-case for ChunkedArray that comes to my mind is external sort for
large vectors.

Best,
Liya Fan

On Fri, Nov 15, 2019 at 2:14 PM Micah Kornfield 
wrote:

> >
> > Maybe Java can add the concept of Tables and ChunkedArrays sometime in
> the
> > future.
>
>
> Is there a concrete use-case here?  It might pay to open up some JIRAs.
> I'm still not 100% clear on the rationale for the way VectorSchemaRoot is
> designed and how that would relate to Table/ChunkedArrays (or maybe they
> are completely separate)?
>
> On Tue, Nov 12, 2019 at 11:28 AM Bryan Cutler  wrote:
>
> > Yes, you are correct. I think I was mixing up a couple different things.
> I
> > like the way C++/Python distinguishes it where a RecordBatch is
> contiguous
> > memory and a Table can be chunked. So since you are just talking about
> > RecordBatches, I think we should keep it contiguous and concat would
> > require memcpy. Maybe Java can add the concept of Tables and
> ChunkedArrays
> > sometime in the future.
> >
> > On Mon, Nov 11, 2019 at 9:59 AM Micah Kornfield 
> > wrote:
> >
> >> I think having a chunked array with multiple vector buffers would be
> >>> ideal, similar to C++. It might take a fair amount of work to add this
> but
> >>> would open up a lot more functionality.
> >>
> >>
> >> There are potentially two different use-cases.  ChunkedArray is
> >> logical/lazy concatenation where as concat, physically rebuilds the
> vectors
> >> to be a single vector.
> >>
> >> On Fri, Nov 8, 2019 at 10:51 AM Bryan Cutler  wrote:
> >>
> >>> I think having a chunked array with multiple vector buffers would be
> >>> ideal, similar to C++. It might take a fair amount of work to add this
> but
> >>> would open up a lot more functionality. As for the API,
> >>> VectorSchemaRoot.concat(Collection) seems good to me.
> >>>
> >>> On Thu, Nov 7, 2019 at 12:09 AM Fan Liya  wrote:
> >>>
>  Hi Micah,
> 
>  Thanks for bringing this up.
> 
>  > 1.  An efficient solution already exists? It seems like TransferPair
>  implementations could possibly be improved upon or have they already
>  been
>  optimized?
> 
>  Fundamnentally, memory copy is unavoidable, IMO, because the source
> and
>  targe memory regions are likely to be in non-contiguous regions.
>  An alternative is to make ArrowBuf support a number of non-contiguous
>  memory regions. However, that would harm the perfomance of ArrowBuf,
> and
>  ArrowBuf is the core of the Arrow library.
> 
>  > 2.  What the preferred API for doing this would be?  Some options i
>  can
>  think of:
> 
>  > * VectorSchemaRoot.concat(Collection)
>  > * VectorSchemaRoot.from(Collection)
>  > * VectorLoader.load(Collection)
> 
>  IMO, option 1 is required, as we have scenarios that need to concate
>  vectors/VectorSchemaRoots (e.g. restore the complete dictionary from
>  delta
>  dictionaries).
>  Options 2 and 3 are optional for us.
> 
>  Best,
>  Liya Fan
> 
>  On Thu, Nov 7, 2019 at 3:44 PM Micah Kornfield  >
>  wrote:
> 
>  > Hi,
>  > A colleague opened up
>  https://issues.apache.org/jira/browse/ARROW-7048 for
>  > having similar functionality to the python APIs that allow for
>  creating one
>  > larger data structure from a series of record batches.  I just
> wanted
>  to
>  > surface it here in case:
>  > 1.  An efficient solution already exists? It seems like TransferPair
>  > implementations could possibly be improved upon or have they already
>  been
>  > optimized?
>  > 2.  What the preferred API for doing this would be?  Some options i
>  can
>  > think of:
>  >
>  > * VectorSchemaRoot.concat(Collection)
>  > * VectorSchemaRoot.from(Collection)
>  > * VectorLoader.load(Collection)
>  >
>  > Thanks,
>  > Micah
>  >
> 
> >>>
>


[jira] [Created] (ARROW-7175) [Website] Add a security page to track when vulnerabilities are patched

2019-11-14 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-7175:
--

 Summary: [Website] Add a security page to track when 
vulnerabilities are patched
 Key: ARROW-7175
 URL: https://issues.apache.org/jira/browse/ARROW-7175
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Website
Reporter: Micah Kornfield


we might also want to give a brief tutorial on safely using the C++ library  
(e.g. pointers to validation methods).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [Java] Append multiple record batches together?

2019-11-14 Thread Micah Kornfield
>
> Maybe Java can add the concept of Tables and ChunkedArrays sometime in the
> future.


Is there a concrete use-case here?  It might pay to open up some JIRAs.
I'm still not 100% clear on the rationale for the way VectorSchemaRoot is
designed and how that would relate to Table/ChunkedArrays (or maybe they
are completely separate)?

On Tue, Nov 12, 2019 at 11:28 AM Bryan Cutler  wrote:

> Yes, you are correct. I think I was mixing up a couple different things. I
> like the way C++/Python distinguishes it where a RecordBatch is contiguous
> memory and a Table can be chunked. So since you are just talking about
> RecordBatches, I think we should keep it contiguous and concat would
> require memcpy. Maybe Java can add the concept of Tables and ChunkedArrays
> sometime in the future.
>
> On Mon, Nov 11, 2019 at 9:59 AM Micah Kornfield 
> wrote:
>
>> I think having a chunked array with multiple vector buffers would be
>>> ideal, similar to C++. It might take a fair amount of work to add this but
>>> would open up a lot more functionality.
>>
>>
>> There are potentially two different use-cases.  ChunkedArray is
>> logical/lazy concatenation where as concat, physically rebuilds the vectors
>> to be a single vector.
>>
>> On Fri, Nov 8, 2019 at 10:51 AM Bryan Cutler  wrote:
>>
>>> I think having a chunked array with multiple vector buffers would be
>>> ideal, similar to C++. It might take a fair amount of work to add this but
>>> would open up a lot more functionality. As for the API,
>>> VectorSchemaRoot.concat(Collection) seems good to me.
>>>
>>> On Thu, Nov 7, 2019 at 12:09 AM Fan Liya  wrote:
>>>
 Hi Micah,

 Thanks for bringing this up.

 > 1.  An efficient solution already exists? It seems like TransferPair
 implementations could possibly be improved upon or have they already
 been
 optimized?

 Fundamnentally, memory copy is unavoidable, IMO, because the source and
 targe memory regions are likely to be in non-contiguous regions.
 An alternative is to make ArrowBuf support a number of non-contiguous
 memory regions. However, that would harm the perfomance of ArrowBuf, and
 ArrowBuf is the core of the Arrow library.

 > 2.  What the preferred API for doing this would be?  Some options i
 can
 think of:

 > * VectorSchemaRoot.concat(Collection)
 > * VectorSchemaRoot.from(Collection)
 > * VectorLoader.load(Collection)

 IMO, option 1 is required, as we have scenarios that need to concate
 vectors/VectorSchemaRoots (e.g. restore the complete dictionary from
 delta
 dictionaries).
 Options 2 and 3 are optional for us.

 Best,
 Liya Fan

 On Thu, Nov 7, 2019 at 3:44 PM Micah Kornfield 
 wrote:

 > Hi,
 > A colleague opened up
 https://issues.apache.org/jira/browse/ARROW-7048 for
 > having similar functionality to the python APIs that allow for
 creating one
 > larger data structure from a series of record batches.  I just wanted
 to
 > surface it here in case:
 > 1.  An efficient solution already exists? It seems like TransferPair
 > implementations could possibly be improved upon or have they already
 been
 > optimized?
 > 2.  What the preferred API for doing this would be?  Some options i
 can
 > think of:
 >
 > * VectorSchemaRoot.concat(Collection)
 > * VectorSchemaRoot.from(Collection)
 > * VectorLoader.load(Collection)
 >
 > Thanks,
 > Micah
 >

>>>


[jira] [Created] (ARROW-7174) [Python] Expose dictionary size parameter in python.

2019-11-14 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-7174:
--

 Summary: [Python] Expose dictionary size parameter in python.
 Key: ARROW-7174
 URL: https://issues.apache.org/jira/browse/ARROW-7174
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Micah Kornfield


In some cases it might be useful to have dictionaries larger then the current 
default 1MB.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7173) Add test to verify Map field names can be arbitrary

2019-11-14 Thread Bryan Cutler (Jira)
Bryan Cutler created ARROW-7173:
---

 Summary: Add test to verify Map field names can be arbitrary
 Key: ARROW-7173
 URL: https://issues.apache.org/jira/browse/ARROW-7173
 Project: Apache Arrow
  Issue Type: Test
  Components: Integration
Reporter: Bryan Cutler


A Map has child fields and the format spec only recommends that they be named 
"entries", "key", and "value" but could be named anything. Currently, 
integration tests for Map arrays verify the exchanged schema is equal, so the 
child fields are always named the same. There should be tests that use 
different names to verify implementations can accept this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[C++][Parquet]: Stream API handling of optional fields

2019-11-14 Thread Gawain Bolton

Hello,

I would like to add support for handling optional fields to the 
parquet::StreamReader and parquet::StreamWriter classes which I recently 
contributed (thank you!).


Ideally I would do this by using std::optional like this:

    parquet::StreamWriter writer{ parquet::ParquetFileWriter::Open(...) };

    std::optional d;

    writer << d;

    ...

    parquet::StreamReader os{parquet::ParquetFileReader::Open(...)};

    reader >> d;

However std::optional is only available in C++17 and arrow is compiled 
in C++11 mode.


From what I see arrow does use Boost to a limited extent and in fact 
gandiva/cache.h uses the boost::optional class.


So would it be possible to use the boost::optional class in parquet?

Or perhaps someone can suggest another way of handling optional fields?

Thanks in advance for your help,

Gawain




Re: Building Arrow 0.15.1 using dependencies in local source folder

2019-11-14 Thread Neal Richardson
I am not an expert on this, but it seems you can specify `*_ROOT` arguments
to cmake, like
https://github.com/apache/arrow/blob/master/ci/PKGBUILD#L90-L91

Maybe that does what you need?

Neal


On Thu, Nov 14, 2019 at 12:45 PM Tahsin Hassan 
wrote:

> Hi all,
>
> I am trying to build out arrow 0.15.1. The dependencies for arrow, e.g.
> thrift, double-conversion are in a local source folder and we need to build
> the dependencies from that location.
>
> I read up on
>
> https://github.com/apache/arrow/blob/master/docs/source/developers/cpp.rst#offline-builds
>
>   *   BUNDLED: Building the dependency automatically from source
>   *   SYSTEM: Finding the dependency in system paths using CMake's
> built-in find_package function, or using pkg-config for packages that do
> not have this feature
> Unfortunately, that’s not exactly what I want.
> and
>
> https://github.com/apache/arrow/blob/master/docs/source/developers/cpp.rst#offline-builds
> but, that basically downloads the tar(s) into a folder extracts them and
> sets up build using that.
> e.g.
> $./download_dependencies.sh /sandbox/someArrowStuff/
> # Environment variables for offline Arrow build
> export ARROW_AWSSDK_URL=/sandbox/someArrowStuff/aws-sdk-cpp-1.7.160.tar.gz
> export ARROW_BOOST_URL=/sandbox/someArrowStuff/boost-1.67.0.tar.gz
> export ARROW_BROTLI_URL=/sandbox/someArrowStuff/brotli-v1.0.7.tar.gz
> …
>
>
> What I kind of wanted was , the set of environment variables that can
> allow to set a source folder path
> export ARROW_BOOST_MYPATH=/sandbox/someArrowStuff/ 3p/boost/
> where /sandbox/someArrowStuff/ 3p/boost/ already holds the necessary boost
> source folder and ARROW_BOOST_MYPATH is somekind of variable to help locate
> the necessary source folder.
>
> Is there some option like that? Where can I dig for more information
> regarding that?
>
> Thanks,
> Tahsin
>
>
>
>
>
>


[jira] [Created] (ARROW-7172) [C++][Dataset] Improve format of Expression::ToString

2019-11-14 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-7172:
---

 Summary: [C++][Dataset] Improve format of Expression::ToString
 Key: ARROW-7172
 URL: https://issues.apache.org/jira/browse/ARROW-7172
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, C++ - Dataset
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 1.0.0


Instead of {{GREATER(FIELD(b), SCALAR(3))}}, these could just read 
{{"b"_ > int32(3)}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7171) [Ruby] Pass Array for Arrow::Table#filter

2019-11-14 Thread Yosuke Shiro (Jira)
Yosuke Shiro created ARROW-7171:
---

 Summary: [Ruby] Pass Array for Arrow::Table#filter
 Key: ARROW-7171
 URL: https://issues.apache.org/jira/browse/ARROW-7171
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Ruby
Reporter: Yosuke Shiro
Assignee: Yosuke Shiro






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Achieving parity with Java extension types in Python

2019-11-14 Thread Justin Polchlopek
I made a PR for this issue at https://github.com/apache/arrow/pull/5835.
Would love some more detail about what was intended by the initial issue
and what would be a better way.

On Tue, Nov 12, 2019 at 11:25 AM Joris Van den Bossche <
jorisvandenboss...@gmail.com> wrote:

> Sorry for the delay in response. I would suggest that you open a PR (or
> point to a branch with those changes), that will make it easier to discuss
> specific implementation options (rather than trying to explain and
> understand it in words) and give advice.
>
> On Wed, 6 Nov 2019 at 20:29, Justin Polchlopek 
> wrote:
>
> > Hi.  I'm looking into this issue and I have some questions as someone new
> > to the project.  The comment from Joris earlier in the thread suggests
> that
> > the solution here is to create an Array subclass for each extension type
> > that wants to use one.  This will give a nice symmetry w.r.t. the Java
> > interface, but in the Python case, this seems to suggest having to travel
> > some fairly byzantine code paths (rather quickly, we end up in C++ code,
> > where I lose the thread of what's happening—specifically as regards
> > `pyarrow_wrap_array`, as suggested in ARROW-6176).
> >
>
> The goal here is that for the end user, it is possible to do this without
> involving C++ code, and I *think* implementing it should be possible from
> cython. How did you end up in C++?
>
>
> > I came up with a quick-and-dirty method wherein the ExtensionType
> subclass
> > simply provides a method to translate from the storage type to the output
> > type, and ExtensionArray has a __getitem__ implementation that passes the
> > element from storage through the translation function.  This doesn't feel
> > outside of the realm of what is often acceptable in the python world, but
> > it isn't nearly as typeful as Arrow seems to be leaning.  Plus, this
> feels
> > very far from what was intended in the issue, and I believe that I'm not
> > understanding the underlying design principles.
> >
> > Can I get a bit of advice on this?
> >
> > Thanks.
> > -J
> >
> > On Tue, Oct 29, 2019 at 12:26 PM Justin Polchlopek <
> jpolchlo...@azavea.com
> > >
> > wrote:
> >
> > > That sounds about right.  We're doing some work here that might require
> > > this feature sooner than later, and if we decide to go the route that
> > needs
> > > this improved support, I'd be happy to make this PR.  Thanks for
> showing
> > > that issue.  I'll be sure to tag any contribution with that ticket
> > number.
> > >
> > > On Tue, Oct 29, 2019 at 9:01 AM Joris Van den Bossche <
> > > jorisvandenboss...@gmail.com> wrote:
> > >
> > >>
> > >> On Mon, 28 Oct 2019 at 22:41, Wes McKinney 
> wrote:
> > >>
> > >>> Adding dev@
> > >>>
> > >>> I don't believe we have APIs yet for plugging in user-defined Array
> > >>> subtypes. I assume you've read
> > >>>
> > >>>
> > >>>
> >
> http://arrow.apache.org/docs/python/extending_types.html#defining-extension-types-user-defined-types
> > >>>
> > >>> There may be some JIRA issues already about this (defining subclasses
> > >>> of pa.Array with custom behavior) -- since Joris has been working on
> > >>> this I'm interested in more comments
> > >>>
> > >>
> > >> Yes, there is https://issues.apache.org/jira/browse/ARROW-6176 for
> > >> exactly this issue.
> > >> What I proposed there is to allow one to subclass
> pyarrow.ExtensionArray
> > >> and to attach this to an attribute on the custom ExtensionType (eg
> > >> __arrow_ext_array_class__ in line with the other __arrow_ext_..
> > >> methods). That should allow to achieve similar functionality as what
> is
> > >> available in Java I think.
> > >>
> > >> If that seems a good way to do this, I think we certainly welcome a PR
> > >> for that (I can also look into it otherwise before 1.0).
> > >>
> > >> Joris
> > >>
> > >>
> > >>>
> > >>> On Mon, Oct 28, 2019 at 3:56 PM Justin Polchlopek
> > >>>  wrote:
> > >>> >
> > >>> > Hi!
> > >>> >
> > >>> > I've been working through understanding extension types in Arrow.
> > >>> It's a great feature, and I've had no problems getting things working
> > in
> > >>> Java/Scala; however, Python has been a bit of a different story.  Not
> > that
> > >>> I am unable to create and register extension types in Python, but
> > rather
> > >>> that I can't seem to recreate the functionality provided by the Java
> > API's
> > >>> ExtensionTypeVector class.
> > >>> >
> > >>> > In Java, ExtensionType::getNewVector() provides a clear pathway
> from
> > >>> the registered type to output a vector in something other than the
> > >>> underlying vector type, and I am at a loss for how to get this same
> > >>> functionality in Python.  Am I missing something?
> > >>> >
> > >>> > Thanks for any hints.
> > >>> > -Justin
> > >>>
> > >>
> >
>


Re: [DISCUSS] Dictionary Encoding Clarifications/Future Proofing

2019-11-14 Thread Micah Kornfield
Ok, anything else do discuss?  Otherwise I'll plan on a new vote with the
original language + an explicit call-out that dictionary replacement isn't
supported for the file format in the PR

On Thursday, November 14, 2019, Antoine Pitrou  wrote:

>
> Right.  The dictionaries can be found from the file footer, so it seems ok.
>
> Thank you
>
> Regards
>
> Antoine.
>
>
> Le 14/11/2019 à 07:11, Micah Kornfield a écrit :
> > I'll add for:
> >
> > If so, how does this play with the fact that there potentially are delta
> >> dictionaries in the "stream"?
> >
> > That in this case the important feature is the dictionary batches have an
> > explicit ordering in the file format based on metadata.  So their
> ordering
> > in the "stream" is largely irrelevant.  As Wes pointed out the most
> > convenient implementation for this would have to load all dictionary
> > batches before doing random access (and would be very similar to the
> stream
> > code).
> >
> > Does this make sense?
> >
> >
> > On Tue, Nov 12, 2019 at 2:01 PM Wes McKinney 
> wrote:
> >
> >> Hi Antoine,
> >>
> >> Each *record batch* is intended to be readable in random order. To read
> any
> >> record batch requires loading the dictionaries indicated in the schema,
> so
> >> appending the deltas as part of this process does not seem like it would
> >> introduce hardship given that such logic is needed to properly handle
> the
> >> stream format. Dictionary replacements in the file format (at least as
> >> currently conceived) does not seem possible.
> >>
> >>
> >> On Tue, Nov 12, 2019, 10:13 AM Antoine Pitrou 
> wrote:
> >>
> >>>
> >>> Hi,
> >>>
> >>> Sorry for the delay.
> >>>
> >>> My high-level question is the following:  is the file format intended
> to
> >>> be readable in random order (rather than having to read through it in
> >>> sequence as with the stream format)?  If so, how does this play with
> the
> >>> fact that there potentially are delta dictionaries in the "stream"?
> >>>
> >>> Regards
> >>>
> >>> Antoine.
> >>>
> >>>
> >>> Le 30/10/2019 à 21:11, Wes McKinney a écrit :
>  Returning to this discussion as there seems to lack consensus in the
> >>> vote thread
> 
>  Copying Micah's proposals in the VOTE thread here, I wanted to state
>  my opinions so we can discuss further and see where there is potential
>  disagreement
> 
>  1.  It is not required that all dictionary batches occur at the
> >> beginning
>  of the IPC stream format (if a the first record batch has an all null
>  dictionary encoded column, the null column's dictionary might not be
> >> sent
>  until later in the stream).
> 
>  This seems preferable to requiring a placeholder empty dictionary
>  batch. This does mean more to test but the integration tests will
>  force the issue
> 
>  2.  A second dictionary batch for the same ID that is not a "delta
> >> batch"
>  in an IPC stream indicates the dictionary should be replaced.
> 
>  Agree.
> 
>  3.  Clarifies that the file format, can only contain 1 "NON-delta"
>  dictionary batch and multiple "delta" dictionary batches.
> 
>  Agree -- it is also worth stating explicitly that dictionary
>  replacements are not allowed in the file format.
> 
>  In the file format, all the dictionaries must be "loaded" up front.
>  The code path for loading the dictionaries ideally should use nearly
>  the same code as the stream-reader code that sees follow-up dictionary
>  batches interspersed in the stream. The only downside is that it will
>  not be possible to exactly preserve the dictionary "state" as of each
>  record batch being written.
> 
>  So if we had a file containing
> 
>  DICTIONARY ID=0
>  RECORD BATCH
>  RECORD BATCH
>  DICTIONARY DELTA ID=0
>  RECORD BATCH
>  RECORD BATCH
> 
>  Then after processing/loading the dictionaries, the first two record
>  batches will have a dictionary that is "larger" (on account of the
>  delta) than when they were written. Since dictionaries are
>  fundamentally about data representation, they still represent the same
>  data so I think this is acceptable.
> 
>  4.  Add an enum to dictionary metadata for possible future changes in
> >>> what
>  format dictionary batches can be sent. (the most likely would be an
> >> array
>  Map).  An enum is needed as a place holder to allow for
> >>> forward
>  compatibility past the release 1.0.0.
> 
>  I'm least sure about this but I do not think it is harmful to have a
>  forward-compatible "escape hatch" for future evolutions in dictionary
>  encoding.
> 
>  On Wed, Oct 16, 2019 at 2:57 AM Micah Kornfield <
> emkornfi...@gmail.com
> >>>
> >>> wrote:
> >
> > I'll plan on starting a vote in the next day or two if there are no
> >>> further
> > objections/comments.
> >
> > On Sun, Oct 13, 2019 at 11:06 AM 

[jira] [Created] (ARROW-7170) [C++] Bundled ORC fails linking

2019-11-14 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-7170:
-

 Summary: [C++] Bundled ORC fails linking
 Key: ARROW-7170
 URL: https://issues.apache.org/jira/browse/ARROW-7170
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Antoine Pitrou


This shows up when building the tests as well:
{code}
[1/2] Linking CXX executable debug/orc-adapter-test
FAILED: debug/orc-adapter-test 
: && /usr/bin/ccache /usr/bin/clang++-7  -Qunused-arguments -fcolor-diagnostics 
-fuse-ld=gold -ggdb -O0  -Wall -Wextra -Wdocumentation 
-Wno-unused-parameter -Wno-unknown-warning-option -Werror 
-Wno-unknown-warning-option -msse4.2 -maltivec  -D_GLIBCXX_USE_CXX11_ABI=1 
-D_GLIBCXX_USE_CXX11_ABI=1 -fno-omit-frame-pointer -g  -rdynamic 
src/arrow/adapters/orc/CMakeFiles/orc-adapter-test.dir/adapter_test.cc.o  -o 
debug/orc-adapter-test  
-Wl,-rpath,/home/antoine/arrow/dev/cpp/build-test/debug:/home/antoine/miniconda3/envs/pyarrow/lib
 /home/antoine/miniconda3/envs/pyarrow/lib/libgtest_main.so 
/home/antoine/miniconda3/envs/pyarrow/lib/libgtest.so -lpthread -ldl 
debug/libarrow_testing.so.100.0.0 debug/libarrow.so.100.0.0 
orc_ep-install/lib/liborc.a 
/home/antoine/miniconda3/envs/pyarrow/lib/libgtest.so -ldl 
double-conversion_ep/src/double-conversion_ep/lib/libdouble-conversion.a 
/home/antoine/miniconda3/envs/pyarrow/lib/libssl.so 
/home/antoine/miniconda3/envs/pyarrow/lib/libcrypto.so 
/home/antoine/miniconda3/envs/pyarrow/lib/libbrotlienc-static.a 
/home/antoine/miniconda3/envs/pyarrow/lib/libbrotlidec-static.a 
/home/antoine/miniconda3/envs/pyarrow/lib/libbrotlicommon-static.a 
/home/antoine/miniconda3/envs/pyarrow/lib/libprotobuf.so 
/home/antoine/miniconda3/envs/pyarrow/lib/libaws-cpp-sdk-config.so 
/home/antoine/miniconda3/envs/pyarrow/lib/libaws-cpp-sdk-transfer.so 
/home/antoine/miniconda3/envs/pyarrow/lib/libaws-cpp-sdk-s3.so 
/home/antoine/miniconda3/envs/pyarrow/lib/libaws-cpp-sdk-core.so 
/home/antoine/miniconda3/envs/pyarrow/lib/libaws-c-event-stream.so.1.0.0 
/home/antoine/miniconda3/envs/pyarrow/lib/libaws-c-common.so.1.0.0 -lm 
-lpthread /home/antoine/miniconda3/envs/pyarrow/lib/libaws-checksums.so 
jemalloc_ep-prefix/src/jemalloc_ep/dist//lib/libjemalloc_pic.a 
mimalloc_ep/src/mimalloc_ep/lib/mimalloc-1.0/libmimalloc-debug.a -pthread -lrt 
-Wl,-rpath-link,/home/antoine/miniconda3/envs/pyarrow/lib && :
/home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:284:
 error: undefined reference to 'deflateInit2_'
/home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:232:
 error: undefined reference to 'deflateReset'
/home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:254:
 error: undefined reference to 'deflate'
/home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:291:
 error: undefined reference to 'deflateEnd'
/home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:405:
 error: undefined reference to 'inflateInit2_'
/home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:430:
 error: undefined reference to 'inflateEnd'
/home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:471:
 error: undefined reference to 'inflateReset'
/home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:477:
 error: undefined reference to 'inflate'
/home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:820:
 error: undefined reference to 'snappy::GetUncompressedLength(char const*, 
unsigned long, unsigned long*)'
/home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:828:
 error: undefined reference to 'snappy::RawUncompress(char const*, unsigned 
long, char*)'
/home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:894:
 error: undefined reference to 'LZ4_decompress_safe'
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7169) [C++] Vendor uriparser library

2019-11-14 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-7169:
-

 Summary: [C++] Vendor uriparser library
 Key: ARROW-7169
 URL: https://issues.apache.org/jira/browse/ARROW-7169
 Project: Apache Arrow
  Issue Type: Wish
  Components: C++
Reporter: Antoine Pitrou


The [uriparser C library|https://github.com/uriparser/uriparser]  is used 
internally for URI parsing. Instead of having an explicit dependency, we could 
simply vendor it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[NIGHTLY] Arrow Build Report for Job nightly-2019-11-14-0

2019-11-14 Thread Crossbow


Arrow Build Report for Job nightly-2019-11-14-0

All tasks: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-14-0

Failed Tasks:
- homebrew-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-14-0-travis-homebrew-cpp
- test-conda-python-3.7-dask-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-14-0-circle-test-conda-python-3.7-dask-master
- test-ubuntu-14.04-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-14-0-circle-test-ubuntu-14.04-cpp
- test-ubuntu-18.04-r-sanitizer:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-14-0-circle-test-ubuntu-18.04-r-sanitizer
- test-ubuntu-fuzzit:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-14-0-circle-test-ubuntu-fuzzit
- wheel-manylinux2010-cp27m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-14-0-travis-wheel-manylinux2010-cp27m
- wheel-manylinux2010-cp27mu:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-14-0-travis-wheel-manylinux2010-cp27mu
- wheel-manylinux2010-cp35m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-14-0-travis-wheel-manylinux2010-cp35m
- wheel-manylinux2010-cp36m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-14-0-travis-wheel-manylinux2010-cp36m
- wheel-manylinux2010-cp37m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-14-0-travis-wheel-manylinux2010-cp37m
- wheel-osx-cp35m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-14-0-travis-wheel-osx-cp35m

Succeeded Tasks:
- centos-6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-14-0-azure-centos-6
- centos-7:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-14-0-azure-centos-7
- centos-8:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-14-0-azure-centos-8
- conda-linux-gcc-py27:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-14-0-azure-conda-linux-gcc-py27
- conda-linux-gcc-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-14-0-azure-conda-linux-gcc-py36
- conda-linux-gcc-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-14-0-azure-conda-linux-gcc-py37
- conda-osx-clang-py27:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-14-0-azure-conda-osx-clang-py27
- conda-osx-clang-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-14-0-azure-conda-osx-clang-py36
- conda-osx-clang-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-14-0-azure-conda-osx-clang-py37
- conda-win-vs2015-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-14-0-azure-conda-win-vs2015-py36
- conda-win-vs2015-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-14-0-azure-conda-win-vs2015-py37
- debian-buster:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-14-0-azure-debian-buster
- debian-stretch:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-14-0-azure-debian-stretch
- gandiva-jar-osx:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-14-0-travis-gandiva-jar-osx
- gandiva-jar-trusty:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-14-0-travis-gandiva-jar-trusty
- macos-r-autobrew:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-14-0-travis-macos-r-autobrew
- test-conda-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-14-0-circle-test-conda-cpp
- test-conda-python-2.7-pandas-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-14-0-circle-test-conda-python-2.7-pandas-latest
- test-conda-python-2.7-pandas-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-14-0-circle-test-conda-python-2.7-pandas-master
- test-conda-python-2.7:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-14-0-circle-test-conda-python-2.7
- test-conda-python-3.6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-14-0-circle-test-conda-python-3.6
- test-conda-python-3.7-dask-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-14-0-circle-test-conda-python-3.7-dask-latest
- test-conda-python-3.7-hdfs-2.9.2:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-14-0-circle-test-conda-python-3.7-hdfs-2.9.2
- test-conda-python-3.7-pandas-latest:
  URL: 

[jira] [Created] (ARROW-7168) pa.array() doesn't respect provided dictionary type with all NaNs

2019-11-14 Thread Thomas Buhrmann (Jira)
Thomas Buhrmann created ARROW-7168:
--

 Summary: pa.array() doesn't respect provided dictionary type with 
all NaNs
 Key: ARROW-7168
 URL: https://issues.apache.org/jira/browse/ARROW-7168
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Affects Versions: 0.15.1
Reporter: Thomas Buhrmann


This might be related to ARROW-6548 and others dealing with all NaN columns. 
When creating a dictionary array, even when fully specifying the desired type, 
this type is not respected when the data contains only NaNs:


{code:python}
# This may look a little artificial but easily occurs when processing 
categorial data in batches and a particular batch containing only NaNs
ser = pd.Series([None, None]).astype('object').astype('category')
typ = pa.dictionary(index_type=pa.int8(), value_type=pa.string(), ordered=False)
pa.array(ser, type=typ).type
{code}

results in

{noformat}
>> DictionaryType(dictionary)
{noformat}

which means that one cannot e.g. serialize batches of categoricals if the 
possibility of all-NaN batches exists, even when trying to enforce that each 
batch has the same schema (because the schema is not respected).

I understand that inferring the type in this case would be difficult, but I'd 
imagine that a fully specified type should be respected in this case?

In the meantime, is there a workaround to manually create a dictionary array of 
the desired type containing only NaNs?




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[Discuss][Java] Appropriate semantics for comparing values in UnionVector

2019-11-14 Thread Fan Liya
Dear all,

The problem arises from the discussion in a PR:
https://github.com/apache/arrow/pull/5544#discussion_r338394941.

We are trying to come up with a proper semantics to compare values in
UnionVectors.

According to the current logic in the code base, two values from two
UnionVectors are compared in two steps:

1. Child vectors for the two UnionVectors are compared, to make sure both
vectors have the same types of child vectors.
2. If step 1 passes, we continue to compare values in the corresponding
slots in the two union vectors.

This is a legitimate equality semantics (being reflexive, symmentirc, and
transitive). However, we think it is overly strict to for equality
determination, as it compares child vectors first, and this may lead to
unexpected results.

An example related to dictionary encoding UnionVectors is given: Suppose
our dictionary is a union vector with 3 elements: {Int (0), Long(1),
Byte(2)}. This dictionary vector has 3 child vectors: an IntVector, a
BigIntVector, and a SmallIntVector.

We want to encode another union vector with 2 elements: {Int(0), Byte(2)}.
The encoded vector should be an integer vector {0, 2}.

However, since the vector to encode has only 2 children: an IntVector and a
SmallIntVector, the check for child vectors will always fail, so no value
will be considered equal to any value in the dictionary, and dictionary
encoding will always fail.

So our propsed change is: we no longer compare child vectors, and only
compare values slots for UnionVectors. That is, we compare values in 2
steps too:

1. Make sure the slots in both vectors are of the same type (e.g. both are
IntVectors).
2. Compare values stored in the slots.

This is the *problem one* we want to discuss. What do you think?

*Problem two *is proposed by Micah Kornfield. Should we consider any of the
following semantics for comparing UnionVectors?

1. Is it OK for unions to compare against any other vector? (for example,
if the value slot of a  union vector has type IntVector, is it valid to
compare it with a real IntVector?)
2. Can we compare a dense union vector against a sparse union vector?
3. Is it only ok to compare unions that have the exact same metadata.

Please give your valuable feedback. Thank you in advance.

Best,
Liya Fan


Re: [Java] Question About Vector Allocation

2019-11-14 Thread Fan Liya
Hi Azim,

According to the current API, after filling in some values, you have to set
the value count manually (through the setValueCount method).
Otherwise, the value count remains 0.

Best,
Liya Fan


On Thu, Nov 14, 2019 at 6:33 PM azim afroozeh  wrote:

> Thanks for your answer. So the valueCount shows the number of data filled
> in the vector.
>
> Then I would like to ask you why the valueCount after setting some values
> is 0? for example: (
>
> https://github.com/apache/arrow/blob/3fbbcdaf77a9e354b6bd07ec1fd1dac005a505c9/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java#L609
> )
>
>
> System.out.print(vector.getValueCount()); //prints 0
> /* populate the vector */vector.set(0, 100.5f);vector.set(2,
> 201.5f);vector.set(4, 300.3f);vector.set(6, 423.8f);vector.set(8,
> 555.6f);vector.set(10, 66.6f);vector.set(12, 78.8f);vector.set(14,
> 89.5f);
> System.out.print(vector.getValueCount()); //prints 0
>
>
> If I add these two print lines, they will print 0.
>
>
> Also If I add the following code to isSet again some tests fail.
>
>  if (valueCount == getValueCapacity()) {  return 1;}
>
>
>
> Thanks,
>
>
> Azim Afroozeh
>
> On Fri, Nov 8, 2019 at 10:57 AM Fan Liya  wrote:
>
> > Hi Azim,
> >
> > I think we should be aware of two distinct concepts:
> >
> > 1. vector capacity: the max number of values that can be stored in the
> > vector, without reallocation
> > 2. vector length: the number of values actually filled in the vector
> >
> > For any valid vector, we always have vector length <= vector capacity.
> >
> > The allocateNew method expands the vector capacity, but it does not fill
> in
> > any value, so it does not affect the the vector length.
> >
> > For the code above, if the vector length is 0, the value of isSet(index)
> > (where index > 0) should be undefined. So throwing an exception is the
> > correct behavior.
> >
> > Hope this answers your question.
> >
> > Best,
> > Liya Fan
> >
> >
> > On Fri, Nov 8, 2019 at 5:38 PM azim afroozeh 
> wrote:
> >
> > > Hi everyone,
> > >
> > > I have a question about the Java implementation of Apache Arrow. Should
> > we
> > > always call setValueCount after creating a vector with allocateNew()?
> > >
> > > I can see that in some tests where setValueCount is called immediately
> > > after allocateNew. For example here:
> > >
> > >
> >
> https://github.com/apache/arrow/blob/master/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java#L285
> > > ,
> > > but not in other tests:
> > >
> > >
> >
> https://github.com/apache/arrow/blob/master/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java#L792
> > > .
> > >
> > > To illustrate the problem more, if I change the isSet(int
> index)function
> > as
> > > follows:
> > >
> > > public int isSet(int index) {
> > >  if (valueCount == 0) {
> > >  return 0;
> > >  }
> > >  final int byteIndex = index >> 3;
> > >  final byte b = validityBuffer.getByte(byteIndex);
> > >  final int bitIndex = index & 7;
> > >  return (b >> bitIndex) & 0x01;
> > > }
> > >
> > > Many tests will fail, while logically they should not because if the
> > > valueCount is 0 then isSet returned value for every index should be
> zero.
> > > The problem comes from the allocateNew method which does not initialize
> > the
> > > valueCount variable.
> > >
> > > One potential solution to this problem is to initialize the valueCount
> > > in allocateNew function, as I did here:
> > >
> > >
> >
> https://github.com/azimafroozeh/arrow/commit/4281613b7ed1370252a155192f12b9bca494dbeb
> > > .
> > > The classes BaseVariableWidthVector and BaseFixedWidthVector, both have
> > > allocateNew function that needs to be changed. Is this an acceptable
> > > approach? or am I missing some semantics?
> > >
> > > Thanks,
> > >
> > > Azim Afroozeh
> > >
> >
>


Re: [Java] Question About Vector Allocation

2019-11-14 Thread azim afroozeh
Thanks for your answer. So the valueCount shows the number of data filled
in the vector.

Then I would like to ask you why the valueCount after setting some values
is 0? for example: (
https://github.com/apache/arrow/blob/3fbbcdaf77a9e354b6bd07ec1fd1dac005a505c9/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java#L609
)


System.out.print(vector.getValueCount()); //prints 0
/* populate the vector */vector.set(0, 100.5f);vector.set(2,
201.5f);vector.set(4, 300.3f);vector.set(6, 423.8f);vector.set(8,
555.6f);vector.set(10, 66.6f);vector.set(12, 78.8f);vector.set(14,
89.5f);
System.out.print(vector.getValueCount()); //prints 0


If I add these two print lines, they will print 0.


Also If I add the following code to isSet again some tests fail.

 if (valueCount == getValueCapacity()) {  return 1;}



Thanks,


Azim Afroozeh

On Fri, Nov 8, 2019 at 10:57 AM Fan Liya  wrote:

> Hi Azim,
>
> I think we should be aware of two distinct concepts:
>
> 1. vector capacity: the max number of values that can be stored in the
> vector, without reallocation
> 2. vector length: the number of values actually filled in the vector
>
> For any valid vector, we always have vector length <= vector capacity.
>
> The allocateNew method expands the vector capacity, but it does not fill in
> any value, so it does not affect the the vector length.
>
> For the code above, if the vector length is 0, the value of isSet(index)
> (where index > 0) should be undefined. So throwing an exception is the
> correct behavior.
>
> Hope this answers your question.
>
> Best,
> Liya Fan
>
>
> On Fri, Nov 8, 2019 at 5:38 PM azim afroozeh  wrote:
>
> > Hi everyone,
> >
> > I have a question about the Java implementation of Apache Arrow. Should
> we
> > always call setValueCount after creating a vector with allocateNew()?
> >
> > I can see that in some tests where setValueCount is called immediately
> > after allocateNew. For example here:
> >
> >
> https://github.com/apache/arrow/blob/master/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java#L285
> > ,
> > but not in other tests:
> >
> >
> https://github.com/apache/arrow/blob/master/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java#L792
> > .
> >
> > To illustrate the problem more, if I change the isSet(int index)function
> as
> > follows:
> >
> > public int isSet(int index) {
> >  if (valueCount == 0) {
> >  return 0;
> >  }
> >  final int byteIndex = index >> 3;
> >  final byte b = validityBuffer.getByte(byteIndex);
> >  final int bitIndex = index & 7;
> >  return (b >> bitIndex) & 0x01;
> > }
> >
> > Many tests will fail, while logically they should not because if the
> > valueCount is 0 then isSet returned value for every index should be zero.
> > The problem comes from the allocateNew method which does not initialize
> the
> > valueCount variable.
> >
> > One potential solution to this problem is to initialize the valueCount
> > in allocateNew function, as I did here:
> >
> >
> https://github.com/azimafroozeh/arrow/commit/4281613b7ed1370252a155192f12b9bca494dbeb
> > .
> > The classes BaseVariableWidthVector and BaseFixedWidthVector, both have
> > allocateNew function that needs to be changed. Is this an acceptable
> > approach? or am I missing some semantics?
> >
> > Thanks,
> >
> > Azim Afroozeh
> >
>


[jira] [Created] (ARROW-7167) [CI][Python] Add nightly tests for older pandas versions to Github Actions

2019-11-14 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7167:


 Summary: [CI][Python] Add nightly tests for older pandas versions 
to Github Actions
 Key: ARROW-7167
 URL: https://issues.apache.org/jira/browse/ARROW-7167
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche






--
This message was sent by Atlassian Jira
(v8.3.4#803005)