[jira] [Created] (ARROW-9010) [Java] Framework and interface changes for RecordBatch IPC buffer compression
Liya Fan created ARROW-9010: --- Summary: [Java] Framework and interface changes for RecordBatch IPC buffer compression Key: ARROW-9010 URL: https://issues.apache.org/jira/browse/ARROW-9010 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Liya Fan Assignee: Liya Fan This is the first sub-work item of ARROW-8672 ( [Java] Implement RecordBatch IPC buffer compression from ARROW-300). However, it does not involve any concrete compression algorithms. The purpose of this PR is to establish basic interfaces for data compression, and make changes to the IPC framework so that different compression algorithms can be plug-in smoothly. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8973) [Java] Support batch value appending for large varchar/varbinary vectors
Liya Fan created ARROW-8973: --- Summary: [Java] Support batch value appending for large varchar/varbinary vectors Key: ARROW-8973 URL: https://issues.apache.org/jira/browse/ARROW-8973 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Liya Fan Support appending values in batch for LargeVarCharVector/LargeVarBinaryVector. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8972) [Java] Support range value comparison for large varchar/varbinary vectors
Liya Fan created ARROW-8972: --- Summary: [Java] Support range value comparison for large varchar/varbinary vectors Key: ARROW-8972 URL: https://issues.apache.org/jira/browse/ARROW-8972 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Liya Fan Assignee: Liya Fan Support comparing a range of values for LargeVarCharVector and LargeVarBinaryVector. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8940) [Java] Fix the performance degradation of integration tests
Liya Fan created ARROW-8940: --- Summary: [Java] Fix the performance degradation of integration tests Key: ARROW-8940 URL: https://issues.apache.org/jira/browse/ARROW-8940 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Liya Fan Assignee: Liya Fan In the past, we run integration tests from main methods, and recently, we have changed this to run them by the failsafe plugin. This is a good change, but it also leads to significant performance degradation. In the past, it took about 10s to run {{ITTestLargeVector#testLargeDecimalVector}}, now it takes more than half an hour. Our investigation shows that the problem was caused by calling {{HistoricalLog#recordEvent}} repeatedly. This method is called only when {{BaseAllocator#DEBUG}} is enabled. In a unit/integration test, the flag is enabled by default. We solve the problem with the following steps: 1. We set system property to disable the {{BaseAllocator#DEBUG}} flag. 2. We change the logic so that the system property takes precedence over the {{AssertionUtil#isAssertionsEnabled}} method. This makes the integration tests as fast as before. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8771) [C++] Add boost/process library to build support
Liya Fan created ARROW-8771: --- Summary: [C++] Add boost/process library to build support Key: ARROW-8771 URL: https://issues.apache.org/jira/browse/ARROW-8771 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Liya Fan Some of our test source code requires the process.hpp file (and its dependent libraries). Our current build support does not include these files, causing build failures like: fatal error: boost/process.hpp: No such file or directory -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8761) [C++] Improve the performance of minmax kernel
Liya Fan created ARROW-8761: --- Summary: [C++] Improve the performance of minmax kernel Key: ARROW-8761 URL: https://issues.apache.org/jira/browse/ARROW-8761 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Liya Fan Assignee: Liya Fan We improve the performance of the max-min kernel with the simple idea: if the current value is smaller than the current min value; then there is no need to compare it against the current max value, because it must be smaller than the current max value. This simple trick reduces the expected number of comparisons from 2n to 1.5n, which can be notable for large arrays. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8481) [Java] Provide an allocation manager based on Unsafe API
Liya Fan created ARROW-8481: --- Summary: [Java] Provide an allocation manager based on Unsafe API Key: ARROW-8481 URL: https://issues.apache.org/jira/browse/ARROW-8481 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Liya Fan Assignee: Liya Fan This is in response to the discussion in https://github.com/apache/arrow/pull/6323#issuecomment-614195070 In this issue, we provide an allocation manager that is capable of allocation large (> 2GB) buffers. In addition, it does not depend on the netty library, which is aligning with the general trend of removing netty dependencies. In the future, we are going to make it the default allocation manager. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8468) [Document] Fix the incorrect null bits description
Liya Fan created ARROW-8468: --- Summary: [Document] Fix the incorrect null bits description Key: ARROW-8468 URL: https://issues.apache.org/jira/browse/ARROW-8468 Project: Apache Arrow Issue Type: Bug Components: Documentation Reporter: Liya Fan Assignee: Liya Fan The desription about the null bits in arrays.rst is incorrect. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8402) [Java] Support ValidateFull methods in Java
Liya Fan created ARROW-8402: --- Summary: [Java] Support ValidateFull methods in Java Key: ARROW-8402 URL: https://issues.apache.org/jira/browse/ARROW-8402 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Liya Fan Assignee: Liya Fan We need to support ValidateFull methods in Java, just like we do in C++. This is required by ARROW-5926. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8392) [Java] Fix overflow related corner cases for vector value comparison
Liya Fan created ARROW-8392: --- Summary: [Java] Fix overflow related corner cases for vector value comparison Key: ARROW-8392 URL: https://issues.apache.org/jira/browse/ARROW-8392 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Liya Fan 1. Fix corner cases related to overflow. 2. Provide test cases for the corner cases. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8230) [Java] Move Netty memory manager into a separate module
Liya Fan created ARROW-8230: --- Summary: [Java] Move Netty memory manager into a separate module Key: ARROW-8230 URL: https://issues.apache.org/jira/browse/ARROW-8230 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Liya Fan Move Netty memory manager into a separate module such that the basic allocator does not depend on it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8229) [Java] Move ArrowBuf into the Arrow package
Liya Fan created ARROW-8229: --- Summary: [Java] Move ArrowBuf into the Arrow package Key: ARROW-8229 URL: https://issues.apache.org/jira/browse/ARROW-8229 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Liya Fan Assignee: Liya Fan After ARROW-7505 and ARROW-7935 are done, we are ready to move ArrowBuf into Arrow's package, and make it independent of Netty library. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8169) [Java] Improve the performance of JDBC adapter by allocating memory proactively
Liya Fan created ARROW-8169: --- Summary: [Java] Improve the performance of JDBC adapter by allocating memory proactively Key: ARROW-8169 URL: https://issues.apache.org/jira/browse/ARROW-8169 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Liya Fan Assignee: Liya Fan The current implementation use {{setSafe}} methods to dynamically allocate memory if necessary. For fixed width vectors (which are frequently used in JDBC), however, we can allocate memory proactively, since the vector size is known as a configuration parameter. So for fixed width vectors, we can use {{set}} methods instead. This change leads to two benefits: 1. When processing each value, we no longer have to check vector capacity and reallocate memroy if needed. This leads to better performance. 2. If we allow the memory to expand automatically (each time by 2x), the amount of memory usually ends up being more than necessary. By allocating memory by the configuration parameter, we allocate no more, or no less. Benchmark results show notable performance improvements: Before: Benchmark Mode CntScore Error Units JdbcAdapterBenchmarks.consumeBenchmark avgt5 521.700 ± 4.837 us/op After: Benchmark Mode CntScore Error Units JdbcAdapterBenchmarks.consumeBenchmark avgt5 430.523 ± 9.932 us/op -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8121) [Java] Enhance code style checking for Java code (add space after commas, semi-colons and type casts)
Liya Fan created ARROW-8121: --- Summary: [Java] Enhance code style checking for Java code (add space after commas, semi-colons and type casts) Key: ARROW-8121 URL: https://issues.apache.org/jira/browse/ARROW-8121 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Liya Fan Assignee: Liya Fan This is in response to a discussion in https://github.com/apache/arrow/pull/6039#discussion_r375161992 We found the current style checking for Java code is not sufficient. So we want to enhace it in a series of "small" steps, in order to avoid having to change too many files at once. In this issue, we add spaces after commas, semi-colons and type casts. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8108) [Java] Extract a common interface for dictionary encoders
Liya Fan created ARROW-8108: --- Summary: [Java] Extract a common interface for dictionary encoders Key: ARROW-8108 URL: https://issues.apache.org/jira/browse/ARROW-8108 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Liya Fan Assignee: Liya Fan In this issue, we extract a common interfaces from existing dictionary encoders. This can be useful for scenarios when the client does not care about the encoder implementations. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8009) [Java] Fix the hash code mehods for BitVector
Liya Fan created ARROW-8009: --- Summary: [Java] Fix the hash code mehods for BitVector Key: ARROW-8009 URL: https://issues.apache.org/jira/browse/ARROW-8009 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Liya Fan Assignee: Liya Fan The current hash code methods of BitVector are based on implementations in BaseFixedWidthVector, which rely on the type width of the vector. For BitVector, the type width is 0, so the underlying data is not actually used when computing the hash code. That means, the hash code will always be 0, no matter if the underlying data is null or not, and no matter if the underlying bit is 0 or 1. We fix this by overriding the methods in BitVector. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7955) [Java] Support large buffer for file/stream IPC
Liya Fan created ARROW-7955: --- Summary: [Java] Support large buffer for file/stream IPC Key: ARROW-7955 URL: https://issues.apache.org/jira/browse/ARROW-7955 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Liya Fan Assignee: Liya Fan After supporting 64-bit ArrowBuf, we need to make file/stream IPC work. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7935) [Java] Remove Netty dependency for BufferAllocator and ReferenceManager
Liya Fan created ARROW-7935: --- Summary: [Java] Remove Netty dependency for BufferAllocator and ReferenceManager Key: ARROW-7935 URL: https://issues.apache.org/jira/browse/ARROW-7935 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Liya Fan Assignee: Liya Fan With previous work (ARROW-7329 and ARROW-7505), Netty based allocation is only one of the possible implementations. So we need to revise BufferAllocator and ReferenceManager, to make them general, and independent Netty libraries. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7746) [Java] Support large buffer for IPC
Liya Fan created ARROW-7746: --- Summary: [Java] Support large buffer for IPC Key: ARROW-7746 URL: https://issues.apache.org/jira/browse/ARROW-7746 Project: Apache Arrow Issue Type: Task Components: Java Reporter: Liya Fan The motivation is described in https://github.com/apache/arrow/pull/6323#issuecomment-580137629. When the size of the ArrowBuf exceeds 2GB, our flighing library does not work due to integer overflow. This is because internally, we have used some data structures which are based on 32-bit integers. To resolve the problem, we must revise/replace the data structures to make them support 64-bit integers. As a concrete example, we can see that when the server sends data through IPC, an org.apache.arrow.flight.ArrowMessage object is created, and is wrapped as an InputStream through the `asInputStream` method. In this method, we use data stuctures like java.io.ByteArrayOutputStream and io.netty.buffer.ByteBuf, which are based on 32-bit integers (we can observe that NettyArrowBuf#length and ByteArrayOutputStream#count are both 32-bit integers). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7699) [Java] Support concating dense union vectors in batch
Liya Fan created ARROW-7699: --- Summary: [Java] Support concating dense union vectors in batch Key: ARROW-7699 URL: https://issues.apache.org/jira/browse/ARROW-7699 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Liya Fan Assignee: Liya Fan After supporting the dense union vector, we need to support concating dense union vectors in batch. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7616) [Java] Support comparing value ranges for dense union vector
Liya Fan created ARROW-7616: --- Summary: [Java] Support comparing value ranges for dense union vector Key: ARROW-7616 URL: https://issues.apache.org/jira/browse/ARROW-7616 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Liya Fan Assignee: Liya Fan After we support dense union vectors, we should support range value comparisons for them. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7506) [Java] JMH benchmarks should be called from main methods
Liya Fan created ARROW-7506: --- Summary: [Java] JMH benchmarks should be called from main methods Key: ARROW-7506 URL: https://issues.apache.org/jira/browse/ARROW-7506 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Liya Fan Assignee: Liya Fan Some benchmarks are called as unit tests in our current code base. They should be called from main methods, because: 1. This is the recommended way of writing JMH benchmarks. The automatically generated benchmarks are called from main, and sample benchmarks provided by JMH [1] are also called from main. 2. Some compiler does not support calling JMH as unit test. For example, the "javac with error prone" reports the following error: Error:(100, 15) java: [JUnit4TearDownNotRun] tearDown() method will not be run; please add JUnit's @After annotation (see https://errorprone.info/bugpattern/JUnit4TearDownNotRun) Did you mean '@After'? 3. When run as a unit test, enable assert flag will be turned on by default, so some test/debug operations will be performed. This will distort the benchmark result data. For example, a related discussion can be found in [2]. [1] https://hg.openjdk.java.net/code-tools/jmh/file/tip/jmh-samples/src/main/java/org/openjdk/jmh/samples/ [2] https://github.com/apache/arrow/pull/5842#issuecomment-558082914 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7505) [Java] Remove Netty dependency for ArrowBuf
Liya Fan created ARROW-7505: --- Summary: [Java] Remove Netty dependency for ArrowBuf Key: ARROW-7505 URL: https://issues.apache.org/jira/browse/ARROW-7505 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Liya Fan Assignee: Liya Fan This is part of the first step of issue ARROW-4526. In this step, we remove netty dependency for ArrowBuf, BufferAllocator and ReferenceManager. In this issue, we remove the dependency for ArrowBuf. The task for BufferAllocator and ReferenceManager will not start until ARROW-7329 is finished. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7491) [Java] Improve the performance of aligning
Liya Fan created ARROW-7491: --- Summary: [Java] Improve the performance of aligning Key: ARROW-7491 URL: https://issues.apache.org/jira/browse/ARROW-7491 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Liya Fan Assignee: Liya Fan Aligning is an important and frequent operation when writing IPC data. It writes no more than 7 0 bytes to the output. The current implementation creates a new byte array each time, leading to performance overhead, and increases the GC pressure. We improve it by means of a shared byte array. Benchmark evaluation shows a 10% performance gain. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7469) [C++] Improve division related bit operations
Liya Fan created ARROW-7469: --- Summary: [C++] Improve division related bit operations Key: ARROW-7469 URL: https://issues.apache.org/jira/browse/ARROW-7469 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Liya Fan Assignee: Liya Fan Improve some operations in bit_util: 1. Eliminate one division for CeilDiv 2. Avoid overflow for RoundUp 3. Add a utility for CeilDiv(value, 8) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7437) [Java] ReadChannel#readFully does not set writer index correctly
Liya Fan created ARROW-7437: --- Summary: [Java] ReadChannel#readFully does not set writer index correctly Key: ARROW-7437 URL: https://issues.apache.org/jira/browse/ARROW-7437 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Liya Fan Assignee: Liya Fan The writer index should be incremented by the amount of data actually read. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7429) [Java] Enhance code style checking for Java code (remove consecutive spaces)
Liya Fan created ARROW-7429: --- Summary: [Java] Enhance code style checking for Java code (remove consecutive spaces) Key: ARROW-7429 URL: https://issues.apache.org/jira/browse/ARROW-7429 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Liya Fan Assignee: Liya Fan This issue is opened in response to a discussion in https://github.com/apache/arrow/pull/5861#discussion_r348917065. We found the current style checking for Java code is not sufficient. So we want to enhace it in a series of "small" steps, in order to avoid having to change too many files at once. In this issue, we remove consecutive spaces between tokens, so that tokens are separated by single spaces. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7400) [Java] Avoids the worst case for quick sort
Liya Fan created ARROW-7400: --- Summary: [Java] Avoids the worst case for quick sort Key: ARROW-7400 URL: https://issues.apache.org/jira/browse/ARROW-7400 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Liya Fan Assignee: Liya Fan This issue is in response of a discussion in: https://github.com/apache/arrow/pull/5540#discussion_r329487232. The quick sort algorithm can degenerate to an O(n^2) algorithm, if the pivot is selected poorly. This is an important problem, as the worst case can happen, if the input vector is alrady sorted, which is frequently encountered in practice. After some investigation, we solve the problem with a simple but effective approach: take 3 samples and choose the median (with at most 3 comparisons) as the pivot. This sorts the vector which is already sorted in O(nlogn) time. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7349) [C++] Fix the bug of parsing string hex values
Liya Fan created ARROW-7349: --- Summary: [C++] Fix the bug of parsing string hex values Key: ARROW-7349 URL: https://issues.apache.org/jira/browse/ARROW-7349 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Liya Fan Assignee: Liya Fan std::lower_bound returns the end of the search range, when failing to find a match. The end of the search range is one position after the last valid position. So the value in this position is undefined, and we should not reference the value here to compare it with the target value. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7301) [Java] Sql type DATE should correspond to DateDayVector
Liya Fan created ARROW-7301: --- Summary: [Java] Sql type DATE should correspond to DateDayVector Key: ARROW-7301 URL: https://issues.apache.org/jira/browse/ARROW-7301 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Liya Fan Assignee: Liya Fan According to SQL convertion, sql type DATE should correspond to a format of -MM-DD, without the components for hour/minute/second/millis Therefore, JDBC type DATE should correspond to DateDayVector, with a type width of 4, instead of 8. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7277) [Document] Add discussion about vector lifecycle
Liya Fan created ARROW-7277: --- Summary: [Document] Add discussion about vector lifecycle Key: ARROW-7277 URL: https://issues.apache.org/jira/browse/ARROW-7277 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Liya Fan Assignee: Liya Fan As discussed in https://issues.apache.org/jira/browse/ARROW-7254?focusedCommentId=16983284&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16983284, we need a discussion about the lifecycle of a vector. Each vector has a lifecycle, and different operations should be performed in particular phases of the lifecycle. If we violate this, some unexpected results may be produced. This may cause some confusion for Arrow users. So we want to add a new section to the prose document, to make it clear and explicit. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7216) [Java] Improve the performance of setting/clearing individual bits
Liya Fan created ARROW-7216: --- Summary: [Java] Improve the performance of setting/clearing individual bits Key: ARROW-7216 URL: https://issues.apache.org/jira/browse/ARROW-7216 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Liya Fan Assignee: Liya Fan Setting/clearing individual bits are key operations for Arrow. In this issue, we improve the performance these operations by: 1. replacing arithmetic operations with bit-wise operations 2. remove unnecessary casts between int/byte 3. provide new API to remove the if branch Benchmark results show that for clearing a bit, the performance improve by 11%, and for general set/clear operation, the performance improve by 4.7%: before: BitVectorHelperBenchmarks.setValidityBitBenchmarkavgt5 4.524 ± 0.015 us/op after: BitVectorHelperBenchmarks.setValidityBitBenchmarkavgt5 4.313 ± 0.011 us/op BitVectorHelperBenchmarks.setValidityBitToZeroBenchmark avgt5 4.020 ± 0.016 us/op -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7213) [Java] Represent a data element of a vector as a tree of ArrowBufPointer
Liya Fan created ARROW-7213: --- Summary: [Java] Represent a data element of a vector as a tree of ArrowBufPointer Key: ARROW-7213 URL: https://issues.apache.org/jira/browse/ARROW-7213 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Liya Fan Assignee: Liya Fan For a fixed/variable width vector, each of its data element can be represented as an ArrowBufPointer object, which represents a contiguous memory segment. This makes many tasks easier and more efficient (without memory copy): calculating hash code, comparing values, etc. This cannot be achieved for complex vectors, because their values often reside in more than one contiguous memory regions. However, it can be seen that the contiguous memory regions for each data element forms a tree-like structure, whose leaf nodes are the contiguous memory regions. For example, a data element for a struct vector forms a tree, whose root corresponds to the struct vector, while the child vectors corresponds to the child nodes of the tree root. In this issue, we provide a data structure that represents each data element of a vector as a tree, whose leaf nodes are ArrowBufPointers, representing contiguous memory regions for the data element. With this data structure, many tasks also becomes easier and more efficient: calculating hash code, comparing vector elements (ordering & equality). In addition, we can do something that could not have been done in the past, like placing data elements into a hash table/hash set, etc. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7177) [Java] Provide a utility to improve the performance of vector loading/unloading
Liya Fan created ARROW-7177: --- Summary: [Java] Provide a utility to improve the performance of vector loading/unloading Key: ARROW-7177 URL: https://issues.apache.org/jira/browse/ARROW-7177 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Liya Fan Assignee: Liya Fan Vector loading/unloading transforms a set of vectors to and from a set of buffers with meta data. It is heavily used in flight/IPC. In the loading/unloading operations, only the number of type buffers are really needed. However, the current code logic gets a copy of the type buffers, which is not necessary. In this issue, we provide a utility to get the number of type buffers, given an arrow type. It improves the performance by 1. avoiding creating objects unnecessarily. 2. avoiding list copying for vector unloading (which calls TypeLayout#getBufferTypes). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7166) [Java] Remove redundant code for Jdbc adapters
Liya Fan created ARROW-7166: --- Summary: [Java] Remove redundant code for Jdbc adapters Key: ARROW-7166 URL: https://issues.apache.org/jira/browse/ARROW-7166 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Liya Fan Assignee: Liya Fan As discussed in https://github.com/apache/arrow/pull/5508#issuecomment-543011016, we need a separate issue to extract common logic to a common super class. This makes the code clearer, and we need to make sure we have no performance regression. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7106) [Java] Fix the problem that flight perf test hangs endlessly
Liya Fan created ARROW-7106: --- Summary: [Java] Fix the problem that flight perf test hangs endlessly Key: ARROW-7106 URL: https://issues.apache.org/jira/browse/ARROW-7106 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Liya Fan Assignee: Liya Fan Flight performance test (org.apache.arrow.flight.perf.TestPerf) is an important tool for tracking the current throughput of IPC. In this issue, we improve it in two ways: 1. We fix the problem that the test hangs endlessly after all runs have been finished. This is because the thread pool is not released. 2. We add a summary to the output report, so that we can easily evaluate the overall results for all runs. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7098) [Java] Improve the performance of comparing two memory blocks
Liya Fan created ARROW-7098: --- Summary: [Java] Improve the performance of comparing two memory blocks Key: ARROW-7098 URL: https://issues.apache.org/jira/browse/ARROW-7098 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Liya Fan Assignee: Liya Fan We often use the 8-4-1 paradigm to compare two blocks of memory: 1. First compare by 8-byte blocks in a loop 2. Then compare by 4-byte blocks in a loop 3. Last compare by 1-byte blocks in a loop It can be proved that the second loop runs at most once. So we can replace the loop with a if statement, which will save us a comparison and two jump operations. According to the discussion in https://github.com/apache/arrow/pull/5508#discussion_r343973982, loop can be expensive. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7073) [Java] Support concating vectors values in batch
Liya Fan created ARROW-7073: --- Summary: [Java] Support concating vectors values in batch Key: ARROW-7073 URL: https://issues.apache.org/jira/browse/ARROW-7073 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Liya Fan Assignee: Liya Fan We need a way to copy vector values in batch. Currently, we have copyFrom and copyFromSafe APIs. However, they are not enough, as copying values individually is not performant. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7072) [Java] Support concating validity bits efficiently
Liya Fan created ARROW-7072: --- Summary: [Java] Support concating validity bits efficiently Key: ARROW-7072 URL: https://issues.apache.org/jira/browse/ARROW-7072 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Liya Fan Assignee: Liya Fan For scenarios when we need to concate vectors (like the scenario in ARROW-7048, and delta dictionary), we need a way to concat validity bits. Currently, we have bit level API to read/write individual validity bit. However, it is not efficient , and we need a way to copy more bits at a time. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7020) [Java] Fix the bugs when calculating vector hash code
Liya Fan created ARROW-7020: --- Summary: [Java] Fix the bugs when calculating vector hash code Key: ARROW-7020 URL: https://issues.apache.org/jira/browse/ARROW-7020 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Liya Fan Assignee: Liya Fan When calculating the hash code for a value in the vector, the validity bit must be taken into account. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7019) [Java] Improve the performance of loading validity buffers
Liya Fan created ARROW-7019: --- Summary: [Java] Improve the performance of loading validity buffers Key: ARROW-7019 URL: https://issues.apache.org/jira/browse/ARROW-7019 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Liya Fan Assignee: Liya Fan At the receiver side of flighting, loading validity buffer is an important operation, as each vector has a validity buffer. For non-nullable vectors, the current implementation of loading the validity buffer is inefficient. We improve the performance of this operation by efficiently setting the bits of a memory region to 1. Benchmark results show that the changes leads to a 35% performance improvement: Before: BitVectorHelperBenchmarks.loadValidityBufferAllOne avgt5 748.916 ± 23.290 ns/op After: BitVectorHelperBenchmarks.loadValidityBufferAllOne avgt5 487.352 ± 15.046 ns/op -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6935) [Java] Improve the performance of comparing two blocks of heap data
Liya Fan created ARROW-6935: --- Summary: [Java] Improve the performance of comparing two blocks of heap data Key: ARROW-6935 URL: https://issues.apache.org/jira/browse/ARROW-6935 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Liya Fan Assignee: Liya Fan Implement methods to compare data word by word, instead of byte by byte. Benchmarks shows that there is a 4.5x performance improvement: ByteFunctionHelpersBenchmarks.builtInByteArrayEquals avgt5 437.504 ± 1.120 ns/op ByteFunctionHelpersBenchmarks.byteArrayEquals avgt5 97.700 ± 0.178 ns/op -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6933) [Java] Suppor linear dictionary encoder
Liya Fan created ARROW-6933: --- Summary: [Java] Suppor linear dictionary encoder Key: ARROW-6933 URL: https://issues.apache.org/jira/browse/ARROW-6933 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Liya Fan Assignee: Liya Fan For many scenarios, the distribution of dictionary entries is highly skewed. In other words, a few dictionary entries occurs much more frequently than others. If we can sort the dictionary by the non-increasing order of entry frequencies, and compare each value to encode from the beginning of the dictionary, we get the following benefits: 1) We need no extra memory space or data structure. 2) The search is extremely efficient, as we are likely to find a match in the first few entries of the dictionary. This is the basic idea behind the linear dictionary encoder. When the scenario is right (highly skewed dictionary distribution), it outperforms both search based encoder and hash table based encoders. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6911) [Java] Provide composite comparator
Liya Fan created ARROW-6911: --- Summary: [Java] Provide composite comparator Key: ARROW-6911 URL: https://issues.apache.org/jira/browse/ARROW-6911 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Liya Fan Assignee: Liya Fan A composite comparator is a sub-class of VectorValueComparator that contains an array of inner comparators, with each comparator corresponding to one column for comparison. It can be used to support sort/comparison operations for VectorSchemaRoot/StructVector. The composite comparator works like this: it first uses the first internal comparator (for the primary sort key) to compare vector values. If it gets a non-zero value, we just return it; otherwise, we use the second comparator to break the tie, and so on, until a non-zero value is produced by some internal comparator, or all internal comparators have been used. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6896) [Java] Vector schema root should not share vectors
Liya Fan created ARROW-6896: --- Summary: [Java] Vector schema root should not share vectors Key: ARROW-6896 URL: https://issues.apache.org/jira/browse/ARROW-6896 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Liya Fan Assignee: Liya Fan Vector schema root should not share vectors. Otherwise, unexpectd behavior would happen. Please note that VectorSchemaRoot is not just a container for vectors, it is also a resource (it implements the AutoClosable interface), and it manages the life cycle of its inner vectors. When two VectorSchemaRoots share vectors, something unexpected may happen. Consider the following scenario, which is frequently encountered in a SQL engine. 1. We create a batch: VectorSchemaRoot oldBatch = ... 2. We add a vector to it, which results in a new batch VectorSchemaRoot newBatch = oldBatch.addVector(vector); 3. We are done with the old batch, and release the resource oldBatch.close(); 4. We continue to use the new batch, but gets an exception, because some inner vectors have been released by the old batch. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6888) [Java] Support copy operation for vector value comparators
Liya Fan created ARROW-6888: --- Summary: [Java] Support copy operation for vector value comparators Key: ARROW-6888 URL: https://issues.apache.org/jira/browse/ARROW-6888 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Liya Fan Assignee: Liya Fan In this issue, we provide copy operations for vector value comparators. This operation creates another comparator with the same type and comparison logic. This feature is useful in multi-threading scenarios where multiple threads uses the comparator to perform their own task. In this scenario, we have no way of making sure the compare method is thread safe. So a safe way is to create a new comparator for each thread. The copy operation will support this. An immediate application of this is the parallel searcher for ordering semantics. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6866) [Java] Improve the performance of calculating hash code for struct vector
Liya Fan created ARROW-6866: --- Summary: [Java] Improve the performance of calculating hash code for struct vector Key: ARROW-6866 URL: https://issues.apache.org/jira/browse/ARROW-6866 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Liya Fan Assignee: Liya Fan Improve the performance of hashCode(int) method for StructVector: 1. We can get the child vectors directly, so there is no need to get the name from the child vector and then use the name to get the vector. 2. The child vectors cannot be null, so there is no need to check it. The performance improvement depends on the complexity of the hash algorithm. For computational intensive hash algorithms, the improvement can be small; while for simple hash algorithms, the improvement can be notable. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6865) [Java] Improve the performance of comparing an ArrowBuf against a byte array
Liya Fan created ARROW-6865: --- Summary: [Java] Improve the performance of comparing an ArrowBuf against a byte array Key: ARROW-6865 URL: https://issues.apache.org/jira/browse/ARROW-6865 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Liya Fan Assignee: Liya Fan We change the way of comparing an ArrowBuf against a byte array from byte wise comparison to comparison by long/int/byte. Benchmark shows that there is a 6.7x performance improvement. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6863) [Java] Provide parallel searcher
Liya Fan created ARROW-6863: --- Summary: [Java] Provide parallel searcher Key: ARROW-6863 URL: https://issues.apache.org/jira/browse/ARROW-6863 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Liya Fan Assignee: Liya Fan For scenarios where the vector is large and the a low response time is required, we need to search the vector in parallel to improve the responsiveness. This issue tries to provide a parallel searcher for the equality semantics (the support for ordering semantics is not ready yet, as we need a way to distribute the comparator). The implementation is based on multi-threading. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6738) [Java] Fix problems with current union comparison logic
Liya Fan created ARROW-6738: --- Summary: [Java] Fix problems with current union comparison logic Key: ARROW-6738 URL: https://issues.apache.org/jira/browse/ARROW-6738 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Liya Fan Assignee: Liya Fan There are some problems with the current union comparison logic. For example: 1. For type check, we should not require fields to be equal. It is possible that two vectors' value ranges are equal but their fields are different. 2. We should not compare the number of sub vectors, as it is possible that two union vectors have different numbers of sub vectors, but have equal values in the range. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6732) [Java] Implement quick sort in a non-recursive way to avoid stack overflow
Liya Fan created ARROW-6732: --- Summary: [Java] Implement quick sort in a non-recursive way to avoid stack overflow Key: ARROW-6732 URL: https://issues.apache.org/jira/browse/ARROW-6732 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Liya Fan Assignee: Liya Fan The current quick sort algorithm in implemented by a recursive algorithm. The problem is that for the worst case, the number of recursive layers is equal to the length of the vector. For large vectors, this will cause stack overflow. To solve this problem, we implement the quick sort algorithm as a non-recursive algorithm. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6723) [Java] Reduce the range of synchronized block when releasing an ArrowBuf
Liya Fan created ARROW-6723: --- Summary: [Java] Reduce the range of synchronized block when releasing an ArrowBuf Key: ARROW-6723 URL: https://issues.apache.org/jira/browse/ARROW-6723 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Liya Fan Assignee: Liya Fan When releasing an ArrowBuf, we will run the following piece of code: private int decrement(int decrement) { allocator.assertOpen(); final int outcome; synchronized (allocationManager) { outcome = bufRefCnt.addAndGet(-decrement); if (outcome == 0) { lDestructionTime = System.nanoTime(); allocationManager.release(this); } } return outcome; } It can be seen that we need to acquire the lock for allocation manager lock, no matter if we need to release the buffer. In addition, the operation of decrementing refcount is only carried out after the lock is acquired. This leads to unnecessary resource contention, and may degrade performance. We propose to change the code like this: private int decrement(int decrement) { allocator.assertOpen(); final int outcome; outcome = bufRefCnt.addAndGet(-decrement); if (outcome == 0) { lDestructionTime = System.nanoTime(); synchronized (allocationManager) { allocationManager.release(this); } } return outcome; } Note that this change can be dangerous, as it lies in the core of our code base, so we should be careful with it. On the other hand, it may have non-trivial performance implication. As far as I know, when a distributed task is getting closed, a large number of ArrowBuf will be closed simultaneously. If we reduce the range of the synchronization block, we can significantly improve the performance. What do you think? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6722) [Java] Provide a uniform way to get vector name
Liya Fan created ARROW-6722: --- Summary: [Java] Provide a uniform way to get vector name Key: ARROW-6722 URL: https://issues.apache.org/jira/browse/ARROW-6722 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Liya Fan Assignee: Liya Fan Currently, the getName method is defined in BaseValueVector, as an abstract class. However, some vector does not extend the BaseValueVector, like StructVector, UnionVector, ZeroVector. In this issue, we move the method to ValueVector interface, the base interface for all vectors. This makes it easier to get a vector's name without checking its type. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6707) [Java] Improve the performance of JDBC adapters by using nullable information
Liya Fan created ARROW-6707: --- Summary: [Java] Improve the performance of JDBC adapters by using nullable information Key: ARROW-6707 URL: https://issues.apache.org/jira/browse/ARROW-6707 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Liya Fan Assignee: Liya Fan JDBC meta data has a field that indicates if a column can contain null. We can make use of this information when transforming jdbc data to arrow vectors. In particular, if the column cannot have null, there is no need to call the JDBC API for each value to check if the last value is null. This will improve the performance of transforming JDBC data to arrow vectors. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6672) [Java] Extract a common interface for dictionary builders
Liya Fan created ARROW-6672: --- Summary: [Java] Extract a common interface for dictionary builders Key: ARROW-6672 URL: https://issues.apache.org/jira/browse/ARROW-6672 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Liya Fan Assignee: Liya Fan We need a common interface for dictionary builders to support more sophisticated scenarios, like collecting dictionary statistics. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6598) [Java] Sort the code for ApproxEqualsVisitor
Liya Fan created ARROW-6598: --- Summary: [Java] Sort the code for ApproxEqualsVisitor Key: ARROW-6598 URL: https://issues.apache.org/jira/browse/ARROW-6598 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Liya Fan Assignee: Liya Fan As a follow up issue of ARROW-6458, we finalize the code for ApproxEqualsVisitor. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6580) [Java] Support comparison for unsigned integers
Liya Fan created ARROW-6580: --- Summary: [Java] Support comparison for unsigned integers Key: ARROW-6580 URL: https://issues.apache.org/jira/browse/ARROW-6580 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Liya Fan Assignee: Liya Fan In this issue, we support the comparison of unsigned integer vectors, including UInt1Vector, UInt2Vector, UInt4Vector, and UInt8Vector. With support for comparison for these vectors, the sort for them is also supported automatically. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6458) [Java] Improve the performance and code structure for ApproxEqualsVisitor
Liya Fan created ARROW-6458: --- Summary: [Java] Improve the performance and code structure for ApproxEqualsVisitor Key: ARROW-6458 URL: https://issues.apache.org/jira/browse/ARROW-6458 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Liya Fan Assignee: Liya Fan As discussed in https://github.com/apache/arrow/pull/5195#issuecomment-526157961, there are some problems with the current ways of comparing floating point vectors, we solve them in this PR: 1. there are if statements/duplicated members in ApproxEqualsVisitor, making the code redundant and less clear. 2. the comparion of float4 and float8 are based on wrapped objects Float and Double, which may have performance penalty. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6420) [Java] Improve the performance of UnionVector when getting underlying vectors
Liya Fan created ARROW-6420: --- Summary: [Java] Improve the performance of UnionVector when getting underlying vectors Key: ARROW-6420 URL: https://issues.apache.org/jira/browse/ARROW-6420 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Liya Fan Assignee: Liya Fan Getting the underlying vector is a frequent opertation for UnionVector. It relies on this operation to get/set data at each index. The current implementation is inefficient. In particular, it first gets the minor type at the given index, and then compares it against all possible minor types in a switch statment, until a match is found. We improve the performance by storing the internal vectors in an array, whose index is the ordinal of the minor type. So given a minor type, its corresponding underlying vector can be obtained in O(1) time. It should be noted that this technique is also applicable to UnionReader and UnionWriter, and support for UnionReader is already implemented. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6394) [Java] Support conversions between delta vector and partial sum vector
Liya Fan created ARROW-6394: --- Summary: [Java] Support conversions between delta vector and partial sum vector Key: ARROW-6394 URL: https://issues.apache.org/jira/browse/ARROW-6394 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Liya Fan Assignee: Liya Fan What is a delta vector/partial sum vector? Given an integer vector a with length n, its partial sum vector is another integer vector b with length n + 1, with values defined as: b(0) = initial sum b(i) = a(0) + a(1) + ... + a(i - 1) i = 1, 2, ..., n Given an integer vector with length n + 1, its delta vector is another integer vector b with length n, with values defined as: b(i) = a(i) - a(i - 1), i = 0, 1, ... , n -1 In this issue, we provide utilities to convert between vector and partial sum vector. It is interesting to note that the two operations corresponding to the discrete integration and differentian. These conversions have wide applications. For example, 1. The run-length vector proposed by Micah is based on the partial sum vector, while the deduplication functionality is based on delta vector. This issue provides conversions between them. 2. The current VarCharVector/VarBinaryVector implementations are based on partial sum vector. We can transform them to delta vectors before IPC, to reduce network traffic. 3. Converting to delta can be considered as a way for data compression. To further reduce the data volume, the operation can be applied more than once, to further reduce data volume. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6374) [Java] Refactor the code for TimeXXVectors
Liya Fan created ARROW-6374: --- Summary: [Java] Refactor the code for TimeXXVectors Key: ARROW-6374 URL: https://issues.apache.org/jira/browse/ARROW-6374 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Liya Fan Assignee: Liya Fan This is based on the discussion in [https://lists.apache.org/thread.html/836d3b87ccb6e65e9edf0f220829a29edfa394fc2cd1e0866007d86e@%3Cdev.arrow.apache.org%3E.|https://lists.apache.org/thread.html/836d3b87ccb6e65e9edf0f220829a29edfa394fc2cd1e0866007d86e@%3Cdev.arrow.apache.org%3E,] The internals of TimeXXVectors are simply IntVector or BigIntVector. There are duplicated code for setting/getting int/long. We want to refactor the code by: # push get/set methods into the base class BaseFixedWidthVector, and make them protected. # The APIs in TimeXXVectors references the methods in the base class. Note that this issue not just reduce redundant code, it also centralizes the logics for getting/setting int/long, making them easy to maintain and change. If it looks good, later we will make other integer based vectors rely on the base class implementations. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6366) [Java] Make field vectors final explicitly
Liya Fan created ARROW-6366: --- Summary: [Java] Make field vectors final explicitly Key: ARROW-6366 URL: https://issues.apache.org/jira/browse/ARROW-6366 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Liya Fan Assignee: Liya Fan According to the discussion in [https://lists.apache.org/thread.html/836d3b87ccb6e65e9edf0f220829a29edfa394fc2cd1e0866007d86e@%3Cdev.arrow.apache.org%3E,] field vectors should not be extended, so they should be made final explicitly. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6355) [Java] Make range equal visitor reusable
Liya Fan created ARROW-6355: --- Summary: [Java] Make range equal visitor reusable Key: ARROW-6355 URL: https://issues.apache.org/jira/browse/ARROW-6355 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Liya Fan Assignee: Liya Fan According to the discussion in [https://github.com/apache/arrow/pull/4993#discussion_r316009165,] we often encountered this scenario: we compare values repeatedly. The comparisons differs only in the parameters (vector to compare, start index, etc). According to the current API, we have to create a new RangeEqualVisitor object each time the comparison is performed. This leads to non-trivial performance overhead. To address this problem, we make the RangeEqualVisitor reusable, and allow the client to change parameters of an existing visitor. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6335) [Java] Improve the performance of DictionaryHashTable
Liya Fan created ARROW-6335: --- Summary: [Java] Improve the performance of DictionaryHashTable Key: ARROW-6335 URL: https://issues.apache.org/jira/browse/ARROW-6335 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Liya Fan Assignee: Liya Fan when comparing two entries in the dictionary hash table, it is more efficient to compare the index directly, rather than using Objects.equals, because they are both ints. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6334) [Java] Improve the dictionary builder API to return the position of the value in the dictionary
Liya Fan created ARROW-6334: --- Summary: [Java] Improve the dictionary builder API to return the position of the value in the dictionary Key: ARROW-6334 URL: https://issues.apache.org/jira/browse/ARROW-6334 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Liya Fan Assignee: Liya Fan This is an improvement of the {{addValue}} method. Previously, the method returns a boolean, indicating if the value has been successfully added to the dictionary. After the change, the method returns an integer, which is the position of the value in the dictionary. The purpose of this change: # the dictionary position contains more information, compared with a boolean indicating if the value is added successfully. # this information about the index in the dictionary can be useful, for example, to collect statistics about the dictionary. With the dictionary position, the information about if a value has been added can be easily determined. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6307) [Java] Provide RLE vector
Liya Fan created ARROW-6307: --- Summary: [Java] Provide RLE vector Key: ARROW-6307 URL: https://issues.apache.org/jira/browse/ARROW-6307 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Liya Fan Assignee: Liya Fan RLE (run length encoding) is a widely used encoding/decoding technique. Compared with other encoding/decoding techniques, it is easier to work with the encoded data. We want to provide an RLE vector implementation in Arrow. The design details include: 1. RleVector implements ValueVector. 2. the data structure of RleVector includes an inner vector, plus a repetition buffer. 3. we do not provide random access over the RleVector 4. In the future, we will provide iterators to access the vector in sequence. 5. RleVector does not support update, but supports appending. 6. In the future, we will provide encoder/decoder to efficiently transform encoded/decoded vectors. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6306) [Java] Support stable sort by stable comparators
Liya Fan created ARROW-6306: --- Summary: [Java] Support stable sort by stable comparators Key: ARROW-6306 URL: https://issues.apache.org/jira/browse/ARROW-6306 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Liya Fan Assignee: Liya Fan Stable sort is desirable in many scenarios. It means equal elements preserve their relative order after sorting. There are stable sort algorithms. However, in practice, the best sort algorithm is quick sort and quick sort is not stable. To make the best of both worlds, we support stable sort by stable comparators. It differs from an ordinary comparator in that it breaks ties by comparing the value indices. With the stable comparator, the quick sort algorithm becomes a stable algorithm. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6297) [Java] Compare ArrowBufPointers by unsinged integers
Liya Fan created ARROW-6297: --- Summary: [Java] Compare ArrowBufPointers by unsinged integers Key: ARROW-6297 URL: https://issues.apache.org/jira/browse/ARROW-6297 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Liya Fan Assignee: Liya Fan Currently, ArrowBufPointers compare by bytes in lexicographic order. Another way is to compare by unsigned integers (longs, ints, & bytes). The second way involves additional bit operations for each iteration. However, it can compare 8 bytes at a time. So it is overall faster: Compare by unsigned integers: ArrowBufPointerBenchmarks.compareBenchmark avgt 5 65.722 ± 0.381 ns/op Compare byte-wise: ArrowBufPointerBenchmarks.compareBenchmark avgt 5 681.372 ± 0.604 ns/op -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6266) [Java] Resolve the ambiguous method overload in RangeEqualsVisitor
Liya Fan created ARROW-6266: --- Summary: [Java] Resolve the ambiguous method overload in RangeEqualsVisitor Key: ARROW-6266 URL: https://issues.apache.org/jira/browse/ARROW-6266 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Liya Fan Assignee: Liya Fan In RangeEqualsVisitor, there are overload methods for both super class and sub class. This will lead to unexpected behavior. For example, if we call RangeEqualsVisitor#visit(v), where v is a fixed width vector, the method actually called may be visit(ValueVector), which is unexpected. In general, in the visitor pattern, it is not a good idea to support method overload for both super class and sub-class as parameters. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6264) [Java] There is no need to consider byte order in ArrowBufHasher
Liya Fan created ARROW-6264: --- Summary: [Java] There is no need to consider byte order in ArrowBufHasher Key: ARROW-6264 URL: https://issues.apache.org/jira/browse/ARROW-6264 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Liya Fan Assignee: Liya Fan According to the discussion in [https://github.com/apache/arrow/pull/5063#issuecomment-521276547|https://github.com/apache/arrow/pull/5063#issuecomment-521276547.], Arrow has a mechanism to make sure the data is stored in little-endian, so there is no need to check byte order. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6247) [Java] Provide a common interface for float4 and float8 vectors
Liya Fan created ARROW-6247: --- Summary: [Java] Provide a common interface for float4 and float8 vectors Key: ARROW-6247 URL: https://issues.apache.org/jira/browse/ARROW-6247 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Liya Fan Assignee: Liya Fan We want to provide an interface for floating point vectors (float4 & float8). This interface will make it convenient for many operations on a vector. With this interface, the client code will be greatly simplified, with many branches/switch removed. The design is similar to BaseIntVector (the interface for all integer vectors). We provide 3 methods for setting & getting floating point values: setWithPossibleTruncate setSafeWithPossibleTruncate getValueAsDouble -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6245) [DISCUSS][Java] Provide an interface for numeric vectors
Liya Fan created ARROW-6245: --- Summary: [DISCUSS][Java] Provide an interface for numeric vectors Key: ARROW-6245 URL: https://issues.apache.org/jira/browse/ARROW-6245 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Liya Fan Assignee: Liya Fan We want to provide an interface for all vectors with numeric types (small int, float4, float8, etc). This interface will make it convenient for many operations on a vector, like average, sum, variance, etc. With this interface, the client code will be greatly simplified, with many branches/switch removed. The design is similar to BaseIntVector (the interface for all integer vectors). We provide 3 methods for setting & getting numeric values: setWithPossibleRounding setSafeWithPossibleRounding getValueAsDouble -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6221) [Java] Improve the performance of RangeEqualVisitor for comparing variable-width vectors
Liya Fan created ARROW-6221: --- Summary: [Java] Improve the performance of RangeEqualVisitor for comparing variable-width vectors Key: ARROW-6221 URL: https://issues.apache.org/jira/browse/ARROW-6221 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Liya Fan Assignee: Liya Fan Two improvements: # Compare the whole range of the data buffer, instead of comparing individual elements. # If two elements are of different sizes, there is no need to compare them. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6212) [Java] Support vector rank operation
Liya Fan created ARROW-6212: --- Summary: [Java] Support vector rank operation Key: ARROW-6212 URL: https://issues.apache.org/jira/browse/ARROW-6212 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Liya Fan Assignee: Liya Fan Given an unsorted vector, we want to get the index of the ith smallest element in the vector. This function is supported by the rank operation. We provide an implementation that gets the index with the desired rank, without sorting the vector (the vector is left intact), and the implementation takes O(n) time, where n is the vector length. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6209) [Java] Extract set null method to the base class for fixed width vectors
Liya Fan created ARROW-6209: --- Summary: [Java] Extract set null method to the base class for fixed width vectors Key: ARROW-6209 URL: https://issues.apache.org/jira/browse/ARROW-6209 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Liya Fan Assignee: Liya Fan Currently, each fixed width vector has the setNull method. All these implementations are identical, so we move them to the base class. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6198) [Java] Support online dictionary builder and encoder
Liya Fan created ARROW-6198: --- Summary: [Java] Support online dictionary builder and encoder Key: ARROW-6198 URL: https://issues.apache.org/jira/browse/ARROW-6198 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Liya Fan Assignee: Liya Fan An online dictionary builder and encoder is used for the scenarios where the dictionary is used as is (for some cases, the dictionary may need to be sorted by content or frequency, before encoding). For such scenarios, the dictionary building and encoding can be performed simultaneously. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6185) [Java] Provide hash table based dictionary builder
Liya Fan created ARROW-6185: --- Summary: [Java] Provide hash table based dictionary builder Key: ARROW-6185 URL: https://issues.apache.org/jira/browse/ARROW-6185 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Liya Fan Assignee: Liya Fan This is related ARROW-5862. We provide another type of dictionary builder based on hash table. Compared with a search based dictionary encoder, a hash table based encoder process each new element in O(1) time, but require extra memory space. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6184) [Java] Provide hash table based dictionary encoder
Liya Fan created ARROW-6184: --- Summary: [Java] Provide hash table based dictionary encoder Key: ARROW-6184 URL: https://issues.apache.org/jira/browse/ARROW-6184 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Liya Fan Assignee: Liya Fan This is the second part of ARROW-5917. We provide a sort based encoder, as well as a hash table based encoder, to solve the problem with the current dictionary encoder. In particular, we solve the following problems with the current encoder: # There are repeated conversions between Java objects and bytes (e.g. vector.getObject(i)). # Unnecessary memory copy (the vector data must be copied to the hash table). # The hash table cannot be reused for encoding multiple vectors (other data structure & results cannot be reused either). # The output vector should not be created/managed by the encoder (just like in the out-of-place sorter) # The hash table requires that the hashCode & equals methods be implemented appropriately, but this is not guaranteed. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6172) [Java] Avoid creating value holders repeatedly when reading data from JDBC
Liya Fan created ARROW-6172: --- Summary: [Java] Avoid creating value holders repeatedly when reading data from JDBC Key: ARROW-6172 URL: https://issues.apache.org/jira/browse/ARROW-6172 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Liya Fan Assignee: Liya Fan When converting JDBC data to Arrow data. A value holder is created for each single value. The following code snippet gives an example: NullableSmallIntHolder holder = new NullableSmallIntHolder(); holder.isSet = isNonNull ? 1 : 0; if (isNonNull) { holder.value = (short) value; } smallIntVector.setSafe(rowCount, holder); smallIntVector.setValueCount(rowCount + 1); This is inefficient, both in terms of memory usage, and computational efficiency. For most types, we can improve the performance by directly setting the value. For example, the benchmarks on IntVector show that a 20% performance improvement can be achieved by directly setting the int value: Benchmark Mode Cnt Score Error Units IntBenchmarks.setIntDirectly avgt 5 15.397 ± 0.018 us/op IntBenchmarks.setWithValueHolder avgt 5 19.198 ± 0.789 us/op -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6156) [Java] Support compare semantics for ArrowBufPointer
Liya Fan created ARROW-6156: --- Summary: [Java] Support compare semantics for ArrowBufPointer Key: ARROW-6156 URL: https://issues.apache.org/jira/browse/ARROW-6156 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Liya Fan Assignee: Liya Fan Compare two arrow buffer pointers by their content in lexicographic order. null is smaller and shorter buffer is smaller. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6155) [Java] Extract a super interface for vectors whose elements reside in continuous memory segments
Liya Fan created ARROW-6155: --- Summary: [Java] Extract a super interface for vectors whose elements reside in continuous memory segments Key: ARROW-6155 URL: https://issues.apache.org/jira/browse/ARROW-6155 Project: Apache Arrow Issue Type: New Feature Reporter: Liya Fan Assignee: Liya Fan For vectors whose data elements reside in continuous memory segments, they should implement a common super interface. This will avoid unnecessary code branches. For now, such vectors include fixed-width vectors and variable-width vectors. In the future, there can be more vectors included. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6143) [Java] Unify the copyFrom and copyFromSafe methods for all vectors
Liya Fan created ARROW-6143: --- Summary: [Java] Unify the copyFrom and copyFromSafe methods for all vectors Key: ARROW-6143 URL: https://issues.apache.org/jira/browse/ARROW-6143 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Liya Fan Assignee: Liya Fan Some vectors have their own implementations of copyFrom and copyFromSafe methods. Since we have extracted the copyFrom and copyFromSafe methods to the base interface (see ARROW-6021), we want all vectors' implementations to override the methods from the super interface. This will provide a unified way of copying data elements. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6117) [Java] Fix the set method of FixedSizeBinaryVector
Liya Fan created ARROW-6117: --- Summary: [Java] Fix the set method of FixedSizeBinaryVector Key: ARROW-6117 URL: https://issues.apache.org/jira/browse/ARROW-6117 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Liya Fan Assignee: Liya Fan For the set method, if the parameter is null, it should clear the validity bit. However, the current implementation throws a NullPointerException. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6113) [Java] Support vector deduplicate function
Liya Fan created ARROW-6113: --- Summary: [Java] Support vector deduplicate function Key: ARROW-6113 URL: https://issues.apache.org/jira/browse/ARROW-6113 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Liya Fan Assignee: Liya Fan Remove adjacent deduplicated elements from a vector. This function can be used, for example, in finding distinct values, or in compressing the vector data. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6080) [Java] Support search operation for BaseRepeatedValueVector
Liya Fan created ARROW-6080: --- Summary: [Java] Support search operation for BaseRepeatedValueVector Key: ARROW-6080 URL: https://issues.apache.org/jira/browse/ARROW-6080 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Liya Fan Assignee: Liya Fan -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6070) [Java] Avoid creating new schema before IPC sending
Liya Fan created ARROW-6070: --- Summary: [Java] Avoid creating new schema before IPC sending Key: ARROW-6070 URL: https://issues.apache.org/jira/browse/ARROW-6070 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Liya Fan Assignee: Liya Fan If a dictionary is attached to a schema, it may need to be converted before IPC sending. When this is not the case (which is most likely in practice), there is no need to do the conversion and no need to create a new schema. We solve the above problem by quickly determining if conversion is required, and if not, we avoid creating a new schema and return the original one immediately. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6056) [Java] Handle exceptions when flight service processes put requests
Liya Fan created ARROW-6056: --- Summary: [Java] Handle exceptions when flight service processes put requests Key: ARROW-6056 URL: https://issues.apache.org/jira/browse/ARROW-6056 Project: Apache Arrow Issue Type: Improvement Reporter: Liya Fan The current way of processing is to swallow the exception silently and print a log. However, this way is not friendly to debugging and problem diagnosis. We need a way to process it explicitly. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6031) [Java] Support iterating a vector by ArrowBufPointer
Liya Fan created ARROW-6031: --- Summary: [Java] Support iterating a vector by ArrowBufPointer Key: ARROW-6031 URL: https://issues.apache.org/jira/browse/ARROW-6031 Project: Apache Arrow Issue Type: New Feature Reporter: Liya Fan Assignee: Liya Fan Provide the functionality to traverse a vector (fixed-width vector & variable-width vector) by an iterator. This is convenient for scenarios when accessing vector elements in sequence. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6030) [Java] Efficiently compute hash code for ArrowBufPointer
Liya Fan created ARROW-6030: --- Summary: [Java] Efficiently compute hash code for ArrowBufPointer Key: ARROW-6030 URL: https://issues.apache.org/jira/browse/ARROW-6030 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Liya Fan Assignee: Liya Fan As ArrowBufHasher is introduced, we can compute the hash code of a continuous region within an ArrowBuf. We optimize the process to make it efficient to avoid recomputation. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6024) [Java] Provide more hash algorithms
Liya Fan created ARROW-6024: --- Summary: [Java] Provide more hash algorithms Key: ARROW-6024 URL: https://issues.apache.org/jira/browse/ARROW-6024 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Liya Fan Assignee: Liya Fan Provide more hash algorithms to choose for different scenarios. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6021) [Java] Extract copyFrom and copyFromSafe to ValueVector
Liya Fan created ARROW-6021: --- Summary: [Java] Extract copyFrom and copyFromSafe to ValueVector Key: ARROW-6021 URL: https://issues.apache.org/jira/browse/ARROW-6021 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Liya Fan Assignee: Liya Fan Currently we have copyFrom and copyFromSafe methods in fixed-width and variable-width vectors. Extracting them to the common super interface will make it much more convenient to use them, and avoid unnecessary if-else statements. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6013) Support range searcher
Liya Fan created ARROW-6013: --- Summary: Support range searcher Key: ARROW-6013 URL: https://issues.apache.org/jira/browse/ARROW-6013 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Liya Fan Assignee: Liya Fan For a sorted vector, the range searcher finds the first/last occurrence of a particular element. The search is based on binary search, which takes O(logn) time. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-5998) [Java] Open a document to track the API changes
Liya Fan created ARROW-5998: --- Summary: [Java] Open a document to track the API changes Key: ARROW-5998 URL: https://issues.apache.org/jira/browse/ARROW-5998 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Liya Fan Assignee: Liya Fan We need a document to track the API behavior changes, so as not forget about them for the next release. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-5996) [Java] Avoid resource leak in flight service
Liya Fan created ARROW-5996: --- Summary: [Java] Avoid resource leak in flight service Key: ARROW-5996 URL: https://issues.apache.org/jira/browse/ARROW-5996 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Liya Fan Assignee: Liya Fan # In FlightService#doPutCustom, the flight stream must be closed, even if an exception is thrown during the call of responseObserver.onError # The exception occurred during the call to acceptPut should not be swallowed. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-5973) [Java] Variable width vectors' get methods should return return null when the underlying data is null
Liya Fan created ARROW-5973: --- Summary: [Java] Variable width vectors' get methods should return return null when the underlying data is null Key: ARROW-5973 URL: https://issues.apache.org/jira/browse/ARROW-5973 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Liya Fan Assignee: Liya Fan For variable-width vectors (VarCharVector and VarBinaryVector), when the validity bit is not set, it means the underlying data is null, so the get method should return null. However, the current implementation throws an IllegalStateException when NULL_CHECKING_ENABLED is set, or returns an empty array when the flag is clear. Maybe the purpose of this design is to be consistent with fixed-width vectors. However, the scenario is different: fixed-width vectors (e.g. IntVector) throw an IllegalStateException, simply because the primitive types are non-nullable. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-5970) [Java] Provide pointer to Arrow buffer
Liya Fan created ARROW-5970: --- Summary: [Java] Provide pointer to Arrow buffer Key: ARROW-5970 URL: https://issues.apache.org/jira/browse/ARROW-5970 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Liya Fan Assignee: Liya Fan Introduce pointer to a memory region within an ArrowBuf. This pointer will be used as the basis for calculating the hash code within a vector, and equality determination. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-5920) [Java] Support sort & compare for all variable width vectors
Liya Fan created ARROW-5920: --- Summary: [Java] Support sort & compare for all variable width vectors Key: ARROW-5920 URL: https://issues.apache.org/jira/browse/ARROW-5920 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Liya Fan Assignee: Liya Fan All variable-width vector can reuse the same comparator for sorting & searching. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-5918) [Java] Revise the BaseIntVector interface
Liya Fan created ARROW-5918: --- Summary: [Java] Revise the BaseIntVector interface Key: ARROW-5918 URL: https://issues.apache.org/jira/browse/ARROW-5918 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Liya Fan 1. In the set method should not use long as parameter. It is hardly the case that there are more than 2^32 distinct values in a dictionary. If it really happens, maybe it means we should not have used dictionary in the first place. 2. In addition to the get method, there should also be a set method. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-5917) [Java] Redesign the dictionary encoder
Liya Fan created ARROW-5917: --- Summary: [Java] Redesign the dictionary encoder Key: ARROW-5917 URL: https://issues.apache.org/jira/browse/ARROW-5917 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Liya Fan Assignee: Liya Fan The current dictionary encoder implementation (org.apache.arrow.vector.dictionary.DictionaryEncoder) has heavy performance overhead, which prevents it from being useful in practice: # There are repeated conversions between Java objects and bytes (e.g. vector.getObject(i)). # Unnecessary memory copy (the vector data must be copied to the hash table). # The hash table cannot be reused for encoding multiple vectors (other data structure & results cannot be reused either). # The output vector should not be created/managed by the encoder (just like in the out-of-place sorter) # The hash table requires that the hashCode & equals methods be implemented appropriately, but this is not guaranteed. We plan to implement a new one in the algorithm module, and gradually deprecate the current one. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-5911) [Java] Make ListVector and MapVector create reader lazily
Liya Fan created ARROW-5911: --- Summary: [Java] Make ListVector and MapVector create reader lazily Key: ARROW-5911 URL: https://issues.apache.org/jira/browse/ARROW-5911 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Liya Fan Assignee: Liya Fan Current implementation creates reader eagerly, which may cause unnecessary resource and time. This issue changes the behavior to lazily create the reader. This is a follow-up issue for ARROW-5897. -- This message was sent by Atlassian JIRA (v7.6.14#76016)