Re: [DISCUSS] Plasma appears to have been forked, consider deprecating pyarrow.serialization
> > Regarding Plasma, you're right we should have started this conversation > earlier! The way it's being developed in Ray currently isn't useful as a > standalone project. We realized that tighter integration with Ray's object > lifetime tracking could be important, and removing IPCs and making it a > separate thread in the same process as our scheduler could make a big > difference for performance. Some of these optimizations wouldn't be easy > without a tight integration, so there are some trade-offs here. So I guess the question is whether it is worth continuing to try to maintain a sepearate version of plasma within the Arrow repo? On Tue, Jul 21, 2020 at 9:28 AM Robert Nishihara wrote: > Hi all, > > Regarding Plasma, you're right we should have started this conversation > earlier! The way it's being developed in Ray currently isn't useful as a > standalone project. We realized that tighter integration with Ray's object > lifetime tracking could be important, and removing IPCs and making it a > separate thread in the same process as our scheduler could make a big > difference for performance. Some of these optimizations wouldn't be easy > without a tight integration, so there are some trade-offs here. > > Regarding the Python serialization format, I agree with Antoine that it > should be deprecated. We began developing it before pickle 5, but now that > pickle 5 has taken off, it makes less sense (it's useful in its own right, > but at the end of the day, we were interested in it as a way to serialize > arbitrary Python objects). > > -Robert > > On Sun, Jul 12, 2020 at 5:26 PM Wes McKinney wrote: > > > I'll add deprecation warnings to the pyarrow.serialize functions in > > question, it will be pretty simple. > > > > On Sun, Jul 12, 2020, 6:34 PM Neal Richardson < > neal.p.richard...@gmail.com > > > > > wrote: > > > > > This seems like something to investigate after the 1.0 release. > > > > > > Neal > > > > > > On Sun, Jul 12, 2020 at 11:53 AM Antoine Pitrou > > > wrote: > > > > > > > > > > > I'd certainly like to deprecate our custom Python serialization > format, > > > > and using pickle protocol 5 instead is a very good idea. > > > > > > > > We can probably keep it in 1.0 while raising a FutureWarning. > > > > > > > > Regards > > > > > > > > Antoine. > > > > > > > > > > > > Le 12/07/2020 à 19:22, Wes McKinney a écrit : > > > > > It appears that the Ray developers have decided to fork Plasma and > > > > > decouple from the Arrow codebase: > > > > > > > > > > https://github.com/ray-project/ray/pull/9154 > > > > > > > > > > This is a disappointing development to occur without any discussion > > on > > > > > this mailing list but given the lack of development activity on > > Plasma > > > > > I would like to see how others in the community would like to > > proceed. > > > > > > > > > > It appears additionally that the Union-based serialization format > > > > > implemented by arrow/python/serialize.h and the > pyarrow/serialize.py > > > > > has been dropped in favor of pickle5. If there is not value in > > > > > maintaining this code then it would probably be preferable for us > to > > > > > remove this from the codebase. > > > > > > > > > > Thanks, > > > > > Wes > > > > > > > > > > > > > > >
Re: [DISCUSS] Support of higher bit-width Decimal type
Hi Jacques, Do we have a good definition of what is necessary to add a new data type? > Adding a type but not pulling it through most of the code seems less than > ideal since it means one part of Arrow doesn't work with another (providing > a less optimal end-user experience). I think what I proposed below is a minimum viable integration plan (and covers previously discussed requirements for new types). It demonstrates interop between two reference implementations and allows conversion to/from idiomatic language analogues. So it covers the basic goal of "arrow interop". For example, would this work include making Gandiva and all the kernels > support this new data type where appropriate? Not initially. There needs to be a stepping stone to start supporting new types. I don't think it is feasible to try to land all of this functionality in one PR. I'll lend a hand at trying get support for built-in compute after we get the first part done. Thanks, Micah On Fri, Aug 14, 2020 at 5:08 PM Jacques Nadeau wrote: > Do we have a good definition of what is necessary to add a new data type? > Adding a type but not pulling it through most of the code seems less than > ideal since it means one part of Arrow doesn't work with another (providing > a less optimal end-user experience). > > For example, would this work include making Gandiva and all the kernels > support this new data type where appropriate? > > On Wed, Aug 5, 2020 at 12:01 PM Wes McKinney wrote: > > > Sounds fine to me. I guess one question is what needs to be formalized > > in the Schema.fbs files or elsewhere in the columnar format > > documentation (and we will need to hold an associated vote for that I > > think) > > > > On Mon, Aug 3, 2020 at 11:30 PM Micah Kornfield > > wrote: > > > > > > Given no objections, we'll go ahead and start implementing support for > > 256-bit decimals. > > > > > > I'm considering setting up another branch to develop all the components > > so they can be merged to master atomically. > > > > > > Thanks, > > > Micah > > > > > > On Tue, Jul 28, 2020 at 6:39 AM Wes McKinney > > wrote: > > >> > > >> Generally this sounds fine to me. At some point it would be good to > > >> add 32-bit and 64-bit decimal support but this can be done in the > > >> future. > > >> > > >> On Tue, Jul 28, 2020 at 7:28 AM Fan Liya > wrote: > > >> > > > >> > Hi Micah, > > >> > > > >> > Thanks for opening the discussion. > > >> > I am aware of some scenarios where decimal requires more than 16 > > bytes, so > > >> > I think it would be beneficial to support this in Arrow. > > >> > > > >> > Best, > > >> > Liya Fan > > >> > > > >> > > > >> > On Tue, Jul 28, 2020 at 11:12 AM Micah Kornfield < > > emkornfi...@gmail.com> > > >> > wrote: > > >> > > > >> > > Hi Arrow Dev, > > >> > > ZetaSQL (Google's open source standard SQL library) recently > > introduced a > > >> > > BigNumeric [1] type which requires a 256 bit width to properly > > support it. > > >> > > I'd like to add support (possibly in collaboration with some of my > > >> > > colleagues) to add support for 256 bit width Decimals in Arrow to > > support a > > >> > > type corresponding to BigNumeric. > > >> > > > > >> > > In past discussions on this, I don't think we established a > minimum > > bar for > > >> > > supporting additional bit-widths within Arrow. > > >> > > > > >> > > I'd like to propose the following requirements: > > >> > > 1. A vote agreeing on adding support for a new bitwidth (we can > > discuss > > >> > > any objections here). > > >> > > 2. Support in Java and C++ for integration tests verifying the > > ability to > > >> > > round-trip the value. > > >> > > 3. Support in Java for conversion to/from BigDecimal [2] > > >> > > 4. Support in Python converting to/from Decimal [3] > > >> > > > > >> > > Is there anything else that people feel like is a requirement for > > basic > > >> > > support of an additional bit width for Decimal's? > > >> > > > > >> > > Thanks, > > >> > > Micah > > >> > > > > >> > > > > >> > > [1] > > >> > > > > >> > > > > > https://github.com/google/zetasql/blob/1aefaa7c62fc7a50def879bb7c4225ec6974b7ef/zetasql/public/numeric_value.h#L486 > > >> > > [2] > > https://docs.oracle.com/javase/7/docs/api/java/math/BigDecimal.html > > >> > > [3] https://docs.python.org/3/library/decimal.html > > >> > > > > >
Re: Gandiva and Threads
@ravin...@dremio.com @prav...@dremio.com thoughts? On Tue, Jul 28, 2020 at 3:39 PM Wes McKinney wrote: > Perhaps Gandiva does not handle sliced arrays properly? This would be > worth investigating > > On Mon, Jul 27, 2020 at 7:43 PM Matt Youill > wrote: > > > > Managed to track down the issue (sort of). > > > > I removed a call to set_chunksize on TableBatchReader where the chunk > > size was less than the number of rows in a table being read. Runs fine > > now (tested with 100s threads over mils of rows). > > > > Strangely, Gandiva fails if I don't call set_chunksize for a subsequent > > filter evaluation. It's very strange. > > > > > > On 28/7/20 7:58 am, Wes McKinney wrote: > > > Crashing when running from multiple threads doesn't sound right, > > > perhaps there are some missing synchronizations in internal data > > > structures. Could you open a JIRA issue and show the backtraces of any > > > crashes or other clues about how to reproduce the issues? > > > > > > On Sun, Jul 26, 2020 at 8:12 PM Matt Youill > wrote: > > >> Hi, > > >> > > >> Are there any guidelines for using the Gandiva C++ lib from multiple > > >> threads? As I increase the number of threads in an application Gandiva > > >> begins to raise memory faults (e.g. double free). Sometimes it is on > > >> projector creation, sometimes evaluation, seems to happen in multiple > > >> unpredictable places inside Gandiva. Can it be used from multiple > > >> threads, and if so which parts need to be guarded, if any? > > >> > > >> Thanks, Matt > > >> >
Re: [DISCUSS] How to extended time value range for Timestamp type?
+1, let's be cautious adding these kinds of things. On Wed, Aug 5, 2020 at 5:49 AM Wes McKinney wrote: > I also am not sure there is a good case for a new built-in type since it > introduces a good deal of complexity, particularly when there is the > extension type option. We’ve been living with 64-bit nanoseconds in pandas > for a decade, for example (and without the option for lower resolutions!!), > and while it does arise as a limitation from time to time, the use cases > are so specialized that it has never made sense to do anything about it. > > On Tue, Aug 4, 2020 at 11:26 PM Micah Kornfield > wrote: > > > I think a stronger case needs to be made for adding a new builtin type to > > support this. Can you provide concrete use-cases? Why can't dates > outside > > of the one representable by int64 be truncated (even for nano precision > > 64-bits max value is is over 200 years in the future)? It seems like in > > most cases values at the nanosecond level that are outside the values > > representable by 64-bits, are generally sentinel values. > > > > FWIW, Parquet had an int96 type that was used for timestamps but it has > > been deprecated [1] in favor of int64 nanos. > > > > -Micah > > > > [1] https://issues.apache.org/jira/browse/PARQUET-323 > > > > On Tue, Aug 4, 2020 at 8:52 PM Fan Liya wrote: > > > > > Hi Ji, > > > > > > This sounds like a universal requirement, as 64-bit is not sufficient > to > > > hold the precision for nano-second. > > > > > > For the extension type, we have two choices: > > > 1. Extending struct(int64, int32), which represents the design of SoA > > > (Struct of Arrays). > > > 2. Extending fixed width binary(12), which represents the design of AoS > > > (Array of Structs) > > > > > > Given the universal requirement, I'd prefer a new type. > > > > > > Best, > > > Liya Fan > > > > > > > > > On Wed, Aug 5, 2020 at 11:18 AM Ji Liu wrote: > > > > > > > Hi all, > > > > > > > > Now in Arrow Timestamp type, it support different TimeUnit(seconds, > > > > milliseconds, microseconds, nanoseconds) with int64 type for storage. > > In > > > > most cases this is enough, but if the timestamp value range of > external > > > > system exceeds int64_t::max, then it's impossible to directly convert > > to > > > > Arrow Timestamp, consider the following user case: > > > > > > > > A timestamp in other system with int64 + int32(stores milliseconds > and > > > > nanoseconds) can represent data from -00-00 to -12-31 > > > > 23:59:59.9, if we want to convert type like this, how should > we > > > do? > > > > One probably create an extension type with struct(int64, int32) for > > > > storage. > > > > > > > > Besides ExtensionType, are we considering extending our Timestamp for > > > wider > > > > range or maybe a new type for cases above? > > > > > > > > > > > > Thanks, > > > > Ji Liu > > > > > > > > > >
Re: change in pyarrow scalar equality?
Thanks for the detailed response and background on this Joris! My case was certainly not necessary to compare pyarrow scalars, so it would have been better to just raise an error, but there are probably other cases where that wouldn't be preferred. Anyway, I think it would be a good idea to document this since I'm sure others will hit it. I made https://issues.apache.org/jira/browse/ARROW-9750 for adding some docs. On Thu, Aug 6, 2020 at 12:18 AM Joris Van den Bossche < jorisvandenboss...@gmail.com> wrote: > Hi Bryan, > > This indeed changed in 1.0. The full scalar implementation in pyarrow was > refactored (there were two types of scalars before, see > https://issues.apache.org/jira/browse/ARROW-9017 / > https://github.com/apache/arrow/pull/7519). > > Due to that PR, there was discussion about what "==" should mean > (originally triggered by comparison with Null returning Null, but then > expanded to comparison in general, see the mailing list thread "Extremely > dubious Python equality semantics" -> > > https://lists.apache.org/thread.html/rdd11d3635c751a3a626e14106f1a95f3cddba4dd3bf44247edefde49%40%3Cdev.arrow.apache.org%3E > ). > The options for "==" are: is it a strict "data structure / object" equality > (like the '.equals(..)' method), or is it an "analytical/semantic" equality > (like the element-wise 'equal' compute method)? > > In the end, we opted for the object equality, and then made it actually > strict to only have it compare equal to actual pyarrow scalars (and not do > automatic conversion of python scalars to pyarrow scalars). But note that > even different types will not compare equal like that at the moment: > > >>> a = pa.array([1,2,3], type="int64") > >>> b = pa.array([1,2,3], type="int32") > >>> a[0] == b[0] > False > >>> a[0] == 1 > False > >>> a[0].equals(1) > ... > TypeError: Argument 'other' has incorrect type (expected > pyarrow.lib.Scalar, got int) > > Using the pyarrow.compute module, you _should_ get the analytical equality > as you expected in this case. However, it seems that the "equal" kernel is > not yet implemented for differing types (I suppose an automatic casting > step is still missing): > > >>> import pyarrow.compute as pc > >>> pc.equal(a[0], b[0]) > ... > ArrowNotImplementedError: Function equal has no kernel matching input types > (scalar[int64], scalar[int32]) > >>> pc.equal(a[0], 1) > ... > TypeError: Got unexpected argument type for compute function > > For this last one, we should probably do an attempt to convert the python > scalar to a pyarrow scalar, and maybe for the "a[0] == 1" case as well > (however, coerce to which type if there are multiple possibilities (eg > int64 vs int32)?) > > I agree the new behaviour might be confusing (if you expect semantic > equality), but on the other hand is also clear avoiding dubious cases. But > I don't think this is already set in stone, so more feedback is certainly > welcome. > > Joris > > On Thu, 6 Aug 2020 at 01:12, Bryan Cutler wrote: > > > Hi all, > > > > I came across a behavior change from 0.17.1 when comparing array scalar > > values with python objects. This used to work for 0.17.1 and before, but > in > > 1.0.0 equals always returns false. I saw there was a previous discussion > on > > Python equality semantics, but not sure if the conclusion is the behavior > > I'm seeing. For example: > > > > In [4]: a = pa.array([1,2,3]) > > > > > > In [5]: a[0] == 1 > > > > Out[5]: False > > > > In [6]: a[0].as_py() == 1 > > > > Out[6]: True > > > > I know the scalars can be converted with `as_py()`, but it does seem a > > little strange to return False when compared with a python object. Is > this > > the expected behavior for 1.0.0+? > > > > Thanks, > > Bryan > > >
Re: [DISCUSS] Support of higher bit-width Decimal type
Do we have a good definition of what is necessary to add a new data type? Adding a type but not pulling it through most of the code seems less than ideal since it means one part of Arrow doesn't work with another (providing a less optimal end-user experience). For example, would this work include making Gandiva and all the kernels support this new data type where appropriate? On Wed, Aug 5, 2020 at 12:01 PM Wes McKinney wrote: > Sounds fine to me. I guess one question is what needs to be formalized > in the Schema.fbs files or elsewhere in the columnar format > documentation (and we will need to hold an associated vote for that I > think) > > On Mon, Aug 3, 2020 at 11:30 PM Micah Kornfield > wrote: > > > > Given no objections, we'll go ahead and start implementing support for > 256-bit decimals. > > > > I'm considering setting up another branch to develop all the components > so they can be merged to master atomically. > > > > Thanks, > > Micah > > > > On Tue, Jul 28, 2020 at 6:39 AM Wes McKinney > wrote: > >> > >> Generally this sounds fine to me. At some point it would be good to > >> add 32-bit and 64-bit decimal support but this can be done in the > >> future. > >> > >> On Tue, Jul 28, 2020 at 7:28 AM Fan Liya wrote: > >> > > >> > Hi Micah, > >> > > >> > Thanks for opening the discussion. > >> > I am aware of some scenarios where decimal requires more than 16 > bytes, so > >> > I think it would be beneficial to support this in Arrow. > >> > > >> > Best, > >> > Liya Fan > >> > > >> > > >> > On Tue, Jul 28, 2020 at 11:12 AM Micah Kornfield < > emkornfi...@gmail.com> > >> > wrote: > >> > > >> > > Hi Arrow Dev, > >> > > ZetaSQL (Google's open source standard SQL library) recently > introduced a > >> > > BigNumeric [1] type which requires a 256 bit width to properly > support it. > >> > > I'd like to add support (possibly in collaboration with some of my > >> > > colleagues) to add support for 256 bit width Decimals in Arrow to > support a > >> > > type corresponding to BigNumeric. > >> > > > >> > > In past discussions on this, I don't think we established a minimum > bar for > >> > > supporting additional bit-widths within Arrow. > >> > > > >> > > I'd like to propose the following requirements: > >> > > 1. A vote agreeing on adding support for a new bitwidth (we can > discuss > >> > > any objections here). > >> > > 2. Support in Java and C++ for integration tests verifying the > ability to > >> > > round-trip the value. > >> > > 3. Support in Java for conversion to/from BigDecimal [2] > >> > > 4. Support in Python converting to/from Decimal [3] > >> > > > >> > > Is there anything else that people feel like is a requirement for > basic > >> > > support of an additional bit width for Decimal's? > >> > > > >> > > Thanks, > >> > > Micah > >> > > > >> > > > >> > > [1] > >> > > > >> > > > https://github.com/google/zetasql/blob/1aefaa7c62fc7a50def879bb7c4225ec6974b7ef/zetasql/public/numeric_value.h#L486 > >> > > [2] > https://docs.oracle.com/javase/7/docs/api/java/math/BigDecimal.html > >> > > [3] https://docs.python.org/3/library/decimal.html > >> > > >
Re: [DISSCUSS][JAVA] Avoid set reader/writer indices in FieldVector#getFieldBuffers
Per my comments there, the introduction of field buffers was added as part of the fieldvector addition when we have vectors that weren't field level. This meant that getbuffers and getfieldbuffers were at different levels at hierarchy (getbuffers being more general). I believe we no longer have the case of a vector that is not a field. As such, collapsing fieldvector and valuevector makes a lot of sense. In the thread I tried to say that we should capture the use cases and then consolidate to support those use cases. I think there are definitely two use cases (with sub in one of them) that I'm aware of: - get me buffers for writing to a io (socket or disk) - transfer the buffers - keep the buffers in vector (double reference counts) - get me the buffers so I can do some kind of low level memory operations (cheap as possible please) I actually think we've also failed to complete the other work that we talked about wrt removing readerIndex and writerIndex from ArrowBuf. They aren't well maintained by any of the vectors and thus are mostly broken (and create a bunch of this confusion). So I'd actually be inclined to fix the readerIndex/writerIndex issue before we mess with the buffer methods. Removing that whole concept will collapse the use cases, I believe. Then we only have transfer versus not (which is what the getBuffers() method already provides). On Mon, Aug 10, 2020 at 1:44 AM Ji Liu wrote: > Hi Micah, I am afraid it's not a reasonable solution. > 1. The status is that getFieldBuffers has right order buffer and was used > in IPC, getBuffers was not used in IPC. > 2. The purpose of this PR is to use getBuffers in IPC instead, and making > changes in getFieldBuffers dose not seem to help this problem since it will > break IPC format by using getBuffers. > > Micah Kornfield 于2020年8月8日周六 上午11:50写道: > > > Thinking about this some more, I think maybe we should also potentially > > try to deprecate hold off on any changes to getFieldBuffers. It should > > likely follow the same sort of pattern for getBuffers (create a new > method > > that doesn't set reader/writer indices and deprecate getFieldBuffers). > But > > I think that can be handled in a separate PR? > > > > Anybody else have thoughts? > > > > -Micah > > > > On Tue, Aug 4, 2020 at 11:24 PM Ji Liu wrote: > > > > > hi liya, > > > Thanks for your careful review, it is a typo, the order of getBuffers > is > > > wrong. > > > > > > Fan Liya 于2020年8月5日周三 下午2:14写道: > > > > > > > Hi Ji, > > > > > > > > IMO, for the correct order, the validity buffer should precede the > > offset > > > > buffer (e.g. this is the order used by BaseVariableWidthVector & > > > > BaseLargeVariableWidthVector). > > > > In ListVector#getBuffers, the offset buffer precedes the validity > > buffer, > > > > so I am a little confused why you say the order of > > ListVector#getBuffers > > > is > > > > right? > > > > > > > > Best, > > > > Liya Fan > > > > > > > > On Wed, Aug 5, 2020 at 12:32 PM Micah Kornfield < > emkornfi...@gmail.com > > > > > > > wrote: > > > > > > > > > FWIW, I lack historical context on how these methods evolved, so > I'd > > > > > appreciate insight from anyone who has worked on the java codebase > > for > > > a > > > > > longer period of time. The current situation seems less then > ideal. > > > > > > > > > > On Tue, Aug 4, 2020 at 12:55 AM Ji Liu > wrote: > > > > > > > > > > > Hi all, > > > > > > > > > > > > > > > > > > When I worked on ARROW-7539[1], I met some problems and not sure > > > what's > > > > > the > > > > > > proper way to solve it. > > > > > > > > > > > > > > > > > > This issue was about to avoid set reader/writer indices in > > > > > > FieldVector#getFieldBuffers according to the following reasons: > > > > > > > > > > > > i. getBuffers set reader/writer indices and it's right for the > > > purpose > > > > of > > > > > > sending the data over the wire > > > > > > > > > > > > ii. getFieldBuffers is distinct from getBuffers, it should be for > > > > getting > > > > > > access to underlying data for higher-performance algorithms > > > > > > > > > > > > > > > > > > Currently in VectorUnloader, we used getFieldBuffers to create > > > > > > ArrowRecordBatch that's why we keep writer/reader indices in > > > > > > getFieldBuffers > > > > > > , we should use getBuffers instead. But during the change, we > found > > > > > another > > > > > > problem: > > > > > > > > > > > > The order of validity and offset buffers are not in the same > order > > in > > > > > > ListVector(getBuffers's order is right), changing the API in > > > > > VectorUnloader > > > > > > creates problems with serialization/deserialization resulting in > > test > > > > > > failures in Dremio which would break backward compatibility with > > > > existing > > > > > > serialised files. > > > > > > > > > > > > > > > > > > Micah gives a solution but seems doesn't reach consistent in the > PR > > > > > > thread[2 > > > > > > ]: > > > > > > > > > > > >1. Remove setReaderWrite
Re: [DISCUSS] Adding a pull-style iterator API to the C data interface
I think this unlocks a bunch of use cases. I think people are generally using Arrow in simpler, non-streaming ways right now and thus the quiet. Producing an iterator pattern is logical as you move to streams of smaller chunks (common in distributed and multi-tenant systems). On Mon, Aug 10, 2020 at 11:56 AM Wes McKinney wrote: > I'm still in need of it. I'd be interested in developing a solution > that can be used in some database APIs, e.g. using it for the result > interface for an embedded SQL database like SQLite or DuckDB would be > an interesting motivating use case. > > One approach would be to create something unofficial and used only in > the C++ library's implementation of the C API so that it can make > breaking changes for a time and then propose to formalize it in the > ABI later. > > On Mon, Aug 10, 2020 at 9:22 AM Antoine Pitrou > wrote: > > > > > > From the absence of response, it would seem there isn't much interest > > in this. Please speak up if you think this would be useful to you. > > > > Regards > > > > Antoine. > > > > > > On Tue, 7 Jul 2020 07:49:17 -0500 > > Wes McKinney wrote: > > > Any opinions about this? It seems the next steps would be a concrete > > > API proposal and perhaps a reference implementation thereof. > > > > > > On Sun, Jun 28, 2020 at 11:26 PM Wes McKinney > wrote: > > > > > > > > In ARROW-8301 [1] and elsewhere we've been discussing how to > > > > communicate what amounts to a sequence of arrays or a sequence of > > > > RecordBatch objects using the C data interface. > > > > > > > > Example use cases: > > > > > > > > * Returning a sequence of record / row batches from a database driver > > > > * Sending a C++ arrow::ChunkedArray or arrow::Table to a consumer > > > > using only the C interface > > > > > > > > Applications could define their own custom iterator interfaces to > > > > communicate what amounts to a sequence of the ArrowArray C interface > > > > objects, but it is likely a common enough use case to have an > > > > off-the-shelf solution so that we can support this solution in our > > > > reference libraries (e.g. Arrow C++, pyarrow, Arrow R) > > > > > > > > I suggested a C structure as follows > > > > > > > > struct ArrowArrayStream { > > > > void (*get_schema)(struct ArrowSchema*); > > > > // Non-zero return value indicates an error? > > > > int (*get_next)(struct ArrowArray*); > > > > void (*get_error)(... ERROR HANDLING TODO ); > > > > void (*release)(struct ArrowArrayStream*); > > > > void* private_data; > > > > }; > > > > > > > > The producer would populate this object with pointers to its > > > > implementations of these functions. > > > > > > > > Thoughts about this? > > > > > > > > Thanks, > > > > Wes > > > > > > > > [1]: https://issues.apache.org/jira/browse/ARROW-8301 > > > > > > > > > >
Re: [Java] Supporting Big Endian
Hey Micah, thanks for starting the discussion. I just skimmed that thread and it isn't entirely clear that there was a conclusion that the overhead was worth it. I think everybody agrees that it would be nice to have the code work on both platforms. On the flipside, the code noise for a rare case makes the cost-benefit questionable. In the Java code, we wrote the code to explicitly disallow big endian platforms and put preconditions checks in. I definitely think if we want to support this, it should be done holistically across the code with appropriate test plan (both functional and perf). To me, the question is really about how many use cases are blocked by this. I'm not sure I've heard anyone say that the limiting factor to leveraging Java Arrow was the block on endianess. Keep in mind that until very recently, using any Arrow Java code would throw a preconditions check before you could even get started on big-endian and I don't think we've seen a bunch of messages on that exception. Adding if conditions throughout the codebase like this patch: [1] isn't exactly awesome and it can also risk performance impacts depending on how carefully it is done. If there isn't a preponderance of evidence of many users being blocked by this capability, I don't think we should accept the code. We already have a backlog of items that we need to address just ensure existing use cases work well. Expanding to new use cases that there is no clear demand for will likely just increase code development cost at little benefit. What do others think? [1] https://github.com/apache/arrow/pull/7923#issuecomment-67439 On Fri, Aug 14, 2020 at 4:36 PM Micah Kornfield wrote: > Kazuaki Ishizak has started working on Big Endian support in Java > (including setting up CI for it). Thank you! > > We previously discussed support for Big Endian architectures in C++ [1] and > generally agreed that it was a reasonable thing to do. > > Similar to C++ I think as long as we have a working CI setup it is > reasonable for Java to support Big Endian machines. > > But I think there might be differing opinions so it is worth a discussion > to see if there are technical blockers or other reasons for not supporting > Big Endian architectures in the existing java implementation. > > Thanks, > Micah > > > [1] > > https://lists.apache.org/thread.html/rcae745f1d848981bb5e8dddacfc4554641aba62e3c949b96bfd8b019%40%3Cdev.arrow.apache.org%3E >
Re: [Java] Supporting Big Endian
Hi Micah, Thank you for pick up this topic. It is great to discuss the support of Big Endian in Java implementation in the mailing list. I may miss the technical blockers. Any comments are appreciated. Best Regards, Kazuaki Ishizaki From: Micah Kornfield To: dev Date: 2020/08/15 08:36 Subject:[EXTERNAL] [Java] Supporting Big Endian Kazuaki Ishizak has started working on Big Endian support in Java (including setting up CI for it). Thank you! We previously discussed support for Big Endian architectures in C++ [1] and generally agreed that it was a reasonable thing to do. Similar to C++ I think as long as we have a working CI setup it is reasonable for Java to support Big Endian machines. But I think there might be differing opinions so it is worth a discussion to see if there are technical blockers or other reasons for not supporting Big Endian architectures in the existing java implementation. Thanks, Micah [1] https://lists.apache.org/thread.html/rcae745f1d848981bb5e8dddacfc4554641aba62e3c949b96bfd8b019%40%3Cdev.arrow.apache.org%3E
[Java] Supporting Big Endian
Kazuaki Ishizak has started working on Big Endian support in Java (including setting up CI for it). Thank you! We previously discussed support for Big Endian architectures in C++ [1] and generally agreed that it was a reasonable thing to do. Similar to C++ I think as long as we have a working CI setup it is reasonable for Java to support Big Endian machines. But I think there might be differing opinions so it is worth a discussion to see if there are technical blockers or other reasons for not supporting Big Endian architectures in the existing java implementation. Thanks, Micah [1] https://lists.apache.org/thread.html/rcae745f1d848981bb5e8dddacfc4554641aba62e3c949b96bfd8b019%40%3Cdev.arrow.apache.org%3E
Re: My focus for Rust implementation for 2.0.0
First, an update on progress. Once the PRs for ARROW-9711 and ARROW-9716 are merged, it is possible to run TPC-H query 1 against a 100 GB data set with similar performance to Apache Spark in local mode. I plan on testing larger datasets over the weekend. To answer Kirill's question, I wouldn't necessarily characterize it as giving up on exploring any integration with Gandiva. There are several integrations that I would be interested in exploring, including with the Arrow C Data Interface, and the C++ Dataset work that is happening, but I only have so much time available to contribute to this project and I have some specific goals that I am working towards that are a much higher priority for me right now. Also, I am encouraged by the performance I'm seeing from DataFusion after some of the changes this week, and I know there is plenty of room for improvement still. This perhaps makes it less compelling to explore delegating to C++ at this point. However, it would be nice to see some performance comparisons between DataFusion and the C++ Dataset work. Thanks, Andy. On Fri, Aug 14, 2020 at 2:18 AM Kirill Lykov wrote: > Sounds interesting as we wanted to start using DataFusion. > Btw, I vaguely remember that in the original repository you had issue > like "investigate DataFusion with Gandiva", I'm curious why you have > decided to give up with it? > > On Thu, Aug 13, 2020 at 5:11 PM Andy Grove wrote: > > > > Some of you may have noticed a sudden flurry of activity from me after a > > bit of a break from the project, so I thought it might be useful to > explain > > what I am up to. > > > > As of 1.0.0, DataFusion isn't really useful against any real-world data > > sets for a number of reasons, but most of all due to the simplistic > > threading/partitioning model. There are a few small bugs as well. > > > > My current focus is to be able to run TPC-H query 1 against decent size > > datasets (starting with the 100 GB dataset) with hundreds of partitions. > I > > believe that I can get this working with some fairly small changes. > Later, > > we can experiment with more advanced threading models and async, using > the > > same benchmark to measure improvements. > > > > Let me know if you have any questions. > > > > Thanks, > > > > Andy. > > > > -- > Best regards, > Kirill Lykov >
Re: Building an executable with arrow flight (C++)
Using ExternalProject should work as well (again, if it doesn't, it's a bug and should be reported). We should augment our examples to include an example use with ExternalProject https://issues.apache.org/jira/browse/ARROW-9740 On Thu, Aug 13, 2020 at 9:44 PM Radu Teodorescu wrote: > Hi Wes, > > I will certainly give that a shot and provide feedback - my typical setup > with arrow has so far used ExternalProject and I tend to prefer this for > development vs the install path since it makes it easier track problems, > step into the code, run arrow examples and tests when I need a quick usage > sample etc. > > So if possible I would like to stick to that or one of the other cmake > options for including the arrow source into a project > > > > > On Aug 13, 2020, at 7:27 PM, Wes McKinney wrote: > > > > > > hi Radu, > > > > > > If you use the approach in > > > > > > https://github.com/apache/arrow/blob/master/cpp/examples/minimal_build > > > > > > It should be sufficient to use > > > > > > find_package(ArrowFlight REQUIRED) > > > > > > and then use the imported arrow_flight_static target (or > > > arrow_flight_shared, depending on your needs) when linking. If that > > > does not work, it's a bug and you should open a JIRA issue. We just > > > worked a bunch on this for 1.0.0 and after so it's important that this > > > work consistently. > > > > > >> On Thu, Aug 13, 2020 at 4:20 PM Radu Teodorescu > > >> wrote: > > >> > > >> I can produce something isolated shortly - but really the questions is > how can one build a hello world type flight server that does something like > > >> { > > >> FlightServerBase server; > > >> server.Serve(); > > >> //Yes I know this would fail at runtime but I just need to get there > first > > >> } > > >> > > >> with a fully self contained CMake project (i.e. that doesn’t depend on > having arrow or it’s dependencies preinstalled). > > >> > > >> If you have something like that that works, I can take it from there > > >> Thank you > > >> Radu > > >> > > On Aug 13, 2020, at 4:42 PM, Sutou Kouhei wrote: > > >>> > > >>> Hi, > > >>> > > >>> Could you share a minimal CMake and C++ file set to > > >>> reproduce your case? > > >>> > > >>> > > >>> Thanks, > > >>> -- > > >>> kou > > >>> > > >>> In > > >>> "Building an executable with arrow flight (C++)" on Thu, 13 Aug 2020 > 12:06:49 -0400, > > >>> Radu Teodorescu wrote: > > >>> > > Hello, > > I am trying to build a server that uses arrow flight and getting into > a bit of a rabbit hole with dependency inclusion. > > I have arrow included as an external project and so far everything > has worked really smoothly (I have executables building with arrow, parquet > arrow and I also have arrow flight libraty building fine). > > When I try to build an executable that user flight lib, I am getting > a never-ending stream of missing dependencies (mostly grpc related). > > The flight-test-server is building without any issues but I cannot > see a clean way to point my cmake to the same list of dependencies that are > built internally by arrow CMake stack (without duplicating a lot of the > existing arrow CMake and/or manually defining all the dependencies) > > > > I realize this is mostly a gRPC and CMake question, but I am hoping > someone had walked this road before or there is some public domain project > I can use as an integration reference. > > Thank you > > Radu > > >> > > > >
[NIGHTLY] Arrow Build Report for Job nightly-2020-08-14-0
Arrow Build Report for Job nightly-2020-08-14-0 All tasks: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-14-0 Failed Tasks: - test-conda-cpp-valgrind: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-14-0-github-test-conda-cpp-valgrind - test-conda-python-3.7-hdfs-2.9.2: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-14-0-github-test-conda-python-3.7-hdfs-2.9.2 - test-conda-python-3.7-kartothek-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-14-0-github-test-conda-python-3.7-kartothek-latest - test-conda-python-3.7-kartothek-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-14-0-github-test-conda-python-3.7-kartothek-master - ubuntu-focal-arm64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-14-0-travis-ubuntu-focal-arm64 Succeeded Tasks: - centos-6-amd64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-14-0-github-centos-6-amd64 - centos-7-aarch64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-14-0-travis-centos-7-aarch64 - centos-7-amd64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-14-0-github-centos-7-amd64 - centos-8-aarch64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-14-0-travis-centos-8-aarch64 - centos-8-amd64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-14-0-github-centos-8-amd64 - conda-clean: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-14-0-azure-conda-clean - conda-linux-gcc-py36-cpu: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-14-0-azure-conda-linux-gcc-py36-cpu - conda-linux-gcc-py36-cuda: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-14-0-azure-conda-linux-gcc-py36-cuda - conda-linux-gcc-py37-cpu: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-14-0-azure-conda-linux-gcc-py37-cpu - conda-linux-gcc-py37-cuda: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-14-0-azure-conda-linux-gcc-py37-cuda - conda-linux-gcc-py38-cpu: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-14-0-azure-conda-linux-gcc-py38-cpu - conda-linux-gcc-py38-cuda: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-14-0-azure-conda-linux-gcc-py38-cuda - conda-osx-clang-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-14-0-azure-conda-osx-clang-py36 - conda-osx-clang-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-14-0-azure-conda-osx-clang-py37 - conda-osx-clang-py38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-14-0-azure-conda-osx-clang-py38 - conda-win-vs2017-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-14-0-azure-conda-win-vs2017-py36 - conda-win-vs2017-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-14-0-azure-conda-win-vs2017-py37 - conda-win-vs2017-py38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-14-0-azure-conda-win-vs2017-py38 - debian-buster-amd64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-14-0-github-debian-buster-amd64 - debian-buster-arm64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-14-0-travis-debian-buster-arm64 - debian-stretch-amd64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-14-0-github-debian-stretch-amd64 - debian-stretch-arm64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-14-0-travis-debian-stretch-arm64 - example-cpp-minimal-build-static-system-dependency: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-14-0-github-example-cpp-minimal-build-static-system-dependency - example-cpp-minimal-build-static: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-14-0-github-example-cpp-minimal-build-static - gandiva-jar-osx: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-14-0-travis-gandiva-jar-osx - gandiva-jar-xenial: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-14-0-travis-gandiva-jar-xenial - homebrew-cpp: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-14-0-travis-homebrew-cpp - homebrew-r-autobrew: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-14-0-travis-homebrew-r-autobrew - nuget: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-14-0-github-nuget - test-conda-cpp: URL: https://github
RE: Arrow Flight + Go, Arrow for Realtime
Thanks Wes, I'll likely work on that once I get my head around Arrow in general and confirm will use for the project. Considerations for how to account for the streaming append problem to an otherwise immutable dataset is current concern. Still thinking through that. Regards Mark. -Original Message- From: Wes McKinney Sent: Wednesday, August 12, 2020 3:59 PM To: dev Subject: Re: Arrow Flight + Go, Arrow for Realtime There's a WIP patch for Flight support in Go https://github.com/apache/arrow/pull/6731 I hope to see someone taking up this work as first-class Flight support in Go would be very useful for building data services. On Wed, Aug 12, 2020 at 5:08 AM Adam Lippai wrote: > > Arrow is mainly about batching data and leveraging all the > opportunities this gives. > This means you either have to buffer the data yourself and flush it > when a reasonable sized batch is complete or play with preallocating > Arrow structures This was discussed recently, you might be interested > in the thread: > https://www.mail-archive.com/dev@arrow.apache.org/msg19862.html > > Note: I'm not an Arrow developer, I'm just following the "streaming" > features of the Arrow lib, I'm interested in having a "rolling window" > API (like a fixed size FIFO queue). > > Best regards, > Adam Lippai > > On Wed, Aug 12, 2020 at 11:29 AM wrote: > > > I'm looking at using Arrow for a realtime IoT project which includes > > use cases both on server, and also for transferring /using in a > > Browser via WASM, and have a few questions. > > > > > > > > Language in use is Go. > > > > > > > > Is anyone working on implementing Arrow-Flight in Go ? (According to > > the feature matrix, nothing ready yet, so wanted to check. > > > > > > > > Has anyone tried using Apache Arrow in Go WASM (Webassembly) ? if so, > > any issues ? > > > > > > > > Any pointers/documentation on using/extending Arrow for realtime streaming > > cases. (Specifically where a DataFrame is requested, but then it needs to > > 'grow' as new data arrives, often at high speed). > > > > Not language specific, just trying to understand the right pattern > > for using Arrow for this, and couldn't' find much in the docs. > > > > > > > > Regards > > > > > > > > Mark. > > > >
RE: Arrow Flight + Go, Arrow for Realtime
Thanks Wes & Sebastien, I've tested Arrow in Go-WASM now and it is working fine. Still getting my head around best way to use it for our Use case (IoT Data) My goal here is to hit a Flight endpoint from the Browser (GO-WASM specifically), and pull (all or part of) an Arrow dataset on the server, into the Browser for visualization and local analysis. One of the issues I will contend with is that visualization can 'walk backwards' in a larger dataset, (Scrolling up), not just forwards like an analytic generally does. Second goal is to update this visualization from a realtime stream, which can be as fast as 1 sample per second. I'm wondering if a use (abuse ?) of batches and pre-allocation might work for streaming updates. Note: It may be that using arrow like this for visualization is not appropriate, but I think it would be great if it can. Regards Mark. -Original Message- From: Sebastien Binet Sent: Wednesday, August 12, 2020 1:53 PM To: dev@arrow.apache.org Subject: Re: Arrow Flight + Go, Arrow for Realtime Mark, AFAIK, nobody's actively working on Arrow-Flight for Go (I think somebody started that work at some point but I don't remember anything hitting the main repo) as for Go+WASM: https://lists.apache.org/thread.html/e15dc80debf9dea1b33581fa6ba95fd84b57c0ccd0162505d5d25079%40%3Cdev.arrow.apache.org%3E ie: === I've just tried compiling this example: - https://godoc.org/github.com/apache/arrow/go/arrow#example-package--Table to wasm. compilation went fine: $> GOOS=js GOARCH=wasm go build -o foo.wasm foo.go $> go-wasm ./foo.wasm rec[0]["f1-i32"]: [1 2 3 4 5] rec[0]["f2-f64"]: [1 2 3 4 5] rec[1]["f1-i32"]: [6 7 8 (null) 10] rec[1]["f2-f64"]: [6 7 8 9 10] rec[2]["f1-i32"]: [11 12 13 14 15] rec[2]["f2-f64"]: [11 12 13 14 15] rec[3]["f1-i32"]: [16 17 18 19 20] rec[3]["f2-f64"]: [16 17 18 19 20] and it ran fine once this patch was added: - https://github.com/apache/arrow/pull/3707 hth, -s PS: go-wasm is an alias of mine for this file: https://github.com/golang/go/blob/master/misc/wasm/go_js_wasm_exec === hth, -s ‐‐‐ Original Message ‐‐‐ On Wednesday, August 12, 2020 11:29 AM, wrote: > I'm looking at using Arrow for a realtime IoT project which includes > use cases both on server, and also for transferring /using in a > Browser via WASM, and have a few questions. > > Language in use is Go. > > Is anyone working on implementing Arrow-Flight in Go ? (According to > the feature matrix, nothing ready yet, so wanted to check. > > Has anyone tried using Apache Arrow in Go WASM (Webassembly) ? if so, > any issues ? > > Any pointers/documentation on using/extending Arrow for realtime > streaming cases. (Specifically where a DataFrame is requested, but > then it needs to 'grow' as new data arrives, often at high speed). > > Not language specific, just trying to understand the right pattern for > using Arrow for this, and couldn't' find much in the docs. > > Regards > > Mark.
Re: My focus for Rust implementation for 2.0.0
Sounds interesting as we wanted to start using DataFusion. Btw, I vaguely remember that in the original repository you had issue like "investigate DataFusion with Gandiva", I'm curious why you have decided to give up with it? On Thu, Aug 13, 2020 at 5:11 PM Andy Grove wrote: > > Some of you may have noticed a sudden flurry of activity from me after a > bit of a break from the project, so I thought it might be useful to explain > what I am up to. > > As of 1.0.0, DataFusion isn't really useful against any real-world data > sets for a number of reasons, but most of all due to the simplistic > threading/partitioning model. There are a few small bugs as well. > > My current focus is to be able to run TPC-H query 1 against decent size > datasets (starting with the 100 GB dataset) with hundreds of partitions. I > believe that I can get this working with some fairly small changes. Later, > we can experiment with more advanced threading models and async, using the > same benchmark to measure improvements. > > Let me know if you have any questions. > > Thanks, > > Andy. -- Best regards, Kirill Lykov