Re: [JIRA] Request contributor role

2021-03-19 Thread Neal Richardson
Hi Bob, have you subscribed to the dev@arrow.apache.org mailing list? If you're having trouble sending messages to the list, that would definitely help. (I'm one of the moderators for the list, and I notice that many of your messages are requiring approval, which usually only happens when non-subsc

Re: [JIRA] Request contributor role

2021-03-19 Thread bobtins
Thanks! Also, I noticed you changed the description to "Updates to make dev on Windows easier" instead of "Windows and Java". I guess the issues I've run into would affect development on other languages; for example, the checkstyle config is not specific to Java, nor is the flatc compiler, but I

Re: [DISCUSS] How to describe computation on Arrow data?

2021-03-19 Thread Wes McKinney
> I might be misunderstanding, but I think Weld [1] is another project > targeting the lower level components? Weld IR is _really_ low level (not an expert, but have read the papers), see [1] for more > Also, I think there was a little bit of effort to come up with a common > expression represe

Re: No replacement dictionaries supported in pyarrow?

2021-03-19 Thread Wes McKinney
Part of the rationale for the file format was to enable custom applications to put indexing structures in the file metadata. I still think this is useful and it's hard for us to know exactly how people are using this out in the wild. If you don't do this, then you must do a bunch of IPC reconstruct

Re: [JIRA] Request contributor role

2021-03-19 Thread Sutou Kouhei
Hi Bob, Done. Could you try again? Thanks, -- kou In <2042358562.2371734.1616184276...@mail.yahoo.com> "[JIRA] Request contributor role" on Fri, 19 Mar 2021 20:04:36 + (UTC), Bob Tinsman wrote: > I've logged a couple bugs and would like to assign myself. My id is > bobtinsman on JIRA

[JIRA] Request contributor role

2021-03-19 Thread Bob Tinsman
I've logged a couple bugs and would like to assign myself. My id is bobtinsman on JIRA; here is one of the bugs I logged: [ARROW-12006] updates to make dev on Java and Windows easier - ASF JIRA | | | | [ARROW-12006] updates to make dev on Java and Windows easier - ASF JIRA | | | I tr

Re: No replacement dictionaries supported in pyarrow?

2021-03-19 Thread Antoine Pitrou
One more general question is whether the file format is really beneficial over the stream format in practice. I understand the theoretical argument for direct access to specific batches, but are there situations where it really matters? Intuitively, it seems to me that if your data is real

Re: No replacement dictionaries supported in pyarrow?

2021-03-19 Thread Wes McKinney
Okay, let’s open an issue then to address that at some point. What I recall from our last discussion was that the dictionaries would be “processed” when beginning to read the file, appending all the deltas to yield one set of dictionaries for reassembly. The downside is that the “partial dictionari

Re: [DISCUSS] How to encode table_pivot information state in Arrow

2021-03-19 Thread Wes McKinney
> It seems that the schema changes to arrow is a custom solution for just > Perspective and it might be prudent to wait for Arrow 4 that will have a > standard way of representing this information. Arrow 4.0.0 is not going to have the pivot table structures you are looking for (speaking as the o

Re: [ALL] Integration tests for dense and sparse tensor

2021-03-19 Thread Antoine Pitrou
Golden files can also make it easier to implement the read side without firing up the entire integration machinery. Regards Antoine. Le 19/03/2021 à 17:56, Micah Kornfield a écrit : For historical context golden files were first introduced so we could verify backwards compatibility. I th

Re: [ALL] Integration tests for dense and sparse tensor

2021-03-19 Thread Micah Kornfield
For historical context golden files were first introduced so we could verify backwards compatibility. I think the preferred method is still to do "live" testing. (i.e. Having one implementation consume JSON output a binary file, read the binary file with the second implementation and emit JSON, a

Re: [ALL] Integration tests for dense and sparse tensor

2021-03-19 Thread Jorge Cardoso Leitão
Hi, Thanks a lot for bringing this up, Fernando. I had the same thought when I first looked at the tensor implementation in Rust. Now it is a bit more clear :) So, if I understood correctly, the direction would be to declare a "JSON-integration" equivalent for tensors, generate a set of "golden b

Re: [DISCUSS] How to describe computation on Arrow data?

2021-03-19 Thread Antoine Pitrou
If we want this format to be common to different execution engines then it seems like it should represent logical expressions indeed (which may be implemented by different physical operators, depending on the execution engine). But I'm no expert in the matter. Regards Antoine. Le 18/03/

Re: No replacement dictionaries supported in pyarrow?

2021-03-19 Thread Antoine Pitrou
Le 19/03/2021 à 13:37, Wes McKinney a écrit : I am also under the impression that the file format is supposed to support deltas, but not replacements. Is this not implemented in C++? Definitely not. Also I was not aware that the file format was supposed to support deltas. Regards Antoine

Re: No replacement dictionaries supported in pyarrow?

2021-03-19 Thread Nate Bauernfeind
Actually, I slightly want to rephrase my claim. I see the footer is defined as: table Footer { version: org.apache.arrow.flatbuf.MetadataVersion; schema: org.apache.arrow.flatbuf.Schema; dictionaries: [ Block ]; recordBatches: [ Block ]; /// User-defined metadata custom_metadata: [

Re: No replacement dictionaries supported in pyarrow?

2021-03-19 Thread Nate Bauernfeind
The dictionary is not allowed to change throughout the file; which is ultimately OP's request. This is because all of the dictionary definition is in the footer of the file; which was clearly done to support random access of record batches. To quote the documentation: > We define a “file format”

Re: [DISCUSS] How to encode table_pivot information state in Arrow

2021-03-19 Thread Michael Lavina
Hey Tim, Maybe you can shed some light on this for me. Again sorry if this is well know but I just found out about perspective and I have been playing around with it. Is the thought that the output of to_arrow() should not be used in non perspective context? For my use case we are thinking of u

Re: [DISCUSS] How to encode table_pivot information state in Arrow

2021-03-19 Thread Tim Paine
Perspective uses arrow across the wire but internally uses it's own formats. Tim Paine tim.paine.nyc 908-721-1185 > On Mar 19, 2021, at 09:46, Michael Lavina wrote: > > Hey Benjamin, > > That sounds really awesome. Thank you. > > Sorry if this was already a well known thing as I am fairly n

Re: [DISCUSS] How to encode table_pivot information state in Arrow

2021-03-19 Thread Michael Lavina
Hey Benjamin, That sounds really awesome. Thank you. Sorry if this was already a well known thing as I am fairly new to the Arrow ecosystem. Is there a way to track a roadmap for Arrow 4 and be involved in that? Is there anywhere I can read more just general information on that? -Michael From

Re: [DISCUSS] How to encode table_pivot information state in Arrow

2021-03-19 Thread Benjamin Kietzman
Hi Michael, We are targeting grouped aggregation for 4.0 as part of a general query engine buildout. We also intend to bring DataFrame functionality into core Arrow (which would probably include an analog of pandas' pivot_table), but the query engine work is a prerequisite. Ben Kietzman On Fri,

Re: No replacement dictionaries supported in pyarrow?

2021-03-19 Thread Wes McKinney
I am also under the impression that the file format is supposed to support deltas, but not replacements. Is this not implemented in C++? On Thu, Mar 18, 2021 at 9:57 PM Nate Bauernfeind wrote: > If dictionary replacements were supported, then the IPC file format > couldn't guarantee random acces

[DISCUSS] How to encode table_pivot information state in Arrow

2021-03-19 Thread Michael Lavina
Hey Team, Sorry if this is answered already somewhere I tried searching emails and issues but couldn’t find anything. I am wondering if there is a standard way to encode row or column pivots in Arrow? I know Pandas does it already some way https://pandas.pydata.org/pandas-docs/stable/reference