Re: sparse data array

2021-03-31 Thread bobtins
I appreciate the feedback. I realize it's a tricky nut to crack; there's always 
going to be a desire to use compression to improve scaling, and I was trying to 
identify a connecting thread between various requests for compression 
enhancements on this list and my own experience. I'll look at the spec again 
and put it on my back burner.

On 2021/03/31 04:03:07, Micah Kornfield  wrote: 
> Hi Bob,
> 
> 
> > I can observe that in a project like Arrow, there is always a tension
> > between compatibility and extensibility, and it makes me wonder if it would
> > be helpful to add capabilities without changing the spec. An extension type
> > can be defined in terms of one of the built-in layouts, but it would define
> > semantics (such as compression) that would be used to interpret that layout.
> 
> 
> I'm not sure if this is referring to existing extension types [1] but I
> believe  they are insufficient for this purpose.  The compression
> techniques being discussed don't work well, because compression violates
> the fundamental assumptions of the existing protocols.  Each array is
> expected to have an equal number of slots.  So an array compressed as a
> struct, would cause misalignment with non-encoded arrays.
> 
> For example, integers are stored in blocks of 4096 values, with each block
> > the minimum size to hold all the values. You access the value n with the
> > expression "block[n >> 12][n % 4096]".
> > Take an example with 1M int32 values. Value 0 is 1e9 but all the others
> > are 0 to 9.
> > Normally you would use 4M bytes to store these, but you could instead have
> > 1 block of int32 (16k) and 255 blocks of int8 (1020k) plus 1K storage for
> > block offsets, so a savings of almost 75%. If you could have uint4 blocks
> > you could save about 87%.
> 
> 
> This is difficult with the existing RecordBatch stream approach  since
> schema is fixed ahead of time.  One could theoretically standardize a
> notion of schema replacement in communications.  The I linked takes a
> different approach and allows for adjusting encodings on a per message
> basis.  Both are potentially viable.
> 
> [1] https://arrow.apache.org/docs/format/Columnar.html#extension-types
> 
> On Tue, Mar 30, 2021 at 5:09 PM bobtins  wrote:
> 
> > From your response, I'm inferring that in order to introduce this kind of
> > compression, support in the spec is needed, similar to how compression
> > types and parameters are enumerated in
> > https://github.com/apache/arrow/blob/master/format/Message.fbs. Any
> > change in the spec is a Big Deal (and it should be).
> >
> > I can observe that in a project like Arrow, there is always a tension
> > between compatibility and extensibility, and it makes me wonder if it would
> > be helpful to add capabilities without changing the spec. An extension type
> > can be defined in terms of one of the built-in layouts, but it would define
> > semantics (such as compression) that would be used to interpret that
> > layout.
> >
> > > > > On Thu, Mar 25, 2021 at 2:17 AM Jorge Cardoso Leitão <
> > > > > jorgecarlei...@gmail.com> wrote:
> > > > >
> > > > > > Would it be an option to use a StructArray for that? One array
> > with the
> > > > > > values, and one with the repetitions:
> > > > > >
> > > > > > Int32([1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 1, 2]) ->
> > > > > >
> > > > > > StructArray([
> > > > > > "values": Int32([1, 2, 3, 1, 2]),
> > > > > > "repetitions": UInt32([1, 3, 5, 1, 1]),
> > > > > > ])
> > > > > >
> > > > > > It does not have the same API, but I think that the physical
> > operations
> > > > > > would be different, anyways: ("array + 2" would only operate on
> > > > > "values").
> > > > > > I think that a small struct / object with some operator overloading
> > > > would
> > > > > > address this, and writing something on the metadata would allow
> > others
> > > > to
> > > > > > consume it, a-la extension type?
> > > > > >
> > > > > > On a related note, such encoding would address DataFusion's issue
> > of
> > > > > > representing scalars / constant arrays: a constant array would be
> > > > > > represented as a repetition. Currently we just unpack (i.e.
> > allocate) a
> > > > > > constant array when we want to transfer th

Re: sparse data array

2021-03-30 Thread bobtins
>From your response, I'm inferring that in order to introduce this kind of 
>compression, support in the spec is needed, similar to how compression types 
>and parameters are enumerated in 
>https://github.com/apache/arrow/blob/master/format/Message.fbs. Any change in 
>the spec is a Big Deal (and it should be).

I can observe that in a project like Arrow, there is always a tension between 
compatibility and extensibility, and it makes me wonder if it would be helpful 
to add capabilities without changing the spec. An extension type can be defined 
in terms of one of the built-in layouts, but it would define semantics (such as 
compression) that would be used to interpret that layout. 

> > > On Thu, Mar 25, 2021 at 2:17 AM Jorge Cardoso Leitão <
> > > jorgecarlei...@gmail.com> wrote:
> > >
> > > > Would it be an option to use a StructArray for that? One array with the
> > > > values, and one with the repetitions:
> > > >
> > > > Int32([1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 1, 2]) ->
> > > >
> > > > StructArray([
> > > > "values": Int32([1, 2, 3, 1, 2]),
> > > > "repetitions": UInt32([1, 3, 5, 1, 1]),
> > > > ])
> > > >
> > > > It does not have the same API, but I think that the physical operations
> > > > would be different, anyways: ("array + 2" would only operate on
> > > "values").
> > > > I think that a small struct / object with some operator overloading
> > would
> > > > address this, and writing something on the metadata would allow others
> > to
> > > > consume it, a-la extension type?
> > > >
> > > > On a related note, such encoding would address DataFusion's issue of
> > > > representing scalars / constant arrays: a constant array would be
> > > > represented as a repetition. Currently we just unpack (i.e. allocate) a
> > > > constant array when we want to transfer through a RecordBatch.
> > > >

I just reread the whole thread, and realized Jorge was saying a similar thing, 
that this new type could be built on an existing layout. But I guess I'm also 
imagining some more general capabilities:

* Define an extension type in terms of an existing layout
* Register an extension type implementation
* Enumerate available extension types

It would make life more complicated, for sure, but it would allow things like 
compression to evolve more quickly. I'm thinking about the in-memory 
implementation that I built, where I did various things to save memory. 

For example, integers are stored in blocks of 4096 values, with each block the 
minimum size to hold all the values. You access the value n with the expression 
"block[n >> 12][n % 4096]".
Take an example with 1M int32 values. Value 0 is 1e9 but all the others are 0 
to 9. 
Normally you would use 4M bytes to store these, but you could instead have 1 
block of int32 (16k) and 255 blocks of int8 (1020k) plus 1K storage for block 
offsets, so a savings of almost 75%. If you could have uint4 blocks you could 
save about 87%.

On the other hand, I had the luxury of total control over the implementation, 
so it was easy for me to try something and make it work. For Arrow, having the 
memory layout standardized makes it easy to implement various SIMD and GPU 
optimizations, which all go out the window if you allow arbitrary new data 
access semantics. 

So if this idea is hopelessly naive and half-baked and would result in a train 
wreck, please feel free to enlighten me/rip me a new one--I'm here to learn.



On 2021/03/27 17:35:32, Wes McKinney  wrote: 
> I’ve also heard interest from folks in the academic database community
> about adding compressed (sparse) in memory structures to the Arrow format /
> specification, so I think it makes more sense to try to figure things out
> at the specification / protocol level and then work on an implementation. I
> agree this seems above and beyond what I would think an intern could
> accomplish in a 10-12 week period given the process that has been involved
> with other significant additions to Arrow over the last several years.
> 
> On Sat, Mar 27, 2021 at 9:40 AM Kirill Lykov  wrote:
> 
> > Thanks for the information and ideas, I need to check them out (especially
> > one with structures).
> > PR proposal for RLE is very interesting since internally people express
> > interest in this feature.
> > For intern, I thought to ask to work primarily on data structures level
> > (like array adapter or something like that).
> > So I haven't thought about communication layer, but it is a useful feature
> > per se.
> > However it might have limited value in terms of contribution to Arrow and,
> > hence, not that attractive for an intern.
> >
> > On Sat, Mar 27, 2021 at 12:50 AM Micah Kornfield 
> > wrote:
> >
> > > I made a proposal a while ago that covers a form of RLE encoding [1].  I
> > > haven't had time to work on it, since it is a substantial effort to
> > > implement.
> > >
> > > I wouldn't expect an intern to be able to complete the work necessary to
> > > get this merged over the course of a normal 3 month 

Re: [Java] Source control of generated flatbuffers code

2021-03-26 Thread bobtins
OK, originally this was part of 
https://issues.apache.org/jira/browse/ARROW-12006 and I was going to just add 
some doc on flatc, but I will make this a new bug because it's a little bigger: 
https://issues.apache.org/jira/browse/ARROW-12111

On 2021/03/23 23:40:50, Micah Kornfield  wrote: 
> >
> > I have a concern, though. Four other languages (Java would be five) check
> > in the generated flatbuffers code, and it appears (based on a quick scan of
> > Git logs) that this is done manually. Is there a danger that the binary
> > format could change, but some language might get forgotten, and thus be
> > working with the old format?
> 
> The format changes relatively slowly and any changes at this point should
> be backwards compatible.
> 
> 
> 
> > Or is there enough interop testing that the problem would get caught right
> > away?
> 
> In most cases I would expect integration tests to catch these types of
> error.
> 
> On Tue, Mar 23, 2021 at 4:26 PM bobtins  wrote:
> 
> > I'm happy to check in the generated Java source. I would also update the
> > Java build info to reflect this change and document how to regenerate the
> > source as needed.
> >
> > I have a concern, though. Four other languages (Java would be five) check
> > in the generated flatbuffers code, and it appears (based on a quick scan of
> > Git logs) that this is done manually. Is there a danger that the binary
> > format could change, but some language might get forgotten, and thus be
> > working with the old format? Or is there enough interop testing that the
> > problem would get caught right away?
> >
> > I'm new to the project and don't know how big of an issue this is in
> > practice. Thanks for any enlightenment.
> >
> > On 2021/03/23 07:39:16, Micah Kornfield  wrote:
> > > I think checking in the java files is fine and probably better then
> > relying
> > > on a third party package.  We should make sure there are instructions on
> > > how to regenerate them along with the PR
> > >
> > > On Monday, March 22, 2021, Antoine Pitrou  wrote:
> > >
> > > >
> > > > Le 22/03/2021 à 20:17, bobtins a écrit :
> > > >
> > > >> TL;DR: The Java implementation doesn't have generated flatbuffers code
> > > >> under source control, and the code generation depends on an
> > > >> unofficially-maintained Maven artifact. Other language
> > implementations do
> > > >> check in the generated code; would it make sense for this to be done
> > for
> > > >> Java as well?
> > > >>
> > > >> I'm currently focusing on Java development; I started building on
> > Windows
> > > >> and got a failure under java/format, because I couldn't download the
> > > >> flatbuffers compiler (flatc) to generate Java source.
> > > >> The artifact for the flatc binary is provided "unofficially" (not by
> > the
> > > >> flatbuffers project), and there was no Windows version, so I had to
> > jump
> > > >> through hoops to build it and proceed.
> > > >>
> > > >
> > > > While this does not answer the more general question of checking in the
> > > > generated Flatbuffers code (which sounds like a good idea, but I'm not
> > a
> > > > Java developer), note that you could workaround this by installing the
> > > > Conda-provided flatbuffers package:
> > > >
> > > >   $ conda install flatbuffers
> > > >
> > > > which should get you the `flatc` compiler, even on Windows.
> > > > (see https://docs.conda.io/projects/conda/en/latest/ for installing
> > conda)
> > > >
> > > > You may also try other package managers such as Chocolatey:
> > > >
> > > >   https://chocolatey.org/packages/flatc
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > >
> >
> 


Re: [Java] Source control of generated flatbuffers code

2021-03-23 Thread bobtins
I'm happy to check in the generated Java source. I would also update the Java 
build info to reflect this change and document how to regenerate the source as 
needed.

I have a concern, though. Four other languages (Java would be five) check in 
the generated flatbuffers code, and it appears (based on a quick scan of Git 
logs) that this is done manually. Is there a danger that the binary format 
could change, but some language might get forgotten, and thus be working with 
the old format? Or is there enough interop testing that the problem would get 
caught right away?

I'm new to the project and don't know how big of an issue this is in practice. 
Thanks for any enlightenment.

On 2021/03/23 07:39:16, Micah Kornfield  wrote: 
> I think checking in the java files is fine and probably better then relying
> on a third party package.  We should make sure there are instructions on
> how to regenerate them along with the PR
> 
> On Monday, March 22, 2021, Antoine Pitrou  wrote:
> 
> >
> > Le 22/03/2021 à 20:17, bobtins a écrit :
> >
> >> TL;DR: The Java implementation doesn't have generated flatbuffers code
> >> under source control, and the code generation depends on an
> >> unofficially-maintained Maven artifact. Other language implementations do
> >> check in the generated code; would it make sense for this to be done for
> >> Java as well?
> >>
> >> I'm currently focusing on Java development; I started building on Windows
> >> and got a failure under java/format, because I couldn't download the
> >> flatbuffers compiler (flatc) to generate Java source.
> >> The artifact for the flatc binary is provided "unofficially" (not by the
> >> flatbuffers project), and there was no Windows version, so I had to jump
> >> through hoops to build it and proceed.
> >>
> >
> > While this does not answer the more general question of checking in the
> > generated Flatbuffers code (which sounds like a good idea, but I'm not a
> > Java developer), note that you could workaround this by installing the
> > Conda-provided flatbuffers package:
> >
> >   $ conda install flatbuffers
> >
> > which should get you the `flatc` compiler, even on Windows.
> > (see https://docs.conda.io/projects/conda/en/latest/ for installing conda)
> >
> > You may also try other package managers such as Chocolatey:
> >
> >   https://chocolatey.org/packages/flatc
> >
> > Regards
> >
> > Antoine.
> >
> 


[Java] Source control of generated flatbuffers code

2021-03-22 Thread bobtins
TL;DR: The Java implementation doesn't have generated flatbuffers code
under source control, and the code generation depends on an
unofficially-maintained Maven artifact. Other language implementations do
check in the generated code; would it make sense for this to be done for
Java as well?

I'm currently focusing on Java development; I started building on Windows
and got a failure under java/format, because I couldn't download the
flatbuffers compiler (flatc) to generate Java source.
The artifact for the flatc binary is provided "unofficially" (not by the
flatbuffers project), and there was no Windows version, so I had to jump
through hoops to build it and proceed.
I wanted to document this procedure (see
https://issues.apache.org/jira/browse/ARROW-12006) but I was curious to
know this affects other implementations, and I found that these languages
have generated flatbuffers code checked in:

   - C++
   - JS
   - Rust
   - C#

I would like to consider adding Java to the list; this would eliminate a
hurdle for Java developers under Windows, and eliminate depending on the
unofficial artifact provided for other platforms (which BTW, is at 1.9,
behind the 1.12 version used by other languages).
Let me know if this makes sense, or if I'm missing something.


Re: [JIRA] Request contributor role

2021-03-19 Thread bobtins
Thanks!
Also, I noticed you changed the description to "Updates to make dev on Windows 
easier" instead of "Windows and Java". I guess the issues I've run into would 
affect development on other languages; for example, the checkstyle config is 
not specific to Java, nor is the flatc compiler, but I'd have to review builds 
for the other languages.

It's a worthwhile goal to remove blocks to developers on Windows, but I'm not 
sure about the scope of this. I think the flatc dependency would be the biggest 
headache. I guess I'll just throw it out for general comment--are there other 
roadblocks that Windows developers run into?

On 2021/03/19 21:58:44, Sutou Kouhei  wrote: 
> Hi Bob,
> 
> Done. Could you try again?
> 
> 
> Thanks,
> --
> kou
> 
> In <2042358562.2371734.1616184276...@mail.yahoo.com>
>   "[JIRA] Request contributor role" on Fri, 19 Mar 2021 20:04:36 + (UTC),
>   Bob Tinsman  wrote:
> 
> > I've logged a couple bugs and would like to assign myself. My id is 
> > bobtinsman on JIRA; here is one of the bugs I logged:
> > [ARROW-12006] updates to make dev on Java and Windows easier - ASF JIRA
> > 
> > | 
> > | 
> > |  | 
> > [ARROW-12006] updates to make dev on Java and Windows easier - ASF JIRA
> > 
> > 
> >  |
> > 
> >  |
> > 
> >  |
> > 
> > 
> > I tried creating a new email on the archive page Pony Mail! 
> > 
> > 
> > | 
> > | 
> > |  | 
> > Pony Mail!
> > 
> > 
> >  |
> > 
> >  |
> > 
> >  |
> > 
> > but it didn't seem to work.
> > 
> > 
> 


Re: [Rust][DataFusion] Query Engine Design / DataFusion Implementation talk

2021-03-17 Thread bobtins
I missed the talk but watched the video, which was fascinating. It helped me 
get the whole picture of what DataFusion does, which is impressive. In my 
previous job, I built a data analysis engine on a smaller scale in Java, so 
some of the problems that DataFusion tackles are familiar to me.

The initial implementation of my engine would load some data from a relational 
DB into a columnar memory store that I implemented (very much like Arrow); it 
would then perform various transformations analogous to the logical plan in 
DataFusion (sort, group, filter, aggregate, etc), but also supporting OLAP-like 
multi-level hierarchies and cubes. This query model didn't have a language 
itself; the UI manipulated an object model which contained the logical plan 
(although unfortunately the query model was tangled with other layers).

This was later enhanced to generate SQL queries so you wouldn't have to load 
everything into memory, but you could do in-memory operations on top of the SQL 
result. I came up with an expression language close to SQL which could be 
translated into in-memory or SQL operations. I had to do something like the 
merge operator in DataFusion to support multi-stage aggregation (e.g. implement 
count(x) -> sum(count(x)), average(x) -> sum(sum(x))/sum(count(x)), etc. ).

Like I said, my framework was nowhere near as heavy-duty as DataFusion + Arrow, 
but my familiarity with the power of in-memory columnar stores is what drew me 
to Arrow in the first place. 

I am curious about how the various language implementations in Arrow are 
evolving computation frameworks; for Rust, there is DataFusion, and I noticed 
that there has been a lot of work going on in C++/Python. For Java, it seems 
like this would be in the realm of Gandiva or the dremio product...and of 
course there's Spark! I am still surveying the terrain, but any pointers to 
work people are doing in Java would be welcome.

On 2021/03/12 19:39:16, Andrew Lamb  wrote: 
> Here are links to the content, should anyone be interested:
> 
> Query Engine Design and the Rust-Based DataFusion in Apache Arrow
> recording: https://www.youtube.com/watch?v=K6eCAVEk4kU
> slides: (datafusion content starts on slide 6):
> https://www.slideshare.net/influxdata/influxdb-iox-tech-talks-query-engine-design-and-the-rustbased-datafusion-in-apache-arrow-244161934
> 
> On Thu, Mar 4, 2021 at 4:05 PM Andrew Lamb  wrote:
> 
> > In case anyone is interested in the topic in general or DataFusion in
> > particular, I plan a tech talk [1] next week about "Query Engine Design and
> > the Rust based DataFusion in Apache Arrow."
> >
> > If you are curious how (SQL) query engines in general are structured, I
> > plan to describe the typical high level architecture, using DataFusion as
> > an exemplar.
> >
> > It will be held next Wednesday, March 10, 2021 at 8:00 am PST | 4:00 pm
> > GMT, and posted publicly afterwards.
> >
> > Andrew
> >
> > [1] https://www.influxdata.com/community-showcase/influxdb-tech-talks/
> >
> >
> 


Re: [JAVA] issues encountered during build

2021-03-17 Thread bobtins



On 2021/03/12 06:36:24, Fan Liya  wrote: 
> Hi Bob,
> 
> Thanks for reporting the issues.
> I remember encountering the same problems with the JDBC tests (over one
> year ago).
> 
> Maybe it is not just related to the time zone, it is also related to the
> machine locale.
> I think we can open an issue to track it.
> 
OK, opened an issue: https://issues.apache.org/jira/browse/ARROW-11957
I'll create the pull request today, probably.
I noticed that I can't add watchers or even assign to myself; saw on the doc 
that someone needs to make me a Contributor.
Here's the bug for various Windows build nits: 
https://issues.apache.org/jira/browse/ARROW-12006


@Liya, I don't think it has anything to do with locale; it's the offset 
associated with the time zone which is showing up.

@Micah, I guess I'm used to doing local builds and having the JVM pick up my 
timezone; now that we're on daylight savings, everything will be off by 7 hours 
;-)

> 
> 
> On Fri, Mar 12, 2021 at 12:09 PM Micah Kornfield 
> wrote:
> 
> > Hi Bob,
> > Thanks for some feedback, I don't think a lot of people are developing on
> > windows.  Some answers in line:
> >
> > * Build does require Java 8, not "8 or later" as stated in java/README.md
> > >   There's a reference to sun.misc.Unsafe
> > > in
> > memory/memory-core/src/main/java/org/apache/arrow/memory/util/MemoryUtil.java
> > > which of course went away in JDK 9.
> >
> > Is this the case even using "-Dio.netty.tryReflectionSetAccessible=true" as
> > mentioned in the README?
> >
> >
> > * The build won't work on Windows because the java/format POM downloads a
> > > binary flatc executable, but there's no artifact for Windows, just Linux
> > > and OSX. I wound up downloading Visual Studio and building the
> > flatbuffers
> > > project.
> >
> > This unfortunately sounds familiar, I think this indicate the popularity on
> > windows.  I also think the hosting of the mac and linux are not exactly
> > official (they are hosted by a former contributor).  Updating the Readme
> > might be a good first step with instructions on how to do this.
> >
> > I see in the pom.xml that user.timezone is set to UTC. I have seen these
> > > types of errors in tests before; I know there are ways to insulate the
> > test
> > > from the user's current timezone but maybe someone knows what's going on
> >
> > This is somewhat surprising.  I would thought we had the user.timezone set
> > for exactly this reason.  There might have been a regression, this might
> > make a good second contribution if you wanted to look into fixing it.
> >
> > * I bumped into what looks like a spurious checkstyle error: it reports
> > > memory/src/test/java/io/netty/buffer/TestExpandableByteBuf.java having no
> > > linefeed at the end when it definitely does. I set up Git not to do
> > Windows
> > > conversions, and I checked with various editors and binary dump
> > utilities.
> > > One source says that this because I'm running on Windows, checkstyle
> > > actually expects a CR-LF and throws an error if it doesn't find it! I've
> > > worked around this by disabling the check.
> >
> > It looks like we can force the checker to assume linux line feeds:
> >
> > https://stackoverflow.com/questions/997021/how-to-get-rid-of-checkstyle-message-file-does-not-end-with-a-newline
> > (second answer).  This would also make a good contribution for someone new.
> >
> > Cheers,
> > Micah
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > On Thu, Mar 11, 2021 at 7:14 PM bobtins  wrote:
> >
> > > My mail client took out all the linefeeds, so let me reformat; sorry
> > about
> > > that!
> > >
> > >
> > > In the process of slogging through the build, I've bumped into various
> > > issues. I'm happy to document them in java/README.md or make any other
> > > changes that might be helpful to others.
> > >
> > > I'm pretty experienced with Java and Maven, so I think these are not
> > > beginner's mistakes, but let me know if I'm missing something.
> > >
> > > A lot of these may be Windows-specific. I normally prefer Linux but just
> > > got a new laptop and haven't set it up, but this experience is giving me
> > a
> > > lot of incentive to run screaming back to Linux ;-)
> > >
> > > Environment details:
> > > * Windows 10
> > > * Java 8
> > > here's the output of java -version:
> > > openjdk v

Re: [JAVA] issues encountered during build

2021-03-12 Thread bobtins



On 2021/03/12 04:09:21, Micah Kornfield  wrote: 
> 
> * Build does require Java 8, not "8 or later" as stated in java/README.md
> >   There's a reference to sun.misc.Unsafe
> > in 
> > memory/memory-core/src/main/java/org/apache/arrow/memory/util/MemoryUtil.java
> > which of course went away in JDK 9.
> 
> Is this the case even using "-Dio.netty.tryReflectionSetAccessible=true" as
> mentioned in the README?
> 
OK, now I can't duplicate the compile-time issue; now if I point it at JDK 11 
it gets a warning, but previously it was actually unable to resolve 
"sun.misc.Unsafe" and got a compile time error; I'm not sure why it happened 
before and doesn't now, so I'll move on.
> 
> * The build won't work on Windows because the java/format POM downloads a
> > binary flatc executable, but there's no artifact for Windows, just Linux
> > and OSX. I wound up downloading Visual Studio and building the flatbuffers
> > project.
> 
> This unfortunately sounds familiar, I think this indicate the popularity on
> windows.  I also think the hosting of the mac and linux are not exactly
> official (they are hosted by a former contributor).  Updating the Readme
> might be a good first step with instructions on how to do this.
> 
As far as providing instructions on building the flatbuffers project on 
Windows, I'm not an expert at all with the Microsoft ecosystem, but I could 
provide a summary.

> I see in the pom.xml that user.timezone is set to UTC. I have seen these
> > types of errors in tests before; I know there are ways to insulate the test
> > from the user's current timezone but maybe someone knows what's going on
> 
> This is somewhat surprising.  I would thought we had the user.timezone set
> for exactly this reason.  There might have been a regression, this might
> make a good second contribution if you wanted to look into fixing it.
> ; 
I would think that this would show up in nightly builds. I guess I could try 
older versions, or I'll keep tracking it down to the cause.

> * I bumped into what looks like a spurious checkstyle error: it reports
> > memory/src/test/java/io/netty/buffer/TestExpandableByteBuf.java having no
> > linefeed at the end when it definitely does. I set up Git not to do Windows
> > conversions, and I checked with various editors and binary dump utilities.
> > One source says that this because I'm running on Windows, checkstyle
> > actually expects a CR-LF and throws an error if it doesn't find it! I've
> > worked around this by disabling the check.
> 
> It looks like we can force the checker to assume linux line feeds:
> https://stackoverflow.com/questions/997021/how-to-get-rid-of-checkstyle-message-file-does-not-end-with-a-newline
> (second answer).  This would also make a good contribution for someone new.
> 
Sure! 
> 
> 



[JAVA] issues encountered during build

2021-03-11 Thread bobtins
My mail client took out all the linefeeds, so let me reformat; sorry about that!


In the process of slogging through the build, I've bumped into various issues. 
I'm happy to document them in java/README.md or make any other changes that 
might be helpful to others. 

I'm pretty experienced with Java and Maven, so I think these are not beginner's 
mistakes, but let me know if I'm missing something. 

A lot of these may be Windows-specific. I normally prefer Linux but just got a 
new laptop and haven't set it up, but this experience is giving me a lot of 
incentive to run screaming back to Linux ;-)

Environment details:
* Windows 10
* Java 8
here's the output of java -version:
openjdk version "1.8.0_282"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_282-b08)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.282-b08, mixed mode)

* Cygwin environment
* Maven 3.6.2


Issues encountered thus far:

* Build does require Java 8, not "8 or later" as stated in java/README.md
  There's a reference to sun.misc.Unsafe in 
memory/memory-core/src/main/java/org/apache/arrow/memory/util/MemoryUtil.java 
which of course went away in JDK 9.   
* The build won't work on Windows because the java/format POM downloads a 
binary flatc executable, but there's no artifact for Windows, just Linux and 
OSX. I wound up downloading Visual Studio and building the flatbuffers project.

* I bumped into what looks like a spurious checkstyle error: it reports  
memory/src/test/java/io/netty/buffer/TestExpandableByteBuf.java having no 
linefeed at the end when it definitely does. I set up Git not to do Windows 
conversions, and I checked with various editors and binary dump utilities. One 
source says that this because I'm running on Windows, checkstyle actually 
expects a CR-LF and throws an error if it doesn't find it! I've worked around 
this by disabling the check.

* The one thing that I'm stuck on now is failures on the jdbc module:
[INFO]
[INFO] Results:
[INFO]
[ERROR] Failures:
[ERROR]   JdbcToArrowDataTypesTest.testJdbcToArrowValues:146->testDataSets:209 
expected:<45935000> but was:<74735000>
[ERROR]   JdbcToArrowDataTypesTest.testJdbcToArrowValues:146->testDataSets:213 
expected:<1518439535000> but was:<1518468335000>
[ERROR]   JdbcToArrowDataTypesTest.testJdbcToArrowValues:146->testDataSets:205 
expected:<-365> but was:<-364>
[ERROR]   
JdbcToArrowNullTest.testJdbcToArrowValues:123->testDataSets:165->testAllVectorValues:209
 expected:<17574> but was:<17573>
[ERROR]   JdbcToArrowTest.testJdbcToArrowValues:138->testDataSets:206 
expected:<17574> but was:<17573>
[ERROR]   
JdbcToArrowVectorIteratorTest.testJdbcToArrowValuesNoLimit:107->validate:199->assertDateDayVectorValues:277
 expected:<17574> but was:<17573>
[ERROR]   
JdbcToArrowVectorIteratorTest.testJdbcToArrowValues:95->validate:199->assertDateDayVectorValues:277
 expected:<17574> but was:<17573>
[INFO]
[ERROR] Tests run: 93, Failures: 7, Errors: 0, Skipped: 0

I attached the full build output.
Looking more closely at these errors, they seem to be due to the timezone 
difference; for example, the difference between 74735000 (actual value) and 
45935000 (expected) is 288, or 8 hours in milliseconds, which is the PST 
timezone offset. 

I see in the pom.xml that user.timezone is set to UTC. I have seen these types 
of errors in tests before; I know there are ways to insulate the test from the 
user's current timezone but maybe someone knows what's going on.

Thanks for any input! 


On 2021/03/11 23:49:37, Bob Tinsman  wrote: 
> I've been mostly lurking for awhile, but I would like to start picking off 
> some bugs in the Java implementation.In the process of slogging through the 
> build, I've bumped into various issues. I'm happy to document them in 
> java/README.md or make any other changes that might be helpful to others. I'm 
> pretty experienced with Java and Maven, so I think these are not 
> super-obvious, but let me know if I'm missing something.A lot of these may be 
> Windows-specific. I normally prefer Linux but just got a new laptop and 
> haven't set it up, but this experience is giving me a lot of incentive to run 
> screaming back to Linux ;-)
> Environment details:- Windows 10- Java 8:openjdk version "1.8.0_282"OpenJDK 
> Runtime Environment (AdoptOpenJDK)(build 1.8.0_282-b08)OpenJDK 64-Bit Server 
> VM (AdoptOpenJDK)(build 25.282-b08, mixed mode)- Cygwin environment- Maven 
> 3.6.2
> Issues encountered thus far:- Build does require Java 8, not "8 or later" as 
> stated in java/README.md    There's a reference to sun.misc.Unsafe in 
> memory/memory-core/src/main/java/org/apache/arrow/memory/util/MemoryUtil.java 
> which of course went away in JDK 9.    I can update the build instructions.- 
> The build won't work on Windows because the java/format POM downloads a 
> binary flatc executables; when I looked, there was no version for Windows, 
> just Linux and OSX. I wound up downloading Visual Studio and building the 
> flatbuffers project.- I