Re: [Rust][DataFusion] profiling TPC-H benchmarks with flamegraphs

2022-03-21 Thread Bob Tinsman
Wow, hotspot looks very cool! However, I was only able to download v1.1
which is very slow at processing the perf.data; it took something like 30
minutes for a perf.data file of about 0.5G. I am resorting to building Qt
from source, so I can build hotspot from source...

On Mon, Mar 21, 2022 at 2:05 PM Daniël Heres  wrote:

> Hi Bob,
>
> One command I've been using succesfully some time for profiling is as
> follows (with different flags, csv file works just as well):
>
> perf record --call-graph=dwarf ../target/release/tpch benchmark datafusion
> --path [path] --format parquet --query 6 --iterations 1 -n 16
> And using hotspot (https://github.com/KDAB/hotspot) to load/visualize the
> perf.data file (this takes 1-30s for me depending on the size of the file).
>
> Best regards,
>
> Daniël
>
> Op ma 21 mrt. 2022 om 21:12 schreef Bob Tinsman :
>
> > Andrew, thanks for your feedback! I started looking at IOx and pprof, and
> > I'm slowly getting a better picture of DataFusion performance work. In
> > particular, I can see that IOx is driving some of this (in particular
> [1]).
> > I'm still in sponge mode, but I can think of a few useful things to do
> > around benchmarking/profiling:
> > - Parameterizing the TPC-H benchmark for scale factor
> > - Parameterizing criterion benchmarks for data size, etc.
> > - Instrument the TPC-H benchmark with pprof
> > - Document profiling with pprof or other means (I realized that criterion
> > has pprof integration already [2])
> >
> > I'd welcome anybody's feedback.
> >
> > I have a couple questions as well:
> > - Since pprof is an "internal" profiler (i.e. you write some code to
> > integrate it), can you point me to how it's integrated in IOx?
> > - Not really on topic, but can you give some advice on building faster? I
> > am new to Rust and am not sure whether it's Rust or datafusion in
> > particular. Part of it may be the combo of debug + optimized profile
> that I
> > need for profiling; maybe it's time to upgrade my box (ryzen 1700/32G
> > ram/m.2 ssd), but sometimes you can have a monster system and it's still
> > slow [3].
> >
> > Thanks, Bob
> >
> >
> > [1] https://github.com/influxdata/influxdb_iox/issues/3994
> > [2] https://github.com/tikv/pprof-rs
> > [3] https://fasterthanli.me/articles/why-is-my-rust-build-so-slow
> >
> > On Mon, Mar 21, 2022 at 7:27 AM Andrew Lamb 
> wrote:
> >
> > > Thank you for writing up your findings
> > >
> > > If you use the `--mem-table` / `-m` command, the CSV file is read once
> > and
> > > then the query is executed subsequently
> > >
> > > As for better ways of profiling rust, we have had good luck using
> `pprof`
> > > [1] in InfluxDB IOx (which also uses DataFusion), so I have mostly
> never
> > > tried to profile the tpch benchmark program directly
> > >
> > > Making the profiling process easier / documenting it would definitely
> be
> > > useful in my opinion
> > >
> > > Andrew
> > >
> > >
> > > [1] https://crates.io/crates/pprof
> > >
> > > On Fri, Mar 18, 2022 at 6:10 PM Bob Tinsman 
> wrote:
> > >
> > > > I've been diving into DataFusion benchmarking because I'm interested
> in
> > > > understanding its performance. Here's a summary of my experience thus
> > > far.
> > > > TL;DR: I was able to do it, but it's very slow, ironically.
> > > > I'd love to hear about anyone else's experiences or recommendations
> > > > profiling DataFusion (or any other Rust projects for that matter).
> > > >
> > > > I decided to start with the TPC-H benchmarks, which have support in
> the
> > > > benchmarks directory [2], and use flamegraphs [1] to visualize CPU
> > > profile
> > > > data. Gathering and preparing the profile data can be complicated,
> but
> > > > there is a "flamegraph" cargo command [3] which conveniently wraps up
> > the
> > > > whole process.
> > > >
> > > > My steps:
> > > >
> > > > Followed the benchmark [2] instructions for generating TPC-H data
> > > > Tested the DataFusion benchmark for query 1:
> > > >
> > > > cd benchmarks
> > > >
> > > > cargo run --release --bin tpch -- benchmark datafusion --iterations 3
> > > --path ./data --format tbl --query 1 --batch-size 4096
> > > >
> > > > This took about 4 seconds per iteration on my system (Ryzen 1700
> wi

Re: [Rust][DataFusion] profiling TPC-H benchmarks with flamegraphs

2022-03-21 Thread Bob Tinsman
Andrew, thanks for your feedback! I started looking at IOx and pprof, and
I'm slowly getting a better picture of DataFusion performance work. In
particular, I can see that IOx is driving some of this (in particular [1]).
I'm still in sponge mode, but I can think of a few useful things to do
around benchmarking/profiling:
- Parameterizing the TPC-H benchmark for scale factor
- Parameterizing criterion benchmarks for data size, etc.
- Instrument the TPC-H benchmark with pprof
- Document profiling with pprof or other means (I realized that criterion
has pprof integration already [2])

I'd welcome anybody's feedback.

I have a couple questions as well:
- Since pprof is an "internal" profiler (i.e. you write some code to
integrate it), can you point me to how it's integrated in IOx?
- Not really on topic, but can you give some advice on building faster? I
am new to Rust and am not sure whether it's Rust or datafusion in
particular. Part of it may be the combo of debug + optimized profile that I
need for profiling; maybe it's time to upgrade my box (ryzen 1700/32G
ram/m.2 ssd), but sometimes you can have a monster system and it's still
slow [3].

Thanks, Bob


[1] https://github.com/influxdata/influxdb_iox/issues/3994
[2] https://github.com/tikv/pprof-rs
[3] https://fasterthanli.me/articles/why-is-my-rust-build-so-slow

On Mon, Mar 21, 2022 at 7:27 AM Andrew Lamb  wrote:

> Thank you for writing up your findings
>
> If you use the `--mem-table` / `-m` command, the CSV file is read once and
> then the query is executed subsequently
>
> As for better ways of profiling rust, we have had good luck using `pprof`
> [1] in InfluxDB IOx (which also uses DataFusion), so I have mostly never
> tried to profile the tpch benchmark program directly
>
> Making the profiling process easier / documenting it would definitely be
> useful in my opinion
>
> Andrew
>
>
> [1] https://crates.io/crates/pprof
>
> On Fri, Mar 18, 2022 at 6:10 PM Bob Tinsman  wrote:
>
> > I've been diving into DataFusion benchmarking because I'm interested in
> > understanding its performance. Here's a summary of my experience thus
> far.
> > TL;DR: I was able to do it, but it's very slow, ironically.
> > I'd love to hear about anyone else's experiences or recommendations
> > profiling DataFusion (or any other Rust projects for that matter).
> >
> > I decided to start with the TPC-H benchmarks, which have support in the
> > benchmarks directory [2], and use flamegraphs [1] to visualize CPU
> profile
> > data. Gathering and preparing the profile data can be complicated, but
> > there is a "flamegraph" cargo command [3] which conveniently wraps up the
> > whole process.
> >
> > My steps:
> >
> > Followed the benchmark [2] instructions for generating TPC-H data
> > Tested the DataFusion benchmark for query 1:
> >
> > cd benchmarks
> >
> > cargo run --release --bin tpch -- benchmark datafusion --iterations 3
> --path ./data --format tbl --query 1 --batch-size 4096
> >
> > This took about 4 seconds per iteration on my system (Ryzen 1700 with a
> > pretty fast SSD).
> >
> > The flamegraph command uses release profile by default but you will need
> > symbols, so add "debug = 1" under "[profile.release]" in the top level
> > Cargo.toml.
> > I also did top level "cargo clean" to make sure I had symbols for
> > everything.
> >
> > To use flamegraph, just substitute "flamegraph" for "run" in the original
> > command:
> >
> > cargo flamegraph --release --bin tpch -- benchmark datafusion
> --iterations 3 --path ./data --format tbl --query 1 --batch-size 4096
> >
> > I got the following output:
> > Finished release [optimized + debuginfo] target(s) in 0.13s
> > ...omitting various gripes about kernel symbols
> > Running benchmarks with the following options: DataFusionBenchmarkOpt {
> > query: 1, debug: false, iterations: 3, partitions: 2, batch_size: 4096,
> > path: "./data", file_format: "tbl", mem_table: false, output_path: None }
> > Query 1 iteration 0 took 4106.1 ms and returned 4 rows
> > Query 1 iteration 1 took 4025.6 ms and returned 4 rows
> > Query 1 iteration 2 took 4048.3 ms and returned 4 rows
> > Query 1 avg time: 4060.00 ms
> > [ perf record: Woken up 591 times to write data ]
> > [ perf record: Captured and wrote 149.619 MB perf.data (18567 samples) ]
> >
> > And then I waited a lng time; I think I gave it up to 45 minutes.
> What
> > was it doing? It looks like the flamegraph command was calling perf (the
> > profiling command) which was then calling addr2line over and over:
> > bo

Re: [DISCUSS][Rust] Performance Measurements (was Biweekly sync call for arrow/datafusion again?)

2022-03-14 Thread Bob Tinsman
Thanks for pulling this out from the long thread...

On Sat, Mar 12, 2022 at 3:06 AM Andrew Lamb  wrote:

> Hi Bob,
>
> > - By "pipeline-breaking" I assume you mean "very slow", but can you give
> me
> details? Does this arise from some particular observation, or other
> reported issues?
>
> In general pipeline breaking means that the output of the operator can't be
> produced until it has seen *ALL* its input.
>
> For example, a sort (ORDER BY x) is a pipeline breaker because the engine
> has to see the entire input prior to being able to produce any output.
>
> However, a filter (WHERE x > 500) is not a pipeline breaker because the
> operator can produce output rows as soon as it sees any that pass the
> filter criteria.
>

Aha, I get it--so the goal is not necessarily to speed up the whole thing
but to be able to send output to the next processing stage sooner.
So IIRC besides sorts, the other types of queries mentioned were joins,
group by, and hash aggregates?

>
> > - In general, what tools are you using to analyze datafusion performance?
>
> The tools used most commonly are in the benchmark directory [1] There is
> some other work
>
> >  - How much profiling have you done to identify bottlenecks?
>
> I would say it is done on an "as needed basis" -- namely someone runs a
> query that is important to them and then improves whatever hotspot they may
> find.
>
> However, we don't have regular runs of the same queries or automatically
> gather data over time. dianaclarke added integration for condabench in [2]
> that I think would allow for such data collection, but no one has hooked up
> the benchmarks to it uet.
>
> Getting regular runs of the performance benchmark up and running would be
> very valuable indeed, if you were looking to help.
>
> Yes, I'm definitely looking to help, and maybe getting more perf
benchmarks up would be a good way of starting.
I noticed that matthewmturner was working on something to run benchmarks in
docker, which is pretty nice! [3]
Any suggestions for performance use cases would be welcome; I could add
them in.
One thing I like to do is to run the same benchmark and tweak the knobs,
such as number of rows, cardinality, etc. because the effects can vary A
LOT.

I am tempted to venture opinions on how to do things based on my experience
building my own (closed-source) columnar query engine, but that one is an
entirely different beast, so I am not qualified to opine until I learn more.
I'm starting to follow history about various performance improvements, but
if anyone has any suggestion, like "I wish datafusion could complete X
query on 50 bazillion rows in less than 3 days", let me know. In
performance, there are so many variables that it's hard to know where to
start.

Thanks, Bob


>
> [1] https://github.com/apache/arrow-datafusion/tree/master/benchmarks
> [2] https://github.com/apache/arrow-datafusion/pull/1791
>
> [3] https://github.com/apache/arrow-datafusion/pull/1928



On Fri, Mar 11, 2022 at 4:56 PM Bob Tinsman  wrote:
>
> > I just missed the call, but I watched the recording (thank you to Andrew
> > for posting [1]). Really interesting!
> > I'm diving into Arrow because I have some previous experience with
> > in-memory query engines. I'm following discussions around improving
> > performance and adding features so I can determine how best to
> contribute.
> >
> > In particular, I was interested in some of the background for the JIT
> > implementation [2] and the row format [3] but I guess I'm missing
> context.
> > I saw the comment in #1708 [4] that "many pipeline-breaking operators are
> > inherently row-based".
> > My questions:
> > - By "pipeline-breaking" I assume you mean "very slow", but can you give
> me
> > details? Does this arise from some particular observation, or other
> > reported issues?
> >   - An example would be nice, like "select a, b, c from blah order by d"
> > with table "blah" having 1 million rows and 10 columns takes 5 minutes,
> or
> > even anecdotal evidence like mailing list discussions
> > - In general, what tools are you using to analyze datafusion performance?
> >   - The criterion benchmarks are nice but do you have anything
> higher-level
> > which exercises a broad range of workloads?
> >   - How much profiling have you done to identify bottlenecks?
> >
> > To be honest, I was kind of surprised to see using a row format to solve
> a
> > performance problem, but I figured you must have good reasons, and I'm
> > still getting my brain around datafusion's query execution model. Thanks
> > for any illumination!
> >

Re: [DISCUSS][Rust] Biweekly sync call for arrow/datafusion again?

2022-03-11 Thread Bob Tinsman
I just missed the call, but I watched the recording (thank you to Andrew
for posting [1]). Really interesting!
I'm diving into Arrow because I have some previous experience with
in-memory query engines. I'm following discussions around improving
performance and adding features so I can determine how best to contribute.

In particular, I was interested in some of the background for the JIT
implementation [2] and the row format [3] but I guess I'm missing context.
I saw the comment in #1708 [4] that "many pipeline-breaking operators are
inherently row-based".
My questions:
- By "pipeline-breaking" I assume you mean "very slow", but can you give me
details? Does this arise from some particular observation, or other
reported issues?
  - An example would be nice, like "select a, b, c from blah order by d"
with table "blah" having 1 million rows and 10 columns takes 5 minutes, or
even anecdotal evidence like mailing list discussions
- In general, what tools are you using to analyze datafusion performance?
  - The criterion benchmarks are nice but do you have anything higher-level
which exercises a broad range of workloads?
  - How much profiling have you done to identify bottlenecks?

To be honest, I was kind of surprised to see using a row format to solve a
performance problem, but I figured you must have good reasons, and I'm
still getting my brain around datafusion's query execution model. Thanks
for any illumination!

[1] https://youtu.be/5NJcqXm6uE0
[2] https://github.com/apache/arrow-datafusion/pull/1849
[3] https://github.com/apache/arrow-datafusion/pull/1782
[4] https://github.com/apache/arrow-datafusion/issues/1708

On Tue, Mar 8, 2022 at 12:25 PM Andrew Lamb  wrote:

> I am not sure if everyone saw it in the agenda[1], but we plan to have a
> meeting tomorrow. I'll plan to record it for anyone who can not make this
> time.
>
> 15:00 UTC Wednesday March 9, 2022
> Meeting Location: (in agenda)
> Matthew Turner:  focused on JIT and row representation, next Wednesday,
> March 9th,
> @yijie: JIT  overview
>
> [1]
>
> https://docs.google.com/document/d/1atCVnoff5SR4eM4Lwf2M1BBJTY6g3_HUNR6qswYJW_U/edit#
>
> On Thu, Mar 3, 2022 at 12:50 AM Benson Muite 
> wrote:
>
> > Interested in learning more about this. Can work through the code and
> > discuss on 17 March either 4:00 or 16:00 UTC.
> >
> > Benson
> >
> > On 3/3/22 12:03 AM, Andrew Lamb wrote:
> > > I noticed that Matthew Turner added a note to the agenda[1] for a walk
> > > through of the JIT code. I would be interested in this as well -- would
> > > anyone plan to be on the call and discuss it?
> > >
> > > I don't think I have time to prepare that content prior
> > >
> > > Andrew
> > >
> > > [1]
> > >
> >
> https://docs.google.com/document/d/1atCVnoff5SR4eM4Lwf2M1BBJTY6g3_HUNR6qswYJW_U/edit#
> > >
> >
>


Re: [C++] Adopting a library for (distributed) tracing

2021-05-01 Thread Bob Tinsman
I agree that OpenTelemetry is the future; I have been following the 
observability space off and on and I knew about OpenTracing; I just realized 
that OpenTelemetry is its successor. [1]
I have found tracing to be a very powerful approach; at one point, I did a POC 
of a trace recorder inside a Java webapp, which shed light on some nasty 
bottlenecks. If integrated properly, it can be left on all the time, so it's 
valuable for doing root-cause analysis in production. At least in Java, there 
are already a lot of packages with OpenTelemetry hooks built in. [2]
I'm not sure what the overhead is when disabled--I think it is probably minimal 
or else it wouldn't be used so widely. But if we're not ready to jump right in, 
we could introduce our own @WithSpan annotation which by default is a no-op. To 
build an instrumented Arrow lib, you'd hook it up with a shim. Or you could 
just maintain a branch with instrumentation for people to try it out.

[1] https://lightstep.com/blog/brief-history-of-opentelemetry/
[2] 
https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/docs/supported-libraries.md

On 2021/04/30 22:18:46, Evan Chan  wrote: 
> Dear David,
> 
> OpenTelemetry tracing is definitely the future, I guess the question is how 
> far down the stack we want to put it.   I think it would be useful for flight 
> and other higher level modules, and for DataFusion for example it would be 
> really useful.  
> As for being alpha, I don’t think it will stay that way very long, there is a 
> ton of industry momentum behind OpenTelemetry.
> 
> -Evan
> 
> > On Apr 29, 2021, at 1:21 PM, David Li  wrote:
> > 
> > Hello,
> > 
> > For Arrow Datasets, I've been working to instrument the scanner to find
> > bottlenecks. For example, here's a demo comparing the current async
> > scanner, which doesn't truly read asynchronously, to one that does; it
> > should be fairly evident where the bottleneck is:
> > https://gistcdn.rawgit.org/lidavidm/b326f151fdecb2a5281b1a8be38ec1a6/a1e1a7516c5ce8f87a87ce196c6a726d1cdacf6f/index.html
> > 
> > I'd like to upstream this, but I'd like to run some questions by
> > everyone first:
> > - Does this look useful to developers working on other sub-projects?
> > - This uses OpenTelemetry[1], which is still in alpha, so are we
> >  comfortable with adopting it? Is the overhead acceptable?
> > - Is there anyone using Arrow to build services, that would find more
> >  general integration useful?
> > 
> > How it works: OpenTelemetry[1] is used to annotate and record a "span"
> > for operations like reading a single record batch. The data is saved as
> > JSON, then rendered by some JavaScript. The branch is at [2].
> > 
> > As a quick summary, OpenTelemetry implements distributed tracing, in
> > which a request is tracked as a directed acyclic graph of spans. A span
> > is just metadata (name, ID, start/end time, parent span, ...) about an
> > operation (function call, network request, ...). Typically, it's used in
> > services. Spans can reference each other across machines, so you can
> > track a request across multiple services (e.g. finding which service
> > failed/is unusually slow in a chain of services that call each other).
> > 
> > As opposed to a (sampling) profiler, this gives you application-level
> > metadata, like filenames or S3 download rates, that you can use in
> > analysis (as in the demo). It's also something you'd always keep turned
> > on (at least when running a service). If integrated with Flight,
> > OpenTelemetry would also give us a performance picture across multiple
> > machines - speculatively, something like making a request to a Flight
> > service and being able to trace all the requests it makes to S3.
> > 
> > It does have some overhead; you wouldn't annotate every function in a
> > codebase. This is rather anecdotal, but for the demo above, there was
> > essentially zero impact on runtime. Of course, that demo records very
> > little data overall, so it's not very representative.
> > 
> > Alternatives:
> > - Add a simple Span class of our own, and defer Flight until later.
> > - Integrate OpenTelemetry in such a way that it gets compiled out if not
> >  enabled at build time. This would be messier but should alleviate any
> >  performance questions.
> > - Use something like Perfetto[3] or LLVM XRay[4]. They have their own
> >  caveats (e.g. XRay is LLVM-specific) and aren't intended for the
> >  multi-machine use case, but would otherwise work. I haven't looked
> >  into these much, but could evaluate them, especially if they seem more
> >  fit for purpose for use in other Arrow subprojects.
> > 
> > If people aren't super enthused, I'll most likely go with adding a
> > custom Span class for Datasets, and defer the question of whether we
> > should integrate Flight/Datasets with OpenTelemetry until another use
> > case arises. But recently we have seen interest in this - so I see this
> > as perhaps a chance to take care of two problems 

Re: [JS] Exploring usage of apache arrow at my company for complex table rendering

2021-04-19 Thread Bob Tinsman
I know a lot about the general subject of grouping, although this was in a 
closed-source framework I created which also uses columnar in-memory data; it 
was specific to Java but at its heart it has a lot in common with Arrow.
This framework was able to render charts, multi-level tables, and crosstabs 
with multiple groups on each axis. Ideally, this would be done with three 
different query models:

- Flat query: returns a tabular result set, but can optionally aggregate, like 
SQL group by
- Multi-level query: has multiple groups, and aggregate calculations at each 
group level, including a top level (grand total)
- Multi-axis query: Has two axes (row and column) with a list of groups on each 
axis

To this basic query model we added more features, so that it was getting more 
and more OLAP-like.
I said "ideally" this is what it looks like, but for the initial version the 
rendering metadata and query model were coupled too much. I was trying to drive 
a clear definition of this query model but it got bogged down in politics.

(The following is IMHO, please correct anything I get wrong)
The main thing that Arrow defines in common for all the implementations is the 
memory layout; there's a lot of calculation infrastructure but it differs 
between implementations. The C++ model is probably the richest, because it's 
used by Python/pandas, but Rust has DataFusion and now Ballista.

>From experience, I can tell you that it's really important to keep query and 
>rendering separate--but that's tricky because the rendering refers to the 
>fields in the query and you need to keep them in sync. It's tempting to couple 
>them.

On 2021/03/03 18:59:54, Michael Lavina  wrote: 
> Hey Weston,
> 
> Do you have any public code examples I could take a look at? This does sound 
> very related to what I am doing.
> 
> One particular question I have related to grouping is how you define 
> row-grouping. Column grouping is fairly simple I think you can just define a 
> Struct that tells you how columns of data is grouped, but how would you go 
> about grouping rows of data for example
> 
> User Table
> 
> First Name | Last Name | Country | State | City | Occupation
> 
> // some data
> 
> I have thought of basically two ways to do this. Send some metadata array 
> i.e. groupBy that denotes how data should be grouped by and it’s a simple 
> algorithm maybe something like [country, state, city]. But then you would 
> need to store some mapping of a given rowIndex returns some rows of children 
> based of that algorithm. And I think this would require all the data to be 
> available to do the grouping.
> 
> The other way is defining the structure of the data maybe something like 
> (this could be entirely wrong I am new to Arrow sorry)
> list list>>>
> 
> but basically the idea would be if you were to retrieve the data for a given 
> index of let’s say a state it would return all the cities and vectors of data 
> related to that given state.
> 
> I also don’t know also if this is a limitations of my understanding of Arrow 
> or the ArrowJs SDK library and this might be something very easy I am just 
> not seeing it.
> 
> -Michael
> From: Weston Pace 
> Date: Friday, February 26, 2021 at 9:34 PM
> To: dev@arrow.apache.org 
> Cc: Michael Lavina 
> Subject: Re: [JS] Exploring usage of apache arrow at my company for complex 
> table rendering
> I used Arrow for this purpose in the past.  I don't have much to add
> but just a few thoughts off the top of my head...
> 
> * The line between data and metadata can be blurry - For most
> measurements we were able to store the "expected distribution" as
> metadata (e.g. this measurement should have an expected value of 10
> +/- 3) and that could be used for drawing limit lines.  For some
> measurements however the common practice in place was to store the
> upper/lower limit as separate columns because they often changed
> depending on the various independent variables.  In that case the same
> "concept" (limit) might be stored in data or metadata.
> 
> * Distinction between "data" and a "chart" - For us, we introduced a
> separate representation called the "chart" between the data and the
> rendering layer.  So using that limit line example before if we wanted
> to plot a histogram of some column then we would create a bar chart
> from the column.  This bar chart itself was also an array of numbers
> but, since these arrays were much smaller (one per bin, hard limit to
> bin count in the thousands based on # of pixels in display), and the
> structure was much more deeply nested, we ended up just using JSON for
> charts.  The "limit" metadata belonged to the data and it was
> translated into a vertical line element as part of the chart.
> 
> * Processing layer - For us it was too expensive to send the data
> across the Internet for display.  So the conversion from data -> chart
> happened with the datacenter close to the actual data.  The JS UI was
> simply responsible for chart -> 

Re: [Java] Source control of generated flatbuffers code

2021-04-15 Thread Bob Tinsman
OK, I just approved those changes. I was working on a shell script to automate 
it--nice to have, but not necessary. Better that you can get it into 4.0. 
Thanks!

On 2021/04/15 17:33:20, Micah Kornfield  wrote: 
> I took a look and added comments.   I'm not sure if Bob replied off-list,
> so hopefully no work was duplicated.
> 
> Lets try to be mindful that the project is asynchronous in nature and it
> might take a little time to reply.
> 
> Cheers,
> Micah
> 
> On Thu, Apr 15, 2021 at 10:00 AM Nate Bauernfeind <
> natebauernfe...@deephaven.io> wrote:
> 
> > > I think checking in the java files is fine and probably better then
> > relying
> > > on a third party package.  We should make sure there are instructions on
> > > how to regenerate them along with the PR
> >
> > Micah,
> >
> > I just opened a pull-request to satisfy ARROW-12111. This is my first
> > contribution to an apache project; please let me know if there is anything
> > else that I need to do to get this past the finish line.
> >
> > https://github.com/apache/arrow/pull/10058
> >
> > Thanks,
> > Nate
> >
> > On Wed, Apr 14, 2021 at 11:45 PM Nate Bauernfeind <
> > natebauernfe...@deephaven.io> wrote:
> >
> > > Hey Bob,
> > >
> > > Someone did publish a 1.12 version of the flatc maven plugin. I double
> > > checked that the plugin and binaries look correct and legit.. but you
> > know,
> > > it's always shady to download some random executable from the internet
> > and
> > > run it. However, I have been using it to generate the arrow flatbuffer
> > > files because I _really_ wanted some features that are in flatc 1.12's
> > > runtime jar (there are performance improvements for array types in
> > > particular).
> > >
> > > You can see them here:
> > > https://search.maven.org/search?q=com.github.shinanca
> > > The repository fork is here: https://github.com/shinanca/flatc
> > >
> > > On the bright side that developer appears to have published an x86_64
> > > windows binary which might satisfy one of your earlier complaints in the
> > > thread.
> > >
> > > On the other hand, if everyone is comfortable checking in the flatc
> > > generated files (obviously with the additional documentation on how to
> > > re-generate should the fbs files change), it's a relatively small change
> > to
> > > replace the existing apache/arrow/java/format source. Based on the
> > previous
> > > discussion on this thread, it seems that the arrow dev team _is_
> > > comfortable with the check-in-the-generated-files approach.
> > >
> > > Although 4.0 is near the release phase, there are still a few blocking
> > > issues that people are trying to fix (according to the arrow-sync call
> > > earlier today). I don't mind jumping in and doing this; it appears that
> > > there might be enough time for such a small change to make it into the
> > > release if the work is performed and merged ASAP.
> > >
> > > I guess, I'm either looking for the "pull request is on the way" or the
> > > "thumbs up - we definitely want this; I'll get the code review for you
> > when
> > > it's ready" style reply =D.
> > >
> > > On Wed, Apr 14, 2021 at 10:43 PM Bob Tinsman  wrote:
> > >
> > >> I apologize for leaving this hanging, but it looks like 4.0 is leaving
> > >> the station :(
> > >> Yes, it makes sense to bump it to 1.12, but you can't do that in
> > >> isolation, because the flatc binary which is fetched as a Maven
> > dependency
> > >> is only available for 1.9. I will get back onto this and finish it this
> > >> week.
> > >>
> > >> FWIW, I was looking around and catalogued the various ways of generating
> > >> flatbuffers for all the languages--you can look at it in my branch:
> > >> https://github.com/bobtins/arrow/tree/check-in-gen-code/java/dev
> > >> Let me know if any info is wrong or missing.
> > >> The methods of generation are all over the map, and some have no script
> > >> or build file, just doc. Would there be any value in making this more
> > >> uniform?
> > >>
> > >> On 2021/04/14 16:36:47, Nate Bauernfeind 
> > >> wrote:
> > >> > It would also be nice to upgrade that java flatbuffer version from 1.9
> > >> to
> > >> > 1.12. Is anyone planning on doing this work (as listed

Re: [Java] Source control of generated flatbuffers code

2021-04-14 Thread Bob Tinsman
I apologize for leaving this hanging, but it looks like 4.0 is leaving the 
station :(
Yes, it makes sense to bump it to 1.12, but you can't do that in isolation, 
because the flatc binary which is fetched as a Maven dependency is only 
available for 1.9. I will get back onto this and finish it this week.

FWIW, I was looking around and catalogued the various ways of generating 
flatbuffers for all the languages--you can look at it in my branch: 
https://github.com/bobtins/arrow/tree/check-in-gen-code/java/dev
Let me know if any info is wrong or missing.
The methods of generation are all over the map, and some have no script or 
build file, just doc. Would there be any value in making this more uniform?

On 2021/04/14 16:36:47, Nate Bauernfeind  wrote: 
> It would also be nice to upgrade that java flatbuffer version from 1.9 to
> 1.12. Is anyone planning on doing this work (as listed in ARROW-12111)?
> 
> If I did this work today, might it be possible to get it included in the
> 4.0.0 release?
> 
> On Fri, Mar 26, 2021 at 3:25 PM bobtins  wrote:
> 
> > OK, originally this was part of
> > https://issues.apache.org/jira/browse/ARROW-12006 and I was going to just
> > add some doc on flatc, but I will make this a new bug because it's a little
> > bigger: https://issues.apache.org/jira/browse/ARROW-12111
> >
> > On 2021/03/23 23:40:50, Micah Kornfield  wrote:
> > > >
> > > > I have a concern, though. Four other languages (Java would be five)
> > check
> > > > in the generated flatbuffers code, and it appears (based on a quick
> > scan of
> > > > Git logs) that this is done manually. Is there a danger that the binary
> > > > format could change, but some language might get forgotten, and thus be
> > > > working with the old format?
> > >
> > > The format changes relatively slowly and any changes at this point should
> > > be backwards compatible.
> > >
> > >
> > >
> > > > Or is there enough interop testing that the problem would get caught
> > right
> > > > away?
> > >
> > > In most cases I would expect integration tests to catch these types of
> > > error.
> > >
> > > On Tue, Mar 23, 2021 at 4:26 PM bobtins  wrote:
> > >
> > > > I'm happy to check in the generated Java source. I would also update
> > the
> > > > Java build info to reflect this change and document how to regenerate
> > the
> > > > source as needed.
> > > >
> > > > I have a concern, though. Four other languages (Java would be five)
> > check
> > > > in the generated flatbuffers code, and it appears (based on a quick
> > scan of
> > > > Git logs) that this is done manually. Is there a danger that the binary
> > > > format could change, but some language might get forgotten, and thus be
> > > > working with the old format? Or is there enough interop testing that
> > the
> > > > problem would get caught right away?
> > > >
> > > > I'm new to the project and don't know how big of an issue this is in
> > > > practice. Thanks for any enlightenment.
> > > >
> > > > On 2021/03/23 07:39:16, Micah Kornfield  wrote:
> > > > > I think checking in the java files is fine and probably better then
> > > > relying
> > > > > on a third party package.  We should make sure there are
> > instructions on
> > > > > how to regenerate them along with the PR
> > > > >
> > > > > On Monday, March 22, 2021, Antoine Pitrou 
> > wrote:
> > > > >
> > > > > >
> > > > > > Le 22/03/2021 à 20:17, bobtins a écrit :
> > > > > >
> > > > > >> TL;DR: The Java implementation doesn't have generated flatbuffers
> > code
> > > > > >> under source control, and the code generation depends on an
> > > > > >> unofficially-maintained Maven artifact. Other language
> > > > implementations do
> > > > > >> check in the generated code; would it make sense for this to be
> > done
> > > > for
> > > > > >> Java as well?
> > > > > >>
> > > > > >> I'm currently focusing on Java development; I started building on
> > > > Windows
> > > > > >> and got a failure under java/format, because I couldn't download
> > the
> > > > > >> flatbuffers compiler (flatc) to generate Java source.
> > > > > >> The artifact for the flatc binary is provided "unofficially" (not
> > by
> > > > the
> > > > > >> flatbuffers project), and there was no Windows version, so I had
> > to
> > > > jump
> > > > > >> through hoops to build it and proceed.
> > > > > >>
> > > > > >
> > > > > > While this does not answer the more general question of checking
> > in the
> > > > > > generated Flatbuffers code (which sounds like a good idea, but I'm
> > not
> > > > a
> > > > > > Java developer), note that you could workaround this by installing
> > the
> > > > > > Conda-provided flatbuffers package:
> > > > > >
> > > > > >   $ conda install flatbuffers
> > > > > >
> > > > > > which should get you the `flatc` compiler, even on Windows.
> > > > > > (see https://docs.conda.io/projects/conda/en/latest/ for
> > installing
> > > > conda)
> > > > > >
> > > > > > You may also try other package managers such as Chocolatey:
> > > > > >
> > > > > >   

[JIRA] Request contributor role

2021-03-19 Thread Bob Tinsman
I've logged a couple bugs and would like to assign myself. My id is bobtinsman 
on JIRA; here is one of the bugs I logged:
[ARROW-12006] updates to make dev on Java and Windows easier - ASF JIRA

| 
| 
|  | 
[ARROW-12006] updates to make dev on Java and Windows easier - ASF JIRA


 |

 |

 |


I tried creating a new email on the archive page Pony Mail! 


| 
| 
|  | 
Pony Mail!


 |

 |

 |

but it didn't seem to work.




[JAVA] [n00b] issues encountered during build

2021-03-11 Thread Bob Tinsman
I've been mostly lurking for awhile, but I would like to start picking off some 
bugs in the Java implementation.In the process of slogging through the build, 
I've bumped into various issues. I'm happy to document them in java/README.md 
or make any other changes that might be helpful to others. I'm pretty 
experienced with Java and Maven, so I think these are not super-obvious, but 
let me know if I'm missing something.A lot of these may be Windows-specific. I 
normally prefer Linux but just got a new laptop and haven't set it up, but this 
experience is giving me a lot of incentive to run screaming back to Linux ;-)
Environment details:- Windows 10- Java 8:openjdk version "1.8.0_282"OpenJDK 
Runtime Environment (AdoptOpenJDK)(build 1.8.0_282-b08)OpenJDK 64-Bit Server VM 
(AdoptOpenJDK)(build 25.282-b08, mixed mode)- Cygwin environment- Maven 3.6.2
Issues encountered thus far:- Build does require Java 8, not "8 or later" as 
stated in java/README.md    There's a reference to sun.misc.Unsafe in 
memory/memory-core/src/main/java/org/apache/arrow/memory/util/MemoryUtil.java 
which of course went away in JDK 9.    I can update the build instructions.- 
The build won't work on Windows because the java/format POM downloads a binary 
flatc executables; when I looked, there was no version for Windows, just Linux 
and OSX. I wound up downloading Visual Studio and building the flatbuffers 
project.- I bumped into what looks like a spurious checkstyle error: it reports 
memory/src/test/java/io/netty/buffer/TestExpandableByteBuf.java having no 
linefeed at the end when it definitely does. I set up Git not to do Windows 
conversions, and I checked with various editors and binary dump utilities. One 
source says that this because I'm running on Windows, checkstyle actually 
expects a CR-LF and throws an error if it doesn't find it! I've worked around 
this by disabling the check.- The one thing that I'm stuck on now is failures 
on jdbc:
[INFO] Results:[INFO][ERROR] Failures:[ERROR]   
JdbcToArrowDataTypesTest.testJdbcToArrowValues:146->testDataSets:209 
expected:<45935000> but was:<74735000>[ERROR]   
JdbcToArrowDataTypesTest.testJdbcToArrowValues:146->testDataSets:213 
expected:<1518439535000> but was:<1518468335000>[ERROR]   
JdbcToArrowDataTypesTest.testJdbcToArrowValues:146->testDataSets:205 
expected:<-365> but was:<-364>[ERROR]   
JdbcToArrowNullTest.testJdbcToArrowValues:123->testDataSets:165->testAllVectorValues:209
 expected:<17574> but was:<17573>[ERROR]   
JdbcToArrowTest.testJdbcToArrowValues:138->testDataSets:206 expected:<17574> 
but was:<17573>[ERROR]   
JdbcToArrowVectorIteratorTest.testJdbcToArrowValuesNoLimit:107->validate:199->assertDateDayVectorValues:277
 expected:<17574> but was:<17573>[ERROR]   
JdbcToArrowVectorIteratorTest.testJdbcToArrowValues:95->validate:199->assertDateDayVectorValues:277
 expected:<17574> but was:<17573>[INFO][ERROR] Tests run: 93, Failures: 7, 
Errors: 0, Skipped: 0
I attached the full build output.Looking more closely at these errors, they 
seem to be due to the timezone difference; for example, the difference between 
74735000 (actual value) and 45935000 (expected) is 288, or 8 hours in 
milliseconds, which is the PST timezone offset. I see in the pom.xml that 
user.timezone is set to UTC. I have seen these types of errors in tests before; 
I know there are ways to insulate the test from the user's current timezone but 
maybe someone knows what's going on.
Thanks for any input! 
[INFO] Scanning for projects...
[INFO] 

[INFO] Detecting the operating system and CPU architecture
[INFO] 

[INFO] os.detected.name: windows
[INFO] os.detected.arch: x86_64
[INFO] os.detected.version: 10.0
[INFO] os.detected.version.major: 10
[INFO] os.detected.version.minor: 0
[INFO] os.detected.classifier: windows-x86_64
[INFO] 

[INFO] Reactor Build Order:
[INFO] 
[INFO] Arrow JDBC Adapter 
[jar]
[INFO] Arrow Plasma Client
[jar]
[INFO] Arrow Flight Core  
[jar]
[INFO] Arrow Flight GRPC  
[jar]
[INFO] Arrow AVRO Adapter 
[jar]
[INFO] Arrow Algorithms   
[jar]
[INFO] Arrow Performance Benchmarks   
[jar]
[INFO] 
[INFO] < 
org.apache.arrow:arrow-jdbc >-
[INFO] 

Hello to the Arrow dev community

2020-09-22 Thread Bob Tinsman
I'd like to introduce myself, because I've had an interest in Arrow for a long 
time and now I have a chance to help out.Up until now, I haven't really 
contributed much in open source, although I've been an avid consumer, so I'd 
like to change that!
My main areas of work have been performance optimization, Java, databases 
(mostly relational), and optimizing/refactoring architecture, but I also have 
some C/C++ background, and I'm a quick learner of new languages.

The reason that I'm so interested in Arrow is that I've already created two 
in-memory columnar dataset implementations for two different companies, so I'm 
a believer in the power of this model, although I came to it from a different 
perspective.I was just watching this discussion with Wes and Jacques: Starting 
Apache Arrow

| 
| 
| 
|  |  |

 |

 |
| 
|  | 
Starting Apache Arrow

Our CTO Jacques Nadeau sat down for a fireside chat with Wes Mckinnney, 
discussing the past, present, and future...
 |

 |

 |


Wes lays out two phases of Arrow:- Phase one: Arrow used as a common format- 
Phase two: Arrow used for actual calculationBecause I was working on my own, I 
skipped to phase two.
I worked for an online marketing survey company called MarketTools in the early 
00's. Survey results were stored in SQL Server, and we had to implement 
crosstabs on the data; for example, if you wanted to see answers to survey 
answers broken down by age, gender, income range, etc.
The original implementation would generate some pretty hairy SQL, which got 
pretty slow if there were a lot of questions on the crosstab.I thought "why are 
we asking the DB to run multiple queries on the same data when we could pull it 
into memory once, then do aggregate calculations there?"That managed to produce 
a 5x speedup in running the crosstabs.In my most recent company, I created a 
new in-memory dataset implementation as the basis for an interactive data 
analysis tool. Again I was working with mostly relational databases. I was able 
to push the scalability of the in-memory columns a lot more using dictionaries. 
I also developed a hybrid engine combining SQL generation and in-memory 
calculation, sort of like what Spark is doing.If I knew about Arrow, I would 
have definitely used it, but it wasn't around yet. You guys have accomplished a 
lot--congrats on your 1.0.0 release, by the way!I'm starting out by grokking 
all the source and doc, and looking at JIRA issues that I could potentially 
work on, but I'm looking forward to helping out however I can.