Re: [DISCUSS][JAVA] Designs & goals for readers/writers
Yes I think text files are OK but I want to make sure that committers are reviewing patches for binary files because there have been a number of incidents in the past where I had to roll back patches to remove such files. On Tue, Jul 23, 2019, 10:37 AM Micah Kornfield wrote: > Hi Wes, > I haven't checked locally but that file at least for me renders as text > file in GitHub (with an Apache header). If we want all test data in the > testing package I can make sure to move it but I thought text files might > be ok in the main repo? > > Thanks, > Micah > > On Tuesday, July 23, 2019, Wes McKinney wrote: > >> I noticed that test data-related files are beginning to be checked in >> >> >> https://github.com/apache/arrow/blob/master/java/adapter/avro/src/test/resources/schema/test.avsc >> >> I wanted to make sure this doesn't turn into a slippery slope where we >> end up with several megabytes or more of test data files >> >> On Mon, Jul 22, 2019 at 11:39 PM Micah Kornfield >> wrote: >> > >> > Hi Wes, >> > Are there currently files that need to be moved? >> > >> > Thanks, >> > Micah >> > >> > On Monday, July 22, 2019, Wes McKinney wrote: >> >> >> >> Sort of tangentially related, but while we are on the topic: >> >> >> >> Please, if you would, avoid checking binary test data files into the >> >> main repository. Use https://github.com/apache/arrow-testing if you >> >> truly need to check in binary data -- something to look out for in >> >> code reviews >> >> >> >> On Mon, Jul 22, 2019 at 10:38 AM Micah Kornfield < >> emkornfi...@gmail.com> wrote: >> >> > >> >> > Hi Jacques, >> >> > Thanks for the clarifications. I think the distinction is useful. >> >> > >> >> > If people want to write adapters for Arrow, I see that as useful but >> very >> >> > > different than writing native implementations and we should try to >> create a >> >> > > clear delineation between the two. >> >> > >> >> > >> >> > What do you think about creating a "contrib" directory and moving >> the JDBC >> >> > and AVRO adapters into it? We should also probably provide more >> description >> >> > in pom.xml to make it clear for downstream consumers. >> >> > >> >> > We should probably come up with a name other than adapters for >> >> > readers/writer ("converters"?) and use it in the directory structure >> for >> >> > the existing Orc implementation? >> >> > >> >> > Thanks, >> >> > Micah >> >> > >> >> > >> >> > On Sun, Jul 21, 2019 at 6:09 PM Jacques Nadeau >> wrote: >> >> > >> >> > > As I read through your responses, I think it might be useful to >> talk about >> >> > > adapters versus native Arrow readers/writers. Adapters are >> something that >> >> > > adapt an existing API to produce and/or consume Arrow data. A >> native >> >> > > reader/writer is something that understand the format directly and >> does not >> >> > > have intermediate representations or APIs the data moves through >> beyond >> >> > > those that needs to be used to complete work. >> >> > > >> >> > > If people want to write adapters for Arrow, I see that as useful >> but very >> >> > > different than writing native implementations and we should try to >> create a >> >> > > clear delineation between the two. >> >> > > >> >> > > Further comments inline. >> >> > > >> >> > > >> >> > >> Could you expand on what level of detail you would like to see a >> design >> >> > >> document? >> >> > >> >> >> > > >> >> > > A couple paragraphs seems sufficient. This is the goals of the >> >> > > implementation. We target existing functionality X. It is an >> adapter. Or it >> >> > > is a native impl. This is the expected memory and processing >> >> > > characteristics, etc. I've never been one for huge amount of >> design but >> >> > > I've seen a number of recent patches appear where this is no >> upfront >> >> > > discussion. Making sure that multiple buy into a design is the >> best way to >> >> > > ensure long-term maintenance and use. >> >> > > >> >> > > >> >> > >> I think this should be optional (the same argument below about >> predicates >> >> > >> apply so I won't repeat them). >> >> > >> >> >> > > >> >> > > Per my comments above, maybe adapter versus native reader clarifies >> >> > > things. For example, I've been working on a native avro read >> >> > > implementation. It is little more than chicken scratch at this >> point but >> >> > > its goals, vision and design are very different than the adapter >> that is >> >> > > being produced atm. >> >> > > >> >> > > >> >> > >> Can you clarify the intent of this objective. Is it mainly to >> tie in with >> >> > >> the existing Java arrow memory book keeping? Performance? >> Something >> >> > >> else? >> >> > >> >> >> > > >> >> > > Arrow is designed to be off-heap. If you have large variable >> amounts of >> >> > > on-heap memory in an application, it starts to make it very hard >> to make >> >> > > decisions about off-heap versus on-heap memory since those >> divisions are by >> >> > > and large static in nature. It's fine for shor
Re: [DISCUSS][JAVA] Designs & goals for readers/writers
Hi Wes, I haven't checked locally but that file at least for me renders as text file in GitHub (with an Apache header). If we want all test data in the testing package I can make sure to move it but I thought text files might be ok in the main repo? Thanks, Micah On Tuesday, July 23, 2019, Wes McKinney wrote: > I noticed that test data-related files are beginning to be checked in > > https://github.com/apache/arrow/blob/master/java/adapter/avro/src/test/ > resources/schema/test.avsc > > I wanted to make sure this doesn't turn into a slippery slope where we > end up with several megabytes or more of test data files > > On Mon, Jul 22, 2019 at 11:39 PM Micah Kornfield > wrote: > > > > Hi Wes, > > Are there currently files that need to be moved? > > > > Thanks, > > Micah > > > > On Monday, July 22, 2019, Wes McKinney wrote: > >> > >> Sort of tangentially related, but while we are on the topic: > >> > >> Please, if you would, avoid checking binary test data files into the > >> main repository. Use https://github.com/apache/arrow-testing if you > >> truly need to check in binary data -- something to look out for in > >> code reviews > >> > >> On Mon, Jul 22, 2019 at 10:38 AM Micah Kornfield > wrote: > >> > > >> > Hi Jacques, > >> > Thanks for the clarifications. I think the distinction is useful. > >> > > >> > If people want to write adapters for Arrow, I see that as useful but > very > >> > > different than writing native implementations and we should try to > create a > >> > > clear delineation between the two. > >> > > >> > > >> > What do you think about creating a "contrib" directory and moving the > JDBC > >> > and AVRO adapters into it? We should also probably provide more > description > >> > in pom.xml to make it clear for downstream consumers. > >> > > >> > We should probably come up with a name other than adapters for > >> > readers/writer ("converters"?) and use it in the directory structure > for > >> > the existing Orc implementation? > >> > > >> > Thanks, > >> > Micah > >> > > >> > > >> > On Sun, Jul 21, 2019 at 6:09 PM Jacques Nadeau > wrote: > >> > > >> > > As I read through your responses, I think it might be useful to > talk about > >> > > adapters versus native Arrow readers/writers. Adapters are > something that > >> > > adapt an existing API to produce and/or consume Arrow data. A native > >> > > reader/writer is something that understand the format directly and > does not > >> > > have intermediate representations or APIs the data moves through > beyond > >> > > those that needs to be used to complete work. > >> > > > >> > > If people want to write adapters for Arrow, I see that as useful > but very > >> > > different than writing native implementations and we should try to > create a > >> > > clear delineation between the two. > >> > > > >> > > Further comments inline. > >> > > > >> > > > >> > >> Could you expand on what level of detail you would like to see a > design > >> > >> document? > >> > >> > >> > > > >> > > A couple paragraphs seems sufficient. This is the goals of the > >> > > implementation. We target existing functionality X. It is an > adapter. Or it > >> > > is a native impl. This is the expected memory and processing > >> > > characteristics, etc. I've never been one for huge amount of > design but > >> > > I've seen a number of recent patches appear where this is no upfront > >> > > discussion. Making sure that multiple buy into a design is the best > way to > >> > > ensure long-term maintenance and use. > >> > > > >> > > > >> > >> I think this should be optional (the same argument below about > predicates > >> > >> apply so I won't repeat them). > >> > >> > >> > > > >> > > Per my comments above, maybe adapter versus native reader clarifies > >> > > things. For example, I've been working on a native avro read > >> > > implementation. It is little more than chicken scratch at this > point but > >> > > its goals, vision and design are very different than the adapter > that is > >> > > being produced atm. > >> > > > >> > > > >> > >> Can you clarify the intent of this objective. Is it mainly to tie > in with > >> > >> the existing Java arrow memory book keeping? Performance? > Something > >> > >> else? > >> > >> > >> > > > >> > > Arrow is designed to be off-heap. If you have large variable > amounts of > >> > > on-heap memory in an application, it starts to make it very hard to > make > >> > > decisions about off-heap versus on-heap memory since those > divisions are by > >> > > and large static in nature. It's fine for short lived applications > but for > >> > > long lived applications, if you're working with a large amount of > data, you > >> > > want to keep most of your memory in one pool. In the context of > Arrow, this > >> > > is going to naturally be off-heap memory. > >> > > > >> > > > >> > >> I'm afraid this might lead to a "perfect is the enemy of the good" > >> > >> situation. Starting off with a known good implementation of > conversion to > >> > >> A
Re: [DISCUSS][JAVA] Designs & goals for readers/writers
I noticed that test data-related files are beginning to be checked in https://github.com/apache/arrow/blob/master/java/adapter/avro/src/test/resources/schema/test.avsc I wanted to make sure this doesn't turn into a slippery slope where we end up with several megabytes or more of test data files On Mon, Jul 22, 2019 at 11:39 PM Micah Kornfield wrote: > > Hi Wes, > Are there currently files that need to be moved? > > Thanks, > Micah > > On Monday, July 22, 2019, Wes McKinney wrote: >> >> Sort of tangentially related, but while we are on the topic: >> >> Please, if you would, avoid checking binary test data files into the >> main repository. Use https://github.com/apache/arrow-testing if you >> truly need to check in binary data -- something to look out for in >> code reviews >> >> On Mon, Jul 22, 2019 at 10:38 AM Micah Kornfield >> wrote: >> > >> > Hi Jacques, >> > Thanks for the clarifications. I think the distinction is useful. >> > >> > If people want to write adapters for Arrow, I see that as useful but very >> > > different than writing native implementations and we should try to >> > > create a >> > > clear delineation between the two. >> > >> > >> > What do you think about creating a "contrib" directory and moving the JDBC >> > and AVRO adapters into it? We should also probably provide more description >> > in pom.xml to make it clear for downstream consumers. >> > >> > We should probably come up with a name other than adapters for >> > readers/writer ("converters"?) and use it in the directory structure for >> > the existing Orc implementation? >> > >> > Thanks, >> > Micah >> > >> > >> > On Sun, Jul 21, 2019 at 6:09 PM Jacques Nadeau wrote: >> > >> > > As I read through your responses, I think it might be useful to talk >> > > about >> > > adapters versus native Arrow readers/writers. Adapters are something that >> > > adapt an existing API to produce and/or consume Arrow data. A native >> > > reader/writer is something that understand the format directly and does >> > > not >> > > have intermediate representations or APIs the data moves through beyond >> > > those that needs to be used to complete work. >> > > >> > > If people want to write adapters for Arrow, I see that as useful but very >> > > different than writing native implementations and we should try to >> > > create a >> > > clear delineation between the two. >> > > >> > > Further comments inline. >> > > >> > > >> > >> Could you expand on what level of detail you would like to see a design >> > >> document? >> > >> >> > > >> > > A couple paragraphs seems sufficient. This is the goals of the >> > > implementation. We target existing functionality X. It is an adapter. Or >> > > it >> > > is a native impl. This is the expected memory and processing >> > > characteristics, etc. I've never been one for huge amount of design but >> > > I've seen a number of recent patches appear where this is no upfront >> > > discussion. Making sure that multiple buy into a design is the best way >> > > to >> > > ensure long-term maintenance and use. >> > > >> > > >> > >> I think this should be optional (the same argument below about >> > >> predicates >> > >> apply so I won't repeat them). >> > >> >> > > >> > > Per my comments above, maybe adapter versus native reader clarifies >> > > things. For example, I've been working on a native avro read >> > > implementation. It is little more than chicken scratch at this point but >> > > its goals, vision and design are very different than the adapter that is >> > > being produced atm. >> > > >> > > >> > >> Can you clarify the intent of this objective. Is it mainly to tie in >> > >> with >> > >> the existing Java arrow memory book keeping? Performance? Something >> > >> else? >> > >> >> > > >> > > Arrow is designed to be off-heap. If you have large variable amounts of >> > > on-heap memory in an application, it starts to make it very hard to make >> > > decisions about off-heap versus on-heap memory since those divisions are >> > > by >> > > and large static in nature. It's fine for short lived applications but >> > > for >> > > long lived applications, if you're working with a large amount of data, >> > > you >> > > want to keep most of your memory in one pool. In the context of Arrow, >> > > this >> > > is going to naturally be off-heap memory. >> > > >> > > >> > >> I'm afraid this might lead to a "perfect is the enemy of the good" >> > >> situation. Starting off with a known good implementation of conversion >> > >> to >> > >> Arrow can allow us to both to profile hot-spots and provide a comparison >> > >> of >> > >> implementations to verify correctness. >> > >> >> > > >> > > I'm not clear what message we're sending as a community if we produce low >> > > performance components. The whole of Arrow is to increase performance, >> > > not >> > > decrease it. I'm targeting good, not perfect. At the same time, from my >> > > perspective, Arrow development should not be approached in the same
Re: [DISCUSS][JAVA] Designs & goals for readers/writers
Hi Wes, Are there currently files that need to be moved? Thanks, Micah On Monday, July 22, 2019, Wes McKinney wrote: > Sort of tangentially related, but while we are on the topic: > > Please, if you would, avoid checking binary test data files into the > main repository. Use https://github.com/apache/arrow-testing if you > truly need to check in binary data -- something to look out for in > code reviews > > On Mon, Jul 22, 2019 at 10:38 AM Micah Kornfield > wrote: > > > > Hi Jacques, > > Thanks for the clarifications. I think the distinction is useful. > > > > If people want to write adapters for Arrow, I see that as useful but very > > > different than writing native implementations and we should try to > create a > > > clear delineation between the two. > > > > > > What do you think about creating a "contrib" directory and moving the > JDBC > > and AVRO adapters into it? We should also probably provide more > description > > in pom.xml to make it clear for downstream consumers. > > > > We should probably come up with a name other than adapters for > > readers/writer ("converters"?) and use it in the directory structure for > > the existing Orc implementation? > > > > Thanks, > > Micah > > > > > > On Sun, Jul 21, 2019 at 6:09 PM Jacques Nadeau > wrote: > > > > > As I read through your responses, I think it might be useful to talk > about > > > adapters versus native Arrow readers/writers. Adapters are something > that > > > adapt an existing API to produce and/or consume Arrow data. A native > > > reader/writer is something that understand the format directly and > does not > > > have intermediate representations or APIs the data moves through beyond > > > those that needs to be used to complete work. > > > > > > If people want to write adapters for Arrow, I see that as useful but > very > > > different than writing native implementations and we should try to > create a > > > clear delineation between the two. > > > > > > Further comments inline. > > > > > > > > >> Could you expand on what level of detail you would like to see a > design > > >> document? > > >> > > > > > > A couple paragraphs seems sufficient. This is the goals of the > > > implementation. We target existing functionality X. It is an adapter. > Or it > > > is a native impl. This is the expected memory and processing > > > characteristics, etc. I've never been one for huge amount of design > but > > > I've seen a number of recent patches appear where this is no upfront > > > discussion. Making sure that multiple buy into a design is the best > way to > > > ensure long-term maintenance and use. > > > > > > > > >> I think this should be optional (the same argument below about > predicates > > >> apply so I won't repeat them). > > >> > > > > > > Per my comments above, maybe adapter versus native reader clarifies > > > things. For example, I've been working on a native avro read > > > implementation. It is little more than chicken scratch at this point > but > > > its goals, vision and design are very different than the adapter that > is > > > being produced atm. > > > > > > > > >> Can you clarify the intent of this objective. Is it mainly to tie in > with > > >> the existing Java arrow memory book keeping? Performance? Something > > >> else? > > >> > > > > > > Arrow is designed to be off-heap. If you have large variable amounts of > > > on-heap memory in an application, it starts to make it very hard to > make > > > decisions about off-heap versus on-heap memory since those divisions > are by > > > and large static in nature. It's fine for short lived applications but > for > > > long lived applications, if you're working with a large amount of > data, you > > > want to keep most of your memory in one pool. In the context of Arrow, > this > > > is going to naturally be off-heap memory. > > > > > > > > >> I'm afraid this might lead to a "perfect is the enemy of the good" > > >> situation. Starting off with a known good implementation of > conversion to > > >> Arrow can allow us to both to profile hot-spots and provide a > comparison > > >> of > > >> implementations to verify correctness. > > >> > > > > > > I'm not clear what message we're sending as a community if we produce > low > > > performance components. The whole of Arrow is to increase performance, > not > > > decrease it. I'm targeting good, not perfect. At the same time, from my > > > perspective, Arrow development should not be approached in the same way > > > that general Java app development should be. If we hold a high > standard, > > > we'll have less total integrations initially but I think we'll solve > more > > > real world problems. > > > > > > There is also the question of how widely adoptable we want Arrow > libraries > > >> to be. > > >> It isn't surprising to me that Impala's Avro reader is an order of > > >> magnitude faster then the stock Java one. As far as I know Impala's > is a > > >> C++ implementation that does JIT with LLVM. We could try to use it >
Re: [DISCUSS][JAVA] Designs & goals for readers/writers
Sort of tangentially related, but while we are on the topic: Please, if you would, avoid checking binary test data files into the main repository. Use https://github.com/apache/arrow-testing if you truly need to check in binary data -- something to look out for in code reviews On Mon, Jul 22, 2019 at 10:38 AM Micah Kornfield wrote: > > Hi Jacques, > Thanks for the clarifications. I think the distinction is useful. > > If people want to write adapters for Arrow, I see that as useful but very > > different than writing native implementations and we should try to create a > > clear delineation between the two. > > > What do you think about creating a "contrib" directory and moving the JDBC > and AVRO adapters into it? We should also probably provide more description > in pom.xml to make it clear for downstream consumers. > > We should probably come up with a name other than adapters for > readers/writer ("converters"?) and use it in the directory structure for > the existing Orc implementation? > > Thanks, > Micah > > > On Sun, Jul 21, 2019 at 6:09 PM Jacques Nadeau wrote: > > > As I read through your responses, I think it might be useful to talk about > > adapters versus native Arrow readers/writers. Adapters are something that > > adapt an existing API to produce and/or consume Arrow data. A native > > reader/writer is something that understand the format directly and does not > > have intermediate representations or APIs the data moves through beyond > > those that needs to be used to complete work. > > > > If people want to write adapters for Arrow, I see that as useful but very > > different than writing native implementations and we should try to create a > > clear delineation between the two. > > > > Further comments inline. > > > > > >> Could you expand on what level of detail you would like to see a design > >> document? > >> > > > > A couple paragraphs seems sufficient. This is the goals of the > > implementation. We target existing functionality X. It is an adapter. Or it > > is a native impl. This is the expected memory and processing > > characteristics, etc. I've never been one for huge amount of design but > > I've seen a number of recent patches appear where this is no upfront > > discussion. Making sure that multiple buy into a design is the best way to > > ensure long-term maintenance and use. > > > > > >> I think this should be optional (the same argument below about predicates > >> apply so I won't repeat them). > >> > > > > Per my comments above, maybe adapter versus native reader clarifies > > things. For example, I've been working on a native avro read > > implementation. It is little more than chicken scratch at this point but > > its goals, vision and design are very different than the adapter that is > > being produced atm. > > > > > >> Can you clarify the intent of this objective. Is it mainly to tie in with > >> the existing Java arrow memory book keeping? Performance? Something > >> else? > >> > > > > Arrow is designed to be off-heap. If you have large variable amounts of > > on-heap memory in an application, it starts to make it very hard to make > > decisions about off-heap versus on-heap memory since those divisions are by > > and large static in nature. It's fine for short lived applications but for > > long lived applications, if you're working with a large amount of data, you > > want to keep most of your memory in one pool. In the context of Arrow, this > > is going to naturally be off-heap memory. > > > > > >> I'm afraid this might lead to a "perfect is the enemy of the good" > >> situation. Starting off with a known good implementation of conversion to > >> Arrow can allow us to both to profile hot-spots and provide a comparison > >> of > >> implementations to verify correctness. > >> > > > > I'm not clear what message we're sending as a community if we produce low > > performance components. The whole of Arrow is to increase performance, not > > decrease it. I'm targeting good, not perfect. At the same time, from my > > perspective, Arrow development should not be approached in the same way > > that general Java app development should be. If we hold a high standard, > > we'll have less total integrations initially but I think we'll solve more > > real world problems. > > > > There is also the question of how widely adoptable we want Arrow libraries > >> to be. > >> It isn't surprising to me that Impala's Avro reader is an order of > >> magnitude faster then the stock Java one. As far as I know Impala's is a > >> C++ implementation that does JIT with LLVM. We could try to use it as a > >> basis for converting to Arrow but I think this might limit adoption in > >> some > >> circumstances. Some organizations/people might be hesitant to adopt the > >> technology due to: > >> 1. Use of JNI. > >> 2. Use LLVM to do JIT. > >> > >> It seems that as long as we have a reasonably general interface to > >> data-sources we should be able to optimize/refactor aggressively
Re: [DISCUSS][JAVA] Designs & goals for readers/writers
Hi Jacques, Thanks for the clarifications. I think the distinction is useful. If people want to write adapters for Arrow, I see that as useful but very > different than writing native implementations and we should try to create a > clear delineation between the two. What do you think about creating a "contrib" directory and moving the JDBC and AVRO adapters into it? We should also probably provide more description in pom.xml to make it clear for downstream consumers. We should probably come up with a name other than adapters for readers/writer ("converters"?) and use it in the directory structure for the existing Orc implementation? Thanks, Micah On Sun, Jul 21, 2019 at 6:09 PM Jacques Nadeau wrote: > As I read through your responses, I think it might be useful to talk about > adapters versus native Arrow readers/writers. Adapters are something that > adapt an existing API to produce and/or consume Arrow data. A native > reader/writer is something that understand the format directly and does not > have intermediate representations or APIs the data moves through beyond > those that needs to be used to complete work. > > If people want to write adapters for Arrow, I see that as useful but very > different than writing native implementations and we should try to create a > clear delineation between the two. > > Further comments inline. > > >> Could you expand on what level of detail you would like to see a design >> document? >> > > A couple paragraphs seems sufficient. This is the goals of the > implementation. We target existing functionality X. It is an adapter. Or it > is a native impl. This is the expected memory and processing > characteristics, etc. I've never been one for huge amount of design but > I've seen a number of recent patches appear where this is no upfront > discussion. Making sure that multiple buy into a design is the best way to > ensure long-term maintenance and use. > > >> I think this should be optional (the same argument below about predicates >> apply so I won't repeat them). >> > > Per my comments above, maybe adapter versus native reader clarifies > things. For example, I've been working on a native avro read > implementation. It is little more than chicken scratch at this point but > its goals, vision and design are very different than the adapter that is > being produced atm. > > >> Can you clarify the intent of this objective. Is it mainly to tie in with >> the existing Java arrow memory book keeping? Performance? Something >> else? >> > > Arrow is designed to be off-heap. If you have large variable amounts of > on-heap memory in an application, it starts to make it very hard to make > decisions about off-heap versus on-heap memory since those divisions are by > and large static in nature. It's fine for short lived applications but for > long lived applications, if you're working with a large amount of data, you > want to keep most of your memory in one pool. In the context of Arrow, this > is going to naturally be off-heap memory. > > >> I'm afraid this might lead to a "perfect is the enemy of the good" >> situation. Starting off with a known good implementation of conversion to >> Arrow can allow us to both to profile hot-spots and provide a comparison >> of >> implementations to verify correctness. >> > > I'm not clear what message we're sending as a community if we produce low > performance components. The whole of Arrow is to increase performance, not > decrease it. I'm targeting good, not perfect. At the same time, from my > perspective, Arrow development should not be approached in the same way > that general Java app development should be. If we hold a high standard, > we'll have less total integrations initially but I think we'll solve more > real world problems. > > There is also the question of how widely adoptable we want Arrow libraries >> to be. >> It isn't surprising to me that Impala's Avro reader is an order of >> magnitude faster then the stock Java one. As far as I know Impala's is a >> C++ implementation that does JIT with LLVM. We could try to use it as a >> basis for converting to Arrow but I think this might limit adoption in >> some >> circumstances. Some organizations/people might be hesitant to adopt the >> technology due to: >> 1. Use of JNI. >> 2. Use LLVM to do JIT. >> >> It seems that as long as we have a reasonably general interface to >> data-sources we should be able to optimize/refactor aggressively when >> needed. >> > > This is somewhat the crux of the problem. It goes a little bit to who our > consuming audience is and what we're trying to deliver. I'll also say that > trying to build a high-quality implementation on top of low-quality > implementation or library-based adapter is worse than starting from > scratch. I believe this is especially true in Java where developers are > trained to trust hotspot and that things will be good enough. That is great > in a web app but not in systems software where we (and I expect othe
Re: [DISCUSS][JAVA] Designs & goals for readers/writers
Thanks for your proposal. Agreed Arrow readers/writers should have high performance like Orc reader, and as mentioned above, I think the current Avro adapter should be positioned as adapter rather than native reader. Not sure whether Arrow requires adapter based on library, I update the current design in ARROW-5845[1] for your information anyway. Thanks, Ji Liu [1] https://issues.apache.org/jira/browse/ARROW-5845 -- From:Jacques Nadeau Send Time:2019年7月22日(星期一) 09:16 To:dev ; Micah Kornfield Subject:Re: [DISCUSS][JAVA] Designs & goals for readers/writers As I read through your responses, I think it might be useful to talk about adapters versus native Arrow readers/writers. Adapters are something that adapt an existing API to produce and/or consume Arrow data. A native reader/writer is something that understand the format directly and does not have intermediate representations or APIs the data moves through beyond those that needs to be used to complete work. If people want to write adapters for Arrow, I see that as useful but very different than writing native implementations and we should try to create a clear delineation between the two. Further comments inline. > Could you expand on what level of detail you would like to see a design > document? > A couple paragraphs seems sufficient. This is the goals of the implementation. We target existing functionality X. It is an adapter. Or it is a native impl. This is the expected memory and processing characteristics, etc. I've never been one for huge amount of design but I've seen a number of recent patches appear where this is no upfront discussion. Making sure that multiple buy into a design is the best way to ensure long-term maintenance and use. > I think this should be optional (the same argument below about predicates > apply so I won't repeat them). > Per my comments above, maybe adapter versus native reader clarifies things. For example, I've been working on a native avro read implementation. It is little more than chicken scratch at this point but its goals, vision and design are very different than the adapter that is being produced atm. > Can you clarify the intent of this objective. Is it mainly to tie in with > the existing Java arrow memory book keeping? Performance? Something else? > Arrow is designed to be off-heap. If you have large variable amounts of on-heap memory in an application, it starts to make it very hard to make decisions about off-heap versus on-heap memory since those divisions are by and large static in nature. It's fine for short lived applications but for long lived applications, if you're working with a large amount of data, you want to keep most of your memory in one pool. In the context of Arrow, this is going to naturally be off-heap memory. > I'm afraid this might lead to a "perfect is the enemy of the good" > situation. Starting off with a known good implementation of conversion to > Arrow can allow us to both to profile hot-spots and provide a comparison of > implementations to verify correctness. > I'm not clear what message we're sending as a community if we produce low performance components. The whole of Arrow is to increase performance, not decrease it. I'm targeting good, not perfect. At the same time, from my perspective, Arrow development should not be approached in the same way that general Java app development should be. If we hold a high standard, we'll have less total integrations initially but I think we'll solve more real world problems. There is also the question of how widely adoptable we want Arrow libraries > to be. > It isn't surprising to me that Impala's Avro reader is an order of > magnitude faster then the stock Java one. As far as I know Impala's is a > C++ implementation that does JIT with LLVM. We could try to use it as a > basis for converting to Arrow but I think this might limit adoption in some > circumstances. Some organizations/people might be hesitant to adopt the > technology due to: > 1. Use of JNI. > 2. Use LLVM to do JIT. > > It seems that as long as we have a reasonably general interface to > data-sources we should be able to optimize/refactor aggressively when > needed. > This is somewhat the crux of the problem. It goes a little bit to who our consuming audience is and what we're trying to deliver. I'll also say that trying to build a high-quality implementation on top of low-quality implementation or library-based adapter is worse than starting from scratch. I believe this is especially true in Java where developers are trained to trust hotspot and that things will be good enough. That is great in a web app but not in systems software where we (and I expect others) will deploy Arrow. > >
Re: [DISCUSS][JAVA] Designs & goals for readers/writers
As I read through your responses, I think it might be useful to talk about adapters versus native Arrow readers/writers. Adapters are something that adapt an existing API to produce and/or consume Arrow data. A native reader/writer is something that understand the format directly and does not have intermediate representations or APIs the data moves through beyond those that needs to be used to complete work. If people want to write adapters for Arrow, I see that as useful but very different than writing native implementations and we should try to create a clear delineation between the two. Further comments inline. > Could you expand on what level of detail you would like to see a design > document? > A couple paragraphs seems sufficient. This is the goals of the implementation. We target existing functionality X. It is an adapter. Or it is a native impl. This is the expected memory and processing characteristics, etc. I've never been one for huge amount of design but I've seen a number of recent patches appear where this is no upfront discussion. Making sure that multiple buy into a design is the best way to ensure long-term maintenance and use. > I think this should be optional (the same argument below about predicates > apply so I won't repeat them). > Per my comments above, maybe adapter versus native reader clarifies things. For example, I've been working on a native avro read implementation. It is little more than chicken scratch at this point but its goals, vision and design are very different than the adapter that is being produced atm. > Can you clarify the intent of this objective. Is it mainly to tie in with > the existing Java arrow memory book keeping? Performance? Something else? > Arrow is designed to be off-heap. If you have large variable amounts of on-heap memory in an application, it starts to make it very hard to make decisions about off-heap versus on-heap memory since those divisions are by and large static in nature. It's fine for short lived applications but for long lived applications, if you're working with a large amount of data, you want to keep most of your memory in one pool. In the context of Arrow, this is going to naturally be off-heap memory. > I'm afraid this might lead to a "perfect is the enemy of the good" > situation. Starting off with a known good implementation of conversion to > Arrow can allow us to both to profile hot-spots and provide a comparison of > implementations to verify correctness. > I'm not clear what message we're sending as a community if we produce low performance components. The whole of Arrow is to increase performance, not decrease it. I'm targeting good, not perfect. At the same time, from my perspective, Arrow development should not be approached in the same way that general Java app development should be. If we hold a high standard, we'll have less total integrations initially but I think we'll solve more real world problems. There is also the question of how widely adoptable we want Arrow libraries > to be. > It isn't surprising to me that Impala's Avro reader is an order of > magnitude faster then the stock Java one. As far as I know Impala's is a > C++ implementation that does JIT with LLVM. We could try to use it as a > basis for converting to Arrow but I think this might limit adoption in some > circumstances. Some organizations/people might be hesitant to adopt the > technology due to: > 1. Use of JNI. > 2. Use LLVM to do JIT. > > It seems that as long as we have a reasonably general interface to > data-sources we should be able to optimize/refactor aggressively when > needed. > This is somewhat the crux of the problem. It goes a little bit to who our consuming audience is and what we're trying to deliver. I'll also say that trying to build a high-quality implementation on top of low-quality implementation or library-based adapter is worse than starting from scratch. I believe this is especially true in Java where developers are trained to trust hotspot and that things will be good enough. That is great in a web app but not in systems software where we (and I expect others) will deploy Arrow. > >3. Propose a generalized "reader" interface as opposed to making each > >reader have a different way to package/integrate. > > This also seems like a good idea. Is this something you were thinking of > doing or just a proposal that someone in the community should take up > before we get too many more implementations? > I don't have something in mind and didn't have a plan to build something, just want to make sure we start getting consistent early as opposed to once we have a bunch of readers/adapters.
Re: [DISCUSS][JAVA] Designs & goals for readers/writers
Hi Jacques, I added more comments/questions inline, but as a TL;DR; Generally these all sound like good goals, but I have concern that as policy it might lead to a "boil the ocean" type approach that could potentially delay useful functionality. Thanks, Micah On Sun, Jul 21, 2019 at 2:41 PM Jacques Nadeau wrote: > I've seen a couple of recent pieces of work on generating new > readers/writers for Arrow (Avro and discussion of CSV). I'd like to propose > a couple of guidelines to help ensure a high quality bar: > >1. Design review first - Before someone starts implementing a particular >reader/writer, let's ask for a basic design outline in jira, google > docs, >etc. > Could you expand on what level of detail you would like to see a design document? 2. High bar for implementation: Having more readers for the sake of more >readers should not be the goal of the project. Instead, people should >expect Arrow Java readers to be high quality and faster than other > readers >(even if the consumer has to do a final conversion to move from the > Arrow >representation to their current internal representation). As such, I >propose the following two bars as part of design work: > 1. Field selection support as part of reads - Make sure that each > implementation supports field selection (which columns to > materialize) as > part of the interface. > I think this should be optional (the same argument below about predicates apply so I won't repeat them). > 2. Configurable target batch size - Different systems will want to > control the target size of batch data. > Agree this should be supported by all readers. I view the Avro implementation as a work in progress, but I did raise this on the PRs and expect it should be done before we call the Avro work done. > 3. Minimize use of heap memory - Most of the core existing Arrow Java > libraries have been very focused on minimizing on-heap memory > consumption. > While there may be some, we continue to try reduce the footprint as > small > as possible. When creating new readers/writers, I think we should > target > the same standard for new readers. For example, the current Avro > reader PR > relies heavily on the Java Avro project's reader implementation > which has > very poor heap characteristics. > Can you clarify the intent of this objective. Is it mainly to tie in with the existing Java arrow memory book keeping? Performance? Something else? 4. Industry leading performance - People should expect that using > Arrow stuff is very fast. Releasing something under this banner > means we > should focus on achieving that kind of target. To pick on the Avro > reader > again here, our previous analysis has shown that the Java Avro > project's > reader (not the Arrow connected impl) is frequently an order of > magnitude+ > slower than some other open source Avro readers (such as Impala's > implementation), especially when applying any predicates or > projections. > I'm afraid this might lead to a "perfect is the enemy of the good" situation. Starting off with a known good implementation of conversion to Arrow can allow us to both to profile hot-spots and provide a comparison of implementations to verify correctness. There is also the question of how widely adoptable we want Arrow libraries to be. It isn't surprising to me that Impala's Avro reader is an order of magnitude faster then the stock Java one. As far as I know Impala's is a C++ implementation that does JIT with LLVM. We could try to use it as a basis for converting to Arrow but I think this might limit adoption in some circumstances. Some organizations/people might be hesitant to adopt the technology due to: 1. Use of JNI. 2. Use LLVM to do JIT. It seems that as long as we have a reasonably general interface to data-sources we should be able to optimize/refactor aggressively when needed. 5. (Ideally) Predicate application as part of reads - 99% in > workloads we've, a user is frequently applying one or more > predicates when > reading data. Whatever performance you gain from a strong > implementation > for reads will be drown out in most cases if you fail apply > predicates as > part of reading (and thus have to materialize far more records > than you'll > need in a minute). > I agree this would probably be useful, and something that should be considered as part of a generalized reader. It doesn't seem like it should necessarily block implementations. For instance, as far as I know this isn't implemented in the C++ CSV Reader (and I'm pretty sure the other file format readers we in C++ don't support it yet either). Also, as far as I know Apache Spark treats predicate push-downs on its data-sets as optional. >3. Propose a generalized "reader" interface as opposed to making each >reader have a different
[DISCUSS][JAVA] Designs & goals for readers/writers
I've seen a couple of recent pieces of work on generating new readers/writers for Arrow (Avro and discussion of CSV). I'd like to propose a couple of guidelines to help ensure a high quality bar: 1. Design review first - Before someone starts implementing a particular reader/writer, let's ask for a basic design outline in jira, google docs, etc. 2. High bar for implementation: Having more readers for the sake of more readers should not be the goal of the project. Instead, people should expect Arrow Java readers to be high quality and faster than other readers (even if the consumer has to do a final conversion to move from the Arrow representation to their current internal representation). As such, I propose the following two bars as part of design work: 1. Field selection support as part of reads - Make sure that each implementation supports field selection (which columns to materialize) as part of the interface. 2. Configurable target batch size - Different systems will want to control the target size of batch data. 3. Minimize use of heap memory - Most of the core existing Arrow Java libraries have been very focused on minimizing on-heap memory consumption. While there may be some, we continue to try reduce the footprint as small as possible. When creating new readers/writers, I think we should target the same standard for new readers. For example, the current Avro reader PR relies heavily on the Java Avro project's reader implementation which has very poor heap characteristics. 4. Industry leading performance - People should expect that using Arrow stuff is very fast. Releasing something under this banner means we should focus on achieving that kind of target. To pick on the Avro reader again here, our previous analysis has shown that the Java Avro project's reader (not the Arrow connected impl) is frequently an order of magnitude+ slower than some other open source Avro readers (such as Impala's implementation), especially when applying any predicates or projections. 5. (Ideally) Predicate application as part of reads - 99% in workloads we've, a user is frequently applying one or more predicates when reading data. Whatever performance you gain from a strong implementation for reads will be drown out in most cases if you fail apply predicates as part of reading (and thus have to materialize far more records than you'll need in a minute). 3. Propose a generalized "reader" interface as opposed to making each reader have a different way to package/integrate. What do other people think?