Re: [DISCUSS] ORC separate project
On Mon, Apr 13, 2015 at 10:43 PM, Sergey Shelukhin ser...@hortonworks.com wrote: The 2nd concern about fixing issue quickly doesn¹t make sense - it can happen with any dependency. What if guava or Kryo or Spark or Tez have a bug? We can still ship Hive as long as the dependency can be updated to correct version., Agreed. Hive has many dependencies and any of them could require a quick turn around. The ORC project will largely consist of people from the Hive community and thus will be far more responsive to Hive's concerns than many of the other projects. Thanks, Owen
Re: [DISCUSS] ORC separate project
On Mon, Apr 13, 2015 at 10:43 PM, Sergey Shelukhin ser...@hortonworks.com wrote: The 2nd concern about fixing issue quickly doesn¹t make sense - it can happen with any dependency. What if guava or Kryo or Spark or Tez have a bug? We can still ship Hive as long as the dependency can be updated to correct version., Agreed. Hive has many dependencies and any of them could require a quick turn around. The ORC project will largely consist of people from the Hive community and thus will be far more responsive to Hive's concerns than many of the other projects. Thanks, Owen
Re: [DISCUSS] ORC separate project
IMHO there are 2 separate concerns, forking ORC and Hive using ³new² ORC. The first one does not really require vote, as discussed on private/board - anyone can fork part of code (in this case, at least). Then, for Hive switching to ³new² ORC, I¹m not sure that requires a vote either. We didn¹t vote when we added Kryo or Spark or Tez dependencyŠ it¹s just a (big) code change. 3 +1s like a branch merge will be enough, or even one +1 maybe. The 2nd concern about fixing issue quickly doesn¹t make sense - it can happen with any dependency. What if guava or Kryo or Spark or Tez have a bug? We can still ship Hive as long as the dependency can be updated to correct version., On 15/4/10, 20:05, Xuefu Zhang xzh...@cloudera.com wrote: To Lefty's comment - Yes, anyone can take Apache code and make another project at will. However, for changes made to an existing project as part of that process, such as what Owen described for ORC in Hive, it is certainly something that Hive PMC can control or vote on. Nevertheless, that's not my immediate concern. To Owen's explanation - Thanks. I guess my major concern is that we seemingly are breaking apart Hive's integrity and making it hard to release and maintain due to increasing number of external dependents. Let's say that Hive depends on a certain version of ORC (as TLP) and it's found that ORC has a bug that seriously impacts Hive users. We cannot release Hive as fast as we can, since dong so would need ORC community to fix the problem and make a release, for which Hive PMC has no control. On the contrary, Hive community can quickly fix the problem and make a release without waiting for other projects to make a release. I'm not sure this move (ORC as TLP) will be beneficial to vast Hive users. If this not convincing, let me propose that we spin off metastore also as TLP tomorrow! Thanks, Xuefu On Wed, Apr 8, 2015 at 8:33 AM, Owen O'Malley omal...@apache.org wrote: On Tue, Apr 7, 2015 at 8:49 PM, Xuefu Zhang xzh...@cloudera.com wrote: If I understood Allen's #2 comment, we are moving existing ORC code out of Hive and make it a separate project, which I definitely missed. I'm sorry that wasn't clear. Yes, most of the code that is currently in org.apache.hadoop.hive.ql.io.orc will move to the new project. The biggest change on the Hive side will be to create a new Hive module that defines the API that storage formats like ORC need to code against if they want high performance integration with Hive's vectorization. I've started that jira at https://issues.apache.org/jira/browse/HIVE-10171 . Creating this API should help us create a clean interface for storage formats that will help ORC and other columnar formats like Trevni or Parquet. Once the ORC project has made its first release, we can create a Hive jira to replace the Hive ORC code with a reference to the ORC release jar. Since existing Hive PMC has governance on the code, I would expect it's still the case even after the spinoff. No, Apache doesn't allow umbrella projects where one PMC controls sub-projects. The reason is that the Apache board has found that controlling projects directly instead of indirectly through another PMC reduces the problems. .. Owen
Re: [DISCUSS] ORC separate project
Speaking of the C++ ORC reader and writer, could they be included in the Hive project or do they have to be separate because they aren't Java code? By the way, gmail thwarts adding [DISCUSS] to the subject line. It shows up in the mail archives, although pre- post-DISCUSS threads are separate. -- Lefty On Fri, Apr 10, 2015 at 11:56 PM, Gopal Vijayaraghavan gop...@apache.org wrote: On 4/10/15, 8:05 PM, Xuefu Zhang xzh...@cloudera.com wrote: To Owen's explanation - Thanks. I guess my major concern is that we seemingly are breaking apart Hive's integrity and making it hard to release and maintain due to increasing number of external dependents. Let's say that Hive depends on a certain version of ORC (as TLP) and it's found that ORC has a bug that seriously impacts Hive users. We cannot release Hive as fast as we can, since dong so would need ORC community to fix the problem and make a release, for which Hive PMC has no control. On the contrary, Hive community can quickly fix the problem and make a release without waiting for other projects to make a release. I'm not sure this move (ORC as TLP) will be beneficial to vast Hive users. You need to understand exactly what this brings about for Hive, in fact to those who do not use ORC today. With the proposed changes, competing formats like Parquet might be able to compete with ORC in terms of hive features. That is the direct impact of standardization of a Storage-API implementation. As an independent project, new ORC features cannot use the fact that it is included in the ql/ source to introduce circular dependencies between ql.exec - orc - ql.exec.vector classes. As far as your concern for risks go, I would ask for a comparison against the bugs/release cycles of ³STORED AS PARQUET². As a Hive contributor, I¹m certain that if I find a core issue in Parquet, my patches would be welcome there. That should be beneficial to the Parquet community, but might not be aligned entirely along employer lines, since my patch might be good, but my intention would be to migrating warehouses with parquet.hive.DeprecatedParquetInputFormat Impala tables to Hive. Resolving that conflict should be ideally left to the Parquet IPMC the ASF rather than the Hive PMC (or let¹s do a bias check *to* Hive?). Now - reverse that argument and replay it, except instead we¹re talking about the C++ ORC reader plus a non-ASF SQL competitor to Hive. If this not convincing, let me propose that we spin off metastore also as TLP tomorrow! http://incubator.apache.org/projects/hcatalog.html Cheers, Gopal
Re: [DISCUSS] ORC separate project
On 4/10/15, 8:05 PM, Xuefu Zhang xzh...@cloudera.com wrote: To Owen's explanation - Thanks. I guess my major concern is that we seemingly are breaking apart Hive's integrity and making it hard to release and maintain due to increasing number of external dependents. Let's say that Hive depends on a certain version of ORC (as TLP) and it's found that ORC has a bug that seriously impacts Hive users. We cannot release Hive as fast as we can, since dong so would need ORC community to fix the problem and make a release, for which Hive PMC has no control. On the contrary, Hive community can quickly fix the problem and make a release without waiting for other projects to make a release. I'm not sure this move (ORC as TLP) will be beneficial to vast Hive users. You need to understand exactly what this brings about for Hive, in fact to those who do not use ORC today. With the proposed changes, competing formats like Parquet might be able to compete with ORC in terms of hive features. That is the direct impact of standardization of a Storage-API implementation. As an independent project, new ORC features cannot use the fact that it is included in the ql/ source to introduce circular dependencies between ql.exec - orc - ql.exec.vector classes. As far as your concern for risks go, I would ask for a comparison against the bugs/release cycles of ³STORED AS PARQUET². As a Hive contributor, I¹m certain that if I find a core issue in Parquet, my patches would be welcome there. That should be beneficial to the Parquet community, but might not be aligned entirely along employer lines, since my patch might be good, but my intention would be to migrating warehouses with parquet.hive.DeprecatedParquetInputFormat Impala tables to Hive. Resolving that conflict should be ideally left to the Parquet IPMC the ASF rather than the Hive PMC (or let¹s do a bias check *to* Hive?). Now - reverse that argument and replay it, except instead we¹re talking about the C++ ORC reader plus a non-ASF SQL competitor to Hive. If this not convincing, let me propose that we spin off metastore also as TLP tomorrow! http://incubator.apache.org/projects/hcatalog.html Cheers, Gopal
Re: [DISCUSS] ORC separate project
To Lefty's comment - Yes, anyone can take Apache code and make another project at will. However, for changes made to an existing project as part of that process, such as what Owen described for ORC in Hive, it is certainly something that Hive PMC can control or vote on. Nevertheless, that's not my immediate concern. To Owen's explanation - Thanks. I guess my major concern is that we seemingly are breaking apart Hive's integrity and making it hard to release and maintain due to increasing number of external dependents. Let's say that Hive depends on a certain version of ORC (as TLP) and it's found that ORC has a bug that seriously impacts Hive users. We cannot release Hive as fast as we can, since dong so would need ORC community to fix the problem and make a release, for which Hive PMC has no control. On the contrary, Hive community can quickly fix the problem and make a release without waiting for other projects to make a release. I'm not sure this move (ORC as TLP) will be beneficial to vast Hive users. If this not convincing, let me propose that we spin off metastore also as TLP tomorrow! Thanks, Xuefu On Wed, Apr 8, 2015 at 8:33 AM, Owen O'Malley omal...@apache.org wrote: On Tue, Apr 7, 2015 at 8:49 PM, Xuefu Zhang xzh...@cloudera.com wrote: If I understood Allen's #2 comment, we are moving existing ORC code out of Hive and make it a separate project, which I definitely missed. I'm sorry that wasn't clear. Yes, most of the code that is currently in org.apache.hadoop.hive.ql.io.orc will move to the new project. The biggest change on the Hive side will be to create a new Hive module that defines the API that storage formats like ORC need to code against if they want high performance integration with Hive's vectorization. I've started that jira at https://issues.apache.org/jira/browse/HIVE-10171 . Creating this API should help us create a clean interface for storage formats that will help ORC and other columnar formats like Trevni or Parquet. Once the ORC project has made its first release, we can create a Hive jira to replace the Hive ORC code with a reference to the ORC release jar. Since existing Hive PMC has governance on the code, I would expect it's still the case even after the spinoff. No, Apache doesn't allow umbrella projects where one PMC controls sub-projects. The reason is that the Apache board has found that controlling projects directly instead of indirectly through another PMC reduces the problems. .. Owen
Re: [DISCUSS] ORC separate project
On Mon, Apr 6, 2015 at 11:26 PM, Brock Noland br...@apache.org wrote: Hey guys, Good discussion here. One point of order, I feel like this should be a [DISCUSS] thread. Ok, I've edited the subject on this reply. At the very least, this will hit people's filters. .. Owen
Re: [DISCUSS] ORC separate project
On Tue, Apr 7, 2015 at 8:49 PM, Xuefu Zhang xzh...@cloudera.com wrote: If I understood Allen's #2 comment, we are moving existing ORC code out of Hive and make it a separate project, which I definitely missed. I'm sorry that wasn't clear. Yes, most of the code that is currently in org.apache.hadoop.hive.ql.io.orc will move to the new project. The biggest change on the Hive side will be to create a new Hive module that defines the API that storage formats like ORC need to code against if they want high performance integration with Hive's vectorization. I've started that jira at https://issues.apache.org/jira/browse/HIVE-10171 . Creating this API should help us create a clean interface for storage formats that will help ORC and other columnar formats like Trevni or Parquet. Once the ORC project has made its first release, we can create a Hive jira to replace the Hive ORC code with a reference to the ORC release jar. Since existing Hive PMC has governance on the code, I would expect it's still the case even after the spinoff. No, Apache doesn't allow umbrella projects where one PMC controls sub-projects. The reason is that the Apache board has found that controlling projects directly instead of indirectly through another PMC reduces the problems. .. Owen