Re: [DISCUSS] ORC separate project

2015-04-15 Thread Owen O'Malley
On Mon, Apr 13, 2015 at 10:43 PM, Sergey Shelukhin ser...@hortonworks.com
wrote:

 The 2nd concern about fixing issue quickly doesn¹t make sense - it can
 happen with any dependency. What if guava or Kryo or Spark or Tez have a
 bug? We can still ship Hive as long as the dependency can be updated to
 correct version.,


Agreed. Hive has many dependencies and any of them could require a quick
turn around. The ORC project will largely consist of people from the Hive
community and thus will be far more responsive to Hive's concerns than many
of the other projects.

Thanks,
   Owen


Re: [DISCUSS] ORC separate project

2015-04-15 Thread Owen O'Malley
On Mon, Apr 13, 2015 at 10:43 PM, Sergey Shelukhin ser...@hortonworks.com
wrote:

 The 2nd concern about fixing issue quickly doesn¹t make sense - it can
 happen with any dependency. What if guava or Kryo or Spark or Tez have a
 bug? We can still ship Hive as long as the dependency can be updated to
 correct version.,


Agreed. Hive has many dependencies and any of them could require a quick
turn around. The ORC project will largely consist of people from the Hive
community and thus will be far more responsive to Hive's concerns than many
of the other projects.

Thanks,
   Owen


Re: [DISCUSS] ORC separate project

2015-04-13 Thread Sergey Shelukhin
IMHO there are 2 separate concerns, forking ORC and Hive using ³new² ORC.
The first one does not really require vote, as discussed on private/board
- anyone can fork part of code (in this case, at least). Then, for Hive
switching to ³new² ORC, I¹m not sure that requires a vote either. We
didn¹t vote when we added Kryo or Spark or Tez dependencyŠ it¹s just a
(big) code change. 3 +1s like a branch merge will be enough, or even one
+1 maybe.

The 2nd concern about fixing issue quickly doesn¹t make sense - it can
happen with any dependency. What if guava or Kryo or Spark or Tez have a
bug? We can still ship Hive as long as the dependency can be updated to
correct version.,

On 15/4/10, 20:05, Xuefu Zhang xzh...@cloudera.com wrote:

To Lefty's comment -  Yes, anyone can take Apache code and make another
project at will. However, for changes made to an existing project as part
of that process, such as what Owen described for ORC in Hive, it is
certainly something that Hive PMC can control or vote on. Nevertheless,
that's not my immediate concern.

To Owen's explanation - Thanks. I guess my major concern is that we
seemingly are breaking apart Hive's integrity and making it hard to
release
and maintain due to increasing number of external dependents. Let's say
that Hive depends on a certain version of ORC (as TLP) and it's found that
ORC has a bug that seriously impacts Hive users. We cannot release Hive as
fast as we can, since dong so would need ORC community to fix the problem
and make a release, for which Hive PMC has no control. On the contrary,
Hive community can quickly fix the problem and make a release without
waiting for other projects to make a release. I'm not sure this move (ORC
as TLP) will be beneficial to vast Hive users.

If this not convincing, let me propose that we spin off metastore also as
TLP tomorrow!

Thanks,
Xuefu


On Wed, Apr 8, 2015 at 8:33 AM, Owen O'Malley omal...@apache.org wrote:

 On Tue, Apr 7, 2015 at 8:49 PM, Xuefu Zhang xzh...@cloudera.com wrote:

  If I understood Allen's #2 comment, we are moving existing ORC code
out
 of
  Hive and make it a separate project, which I definitely missed.
 

 I'm sorry that wasn't clear. Yes, most of the code that is currently in
 org.apache.hadoop.hive.ql.io.orc will move to the new project.

 The biggest change on the Hive side will be to create a new Hive module
 that defines the API that storage formats like ORC need to code against
if
 they want high performance integration with Hive's vectorization. I've
 started that jira at https://issues.apache.org/jira/browse/HIVE-10171 .
 Creating this API should help us create a clean interface for storage
 formats that will help ORC and other columnar formats like Trevni or
 Parquet.

 Once the ORC project has made its first release, we can create a Hive
jira
 to replace the Hive ORC code with a reference to the ORC release jar.


  Since existing Hive PMC has governance on the code, I would expect
it's
  still the case even after the spinoff.
 

 No, Apache doesn't allow umbrella projects where one PMC controls
 sub-projects. The reason is that the Apache board has found that
 controlling projects directly instead of indirectly through another PMC
 reduces the problems.

 .. Owen




Re: [DISCUSS] ORC separate project

2015-04-11 Thread Lefty Leverenz
Speaking of the C++ ORC reader and writer, could they be included in the
Hive project or do they have to be separate because they aren't Java code?

By the way, gmail thwarts adding [DISCUSS] to the subject line.  It shows
up in the mail archives, although pre-  post-DISCUSS threads are separate.

-- Lefty

On Fri, Apr 10, 2015 at 11:56 PM, Gopal Vijayaraghavan gop...@apache.org
wrote:



 On 4/10/15, 8:05 PM, Xuefu Zhang xzh...@cloudera.com wrote:

 To Owen's explanation - Thanks. I guess my major concern is that we
 seemingly are breaking apart Hive's integrity and making it hard to
 release
 and maintain due to increasing number of external dependents. Let's say
 that Hive depends on a certain version of ORC (as TLP) and it's found that
 ORC has a bug that seriously impacts Hive users. We cannot release Hive as
 fast as we can, since dong so would need ORC community to fix the problem
 and make a release, for which Hive PMC has no control. On the contrary,
 Hive community can quickly fix the problem and make a release without
 waiting for other projects to make a release. I'm not sure this move (ORC
 as TLP) will be beneficial to vast Hive users.

 You need to understand exactly what this brings about for Hive, in fact to
 those who do not use ORC today.

 With the proposed changes, competing formats like Parquet might be able to
 compete with ORC in terms of hive features.

 That is the direct impact of standardization of a Storage-API
 implementation.

 As an independent project, new ORC features cannot use the fact that it is
 included in the ql/ source to introduce circular dependencies between
 ql.exec - orc - ql.exec.vector classes.

 As far as your concern for risks go, I would ask for a comparison against
 the bugs/release cycles of ³STORED AS PARQUET².

 As a Hive contributor, I¹m certain that if I find a core issue in Parquet,
 my patches would be welcome there.

 That should be beneficial to the Parquet community, but might not be
 aligned entirely along employer lines, since my patch might be good, but
 my intention would be to migrating warehouses with
 parquet.hive.DeprecatedParquetInputFormat Impala tables to Hive.

 Resolving that conflict should be ideally left to the Parquet IPMC  the
 ASF rather than the Hive PMC (or let¹s do a bias check *to* Hive?).

 Now - reverse that argument and replay it, except instead we¹re talking
 about the C++ ORC reader plus a non-ASF SQL competitor to Hive.


 If this not convincing, let me propose that we spin off metastore also as
 TLP tomorrow!

 http://incubator.apache.org/projects/hcatalog.html

 Cheers,
 Gopal





Re: [DISCUSS] ORC separate project

2015-04-10 Thread Gopal Vijayaraghavan


On 4/10/15, 8:05 PM, Xuefu Zhang xzh...@cloudera.com wrote:

To Owen's explanation - Thanks. I guess my major concern is that we
seemingly are breaking apart Hive's integrity and making it hard to
release
and maintain due to increasing number of external dependents. Let's say
that Hive depends on a certain version of ORC (as TLP) and it's found that
ORC has a bug that seriously impacts Hive users. We cannot release Hive as
fast as we can, since dong so would need ORC community to fix the problem
and make a release, for which Hive PMC has no control. On the contrary,
Hive community can quickly fix the problem and make a release without
waiting for other projects to make a release. I'm not sure this move (ORC
as TLP) will be beneficial to vast Hive users.

You need to understand exactly what this brings about for Hive, in fact to
those who do not use ORC today.

With the proposed changes, competing formats like Parquet might be able to
compete with ORC in terms of hive features.

That is the direct impact of standardization of a Storage-API
implementation.

As an independent project, new ORC features cannot use the fact that it is
included in the ql/ source to introduce circular dependencies between
ql.exec - orc - ql.exec.vector classes.

As far as your concern for risks go, I would ask for a comparison against
the bugs/release cycles of ³STORED AS PARQUET².

As a Hive contributor, I¹m certain that if I find a core issue in Parquet,
my patches would be welcome there.

That should be beneficial to the Parquet community, but might not be
aligned entirely along employer lines, since my patch might be good, but
my intention would be to migrating warehouses with
parquet.hive.DeprecatedParquetInputFormat Impala tables to Hive.

Resolving that conflict should be ideally left to the Parquet IPMC  the
ASF rather than the Hive PMC (or let¹s do a bias check *to* Hive?).

Now - reverse that argument and replay it, except instead we¹re talking
about the C++ ORC reader plus a non-ASF SQL competitor to Hive.


If this not convincing, let me propose that we spin off metastore also as
TLP tomorrow!

http://incubator.apache.org/projects/hcatalog.html

Cheers,
Gopal




Re: [DISCUSS] ORC separate project

2015-04-10 Thread Xuefu Zhang
To Lefty's comment -  Yes, anyone can take Apache code and make another
project at will. However, for changes made to an existing project as part
of that process, such as what Owen described for ORC in Hive, it is
certainly something that Hive PMC can control or vote on. Nevertheless,
that's not my immediate concern.

To Owen's explanation - Thanks. I guess my major concern is that we
seemingly are breaking apart Hive's integrity and making it hard to release
and maintain due to increasing number of external dependents. Let's say
that Hive depends on a certain version of ORC (as TLP) and it's found that
ORC has a bug that seriously impacts Hive users. We cannot release Hive as
fast as we can, since dong so would need ORC community to fix the problem
and make a release, for which Hive PMC has no control. On the contrary,
Hive community can quickly fix the problem and make a release without
waiting for other projects to make a release. I'm not sure this move (ORC
as TLP) will be beneficial to vast Hive users.

If this not convincing, let me propose that we spin off metastore also as
TLP tomorrow!

Thanks,
Xuefu


On Wed, Apr 8, 2015 at 8:33 AM, Owen O'Malley omal...@apache.org wrote:

 On Tue, Apr 7, 2015 at 8:49 PM, Xuefu Zhang xzh...@cloudera.com wrote:

  If I understood Allen's #2 comment, we are moving existing ORC code out
 of
  Hive and make it a separate project, which I definitely missed.
 

 I'm sorry that wasn't clear. Yes, most of the code that is currently in
 org.apache.hadoop.hive.ql.io.orc will move to the new project.

 The biggest change on the Hive side will be to create a new Hive module
 that defines the API that storage formats like ORC need to code against if
 they want high performance integration with Hive's vectorization. I've
 started that jira at https://issues.apache.org/jira/browse/HIVE-10171 .
 Creating this API should help us create a clean interface for storage
 formats that will help ORC and other columnar formats like Trevni or
 Parquet.

 Once the ORC project has made its first release, we can create a Hive jira
 to replace the Hive ORC code with a reference to the ORC release jar.


  Since existing Hive PMC has governance on the code, I would expect it's
  still the case even after the spinoff.
 

 No, Apache doesn't allow umbrella projects where one PMC controls
 sub-projects. The reason is that the Apache board has found that
 controlling projects directly instead of indirectly through another PMC
 reduces the problems.

 .. Owen



Re: [DISCUSS] ORC separate project

2015-04-08 Thread Owen O'Malley
On Mon, Apr 6, 2015 at 11:26 PM, Brock Noland br...@apache.org wrote:

 Hey guys,

 Good discussion here. One point of order, I feel like this should be a
 [DISCUSS] thread.


Ok, I've edited the subject on this reply. At the very least, this will hit
people's filters.

.. Owen


Re: [DISCUSS] ORC separate project

2015-04-08 Thread Owen O'Malley
On Tue, Apr 7, 2015 at 8:49 PM, Xuefu Zhang xzh...@cloudera.com wrote:

 If I understood Allen's #2 comment, we are moving existing ORC code out of
 Hive and make it a separate project, which I definitely missed.


I'm sorry that wasn't clear. Yes, most of the code that is currently in
org.apache.hadoop.hive.ql.io.orc will move to the new project.

The biggest change on the Hive side will be to create a new Hive module
that defines the API that storage formats like ORC need to code against if
they want high performance integration with Hive's vectorization. I've
started that jira at https://issues.apache.org/jira/browse/HIVE-10171 .
Creating this API should help us create a clean interface for storage
formats that will help ORC and other columnar formats like Trevni or
Parquet.

Once the ORC project has made its first release, we can create a Hive jira
to replace the Hive ORC code with a reference to the ORC release jar.


 Since existing Hive PMC has governance on the code, I would expect it's
 still the case even after the spinoff.


No, Apache doesn't allow umbrella projects where one PMC controls
sub-projects. The reason is that the Apache board has found that
controlling projects directly instead of indirectly through another PMC
reduces the problems.

.. Owen