HBase does have a ensure-jars-have-correct-contents.sh check that they run which we could do something similar to.
On Wed, Nov 15, 2017 at 3:30 PM Stephen Moist <mo...@cloudera.com> wrote: > Is there any way to add a check to the maven build to make sure that the > classes being imported are the shaded ones? > > > > On Nov 15, 2017, at 3:02 PM, Brian Towles <btow...@cloudera.com> wrote: > > > > I can see that. I can look into that as well. It would make sense for > the > > "Sentry as a Library" you talked about as well > > > > On Wed, Nov 15, 2017 at 2:56 PM Alexander Kolbasov <ak...@cloudera.com> > > wrote: > > > >> Brian, There are cases where shading is useful for server as well - > mostly > >> for e2e tests which run server, HDFS, Hive all in the same JVM. > >> > >> On Wed, Nov 15, 2017 at 12:47 PM, Brian Towles <btow...@cloudera.com> > >> wrote: > >> > >>> So there would be two different levels of shading. > >>> > >>> The first would be the known issues shared repackaged dependencies like > >>> Guava. This would only be for libraries that we absolutely know will > >> cause > >>> collisions with implementing services and implementing third party > >>> libraries that need to have the references to the trouble causing > >> libraries > >>> shifted. These trouble causing libraries would be the only ones where > >> hard > >>> coded import path changes would need to take place for. > >>> > >>> For example a sentry-thridparty/sentry-shaded-miscellaneous module > would > >>> shared google guava shifting the package to > >>> org.apache.sentry.thirdparty.com.google.common. As well as it would > shade > >>> the BoneCP jar so that it picks up the shift of the Guava package but > not > >>> shifting the BoneCP packages unless we encounter and a known conflict. > >>> > >>> The second level os shading would be for the plugins distribution and > >> would > >>> shade and package shift only the dependencies needed for that plugin. > >> This > >>> would not need to have any import statement changes as the maven shade > >>> plugin would change the bytecode package references on packaging. As > to > >>> Kalyan's question, the run time memory used for class loading would > only > >> be > >>> for those items used by the plugin, which is really not much anyways > >>> comparatively. The class loader doesn't load up a class until its > needed. > >>> Till that point it is just a reference to a file in a jar. The list of > >>> files/classes in jars is not a heavy memory load. > >>> > >>> The main server implementation would not need to have the second level > >>> shading done to it as it runs stand alone and is unaffected by the > >>> collisions issues. > >>> > >>> @Alexander For your questions, there is a small amount of refactoring > of > >>> import statements that need to take place for the first level of > shading > >>> but there are no logic changes that take place. > >>> > >>> @Stephen Yeah I can see BouncyCastle being more of an issue but it does > >> not > >>> appear in any of the Sentry dependency chains right now. > >>> > >>> -=Brian > >>> > >>> On Wed, Nov 15, 2017 at 1:30 PM Stephen Moist <mo...@cloudera.com> > >> wrote: > >>> > >>>> +1 > >>>> > >>>> Agree with Kalyan’s concern. To me it seems simpler to shade all jars > >>>> rather than known issue jars. One thing though is what if there are > >> bug > >>>> fixes in our jars that are not included in other components, it may > >> cause > >>>> subtle error that are hard to track. I’m thinking along the lines of > >> the > >>>> KeyValue class change in behavior. So I propose limited the scope of > >>>> shaded jars to known issues. > >>>> > >>>> However one thing to note is that some jars do not like to be shaded, > >> as > >>>> an example BouncyCastle has a check to see if the jar has been > modified > >>> and > >>>> will refuse to be used in a shaded jar. > >>>> > >>>>> On Nov 15, 2017, at 1:07 PM, Kalyan Kumar Kalvagadda < > >>>> kkal...@cloudera.com> wrote: > >>>>> > >>>>> +1 > >>>>> > >>>>> It's good approach but I have a question/concern. > >>>>> Is the proposal to shade is for some specific jars OR to shade all > >> the > >>>>> third party jars? > >>>>> > >>>>> If proposal is shade all the third-party jars then if would impact > >> the > >>>> run > >>>>> time memory usage as all the classes from the third-party jars would > >> be > >>>>> loaded regardless.. > >>>>> > >>>>> -Kalyan > >>>>> > >>>>> On Wed, Nov 15, 2017 at 12:04 PM, Sergio Pena < > >>> sergio.p...@cloudera.com> > >>>>> wrote: > >>>>> > >>>>>> +1 > >>>>>> > >>>>>> I like the idea. It's hard to upgrade our libraries to newer ones > >> when > >>>>>> other components break due to Sentry being a plugin. I was once > >>> thought > >>>> of > >>>>>> using different jars versions per plugin (i.e. guava11 on hdfs, > >>> guava14 > >>>> on > >>>>>> hive and the server, etc), but that is too much to do and not good. > >> I > >>>> like > >>>>>> the hbase idea better so we can use the latest libraries that we > >> want > >>>>>> instead of depending on other components. > >>>>>> > >>>>>> - Sergio > >>>>>> > >>>>>> > >>>>>> > >>>>>> On Wed, Nov 15, 2017 at 11:53 AM, Alexander Kolbasov < > >>>> ak...@cloudera.com> > >>>>>> wrote: > >>>>>> > >>>>>>> Agreed, this would be a very useful thing to do. I remember > >> spending > >>> a > >>>>>> lot > >>>>>>> of time trying to make Sentry work with DataNucleus 4 - the problem > >>> was > >>>>>>> that e2e tests combine Sentry with Hive in the same JVM and this > >>>> created > >>>>>> a > >>>>>>> conflict on the DataNucleus libraries and test failures. > >>>>>>> > >>>>>>> Looking at the HBase proposal it seems to require some (small) code > >>>>>> changes > >>>>>>> for package names - or I am misreading it? Do you plan to shade all > >>>> 3-rd > >>>>>>> party packages or just some? I would consider at least > >>>>>>> Guava/ZooKeepeer/DataNucleus/Jetty. > >>>>>>> > >>>>>>> - Alex > >>>>>>> > >>>>>>> On Wed, Nov 15, 2017 at 8:24 AM, Brian Towles < > >> btow...@cloudera.com> > >>>>>>> wrote: > >>>>>>> > >>>>>>>> Howdy all, > >>>>>>>> > >>>>>>>> An issue that keeps coming up seems to be the conflict of > >> dependency > >>>>>>>> versions between Sentry and the components it is plugging into. A > >>>>>> current > >>>>>>>> example of this impact is Google Guava with hive2 using v14 and > >>> Impala > >>>>>>>> using v11 while Sentry needs to have at least v14 in order to fix > >>> some > >>>>>>> bugs > >>>>>>>> in the BoneCP library. These sort of conflicts come in whenever > >> we > >>>> are > >>>>>>>> embedding a Sentry plugin or using a Sentry library into another > >>>>>> project. > >>>>>>>> > >>>>>>>> I would like to propose a mechanism for offsetting some of these > >>>> issues > >>>>>>>> using something similar to the third party shading that HBase uses > >>> for > >>>>>>>> common problems (such as Guava) > >>>>>>>> (https://github.com/apache/hbase-thirdparty with > >>>>>>>> documentation about it > >> http://hbase.apache.org/book.html#thirdparty > >>> ) > >>>>>> and > >>>>>>>> is > >>>>>>>> used by HBase and Hadoop for packaging of their downstream > >>> artifacts ( > >>>>>>>> https://github.com/apache/hbase/blob/master/hbase-shaded/ and > >>>>>>>> https://github.com/apache/hadoop/tree/trunk/hadoop- > >>>>>>>> client-modules/hadoop-client-api/ > >>>>>>>> ) > >>>>>>>> > >>>>>>>> The main benefit of this is that it would allow Sentry to be used > >> as > >>>>>>>> libraries and plugins with all of the dependencies needed for > >> Sentry > >>>> to > >>>>>>> be > >>>>>>>> abstracted away from components implementing it. Sentry could rev > >>>>>>> versions > >>>>>>>> of libraries easier and not have collisions of library versions > >>> needed > >>>>>> by > >>>>>>>> the implementing component. As well this would potentially make > >>>> Sentry > >>>>>>>> downstream usage more stable since it would be used and tested > >>> against > >>>>>> a > >>>>>>>> static set of dependencies and not using libraries based on what > >> the > >>>>>>>> implementing component has. > >>>>>>>> > >>>>>>>> On the downside, it would make the on disk size of the Sentry > >>> plugins > >>>>>> and > >>>>>>>> libraries for downstream larger. As well, the number of classes > >>> loaded > >>>>>>> into > >>>>>>>> memory would be larger since there would be potential duplication > >> of > >>>>>>> actual > >>>>>>>> class implementations or multiple versions of a class with > >> different > >>>>>>>> package names in memory. But this seems to be a common practice > >> and > >>>>>> the > >>>>>>>> lack of library version collisions and stability make up for these > >>>>>>>> downsides. > >>>>>>>> > >>>>>>>> The third party shading works by using the Maven shade plugin to > >> do > >>>>>>> package > >>>>>>>> name shifting of the third party library (Guava, BoneCP) to a > >> sentry > >>>>>>>> specific package using the version on the third party library > >> needed > >>>>>> for > >>>>>>>> Sentry. > >>>>>>>> > >>>>>>>> E.G. *com.google.common* packages could be shifted to > >>>>>>>> *org.apache.sentry.shaded.com.google.common*. > >>>>>>>> > >>>>>>>> Since the Maven Shade plugin can actually change the byte-codes of > >>> the > >>>>>>>> libraries being shaded, we can do this even for dependencies that > >>> have > >>>>>>>> shared sub-dependencies . BoneCP being the main example since it > >>> uses > >>>>>>>> Guava internal to itself. We could shade the BoneCP into the same > >>>>>> shared > >>>>>>>> sentry third party dependency jar and since the bytecode level > >>>>>>>> manipulation. > >>>>>>>> > >>>>>>>> What this means on the development side is that we would need to > >>>>>>> reference > >>>>>>>> imports from the shaded third party > >>>>>>>> > >>>>>>>> E.G. *import com.google.common.collect.Maps* becomes *import > >>>>>>>> org.apache.sentry.shaded.com.google.common.collect.Maps*. > >>>>>>>> > >>>>>>>> Ive been looking at this in the context of > >>>>>>>> https://issues.apache.org/jira/browse/SENTRY-2044, but I feel > >> this > >>>>>>> should > >>>>>>>> potentially be something that is more of an overall Sentry > >> standard > >>>>>>>> practice and larger scale implementation. > >>>>>>>> > >>>>>>>> -=Brian > >>>>>>>> > >>>>>>>> > >>>>>>>> -- > >>>>>>>> *Brian Towles* | Software Engineer > >>>>>>>> t. (512) 415- <0000000000>8105 e. btow...@cloudera.com < > >>>>>>> j...@cloudera.com> > >>>>>>>> cloudera.com <http://www.cloudera.com/> > >>>>>>>> > >>>>>>>> [image: Cloudera] <http://www.cloudera.com/> > >>>>>>>> > >>>>>>>> [image: Cloudera on Twitter] <https://twitter.com/cloudera> > >> [image: > >>>>>>>> Cloudera on Facebook] <https://www.facebook.com/cloudera> [image: > >>>>>>> Cloudera > >>>>>>>> on LinkedIn] <https://www.linkedin.com/company/cloudera> > >>>>>>>> ------------------------------ > >>>>>>>> > >>>>>>> > >>>>>> > >>>> > >>>> -- > >>> *Brian Towles* | Software Engineer > >>> t. (512) 415- <0000000000>8105 e. btow...@cloudera.com < > >> j...@cloudera.com> > >>> cloudera.com <http://www.cloudera.com/> > >>> > >>> [image: Cloudera] <http://www.cloudera.com/> > >>> > >>> [image: Cloudera on Twitter] <https://twitter.com/cloudera> [image: > >>> Cloudera on Facebook] <https://www.facebook.com/cloudera> [image: > >> Cloudera > >>> on LinkedIn] <https://www.linkedin.com/company/cloudera> > >>> ------------------------------ > >>> > >> > > -- > > *Brian Towles* | Software Engineer > > t. (512) 415- <0000000000>8105 e. btow...@cloudera.com < > j...@cloudera.com> > > cloudera.com <http://www.cloudera.com/> > > > > [image: Cloudera] <http://www.cloudera.com/> > > > > [image: Cloudera on Twitter] <https://twitter.com/cloudera> [image: > > Cloudera on Facebook] <https://www.facebook.com/cloudera> [image: > Cloudera > > on LinkedIn] <https://www.linkedin.com/company/cloudera> > > ------------------------------ > > -- *Brian Towles* | Software Engineer t. (512) 415- <0000000000>8105 e. btow...@cloudera.com <j...@cloudera.com> cloudera.com <http://www.cloudera.com/> [image: Cloudera] <http://www.cloudera.com/> [image: Cloudera on Twitter] <https://twitter.com/cloudera> [image: Cloudera on Facebook] <https://www.facebook.com/cloudera> [image: Cloudera on LinkedIn] <https://www.linkedin.com/company/cloudera> ------------------------------