Re: [Spark Namespace]: Expanding Spark ML under Different Namespace?

Nick Pentreath Sat, 04 Mar 2017 03:04:36 -0800

Also, note https://issues.apache.org/jira/browse/SPARK-7146 is linked from
SPARK-19498 specifically to discuss opening up sharedParams traits.



On Fri, 3 Mar 2017 at 23:17 Shouheng Yi <sho...@microsoft.com.invalid>
wrote:

> Hi Spark dev list,
>
>
>
> Thank you guys so much for all your inputs. We really appreciated those
> suggestions. After some discussions in the team, we decided to stay under
> apache’s namespace for now, and attach some comments to explain what we did
> and why we did this.
>
>
>
> As the Spark dev list kindly pointed out, this is an existing issue that
> was documented in the JIRA ticket [Spark-19498] [0]. We can follow the JIRA
> ticket to see if there are any new suggested practices that should be
> adopted in the future and make corresponding fixes.
>
>
>
> Best,
>
> Shouheng
>
>
>
> [0] https://issues.apache.org/jira/browse/SPARK-19498
>
>
>
> *From:* Tim Hunter [mailto:timhun...@databricks.com
> <timhun...@databricks.com>]
> *Sent:* Friday, February 24, 2017 9:08 AM
> *To:* Joseph Bradley <jos...@databricks.com>
> *Cc:* Steve Loughran <ste...@hortonworks.com>; Shouheng Yi <
> sho...@microsoft.com.invalid>; Apache Spark Dev <dev@spark.apache.org>;
> Markus Weimer <mwei...@microsoft.com>; Rogan Carr <roc...@microsoft.com>;
> Pei Jiang <pej...@microsoft.com>; Miruna Oprescu <mopre...@microsoft.com>
> *Subject:* Re: [Spark Namespace]: Expanding Spark ML under Different
> Namespace?
>
>
>
> Regarding logging, Graphframes makes a simple wrapper this way:
>
>
>
>
> https://github.com/graphframes/graphframes/blob/master/src/main/scala/org/graphframes/Logging.scala
> <https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fgraphframes%2Fgraphframes%2Fblob%2Fmaster%2Fsrc%2Fmain%2Fscala%2Forg%2Fgraphframes%2FLogging.scala&data=02%7C01%7Cshouyi%40microsoft.com%7C1e65a9468afa4348a0ac08d45cd7c42c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636235529161198274&sdata=lNT03ZybQOrEboWz0vuX4cic%2F5WGn49E464%2B1XbqdD8%3D&reserved=0>
>
>
>
> Regarding the UDTs, they have been hidden to be reworked for Datasets, the
> reasons being detailed here [1]. Can you describe your use case in more
> details? You may be better off copy/pasting the UDT code outside of Spark,
> depending on your use case.
>
>
>
> [1] https://issues.apache.org/jira/browse/SPARK-14155
> <https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FSPARK-14155&data=02%7C01%7Cshouyi%40microsoft.com%7C1e65a9468afa4348a0ac08d45cd7c42c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636235529161198274&sdata=I5yFehqhf5qXMPXKQj8inZa3kXQwM3O2ntea3bFlge4%3D&reserved=0>
>
>
>
> On Thu, Feb 23, 2017 at 3:42 PM, Joseph Bradley <jos...@databricks.com>
> wrote:
>
> +1 for Nick's comment about discussing APIs which need to be made public
> in https://issues.apache.org/jira/browse/SPARK-19498
> <https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FSPARK-19498&data=02%7C01%7Cshouyi%40microsoft.com%7C1e65a9468afa4348a0ac08d45cd7c42c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636235529161198274&sdata=jByKjOBuL9elEiJNJzxeoZ5euHDfinjqzj%2FJY5hn7Xo%3D&reserved=0>
> !
>
>
>
> On Thu, Feb 23, 2017 at 2:36 AM, Steve Loughran <ste...@hortonworks.com>
> wrote:
>
>
>
> On 22 Feb 2017, at 20:51, Shouheng Yi <sho...@microsoft.com.INVALID>
> wrote:
>
>
>
> Hi Spark developers,
>
>
>
> Currently my team at Microsoft is extending Spark’s machine learning
> functionalities to include new learners and transformers. We would like
> users to use these within spark pipelines so that they can mix and match
> with existing Spark learners/transformers, and overall have a native spark
> experience. We cannot accomplish this using a non-“org.apache” namespace
> with the current implementation, and we don’t want to release code inside
> the apache namespace because it’s confusing and there could be naming
> rights issues.
>
>
>
> This isn't actually the ASF has a strong stance against, more left to
> projects themselves. After all: the source is licensed by the ASF, and the
> license doesn't say you can't.
>
>
>
> Indeed, there's a bit of org.apache.hive in the Spark codebase where the
> hive team kept stuff package private. Though that's really a sign that
> things could be improved there.
>
>
>
> Where is problematic is that stack traces end up blaming the wrong group;
> nobody likes getting a bug report which doesn't actually exist in your
> codebase., not least because you have to waste time to even work it out.
>
>
>
> You also have to expect absolutely no stability guarantees, so you'd
> better set your nightly build to work against trunk
>
>
>
> Apache Bahir does put some stuff into org.apache.spark.stream, but they've
> sort of inherited that right.when they picked up the code from spark. new
> stuff is going into org.apache.bahir
>
>
>
>
>
> We need to extend several classes from spark which happen to have
> “private[spark].” For example, one of our class extends VectorUDT[0] which
> has private[spark] class VectorUDT as its access modifier. This
> unfortunately put us in a strange scenario that forces us to work under the
> namespace org.apache.spark.
>
>
>
> To be specific, currently the private classes/traits we need to use to
> create new Spark learners & Transformers are HasInputCol, VectorUDT and
> Logging. We will expand this list as we develop more.
>
>
>
> I do think tis a shame that logging went from public to private.
>
>
>
> One thing that could be done there is to copy the logging into Bahir,
> under an org.apache.bahir package, for yourself and others to use. That's
> be beneficial to me too.
>
>
>
> For the ML stuff, that might be place to work too, if you are going to
> open source the code.
>
>
>
>
>
>
>
> Is there a way to avoid this namespace issue? What do other
> people/companies do in this scenario? Thank you for your help!
>
>
>
> I've hit this problem in the past.  Scala code tends to force your hand
> here precisely because of that (very nice) private feature. While it offers
> the ability of a project to guarantee that implementation details aren't
> picked up where they weren't intended to be, in OSS dev, all that
> implementation is visible and for lower level integration,
>
>
>
> What I tend to do is keep my own code in its package and try to do as
> think a bridge over to it from the [private] scope. It's also important to
> name things obviously, say,  org.apache.spark.microsoft , so stack traces
> in bug reports can be dealt with more easily
>
>
>
>
>
> [0]:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/linalg/VectorUDT.scala
> <https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fblob%2Fmaster%2Fmllib%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fspark%2Fml%2Flinalg%2FVectorUDT.scala&data=02%7C01%7Cshouyi%40microsoft.com%7C1e65a9468afa4348a0ac08d45cd7c42c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636235529161198274&sdata=HjxQq3XAT%2FMljuNdU0MOorPhblMrnFcLezj9tebAht8%3D&reserved=0>
>
>
>
> Best,
>
> Shouheng
>
>
>
>
>
>
>
> --
>
> Joseph Bradley
>
> Software Engineer - Machine Learning
>
> Databricks, Inc.
>
> [image: http://databricks.com]
> <https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fdatabricks.com%2F&data=02%7C01%7Cshouyi%40microsoft.com%7C1e65a9468afa4348a0ac08d45cd7c42c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636235529161198274&sdata=Yq5F7xzV%2B8aqAoJyF0gePMG2cghRYonz68NDNvN9vjs%3D&reserved=0>
>
>
>

Re: [Spark Namespace]: Expanding Spark ML under Different Namespace?

Reply via email to