Re: [VOTE] [SPARK-25994] SPIP: DataFrame-based Property Graphs, Cypher Queries, and Algorithms

2019-02-11 Thread Joseph Bradley
ard their proposed tech-talk at Spark + A.I summit in London. >>>>>>>> Well attended & well received.) >>>>>>>> >>>>>>>> — >>>>>>>> Sent from my iPhone >>>>>>>> Pardon the dumb thumb typos :) >&

Re: [VOTE] Spark 2.3.1 (RC4)

2018-06-04 Thread Joseph Bradley
orked on immediately. Everything else please retarget to an >> appropriate release. >> >> == >> But my bug isn't fixed? >> == >> >> In order to make timely releases, we will typically not hold the >> release

Re: [VOTE] [SPARK-24374] SPIP: Support Barrier Scheduling in Apache Spark

2018-06-04 Thread Joseph Bradley
gt;>>> >>>> The vote will be up for the next 72 hours. Please reply with your vote: >>>> >>>> +1: Yeah, let's go forward and implement the SPIP. >>>> +0: Don't really care. >>>> -1: I don't think this is a good idea

Re: [VOTE] SPIP ML Pipelines in R

2018-05-31 Thread Joseph Bradley
re. > > -1: I do not think this is a good idea for the following reasons. > > > > Thanks, > > --Hossein > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc. [image: http://databricks.com] <http://databricks.com/>

Re: Revisiting Online serving of Spark models?

2018-05-21 Thread Joseph Bradley
_ > From: Felix Cheung > Sent: Thursday, May 10, 2018 10:10 AM > Subject: Re: Revisiting Online serving of Spark models? > To: Holden Karau , Joseph Bradley < > jos...@databricks.com> > Cc: dev > > > > Huge +1 on this! > > --

Re: Revisiting Online serving of Spark models?

2018-05-10 Thread Joseph Bradley
er for other projects to build reliable serving > tools. > > I realize this maybe puts some of the folks in an awkward position with > their own commercial offerings, but hopefully if we make it easier for > everyone the commercial vendors can benefit as well. > > Cheers, >

SparkR test failures in PR builder

2018-05-02 Thread Joseph Bradley
..Error in .check_package_CRAN_incoming(pkgdir) : dims [product 24] do not match the length of object [0] ``` and suggested that it could be CRAN flakiness. I'm not familiar with CRAN, but do others have thoughts about how to fix this? Thanks! Joseph -- Joseph Bradley Software Engineer - M

Re: [build system] jenkins master unreachable, build system currently down

2018-05-01 Thread Joseph Bradley
there's nothing we can do. >>> >>> i'll update the list as soon as i hear something. sorry for the >>> inconvenience! >>> >>> shane >>> -- >>> Shane Knapp >>> UC Berkeley EECS Research / RISELab Staff Technical Lead >&

Re: Possible SPIP to improve matrix and vector column type support

2018-04-18 Thread Joseph Bradley
lidity before execution, for example, a matrix > multiply could check dimension match and fail fast. However, there might be > use cases for a column to contain variable shape tensors, I’m open to > discussion here. > > What do you all think? > -- > -- > Cheers, > Leif &g

Re: Welcome Zhenhua Wang as a Spark committer

2018-04-02 Thread Joseph Bradley
>> > Wenchen >>> >>> - >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>> >>> > > > -- > Takuya UESHIN > Tokyo, Japan > > http://twitter.com/ueshin > -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc. [image: http://databricks.com] <http://databricks.com/>

Re: Spark.ml roadmap 2.3.0 and beyond

2018-03-20 Thread Joseph Bradley
ing can be useful too. On Thu, Dec 7, 2017 at 3:55 PM, Stephen Boesch wrote: > Thanks Joseph. We can wait for post 2.3.0. > > 2017-12-07 15:36 GMT-08:00 Joseph Bradley : > >> Hi Stephen, >> >> I used to post those roadmap JIRAs to share instructions for contribut

Re: [MLlib] QuantRegForest

2018-03-09 Thread Joseph Bradley
ou! > > > > -- > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Joseph Bradley Software Engine

Re: Welcoming some new committers

2018-03-09 Thread Joseph Bradley
ith you all and helping out more in the future. Also, congrats to the >> other committers as well!! >> > > -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc. [image: http://databricks.com] <http://databricks.com/>

Re: Spark.ml roadmap 2.3.0 and beyond

2017-12-07 Thread Joseph Bradley
e: >>> >>> 2.2.0 https://issues.apache.org/jira/browse/SPARK-18813 >>> >>> 2.1.0 https://issues.apache.org/jira/browse/SPARK-15581 >>> .. >>> >>> It seems those roadmaps were not available per se' for 2.3.0 and later? >>> Is there a different mechanism for that info? >>> >>> stephenb >>> >> >> > -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc. [image: http://databricks.com] <http://databricks.com/>

Re: [ML] Migrating transformers from mllib to ml

2017-11-07 Thread Joseph Bradley
has not been done so far? Is it to avoid >> code duplication? If so, is it still an issue since we are going to >> deprecate mllib from 2.3 (at least this is what I read on Spark docs)? If >> no, I can work on this. >> >> Thanks, >> Marco >> >> >> > -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc. [image: http://databricks.com] <http://databricks.com/>

Re: [Vote] SPIP: Continuous Processing Mode for Structured Streaming

2017-11-07 Thread Joseph Bradley
;>>>> are >>>>>> also designing and implementing data source API v2. If designed properly, >>>>>> we can have the same data source API working for both streaming and >>>>>> batch. >>>>>> > >>>>>> > >>>>>> > Following the SPIP process, I'm putting this SPIP up for a vote. >>>>>> > >>>>>> > +1: Let's go ahead and design / implement the SPIP. >>>>>> > +0: Don't really care. >>>>>> > -1: I do not think this is a good idea for the following reasons. >>>>>> > >>>>>> > >>>>>> > >>>>>> >>>>>> >>>>>> - >>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>>> >>>>>> >>>>> >>>> >>>> >>>> -- >>>> Regards, >>>> Vaquar Khan >>>> +1 -224-436-0783 <(224)%20436-0783> >>>> Greater Chicago >>>> >>> >>> >> > -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc. [image: http://databricks.com] <http://databricks.com/>

Re: HashingTFModel/IDFModel in Structured Streaming

2017-10-20 Thread Joseph Bradley
; > > -- > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > > --------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc. [image: http://databricks.com] <http://databricks.com/>

Re: SparkR is now available on CRAN

2017-10-20 Thread Joseph Bradley
he corresponding Spark binaries. https://issues.apach >>> e.org/jira/browse/SPARK-15799 has more details on this. >>> >>> Many thanks to everyone who helped put this together -- especially Felix >>> Cheung for making a number of fixes to meet the CRAN requirements

Re: [VOTE][SPIP] SPARK-21866 Image support in Apache Spark

2017-09-27 Thread Joseph Bradley
This vote passes with 11 +1s (4 binding) and no +0s or -1s. +1: Sean Owen (binding) Holden Karau Denny Lee Reynold Xin (binding) Joseph Bradley (binding) Noman Khan Weichen Xu Yanbo Liang Dongjoon Hyun Matei Zaharia (binding) Vaquar Khan Thanks everyone! Joseph On Sat, Sep 23, 2017 at 4:23 PM

Re: Welcoming Saisai (Jerry) Shao as a committer

2017-08-30 Thread Joseph Bradley
Congrats! On Aug 29, 2017 9:55 AM, "Felix Cheung" wrote: > Congrats! > > -- > *From:* Wenchen Fan > *Sent:* Tuesday, August 29, 2017 9:21:38 AM > *To:* Kevin Yu > *Cc:* Meisam Fathi; dev > *Subject:* Re: Welcoming Saisai (Jerry) Shao as a committer > > Congratulation

Re: Welcoming Hyukjin Kwon and Sameer Agarwal as committers

2017-08-10 Thread Joseph Bradley
Congrats! On Aug 8, 2017 9:31 PM, "Minho Kim" wrote: > Congrats, Hyukjin and Sameer!! > > 2017-08-09 9:55 GMT+09:00 Sandeep Joshi : > >> Congratulations Hyukjin and Sameer ! >> >> On 7 Aug 2017 9:23 p.m., "Matei Zaharia" wrote: >> >>> Hi everyone, >>> >>> The Spark PMC recently voted to add Hyu

Re: [VOTE] Apache Spark 2.2.0 (RC6)

2017-07-02 Thread Joseph Bradley
; fixes, documentation, and API tweaks that impact compatibility should be >> worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1. >> >> *But my bug isn't fixed!??!* >> >> In order to make timely releases, we will typically not hold the release >> unless the bug in question is a regression from 2.1.1. >> > -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc. [image: http://databricks.com] <http://databricks.com/>

Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-13 Thread Joseph Bradley
;>> https://repository.apache.org/content/repositories/ >>>>> orgapachespark-1241/ >>>>> >>>>> The documentation corresponding to this release can be found at: >>>>> http://people.apache.org/~pwendell/spark-releases/spark- >>>

GraphFrames 0.5.0 - critical bug fix + other improvements

2017-05-19 Thread Joseph Bradley
eases/tag/release-0.5.0 *Docs*: http://graphframes.github.io/ *Spark Package*: https://spark-packages.org/package/graphframes/graphframes *Source*: https://github.com/graphframes/graphframes Thanks to all contributors and to the community for feedback! Joseph -- Joseph Bradley Software Eng

Re: [VOTE] Apache Spark 2.2.0 (RC2)

2017-05-16 Thread Joseph Bradley
; http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc2-docs/ > >> > >> > >> FAQ > >> > >> How can I help test this release? > >> > >> If you are a Spark user, you can help us test this release by taking an > >>

Re: [VOTE] Apache Spark 2.2.0 (RC2)

2017-05-08 Thread Joseph Bradley
rgeting 2.2.0?* >> >> Committers should look at those and triage. Extremely important bug >> fixes, documentation, and API tweaks that impact compatibility should be >> worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1. >> >> *But my bug isn't fixed!??!* >> >> In order to make timely releases, we will typically not hold the release >> unless the bug in question is a regression from 2.1.1. >> > -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc. [image: http://databricks.com] <http://databricks.com/>

Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-04-27 Thread Joseph Bradley
an "RC0" that > can't pass (unless somehow these issue result in zero changes) and there's > value in that anyway. Just want to see if we're on the same page about > process, maybe even just say this is how we manage releases, with "RCs" > sta

Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-04-27 Thread Joseph Bradley
ease candidate, then >>> reporting any regressions. >>> >>> *What should happen to JIRA tickets still targeting 2.2.0?* >>> >>> Committers should look at those and triage. Extremely important bug >>> fixes, documentation, and API tweaks that impact compatibility should be >>> worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1. >>> >>> *But my bug isn't fixed!??!* >>> >>> In order to make timely releases, we will typically not hold the release >>> unless the bug in question is a regression from 2.1.1. >>> >> > -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc. [image: http://databricks.com] <http://databricks.com/>

Re: Pull Request Made, Ignored So Far

2017-03-31 Thread Joseph Bradley
n the PR list until it’s too deep for anyone to be > expected to find otherwise. > > Best, > > John > -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc. [image: http://databricks.com] <http://databricks.com/>

GraphFrames 0.4.0 release, with Apache Spark 2.1 support

2017-03-28 Thread Joseph Bradley
ocs*: http://graphframes.github.io/ *Spark Package*: https://spark-packages.org/package/graphframes/graphframes *Source*: https://github.com/graphframes/graphframes Joseph -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc. [image: http://databricks.com] <http://databricks.com/>

Re: SPIP docs are live

2017-03-16 Thread Joseph Bradley
Awesome! Thanks for pushing this through, Cody. Joseph On Sun, Mar 12, 2017 at 1:18 AM, Sean Owen wrote: > http://spark.apache.org/improvement-proposals.html > > (Thanks Cody!) > > We should use this process where appropriate now, and we can refine it > further if needed

Re: Question on Spark's graph libraries roadmap

2017-03-15 Thread Joseph Bradley
t;>> >>>> Regards, >>>> _ >>>> *Md. Rezaul Karim*, BSc, MSc >>>> PhD Researcher, INSIGHT Centre for Data Analytics >>>> National University of Ireland, Galway >>>> IDA Business Park, Dangan, Galway, Ireland >>>> Web: http://www.reza-analytics.eu/index.html >>>> <http://139.59.184.114/index.html> >>>> >>>> On 10 March 2017 at 12:10, Robin East wrote: >>>> >>>> I would love to know the answer to that too. >>>> >>>> --- >>>> Robin East >>>> *Spark GraphX in Action* Michael Malak and Robin East >>>> Manning Publications Co. >>>> http://www.manning.com/books/spark-graphx-in-action >>>> >>>> >>>> >>>> >>>> >>>> On 9 Mar 2017, at 17:42, enzo wrote: >>>> >>>> I am a bit confused by the current roadmap for graph and graph >>>> analytics in Apache Spark. >>>> >>>> I understand that we have had for some time two libraries (the >>>> following is my understanding - please amend as appropriate!): >>>> >>>> . GraphX, part of Spark project. This library is based on RDD and it >>>> is only accessible via Scala. It doesn’t look that this library has been >>>> enhanced recently. >>>> . GraphFrames, independent (at the moment?) library for Spark. This >>>> library is based on Spark DataFrames and accessible by Scala & Python. Last >>>> commit on GitHub was 2 months ago. >>>> >>>> GraphFrames cam about with the promise at some point to be integrated >>>> in Apache Spark. >>>> >>>> I can see other projects coming up with interesting libraries and ideas >>>> (e.g. Graphulo on Accumulo, a new project with the goal of >>>> implementing the GraphBlas building blocks for graph algorithms on top >>>> of Accumulo). >>>> >>>> Where is Apache Spark going? >>>> >>>> Where are graph libraries in the roadmap? >>>> >>>> >>>> >>>> Thanks for any clarity brought to this matter. >>>> >>>> Enzo >>>> >>>> >>>> >>>> >>>> >>>> >> > -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc. [image: http://databricks.com] <http://databricks.com/>

Re: Spark Improvement Proposals

2017-02-24 Thread Joseph Bradley
t;> > >>> >>>>>> During the summit, I also had a lot of discussions over similar > >>> >>>>>> topics > >>> >>>>>> with multiple Committers and active users. I heard many > fantastic > >>> >

Re: [Spark Namespace]: Expanding Spark ML under Different Namespace?

2017-02-23 Thread Joseph Bradley
t; implementation is visible and for lower level integration, > > What I tend to do is keep my own code in its package and try to do as > think a bridge over to it from the [private] scope. It's also important to > name things obviously, say, org.apache.spark.microsoft , so stack t

Re: welcoming Takuya Ueshin as a new Apache Spark committer

2017-02-15 Thread Joseph Bradley
gt; been active in the SQL area and writes very small, surgical patches that >>> are high quality. Please join me in congratulating Takuya-san! >>> >>> >>> >>> >> > > > -- > Takuya UESHIN > Tokyo, Japan > > http://twitter.com/ueshin > -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc. [image: http://databricks.com] <http://databricks.com/>

PSA: Java 8 unidoc build

2017-02-06 Thread Joseph Bradley
others who have made many fixes for this! See these sample PRs for some issues causing failures (especially around links): https://github.com/apache/spark/pull/16741 https://github.com/apache/spark/pull/16604 Thanks, Joseph -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc

Re: Feedback on MLlib roadmap process proposal

2017-01-26 Thread Joseph Bradley
che > projects, committers do not take requests. They pursue the work they > believe needs doing, and shepherd work initiated by others (a clear bug > report, a PR) to a resolution. Things get done by doing them, or by > building influence by doing other things the project needs doing. It isn't > a mechanical, objective process, and can't be. But it does work in a > recognizable way. > >> -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc. [image: http://databricks.com] <http://databricks.com/>

Re: MLlib mission and goals

2017-01-24 Thread Joseph Bradley
andwidth_and_Machine_Balance_in_Current_High_Performance_Computers> > > -- > View this message in context: Re: MLlib mission and goals > <http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-mission-and-goals-tp20715p20754.html> > Sent from the Apache Spark Developers List mailing list archive > <http://apache-spark-developers-list.1001551.n3.nabble.com/> at > Nabble.com. > > -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc. [image: http://databricks.com] <http://databricks.com/>

Re: welcoming Burak and Holden as committers

2017-01-24 Thread Joseph Bradley
d most recently Structured Streaming. > > > > > > Holden has been a long time Spark contributor and evangelist. She has > > > written a few books on Spark, as well as frequent contributions to the > > > Python API to improve its usability and performance. > > > > > > Please join me in welcoming the two! > > > > > > > > > > > > > > > > > > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc. [image: http://databricks.com] <http://databricks.com/>

MLlib mission and goals

2017-01-23 Thread Joseph Bradley
lenty of other possibilities, and it will be great to hear the community's thoughts! Thanks, Joseph -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc. [image: http://databricks.com] <http://databricks.com/>

Re: Feedback on MLlib roadmap process proposal

2017-01-23 Thread Joseph Bradley
with the contributor > to make sure it lands with the target release. > > I'm sure Joseph can explain it better than I do ;) > > > _ > From: Mingjie Tang > Sent: Thursday, January 19, 2017 10:30 AM > Subject: Re: Feedback on MLlib roadm

Feedback on MLlib roadmap process proposal

2017-01-17 Thread Joseph Bradley
munication. * This is fairly orthogonal to the SIP discussion since this proposal is more about setting release targets than about proposing future plans. Thanks! Joseph -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc. [image: http://databricks.com] <http://databricks.com/>

Re: Spark GraphFrame ConnectedComponents

2017-01-05 Thread Joseph Bradley
t(sc.hadoopConfiguration)* >>>>> * .delete(new Path(s"${checkpointDir.get}/${iteration - >>>>> checkpointInterval}"), true)* >>>>> } >>>>> >>>>> System.gc() // hint Spark to clean shuffle directories >>>>> } >>>>> >>>>> >>>>> Thanks >>>>> Ankur >>>>> >>>>> On Wed, Jan 4, 2017 at 5:19 PM, Felix Cheung < >>>>> felixcheun...@hotmail.com> wrote: >>>>> >>>>>> Do you have more of the exception stack? >>>>>> >>>>>> >>>>>> -- >>>>>> *From:* Ankur Srivastava >>>>>> *Sent:* Wednesday, January 4, 2017 4:40:02 PM >>>>>> *To:* u...@spark.apache.org >>>>>> *Subject:* Spark GraphFrame ConnectedComponents >>>>>> >>>>>> Hi, >>>>>> >>>>>> I am trying to use the ConnectedComponent algorithm of GraphFrames >>>>>> but by default it needs a checkpoint directory. As I am running my spark >>>>>> cluster with S3 as the DFS and do not have access to HDFS file system I >>>>>> tried using a s3 directory as checkpoint directory but I run into below >>>>>> exception: >>>>>> >>>>>> Exception in thread "main"java.lang.IllegalArgumentException: Wrong >>>>>> FS: s3n://, expected: file:/// >>>>>> >>>>>> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642) >>>>>> >>>>>> at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalF >>>>>> ileSystem.java:69) >>>>>> >>>>>> If I set checkpoint interval to -1 to avoid checkpointing the driver >>>>>> just hangs after 3 or 4 iterations. >>>>>> >>>>>> Is there some way I can set the default FileSystem to S3 for Spark or >>>>>> any other option? >>>>>> >>>>>> Thanks >>>>>> Ankur >>>>>> >>>>>> >>>>> >>>> >>>> >>>> >>> >> > -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc. [image: http://databricks.com] <http://databricks.com/>

Re: ml word2vec finSynonyms return type

2017-01-05 Thread Joseph Bradley
; From: Asher Krim > Sent: Tuesday, January 3, 2017 11:58 PM > Subject: Re: ml word2vec finSynonyms return type > To: Felix Cheung > Cc: , Joseph Bradley < > jos...@databricks.com>, > > > > The jira: https://issues.apache.org/jira/browse/SPARK-17629 > > Add

Re: Spark Improvement Proposals

2017-01-03 Thread Joseph Bradley
gt;> >>> >>> to > >> >>> >>> >>> run > >> >>> >>> >>> SQL > >> >>> >>> >>> commands on stream but do we really have time to do SQL > >> >>> >>>

Re: mllib metrics vs ml evaluators and how to improve apis for users

2017-01-02 Thread Joseph Bradley
_]): RegressionEvaluation (or > classification/multiclass etc) > > > > where the evaluation class returned will have very similar fields to the > corresponding mllib RegressionMetrics class that can be called by the user. > > > > Any thoughts/ideas about spark ml evaluators/

Re: [VOTE] Apache Spark 2.1.0 (RC5)

2016-12-16 Thread Joseph Bradley
this release can be found at: >>>> http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc5-docs/ >>>> >>>> >>>> *FAQ* >>>> >>>> *How can I help test this release?* >>>> >>>> If you are a Spark user, you can h

Please limit commits for branch-2.1

2016-11-21 Thread Joseph Bradley
committing API changes to master (not branch-2.1). Thanks everyone! Joseph -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc. [image: http://databricks.com] <http://databricks.com/>

Re: Develop custom Estimator / Transformer for pipeline

2016-11-17 Thread Joseph Bradley
Hi Georg, It's true we need better documentation for this. I'd recommend checking out simple algorithms within Spark for examples: ml.feature.Tokenizer ml.regression.IsotonicRegression You should not need to put your library in Spark's namespace. The shared Params in SPARK-7146 are not necessar

Re: Reduce the memory usage if we do same first in GradientBoostedTrees if subsamplingRate< 1.0

2016-11-15 Thread Joseph Bradley
Thanks for the suggestion. That would be faster, but less accurate in most cases. It's generally better to use a new random sample on each iteration, based on literature and results I've seen. Joseph On Fri, Nov 11, 2016 at 5:13 AM, WangJianfei < wangjianfe...@otcaix.iscas.ac.cn> wrote: > when

Re: [VOTE] Release Apache Spark 2.0.2 (RC2)

2016-11-04 Thread Joseph Bradley
+1 On Fri, Nov 4, 2016 at 11:20 AM, Michael Armbrust wrote: > +1 > > On Tue, Nov 1, 2016 at 9:51 PM, Reynold Xin wrote: > >> Please vote on releasing the following candidate as Apache Spark version >> 2.0.2. The vote is open until Fri, Nov 4, 2016 at 22:00 PDT and passes if a >> majority of at

Re: [VOTE] Release Apache Spark 1.6.3 (RC2)

2016-11-03 Thread Joseph Bradley
+1 On Thu, Nov 3, 2016 at 9:51 PM, Kousuke Saruta wrote: > +1 (non-binding) > > - Kousuke > > On 2016/11/03 9:40, Reynold Xin wrote: > >> Please vote on releasing the following candidate as Apache Spark version >> 1.6.3. The vote is open until Sat, Nov 5, 2016 at 18:00 PDT and passes if a >> maj

Re: [ML]Random Forest Error : Size exceeds Integer.MAX_VALUE

2016-10-04 Thread Joseph Bradley
Could you please file a bug report JIRA and also include more info about what you ran? * Random forest Param settings * dataset dimensionality, partitions, etc. Thanks! On Tue, Oct 4, 2016 at 10:44 PM, Samkit Shah wrote: > Hello folks, > I am running Random Forest from ml from spark 1.6.1 on bim

Re: welcoming Xiao Li as a committer

2016-10-04 Thread Joseph Bradley
Congrats! On Tue, Oct 4, 2016 at 4:09 PM, Kousuke Saruta wrote: > Congratulations Xiao! > > - Kousuke > On 2016/10/05 7:44, Bryan Cutler wrote: > > Congrats Xiao! > > On Tue, Oct 4, 2016 at 11:14 AM, Holden Karau > wrote: > >> Congratulations :D :) Yay! >> >> On Tue, Oct 4, 2016 at 11:14 AM, Su

Re: Nominal Attribute

2016-10-03 Thread Joseph Bradley
There are plans...but not concrete ones yet: https://issues.apache.org/jira/browse/SPARK-8515 I agree categorical data handling is a pain point and that we need to improve it! On Tue, Sep 13, 2016 at 4:45 PM, Danil Kirsanov wrote: > NominalAttribute in MLib is used to represent categorical data

Re: [VOTE] Release Apache Spark 2.0.1 (RC4)

2016-09-29 Thread Joseph Bradley
+1 On Thu, Sep 29, 2016 at 2:11 PM, Dongjoon Hyun wrote: > +1 (non-binding) > > At this time, I tested RC4 on the followings. > > - CentOS 6.8 (Final) > - OpenJDK 1.8.0_101 > - Python 2.7.12 > > /build/mvn -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver > -Dpyspark -Dsparkr -DskipTe

Re: [discuss] Spark 2.x release cadence

2016-09-28 Thread Joseph Bradley
+1 for 4 months. With QA taking about a month, that's very reasonable. My main ask (especially for MLlib) is for contributors and committers to take extra care not to delay on updating the Programming Guide for new APIs. Documentation debt often collects and has to be paid off during QA, and a l

Re: [VOTE] Release Apache Spark 2.0.1 (RC3)

2016-09-26 Thread Joseph Bradley
+1 On Mon, Sep 26, 2016 at 7:47 AM, Denny Lee wrote: > +1 (non-binding) > On Sun, Sep 25, 2016 at 23:20 Jeff Zhang wrote: > >> +1 >> >> On Mon, Sep 26, 2016 at 2:03 PM, Shixiong(Ryan) Zhu < >> shixi...@databricks.com> wrote: >> >>> +1 >>> >>> On Sun, Sep 25, 2016 at 10:43 PM, Pete Lee >>> wrot

Re: GraphFrames 0.2.0 released

2016-08-26 Thread Joseph Bradley
This should do it: https://github.com/graphframes/graphframes/releases/tag/release-0.2.0 Thanks for the reminder! Joseph On Wed, Aug 24, 2016 at 10:11 AM, Maciej Bryński wrote: > Hi, > Do you plan to add tag for this release on github ? > https://github.com/graphframes/graphframes/releases > > R

Re: Welcoming Felix Cheung as a committer

2016-08-16 Thread Joseph Bradley
Welcome Felix! On Mon, Aug 15, 2016 at 6:16 AM, mayur bhole wrote: > Congrats Felix! > > On Mon, Aug 15, 2016 at 2:57 PM, Paul Roy wrote: > >> Congrats Felix >> >> Paul Roy. >> >> On Mon, Aug 8, 2016 at 9:15 PM, Matei Zaharia >> wrote: >> >>> Hi all, >>> >>> The PMC recently voted to add Felix

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

2016-07-22 Thread Joseph Bradley
+1 Mainly tested ML/Graph/R. Perf tests from Tim Hunter showed minor speedups from 1.6 for common ML algorithms. On Thu, Jul 21, 2016 at 9:41 AM, Ricardo Almeida < ricardo.alme...@actnowib.com> wrote: > +1 (non binding) > > Tested PySpark Core, DataFrame/SQL, MLlib and Streaming on a standalone

Re: Hello

2016-06-20 Thread Joseph Bradley
Hi Harmeet, I'll add one more item to the other advice: The community is in the process of putting together a roadmap JIRA for 2.1 for ML: https://issues.apache.org/jira/browse/SPARK-15581 This JIRA lists some of the major items and links to a few umbrella JIRAs with subtasks. I'd expect this ro

Re: DAG in Pipeline

2016-06-12 Thread Joseph Bradley
One more note: When you specify the stages in the Pipeline, they need to be in topological order according to the DAG. On Sun, Jun 12, 2016 at 10:47 AM, Joseph Bradley wrote: > Hi Pranay, > > Yes, you can do this. The DAG structure should be specified via the > various Transformer

Re: DAG in Pipeline

2016-06-12 Thread Joseph Bradley
Hi Pranay, Yes, you can do this. The DAG structure should be specified via the various Transformers' input and output columns, where a Transformer can have multiple input and/or output columns. Most of the classification and regression Models are good examples of Transformers with multiple input

Re: Welcoming Yanbo Liang as a committer

2016-06-12 Thread Joseph Bradley
Congrats & welcome! On Tue, Jun 7, 2016 at 7:15 AM, Xiangrui Meng wrote: > Congrats!! > > On Mon, Jun 6, 2016, 8:12 AM Gayathri Murali > wrote: > >> Congratulations Yanbo Liang! Well deserved. >> >> >> On Sun, Jun 5, 2016 at 7:10 PM, Shixiong(Ryan) Zhu < >> shixi...@databricks.com> wrote: >> >>

Re: Shrinking the DataFrame lineage

2016-06-12 Thread Joseph Bradley
GraphFrames? Suppose, I want to >> use aggregateMessages in the iterative loop, for implementing PageRank. >> >> >> >> Best regards, Alexander >> >> >> >> *From:* Joseph Bradley [mailto:jos...@databricks.com] >> *Sent:* Friday, May 13, 20

Re: Implementing linear albegra operations in the distributed linalg package

2016-06-10 Thread Joseph Bradley
I agree that more distributed matrix ops would be good to have, but I think there are a few things which need to happen first: * Now that the spark.ml package has local linear algebra separate from the spark.mllib package, we should migrate the distributed linear algebra implementations over to spa

Re: [vote] Apache Spark 2.0.0-preview release (rc1)

2016-05-19 Thread Joseph Bradley
+1 On Wed, May 18, 2016 at 10:49 AM, Reynold Xin wrote: > Hi Ovidiu-Cristian , > > The best source of truth is change the filter with target version to > 2.1.0. Not a lot of tickets have been targeted yet, but I'd imagine as we > get closer to 2.0 release, more will be retargeted at 2.1.0. > > >

Re: Shrinking the DataFrame lineage

2016-05-13 Thread Joseph Bradley
Here's a JIRA for it: https://issues.apache.org/jira/browse/SPARK-13346 I don't have a great method currently, but hacks can get around it: convert the DataFrame to an RDD and back to truncate the query plan lineage. Joseph On Wed, May 11, 2016 at 12:46 PM, Ulanov, Alexander < alexander.ula...@h

Re: Decrease shuffle in TreeAggregate with coalesce ?

2016-04-27 Thread Joseph Bradley
Do you have code which can reproduce this performance drop in treeReduce? It would be helpful to debug. In the 1.6 release, we profiled it via the various MLlib algorithms and did not see performance drops. It's not just renumbering the partitions; it is reducing the number of partitions by a fac

Re: net.razorvine.pickle.PickleException in Pyspark

2016-04-25 Thread Joseph Bradley
Thanks for your work on this. Can we continue discussing on the JIRA? On Sun, Apr 24, 2016 at 9:39 AM, Caique Marques wrote: > Hello, everyone! > > I'm trying to implement the association rules in Python. I got implement > an association by a frequent element, works as expected (example can be

Re: Organizing Spark ML example packages

2016-04-20 Thread Joseph Bradley
Sounds good to me. I'd request we be strict during this process about requiring *no* changes to the example itself, which will make review easier. On Tue, Apr 19, 2016 at 11:12 AM, Bryan Cutler wrote: > +1, adding some organization would make it easier for people to find a > specific example >

Re: Different maxBins value for categorical and continuous features in RandomForest implementation.

2016-04-12 Thread Joseph Bradley
That sounds useful. Would you mind creating a JIRA for it? Thanks! Joseph On Mon, Apr 11, 2016 at 2:06 AM, Rahul Tanwani wrote: > Hi, > > Currently the RandomForest algo takes a single maxBins value to decide the > number of splits to take. This sometimes causes training time to go very > high

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-05 Thread Joseph Bradley
+1 By the way, the JIRA for tracking (Scala) API parity is: https://issues.apache.org/jira/browse/SPARK-4591 On Tue, Apr 5, 2016 at 4:58 PM, Matei Zaharia wrote: > This sounds good to me as well. The one thing we should pay attention to > is how we update the docs so that people know to start w

Re: running lda in spark throws exception

2016-04-04 Thread Joseph Bradley
llection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > >> >> >>> > at > >> >> >>> > > >> >> >>> > > >> >> >>> > > org.apache.spark.mllib.clustering.DistributedLDAM

Re: Any plans to migrate Transformer API to Spark SQL (closer to DataFrames)?

2016-03-29 Thread Joseph Bradley
y RDD/DataFrame space. >>> > >>> > So, to promote a more extensive use of Pipelines, PipelineStages, and >>> > Transformers, I was thinking about moving that part to SQL/DataFrame >>> > API where they really belong. If not, I think people might miss the &g

Re: Any plans to migrate Transformer API to Spark SQL (closer to DataFrames)?

2016-03-25 Thread Joseph Bradley
There have been some comments about using Pipelines outside of ML, but I have not yet seen a real need for it. If a user does want to use Pipelines for non-ML tasks, they still can use Transformers + PipelineModels. Will that work? On Fri, Mar 25, 2016 at 8:05 AM, Jacek Laskowski wrote: > Hi,

Re: SparkML algos limitations question.

2016-03-21 Thread Joseph Bradley
I'd assume that in most cases I simply won't hit it, but the depth > of the tree would be much more, than 30. > > > -- > Be well! > Jean Morozov > > On Wed, Dec 16, 2015 at 1:00 AM, Joseph Bradley > wrote: > >> Hi Eugene, >> >> The maxD

Merging ML Estimator and Model

2016-03-21 Thread Joseph Bradley
Spark devs & users, I want to bring attention to a proposal to merge the MLlib (spark.ml) concepts of Estimator and Model in Spark 2.0. Please comment & discuss on SPARK-14033 (not in this email thread). *TL;DR:* *Proposal*: Merge Estimator and

Re: pull request template

2016-03-15 Thread Joseph Bradley
+1 for keeping the template I figure any template will require conscientiousness & enforcement. On Sat, Mar 12, 2016 at 1:30 AM, Sean Owen wrote: > The template is a great thing as it gets instructions even more right > in front of people. > > Another idea is to just write a checklist of items,

Re: Welcoming two new committers

2016-02-08 Thread Joseph Bradley
Congrats & welcome! On Mon, Feb 8, 2016 at 12:19 PM, Ram Sriharsha wrote: > great job guys! congrats and welcome! > > On Mon, Feb 8, 2016 at 12:05 PM, Amit Chavan wrote: > >> Welcome. >> >> On Mon, Feb 8, 2016 at 2:50 PM, Suresh Thalamati < >> suresh.thalam...@gmail.com> wrote: >> >>> Congratul

Re: Adding Naive Bayes sample code in Documentation

2016-01-29 Thread Joseph Bradley
JIRA created! https://issues.apache.org/jira/browse/SPARK-13089 Feel free to pick it up if you're interested. : ) Joseph On Wed, Jan 27, 2016 at 8:43 AM, Vinayak Agrawal wrote: > Hi, > I was reading through Spark ML package and I couldn't find Naive Bayes > examples documented on the spark doc

Re: Spark LDA model reuse with new set of data

2016-01-26 Thread Joseph Bradley
Hi, This is more a question for the user list, not the dev list, so I'll CC user. If you're using mllib.clustering.LDAModel (RDD API), then can you make sure you're using a LocalLDAModel (or convert to it from DistributedLDAModel)? You can then call topicDistributions() on the new data. If you'r

Re: running lda in spark throws exception

2015-12-29 Thread Joseph Bradley
Hi Li, I'm wondering if you're running into the same bug reported here: https://issues.apache.org/jira/browse/SPARK-12488 I haven't figured out yet what is causing it. Do you have a small corpus which reproduces this error, and which you can share on the JIRA? If so, that would help a lot in de

Re: java.lang.NoSuchMethodError while saving a random forest model Spark version 1.5

2015-12-16 Thread Joseph Bradley
This method is tested in the Spark 1.5 unit tests, so I'd guess it's a problem with the Parquet dependency. What version of Parquet are you building Spark 1.5 off of? (I'm not that familiar with Parquet issues myself, but hopefully a SQL person can chime in.) On Tue, Dec 15, 2015 at 3:23 PM, Rac

Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

2015-12-16 Thread Joseph Bradley
+1 On Wed, Dec 16, 2015 at 5:26 PM, Reynold Xin wrote: > +1 > > > On Wed, Dec 16, 2015 at 5:24 PM, Mark Hamstra > wrote: > >> +1 >> >> On Wed, Dec 16, 2015 at 1:32 PM, Michael Armbrust > > wrote: >> >>> Please vote on releasing the following candidate as Apache Spark version >>> 1.6.0! >>> >>>

Re: SparkML algos limitations question.

2015-12-15 Thread Joseph Bradley
Hi Eugene, The maxDepth parameter exists because the implementation uses Integer node IDs which correspond to positions in the binary tree. This simplified the implementation. I'd like to eventually modify it to avoid depending on tree node IDs, but that is not yet on the roadmap. There is not

Re: BIRCH clustering algorithm

2015-12-15 Thread Joseph Bradley
Hi Dzeno, I'm not familiar with the algorithm myself, but if you have an important use case for it, you could open a JIRA to discuss it. However, if it is a less common algorithm, I'd recommend first submitting it as a Spark package (but publicizing the package on the user list). If it gains tra

Re: [VOTE] Release Apache Spark 1.6.0 (RC2)

2015-12-12 Thread Joseph Bradley
+1 Ran all tests locally on Mac OS X, and MLlib with large workloads on a cluster. On Sat, Dec 12, 2015 at 6:58 PM, Burak Yavuz wrote: > +1 tested SparkSQL and Streaming on some production sized workloads > > On Sat, Dec 12, 2015 at 4:16 PM, Mark Hamstra > wrote: > >> +1 >> >> On Sat, Dec 12, 2

Re: [ML] Missing documentation for the IndexToString feature transformer

2015-12-05 Thread Joseph Bradley
Thanks for reporting this! I just added a JIRA: https://issues.apache.org/jira/browse/SPARK-12159 That would be great if you could send a PR for it; thanks! Joseph On Sat, Dec 5, 2015 at 5:02 AM, Benjamin Fradet wrote: > Hi, > > I was wondering why the IndexToString >

Re: Python API for Association Rules

2015-12-02 Thread Joseph Bradley
If you're working on a feature, please comment on the JIRA first (to avoid conflicts / duplicate work). Could you please copy what your wrote to the JIRA to discuss there? Thanks, Joseph On Wed, Dec 2, 2015 at 4:51 AM, caiquermarques95 wrote: > Hello everyone! > I'm developing to the Python API

Re: Problem in running MLlib SVM

2015-12-01 Thread Joseph Bradley
. > > On Mon, Nov 30, 2015 at 6:33 PM, Joseph Bradley > wrote: > >> model.predict should return a 0/1 predicted label. The example code is >> misleading when it calls the prediction a "score." >> >> On Mon, Nov 30, 2015 at 9:13 AM, Fazlan Nazeem wr

Re: Grid search with Random Forest

2015-12-01 Thread Joseph Bradley
ns-1 >>> On 1 Dec 2015 3:57 a.m., "Ndjido Ardo BAR" wrote: >>> >>>> Hi Joseph, >>>> >>>> Yes Random Forest support Grid Search on Spark 1.5.+ . But I'm getting >>>> a "rawPredictionCol field does not exist exception&q

Re: Grid search with Random Forest

2015-11-30 Thread Joseph Bradley
It should work with 1.5+. On Thu, Nov 26, 2015 at 12:53 PM, Ndjido Ardo Bar wrote: > > Hi folks, > > Does anyone know whether the Grid Search capability is enabled since the > issue spark-9011 of version 1.4.0 ? I'm getting the "rawPredictionCol > column doesn't exist" when trying to perform a g

Re: Problem in running MLlib SVM

2015-11-30 Thread Joseph Bradley
model.predict should return a 0/1 predicted label. The example code is misleading when it calls the prediction a "score." On Mon, Nov 30, 2015 at 9:13 AM, Fazlan Nazeem wrote: > You should never use the training data to measure your prediction > accuracy. Always use a fresh dataset (test data)

Re: Unhandled case in VectorAssembler

2015-11-20 Thread Joseph Bradley
Yes, please, could you send a JIRA (and PR)? A custom error message would be better. Thank you! Joseph On Fri, Nov 20, 2015 at 2:39 PM, BenFradet wrote: > Hey there, > > I noticed that there is an unhandled case in the transform method of > VectorAssembler if one of the input columns doesn't ha

Re: spark-submit is throwing NPE when trying to submit a random forest model

2015-11-19 Thread Joseph Bradley
Hi, Could you please submit this via JIRA as a bug report? It will be very helpful if you include the Spark version, system details, and other info too. Thanks! Joseph On Thu, Nov 19, 2015 at 1:21 PM, Rachana Srivastava < rachana.srivast...@markmonitor.com> wrote: > *Issue:* > > I have a random

Re: Unchecked contribution (JIRA and PR)

2015-11-16 Thread Joseph Bradley
Hi Sergio, Apart from apologies about limited review bandwidth (from me too!), I wanted to add: It would be interesting to hear what feedback you've gotten from users of your package. Perhaps you could collect feedback by (a) emailing the user list and (b) adding a note in the Spark Packages poin

Re: Spark Implementation of XGBoost

2015-11-16 Thread Joseph Bradley
One comment about """ 1) I agree the sorting method you suggested is a very efficient way to handle the unordered categorical variables in binary classification and regression. I propose we have a Spark ML Transformer to do the sorting and encoding, bringing the benefits to many tree based methods.

Re: slightly more informative error message in MLUtils.loadLibSVMFile

2015-11-16 Thread Joseph Bradley
That sounds useful; would you mind submitting a JIRA (and a PR if you're willing)? Thanks, Joseph On Fri, Oct 23, 2015 at 12:43 PM, Robert Dodier wrote: > Hi, > > MLUtils.loadLibSVMFile verifies that indices are 1-based and > increasing, and otherwise triggers an error. I'd like to suggest that

  1   2   >