Re: Spark Local Pipelines

2017-05-18 Thread Asher Krim
pleasant and safe (due to possible train-serve skews) than it can be. Internally, the lack of this feature has caused debates about how appropriate Spark really is for production ML. Asher Krim Senior Software Engineer On Thu, May 18, 2017 at 4:24 AM, Cristian Opris <cristian.b.op...@gmail.com>

Re: Outstanding Spark 2.1.1 issues

2017-03-28 Thread Asher Krim
Hey Michael, any update on this? We're itching for a 2.1.1 release (specifically SPARK-14804 which is currently blocking us) Thanks, Asher Krim Senior Software Engineer On Wed, Mar 22, 2017 at 7:44 PM, Michael Armbrust <mich...@databricks.com> wrote: > An update: I cut the tag for

Re: Spark Local Pipelines

2017-03-13 Thread Asher Krim
at was used to train I think this is one of those things that could live outside the project, because it's more not-Spark than Spark. Remember too that building a solution into the project blesses one at the expense of others. Asher Krim Senior Software Engineer On Mon, Mar 13, 2017 at 11:08 A

Spark Local Pipelines

2017-03-12 Thread Asher Krim
ussion on. Thanks, Asher Krim Senior Software Engineer

Re: welcoming Takuya Ueshin as a new Apache Spark committer

2017-02-13 Thread Asher Krim
Congrats! Asher Krim Senior Software Engineer On Mon, Feb 13, 2017 at 6:24 PM, Kousuke Saruta <saru...@oss.nttdata.co.jp> wrote: > Congratulations, Takuya! > > - Kousuke > On 2017/02/14 7:38, Herman van Hövell tot Westerflier wrote: > > Congrats Takuya! > > On

Re: ml word2vec finSynonyms return type

2017-02-05 Thread Asher Krim
It took me a while, but I finally got around this: https://github.com/apache/spark/pull/16811/files On Fri, Jan 6, 2017 at 4:03 AM, Asher Krim <ak...@hubspot.com> wrote: > Felix - I'm not sure I understand your example about pipeline models, > could you elaborate? I'm t

Re: MLlib mission and goals

2017-01-24 Thread Asher Krim
loss functions, etc. > > *(2) Consistent improvements to core algorithms* > A less exciting but still very important item will be constantly improving > the core set of algorithms in MLlib. This could mean speed, scaling, > robustness, and usability for the few algorithms which cover 90% of use > cases. > > There are plenty of other possibilities, and it will be great to hear the > community's thoughts! > > Thanks, > Joseph > > > > -- > > Joseph Bradley > > Software Engineer - Machine Learning > > Databricks, Inc. > > [image: http://databricks.com] <http://databricks.com/> > > > > - To > unsubscribe e-mail: dev-unsubscr...@spark.apache.org -- Asher Krim Senior Software Engineer

Spark 1.6.3 Driver OOM on createDataFrame

2017-01-22 Thread Asher Krim
is supposed to be based on RDDs. This makes these algorithms unusable for anything larger than toy examples in < Spark 2. If anyone is familiar with this bug, I would really appreciate it if they could point me in the direction of the pr that fixed it. Is a 1.6.4 release planned? Would be possible to

Re: Possible bug - Java iterator/iterable inconsistency

2017-01-19 Thread Asher Krim
n't cause an API compatibility problem with respect to Java 8 >> lambdas, but if that's settled, I think this could be fixed without >> breaking the API. >> >> On Wed, Jan 18, 2017 at 8:50 PM Asher Krim <ak...@hubspot.com> wrote: >> >> In Spark 2 + Jav

Possible bug - Java iterator/iterable inconsistency

2017-01-18 Thread Asher Krim
using these constructs correctly? Is there a workaround other than converting the iterator to an iterable outside of the function? Thanks, -- Asher Krim Senior Software Engineer

Re: Why are ml models repartition(1)'d in save methods?

2017-01-16 Thread Asher Krim
> not crazy models. This model could probably easily be serialized as > individual vectors in this case. It would introduce a > backwards-compatibility issue but it's possible to read old and new > formats, I believe. > > On Fri, Jan 13, 2017 at 8:16 PM Asher Krim <ak...@hubspot.co

Re: Why are ml models repartition(1)'d in save methods?

2017-01-13 Thread Asher Krim
de that serializes models, which are quite small. >> For example a PCA model consists of a few principal component vector. It's >> a Dataset of just one element being saved here. It's re-using the code path >> normally used to save big data sets, to output 1 file with 1 thing a

Re: Why are ml models repartition(1)'d in save methods?

2017-01-13 Thread Asher Krim
> n files. > > On Fri, Jan 13, 2017 at 5:23 PM Asher Krim <ak...@hubspot.com> wrote: > >> Hi, >> >> I'm curious why it's common for data to be repartitioned to 1 partition >> when saving ml models: >> >> sqlContext.createDataFrame(Seq(data)).reparti

Why are ml models repartition(1)'d in save methods?

2017-01-13 Thread Asher Krim
Am I missing some benefit of repartitioning like this? Thanks, -- Asher Krim Senior Software Engineer

Re: ml word2vec finSynonyms return type

2017-01-03 Thread Asher Krim
new method instead of changing the return type of > the existing one. > > > _____ > From: Asher Krim <ak...@hubspot.com> > Sent: Wednesday, December 28, 2016 11:52 AM > Subject: ml word2vec finSynonyms return type > To: <dev@spark.apache.org>

ml word2vec finSynonyms return type

2016-12-28 Thread Asher Krim
, so here we are.) Thanks, -- Asher Krim Senior Software Engineer

Re: [SPARK-15717][GraphX] status

2016-09-23 Thread Asher Krim
proposed fix? Would be good to know whether it fixes the >>> issue. >>> >>> On Thu, Sep 22, 2016 at 2:49 PM, Asher Krim <ak...@hubspot.com> wrote: >>> >>>> Does anyone know what the status of SPARK-15717 is? It's a simple >>>> enough

[SPARK-15717][GraphX] status

2016-09-22 Thread Asher Krim
Does anyone know what the status of SPARK-15717 is? It's a simple enough looking PR, but there has been no activity on it since June 16th. I believe that we are hitting that bug with checkpointed distributed LDA. It's a blocker for us and we would really appreciate getting it fixed. Jira: