Re: Adding abstraction in MLlib
It's good, that databricks working on this issue! However current process of working on that is not very clear for outsider. - Last update on this ticket is August 5. If all this time was active development, I have concerns that without feedback from community for such long time development can fall in wrong way. - Even if it would be great big patch as soon as you introduce new interfaces to community it would allow us to start working on our pipeline code. It would allow us write algorithm in new paradigm instead of in lack of any paradigms like it was before. It would allow us to help you transfer old code to new paradigm. My main point - shorter iterations with more transparency. I think it would be good idea to create some pull request with code, which you have so far, even if it doesn't pass tests, so just we can comment on it before formulating it in design doc. 2014-09-13 0:00 GMT+04:00 Patrick Wendell pwend...@gmail.com: We typically post design docs on JIRA's before major work starts. For instance, pretty sure SPARk-1856 will have a design doc posted shortly. On Fri, Sep 12, 2014 at 12:10 PM, Erik Erlandson e...@redhat.com wrote: Are interface designs being captured anywhere as documents that the community can follow along with as the proposals evolve? I've worked on other open source projects where design docs were published as living documents (e.g. on google docs, or etherpad, but the particular mechanism isn't crucial). FWIW, I found that to be a good way to work in a community environment. - Original Message - Hi Egor, Thanks for the feedback! We are aware of some of the issues you mentioned and there are JIRAs created for them. Specifically, I'm pushing out the design on pipeline features and algorithm/model parameters this week. We can move our discussion to https://issues.apache.org/jira/browse/SPARK-1856 . It would be nice to make tests against interfaces. But it definitely needs more discussion before making PRs. For example, we discussed the learning interfaces in Christoph's PR (https://github.com/apache/spark/pull/2137/) but it takes time to reach a consensus, especially on interfaces. Hopefully all of us could benefit from the discussion. The best practice is to break down the proposal into small independent piece and discuss them on the JIRA before submitting PRs. For performance tests, there is a spark-perf package (https://github.com/databricks/spark-perf) and we added performance tests for MLlib in v1.1. But definitely more work needs to be done. The dev-list may not be a good place for discussion on the design, could you create JIRAs for each of the issues you pointed out, and we track the discussion on JIRA? Thanks! Best, Xiangrui On Fri, Sep 12, 2014 at 10:45 AM, Reynold Xin r...@databricks.com wrote: Xiangrui can comment more, but I believe Joseph and him are actually working on standardize interface and pipeline feature for 1.2 release. On Fri, Sep 12, 2014 at 8:20 AM, Egor Pahomov pahomov.e...@gmail.com wrote: Some architect suggestions on this matter - https://github.com/apache/spark/pull/2371 2014-09-12 16:38 GMT+04:00 Egor Pahomov pahomov.e...@gmail.com: Sorry, I misswrote - I meant learners part of framework - models already exists. 2014-09-12 15:53 GMT+04:00 Christoph Sawade christoph.saw...@googlemail.com: I totally agree, and we discovered also some drawbacks with the classification models implementation that are based on GLMs: - There is no distinction between predicting scores, classes, and calibrated scores (probabilities). For these models it is common to have access to all of them and the prediction function ``predict``should be consistent and stateless. Currently, the score is only available after removing the threshold from the model. - There is no distinction between multinomial and binomial classification. For multinomial problems, it is necessary to handle multiple weight vectors and multiple confidences. - Models are not serialisable, which makes it hard to use them in practise. I started a pull request [1] some time ago. I would be happy to continue the discussion and clarify the interfaces, too! Cheers, Christoph [1] https://github.com/apache/spark/pull/2137/ 2014-09-12 11:11 GMT+02:00 Egor Pahomov pahomov.e...@gmail.com: Here in Yandex, during implementation of gradient boosting in spark and creating our ML tool for internal use, we found next serious problems in MLLib: - There is no Regression/Classification model abstraction. We were building abstract data processing pipelines, which should work just with some regression - exact algorithm specified outside this code. There is no abstraction,
Re: Spark authenticate enablement
Spark authentication does work in standalone mode (atleast it did, I haven't tested it in a while). The same shared secret has to be set on all the daemons (master and workers) and then also in the configs of any applications submitted. Since everyone shares the same secret its by no means ideal or a strong authentication. Tom On Thursday, September 11, 2014 4:17 AM, Jun Feng Liu liuj...@cn.ibm.com wrote: Hi, there I am trying to enable the authentication on spark on standealone model. Seems like only SparkSubmit load the properties from spark-defaults.conf. org.apache.spark.deploy.master.Master dose not really load the default setting from spark-defaults.conf. Dose it mean the spark authentication only work for like YARN model? Or I missed something with standalone model. Best Regards Jun Feng Liu IBM China Systems Technology Laboratory in Beijing Phone: 86-10-82452683 E-mail:liuj...@cn.ibm.com BLD 28,ZGC Software Park No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193 China
Re: PARSING_ERROR from kryo
At 2014-09-15 08:59:48 -0700, Andrew Ash and...@andrewash.com wrote: I'm seeing the same exception now on the Spark 1.1.0 release. Did you ever get this figured out? [...] On Thu, Aug 21, 2014 at 2:14 PM, npanj nitinp...@gmail.com wrote: I am getting PARSING_ERROR while running my job on the code checked out up to commit# db56f2df1b8027171da1b8d2571d1f2ef1e103b6. The error is because I merged a GraphX PR that introduced a nondeterministic bug [1]. I reverted the faulty PR, but it was too late for the 1.1.0 release. The problem should go away if you use branch-1.1 or master. Sorry about that... Ankur [1] https://issues.apache.org/jira/browse/SPARK-3400 - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: why does BernoulliSampler class use a lower and upper bound?
It is also used in RDD.randomSplit. -Xiangrui On Mon, Sep 15, 2014 at 4:23 PM, Erik Erlandson e...@redhat.com wrote: I'm climbing under the hood in there for SPARK-3250, and I see this: override def sample(items: Iterator[T]): Iterator[T] = { items.filter { item = val x = rng.nextDouble() (x = lb x ub) ^ complement } } The clause (x = lb x ub) is equivalent to (x ub-lb), which is faster, and requires only one parameter (sampling fraction). Any caller asking for BernoulliSampler(a, b) can equally well ask for BernoulliSampler(b-a). Is there some angle I'm missing? - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Wiki page for Operations/Monitoring tools?
Hi, I'm looking for a suitable place on the Wiki to add some info about a Spark monitoring we've built. The Wiki looks nice and orderly, so I didn't want to go in and mess it up without asking where to put such info. I don't see an existing Operations or Monitoring or similar pages. Should I just create a Child page under https://cwiki.apache.org/confluence/display/SPARK ? Thanks, Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr Elasticsearch Support * http://sematext.com/
Re: PARSING_ERROR from kryo
Hi Andrew, No I could not figure out the root cause. This seems to be non-deterministic error... I didn't see same error after rerunning same program. But I noticed same error on a different program. First I thought that this may be related to SPARK-2878, but @Graham replied that this looks irrelevant. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/PARSING-ERROR-from-kryo-tp7944p8433.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: NullWritable not serializable
Can you post the exact code for the test that worked in 1.0? I can't think of much that could've changed. The one possibility is if we had some operations that were computed locally on the driver (this happens with things like first() and take(), which will try to do the first partition locally). But generally speaking these operations should *not* work over a network, so you'll have to make sure that you only send serializable types through shuffles or collects, or use a serialization framework like Kryo that might be okay with Writables. Matei On September 15, 2014 at 9:13:13 PM, Du Li (l...@yahoo-inc.com) wrote: Hi Matei, Thanks for your reply. The Writable classes have never been serializable and this is why it is weird. I did try as you suggested to map the Writables to integers and strings. It didn’t pass, either. Similar exceptions were thrown except that the messages became IntWritable, Text are not serializable. The reason is in the implicits defined in the SparkContext object that convert those values into their corresponding Writable classes before saving the data in sequence file. My original code was actual some test cases to try out SequenceFile related APIs. The tests all passed when the spark version was specified as 1.0.2. But this one failed after I changed the spark version to 1.1.0 the new release, nothing else changed. In addition, it failed when I called rdd2.collect(), take(1), and first(). But it worked fine when calling rdd2.count(). As you can see, count() does not need to serialize and ship data while the other three methods do. Do you recall any difference between spark 1.0 and 1.1 that might cause this problem? Thanks, Du From: Matei Zaharia matei.zaha...@gmail.com Date: Friday, September 12, 2014 at 9:10 PM To: Du Li l...@yahoo-inc.com.invalid, u...@spark.apache.org u...@spark.apache.org, dev@spark.apache.org dev@spark.apache.org Subject: Re: NullWritable not serializable Hi Du, I don't think NullWritable has ever been serializable, so you must be doing something differently from your previous program. In this case though, just use a map() to turn your Writables to serializable types (e.g. null and String). Matie On September 12, 2014 at 8:48:36 PM, Du Li (l...@yahoo-inc.com.invalid) wrote: Hi, I was trying the following on spark-shell (built with apache master and hadoop 2.4.0). Both calling rdd2.collect and calling rdd3.collect threw java.io.NotSerializableException: org.apache.hadoop.io.NullWritable. I got the same problem in similar code of my app which uses the newly released Spark 1.1.0 under hadoop 2.4.0. Previously it worked fine with spark 1.0.2 under either hadoop 2.40 and 0.23.10. Anybody knows what caused the problem? Thanks, Du import org.apache.hadoop.io.{NullWritable, Text} val rdd = sc.textFile(README.md) val res = rdd.map(x = (NullWritable.get(), new Text(x))) res.saveAsSequenceFile(./test_data) val rdd2 = sc.sequenceFile(./test_data, classOf[NullWritable], classOf[Text]) rdd2.collect val rdd3 = sc.sequenceFile[NullWritable,Text](./test_data) rdd3.collect