Re: Adding abstraction in MLlib

2014-09-15 Thread Egor Pahomov
It's good, that databricks working on this issue! However current process
of working on that is not very clear for outsider.

   - Last update on this ticket is August 5. If all this time was active
   development, I have concerns that without feedback from community for such
   long time development can fall in wrong way.
   - Even if it would be great big patch as soon as you introduce new
   interfaces to community it would allow us to start working on our pipeline
   code. It would allow us write algorithm in new paradigm instead of in lack
   of any paradigms like it was before. It would allow us to help you transfer
   old code to new paradigm.

My main point - shorter iterations with more transparency.

I think it would be good idea to create some pull request with code, which
you have so far, even if it doesn't pass tests, so just we can comment on
it before formulating it in design doc.


2014-09-13 0:00 GMT+04:00 Patrick Wendell pwend...@gmail.com:

 We typically post design docs on JIRA's before major work starts. For
 instance, pretty sure SPARk-1856 will have a design doc posted
 shortly.

 On Fri, Sep 12, 2014 at 12:10 PM, Erik Erlandson e...@redhat.com wrote:
 
  Are interface designs being captured anywhere as documents that the
 community can follow along with as the proposals evolve?
 
  I've worked on other open source projects where design docs were
 published as living documents (e.g. on google docs, or etherpad, but the
 particular mechanism isn't crucial).   FWIW, I found that to be a good way
 to work in a community environment.
 
 
  - Original Message -
  Hi Egor,
 
  Thanks for the feedback! We are aware of some of the issues you
  mentioned and there are JIRAs created for them. Specifically, I'm
  pushing out the design on pipeline features and algorithm/model
  parameters this week. We can move our discussion to
  https://issues.apache.org/jira/browse/SPARK-1856 .
 
  It would be nice to make tests against interfaces. But it definitely
  needs more discussion before making PRs. For example, we discussed the
  learning interfaces in Christoph's PR
  (https://github.com/apache/spark/pull/2137/) but it takes time to
  reach a consensus, especially on interfaces. Hopefully all of us could
  benefit from the discussion. The best practice is to break down the
  proposal into small independent piece and discuss them on the JIRA
  before submitting PRs.
 
  For performance tests, there is a spark-perf package
  (https://github.com/databricks/spark-perf) and we added performance
  tests for MLlib in v1.1. But definitely more work needs to be done.
 
  The dev-list may not be a good place for discussion on the design,
  could you create JIRAs for each of the issues you pointed out, and we
  track the discussion on JIRA? Thanks!
 
  Best,
  Xiangrui
 
  On Fri, Sep 12, 2014 at 10:45 AM, Reynold Xin r...@databricks.com
 wrote:
   Xiangrui can comment more, but I believe Joseph and him are actually
   working on standardize interface and pipeline feature for 1.2 release.
  
   On Fri, Sep 12, 2014 at 8:20 AM, Egor Pahomov pahomov.e...@gmail.com
 
   wrote:
  
   Some architect suggestions on this matter -
   https://github.com/apache/spark/pull/2371
  
   2014-09-12 16:38 GMT+04:00 Egor Pahomov pahomov.e...@gmail.com:
  
Sorry, I misswrote  - I meant learners part of framework - models
already
exists.
   
2014-09-12 15:53 GMT+04:00 Christoph Sawade 
christoph.saw...@googlemail.com:
   
I totally agree, and we discovered also some drawbacks with the
classification models implementation that are based on GLMs:
   
- There is no distinction between predicting scores, classes, and
calibrated scores (probabilities). For these models it is common
 to
have
access to all of them and the prediction function
 ``predict``should be
consistent and stateless. Currently, the score is only available
 after
removing the threshold from the model.
- There is no distinction between multinomial and binomial
classification. For multinomial problems, it is necessary to
 handle
multiple weight vectors and multiple confidences.
- Models are not serialisable, which makes it hard to use them in
practise.
   
I started a pull request [1] some time ago. I would be happy to
continue
the discussion and clarify the interfaces, too!
   
Cheers, Christoph
   
[1] https://github.com/apache/spark/pull/2137/
   
2014-09-12 11:11 GMT+02:00 Egor Pahomov pahomov.e...@gmail.com:
   
Here in Yandex, during implementation of gradient boosting in
 spark
and
creating our ML tool for internal use, we found next serious
 problems
   in
MLLib:
   
   
   - There is no Regression/Classification model abstraction. We
 were
   building abstract data processing pipelines, which should
 work just
with
   some regression - exact algorithm specified outside this code.
   There
is no
   abstraction, 

Re: Spark authenticate enablement

2014-09-15 Thread Tom Graves
Spark authentication does work in standalone mode (atleast it did, I haven't 
tested it in a while). The same shared secret has to be set on all the daemons 
(master and workers) and then also in the configs of any applications 
submitted.  Since everyone shares the same secret its by no means ideal or a 
strong authentication.

Tom


On Thursday, September 11, 2014 4:17 AM, Jun Feng Liu liuj...@cn.ibm.com 
wrote:
 


Hi, there 

I am trying to enable the authentication
on spark on standealone model. Seems like only SparkSubmit load the properties
from spark-defaults.conf.  org.apache.spark.deploy.master.Master dose
not really load the default setting from spark-defaults.conf.  

Dose it mean the spark authentication
only work for like YARN model? Or I missed something with standalone model.
 
Best Regards 
  
Jun Feng Liu
IBM China Systems  Technology Laboratory in Beijing 


 
  Phone: 86-10-82452683 
E-mail:liuj...@cn.ibm.com  

BLD 28,ZGC Software Park 
No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193 
China  

Re: PARSING_ERROR from kryo

2014-09-15 Thread Ankur Dave
At 2014-09-15 08:59:48 -0700, Andrew Ash and...@andrewash.com wrote:
 I'm seeing the same exception now on the Spark 1.1.0 release.  Did you ever
 get this figured out?

 [...]

 On Thu, Aug 21, 2014 at 2:14 PM, npanj nitinp...@gmail.com wrote:
 I am getting PARSING_ERROR while running my job on the code checked out up
 to commit# db56f2df1b8027171da1b8d2571d1f2ef1e103b6.

The error is because I merged a GraphX PR that introduced a nondeterministic 
bug [1]. I reverted the faulty PR, but it was too late for the 1.1.0 release. 
The problem should go away if you use branch-1.1 or master. Sorry about that...

Ankur

[1] https://issues.apache.org/jira/browse/SPARK-3400

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: why does BernoulliSampler class use a lower and upper bound?

2014-09-15 Thread Xiangrui Meng
It is also used in RDD.randomSplit. -Xiangrui

On Mon, Sep 15, 2014 at 4:23 PM, Erik Erlandson e...@redhat.com wrote:
 I'm climbing under the hood in there for SPARK-3250, and I see this:

 override def sample(items: Iterator[T]): Iterator[T] = {
   items.filter { item =
 val x = rng.nextDouble()
 (x = lb  x  ub) ^ complement
   }
 }


 The clause (x = lb  x  ub) is equivalent to (x  ub-lb), which is faster, 
 and requires only one parameter (sampling fraction).   Any caller asking for 
 BernoulliSampler(a, b) can equally well ask for BernoulliSampler(b-a).

 Is there some angle I'm missing?

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Wiki page for Operations/Monitoring tools?

2014-09-15 Thread Otis Gospodnetic
Hi,

I'm looking for a suitable place on the Wiki to add some info about a Spark
monitoring we've built.  The Wiki looks nice and orderly, so I didn't want
to go in and mess it up without asking where to put such info.  I don't see
an existing Operations or Monitoring or similar pages.  Should I just
create a Child page under https://cwiki.apache.org/confluence/display/SPARK
?

Thanks,
Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/


Re: PARSING_ERROR from kryo

2014-09-15 Thread npanj
Hi Andrew,

No I could not figure out the root cause. This seems to be non-deterministic
error... I didn't see same error after rerunning same program. But I noticed
same error on a different program. 

First I thought that this may be related to SPARK-2878, but @Graham replied
that this looks irrelevant.






--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/PARSING-ERROR-from-kryo-tp7944p8433.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: NullWritable not serializable

2014-09-15 Thread Matei Zaharia
Can you post the exact code for the test that worked in 1.0? I can't think of 
much that could've changed. The one possibility is if  we had some operations 
that were computed locally on the driver (this happens with things like first() 
and take(), which will try to do the first partition locally). But generally 
speaking these operations should *not* work over a network, so you'll have to 
make sure that you only send serializable types through shuffles or collects, 
or use a serialization framework like Kryo that might be okay with Writables.

Matei

On September 15, 2014 at 9:13:13 PM, Du Li (l...@yahoo-inc.com) wrote:

Hi Matei,

Thanks for your reply. 

The Writable classes have never been serializable and this is why it is weird. 
I did try as you suggested to map the Writables to integers and strings. It 
didn’t pass, either. Similar exceptions were thrown except that the messages 
became IntWritable, Text are not serializable. The reason is in the implicits 
defined in the SparkContext object that convert those values into their 
corresponding Writable classes before saving the data in sequence file.

My original code was actual some test cases to try out SequenceFile related 
APIs. The tests all passed when the spark version was specified as 1.0.2. But 
this one failed after I changed the spark version to 1.1.0 the new release, 
nothing else changed. In addition, it failed when I called rdd2.collect(), 
take(1), and first(). But it worked fine when calling rdd2.count(). As you can 
see, count() does not need to serialize and ship data while the other three 
methods do.

Do you recall any difference between spark 1.0 and 1.1 that might cause this 
problem?

Thanks,
Du


From: Matei Zaharia matei.zaha...@gmail.com
Date: Friday, September 12, 2014 at 9:10 PM
To: Du Li l...@yahoo-inc.com.invalid, u...@spark.apache.org 
u...@spark.apache.org, dev@spark.apache.org dev@spark.apache.org
Subject: Re: NullWritable not serializable

Hi Du,

I don't think NullWritable has ever been serializable, so you must be doing 
something differently from your previous program. In this case though, just use 
a map() to turn your Writables to serializable types (e.g. null and String).

Matie

On September 12, 2014 at 8:48:36 PM, Du Li (l...@yahoo-inc.com.invalid) wrote:

Hi,

I was trying the following on spark-shell (built with apache master and hadoop 
2.4.0). Both calling rdd2.collect and calling rdd3.collect threw 
java.io.NotSerializableException: org.apache.hadoop.io.NullWritable. 

I got the same problem in similar code of my app which uses the newly released 
Spark 1.1.0 under hadoop 2.4.0. Previously it worked fine with spark 1.0.2 
under either hadoop 2.40 and 0.23.10.

Anybody knows what caused the problem?

Thanks,
Du


import org.apache.hadoop.io.{NullWritable, Text}
val rdd = sc.textFile(README.md)
val res = rdd.map(x = (NullWritable.get(), new Text(x)))
res.saveAsSequenceFile(./test_data)
val rdd2 = sc.sequenceFile(./test_data, classOf[NullWritable], classOf[Text])
rdd2.collect
val rdd3 = sc.sequenceFile[NullWritable,Text](./test_data)
rdd3.collect