Re: A proposal for Spark 2.0

2015-11-25 Thread Sandy Ryza
update their apps, I think it's better > to make the other small changes in 2.0 at the same time than to update once > for Dataset and another time for 2.0. > > BTW just refer to Reynold's original post for the other proposed API > changes. > > Matei > > On N

Re: A proposal for Spark 2.0

2015-11-24 Thread Sandy Ryza
I think that Kostas' logic still holds. The majority of Spark users, and likely an even vaster majority of people running vaster jobs, are still on RDDs and on the cusp of upgrading to DataFrames. Users will probably want to upgrade to the stable version of the Dataset / DataFrame API so they

Re: Dropping support for earlier Hadoop versions in Spark 2.0?

2015-11-20 Thread Sandy Ryza
To answer your fourth question from Cloudera's perspective, we would never support a customer running Spark 2.0 on a Hadoop version < 2.6. -Sandy On Fri, Nov 20, 2015 at 1:39 PM, Reynold Xin wrote: > OK I'm not exactly asking for a vote here :) > > I don't think we should

Re: A proposal for Spark 2.0

2015-11-10 Thread Sandy Ryza
Another +1 to Reynold's proposal. Maybe this is obvious, but I'd like to advocate against a blanket removal of deprecated / developer APIs. Many APIs can likely be removed without material impact (e.g. the SparkContext constructor that takes preferred node location data), while others likely see

Re: A proposal for Spark 2.0

2015-11-10 Thread Sandy Ryza
Oh and another question - should Spark 2.0 support Java 7? On Tue, Nov 10, 2015 at 4:53 PM, Sandy Ryza <sandy.r...@cloudera.com> wrote: > Another +1 to Reynold's proposal. > > Maybe this is obvious, but I'd like to advocate against a blanket removal > of deprecated / develope

Re: Info about Dataset

2015-11-03 Thread Sandy Ryza
Hi Justin, The Dataset API proposal is available here: https://issues.apache.org/jira/browse/SPARK-. -Sandy On Tue, Nov 3, 2015 at 1:41 PM, Justin Uang wrote: > Hi, > > I was looking through some of the PRs slated for 1.6.0 and I noted > something called a Dataset,

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

2015-08-30 Thread Sandy Ryza
+1 (non-binding) built from source and ran some jobs against YARN -Sandy On Sat, Aug 29, 2015 at 5:50 AM, vaquar khan vaquar.k...@gmail.com wrote: +1 (1.5.0 RC2)Compiled on Windows with YARN. Regards, Vaquar khan +1 (non-binding, of course) 1. Compiled OSX 10.10 (Yosemite) OK Total

Re: [VOTE] Release Apache Spark 1.5.0 (RC1)

2015-08-24 Thread Sandy Ryza
I see that there's an 1.5.0-rc2 tag in github now. Is that the official RC2 tag to start trying out? -Sandy On Mon, Aug 24, 2015 at 7:23 AM, Sean Owen so...@cloudera.com wrote: PS Shixiong Zhu is correct that this one has to be fixed: https://issues.apache.org/jira/browse/SPARK-10168 For

Re: [VOTE] Release Apache Spark 1.5.0 (RC1)

2015-08-24 Thread Sandy Ryza
Cool, thanks! On Mon, Aug 24, 2015 at 2:07 PM, Reynold Xin r...@databricks.com wrote: Nope --- I cut that last Friday but had an error. I will remove it and cut a new one. On Mon, Aug 24, 2015 at 2:06 PM, Sandy Ryza sandy.r...@cloudera.com wrote: I see that there's an 1.5.0-rc2 tag

Re: Compact RDD representation

2015-07-19 Thread Sandy Ryza
Edit: the first line should read: val groupedRdd = rdd.map((_, 1)).reduceByKey(_ + _) On Sun, Jul 19, 2015 at 11:02 AM, Sandy Ryza sandy.r...@cloudera.com wrote: This functionality already basically exists in Spark. To create the grouped RDD, one can run: val groupedRdd

Re: Compact RDD representation

2015-07-19 Thread Sandy Ryza
, Сергей Лихоман sergliho...@gmail.com wrote: Thanks for answer! Could you please answer for one more question? Will we have in memory original rdd and grouped rdd in the same time? 2015-07-19 21:04 GMT+03:00 Sandy Ryza sandy.r...@cloudera.com: Edit: the first line should read: val

Re: Compact RDD representation

2015-07-19 Thread Sandy Ryza
This functionality already basically exists in Spark. To create the grouped RDD, one can run: val groupedRdd = rdd.reduceByKey(_ + _) To get it back into the original form: groupedRdd.flatMap(x = List.fill(x._1)(x._2)) -Sandy -Sandy On Sun, Jul 19, 2015 at 10:40 AM, Сергей Лихоман

Re: Compact RDD representation

2015-07-19 Thread Sandy Ryza
GMT+03:00 Sandy Ryza sandy.r...@cloudera.com: The user gets to choose what they want to reside in memory. If they call rdd.cache() on the original RDD, it will be in memory. If they call rdd.cache() on the compact RDD, it will be in memory. If cache() is called on both, they'll both

Re: How to Read Excel file in Spark 1.4

2015-07-13 Thread Sandy Ryza
Hi Su, Spark can't read excel files directly. Your best best is probably to export the contents as a CSV and use the csvFile API. -Sandy On Mon, Jul 13, 2015 at 9:22 AM, spark user spark_u...@yahoo.com.invalid wrote: Hi I need your help to save excel data in hive . 1. how to read

Re: External Shuffle service over yarn

2015-06-26 Thread Sandy Ryza
Hi Yash, One of the main advantages is that, if you turn dynamic allocation on, and executors are discarded, your application is still able to get at the shuffle data that they wrote out. -Sandy On Thu, Jun 25, 2015 at 11:08 PM, yash datta sau...@gmail.com wrote: Hi devs, Can someone point

Re: Increase partition count (repartition) without shuffle

2015-06-18 Thread Sandy Ryza
Hi Alexander, There is currently no way to create an RDD with more partitions than its parent RDD without causing a shuffle. However, if the files are splittable, you can set the Hadoop configurations that control split size to something smaller so that the HadoopRDD ends up with more

Re: [SparkScore] Performance portal for Apache Spark

2015-06-17 Thread Sandy Ryza
This looks really awesome. On Tue, Jun 16, 2015 at 10:27 AM, Huang, Jie jie.hu...@intel.com wrote: Hi All We are happy to announce Performance portal for Apache Spark http://01org.github.io/sparkscore/ ! The Performance Portal for Apache Spark provides performance data on the Spark

Re: [VOTE] Release Apache Spark 1.4.0 (RC4)

2015-06-05 Thread Sandy Ryza
+1 (non-binding) Built from source and ran some jobs against a pseudo-distributed YARN cluster. -Sandy On Fri, Jun 5, 2015 at 11:05 AM, Ram Sriharsha sriharsha@gmail.com wrote: +1 , tested with hadoop 2.6/ yarn on centos 6.5 after building w/ -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0

Re: [VOTE] Release Apache Spark 1.4.0 (RC3)

2015-05-31 Thread Sandy Ryza
+1 (non-binding) Launched against a pseudo-distributed YARN cluster running Hadoop 2.6.0 and ran some jobs. -Sandy On Sat, May 30, 2015 at 3:44 PM, Krishna Sankar ksanka...@gmail.com wrote: +1 (non-binding, of course) 1. Compiled OSX 10.10 (Yosemite) OK Total time: 17:07 min mvn clean

Re: YARN mode startup takes too long (10+ secs)

2015-05-11 Thread Sandy Ryza
Wow, I hadn't noticed this, but 5 seconds is really long. It's true that it's configurable, but I think we need to provide a decent out-of-the-box experience. For comparison, the MapReduce equivalent is 1 second. I filed https://issues.apache.org/jira/browse/SPARK-7533 for this. -Sandy On

Re: Regarding KryoSerialization in Spark

2015-04-30 Thread Sandy Ryza
Hi Twinkle, Registering the class makes it so that writeClass only writes out a couple bytes, instead of a full String of the class name. -Sandy On Thu, Apr 30, 2015 at 4:13 AM, twinkle sachdeva twinkle.sachd...@gmail.com wrote: Hi, As per the code, KryoSerialization used

Re: Using memory mapped file for shuffle

2015-04-29 Thread Sandy Ryza
to store it as a byte buffer. I want to make sure this will not cause OOM when the file size is large. -- Kannan On Tue, Apr 14, 2015 at 9:07 AM, Sandy Ryza sandy.r...@cloudera.com wrote: Hi Kannan, Both in MapReduce and Spark, the amount of shuffle data a task produces can exceed

Re: Design docs: consolidation and discoverability

2015-04-27 Thread Sandy Ryza
On Fri, Apr 24, 2015 at 3:53 PM Sandy Ryza sandy.r...@cloudera.com wrote: I think there are maybe two separate things we're talking about? 1. Design discussions and in-progress design docs. My two cents are that JIRA is the best place for this. It allows tracking

Re: Design docs: consolidation and discoverability

2015-04-24 Thread Sandy Ryza
I think there are maybe two separate things we're talking about? 1. Design discussions and in-progress design docs. My two cents are that JIRA is the best place for this. It allows tracking the progression of a design across multiple PRs and contributors. A piece of useful feedback that I've

Re: Should we let everyone set Assignee?

2015-04-22 Thread Sandy Ryza
I think one of the benefits of assignee fields that I've seen in other projects is their potential to coordinate and prevent duplicate work. It's really frustrating to put a lot of work into a patch and then find out that someone has been doing the same. It's helpful for the project etiquette to

Re: [VOTE] Release Apache Spark 1.3.1 (RC2)

2015-04-08 Thread Sandy Ryza
+1 Built against Hadoop 2.6 and ran some jobs against a pseudo-distributed YARN cluster. -Sandy On Wed, Apr 8, 2015 at 12:49 PM, Patrick Wendell pwend...@gmail.com wrote: Oh I see - ah okay I'm guessing it was a transient build error and I'll get it posted ASAP. On Wed, Apr 8, 2015 at 3:41

Re: RDD.count

2015-03-28 Thread Sandy Ryza
I definitely see the value in this. However, I think at this point it would be an incompatible behavioral change. People often use count in Spark to exercise their DAG. Omitting processing steps that were previously included would likely mislead many users into thinking their pipeline was

Re: hadoop input/output format advanced control

2015-03-25 Thread Sandy Ryza
Regarding Patrick's question, you can just do new Configuration(oldConf) to get a cloned Configuration object and add any new properties to it. -Sandy On Wed, Mar 25, 2015 at 4:42 PM, Imran Rashid iras...@cloudera.com wrote: Hi Nick, I don't remember the exact details of these scenarios, but

Re: Spark Executor resources

2015-03-24 Thread Sandy Ryza
(ZVZOAAI.ELTE) 2015-03-24 16:30 GMT+01:00 Sandy Ryza sandy.r...@cloudera.com: Hi Zoltan, If running on YARN, the YARN NodeManager starts executors. I don't think there's a 100% precise way for the Spark executor way to know how many resources are allotted to it. It can come close by looking

Re: Spark Executor resources

2015-03-24 Thread Sandy Ryza
Hi Zoltan, If running on YARN, the YARN NodeManager starts executors. I don't think there's a 100% precise way for the Spark executor way to know how many resources are allotted to it. It can come close by looking at the Spark configuration options used to request it (spark.executor.memory and

Re: Directly broadcasting (sort of) RDDs

2015-03-22 Thread Sandy Ryza
Hi Guillaume, I've long thought something like this would be useful - i.e. the ability to broadcast RDDs directly without first pulling data through the driver. If I understand correctly, your requirement to block a matrix up and only fetch the needed parts could be implemented on top of this by

Re: multi-line comment style

2015-02-09 Thread Sandy Ryza
+1 to what Andrew said, I think both make sense in different situations and trusting developer discretion here is reasonable. On Mon, Feb 9, 2015 at 1:48 PM, Andrew Or and...@databricks.com wrote: In my experience I find it much more natural to use // for short multi-line comments (2 or 3

Re: Improving metadata in Spark JIRA

2015-02-06 Thread Sandy Ryza
JIRA updates don't go to this list, they go to iss...@spark.apache.org. I don't think many are signed up for that list, and those that are probably have a flood of emails anyway. So I'd definitely be in favor of any JIRA cleanup that you're up for. -Sandy On Fri, Feb 6, 2015 at 6:45 AM, Sean

Re: Issue with repartition and cache

2015-01-21 Thread Sandy Ryza
Hi Dirceu, Does the issue not show up if you run map(f = f(1).asInstanceOf[Int]).sum on the train RDD? It appears that f(1) is an String, not an Int. If you're looking to parse and convert it, toInt should be used instead of asInstanceOf. -Sandy On Wed, Jan 21, 2015 at 8:43 AM, Dirceu

Re: Semantics of LGTM

2015-01-17 Thread sandy . ryza
I think clarifying these semantics is definitely worthwhile. Maybe this complicates the process with additional terminology, but the way I've used these has been: +1 - I think this is safe to merge and, barring objections from others, would merge it immediately. LGTM - I have no concerns

Re: Semantics of LGTM

2015-01-17 Thread sandy . ryza
Yeah, the ASF +1 has become partly overloaded to mean both I would like to see this feature and this patch should be committed, although, at least in Hadoop, using +1 on JIRA (as opposed to, say, in a release vote) should unambiguously mean the latter unless qualified in some other way. I

Re: Spark Dev

2014-12-19 Thread Sandy Ryza
Hi Harikrishna, A good place to start is taking a look at the wiki page on contributing: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark -Sandy On Fri, Dec 19, 2014 at 2:43 PM, Harikrishna Kamepalli harikrishna.kamepa...@gmail.com wrote: i am interested to contribute

Re: one hot encoding

2014-12-13 Thread Sandy Ryza
Hi Lochana, We haven't yet added this in 1.2. https://issues.apache.org/jira/browse/SPARK-4081 tracks adding categorical feature indexing, which one-hot encoding can be built on. https://issues.apache.org/jira/browse/SPARK-1216 also tracks a version of this prior to the ML pipelines work. -Sandy

Re: [VOTE] Release Apache Spark 1.2.0 (RC2)

2014-12-11 Thread Sandy Ryza
+1 (non-binding). Tested on Ubuntu against YARN. On Thu, Dec 11, 2014 at 9:38 AM, Reynold Xin r...@databricks.com wrote: +1 Tested on OS X. On Wednesday, December 10, 2014, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark

Re: HA support for Spark

2014-12-10 Thread Sandy Ryza
I think that if we were able to maintain the full set of created RDDs as well as some scheduler and block manager state, it would be enough for most apps to recover. On Wed, Dec 10, 2014 at 5:30 AM, Jun Feng Liu liuj...@cn.ibm.com wrote: Well, it should not be mission impossible thinking there

Re: [VOTE] Release Apache Spark 1.2.0 (RC1)

2014-12-01 Thread Sandy Ryza
+1 (non-binding) built from source fired up a spark-shell against YARN cluster ran some jobs using parallelize ran some jobs that read files clicked around the web UI On Sun, Nov 30, 2014 at 1:10 AM, GuoQiang Li wi...@qq.com wrote: +1 (non-binding‍) -- Original

Re: Too many open files error

2014-11-19 Thread Sandy Ryza
Quizhang, This is a known issue that ExternalAppendOnlyMap can do tons of tiny spills in certain situations. SPARK-4452 aims to deal with this issue, but we haven't finalized a solution yet. Dinesh's solution should help as a workaround, but you'll likely experience suboptimal performance when

Re: Spark Hadoop 2.5.1

2014-11-14 Thread sandy . ryza
You're the second person to request this today. Planning to include this in my PR for Spark-4338. -Sandy On Nov 14, 2014, at 8:48 AM, Corey Nolet cjno...@gmail.com wrote: In the past, I've built it by providing -Dhadoop.version=2.5.1 exactly like you've mentioned. What prompted me to write

Re: [NOTICE] [BUILD] Minor changes to Spark's build

2014-11-12 Thread Sandy Ryza
Currently there are no mandatory profiles required to build Spark. I.e. mvn package just works. It seems sad that we would need to break this. On Wed, Nov 12, 2014 at 10:59 PM, Patrick Wendell pwend...@gmail.com wrote: I think printing an error that says -Pscala-2.10 must be enabled is

Re: proposal / discuss: multiple Serializers within a SparkContext?

2014-11-08 Thread Sandy Ryza
a storage policy in which you can specify how data should be stored. I think that would be a great API to have in the long run. Designing it won't be trivial though. On Fri, Nov 7, 2014 at 1:05 AM, Sandy Ryza sandy.r...@cloudera.com wrote: Hey all, Was messing around with Spark

proposal / discuss: multiple Serializers within a SparkContext?

2014-11-07 Thread Sandy Ryza
Hey all, Was messing around with Spark and Google FlatBuffers for fun, and it got me thinking about Spark and serialization. I know there's been work / talk about in-memory columnar formats Spark SQL, so maybe there are ways to provide this flexibility already that I've missed? Either way, my

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Sandy Ryza
It looks like the difference between the proposed Spark model and the CloudStack / SVN model is: * In the former, maintainers / partial committers are a way of centralizing oversight over particular components among committers * In the latter, maintainers / partial committers are a way of giving

Re: [VOTE] Designating maintainers for some Spark components

2014-11-05 Thread Sandy Ryza
This seems like a good idea. An area that wasn't listed, but that I think could strongly benefit from maintainers, is the build. Having consistent oversight over Maven, SBT, and dependencies would allow us to avoid subtle breakages. Component maintainers have come up several times within the

Re: A couple questions about shared variables

2014-09-22 Thread Sandy Ryza
, Sandy Ryza (sandy.r...@cloudera.com) wrote: Hey All, A couple questions came up about shared variables recently, and I wanted to confirm my understanding and update the doc to be a little more clear. *Broadcast variables* Now that tasks data is automatically broadcast, the only occasions

Re: Spark authenticate enablement

2014-09-12 Thread Sandy Ryza
Hi Jun, I believe that's correct that Spark authentication only works against YARN. -Sandy On Thu, Sep 11, 2014 at 2:14 AM, Jun Feng Liu liuj...@cn.ibm.com wrote: Hi, there I am trying to enable the authentication on spark on standealone model. Seems like only SparkSubmit load the

Reporting serialized task size after task broadcast change?

2014-09-11 Thread Sandy Ryza
After the change to broadcast all task data, is there any easy way to discover the serialized size of the data getting sent down for a task? thanks, -Sandy

Re: Reporting serialized task size after task broadcast change?

2014-09-11 Thread Sandy Ryza
It used to be available on the UI, no? On Thu, Sep 11, 2014 at 6:26 PM, Reynold Xin r...@databricks.com wrote: I don't think so. We should probably add a line to log it. On Thursday, September 11, 2014, Sandy Ryza sandy.r...@cloudera.com wrote: After the change to broadcast all task data

Re: Reporting serialized task size after task broadcast change?

2014-09-11 Thread Sandy Ryza
Hmm, well I can't find it now, must have been hallucinating. Do you know off the top of your head where I'd be able to find the size to log it? On Thu, Sep 11, 2014 at 6:33 PM, Reynold Xin r...@databricks.com wrote: I didn't know about that On Thu, Sep 11, 2014 at 6:29 PM, Sandy Ryza

Re: Lost executor on YARN ALS iterations

2014-09-10 Thread Sandy Ryza
That's right On Tue, Sep 9, 2014 at 2:04 PM, Debasish Das debasish.da...@gmail.com wrote: Last time it did not show up on environment tab but I will give it another shot...Expected behavior is that this env variable will show up right ? On Tue, Sep 9, 2014 at 12:15 PM, Sandy Ryza sandy.r

Re: Lost executor on YARN ALS iterations

2014-09-09 Thread Sandy Ryza
Hi Deb, The current state of the art is to increase spark.yarn.executor.memoryOverhead until the job stops failing. We do have plans to try to automatically scale this based on the amount of memory requested, but it will still just be a heuristic. -Sandy On Tue, Sep 9, 2014 at 7:32 AM,

Re: Lost executor on YARN ALS iterations

2014-09-09 Thread Sandy Ryza
billion ratings... On Tue, Sep 9, 2014 at 10:49 AM, Sandy Ryza sandy.r...@cloudera.com wrote: Hi Deb, The current state of the art is to increase spark.yarn.executor.memoryOverhead until the job stops failing. We do have plans to try to automatically scale this based on the amount of memory

Re: about spark assembly jar

2014-09-02 Thread Sandy Ryza
This doesn't help for every dependency, but Spark provides an option to build the assembly jar without Hadoop and its dependencies. We make use of this in CDH packaging. -Sandy On Tue, Sep 2, 2014 at 2:12 AM, scwf wangf...@huawei.com wrote: Hi sean owen, here are some problems when i used

Re: Lost executor on YARN ALS iterations

2014-08-20 Thread Sandy Ryza
Hi Debasish, The fix is to raise spark.yarn.executor.memoryOverhead until this goes away. This controls the buffer between the JVM heap size and the amount of memory requested from YARN (JVMs can take up memory beyond their heap size). You should also make sure that, in the YARN NodeManager

Re: Fine-Grained Scheduler on Yarn

2014-08-08 Thread Sandy Ryza
Hi Jun, Spark currently doesn't have that feature, i.e. it aims for a fixed number of executors per application regardless of resource usage, but it's definitely worth considering. We could start more executors when we have a large backlog of tasks and shut some down when we're underutilized.

Re: Fine-Grained Scheduler on Yarn

2014-08-08 Thread Sandy Ryza
* E-mail:* *liuj...@cn.ibm.com* liuj...@cn.ibm.com [image: IBM] BLD 28,ZGC Software Park No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193 China *Sandy Ryza sandy.r...@cloudera.com sandy.r...@cloudera.com* 2014/08/08 15:14 To Jun Feng Liu/China/IBM@IBMCN, cc Patrick

Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-22 Thread Sandy Ryza
the first row for every file, or the header only for the first file. The former is not really supported out of the box by the input format I think? On Mon, Jul 21, 2014 at 10:50 PM, Sandy Ryza sandy.r...@cloudera.com wrote: It could make sense to add a skipHeader argument

Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-21 Thread Sandy Ryza
It could make sense to add a skipHeader argument to SparkContext.textFile? On Mon, Jul 21, 2014 at 10:37 PM, Reynold Xin r...@databricks.com wrote: If the purpose is for dropping csv headers, perhaps we don't really need a common drop and only one that drops the first line in a file? I'd

Re: Possible bug in ClientBase.scala?

2014-07-17 Thread Sandy Ryza
On Wed, Jul 16, 2014 at 4:19 PM, Sandy Ryza sandy.r...@cloudera.com wrote: Hi Ron, I just checked and this bug is fixed in recent releases of Spark. -Sandy On Sun, Jul 13, 2014 at 8:15 PM, Chester Chen ches...@alpinenow.com wrote: Ron, Which distribution and Version

Re: better compression codecs for shuffle blocks?

2014-07-14 Thread Sandy Ryza
Stephen, Often the shuffle is bound by writes to disk, so even if disks have enough space to store the uncompressed data, the shuffle can complete faster by writing less data. Reynold, This isn't a big help in the short term, but if we switch to a sort-based shuffle, we'll only need a single

Re: Changes to sbt build have been merged

2014-07-10 Thread Sandy Ryza
Woot! On Thu, Jul 10, 2014 at 11:15 AM, Patrick Wendell patr...@databricks.com wrote: Just a heads up, we merged Prashant's work on having the sbt build read all dependencies from Maven. Please report any issues you find on the dev list or on JIRA. One note here for developers, going

Re: Data Locality In Spark

2014-07-08 Thread Sandy Ryza
Hi Anish, Spark, like MapReduce, makes an effort to schedule tasks on the same nodes and racks that the input blocks reside on. -Sandy On Tue, Jul 8, 2014 at 12:27 PM, anishs...@yahoo.co.in anishs...@yahoo.co.in wrote: Hi All My apologies for very basic question, do we have full support

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Sandy Ryza
Having a common framework for clustering makes sense to me. While we should be careful about what algorithms we include, having solid implementations of minibatch clustering and hierarchical clustering seems like a worthwhile goal, and we should reuse as much code and APIs as reasonable. On

Re: Contributing to MLlib on GLM

2014-06-17 Thread Sandy Ryza
Hi Xiaokai, I think MLLib is definitely interested in supporting additional GLMs. I'm not aware of anybody working on this at the moment. -Sandy On Tue, Jun 17, 2014 at 5:00 PM, Xiaokai Wei x...@palantir.com wrote: Hi, I am an intern at PalantirTech and we are building some stuff on top

Re: Please change instruction about Launching Applications Inside the Cluster

2014-05-30 Thread Sandy Ryza
They should be - in the sense that the docs now recommend using spark-submit and thus include entirely different invocations. On Fri, May 30, 2014 at 12:46 AM, Reynold Xin r...@databricks.com wrote: Can you take a look at the latest Spark 1.0 docs and see if they are fixed?

Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-26 Thread Sandy Ryza
+1 On Mon, May 26, 2014 at 7:38 AM, Tathagata Das tathagata.das1...@gmail.comwrote: Please vote on releasing the following candidate as Apache Spark version 1.0.0! This has a few important bug fixes on top of rc10: SPARK-1900 and SPARK-1918: https://github.com/apache/spark/pull/853

Re: Calling external classes added by sc.addJar needs to be through reflection

2014-05-21 Thread Sandy Ryza
--- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Mon, May 19, 2014 at 7:38 PM, Sandy Ryza sandy.r...@cloudera.com wrote: It just hit me why this problem is showing up on YARN and not on standalone. The relevant difference between YARN

Re: Calling external classes added by sc.addJar needs to be through reflection

2014-05-21 Thread Sandy Ryza
and throw exception if the code is wrapped in security manager. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Wed, May 21, 2014 at 1:13 PM, Sandy Ryza sandy.r

Re: [VOTE] Release Apache Spark 1.0.0 (RC10)

2014-05-20 Thread Sandy Ryza
+1 On Tue, May 20, 2014 at 5:26 PM, Andrew Or and...@databricks.com wrote: +1 2014-05-20 13:13 GMT-07:00 Tathagata Das tathagata.das1...@gmail.com: Please vote on releasing the following candidate as Apache Spark version 1.0.0! This has a few bug fixes on top of rc9: SPARK-1875:

Re: Calling external classes added by sc.addJar needs to be through reflection

2014-05-19 Thread Sandy Ryza
. Reflection lets you pick the ClassLoader, yes. I would not call setContextClassLoader. On Mon, May 19, 2014 at 12:00 AM, Sandy Ryza sandy.r...@cloudera.com wrote: I spoke with DB offline about this a little while ago and he confirmed that he was able to access

Re: Calling external classes added by sc.addJar needs to be through reflection

2014-05-18 Thread Sandy Ryza
an object in that way. Since the jars are already in distributed cache before the executor starts, is there any reason we cannot add the locally cached jars to classpath directly? Best, Xiangrui On Sun, May 18, 2014 at 4:00 PM, Sandy Ryza sandy.r...@cloudera.com wrote: I spoke with DB offline

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-15 Thread Sandy Ryza
+1 (non-binding) * Built the release from source. * Compiled Java and Scala apps that interact with HDFS against it. * Ran them in local mode. * Ran them against a pseudo-distributed YARN cluster in both yarn-client mode and yarn-cluster mode. On Tue, May 13, 2014 at 9:09 PM, witgo wi...@qq.com

Re: Apache Spark running out of the spark shell

2014-05-03 Thread Sandy Ryza
Hi AJ, You might find this helpful - http://blog.cloudera.com/blog/2014/04/how-to-run-a-simple-apache-spark-app-in-cdh-5/ -Sandy On Sat, May 3, 2014 at 8:42 AM, Ajay Nair prodig...@gmail.com wrote: Hi, I have written a code that works just about fine in the spark shell on EC2. The ec2

Re: Any plans for new clustering algorithms?

2014-04-22 Thread Sandy Ryza
are under spark/docs. You can submit a PR for changes. -Xiangrui On Mon, Apr 21, 2014 at 6:01 PM, Sandy Ryza sandy.r...@cloudera.com(mailto: sandy.r...@cloudera.com) wrote: How do I get permissions to edit the wiki? On Mon, Apr 21, 2014 at 3:19 PM, Xiangrui Meng men...@gmail.com(mailto

Re: all values for a key must fit in memory

2014-04-21 Thread Sandy Ryza
). Regards, Mridul On Mon, Apr 21, 2014 at 6:25 AM, Sandy Ryza sandy.r...@cloudera.com wrote: The issue isn't that the Iterator[P] can't be disk-backed. It's that, with a groupBy, each P is a (Key, Values) tuple, and the entire tuple is read into memory at once. The ShuffledRDD

Re: Any plans for new clustering algorithms?

2014-04-21 Thread Sandy Ryza
of this), scalable and parallelizable, well documented and with reasonable expectation of dev support Sent from my iPhone On 21 Apr 2014, at 19:59, Sandy Ryza sandy.r...@cloudera.com wrote: If it's not done already, would it make sense to codify this philosophy somewhere? I imagine

Re: Any plans for new clustering algorithms?

2014-04-21 Thread Sandy Ryza
/docs. You can submit a PR for changes. -Xiangrui On Mon, Apr 21, 2014 at 6:01 PM, Sandy Ryza sandy.r...@cloudera.com wrote: How do I get permissions to edit the wiki? On Mon, Apr 21, 2014 at 3:19 PM, Xiangrui Meng men...@gmail.com wrote: Cannot agree more with your words. Could you add

Re: all values for a key must fit in memory

2014-04-20 Thread Sandy Ryza
: An iterator does not imply data has to be memory resident. Think merge sort output as an iterator (disk backed). Tom is actually planning to work on something similar with me on this hopefully this or next month. Regards, Mridul On Sun, Apr 20, 2014 at 11:46 PM, Sandy Ryza sandy.r

Re: Suggestion

2014-04-11 Thread Sandy Ryza
Hi Priya, Here's a good place to start: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark -Sandy On Fri, Apr 11, 2014 at 12:05 PM, priya arora arora.priya4...@gmail.comwrote: Hi, May I know how one can contribute in this project http://spark.apache.org/mllib/ or in

Re: cloudera repo down again - mqtt

2014-03-14 Thread sandy . ryza
Our guys are looking into it. I'll post when things are back up. -Sandy On Mar 14, 2014, at 7:37 AM, Tom Graves tgraves...@yahoo.com wrote: It appears the cloudera repo for the mqtt stuff is down again. Did someone ping them the last time? Can we pick this up from some other repo?

Re: Assigning JIRA's to self

2014-03-12 Thread Sandy Ryza
In the mean time, you don't need to wait for the task to be assigned to you to start work. If you're worried about someone else picking it up, you can drop a short comment on the JIRA saying that you're working on it. On Wed, Mar 12, 2014 at 3:25 PM, Konstantin Boudnik c...@apache.org wrote:

Re: YARN Maven build questions

2014-03-04 Thread Sandy Ryza
Hi Lars, Unfortunately, due to some incompatible changes we pulled in to be closer to YARN trunk, Spark-on-YARN does not work against CDH 4.4+ (but does work against CDH5) -Sandy On Tue, Mar 4, 2014 at 6:33 AM, Tom Graves tgraves...@yahoo.com wrote: What is your question about Any hints?

Re: [DISCUSS] Necessity of Maven *and* SBT Build in Spark

2014-02-26 Thread Sandy Ryza
@patrick - It seems like my point about being able to inherit the root pom was addressed and there's a way to handle this. The larger point I meant to make is that Maven is by far the most common build tool in projects that are likely to share contributors with Spark. I personally know 10 people