Spark Streaming - Design considerations/Knobs

2015-05-20 Thread Hemant Bhanawat
Hi, I have compiled a list (from online sources) of knobs/design considerations that need to be taken care of by applications running on spark streaming. Is my understanding correct? Any other important design consideration that I should take care of? - A DStream is associated with a single

userClassPathFirst and loader constraint violation

2015-05-20 Thread Sean Owen
(Marcelo you might have some insight on this one) Warning: this may just be because I'm doing something non-standard -- trying embed Spark in a Java app and feed it all the classpath it needs manually. But this was surprising enough I wanted to ask. I have an app that includes among other things

Regarding Connecting spark to Mesos documentation

2015-05-20 Thread Meethu Mathew
Hi List, In the documentation of Connecting Spark to Mesos http://spark.apache.org/docs/latest/running-on-mesos.html#connecting-spark-to-mesos, is it possible to modify and write in detail the step Create a binary package using make-distribution.sh --tgz ? When we use custom compiled

Re: [Catalyst] RFC: Using PartialFunction literals instead of objects

2015-05-20 Thread Edoardo Vacchi
Thanks for the prompt feedback; I have further expanded on your suggestions on this JIRA https://issues.apache.org/jira/browse/SPARK-7754 On Tue, May 19, 2015 at 8:35 PM, Michael Armbrust mich...@databricks.com wrote: Overall this seems like a reasonable proposal to me. Here are a few

Re: [VOTE] Release Apache Spark 1.4.0 (RC1)

2015-05-20 Thread Sean Owen
Signature, hashes, LICENSE/NOTICE, source tarball looks OK. I built for Hadoop 2.6 (-Pyarn -Phive -Phadoop-2.6) on Ubuntu from source and tests pass. The release looks OK except that I'd like to resolve the Blockers before giving a +1. I'm seeing some test failures, and wanted to cross-check with

Re: Contribute code to MLlib

2015-05-20 Thread Trevor Grant
Hey Ram, I'm not speaking to Tarek's package specifically but to the spirit of MLib. There are a number of method/algorithms for PCA, I'm not sure by what criterion the current one is considered 'standard'. It is rare to find ANY machine learning algo that is 'clearly better' than any other.

Re: userClassPathFirst and loader constraint violation

2015-05-20 Thread Marcelo Vanzin
Hmm... this seems to be particular to logging (KafkaRDD.scala:89 in my tree is a log statement). I'd expect KafkaRDD to be loaded from the system class loader - or are you repackaging it in your app? I'd have to investigate more to come with an accurate explanation here... but it seems that the

Re: Contribute code to MLlib

2015-05-20 Thread Ram Sriharsha
Hi Trevor Good point, I didn't mean that some algorithm has to be clearly better than another in every scenario to be included in MLLib. However, even if someone is willing to be the maintainer of a piece of code, it does not make sense to accept every possible algorithm into the core library.

Re: Contribute code to MLlib

2015-05-20 Thread Ram Sriharsha
Hi Trevor I'm attaching the MLLib contribution guideline here: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-MLlib-specificContributionGuidelines It speaks to widely known and accepted algorithms but not to whether an algorithm has to be better than

IndexedRowMatrix semantics

2015-05-20 Thread Debasish Das
Hi, For indexedrowmatrix and rowmatrix, both take RDD(vector)is it possible that it has intermixed dense and sparse vectorbasically I am considering a gemv flow when indexedrowmatrix has dense flag true, dot flow otherwise... Thanks. Deb

Re: IndexedRowMatrix semantics

2015-05-20 Thread Joseph Bradley
I believe it works with a mix of DenseVector and SparseVector types. Joseph On Wed, May 20, 2015 at 10:06 AM, Debasish Das debasish.da...@gmail.com wrote: Hi, For indexedrowmatrix and rowmatrix, both take RDD(vector)is it possible that it has intermixed dense and sparse

Re: Spark Streaming - Design considerations/Knobs

2015-05-20 Thread Tathagata Das
Correcting the ones that are incorrect or incomplete. BUT this is good list for things to remember about Spark Streaming. On Wed, May 20, 2015 at 3:40 AM, Hemant Bhanawat hemant9...@gmail.com wrote: Hi, I have compiled a list (from online sources) of knobs/design considerations that need to

Re: Contribute code to MLlib

2015-05-20 Thread Joseph Bradley
Hi Trevor, I may be repeating what Ram said, but to 2nd it, a few points: We do want MLlib to become an extensive and rich ML library; as you said, scikit-learn is a great example. To make that happen, we of course need to include important algorithms. Important is hazy, but roughly means

Re: Representing a recursive data type in Spark SQL

2015-05-20 Thread Rakesh Chalasani
Hi Jeremy: Row is a collect of 'Any'. So, you can be used as a recursive data type. Is this what you were looking for? Example: val x = sc.parallelize(Array.range(0,10)).map(x = Row(Row(x), Row(x.toString))) Rakesh On Wed, May 20, 2015 at 7:23 PM Jeremy Lucas jeremyalu...@gmail.com wrote:

Re: Representing a recursive data type in Spark SQL

2015-05-20 Thread Jeremy Lucas
Hey Rakesh, To clarify, what I was referring to is when doing something like this: sqlContext.applySchema(rdd, mySchema) mySchema must be a well-defined StructType, which presently does not allow for a recursive type. On Wed, May 20, 2015 at 5:39 PM Rakesh Chalasani vnit.rak...@gmail.com

Representing a recursive data type in Spark SQL

2015-05-20 Thread Jeremy Lucas
Spark SQL has proven to be quite useful in applying a partial schema to large JSON logs and being able to write plain SQL to perform a wide variety of operations over this data. However, one small thing that keeps coming back to haunt me is the lack of support for recursive data types, whereby a

Low throughput and effect of GC in SparkSql GROUP BY

2015-05-20 Thread Pramod Biligiri
Hi, Somewhat similar to Daniel Mescheder's mail yesterday on SparkSql, I have a data point regarding the performance of Group By, indicating there's excessive GC and it's impacting the throughput. I want to know if the new memory manager for aggregations

Re: Low throughput and effect of GC in SparkSql GROUP BY

2015-05-20 Thread Reynold Xin
Does this turn codegen on? I think the performance is fairly different when codegen is turned on. For 1.5, we are investigating having codegen on by default, so users get much better performance out of the box. On Wed, May 20, 2015 at 5:24 PM, Pramod Biligiri pramodbilig...@gmail.com wrote:

Re: Performance Memory Issues When Creating Many Columns in GROUP BY (spark-sql)

2015-05-20 Thread Reynold Xin
It is a lot of columns, but I'm not sure if that's why it is running out of memory. In Spark SQL, we are not yet doing external aggregation when the number of keys is large in the aggregation hashmap. We will fix this and have external aggregation in 1.5. On Tue, May 19, 2015 at 2:43 AM,

Re: [VOTE] Release Apache Spark 1.4.0 (RC1)

2015-05-20 Thread Imran Rashid
-1 discovered I accidentally removed master worker json endpoints, will restore https://issues.apache.org/jira/browse/SPARK-7760 On Tue, May 19, 2015 at 11:10 AM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.4.0! The