Hi,
I have compiled a list (from online sources) of knobs/design considerations
that need to be taken care of by applications running on spark streaming.
Is my understanding correct? Any other important design consideration that
I should take care of?
- A DStream is associated with a single
(Marcelo you might have some insight on this one)
Warning: this may just be because I'm doing something non-standard --
trying embed Spark in a Java app and feed it all the classpath it
needs manually. But this was surprising enough I wanted to ask.
I have an app that includes among other things
Hi List,
In the documentation of Connecting Spark to Mesos
http://spark.apache.org/docs/latest/running-on-mesos.html#connecting-spark-to-mesos,
is it possible to modify and write in detail the step Create a binary
package using make-distribution.sh --tgz ? When we use custom compiled
Thanks for the prompt feedback; I have further expanded on your
suggestions on this JIRA
https://issues.apache.org/jira/browse/SPARK-7754
On Tue, May 19, 2015 at 8:35 PM, Michael Armbrust
mich...@databricks.com wrote:
Overall this seems like a reasonable proposal to me. Here are a few
Signature, hashes, LICENSE/NOTICE, source tarball looks OK. I built
for Hadoop 2.6 (-Pyarn -Phive -Phadoop-2.6) on Ubuntu from source and
tests pass. The release looks OK except that I'd like to resolve the
Blockers before giving a +1.
I'm seeing some test failures, and wanted to cross-check with
Hey Ram,
I'm not speaking to Tarek's package specifically but to the spirit of
MLib. There are a number of method/algorithms for PCA, I'm not sure by
what criterion the current one is considered 'standard'.
It is rare to find ANY machine learning algo that is 'clearly better' than
any other.
Hmm... this seems to be particular to logging (KafkaRDD.scala:89 in my tree
is a log statement). I'd expect KafkaRDD to be loaded from the system class
loader - or are you repackaging it in your app?
I'd have to investigate more to come with an accurate explanation here...
but it seems that the
Hi Trevor
Good point, I didn't mean that some algorithm has to be clearly better than
another in every scenario to be included in MLLib. However, even if someone
is willing to be the maintainer of a piece of code, it does not make sense
to accept every possible algorithm into the core library.
Hi Trevor
I'm attaching the MLLib contribution guideline here:
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-MLlib-specificContributionGuidelines
It speaks to widely known and accepted algorithms but not to whether an
algorithm has to be better than
Hi,
For indexedrowmatrix and rowmatrix, both take RDD(vector)is it possible
that it has intermixed dense and sparse vectorbasically I am
considering a gemv flow when indexedrowmatrix has dense flag true, dot flow
otherwise...
Thanks.
Deb
I believe it works with a mix of DenseVector and SparseVector types.
Joseph
On Wed, May 20, 2015 at 10:06 AM, Debasish Das debasish.da...@gmail.com
wrote:
Hi,
For indexedrowmatrix and rowmatrix, both take RDD(vector)is it
possible that it has intermixed dense and sparse
Correcting the ones that are incorrect or incomplete. BUT this is good list
for things to remember about Spark Streaming.
On Wed, May 20, 2015 at 3:40 AM, Hemant Bhanawat hemant9...@gmail.com
wrote:
Hi,
I have compiled a list (from online sources) of knobs/design
considerations that need to
Hi Trevor,
I may be repeating what Ram said, but to 2nd it, a few points:
We do want MLlib to become an extensive and rich ML library; as you said,
scikit-learn is a great example. To make that happen, we of course need to
include important algorithms. Important is hazy, but roughly means
Hi Jeremy:
Row is a collect of 'Any'. So, you can be used as a recursive data type. Is
this what you were looking for?
Example:
val x = sc.parallelize(Array.range(0,10)).map(x = Row(Row(x),
Row(x.toString)))
Rakesh
On Wed, May 20, 2015 at 7:23 PM Jeremy Lucas jeremyalu...@gmail.com wrote:
Hey Rakesh,
To clarify, what I was referring to is when doing something like this:
sqlContext.applySchema(rdd, mySchema)
mySchema must be a well-defined StructType, which presently does not allow
for a recursive type.
On Wed, May 20, 2015 at 5:39 PM Rakesh Chalasani vnit.rak...@gmail.com
Spark SQL has proven to be quite useful in applying a partial schema to
large JSON logs and being able to write plain SQL to perform a wide variety
of operations over this data. However, one small thing that keeps coming
back to haunt me is the lack of support for recursive data types, whereby a
Hi,
Somewhat similar to Daniel Mescheder's mail yesterday on SparkSql, I have a
data point regarding the performance of Group By, indicating there's
excessive GC and it's impacting the throughput. I want to know if the new
memory manager for aggregations
Does this turn codegen on? I think the performance is fairly different when
codegen is turned on.
For 1.5, we are investigating having codegen on by default, so users get
much better performance out of the box.
On Wed, May 20, 2015 at 5:24 PM, Pramod Biligiri pramodbilig...@gmail.com
wrote:
It is a lot of columns, but I'm not sure if that's why it is running out of
memory. In Spark SQL, we are not yet doing external aggregation when the
number of keys is large in the aggregation hashmap. We will fix this and
have external aggregation in 1.5.
On Tue, May 19, 2015 at 2:43 AM,
-1
discovered I accidentally removed master worker json endpoints, will
restore
https://issues.apache.org/jira/browse/SPARK-7760
On Tue, May 19, 2015 at 11:10 AM, Patrick Wendell pwend...@gmail.com
wrote:
Please vote on releasing the following candidate as Apache Spark version
1.4.0!
The
20 matches
Mail list logo