It depends on what stack you want to run. A quick cut:
- Worker Machines (DataNode, HBase Region Servers, Spark Worker Nodes)
- Dual 6 core CPU
- 64 to 128 GB RAM
- 3 X 3TB disk (JBOD)
- Master Node (Name Node, HBase Master,Spark Master)
- Dual 6 core CPU
- 64
I couldn't find the classification.SVM class.
- Most probably the command is something of the order of:
- bin/spark-submit --class
org.apache.spark.examples.mllib.BinaryClassification
examples/target/scala-*/spark-examples-*.jar --algorithm SVM train.csv
- For more details
Carter,
Just as a quick simple starting point for Spark. (caveats - lots of
improvements reqd for scaling, graceful and efficient handling of RDD et
al):
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import scala.collection.immutable.ListMap
import
Nicholas,
Good question. Couple of thoughts from my practical experience:
- Coming from R, Scala feels more natural than other languages. The
functional succinctness of Scala is more suited for Data Science than
other languages. In short, Scala-Spark makes sense, for Data Science,
- Original Message -
From: Krishna Sankar ksanka...@gmail.com
To: user@spark.apache.org
Sent: Wednesday, June 4, 2014 8:52:59 AM
Subject: Re: Trouble launching EC2 Cluster with Spark
One reason could be that the keys are in a different region. Need to create
the keys in us-east-1-North
Shahab,
Interesting question. Couple of points (based on the information from
your e-mail)
1. One can support the use case in Spark as a set of transformations on
a WIP TDD over a span of time and the final transformation outputting to a
processed TDD
- Spark streaming would be
Project-Properties-Java Build Path-Add External Jars
Add the /spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar
Cheers
K/
On Sun, Jun 8, 2014 at 8:06 AM, Carter gyz...@hotmail.com wrote:
Hi All,
I just downloaded the Scala IDE for Eclipse. After I created a Spark
project
and
Yep, it gives tons of errors. I was able to make it work with sudo. Looks
like ownership issue.
Cheers
k/
On Tue, Jun 10, 2014 at 6:29 PM, zhen z...@latrobe.edu.au wrote:
I created a Spark 1.0 cluster on EC2 using the provided scripts. However, I
do not seem to be able to start the history
Hi,
Would appreciate insights and wisdom on a problem we are working on:
1. Context:
- Given a csv file like:
- d1,c1,a1
- d1,c1,a2
- d1,c2,a1
- d1,c1,a1
- d2,c1,a3
- d2,c2,a1
- d3,c1,a1
- d3,c3,a1
- d3,c2,a1
- d3,c3,a2
Answered one of my questions (#5) : val pairs = new PairRDDFunctions(RDD)
works fine locally. Now I can do groupByKey et al. Am not sure if it is
scalable for millions of records memory efficient.
heers
k/
On Fri, Jun 13, 2014 at 8:52 PM, Krishna Sankar ksanka...@gmail.com wrote:
Hi
And got the first cut:
val res = pairs.groupByKey().map((x) = (x._1, x._2.size, x._2.toSet.size))
gives the total unique.
The question : is it scalable efficient ? Would appreciate insights.
Cheers
k/
On Fri, Jun 13, 2014 at 10:51 PM, Krishna Sankar ksanka...@gmail.com
wrote
Ian,
Yep, HLL is an appropriate mechanism. The countApproxDistinctByKey is a
wrapper around the
com.clearspring.analytics.stream.cardinality.HyperLogLogPlus.
Cheers
k/
On Sun, Jun 15, 2014 at 4:50 PM, Ian O'Connell i...@ianoconnell.com wrote:
Depending on your requirements when doing hourly
Mahesh,
- One direction could be : create a parquet schema, convert save the
records to hdfs.
- This might help
https://github.com/massie/spark-parquet-example/blob/master/src/main/scala/com/zenfractal/SparkParquetExample.scala
Cheers
k/
On Tue, Jun 17, 2014 at 12:52 PM,
Hi,
- I have seen similar behavior before. As far as I can tell, the root
cause is the out of memory error - verified this by monitoring the memory.
- I had a 30 GB file and was running on a single machine with 16GB.
So I knew it would fail.
- But instead of raising an
Konstantin,
1. You need to install the hadoop rpms on all nodes. If it is Hadoop 2,
the nodes would have hdfs YARN.
2. Then you need to install Spark on all nodes. I haven't had experience
with HDP, but the tech preview might have installed Spark as well.
3. In the end, one should
I rsync the spark-1.0.1 directory to all the nodes. Yep, one needs Spark in
all the nodes irrespective of Hadoop/YARN.
Cheers
k/
On Tue, Jul 8, 2014 at 6:24 PM, Robert James srobertja...@gmail.com wrote:
I have a Spark app which runs well on local master. I'm now ready to
put it on a
One vector to check is the HBase libraries in the --jars as in :
spark-submit --class your class --master master url --jars
Probably you have - if not, try a very simple app in the docker container
and make sure it works. Sometimes resource contention/allocation can get in
the way. This happened to me in the YARN container.
Also try single worker thread.
Cheers
k/
On Sat, Jul 19, 2014 at 2:39 PM, boci
Guys,
- Need help in terms of the interesting features coming up in MLlib 1.2.
- I have a 2 Part, ~3 hr hands-on tutorial at the Big Data Tech Con
- The Hitchhiker's Guide to Machine Learning with Python Apache
Spark[2]
- At minimum, it would be good to take the last 30 min
binary packages and documentation can
be easily found on spark.apache.org, which is important for making
hands-on tutorial.
Best,
Xiangrui
On Sat, Sep 27, 2014 at 12:15 PM, Krishna Sankar ksanka...@gmail.com
wrote:
Guys,
Need help in terms of the interesting features coming up in MLlib
to be 0.1 or 0.01?
Best,
Burak
- Original Message -
From: Krishna Sankar ksanka...@gmail.com
To: user@spark.apache.org
Sent: Wednesday, October 1, 2014 12:43:20 PM
Subject: MLlib Linear Regression Mismatch
Guys,
Obviously I am doing something wrong. May be 4 points are too small
Hi,
I am sure you can use the -Pspark-ganglia-lgpl switch to enable Ganglia.
This step only adds the support for Hadoop,Yarn,Hive et al in the spark
executable.No need to run if one is not using them.
Cheers
k/
On Thu, Oct 2, 2014 at 12:29 PM, danilopds danilob...@gmail.com wrote:
Hi
Well done guys. MapReduce sort at that time was a good feat and Spark now
has raised the bar with the ability to sort a PB.
Like some of the folks in the list, a summary of what worked (and didn't)
as well as the monitoring practices would be good.
Cheers
k/
P.S: What are you folks planning next ?
Adding to already interesting answers:
- Is there any case where MR is better than Spark? I don't know what cases
I should be used Spark by MR. When is MR faster than Spark?
- Many. MR would be better (am not saying faster ;o)) for
- Very large dataset,
- Multistage
Good point.
On the positive side, whether we choose the most efficient mechanism in
Scala might not be as important, as the Spark framework mediates the
distributed computation. Even if there is some declarative part in Spark,
we can still choose an inefficient computation path that is not
A very timely article
http://rahulkavale.github.io/blog/2014/11/16/scrap-your-map-reduce/
Cheers
k/
P.S: Now reply to ALL.
On Sun, Nov 23, 2014 at 7:16 PM, Krishna Sankar ksanka...@gmail.com wrote:
Good point.
On the positive side, whether we choose the most efficient mechanism in
Scala might
a) There is no absolute RSME - it depends on the domain. Also RSME is the
error based on what you have seen so far, a snapshot of a slice of the
domain.
b) My suggestion is put the system in place, see what happens when users
interact with the system and then you can think of reducing the RSME as
Stephen,
Scala 2.11 worked fine for me. Did the dev change and then compile. Not
using in production, but I go back and forth between 2.10 2.11.
Cheers
k/
On Wed, Jan 28, 2015 at 12:18 PM, Stephen Haberman
stephen.haber...@gmail.com wrote:
Hey,
I recently compiled Spark master against
Guys,
registerTempTable(Employees)
gives me the error
Exception in thread main scala.ScalaReflectionException: class
org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with primordial
classloader with boot classpath
I am also looking at this domain. We could potentially use the broadcast
capability in Spark to distribute the parameters. Haven't thought thru yet.
Cheers
k/
On Fri, Jan 9, 2015 at 2:56 PM, Andrei faithlessfri...@gmail.com wrote:
Does it makes sense to use Spark's actor system (e.g. via
Interestingly Google Chrome translates the materials.
Cheers
k/
On Tue, Jan 6, 2015 at 7:26 PM, Boromir Widas vcsub...@gmail.com wrote:
I do not understand Chinese but the diagrams on that page are very helpful.
On Tue, Jan 6, 2015 at 9:46 PM, eric wong win19...@gmail.com wrote:
A good
Alec,
Good questions. Suggestions:
1. Refactor the problem into layers viz. DFS, Data Store, DB, SQL Layer,
Cache, Queue, App Server, App (Interface), App (backend ML) et al.
2. Then slot-in the appropriate technologies - may be even multiple
technologies for the same layer and
Yep the command-option is gone. No big deal, just add the '%pylab
inline' command
as part of your notebook.
Cheers
k/
On Fri, Mar 20, 2015 at 3:45 PM, cong yue yuecong1...@gmail.com wrote:
Hello :
I tried ipython notebook with the following command in my enviroment.
Without knowing the data size, computation storage requirements ... :
- Dual 6 or 8 core machines, 256 GB memory each, 12-15 TB per machine.
Probably 5-10 machines.
- Don't go for the most exotic machines, otoh don't go for cheapest ones
either.
- Find a sweet spot with your
1. The RSME varies a little bit between the versions.
2. Partitioned the training,validation,test set like so:
- training = ratings_rdd_01.filter(lambda x: (x[3] % 10) 6)
- validation = ratings_rdd_01.filter(lambda x: (x[3] % 10) = 6 and
(x[3] % 10) 8)
- test =
to set lambda to 0.1. -Xiangrui
On Mon, Feb 23, 2015 at 3:06 PM, Krishna Sankar ksanka...@gmail.com
wrote:
The RSME varies a little bit between the versions.
Partitioned the training,validation,test set like so:
training = ratings_rdd_01.filter(lambda x: (x[3] % 10) 6)
validation
- Divide and conquer with reduceByKey (like Ashish mentioned, each pair
being the key) would work - looks like a mapReduce with combiners
problem. I think reduceByKey would use combiners while aggregateByKey
wouldn't.
- Could we optimize this further by using combineByKey directly ?
Do you have commons-csv-1.1-bin.jar in your path somewhere ? I had to
download and add this.
Cheers
k/
On Wed, Apr 22, 2015 at 11:01 AM, Mohammed Omer beancinemat...@gmail.com
wrote:
Afternoon all,
I'm working with Scala 2.11.6, and Spark 1.3.1 built from source via:
`mvn -Pyarn
Thanks Olivier. Good work.
Interesting in more than one ways - including training, benchmarking,
testing new releases et al.
One quick question - do you plan to make it available as an S3 bucket ?
Cheers
k/
On Wed, Apr 15, 2015 at 5:58 PM, Olivier Chapelle oliv...@chapelle.cc
wrote:
Dear Spark
Error - ImportError: No module named Row
Cheers enjoy the long weekend
k/
- use .cast(...).alias('...') after the DataFrame is read.
- sql.functions.udf for any domain-specific conversions.
Cheers
k/
On Wed, Jul 1, 2015 at 11:03 AM, Hafiz Mujadid hafizmujadi...@gmail.com
wrote:
Hi experts!
I am using spark-csv to lead csv data into dataframe. By default it
You can predict and then zip it with the points RDD to get approx. same as
LP.
Cheers
k/
On Thu, May 21, 2015 at 6:19 PM, anneywarlord anneywarl...@gmail.com
wrote:
Hello,
New to Spark. I wanted to know if it is possible to use a Labeled Point RDD
in org.apache.spark.mllib.clustering.KMeans.
Interesting. Looking at the definitions, sql.functions.pow is defined only
for (col,col). Just as an experiment, create a column with value 2 and see
if that works.
Cheers
k/
On Mon, Jun 29, 2015 at 1:34 PM, Bob Corsaro rcors...@gmail.com wrote:
1.4 and I did set the second parameter. The DSL
A few points to consider:
a) SparkR gives the union of R_in_a_single_machine and the
distributed_computing_of_Spark:
b) It also gives the ability to wrangle with data in R, that is in the
Spark eco system
c) Coming to MLlib, the question is MLlib and R (not MLlib or R) -
depending on the scale,
Looks like reduceByKey() should work here.
Cheers
k/
On Sat, Jul 11, 2015 at 11:02 AM, leonida.gianfagna
leonida.gianfa...@gmail.com wrote:
Thanks a lot oubrik,
I got your point, my consideration is that sum() should be already a
built-in function for iterators in python.
Anyway I tried
;> finished.*
>>
>> 2015-09-23 1:31 GMT+02:00 Zhan Zhang <zzh...@hortonworks.com>:
>>
>>> Hi Krishna,
>>>
>>> For the time being, you can download from upstream, and it should be
>>> running OK for HDP2.3. For hdp specific problem, you
Hi all,
Just wanted to thank all for the dataset API - most of the times we see
only bugs in these lists ;o).
- Putting some context, this weekend I was updating the SQL chapters of
my book - it had all the ugliness of SchemaRDD,
registerTempTable, take(10).foreach(println)
and
Good question. It comes to computational complexity, computational scale
and data volume.
1. If you can store the data in a single server or a small cluster of db
server (say mysql) then hdfs/Spark might be an overkill
2. If you can run the computation/process the data on a single
This intrigued me as well.
- Just for sure, I downloaded the 1.6.2 code and recompiled.
- spark-shell and pyspark both show 1.6.2 as expected.
Cheers
On Mon, Jul 25, 2016 at 1:45 AM, Daniel Darabos <
daniel.dara...@lynxanalytics.com> wrote:
> Another possible explanation is that by
Thanks Nick. I also ran into this issue.
VG, One workaround is to drop the NaN from predictions (df.na.drop()) and
then use the dataset for the evaluator. In real life, probably detect the
NaN and recommend most popular on some window.
HTH.
Cheers
On Sun, Jul 24, 2016 at 12:49 PM, Nick Pentreath
Hi,
Looks like the test-dataset has different sizes for X & Y. Possible steps:
1. What is the test-data-size ?
- If it is 15,909, check the prediction variable vector - it is now
29,471, should be 15,909
- If you expect it to be 29,471, then the X Matrix is not right.
that does the data split and the
datasets where they are allocated to.
Cheers
On Sun, Aug 21, 2016 at 4:37 PM, Krishna Sankar <ksanka...@gmail.com> wrote:
> Hi,
> Looks like the test-dataset has different sizes for X & Y. Possible
> steps:
>
>1. W
52 matches
Mail list logo