Hi guys
I am playing with spark, and I was thinking if there is a way to share RDD
across multiple implementations in a decoupled way ,
i.e assuming I have RDD that comes from a stream in spark streaming, I want to
be able to store the same stream on two different s3 folders using two
Ah of course. Great explanation. So I suppose you should see desired
results with lambda = 0, although you don't generally want to set this
to 0.
On Wed, Nov 26, 2014 at 7:53 PM, Xiangrui Meng men...@gmail.com wrote:
The training RMSE may increase due to regularization. Squared loss
only
Hi,
I setup maven environment on a Linux machine and able to build the pom file
in spark home directory. Each module refreshed with corresponding target
directory with jar files.
In order to include all the libraries to classpath, what I need to do?
earlier, I used single assembly jar file to
I have just read on the website that spark 1.1.1 has been released but when
I upgraded my project to use 1.1.1 I discovered that the artefacts are not
on maven yet.
[info] Resolving org.apache.spark#spark-streaming-kafka_2.10;1.1.1 ...
[warn] module not found:
Thanks a lot for your time guys and your quick replies!
On Nov 26, 2014, at 7:53 PM, Xiangrui Meng men...@gmail.com wrote:
The training RMSE may increase due to regularization. Squared loss
only represents part of the global loss. If you watch the sum of the
squared loss and the
Hi TD,
We also struggled with this error for a long while. The recurring scenario
is when the job takes longer to compute than the job interval and a backlog
starts to pile up.
Hint: Check
If the DStream storage level is set to MEMORY_ONLY_SER and memory runs
out, then you will get a 'Cannot
Hi,
I have been running through some troubles while converting the code to Java.
I have done the matrix operations as directed and tried to find the maximum
score for each category. But the predicted category is mostly different from
the prediction done by MLlib.
I am fetching iterators of the
Hi,
When I'm starting thrift server, I'm getting the following exception. Could
you any one help me on this.
I placed hive-site.xml in $SPARK_HOME/conf folder and the property
hive.metastore.sasl.enabled set to 'false'.
org.apache.hive.service.ServiceException: Unable to login to kerberos with
Hi,
I just tried to submit an application from graphx examples directory, but it
failed:
yifan2:bin yifanli$ MASTER=local[*] ./run-example graphx.PPR_hubs
java.lang.ClassNotFoundException: org.apache.spark.examples.graphx.PPR_hubs
at
Hi,
I'm looking to do an iterative algorithm implementation with data coming in
from Cassandra. This might be a use case for GraphX, however the ids are
non-integral, and I would like to avoid a mapping (for now). I'm doing a simple
hubs and authorities HITS implementation, and the current
No, the feature vector is not converted. It contains count n_i of how
often each term t_i occurs (or a TF-IDF transformation of those). You
are finding the class c such that P(c) * P(t_1|c)^n_1 * ... is
maximized.
In log space it's log(P(c)) + n_1*log(P(t_1|c)) + ...
So your n_1 counts (or
Hi Hao,
I'm using inner join as Broadcast join didn't work for left joins (thanks
for the links for the latest improvements).
And I'm using HiveConext and it worked in a previous build (10/12) when
joining 15 dimension tables.
Jianshi
On Thu, Nov 27, 2014 at 8:35 AM, Cheng, Hao
Hi,
We are currently running our Spark + Spark Streaming jobs on Mesos,
submitting our jobs through Marathon.
We see with some regularity that the Spark Streaming driver gets killed by
Mesos and then restarted on some other node by Marathon.
I've no clue why Mesos is killing the driver and
Hi folks!,
Anyone known how can I calculate for each elements of a variable in a RDD
its percentile? I tried to calculate trough Spark SQL with subqueries but I
think that is imposible in Spark SQL. Any idea will be welcome.
Thanks in advance,
Franco Barrientos
Data Scientist
Málaga
Hi,
I'm trying to use the breeze library in the spark scala shell, but I'm
running into the same problem documented here:
http://apache-spark-user-list.1001560.n3.nabble.com/org-apache-commons-math3-random-RandomGenerator-issue-td15748.html
As I'm using the shell, I don't have a pom.xml, so the
I have used breeze fine with scala shell:
scala -cp ./target/spark-mllib_2.10-1.3.0-SNAPSHOT.
We're training a recommender with ALS in mllib 1.1 against a dataset of
150M users and 4.5K items, with the total number of training records being
1.2 Billion (~30GB data). The input data is spread across 1200 partitions
on HDFS. For the training, rank=10, and we've configured {number of user
data
Gerard,
That is a good observation. However, the strange thing I meet is if I use
MEMORY_AND_DISK_SER, the job even fails earlier. In my case, it takes 10
seconds to process my data of every batch, which is one minute. It fails
after 10 hours with the cannot compute split error.
Bill
On Thu,
Yeah, only a few hours after I sent my message I saw some correspondence on
this other thread:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-complex-types-like-map-lt-string-map-lt-string-int-gt-gt-in-spark-sql-td19603.html,
which is the exact same issue. Glad to find that
If it regularly fails after 8 hours then could you get me the log4j logs?
To limit the size, set default log level to Warn and the level of logs for
all classes in package o.a.s.streaming to Debug. Then I can take a look.
On Nov 27, 2014 11:01 AM, Bill Jay bill.jaypeter...@gmail.com wrote:
Thanks Ankur, Its really help full. I've few queries on optimization
techniques. for the current I used RandomVertexCut partition.
But what partition should be used if have:
1. No. of edges in edgeList file are to large like 50,000,000; where
multiple edges to same pair of vertices are many
2. No
When there is new data comes in a stream spark use streams classes to
convert it into RDD and as you mention its follow with transformation and
finally action. Till the time user doesn't destroy or application is alive
All RDD remain in Memory as far as I experienced.
On 26 November 2014 at
Hi,
I was just going through the two codes in GraphX namely SVDPlusPlus and
TriangleCount. In the first I see an RDD as an input to run ie, run(edges:
RDD[Edge[Double]],...) and in the other I see run(VD:..., ED:...)
Can anyone explain me the difference between these two? Infact SVDPlusPlus
is the
Hi Jianshi,
I couldn’t reproduce that with latest MASTER, and I can always get the
BroadcastHashJoin for managed tables (in .csv file) in my testing, are there
any external tables in your case?
In general probably couple of things you can try first (with HiveContext):
1) ANALYZE TABLE xxx
Hi,
I am evaluating Spark for an analytic component where we do batch
processing of data using SQL.
So, I am particularly interested in Spark SQL and in creating a SchemaRDD
from an existing API [1].
This API exposes elements in a database as datasources. Using the methods
allowed by this data
Hi,
The configuration you provide is just to access the HDFS when you give an
HDFS path. When you provide a HDFS path with the HDFS nameservice, like in
your case hmaster155:9000 it goes inside the HDFS to look for the file. For
accessing local file just give the local path of the file. Go to the
Hi,
I am trying to compile spark 1.1.0 on windows 8.1 but I get the following
exception.
[info] Compiling 3 Scala sources to
D:\myworkplace\software\spark-1.1.0\project\target\scala-2.10\sbt0.13\classes...
[error] D:\myworkplace\software\spark-1.1.0\project\SparkBuild.scala:26:
object sbt is
27 matches
Mail list logo