Hi,
I am trying to compile spark 1.1.0 on windows 8.1 but I get the following
exception.
[info] Compiling 3 Scala sources to
D:\myworkplace\software\spark-1.1.0\project\target\scala-2.10\sbt0.13\classes...
[error] D:\myworkplace\software\spark-1.1.0\project\SparkBuild.scala:26:
object sbt is not
Hi,
The configuration you provide is just to access the HDFS when you give an
HDFS path. When you provide a HDFS path with the HDFS nameservice, like in
your case hmaster155:9000 it goes inside the HDFS to look for the file. For
accessing local file just give the local path of the file. Go to the
Hi Franco,
Hive percentile UDAF is added in the master branch. You can have a look at
it. I think it would work like "select percentile(col_name, 1) from
sigmoid_logs"
Thanks
Best Regards
On Thu, Nov 27, 2014 at 8:58 PM, Franco Barrientos <
franco.barrien...@exalitica.com> wrote:
> Hi folks!,
>
Hi,
I am evaluating Spark for an analytic component where we do batch
processing of data using SQL.
So, I am particularly interested in Spark SQL and in creating a SchemaRDD
from an existing API [1].
This API exposes elements in a database as datasources. Using the methods
allowed by this data s
Hi Jianshi,
I couldn’t reproduce that with latest MASTER, and I can always get the
BroadcastHashJoin for managed tables (in .csv file) in my testing, are there
any external tables in your case?
In general probably couple of things you can try first (with HiveContext):
1) ANALYZE TABLE xxx
Hi,
I was just going through the two codes in GraphX namely SVDPlusPlus and
TriangleCount. In the first I see an RDD as an input to run ie, run(edges:
RDD[Edge[Double]],...) and in the other I see run(VD:..., ED:...)
Can anyone explain me the difference between these two? Infact SVDPlusPlus
is the
When there is new data comes in a stream spark use streams classes to
convert it into RDD and as you mention its follow with transformation and
finally action. Till the time user doesn't destroy or application is alive
All RDD remain in Memory as far as I experienced.
On 26 November 2014 at 20:05
Thanks Ankur, Its really help full. I've few queries on optimization
techniques. for the current I used RandomVertexCut partition.
But what partition should be used if have:
1. No. of edges in edgeList file are to large like 50,000,000; where
multiple edges to same pair of vertices are many
2. No
If it regularly fails after 8 hours then could you get me the log4j logs?
To limit the size, set default log level to Warn and the level of logs for
all classes in package o.a.s.streaming to Debug. Then I can take a look.
On Nov 27, 2014 11:01 AM, "Bill Jay" wrote:
> Gerard,
>
> That is a good ob
Yeah, only a few hours after I sent my message I saw some correspondence on
this other thread:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-complex-types-like-map-lt-string-map-lt-string-int-gt-gt-in-spark-sql-td19603.html,
which is the exact same issue. Glad to find that t
Gerard,
That is a good observation. However, the strange thing I meet is if I use
"MEMORY_AND_DISK_SER, the job even fails earlier. In my case, it takes 10
seconds to process my data of every batch, which is one minute. It fails
after 10 hours with the "cannot compute split" error.
Bill
On Thu,
We're training a recommender with ALS in mllib 1.1 against a dataset of
150M users and 4.5K items, with the total number of training records being
1.2 Billion (~30GB data). The input data is spread across 1200 partitions
on HDFS. For the training, rank=10, and we've configured {number of user
data
I have used breeze fine with scala shell:
scala -cp ./target/spark-mllib_2.10-1.3.0-SNAPSHOT.
jar:/Users/v606014/.m2/repository/com/github/fommil/netlib/core/1.1.2/core-1.1.2.jar:/Users/v606014/.m2/repository/org/jblas/jblas/1.2.3/jblas-1.2.3.jar:/Users/v606014/.m2/repository/org/scalanlp/breeze_2
Hi,
I'm trying to use the breeze library in the spark scala shell, but I'm
running into the same problem documented here:
http://apache-spark-user-list.1001560.n3.nabble.com/org-apache-commons-math3-random-RandomGenerator-issue-td15748.html
As I'm using the shell, I don't have a pom.xml, so the s
Hi folks!,
Anyone known how can I calculate for each elements of a variable in a RDD
its percentile? I tried to calculate trough Spark SQL with subqueries but I
think that is imposible in Spark SQL. Any idea will be welcome.
Thanks in advance,
Franco Barrientos
Data Scientist
Málaga #1
Hi,
We are currently running our Spark + Spark Streaming jobs on Mesos,
submitting our jobs through Marathon.
We see with some regularity that the Spark Streaming driver gets killed by
Mesos and then restarted on some other node by Marathon.
I've no clue why Mesos is killing the driver and lookin
Hi Hao,
I'm using inner join as Broadcast join didn't work for left joins (thanks
for the links for the latest improvements).
And I'm using HiveConext and it worked in a previous build (10/12) when
joining 15 dimension tables.
Jianshi
On Thu, Nov 27, 2014 at 8:35 AM, Cheng, Hao wrote:
> Are
No, the feature vector is not converted. It contains count n_i of how
often each term t_i occurs (or a TF-IDF transformation of those). You
are finding the class c such that P(c) * P(t_1|c)^n_1 * ... is
maximized.
In log space it's log(P(c)) + n_1*log(P(t_1|c)) + ...
So your n_1 counts (or TF-IDF
Running with lambda=0 fails the ALS code since the matrices no longer stays
positive def and cholesky fails...
Run with a very low lambda (I tested with 1e-4) and you should see the
decrease in RMSE as you expect...
On Thu, Nov 27, 2014 at 3:04 AM, Kostas Kloudas wrote:
> Thanks a lot for your
Hi,
When I run the below program, I see two files in the HDFS because the
number of partitions in 2. But, one of the file is empty. Why is it so? Is
the work not distributed equally to all the tasks?
textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).
*reduceByKey*(lambda a,
Hi,
I'm looking to do an iterative algorithm implementation with data coming in
from Cassandra. This might be a use case for GraphX, however the ids are
non-integral, and I would like to avoid a mapping (for now). I'm doing a simple
hubs and authorities HITS implementation, and the current imple
Hi,
I just tried to submit an application from graphx examples directory, but it
failed:
yifan2:bin yifanli$ MASTER=local[*] ./run-example graphx.PPR_hubs
java.lang.ClassNotFoundException: org.apache.spark.examples.graphx.PPR_hubs
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
Hi,
When I'm starting thrift server, I'm getting the following exception. Could
you any one help me on this.
I placed hive-site.xml in $SPARK_HOME/conf folder and the property
hive.metastore.sasl.enabled set to 'false'.
org.apache.hive.service.ServiceException: Unable to login to kerberos with
g
Hi,
I have been running through some troubles while converting the code to Java.
I have done the matrix operations as directed and tried to find the maximum
score for each category. But the predicted category is mostly different from
the prediction done by MLlib.
I am fetching iterators of the pi
Hi TD,
We also struggled with this error for a long while. The recurring scenario
is when the job takes longer to compute than the job interval and a backlog
starts to pile up.
Hint: Check
If the DStream storage level is set to "MEMORY_ONLY_SER" and memory runs
out, then you will get a 'Cannot c
Actually all components are merged into the assembly jar under
assembly/target/scala-2.10. You don’t need to configure the classpath
unless you need to include some customized jars (e.g. customized Hive
SerDes). Existing scripts can compute the classpath correctly (if they
can’t, it should be a
Thanks a lot for your time guys and your quick replies!
> On Nov 26, 2014, at 7:53 PM, Xiangrui Meng wrote:
>
> The training RMSE may increase due to regularization. Squared loss
> only represents part of the global loss. If you watch the sum of the
> squared loss and the regularization, it shou
I have just read on the website that spark 1.1.1 has been released but when
I upgraded my project to use 1.1.1 I discovered that the artefacts are not
on maven yet.
[info] Resolving org.apache.spark#spark-streaming-kafka_2.10;1.1.1 ...
>
> [warn] module not found: org.apache.spark#spark-streaming-
Hi,
I setup maven environment on a Linux machine and able to build the pom file
in spark home directory. Each module refreshed with corresponding target
directory with jar files.
In order to include all the libraries to classpath, what I need to do?
earlier, I used single assembly jar file to inc
Ah of course. Great explanation. So I suppose you should see desired
results with lambda = 0, although you don't generally want to set this
to 0.
On Wed, Nov 26, 2014 at 7:53 PM, Xiangrui Meng wrote:
> The training RMSE may increase due to regularization. Squared loss
> only represents part of th
Hi guys
I am playing with spark, and I was thinking if there is a way to share RDD
across multiple implementations in a decoupled way ,
i.e assuming I have RDD that comes from a stream in spark streaming, I want to
be able to store the same stream on two different s3 folders using two
differe
31 matches
Mail list logo