Hi
You can see my code here .
Its a POC to implement UIMA on spark
https://bitbucket.org/SigmoidDev/uimaspark
https://bitbucket.org/SigmoidDev/uimaspark/src/8476fdf16d84d0f517cce45a8bc1bd3410927464/UIMASpark/src/main/scala/
*UIMAProcessor.scala*?at=master
this is the class where the major
Hi all:I deployed a spark client in my own machine. I put SPARK in path:`
/home/somebody/spark`, and the cluster's worker spark home path is
`/home/spark/spark` .While I launched the jar, it shows that: `
AppClient$ClientActor: Executor updated: app-20141124170955-11088/12 is now
FAILED
Hi,
I tried with try catch blocks. Infact, inside mapPartitionsWithIndex,
method is invoked which does the operation. I put the operations inside the
function in try...catch block but thats of no use...still the error
persists. Even I commented all the operations and a simple print statement
How are you submitting the job?
Thanks
Best Regards
On Mon, Nov 24, 2014 at 3:02 PM, LinQili lin_q...@outlook.com wrote:
Hi all:
I deployed a spark client in my own machine. I put SPARK in path:`
/home/somebody/spark`, and the cluster's worker spark home path is
`/home/spark/spark` .
While
Thank you, Marcelo and Sean, mvn install is a good answer for my demands.
-邮件原件-
发件人: Marcelo Vanzin [mailto:van...@cloudera.com]
发送时间: 2014年11月21日 1:47
收件人: yiming zhang
抄送: Sean Owen; user@spark.apache.org
主题: Re: How to incrementally compile spark examples using mvn
Hi Yiming,
On
Thanks Arush! Your example is nice and easy to understand. I am implementing
it through Java though.
Jatin
-
Novice Big Data Programmer
--
View this message in context:
Dear all,
How can one save a kmeans model after training ?
Best,
Jao
Hi,
I want to submit my spark program from my machine on a YARN Cluster in yarn
client mode.
How to specify al l the required details through SPARK submitter.
Please provide me some details.
-Naveen.
You can export the hadoop configurations dir (export HADOOP_CONF_DIR=XXX) in
the environment and then submit it like:
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn-cluster \ # can also be `yarn-client` for client mode
--executor-memory 20G \
Hi,
When i trying to execute the program from my laptop by connecting to HDP
environment (on which Spark also configured), i'm getting the warning
(Initial job has not accepted any resources; check your cluster UI to
ensure that workers are registered and have sufficient memory) and Job is
being
Hi Akhil,
But driver and yarn both are in different networks, How to specify (export
HADOOP_CONF_DIR=XXX) path.
Like driver is from my windows machine and yarn is some unix machine on
different network.
-Naveen.
From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
Sent: Monday, November 24,
Not sure if it will work, but you can try creating a dummy hadoop conf
directory and put those files (*-site.xml) files inside it and hopefully
spark will pick it up and submit it on that remote cluster. (If there isn't
any network/firewall issues).
Thanks
Best Regards
On Mon, Nov 24, 2014 at
This can happen mainly because of the following:
- Wrong master url (Make sure you give the master url which is listed on
top left corner of the webui - running on 8080)
- Allocated more memory/cores while creating the sparkContext.
Thanks
Best Regards
On Mon, Nov 24, 2014 at 4:13 PM,
hi,
We are building an analytics dashboard. Data will be updated every 5
minutes for now and eventually every 1 minute, maybe more frequent. The
amount of data coming is not huge, per customer maybe 30 records per minute
although we could have 500 customers. Is streaming correct for this
I nstead
Streaming would be easy to implement, all you have to do is to create the
stream, do some transformation (depends on your usecase) and finally write
it to your dashboards backend. What kind of dashboards are you building?
For d3.js based ones, you can have websocket and write the stream output to
Thanks. Yes d3 ones. Just to clarify--we could take our current system,
which is incrementally adding partitions and overlay an Apache streaming
layer to ingest these partitions? Then nightly, we could coalesce these
partitions for example? I presume that while we are carrying out
a coalesce, the
Wouldn't it likely be the opposite? Too much memory / too many cores being
requested relative to the resource that YARN makes available?
On Nov 24, 2014 11:00 AM, Akhil Das ak...@sigmoidanalytics.com wrote:
This can happen mainly because of the following:
- Wrong master url (Make sure you give
Thanks for your response.
I gave correct master url. Moreover as i mentioned in my post, i could able
to run the sample program by using spark-submit. But it is not working when
i'm running from my machine. Any clue on this?
Thanks in advance.
--
View this message in context:
OK thank you very much for that!
On 23 Nov 2014 21:49, Denny Lee [via Apache Spark User List]
ml-node+s1001560n19598...@n3.nabble.com wrote:
It sort of depends on your environment. If you are running on your local
environment, I would just download the latest Spark 1.1 binaries and you'll
be
import org.apache.spark.mllib.recommendation.ALS
import org.apache.spark.mllib.recommendation.Rating
// Load and parse the data
val data = sc.textFile(/path/CFReady.txt)
val ratings = data.map(_.split('\t') match { case Array(user, item, rate) =
Rating(user.toInt, item.toInt, rate.toDouble)
I'm not quiet sure if i understood you correctly, but here's the thing, if
you use sparkstreaming, it is more likely to refresh your dashboard for
each batch. So for every batch your dashboard will be updated with the new
data. And yes, the end use won't feel anything while you do the
Hi Saurabh,
Here your ratesAndPreds is a RDD of type [((int, int), (Double, Double))]
not an Array. Now, if you want to save it on disk, then you can simply call
the saveAsTextFile and provide the location.
So change your last line from this:
ratesAndPreds.foreach(pw.println)
to this
Hi,
I found that the ec2 script has been improved a lot.
And the option ebs-vol-type is added to specify ebs type.
However, it seems that the option does not work, the cmd I used is the
following:
$SPARK_HOME/ec2/spark-ec2 -k sparkcv -i spark.pem -m r3.4xlarge -s 3 -t
r3.2xlarge
I want to ran my spark program on a YARN cluster. But when I tested broadcast
function in my program,
I got an error.
Exception in thread main org.apache.spark.SparkException: Error sending
message as driverActor is null [message =
UpdateBlockInfo(BlockManagerId(driver, in160-011.byted.org,
Great thanks
On Monday, November 24, 2014, Akhil Das ak...@sigmoidanalytics.com wrote:
I'm not quiet sure if i understood you correctly, but here's the thing, if
you use sparkstreaming, it is more likely to refresh your dashboard for
each batch. So for every batch your dashboard will be
Hello,
I have a large data calculation in Spark, distributed across serveral
nodes. In the end, I want to write to a single output file.
For this I do:
output.coalesce(1, false).saveAsTextFile(filename).
What happens is all the data from the workers flows to a single worker, and
that one
I've got a Cassandra 2.1.1 + Spark 1.1.0 cluster running. I'm using
sbt-assembly to create a uber jar to submit to the stand alone master. I'm
using the hadoop 1 prebuilt binaries for Spark. As soon as I try to do
sc.CassandraTable(...) I get an error that's likely to be a Guava versioning
Hi Alaa Ali,
That's right, when using the PhoenixInputFormat, you can do simple 'WHERE'
clauses and then perform any aggregate functions you'd like from within
Spark. Any aggregations you run won't be quite as fast as running the
native Spark queries, but once it's available as an RDD you can
We keep conf as symbolic link so that upgrade is as simple as drop-in
replacement
On Monday, November 24, 2014, riginos samarasrigi...@gmail.com wrote:
OK thank you very much for that!
On 23 Nov 2014 21:49, Denny Lee [via Apache Spark User List] [hidden
email]
I finally managed to get the example working, here are the details that may
help other users.
I have 2 windows nodes for the test system, PN01 and PN02. Both have the same
shared drive S: (it is mapped to C:\source on PN02).
If I run the worker and master from S:\spark-1.1.0-bin-hadoop2.4,
Thanks for your help Akhil, however, this is creating an output folder and
storing the result sets in multiple files. Also the record count in the result
set seems to have multiplied!! Is there any other way to achieve this?
Thanks!!
Regards,
Saurabh Agrawal
Vice President
Markit
Green
Hi,
I build 2 tables from files. Table F1 join with table F2 on c5=d4.
F1 has 46730613 rows
F2 has 3386740 rows
All keys d4 exists in F1.c5, so i expect to retrieve 46730613 rows. But it
returns only 3437 rows
// --- begin code ---
val sqlContext = new
jsonFiles in your code is schemaRDD rather than RDD[Array].
If it is a column in schemaRDD, you can first use Spark SQL query to get a
certain column.
Or schemaRDD support some SQL like operation such as select / where can also
get specific column.
在 2014年11月24日,上午4:01,Daniel Haviv
I faced same problem, and s work around solution is here :
https://github.com/datastax/spark-cassandra-connector/issues/292
best,
/Shahab
On Mon, Nov 24, 2014 at 3:21 PM, Ashic Mahtab as...@live.com wrote:
I've got a Cassandra 2.1.1 + Spark 1.1.0 cluster running. I'm using
sbt-assembly to
Hello,
I was wondering if anyone has gotten the Stanford CoreNLP Java library to
work with Spark.
My attempts to use the parser/annotator fail because of task serialization
errors since the class
StanfordCoreNLP cannot be serialized.
I've tried the remedies of registering StanfordCoreNLP
To get the results in a single file, you could do a repartition(1) and then
save it.
ratesAndPreds.repartition(1).saveAsTextFile(/path/CFOutput)
Thanks
Best Regards
On Mon, Nov 24, 2014 at 8:32 PM, Saurabh Agrawal saurabh.agra...@markit.com
wrote:
Thanks for your help Akhil, however,
From the metrics page, it reveals that only two executors work parallel for
each iteration.
You need to improve parallel threads numbers.
Some tips maybe helpful:
Increase spark.default.parallelism;
Use repartition() or coalesce() to increase partition number.
在 2014年11月22日,上午3:18,Sameer
We have gotten this to work, but it requires instantiating the CoreNLP object
on the worker side. Because of the initialization time it makes a lot of sense
to do this inside of a .mapPartitions instead of a .map, for example.
As an aside, if you're using it from Scala, have a look at
object MyCoreNLP {
@transient lazy val coreNLP = new coreNLP()
}
and then refer to it from your map/reduce/map partitions or that it should
be fine (presuming its thread safe), it will only be initialized once per
classloader per jvm
On Mon, Nov 24, 2014 at 7:58 AM, Evan Sparks
发自我的 iPad
在 2014年11月24日,上午9:41,zh8788 78343...@qq.com 写道:
Hi,
I am new to spark. This is the first time I am posting here. Currently, I
try to implement ADMM optimization algorithms for Lasso/SVM
Then I come across a problem:
Since the training data(label, feature) is large, so I
I am trying to run connected components on a graph generated by reading an
edge file. Its running for a long time(3-4 hrs) and then eventually failing.
Cant find any error in log file. The file I am testing it on has 27M
rows(edges). Is there something obviously wrong with the code?
I tested the
Hi,
Is there any advantage to storing data as a parquet format, loading it using
the sparkSQL context, but never registering as a table/using sql on it?
Something like:
Something like:
data = sqc.parquetFile(path)
results = data.map(lambda x: applyfunc(x.field))
Is this faster/more optimised
Hi, i'm trying to improve performance for Spark's Mllib, and I am having
trouble getting native netlib-java libraries installed/recognized by Spark.
I am running on a single machine, Ubuntu 14.04 and here is what I've tried:
sudo apt-get install libgfortran3
sudo apt-get install libatlas3-base
Parquet is a column-oriented format, which means that you need to read in
less data from the file system if you're only interested in a subset of
your columns. Also, Parquet pushes down selection predicates, which can
eliminate needless deserialization of rows that don't match a selection
Hello,
I'm new to Stanford CoreNLP. Could any one share good training material and
examples(java or scala) on NLP.
Regards,
Rajesh
On Mon, Nov 24, 2014 at 9:38 PM, Ian O'Connell i...@ianoconnell.com wrote:
object MyCoreNLP {
@transient lazy val coreNLP = new coreNLP()
}
and then refer
This is probably not the right venue for general questions on CoreNLP - the
project website (http://nlp.stanford.edu/software/corenlp.shtml) provides
documentation and links to mailing lists/stack overflow topics.
On Mon, Nov 24, 2014 at 9:08 AM, Madabhattula Rajesh Kumar
mrajaf...@gmail.com
Hi All,
I'm learning the code of Spark SQL.
I'm confused about how SchemaRDD executes each operator.
I'm tracing the code. I found toRDD() function in QueryExecution is the
start for running a query. toRDD function will run SparkPlan, which is a
tree structure.
However, I didn't find any
Hello guys,
I'm using Spark 1.0.0 and Kryo serialization
In the Spark Shell, when I create a class that contains as an attribute the
SparkContext, in this way:
class AAA(val s: SparkContext) { }
val aaa = new AAA(sc)
and I execute any action using that attribute like:
val myNumber = 5
I created an application in spark. When I run it with spark, everything works
fine. But when I export my application with the libraries (via sbt), and
trying to run it as an executable jar, I get the following error:
14/11/24 20:06:11 ERROR OneForOneStrategy: exception during creation
I have figured out the problem here. Turned out that there was a problem
with my SparkConf when I was running my application with yarn in cluster
mode. I was setting my master to be local[4] inside my application, whereas
I was setting it to yarn-cluster with spark-submit. Now I have changed my
Hello Folks:
Since spark exposes python bindings and allows you to express your logic in
Python, Is there a way to leverage some of the sophisticated libraries like
NumPy, SciPy, Scikit in spark job and run at scale?
What's been your experience, any insights you can share in terms of what's
Did the workaround work for you? Doesn't seem to work for me.
Date: Mon, 24 Nov 2014 16:44:17 +0100
Subject: Re: Spark Cassandra Guava version issues
From: shahab.mok...@gmail.com
To: as...@live.com
CC: user@spark.apache.org
I faced same problem, and s work around solution is here :
The data is in LIBSVM format. So this line won't work:
values = [float(s) for s in line.split(' ')]
Please use the util function in MLUtils to load it as an RDD of LabeledPoint.
http://spark.apache.org/docs/latest/mllib-data-types.html#labeled-point
from pyspark.mllib.util import MLUtils
Try building Spark with -Pnetlib-lgpl, which includes the JNI library
in the Spark assembly jar. This is the simplest approach. If you want
to include it as part of your project, make sure the library is inside
the assembly jar or you specify it via `--jars` with spark-submit.
-Xiangrui
On Mon,
Additionally - I strongly recommend using OpenBLAS over the Atlas build
from the default Ubuntu repositories. Alternatively, you can build ATLAS on
the hardware you're actually going to be running the matrix ops on (the
master/workers), but we've seen modest performance gains doing this vs.
Marcelo Vanzin wrote
Do you expect to be able to use the spark context on the remote task?
Not At all, what I want to create is a wrapper of the SparkContext, to be
used only on the driver node.
I would like to have in this AAA wrapper several attributes, such as the
SparkContext and other
Hi,
I want to test Kryo serialization but when starting spark-shell I'm hitting
the following error:
java.lang.ClassNotFoundException: org.apache.spark.KryoSerializer
the kryo-2.21.jar is on the classpath so I'm not sure why it's not picking
it up.
Thanks for your help,
Daniel
You are pretty close. The QueryExecution is what drives the phases from
parsing to execution. Once we have a final SparkPlan (the physical plan),
toRdd just calls execute() which recursively calls execute() on children
until we hit a leaf operator. This gives us an RDD[Row] that will compute
Akshat is correct about the benefits of parquet as a columnar format, but
I'll add that some of this is lost if you just use a lambda function to
process the data. Since your lambda function is a black box Spark SQL does
not know which columns it is going to use and thus will do a full
tablescan.
Thanks for the feedback, I filed a couple of issues:
https://github.com/databricks/spark-avro/issues
On Fri, Nov 21, 2014 at 5:04 AM, thomas j beanb...@googlemail.com wrote:
I've been able to load a different avro file based on GenericRecord with:
val person =
Hello,
On Mon, Nov 24, 2014 at 12:07 PM, aecc alessandroa...@gmail.com wrote:
This is the stacktrace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task not
serializable: java.io.NotSerializableException: $iwC$$iwC$$iwC$$iwC$AAA
- field (class
Can you give the full stack trace. You might be hitting:
https://issues.apache.org/jira/browse/SPARK-4293
On Sun, Nov 23, 2014 at 3:00 PM, critikaled isasmani@gmail.com wrote:
Hi,
I am trying to insert particular set of data from rdd to a hive table I
have Map[String,Map[String,Int]] in
Parquet does a lot of serial metadata operations on the driver which makes
it really slow when you have a very large number of files (especially if
you are reading from something like S3). This is something we are aware of
and that I'd really like to improve in 1.3.
You might try the (brand new
Can you clarify what is the Spark master URL you are using ? Is it 'local'
or is it a cluster ? If it is 'local' then rebuilding Spark wouldn't help
as Spark is getting pulled in from Maven and that'll just pick up the
released artifacts.
Shivaram
On Mon, Nov 24, 2014 at 1:08 PM, agg212
Yes, I'm running this in the Shell. In my compiled Jar it works perfectly,
the issue is I need to do this on the shell.
Any available workarounds?
I checked sqlContext, they use it in the same way I would like to use my
class, they make the class Serializable with transient. Does this affects
Andrei, Ashish,
To be clear, I don't think it's *counting* the entire file. It just seems
from the logging and the timing that it is doing a get of the entire file,
then figures out it only needs some certain blocks, does another get of
only the specific block.
Regarding # partitions - I think I
On Mon, Nov 24, 2014 at 1:56 PM, aecc alessandroa...@gmail.com wrote:
I checked sqlContext, they use it in the same way I would like to use my
class, they make the class Serializable with transient. Does this affects
somehow the whole pipeline of data moving? I mean, will I get performance
Hi all,
I am getting started with Spark.
I would like to use for a spike on anomaly detection in a massive stream
of metrics.
Can Spark easily handle this use case ?
Thanks,
Natu
Ok, great, I'm gonna do do it that way, thanks :). However I still don't
understand why this object should be serialized and shipped?
aaa.s and sc are both the same object org.apache.spark.SparkContext@1f222881
However this :
aaa.s.parallelize(1 to 10).filter(_ == myNumber).count
Needs to be
Can you verify that its reading the entire file on each worker using
network monitoring stats? If it does, that would be a bug in my opinion.
On Mon, Nov 24, 2014 at 2:06 PM, Nitay Joffe ni...@actioniq.co wrote:
Andrei, Ashish,
To be clear, I don't think it's *counting* the entire file. It
Is there any timeline where Spark SQL goes beyond alpha version?
Thanks,
That's an interesting question for which I do not know the answer.
Probably a question for someone with more knowledge of the internals
of the shell interpreter...
On Mon, Nov 24, 2014 at 2:19 PM, aecc alessandroa...@gmail.com wrote:
Ok, great, I'm gonna do do it that way, thanks :). However I
These libraries could be used in PySpark easily. For example, MLlib
uses Numpy heavily, it can accept np.array or sparse matrix in SciPy
as vectors.
On Mon, Nov 24, 2014 at 10:56 AM, Rohit Pujari rpuj...@hortonworks.com wrote:
Hello Folks:
Since spark exposes python bindings and allows you to
Neat hack! This is cute and actually seems to work. The fact that it works
is a little surprising and somewhat unintuitive.
On Mon, Nov 24, 2014 at 8:08 AM, Ian O'Connell i...@ianoconnell.com wrote:
object MyCoreNLP {
@transient lazy val coreNLP = new coreNLP()
}
and then refer to it
Hi Gurus,
Sorry for my naive question. I am new.
I seemed to read somewhere that spark is still batch learning, but spark
streaming could allow online learning.
I could not find this on the website now.
http://spark.apache.org/docs/latest/streaming-programming-guide.html
I know MLLib uses
Hi,
On Tue, Nov 25, 2014 at 9:40 AM, Joanne Contact joannenetw...@gmail.com
wrote:
I seemed to read somewhere that spark is still batch learning, but spark
streaming could allow online learning.
Spark doesn't do Machine Learning itself, but MLlib does. MLlib currently
can do online learning
Hi,
On Sat, Nov 22, 2014 at 12:13 AM, EH eas...@gmail.com wrote:
Unfortunately whether it is possible to have both Spark and HDFS running on
the same machine is not under our control. :( Right now we have Spark and
HDFS running in different machines. In this case, is it still possible to
Thank you Tobias!
On Mon, Nov 24, 2014 at 5:13 PM, Tobias Pfeiffer t...@preferred.jp wrote:
Hi,
On Tue, Nov 25, 2014 at 9:40 AM, Joanne Contact joannenetw...@gmail.com
wrote:
I seemed to read somewhere that spark is still batch learning, but spark
streaming could allow online learning.
Hello!
Does anyone know why I may be receiving negative final accumulator values?
Thanks!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Negative-Accumulators-tp19706.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
I am running it in local. How can I use the built version (in local mode) so
that I can use the native libraries?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Mllib-native-netlib-java-OpenBLAS-tp19662p19705.html
Sent from the Apache Spark User List
Hi,
Is there any document that provides some guidelines with some examples that
illustrate when different performance optimizations would be useful? I am
interested in knowing the guidelines for using optimizations like cache(),
persist(), repartition(), coalesce(), and broadcast variables. I
Hi Folks!
I'm running a spark JOB on a cluster with 9 slaves and 1 master (250GB RAM,
32 cores each and 1TB of storage each).
This job generates 1.200 TB of data on a RDD with 1200 partitions.
When I call saveAsTextFile(hdfs://...), spark creates 1200 files named
part-000* on HDFS's folder.
int overflow? If so, you can use BigInt like this:
scala import org.apache.spark.AccumulatorParamimport
org.apache.spark.AccumulatorParam
scala :paste// Entering paste mode (ctrl-D to finish)
implicit object BigIntAccumulatorParam extends AccumulatorParam[BigInt] {
def addInPlace(t1: BigInt,
Hi All,
I started exploring Spark from past 2 months. I'm looking for some concrete
features from both Spark and GraphX so that I'll take some decisions what to
use, based upon who get highest performance.
According to documentation GraphX runs 10x faster than normal Spark. So I
run Page Rank
Great! Worked like a charm :)
On Mon, Nov 24, 2014 at 9:56 PM, Shixiong Zhu zsxw...@gmail.com wrote:
int overflow? If so, you can use BigInt like this:
scala import org.apache.spark.AccumulatorParamimport
org.apache.spark.AccumulatorParam
scala :paste// Entering paste mode (ctrl-D to
You can try recompiling spark with that option, and doing an sbt/sbt
publish-local, then change your spark version from 1.1.0 to 1.2.0-SNAPSHOT
(assuming you're building from the 1.1 branch) - sbt or maven (whichever
you're compiling your app with) will pick up the version of spark that you
just
In memory cache may be blow up the size of RDD.
It's general condition that RDD will take more space in memory than disk.
There are options to configure and optimize storage space efficiency in
Spark, take a look at this https://spark.apache.org/docs/latest/tuning.html
2014-11-25 10:38 GMT+08:00
Hi,
I'm going to debug some spark applications on our testing platform. And it
would be helpful if we can see the eventLog on the worker node.
I've tried to turn on spark.eventLog.enabled and set spark.eventLog.dir
parameters on the worker node. However, it doesn't work.
I do
Hello,
What exactly are you trying to see? Workers don't generate any events
that would be logged by enabling that config option. Workers generate
logs, and those are captured and saved to disk by the cluster manager,
generally, without you having to do anything.
On Mon, Nov 24, 2014 at 7:46 PM,
You can set the same parameter when launching an application, if you use
sppar-submit tried --conf to give those variables or from SparkConfig also
you can set the logs for both driver and workers.
-
--Harihar
--
View this message in context:
Hm, I tried exactly the same commit and the build command locally, but
couldn’t reproduce this.
Usually this kind of errors are caused by classpath misconfiguration.
Could you please try this to ensure corresponding Guava classes are
included in the assembly jar you built?
|jar tf
Hello,
I am reading around 1000 input files from disk in an RDD and generating
parquet. It always produces same number of parquet files as number of input
files. I tried to merge them using
rdd.coalesce(n) and/or rdd.repatition(n).
also tried using:
int MB_128 = 128*1024*1024;
Hi,
When I submit a spark application like this:
./bin/spark-submit --class org.apache.spark.examples.SparkKMeans
--deploy-mode client --master spark://karthik:7077
$SPARK_HOME/examples/*/scala-*/spark-examples-*.jar /k-means 4 0.001
Which part of the spark framework code deals with the name of
For the “never register a table” part, actually you /can/ use Spark SQL
without registering a table via its DSL. Say you’re going to extract an
|Int| field named |key| from the table and double it:
|import org.apache.sql.catalyst.dsl._
val data = sqc.parquetFile(path)
val double =
Hi,
I think it should be accessible via the SparkConf in the SparkContext.
Something like sc.getConf().get(spark.app.name)?
Thanks,
Deng
On Tue, Nov 25, 2014 at 12:40 PM, rapelly kartheek kartheek.m...@gmail.com
wrote:
Hi,
When I submit a spark application like this:
./bin/spark-submit
Hi,
Is it necessary for every vertex to have an attribute when we load a graph
to GraphX?
In other words, if I have an edge list file containing pairs of vertices
i.e., 1 2 means that there is an edge between node 1 and node 2. Now,
when I run PageRank on this data it return a NaN.
Can I use
Could it be because my edge list file is in the form (1 2), where there
is an edge between node 1 and node 2?
On Tue, Nov 18, 2014 at 4:13 PM, Ankur Dave ankurd...@gmail.com wrote:
At 2014-11-18 15:51:52 +0530, Deep Pradhan pradhandeep1...@gmail.com
wrote:
Yes the above command works,
Here's the tuning guidelines if you haven't seen it already.
http://spark.apache.org/docs/latest/tuning.html
You could try the following to get it loaded:
- Use kryo Serialization
http://spark.apache.org/docs/latest/tuning.html#data-serialization
- Enable RDD Compression
- Set Storage level to
Thanks for the reply Micheal here is the stack trace
org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in
stage 0.0 failed 1 times, most recent failure: Lost task 3.0 in stage 0.0
(TID 3, localhost): scala.MatchError: MapType(StringType,StringType,true)
(of class
This is what I got from jar tf:
org/spark-project/guava/common/base/Preconditions.class
org/spark-project/guava/common/math/MathPreconditions.class
com/clearspring/analytics/util/Preconditions.class
parquet/Preconditions.class
I seem to have the line that reported missing, but I am missing this
100 matches
Mail list logo