I'm not caching the data. with each iteration I mean,, each 128mb
that a executor has to process.
The code is pretty simple.
final Conversor c = new Conversor(null, null, null, longFields,typeFields);
SparkConf conf = new SparkConf().setAppName(Simple Application);
JavaSparkContext sc = new
okay. So the queries tableau will run on the persisted data will be through
SPARK SQL to improve performance and to take advantage of SPARK SQL.
Thanks again Denny
From: Denny Lee denny.g@gmail.com
Sent: Thursday, February 5, 2015 1:27 PM
To: Ashutosh
I'm getting SequenceFile doesn't work with GzipCodec without native-hadoop
code! Where to get those libs and where to put it in the spark?
Also can I save plain text file (like saveAsTextFile) as gzip?
Thanks.
On Wed, Feb 4, 2015 at 11:10 PM, Kane Kim kane.ist...@gmail.com wrote:
How to save
I don't see anything that says you must explicitly restart them to load the
new settings, but usually there is some sort of signal trapped [or brute
force full restart] to get a configuration reload for most daemons. I'd
take a guess and use the $SPARK_HOME/sbin/{stop,start}-slaves.sh scripts on
Yes, the driver has to be able to accept incoming connections. All the
executors connect back to the driver sending heartbeats, map status,
metrics. It is critical and I don't know of a way around it. You could look
into using something like the
https://github.com/spark-jobserver/spark-jobserver
I read somewhere about Gatling. Can that be used to profile Spark jobs?
On Fri, Feb 6, 2015 at 10:27 AM, Kostas Sakellis kos...@cloudera.com
wrote:
Which Spark Job server are you talking about?
On Thu, Feb 5, 2015 at 8:28 PM, Deep Pradhan pradhandeep1...@gmail.com
wrote:
Hi,
Can Spark Job
Not sure spark standalone mode. But on spark-on-yarn, it should work. You can
check following link:
http://hortonworks.com/hadoop-tutorial/using-apache-spark-hdp/
Thanks.
Zhan Zhang
On Feb 5, 2015, at 5:02 PM, Cheng Lian
lian.cs@gmail.commailto:lian.cs@gmail.com wrote:
Please note
I want to write a spark streaming consumer for kafka in java. I want to
process the data in real-time as well as store the data in hdfs in
year/month/day/hour/ format. I am not sure how to achieve this. Should I
write separate kafka consumers, one for writing data to HDFS and one for
spark
Which Spark Job server are you talking about?
On Thu, Feb 5, 2015 at 8:28 PM, Deep Pradhan pradhandeep1...@gmail.com
wrote:
Hi,
Can Spark Job Server be used for profiling Spark jobs?
When you say profiling, what are you trying to figure out? Why your spark
job is slow? Gatling seems to be a load generating framework so I'm not
sure how you'd use it (i've never used it before). Spark runs on the JVM so
you can use any JVM profiling tools like YourKit.
Kostas
On Thu, Feb 5,
what's the dump info by jstack?
Yours, Xuefeng Wu 吴雪峰 敬上
On 2015年2月6日, at 上午10:20, Michael Albert m_albert...@yahoo.com.INVALID
wrote:
My apologies for following up my own post, but I thought this might be of
interest.
I terminated the java process corresponding to executor which had
could you find the shuffle files? or the files were deleted by other processes?
Yours, Xuefeng Wu 吴雪峰 敬上
On 2015年2月5日, at 下午11:14, Yifan LI iamyifa...@gmail.com wrote:
Hi,
I am running a heavy memory/cpu overhead graphx application, I think the
memory is sufficient and set RDDs’
Yes, I want to know, the reason about the job being slow.
I will look at YourKit.
Can you redirect me to that, some tutorial in how to use?
Thank You
On Fri, Feb 6, 2015 at 10:44 AM, Kostas Sakellis kos...@cloudera.com
wrote:
When you say profiling, what are you trying to figure out? Why your
Hi,
My spark-env.sh has the following entries with respect to classpath:
export SPARK_CLASSPATH=$SPARK_CLASSPATH:/usr/lib/hive/lib/*:/etc/hive/conf/
-Skanda
On Sun, Feb 1, 2015 at 11:45 AM, guxiaobo1982 guxiaobo1...@qq.com wrote:
Hi Skanda,
How do set up your SPARK_CLASSPATH?
I add the
Hello,
Did you check the following?
http://themodernlife.github.io/scala/spark/hadoop/hdfs/2014/09/28/spark-input-filename/
http://apache-spark-user-list.1001560.n3.nabble.com/access-hdfs-file-name-in-map-td6551.html
--
Emre Sevinç
On Fri, Feb 6, 2015 at 2:16 AM, Subacini B
Hi,
I am playing with the following example code:
public class SparkTest {
public static void main(String[] args){
String appName= This is a test application;
String master=spark://lix1.bh.com:7077;
SparkConf
Even I observed the same issue.
On Fri, Feb 6, 2015 at 12:19 AM, Praveen Garg praveen.g...@guavus.com
wrote:
Hi,
While moving from spark 1.1 to spark 1.2, we are facing an issue where
Shuffle read/write has been increased significantly. We also tried running
the job by rolling back to
Yes, there is no way right now to know how many stages a job will generate
automatically. Like Mark said, RDD#toDebugString will give you some info
about the RDD DAG and from that you can determine based on the dependency
types (Wide vs. narrow) if there is a stage boundary.
On Thu, Feb 5, 2015
https://cwiki.apache.org/confluence/display/SPARK/Profiling+Spark+Applications+Using+YourKit
On Thu, Feb 5, 2015 at 9:18 PM, Deep Pradhan pradhandeep1...@gmail.com
wrote:
Yes, I want to know, the reason about the job being slow.
I will look at YourKit.
Can you redirect me to that, some
Standalone mode does not support talking to a kerberized HDFS. If you want
to talk to a kerberized (secure) HDFS cluster i suggest you use Spark on
Yarn.
On Wed, Feb 4, 2015 at 2:29 AM, Jander g jande...@gmail.com wrote:
Hope someone helps me. Thanks.
On Wed, Feb 4, 2015 at 6:14 PM, Jander g
Oh yeah, they picked up changes after restart, thanks!
On Thu, Feb 5, 2015 at 8:13 PM, Charles Feduke charles.fed...@gmail.com
wrote:
I don't see anything that says you must explicitly restart them to load
the new settings, but usually there is some sort of signal trapped [or
brute force full
Hi,
If your message is string you will have to Change Encoder and
Decoder to StringEncoder , StringDecoder.
If your message Is byte[] you can use DefaultEncoder Decoder.
Also Don’t forget to add import statements depending on ur encoder and decoder.
import
I have a single node Spark standalone cluster. Will this also work for my
cluster?
Thank You
On Fri, Feb 6, 2015 at 11:02 AM, Mark Hamstra m...@clearstorydata.com
wrote:
https://cwiki.apache.org/confluence/display/SPARK/Profiling+Spark+Applications+Using+YourKit
On Thu, Feb 5, 2015 at 9:18
Hi,
Can Spark Job Server be used for profiling Spark jobs?
Hi all,
Looking at spark metricsServlet.
What is the url exposing driver executor json response?
Found master and worker successfully, but can't find url that return json for
the other 2 sources.
Thanks!
Judy
Hi Guys,
I’m getting this error in KafkaWordCount;
TaskSetManager: Lost task 0.0 in stage 4095.0 (TID 1281, 10.20.10.234):
java.lang.ClassCastException: [B cannot be cast to java.lang.String
Hi Kevin,
We seem to be facing the same problem as well. Were you able to find
anything after that? The ticket does not seem to have progressed anywhere.
Regards,
Anubhav
On 5 January 2015 at 10:37, 정재부 itsjb.j...@samsung.com wrote:
Sure, here is a ticket.
Hi Florin,
I might be wrong but timestamp looks like a keyword in SQL that the engine
gets confused with. If it is a column name of your table, you might want to
change it. (
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types)
I'm constantly working with CSV files with spark.
Is SPARK-4905 / https://github.com/apache/spark/pull/4371/files the same issue?
On Thu, Feb 5, 2015 at 7:03 AM, Eduardo Costa Alfaia
e.costaalf...@unibs.it wrote:
Hi Guys,
I’m getting this error in KafkaWordCount;
TaskSetManager: Lost task 0.0 in stage 4095.0 (TID 1281, 10.20.10.234):
I don’t think so Sean.
On Feb 5, 2015, at 16:57, Sean Owen so...@cloudera.com wrote:
Is SPARK-4905 / https://github.com/apache/spark/pull/4371/files the same
issue?
On Thu, Feb 5, 2015 at 7:03 AM, Eduardo Costa Alfaia
e.costaalf...@unibs.it wrote:
Hi Guys,
I’m getting this error in
No, you can compress SequenceFile with gzip. If you are reading outside
Hadoop then SequenceFile may not be a great choice. You can use the gzip
codec with TextOutputFormat if you need to.
On Feb 5, 2015 8:28 AM, Kane Kim kane.ist...@gmail.com wrote:
I'm getting SequenceFile doesn't work with
Do you mean disable the web UI? spark.ui.enabled=false
Sure, it's useful with master = local[*] too.
On Thu, Feb 5, 2015 at 9:30 AM, Shuai Zheng szheng.c...@gmail.com wrote:
Hi All,
It might sounds weird, but I think spark is perfect to be used as a
multi-threading library in some cases.
Does pyspark supports zeroMQ?
I see that java does it, but I am not sure for Python?
regards
--
Aleksandar Kacanski
I solve the question with this code:
import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.mllib.linalg.Vectors
val data = sc.textFile(/opt/testAppSpark/data/height-weight.txt).map {
line = Vectors.dense(line.split(' ').map(_.toDouble))
}.cache()
val cluster =
I'm submitting this on a cluster, with my usual setting of, export
YARN_CONF_DIR=/etc/hadoop/conf
It is working again after a small change to the code so I will see if I can
reproduce the error (later today).
On Thu, Feb 5, 2015 at 9:17 AM, Arush Kharbanda ar...@sigmoidanalytics.com
wrote:
Are
I am evaluating Spark for our production usage. Our production cluster is
Hadoop 2.2.0 without Yarn. So I want to test Spark with Standalone deployment
running with Hadoop.
What I have in mind is to test a very complex Hive query, which joins between 6
tables, lots of nested structure with
I was wondering if there was any chance of getting a more distributed word2vec
implementation. I seem to be running out of memory from big local data
structures such as
val syn1Global = new Array[Float](vocabSize * vectorSize)
Is there anyway chance of getting a version where these are all
Hello!
I received the following errors in the workerLog.log files:
ERROR EndpointWriter: AssociationError [akka.tcp://sparkWorker@stream4:33660]
- [akka.tcp://sparkExecutor@stream4:47929]: Error [Association failed with
[akka.tcp://sparkExecutor@stream4:47929]] [
Hi Xiangrui,
I'm sorry. I didn't recognize your mail.
What I did is a workaround only working for my special case.
It does not scale and only works for small data sets but that is fine
for me so far.
Kind Regards,
Niklas
def securlyZipRdds[A, B: ClassTag](rdd1: RDD[A], rdd2: RDD[B]):
RDD[(A,
Hi All,
I want to develop a server side application:
User submit request à Server run spark application and return (this might
take a few seconds).
So I want to host the server to keep the long-live context, I dont know
whether this is reasonable or not.
Basically I try to have a
Hello!
I'm using spark-csv 2.10 with Java from the maven repository
groupIdcom.databricks/groupId
artifactIdspark-csv_2.10/artifactId
version0.1.1/version
I would like to use Spark-SQL to filter out my data. I'm using the
following code:
JavaSchemaRDD cars = new
Hi,
I am trying to submit a job from a Windows system to a YARN cluster running
on Linux (the HDP2.2 sandbox). I have copied the relevant Hadoop directories
as well as the yarn-site.xml and mapred-site.xml to the Windows file system.
Further, I have added winutils.exe to $HADOOP_HOME/bin.
I can
Hi,
I am running a heavy memory/cpu overhead graphx application, I think the memory
is sufficient and set RDDs’ StorageLevel using MEMORY_AND_DISK.
But I found there were some tasks failed due to following errors:
java.io.FileNotFoundException:
Hi All,
It might sounds weird, but I think spark is perfect to be used as a
multi-threading library in some cases. The local mode will naturally boost
multiple thread when required. Because it is more restrict and less chance
to have potential bug in the code (because it is more data oriental,
The challenge I have is this. There's two streams of data where an event
might look like this in stream1: (time, hashkey, foo1) and in stream2:
(time, hashkey, foo2)
The result after joining should be (time, hashkey, foo1, foo2) .. The join
happens on hashkey and the time difference can be ~30
Hi,
I am trying to get the final cluster centers after running the KMeans
algorithm in MLlib in order to characterize the clusters. But the
KMeansModel does not have any public method to retrieve this info. There
appears to be only a private method called clusterCentersWithNorm. I guess
I could
Unless I misunderstood your question, you’re looking for the val clusterCenters
in
http://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.mllib.clustering.KMeansModel,
no?
Frank Austin Nothaft
fnoth...@berkeley.edu
fnoth...@eecs.berkeley.edu
202-340-0466
On Feb 5, 2015, at
Yes you can submit multiple actions from different threads to the same
SparkContext. It is safe.
Indeed what you want to achieve is quite common. Expose some operations
over a SparkContext through HTTP.
I have used spray for this and it just worked fine.
At bootstrap of your web app, start a
This example helps a lot J
But I am thinking a below case:
Assume I have a SparkContext as a global variable.
Then if I use multiple threads to access/use it. Will it mess up?
For example:
My code:
public static ListTuple2Integer, Double run(JavaSparkContext sparkContext,
Finally I gave up after there are too many failed retry.
From the log in the worker side, it looks like failed with JVM OOM, as below:
15/02/05 17:02:03 ERROR util.SparkUncaughtExceptionHandler: Uncaught exception
in thread Thread[Driver Heartbeater,5,main]java.lang.OutOfMemoryError: Java
heap
There's a kMeansModel.clusterCenters() available if u r looking to get the
centers from KMeansModel.
From: SK skrishna...@gmail.com
To: user@spark.apache.org
Sent: Thursday, February 5, 2015 5:35 PM
Subject: K-Means final cluster centers
Hi,
I am trying to get the final cluster
I have a file badFullIPs.csv of bad IP addresses used for filtering. In
yarn-client mode, I simply read it off the edge node, transform it, and then
broadcast it:
val badIPs = fromFile(edgeDir + badfullIPs.csv)
val badIPsLines = badIPs.getLines
val badIpSet = badIPsLines.toSet
I am trying to use the StreamingContext getOrCreate method in my app.
I started by running the example ( RecoverableNetworkWordCount
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/RecoverableNetworkWordCount.scala
), which worked as
Hello Everyone,
I wanted to hear the community's thoughts on what (open - source) tools
have been used to visualize data from Spark/Spark Streaming? I've taken a
look at Zepellin, but had some trouble working with it.
Couple questions:
1) I've looked at a couple blog posts and it seems like
Now that Kafka 0.8.2.0 has been released, adding external/kafka module
works.
FYI
On Sun, Jan 18, 2015 at 7:36 PM, Ted Yu yuzhih...@gmail.com wrote:
bq. there was no 2.11 Kafka available
That's right. Adding external/kafka module resulted in:
[ERROR] Failed to execute goal on project
I have the following code:
SparkConf conf = new
SparkConf().setAppName(streamer).setMaster(local[2]);
conf.set(spark.driver.allowMultipleContexts, true);
JavaStreamingContext ssc = new JavaStreamingContext(conf, new
Duration(batch_interval));
hi all,
just wondering if there is a reason why it is not possible to add intercepts
for streaming regression models? I understand that run method in the
underlying GeneralizedLinearModel does not take intercept as a parameter
either. Any reason for that?
thanks,
--
View this message in
Hi,
My guess is that Spark is not picking up the jar where the function is
stored. You might have to add it to sparkcontext or the classpath manually.
You can also register the function
hc.registerFunction(myfunct, myfunct)
then use it in the query.
--
View this message in context:
Could you clarify what you mean by build another Spark and work through
Spark Submit?
If you are referring to utilizing Spark spark and thrift, you could start
the Spark service and then have your spark-shell, spark-submit, and/or
thrift service aim at the master you have started.
On Thu Feb 05
As a Spark newbie, I've come across this thread. I'm playing with Word2Vec in
our Hadoop cluster and here's my issue with classic Java serialization of
the model: I don't have SSH access to the cluster master node.
Here's my code for computing the model:
val input =
You can use akka, that is the underlying Multithreading library Spark uses.
On Thu, Feb 5, 2015 at 9:56 PM, Shuai Zheng szheng.c...@gmail.com wrote:
Nice. I just try and it works. Thanks very much!
And I notice there is below in the log:
15/02/05 11:19:09 INFO Remoting: Remoting started;
While a spark-submit job is setting up, the yarnAppState goes into Running
mode, then I get a flurry of typical looking INFO-level messages such as
INFO BlockManagerMasterActor: ...
INFO YarnClientSchedulerBackend: Registered executor: ...
Then, spark-submit quits without any error message and
Nice. I just try and it works. Thanks very much!
And I notice there is below in the log:
15/02/05 11:19:09 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://sparkDriver@NY02913D.global.local:8162]
15/02/05 11:19:10 INFO AkkaUtils: Connecting to HeartbeatReceiver:
Anyone has idea on where I can find the detailed log of that lost executor(why
it was lost)?
Thanks in advance!
On 05 Feb 2015, at 16:14, Yifan LI iamyifa...@gmail.com wrote:
Hi,
I am running a heavy memory/cpu overhead graphx application, I think the
memory is sufficient and set
If you want to design something like Spark shell have a look at:
http://zeppelin-project.org/
Its open source and may already do what you need. If not, its source code
will be helpful in answering the questions about how to integrate with long
running jobs that you have.
On Thu Feb 05 2015 at
Here's another lightweight example of running a SparkContext in a common
java servlet container: https://github.com/calrissian/spark-jetty-server
On Thu, Feb 5, 2015 at 11:46 AM, Charles Feduke charles.fed...@gmail.com
wrote:
If you want to design something like Spark shell have a look at:
Li, I cannot tell you the reason for this exception but have seen these
kind of errors when using HASH based shuffle manager (which is default)
until v 1.2. Try the SORT shuffle manager.
Hopefully that will help
Thanks
Ankur
Anyone has idea on where I can find the detailed log of that lost
Hi Jenny,
You may try to use |--files $SPARK_HOME/conf/hive-site.xml
--driver-class-path hive-site.xml| when submitting your application. The
problem is that when running in cluster mode, the driver is actually
running in a random container directory on a random executor node. By
using
Please note that Spark 1.2.0 /only/ support Hive 0.13.1 /or/ 0.12.0,
none of other versions are supported.
Best,
Cheng
On 1/25/15 12:18 AM, guxiaobo1982 wrote:
Hi,
I built and started a single node standalone Spark 1.2.0 cluster along
with a single node Hive 0.14.0 instance installed by
I submit spark job from machine behind firewall, I can't open any incoming
connections to that box, does driver absolutely need to accept incoming
connections? Is there any workaround for that case?
Thanks.
Hi, Deb:From what I search online, changing parallelism is one option. But the
failed stage already had 200 tasks, which is quite large on a one 24 core box.I
know query that amount of data in one box is kind of over, but I do want to
know how to config it using less memory, even it could mean
Hi All,
We have filename with timestamp say ABC_1421893256000.txt and the
timestamp needs to be extracted from file name for further processing.Is
there a way to get input file name picked up by spark streaming job?
Thanks in advance
Subacini
Hi Shao,
When I changed to StringDecoder I’ve get this compiling error:
[error]
/sata_disk/workspace/spark-1.2.0/examples/scala-2.10/src/main/scala/org/apache/spark/examples/streaming/KafkaW
ordCount.scala:78: not found: type StringDecoder
[error] KafkaUtils.createStream[String, String,
Did you include Kafka jars? This StringDecoder is under kafka/serializer, You
can refer to the unit test KafkaStreamSuite in Spark to see how to use this API.
Thanks
Jerry
From: Eduardo Costa Alfaia [mailto:e.costaalf...@unibs.it]
Sent: Friday, February 6, 2015 9:44 AM
To: Shao, Saisai
Cc: Sean
Greetings!
Again, thanks to all who have given suggestions.I am still trying to diagnose a
problem where I have processes than run for one or several hours but
intermittently stall or hang.By stall I mean that there is no CPU usage on
the workers or the driver, nor network activity, nor do I
I dont think your screenshots came through in the email. None the less,
queueStream will not work with getOrCreate. Its mainly for testing (by
generating your own RDDs) and not really useful for production usage (where
you really need to checkpoint-based recovery).
TD
On Thu, Feb 5, 2015 at 4:12
Hi Jon,
You'll need to put the file on HDFS (or whatever distributed filesystem
you're running on) and load it from there.
-Sandy
On Thu, Feb 5, 2015 at 3:18 PM, YaoPau jonrgr...@gmail.com wrote:
I have a file badFullIPs.csv of bad IP addresses used for filtering. In
yarn-client mode, I
Hi Ayoub,
The doc page isn’t wrong, but it’s indeed confusing.
|spark.sql.parquet.compression.codec| is used when you’re wring Parquet
file with something like |data.saveAsParquetFile(...)|. However, you are
using Hive DDL in the example code. All Hive DDLs and commands like
|SET| are
Hi,
I think you should change the `DefaultDecoder` of your type parameter into
`StringDecoder`, seems you want to decode the message into String.
`DefaultDecoder` is to return Array[Byte], not String, so here class casting
will meet error.
Thanks
Jerry
-Original Message-
From:
Hi,
I have a 170GB data tab limited data set which I am converting into the
RDD[LabeledPoint] format. I am then taking a 60% sample of this data set to be
used for training a GBT model.
I got the Size exceeds Integer.MAX_VALUE error which I fixed by repartitioning
the data set to 1000
Thanks Akhil and Mark. I can of course count events (assuming I can deduce
the shuffle boundaries), but like I said the program isn't simple and I'd
have to do this manually every time I change the code. So I rather find a
way of doing this automatically if possible.
On 4 February 2015 at 19:41,
RDD#toDebugString will help.
On Thu, Feb 5, 2015 at 1:15 AM, Joe Wass jw...@crossref.org wrote:
Thanks Akhil and Mark. I can of course count events (assuming I can deduce
the shuffle boundaries), but like I said the program isn't simple and I'd
have to do this manually every time I change the
Hi
Are you trying to run a spark job from inside eclipse? and want the job to
access hive configuration options.? To access hive tables?
Thanks
Arush
On Tue, Feb 3, 2015 at 7:24 AM, guxiaobo1982 guxiaobo1...@qq.com wrote:
Hi,
I know two options, one for spark_submit, the other one for
Hi,
I got strange behavior. When I am creating schema RDD for nested directory
sometimes it work and sometime it does not work. My question is whether
nested directory supported or not?
My code is as below.
val fileLocation = hdfs://localhost:9000/apps/hive/warehouse/hl7
val parquetRDD =
1. For what reasons is using Spark the above ports? What internal component
is triggering them? -Akka(guessing from the error log) is used to schedule
tasks and to notify executors - the ports used are random by default
2. How I can get rid of these errors? - Probably the ports are not open on
?Hi,
I am trying to do a hbase scan and read it into a spark rdd using pyspark. I
have successfully written data to hbase from pyspark, and been able to read a
full table from hbase using the python example code. Unfortunately I am unable
to find any example code for doing an HBase scan and
The problem got resolved after removing all the configuration files from
all the slave nodes. Earlier we were running in the standalone mode and
that lead to duplicating the configuration on all the slaves. Once that was
done it ran as expected in cluster mode. Although performance is not up to
Hi,
In fact, this pull https://github.com/apache/spark/pull/3920 is to do Hbase
scan. However, it is not merged yet.
You can also take a look at the example code at
http://spark-packages.org/package/20 which is using scala and python to
read data from hbase.
Hope this can be helpful.
Cheers
Gen
And the Job page of the web UI will give you an idea of stages completed
out of the total number of stages for the job. That same information is
also available as JSON. Statically determining how many stages a job
logically comprises is one thing, but dynamically determining how many
stages
My apologies for following up my own post, but I thought this might be of
interest.
I terminated the java process corresponding to executor which had opened the
stderr file mentioned below (kill pid).Then my spark job completed without
error (it was actually almost finished).
Now I am
Hi,
I'm trying to change setting as described here:
http://spark.apache.org/docs/1.2.0/ec2-scripts.html
export SPARK_WORKER_CORES=6
Then I ran ~/spark-ec2/copy-dir /root/spark/conf to distribute to slaves,
but without any effect. Do I have to restart workers?
How to do that with spark-ec2?
Any idea why if I use more containers I get a lot of stopped because GC?
2015-02-05 8:59 GMT+01:00 Guillermo Ortiz konstt2...@gmail.com:
I'm not caching the data. with each iteration I mean,, each 128mb
that a executor has to process.
The code is pretty simple.
final Conversor c = new
Hi,
You can also check out the Spark Kernel project:
https://github.com/ibm-et/spark-kernel
It can plug into the upcoming IPython 3.0 notebook (providing a Scala/Spark
language interface) and provides an API to submit code snippets (like the
Spark Shell) and get results directly back, rather
I modified the code Base on CassandraCQLTest. to get the area code count
base on time zone. I got error on create new map Rdd. Any helping is
appreciated. Thanks.
... val arecodeRdd = sc.newAPIHadoopRDD(job.getConfiguration(),
classOf[CqlPagingInputFormat],
Hi, I am working on a text mining project and I want to use
NaiveBayesClassifier of MLlib to classify some stream items. So, I have two
Spark contexts one of which is a streaming context. Everything looks fine if
I comment out train and predict methods, it works fine although doesn't
obviously do
Hi,
Could you share the code snippet.
Thanks,
Vishnu
On Thu, Feb 5, 2015 at 11:22 PM, aanilpala aanilp...@gmail.com wrote:
Hi, I am working on a text mining project and I want to use
NaiveBayesClassifier of MLlib to classify some stream items. So, I have two
Spark contexts one of which is a
Is it possible that value.get((area_code)) or value.get(time_zone))
returned null ?
On Thu, Feb 5, 2015 at 10:58 AM, oxpeople vincent.y@bankofamerica.com
wrote:
I modified the code Base on CassandraCQLTest. to get the area code count
base on time zone. I got error on create new map Rdd.
Kundan,
So I think your configuration here is incorrect. We need to adjust memory
and #executors. So for your case you have:
Cluster setup
5 nodes
16gb RAM
8 cores.
The number of executors should be the total number of nodes in your cluster
- in your case 5. As for --num-executor-cores it should
Why don't you just map rdd's rows to lines and then call saveAsTextFile()?
On 3.2.2015. 11:15, Hafiz Mujadid wrote:
I want to write whole schemardd to single in hdfs but facing following
exception
Hi,
I'm learning Spark and testing the Spark MLlib library with algorithm
K-means.
So,
I created a file height-weight.txt like this:
65.0 220.0
73.0 160.0
59.0 110.0
61.0 120.0
...
And the code (executed in spark-shell):
import org.apache.spark.mllib.clustering.KMeans
import
100 matches
Mail list logo