hi,
I'm using the new 1.4.0 installation, and ran a job there. The job finished
and everything seems fine. When I enter the application, I can see that the
job is marked as KILLED:
Removed Executors
ExecutorID Worker Cores Memory State Logs
0
I feel he wanted to ask about workers. In that case, pplease launch workers
on Node 3,4,5 (and/or Node 8,9,10 etc).
You need to go to each worker and start worker daemon with master URL:Port
(typically7077) as parameter (so workers can talk to master).
You shoud be able to see 1 masterr and N
Akhil, thank you for the response. I want to explore more.
If the application is just monitoring a HDFS folder and output the word
count of each streaming batch into also HDFS.
When I kill the application _before_ spark takes a checkpoint, after
recovery, spark will resume the processing
Thanks Akhil, it solved the problem.
best
/Shahab
On Fri, Jun 12, 2015 at 8:50 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Looks like your spark is not able to pick up the HADOOP_CONF. To fix this,
you can actually add jets3t-0.9.0.jar to the classpath
I'm assuming by spark-client you mean the spark driver program. In that
case you can pick any machine (say Node 7), create your driver program in
it and use spark-submit to submit it to the cluster or if you create the
SparkContext within your driver program (specifying all the properties)
then
Hi Akhil,
Thanks for your response.
I have 10 cores which sums of all my 3 machines and I am having 5-10 receivers.
I have tried to test the processed number of records per second by varying
number of receivers.
If I am having 10 receivers (i.e. one receiver for each core), then I am not
Something like this?
val huge_data = sc.textFile(/path/to/first.csv).map(x =
(x.split(\t)(1), x.split(\t)(0))
val gender_data = sc.textFile(/path/to/second.csv),map(x =
(x.split(\t)(0), x))
val joined_data = huge_data.join(gender_data)
joined_data.take(1000)
Its scala btw, python api should
I'm loading these settings from a properties file:
spark.executor.memory=256M
spark.cores.max=1
spark.shuffle.consolidateFiles=true
spark.task.cpus=1
spark.deploy.defaultCores=1
spark.driver.cores=1
spark.scheduler.mode=FAIR
Once the job is submitted to mesos, I can go to the spark UI for that
My Mesos cluster has 1.5 CPU and 17GB free. If I set:
conf.set(spark.mesos.coarse, true);
conf.set(spark.cores.max, 1);
in the SparkConf object, the job will run in the mesos cluster fine.
But if I comment out those settings above so that it defaults to fine
grained, the task never finishes.
Hi,
I use spark 0.14. I tried to create dataframe from RDD below, but got
scala.reflect.internal.MissingRequirementError
val partitionedTestDF2 = pairVarRDD.toDF(column1,column2,column3)
//pairVarRDD is RDD[Record4Dim_2], and Record4Dim_2 is a Case Class
How can I fix this?
Exception in
Google+
https://plus.google.com/app/basic?nopromo=1source=moggl=uk
http://mail.google.com/mail/x/mog-/gp/?source=moggl=uk
Calendar
https://www.google.com/calendar/gpcal?source=moggl=uk
Web
http://www.google.co.uk/?source=moggl=uk
more
Inbox
Apache Spark Email
GmailNot Work
S
You may want to have a look there if you (or the original author) are using
JDK 8.0:
https://issues.apache.org/jira/browse/SPARK-4193
https://issues.apache.org/jira/browse/SPARK-4543
Cheers,
On Sat, Apr 4, 2015 at 10:39 PM mas mas.ha...@gmail.com wrote:
Hi All,
I am trying to build spark
name := SparkLeaning
version := 1.0
scalaVersion := 2.10.4
//scalaVersion := 2.11.2
libraryDependencies ++= Seq(
//org.apache.hive% hive-jdbc % 0.13.0
//io.spray % spray-can % 1.3.1,
//io.spray % spray-routing % 1.3.1,
io.spray % spray-testkit % 1.3.1 % test,
io.spray %% spray-json %
How far did you get?
On Tue, Jun 2, 2015 at 4:02 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
We use Scoobi + MR to perform joins and we particularly use blockJoin()
API of scoobi
/** Perform an equijoin with another distributed list where this list is
considerably smaller
* than the
HI Davies,
I have tried recent 1.4 and 1.5-snapshot to 1) open the parquet and save it
again or 2 apply schema to rdd and save dataframe as parquet but now I get
this error (right in the beginning):
java.lang.OutOfMemoryError: Java heap space
at
Hey,
I had some similar issues in the past when I used Java 8. Are you using
Java 7 or 8. (it's just an idea, because I had a similar issue)
On Mon 15 Jun 2015 at 6:52 am Wwh 吴 wwyando...@hotmail.com wrote:
name := SparkLeaning
version := 1.0
scalaVersion := 2.10.4
//scalaVersion := 2.11.2
JavaDStreamString inputStream = ssc.queueStream(rddQueue);
Can this rddQueue be of dynamic type in nature .If yes then how to
make it run untill rddQueue is not finished .
Any other way to get rddQueue from a dynamically updatable Normal Queue .
--
Thanks Regards,
SERC-IISC
Anshu Shukla
Hi Proust,
Is it possible to see the query you are running and can you run EXPLAIN
EXTENDED to show the physical plan
for the query. To generate the plan you can do something like this from
$SPARK_HOME/bin/beeline:
0: jdbc:hive2://localhost:10001 explain extended select * from
YourTableHere;
Hi all,
I was looking for an explanation on the number of partitions for a joined
rdd.
The documentation of Spark 1.3.1. says that:
For distributed shuffle operations like reduceByKey and join, the largest
number of partitions in a parent RDD.
Thanks a lot Akhil, after try some suggestions in the tuning guide, there
seems no improvement at all.
And below is the job detail when running locally(8cores) which took 3min
to complete the job, we can see it is the map operation took most of time,
looks like the mapPartitions took too long
Hi,
I have this in my spark-defaults.conf (same for hdfs):
spark.eventLog.enabled true
spark.eventLog.dir file:/tmp/spark-events
spark.history.fs.logDirectory file:/tmp/spark-events
While the app is running, there is a “.inprogress” directory. However when the
job
I tried running the job in a standalone cluster and I'm getting this:
java.io.IOException: Failed on local exception: java.io.IOException:
org.apache.hadoop.security.AccessControlException: Client cannot
authenticate via:[TOKEN, KERBEROS]; Host Details : local host is:
worker-node/0.0.0.0;
even if the following POM is also not working:
project xmlns=http://maven.apache.org/POM/4.0.0; xmlns:xsi=
http://www.w3.org/2001/XMLSchema-instance; xsi:schemaLocation=
http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd;
parent
I am trying to create new RDD based on given PairRDD. I have a PairRDD with
few keys but each keys have large (about 100k) values. I want to somehow
repartition, make each `Iterablev` into RDD[v] so that I can further
apply map, reduce, sortBy etc effectively on those values. I am sensing
How do I decide in how many partitions I break up my data into, how many
executors should I have? I guess memory and cores will be allocated based on
the number of executors I have.
Thanks
--
View this message in context:
When you submit a job, spark breaks down it into stages, as per DAG. the
stages run transformations or actions on the rdd's. Each rdd constitutes of
N partitions. The tasks creates by spark to execute the stage are equal to
the number of partitions. Every task is executed on the cored utilized
Hello;
I have a data set of about 80 Million users and 12,000 items (very sparse ).
I can get the training part working no problem. (model has 20 factors),
However, when i try using Predict all for 80 Million x 10 items , the jib
does not complete.
When i use a smaller data set say 500k or a
Looking at the logs of the executor, looks like it fails to find the file;
e.g. for task 10323.0
15/06/16 13:43:13 ERROR output.FileOutputCommitter: Hit IOException trying
to rename
Have you found answer to this? I am also looking for exact same solution.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-of-Iterable-String-tp15016p23329.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi guys,
Using Spark 1.4, trying to save a dataframe as a table, a really simple
test, but I'm getting a bunch of NPEs;
The code Im running is very simple;
qc.read.parquet(/user/sparkuser/data/staged/item_sales_basket_id.parquet).write.format(parquet).saveAsTable(is_20150617_test2)
Logs of
I saw it once but I was not clear how to reproduce it. The jira I created
is https://issues.apache.org/jira/browse/SPARK-7837.
More information will be very helpful. Were those errors from speculative
tasks or regular tasks (the first attempt of the task)? Is this error
deterministic (can you
Sorry cut and paste error, the resulting data set i want is this:
({(101,S)=3},piece_of_data_1))
({(101,S)=3},piece_of_data_2))
({(101,S)=1},piece_of_data_3))
({(109,S)=2},piece_of_data_3))
--
View this message in context:
Hi Raj,
Since the number of executor cores is equivalent to the number of tasks
that can be executed in parallel in the executor, in effect, the 6G
executor memory configured for an executor is being shared by 6 tasks plus
factoring in the memory allocation for caching task execution. I would
Hi folks,
Help me! I met a very weird problem. I really need some help!! Here is my
situation:
Case: Assign keys to two datasets (one is 96GB with 2.7 billion records and
one 1.5GB with 30k records) via MapPartitions first, and
Hey Yin,
Thanks for the link to the JIRA. I'll add details to it. But I'm able to
reproduce it, at least in the same shell session, every time I do a write I
get a random number of tasks failing on the first run with the NPE.
Using dynamic allocation of executors in YARN mode. No speculative
Hi,
I'm getting this error while running spark as a java project using maven :
15/06/15 17:11:38 INFO SparkContext: Running Spark version 1.3.0
15/06/15 17:11:38 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
15/06/15
Hello list,
The method *insertIntoJDBC(url: String, table: String, overwrite: Boolean)*
provided by Spark DataFrame allows us to copy a DataFrame into a JDBC DB
table. Similar functionality is provided by the *createJDBCTable(url:
String, table: String, allowExisting: Boolean) *method. But if you
Hello All,
I have a Spark job that throws java.lang.OutOfMemoryError: GC overhead
limit exceeded.
The job is trying to process a filesize 4.5G.
I've tried following spark configuration:
--num-executors 6 --executor-memory 6G --executor-cores 6 --driver-memory 3G
I tried increasing more
Hi,
Though my project has nothing to do with akka, I'm getting this error :
Exception in thread main com.typesafe.config.ConfigException$Missing: No
configuration setting found for key 'akka.version'
at com.typesafe.config.impl.SimpleConfig.findKey(SimpleConfig.java:124)
at
I try to measure how spark standalone cluster performance scale out with
multiple machines. I did a test of training the SVM model which is heavy in
memory computation. I measure the run time for spark standalone cluster of 1 -
3 nodes, the result is following
1 node: 35 minutes
2 nodes: 30.1
Hi, I'm running spark sql against Cassandra table. I have 3 C* nodes, Each
of them has spark worker.
The problem is that spark runs 869 task to read 3 lines: select bar from
foo.
I've tried these properties:
#try to avoid 769 tasks per dummy select foo from bar qeury
Hi June,
As i understand your problem, you are running spark 1.3 and want to use
Tachyon with it. what you need to do is simply build the latest Spark and
Tachyon and set some configuration is Spark. In fact spark 1.3 has
spark/core/pom.xm, you have to find the core folder in your spark home
and
I have written a custom receiver for converting the tuples in the Dynamic
Queue/EventGen to the Dstream.But i dont know why It is only processing
data for some time (3-4 sec.) only and then shows Queue as Empty .ANy
suggestions please ..
--code //
public class JavaCustomReceiver extends
Hi
Have anyone experienced problem with uploading to s3 with s3n protocol with
spark newHadoopApi, when job completes successfully(there is _SUCCESS
marker), but in reality one of the parts of the file is missing ?
Thanks in advance
ps: we are trying s3a now(which needs upgrade to hadoop2.7)
There are a lot of variables to consider. I'm not an expert on Spark, and
my ML knowledge is rudimentary at best, but here are some questions whose
answers might help us to help you:
- What type of Spark cluster are you running (e.g., Stand-alone, Mesos,
YARN)?
- What does the HTTP UI
Hi,
Spark on YARN should help in the memory management for Spark jobs.
Here is a good starting point:
https://spark.apache.org/docs/latest/running-on-yarn.html
YARN integrates well with HDFS and should be a good solution for a large
cluster.
What specific features are you looking for that HDFS
I just wanted to clarify - when I said you hit your maximum level of
parallelism, I meant that the default number of partitions might not be
large enough to take advantage of more hardware, not that there was no way
to increase your parallelism - the documentation I linked gives a few
suggestions
“turn (keep turning) your HDFS file (Batch RDD) into a stream of messages
(outside spark streaming)” – what I meant by that was “turn the Updates to your
HDFS dataset into Messages” and send them as such to spark streaming
From: Evo Eftimov [mailto:evo.efti...@isecc.com]
Sent: Monday, June
I use the attached program to test checkpoint. It's quite simple.
When I run the program second time, it will load checkpoint data, that's
expected, however I see NPE in driver log.
Do you have any idea about the issue? I'm on Spark 1.4.0, thank you very
much!
== logs ==
Have a look here https://spark.apache.org/docs/latest/tuning.html
Thanks
Best Regards
On Mon, Jun 15, 2015 at 11:27 AM, Proust GZ Feng pf...@cn.ibm.com wrote:
Hi, Spark Experts
I have played with Spark several weeks, after some time testing, a reduce
operation of DataFrame cost 40s on a
Then go for the second option I suggested - simply turn (keep turning) your
HDFS file (Batch RDD) into a stream of messages (outside spark streaming) –
then spark streaming consumes and aggregates the messages FOR THE RUNTIME
LIFETIME of your application in some of the following ways:
1.
I think it should be fine, that's the whole point of check-pointing (in
case of driver failure etc).
Thanks
Best Regards
On Mon, Jun 15, 2015 at 6:54 AM, Haopu Wang hw...@qilinsoft.com wrote:
Hi, can someone help to confirm the behavior? Thank you!
-Original Message-
From: Haopu
Hi If your data is not so huge you can use both cloudera and HDP's free
stack. Cloudera Express is 100% opensource free.
-
Software Developer
SigmoidAnalytics, Bangalore
--
View this message in context:
Check this link
https://forums.databricks.com/questions/277/how-do-i-avoid-the-no-space-left-on-device-error.html
https://forums.databricks.com/questions/277/how-do-i-avoid-the-no-space-left-on-device-error.html
Hope this will solve your problem.
-
Software Developer
Sigmoid
54 matches
Mail list logo