As part of upgrading a cluster from CDH 5.3.x to CDH 5.4.x I noticed that
Spark is behaving differently when reading Parquet directories that contain
a .metadata directory.
It seems that in spark 1.2.x, it would just ignore the .metadata directory,
but now that I'm using Spark 1.3, reading these
Hi All,
I am curious to know if anyone has successfully deployed a spark cluster
using supervisord?
- http://supervisord.org/
Currently I am using the cluster launch scripts which are working greater,
however, every time I reboot my VM or development environment I need to
re-launch the
assuming you are talking about standalone cluster
imho, with workers you won't get any problems and it's straightforward
since they are usually foreground processes
with master it's a bit more complicated, ./sbin/start-master.sh goes
background which is not good for supervisor, but anyway I think
Hi all,
I'm playing around with manipulating images via Python and want to
utilize Spark for scalability. That said, I'm just learing Spark and my
Python is a bit rusty (been doing PHP coding for the last few years). I
think I have most of the process figured out. However, the script fails
(bcc: user@spark, cc:cdh-user@cloudera)
If you're using CDH, Spark SQL is currently unsupported and mostly
untested. I'd recommend trying to use it in CDH. You could try an upstream
version of Spark instead.
On Wed, Jun 3, 2015 at 1:39 PM, Don Drake dondr...@gmail.com wrote:
As part of
Try with large number of partition in parallelize.
On 4 Jun 2015 06:28, Justin Spargur jmspar...@gmail.com wrote:
Hi all,
I'm playing around with manipulating images via Python and want to
utilize Spark for scalability. That said, I'm just learing Spark and my
Python is a bit rusty
Hi all
We recently merged support for launching YARN clusters using Spark EC2
scripts as a part of
https://issues.apache.org/jira/browse/SPARK-3674. To use this you can pass
in hadoop-major-version as yarn to the spark-ec2 script and this will
setup Hadoop 2.4 HDFS, YARN and Spark built for YARN
Hello User group,
I have a RDD of LabeledPoint composed of sparse vectors like showing below. In
the next step, I am standardizing the columns with the Standard Scaler. The
data has 2450 columns and ~110M rows. It took 1.5hrs to complete the
standardization with 10 nodes and 80 executors. The
Tasks are scheduled on executors based on data locality. Things work as
you would expect in the example you brought up.
Through dynamic allocation, the number of executors can change throughout
the life time of an application. 10 executors (or 5 executors with 2 cores
each) are not needed for a
Hi Doug,
Actually, sqlContext.table does not support database name in both Spark 1.3
and Spark 1.4. We will support it in future version.
Thanks,
Yin
On Wed, Jun 3, 2015 at 10:45 AM, Doug Balog doug.sparku...@dugos.com
wrote:
Hi,
sqlContext.table(“db.tbl”) isn’t working for me, I get a
Hi Yasemin,
If you can convert your user IDs to Integers in pre-processing (if you have
a couple billion users), that would work. Otherwise...
In Spark 1.3: You may need to modify ALS to use Long instead of Int.
In Spark 1.4: spark.ml.recommendation.ALS (in the Pipeline API) exposes
ALS.train
The short answer is yes.
How you do it depends on a number of factors. Assuming you want to build an RDD
from the responses and then analyze the responses using Spark core (not Spark
Streaming), here is one simple way to do it:
1) Implement a class or function that connects to a web service and
When training a RandomForest model, the Strategy class (in
mllib.tree.configuration) provides a subsamplingRate parameter. I was hoping
to use this to cut down on processing time for large datasets (more than 2MM
rows and 9K predictors), but I've found that the runtime stays approximately
Hi,
I'm trying to use property substitution in my log4j.properties, so that
I can choose where to write spark logs at runtime.
The problem is that, system property passed to spark shell
doesn't seem to getting propagated to log4j.
*Here is log4j.properites(partial) with a parameter
Did you try this?
Create an sbt project like:
// Create your context
val sconf = new
SparkConf().setAppName(Sigmoid).setMaster(spark://sigmoid:7077)
val sc = new SparkContext(sconf)
// Do some computations
sc.parallelize(1 to 1).take(10).foreach(println)
//Now return the exit status
hi,
i want to run my spark app on a cluster,
i use cloudera live single node vm.
how i must build the job for the spark submit script?
and i must upload spark submit on hdfs?
best regards
paul
Have you done sc.stop() ? :)
On 3 Jun 2015 14:05, amghost zhengweita...@outlook.com wrote:
I run spark application in spark standalone cluster with client deploy
mode.
I want to check out the logs of my finished application, but I always get
a
page telling me Application history not found -
As far as I know, spark don't support multiple outputs
On Wed, Jun 3, 2015 at 2:15 PM, ayan guha guha.a...@gmail.com wrote:
Why do you need to do that if filter and content of the resulting rdd are
exactly same? You may as well declare them as 1 RDD.
On 3 Jun 2015 15:28, ÐΞ€ρ@Ҝ (๏̯͡๏)
Hi,
I was running a spark job to insert overwrite hive table and got
Permission denied. My question is why spark job did the insert by using
user 'hive', not myself who ran the job? How can I fix the problem?
val hiveContext = new HiveContext(sc)
import hiveContext.implicits._
You need to look into your executor/worker logs to see whats going on.
Thanks
Best Regards
On Wed, Jun 3, 2015 at 12:01 PM, patcharee patcharee.thong...@uni.no
wrote:
Hi,
What can be the cause of this ERROR cluster.YarnScheduler: Lost executor?
How can I fix it?
Best,
Patcharee
I check the RDD#randSplit, it is much more like multiple one-to-one
transformation rather than a one-to-multiple transformation.
I write one sample code as following, it would generate 3 stages. Although
we can use cache here to make it better, If spark can support multiple
outputs, only 2 stages
node down or container preempted ? You need to check the executor log /
node manager log for more info.
On Wed, Jun 3, 2015 at 2:31 PM, patcharee patcharee.thong...@uni.no wrote:
Hi,
What can be the cause of this ERROR cluster.YarnScheduler: Lost executor?
How can I fix it?
Best,
You can build Spark from the 1.4 release branch yourself:
https://github.com/apache/spark/tree/branch-1.4
-
Daniel Emaasit,
Ph.D. Research Assistant
Transportation Research Center (TRC)
University of Nevada, Las Vegas
Las Vegas, NV 89154-4015
Cell: 615-649-2489
www.danielemaasit.com
--
I run into errors while trying to build Spark from the 1.4 release branch
yourself: https://github.com/apache/spark/tree/branch-1.4. Any help will be
much appreciated. Here is the log file. (F.Y.I, I installed all the
dependencies like Java 7, Maven 3.2.5)
C:\Program Files\Apache Software
Hi again,
Below is the log from executor
FetchFailed(BlockManagerId(4, compute-10-0.local, 38594), shuffleId=0,
mapId=117, reduceId=117, message=
org.apache.spark.shuffle.FetchFailedException: Failed to connect to
compute-10-0.local/10.10.255.241:38594
at
Hi guys , i am new to spark . I am using sparksubmit to submit spark jobs.
But for my use case i don't want it to be exit with System.exit . Is there
any other spark client which is api friendly other than SparkSubmit which
shouldn't exit with system.exit. Please correct me if i am missing
In the sense here, Spark actually does have operations that make multiple
RDDs like randomSplit. However there is not an equivalent of the partition
operation which gives the elements that matched and did not match at once.
On Wed, Jun 3, 2015, 8:32 AM Jeff Zhang zjf...@gmail.com wrote:
As far
I think when you do a ssc.stop it will stop your entire application and by
update a transformation function you mean modifying the driver program?
In that case even if you checkpoint your application, it won't be able to
recover from its previous state.
A simpler approach would be to add certain
Hi akhil , sorry i may not conveying the question properly . Actually we
are looking to Launch a spark job from a long running workflow manager,
which invokes spark client via SparkSubmit. Unfortunately the client upon
successful completion of the application exits with a System.exit(0) or
This is log I can get
15/06/02 16:37:31 INFO shuffle.RetryingBlockFetcher: Retrying fetch
(2/3) for 4 outstanding blocks after 5000 ms
15/06/02 16:37:36 INFO client.TransportClientFactory: Found inactive
connection to compute-10-3.local/10.10.255.238:33671, creating a new one.
15/06/02
Run it as a standalone application. Create an sbt project and do sbt run?
Thanks
Best Regards
On Wed, Jun 3, 2015 at 11:36 AM, pavan kumar Kolamuri
pavan.kolam...@gmail.com wrote:
Hi guys , i am new to spark . I am using sparksubmit to submit spark jobs.
But for my use case i don't want it
Hi everybody,
I'm new to Spark, apologies if my question is very basic.
I have a need to send millions of requests to a web service and analyse and
store the responses in an RDD. I can easy express the analysing part using
Spark's filter/map/etc. primitives but I don't know how to make the
Tried on some other data sources as well, and it actually works for some
parquet sources. Potentially some specific problems with that first parquet
source that I tried with, and not a Spark 1.4 problem. I'll get back with
more info if I find any new information.
Thanks,
Anders
On Tue, Jun 2,
Which version of spark? Looks like you are hitting this one
https://issues.apache.org/jira/browse/SPARK-4516
Thanks
Best Regards
On Wed, Jun 3, 2015 at 1:06 PM, patcharee patcharee.thong...@uni.no wrote:
This is log I can get
15/06/02 16:37:31 INFO shuffle.RetryingBlockFetcher: Retrying
I run into errors while trying to build Spark from the 1.4 release branch:
https://github.com/apache/spark/tree/branch-1.4. Any help will be much
appreciated. Here is the log file from my windows 8.1 PC. (F.Y.I, I
installed all the dependencies like Java 7, Maven 3.2.5 and set
the environment
Why do you need to do that if filter and content of the resulting rdd are
exactly same? You may as well declare them as 1 RDD.
On 3 Jun 2015 15:28, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
I want to do this
val qtSessionsWithQt = rawQtSession.filter(_._2.qualifiedTreatmentId
!=
Hi,
What can be the cause of this ERROR cluster.YarnScheduler: Lost
executor? How can I fix it?
Best,
Patcharee
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail:
I think you could check the yarn nodemanager log or other Spark executor
logs to see the details. What you listed above of the exception stacks are
just the phenomenon, not the cause. Normally there will be some situations
which will lead to executor lost:
1. Killed by yarn cause of memory
Dmitry was concerned about the “serialization cost” NOT the “memory footprint –
hence option a) is still viable since a Broadcast is performed only ONCE for
the lifetime of Driver instance
From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Wednesday, June 3, 2015 2:44 PM
To: Evo Eftimov
Cc:
When spark reads parquet files (sqlContext.parquetFile), it creates a
DataFrame RDD. I would like to know if the resulting DataFrame has columnar
structure (many rows of a column coalesced together in memory) or its a row
wise structure that a spark RDD has. The section Spark SQL and DataFrames
I had the same issue a couple days ago. It's a bug in 1.3.0 that is fixed
in 1.3.1 and up.
https://issues.apache.org/jira/browse/SPARK-6036
The ordering that the event logs are moved from in-progress to complete is
coded to be after the Master tries to build the history page for the logs.
The
I think the short answer to the question is, no, there is no alternate API
that will not use the System.exit calls. You can craft a workaround like is
being suggested in this thread. For comparison, we are doing programmatic
submission of applications in a long-running client application. To get
When spark reads parquet files (sqlContext.parquetFile), it creates a
DataFrame RDD. I would like to know if the resulting DataFrame has columnar
structure (many rows of a column coalesced together in memory) or its a row
wise structure that a spark RDD has. The section Spark SQL and DataFrames
Hi everybody,
is there in Spark anything sharing the philosophy of Storm's field grouping?
I'd like to manage data partitioning across the workers by sending tuples
sharing the same key to the very same worker in the cluster, but I did not
find any method to do that.
Suggestions?
:)
--
View
Hmmm a spark streaming app code doesn't execute in the linear fashion
assumed in your previous code snippet - to achieve your objectives you
should do something like the following
in terms of your second objective - saving the initialization and
serialization of the params you can:
a) broadcast
Hi,
I want to use Spark's ALS in my project. I have the userid
like 30011397223227125563254 and Rating Object which is the Object of ALS
wants Integer as a userid so the id field does not fit into a 32 bit
Integer. How can I solve that ? Thanks.
Best,
yasemin
--
hiç ender hiç
I think you're exactly right. I once had 100 iterations in a single Pregel
call, and got into the lineage problem right there. I had to modify the
Pregel function and checkpoint both the graph and the newVerts RDD there to
cut off the lineage. If you draw out the dependency graph among the g,
I'm looking at https://spark.apache.org/docs/latest/tuning.html. Basically
the takeaway is that all objects passed into the code processing RDD's must
be serializable. So if I've got a few objects that I'd rather initialize
once and deinitialize once outside of the logic processing the RDD's, I'd
Considering memory footprint of param as mentioned by Dmitry, option b
seems better.
Cheers
On Wed, Jun 3, 2015 at 6:27 AM, Evo Eftimov evo.efti...@isecc.com wrote:
Hmmm a spark streaming app code doesn't execute in the linear fashion
assumed in your previous code snippet - to achieve your
Great.
You should monitor vital performance / job clogging stats of the Spark
Streaming Runtime not “kafka topics” -- anything specific you were thinking
of?
On Wed, Jun 3, 2015 at 11:49 AM, Evo Eftimov evo.efti...@isecc.com wrote:
Makes sense especially if you have a cloud with “infinite”
I got the same error message when using maven 3.3 .
On Jun 3, 2015 8:58 AM, Ted Yu yuzhih...@gmail.com wrote:
I used the same command on Linux but didn't reproduce the error.
Can you include -X switch on your command line ?
Also consider upgrading maven to 3.3.x
Cheers
On Wed, Jun 3,
So Evo, option b is to singleton the Param, as in your modified snippet,
i.e. instantiate is once per an RDD.
But if I understand correctly the a) option is broadcast, meaning
instantiation is in the Driver once before any transformations and actions,
correct? That's where my serialization costs
If we have a hand-off between the older consumer and the newer consumer, I
wonder if we need to manually manage the offsets in Kafka so as not to miss
some messages as the hand-off is happening.
Or if we let the new consumer run for a bit then let the old consumer know
the 'new guy is in town'
Would it be possible to implement Spark autoscaling somewhat along these
lines? --
1. If we sense that a new machine is needed, by watching the data load in
Kafka topic(s), then
2. Provision a new machine via a Provisioner interface (e.g. talk to AWS
and get a machine);
3. Create a shadow/mirror
Makes sense especially if you have a cloud with “infinite” resources / nodes
which allows you to double, triple etc in the background/parallel the resources
of the currently running cluster
I was thinking more about the scenario where you have e.g. 100 boxes and want
to / can add e.g. 20
I used the same command on Linux but didn't reproduce the error.
Can you include -X switch on your command line ?
Also consider upgrading maven to 3.3.x
Cheers
On Wed, Jun 3, 2015 at 2:36 AM, Daniel Emaasit daniel.emaa...@gmail.com
wrote:
I run into errors while trying to build Spark from
Yes, I think you're right. Since this is a change to the ASF hosted
site, I can make this change to the .md / .html directly rather than
go through the usual PR.
On Wed, Jun 3, 2015 at 6:23 PM, linkstar350 . tweicomepan...@gmail.com wrote:
Hi, I'm Taira.
I notice that this example page may be
Hi, I'm Taira.
I notice that this example page may be a mistake.
https://spark.apache.org/examples.html
Word Count (Java)
JavaRDDString textFile = spark.textFile(hdfs://...);
JavaRDDString words = textFile.flatMap(new FlatMapFunctionString, String() {
public
Hi,
I've got a Spark Streaming driver job implemented and in it, I register a
streaming listener, like so:
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf,
Durations.milliseconds(params.getBatchDurationMillis()));
jssc.addStreamingListener(new JobListener(jssc));
where
I am pasting some of the exchanges I had on this topic via the mailing list
directly so it may help someone else too. (Don't know why those responses
don't show up here).
---
Thanks Imran. It does help clarify. I believe I had it right all along then
but was
I think what we'd want to do is track the ingestion rate in the consumer(s)
via Spark's aggregation functions and such. If we're at a critical level
(load too high / load too low) then we issue a request into our
Provisioning Component to add/remove machines. Once it comes back with an
OK, each
The default of 0 means no limit. Each batch will grab as much as is
available, ie a range of offsets spanning from the end of the previous
batch to the highest available offsets on the leader.
If you set spark.streaming.kafka.maxRatePerPartition 0, the number you
set is the maximum number of
I have the existing operating Spark cluster that was launched with spark-ec2
script. I'm trying to add new slave by following the instructions:
Stop the cluster
On AWS console launch more like this on one of the slaves
Start the cluster
Although the new instance is added to the same security
Try something like the following.
Create a function to make the HTTP call, e.g.
using org.apache.commons.httpclient.HttpClient as in below.
def getUrlAsString(url: String): String = {
val client = new org.apache.http.impl.client.DefaultHttpClient()
val request = new
Hi all,
Trying out Spark 1.4 RC4 on MapR4/Hadoop 2.5.1 running in yarn-client mode with
Hive support.
*Build command;*
./make-distribution.sh --name mapr4.0.2_yarn_j6_2.10 --tgz -Pyarn -Pmapr4
-Phadoop-2.4 -Pmapr4 -Phive -Phadoop-provided
-Dhadoop.version=2.5.1-mapr-1501
Hi Joseph,
I think about converting IDS but there will be birthday problem. The
probability of a Hash Collision
http://preshing.com/20110504/hash-collision-probabilities/ is important
for me because of the user number. I don't know how can I modify ALS to use
Integer.
yasemin
2015-06-04 2:28
Thank you. I confirmed the page.
2015-06-04 1:35 GMT+09:00 Sean Owen so...@cloudera.com:
Yes, I think you're right. Since this is a change to the ASF hosted
site, I can make this change to the .md / .html directly rather than
go through the usual PR.
On Wed, Jun 3, 2015 at 6:23 PM,
The fit part is very slow, transform not at all.
The number of partitions was 210 vs number of executors 80.
Spark 1.4 sounds great but as my company is using Qubole we are dependent upon
them to upgrade from version 1.3.1. Until that happens, can you think of any
other reasons as to why it
I am trying to understand what the treeReduce function for an RDD does, and
how it is different from the normal reduce function. My current
understanding is that treeReduce tries to split up the reduce into multiple
steps. We do a partial reduce on different nodes, and then a final reduce is
done
Which part of StandardScaler is slow? Fit or transform? Fit has shuffle but
very small, and transform doesn't do shuffle. I guess you don't have enough
partition, so please repartition your input dataset to a number at least
larger than the # of executors you have.
In Spark 1.4's new ML pipeline
Hi Kaspar,
This is definitely doable, but in my opinion, it's important to remember
that, at its core, Spark is based around a functional programming paradigm
- you're taking input sets of data and, by applying various
transformations, you end up with a dataset that represents your answer.
Hi Jonathan,
Maybe you can try BigDataBench. http://prof.ict.ac.cn/BigDataBench/
http://prof.ict.ac.cn/BigDataBench/ . It provides lots of workloads,
including both Hadoop and Spark based workloads.
Zhen Jia
hodgesz wrote
Hi Spark Experts,
I am curious what people are using to benchmark
Can you do count() before fit to force materialize the RDD? I think
something before fit is slow.
On Wednesday, June 3, 2015, Piero Cinquegrana pcinquegr...@marketshare.com
wrote:
The fit part is very slow, transform not at all.
The number of partitions was 210 vs number of executors 80.
Thanks Akhil, Richard, Oleg for your quick response .
@Oleg we have actually tried the same thing but unfortunately when we throw
exception Akka framework is catching all exceptions and thinking job failed
and rerunning the spark jobs infinitely. Since in OneForOneStrategy in
akka , max no of
Hi,
sqlContext.table(“db.tbl”) isn’t working for me, I get a NoSuchTableException.
But I can access the table via
sqlContext.sql(“select * from db.tbl”)
So I know it has the table info from the metastore.
Anyone else see this ?
I’ll keep digging.
I compiled via make-distribution -Pyarn
This happens automatically when you use the byKey operations, e.g. reduceByKey,
updateStateByKey, etc. Spark Streaming keeps the state for a given set of keys
on a specific node and sends new tuples with that key to that.
Matei
On Jun 3, 2015, at 6:31 AM, allonsy luke1...@gmail.com wrote:
76 matches
Mail list logo