I'm doing a performance analysis for Spark on Mesos and I can see that the
Coarse-grained backend simply launches tasks in wave size of the amount of
cores available. But it seems Fine-grained mode the Mesos executor takes 1
core for itself (so -1 core per mesos slave). Shouldn't fine- and
Hi, all
We use spark sql to insert data from a text table into a partitioning table
and found that if we give more cores to executors the insert performance whold
be worse.
executors numtotal-executor-cores average time for
insert task
3
Hi,
Akhil Das suggested using ssh tunnelling (ssh -L 4040:127.0.0.1:4040
master-ip, and then open localhost:4040 in browser.) and this solved my
problem, so it made me think that the settings of my cluster were wrong.
So I checked the inbound rules for the security group of my cluster and I
发件人: Wangfei (X)
发送时间: 2015年6月11日 17:33
收件人: user@spark.apache.org
主题:
Hi, all
We use spark sql to insert data from a text table into a partitioning table
and found that if we give more cores to executors the insert performance whold
be worse.
Hi Eric,
We are also running into the same issue. Are you able to find some
suitable solution to this problem
Best Regards
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-a-way-to-limit-the-sql-query-result-size-tp18316p23272.html
Sent from
Hi Marchelo,
The collected data are collected in say class C. c.id is the id of each
of those data. But that id might appear more than once in those 1mil xml
files, so I am doing a reduceByKey(). Even if I had multiple binaryFile
RDD's, wouldn't I have to ++ those in order to correctly
No, this just a random queue name I picked when submitting the job, there's
no specific configuration for it. I am not logged in, so don't have the
default fair scheduler configuration in front of me, but I don't think
that's the problem. The cluster is completely idle, there aren't any jobs
being
Hi
I am struggling to find how to run a scala script on Datastax Spark.
(SPARK_HOME/bin/spark-shell -i test.scala is depricated)
I dont want to use the scala prompt.
Thanks
AT
Hi all:
We are trying to using spark to do some real time data processing. I
need do some sql-like query and analytical tasks with the real time data
against historical normalized data stored in data bases. Is there anyone has
done this kind of work or design? Any suggestion or material
If I want to restart my consumers into an updated cluster topology after
the cluster has been expanded or contracted, would I need to call stop() on
them, then call start() on them, or would I need to instantiate and start
new context objects (new JavaStreamingContext(...)) ? I'm thinking of
I have a flatmap that takes a line from an input file, splits it by tab into
words then returns an array of tuples consisting of every combination of 2
words. I then go on to count the frequency of each combination across the
whole file using a reduce by key (where the key is the tuple).
I am
John, I took the liberty of reopening because I have sufficient JIRA
permissions (not sure if you do). It would be good if you can add relevant
comments/investigations there.
On Thu, Jun 11, 2015 at 8:34 AM, John Omernik j...@omernik.com wrote:
Hey all, from my other post on Spark 1.3.1 issues,
Now I am profiling the executor.
There seems to be a memory leak.
20 mins after the run there were:
157k byte[] allocated for 75MB.
519k java.lang.ref.Finalizer for 31MB,
291k java.util.zip.Inflater for 17MB
487k java.util.zip.ZStreamRef for 11MB
An hour after the run I got :
186k byte[]
Hi,
I am using Kafka Spark cluster for real time aggregation analytics use case
in production.
Cluster details
6 nodes, each node running 1 Spark and kafka processes each.
Node1 - 1 Master , 1 Worker, 1 Driver,
1 Kafka process
Node 2,3,4,5,6 - 1 Worker prcocess each
Another option would be to close sc and open new context with your custom
configuration
On Jun 11, 2015 01:17, bhomass bhom...@gmail.com wrote:
you need to register using spark-default.xml as explained here
Or launch the spark-shell with --conf spark.kryo.registrator=foo.bar.MyClass
2015-06-11 14:30 GMT+02:00 Igor Berman igor.ber...@gmail.com:
Another option would be to close sc and open new context with your custom
configuration
On Jun 11, 2015 01:17, bhomass bhom...@gmail.com wrote:
you need
Hi Gaurav,
Seems like you could use a broadcast variable for this if I understand your
use case. Create it in the driver based on the CommandLineArguments and
then use it in the workers.
https://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables
So something like:
Hi Jeroen,
No problem. I think there's some magic involved with how the Spark
classloader(s) works, especially with regards to the HBase dependencies. I
know there's probably a more light-weight solution that doesn't require
customizing the Spark setup, but that's the most straight-forward way
Hi Spark Users,
I'm trying to load a literally big file (50GB when compressed as gzip file,
stored in HDFS) by receiving a DStream using `ssc.textFileStream`, as this
file cannot be fitted in my memory. However, it looks like no RDD will be
received until I copy this big file to a prior-specified
Hello all,
I was wondering if it is possible to have a high latency batch processing
job coexists within Spark Streaming job? If it's possible then could we
share the state of the batch job with the Spark Streaming job?
For example when the long-running batch computation is complete, could we
Hello,
I am running a streaming app in Spark 1.2.1. When running local everything
works fine. When I try on yarn-cluster it fails and I see ClassCastException
in the log (see below). I can run Spark (non-streaming) apps in the cluster
with no problem.
Any ideas here? Thanks in advance!
WARN
..and keeps on increasing.
maybe there is a bug in some code that zips/unzips data.
109k instances of byte[] followed by 1 mil instances of Finalizer, with
~500k Deflaters and ~500k Inflaters and 1 mil ZStreamRef
I assume that's due to either binaryFiles or saveAsObjectFile
On 11/06/15
Hi,
I tried to read a csv file from amazon s3, but I get the following
exception which I have no clue how to solve this. I tried both spark 1.3.1
and 1.2.1, but no success. Any idea how to solve this is appreciated.
best,
/Shahab
the code:
val hadoopConf=sc.hadoopConfiguration;
after 2h of running, now I got a 10GB long[], 1.3mil instances of long[]
So probably information about the files again.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail:
Hey all, from my other post on Spark 1.3.1 issues, I think we found an
issue related to a previous closed Jira (
https://issues.apache.org/jira/browse/SPARK-1403) Basically it looks like
the threat context class loader is NULL which is causing the NPE in MapR
and that's similar to posted Jira.
I see. It makes a lot of sense now. It is not unique to spark but it would
be great if it is mentioned in spark documentation.
I have been using hadoop for a while and I am not aware of it!
Zheng zheng
On Thu, Jun 11, 2015 at 7:21 PM, Will Briggs wrbri...@gmail.com wrote:
To be fair, this is
Hi,
Does
df.write.partitionBy(partitions).format(format).mode(overwrite).saveAsTable(tbl)
support orc file?
I tried df.write.partitionBy(zone, z, year,
month).format(orc).mode(overwrite).saveAsTable(tbl), but after
the insert my table tbl schema has been changed to something I did not
This is a kryo issue. https://github.com/EsotericSoftware/kryo/issues/124.
It has to do with the lengths of the fieldnames. This issue is fixed in
Kryo 2.23.
What's weird is this doesn't break on Hive itself, only when using
SparkSQL. Attached is the full stacktrace. It might be how SparkSQL is
I've seen plenty of examples for creating HDFS files from pyspark but
haven't been able to figure out how to delete files from pyspark. Is there
an API I am missing for filesystem management? Or should I be including the
HDFS python modules?
Thanks,
Siegfried
What would be closest equivalent in MLLib to Oracle Data Miner's Attribute
Importance mining function?
http://docs.oracle.com/cd/B28359_01/datamine.111/b28129/feature_extr.htm#i1005920
Attribute importance is a supervised function that ranks attributes
according to their significance in
Hey Corey,
Yes, when shuffles are smaller than available memory to the OS, most
often the outputs never get stored to disk. I believe this holds same
for the YARN shuffle service, because the write path is actually the
same, i.e. we don't fsync the writes and force them to disk. I would
guess in
Hi Jerry,
Take a look at this example:
https://spark.apache.org/docs/latest/streaming-kafka-integration.html#tab_scala_2
The offsets are needed because as RDDs get generated within spark the
offsets move further along. With direct Kafka mode the current offsets are
no more persisted in Zookeeper
Thanks, Jerry. That's what I suspected based on the code I looked at. Any
pointers on what is needed to build in this support would be great. This is
critical to the project we are currently working on.
Thanks!
On Thu, Jun 11, 2015 at 10:54 PM, Saisai Shao sai.sai.s...@gmail.com
wrote:
OK, I
Hi Ningjun,
This is probably a configuration difference between WIN01 and WIN02.
Execute: ipconfig/all on the windows command line on both machines and compare
them.
Also if you have a localhost entry in the hosts file, it should not have the
wrong sequence: See the first answer in this
OK, I get it, I think currently Python based Kafka direct API do not
provide such equivalence like Scala, maybe we should figure out to add this
into Python API also.
2015-06-12 13:48 GMT+08:00 Amit Ramesh a...@yelp.com:
Hi Jerry,
Take a look at this example:
Hmm, you have a good point. So should I load the file by `sc.textFile()`
and specify a high number of partitions, and the file is then split into
partitions in memory across the cluster?
On Thu, Jun 11, 2015 at 9:27 PM ayan guha guha.a...@gmail.com wrote:
Why do you need to use stream in this
This is a cross post of a StackOverflow that has not been answered.
I have just started 50pt bounty on it there.
http://stackoverflow.com/questions/30744294/error-using-json4s-with-apache-spark-in-spark-shell
You can answer it there if you prefer
I am trying to use the case class extraction
Congratulations on the release of 1.4!
I have been trying out the direct Kafka support in python but haven't been
able to figure out how to get the offsets from the RDD. Looks like the
documentation is yet to be updated to include Python examples (
Thanks for your reply! In my use case, it would be stream from only one stdin.
Also I'm working with Scala.
It would be great if you could talk about multi stdin case as well! Thanks.
From: Tathagata Das t...@databricks.commailto:t...@databricks.com
Date: Thursday, June 11, 2015 at 8:11 PM
To:
Hi,
What is your meaning of getting the offsets from the RDD, from my
understanding, the offsetRange is a parameter you offered to KafkaRDD, why
do you still want to get the one previous you set into?
Thanks
Jerry
2015-06-12 12:36 GMT+08:00 Amit Ramesh a...@yelp.com:
Congratulations on the
Hi guys:I'm trying to running 24 APPs simultaneously in one spark
cluster.
However, everytime my cluster can only running 17 APPs in the same
time. Other APPs disappeared, no logs, no failures. Any ideas will be
appreciated.
Here is my code:object GeneCompare5{
def
Hi Josh,
That worked! Thank you so much! (I can't believe it was something so obvious
;) )
If you care about such a thing you could answer my question here for bounty:
http://stackoverflow.com/questions/30639659/apache-phoenix-4-3-1-and-4-4-0-hbase-0-98-on-spark-1-3-1-classnotfoundexceptio
Hello,
I have a large CSV file in which the continued records(with same RecordID)
may have the context meaning. I should see these continued records as ONE
complete record. Also the recordID will be reset to 1 at some time when the
csv dumper system think it's necessary.
I'd like to get some
Hello,
I have a large CSV file in which the continued records(with same RecordID)
may have the context meaning. I should see these continued records as ONE
complete record. Also the recordID will be reset to 1 at some time when the
csv dumper system think it's necessary.
I'd like to get some
I would guess in such shuffles the bottleneck is serializing the data
rather than raw IO, so I'm not sure explicitly buffering the data in the
JVM process would yield a large improvement.
Good guess! It is very hard to beat the performance of retrieving shuffle
outputs from the OS buffer
Let me try to add some clarity in the different thought directions that's
going on in this thread.
1. HOW TO DETECT THE NEED FOR MORE CLUSTER RESOURCES?
If there are not rate limits set up, the most reliable way to detect
whether the current Spark cluster is being insufficient to handle the data
Are you going to receive data from one stdin from one machine, or many
stdins on many machines?
On Thu, Jun 11, 2015 at 7:25 PM, foobar heath...@fb.com wrote:
Hi, I'm new to Spark Streaming, and I want to create a application where
Spark Streaming could create DStream from stdin. Basically I
Yes, Tathagata, thank you.
For #1, the 'need detection', one idea we're entertaining is timestamping
the messages coming into the Kafka topics. The consumers would check the
interval between the time they get the message and that message origination
timestamp. As Kafka topics start to fill up
Do you have the event logging enabled?
TD
On Thu, Jun 11, 2015 at 11:24 AM, scar0909 scar0...@gmail.com wrote:
I have the same problem. i realized that the master spark becomes
unresponsive when we kill the leader zookeeper (of course i assigned the
leader election task to the zookeeper).
Simplest way would be issuing a os.system with HDFS rm command from driver,
assuming it has hdfs connectivity, like a gateway node. Executors will have
nothing to do with it.
On 12 Jun 2015 08:57, Siegfried Bilstein sbilst...@gmail.com wrote:
I've seen plenty of examples for creating HDFS files
Another approach not mentioned is to use a function to get the RDD that is
to be joined. Something like this.
Not sure, but you can try something like this also:
kvDstream.foreachRDD(rdd = {
val rdd = getOrUpdateRDD(params...)
rdd.join(kvFile)
})
The
Hi all,
I wonder if anyone has used use MapReduce Job History to show Spark jobs.
I can see my Spark jobs (Spark running on Yarn cluster) on Resource manager
(RM).
I start Spark History server, and then through Spark's web-based user
interface I can monitor the cluster (and track cluster and
Hi, I'm new to Spark Streaming, and I want to create a application where
Spark Streaming could create DStream from stdin. Basically I have a command
line utility that generates stream data, and I'd like to pipe data into
DStream. What's the best way to do that? I thought rdd.pipe() could help,
but
Hi Nathan,
I am also facing the issue with Spark 1.3. Did you find any workaround for
this issue? Please help
Thanks
Sathish
On Thu, Apr 16, 2015 at 6:03 AM Nathan McCarthy
nathan.mccar...@quantium.com.au wrote:
Its JTDS 1.3.1; http://sourceforge.net/projects/jtds/files/jtds/1.3.1/
I
Hi there, If I set spark.scheduler.mode to FAIR, How many APPs does
spark support to run simultaneously in one cluster? Is there a upper limit?
Thanks.
Thanksamp;Best regards!
San.Luo
That's spark on YARN in Kerberos
In Spark 1.3 you can submit work to a Kerberized Hadoop cluster; once the
tokens you passed up with your app submission expire (~72 hours) your job can't
access HDFS any more.
That's been addressed in Spark 1.4, where you can now specify a kerberos keytab
for
hey guys
Using Hive and Impala daily intensively.Want to transition to spark-sql in CLI
mode
Currently in my sandbox I am using the Spark (standalone mode) in the CDH
distribution (starving developer version 5.3.3)
3 datanode hadoop cluster32GB RAM per node8 cores per node
| spark |
I would like to work with RDD pairs of Tuple2byte[], obj, but byte[]s with
the same contents are considered as different values because their reference
values are different.
I didn't see any to pass in a custom comparer. I could convert the byte[] into
a String with an explicit charset, but
Here is an example:
|val sc = new SparkContext(new SparkConf)
// access hive tables
val hqlc = new HiveContext(sc)
import hqlc.implicits._
// access files on hdfs
val sqlc = new SQLContext(sc)
import sqlc.implicits._
sqlc.jsonFile(xxx).registerTempTable(xxx)
// access other DB
sqlc.jdbc(url,
Hi,
Did you resolve this? I have the same questions.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Running-Spark-in-Local-Mode-tp22279p23278.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi All,
I'm happy to announce the availability of Spark 1.4.0! Spark 1.4.0 is
the fifth release on the API-compatible 1.X line. It is Spark's
largest release ever, with contributions from 210 developers and more
than 1,000 commits!
A huge thanks go to all of the individuals and organizations
I am using Spark on a machine with limited disk space. I am using it to
analyze very large (100GB to 1TB per file) data sets stored in HDFS. When I
analyze these datasets, I will run groups, joins and cogroups. All of these
operations mean lots of shuffle files written to disk.
Unfortunately
From
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
:
Look at the number of partitions in the parent RDD and then keep
multiplying that by 1.5 until performance stops improving.
FYI
On Thu, Jun 11, 2015 at 7:57 AM, matthewrj mle...@gmail.com wrote:
I have a
I've observed interesting behavior in Spark 1.3.1, the reason for which is
not clear.
Doing something as simple as sc.textFile(...).takeSample(...) always
results in two stages:Spark's takeSample() results in two stages
http://apache-spark-user-list.1001560.n3.nabble.com/file/n23280/Capture.jpg
I think if you wrap the byte[] into an object and implement equals and
hashcode methods, you may be able to do this. There will be the overhead of
extra object, but conceptually it should work unless I am missing
something.
Best Regards,
Sonal
Founder, Nube Technologies http://www.nubetech.co
Makes sense – I suspect what you suggested should work.
However, I think the overhead between this and using `String` would be similar
enough to warrant just using `String`.
Mark
From: Sonal Goyal [mailto:sonalgoy...@gmail.com]
Sent: June-11-15 12:58 PM
To: Mark Tse
Cc: user@spark.apache.org
I've looked a bit into what DataFrames are, and it seems that most posts on
the subject are related to SQL, but it does seem to be very efficient. My
main questions is: Are DataFrames also beneficial for non-SQL computations?
For instance I want to:
- sort k/v pairs (in particular, is the naive
Be careful shoving arbitrary binary data into a string, invalid utf
characters can cause significant computational overhead in my experience.
On Jun 11, 2015 10:09 AM, Mark Tse mark@d2l.com wrote:
Makes sense – I suspect what you suggested should work.
However, I think the overhead
Hello, have you found a solution? I have the same issue
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-2-0-standalone-ha-zookeeper-tp21308p23282.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
I have the same problem. i realized that the master spark becomes
unresponsive when we kill the leader zookeeper (of course i assigned the
leader election task to the zookeeper). please let me know if you have any
devlepments.
--
View this message in context:
I load a list of ids from a text file as NLineInputFormat, and when I do
distinct(), it returns incorrect number.
JavaRDDText idListData = jvc
.hadoopFile(idList, NLineInputFormat.class,
LongWritable.class, Text.class).values().distinct()
I should have
Guess: it has something to do with the Text object being reused by Hadoop?
You can't in general keep around refs to them since they change. So you may
have a bunch of copies of one object at the end that become just one in
each partition.
On Thu, Jun 11, 2015, 8:36 PM Crystal Xing
Hello,
Since the other queues are fine, I reckon, there may be a limit in the max
apps or memory on this queue in particular.
I don't suspect fairscheduler limits either but on this queue we may be
seeing / hitting a maximum.
Could you try to get the configs for the queue? That should provide
That is a little scary.
So you mean in general, we shouldn't use hadoop's writable as Key in RDD?
Zheng zheng
On Thu, Jun 11, 2015 at 6:44 PM, Sean Owen so...@cloudera.com wrote:
Guess: it has something to do with the Text object being reused by Hadoop?
You can't in general keep around refs
Yep you need to use a transformation of the raw value; use toString for
example.
On Thu, Jun 11, 2015, 8:54 PM Crystal Xing crystalxin...@gmail.com wrote:
That is a little scary.
So you mean in general, we shouldn't use hadoop's writable as Key in RDD?
Zheng zheng
On Thu, Jun 11, 2015 at
To be fair, this is a long-standing issue due to optimizations for object reuse
in the Hadoop API, and isn't necessarily a failing in Spark - see this blog
post
(https://cornercases.wordpress.com/2011/08/18/hadoop-object-reuse-pitfall-all-my-reducer-values-are-the-same/)
from 2011 documenting
I have a properties file that is hosted at a url. I would like to be able
to use the url in the --properties-file parameter when submitting a job to
mesos using spark-submit via chronos
I would rather do this than use a file on the local server.
This doesn't seem to work though when submitting
That's not supported. You could use wget / curl to download the file to a
temp location before running spark-submit, though.
On Thu, Jun 11, 2015 at 12:48 PM, Gary Ogden gog...@gmail.com wrote:
I have a properties file that is hosted at a url. I would like to be able
to use the url in the
Yes, DataFrames are for much more than SQL and I would recommend using them
where ever possible. It is much easier for us to do optimizations when we
have more information about the schema of your data, and as such, most of
our on going optimization effort will focus on making DataFrames faster.
I start spark master on windows using
bin\spark-class.cmd org.apache.spark.deploy.master.Master
Then I goto http://localhost:8080/ to find the master URL, it is
spark://WIN02:7077
Here WIN02 is my machine name. Why does it missing the domain name? If I start
the spark master on other
Any idea why this happens?
On Wed, Jun 10, 2015 at 9:28 AM, Ashish Nigam ashnigamt...@gmail.com
wrote:
BTW, I am using spark streaming 1.2.0 version.
On Wed, Jun 10, 2015 at 9:26 AM, Ashish Nigam ashnigamt...@gmail.com
wrote:
I did not change driver program. I just shutdown the context
81 matches
Mail list logo