Probably a naive question: can you try the same in hive CLI and see if your
SQL is working? Looks like hive thing to me as spark is faithfully
delegating the query to hive.
On 29 May 2015 03:22, Abhishek Tripathi trackissue...@gmail.com wrote:
Hi ,
I'm using CDH5.4.0 quick start VM and tried
I have a simple problem:
i got mean number of people on one place by hour(time-series like), and now
i want to know if the weather condition have impact on the mean number.
I would do it with variance analysis like anova in spss or analysing the
resultant regression model summary
How is it
Which version of spark? In 1.4 window queries will show up for these kind
of scenarios.
1 thing I can suggest is keep daily aggregates materialised and partioned
by key and sorted by key-day combination using repartitionandsort method.
It allows you to use custom partitioner and custom sorter.
Which would imply that if there was a load manager type of service, it
could signal to the driver(s) that they need to acquiesce, i.e. process
what's at hand and terminate. Then bring up a new machine, then restart
the driver(s)... Same deal with removing machines from the cluster. Send a
signal
Thanks Sandy- I was digging through the code in the deploy.yarn.Client and
literally found that property right before I saw your reply. I'm on 1.2.x
right now which doesn't have the property. I guess I need to update sooner
rather than later.
On Thu, May 28, 2015 at 3:56 PM, Sandy Ryza
Hi all, I'm using Spark 1.3.1 with Hive 0.13.1. When running a UDF accessing
a hive struct array the query fails with:
Caused by: com.esotericsoftware.kryo.KryoException: Buffer underflow.
Serialization trace:
fieldName
hi,
Suddenly spark jobs started failing with following error
Exception in thread main java.io.FileNotFoundException:
/user/spark/applicationHistory/application_1432824195832_1275.inprogress (No
such file or directory)
full trace here
[21:50:04 x...@hadoop-client01.dev:~]$ spark-submit --class
Assuming that I have the next data frame:
flag | price
--
1|47.808764653746
1|47.808764653746
1|31.9869279512204
1|47.7907893713564
1|16.7599200038239
1|16.7599200038239
1|20.3916014172137
How can I create a data frame with an extra indexed column
Thanks for the valuable information. The blog states:
The cores property controls the number of concurrent tasks an executor can
run. --executor-cores 5 means that each executor can run a maximum of five
tasks at the same time.
So, I guess the max number of executor-cores I can assign is the
13:36:22.958067 26890 exec.cpp:206] Executor registered on slave
20150528-063307-780930314-5050-8152-S5
Spark assembly has been built with Hive, including Datanucleus jars on
classpath
Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties
So, no matter what I provide
Hi guys,
Does the SPARK_EXECUTOR_CORES assume Hyper threading? For example, if I
have 4 cores with 2 threads per core, should the SPARK_EXECUTOR_CORES be
4*2 = 8 or just 4?
Thanks,
I don’t think the number of CPU cores controls the “number of parallel tasks”.
The number of Tasks corresponds first and foremost to the number of (Dstream)
RDD Partitions
The Spark documentation doesn’t mention what is meant by “Task” in terms of
Standard Multithreading Terminology ie a
I am submitting jobs to my yarn cluster via the yarn-cluster mode and I'm
noticing the jvm that fires up to allocate the resources, etc... is not
going away after the application master and executors have been allocated.
Instead, it just sits there printing 1 second status updates to the
console.
Hi Corey,
As of this PR https://github.com/apache/spark/pull/5297/files, this can be
controlled with spark.yarn.submit.waitAppCompletion.
-Sandy
On Thu, May 28, 2015 at 11:48 AM, Corey Nolet cjno...@gmail.com wrote:
I am submitting jobs to my yarn cluster via the yarn-cluster mode and I'm
Could you try to comment out some lines in
`extract_sift_features_opencv` to find which line cause the crash?
If the bytes came from sequenceFile() is broken, it's easy to crash a
C library in Python (OpenCV).
On Thu, May 28, 2015 at 8:33 AM, Sam Stoelinga sammiest...@gmail.com wrote:
Hi
It's not only about cores. Keep in mind spark.executor.cores also affects
available memeory for each task:
From
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
The memory available to each task is (spark.executor.memory *
spark.shuffle.memoryFraction
Hi all,
As the author of the dynamic allocation feature I can offer a few insights
here.
Gerard's explanation was both correct and concise: dynamic allocation is
not intended to be used in Spark streaming at the moment (1.4 or before).
This is because of two things:
(1) Number of receivers is
Hi,
I have a batch daily job that computes daily aggregate of several counters
represented by some object.
After daily aggregation is done, I want to compute block of 3 days
aggregation(3,7,30 etc)
To do so I need to add new daily aggregation to the current block and then
subtract from current
Hi ,
I'm using CDH5.4.0 quick start VM and tried to build Spark with Hive
compatibility so that I can run Spark sql and access temp table remotely.
I used below command to build Spark, it was build successful but when I
tried to access Hive data from Spark sql, I get error.
Thanks,
Abhi
I am working on the Databricks Reference applications, porting them to my
company's platform, and extending them to emit RDF. I have already gotten
them working with the extension on EC2, and have the Log Analyzer
application working on our platform. But the Twitter Language Classifier
application
Hi Somnath,
Is there a step-by-step instruction about using Eclipse to develop
Spark application? I think many people need them. Thanks!
Best Regards
Nan Xiao
On Thu, May 28, 2015 at 3:15 PM, Somnath Pandeya
somnath_pand...@infosys.com wrote:
Try scala eclipse plugin to eclipsify spark project
I have used VoltDB and Spark. The use cases for the two are quite different.
VoltDB is intended for transactions and also supports queries on the
same(custom to voltdb) store. Spark(SQL) is NOT suitable for transactions; it
is designed for querying immutable data (which may exist in several
Hello All,
I am using spark streaming 1.3 . I want to capture few custom metrics based
on accumulators, I followed somewhat similar to this approach ,
val instrumentation = new SparkInstrumentation(example.metrics)
* val numReqs = sc.accumulator(0L)
*
Hi Mohit,
Thanks for your reply.
If my use case is purely querying read-only data (no transaction
scenarios), at what scale is one of them a better option than the other? I
am aware that for scale which can be supported on a single node, VoltDB is
a better choice. However, when the scale grows
Hello,
I am trying to save data from spark to Cassandra.
So I have an ScalaESRDD (because i take data from elasticsearch) that
contains a lot of key/values like this :
(AU16r4o_kbhIuSky3zFO , Map(@timestamp - 2015-05-21T21:35:54.035Z,
timestamp - 2015-05-21 23:35:54,035, loglevel - INFO,
Hi all,
I want to use Eclipse on Windows to build Spark environment, but find
the reference
page(https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-IDESetup)
doesn't contain any guide about Eclipse.
Could anyone give tutorials or links about how to using
Try scala eclipse plugin to eclipsify spark project and import spark as eclipse
project
-Somnath
-Original Message-
From: Nan Xiao [mailto:xiaonan830...@gmail.com]
Sent: Thursday, May 28, 2015 12:32 PM
To: user@spark.apache.org
Subject: How to use Eclipse on Windows to build Spark
I do this way:
- Launch a new instance by clicking on the slave instance and choose *launch
more like this *
*- *Once its launched, ssh into it and add the master public key to
.ssh/authorized_keys
- Add the slaves internal IP to the master's conf/slaves file
- do sbin/start-all.sh and it will
hi,
I'm working on spark standalone system on ec2, and I'm having problems on
resizing the cluster (meaning - adding or removing slaves).
In the basic ec2 scripts
(http://spark.apache.org/docs/latest/ec2-scripts.html), there's only script
for lunching the cluster, not adding slaves to it. On the
Hi Zhang,
Could you paste your code in a gist? Not sure what you are doing inside the
code to fill up memory.
Thanks
Best Regards
On Thu, May 28, 2015 at 10:08 AM, Ji ZHANG zhangj...@gmail.com wrote:
Hi,
Yes, I'm using createStream, but the storageLevel param is by default
Would be great if you guys can test out the Spark 1.4.0 RC2 (RC3 coming out
soon) with Scala 2.11 and report issues.
TD
On Tue, May 26, 2015 at 9:15 AM, Koert Kuipers ko...@tresata.com wrote:
we are still running into issues with spark-shell not working on 2.11, but
we are running on somewhat
Hi!
I have a json file with some data, I’m able to create DataFrame out of it and
the schema for particular part of it I’m interested in looks like following:
val json: DataFrame = sqlc.load(entities_with_address2.json, json)
root
|-- attributes: struct (nullable = true)
||-- Address2:
Hi all,
I want to use Eclipse on Windows to build Spark environment, but find
the reference
page(https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-IDESetup)
doesn't contain any guide about Eclipse.
Could anyone give tutorials or links about how to using
Hi,
I wrote a simple test job, it only does very basic operations. for example:
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, Map(topic -
1)).map(_._2)
val logs = lines.flatMap { line =
try {
Some(parse(line).extract[Impression])
} catch {
case _:
hi,
thanks for your answer!
I have few more:
1) the file /root/spark/conf/slaves , has the full DNS names of servers (
ec2-52-26-7-137.us-west-2.compute.amazonaws.com), did you add there the
internal ip?
2) You call to start-all. Isn't it too aggressive? Let's say I have 20
slaves up, and I
1. Upto you, you can either add internal ip or the external ip, it won't be
a problem unless they are not in the same network.
2. If you only want to start a particular slave, then you can do like:
sbin/start-slave.sh worker# master-spark-URL
Thanks
Best Regards
On Thu, May 28, 2015 at 1:52
hi,
Is there anyway in bash (from an ec2 apsrk server) to list all the servers
in my security group (or better - in a given security group)
I tried using:
wget -q -O - http://instance-data/latest/meta-data/security-groups
security_group_xxx
but now, I want all the servers in security group
You can use python boto library for that, in fact spark-ec2 script uses it
underneath. Here's the
https://github.com/apache/spark/blob/master/ec2/spark_ec2.py#L706 call
spark-ec2 is making to get all machines under a given security group.
Thanks
Best Regards
On Thu, May 28, 2015 at 2:22 PM,
Hi,
tl;dr At the moment (with a BIG disclaimer *) elastic scaling of spark
streaming processes is not supported.
*Longer version.*
I assume that you are talking about Spark Streaming as the discussion is
about handing Kafka streaming data.
Then you have two things to consider: the Streaming
Can you replace your counting part with this?
logs.filter(_.s_id 0).foreachRDD(rdd = logger.info(rdd.count()))
Thanks
Best Regards
On Thu, May 28, 2015 at 1:02 PM, Ji ZHANG zhangj...@gmail.com wrote:
Hi,
I wrote a simple test job, it only does very basic operations. for example:
Thank you, Gerard.
We're looking at the receiver-less setup with Kafka Spark streaming so I'm
not sure how to apply your comments to that case (not that we have to use
receiver-less but it seems to offer some advantages over the
receiver-based).
As far as the number of Kafka receivers is fixed
@DG; The key metrics should be
- Scheduling delay – its ideal state is to remain constant over time
and ideally be less than the time of the microbatch window
- The average job processing time should remain less than the
micro-batch window
- Number of Lost Jobs
begs for your help
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/why-does-com-esotericsoftware-kryo-KryoException-java-u-til-ConcurrentModificationException-happen-tp23067p23068.html
Sent from the Apache Spark User List mailing list archive at
My program runs for 500 iterations, but fails at about 150 iterations
almostly.
It's hard to explain the details of my program, but i think my program is
ok, for it runs succesfully somtimes. *I just wana know in which situations
this exception will happen*.
The detail error information is
Thanks, Evo. Per the last part of your comment, it sounds like we will
need to implement a job manager which will be in control of starting the
jobs, monitoring the status of the Kafka topic(s), shutting jobs down and
marking them as ones to relaunch, scaling the cluster up/down by
Hi,
I am trying to extract the filenames from which a Dstream is generated by
parsing the toDebugString method on RDD
I am implementing the following code in spark-shell:
import org.apache.spark.streaming.{StreamingContext, Seconds}
val ssc = new StreamingContext(sc,Seconds(10))
val lines =
You can always spin new boxes in the background and bring them into the cluster
fold when fully operational and time that with job relaunch and param change
Kafka offsets are mabaged automatically for you by the kafka clients which keep
them in zoomeeper dont worry about that ad long as you
You must start the StreamingContext by calling ssc.start()
On Thu, May 28, 2015 at 6:57 PM, Animesh Baranawal
animeshbarana...@gmail.com wrote:
Hi,
I am trying to extract the filenames from which a Dstream is generated by
parsing the toDebugString method on RDD
I am implementing the
Hi.
I have 2 dataframe with 1 and 12 partitions respectively. When I do a inner
join between these dataframes, the result contains 200 partitions. *Why?*
df1.join(df2, df1(id) === df2(id), Inner) = returns 200 partitions
Thanks!!!
--
Regards.
Miguel Ángel
I also started the streaming context by running ssc.start() but still apart
from logs nothing of g gets printed.
-- Forwarded message --
From: Animesh Baranawal animeshbarana...@gmail.com
Date: Thu, May 28, 2015 at 6:57 PM
Subject: SPARK STREAMING PROBLEM
To: user@spark.apache.org
val sqlContext = new HiveContext(sc)
val schemaRdd = sqlContext.sql(some complex SQL)
It mostly works, but have been having issues with tables that contains a large
amount of data:
https://issues.apache.org/jira/browse/SPARK-6910
https://issues.apache.org/jira/browse/SPARK-6910
On May
Hey,
Is there a way to do a distinct operation on each partition only?
My program generates quite a few duplicate tuples and it would be nice to
remove some of these as an optimisation
without having to reshuffle the data.
I’ve also noticed that plans generated with an unique transformation have
The oproblem lies the way you are doing the processing.
After the g.foreach(x = {println(x); println()}) are you
doing ssc.start. It means till now what you did is just setup the
computation stpes but spark has not started any real processing. so when
you do g.foreach what it iterates
in the spark assembly:
I0528 13:36:22.958067 26890 exec.cpp:206] Executor registered on slave
20150528-063307-780930314-5050-8152-S5
Spark assembly has been built with Hive, including Datanucleus jars on
classpath
Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties
So
That’s due to the config setting spark.sql.shuffle.partitions which defaults to
200
From: Masf
Date: Thursday, May 28, 2015 at 10:02 AM
To: user@spark.apache.orgmailto:user@spark.apache.org
Subject: Dataframe Partitioning
Hi.
I have 2 dataframe with 1 and 12 partitions respectively. When I do
Hi guys,
I using spark streaming with kafka... In local machine (start as java
application without using spark-submit) it's work, connect to kafka and do
the job (*). I tried to put into spark docker container (hadoop 2.6, spark
1.3.1, try spark submit wil local[5] and yarn-client too ) but I'm
Evo, good points.
On the dynamic resource allocation, I'm surmising this only works within a
particular cluster setup. So it improves the usage of current cluster
resources but it doesn't make the cluster itself elastic. At least, that's
my understanding.
Memory + disk would be good and
Hello,
I was wondering if there is any documented comparison of SparkSQL with
MemSQL/VoltDB kind of in-memory SQL databases. MemSQL etc. too allow
queries to be run in a clustered environment. What is the major
differentiation?
Regards,
Ashish
Hello,
I'm evaluating Spark/SparkStreaming .
I use SparkStreaming to receive messages from a Kafka topic.
As soon as I have a JavaReceiverInputDStream , I have to treat each message,
for each one I have to search in MongoDB to find if a document does exist.
If I found the document I have to
Probably you should ALWAYS keep the RDD storage policy to MEMORY AND DISK – it
will be your insurance policy against sys crashes due to memory leaks. Until
there is free RAM, spark streaming (spark) will NOT resort to disk – and of
course resorting to disk from time to time (ie when there is no
Hi,
Unfortunately, they're still growing, both driver and executors.
I run the same job with local mode, everything is fine.
On Thu, May 28, 2015 at 5:26 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Can you replace your counting part with this?
logs.filter(_.s_id 0).foreachRDD(rdd =
On Wed, May 27, 2015 at 02:06:16PM -0700, Ted Yu wrote:
Looks like the exception was caused by resolved.get(prefix ++ a) returning None
:
a = StructField(a.head, resolved.get(prefix ++ a).get, nullable =
true)
There are three occurrences of resolved.get() in createSchema() - None should
Hi sparkers,
I am working on a PySpark application which uses the OpenCV library. It
runs fine when running the code locally but when I try to run it on Spark
on the same Machine it crashes the worker.
The code can be found here:
https://gist.github.com/samos123/885f9fe87c8fa5abf78f
This is the
I am using Spark-CSV to load a 50GB of around 10,000 CSV files into couple
of unified DataFrames. Since this process is slow I have wrote this snippet:
targetList.foreach { target =
// this is using sqlContext.load by getting list of files then
loading them according to schema files
64 matches
Mail list logo