I meant a restart by the user, as ayan said.
I was thinking of a case where e.g. a Spark conf setting wrong and the job
failed in Stage 1, in my example .. and we want to rerun the job with the
right conf without rerunning Stage 0. Having this re-start capability may
cause some chaos if it would
Thanks Petar,
I will purchase. Thanks for the input.
Thanks
Siva
On Tue, Jul 28, 2015 at 4:39 PM, Carol McDonald cmcdon...@maprtech.com
wrote:
I agree, I found this book very useful for getting started with spark and
eclipse
On Tue, Jul 28, 2015 at 11:10 AM, Petar Zecevic
Hi all,
I have a streaming job which reads messages from kafka via directStream,
transforms it and writes it out to Cassandra.
Also I use a large broadcast variable (~300MB)
I tried to implement a stopping mechanism for this job , something like this
:
When I test it in local mode, the job
Hi All,
I have a fairly complex HiveQL data processing which I am trying to convert
to SparkSQL to improve performance. Below is what it does.
Select around 100 columns including Aggregates
From a FACT_TABLE
Joined to the summary of the same FACT_TABLE
Joined to 2 smaller DIMENSION tables.
The
This question was answered with sample code a couple of days ago, please
look back.
On Sat, Jul 25, 2015 at 11:43 PM, Zoran Jeremic zoran.jere...@gmail.com
wrote:
Hi,
I discovered what is the problem here. Twitter public stream is limited to
1% of overall tweets (https://goo.gl/kDwnyS), so
Hi, I do an sc.parallelize with a list of 512k items. But sometimes not all
executors are used, i.e. they don't have work to do and nothing is logged
after:
15/07/29 16:35:22 WARN internal.ThreadLocalRandom: Failed to generate a seed
from SecureRandom within 3 seconds. Not enough entrophy?
you probably should increase file handles limit for user that all processes
are running with(spark master workers)
e.g.
http://www.cyberciti.biz/faq/linux-increase-the-maximum-number-of-open-files/
On 29 July 2015 at 18:39, saif.a.ell...@wellsfargo.com wrote:
Hello,
I’ve seen a couple
Thank you both, I will take a look, but
1. For high-shuffle tasks, is this right for the system to have the size
and thresholds high? I hope there is no bad consequences.
2. I will try to overlook admin access and see if I can get anything with
only user rights
From: Ted Yu
Hi everyone. I'm running into an issue with SparkContexts when running on
Yarn. The issue is observable when I reproduce these steps in the
spark-shell (version 1.4.1):
scala sc
res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@7b965dee
*Note the pointer address of sc.
(Then
If you don't mind using SBT with your Scala instead of Maven, you can see
the example I created here: https://github.com/deanwampler/spark-workshop
It can be loaded into Eclipse or IntelliJ
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
Can you give an example with my extract?
Mélanie Gallois
2015-07-29 16:55 GMT+02:00 Young, Matthew T matthew.t.yo...@intel.com:
The built-in Spark JSON functionality cannot read normal JSON arrays. The
format it expects is a bunch of individual JSON objects without any outer
array syntax,
{IFAM:EQR,KTM:143000640,COL:21,DATA:[{MLrate:30,Nrout:0,up:null,Crate:2},{MLrate:30,Nrout:0,up:null,Crate:2},{MLrate:30,Nrout:0,up:null,Crate:2},{MLrate:30,Nrout:0,up:null,Crate:2},{MLrate:30,Nrout:0,up:null,Crate:2},{MLrate:30,Nrout:0,up:null,Crate:2}]}
Please increase limit for open files:
http://stackoverflow.com/questions/34588/how-do-i-change-the-number-of-open-files-limit-in-linux
On Jul 29, 2015, at 8:39 AM, saif.a.ell...@wellsfargo.com
saif.a.ell...@wellsfargo.com wrote:
Hello,
I’ve seen a couple emails on this issue but
Hi
I am new to Spark Streaming and writing a code for twitter connector. when i
run this code more than one time, it gives the following exception. I have
to create a new hdfs directory for checkpointing each time to make it run
successfully and moreover it doesn't get stopped.
ERROR
bq. it seems like we never get to the clearActiveContext() call by the end
Looking at stop() method, there is only one early return
after stopped.compareAndSet() call.
Is there any clue from driver log ?
Cheers
On Wed, Jul 29, 2015 at 9:38 AM, Andres Perez and...@tresata.com wrote:
Hi
I made such naive implementation:
SparkConf conf =newSparkConf();
conf.setMaster(local[4]).setAppName(Stub);
finalJavaSparkContext ctx =newJavaSparkContext(conf);
JavaRDDString input = ctx.textFile(path_to_file);
// explode each line into list of column values
JavaRDDListString rowValues =
Can you send me the subject of that email? I can't find any email
suggesting solution to that problem. There is email *Twitter4j streaming
question*, but it doesn't have any sample code. It just confirms what I
explained earlier that without filtering Twitter will limit to 1% of
tweets, and if you
You might look at the edx course on Apache Spark or ML with Spark. There
are probably some homework problems or quiz questions that might be
relevant. I haven't looked at the course myself, but thats where I would go
first.
If you start parallel Twitter streams, you will be in breach of their TOS.
They allow a small number of parallel stream in practice, but if you do it
on massive scale they'll ban you (I'm speaking from experience ;) ).
If you really need that level of data, you need to talk to a company called
Hi
I am using spark streaming 1.3 and using checkpointing.
But job is failing to recover from checkpoint on restart.
For broadcast variable it says :
1.WARN TaskSetManager: Lost task 15.0 in stage 7.0 (TID 1269, hostIP):
java.lang.ClassCastException: [B cannot be cast to
This isn't totally correct. Spark SQL does support JSON arrays and will
implicitly flatten them. However, complete objects or arrays must exist
one per line and cannot be split with newlines.
On Wed, Jul 29, 2015 at 7:55 AM, Young, Matthew T matthew.t.yo...@intel.com
wrote:
The built-in
Rather than using accumulator directly, what you can do is something like
this to lazily create an accumulator and use it (will get lazily recreated
if driver restarts from checkpoint)
dstream.transform { rdd =
val accum = SingletonObject.getOrCreateAccumulator() // single object
method to
What Hive Version are you using? Do you run it in on TEZ? Are you using the
ORC Format? Do you use compression? Snappy? Do you use Bloom filters? Do
you insert the data sorted on the right columns? Do you use partitioning?
Did you increase the replication factor for often used tables or
Hi, all,
I am just setting up to run Spark in standalone mode, as a (Univa) Grid
Engine job. I have been able to set up the appropriate environment
variables such that the master launches correctly, etc. In my setup, I
generate GE job-specific conf and log dirs.
However, I am finding that the
Hi
Wanted to know what is the difference between
RandomForestModel and RandomForestClassificationModel?
in Mlib.. Will they yield the same results for a given dataset?
Hi Tim,
thanks for clarifying this!
Regarding dispatcher and cluster mode, it seems, that there’s still no way to
get to exact Spark executor UI but only driver configuration and execution
stats.
--
Anton Kirillov
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
On Wednesday,
Maybe you forgot Tod close a reader Ort writer object.
Am 29. Juli 2015 18:04:59 MESZ, schrieb saif.a.ell...@wellsfargo.com:
Thank you both, I will take a look, but
1. For high-shuffle tasks, is this right for the system to have
the size and thresholds high? I hope there is no bad
Hi I have Spark Streaming code which streams from Kafka topic it used to work
fine but suddenly it started throwing the following exception
Exception in thread main org.apache.spark.SparkException:
org.apache.spark.SparkException: Couldn't find leader offsets for Set()
at
Hi Anton,
Client mode we haven't populated the webui link and only did so for cluster
mode.
If you like you can open a JIRA and it should be a easy ticket for anyone
to work on.
Tim
On Wed, Jul 29, 2015 at 4:27 AM, Anton Kirillov antonv.kiril...@gmail.com
wrote:
Hi everyone,
I’m trying to
You can put the database files in a central location accessible to all the
workers and build the GeoIP object once per-partition when you go to do a
mapPartitions across your dataset, loading from the central location.
___
From: Filli Alem [alem.fi...@ti8m.ch]
Sent: Wednesday, July 29, 2015
Hi,
I would like to use ip2Location databases during my spark jobs (MaxMind).
So far I haven't found a way to properly serialize the database offered by the
Java API of the database.
The CSV version isn't easy to handle as it contains of multiple files.
Any recommendations on how to do this?
'How to restart Twitter spark stream' i
It may not be exactly what you are looking for, but i thought it did touch
on some aspect of your question.
On Wed, Jul 29, 2015 at 10:26 AM, Zoran Jeremic zoran.jere...@gmail.com
wrote:
Can you send me the subject of that email? I can't find any email
Actually, I posted that question :)
I already implemented solution that Akhil suggested there , and that
solution is using Sample tweets API, which returns only 1% of the tweets.
It would not work in my scenario of use. For the hashtags I'm interested
in, I need to catch each single tweet, not
Hi Ted. Taking a look at the logs, I get the feeling like there may be an
uncaught exception blowing up the SparkContext.stop method, causing it to
not reach the line where it gets set as inactive. The line referenced below
in SparkContext (SparkContext.scala:1644) is the call:
You can generate dependency tree using:
mvn dependency:tree
and grep for 'org.scala-lang' in the output to see if there is any clue.
Cheers
On Wed, Jul 29, 2015 at 5:14 PM, Benjamin Ross br...@lattice-engines.com
wrote:
Hello all,
I’m new to both spark and scala, and am running into an
If you run it on yarn with kerberos setup. You authenticate yourself by kinit
before launching the job.
Thanks.
Zhan Zhang
On Jul 28, 2015, at 8:51 PM, Anh Hong
hongnhat...@yahoo.com.INVALIDmailto:hongnhat...@yahoo.com.INVALID wrote:
Hi,
I'd like to remotely run spark-submit from a local
If you are changing the SparkConf, that mean you have recreate the
SparkContext, isnt it? So you have to stop the previous SparkCotnext which
deletes all the information about stages that have been run. So the better
approach is to indeed save the data of the last stage explicitly and then
try
There is a known issue that Kafka cannot return leader if there is not data
in the topic. I think it was raised in another thread in this forum. Is
that the issue?
On Wed, Jul 29, 2015 at 10:38 AM, unk1102 umesh.ka...@gmail.com wrote:
Hi I have Spark Streaming code which streams from Kafka
Hello all,
I'm new to both spark and scala, and am running into an annoying error
attempting to prototype some spark functionality. From forums I've read
online, this error should only present itself if there's a version mismatch
between the version of scala used to compile spark and the scala
Hey Ted,
Thanks for the quick response. Sadly, all of those are 2.10.x:
─$ mvn dependency:tree | grep -A 2 -B 2 org.scala-lang
130 ↵
[INFO] | | \- org.tukaani:xz:jar:1.0:compile
[INFO] | \-
Anyone know how to set log level in spark-submit ? Thanks
This may be what you want
val conf = new SparkConf().setMaster(local).setAppName(test)
val sc = new SparkContext(conf)
val inputRdd = sc.parallelize(Array((key_1, a), (key_1,b),
(key_2,c), (key_2, d)))
val result = inputRdd.groupByKey().flatMap(e={
val key= e._1
val valuesWithIndex =
1.How to do it in java?
2.Can broadcast objects also be created in same way after checkpointing.
3.Is it safe If I disable checkpoint and write offsets at end of each batch
to hdfs in mycode and somehow specify in my job to use this offset for
creating kafkastream at first time. How can I
Hi thanks for the response. Like I already mentioned in the question kafka
topic is valid and it has data I can see data in it using another kafka
consumer.
On Jul 30, 2015 7:31 AM, Cody Koeninger c...@koeninger.org wrote:
The last time someone brought this up on the mailing list, the issue
Put a log4j.properties file in conf/. You can copy
log4j.properties.template as a good base
El miércoles, 29 de julio de 2015, canan chen ccn...@gmail.com escribió:
Anyone know how to set log level in spark-submit ? Thanks
Hi Zhan,I'm running Standalone Spark cluster and execute spark-submit from a
local host outside the cluster. Beside kerberos, do you know any other existing
method? Is there any JIRA opened on this enhancement request?
Regards,Anh.
On Wednesday, July 29, 2015 4:15 PM, Zhan Zhang
If we want to stop the application after fix-time period , how it will
work . (How to give the duration in logic , in my case sleep(t.s.) is not
working .) So i used to kill coarseGrained job at each slave by script
.Please suggest something .
On Thu, Jul 30, 2015 at 5:14 AM, Tathagata Das
The last time someone brought this up on the mailing list, the issue
actually was that the topic(s) didn't exist in Kafka at the time the spark
job was running.
On Wed, Jul 29, 2015 at 6:17 PM, Tathagata Das t...@databricks.com wrote:
There is a known issue that Kafka cannot return leader
I have some data like this:RDD[(String, String)] = ((*key-1*, a),
(*key-1*,b), (*key-2*,a), (*key-2*,c),(*key-3*,b),(*key-4*,d))and I want to
group the data by Key, and for each group, add index fields to the
groupmember, at last I can transform the data to below : RDD[(String, *Int*,
String)] =
Yes, that should work. What I mean is is there any option in spark-submit
command that I can specify for the log level
On Thu, Jul 30, 2015 at 10:05 AM, Jonathan Coveney jcove...@gmail.com
wrote:
Put a log4j.properties file in conf/. You can copy
log4j.properties.template as a good base
El
Is there a relationship between data and index? I.e with a,b,c to 1,2,3?
On 30 Jul 2015 12:13, askformore askf0rm...@163.com wrote:
I have some data like this: RDD[(String, String)] = ((*key-1*, a), (
*key-1*,b), (*key-2*,a), (*key-2*,c),(*key-3*,b),(*key-4*,d)) and I want
to group the data by
Did you get a thread dump? We have experienced similar problems during
shuffle operations due to a deadlock in InetAddress. Specifically, look for
a runnable thread at something like
java.net.Inet6AddressImpl.lookupAllHostAddr(Native
Method).
Our solution has been to put a timeout around the code
A side question: Any reason why you're using window(Seconds(10), Seconds(10))
instead of new StreamingContext(conf, Seconds(10)) ?
Making the micro-batch interval 10 seconds instead of 1 will provide you
the same 10-second window with less complexity. Of course, this might just
be a test for the
What is the size of each RDD? Size of your cluster spark configurations
that you tried out.
On Tue, Jul 28, 2015 at 9:54 PM, ponkin alexey.pon...@ya.ru wrote:
Hi, Alice
Did you find solution?
I have exactly the same problem.
--
View this message in context:
Try to give a look at zoomdata. They are spark based and they offer BI features
with good performance.
Paolo
Inviata dal mio Windows Phone
Da: Ruslan Dautkhanovmailto:dautkha...@gmail.com
Inviato: 29/07/2015 06:18
A: renga.kannanmailto:renga.kan...@gmail.com
thanks for the suggestion Akashsihag.
i've tried this solution and unfortunately it is also giving the same error.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/error-in-twitter-streaming-tp24030p24056.html
Sent from the Apache Spark User List mailing
Hi,
I am new to Spark Streaming and writing a code for twitter connector.
I am facing the following exception.
ERROR StreamingContext: Error starting the context, marking it as stopped
org.apache.spark.SparkException:
org.apache.spark.streaming.dstream.WindowedDStream@532d0784 has not been
Dear All,
I'm using -
= Spark 1.2.0
= Hive 0.13.1
= Mesos 0.18.1
= Spring
= JDK 1.7
I've written a scala program which
= instantiates a spark and hive context
= parses an XML file which provides the where clauses for queries
= generates full fledged hive queries to be run on hive
Hi, Sarath
Did you try to use and increase spark.excecutor.extraJaveOptions -XX:PermSize=
-XX:MaxPermSize=
fightf...@163.com
From: Sarath Chandra
Date: 2015-07-29 17:39
To: user@spark.apache.org
Subject: PermGen Space Error
Dear All,
I'm using -
= Spark 1.2.0
= Hive 0.13.1
= Mesos
Hi Rahul,
Where did you see such a recommendation?
I personally define partitions with the following formula
partitions = nextPrimeNumberAbove( K*(--num-executors * --executor-cores ) )
where
nextPrimeNumberAbove(x) - prime number which is greater than x
K - multiplicator to calculate start
Hello,
Please help me with links or some document for Apache Spark interview questions
and answers. Also for the tools related to it ,for which questions could be
asked.
Thanking you all.
Sincerely,
Abhishek
-
To
Hi, I have tried to use lambda expression in spark task, And it throws
java.lang.IllegalArgumentException: Invalid lambda deserialization
exception. It exception is thrown when I used the code like
transform(pRDD-pRDD.map(t-t._2)) . The code snippet is below.
JavaPairDStreamString,Integer
Yes.
As mentioned in my mail at the end, I tried with both 256 and 512 options.
But the issue persists.
I'm giving following parameters to spark configuration -
spark.core.connection.ack.wait.timeout=600
spark.akka.timeout=1000
spark.akka.framesize=50
spark.executor.memory=2g
spark.task.cpus=2
imho, you need to take into account size of your data too
if your cluster is relatively small, you may cause memory pressure on your
executors if trying to repartition to some #cores connected number of
partitions
better to take some max between initial number of partitions(assuming your
data is
Hi everyone,
I’m trying to get access to Spark web UI from Mesos Master but with no success:
the host name displayed properly, but the link is not active, just text. Maybe
it’s a well-known issue or I misconfigured something, but this problem is
really annoying.
Jobs are executed in client
Yes, I think this was asked because you didn't say what flags you set
before, and it's worth verifying they're the correct ones.
Although I'd be kind of surprised if 512m isn't enough, did you try more?
You could also try -XX:+CMSClassUnloadingEnabled -XX:+CMSPermGenSweepingEnabled
Also verify
Hi Sean,
Thanks for your response.
Few lame questions
- Where to check if the driver and executor have started with the
provided options?
- As said I'm using mesos cluster. So it should appear in the task logs
under frameworks, correct?
- What if provided options not appearing in those
Hello,
I'm using 4Gb for the driver memory. The checkpoint is between 1 Gb and 10
Gb depending if I'm reprocessing all the data from beginning or just
getting the latest offset from the real time processed. Is there any best
practice to be aware of with driver memory relating to checkpoint size ?
Stefan Panayotov
Sent from my Windows Phone
From: Stefan Panayotovmailto:spanayo...@msn.com
Sent: 7/29/2015 8:20 AM
To: user-subscr...@spark.apache.orgmailto:user-subscr...@spark.apache.org
Subject: Executing spark code in Zeppelin
Hi,
I faced a problem with
How to initiate graceful shutdown from outside of the Spark Streaming
driver process? Both for the local and cluster mode of Spark Standalone as
well as EMR.
Does sbin/stop-all.sh stop the context gracefully? How is it done? Is there
a signal sent to the driver process?
For EMR, is there a way
Hi All,
I'm running Spark 1.4.1 on a 8 core machine with 16 GB RAM. I've a 500MB
CSV file with 10 columns and i'm need of separating it into multiple
CSV/Parquet files based on one of the fields in the CSV file. I've loaded
the CSV file using spark-csv and applied the below transformations. It
I'm trying to read a Json file which is like :
[
{IFAM:EQR,KTM:143000640,COL:21,DATA:[{MLrate:30,Nrout:0,up:null,Crate:2}
,{MLrate:30,Nrout:0,up:null,Crate:2}
,{MLrate:30,Nrout:0,up:null,Crate:2}
,{MLrate:30,Nrout:0,up:null,Crate:2}
,{MLrate:30,Nrout:0,up:null,Crate:2}
Hi All,
I have a 5GB CSV dataset having 69 columns..I need to find the count of
distinct values in each column. What is the optimized way to find the same
using spark scala?
Example CSV format :
a,b,c,d
a,c,b,a
b,b,c,d
b,b,c,a
c,b,b,a
Output expecting :
(a,2),(b,2),(c,1) #- First column
StreamingContext.stop(stopGracefully = true) stops the streaming context
gracefully.
Then you can safely terminate the Spark cluster. They are two different
steps and needs to be done separately ensuring that the driver process has
been completely terminated before the Spark cluster is the
1. Same way, using static fields in a class.
2. Yes, same way.
3. Yes, you can do that. To differentiate from first time v/s continue,
you have to build your own semantics. For example, if the location in HDFS
you are suppose to store the offsets does not have any data, that means its
probably
Hi Abhishek,
Please learn spark ,there are no shortcuts for sucess.
Regards,
Vaquar khan
On 29 Jul 2015 11:32, Mishra, Abhishek abhishek.mis...@xerox.com wrote:
Hello,
Please help me with links or some document for Apache Spark interview
questions and answers. Also for the tools related to
The built-in Spark JSON functionality cannot read normal JSON arrays. The
format it expects is a bunch of individual JSON objects without any outer array
syntax, with one complete JSON object per line of the input file.
AFAIK your options are to read the JSON in the driver and parallelize it
Hello Vaquar,
I have working knowledge and experience in Spark. I just wanted to test or do a
mock round to evaluate myself. Thank you for the reply,
Please share something if you have for the same.
Sincerely,
Abhishek
From: vaquar khan [mailto:vaquar.k...@gmail.com]
Sent: Wednesday, July 29,
Is there an available release?
Anyone using in production?
Is the project being actively developed and maintained?
Thanks!
@Ashwin you don't need to append the topic to your data if you're using the
direct stream. You can get the topic from the offset range, see
http://spark.apache.org/docs/latest/streaming-kafka-integration.html
(search for offsetRange)
If you're using the receiver based stream, you'll need to
Hi Silvio,
There are no errors reported. Just the Zeppelin paragraph refuses to do
anything.
I tested this by just adding comment characters to the code snippet.
Also, I discovered the limit is 3400. Once I go over that it stops reacting.
BTW, thanks for the advice - I sent email to the
81 matches
Mail list logo