<
vincent.gromakow...@gmail.com> wrote:
> Checkpointing is only used for failure recovery not for app upgrades. You
> need to manually code the unload/load and save it to a persistent store
>
> Le mar. 18 déc. 2018 à 17:29, Priya Matpadi a écrit :
>
>> Using checkpointing
Using checkpointing for graceful updates is my understanding as well, based
on the writeup in
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovering-from-failures-with-checkpointing,
and some prototyping. Have you faced any missed events?
On Mon, Dec 17, 2018
if you are deploying your spark application on YARN cluster,
1. ssh into master node
2. List the currently running application and retreive the application_id
yarn application --list
3. Kill the application using application_id of the form
application_x_ from output of list command
Hello All,
I am trying to extend SparkListener and post job ends trying to retrieve
job name to check the status of either success/failure and write to log
file.
I couldn't find a way where I could fetch job name in the onJobEnd method.
Thanks,
Padma CH
Hello Team,
I am trying to write a DataSet as parquet file in Append mode partitioned
by few columns. However since the job is time consuming, I would like to
enable DirectFileOutputCommitter (i.e by-passing the writes to temporary
folder).
Version of the spark i am using is 2.3.1.
Can someone
-- Forwarded message --
From: Priya PM <pmpr...@gmail.com>
Date: Fri, May 26, 2017 at 8:54 PM
Subject: Re: Spark checkpoint - nonstreaming
To: Jörn Franke <jornfra...@gmail.com>
Oh, how do i do it. I dont see it mentioned anywhere in the documentation.
I have follow
root 133 May 26 16:26 rdd-28
[root@priya-vm 9dd1acf0-bef8-4a4f-bf0e-f7624334abc5]# cd rdd-28/
[root@priya-vm rdd-28]# ls
part-0 part-1 _partitioner
Thanks
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-checkpoint-nonstreaming-tp28712.html
Hi All,
I have video surveillance data and this needs to be processed in Spark. I
am going through the Spark + OpenCV. How to load .mp4 images into an RDD ?
Can we directly do this or the video needs to be coverted to sequenceFile ?
Thanks,
Padma CH
I mean model training on incoming data is taken care by Spark. For detected
anomalies, need to send alert. Could we do this with Spark or any other
framework like Akka/REST API would do it ?
Thanks,
Padma CH
On Tue, Jul 12, 2016 at 7:30 PM, Marcin Tustin <mtus...@handybook.com>
wrote:
&
Hi All,
I am building Real-time Anomaly detection system where I am using k-means
to detect anomaly. Now in-order to send alert to mobile or an email alert
how do i send it using Spark itself ?
Thanks,
Padma CH
Is anyone resolved this ?
Thanks,
Padma CH
On Wed, Jun 22, 2016 at 4:39 PM, Priya Ch <learnings.chitt...@gmail.com>
wrote:
> Hi All,
>
> I am running Spark Application with 1.8TB of data (which is stored in Hive
> tables format). I am reading the data using HiveCont
Hi All,
I am running Spark Application with 1.8TB of data (which is stored in Hive
tables format). I am reading the data using HiveContect and processing it.
The cluster has 5 nodes total, 25 cores per machine and 250Gb per node. I
am launching the application with 25 executors with 5 cores each
Hi
Can someone throw light on this. The issue is not frquently happening.
Sometimes the job halts with the above messages.
Regards,
Padma Ch
On Fri, May 27, 2016 at 8:47 AM, Ted Yu <yuzhih...@gmail.com> wrote:
> Priya:
> Have you checked the executor logs on hostname1 and hostname2
Hello Team,
I am trying to perform join 2 rdds where one is of size 800 MB and the
other is 190 MB. During the join step, my job halts and I don't see
progress in the execution.
This is the message I see on console -
INFO spark.MapOutputTrackerMasterEndPoint: Asked to send map output
r dataset to all nodes and do the matching with the
> bigger dataset there.
> This highly depends on the data in your data set. How they compare in size
> etc.
>
>
>
> On 25 May 2016, at 13:27, Priya Ch <learnings.chitt...@gmail.com> wrote:
>
> Why do i need to deplo
Ch
On Wed, May 25, 2016 at 4:49 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>
> Alternatively depending on the exact use case you may employ solr on
> Hadoop for text analytics
>
> > On 25 May 2016, at 12:57, Priya Ch <learnings.chitt...@gmail.com> wrote:
> >
> &
Lets say i have rdd A of strings as {"hi","bye","ch"} and another RDD B of
strings as {"padma","hihi","chch","priya"}. For every string rdd A i need
to check the matches found in rdd B as such for string "hi" i ha
ame#save writing as
> parquet, orc, ...?
>
> // maropu
>
> On Wed, May 25, 2016 at 7:10 PM, Priya Ch <learnings.chitt...@gmail.com>
> wrote:
>
>> Hi , Yes I have joined using DataFrame join. Now to save this into hdfs
>> .I am converting the joined datafram
ollowed by all columns from the
>> FinancialData table.
>>
>>
>> Not very useful
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn *
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <http
Hi All,
I have two RDDs A and B where in A is of size 30 MB and B is of size 7
MB, A.cartesian(B) is taking too much time. Is there any bottleneck in
cartesian operation ?
I am using spark 1.6.0 version
Regards,
Padma Ch
Hi All,
I am working with Kafka, Spark Streaming and I want to write the
streaming output to a single file. dstream.saveAsTexFiles() is creating
files in different folders. Is there a way to write to a single folder ? or
else if written to different folders, how do I merge them ?
Thanks,
Padma
Hi All,
I am using Streaming k-means to train my model on streaming data. Now I
want to visualize the clusters. What would be the reporting tool used for
this ? Would zeppelin used to visualize the clusters
Regards,
Padma Ch
Hi Xi Shen,
Changing the initialization step from "kmeans||" to "random" decreased
the execution time from 2 hrs to 6 min. However, by default the no.of runs
is 1. If I try to set the number of runs to 10, then again see increase in
job execution time.
How to proceed on this ?.
By the way how
Hi All,
I am trying to run spark k-means on a data set which is closely to 1 GB.
Most often I seen BlockFetchFailed Exception which I am suspecting because
of Out of memory.
Here the configuration details-
Total cores:12
Total workers:3
Memory per node: 6GB
When running the job, I an giving
Hi Team,
I am running k-means algorithm on KDD 1999 data set (
http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html). I am running the
algorithm for different values of k as such - 5,10,15,40. The data set
is 709 MB. I have placed the file in hdfs with a block size of 128MB (6
blocks).
Hi All,
I am running k-means clustering algorithm. Now, when I am running the
algorithm as -
val conf = new SparkConf
val sc = new SparkContext(conf)
.
.
val kmeans = new KMeans()
val model = kmeans.run(RDD[Vector])
.
.
.
The 'kmeans' object gets created on driver. Now does *kmeans.run() *get
Hi,
I am trying to build real time anomaly detection system using Spark,
kafka, Cassandra and Akka. I have network intrusion dataset (KDD 1999 cup).
how can i build the system using this ? I understood that certain part of
the data, I am considering as historical data for my model training and
d use "lsof" on
> one of the spark executors (perhaps run it in a for loop, writing the
> output to separate files) until it fails and see which files are being
> opened, if there's anything that seems to be taking up a clear majority
> that might key you in on the culprit.
>
which would convey the same.
On Wed, Jan 6, 2016 at 8:19 PM, Annabel Melongo <melongo_anna...@yahoo.com>
wrote:
> Priya,
>
> It would be helpful if you put the entire trace log along with your code
> to help determine the root cause of the error.
>
> Thanks
>
>
> O
the "too many open files"
> exception.
>
>
> On Tuesday, January 5, 2016 8:03 AM, Priya Ch <
> learnings.chitt...@gmail.com> wrote:
>
>
> Can some one throw light on this ?
>
> Regards,
> Padma Ch
>
> On Mon, Dec 28, 2015 at 3:59 PM, Priya Ch &l
Can some one throw light on this ?
Regards,
Padma Ch
On Mon, Dec 28, 2015 at 3:59 PM, Priya Ch <learnings.chitt...@gmail.com>
wrote:
> Chris, we are using spark 1.3.0 version. we have not set
> spark.streaming.concurrentJobs
> this parameter. It takes the default value.
>
&
ing execution time - check total number of open files using lsof
>> command. Need root permissions. If it is cluster not sure much !
>> 2) which exact line in the code is triggering this error ? Can you paste
>> that snippet ?
>>
>>
>> On Wednesday 23 December 2015
ulimit -n 65000
fs.file-max = 65000 ( in etc/sysctl.conf file)
Thanks,
Padma Ch
On Tue, Dec 22, 2015 at 6:47 PM, Yash Sharma <yash...@gmail.com> wrote:
> Could you share the ulimit for your setup please ?
>
> - Thanks, via mobile, excuse brevity.
> On Dec 22, 2015
Jakob,
Increased the settings like fs.file-max in /etc/sysctl.conf and also
increased user limit in /etc/security/limits.conf. But still see the same
issue.
On Fri, Dec 18, 2015 at 12:54 AM, Jakob Odersky wrote:
> It might be a good idea to see how many files are open
Hi All,
When running streaming application, I am seeing the below error:
java.io.FileNotFoundException:
/data1/yarn/nm/usercache/root/appcache/application_1450172646510_0004/blockmgr-a81f42cd-6b52-4704-83f3-2cfc12a11b86/02/temp_shuffle_589ddccf-d436-4d2c-9935-e5f8c137b54b
(Too many open
Hi All,
I have the following scenario in writing rows to Cassandra from Spark
Streaming -
in a 1 sec batch, I have 3 tickets with same ticket number (primary key)
but with different envelope numbers (i.e envelope 1, envelope 2, envelope
3.) I am writing these messages to Cassandra using
Hi All,
I have the following use case for Spark Streaming -
There are 2 streams of data say - FlightBookings and Ticket
For each ticket, I need to associate it with relevant Booking info. There
are distinct applications for Booking and Ticket. The Booking streaming
application processes the
Hi All,
I am seeing exception when trying to substract 2 rdds.
Lets say rdd1 has messages like -
* pnr, bookingId, BookingObject*
101, 1, BookingObject1 // - event number is 0
102, 1, BookingObject2// - event number is 0
103, 2,
across all test suites ???
On Thu, Nov 5, 2015 at 12:36 AM, Bryan Jeffrey <bryan.jeff...@gmail.com>
wrote:
> Priya,
>
> If you're trying to get unit tests running local spark contexts, you can
> just set up your spark context with 'spark.driver.allowMultipleContexts'
> set
Hello All,
How to use multiple Spark Context in executing multiple test suite of
spark code ???
Can some one throw light on this ?
One more question, if i have a function which takes RDD as a parameter, how
do we mock an RDD ??
On Thu, Oct 29, 2015 at 5:20 PM, Priya Ch <learnings.chitt...@gmail.com>
wrote:
> How do we do it for Cassandra..can we use the same Mocking ?
> EmbeddedCassandra Server
ck.executeUpdate _).expects()
>
>
> On Thu, Oct 29, 2015 at 12:27 PM, Priya Ch <learnings.chitt...@gmail.com>
> wrote:
>
>> Hi All,
>>
>> For my Spark Streaming code, which writes the results to Cassandra DB,
>> I need to write Unit test cases. what are the available test frameworks to
>> mock the connection to Cassandra DB ?
>>
>
>
Hi All,
For my Spark Streaming code, which writes the results to Cassandra DB, I
need to write Unit test cases. what are the available test frameworks to
mock the connection to Cassandra DB ?
Hello All,
I have two Cassandra RDDs. I am using joinWithCassandraTable which is
doing a cartesian join because of which we are getting unwanted rows.
How to perform inner join on Cassandra RDDs ? If I intend to use normal
join, i have to read entire table which is costly.
Is there any
Hi All,
When processing streams of data (with batch inter val 1 sec), there is
possible case of Concurrency issue. i.e two messages M1 and M2 (updated
version of M1) with same key are processed by 2 threads in parallel.
To resolve this concurrency issue, I am applying Hash Partitioner on RDD.
2015 at 6:54 PM, Ted Yu <yuzhih...@gmail.com> wrote:
> Which Spark release are you using ?
>
> Can you show the snippet of your code around CassandraSQLContext#sql() ?
>
> Thanks
>
> On Mon, Sep 28, 2015 at 6:21 AM, Priya Ch <learnings.chitt...@gmail.com>
> wrote:
>
>
Hi All,
I am trying to use dataframes (which contain data from cassandra) in
rdd.foreach. This is throwing the following exception:
Is CassandraSQLContext accessible within executor
15/09/28 17:22:40 ERROR JobScheduler: Error running job streaming job
144344116 ms.0
ode and the
spark version is 1.3.0
This has thrown
java.io.IOException: org.apache.spark.SparkException: Failed to get
broadcast_5_piece0 of broadcast_5
Regards,
Padma Ch
On Thu, Sep 24, 2015 at 11:25 AM, Akhil Das <ak...@sigmoidanalytics.com>
wrote:
> It should, i don't see any reason for it t
cord from cassandra, you
>>> need to do cassandraRdd.map(func), which will run distributed on the spark
>>> workers
>>>
>>> *Romi Kuntsman*, *Big Data Engineer*
>>> http://www.totango.com
>>>
>>> On Mon, Sep 21, 2015 at 3:29 PM, Priya
; stream.filter(...).map(x=>(key,
> )).joinWithCassandra().map(...).saveToCassandra()
>
> I'm not sure about exactly 10 messages, spark streaming focus on time not
> count..
>
>
> On Tue, Sep 22, 2015 at 2:14 AM, Priya Ch <learnings.chitt...@gmail.com>
> wrote:
>
>> I h
, 2015 at 3:06 PM, Petr Novak <oss.mli...@gmail.com> wrote:
> add @transient?
>
> On Mon, Sep 21, 2015 at 11:27 AM, Priya Ch <learnings.chitt...@gmail.com>
> wrote:
>
>> Hello All,
>>
>> How can i pass sparkContext as a parameter to a method in an obje
orkers ???
Regards,
Padma Ch
On Mon, Sep 21, 2015 at 5:10 PM, Ted Yu <yuzhih...@gmail.com> wrote:
> You can use broadcast variable for passing connection information.
>
> Cheers
>
> On Sep 21, 2015, at 4:27 AM, Priya Ch <learnings.chitt...@gmail.com>
> wrote:
>
>
Hello All,
How can i pass sparkContext as a parameter to a method in an object.
Because passing sparkContext is giving me TaskNotSerializable Exception.
How can i achieve this ?
Thanks,
Padma Ch
Hello All,
Instead of declaring sparkContext in main, declared as object variable as
-
object sparkDemo
{
val conf = new SparkConf
val sc = new SparkContext(conf)
def main(args:Array[String])
{
val baseRdd = sc.parallelize()
.
.
.
}
}
But this piece of code is giving
; true. What is the possible solution for this ?
Is this a bug in Spark 1.3.0? Changing the scheduling mode to Stand-alone
or Mesos mode would work fine ??
Please someone share your views on this.
On Sat, Sep 12, 2015 at 11:04 PM, Priya Ch <learnings.chitt...@gmail.com>
wrote:
> Hello A
Hello All,
When I push messages into kafka and read into streaming application, I see
the following exception-
I am running the application on YARN and no where broadcasting the message
within the application. Just simply reading message, parsing it and
populating fields in a class and then
Hello All,
I am using foreachRDD in my code as -
dstream.foreachRDD { rdd => rdd.foreach { record => // look up with
cassandra table
// save updated rows to cassandra table.
}
}
This foreachRDD is causing executor lost failure. what is the behavior of
this foreachRDD ???
Thanks,
Padma Ch
Hi All,
In Spark, each action results in launching a job. Lets say my spark app
looks as-
val baseRDD =sc.parallelize(Array(1,2,3,4,5),2)
val rdd1 = baseRdd.map(x => x+2)
val rdd2 = rdd1.filter(x => x%2 ==0)
val count = rdd2.count
val firstElement = rdd2.first
println("Count is"+count)
euse the data, you need to call rdd2.cache
>>
>>
>>
>> On Sun, Sep 6, 2015 at 2:33 PM, Priya Ch <learnings.chitt...@gmail.com>
>> wrote:
>>
>>> Hi All,
>>>
>>> In Spark, each action results in launching a job. Lets say my spark
Hi All,
I have a spark streaming application which writes the processed results to
cassandra. In local mode, the code seems to work fine. The moment i start
running in distributed mode using yarn, i see executor lost failure. I
increased executor memory to occupy entire node's memory which is
Hi All,
I have the following scenario:
There exists a booking table in cassandra, which holds the fields like,
bookingid, passengeName, contact etc etc.
Now in my spark streaming application, there is one class Booking which
acts as a container and holds all the field details -
class
Hi All,
Thank you very much for the detailed explanation.
I have scenario like this-
I have rdd of ticket records and another rdd of booking records. for each
ticket record, i need to check whether any link exists in booking table.
val ticketCachedRdd = ticketRdd.cache
ticketRdd.foreach{
transformation. For
more information, see SPARK-5063.
On Mon, Aug 17, 2015 at 8:13 PM, Preetam preetam...@gmail.com wrote:
The error could be because of the missing brackets after the word cache -
.ticketRdd.cache()
On Aug 17, 2015, at 7:26 AM, Priya Ch learnings.chitt...@gmail.com
wrote:
Hi
Hi All,
I have a question in writing rdd to cassandra. Instead of writing entire
rdd to cassandra, i want to write individual statement into cassandra
beacuse there is a need to perform to ETL on each message ( which requires
checking with the DB).
How could i insert statements individually?
to do is *transform* the rdd before writing it, e.g. using
the .map function.
On Thu, Aug 13, 2015 at 11:30 AM, Priya Ch learnings.chitt...@gmail.com
wrote:
Hi All,
I have a question in writing rdd to cassandra. Instead of writing entire
rdd to cassandra, i want to write individual
key.
Hope that helps.
Greetings,
Juan
2015-07-30 10:50 GMT+02:00 Priya Ch learnings.chitt...@gmail.com:
Hi All,
Can someone throw insights on this ?
On Wed, Jul 29, 2015 at 8:29 AM, Priya Ch learnings.chitt...@gmail.com
wrote:
Hi TD,
Thanks for the info. I have the scenario
Hi All,
Can someone throw insights on this ?
On Wed, Jul 29, 2015 at 8:29 AM, Priya Ch learnings.chitt...@gmail.com
wrote:
Hi TD,
Thanks for the info. I have the scenario like this.
I am reading the data from kafka topic. Let's say kafka has 3 partitions
for the topic. In my
. This will guard against multiple attempts to
run the task that inserts into Cassandra.
See
http://spark.apache.org/docs/latest/streaming-programming-guide.html#semantics-of-output-operations
TD
On Sun, Jul 26, 2015 at 11:19 AM, Priya Ch learnings.chitt...@gmail.com
wrote:
Hi All,
I have
Hi All,
I have a problem when writing streaming data to cassandra. Or existing
product is on Oracle DB in which while wrtiting data, locks are maintained
such that duplicates in the DB are avoided.
But as spark has parallel processing architecture, if more than 1 thread is
trying to write same
Hi All,
I configured Kafka cluster on a single node and I have streaming
application which reads data from kafka topic using KafkaUtils. When I
execute the code in local mode from the IDE, the application runs fine.
But when I submit the same to spark cluster in standalone mode, I end up
Please find the attached worker log.
I could see stream closed exception
On Wed, Jan 7, 2015 at 10:51 AM, Xiangrui Meng men...@gmail.com wrote:
Could you attach the executor log? That may help identify the root
cause. -Xiangrui
On Mon, Jan 5, 2015 at 11:12 PM, Priya Ch learnings.chitt
Hi All,
Word2Vec and TF-IDF algorithms in spark mllib-1.1.0 are working only in
local mode and not on distributed mode. Null pointer exception has been
thrown. Is this a bug in spark-1.1.0 ?
*Following is the code:*
def main(args:Array[String])
{
val conf=new SparkConf
val sc=new
Hi All,
I have akka remote actors running on 2 nodes. I submitted spark application
from node1. In the spark code, in one of the rdd, i am sending message to
actor running on node1. My Spark code is as follows:
class ActorClient extends Actor with Serializable
{
import context._
val
. If yes, then try setting the level of parallelism and
number of partitions while creating/transforming the RDD.
Thanks
Best Regards
On Fri, Nov 14, 2014 at 5:17 PM, Priya Ch [hidden email]
http://user/SendEmail.jtp?type=nodenode=18936i=0 wrote:
Hi All,
We have set up 2 node cluster (NODE
Hi Spark users/experts,
In Spark source code (Master.scala Worker.scala), when registering the
worker with master, I see the usage of *persistenceEngine*. When we don't
specify spark.deploy.recovery mode explicitly, what is the default value
used ? This recovery mode is used to persists and
Hi Team,
When I am trying to use DenseMatrix of breeze library in spark, its
throwing me the following error:
java.lang.noclassdeffounderror: breeze/storage/Zero
Can someone help me on this ?
Thanks,
Padma Ch
Hi,
I am using spark 1.0.0. In my spark code i m trying to persist an rdd to
disk as rrd.persist(DISK_ONLY). But unfortunately couldn't find the
location where the rdd has been written to disk. I specified
SPARK_LOCAL_DIRS and SPARK_WORKER_DIR to some other location rather than
using the default
Hi,
I am trying to build jars using the command :
mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean package
Execution of the above command is throwing the following error:
[INFO] Spark Project Core . FAILURE [ 0.295 s]
[INFO] Spark Project Bagel
78 matches
Mail list logo