Hello All,
I am trying to extend SparkListener and post job ends trying to retrieve
job name to check the status of either success/failure and write to log
file.
I couldn't find a way where I could fetch job name in the onJobEnd method.
Thanks,
Padma CH
Hello Team,
I am trying to write a DataSet as parquet file in Append mode partitioned
by few columns. However since the job is time consuming, I would like to
enable DirectFileOutputCommitter (i.e by-passing the writes to temporary
folder).
Version of the spark i am using is 2.3.1.
Can someone p
Hi All,
I have video surveillance data and this needs to be processed in Spark. I
am going through the Spark + OpenCV. How to load .mp4 images into an RDD ?
Can we directly do this or the video needs to be coverted to sequenceFile ?
Thanks,
Padma CH
wouldn't necessarily "use spark" to send the alert. Spark is in an
> important sense one library among many. You can have your application use
> any other library available for your language to send the alert.
>
> Marcin
>
> On Tue, Jul 12, 2016 at 9:25 AM, Priya Ch
>
Hi All,
I am building Real-time Anomaly detection system where I am using k-means
to detect anomaly. Now in-order to send alert to mobile or an email alert
how do i send it using Spark itself ?
Thanks,
Padma CH
Is anyone resolved this ?
Thanks,
Padma CH
On Wed, Jun 22, 2016 at 4:39 PM, Priya Ch
wrote:
> Hi All,
>
> I am running Spark Application with 1.8TB of data (which is stored in Hive
> tables format). I am reading the data using HiveContect and processing it.
> The cluster ha
Hi All,
I am running Spark Application with 1.8TB of data (which is stored in Hive
tables format). I am reading the data using HiveContect and processing it.
The cluster has 5 nodes total, 25 cores per machine and 250Gb per node. I
am launching the application with 25 executors with 5 cores each
gt; On Thu, May 26, 2016 at 8:00 PM, Takeshi Yamamuro
> wrote:
>
>> Hi,
>>
>> If you get stuck in job fails, one of best practices is to increase
>> #partitions.
>> Also, you'd better off using DataFrame instread of RDD in terms of join
>> optimization.
Hello Team,
I am trying to perform join 2 rdds where one is of size 800 MB and the
other is 190 MB. During the join step, my job halts and I don't see
progress in the execution.
This is the message I see on console -
INFO spark.MapOutputTrackerMasterEndPoint: Asked to send map output
locations
e matching with the
> bigger dataset there.
> This highly depends on the data in your data set. How they compare in size
> etc.
>
>
>
> On 25 May 2016, at 13:27, Priya Ch wrote:
>
> Why do i need to deploy solr for text anaytics...i have files placed in
> HDFS. j
On Wed, May 25, 2016 at 4:49 PM, Jörn Franke wrote:
>
> Alternatively depending on the exact use case you may employ solr on
> Hadoop for text analytics
>
> > On 25 May 2016, at 12:57, Priya Ch wrote:
> >
> > Lets say i have rdd A of strings as {"hi","
g cartesian. Till
now application took 30 mins and on top of that I see executor lost issues.
Thanks,
Padma Ch
On Wed, May 25, 2016 at 4:22 PM, Jörn Franke wrote:
> What is the use case of this ? A Cartesian product is by definition slow
> in any system. Why do you need this? How long does y
iting as
> parquet, orc, ...?
>
> // maropu
>
> On Wed, May 25, 2016 at 7:10 PM, Priya Ch
> wrote:
>
>> Hi , Yes I have joined using DataFrame join. Now to save this into hdfs
>> .I am converting the joined dataframe to rdd (dataframe.rdd) and using
>> saveAsTextFile, tr
from the
>> FinancialData table.
>>
>>
>> Not very useful
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn *
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/pro
Hi All,
I have two RDDs A and B where in A is of size 30 MB and B is of size 7
MB, A.cartesian(B) is taking too much time. Is there any bottleneck in
cartesian operation ?
I am using spark 1.6.0 version
Regards,
Padma Ch
Hi All,
I am working with Kafka, Spark Streaming and I want to write the
streaming output to a single file. dstream.saveAsTexFiles() is creating
files in different folders. Is there a way to write to a single folder ? or
else if written to different folders, how do I merge them ?
Thanks,
Padma C
Hi All,
I am using Streaming k-means to train my model on streaming data. Now I
want to visualize the clusters. What would be the reporting tool used for
this ? Would zeppelin used to visualize the clusters
Regards,
Padma Ch
Hi Xi Shen,
Changing the initialization step from "kmeans||" to "random" decreased
the execution time from 2 hrs to 6 min. However, by default the no.of runs
is 1. If I try to set the number of runs to 10, then again see increase in
job execution time.
How to proceed on this ?.
By the way how
Hi All,
I am trying to run spark k-means on a data set which is closely to 1 GB.
Most often I seen BlockFetchFailed Exception which I am suspecting because
of Out of memory.
Here the configuration details-
Total cores:12
Total workers:3
Memory per node: 6GB
When running the job, I an giving th
Hi Team,
I am running k-means algorithm on KDD 1999 data set (
http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html). I am running the
algorithm for different values of k as such - 5,10,15,40. The data set
is 709 MB. I have placed the file in hdfs with a block size of 128MB (6
blocks).
Hi All,
I am running k-means clustering algorithm. Now, when I am running the
algorithm as -
val conf = new SparkConf
val sc = new SparkContext(conf)
.
.
val kmeans = new KMeans()
val model = kmeans.run(RDD[Vector])
.
.
.
The 'kmeans' object gets created on driver. Now does *kmeans.run() *get
e
Hi,
I am trying to build real time anomaly detection system using Spark,
kafka, Cassandra and Akka. I have network intrusion dataset (KDD 1999 cup).
how can i build the system using this ? I understood that certain part of
the data, I am considering as historical data for my model training and
o
which would convey the same.
On Wed, Jan 6, 2016 at 8:19 PM, Annabel Melongo
wrote:
> Priya,
>
> It would be helpful if you put the entire trace log along with your code
> to help determine the root cause of the error.
>
> Thanks
>
>
> On Wednesday, Januar
f" on
> one of the spark executors (perhaps run it in a for loop, writing the
> output to separate files) until it fails and see which files are being
> opened, if there's anything that seems to be taking up a clear majority
> that might key you in on the culprit.
>
> O
iles"
> exception.
>
>
> On Tuesday, January 5, 2016 8:03 AM, Priya Ch <
> learnings.chitt...@gmail.com> wrote:
>
>
> Can some one throw light on this ?
>
> Regards,
> Padma Ch
>
> On Mon, Dec 28, 2015 at 3:59 PM, Priya Ch
> wrote:
>
> Chris
Can some one throw light on this ?
Regards,
Padma Ch
On Mon, Dec 28, 2015 at 3:59 PM, Priya Ch
wrote:
> Chris, we are using spark 1.3.0 version. we have not set
> spark.streaming.concurrentJobs
> this parameter. It takes the default value.
>
> Vijay,
>
> From the tac
s using lsof
>> command. Need root permissions. If it is cluster not sure much !
>> 2) which exact line in the code is triggering this error ? Can you paste
>> that snippet ?
>>
>>
>> On Wednesday 23 December 2015, Priya Ch
>> wrote:
>>
>>&g
ulimit -n 65000
fs.file-max = 65000 ( in etc/sysctl.conf file)
Thanks,
Padma Ch
On Tue, Dec 22, 2015 at 6:47 PM, Yash Sharma wrote:
> Could you share the ulimit for your setup please ?
>
> - Thanks, via mobile, excuse brevity.
> On Dec 22, 2015 6:39 PM, "Priya Ch&qu
Jakob,
Increased the settings like fs.file-max in /etc/sysctl.conf and also
increased user limit in /etc/security/limits.conf. But still see the same
issue.
On Fri, Dec 18, 2015 at 12:54 AM, Jakob Odersky wrote:
> It might be a good idea to see how many files are open and try increasing
> th
Hi All,
When running streaming application, I am seeing the below error:
java.io.FileNotFoundException:
/data1/yarn/nm/usercache/root/appcache/application_1450172646510_0004/blockmgr-a81f42cd-6b52-4704-83f3-2cfc12a11b86/02/temp_shuffle_589ddccf-d436-4d2c-9935-e5f8c137b54b
(Too many open files
Hi All,
I have the following scenario in writing rows to Cassandra from Spark
Streaming -
in a 1 sec batch, I have 3 tickets with same ticket number (primary key)
but with different envelope numbers (i.e envelope 1, envelope 2, envelope
3.) I am writing these messages to Cassandra using saveToc
Hi All,
I have the following use case for Spark Streaming -
There are 2 streams of data say - FlightBookings and Ticket
For each ticket, I need to associate it with relevant Booking info. There
are distinct applications for Booking and Ticket. The Booking streaming
application processes the in
Hi All,
I am seeing exception when trying to substract 2 rdds.
Lets say rdd1 has messages like -
* pnr, bookingId, BookingObject*
101, 1, BookingObject1 // - event number is 0
102, 1, BookingObject2// - event number is 0
103, 2, Bo
conf, Seconds(seconds))
> }
>
> Regards,
>
> Bryan Jeffrey
>
>
> On Wed, Nov 4, 2015 at 9:49 AM, Ted Yu wrote:
>
>> Are you trying to speed up tests where each test suite uses single
>> SparkContext
>> ?
>>
>> You may want to read
Hello All,
How to use multiple Spark Context in executing multiple test suite of
spark code ???
Can some one throw light on this ?
One more question, if i have a function which takes RDD as a parameter, how
do we mock an RDD ??
On Thu, Oct 29, 2015 at 5:20 PM, Priya Ch
wrote:
> How do we do it for Cassandra..can we use the same Mocking ?
> EmbeddedCassandra Server is available with CassandraUnit. Can this be use
On Thu, Oct 29, 2015 at 12:27 PM, Priya Ch
> wrote:
>
>> Hi All,
>>
>> For my Spark Streaming code, which writes the results to Cassandra DB,
>> I need to write Unit test cases. what are the available test frameworks to
>> mock the connection to Cassandra DB ?
>>
>
>
Hi All,
For my Spark Streaming code, which writes the results to Cassandra DB, I
need to write Unit test cases. what are the available test frameworks to
mock the connection to Cassandra DB ?
Hello All,
I have two Cassandra RDDs. I am using joinWithCassandraTable which is
doing a cartesian join because of which we are getting unwanted rows.
How to perform inner join on Cassandra RDDs ? If I intend to use normal
join, i have to read entire table which is costly.
Is there any speci
Hi All,
When processing streams of data (with batch inter val 1 sec), there is
possible case of Concurrency issue. i.e two messages M1 and M2 (updated
version of M1) with same key are processed by 2 threads in parallel.
To resolve this concurrency issue, I am applying Hash Partitioner on RDD.
(
Sep 28, 2015 at 6:54 PM, Ted Yu wrote:
> Which Spark release are you using ?
>
> Can you show the snippet of your code around CassandraSQLContext#sql() ?
>
> Thanks
>
> On Mon, Sep 28, 2015 at 6:21 AM, Priya Ch
> wrote:
>
>> Hi All,
>>
>> I am trying to use da
Hi All,
I am trying to use dataframes (which contain data from cassandra) in
rdd.foreach. This is throwing the following exception:
Is CassandraSQLContext accessible within executor
15/09/28 17:22:40 ERROR JobScheduler: Error running job streaming job
144344116 ms.0
org.apache.spark.Sp
ode and the
spark version is 1.3.0
This has thrown
java.io.IOException: org.apache.spark.SparkException: Failed to get
broadcast_5_piece0 of broadcast_5
Regards,
Padma Ch
On Thu, Sep 24, 2015 at 11:25 AM, Akhil Das
wrote:
> It should, i don't see any reason for it to not run in cluster mode.
>
..)).joinWithCassandra().map(...).saveToCassandra()
>
> I'm not sure about exactly 10 messages, spark streaming focus on time not
> count..
>
>
> On Tue, Sep 22, 2015 at 2:14 AM, Priya Ch
> wrote:
>
>> I have scenario like this -
>>
>> I read dstream
.map(func), which will run distributed on the spark
>>> workers
>>>
>>> *Romi Kuntsman*, *Big Data Engineer*
>>> http://www.totango.com
>>>
>>> On Mon, Sep 21, 2015 at 3:29 PM, Priya Ch
>>> wrote:
>>&
orkers ???
Regards,
Padma Ch
On Mon, Sep 21, 2015 at 5:10 PM, Ted Yu wrote:
> You can use broadcast variable for passing connection information.
>
> Cheers
>
> On Sep 21, 2015, at 4:27 AM, Priya Ch
> wrote:
>
> can i use this sparkContext on executors ??
> In my applic
ep 21, 2015 at 3:06 PM, Petr Novak wrote:
> add @transient?
>
> On Mon, Sep 21, 2015 at 11:27 AM, Priya Ch
> wrote:
>
>> Hello All,
>>
>> How can i pass sparkContext as a parameter to a method in an object.
>> Because passing sparkContext is giving me Ta
Hello All,
How can i pass sparkContext as a parameter to a method in an object.
Because passing sparkContext is giving me TaskNotSerializable Exception.
How can i achieve this ?
Thanks,
Padma Ch
Hello All,
Instead of declaring sparkContext in main, declared as object variable as
-
object sparkDemo
{
val conf = new SparkConf
val sc = new SparkContext(conf)
def main(args:Array[String])
{
val baseRdd = sc.parallelize()
.
.
.
}
}
But this piece of code is giving
; true. What is the possible solution for this ?
Is this a bug in Spark 1.3.0? Changing the scheduling mode to Stand-alone
or Mesos mode would work fine ??
Please someone share your views on this.
On Sat, Sep 12, 2015 at 11:04 PM, Priya Ch
wrote:
> Hello All,
>
> When I push messages into
Hello All,
When I push messages into kafka and read into streaming application, I see
the following exception-
I am running the application on YARN and no where broadcasting the message
within the application. Just simply reading message, parsing it and
populating fields in a class and then prin
Hello All,
I am using foreachRDD in my code as -
dstream.foreachRDD { rdd => rdd.foreach { record => // look up with
cassandra table
// save updated rows to cassandra table.
}
}
This foreachRDD is causing executor lost failure. what is the behavior of
this foreachRDD ???
Thanks,
Padma Ch
ady computed by
> rdd2.count, because it is already available. If some partitions are not
> available due to GC, then only those partitions are recomputed.
>
> On Sun, Sep 6, 2015 at 5:11 PM, Jeff Zhang wrote:
>
>> If you want to reuse the data, you need to call rdd2.cache
>
Hi All,
In Spark, each action results in launching a job. Lets say my spark app
looks as-
val baseRDD =sc.parallelize(Array(1,2,3,4,5),2)
val rdd1 = baseRdd.map(x => x+2)
val rdd2 = rdd1.filter(x => x%2 ==0)
val count = rdd2.count
val firstElement = rdd2.first
println("Count is"+count)
println(
Hi All,
I have a spark streaming application which writes the processed results to
cassandra. In local mode, the code seems to work fine. The moment i start
running in distributed mode using yarn, i see executor lost failure. I
increased executor memory to occupy entire node's memory which is aro
Hi All,
I have the following scenario:
There exists a booking table in cassandra, which holds the fields like,
bookingid, passengeName, contact etc etc.
Now in my spark streaming application, there is one class Booking which
acts as a container and holds all the field details -
class Booking
map transformation. For
more information, see SPARK-5063.
On Mon, Aug 17, 2015 at 8:13 PM, Preetam wrote:
> The error could be because of the missing brackets after the word cache -
> .ticketRdd.cache()
>
> > On Aug 17, 2015, at 7:26 AM, Priya Ch
> wrote:
> >
> > Hi All,
Hi All,
Thank you very much for the detailed explanation.
I have scenario like this-
I have rdd of ticket records and another rdd of booking records. for each
ticket record, i need to check whether any link exists in booking table.
val ticketCachedRdd = ticketRdd.cache
ticketRdd.foreach{
ticke
o do is *transform* the rdd before writing it, e.g. using
> the .map function.
>
>
> On Thu, Aug 13, 2015 at 11:30 AM, Priya Ch
> wrote:
>
>> Hi All,
>>
>> I have a question in writing rdd to cassandra. Instead of writing entire
>> rdd to cassandra, i want t
Hi All,
I have a question in writing rdd to cassandra. Instead of writing entire
rdd to cassandra, i want to write individual statement into cassandra
beacuse there is a need to perform to ETL on each message ( which requires
checking with the DB).
How could i insert statements individually? Usin
combine the messages with the same primary key.
>
> Hope that helps.
>
> Greetings,
>
> Juan
>
>
> 2015-07-30 10:50 GMT+02:00 Priya Ch :
>
>> Hi All,
>>
>> Can someone throw insights on this ?
>>
>> On Wed, Jul 29, 2015 at 8:29 AM, Priya
Hi All,
Can someone throw insights on this ?
On Wed, Jul 29, 2015 at 8:29 AM, Priya Ch
wrote:
>
>
> Hi TD,
>
> Thanks for the info. I have the scenario like this.
>
> I am reading the data from kafka topic. Let's say kafka has 3 partitions
> for the topic. I
s will guard against multiple attempts to
> run the task that inserts into Cassandra.
>
> See
> http://spark.apache.org/docs/latest/streaming-programming-guide.html#semantics-of-output-operations
>
> TD
>
> On Sun, Jul 26, 2015 at 11:19 AM, Priya Ch
> wrote:
>
>>
Hi All,
I have a problem when writing streaming data to cassandra. Or existing
product is on Oracle DB in which while wrtiting data, locks are maintained
such that duplicates in the DB are avoided.
But as spark has parallel processing architecture, if more than 1 thread is
trying to write same d
Hi All,
I configured Kafka cluster on a single node and I have streaming
application which reads data from kafka topic using KafkaUtils. When I
execute the code in local mode from the IDE, the application runs fine.
But when I submit the same to spark cluster in standalone mode, I end up
with
Please find the attached worker log.
I could see stream closed exception
On Wed, Jan 7, 2015 at 10:51 AM, Xiangrui Meng wrote:
> Could you attach the executor log? That may help identify the root
> cause. -Xiangrui
>
> On Mon, Jan 5, 2015 at 11:12 PM, Priya Ch
> wr
Hi All,
Word2Vec and TF-IDF algorithms in spark mllib-1.1.0 are working only in
local mode and not on distributed mode. Null pointer exception has been
thrown. Is this a bug in spark-1.1.0 ?
*Following is the code:*
def main(args:Array[String])
{
val conf=new SparkConf
val sc=new Sp
Hi All,
I have akka remote actors running on 2 nodes. I submitted spark application
from node1. In the spark code, in one of the rdd, i am sending message to
actor running on node1. My Spark code is as follows:
class ActorClient extends Actor with Serializable
{
import context._
val curre
be
> doing the processing. If yes, then try setting the level of parallelism and
> number of partitions while creating/transforming the RDD.
>
> Thanks
> Best Regards
>
> On Fri, Nov 14, 2014 at 5:17 PM, Priya Ch <[hidden email]
> <http://user/SendEmail.jtp?type=node&nod
Hi All,
We have set up 2 node cluster (NODE-DSRV05 and NODE-DSRV02) each is
having 32gb RAM and 1 TB hard disk capacity and 8 cores of cpu. We have set
up hdfs which has 2 TB capacity and the block size is 256 mb When we try
to process 1 gb file on spark, we see the following exception
14/11/
Hi Spark users/experts,
In Spark source code (Master.scala & Worker.scala), when registering the
worker with master, I see the usage of *persistenceEngine*. When we don't
specify spark.deploy.recovery mode explicitly, what is the default value
used ? This recovery mode is used to persists and re
the classpath? In Spark
>> 1.0, we use breeze 0.7, and in Spark 1.1 we use 0.9. If the breeze
>> version you used is different from the one comes with Spark, you might
>> see class not found. -Xiangrui
>>
>> On Fri, Oct 3, 2014 at 4:22 AM, Priya Ch
>> wrote:
>&
Hi Team,
When I am trying to use DenseMatrix of breeze library in spark, its
throwing me the following error:
java.lang.noclassdeffounderror: breeze/storage/Zero
Can someone help me on this ?
Thanks,
Padma Ch
Hi,
I am using spark 1.0.0. In my spark code i m trying to persist an rdd to
disk as rrd.persist(DISK_ONLY). But unfortunately couldn't find the
location where the rdd has been written to disk. I specified
SPARK_LOCAL_DIRS and SPARK_WORKER_DIR to some other location rather than
using the default /
Hi,
I am trying to build jars using the command :
mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean package
Execution of the above command is throwing the following error:
[INFO] Spark Project Core . FAILURE [ 0.295 s]
[INFO] Spark Project Bagel .
75 matches
Mail list logo