Hello,
We have a streaming job that consistently fails with the trace below. This is
on an AWS EMR 4.2/Spark 1.5.2 cluster.
This ticket looks related
SPARK-8112 Received block event count through the StreamingListener can be
negative
although it appears to have been fixed in 1.5.
This seems to be a problem with Kafka brokers being in a bad state. We're
restarting Kafka to resolve.
--
Nick
From: Ted Yu
Sent: Friday, January 22, 2016 10:38 AM
To: Afshartous, Nick
Cc: user@spark.apache.org
Subject: Re: Spark
Any help? Not sure what I am doing wrong.
Best Regards,
Ram
From: Ram VISWANADHA
>
Date: Friday, January 22, 2016 at 10:25 AM
To: user >
Subject: StackOverflow when
Yes, you should query Kafka if you want to know the latest available
offsets.
There's code to make this straightforward in KafkaCluster.scala, but the
interface isnt public. There's an outstanding pull request to expose the
api at
https://issues.apache.org/jira/browse/SPARK-10963
but frankly
Hi,
I am deploying Spark 1.6.0 using yarn-client mode in our yarn cluster.
Everything works fine, except the first job is extremely slow due to
executor heartbeat RPC timeout:
WARN netty.NettyRpcEndpointRef: Error sending message [message = Heartbeat
I think this might be related to our
Thanks a lot for the help! I'll definately check out the
KafkaCluster.scala. I probably first try use that api from java, and later
try to build the subproject.
thanks,
Charles
On Fri, Jan 22, 2016 at 12:26 PM, Cody Koeninger wrote:
> Yes, you should query Kafka if you
Thanks for the tip. I will try it. But this is the kind of thing spark is
supposed to figure out and handle. Or at least not get stuck forever.
Sent from my Verizon Wireless 4G LTE smartphone
Original message
From: Muthu Jayakumar
Date: 01/22/2016
Does increasing the number of partition helps? You could try out something
3 times what you currently have.
Another trick i used was to partition the problem into multiple dataframes
and run them sequentially and persistent the result and then run a union on
the results.
Hope this helps.
On Fri,
Hi,
I have been using DirectKafkaInputDStream in Spark Streaming to consumer kafka
messages and it's been working very well. Now I have the need to batch process
messages from Kafka, for example, retrieve all messages every hour and process
them, output to destinations like Hive or HDFS. I
Looked at:
https://spark.apache.org/docs/latest/configuration.html
I don't think Spark supports per stage speculation.
On Fri, Jan 22, 2016 at 10:15 AM, Adam McElwee wrote:
> I've used speculative execution a couple times in the past w/ good
> results, but I have one stage in
I've used speculative execution a couple times in the past w/ good results,
but I have one stage in my job with a non-idempotent operation in a
`forEachPartition` block. I don't see a way to disable speculative retry on
certain stages, but does anyone know of any tricks to help out here?
Spark
Hi,
I am getting this StackOverflowError when fetching recommendations from ALS.
Any help is much appreciated
int features = 100;
double alpha = 0.1;
double lambda = 0.001;
boolean implicit = true;
int iterations = 10;
ALS als = new ALS()
Hello
I installed spark in a folder. I start bin/sparkR on console. Then I
execute below command and all work fine. I can see the data as well.
hivecontext <<- sparkRHive.init(sc) ;
df <- loadDF(hivecontext, "/someHdfsPath", "orc")
showDF(df)
But when I give same to rstudio, it throws the
Hi, I have a spark-streaming application which uses sparkOnHBase lib to do
streamBulkPut()
Without checkpointing everything works fine.. But recently upon enabling
checkpointing I got thefollowing exception -
16/01/22 01:32:35 ERROR executor.Executor: Exception in task 0.0 in stage 39.0
(TID
There have been optimizations in this area, such as:
https://issues.apache.org/jira/browse/SPARK-8125
You can also look at parent issue.
Which Spark release are you using ?
> On Jan 22, 2016, at 1:08 AM, Gourav Sengupta
> wrote:
>
>
> Hi,
>
> I have a SPARK
Hi Ted,
I am using SPARK 1.5.2 as available currently in AWS EMR 4x. The data is in
TSV format.
I do not see any affect of the work already done on this for the data
stored in HIVE as it takes around 50 mins just to collect the table
metadata over a 40 node cluster and the time is much the same
Hi
Anyone have any idea on *ClassTag in spark context..*
On Fri, Jan 22, 2016 at 12:42 PM, Nagu Kothapalli wrote:
> Hi All
>
> Facing an Issuee With CustomInputDStream object in java
>
>
>
> *public CustomInputDStream(StreamingContext ssc_, ClassTag classTag)*
> *
(Apologies if this comes through twice; I sent it once before I'd
confirmed by mailing list subscription.)
I've been having lots of trouble with DataFrames whose columns have dots in
their names today. I know that in many places, backticks can be used to
quote column names, but the problem I'm
I've been having lots of trouble with DataFrames whose columns have dots in
their names today. I know that in many places, backticks can be used to
quote column names, but the problem I'm running into now is that I can't
drop a column that has *no* dots in its name when there are *other* columns
Hi Silvio,
Can you go into a little detail how the back pressure work? Does it block
the receiver? Or does it temporarily saves the incoming messages in
mem/disk? I have a custom actor receiver that uses store() to save dataa
to spark. Would the back pressure make store() call block?
On 1/17/16,
Hi,
I have a SPARK table (created from hiveContext) with couple of hundred
partitions and few thousand files.
When I run query on the table then spark spends a lot of time (as seen in
the pyspark output) to collect this files from the several partitions.
After this the query starts running.
Is
Hi,
I am very new to spark & spark-streaming. I am planning to use spark
streaming for real time processing.
I have created a streaming context and checkpointing to hdfs directory
for recovery purposes in case of executor failures & driver failures.
I am creating Dstream with offset map
Me too. I had to shrink my dataset to get it to work. For us at least Spark
seems to have scaling issues.
Sent from my Verizon Wireless 4G LTE smartphone
Original message
From: "Sanders, Isaac B"
Date: 01/21/2016 11:18 PM (GMT-05:00)
To:
Hi,
I have a streaming application with
- 1 sec interval
- accept data from a simulation through MulticastSocket
The simulation sent out data using multiple clients/threads every 1 sec
interval. The input rate accepted by the streaming looks strange.
- When clients = 10,000 the event rate
In SQLConf.scala , I found this:
val PARALLEL_PARTITION_DISCOVERY_THRESHOLD = intConf(
key = "spark.sql.sources.parallelPartitionDiscovery.threshold",
defaultValue = Some(32),
doc = "The degree of parallelism for schema merging and partition
discovery of " +
"Parquet data
Hi Srini,
If you want to get value like the following example using scala, some other
language also like this
" mat.rows.collect().apply(i)
val cov = mat.computeCovariance()
cov.apply(i, j)
"
mat is RowMatrix type and the cov is Matrix type.
Hi,
Does any body know how can we get the application status of a spark app
using the API ?
Currently its giving only the completed status as true/false.
I am trying to build a application manager kind of thing where one can see
the apps deployed and their status, and do some actions based on
Emlyn,
Have you considered using pools?
http://spark.apache.org/docs/latest/job-scheduling.html#fair-scheduler-pools
I haven't tried that by myself, but it looks like pool setting is applied
per thread so that means it's possible to configure fair scheduler, so that
more, than one job is on a
The class path formations on driver and executors are different.
Cheers
On Fri, Jan 22, 2016 at 3:25 PM, Ajinkya Kale wrote:
> Is this issue only when the computations are in distributed mode ?
> If I do (pseudo code) :
> rdd.collect.call_to_hbase I dont get this error,
I tried --jars which supposedly does that but that did not work.
On Fri, Jan 22, 2016 at 4:33 PM Ajinkya Kale wrote:
> Hi Ted,
> Is there a way for the executors to have the hbase-protocol jar on their
> classpath ?
>
> On Fri, Jan 22, 2016 at 4:00 PM Ted Yu
Is this issue only when the computations are in distributed mode ?
If I do (pseudo code) :
rdd.collect.call_to_hbase I dont get this error,
but if I do :
rdd.call_to_hbase.collect it throws this error.
On Wed, Jan 20, 2016 at 6:50 PM Ajinkya Kale wrote:
> Unfortunately
Hi Ted,
Is there a way for the executors to have the hbase-protocol jar on their
classpath ?
On Fri, Jan 22, 2016 at 4:00 PM Ted Yu wrote:
> The class path formations on driver and executors are different.
>
> Cheers
>
> On Fri, Jan 22, 2016 at 3:25 PM, Ajinkya Kale
Vivek,
By default, Cassandra uses ¼ of the system memory, so in your case, it will be
around 8GB, which is fine.
If you have more Cassandra related question, it is better to post it on the
Cassandra mailing list. Also feel free to email me directly.
Mohammed
Author: Big Data Analytics with
New to Spark Streaming. My question is i want to load the XML files to
database [cassandra] using spark streaming.Any suggestions please.Thanks in
Advance.
--
Best Regards,
Sreeharsha Eedupuganti
Data Engineer
innData Analytics Private Limited
I am working on a proof of concept using spark. I set up a small cluster on
AWS. I looking for some part time help with administration.
Kind Regards
Andy
This problem is fixed by restarting R from R studio. Now see
16/01/22 08:08:38 INFO HiveMetaStore: No user is added in admin role,
since config is empty16/01/22 08:08:38 ERROR RBackendHandler:
on org.apache.spark.sql.hive.HiveContext failedError in
value[[3L]](cond) : Spark SQL is not built with
Correction. I have to use spark.yarn.am.memoryOverhead because I'm in Yarn
client mode. I set it to 13% of the executor memory.
Also quite helpful was increasing the total overall executor memory.
It will be great when tungsten enhancements make there way into RDDs.
Thanks!
Arun
On Thu, Jan
Hi Andy,
Sorry this is in Scala but you may be able to do something similar? I use
Joda's DateTime class. I ran into a lot of difficulties with the serializer,
but if you are an admin on the box you'll have less issues by adding in some
Kryo serializers.
import org.joda.time
val
Offsets are stored in the checkpoint. If you want to manage offsets
yourself, don't restart from the checkpoint, specify the starting offsets
when you create the stream.
Have you read / watched the materials linked from
https://github.com/koeninger/kafka-exactly-once
Regarding the small files
Option B is good too, have date as timestamp and format later.
Thanks,
-Durgesh
> On Jan 22, 2016, at 9:50 AM, Spencer, Alex (Santander)
> wrote:
>
> Hi Andy,
>
> Sorry this is in Scala but you may be able to do something similar? I use
> Joda's
Is it possible to reproduce the condition below with test code ?
Thanks
On Fri, Jan 22, 2016 at 7:31 AM, Afshartous, Nick
wrote:
>
> Hello,
>
>
> We have a streaming job that consistently fails with the trace below.
> This is on an AWS EMR 4.2/Spark 1.5.2 cluster.
>
>
Related thread:
http://search-hadoop.com/m/q3RTtSfi342nveex1=RE+NPE+when+using+Joda+DateTime
FYI
On Fri, Jan 22, 2016 at 6:50 AM, Spencer, Alex (Santander) <
alex.spen...@santander.co.uk.invalid> wrote:
> Hi Andy,
>
> Sorry this is in Scala but you may be able to do something similar? I use
>
Hi Yanbo
I recently code up the trivial example from
http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classifica
tion-1.html I do not get the same results. I’ll put my code up on github
over the weekend if anyone is interested
Andy
From: Yanbo Liang
Date:
I am looking for any easy to use visualization tool for KMeansModel
produced as a result of clustering .
Thanks
Ashutosh
If you turn on config (like "-XX:+PrintGCDetails -XX:+PrintGCTimeStamps")
you would be able to see why some job run for a long time.
The tuning guide (http://spark.apache.org/docs/latest/tuning.html) provides
some insight on this. Setting up explicit partition helped in my case when
I was using
Vivek:
I searched for 'cassandra gc pause' and found a few hits.
e.g. :
http://search-hadoop.com/m/qZFqM1c5nrn1Ihwf6=Re+GC+pauses+affecting+entire+cluster+
Keep in mind the effect of GC on shared nodes.
FYI
On Fri, Jan 22, 2016 at 7:09 PM, Mohammed Guller
wrote:
> For
Hi all - I'm running the Spark LDA algorithm on a dataset of roughly 3
million terms with a resulting RDD of approximately 20 GB on a 5 node
cluster with 10 executors (3 cores each) and 14gb of memory per executor.
As the application runs, I'm seeing progressively longer execution times
for the
This may be useful, you can try connectors.
https://academy.datastax.com/demos/getting-started-apache-spark-and-cassandra
https://spark-summit.org/2015/events/cassandra-and-spark-optimizing-for-data-locality/
Thanks,
-Durgesh
> On Jan 22, 2016, at 8:37 PM,
>
+ spark standalone cluster
On Sat, Jan 23, 2016 at 7:33 am, Vivek Meghanathan (WT01 - NEP)
> wrote:
We have the setup on Google cloud platform. Each node has 8 CPU + 30GB memory.
10 nodes for spark another 9nodes for Cassandra.
>From your description, putting Cassandra daemon on Spark cluster should be
feasible.
One aspect to be measured is how much locality can be achieved in this
setup - Cassandra is distributed NoSQL store.
Cheers
On Fri, Jan 22, 2016 at 6:13 PM, wrote:
> + spark
Thanks Ted, also what is the suggested memory setting for Cassandra process?
Regards
Vivek
On Sat, Jan 23, 2016 at 7:57 am, Ted Yu
> wrote:
>From your description, putting Cassandra daemon on Spark cluster should be
>feasible.
One aspect to be
I am not Cassandra developer :-)
Can you use http://search-hadoop.com/ or ask on Cassandra mailing list.
Cheers
On Fri, Jan 22, 2016 at 6:35 PM, wrote:
> Thanks Ted, also what is the suggested memory setting for Cassandra
> process?
>
> Regards
> Vivek
> On Sat,
Hi Andrew,
Here is another option.
You can define custom schema to specify the correct type for the time column as
shown below:
import org.apache.spark.sql.types._
val customSchema =
StructType(
StructField("a", IntegerType, false) ::
StructField("b", LongType, false) ::
Can you give us a bit more information ?
How much memory does each node have ?
What's the current heap allocation for Cassandra process and executor ?
Spark / Cassandra release you are using
Thanks
On Fri, Jan 22, 2016 at 5:37 PM, wrote:
> Hi All,
> What is the
Hi All,
What is the right spark Cassandra cluster setup - having Cassandra cluster and
spark cluster in different nodes or they should be on same nodes.
We are having them in different nodes and performance test shows very bad
result for the spark streaming jobs.
Please let us know.
Regards
55 matches
Mail list logo