Hi all,
I'm running spark 1.2.0 on a 20-node Yarn emr cluster. I've noticed that
whenever I'm running a heavy computation job in parallel to other jobs
running, I'm getting these kind of exceptions:
* [task-result-getter-2] INFO org.apache.spark.scheduler.TaskSetManager-
Lost task 820.0 in
Hi All
We are joining large tables using spark sql and running into shuffle
issues. We have explored multiple options - using coalesce to reduce number
of partitions, tuning various parameters like disk buffer, reducing data in
chunks etc. which all seem to help btw. What I would like to know is,
One friend told me that I should add the hive-site.xml file to the --files
option of spark-submit command, but how can I run and debug my program inside
eclipse?
-- Original --
From: guxiaobo1982;guxiaobo1...@qq.com;
Send time: Sunday, Feb 1, 2015 4:18 PM
Hi,
To order to let a local spark-shell connect to a remote spark stand-alone
cluster and access hive tables there, I must put the hive-site.xml file into
the local spark installation's conf path, but spark-shell even can't import the
default settings there, I found two errors:
property
Dictionary encoding of Strings from Parquet now added and will be in 1.3.
This should reduce UTF to String decoding significantly
https://issues.apache.org/jira/browse/SPARK-5309
--
View this message in context:
Hi I want to send streaming data to kafka topic,
I am having RDD data which I converted in JavaDStream ,now I want to send it
to kafka topic, I don't want kafka sending code, just I need foreachRDD
implementation, my code is look like as
public void publishtoKafka(ITblStream t)
{
Cool! For all the times i had been modifying the hive-site.xml I had only
propped in the integer values - learn something new every day, eh?!
On Sun Feb 01 2015 at 9:36:23 AM Ted Yu yuzhih...@gmail.com wrote:
Looking at common/src/java/org/apache/hadoop/hive/conf/HiveConf.java :
I may be missing something here but typically when the hive-site.xml
configurations do not require you to place s within the configuration
itself. Both the retry.delay and socket.timeout values are in seconds so
you should only need to place the integer value (which are in seconds).
On Sun Feb
Looking at common/src/java/org/apache/hadoop/hive/conf/HiveConf.java :
METASTORE_CLIENT_CONNECT_RETRY_DELAY(hive.metastore.client.connect.retry.delay,
1s,
new TimeValidator(TimeUnit.SECONDS),
Number of seconds for the client to wait between consecutive
connection attempts),
It
I am running on m3.xlarge instances on AWS with 12 gb worker memory and 10
gb executor memory.
On Sun, Feb 1, 2015, 12:41 PM Arush Kharbanda ar...@sigmoidanalytics.com
wrote:
What is the machine configuration you are running it on?
On Mon, Feb 2, 2015 at 1:46 AM, Ankur Srivastava
Spark 1.2
SchemaRDD has schema with decimal columns created like
x1 = new StructField(a, DecimalType(14,4), true)
x2 = new StructField(b, DecimalType(14,4), true)
Registering as SQL Temp table and doing SQL queries on these columns ,
including SUM etc. works fine, so the schema Decimal does
Hi,
I plan to have logstash send log events (as key value pairs) to spark streaming
using Spark on Cassandra.
Being completely fresh to Spark, I have a couple of questions:
- is that a good idea at all, or would it be better to put e.g. Kafka in
between to handle traffic peeks
(IOW: how and
I have been using a logstash alternative - fluentd to ingest the data into hdfs.
I had to configure fluentd to not append the data so that spark streaming will
be able to pick up the new logs.
-Liming
On 2 Feb, 2015, at 6:05 am, NORD SC jan.algermis...@nordsc.com wrote:
Hi,
I plan to
I think I found the issue causing it.
I was calling schemaRDD.coalesce(n).saveAsParquetFile to reduce the number
of partitions in parquet file - in which case the stack trace happens.
If I compress the partitions before creating schemaRDD then the
schemaRDD.saveAsParquetFile call works for
Here is the relevant snippet of code in my main program:
===
sparkConf.set(spark.serializer,
org.apache.spark.serializer.KryoSerializer)
sparkConf.set(spark.kryo.registrationRequired, true)
val summaryDataClass = classOf[SummaryData]
val summaryViewClass
Hi Deep,
What is your configuration and what is the size of the 2 data sets?
Thanks
Arush
On Mon, Feb 2, 2015 at 11:56 AM, Deep Pradhan pradhandeep1...@gmail.com
wrote:
I did not check the console because once the job starts I cannot run
anything else and have to force shutdown the system. I
The configuration is 16GB ram and 1TB HD. have a single node Spark cluster.
Even after setting driver memory to 5g and executor memory to 3g, I get
this error. The size of the data set is 350 KB and the set that it works
well is hardly few KBs.
On Mon, Feb 2, 2015 at 1:18 PM, Arush Kharbanda
Hi,
Is there any better operation than Union. I am using union and the cluster
is getting stuck with a large data set.
Thank you
Hi Deep,
what do you mean by stuck?
Jerry
On Mon, Feb 2, 2015 at 12:44 AM, Deep Pradhan pradhandeep1...@gmail.com
wrote:
Hi,
Is there any better operation than Union. I am using union and the cluster
is getting stuck with a large data set.
Thank you
Hi,
I keep facing this error when I run my application:
java.io.IOException: Connection from s1/- closed +details
java.io.IOException: Connection from s1/:43741 closed
at
Hi Deep,
How do you know the cluster is not responsive because of Union?
Did you check the spark web console?
Best Regards,
Jerry
On Mon, Feb 2, 2015 at 1:21 AM, Deep Pradhan pradhandeep1...@gmail.com
wrote:
The cluster hangs.
On Mon, Feb 2, 2015 at 11:25 AM, Jerry Lam
Thanks for the notification!
For now, I'll use the Kryo serializer without registering classes until the
bug fix has been merged into the next version of Spark (I guess that will
be 1.3, right?).
arun
On Sun, Feb 1, 2015 at 10:58 PM, Shixiong Zhu zsxw...@gmail.com wrote:
It's a bug that has
The cluster hangs.
On Mon, Feb 2, 2015 at 11:25 AM, Jerry Lam chiling...@gmail.com wrote:
Hi Deep,
what do you mean by stuck?
Jerry
On Mon, Feb 2, 2015 at 12:44 AM, Deep Pradhan pradhandeep1...@gmail.com
wrote:
Hi,
Is there any better operation than Union. I am using union and the
Hi,
I want to know about your moving avg implementation. I am also doing some
time-series analysis about CPU performance. So I tried simple regression but
result is not good. rmse is 10 but when I extrapolate it just shoot up
linearly. I think I should first smoothed out the data then try
Hi Tomer,
Are you able to look in your NodeManager logs to see if the NodeManagers
are killing any executors for exceeding memory limits? If you observe
this, you can solve the problem by bumping up
spark.yarn.executor.memoryOverhead.
-Sandy
On Sun, Feb 1, 2015 at 5:28 AM, Tomer Benyamini
I did not check the console because once the job starts I cannot run
anything else and have to force shutdown the system. I commented parts of
codes and I tested. I doubt it is because of union. So, I want to change it
to something else and see if the problem persists.
Thank you
On Mon, Feb 2,
It's a bug that has been fixed in https://github.com/apache/spark/pull/4258
but not yet been merged.
Best Regards,
Shixiong Zhu
2015-02-02 10:08 GMT+08:00 Arun Lists lists.a...@gmail.com:
Here is the relevant snippet of code in my main program:
===
27 matches
Mail list logo