running 2 spark applications in parallel on yarn

2015-02-01 Thread Tomer Benyamini
Hi all, I'm running spark 1.2.0 on a 20-node Yarn emr cluster. I've noticed that whenever I'm running a heavy computation job in parallel to other jobs running, I'm getting these kind of exceptions: * [task-result-getter-2] INFO org.apache.spark.scheduler.TaskSetManager- Lost task 820.0 in

Is pair rdd join more efficient than regular rdd

2015-02-01 Thread Sunita Arvind
Hi All We are joining large tables using spark sql and running into shuffle issues. We have explored multiple options - using coalesce to reduce number of partitions, tuning various parameters like disk buffer, reducing data in chunks etc. which all seem to help btw. What I would like to know is,

Re: Can't access remote Hive table from spark

2015-02-01 Thread guxiaobo1982
One friend told me that I should add the hive-site.xml file to the --files option of spark-submit command, but how can I run and debug my program inside eclipse? -- Original -- From: guxiaobo1982;guxiaobo1...@qq.com; Send time: Sunday, Feb 1, 2015 4:18 PM

spark-shell can't import the default hive-site.xml options probably.

2015-02-01 Thread guxiaobo1982
Hi, To order to let a local spark-shell connect to a remote spark stand-alone cluster and access hive tables there, I must put the hive-site.xml file into the local spark installation's conf path, but spark-shell even can't import the default settings there, I found two errors: property

Re: Spark SQL Parquet - data are reading very very slow

2015-02-01 Thread Mick Davies
Dictionary encoding of Strings from Parquet now added and will be in 1.3. This should reduce UTF to String decoding significantly https://issues.apache.org/jira/browse/SPARK-5309 -- View this message in context:

how to send JavaDStream RDD using foreachRDD using Java

2015-02-01 Thread sachin Singh
Hi I want to send streaming data to kafka topic, I am having RDD data which I converted in JavaDStream ,now I want to send it to kafka topic, I don't want kafka sending code, just I need foreachRDD implementation, my code is look like as public void publishtoKafka(ITblStream t) {

Re: spark-shell can't import the default hive-site.xml options probably.

2015-02-01 Thread Denny Lee
Cool! For all the times i had been modifying the hive-site.xml I had only propped in the integer values - learn something new every day, eh?! On Sun Feb 01 2015 at 9:36:23 AM Ted Yu yuzhih...@gmail.com wrote: Looking at common/src/java/org/apache/hadoop/hive/conf/HiveConf.java :

Re: spark-shell can't import the default hive-site.xml options probably.

2015-02-01 Thread Denny Lee
I may be missing something here but typically when the hive-site.xml configurations do not require you to place s within the configuration itself. Both the retry.delay and socket.timeout values are in seconds so you should only need to place the integer value (which are in seconds). On Sun Feb

Re: spark-shell can't import the default hive-site.xml options probably.

2015-02-01 Thread Ted Yu
Looking at common/src/java/org/apache/hadoop/hive/conf/HiveConf.java : METASTORE_CLIENT_CONNECT_RETRY_DELAY(hive.metastore.client.connect.retry.delay, 1s, new TimeValidator(TimeUnit.SECONDS), Number of seconds for the client to wait between consecutive connection attempts), It

Re: Error when running spark in debug mode

2015-02-01 Thread Ankur Srivastava
I am running on m3.xlarge instances on AWS with 12 gb worker memory and 10 gb executor memory. On Sun, Feb 1, 2015, 12:41 PM Arush Kharbanda ar...@sigmoidanalytics.com wrote: What is the machine configuration you are running it on? On Mon, Feb 2, 2015 at 1:46 AM, Ankur Srivastava

Error in saving schemaRDD with Decimal as Parquet

2015-02-01 Thread Manoj Samel
Spark 1.2 SchemaRDD has schema with decimal columns created like x1 = new StructField(a, DecimalType(14,4), true) x2 = new StructField(b, DecimalType(14,4), true) Registering as SQL Temp table and doing SQL queries on these columns , including SUM etc. works fine, so the schema Decimal does

Logstash as a source?

2015-02-01 Thread NORD SC
Hi, I plan to have logstash send log events (as key value pairs) to spark streaming using Spark on Cassandra. Being completely fresh to Spark, I have a couple of questions: - is that a good idea at all, or would it be better to put e.g. Kafka in between to handle traffic peeks (IOW: how and

Re: Logstash as a source?

2015-02-01 Thread Tsai Li Ming
I have been using a logstash alternative - fluentd to ingest the data into hdfs. I had to configure fluentd to not append the data so that spark streaming will be able to pick up the new logs. -Liming On 2 Feb, 2015, at 6:05 am, NORD SC jan.algermis...@nordsc.com wrote: Hi, I plan to

Re: Error in saving schemaRDD with Decimal as Parquet

2015-02-01 Thread Manoj Samel
I think I found the issue causing it. I was calling schemaRDD.coalesce(n).saveAsParquetFile to reduce the number of partitions in parquet file - in which case the stack trace happens. If I compress the partitions before creating schemaRDD then the schemaRDD.saveAsParquetFile call works for

ClassNotFoundException when registering classes with Kryo

2015-02-01 Thread Arun Lists
Here is the relevant snippet of code in my main program: === sparkConf.set(spark.serializer, org.apache.spark.serializer.KryoSerializer) sparkConf.set(spark.kryo.registrationRequired, true) val summaryDataClass = classOf[SummaryData] val summaryViewClass

Re: Union in Spark

2015-02-01 Thread Arush Kharbanda
Hi Deep, What is your configuration and what is the size of the 2 data sets? Thanks Arush On Mon, Feb 2, 2015 at 11:56 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: I did not check the console because once the job starts I cannot run anything else and have to force shutdown the system. I

Re: Union in Spark

2015-02-01 Thread Deep Pradhan
The configuration is 16GB ram and 1TB HD. have a single node Spark cluster. Even after setting driver memory to 5g and executor memory to 3g, I get this error. The size of the data set is 350 KB and the set that it works well is hardly few KBs. On Mon, Feb 2, 2015 at 1:18 PM, Arush Kharbanda

Union in Spark

2015-02-01 Thread Deep Pradhan
Hi, Is there any better operation than Union. I am using union and the cluster is getting stuck with a large data set. Thank you

Re: Union in Spark

2015-02-01 Thread Jerry Lam
Hi Deep, what do you mean by stuck? Jerry On Mon, Feb 2, 2015 at 12:44 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, Is there any better operation than Union. I am using union and the cluster is getting stuck with a large data set. Thank you

Connection closed/reset by peers error

2015-02-01 Thread Kartheek.R
Hi, I keep facing this error when I run my application: java.io.IOException: Connection from s1/- closed +details java.io.IOException: Connection from s1/:43741 closed at

Re: Union in Spark

2015-02-01 Thread Jerry Lam
Hi Deep, How do you know the cluster is not responsive because of Union? Did you check the spark web console? Best Regards, Jerry On Mon, Feb 2, 2015 at 1:21 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: The cluster hangs. On Mon, Feb 2, 2015 at 11:25 AM, Jerry Lam

Re: ClassNotFoundException when registering classes with Kryo

2015-02-01 Thread Arun Lists
Thanks for the notification! For now, I'll use the Kryo serializer without registering classes until the bug fix has been merged into the next version of Spark (I guess that will be 1.3, right?). arun On Sun, Feb 1, 2015 at 10:58 PM, Shixiong Zhu zsxw...@gmail.com wrote: It's a bug that has

Re: Union in Spark

2015-02-01 Thread Deep Pradhan
The cluster hangs. On Mon, Feb 2, 2015 at 11:25 AM, Jerry Lam chiling...@gmail.com wrote: Hi Deep, what do you mean by stuck? Jerry On Mon, Feb 2, 2015 at 12:44 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, Is there any better operation than Union. I am using union and the

Re: Window comparison matching using the sliding window functionality: feasibility

2015-02-01 Thread ashu
Hi, I want to know about your moving avg implementation. I am also doing some time-series analysis about CPU performance. So I tried simple regression but result is not good. rmse is 10 but when I extrapolate it just shoot up linearly. I think I should first smoothed out the data then try

Re: running 2 spark applications in parallel on yarn

2015-02-01 Thread Sandy Ryza
Hi Tomer, Are you able to look in your NodeManager logs to see if the NodeManagers are killing any executors for exceeding memory limits? If you observe this, you can solve the problem by bumping up spark.yarn.executor.memoryOverhead. -Sandy On Sun, Feb 1, 2015 at 5:28 AM, Tomer Benyamini

Re: Union in Spark

2015-02-01 Thread Deep Pradhan
I did not check the console because once the job starts I cannot run anything else and have to force shutdown the system. I commented parts of codes and I tested. I doubt it is because of union. So, I want to change it to something else and see if the problem persists. Thank you On Mon, Feb 2,

Re: ClassNotFoundException when registering classes with Kryo

2015-02-01 Thread Shixiong Zhu
It's a bug that has been fixed in https://github.com/apache/spark/pull/4258 but not yet been merged. Best Regards, Shixiong Zhu 2015-02-02 10:08 GMT+08:00 Arun Lists lists.a...@gmail.com: Here is the relevant snippet of code in my main program: ===