Spark Streaming S3 Error

2016-05-20 Thread Benjamin Kim
I am trying to stream files from an S3 bucket using CDH 5.7.0’s version of Spark 1.6.0. It seems not to work. I keep getting this error. Exception in thread "JobGenerator" java.lang.VerifyError: Bad type on operand stack Exception Details: Location:

What factors decide the number of executors when doing a Spark SQL insert in Mesos?

2016-05-20 Thread SRK
Hi, What factors decide the number of executors when doing a Spark SQL insert? Right now when I submit my job in Mesos I see only 2 executors getting allocated all the time. Thanks! -- View this message in context:

set spark 1.6 with Hive 0.14 ?

2016-05-20 Thread kali.tumm...@gmail.com
Hi All , Is there a way to ask spark and spark-sql to use Hive 0.14 version instead of inbuilt hive 1.2.1. I am testing spark-sql locally by downloading spark 1.6 from internet , I want to execute my hive queries in spark sql using hive version 0.14 can I go back to previous version just for a

Re: Memory issues when trying to insert data in the form of ORC using Spark SQL

2016-05-20 Thread swetha kasireddy
Also, the Spark SQL insert seems to take only two tasks per stage. That might be the reason why it does not have sufficient memory. Is there a way to increase the number of tasks when doing the sql insert? Stage IdDescriptionSubmittedDurationTasks: Succeeded/TotalInputOutputShuffle ReadShuffle

Memory issues when trying to insert data in the form of ORC using Spark SQL

2016-05-20 Thread SRK
Hi, I see some memory issues when trying to insert the data in the form of ORC using Spark SQL. Please find the query and exception below. Any idea as to why this is happening? sqlContext.sql(" CREATE EXTERNAL TABLE IF NOT EXISTS records (id STRING, record STRING) PARTITIONED BY (datePartition

RE: Can not set spark dynamic resource allocation

2016-05-20 Thread David Newberger
Hi All, The error you are seeing looks really similar to Spark-13514 to me. I could be wrong though https://issues.apache.org/jira/browse/SPARK-13514 Can you check yarn.nodemanager.local-dirs in your YARN configuration for "file://" Cheers! David Newberger -Original Message- From:

Re: Can not set spark dynamic resource allocation

2016-05-20 Thread Cui, Weifeng
Sorry, here is the node-manager log. application_1463692924309_0002 is my test. Hope this will help. http://pastebin.com/0BPEcgcW On 5/20/16, 2:09 PM, "Marcelo Vanzin" wrote: >Hi Weifeng, > >That's the Spark event log, not the YARN application log. You get the >latter

Re: Can not set spark dynamic resource allocation

2016-05-20 Thread Marcelo Vanzin
Hi Weifeng, That's the Spark event log, not the YARN application log. You get the latter using the "yarn logs" command. On Fri, May 20, 2016 at 1:14 PM, Cui, Weifeng wrote: > Here is the application log for this spark job. > > http://pastebin.com/2UJS9L4e > > > > Thanks, >

Logstash to collect Spark logs

2016-05-20 Thread Ashish Kumar Singh
We are trying to collect Spark logs using logstash for parsing app logs and collecting useful info. We can read the Nodemanager logs but unable to read Spark application logs using Logstash . Current Setup for Spark logs and Logstash 1- Spark runs on Yarn . 2- Using log4j socketAppenders to

Re: Dataset API and avro type

2016-05-20 Thread Michael Armbrust
What is the error? I would definitely expect it to work with kryo at least. On Fri, May 20, 2016 at 2:37 AM, Han JU wrote: > Hello, > > I'm looking at the Dataset API in 1.6.1 and also in upcoming 2.0. However > it does not seems to work with Avro data types: > > >

Re: Wide Datasets (v1.6.1)

2016-05-20 Thread Michael Armbrust
> > I can provide an example/open a Jira if there is a chance this will be > fixed. > Please do! Ping me on it. Michael

Re: Can not set spark dynamic resource allocation

2016-05-20 Thread Cui, Weifeng
Here is the application log for this spark job. http://pastebin.com/2UJS9L4e Thanks, Weifeng From: "Aulakh, Sahib" Date: Friday, May 20, 2016 at 12:43 PM To: Ted Yu Cc: Rodrick Brown , Cui Weifeng , user

Re: Can not set spark dynamic resource allocation

2016-05-20 Thread Aulakh, Sahib
Yes it is yarn. We have configured spark shuffle service w yarn node manager but something must be off. We will send u app log on paste bin. Sent from my iPhone On May 20, 2016, at 12:35 PM, Ted Yu > wrote: Since yarn-site.xml was cited, I

Re: Can not set spark dynamic resource allocation

2016-05-20 Thread Ted Yu
Since yarn-site.xml was cited, I assume the cluster runs YARN. On Fri, May 20, 2016 at 12:30 PM, Rodrick Brown wrote: > Is this Yarn or Mesos? For the later you need to start an external shuffle > service. > > Get Outlook for iOS > > > > >

Re: Can not set spark dynamic resource allocation

2016-05-20 Thread Rodrick Brown
Is this Yarn or Mesos? For the later you need to start an external shuffle service.  Get Outlook for iOS On Fri, May 20, 2016 at 11:48 AM -0700, "Cui, Weifeng" wrote: Hi guys,   Our team has a hadoop 2.6.0 cluster with Spark 1.6.1. We want to set

Re: Can not set spark dynamic resource allocation

2016-05-20 Thread Ted Yu
Can you retrieve the log for application_1463681113470_0006 and pastebin it ? Thanks On Fri, May 20, 2016 at 11:48 AM, Cui, Weifeng wrote: > Hi guys, > > > > Our team has a hadoop 2.6.0 cluster with Spark 1.6.1. We want to set > dynamic resource allocation for spark and we

Can not set spark dynamic resource allocation

2016-05-20 Thread Cui, Weifeng
Hi guys, Our team has a hadoop 2.6.0 cluster with Spark 1.6.1. We want to set dynamic resource allocation for spark and we followed the following link. After the changes, all spark jobs failed. https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation This test was

Re: rpc.RpcTimeoutException: Futures timed out after [120 seconds]

2016-05-20 Thread Mail.com
Yes. Sent from my iPhone > On May 20, 2016, at 10:11 AM, Sahil Sareen wrote: > > I'm not sure if this happens on small files or big ones as I have a mix of > them always. > Did you see this only for big files? > >> On Fri, May 20, 2016 at 7:36 PM, Mail.com

Re: KafkaUtils.createDirectStream Not Fetching Messages with Confluent Serializers as Value Decoder.

2016-05-20 Thread Ramaswamy, Muthuraman
No, I haven’t enabled Kerberos. Just the calls as specified in the stack overflow thread on how to use the schema registry based serializer. ~Muthu On 5/19/16, 5:25 PM, "Mail.com" wrote: >Hi Muthu, > >Do you have Kerberos enabled? > >Thanks, >Pradeep > >> On May 19,

Wide Datasets (v1.6.1)

2016-05-20 Thread Don Drake
I have been working to create a Dataframe that contains a nested structure. The first attempt is to create an array of structures. I've written previously on this list how it doesn't work in Dataframes in 1.6.1, but it does in 2.0. I've continued my experimenting and have it working in

Problems finding the original objects after HashingTF()

2016-05-20 Thread Pasquinell Urbani
Hi all, I'm following an TF-IDF example but I’m having some issues that i’m not sure how to fix. The input is the following val test = sc.textFile("s3n://.../test_tfidf_products.txt") test.collect.mkString("\n") which prints test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[370] at

Re: Spark.default.parallelism can not set reduce number

2016-05-20 Thread Ovidiu-Cristian MARCU
You can check org.apache.spark.sql.internal.SQLConf for other default settings as well. val SHUFFLE_PARTITIONS = SQLConfigBuilder("spark.sql.shuffle.partitions") .doc("The default number of partitions to use when shuffling data for joins or aggregations.") .intConf

StackOverflowError in Spark SQL

2016-05-20 Thread Jeff Jones
I’m running Spark 1.6.0 in a standalone cluster. Periodically I’ve seen StackOverflowErrors when running queries. An example below. In the past I’ve been able to avoid such situations by ensuring we don’t have too many arguments in ‘in’ clauses or too many unioned queries both of which seem to

Re: rpc.RpcTimeoutException: Futures timed out after [120 seconds]

2016-05-20 Thread Mail.com
Hi Sahil, I have seen this with high GC time. Do you ever get this error with small volume files Pradeep > On May 20, 2016, at 9:32 AM, Sahil Sareen wrote: > > Hey all > > I'm using Spark-1.6.1 and occasionally seeing executors lost and hurting my > application

Re: rpc.RpcTimeoutException: Futures timed out after [120 seconds]

2016-05-20 Thread Sahil Sareen
I'm not sure if this happens on small files or big ones as I have a mix of them always. Did you see this only for big files? On Fri, May 20, 2016 at 7:36 PM, Mail.com wrote: > Hi Sahil, > > I have seen this with high GC time. Do you ever get this error with small >

Re: Splitting RDD by partition

2016-05-20 Thread Sun Rui
I think the latter approach is better, which can avoid un-necessary computations by filtering out un-needed partitions. It is better to cache the previous RDD so that it won’t be computed twice > On May 20, 2016, at 16:59, shlomi wrote: > > Another approach I found: > >

rpc.RpcTimeoutException: Futures timed out after [120 seconds]

2016-05-20 Thread Sahil Sareen
Hey all I'm using Spark-1.6.1 and occasionally seeing executors lost and hurting my application performance due to these errors. Can someone please let out all the possible problems that could cause this? Full log: 16/05/19 02:17:54 ERROR ContextCleaner: Error cleaning broadcast 266685

Re: Starting executor without a master

2016-05-20 Thread Mathieu Longtin
Correct, what I do to start workers is the equivalent of start-slaves.sh. It ends up running the same command on the worker servers as start-slaves does. It definitively uses all workers, and workers starting later pick up work as well. If you have a long running job, you can add workers

Re: Spark.default.parallelism can not set reduce number

2016-05-20 Thread Takeshi Yamamuro
You need to use `spark.sql.shuffle.partitions`. // maropu On Fri, May 20, 2016 at 8:17 PM, 喜之郎 <251922...@qq.com> wrote: > Hi all. > I set Spark.default.parallelism equals 20 in spark-default.conf. And send > this file to all nodes. > But I found reduce number is still default value,200. >

Spark.default.parallelism can not set reduce number

2016-05-20 Thread ??????
Hi all. I set Spark.default.parallelism equals 20 in spark-default.conf. And send this file to all nodes. But I found reduce number is still default value,200. Does anyone else encouter this problem? can anyone give some advice? [Stage 9:>

Re: Is there a way to merge parquet small files?

2016-05-20 Thread Takeshi Yamamuro
Many small files could cause technical issues in both hdfs and spark though, they do not generate many stages and tasks in the recent version of spark. // maropu On Fri, May 20, 2016 at 2:41 PM, Gavin Yue wrote: > For logs file I would suggest save as gziped text file

Dataset API and avro type

2016-05-20 Thread Han JU
Hello, I'm looking at the Dataset API in 1.6.1 and also in upcoming 2.0. However it does not seems to work with Avro data types: object Datasets extends App { val conf = new SparkConf() conf.setAppName("Dataset") conf.setMaster("local[2]") conf.setIfMissing("spark.serializer",

Fwd: Spark CacheManager Thread-safety

2016-05-20 Thread Pietro Gentile
Hi all, I have a series of doubts about CacheManager used by SQLContext to cache DataFrame. My use case requires different threads persisting/reading dataframes cuncurrently. I realized using spark that persistence really does not work in parallel mode. I would like it if I'm persisting a data

Spark CacheManager Thread-safety

2016-05-20 Thread Pietro Gentile
Hi all, I have a series of doubts about CacheManager used by SQLContext to cache DataFrame. My use case requires different threads persisting/reading dataframes cuncurrently. I realized using spark that persistence really does not work in parallel mode. I would like it if I'm persisting a data

Re: Splitting RDD by partition

2016-05-20 Thread shlomi
Another approach I found: First, I make a PartitionsRDD class which only takes a certain range of partitions - case class PartitionsRDDPartition(val index:Int, val origSplit:Partition) extends Partition {} class PartitionsRDD[U: ClassTag](var

Re: Starting executor without a master

2016-05-20 Thread Mich Talebzadeh
OK this is basically form my notes for Spark standalone. Worker process is the slave process [image: Inline images 2] You start worker as you showed $SPARK_HOME/sbin/start-slaves.sh Now that picks up the worker host node names from $SPARK_HOME/conf/slaves files. So you still have to tell

Re: Query about how to estimate cpu usage for spark

2016-05-20 Thread Mich Talebzadeh
Please note taht CPU usage varies with time, it is not a fixed value First have a look at spark GUI that runs under port 4040 under tab jobs Then use jps to identify the spark process jps|grep SparkSubmit Using the process name start jmonitor on the OS and specify SparkSubmit process, It