Re: Spark dataframe, UserDefinedAggregateFunction(UDAF) help!!

2017-03-24 Thread Georg Heiler
Maybe an udf to flatten is an interesting option as well. http://stackoverflow.com/q/42888711/2587904 would a uadf very more performant? shyla deshpande schrieb am Fr. 24. März 2017 um 04:04: > Thanks a million Yong. Great help!!! It solved my problem. > > On Thu, Mar

skipping header in multiple files

2017-03-24 Thread nayan sharma
Hi, I wanted to skip all the headers of CSVs present in a directory. After searching on Google I got to know that it can be done using sc.wholetextfiles. Can any one suggest me how to do that in Scala.? Thanks & Regards, Nayan Sharma

spark-submit config via file

2017-03-24 Thread , Roy
Hi, I am trying to deploy spark job by using spark-submit which has bunch of parameters like spark-submit --class StreamingEventWriterDriver --master yarn --deploy-mode cluster --executor-memory 3072m --executor-cores 4 --files streaming.conf spark_streaming_2.11-assembly-1.0-SNAPSHOT.jar -conf

[Worker Crashing] OutOfMemoryError: GC overhead limit execeeded

2017-03-24 Thread bsikander
Spark version: 1.6.2 Hadoop: 2.6.0 Cluster: All VMS are deployed on AWS. 1 Master (t2.large) 1 Secondary Master (t2.large) 5 Workers (m4.xlarge) Zookeeper (t2.large) Recently, 2 of our workers went down with out of memory exception. java.lang.OutOfMemoryError: GC overhead limit exceeded (max

Spark 2.0.2 : Hang at "org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:623)"

2017-03-24 Thread Ravindra
Hi All, My Spark job hangs here... Looking into the thread dump I noticed that it hangs here (stack trace given below) on the count action on dataframe (given below). Data is very small. Its actually not more than even 10 rows. I noticed some JIRAs about this issue but all are resolved-closed in

Re: Spark 2.0.2 : Hang at "org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:623)"

2017-03-24 Thread Ravindra
Also noticed that there are 8 - "dispatcher-event-loop-0 7" and 8 - "map-output-dispatcher-0 7" all waiting at the same location in the code that is - *sun.misc.Unsafe.park(Native Method)* *java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)*

Re: [Worker Crashing] OutOfMemoryError: GC overhead limit execeeded

2017-03-24 Thread Yong Zhang
I am not 100% sure, but normally "dispatcher-event-loop" OOM means the driver OOM. Are you sure your workers OOM? Yong From: bsikander Sent: Friday, March 24, 2017 5:48 AM To: user@spark.apache.org Subject: [Worker Crashing]

Re: [Worker Crashing] OutOfMemoryError: GC overhead limit execeeded

2017-03-24 Thread Yong Zhang
I never experienced worker OOM or very rarely see this online. So my guess that you have to generate the heap dump file to analyze it. Yong From: Behroz Sikander Sent: Friday, March 24, 2017 9:15 AM To: Yong Zhang Cc: user@spark.apache.org

Re: [Worker Crashing] OutOfMemoryError: GC overhead limit execeeded

2017-03-24 Thread Yong Zhang
Not sure if anyone else here can help you. But if I were you, I will adjust SPARK_DAEMON_MEMORY to 2g, to bump the worker to 2G. Even though the worker's responsibility is very limited, but in today's world, who knows. Give 2g a try to see if the problem goes away. BTW, in our production, I

Re: spark-submit config via file

2017-03-24 Thread Yong Zhang
Of course it is possible. You can always to set any configurations in your application using API, instead of pass in through the CLI. val sparkConf = new SparkConf().setAppName(properties.get("appName")).set("master", properties.get("master")).set(xxx, properties.get("xxx")) Your error is

Re: [Worker Crashing] OutOfMemoryError: GC overhead limit execeeded

2017-03-24 Thread Behroz Sikander
Yea we also didn't find anything related to this online. Are you aware of any memory leaks in worker in 1.6.2 spark which might be causing this ? Do you know of any documentation which explains all the tasks that a worker is performing ? Maybe we can get some clue from there. Regards, Behroz On

unable to stream kafka messages

2017-03-24 Thread kaniska
Hi, Currently , encountering the following exception while working with below-mentioned code snippet : > Please suggest the correct approach for reading the stream into a sql > schema. > If I add 'tweetSchema' while reading stream, it errors out with message - > we can not change static schema

Re: how to read object field within json file

2017-03-24 Thread Michael Armbrust
I'm not sure you can parse this as an Array, but you can hint to the parser that you would like to treat source as a map instead of as a struct. This is a good strategy when you have dynamic columns in your data. Here is an example of the schema you can use to parse this JSON and also how to use

Re: EXT: Multiple cores/executors in Pyspark standalone mode

2017-03-24 Thread Li Jin
Thanks for the reply. Yeah I found the same doc and am able to use multiple cores in spark-shell, however, when I use pyspark, it appears only to use one core, I am wondering if this is something I did't configure correctly or something supported in pyspark. On Fri, Mar 24, 2017 at 3:52 PM,

Re: how to read object field within json file

2017-03-24 Thread Yong Zhang
I missed the part to pass in a schema to force the "struct" to a Map, then use explode. Good option. Yong From: Michael Armbrust Sent: Friday, March 24, 2017 3:02 PM To: Yong Zhang Cc: Selvam Raman; user Subject: Re: how to read object

Re: unable to stream kafka messages

2017-03-24 Thread Michael Armbrust
Encoders can only map data into an object if those columns already exist. When we are reading from Kafka, we just get a binary blob and you'll need to help Spark parse that first. Assuming your data is stored in JSON it should be pretty straight forward. streams = spark .readStream()

Multiple cores/executors in Pyspark standalone mode

2017-03-24 Thread Li Jin
Hi, I am wondering does pyspark standalone (local) mode support multi cores/executors? Thanks, Li

Re: EXT: Multiple cores/executors in Pyspark standalone mode

2017-03-24 Thread Kadam, Gangadhar (GE Aviation, Non-GE)
In Local Mode all processes are executed inside a single JVM. Application is started in a local mode by setting master to local, local[*] or local[n]. spark.executor.cores and spark.executor.cores are not applicable in the local mode because there is only one embedded executor. In Standalone

KMean clustering resulting Skewed Issue

2017-03-24 Thread Reth RM
Hi, I am using spark k mean for clustering records that consist of news documents, vectors are created by applying tf-idf. Dataset that I am using for testing right now is the gold-truth classified http://qwone.com/~jason/20Newsgroups/ Issue is all the documents are getting assigned to same

Re: unable to stream kafka messages

2017-03-24 Thread kaniska Mandal
Hi Michael, Thanks much for the suggestion. I was wondering - whats the best way to deserialize the 'value' field On Fri, Mar 24, 2017 at 11:47 AM, Michael Armbrust wrote: > Encoders can only map data into an object if those columns already exist. > When we are

Application kill from UI do not propagate exception

2017-03-24 Thread Noorul Islam Kamal Malmiyoda
Hi all, I am trying to trap UI kill event of a spark application from driver. Some how the exception thrown is not propagated to the driver main program. See for example using spark-shell below. Is there a way to get hold of this event and shutdown the driver program? Regards, Noorul

Re: [Worker Crashing] OutOfMemoryError: GC overhead limit execeeded

2017-03-24 Thread Behroz Sikander
Thank you for the response. Yes, I am sure because the driver was working fine. Only 2 workers went down with OOM. Regards, Behroz On Fri, Mar 24, 2017 at 2:12 PM, Yong Zhang wrote: > I am not 100% sure, but normally "dispatcher-event-loop" OOM means the > driver OOM.