KMean clustering resulting Skewed Issue

2017-03-24 Thread Reth RM
Hi, I am using spark k mean for clustering records that consist of news documents, vectors are created by applying tf-idf. Dataset that I am using for testing right now is the gold-truth classified http://qwone.com/~jason/20Newsgroups/ Issue is all the documents are getting assigned to same

Re: unable to stream kafka messages

2017-03-24 Thread kaniska Mandal
Hi Michael, Thanks much for the suggestion. I was wondering - whats the best way to deserialize the 'value' field On Fri, Mar 24, 2017 at 11:47 AM, Michael Armbrust wrote: > Encoders can only map data into an object if those columns already exist. > When we are

Re: how to read object field within json file

2017-03-24 Thread Yong Zhang
I missed the part to pass in a schema to force the "struct" to a Map, then use explode. Good option. Yong From: Michael Armbrust Sent: Friday, March 24, 2017 3:02 PM To: Yong Zhang Cc: Selvam Raman; user Subject: Re: how to read object

Re: EXT: Multiple cores/executors in Pyspark standalone mode

2017-03-24 Thread Li Jin
Thanks for the reply. Yeah I found the same doc and am able to use multiple cores in spark-shell, however, when I use pyspark, it appears only to use one core, I am wondering if this is something I did't configure correctly or something supported in pyspark. On Fri, Mar 24, 2017 at 3:52 PM,

Re: EXT: Multiple cores/executors in Pyspark standalone mode

2017-03-24 Thread Kadam, Gangadhar (GE Aviation, Non-GE)
In Local Mode all processes are executed inside a single JVM. Application is started in a local mode by setting master to local, local[*] or local[n]. spark.executor.cores and spark.executor.cores are not applicable in the local mode because there is only one embedded executor. In Standalone

Multiple cores/executors in Pyspark standalone mode

2017-03-24 Thread Li Jin
Hi, I am wondering does pyspark standalone (local) mode support multi cores/executors? Thanks, Li

Re: unable to stream kafka messages

2017-03-24 Thread Michael Armbrust
Encoders can only map data into an object if those columns already exist. When we are reading from Kafka, we just get a binary blob and you'll need to help Spark parse that first. Assuming your data is stored in JSON it should be pretty straight forward. streams = spark .readStream()

Re: how to read object field within json file

2017-03-24 Thread Michael Armbrust
I'm not sure you can parse this as an Array, but you can hint to the parser that you would like to treat source as a map instead of as a struct. This is a good strategy when you have dynamic columns in your data. Here is an example of the schema you can use to parse this JSON and also how to use

unable to stream kafka messages

2017-03-24 Thread kaniska
Hi, Currently , encountering the following exception while working with below-mentioned code snippet : > Please suggest the correct approach for reading the stream into a sql > schema. > If I add 'tweetSchema' while reading stream, it errors out with message - > we can not change static schema

Application kill from UI do not propagate exception

2017-03-24 Thread Noorul Islam Kamal Malmiyoda
Hi all, I am trying to trap UI kill event of a spark application from driver. Some how the exception thrown is not propagated to the driver main program. See for example using spark-shell below. Is there a way to get hold of this event and shutdown the driver program? Regards, Noorul

Re: [Worker Crashing] OutOfMemoryError: GC overhead limit execeeded

2017-03-24 Thread Yong Zhang
Not sure if anyone else here can help you. But if I were you, I will adjust SPARK_DAEMON_MEMORY to 2g, to bump the worker to 2G. Even though the worker's responsibility is very limited, but in today's world, who knows. Give 2g a try to see if the problem goes away. BTW, in our production, I

Re: [Worker Crashing] OutOfMemoryError: GC overhead limit execeeded

2017-03-24 Thread Behroz Sikander
Yea we also didn't find anything related to this online. Are you aware of any memory leaks in worker in 1.6.2 spark which might be causing this ? Do you know of any documentation which explains all the tasks that a worker is performing ? Maybe we can get some clue from there. Regards, Behroz On

Re: [Worker Crashing] OutOfMemoryError: GC overhead limit execeeded

2017-03-24 Thread Yong Zhang
I never experienced worker OOM or very rarely see this online. So my guess that you have to generate the heap dump file to analyze it. Yong From: Behroz Sikander Sent: Friday, March 24, 2017 9:15 AM To: Yong Zhang Cc: user@spark.apache.org

Re: spark-submit config via file

2017-03-24 Thread Yong Zhang
Of course it is possible. You can always to set any configurations in your application using API, instead of pass in through the CLI. val sparkConf = new SparkConf().setAppName(properties.get("appName")).set("master", properties.get("master")).set(xxx, properties.get("xxx")) Your error is

Re: [Worker Crashing] OutOfMemoryError: GC overhead limit execeeded

2017-03-24 Thread Behroz Sikander
Thank you for the response. Yes, I am sure because the driver was working fine. Only 2 workers went down with OOM. Regards, Behroz On Fri, Mar 24, 2017 at 2:12 PM, Yong Zhang wrote: > I am not 100% sure, but normally "dispatcher-event-loop" OOM means the > driver OOM.

Re: [Worker Crashing] OutOfMemoryError: GC overhead limit execeeded

2017-03-24 Thread Yong Zhang
I am not 100% sure, but normally "dispatcher-event-loop" OOM means the driver OOM. Are you sure your workers OOM? Yong From: bsikander Sent: Friday, March 24, 2017 5:48 AM To: user@spark.apache.org Subject: [Worker Crashing]

spark-submit config via file

2017-03-24 Thread , Roy
Hi, I am trying to deploy spark job by using spark-submit which has bunch of parameters like spark-submit --class StreamingEventWriterDriver --master yarn --deploy-mode cluster --executor-memory 3072m --executor-cores 4 --files streaming.conf spark_streaming_2.11-assembly-1.0-SNAPSHOT.jar -conf

Re: Spark 2.0.2 : Hang at "org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:623)"

2017-03-24 Thread Ravindra
Also noticed that there are 8 - "dispatcher-event-loop-0 7" and 8 - "map-output-dispatcher-0 7" all waiting at the same location in the code that is - *sun.misc.Unsafe.park(Native Method)* *java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)*

Spark 2.0.2 : Hang at "org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:623)"

2017-03-24 Thread Ravindra
Hi All, My Spark job hangs here... Looking into the thread dump I noticed that it hangs here (stack trace given below) on the count action on dataframe (given below). Data is very small. Its actually not more than even 10 rows. I noticed some JIRAs about this issue but all are resolved-closed in

[Worker Crashing] OutOfMemoryError: GC overhead limit execeeded

2017-03-24 Thread bsikander
Spark version: 1.6.2 Hadoop: 2.6.0 Cluster: All VMS are deployed on AWS. 1 Master (t2.large) 1 Secondary Master (t2.large) 5 Workers (m4.xlarge) Zookeeper (t2.large) Recently, 2 of our workers went down with out of memory exception. java.lang.OutOfMemoryError: GC overhead limit exceeded (max

skipping header in multiple files

2017-03-24 Thread nayan sharma
Hi, I wanted to skip all the headers of CSVs present in a directory. After searching on Google I got to know that it can be done using sc.wholetextfiles. Can any one suggest me how to do that in Scala.? Thanks & Regards, Nayan Sharma

Re: Spark dataframe, UserDefinedAggregateFunction(UDAF) help!!

2017-03-24 Thread Georg Heiler
Maybe an udf to flatten is an interesting option as well. http://stackoverflow.com/q/42888711/2587904 would a uadf very more performant? shyla deshpande schrieb am Fr. 24. März 2017 um 04:04: > Thanks a million Yong. Great help!!! It solved my problem. > > On Thu, Mar