Rename hive orc table caused no content in spark

2016-05-07 Thread yansqrt3
Hi, I'm trying to rename an orc table (either in hive or spark has no difference). After that, all the content in the table will be invisible in spark while it is still available in hive. The problem could alway be recreated by very simple steps: spark shell

Re: How to verify if spark is using kryo serializer for shuffle

2016-05-07 Thread Nirav Patel
I have 20 executors, 6 cores each. Total 5 stages. It fails on 5th one. All of them have 6474 tasks. 5th task is a count operations and it also performs aggregateByKey as a part of it lazy evaluation. I am setting: spark.driver.memory=10G, spark.yarn.am.memory=2G and spark.driver.maxResultSize=9G

Re: How to verify if spark is using kryo serializer for shuffle

2016-05-07 Thread Ashish Dubey
Driver maintains the complete metadata of application ( scheduling of executor and maintaining the messaging to control the execution ) This code seems to be failing in that code path only. With that said there is Jvm overhead based on num of executors , stages and tasks in your app. Do you know

Re: sqlCtx.read.parquet yields lots of small tasks

2016-05-07 Thread Ashish Dubey
How big is your file and can you also share the code snippet On Saturday, May 7, 2016, Johnny W. wrote: > hi spark-user, > > I am using Spark 1.6.0. When I call sqlCtx.read.parquet to create a > dataframe from a parquet data source with a single parquet file, it yields > a

sqlCtx.read.parquet yields lots of small tasks

2016-05-07 Thread Johnny W.
hi spark-user, I am using Spark 1.6.0. When I call sqlCtx.read.parquet to create a dataframe from a parquet data source with a single parquet file, it yields a stage with lots of small tasks. It seems the number of tasks depends on how many executors I have instead of how many parquet

Re: Correct way of setting executor numbers and executor cores in Spark 1.6.1 for non-clustered mode ?

2016-05-07 Thread kmurph
Hi Simon, Thanks. I did actually have "SPARK_WORKER_CORES=8" in spark-env.sh - its commented as 'to set the number of cores to use on this machine'. Not sure how this would interplay with SPARK_EXECUTOR_INSTANCES and SPARK_EXECUTOR_CORES, but I removed it and still see no scaleup with increasing

Re: How to verify if spark is using kryo serializer for shuffle

2016-05-07 Thread Nirav Patel
Right but this logs from spark driver and spark driver seems to use Akka. ERROR [sparkDriver-akka.actor.default-dispatcher-17] akka.actor.ActorSystemImpl: Uncaught fatal error from thread [sparkDriver-akka.remote.default-remote-dispatcher-5] shutting down ActorSystem [sparkDriver] I saw

Re: How to verify if spark is using kryo serializer for shuffle

2016-05-07 Thread Ted Yu
bq. at akka.serialization.JavaSerializer.toBinary(Serializer.scala:129) It was Akka which uses JavaSerializer Cheers On Sat, May 7, 2016 at 11:13 AM, Nirav Patel wrote: > Hi, > > I thought I was using kryo serializer for shuffle. I could verify it from > spark UI -

How to verify if spark is using kryo serializer for shuffle

2016-05-07 Thread Nirav Patel
Hi, I thought I was using kryo serializer for shuffle. I could verify it from spark UI - Environment tab that spark.serializer org.apache.spark.serializer.KryoSerializer spark.kryo.registrator com.myapp.spark.jobs.conf.SparkSerializerRegistrator But when I see following error in Driver logs it

Re: Updating Values Inside Foreach Rdd loop

2016-05-07 Thread HARSH TAKKAR
Hi Ted Following is my use case. I have a prediction algorithm where i need to update some records to predict the target. For eg. I have an eq. Y= mX +c I need to change value of Xi of some records and calculate sum(Yi) if the value of prediction is not close to target value then repeat the

Finding max value in spark streaming sliding window

2016-05-07 Thread Mich Talebzadeh
Hi, What is the easiest way of finding max(price) in code below object CEP_AVG { def main(args: Array[String]) { // Create a local StreamingContext with two working thread and batch interval of 10 seconds. val sparkConf = new SparkConf(). setAppName("CEP_AVG").

Re: createDirectStream with offsets

2016-05-07 Thread Eric Friedman
Thanks Cody. It turns out that there was an even simpler explanation (the flaw you pointed out was accurate too). I had mutable.Map instances being passed where KafkaUtils wants immutable ones. On Fri, May 6, 2016 at 8:32 AM, Cody Koeninger wrote: > Look carefully at the

Re: Correct way of setting executor numbers and executor cores in Spark 1.6.1 for non-clustered mode ?

2016-05-07 Thread Mich Talebzadeh
Check how much free memory you have on your hosr /usr/bin/free as a heuristic values start with these in export SPARK_EXECUTOR_CORES=4 ##, Number of cores for the workers (Default: 1). export SPARK_EXECUTOR_MEMORY=8G ## , Memory per Worker (e.g. 1000M, 2G) (Default: 1G) export

Locality aware tree reduction

2016-05-07 Thread Ayman Khalil
Hello, Is there a way to instruct treeReduce() to reduce RDD partitions on the same node locally? In my case, I'm using treeReduce() to reduce map results in parallel. My reduce function is just arithmetically adding map results (i.e. no notion of aggregation by key). As far as I understand, a

Correct way of setting executor numbers and executor cores in Spark 1.6.1 for non-clustered mode ?

2016-05-07 Thread kmurph
Hi, I'm running spark 1.6.1 on a single machine, initially a small one (8 cores, 16GB ram) using "--master local[*]" to spark-submit and I'm trying to see scaling with increasing cores, unsuccessfully. Initially I'm setting SPARK_EXECUTOR_INSTANCES=1, and increasing cores for each executor.

Re: Found Data Quality check package for Spark

2016-05-07 Thread Rick Moritz
Hi Divya, I haven't actually used the package yet, but maybe you should check out the gitter-room where the creator is quite active. You can find it on https://gitter.im/FRosner/drunken-data-quality . There you should be able to get the information you need. Best, Rick On 6 May 2016 12:34,

Re: Spark structured streaming is Micro batch?

2016-05-07 Thread madhu phatak
Hi, Thank you for all those answers. The below is code I am trying out val records = sparkSession.read.format("csv").stream("/tmp/input") val re = records.write.format("parquet").trigger(ProcessingTime(100.seconds)). option("checkpointLocation", "/tmp/checkpoint")