Union of RDDs Hung

2017-12-12 Thread Vikash Pareek
version I am using is 1.6.1 Best Regards, Vikash Pareek - __Vikash Pareek -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Joining streaming data with static table data.

2017-12-11 Thread Vikash Pareek
batch (you can understand it as a stream of dataframes) While joining with reference data, if it is static data then load once and persist it or if it is dynamic data then keep updating this at a regular interval. Best Regards, Vikash Pareek - __Vikash Pareek -- Sent from: http://apache

Re: Joining streaming data with static table data.

2017-12-11 Thread Vikash Pareek
(you can understand it as a stream of dataframes) While joining with reference data, if it is static data then load once and persist it or if it is dynamic data then keep updating this at a regular interval. Best Regards, Vikash Pareek - __Vikash Pareek -- Sent from: http://apache-spark

How to avoid creating meta files (.crc files)

2017-10-09 Thread Vikash Pareek
Hi Users, Is there any way to avoid creation of .crc files when writing an RDD with saveAsTextFile method? My use case is, I have mounted S3 on the local file system using S3FS and saving an RDD to mounting point. by looking at S3, I found one .crc file for each part file and even _SUCCESS file.

Re: Spark ignores --master local[*]

2017-09-12 Thread Vikash Pareek
Your VM might not be having more than 1 core available to run spark job. Check with *nproc* command to see how many cores available on VM and *top *command to see how many cores are free. - __Vikash Pareek -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Re: How does spark work?

2017-09-12 Thread Vikash Pareek
Obviously, you can't store 900GB of data into 80GB memory. There is a concept in spark called disk spill, it means when your data size increases and can't fit into memory then it spilled out to disk. Also, spark doesn't use whole memory for storing the data, some fraction of memory used for

Unable to save an RDd on S3 with SSE-KMS encryption

2017-09-12 Thread Vikash Pareek
I am trying to save an rdd on S3 with server side encryption using KMS key (SSE-KMS), But I am getting the following exception: *Exception in thread "main" com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400, AWS Service: Amazon S3, AWS Request ID: 695E32175EBA568A, AWS Error

Re: How does HashPartitioner distribute data in Spark?

2017-06-24 Thread Vikash Pareek
partition* instead of uniform distribution? 2. In case of rdd2, there is no key value pair so how hash partitoning going to work i.e. *what is the key* to calculate hashcode? Best Regards, [image: InfoObjects Inc.] <http://www.infoobjects.com/> Vikash Pareek Team Lead *InfoObject

Re: Number Of Partitions in RDD

2017-06-23 Thread Vikash Pareek
Local mode - __Vikash Pareek -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Number-Of-Partitions-in-RDD-tp28730p28786.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

How does HashPartitioner distribute data in Spark?

2017-06-23 Thread Vikash Pareek
I am trying to understand how spark partitoing works. To understand this I have following piece of code on spark 1.6 def countByPartition1(rdd: RDD[(String, Int)]) = { rdd.mapPartitions(iter => Iterator(iter.length)) } def countByPartition2(rdd: RDD[String]) = {

Re: Number Of Partitions in RDD

2017-06-02 Thread Vikash Pareek
Spark 1.6.1 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Number-Of-Partitions-in-RDD-tp28730p28735.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To

Number Of Partitions in RDD

2017-06-01 Thread Vikash Pareek
Hi, I am creating a RDD from a text file by specifying number of partitions. But it gives me different number of partitions than the specified one. */scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 0) people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[72] at

Re: Message getting lost in Kafka + Spark Streaming

2017-06-01 Thread Vikash Pareek
t("Accumulator's value is: " + accum)* And I am getting all the messages in "*println("Messges sent to Kafka: " + keyedMessage.message)*" received by the stream, and accumulator value is also same as number of incoming messages. Best Regards, [image: InfoO

Message getting lost in Kafka + Spark Streaming

2017-05-30 Thread Vikash Pareek
I am facing an issue related to spark streaming with kafka, my use case is as follow: 1. Spark streaming(DirectStream) application reading data/messages from kafka topic and process it 2. On the basis of proccessed message, app will write proccessed message to different kafka topics for e.g. if

Re: Hive on Spark is not populating correct records

2017-05-04 Thread Vikash Pareek
After lots of expermiments, I have figured out that it was a potential bug in cloudera with Hive on Spark. Hive on Spark does not populate consistent output on aggregate functions. Hopefully, it will be fixed in next relaese. -- View this message in context:

Re: Setting spark.yarn.stagingDir in 1.6

2017-03-15 Thread Vikash Pareek
2.0.0, it can be done by setting spark.yarn.stagingDir. > > But in production, we have spark 1.6. Can anyone please suggest how it can > be done in spark 1.6. > > -- > Thanks and Regards, > > Saurav Sinha > > Contact: 9742879062 > -- Best Regards, [image: InfoO

Hive on Spark is not populating correct records

2016-11-24 Thread Vikash Pareek
Hi, Not sure whether it is right place to discuss this issue. I am running following Hive query multiple times with execution engine as Hive on Spark and Hive on MapReduce. With Hive on Spark: Result (count) were different of every execution. With Hive on MapReduce: Result (count) were same of

Re: When queried through hiveContext, does hive executes these queries using its execution engine (default is map-reduce), or spark just reads the data and performs those queries itself?

2016-06-08 Thread Vikash Pareek
, columns, datatypes etc. thus, Spark have enough information about the hive tables and it's data to understand the target data and execute the query over its on execution engine. Overall, Spark replaced the Map Reduce model completely by it's in-memory(RDD) computation engine. - Vikash Pareek

Re: Number of executors change during job running

2016-05-02 Thread Vikash Pareek
Hi Bill, You can try DirectStream and increase # of partition to kafka. then input Dstream will have the partitions as per kafka topic without using re-partitioning. Can you please share your event timeline chart from spark ui. You need to tune your configuration as per computation. Spark ui

Re: Number of executors in spark-1.6 and spark-1.5

2016-04-10 Thread Vikash Pareek
de? No, I am not starting another slave node, I just changed *spark-env.sh *for each slave node i.e. set SPARK_WORKER_INSTANCES=2. Best Regards, Vikash Pareek Software Developer, *InfoObjects Inc.* m: +918800206898 a: E5, Jhalana Institutional Area, Jaipur s: vikaspareek1991 e: vikash.par...@inf

Number of executors in spark-1.6 and spark-1.5

2016-04-10 Thread Vikash Pareek
Hi, I have upgraded 5 node spark cluster from spark-1.5 to spark-1.6 (to use mapWithState function). After using spark-1.6, I am getting a strange behaviour of spark, jobs are not using multiple executors of different nodes at a time means there is no parallel processing if each node having

StackOverflow in updateStateByKey

2016-03-28 Thread Vikash Pareek
Hi, In my use case I need to maintain history data for a key. For this I am using updateStateByKey in which state is maintained as mutable scala collection(ArrayBuffer). Each element in ArrayBuffer is an incoming record. Spark version is 1.6 As number of elements(records) increases in the