version I am using is 1.6.1
Best Regards,
Vikash Pareek
-
__Vikash Pareek
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
batch (you can understand it as a stream of dataframes)
While joining with reference data, if it is static data then load once and
persist it or if it is dynamic data then keep updating this at a regular
interval.
Best Regards,
Vikash Pareek
-
__Vikash Pareek
--
Sent from: http://apache
(you can understand it as a stream of dataframes)
While joining with reference data, if it is static data then load once and
persist it or if it is dynamic data then keep updating this at a regular
interval.
Best Regards,
Vikash Pareek
-
__Vikash Pareek
--
Sent from: http://apache-spark
Hi Users,
Is there any way to avoid creation of .crc files when writing an RDD with
saveAsTextFile method?
My use case is, I have mounted S3 on the local file system using S3FS and
saving an RDD to mounting point. by looking at S3, I found one .crc file for
each part file and even _SUCCESS file.
Your VM might not be having more than 1 core available to run spark job.
Check with *nproc* command to see how many cores available on VM and *top
*command to see how many cores are free.
-
__Vikash Pareek
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
Obviously, you can't store 900GB of data into 80GB memory.
There is a concept in spark called disk spill, it means when your data size
increases and can't fit into memory then it spilled out to disk.
Also, spark doesn't use whole memory for storing the data, some fraction of
memory used for
I am trying to save an rdd on S3 with server side encryption using KMS key
(SSE-KMS), But I am getting the following exception:
*Exception in thread "main"
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400, AWS
Service: Amazon S3, AWS Request ID: 695E32175EBA568A, AWS Error
partition*
instead of uniform distribution?
2. In case of rdd2, there is no key value pair so how hash partitoning
going to work i.e. *what is the key* to calculate hashcode?
Best Regards,
[image: InfoObjects Inc.] <http://www.infoobjects.com/>
Vikash Pareek
Team Lead *InfoObject
Local mode
-
__Vikash Pareek
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Number-Of-Partitions-in-RDD-tp28730p28786.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
I am trying to understand how spark partitoing works.
To understand this I have following piece of code on spark 1.6
def countByPartition1(rdd: RDD[(String, Int)]) = {
rdd.mapPartitions(iter => Iterator(iter.length))
}
def countByPartition2(rdd: RDD[String]) = {
Spark 1.6.1
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Number-Of-Partitions-in-RDD-tp28730p28735.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To
Hi,
I am creating a RDD from a text file by specifying number of partitions. But
it gives me different number of partitions than the specified one.
*/scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 0)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[72] at
t("Accumulator's value is: " + accum)*
And I am getting all the messages in "*println("Messges sent to Kafka: " +
keyedMessage.message)*" received by the stream, and accumulator value is
also same as number of incoming messages.
Best Regards,
[image: InfoO
I am facing an issue related to spark streaming with kafka, my use case is as
follow:
1. Spark streaming(DirectStream) application reading data/messages from
kafka topic and process it
2. On the basis of proccessed message, app will write proccessed message to
different kafka topics
for e.g. if
After lots of expermiments, I have figured out that it was a potential bug in
cloudera with Hive on Spark.
Hive on Spark does not populate consistent output on aggregate functions.
Hopefully, it will be fixed in next relaese.
--
View this message in context:
2.0.0, it can be done by setting spark.yarn.stagingDir.
>
> But in production, we have spark 1.6. Can anyone please suggest how it can
> be done in spark 1.6.
>
> --
> Thanks and Regards,
>
> Saurav Sinha
>
> Contact: 9742879062
>
--
Best Regards,
[image: InfoO
Hi,
Not sure whether it is right place to discuss this issue.
I am running following Hive query multiple times with execution engine as
Hive on Spark and Hive on MapReduce.
With Hive on Spark: Result (count) were different of every execution.
With Hive on MapReduce: Result (count) were same of
, columns, datatypes
etc. thus, Spark have enough information about the hive tables and it's data
to understand the target data and execute the query over its on execution
engine.
Overall, Spark replaced the Map Reduce model completely by it's
in-memory(RDD) computation engine.
- Vikash Pareek
Hi Bill,
You can try DirectStream and increase # of partition to kafka. then input
Dstream will have the partitions as per kafka topic without using
re-partitioning.
Can you please share your event timeline chart from spark ui. You need to
tune your configuration as per computation. Spark ui
de?
No, I am not starting another slave node, I just changed *spark-env.sh *for
each slave node i.e. set SPARK_WORKER_INSTANCES=2.
Best Regards,
Vikash Pareek
Software Developer, *InfoObjects Inc.*
m: +918800206898 a: E5, Jhalana Institutional Area, Jaipur
s: vikaspareek1991 e: vikash.par...@inf
Hi,
I have upgraded 5 node spark cluster from spark-1.5 to spark-1.6 (to use
mapWithState function).
After using spark-1.6, I am getting a strange behaviour of spark, jobs are
not using multiple executors of different nodes at a time means there is no
parallel processing if each node having
Hi,
In my use case I need to maintain history data for a key. For this I am
using updateStateByKey in which state is maintained as mutable scala
collection(ArrayBuffer). Each element in ArrayBuffer is an incoming record.
Spark version is 1.6
As number of elements(records) increases in the
22 matches
Mail list logo