Hi Akhil,
Yes, that's what we are planning on doing at the end of the data. At the
moment I am doing performance testing before the job hits production and
testing on 4 cores to get baseline figures and deduced that in order to
grow to 10 - 15 million keys we ll need at batch interval of ~20 secs
Hi all,
We are having a few issues with the performance of updateStateByKey
operation in Spark Streaming (1.2.1 at the moment) and any advice would be
greatly appreciated. Specifically, on each tick of the system (which is set
at 10 secs) we need to update a state tuple where the key is the
Hi all,
We are having a few issues with the performance of updateStateByKey
operation in Spark Streaming (1.2.1 at the moment) and any advice would be
greatly appreciated. Specifically, on each tick of the system (which is set
at 10 secs) we need to update a state tuple where the key is the
repartition
the data by partition columns first.
Cheng
On 7/15/15 7:05 PM, Nikos Viorres wrote:
Hi,
I am trying to test partitioning for DataFrames with parquet usage so i
attempted to do df.write().partitionBy(some_column).parquet(path) on a
small dataset of 20.000 records which when saved
Hi,
I am trying to test partitioning for DataFrames with parquet usage so i
attempted to do df.write().partitionBy(some_column).parquet(path) on a
small dataset of 20.000 records which when saved as parquet locally with
gzip take 4mb of disk space.
However, on my dev machine with
Hi,
I am using KafkaUtils.createRDD to retrieve data from Kafka for batch
processing and
when Invoking KafkaUtils.createRDD with an OffsetRange where
OffsetRange.fromOffset == OffsetRange.untilOffset for a particular
partition, i get an empy RDD.
Documentation is clear that until is exclusive and