Hello, everybody!
May be it's not a reason of your problem, but I've noticed the line in your
commentaries:
*java version "1.8.0_51"*
It's strongly advised to use Java 1.8.0_66+
I use even Java 1.8.0_101
On Tue, Sep 20, 2016 at 1:09 AM, janardhan shetty
wrote:
> Yes
As Cody said, Spark is not going to help you here.
There are two issues you need to look at here: duplicated (or even more)
messages processed by two different processes and the case of failure of any
component (including the message broker). Keep in mind that duplicated messages
can even
I am trying to prototype using a single instance SqlContext and use it toappend
Dataframes,partition by a field, to the same HDFS folder from multiple threads.
(Each thread will work with a DataFrame having different partition column
value.)
I get the exception16/09/24 16:45:12 ERROR
Spark alone isn't going to solve this problem, because you have no reliable
way of making sure a given worker has a consistent shard of the messages
seen so far, especially if there's an arbitrary amount of delay between
duplicate messages. You need a DHT or something equivalent.
On Sep 24, 2016
We have too many (large) files. We have about 30k partitions with about 4
years worth data and we need to process entire history in a one time
monolithic job.
I would like to know how spark decides the number of executors requested.
I've seen testcases where the max executors count is Integer's
Hi Dan,
If you use spark <= 1.6, you can also do
$SPARK_HOME/bin/spark-shell --packages com.databricks:spark-csv_2.10:1.5.0
to quickly link the spark-csv jars to spark shell. Otherwise as Holden
suggested you link it in your maven/sbt dependencies. Spark guys assume
that their users have a good
Do you have too many small files you are trying to read? Number of
executors are very high
On 24 Sep 2016 10:28, "Yash Sharma" wrote:
> Have been playing around with configs to crack this. Adding them here
> where it would be helpful to others :)
> Number of executors and
The scale factor was only to scale up the number of ratings in the dataset
for performance testing purposes, to illustrate the scalability of Spark
ALS.
It is not something you would normally do on your training dataset.
On Fri, 23 Sep 2016 at 20:07, Roshani Nagmote
Hi Guys,
I have bunch of data coming in to my spark streaming cluster from a message
queue(not kafka). And this message queue guarantees at least once delivery only
so there is potential that some of the messages that come in to the spark
streaming cluster are actually duplicates and I am trying