Hi,
I'm using Spark 1.5.0, I wrote a custom Hadoop InputFormat to load data
from 3rdparty datasource, the data type mapping has been taken care of in
my code, but when I issued below query,
SELECT * FROM ( SELECT count(*) as failures from test WHERE state !=
'success' ) as tmp WHERE (
Hi,
I'm using Spark 1.5.0, I wrote a custom Hadoop InputFormat to load data
from 3rdparty datasource, the data type mapping has been taken care of in
my code, but when I issued below query,
SELECT * FROM ( SELECT count(*) as failures from test WHERE state !=
'success' ) as tmp WHERE (
If multiple users are looking at the same data set, then it's good choice
to share the SparkContext.
But my usercases are different, users are looking at different data(I use
custom Hadoop InputFormat to load data from my data source based on the
user input), the data might not have any overlap.
Hi,
I have a doubt regarding yarn-cluster mode and spark.driver.
allowMultipleContexts for below usercases.
I have a long running backend server where I will create a short-lived
Spark job in response to each user request, base on the fact that by
default multiple Spark Context cannot be created
er.allowMultipleContexts", "true")
> .set("spark.driver.allowMultipleContexts", "true"))
> ./core/src/test/scala/org/apache/spark/SparkContextSuite.scala
>
> FYI
>
> On Tue, Dec 1, 2015 at 3:32 PM, Anfernee Xu <anfernee...@gmail.com&
ue, Dec 1, 2015 at 3:32 PM, Anfernee Xu <anfernee...@gmail.com> wrote:
> > I have a long running backend server where I will create a short-lived
> Spark
> > job in response to each user request, base on the fact that by default
> > multiple Spark Context cannot be created
Hi Spark experts,
First of all, happy Thanksgiving!
The comes to my question, I have implemented custom Hadoop InputFormat to
load millions of entities from my data source to Spark(as JavaRDD and
transform to DataFrame). The approach I took in implementing the custom
Hadoop RDD is loading all
Hi,
I have a pretty large data set(2M entities) in my RDD, the data has already
been partitioned by a specific key, the key has a range(type in long), now
I want to create a bunch of key buckets, for example, the key has range
1 -> 100,
I will break the whole range into below buckets
1
Thanks Yong for your response.
Let me see if I can understand what you're suggesting, so the whole data
set, when I load them into Spark(I'm using custom Hadoop InputFormat), I
will add an extra field to each element in RDD, like bucket_id.
For example
Key:
1 - 10, bucket_id=1
11-20,
Hi,
I just want to understand the cost of DataFrame.registerTempTable(String),
is it just a trivial operation(like creating a object reference) in
master(Driver) JVM? And Can I have multiple tables with different name
referencing to the same DataFrame?
Thanks
--
--Anfernee
Hi,
Here's my situation, I have some kind of offline dataset and got them
loaded them into Spark as RDD, but I want to form a virtual data stream
feeding to Spark Streaming, my code looks like this
// sort offline data by time, the dataset spans 2 hours
1) JavaRDD sortedByTime =
Hi,
Here's my situation, I have some kind of offline dataset, but I want to
form a virtual data stream feeding to Spark Streaming, my code looks like
this
// sort offline data by time
1) JavaRDD sortedByTime = offlineDataRDD.sortBy( );
// compute a list of JavaRDD, each element
Sorry, I have to re-send it again as I did not get the answer.
Here's the problem I'm facing, I'm using Spark 1.5.0 release, I have a
standalone java application which is periodically submit Spark jobs to my
yarn cluster, btw I'm not using 'spark-submit' or
'org.apache.spark.launcher' to submit
Sorry, I have to re-send it again as I did not get the answer.
Here's the problem I'm facing, I have a standalone java application which
is periodically submit Spark jobs to my yarn cluster, btw I'm not using
'spark-submit' or 'org.apache.spark.launcher' to submit my jobs. These jobs
are
Hi,
Here's the problem I'm facing, I have a standalone java application which
is periodically submit Spark jobs to my yarn cluster, btw I'm not using
'spark-submit' or 'org.apache.spark.launcher' to submit my jobs. These jobs
are successful and I can see them on Yarn RM webUI, but when I want to
Hi Spark experts,
I'm coming across these terminologies and having some confusions, could you
please help me understand them better?
For instance I have implemented a Hadoop InputFormat to load my external
data in Spark, in turn my custom InputFormat will create a bunch of
InputSplit's, my
16 matches
Mail list logo