Hi,
I use HCatalog Streaming Mutation API to write data to hive transactional
table, and then, I use SparkSQL to read data from the hive transactional
table. I get the right result.
However, SparkSQL uses more time to read hive orc bucket transactional
table, beacause SparkSQL
Hi Paras,Check out the link Spark Scala: DateDiff of two columns by hour or
minute
|
|
|
| ||
|
|
|
| |
Spark Scala: DateDiff of two columns by hour or minute
I have two timestamp columns in a dataframe that I'd like to get the minute
difference of, or
Hi Khaled,
I have attached the spark streaming config below in (a).
In case of the 100vcore run (see the initial email), I used 50 executors
where each executor has 2 vcores and 3g memory. For 70 vcore case, 35
executors, for 80 vcore case, 40 executors.
In the yarn config (yarn-site.xml, (b)
I just discovered https://issues.apache.org/jira/browse/SPARK-25738 with
some more testing. I only marked it as critical, but seems pretty bad --
I'll defer to others opinion
On Sat, Oct 13, 2018 at 4:15 PM Dongjoon Hyun
wrote:
> Yes. From my side, it's -1 for RC3.
>
> Bests,
> Dongjoon.
>
>
Hi Peter,
What parameters are you putting in your spark streaming configuration? What
are you putting as number of executor instances and how many cores per
executor are you setting in your Spark job?
Best,
Khaled
On Mon, Oct 15, 2018 at 9:18 PM Peter Liu wrote:
> Hi there,
>
> I have a
Hi there,
I have a system with 80 vcores and a relatively light spark streaming
workload. Overcomming the vcore resource (i.e. > 80) in the config (see (a)
below) seems to help to improve the average spark batch time (see (b)
below).
Is there any best practice guideline on resource overcommit
Hi Fokko
Spark fires it off for many other things. It does so for ML pipelines and
it does make information available for data frames.
We use S3 in this case I just simplified the example. It is important to
know what process took what action. Only spark knows this and it does
supply this
Hi Bolke,
I would argue that Spark is not the right level of abstraction of doing
this. I would create a wrapper around the particular filesystem:
http://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html
Therefore you can write a wrapper around the LocalFileSystem if data
Hi,
Apologies upfront if this should have gone to user@ but it seems a developer
question so here goes.
We are trying to improve a listener to track lineage across our platform. This
requires tracking where data comes from and where it goes to. E.g.
sc.setLogLevel("INFO");
val data =
i realize it is unlikely all data will be local to tasks, so placement will
not be optimal and there will be some network traffic, but is this the same
as a shuffle?
in CoalesceRDD it shows a NarrowDependency, which i thought meant it could
be implemented without a shuffle.
On Mon, Oct 15, 2018
This is not fully correct. If you have less files then you need to move some
data to some other nodes, because not all the data is there for writing (even
the case for the same node, but then it is easier from a network perspective).
Hence a shuffling is needed.
> Am 15.10.2018 um 05:04
Thanks John,
Actually need full date and time difference not just date difference,
which I guess not supported.
Let me know if its possible, or any UDF available for the same.
Thanks And Regards,
Paras
From: John Zhuge
Sent: Friday, October 12, 2018
12 matches
Mail list logo