I think many advanced Spark users already have customer catalyst rules, to
deal with the query plan directly, so it makes a lot of sense to
standardize the logical plan. However, instead of exploring possible
operations ourselves, I think we should follow the SQL standard.
ReplaceTable, RTAS:
Most
Thanks for responding!
I’ve been coming up with a list of the high-level operations that are
needed. I think all of them come down to 5 questions about what’s happening:
- Does the target table exist?
- If it does exist, should it be dropped?
- If not, should it get created?
- Should
Sure. Obviously, there is going to be some overlap as the project
transitions to being part of mainline Spark development. As long as you are
consciously working toward moving discussions into this dev list, then all
is good.
On Mon, Feb 5, 2018 at 1:56 PM, Matt Cheah wrote:
> I think in this ca
I think in this case, the original design that was proposed before the document
was implemented on the Spark on K8s fork, that we took some time to build
separately before proposing that the fork be merged into the main line.
Specifically, the timeline of events was:
We started building Sp
That's good, but you should probably stop and consider whether the
discussions that led up to this document's creation could have taken place
on this dev list -- because if they could have, then they probably should
have as part of the whole spark-on-k8s project becoming part of mainline
spark deve
Hi everyone,
While we were building the Spark on Kubernetes integration, we realized that
some of the abstractions we introduced for building the driver application in
spark-submit, and building executor pods in the scheduler backend, could be
improved for better readability and clarity. We
In that case, I'd recommend tracking down the node where the files were
created and reporting it to EMR.
On Mon, Feb 5, 2018 at 10:38 AM, Dong Jiang wrote:
> Thanks for the response, Ryan.
>
> We have transient EMR cluster, and we do rerun the cluster whenever the
> cluster failed. However, in t
Hi, Ryan,
Do you have any suggestions on how we could detect and prevent this issue?
This is the second time we encountered this issue. We have a wide table, with
134 columns in the file. The issue seems only impact one column, and very hard
to detect. It seems you have encountered this issue be
Thanks for the response, Ryan.
We have transient EMR cluster, and we do rerun the cluster whenever the cluster
failed. However, in this particular case, the cluster succeeded, not reporting
any errors. I was able to null out the corrupted the column and recover the
rest of the 133 columns. I do
We ensure the bad node is removed from our cluster and reprocess to replace
the data. We only see this once or twice a year, so it isn't a significant
problem.
We've discussed options for adding write-side validation, but it is
expensive and still unreliable if you don't trust the hardware.
rb
O
If you can still access the logs, then you should be able to find where the
write task ran. Maybe you can get an instance ID and open a ticket with
Amazon. Otherwise, it will probably start failing the HW checks when the
instance hardware is reused, so I wouldn't worry about it.
The _SUCCESS file
Hi, Ryan,
Many thanks for your quick response.
We ran Spark on transient EMR clusters. Nothing in the log or EMR events
suggests any issues with the cluster or the nodes. We also see the _SUCCESS
file on the S3. If we see the _SUCCESS file, does that suggest all data is good?
How can we prevent
Dong,
We see this from time to time as well. In my experience, it is almost
always caused by a bad node. You should try to find out where the file was
written and remove that node as soon as possible.
As far as finding out what is wrong with the file, that's a difficult task.
Parquet's encoding i
Thank you very much. I had overlooked the differences between the two.
The public API part is understandable.
Coming to second part. - I see that it creates an instance of UnionRDD with
all RDDs as parent there by preventing long lineage chain.
Is my understanding correct?
On 5 February 2018 at
Hi,
We are running on Spark 2.2.1, generating parquet files, like the following
pseudo code
df.write.parquet(...)
We have recently noticed parquet file corruptions, when reading the parquet
in Spark or Presto, as the following:
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not r
First, the public API cannot be changed except when there is a major
version change, and there is no way that we are going to do Spark 3.0.0
just for this change.
Second, the change would be a mistake since the two different union methods
are quite different. The method in RDD only ever works on t
There is one on RDD but `SparkContext.union` prevents lineage from growing.
Check https://stackoverflow.com/q/34461804
Sent with [ProtonMail](https://protonmail.com) Secure Email.
Original Message
On February 5, 2018 5:04 PM, Suchith J N wrote:
> Hi,
>
> Seems like simple cl
Hi,
Seems like simple clean up - Why do we have union() on RDDs in
SparkContext? Shouldn't it reside in RDD? There is one in RDD, but it seems
like a wrapper around this.
Regards,
Suchith
18 matches
Mail list logo