There's a new parquet RC up this week which would be good to pull in.
On Thu, 2 May 2024 at 03:20, Jungtaek Lim
wrote:
> +1 love to see it!
>
> On Thu, May 2, 2024 at 10:08 AM Holden Karau
> wrote:
>
>> +1 :) yay previews
>>
>> On Wed, May 1, 2024 at 5:36 PM Chao Sun wrote:
>>
>>> +1
>>>
>>>
Those are some quite good improvements -but committing to storing all your
data in an unstable format, is, well, "bold". For temporary data as part of
a workflow though, it could be appealing
Now, assuming you are going to be working with s3, you might want to start
with merging PARQUET-2117 into
ASF will be unhappy about this. and stack overflow exists. otherwise:
apache Confluent and linkedIn exist; LI is the option I'd point at
On Mon, 18 Mar 2024 at 10:59, Mich Talebzadeh
wrote:
> Some of you may be aware that Databricks community Home | Databricks
> have just launched a knowledge
I consider the context info as more important than just logging; at hadoop
level we do it to attach things like task/jobIds, kerberos principals etc
to all store requests.
https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/auditing.html
So worrying about how pass and manage that at
Now, if you are ruthless it'd make sense to randomise the order of results
if someone left out the order by, to stop complacency.
like that time sun changed the ordering that methods were returned in a
Class.listMethods() call and everyone's junit test cases failed if they'd
assumed that ordering
I would recommend cutting them.
+ historically they've fixed the version of aws-sdk jar used in spark
releases, meaning s3a connector through spark rarely used the same sdk
release as that qualified through the hadoop sdk update process, so if
there were incompatibilities, it'd be up to the spark
On Thu, 1 Jun 2023 at 00:58, Andrew Melo wrote:
> Hi all
>
> I've been developing for some time a Spark DSv2 plugin "Laurelin" (
> https://github.com/spark-root/laurelin
> ) to read the ROOT (https://root.cern) file format (which is used in high
> energy physics). I've recently presented my work
I'd say everyone should *and* http UA in all the clients who make calls of
object stores should, as it helps field issues there. s3a and abfs clients
do provide the ability to add params there -please set them in your
deployments
On Fri, 16 Jun 2023 at 21:53, Dongjoon Hyun wrote:
> Please vote
.
On Sat, 27 May 2023 at 14:39, 张铎(Duo Zhang) wrote:
> For hbase, the problem is we rely on protobuf 2.5 for our coprocessors.
>
> See HBASE-27436.
>
> Cheng Pan 于2023年5月24日周三 10:00写道:
>
>> +CC dev@hbase
>>
>> Thanks,
>> Cheng Pan
>>
>> On
+1. the shaded one which is in use also needs upgrading.
> Thanks,
> Cheng Pan
>
>
> On May 17, 2023 at 04:10:43, Dongjoon Hyun wrote:
>
>> Thank you for sharing, Steve.
>>
>> Dongjoon
>>
>> On Tue, May 16, 2023 at 11:44 AM Steve Loughr
I have some bad news here which is even though hadoop cut protobuf 2.5
support, hbase team put it back in (HADOOP-17046). I don't know if the
shaded hadoop client has removed that dependency on protobuf 2.5.
In HADOOP-18487 i want to allow hadoop to cut that dependency, with hbase
having to add
This is truly wonderful.
1. I have an internal patch related to committer stuff I could submit now
2. if someone wants to look at it where FileSystem.open() is used *and you
have the file length, file path, or simply know whether you plan to do
random or sequential IO*, switch to openFile(). on
t;
> test
>
>
>
> ```
>
>
>
> Yang Jie
>
>
>
> *发件人**: *"Yang,Jie(INF)"
> *日期**: *2022年12月6日 星期二 18:27
> *收件人**: *Steve Loughran
> *抄送**: *Hyukjin Kwon , Apache Spark Dev <
> dev@spark.apache.org>
> *主题**: *Re: ma
stleProvider CNFE
>
>
>
> Steve, does the lower version of scala plugin work for you? If that
> solves, we could temporary downgrade for now.
>
>
>
> On Mon, 5 Dec 2022 at 22:23, Steve Loughran
> wrote:
>
> trying to build spark master w/ hadoop trunk and the maven s
trying to build spark master w/ hadoop trunk and the maven sbt plugin is
failing. This doesn't happen with the 3.3.5 RC0;
I note that the only mention of this anywhere was me in march.
clearly something in hadoop trunk has changed in a way which is
incompatible.
Has anyone else tried such a
the api doesn't get used in the hadoop libraries; not sure about other
dependencies.
probably makes sense to say on the jira that there's no need to panic here;
I've had to start doing that as some of the security scanners appear to
overreact
https://issues.apache.org/jira/browse/HDFS-16766
On
v1 on gcs isn't safe either as promotion from task attempt to
successful task is a dir rename; fast and atomic on hdfs, O(files) and
nonatomic on GCS.
if i can get that hadoop 3.3.5 rc out soon, the manifest committer will be
there to test https://issues.apache.org/jira/browse/MAPREDUCE-7341
On Wed, 5 Oct 2022 at 21:59, Chao Sun wrote:
> +1
>
> > and specifically may allow us to finally move off of the ancient version
> of Guava (?)
>
> I think the Guava issue comes from Hive 2.3 dependency, not Hadoop.
>
hadoop branch-2 has guava dependencies; not sure which one
A key lesson
that sounds suspiciously like something I'd write :)
the move to java8 happened in HADOOP-11858; 3.0.0
HADOOP-16219, "[JDK8] Set minimum version of Hadoop 2 to JDK 8" has been
open since 2019 and I just closed as WONTFIX.
Most of the big production hadoop 2 clusters use java7, because that is
ocker file which works
>fine in our environment (with Hadoop3.1)
>
>
> Regards
> Pralabh Kumar
>
>
>
> On Mon, Jun 13, 2022 at 3:25 PM Steve Loughran
> wrote:
>
>>
>>
>> On Mon, 13 Jun 2022 at 08:52, Pralabh Kumar
>> wrote:
>>
>>&
On Mon, 13 Jun 2022 at 08:52, Pralabh Kumar wrote:
> Hi Dev team
>
> I have a spark32 image with Java 11 (Running Spark on K8s) . While
> reading a huge parquet file via spark.read.parquet("") . I am getting
> the following error . The same error is mentioned in Spark docs
>
On Fri, 3 Jun 2022 at 18:46, Martin Grund
wrote:
> Hi Everyone,
>
> We would like to start a discussion on the "Spark Connect" proposal.
> Please find the links below:
>
> *JIRA* - https://issues.apache.org/jira/browse/SPARK-39375
> *SPIP Document* -
>
Prablah, did you follow the URL provided in the exception message? i put a
lot of effort in to improving the diagnostics, where the wiki articles are
part of the troubleshooing process
https://issues.apache.org/jira/browse/HADOOP-7469
it's really disappointing when people escalate the problem to
I should back up Donjoon's comments with the observation that hadoop 2.10.x
is the only branch-2 release which get any security updates; on branch-3 it
is 3.2.x and 3.3.x which do. Donjoon's colleague Chao Sun was the release
manager on the 3.3.2 release, so it got thoroughly tested with Spark.
log4j 1.2.17 is not vulnerable. There is an existing CVE there from a log
aggregation servlet; Cloudera products ship a patched release with that
servlet stripped...asf projects are not allowed to do that.
But: some recent Cloudera Products do include log4j 2.x, so colleagues of
mine are busy
without commenting on any other part of this, note that it was in some hive
commit operations where a race condition in rename surfaced
https://issues.apache.org/jira/browse/HADOOP-16721
if you get odd errors about parent dirs not existing during renames,
that'll be it...Upgrade to Hadoop-3.3.1
one of the issues here is that Parquet creates files with overwrite=false;
other output formats do not do this, so implicitly overwrite the output of
previous attempts. Which is fine if you are confident that each task
attempt (henceforth: TA) is writing to an isolated path.
the next iteration of
I haven't looked into much. I'd expect to only
> see that once since it seems to properly reuse a single FileContext
> instance.
>
> Adam
>
> On Fri, Aug 20, 2021 at 2:22 PM Steve Loughran
> wrote:
>
>>
>> ooh, this is fun,
>>
>> v2 isn't safe to use
ooh, this is fun,
v2 isn't safe to use unless every task attempt generates files with exactly
the same names and it is okay to intermingle the output of two task
attempts.
This is because task commit can felt partway through (or worse, that
process pause for a full GC), and a second attempt
My little committer project, an intermediate manifest committer for Azure
and GCS is reaching the stage where it's ready for others to look at
https://github.com/apache/hadoop/pull/2971
Goals
1. Correctness even on GCS, which doesn't have atomic dir rename (so v1
isn't safe). It does use
On Mon, 14 Jun 2021 at 19:07, Mich Talebzadeh
wrote:
>
>
> Now I am trying to read it in Hive
>
> 0: jdbc:hive2://rhes75:10099/default> desc test.randomDataDelta;
> ++--+--+
> |col_name| data_type | comment |
>
(can't reply to user@, so pulling @dev instead. sorry)
(can't reply to user@, so pulling @dev instead)
There is no fundamental reason why the hadoop-cloud POM and artifact isn't
built/released by the ASF spark project; I think the effort it took to get
the spark-hadoop-cloud module it in at all
Machine readable logs are always good, especially if you can read the
entire logs into an SQL query.
It might be good to use some specific differentiation between
hint/warn/fatal error in the numbering so that any automated analysis of
the logs can identify the class of an error even if its an
If are using kerberized HDFS the spark principal (or whoever is running the
cluster) has to be declared as a proxy user.
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html
Once done, you call the
val ugi = UserGroupInformation.createProxyUser("joe",
On Thu, 11 Mar 2021 at 19:58, Attila Zsolt Piros <
piros.attila.zs...@gmail.com> wrote:
> I agree with you to extend the documentation around this. Moreover I
> support to have specific unit tests for this.
>
> > There is clearly some demand for Spark to automatically clean up
> checkpoints on
On Mon, 7 Dec 2020 at 07:36, Chang Chen wrote:
> Since S3A now works perfectly with S3Guard turned off, Could Magic
> Committor work with S3Guard is off? If Yes, will performance degenerate? Or
> if HADOOP-17400 is fixed, then it will have comparable performance?
>
Yes, works really well.
* It
as sent to hadoop-general.
TL;DR. S3 is consistent; S3A now works perfectly with S3Guard turned off,
if not, file a JIRA. rename still isn't real, so don't rely on that or
create(path, overwrite=false) for atomic operations
---
If you've missed the announcement, AWS S3 storage is now
I'm staring at https://issues.apache.org/jira/browse/HADOOP-17372 and a
stack trace which claims that a com.amazonaws class doesn't implement an
interface which it very much does
2020-11-10 05:27:33,517 [ScalaTest-main-running-S3DataFrameExampleSuite]
WARN fs.FileSystem
On Mon, 12 Oct 2020 at 19:06, Sean Owen wrote:
> I don't have a good answer, Steve may know more, but from looking at
> dependency:tree, it looks mostly like it's hadoop-common that's at issue.
> Without -Phive it remains 'provided' in the assembly/ module, but -Phive
> causes it to come back
s. Could you give us some
> examples specifically?
>
> > Can I suggest some common prefix for third-party-classes put into the
> spark package tree, just to make clear that they are external contributions?
>
> Bests,
> Dongjoon.
>
>
> On Mon, Sep 21, 2020 at 6:29 AM St
I've just been stack-trace-chasing the 404-in-task-commit code:
https://issues.apache.org/jira/browse/HADOOP-17216
And although it's got an org.apache.spark. prefix, it's
actually org.apache.spark.sql.delta, which lives in github, so the
code/issue tracker lives elsewhere.
I understand why
On Wed, 22 Jul 2020 at 18:50, Holden Karau wrote:
> Wonderful. To be clear the patch is more to start the discussion about how
> we want to do it and less what I think is the right way.
>
>
be happy to give a quick online tour of ongoing work on S3A enhancements
some time next week, get feedback
On Wed, 22 Jul 2020 at 00:51, Holden Karau wrote:
> Hi Folks,
>
> In Spark SQL there is the ability to have Spark do it's partition
> discovery/file listing in parallel on the worker nodes and also avoid
> locality lookups. I'd like to expose this in core, but given the Hadoop
> APIs it's a bit
On Sun, 12 Jul 2020 at 01:45, gpongracz wrote:
> As someone who mainly operates in AWS it would be very welcome to have the
> option to use an updated version of hadoop using pyspark sourced from pypi.
>
> Acknowledging the issues of backwards compatability...
>
> The most vexing issue is the
al.io.cloud.PathOutputCommitProtocol");
> hadoopConfiguration.set("spark.sql.parquet.output.committer.class",
> "org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter");
> hadoopConfiguration.set("fs.s3a.connection.maximum",
> Integer.toString(coreCount
https://issues.apache.org/jira/browse/MAPREDUCE-7282
"MR v2 commit algorithm is dangerous, should be deprecated and not the
default"
someone do a PR to change the default & if it doesn't break too much I'l
merge it
On Mon, 29 Jun 2020 at 13:20, Steve Loughran wrote:
> v2 do
Here's a class which lets you proved a function on a row by row basis to
declare location
https://github.com/hortonworks-spark/cloud-integration/blob/master/spark-cloud-integration/src/main/scala/org/apache/spark/cloudera/ParallelizedWithLocalityRDD.scala
needs to be in o.a.spark as something
you are going to need hadoop-3.1 on your classpath, with hadoop-aws and the
same aws-sdk it was built with (1.11.something). Mixing hadoop JARs is
doomed. using a different aws sdk jar is a bit risky, though more recent
upgrades have all be fairly low stress
On Fri, 19 Jun 2020 at 05:39, murat
rowse/SPARK-20107?focusedCommentId=15945177=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15945177
>
> I also did discuss this a bit with Steve Loughran and his opinion was that
> v2 should just be deprecated all together. I believe he was going to bring
> that up
On Tue, 7 Apr 2020 at 15:26, Antonin Delpeuch
wrote:
> Hi,
>
> Sorry to dig out this thread but this bug is still present.
>
> The fix proposed in this thread (creating a new FileSystem implementation
> which sorts listed files) was rejected, with the suggestion that it is the
>
On Tue, Nov 19, 2019 at 10:40 PM Cheng Lian wrote:
> Hey Steve,
>
> In terms of Maven artifact, I don't think the default Hadoop version
> matters except for the spark-hadoop-cloud module, which is only meaningful
> under the hadoop-3.2 profile. All the other spark-* artifacts published to
>
On Thu, Nov 21, 2019 at 12:53 AM Dongjoon Hyun
wrote:
> Thank you for much thoughtful clarification. I agree with your all options.
>
> Especially, for Hive Metastore connection, `Hive isolated client loader`
> is also important with Hive 2.3 because Hive 2.3 client cannot talk with
> Hive 2.1
The ASF PR team would like something like that "Spark now supports ARM" in
press releases. And don't forget: they do you like to be involved in the
launch of the final release.
On Fri, Nov 15, 2019 at 9:46 AM bo zhaobo
wrote:
> Hi @Sean Owen ,
>
> Thanks for your idea.
>
> We may use the bad
need some more
> efforts to investigate as well.
>
> On Fri, 15 Nov 2019, 20:56 Steve Loughran,
> wrote:
>
>> Junit5: Display names.
>>
>> Goes all the way to the XML.
>>
>>
>> https://junit.org/junit5/docs/current/user-guide/#writ
Can I take this moment to remind everyone that the version of hive which
spark has historically bundled (the org.spark-project one) is an orphan
project put together to deal with Hive's shading issues and a source of
unhappiness in the Hive project. What ever get shipped should do its best
to
Junit5: Display names.
Goes all the way to the XML.
https://junit.org/junit5/docs/current/user-guide/#writing-tests-display-names
On Thu, Nov 14, 2019 at 6:13 PM Shixiong(Ryan) Zhu
wrote:
> Should we also add a guideline for non Scala tests? Other languages (Java,
> Python, R) don't support
> spark.sparkContext.hadoopConfiguration.set("spark.hadoop.fs.s3a.impl",
"org.apache.hadoop.fs.s3a.S3AFileSystem")
This is some superstition which seems to get carried through stack overflow
articles. You do not need to declare the implementation class for s3a://
any more than you have to do for
On Mon, Nov 4, 2019 at 12:39 AM Nicholas Chammas
wrote:
> On Fri, Nov 1, 2019 at 8:41 AM Steve Loughran
> wrote:
>
>> It would be really good if the spark distributions shipped with later
>> versions of the hadoop artifacts.
>>
>
> I second this. If we need to
I'd move spark's branch-2 line to 2.9.x as
(a) spark's version of httpclient hits a bug in the AWS SDK used in
hadoop-2.8 unless you revert that patch
https://issues.apache.org/jira/browse/SPARK-22919
(b) there's only one future version of 2.8x planned, which is expected once
myself or someone
On Mon, Oct 28, 2019 at 3:40 PM Sean Owen wrote:
> There will be a "Hadoop 3.x" version of 3.0, as it's essential to get
> a JDK 11-compatible build. you can see the hadoop-3.2 profile.
> hadoop-aws is pulled in in the hadoop-cloud module I believe, so bears
> checking whether the profile
What is the current default value? as the 2.x releases are becoming EOL;
2.7 is dead, there might be a 2.8.x; for now 2.9 is the branch-2 release
getting attention. 2.10.0 shipped yesterday, but the ".0" means there will
inevitably be surprises.
One issue about using a older versions is that any
On Fri, Oct 25, 2019 at 2:56 AM Dongjoon Hyun
wrote:
>
> All versions of JDK8 are not the same naturally. For example, Hadoop
> community also have the following document although they are not specifying
> the minimum versions.
>
> -
>
appens during FileInputFormat scans, so is how I'm going
to tune IOPs there. It might also be good to have those bits of the hadoop
MR classes which spark uses to log internally @ debug, so everything gets
this logging if they ask for it.
Happy to take contribs there as Hadoop JIRAs &
On Fri, Sep 6, 2019 at 2:50 PM Sean Owen wrote:
> I think the problem is calling globStatus to expand all 300K files.
> This is a general problem for object stores and huge numbers of files.
> Steve L. may have better thoughts on real solutions. But you might
> consider, if possible, running a
Zookeeper client is/was netty 3, AFAIK, so if you want to use it for
anything, it ends up on the CP
On Tue, Sep 3, 2019 at 5:18 PM Shixiong(Ryan) Zhu
wrote:
> Yep, historical reasons. And Netty 4 is under another namespace, so we can
> use Netty 3 and Netty 4 in the same JVM.
>
> On Tue, Sep 3,
Fairly neglected hadoop patch, FWIW;
https://issues.apache.org/jira/browse/HADOOP-11890
FB have been running HDFS on IPv6 for a while, but their codebase has
diverged; getting the stuff into trunk is going to take effort. At least
the JDK has moved on and should be better
On Wed, Jul 17, 2019
Where is this list of files coming from?
If you made the list, then yes, the expectation is generally "supply a list
of files which are present" on the basis that general convention is
"missing files are considered bad"
Though you could try setting spark.sql.files.ignoreCorruptFiles=true to see
in upstream community, it bring confidence to end user and
> customers when they plan to deploy these projects on ARM.
>
> This is absolute long term work, let's to make it step by step, CI,
> testing, issue and resolving.
>
> Steve Loughran 于2019年6月27日周四 下午9:22写道:
>
>>
level db and native codecs are invariably a problem here, as is anything
else doing misaligned IO. Protobuf has also had "issues" in the past
see https://issues.apache.org/jira/browse/HADOOP-16100
I think any AA64 work is going to have to define very clearly what "works"
is defined as; spark
be aware that older java 8 versions count the #of cores in the host, not
those allocated for the container they run in
https://bugs.openjdk.java.net/browse/JDK-8140793
On Tue, Jun 18, 2019 at 8:13 PM Ilya Matiach
wrote:
> Hi Andrew,
>
> I tried to do something similar to that in the LightGBM
>
hadoop is still on 1.7.7 branch. A move to 1.9 would probably be as painful
as a move to 1.8.x, so submit a patch for hadoop trunk. Last PR there
wasn't quite ready and I didn't get any follow up to the "what is this
going to break" question
https://issues.apache.org/jira/browse/HADOOP-13386
On Tue, Apr 2, 2019 at 9:39 PM Ankur Gupta wrote:
> Hi Steve,
>
> Thanks for your feedback. From your email, I could gather the following
> two important points:
>
>1. Report failures to something (cluster manager) which can opt to
>destroy the node and request a new one
>2.
On Fri, Mar 29, 2019 at 6:18 PM Reynold Xin wrote:
> We tried enabling blacklisting for some customers and in the cloud, very
> quickly they end up having 0 executors due to various transient errors. So
> unfortunately I think the current implementation is terrible for cloud
> deployments, and
you might want to look at the work on FPGA resources; again it should just
be a resource available by a scheduler. Key thing is probably just to keep
the docs generic
https://hadoop.apache.org/docs/r3.1.0/hadoop-yarn/hadoop-yarn-site/UsingFPGA.html
I don't know where you get those FPGAs to play
e to care about the persistence format *or which
app created the data*
What does Arrow do in this world, incidentally?
On 2 Jan 2019, at 11:48, Steve Loughran
mailto:ste...@hortonworks.com>> wrote:
On 17 Dec 2018, at 17:44, Zoltan Ivanfi
mailto:z...@cloudera.com.INVALID>> wrote:
On 17 Dec 2018, at 17:44, Zoltan Ivanfi
mailto:z...@cloudera.com.INVALID>> wrote:
Hi,
On Sun, Dec 16, 2018 at 4:43 AM Wenchen Fan
mailto:cloud0...@gmail.com>> wrote:
Shall we include Parquet and ORC? If they don't support it, it's hard for
general query engines like Spark to support it.
> On 16 Oct 2018, at 22:06, t4 wrote:
>
> has anyone got spark jars working with hadoop3.1 that they can share? i am
> looking to be able to use the latest hadoop-aws fixes from v3.1
we do, but we do with
* a patched hive JAR
* bulding spark with
Randomized testing can, in theory, help you explore a far larger area of the
environment of an app than you could explicitly explore, such as "does
everything work in the turkish locale where "I".toLower()!="i", etc.
Good: faster tests, especially on an essentially-non-finite set of options
On 2 Oct 2018, at 04:44, tigerquoll
mailto:tigerqu...@outlook.com>> wrote:
Hi Steve,
I think that passing a kerberos keytab around is one of those bad ideas that
is entirely appropriate to re-question every single time you come across it.
It has been used already in spark when interacting with
On 30 Sep 2018, at 19:37, Jacek Laskowski
mailto:ja...@japila.pl>> wrote:
scala> spark.range(1).write.saveAsTable("demo")
2018-09-30 17:44:27 WARN ObjectStore:568 - Failed to get database global_temp,
returning NoSuchObjectException
2018-09-30 17:44:28 ERROR FileOutputCommitter:314 - Mkdirs
On 11 Aug 2018, at 17:33, chandan prakash
mailto:chandanbaran...@gmail.com>> wrote:
Hi All,
I was going through this pull request about new CheckpointFileManager
abstraction in structured streaming coming in 2.4 :
https://issues.apache.org/jira/browse/SPARK-23966
> On 25 Sep 2018, at 07:52, tigerquoll wrote:
>
> To give some Kerberos specific examples, The spark-submit args:
> -–conf spark.yarn.keytab=path_to_keytab -–conf
> spark.yarn.principal=princi...@realm.com
>
> are currently not passed through to the data sources.
>
>
>
I'm not sure why
t more memory with -J-Xmx2g or
whatever. If you're running ./build/mvn and letting it run zinc we might need
to increase the memory that it requests in the script.
On Tue, Aug 14, 2018 at 2:56 PM Steve Loughran
mailto:ste...@hortonworks.com>> wrote:
Is anyone else getting the sql module
Is anyone else getting the sql module maven build on master branch failing when
you use zinc for incremental builds?
[warn] ^
java.lang.OutOfMemoryError: GC overhead limit exceeded
at
scala.tools.nsc.backend.icode.GenICode$Scope.(GenICode.scala:2225)
at
CVS with schema inference is a full read of the data, so that could be one of
the problems. Do it at most once, print out the schema and use it from then on
during ingress & use something else for persistence
On 6 Aug 2018, at 05:44, makatun
mailto:d.i.maka...@gmail.com>> wrote:
a.
FYI, there's some initial exploring of what it would take to move the HDFS wire
protocol to move from HTrace for OpenTrace for tracing, and wire up the other
stores too
https://issues.apache.org/jira/browse/HADOOP-15566
If anyone has any input/insight or code review capacity, it'd be
following up after a ref to this in
https://issues.apache.org/jira/browse/HADOOP-15559
the AWS SDK is a very fast moving project, with a release cycle of ~2 weeks,
but it's in the state Fred Brooks described, "the number of bugs is constant,
they just move around"; bumpin gup an AWS release
> On 21 May 2018, at 17:20, Marcelo Vanzin wrote:
>
> Is there a way to trigger it conditionally? e.g. only if the diff
> touches java files.
>
what about adding it as another command which could be added alongside "jenkins
test this please", something like "lint this
sorry, not noticed this followup. Been busy with other issues
On 3 Apr 2018, at 11:19, cane
> wrote:
Now, if we use saveAsNewAPIHadoopDataset with speculation enable.It may cause
data loss.
I check the comment of thi api:
We should
On 5 Apr 2018, at 18:04, Matei Zaharia
> wrote:
Java 9/10 support would be great to add as well.
Be aware that the work moving hadoop core to java 9+ is still a big piece of
work being undertaken by Akira Ajisaka & colleagues at NTT
> On 3 Apr 2018, at 11:19, cane wrote:
>
> Now, if we use saveAsNewAPIHadoopDataset with speculation enable.It may cause
> data loss.
> I check the comment of thi api:
>
> We should make sure our tasks are idempotent when speculation is enabled,
> i.e. do
> * not
On 3 Apr 2018, at 01:30, Saisai Shao
> wrote:
Yes, the main blocking issue is the hive version used in Spark (1.2.1.spark)
doesn't support run on Hadoop 3. Hive will check the Hadoop version in the
runtime [1]. Besides this I think some
On 14 Feb 2018, at 19:56, Tayyebi, Ameen
> wrote:
Newbie question:
I want to add system/integration tests for the new functionality. There are a
set of existing tests around Spark Catalog that I can leverage. Great. The
provider I’m writing is
On 14 Feb 2018, at 13:51, Tayyebi, Ameen
> wrote:
Thanks a lot Steve. I’ll go through the Jira’s you linked in detail. I took a
quick look and am sufficiently scared for now. I had run into that warning from
the S3 stream before. Sigh.
things
might be coming in transitively
https://issues.apache.org/jira/browse/HADOOP-14799
On 13 Feb 2018, at 18:18, PJ Fanning
> wrote:
Hi Sujith,
I didn't find the nimbusds dependency in any spark 2.2 jars. Maybe I missed
something. Could you tell us
1.11.199 because it didn't have any issues that
we hadn't already got under control
(https://github.com/aws/aws-sdk-java/issues/1211)
Like I said: upgrades bring fear
From: Steve Loughran <ste...@hortonworks.com<mailto:ste...@hortonworks.com>>
Date: Tuesday, February 13, 2018
On 13 Feb 2018, at 19:50, Tayyebi, Ameen
> wrote:
The biggest challenge is that I had to upgrade the AWS SDK to a newer version
so that it includes the Glue client since Glue is a new service. So far, I
haven’t see any jar hell issues, but
On 12 Feb 2018, at 20:21, Ryan Blue
> wrote:
I wouldn't say we have a primary failure mode that we deal with. What we
concluded was that all the schemes we came up with to avoid corruption couldn't
cover all cases. For example, what about when
I'd advocate 2.7 over 2.6, primarily due to Kerberos and JVM versions
2.6 is not even qualified for Java 7, let alone Java 8: you've got no
guarantees that things work on the min Java version Spark requires.
Kerberos is always the failure point here, as well as various libraries (jetty)
which
On 12 Feb 2018, at 19:35, Dong Jiang
> wrote:
I got no error messages from EMR. We write directly from dataframe to S3. There
doesn’t appear to be an issue with S3 file, we can still down the parquet file
and read most of the columns, just one
1 - 100 of 241 matches
Mail list logo