Re: [VOTE] Release Apache Spark 2.4.1 (RC8)

2019-03-20 Thread Sean Owen
(Only the PMC can veto a release) That doesn't look like a regression. I get that it's important, but I don't see that it should block this release. On Tue, Mar 19, 2019 at 11:00 PM Darcy Shen wrote: > > -1 > > please backpoart SPARK-27160, a correctness issue about ORC native reader. > > see

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-19 Thread Sean Owen
This looks like a great level of detail. The broad strokes look good to me. I'm happy with just about any story around what to do with Mesos GPU support now, but might at least deserve a mention: does the existing Mesos config simply become a deprecated alias for the

Re: [build system] jenkins wedged again, rebooting master node

2019-03-17 Thread Sean Owen
It's wedged against since this morning. Something's clearly gone wrong-er than usual; any recent changes that could be a culprit? On Fri, Mar 15, 2019 at 12:08 PM shane knapp wrote: > > well, that box rebooted in record time! we're back up and building. > > and as always, i'll keep a close eye

Re: [build system] jenkins wedged again, rebooting master node

2019-03-15 Thread Sean Owen
It's not responding again. Is there any way to kick it harder? I know it's well understood but this means not much can be merged in Spark On Fri, Mar 15, 2019 at 12:08 PM shane knapp wrote: > > well, that box rebooted in record time! we're back up and building. > > and as always, i'll keep a

Re: PR process

2019-03-15 Thread Sean Owen
Your best bet is to try to ping people who wrote the code that is changing. Jose have you looked at this part, or Cody? On Fri, Mar 15, 2019 at 8:13 AM Tomas Bartalos wrote: > > Hello, > > I've contributed a PR https://github.com/apache/spark/pull/23749/. I think it > is an interesting feature

Scala type checking thread-safety issue, and global locks to resolve it

2019-03-14 Thread Sean Owen
This is worth a look: https://github.com/apache/spark/pull/24085 Scala has a surprising thread-safety bug in the "<:<" operator that's used to check subtypes, which can lead to incorrect results in non-trivial situations. The fix on the table is to introduce a global lock to protect a lot of the

Re: [discuss] 2.4.1-rcX release, k8s client PRs, build system infrastructure update

2019-03-13 Thread Sean Owen
I'm OK with this take. The problem with back-porting the client update to 2.4.x at all is that it drops support for some old-but-not-that-old K8S versions, which feels surprising in a maintenance release. That said, maybe it's OK, and a little more OK for a 2.4.2 in several months' time. On Wed,

Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

2019-03-12 Thread Sean Owen
I don't think we'd fail the current RC for this change, no. On Tue, Mar 12, 2019 at 3:51 AM Jakub Wozniak wrote: > > Hello, > > Any more thoughts on this one? > Will that be let in 2.4.1 or rather not? > > Thanks in advance, > Jakub > > > On 8 Mar 2019, at 11:26, Jakub Wozniak wrote: > > Hi, >

Re: Java 11 support

2019-03-11 Thread Sean Owen
> On Tue, Nov 6, 2018 at 9:16 AM Felix Cheung wrote: >> >> +1 for Spark 3, definitely >> Thanks for the updates >> >> >> >> From: Sean Owen >> Sent: Tuesday, November 6, 2018 9:11 AM >> To: Felix Cheung &g

Re: [VOTE] Release Apache Spark 2.4.1 (RC6)

2019-03-10 Thread Sean Owen
>From https://issues.apache.org/jira/browse/SPARK-25588, I'm reading that: - this is a Parquet-Avro version conflict thing - a downstream app wants different versions of Parquet and Avro than Spark uses, which triggers it - it doesn't work in 2.4.0 It's not a regression from 2.4.0, which is the

Re: [VOTE] Release Apache Spark 2.4.1 (RC6)

2019-03-08 Thread Sean Owen
That's weird. I see the commit but can't find it in the branch. Was it pushed, or lost in a force push of 2.4 along the way? The change is there, just under a different commit in the 2.4 branch. It doesn't necessarily invalidate the RC as it is a valid public tagged commit and all that. I just

Re: [pyspark] dataframe map_partition

2019-03-07 Thread Sean Owen
On Thu, Mar 7, 2019 at 3:40 PM Sean Owen wrote: >> >> Are you just applying a function to every row in the DataFrame? you >> don't need pandas at all. Just get the RDD of Row from it and map a >> UDF that makes another Row, and go back to DataFrame. Or make a UDF >>

Re: [pyspark] dataframe map_partition

2019-03-07 Thread Sean Owen
batch inference in spark, > https://github.com/yupbank/tf-spark-serving/blob/master/tss/serving.py#L108 > > Which i have to do the groupBy in order to use the apply function, i'm > wondering why not just enable apply to df ? > > On Thu, Mar 7, 2019 at 3:15 PM Sean Owen wrote:

Re: [pyspark] dataframe map_partition

2019-03-07 Thread Sean Owen
s_udf(df.schema, PandasUDFType.MAP) > def do_nothing(pandas_df): > return pandas_df > > > new_df = df.mapPartition(do_nothing) > ``` > pandas_udf only support scala or GROUPED_MAP. Why not support just Map? > > On Thu, Mar 7, 2019 at 2:57 PM Sean Owen wrote: >> >&

Re: [pyspark] dataframe map_partition

2019-03-07 Thread Sean Owen
Are you looking for @pandas_udf in Python? Or just mapPartition? Those exist already On Thu, Mar 7, 2019, 1:43 PM peng yu wrote: > There is a nice map_partition function in R `dapply`. so that user can > pass a row to udf. > > I'm wondering why we don't have that in python? > > I'm trying to

Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

2019-03-07 Thread Sean Owen
>>> up with old client versions on branches still supported like 2.4.x in the >>> future. >>> That gives us no choice but to upgrade, if we want to be on the safe side. >>> We have tested 3.0.0 with 1.11 internally and it works but I dont know what >>&

Re: Two spark applications listen on same port on same machine

2019-03-06 Thread Sean Owen
Two drivers can't be listening on port 4040 at the same time -- on the same machine. The OS wouldn't allow it. Are they actually on different machines or somehow different interfaces? or are you saying the reported port is wrong? On Wed, Mar 6, 2019 at 12:23 PM Moein Hosseini wrote: > I've

Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

2019-03-06 Thread Sean Owen
If the old client is basically unusable with the versions of K8S people mostly use now, and the new client still works with older versions, I could see including this in 2.4.1. Looking at https://github.com/fabric8io/kubernetes-client#compatibility-matrix it seems like the 4.1.1 client is needed

Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

2019-03-06 Thread Sean Owen
The problem is that that's a major dependency upgrade in a maintenance release. It didn't seem to work when we applied it to master. I don't think it would block a release. On Wed, Mar 6, 2019 at 6:32 AM Stavros Kontopoulos wrote: > > We need to resolve this

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-04 Thread Sean Owen
It sounds like there's a discussion about the details coming, which is fine and good. That should maybe also have a VOTE. The debate here is then merely about what and when to call things a SPIP, but that's not important. On Mon, Mar 4, 2019 at 10:23 AM Xiangrui Meng wrote: > I think the two

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-04 Thread Sean Owen
To be clear, those goals sound fine to me. I don't think voting on those two broad points is meaningful, but, does no harm per se. If you mean this is just a check to see if people believe this is broadly worthwhile, then +1 from me. Yes it is. That means we'd want to review something more

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-03 Thread Sean Owen
I think treating SPIPs as this high-level takes away much of the point of VOTEing on them. I'm not sure that's even what Reynold is suggesting elsewhere; we're nowhere near discussing APIs here, just what 'accelerator aware' even generally means. If the scope isn't specified, what are we trying to

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-03 Thread Sean Owen
I'm for this in general, at least a +0. I do think this has to have a story for what to do with the existing Mesos GPU support, which sounds entirely like the spark.task.gpus config here. Maybe it's just a synonym? that kind of thing. Requesting different types of GPUs might be a bridge too far,

Re: SPIP: Accelerator-aware Scheduling

2019-03-01 Thread Sean Owen
> use cases that require GPU scheduling on Mesos cluster, however, we can still > add Mesos support in the future if we observe valid use cases. > > Thanks! > > Xingbo > > Sean Owen 于2019年3月1日周五 下午10:39写道: >> >> Two late breaking questions: >> >> This basic

Re: SPIP: Accelerator-aware Scheduling

2019-03-01 Thread Sean Owen
Two late breaking questions: This basically requires Hadoop 3.1 for YARN support? Mesos support is listed as a non goal but it already has support for requesting GPUs in Spark. That would be 'harmonized' with this implementation even if it's not extended? On Fri, Mar 1, 2019, 7:48 AM Xingbo

Re: [VOTE] Functional DataSourceV2 in Spark 3.0

2019-02-28 Thread Sean Owen
This is a fine thing to VOTE on. Committers (and community, non-binding) can VOTE on what we like; we just don't do it often where not required because it's a) overkill overhead over simple lazy consensus, and b) it can be hard to say what the binding VOTE binds if it's not a discrete commit or

Re: Request review for long-standing PRs

2019-02-26 Thread Sean Owen
Mr Torres can you give these a pass please? On Tue, Feb 26, 2019 at 4:38 PM Jungtaek Lim wrote: > > Hi devs, > > sorry to bring this again to mailing list, but you know, ping in Github PR > just doesn't work. > > I have long-stand (created in last year) PRs on SS area which already got > over

Re: Request review for long-standing PRs

2019-02-26 Thread Sean Owen
Those aren't bad changes, but they add a lot of code and complexity relative to benefit. I think it's positive that you've gotten people to spend time reviewing them, quite a lot. I don't know whether they should be merged. This isn't a 'bug' though; not all changes should be committed. Simple and

Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-24 Thread Sean Owen
Sure, I don't read anyone making these statements though? Let's assume good intent, that "foo should happen" as "my opinion as a member of the community, which is not solely up to me, is that foo should happen". I understand it's possible for a person to make their opinion over-weighted; this

Re: [DISCUSS][SQL][PySpark] Column name support for SQL functions

2019-02-24 Thread Sean Owen
I just commented on the PR -- I personally don't think it's worth removing support for, say, max("foo") over max(col("foo")) or max($"foo") in Scala. We can make breaking changes in Spark 3 but this seems like it would unnecessarily break a lot of code. The string arg is more concise in Python and

Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-22 Thread Sean Owen
To your other message: I already see a number of PMC members here. Who's the other entity? The PMC is the thing that says a thing is a release, sure, but this discussion is properly a community one. And here we are, this is lovely to see. (May I remind everyone to casually, sometime, browse the

Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

2019-02-21 Thread Sean Owen
That looks like a change to restore some behavior that was removed in 2.2. It's not directly relevant to a release vote on 2.4.1. See the existing discussion at https://github.com/apache/spark/pull/22144#issuecomment-432258536 It may indeed be a good thing to change but just continue the

Re: merge script stopped working; Python 2/3 input() issue?

2019-02-15 Thread Sean Owen
wrote: > > BTW the main script has this that the website script does not: > > if sys.version < '3': >input = raw_input # noqa > > On Fri, Feb 15, 2019 at 3:55 PM Sean Owen wrote: > > > > I'm seriously confused on this one. The spark-website merge script >

Re: merge script stopped working; Python 2/3 input() issue?

2019-02-15 Thread Sean Owen
ing for me, the website one is broken. > > I think it was caused by this dude changing raw_input to input recently: > > commit 8b6e7dceaf5d73de3f92907ceeab8925a2586685 > Author: Sean Owen > Date: Sat Jan 19 19:02:30 2019 -0600 > >More minor style fixes for merge script &g

merge script stopped working; Python 2/3 input() issue?

2019-02-15 Thread Sean Owen
I'm seriously confused on this one. The spark-website merge script just stopped working for me. It fails on the call to input() that expects a y/n response, saying 'y' isn't defined. Indeed, it seems like Python 2's input() tries to evaluate the input, rather than return a string. Python 3

Re: Compatibility on build-in DateTime functions with Hive/Presto

2019-02-15 Thread Sean Owen
year("1912") == 1912 makes sense; month("1912") == 1 is odd but not wrong. On the one hand, some answer might be better than none. But then, we are trying to match Hive semantics where the SQL standard is silent. Is this actually defined behavior in a SQL standard, or, what does MySQL do? On Fri,

Re: Time to cut an Apache 2.4.1 release?

2019-02-14 Thread Sean Owen
(That may be so, but it may still be correct to revert a change in Spark if necessary to not be exposed to it in the short term. I have no idea whether that's the right thing here or not, just answering the point about why we'd care about a bug in another project. Also, not clear which Hive

Re: Apache Spark git repo moved to gitbox.apache.org

2019-02-13 Thread Sean Owen
. https://issues.apache.org/jira/browse/INFRA-17842 Hopefully, it > can be fixed soon. We should let all the committers follow the same way; > otherwise, it could break the commit history easily. > > Xiao > > > > > Sean Owen 于2018年12月10日周一 上午8:30写道: >> >> P

Re: Time to cut an Apache 2.4.1 release?

2019-02-11 Thread Sean Owen
I support a 2.4.1 release now, yes. SPARK-23539 is a non-trivial improvement, so probably would not be back-ported to 2.4.x.SPARK-26154 does look like a bug whose fix could be back-ported, but that's a big change. I wouldn't hold up 2.4.1 for it, but it could go in if otherwise ready. On Mon,

Re: [VOTE] Release Apache Spark 2.3.3 (RC2)

2019-02-10 Thread Sean Owen
The HiveExternalCatalogVersionsSuite is hard to make robust as it downloads several huge Spark archives. It does try several mirrors and fall back to archive.apache, but, still, plenty of scope for occasional errors. We need to keep this restricted to only testing a few recent Spark versions. On

Re: [DISCUSS] Change default executor log URLs for YARN

2019-02-09 Thread Sean Owen
> > I could be wrong though. > > > > From: Ryan Blue > Sent: Friday, February 8, 2019 4:39 PM > To: Sean Owen > Cc: Jungtaek Lim; dev > Subject: Re: [DISCUSS] Change default executor log URLs for YARN > > I'm not sure that many people need thi

Re: [VOTE] Release Apache Spark 2.3.3 (RC2)

2019-02-08 Thread Sean Owen
(There isn't a hard time limit on votes; they just need to give at _least_ 72 hours. Indeed just wait until next week for more votes.) On Fri, Feb 8, 2019 at 7:07 PM Xiao Li wrote: > > Hi, Takeshi, > > Many PMCs are on vacation or offsite during this week. If possible, could you > extend it to

Re: [DISCUSS] Change default executor log URLs for YARN

2019-02-08 Thread Sean Owen
Is a flag needed? You know me, I think flags are often failures of design, or disagreement punted to the user. I can understand retaining old behavior under a flag where the behavior change could be problematic for some users or facilitate migration, but this is just a change to some UI links no?

Re: [DISCUSS] Change default executor log URLs for YARN

2019-02-08 Thread Sean Owen
I think that's a reasonable argument, that it provides links to potentially several logs of interest. It reduces the UI clutter a little at the cost of one more hop to get to logs. I don't feel strongly about it but think that's a reasonable thing to do. On Fri, Feb 8, 2019 at 4:57 PM Jungtaek

Re: [VOTE] Release Apache Spark 2.3.3 (RC2)

2019-02-07 Thread Sean Owen
2.4 and EOL on branch 2.3 is coming sooner (in some > months), I wonder we still want to tackle it in any way. > > 2019년 2월 7일 (목) 오후 2:21, Sean Owen 님이 작성: >> >> +1 from me. I built and tested the source release on the same env and >> this time not seeing failure

Re: [VOTE] Release Apache Spark 2.3.3 (RC2)

2019-02-06 Thread Sean Owen
+1 from me. I built and tested the source release on the same env and this time not seeing failures. Good, no idea what happened. I updated Fix Version on JIRAs that were marked as 2.3.4 but went in before the RC2 tag. I'm kinda concerned that this test keeps failing in branch 2.3:

Re: Array indexing functions

2019-02-05 Thread Sean Owen
Is it standard SQL or implemented in Hive? Because UDFs are so relatively easy in Spark we don't need tons of builtins like an RDBMS does. On Tue, Feb 5, 2019, 7:43 AM Petar Zečević > Hi everybody, > I finally created the JIRA ticket and the pull request for the two array > indexing functions: >

Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-02-04 Thread Sean Owen
t; From: Xiao Li > Sent: Wednesday, January 16, 2019 9:37 AM > To: Ryan Blue > Cc: Marcelo Vanzin; Hyukjin Kwon; Sean Owen; Felix Cheung; Yuming Wang; dev > Subject: Re: [DISCUSS] Upgrade built-in Hive to 2.3.4 > > Thanks for your feedbacks! > > Working with Yuming to reduce the

Re: Feature request: split dataset based on condition

2019-02-03 Thread Sean Owen
t; use it as semi-action not just transformation. Let's think that we have > something like map-partition which accept multiple lambda that each one > collect their ROW for their dataset (or something like it). Is it possible? > > On Sat, Feb 2, 2019 at 5:59 PM Sean Owen wrote: > &

Re: Feature request: split dataset based on condition

2019-02-02 Thread Sean Owen
I think the problem is that can't produce multiple Datasets from one source in one operation - consider that reproducing one of them would mean reproducing all of them. You can write a method that would do the filtering multiple times but it wouldn't be faster. What do you have in mind that's

Re: Make .unpersist() non-blocking by default?

2019-01-28 Thread Sean Owen
o we can > add it to uograde guide. > > On Mon, Jan 28, 2019 at 8:47 AM Sean Owen wrote: >> >> Interesting notion at https://github.com/apache/spark/pull/23650 : >> >> .unpersist() takes an optional 'blocking' argument. If true, the call >> waits until the reso

Re: [VOTE] Release Apache Spark 2.3.3 (RC1)

2019-01-28 Thread Sean Owen
More analysis at https://github.com/apache/spark/pull/23634 It's not a regression, though it does relate to correctness, although somewhat niche. TD, Jose et al, is this a Blocker? and is the fix probably reliable enough to commit now? On Mon, Jan 28, 2019 at 10:59 AM Sandeep Katta wrote: > > I

Make .unpersist() non-blocking by default?

2019-01-28 Thread Sean Owen
Interesting notion at https://github.com/apache/spark/pull/23650 : .unpersist() takes an optional 'blocking' argument. If true, the call waits until the resource is freed. Otherwise it doesn't. The default looks pretty inconsistent: - RDD: true - Broadcast: true - Dataset / DataFrame: false -

Re: Why outdated third-parties exist on documentation?

2019-01-28 Thread Sean Owen
Nobody's actively monitoring the list or anything. It's also not clear when something has been discontinued; it may still be usable and widely used with no recent project activity. For example ganglia is still, I think, widely used. If you see a project that has totally disappeared or formally

Re: [build system] speeding up maven build building only changed modules compared to master branch

2019-01-26 Thread Sean Owen
Sounds interesting; would it be able to handle R and Python modules built by this project ? The home grown solution here does I think and that is helpful. On Sat, Jan 26, 2019, 6:57 AM vaclavkosar I think it would be good idea to use gitflow-incremental-builder maven > plugin for Spark builds.

Re: Clean out https://dist.apache.org/repos/dist/dev/spark/ ?

2019-01-24 Thread Sean Owen
removing 6 builds for the following branches: > > 1.6 > 2.0 > 2.1 > 2.3 > 2.4 > master > > in fact, do we even need ANY builds for 1.6, 2.0 and 2.1? > > On Thu, Jan 24, 2019 at 5:57 PM Sean Owen wrote: > >> I think we can just remove this job. >&

Re: Clean out https://dist.apache.org/repos/dist/dev/spark/ ?

2019-01-24 Thread Sean Owen
I think we can just remove this job. On Thu, Jan 24, 2019 at 6:44 PM shane knapp wrote: > > On Sun, Jan 13, 2019 at 11:22 AM Felix Cheung > wrote: >> >> Eh, yeah, like the one with signing, I think doc build is mostly useful when >> a) right before we do a release or during the RC resets; b)

Re: moving the spark jenkins job builder repo from dbricks --> spark

2019-01-24 Thread Sean Owen
Are these docs builds creating the SNAPSHOT docs builds at https://dist.apache.org/repos/dist/dev/spark/ ? I think from a thread last month, these aren't used and should probably just be stopped. On Thu, Jan 24, 2019 at 3:34 PM shane knapp wrote: > > revisiting this thread from october... sorry

Re: [VOTE] Release Apache Spark 2.3.3 (RC1)

2019-01-23 Thread Sean Owen
I'm not clear if it's a correctness bug from that description, and if it's not a regression, no it does not need to go into 2.3.3. If it's a real bug, sure it can be merged to 2.3.x. On Wed, Jan 23, 2019 at 7:54 AM Anton Okolnychyi wrote: > > Recently, I came across this bug: >

Re: Make proactive check for closure serializability optional?

2019-01-22 Thread Sean Owen
Agree, I'm not pushing for it unless there's other evidence. The closure check does entail serialization, not just checking serializability, note. I don't like flags either but this one sounded like it could actually be something a user wanted to vary, globally, for runs of the same code. On Tue,

Re: Make proactive check for closure serializability optional?

2019-01-21 Thread Sean Owen
Mon, Jan 21, 2019 at 10:04 AM Sean Owen wrote: >> >> The ClosureCleaner proactively checks that closures passed to >> transformations like RDD.map() are serializable, before they're >> executed. It does this by just serializing it with the JavaSerializer. >> >>

Make proactive check for closure serializability optional?

2019-01-21 Thread Sean Owen
The ClosureCleaner proactively checks that closures passed to transformations like RDD.map() are serializable, before they're executed. It does this by just serializing it with the JavaSerializer. That's a nice feature, although there's overhead in always trying to serialize the closure ahead of

Re: [VOTE] Release Apache Spark 2.3.3 (RC1)

2019-01-20 Thread Sean Owen
OK, if it passes tests, I'm +1 on the release. Can anyone else verify the tests pass? What is the reason for a new RC? I didn't see any other issues reported. On Sun, Jan 20, 2019 at 8:03 PM Takeshi Yamamuro wrote: > > Hi, all > > Thanks for the checks, Sean and Felix. > I'll start the next

Re: [VOTE] Release Apache Spark 2.3.3 (RC1)

2019-01-20 Thread Sean Owen
wrote: > > Hi, sean, > > I run these tests again though, these tests passed on my AWS env. > But, I notice these streaming tests are a little flaky though... > > > On Sat, Jan 19, 2019 at 1:51 AM Sean Owen wrote: >> >> The release itself looks OK. I'm

Re: [VOTE] Release Apache Spark 2.3.3 (RC1)

2019-01-18 Thread Sean Owen
The release itself looks OK. I'm getting, as before, a lot of errors on the machine I'm building on. Is anyone else seeing this? if not I'm going to scrap the env and try something new. Errors like: - event ordering *** FAILED *** The code passed to failAfter did not complete within 10

Re: How to implement model versions in MLlib?

2019-01-16 Thread Sean Owen
at, I think you could > implement either way without issues if the code is written carefully - but > logically, if I had to choose, I would prefer having a separate versioning > mechanism for models. > Thank you, Ilya > > > > -Original Message- > From: Sean Owen

How to implement model versions in MLlib?

2019-01-16 Thread Sean Owen
I know some implementations of model save/load in MLlib use an explicit version 1.0, 2.0, 3.0 mechanism. I've also seen that some just decide based on the version of Spark that wrote the model. Is one or the other preferred? See https://github.com/apache/spark/pull/23549#discussion_r248318392

Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-15 Thread Sean Owen
It's almost certainly needed just to get off the fork of Hive we're not supposed to have. Yes it's going to impact dependencies, so would need to happen at Spark 3. Separately, its usage could be reduced or removed -- this I don't know much about. But it doesn't really make it harder or easier.

Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-15 Thread Sean Owen
Unless it's going away entirely, and I don't think it is, we at least have to do this to get off the fork of Hive that's being used now. I do think we want to keep Hive from getting into the core though -- see comments on PR. On Tue, Jan 15, 2019 at 11:44 AM Xiao Li wrote: > > Hi, Yuming, > >

Re: [mllib] Document frequency

2019-01-14 Thread Sean Owen
Yes that seems OK to me. On Mon, Jan 14, 2019 at 9:40 AM Jatin Puri wrote: > > Thanks for the response. So do I go ahead and create a jira ticket? > Can then send a pull request for the same with the changes. > > On Mon, Jan 14, 2019 at 8:18 PM Sean Owen wrote: >> >>

Re: [mllib] Document frequency

2019-01-14 Thread Sean Owen
I think that's reasonable. The caller probably has the number of docs already but sure, it's one long and is already computed. This would have to be added to Pyspark too. On Mon, Jan 14, 2019 at 7:56 AM Jatin Puri wrote: > > Hello. > > As part of `org.apache.spark.ml.feature.IDFModel`, I think

Re: Ask for reviewing on Structured Streaming PRs

2019-01-13 Thread Sean Owen
interest on both contributors and > committers for such module, but SS is not. Maybe either other committers who > weren't familiar with should try to get familiar and cover the area, or the > area needs more committers. > > -Jungtaek Lim (HeartSaVioR) > > 2019년 1월 13일 (일) 오후 11:

Re: Ask for reviewing on Structured Streaming PRs

2019-01-13 Thread Sean Owen
Jungtaek, the best strategy is to find who wrote the code you are modifying (use Github history or git blame) and ping them directly on the PR. I don't know this code well myself. It also helps if you can address why the functionality is important, and describe compatibility implications. Most

Re: Clean out https://dist.apache.org/repos/dist/dev/spark/ ?

2019-01-13 Thread Sean Owen
enkins... > > > > From: Dongjoon Hyun > Sent: Saturday, January 12, 2019 4:32 PM > To: Sean Owen > Cc: dev > Subject: Re: Clean out https://dist.apache.org/repos/dist/dev/spark/ ? > > +1 for removing old docs there. > It seems that we need to upgrad

Clean out https://dist.apache.org/repos/dist/dev/spark/ ?

2019-01-12 Thread Sean Owen
I'm not sure it matters a whole lot, but we are encouraged to keep dist.apache.org free of old files. I see tons of old -docs snapshot builds at https://dist.apache.org/repos/dist/dev/spark/ -- can I just remove anything not so current?

Re: [VOTE] SPARK 2.2.3 (RC1)

2019-01-10 Thread Sean Owen
Is that the right link? that is marked as a minor bug, maybe. From what you describe it's not a regression from 2.2.2 either. On Thu, Jan 10, 2019 at 6:37 AM Takeshi Yamamuro wrote: > > Hi, Dongjoon, > > We don't need to include https://github.com/apache/spark/pull/23456 in this > release? >

Re: Remove non-Tungsten mode in Spark 3?

2019-01-09 Thread Sean Owen
ike a good idea from the standpoint > of reducing cognitive load, and documentation > > On Fri, Jan 4, 2019 at 7:03 AM Sean Owen wrote: > >> OK, maybe leave in tungsten for 3.0. >> I did a quick check, and removing StaticMemoryManager saves a few hundred >> line

Re: [VOTE] SPARK 2.2.3 (RC1)

2019-01-09 Thread Sean Owen
ls on Jenkins. > > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.2-test-maven-hadoop-2.7/591/ > > > On Wed, Jan 9, 2019 at 3:37 PM Sean Owen wrote: >> >> BTW did you run with the same profiles, I wonder; I test with, >> generall

Re: [VOTE] SPARK 2.2.3 (RC1)

2019-01-09 Thread Sean Owen
lakiness at 2.2.x era. > For me, those test passes. (I ran twice before starting a vote and during > this voting from the source tar file) > > Bests, > Dongjoon > > On Wed, Jan 9, 2019 at 1:42 PM Sean Owen wrote: >> >> I wonder if anyone else is seeing the fol

Re: [VOTE] SPARK 2.2.3 (RC1)

2019-01-09 Thread Sean Owen
I wonder if anyone else is seeing the following issues, or whether it's specific to my environment: With -Phive-thriftserver, it compiles fine. However during tests, I get ... [error]

Re: Remove non-Tungsten mode in Spark 3?

2019-01-04 Thread Sean Owen
at run with UDFs that > allocate a lot of heap memory, it might not be as good). > > I can see us removing the legacy mode since it's been legacy for a long > time and perhaps very few users need it. How much code does it remove > though? > > > On Thu, Jan 03, 2019 at 2:55 P

Remove non-Tungsten mode in Spark 3?

2019-01-03 Thread Sean Owen
Just wondering if there is a good reason to keep around the pre-tungsten on-heap memory mode for Spark 3, and make spark.memory.offHeap.enabled always true? It would simplify the code somewhat, but I don't feel I'm so aware of the tradeoffs. I know we didn't deprecate it, but it's been off by

Re: Apache Spark 2.2.3 ?

2019-01-03 Thread Sean Owen
Yes, that one's not going to be back-ported to 2.3. I think it's fine to proceed with a 2.2 release with what's there now and call it done. Note that Spark 2.3 would be EOL around September of this year. On Thu, Jan 3, 2019 at 2:31 PM Dongjoon Hyun wrote: > Thank you for additional support for

Re: Apache Spark 2.2.3 ?

2019-01-01 Thread Sean Owen
I agree with that logic, and if you're volunteering to do the legwork, I don't see a reason not to cut a final 2.2 release. On Tue, Jan 1, 2019 at 9:19 PM Dongjoon Hyun wrote: > > Hi, All. > > Apache Spark community has a policy maintaining the feature branch for 18 > months. I think it's time

Trigger full GC during executor idle time?

2018-12-31 Thread Sean Owen
https://github.com/apache/spark/pull/23401 Interesting PR; I thought it was not worthwhile until I saw a paper claiming this can speed things up to the tune of 2-6%. Has anyone considered this before? Sean - To unsubscribe

Re: Apache Spark git repo moved to gitbox.apache.org

2018-12-11 Thread Sean Owen
essure myself there too. > > > On Mon, Dec 10, 2018 at 9:51 AM, Sean Owen wrote: > >> Agree, I'll ask on the INFRA ticket and follow up. That's a lot of extra >> noise. >> >> On Mon, Dec 10, 2018 at 11:37 AM Marcelo Vanzin >> wrote: >> >> Hmm,

Re: Apache Spark git repo moved to gitbox.apache.org

2018-12-10 Thread Sean Owen
ton. > > 2018년 12월 11일 (화) 오전 1:51, Sean Owen 님이 작성: > >> Agree, I'll ask on the INFRA ticket and follow up. That's a lot of extra >> noise. >> >> On Mon, Dec 10, 2018 at 11:37 AM Marcelo Vanzin >> wrote: >> > >> > Hmm, it also seems that gith

Re: Apache Spark git repo moved to gitbox.apache.org

2018-12-10 Thread Sean Owen
that (if we can't do it ourselves). > On Mon, Dec 10, 2018 at 9:13 AM Sean Owen wrote: > > > > Update for committers: now that my user ID is synced, I can > > successfully push to remote https://github.com/apache/spark directly. > > Use that as the 'apache' rem

Re: Apache Spark git repo moved to gitbox.apache.org

2018-12-10 Thread Sean Owen
instead of using "Close Stale PRs" pull requests. On Mon, Dec 10, 2018 at 10:30 AM Sean Owen wrote: > > Per the thread last week, the Apache Spark repos have migrated from > https://git-wip-us.apache.org/repos/asf to > https://gitbox.apache.org/repos/asf > > > Non-

Apache Spark git repo moved to gitbox.apache.org

2018-12-10 Thread Sean Owen
Per the thread last week, the Apache Spark repos have migrated from https://git-wip-us.apache.org/repos/asf to https://gitbox.apache.org/repos/asf Non-committers: This just means repointing any references to the old repository to the new one. It won't affect you if you were already referencing

Re: Why not setup a Gitter chatroom for Spark contributors

2018-12-09 Thread Sean Owen
I think this has come up before, and the issue is really that it adds yet another channel for people to follow to get 100% of the discussion about the project. I don't believe the project would bless an official chat channel, but, anyone can run an unofficial one of course. On Sun, Dec 9, 2018 at

Fwd: [NOTICE] Mandatory relocation of Apache git repositories on git-wip-us.apache.org

2018-12-07 Thread Sean Owen
See below: Apache projects are migrating to a new git infrastructure, and are seeking projects to volunteer to move earlier than later. I believe Spark should volunteer. This should mostly affect committers, who would need to point to the new remote. It could affect downstream consumers of the

Re: Cannot run program ".../jre/bin/javac": error=2, No such file or directory

2018-12-01 Thread Sean Owen
javac is in $JAVA_HOME/bin/javac on Mac OS installations. It has always worked fine on my Mac and for many other developers. You probably have an env problem, like: that's not actually where java is, or this isn't the JAVA_HOME actually reaching your build. On Sat, Dec 1, 2018 at 9:53 PM wuyi

Re: A user of thincrs has selected this issue. Deadline: Xxx, Xxx X, XXXX XX:XX

2018-12-01 Thread Sean Owen
Some automated tool or something. Unclear from https://www.linkedin.com/company/thincrs I'll reply to ask them to not add automated comments to JIRA. On Sat, Dec 1, 2018 at 8:22 AM Hyukjin Kwon wrote: > > Just out of curiosity, does any one know what kind of account it is? >

Re: invent 2018

2018-11-23 Thread Sean Owen
I will be there on Nov 27/28; you can ping me directly. On Fri, Nov 23, 2018 at 6:20 AM Shmuel Blitz wrote: > > Hi, > > Are any of you, commiters or experienced contributers, attending re:invent > next week? > > If so, I would love to meet with you and get some hands-on introduction to >

Re: Automated formatting

2018-11-21 Thread Sean Owen
; > I imagine tracking down the corner cases in the config, especially > around interactions with scalastyle, may take a bit of work. Happy to > do it, but not if there's significant concern about style related > changes in PRs. > On Wed, Nov 21, 2018 at 2:42 PM Sean Owen wrote:

Re: Automated formatting

2018-11-21 Thread Sean Owen
scalafmt on the most recent PR I looked at, and it caught > stuff as basic as newlines before curly brace in existing code. > I've had different reviewers for PRs that were literal backports or > cut & paste of each other come up with different formatting nits. > > > On Wed, Nov

Re: Scala lint failing unexpectedly

2018-11-21 Thread Sean Owen
I don't see any of the CI builds failing like this. There's an Await.result in the file, but it's suppressed already, and I don't see it at line 269. I don't see an issue like this in recent branches either. You're sure you are working off, say, master, and/or you're looking at the code that it's

Re: Automated formatting

2018-11-21 Thread Sean Owen
I think reformatting the whole code base might be too much. If there are some more targeted cleanups, sure. We do have some links to style guides buried somewhere in the docs, although the conventions are pretty industry standard. I *think* the code is pretty consistently formatted now, and would

Re: Make Scala 2.12 as default Scala version in Spark 3.0

2018-11-20 Thread Sean Owen
and I don't > think this will be too disruptive for users since it is already a breaking > change. > > On Tue, Nov 20, 2018 at 7:05 AM Sean Owen wrote: >> >> One more data point -- from looking at the SBT build yesterday, it >> seems like most plugin updates req

<    1   2   3   4   5   6   7   8   9   10   >