Re: [Core][Suggestion] sortWithinPartitions and aggregateWithinPartitions for RDD

2018-01-31 Thread Ruifeng Zheng
Do you mean in-memory processing? It works fine if all partitions are small. But when some partition don’t fit in memory, it will cause OOM. 发件人: Reynold Xin 日期: 2018年2月1日 星期四 下午3:14 收件人: Ruifeng Zheng 抄送: 主题: Re:

Re: [Core][Suggestion] sortWithinPartitions and aggregateWithinPartitions for RDD

2018-01-31 Thread Reynold Xin
You can just do that with mapPartitions pretty easily can’t you? On Wed, Jan 31, 2018 at 11:08 PM Ruifeng Zheng wrote: > HI all: > > > >1, Dataset API supports operation “sortWithinPartitions”, but in > RDD API there is no counterpart (I know there is >

[Core][Suggestion] sortWithinPartitions and aggregateWithinPartitions for RDD

2018-01-31 Thread Ruifeng Zheng
HI all: 1, Dataset API supports operation “sortWithinPartitions”, but in RDD API there is no counterpart (I know there is “repartitionAndSortWithinPartitions”, but I don’t want to repartition the RDD), I have to convert RDD to Dataset for this function. Would it make sense to add a

Re: data source v2 online meetup

2018-01-31 Thread Xiao Li
Hi, Ryan, wow, your Iceberg already used data source V2 API! That is pretty cool! I am just afraid these new APIs are not stable. We might deprecate or change some data source v2 APIs in the next version (2.4). Sorry for the inconvenience it might introduce. Thanks for your feedback always,

Re: no-reopen-closed?

2018-01-31 Thread Xiao Li
Thanks! Xiao 2018-01-28 11:23 GMT-08:00 Sean Owen : > Nothing would. The ASF would have to ban the account. There wouldn't be a > total solution in any event but this workflow helps solve several of these > rare corner cases. > > On Sun, Jan 28, 2018, 1:19 PM Koert Kuipers

Re: python tests related to pandas are skipped in jenkins

2018-01-31 Thread Yin Huai
I created https://issues.apache.org/jira/browse/SPARK-23292 for this issue. On Wed, Jan 31, 2018 at 8:17 PM, Yin Huai wrote: > btw, seems we also have the same skipping logic for pyarrow. But, I have > not looked into if tests related to pyarrow are skipped or not. > > On

Re: [VOTE] Spark 2.3.0 (RC2)

2018-01-31 Thread Yin Huai
seems we are not running tests related to pandas in pyspark tests (see my email "python tests related to pandas are skipped in jenkins"). I think we should fix this test issue and make sure all tests are good before cutting RC3. On Wed, Jan 31, 2018 at 10:12 AM, Sameer Agarwal

Re: python tests related to pandas are skipped in jenkins

2018-01-31 Thread Yin Huai
btw, seems we also have the same skipping logic for pyarrow. But, I have not looked into if tests related to pyarrow are skipped or not. On Wed, Jan 31, 2018 at 8:15 PM, Yin Huai wrote: > Hello, > > I was running python tests and found that pyspark.sql.tests. >

python tests related to pandas are skipped in jenkins

2018-01-31 Thread Yin Huai
Hello, I was running python tests and found that pyspark.sql.tests.GroupbyAggPandasUDFTests.test_unsupported_types does not run with Python 2 because the test uses

FOSDEM mini-office hour?

2018-01-31 Thread Holden Karau
Hi Spark Friends, If any folks are around for FOSDEM this year I was planning on doing a coffee office hour on the last day after my talks . Maybe like 6pm? I'm also going to see if any BEAM folks are around and interested :) Cheers,

StateStoreRestoreExec in case of executor/driver failures

2018-01-31 Thread Yogesh Mahajan
Hi, In case of structured streaming StateStore, if I plug-in an embedded store like, RocksDB to maintain per partition aggregation state inside my executors and state will be maintained per partitionId and operatorId. How do I reuse/map to the old state for the cases of executor failures or

Re: data source v2 online meetup

2018-01-31 Thread Ryan Blue
Thanks for suggesting this, I think it's a great idea. I'll definitely attend and can talk about the changes that we've made DataSourceV2 to enable our new table format, Iceberg . On Wed, Jan 31, 2018 at 2:35 PM, Reynold Xin

data source v2 online meetup

2018-01-31 Thread Reynold Xin
Data source v2 API is one of the larger main changes in Spark 2.3, and whatever that has already been committed is only the first version and we'd need more work post-2.3 to improve and stablize it. I think at this point we should stop making changes to it in branch-2.3, and instead focus on

Re: Max number of streams supported ?

2018-01-31 Thread Michael Armbrust
-dev +user > Similarly for structured streaming, Would there be any limit on number of > of streaming sources I can have ? > There is no fundamental limit, but each stream will have a thread on the driver that is doing coordination of execution. We comfortably run 20+ streams on a single

Re: [VOTE] Spark 2.3.0 (RC2)

2018-01-31 Thread Sameer Agarwal
Just a quick status update on RC3 -- SPARK-23274 was resolved yesterday and tests have been quite healthy throughout this week and the last. I'll cut the new RC as soon as the remaining blocker (SPARK-23202

Re: Why are DataFrames always read with nullable=True?

2018-01-31 Thread Marek Novotny
Hi, I would like to ask you whether there is still plan to solve this problem with nullability when reading data from parquet files? I've noticed that the related JIRA ticket SPARK-19950 is still in progress and the PR #17293