Unsubscribe

2018-08-20 Thread Happy??????

Re: [DISCUSS] USING syntax for Datasource V2

2018-08-20 Thread Russell Spitzer
I'm not sure I follow what the discussion topic is here > For example, a Cassandra catalog or a JDBC catalog that exposes tables in those systems will definitely not support users marking tables with the “parquet” data source. I don't understand why a Cassandra Catalogue wouldn't be able to

Re: [Performance] Spark DataFrame is slow with wide data. Polynomial complexity on the number of columns is observed. Why?

2018-08-20 Thread Manu Zhang
Hi Makatun, For 2, I guess `cache` will break up the logical plan and force it be analyzed. For 3, I have a similar observation here https://medium.com/@manuzhang/the-hidden-cost-of-spark-withcolumn-8ffea517c015. Each `withColumn` will force the logical plan to be analyzed which is not free.

Re: Spark Kafka adapter questions

2018-08-20 Thread Ted Yu
After a brief check, I found KAFKA-5649 where almost identical error was reported. There is also KAFKA-3702 which is related but currently open. I will dig some more to see what I can find. Cheers On Mon, Aug 20, 2018 at 3:53 PM Basil Hariri wrote: > I am pretty sure I got those changes with

RE: Spark Kafka adapter questions

2018-08-20 Thread Basil Hariri
I am pretty sure I got those changes with the jar I compiled (I pulled from master on 8/8 and it looks like SPARK-18057 was resolved on 8/3) but no luck, here is a copy-paste of the error I’m seeing. The semantics for Event Hubs’ Kafka head is highlighted for reference – we connect to port 9093

Re: [discuss][minor] impending python 3.x jenkins upgrade... 3.5.x? 3.6.x?

2018-08-20 Thread Li Jin
Thanks for looking into this Shane. If we can only have a single python 3 version, I agree 3.6 would be better than 3.5. Otherwise, ideally I think it would be nice to test all supported 3.x versions (latest micros should be fine). On Mon, Aug 20, 2018 at 7:07 PM shane knapp wrote: > initially,

Re: [discuss][minor] impending python 3.x jenkins upgrade... 3.5.x? 3.6.x?

2018-08-20 Thread shane knapp
initially, i'd like to just choose one version to have the primary tests against, but i'm also not opposed to supporting more of a matrix. the biggest problem i see w/this approach, however, is that of build monitoring and long-term ownership. this is why we have a relatively restrictive current

Re: [discuss][minor] impending python 3.x jenkins upgrade... 3.5.x? 3.6.x?

2018-08-20 Thread Bryan Cutler
Thanks for looking into this Shane! If we are choosing a single python 3.x, I think 3.6 would be good. It might still be nice to test against other versions too, so we can catch any issues. Is it possible to have more exhaustive testing as part of a nightly or scheduled build? As a point of

Re: best way to run one python test?

2018-08-20 Thread Hyukjin Kwon
In my experience it's usually okay but it's still informal way tho as far as I can tell. On Mon, 20 Aug 2018, 11:24 pm Imran Rashid, wrote: > thanks, that helps! > > So run-tests.py adds a bunch more env variables: > https://github.com/apache/spark/blob/master/python/run-tests.py#L74-L97 > >

Re: [DISCUSS] USING syntax for Datasource V2

2018-08-20 Thread Ryan Blue
Thanks for posting this discussion to the dev list, it would be great to hear what everyone thinks about the idea that USING should be a catalog-specific storage configuration. Related to this, I’ve updated the catalog PR, #21306 , to include an

Re: best way to run one python test?

2018-08-20 Thread Imran Rashid
thanks, that helps! So run-tests.py adds a bunch more env variables: https://github.com/apache/spark/blob/master/python/run-tests.py#L74-L97 those don't matter in most cases I guess? On Sun, Aug 19, 2018 at 11:54 PM Hyukjin Kwon wrote: > There's informal way to test specific tests. For

Re: Why repartitionAndSortWithinPartitions slower than MapReducer

2018-08-20 Thread 周浥尘
In addition to my previous email, Environment: spark 2.1.2, hadoop 2.6.0-cdh5.11, Java 1.8, CentOS 6.6 周浥尘 于2018年8月20日周一 下午8:52写道: > Hi team, > > I found the Spark method *repartitionAndSortWithinPartitions *spends > twice as much time as using Mapreduce in some cases. > I want to repartition

Re: [Performance] Spark DataFrame is slow with wide data. Polynomial complexity on the number of columns is observed. Why?

2018-08-20 Thread antonkulaga
makatun, did you try to test somewhing more complex, like dataframe.describe or PCA? -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Why repartitionAndSortWithinPartitions slower than MapReducer

2018-08-20 Thread 周浥尘
Hi team, I found the Spark method *repartitionAndSortWithinPartitions *spends twice as much time as using Mapreduce in some cases. I want to repartition the dataset accorading to split keys and save them to files in ascending. As the doc says, repartitionAndSortWithinPartitions “is more efficient

Apache Airflow (incubator) PMC binding vote needed

2018-08-20 Thread t4
Hi Can any member of Apache Incubator PMC provide a vote for Apache Airflow 1.10 to be released? Thanks https://lists.apache.org/thread.html/fb09a91f1cef4a63df4d5474e2189248aa65a609a6237d8eefcd8eb7@%3Cdev.airflow.apache.org%3E -- Sent from:

Unsubscribe

2018-08-20 Thread Michael Styles

Re: [Performance] Spark DataFrame is slow with wide data. Polynomial complexity on the number of columns is observed. Why?

2018-08-20 Thread makatun
Hi Marco, many thanks for pointing the related Spark commit. According to the discription, it introduces indexed (instead of linear) search over columns in LogicalPlan.resolve(...). We have performed the tests on the current Spark master branch and would like to share the results. There are some

[DISCUSS] USING syntax for Datasource V2

2018-08-20 Thread Hyukjin Kwon
Hi all, I have been trying to follow `USING` syntax support since that looks currently not supported whereas `format` API supports this. I have been trying to understand why and talked with Ryan. Ryan knows all the details and, He and I thought it's good to post here - I just started to look