[VOTE] Spark 2.2.1 (RC1)

2017-11-14 Thread Felix Cheung
Please vote on releasing the following candidate as Apache Spark version 2.2.1. The vote is open until Monday November 20, 2017 at 23:00 UTC and passes if a majority of at least 3 PMC +1 votes are cast. [ ] +1 Release this package as Apache Spark 2.2.1 [ ] -1 Do not release this package because

Re: [discuss][SQL] Partitioned column type inference proposal

2017-11-14 Thread Hyukjin Kwon
Thanks all for feedback. > 1. when merging NullType with another type, the result should always be that type. > 2. when merging StringType with another type, the result should always be StringType. > 3. when merging integral types, the priority from high to low: DecimalType, LongType,

Re: Questions about the future of UDTs and Encoders

2017-11-14 Thread mlopez
Hello everyone! I'm a developer at a security ratings company. We've been moving to Spark for our data analytics and nearly every dataset we have contains IP addresses or variable-length subnets. Katherine's descriptions of use cases and attempts to emulate networking types overlap with ours. I

[spark-kinesis] [SPARK-20168] Requesting some attention for a review

2017-11-14 Thread Yash Sharma
Hi Team, Could I please pull some attention towards the pull request on Spark-Kinesis operability. We have iterated over the patch for past few months, and it would be great to have some final review of the patch. I think its very close now. I would love to work on improvements if any. This patch

Re: HashingTFModel/IDFModel in Structured Streaming

2017-11-14 Thread Bago Amirbekian
There is a known issue with VectorAssembler which causes it to fail in streaming if any of the input columns are of VectorType & don't have size information, https://issues.apache.org/jira/browse/SPARK-22346. This can be fixed by adding size information to the vector columns, I've made a PR to

Re: [discuss][SQL] Partitioned column type inference proposal

2017-11-14 Thread Reynold Xin
Most of those thoughts from Wenchen make sense to me. Rather than a list, can we create a table? X-axis is data type, and Y-axis is also data type, and the intersection explains what the coerced type is? Can we also look at what Hive, standard SQL (Postgres?) do? Also, this shouldn't be

Re: SPARK-22267 issue: Spark SQL incorrectly reads ORC file when column order is different

2017-11-14 Thread Dongjoon Hyun
Hi, Mark. That is one of the reasons why I left it behind from the previous PR (below) and I'm focusing is the second approach; use OrcFileFormat with convertMetastoreOrc. https://github.com/apache/spark/pull/19470 [SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use Spark schema to read ORC table

Re: [discuss][SQL] Partitioned column type inference proposal

2017-11-14 Thread Wenchen Fan
My 2 cents: 1. when merging NullType with another type, the result should always be that type. 2. when merging StringType with another type, the result should always be StringType. 3. when merging integral types, the priority from high to low: DecimalType, LongType, IntegerType. This is because

Re: [discuss][PySpark] Can we drop support old Pandas (<0.19.2) or what version should we support?

2017-11-14 Thread Li Jin
I think this makes sense. PySpark/Pandas interops in 2.3 are new anyway, I don't think we need to support the new functionality with older version of pandas (Takuya's reason 3) One thing I am not sure is how complicated it is to support pandas < 0.19.2 with old non-Arrow interops and require

[discuss][SQL] Partitioned column type inference proposal

2017-11-14 Thread Hyukjin Kwon
Hi dev, I would like to post a proposal about partitioned column type inference (related with 'spark.sql.sources.partitionColumnTypeInference.enabled' configuration). This thread focuses on the type coercion (finding the common type) in partitioned columns, in particular, when the different form

Re: Cutting the RC for Spark 2.2.1 release

2017-11-14 Thread Felix Cheung
Now I’m seeing an error with Closing nexus staging repository. staged_repo_id=orgapachespark-1254 < HTTP/1.1 401 Unauthorized < Date: Tue, 14 Nov 2017 12:32:57 GMT < Server: Nexus/2.13.0-01 < X-Frame-Options: SAMEORIGIN < X-Content-Type-Options: nosniff * Authentication problem. Ignoring this. <

SPARK-22267 issue: Spark SQL incorrectly reads ORC file when column order is different

2017-11-14 Thread Mark Petruska
Hi, I'm very new to spark development, and would like to get guidance from more experienced members. Sorry this email will be long as I try to explain the details. Started to investigate the issue SPARK-22267 ; added some test cases to

Re: [discuss][PySpark] Can we drop support old Pandas (<0.19.2) or what version should we support?

2017-11-14 Thread Hyukjin Kwon
+0 to drop it as I said in the PR. I am seeing It brings a lot of hard time to get the cool changes through, and is slowing down them to get pushed. My only worry is, users who depends on lower pandas versions (Pandas 0.19.2 seems released less then a year before. In the similar time, Spark 2.1.0

[discuss][PySpark] Can we drop support old Pandas (<0.19.2) or what version should we support?

2017-11-14 Thread Takuya UESHIN
Hi all, I'd like to raise a discussion about Pandas version. Originally we are discussing it at https://github.com/apache/spark/pull/19607 but we'd like to ask for feedback from community. Currently we don't explicitly specify the Pandas version we are supporting but we need to decide what