Re: Spark data quality bug when reading parquet files from hive metastore

2018-08-22 Thread t4
https://issues.apache.org/jira/browse/SPARK-23576 ? -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [MLlib][Test] Smoke and Metamorphic Testing of MLlib

2018-08-22 Thread Matei Zaharia
Hi Steffen, Thanks for sharing your results about MLlib — this sounds like a useful tool. However, I wanted to point out that some of the results may be expected for certain machine learning algorithms, so it might be good to design those tests with that in mind. For example: > - The

Spark data quality bug when reading parquet files from hive metastore

2018-08-22 Thread Long, Andrew
Hello Friends, I’ve encountered a bug where spark silently corrupts data when reading from a parquet hive table where the table schema does not match the file schema. I’d like to give a shot at adding some extra validations to the code to handle this corner case and I was wondering if anyone

Re: Persisting driver logs in yarn client mode (SPARK-25118)

2018-08-22 Thread Ankur Gupta
Thanks for your responses Saisai and Marco. I agree that "rename" operation can be time-consuming on object storage, which can potentially delay the shutdown. I also agree that customers/users have a way to use log appenders to write log files and then send them along with Yarn application logs

Spark github sync works now

2018-08-22 Thread Xiao Li
FYI. The Spark github sync was 10 hour behind this morning. You might get fail merges because of this. Just triggered a re-sync. It should work now. Thanks, Xiao

Re: Spark DataFrame UNPIVOT feature

2018-08-22 Thread Maciej Szymkiewicz
Given popularity of related SO questions: - https://stackoverflow.com/q/41670103/1560062 - https://stackoverflow.com/q/42465568/1560062 - https://stackoverflow.com/q/41670103/1560062 it is probably more "nobody thought about asking", than "it is not used often". On Wed, 22 Aug 2018

Re: Spark DataFrame UNPIVOT feature

2018-08-22 Thread Mike Hynes
Hi Reynold/Ivan, People familiar with pandas and R dataframes will likely have used the dataframe "melt" idiom, which is the functionality I believe you are referring to: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html I have had to write this function myself in my own

[MLlib][Test] Smoke and Metamorphic Testing of MLlib

2018-08-22 Thread Steffen Herbold
Dear developers, I am writing you because I applied an approach for the automated testing of classification algorithms to Spark MLlib and would like to forward the results to you. The approach is a combination of smoke testing and metamorphic testing. The smoke tests try to find problems by

Re: [Performance] Spark DataFrame is slow with wide data. Polynomial complexity on the number of columns is observed. Why?

2018-08-22 Thread makatun
Manu, thank you very much for your response. 1. Your post helps to further optimize the spark jobs for wide data. (https://medium.com/@manuzhang/the-hidden-cost-of-spark-withcolumn-8ffea517c015) The suggested change of code: df.select(df.columns.map { col => df(col).isNotNull }: _*)

Re: Persisting driver logs in yarn client mode (SPARK-25118)

2018-08-22 Thread Marco Gaido
I agree with Saisai. You can also configure log4j to append anywhere else other than the console. Many companies have their system for collecting and monitoring logs and they just customize the log4j configuration. I am not sure how needed this change would be. Thanks, Marco Il giorno mer 22 ago