Hi Andrew, If you still see this with Spark 1.6.0, it would be very helpful if you could file a bug about it at https://issues.apache.org/jira/browse/SPARK with as much detail as you can. This issue could be a nasty source of silent data corruption in a case where some intermediate data loses 8 characters but it is not obvious in the final output. Thanks!
On Fri, Jan 29, 2016 at 7:53 AM, Jonathan Kelly <jonathaka...@gmail.com> wrote: > Just FYI, Spark 1.6 was released on emr-4.3.0 a couple days ago: > https://aws.amazon.com/blogs/aws/emr-4-3-0-new-updated-applications-command-line-export/ > > On Thu, Jan 28, 2016 at 7:30 PM Andrew Zurn <awz...@gmail.com> wrote: > >> Hey Daniel, >> >> Thanks for the response. >> >> After playing around for a bit, it looks like it's probably the something >> similar to the first situation you mentioned, with the Parquet format >> causing issues. Both programmatically created dataset and a dataset pulled >> off the internet (rather than out of S3 and put into HDFS/Hive) acted with >> DataFrames as one would expect (printed out everything, grouped properly, >> etc.) >> >> It looks like there is more than likely an outstanding bug that causes >> issues with data coming from S3 and is converted in the parquet format >> (found an article here highlighting it was around in 1.4, and I guess it >> wouldn't be out of the realm of things for it still to exist. Link to >> article: >> https://www.appsflyer.com/blog/the-bleeding-edge-spark-parquet-and-s3/ >> >> Hopefully a little more stability will come out with the upcoming Spark >> 1.6 release on EMR (I think that is happening sometime soon). >> >> Thanks again for the advice on where to dig further into. Much >> appreciated. >> >> Andrew >> >> On Tue, Jan 26, 2016 at 9:18 AM, Daniel Darabos < >> daniel.dara...@lynxanalytics.com> wrote: >> >>> Have you tried setting spark.emr.dropCharacters to a lower value? (It >>> defaults to 8.) >>> >>> :) Just joking, sorry! Fantastic bug. >>> >>> What data source do you have for this DataFrame? I could imagine for >>> example that it's a Parquet file and on EMR you are running with two wrong >>> version of the Parquet library and it messes up strings. It should be easy >>> enough to try a different data format. You could also try what happens if >>> you just create the DataFrame programmatically, e.g. >>> sc.parallelize(Seq("asdfasdfasdf")).toDF. >>> >>> To understand better at which point the characters are lost you could >>> try grouping by a string attribute. I see "education" ends up either as "" >>> (empty string) or "y" in the printed output. But are the characters already >>> lost when you try grouping by the attribute? Will there be a single "" >>> category, or will you have separate categories for "primary" and "tertiary"? >>> >>> I think the correct output through the RDD suggests that the issue >>> happens at the very end. So it will probably happen also with different >>> data sources, and grouping will create separate groups for "primary" and >>> "tertiary" even though they are printed as the same string at the end. You >>> should also check the data from "take(10)" to rule out any issues with >>> printing. You could try the same "groupBy" trick after "take(10)". Or you >>> could print the lengths of the strings. >>> >>> Good luck! >>> >>> On Tue, Jan 26, 2016 at 3:53 AM, awzurn <awz...@gmail.com> wrote: >>> >>>> Sorry for the bump, but wondering if anyone else has seen this before. >>>> We're >>>> hoping to either resolve this soon, or move on with further steps to >>>> move >>>> this into an issue. >>>> >>>> Thanks in advance, >>>> >>>> Andrew Zurn >>>> >>>> >>>> >>>> -- >>>> View this message in context: >>>> http://apache-spark-user-list.1001560.n3.nabble.com/Dataframe-Spark-SQL-Drops-First-8-Characters-of-String-on-Amazon-EMR-tp26022p26065.html >>>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>> For additional commands, e-mail: user-h...@spark.apache.org >>>> >>>> >>> >>