Have you tried setting spark.emr.dropCharacters to a lower value? (It
defaults to 8.)

:) Just joking, sorry! Fantastic bug.

What data source do you have for this DataFrame? I could imagine for
example that it's a Parquet file and on EMR you are running with two wrong
version of the Parquet library and it messes up strings. It should be easy
enough to try a different data format. You could also try what happens if
you just create the DataFrame programmatically, e.g.
sc.parallelize(Seq("asdfasdfasdf")).toDF.

To understand better at which point the characters are lost you could try
grouping by a string attribute. I see "education" ends up either as ""
(empty string) or "y" in the printed output. But are the characters already
lost when you try grouping by the attribute? Will there be a single ""
category, or will you have separate categories for "primary" and "tertiary"?

I think the correct output through the RDD suggests that the issue happens
at the very end. So it will probably happen also with different data
sources, and grouping will create separate groups for "primary" and
"tertiary" even though they are printed as the same string at the end. You
should also check the data from "take(10)" to rule out any issues with
printing. You could try the same "groupBy" trick after "take(10)". Or you
could print the lengths of the strings.

Good luck!

On Tue, Jan 26, 2016 at 3:53 AM, awzurn <awz...@gmail.com> wrote:

> Sorry for the bump, but wondering if anyone else has seen this before.
> We're
> hoping to either resolve this soon, or move on with further steps to move
> this into an issue.
>
> Thanks in advance,
>
> Andrew Zurn
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Dataframe-Spark-SQL-Drops-First-8-Characters-of-String-on-Amazon-EMR-tp26022p26065.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to