It’s pretty simple, really: import com.fasterxml.jackson.databind.ObjectMapper import org.apache.spark.ml.UnaryTransformer import org.apache.spark.ml.util.Identifiable import org.apache.spark.sql.types.{DataType, StringType}
/** * A SparkML Transformer that will transform an * entity of type T into a JSON-formatted string. * Created by Tristan Nixon <tris...@memeticlabs.org> on 3/11/16. */ class JsonSerializationTransformer[T](override val uid: String) extends UnaryTransformer[T,String,JsonSerializationTransformer[T]] { def this() = this(Identifiable.randomUID("JsonSerializationTransformer")) val mapper = new ObjectMapper // add additional mapper configuration code here, like this: // mapper.setAnnotationIntrospector(new JaxbAnnotationIntrospector) // or this: // mapper.getSerializationConfig.withFeatures( SerializationFeature.WRITE_DATES_AS_TIMESTAMPS ) override protected def createTransformFunc: ( T ) => String = mapper.writeValueAsString override protected def outputDataType: DataType = new StringType } and you would use it like any other transformer: val jsontrans = new JsonSerializationTransformer[Document].setInputCol("myEntityColumn") .setOutputCol("myOutputColumn") val dfWithJson = jsontrans.transform( entityDF ) Note that this implementation is for Jackson 2.x. If you want to use Jackson 1.x, it’s a bit trickier because the ObjectMapper class is not Serializable, and so you need to initialize it per-partition rather than having it just be a standard property. > On Mar 11, 2016, at 12:49 PM, Jacek Laskowski <ja...@japila.pl> wrote: > > Hi Tristan, > > Mind sharing the relevant code? I'd like to learn the way you use Transformer > to do so. Thanks! > > Jacek > > 11.03.2016 7:07 PM "Tristan Nixon" <st...@memeticlabs.org > <mailto:st...@memeticlabs.org>> napisał(a): > I have a similar situation in an app of mine. I implemented a custom ML > Transformer that wraps the Jackson ObjectMapper - this gives you full control > over how your custom entities / structs are serialized. > >> On Mar 11, 2016, at 11:53 AM, Caires Vinicius <caire...@gmail.com >> <mailto:caire...@gmail.com>> wrote: >> >> Hmm. I think my problem is a little more complex. I'm using >> https://github.com/databricks/spark-redshift >> <https://github.com/databricks/spark-redshift> and when I read from JSON >> file I got this schema. >> >> root >> |-- app: string (nullable = true) >> >> |-- ct: long (nullable = true) >> >> |-- event: struct (nullable = true) >> >> | |-- attributes: struct (nullable = true) >> >> | | |-- account: string (nullable = true) >> >> | | |-- accountEmail: string (nullable = true) >> >> >> | | |-- accountId: string (nullable = true) >> >> >> >> I want to transform the Column event into String (formatted as JSON). >> >> I was trying to use udf but without success. >> >> >> On Fri, Mar 11, 2016 at 1:53 PM Tristan Nixon <st...@memeticlabs.org >> <mailto:st...@memeticlabs.org>> wrote: >> Have you looked at DataFrame.write.json( path )? >> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameWriter >> >> <https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameWriter> >> >> > On Mar 11, 2016, at 7:15 AM, Caires Vinicius <caire...@gmail.com >> > <mailto:caire...@gmail.com>> wrote: >> > >> > I have one DataFrame with nested StructField and I want to convert to JSON >> > String. There is anyway to accomplish this? >> >