Hi Ted, Finally got round to creating this: https://issues.apache.org/jira/browse/SPARK-13773
I hope you don't mind me selecting you as the shepherd for this ticket. Regards, James On 7 March 2016 at 17:50, James Hammerton <ja...@gluru.co> wrote: > Hi Ted, > > Thanks for getting back - I realised my mistake... I was clicking the > little drop down menu on the right hand side of the Create button (it looks > as if it's part of the button) - when I clicked directly on the word > "Create" I got a form that made more sense and allowed me to choose the > project. > > Regards, > > James > > > On 7 March 2016 at 13:09, Ted Yu <yuzhih...@gmail.com> wrote: > >> Have you tried clicking on Create button from an existing Spark JIRA ? >> e.g. >> https://issues.apache.org/jira/browse/SPARK-4352 >> >> Once you're logged in, you should be able to select Spark as the Project. >> >> Cheers >> >> On Mon, Mar 7, 2016 at 2:54 AM, James Hammerton <ja...@gluru.co> wrote: >> >>> Hi, >>> >>> So I managed to isolate the bug and I'm ready to try raising a JIRA >>> issue. I joined the Apache Jira project so I can create tickets. >>> >>> However when I click Create from the Spark project home page on JIRA, it >>> asks me to click on one of the following service desks: Kylin, Atlas, >>> Ranger, Apache Infrastructure. There doesn't seem to be an option for me to >>> raise an issue for Spark?! >>> >>> Regards, >>> >>> James >>> >>> >>> On 4 March 2016 at 14:03, James Hammerton <ja...@gluru.co> wrote: >>> >>>> Sure thing, I'll see if I can isolate this. >>>> >>>> Regards. >>>> >>>> James >>>> >>>> On 4 March 2016 at 12:24, Ted Yu <yuzhih...@gmail.com> wrote: >>>> >>>>> If you can reproduce the following with a unit test, I suggest you >>>>> open a JIRA. >>>>> >>>>> Thanks >>>>> >>>>> On Mar 4, 2016, at 4:01 AM, James Hammerton <ja...@gluru.co> wrote: >>>>> >>>>> Hi, >>>>> >>>>> I've come across some strange behaviour with Spark 1.6.0. >>>>> >>>>> In the code below, the filtering by "eventName" only seems to work if >>>>> I called .cache on the resulting DataFrame. >>>>> >>>>> If I don't do this, the code crashes inside the UDF because it >>>>> processes an event that the filter should get rid off. >>>>> >>>>> Any ideas why this might be the case? >>>>> >>>>> The code is as follows: >>>>> >>>>>> val df = sqlContext.read.parquet(inputPath) >>>>>> val filtered = df.filter(df("eventName").equalTo(Created)) >>>>>> val extracted = extractEmailReferences(sqlContext, >>>>>> filtered.cache) // Caching seems to be required for the filter to work >>>>>> extracted.write.parquet(outputPath) >>>>> >>>>> >>>>> where extractEmailReferences does this: >>>>> >>>>>> >>>>> >>>>> def extractEmailReferences(sqlContext: SQLContext, df: DataFrame): >>>>>> DataFrame = { >>>>> >>>>> val extracted = df.select(df(EventFieldNames.ObjectId), >>>>> >>>>> extractReferencesUDF(df(EventFieldNames.EventJson), >>>>>> df(EventFieldNames.ObjectId), df(EventFieldNames.UserId)) as >>>>>> "references") >>>>> >>>>> >>>>>> extracted.filter(extracted("references").notEqual("UNKNOWN")) >>>>> >>>>> } >>>>> >>>>> >>>>> and extractReferencesUDF: >>>>> >>>>>> def extractReferencesUDF = udf(extractReferences(_: String, _: >>>>>> String, _: String)) >>>>> >>>>> def extractReferences(eventJson: String, objectId: String, userId: >>>>>> String): String = { >>>>>> import org.json4s.jackson.Serialization >>>>>> import org.json4s.NoTypeHints >>>>>> implicit val formats = Serialization.formats(NoTypeHints) >>>>>> >>>>>> val created = Serialization.read[GMailMessage.Created](eventJson) >>>>>> // This is where the code crashes if the .cache isn't called >>>>> >>>>> >>>>> Regards, >>>>> >>>>> James >>>>> >>>>> >>>> >>> >> >