Re: HyperLogLogUDT

2015-09-12 Thread Nick Pentreath
Ok, that makes sense. So this is (a) more efficient, since as far as I can see it is updating the HLL registers directly in the buffer for each value, and (b) would be "Tungsten-compatible" as it can work against UnsafeRow? Is it currently possible to specify an UnsafeRow as a buffer in a UDAF?

Re: HyperLogLogUDT

2015-09-12 Thread Yin Huai
Hi Nick, The buffer exposed to UDAF interface is just a view of underlying buffer (this underlying buffer is shared by different aggregate functions and every function takes one or multiple slots). If you need a UDAF, extending UserDefinedAggregationFunction is the preferred approach.

Spark Streaming..Exception

2015-09-12 Thread Priya Ch
Hello All, When I push messages into kafka and read into streaming application, I see the following exception- I am running the application on YARN and no where broadcasting the message within the application. Just simply reading message, parsing it and populating fields in a class and then

Re: HyperLogLogUDT

2015-09-12 Thread Nick Pentreath
Inspired by this post: http://eugenezhulenev.com/blog/2015/07/15/interactive-audience-analytics-with-spark-and-hyperloglog/, I've started putting together something based on the Spark 1.5 UDAF interface: https://gist.github.com/MLnick/eca566604f2e4e3c6141 Some questions - 1. How do I get the

Re: SIGTERM 15 Issue : Spark Streaming for ingesting huge text files using custom Receiver

2015-09-12 Thread Jörn Franke
I am not sure what are you trying to achieve here. Have you thought about using flume? Additionally maybe something like rsync? Le sam. 12 sept. 2015 à 0:02, Varadhan, Jawahar a écrit : > Hi all, >I have a coded a custom receiver which receives kafka messages.

Re: HyperLogLogUDT

2015-09-12 Thread Nick Pentreath
I should add that surely the idea behind UDT is exactly that it can (a) fit automatically into DFs and Tungsten and (b) that it can be used efficiently in writing ones own UDTs and UDAFs? On Sat, Sep 12, 2015 at 11:05 AM, Nick Pentreath wrote: > Can I ask why you've

Re: Spark 1.5.x: Java files in src/main/scala and vice versa

2015-09-12 Thread Sean Owen
There are actually 33 instances of a Java file in src/main/scala -- I opened https://issues.apache.org/jira/browse/SPARK-10576 to track a discussion and decision. On Fri, Sep 11, 2015 at 3:10 PM, lonikar wrote: > It does not cause any problem when building using maven. But

Re: spark dataframe transform JSON to ORC meet “column ambigous exception”

2015-09-12 Thread Ted Yu
Is it possible that Canonical_URL occurs more than once in your json ? Can you check your json input ? Thanks On Sat, Sep 12, 2015 at 2:05 AM, Fengdong Yu wrote: > Hi, > > I am using spark1.4.1 data frame, read JSON data, then save it to orc. the > code is very

Re: spark dataframe transform JSON to ORC meet “column ambigous exception”

2015-09-12 Thread Fengdong Yu
Hi Ted, I checked the JSON, there aren't duplicated key in JSON. Azuryy Yu Sr. Infrastructure Engineer cel: 158-0164-9103 wetchat: azuryy On Sat, Sep 12, 2015 at 5:52 PM, Ted Yu wrote: > Is it possible that Canonical_URL occurs more than once in your json ? > > Can you

Re: HyperLogLogUDT

2015-09-12 Thread Herman van Hövell tot Westerflier
Hello Nick, I have been working on a (UDT-less) implementation of HLL++. You can find the PR here: https://github.com/apache/spark/pull/8362. This current implements the dense version of HLL++, which is a further development of HLL. It returns a Long, but it shouldn't be to hard to return a Row

Re: HyperLogLogUDT

2015-09-12 Thread Nick Pentreath
Can I ask why you've done this as a custom implementation rather than using StreamLib, which is already implemented and widely used? It seems more portable to me to use a library - for example, I'd like to export the grouped data with raw HLLs to say Elasticsearch, and then do further on-demand

Re: HyperLogLogUDT

2015-09-12 Thread Herman van Hövell tot Westerflier
I am typically all for code re-use. The reason for writing this is to prevent the indirection of a UDT and work directly against memory. A UDT will work fine at the moment because we still use GenericMutableRow/SpecificMutableRow as aggregation buffers. However if you would use an UnsafeRow as an

Re: spark dataframe transform JSON to ORC meet “column ambigous exception”

2015-09-12 Thread Ted Yu
Can you take a look at SPARK-5278 where ambiguity is shown between field names which differ only by case ? Cheers On Sat, Sep 12, 2015 at 3:40 AM, Fengdong Yu wrote: > Hi Ted, > I checked the JSON, there aren't duplicated key in JSON. > > > Azuryy Yu > Sr.

[Question] ORC - EMRFS Problem

2015-09-12 Thread Cazen Lee
Good Day! I think there are some problems between ORC and AWS EMRFS. When I was trying to read "upper 150M" ORC files from S3, ArrayOutOfIndex Exception occured. I'm sure that it's AWS side issue because there was no exception when trying from HDFS or S3NativeFileSystem. Parquet runs

Re: Code generation for GPU

2015-09-12 Thread kiran lonikar
Thanks. Yes thats exactly what i would like to do: copy large amounts of data to GPU RAM, perform computation and get bulk rows back for map/filter or reduce result. It is true that non trivial operations benefit more. Even streaming data to GPU RAM and interleaving computation with data transfer

Re: Code generation for GPU

2015-09-12 Thread kiran lonikar
Thanks for pointing to the yarn JIRA. For now, it would be good for my talk since it brings out that hadoop and big data community is already aware of the GPUs and making effort to exploit it. Good luck for your talk. That fear is lurking in my mind too :) On 10-Sep-2015 2:08 pm, "Steve Loughran"