Re: test failed due to OOME

2015-11-02 Thread Ted Yu
Looks like SparkListenerSuite doesn't OOM on QA runs compared to Jenkins
builds.

I wonder if this is due to difference between machines running QA tests vs
machines running Jenkins builds.

On Fri, Oct 30, 2015 at 1:19 PM, Ted Yu  wrote:

> I noticed that the SparkContext created in each sub-test is not stopped
> upon finishing sub-test.
>
> Would stopping each SparkContext make a difference in terms of heap memory
> consumption ?
>
> Cheers
>
> On Fri, Oct 30, 2015 at 12:04 PM, Mridul Muralidharan 
> wrote:
>
>> It is giving OOM at 32GB ? Something looks wrong with that ... that is
>> already on the higher side.
>>
>> Regards,
>> Mridul
>>
>>
>> On Fri, Oct 30, 2015 at 11:28 AM, shane knapp 
>> wrote:
>> > here's the current heap settings on our workers:
>> > InitialHeapSize == 2.1G
>> > MaxHeapSize == 32G
>> >
>> > system ram:  128G
>> >
>> > we can bump it pretty easily...  it's just a matter of deciding if we
>> > want to do this globally (super easy, but will affect ALL maven builds
>> > on our system -- not just spark) or on a per-job basis (this doesn't
>> > scale that well).
>> >
>> > thoughts?
>> >
>> > On Fri, Oct 30, 2015 at 9:47 AM, Ted Yu  wrote:
>> >> This happened recently on Jenkins:
>> >>
>> >>
>> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.3,label=spark-test/3964/console
>> >>
>> >> On Sun, Oct 18, 2015 at 7:54 AM, Ted Yu  wrote:
>> >>>
>> >>> From
>> >>>
>> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=spark-test/3846/console
>> >>> :
>> >>>
>> >>> SparkListenerSuite:
>> >>> - basic creation and shutdown of LiveListenerBus
>> >>> - bus.stop() waits for the event queue to completely drain
>> >>> - basic creation of StageInfo
>> >>> - basic creation of StageInfo with shuffle
>> >>> - StageInfo with fewer tasks than partitions
>> >>> - local metrics
>> >>> - onTaskGettingResult() called when result fetched remotely ***
>> FAILED ***
>> >>>   org.apache.spark.SparkException: Job aborted due to stage failure:
>> Task
>> >>> 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in
>> stage
>> >>> 0.0 (TID 0, localhost): java.lang.OutOfMemoryError: Java heap space
>> >>>  at java.util.Arrays.copyOf(Arrays.java:2271)
>> >>>  at
>> java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
>> >>>  at
>> >>>
>> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
>> >>>  at
>> java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
>> >>>  at
>> >>>
>> java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1852)
>> >>>  at java.io.ObjectOutputStream.write(ObjectOutputStream.java:708)
>> >>>  at org.apache.spark.util.Utils$.writeByteBuffer(Utils.scala:182)
>> >>>  at
>> >>>
>> org.apache.spark.scheduler.DirectTaskResult$$anonfun$writeExternal$1.apply$mcV$sp(TaskResult.scala:52)
>> >>>  at
>> org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1160)
>> >>>  at
>> >>>
>> org.apache.spark.scheduler.DirectTaskResult.writeExternal(TaskResult.scala:49)
>> >>>  at
>> >>>
>> java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1458)
>> >>>  at
>> >>>
>> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1429)
>> >>>  at
>> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>> >>>  at
>> java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
>> >>>  at
>> >>>
>> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
>> >>>  at
>> >>>
>> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
>> >>>  at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:256)
>> >>>  at
>> >>>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> >>>  at
>> >>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> >>>  at java.lang.Thread.run(Thread.java:745)
>> >>>
>> >>>
>> >>> Should more heap be given to test suite ?
>> >>>
>> >>>
>> >>> Cheers
>> >>
>> >>
>> >
>> > -
>> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: dev-h...@spark.apache.org
>> >
>>
>
>


Re: Ability to offer initial coefficients in ml.LogisticRegression

2015-11-02 Thread YiZhi Liu
Hi Tsai,

Is it proper if I create a jira and try to work on it?

2015-10-23 10:40 GMT+08:00 YiZhi Liu :
> Thank you Tsai.
>
> Holden, would you mind posting the JIRA issue id here? I searched but
> found nothing. Thanks.
>
> 2015-10-23 1:36 GMT+08:00 DB Tsai :
>> There is a JIRA for this. I know Holden is interested in this.
>>
>>
>> On Thursday, October 22, 2015, YiZhi Liu  wrote:
>>>
>>> Would someone mind giving some hint?
>>>
>>> 2015-10-20 15:34 GMT+08:00 YiZhi Liu :
>>> > Hi all,
>>> >
>>> > I noticed that in ml.classification.LogisticRegression, users are not
>>> > allowed to set initial coefficients, while it is supported in
>>> > mllib.classification.LogisticRegressionWithSGD.
>>> >
>>> > Sometimes we know specific coefficients are close to the final optima.
>>> > e.g., we usually pick yesterday's output model as init coefficients
>>> > since the data distribution between two days' training sample
>>> > shouldn't change much.
>>> >
>>> > Is there any concern for not supporting this feature?
>>> >
>>> > --
>>> > Yizhi Liu
>>> > Senior Software Engineer / Data Mining
>>> > www.mvad.com, Shanghai, China
>>>
>>>
>>>
>>> --
>>> Yizhi Liu
>>> Senior Software Engineer / Data Mining
>>> www.mvad.com, Shanghai, China
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>
>>
>> --
>> - DB
>>
>> Sent from my iPhone
>
>
>
> --
> Yizhi Liu
> Senior Software Engineer / Data Mining
> www.mvad.com, Shanghai, China



-- 
Yizhi Liu
Senior Software Engineer / Data Mining
www.mvad.com, Shanghai, China

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Downloading Hadoop from s3://spark-related-packages/

2015-11-02 Thread Luciano Resende
I am getting the same results using closer.lua versus close.cgi, which
seems to be downloading a page where the user can choose the closest
mirror. I tried to add parameters to follow redirect without much success.
There seems to be already a jira for a similar request with infra:
https://issues.apache.org/jira/browse/INFRA-10240.

A workaround is to use a url pointing to the mirror directly.

curl -O -L
http://ftp.unicamp.br/pub/apache/hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz

I second the lack of documentation on what is available with these scripts,
I'll see if I can find the source and try to see other options.


On Sun, Nov 1, 2015 at 8:40 PM, Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> I think the lua one at
>
> https://svn.apache.org/repos/asf/infrastructure/site/trunk/content/dyn/closer.lua
> has replaced the cgi one from before. Also it looks like the lua one
> also supports `action=download` with a filename argument. So you could
> just do something like
>
> wget
> http://www.apache.org/dyn/closer.lua?filename=hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz=download
>
> Thanks
> Shivaram
>
> On Sun, Nov 1, 2015 at 3:18 PM, Nicholas Chammas
>  wrote:
> > Oh, sweet! For example:
> >
> >
> http://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz?asjson=1
> >
> > Thanks for sharing that tip. Looks like you can also use as_json (vs.
> > asjson).
> >
> > Nick
> >
> >
> > On Sun, Nov 1, 2015 at 5:32 PM Shivaram Venkataraman
> >  wrote:
> >>
> >> On Sun, Nov 1, 2015 at 2:16 PM, Nicholas Chammas
> >>  wrote:
> >> > OK, I’ll focus on the Apache mirrors going forward.
> >> >
> >> > The problem with the Apache mirrors, if I am not mistaken, is that you
> >> > cannot use a single URL that automatically redirects you to a working
> >> > mirror
> >> > to download Hadoop. You have to pick a specific mirror and pray it
> >> > doesn’t
> >> > disappear tomorrow.
> >> >
> >> > They don’t go away, especially http://mirror.ox.ac.uk , and in the us
> >> > the
> >> > apache.osuosl.org, osu being a where a lot of the ASF servers are
> kept.
> >> >
> >> > So does Apache offer no way to query a URL and automatically get the
> >> > closest
> >> > working mirror? If I’m installing HDFS onto servers in various EC2
> >> > regions,
> >> > the best mirror will vary depending on my location.
> >> >
> >> Not sure if this is officially documented somewhere but if you pass
> >> '=1' you will get back a JSON which has a 'preferred' field set
> >> to the closest mirror.
> >>
> >> Shivaram
> >> > Nick
> >> >
> >> >
> >> > On Sun, Nov 1, 2015 at 12:25 PM Shivaram Venkataraman
> >> >  wrote:
> >> >>
> >> >> I think that getting them from the ASF mirrors is a better strategy
> in
> >> >> general as it'll remove the overhead of keeping the S3 bucket up to
> >> >> date. It works in the spark-ec2 case because we only support a
> limited
> >> >> number of Hadoop versions from the tool. FWIW I don't have write
> >> >> access to the bucket and also haven't heard of any plans to support
> >> >> newer versions in spark-ec2.
> >> >>
> >> >> Thanks
> >> >> Shivaram
> >> >>
> >> >> On Sun, Nov 1, 2015 at 2:30 AM, Steve Loughran <
> ste...@hortonworks.com>
> >> >> wrote:
> >> >> >
> >> >> > On 1 Nov 2015, at 03:17, Nicholas Chammas
> >> >> > 
> >> >> > wrote:
> >> >> >
> >> >> > https://s3.amazonaws.com/spark-related-packages/
> >> >> >
> >> >> > spark-ec2 uses this bucket to download and install HDFS on
> clusters.
> >> >> > Is
> >> >> > it
> >> >> > owned by the Spark project or by the AMPLab?
> >> >> >
> >> >> > Anyway, it looks like the latest Hadoop install available on there
> is
> >> >> > Hadoop
> >> >> > 2.4.0.
> >> >> >
> >> >> > Are there plans to add newer versions of Hadoop for use by
> spark-ec2
> >> >> > and
> >> >> > similar tools, or should we just be getting that stuff via an
> Apache
> >> >> > mirror?
> >> >> > The latest version is 2.7.1, by the way.
> >> >> >
> >> >> >
> >> >> > you should be grabbing the artifacts off the ASF and then verifying
> >> >> > their
> >> >> > SHA1 checksums as published on the ASF HTTPS web site
> >> >> >
> >> >> >
> >> >> > The problem with the Apache mirrors, if I am not mistaken, is that
> >> >> > you
> >> >> > cannot use a single URL that automatically redirects you to a
> working
> >> >> > mirror
> >> >> > to download Hadoop. You have to pick a specific mirror and pray it
> >> >> > doesn't
> >> >> > disappear tomorrow.
> >> >> >
> >> >> >
> >> >> > They don't go away, especially http://mirror.ox.ac.uk , and in
> the us
> >> >> > the
> >> >> > apache.osuosl.org, osu being a where a lot of the ASF servers are
> >> >> > kept.
> >> >> >
> >> >> > full list with availability stats
> >> >> >
> >> >> > http://www.apache.org/mirrors/
> >> >> >
> >> >> >
>
> 

[BUILD SYSTEM] quick jenkins downtime, november 5th 7am

2015-11-02 Thread shane knapp
i'd like to take jenkins down briefly thursday morning to install some
plugin updates.

this will hopefully be short (~1hr), but could easily become longer as
the jenkins plugin ecosystem is fragile and updates like this are
known to cause things to explode.  the only reason why i'm
contemplating this, is i'm having some issues with the git plugin on
new github pull request builder builds.

i'll send updates as things progress.

thanks,

shane

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: test failed due to OOME

2015-11-02 Thread Patrick Wendell
I believe this is some bug in our tests. For some reason we are using way
more memory than necessary. We'll probably need to log into Jenkins and
heap dump some running tests and figure out what is going on.

On Mon, Nov 2, 2015 at 7:42 AM, Ted Yu  wrote:

> Looks like SparkListenerSuite doesn't OOM on QA runs compared to Jenkins
> builds.
>
> I wonder if this is due to difference between machines running QA tests vs
> machines running Jenkins builds.
>
> On Fri, Oct 30, 2015 at 1:19 PM, Ted Yu  wrote:
>
>> I noticed that the SparkContext created in each sub-test is not stopped
>> upon finishing sub-test.
>>
>> Would stopping each SparkContext make a difference in terms of heap
>> memory consumption ?
>>
>> Cheers
>>
>> On Fri, Oct 30, 2015 at 12:04 PM, Mridul Muralidharan 
>> wrote:
>>
>>> It is giving OOM at 32GB ? Something looks wrong with that ... that is
>>> already on the higher side.
>>>
>>> Regards,
>>> Mridul
>>>
>>>
>>> On Fri, Oct 30, 2015 at 11:28 AM, shane knapp 
>>> wrote:
>>> > here's the current heap settings on our workers:
>>> > InitialHeapSize == 2.1G
>>> > MaxHeapSize == 32G
>>> >
>>> > system ram:  128G
>>> >
>>> > we can bump it pretty easily...  it's just a matter of deciding if we
>>> > want to do this globally (super easy, but will affect ALL maven builds
>>> > on our system -- not just spark) or on a per-job basis (this doesn't
>>> > scale that well).
>>> >
>>> > thoughts?
>>> >
>>> > On Fri, Oct 30, 2015 at 9:47 AM, Ted Yu  wrote:
>>> >> This happened recently on Jenkins:
>>> >>
>>> >>
>>> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.3,label=spark-test/3964/console
>>> >>
>>> >> On Sun, Oct 18, 2015 at 7:54 AM, Ted Yu  wrote:
>>> >>>
>>> >>> From
>>> >>>
>>> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=spark-test/3846/console
>>> >>> :
>>> >>>
>>> >>> SparkListenerSuite:
>>> >>> - basic creation and shutdown of LiveListenerBus
>>> >>> - bus.stop() waits for the event queue to completely drain
>>> >>> - basic creation of StageInfo
>>> >>> - basic creation of StageInfo with shuffle
>>> >>> - StageInfo with fewer tasks than partitions
>>> >>> - local metrics
>>> >>> - onTaskGettingResult() called when result fetched remotely ***
>>> FAILED ***
>>> >>>   org.apache.spark.SparkException: Job aborted due to stage failure:
>>> Task
>>> >>> 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in
>>> stage
>>> >>> 0.0 (TID 0, localhost): java.lang.OutOfMemoryError: Java heap space
>>> >>>  at java.util.Arrays.copyOf(Arrays.java:2271)
>>> >>>  at
>>> java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
>>> >>>  at
>>> >>>
>>> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
>>> >>>  at
>>> java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
>>> >>>  at
>>> >>>
>>> java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1852)
>>> >>>  at java.io.ObjectOutputStream.write(ObjectOutputStream.java:708)
>>> >>>  at org.apache.spark.util.Utils$.writeByteBuffer(Utils.scala:182)
>>> >>>  at
>>> >>>
>>> org.apache.spark.scheduler.DirectTaskResult$$anonfun$writeExternal$1.apply$mcV$sp(TaskResult.scala:52)
>>> >>>  at
>>> org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1160)
>>> >>>  at
>>> >>>
>>> org.apache.spark.scheduler.DirectTaskResult.writeExternal(TaskResult.scala:49)
>>> >>>  at
>>> >>>
>>> java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1458)
>>> >>>  at
>>> >>>
>>> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1429)
>>> >>>  at
>>> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>>> >>>  at
>>> java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
>>> >>>  at
>>> >>>
>>> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
>>> >>>  at
>>> >>>
>>> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
>>> >>>  at
>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:256)
>>> >>>  at
>>> >>>
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>> >>>  at
>>> >>>
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>> >>>  at java.lang.Thread.run(Thread.java:745)
>>> >>>
>>> >>>
>>> >>> Should more heap be given to test suite ?
>>> >>>
>>> >>>
>>> >>> Cheers
>>> >>
>>> >>
>>> >
>>> > -
>>> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> > For additional commands, e-mail: dev-h...@spark.apache.org
>>> >
>>>
>>
>>
>


Re: Getting Started

2015-11-02 Thread Romi Kuntsman
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com

On Fri, Oct 30, 2015 at 1:25 PM, Saurabh Shah 
wrote:

> Hello, my name is Saurabh Shah and I am a second year undergraduate
> student at DA-IICT, Gandhinagar, India. I have quite lately been
> contributing towards the open source organizations and I find your
> organization the most appropriate one to work on.
>
> I request you to please guide me through the installation of your codebase
> and how to get started to your organization.
>
>
> Thanking You,
>
> Saurabh Shah.
>


Re: Lead operator not working as aggregation operator

2015-11-02 Thread Shagun Sodhani
I was referring to this jira issue :
https://issues.apache.org/jira/browse/TAJO-919

On Mon, Nov 2, 2015 at 4:03 PM, Shagun Sodhani 
wrote:

> Hi! I was trying out window functions in SparkSql (using hive context)
> and I noticed that while this
> 
> mentions that *lead* is implemented as an aggregate operator, it seems
> not to be the case.
>
> I am using the following configuration:
>
> Query : SELECT lead(max(`expenses`)) FROM `table` GROUP BY `customerId`
> Spark Version: 10.4
> SparkSql Version: 1.5.1
>
> I am using the standard example of (`customerId`, `expenses`) scheme where
> each customer has multiple values for expenses (though I am setting age as
> Double and not Int as I am trying out maths functions).
>
>
> *java.lang.NullPointerException at
> org.apache.hadoop.hive.ql.udf.generic.GenericUDFLeadLag.evaluate(GenericUDFLeadLag.java:57)*
>
> The entire error stack can be found here .
>
> Can someone confirm if this is an actual issue or some oversight on my
> part?
>
> Thanks!
>


Re: Lead operator not working as aggregation operator

2015-11-02 Thread Herman van Hövell tot Westerflier
Hi,

This is more a question for the User list.

Lead and Lag imply ordering of the whole dataset, and this is not
supported. You can use Lead/Lag in an ordered window function and you'll be
fine:

*select lead(max(expenses)) over (order by customerId) from tbl group by
customerId*

HTH

Met vriendelijke groet/Kind regards,

Herman van Hövell tot Westerflier

QuestTec B.V.
Torenwacht 98
2353 DC Leiderdorp
hvanhov...@questtec.nl
+31 6 420 590 27


2015-11-02 11:33 GMT+01:00 Shagun Sodhani :

> Hi! I was trying out window functions in SparkSql (using hive context)
> and I noticed that while this
> 
> mentions that *lead* is implemented as an aggregate operator, it seems
> not to be the case.
>
> I am using the following configuration:
>
> Query : SELECT lead(max(`expenses`)) FROM `table` GROUP BY `customerId`
> Spark Version: 10.4
> SparkSql Version: 1.5.1
>
> I am using the standard example of (`customerId`, `expenses`) scheme where
> each customer has multiple values for expenses (though I am setting age as
> Double and not Int as I am trying out maths functions).
>
>
> *java.lang.NullPointerException at
> org.apache.hadoop.hive.ql.udf.generic.GenericUDFLeadLag.evaluate(GenericUDFLeadLag.java:57)*
>
> The entire error stack can be found here .
>
> Can someone confirm if this is an actual issue or some oversight on my
> part?
>
> Thanks!
>


Lead operator not working as aggregation operator

2015-11-02 Thread Shagun Sodhani
Hi! I was trying out window functions in SparkSql (using hive context) and
I noticed that while this

mentions that *lead* is implemented as an aggregate operator, it seems not
to be the case.

I am using the following configuration:

Query : SELECT lead(max(`expenses`)) FROM `table` GROUP BY `customerId`
Spark Version: 10.4
SparkSql Version: 1.5.1

I am using the standard example of (`customerId`, `expenses`) scheme where
each customer has multiple values for expenses (though I am setting age as
Double and not Int as I am trying out maths functions).


*java.lang.NullPointerException at
org.apache.hadoop.hive.ql.udf.generic.GenericUDFLeadLag.evaluate(GenericUDFLeadLag.java:57)*

The entire error stack can be found here .

Can someone confirm if this is an actual issue or some oversight on my part?

Thanks!


Re: Lead operator not working as aggregation operator

2015-11-02 Thread Shagun Sodhani
I get the part about using it with window, but most other window operators
also work as aggregator operator and in this case, it is specifically
mentioned in the jira issue as well. I asked on dev list and not user list
as it was already mentioned in the issue.

On Mon, Nov 2, 2015 at 4:15 PM, Herman van Hövell tot Westerflier <
hvanhov...@questtec.nl> wrote:

> Hi,
>
> This is more a question for the User list.
>
> Lead and Lag imply ordering of the whole dataset, and this is not
> supported. You can use Lead/Lag in an ordered window function and you'll be
> fine:
>
> *select lead(max(expenses)) over (order by customerId) from tbl group by
> customerId*
>
> HTH
>
> Met vriendelijke groet/Kind regards,
>
> Herman van Hövell tot Westerflier
>
> QuestTec B.V.
> Torenwacht 98
> 2353 DC Leiderdorp
> hvanhov...@questtec.nl
> +31 6 420 590 27
>
>
> 2015-11-02 11:33 GMT+01:00 Shagun Sodhani :
>
>> Hi! I was trying out window functions in SparkSql (using hive context)
>> and I noticed that while this
>> 
>> mentions that *lead* is implemented as an aggregate operator, it seems
>> not to be the case.
>>
>> I am using the following configuration:
>>
>> Query : SELECT lead(max(`expenses`)) FROM `table` GROUP BY `customerId`
>> Spark Version: 10.4
>> SparkSql Version: 1.5.1
>>
>> I am using the standard example of (`customerId`, `expenses`) scheme
>> where each customer has multiple values for expenses (though I am setting
>> age as Double and not Int as I am trying out maths functions).
>>
>>
>> *java.lang.NullPointerException at
>> org.apache.hadoop.hive.ql.udf.generic.GenericUDFLeadLag.evaluate(GenericUDFLeadLag.java:57)*
>>
>> The entire error stack can be found here .
>>
>> Can someone confirm if this is an actual issue or some oversight on my
>> part?
>>
>> Thanks!
>>
>
>