Re: LIMIT issue of SparkSQL

2016-10-29 Thread Asher Krim
We have also found LIMIT to take an unacceptable amount of time when
reading parquet formatted data from s3.
LIMIT was not strictly needed for our usecase, so we worked around it

-- 
Asher Krim
Senior Software Engineer

On Fri, Oct 28, 2016 at 5:36 AM, Liz Bai  wrote:

> Sorry for the late reply.
> The size of the raw data is 20G and it is composed of two columns. We
> generated it by this
> 
> .
> The test queries are very simple,
> 1). select ColA from Table limit 1
> 2). select ColA from Table
> 3). select ColA from Table where ColB=0
> 4). select ColA from Table where ColB=0 limit 1
> We found that if we use `result.collect()`, it does early stop upon
> getting adequate results for query 1) and 4).
> However, we used to run `result.write.parquet`, and there is no early stop
> and scans much more data than `result.collect()`.
>
> Below are the detailed testing summary,
> *Query*
> *Method of Saving Results*
> *Run Time*
> select ColA from Table limit 1
> result.write.Parquet
> 1m 56s
> select ColA from Table
> 1m 40s
> select ColA from Table where ColB=0 limit 1
> 1m 32s
> select ColA from Table where ColB=0
> 1m 21s
> select ColA from Table limit 1
> result.collect()
> 18s
> select ColA from Table where ColB=0 limit 1
> 18s
>
> Thanks.
>
> Best,
> Liz
>
> On 27 Oct 2016, at 2:16 AM, Michael Armbrust 
> wrote:
>
> That is surprising then, you may have found a bug.  What timings are you
> seeing?  Can you reproduce it with data you can share? I'd open a JIRA if
> so.
>
> On Tue, Oct 25, 2016 at 4:32 AM, Liz Bai  wrote:
>
>> We used Parquet as data source. The query is like “select ColA from table
>> limit 1”. Attached is the query plan of it. (However its run time is just
>> the same as “select ColA from table”.)
>> We expected an early stop upon getting 1 result, rather than scanning all
>> records and finally collect it with limit in the final phase.
>> Btw, I agree with Mich’s concerning. `Limit push down` is impossible when
>> involving table joins. But some cases such as “Filter + Projection + Limit”
>>  will benefit from `limit push down`.
>> May I know if there is any detailed solutions for this?
>>
>> Thanks so much.
>>
>> Best,
>> Liz
>>
>> 
>>
>> On 25 Oct 2016, at 5:54 AM, Michael Armbrust 
>> wrote:
>>
>> It is not about limits on specific tables.  We do support that.  The case
>> I'm describing involves pushing limits across system boundaries.  It is
>> certainly possible to do this, but the current datasource API does provide
>> this information (other than the implicit limit that is pushed down to the
>> consumed iterator of the data source).
>>
>> On Mon, Oct 24, 2016 at 9:11 AM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> This is an interesting point.
>>>
>>> As far as I know in any database (practically all RDBMS Oracle, SAP
>>> etc), the LIMIT affects the collection part of the result set.
>>>
>>> The result set is carried out fully on the query that may involve
>>> multiple joins on multiple underlying tables.
>>>
>>> To limit the actual query by LIMIT on each underlying table does not
>>> make sense and will not be industry standard AFAK.
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 24 October 2016 at 06:48, Michael Armbrust 
>>> wrote:
>>>
 - dev + user

 Can you give more info about the query?  Maybe a full explain()?  Are
 you using a datasource like JDBC?  The API does not currently push down
 limits, but the documentation talks about how you can use a query instead
 of a table if that is what you are looking to do.

 On Mon, Oct 24, 2016 at 5:40 AM, Liz Bai  wrote:

> Hi all,
>
> Let me clarify the problem:
>
> Suppose we have a simple table `A` with 100 000 000 records
>
> Problem:
> When we execute sql query ‘select * from A Limit 500`,
> It scan through all 100 000 000 records.
> Normal behaviour should be that once 500 records is found, engine stop
> scanning.
>
> Detailed observation:
> We found that there are “GlobalLimit / LocalLimit” physical operators
> https://github.com/apache/spark/blob/branch-2.0/sql/core/src
> 

Out Of Memory issue

2016-10-29 Thread Kürşat Kurt
Hi;

 

While training NaiveBayes classification, i am getting OOM.

What is wrong with these parameters?

Here is the spark-submit command: ./spark-submit --class main.scala.Test1
--master local[*] --driver-memory 60g  /home/user1/project_2.11-1.0.jar

 

Ps: Os is Ubuntu 14.04 and system has 64GB RAM, 256GB SSD with spark 2.0.1. 

 

16/10/29 23:32:21 INFO BlockManagerInfo: Removed broadcast_10_piece0 on
89.*:35416 in memory (size: 4.0 MB, free: 31.7 GB)

16/10/29 23:32:21 INFO BlockManagerInfo: Removed broadcast_10_piece1 on
89.*:35416 in memory (size: 2.4 MB, free: 31.7 GB)

16/10/29 23:33:00 INFO ExternalAppendOnlyMap: Thread 123 spilling in-memory
map of 31.8 GB to disk (1 time so far)

16/10/29 23:34:42 INFO ExternalAppendOnlyMap: Thread 123 spilling in-memory
map of 31.8 GB to disk (2 times so far)

16/10/29 23:36:58 INFO ExternalAppendOnlyMap: Thread 123 spilling in-memory
map of 31.8 GB to disk (3 times so far)

16/10/29 23:41:27 WARN TaskMemoryManager: leak 21.2 GB memory from
org.apache.spark.util.collection.ExternalAppendOnlyMap@43ab2e76

16/10/29 23:41:28 ERROR Executor: Exception in task 0.0 in stage 10.0 (TID
31)

java.lang.OutOfMemoryError: Java heap space

at com.esotericsoftware.kryo.io.Input.readDoubles(Input.java:885)

at
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$DoubleArraySer
ializer.read(DefaultArraySerializers.java:222)

at
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$DoubleArraySer
ializer.read(DefaultArraySerializers.java:205)

at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:759)

at
com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:132)

at
com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.j
ava:551)

at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:790)

at
com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:42)

at
com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:33)

at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:790)

at
org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSeriali
zer.scala:229)

at
org.apache.spark.serializer.DeserializationStream.readValue(Serializer.scala
:159)

at
org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.readN
extItem(ExternalAppendOnlyMap.scala:515)

at
org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.hasNe
xt(ExternalAppendOnlyMap.scala:535)

at scala.collection.Iterator$$anon$1.hasNext(Iterator.scala:1004)

at
org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.org$
apache$spark$util$collection$ExternalAppendOnlyMap$ExternalIterator$$readNex
tHashCode(ExternalAppendOnlyMap.scala:336)

at
org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator$$ano
nfun$next$1.apply(ExternalAppendOnlyMap.scala:409)

at
org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator$$ano
nfun$next$1.apply(ExternalAppendOnlyMap.scala:407)

at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:5
9)

at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)

at
org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.next
(ExternalAppendOnlyMap.scala:407)

at
org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.next
(ExternalAppendOnlyMap.scala:302)

at scala.collection.Iterator$class.foreach(Iterator.scala:893)

at
org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.fore
ach(ExternalAppendOnlyMap.scala:302)

at
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)

at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)

at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)

at
scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)

at
org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.to(E
xternalAppendOnlyMap.scala:302)

at
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)

at
org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.toBu
ffer(ExternalAppendOnlyMap.scala:302)

at
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)

16/10/29 23:41:28 ERROR SparkUncaughtExceptionHandler: Uncaught exception in
thread Thread[Executor task launch worker-7,5,main]

java.lang.OutOfMemoryError: Java heap space

at com.esotericsoftware.kryo.io.Input.readDoubles(Input.java:885)

at
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$DoubleArraySer
ializer.read(DefaultArraySerializers.java:222)

at

Re: Reason for Kafka topic existence check / "Does the topic exist?" error

2016-10-29 Thread Cody Koeninger
I tested your claims that "it used to work that way", and was unable
to reproduce them.  As far as I can tell, streams have always failed
the very first time you start them in that situation.  As Chris and I
pointed out, there are good reasons for that.

If you don't wan't to operationalize topic creation, just start the
stream again after it fails the very first time you start it with a
new topic.  If you don't want to operationalize monitoring whether
streams actually started, especially when it fails within seconds, I
don't know what more I can say.

On Sat, Oct 29, 2016 at 8:52 AM, Dmitry Goldenberg
 wrote:
> Cody,
>
> Thanks for your comments.
>
> The way I'm reading the Kafka documentation
> (https://kafka.apache.org/documentation) is that auto.create.topics.enable
> is set to true by default. Right now it's not set in our server.properties
> on the Kafka broker side so I would imagine that the first request to
> publish a document into topic X would cause X to be created, as
> auto.create.topics.enable is presumably defaulted to true.
>
> Basically, I used to be able to start a streaming Kafka job first, without
> the topic X already existing, then let the producer publish the first (and
> all subsequent) documents and the consumer would get the documents from that
> point.
>
> This mode is not working anymore. Despite auto.create.topics.enable
> presumably defaulting to true (?), I'm getting the "Does the topic exist"
> exception.
>
> Not a big problem but raises the question of, when would the topic be
> "auto-created" if not on the first document being published to it?
>
> It was nice when it was working because we didn't have to operationalize
> topic creation. Not a big deal but now we'll have to make sure we execute
> the 'create-topics' type of task or shell script at install time.
>
> This seems like a Kafka doc issue potentially, to explain what exactly one
> can expect from the auto.create.topics.enable flag.
>
> -Dmitry
>
>
> On Sat, Oct 8, 2016 at 1:26 PM, Cody Koeninger  wrote:
>>
>> So I just now retested this with 1.5.2, and 2.0.0, and the behavior is
>> exactly the same across spark versions.
>>
>> If the topic hasn't been created, you will get that error on startup,
>> because the topic doesn't exist and thus doesn't have metadata.
>>
>> If you have auto.create.topics.enable set to true on the broker
>> config, the request will fairly quickly lead to the topic being
>> created after the fact.
>>
>> All you have to do is hit up-arrow-enter and re-submit the spark job,
>> the second time around the topic will exist.  That seems pretty low
>> effort.
>>
>> I'd rather stick with having an early error for those of us that
>> prefer to run with auto.create set to false (because it makes sure the
>> topic is actually set up the way you want, reduces the likelihood of
>> spurious topics being created, etc).
>>
>>
>>
>> On Sat, Oct 8, 2016 at 11:44 AM, Dmitry Goldenberg
>>  wrote:
>> > Hi,
>> >
>> > I am trying to start up a simple consumer that streams from a Kafka
>> > topic,
>> > using Spark 2.0.0:
>> >
>> > spark-streaming_2.11
>> > spark-streaming-kafka-0-8_2.11
>> >
>> > I was getting an error as below until I created the topic in Kafka. From
>> > integrating Spark 1.5, I never used to hit this check; we were able to
>> > start
>> > all of our Spark Kafka consumers, then start the producers, and have
>> > Kafka
>> > automatically create the topics once the first message for a given topic
>> > was
>> > published.
>> >
>> > Is there something I might be doing to cause this topic existence check
>> > in
>> > KafkaCluster.scala to kick in? I'd much rather be able to not have to
>> > pre-create the topics before I start the consumers.  Any
>> > thoughts/comments
>> > would be appreciated.
>> >
>> > Thanks.
>> > - Dmitry
>> >
>> > 
>> >
>> > Exception in thread "main" org.apache.spark.SparkException:
>> > java.nio.channels.ClosedChannelException
>> >
>> > java.nio.channels.ClosedChannelException
>> >
>> > org.apache.spark.SparkException: Error getting partition metadata for
>> > ''. Does the topic exist?
>> >
>> > at
>> >
>> > org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:373)
>> >
>> > at
>> >
>> > org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:373)
>> >
>> > at scala.util.Either.fold(Either.scala:98)
>> >
>> > at
>> >
>> > org.apache.spark.streaming.kafka.KafkaCluster$.checkErrors(KafkaCluster.scala:372)
>> >
>> > at
>> >
>> > org.apache.spark.streaming.kafka.KafkaUtils$.getFromOffsets(KafkaUtils.scala:222)
>> >
>> > at
>> >
>> > org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:484)
>> >
>> > at
>> >
>> > 

Re: spark dataframe rolling window for user define operation

2016-10-29 Thread ayan guha
Avg is an aggregation function. You need to write XYZ as user defined
aggregate function (UDAF).

On Sat, Oct 29, 2016 at 9:28 PM, Manjunath, Kiran 
wrote:

> Is there a way to get user defined operation to be used for rolling window
> operation?
>
>
>
> Like – Instead of
>
>
>
> val wSpec1 = Window.orderBy("c1").rowsBetween(-20, +20)
>
> var dfWithMovingAvg = df.withColumn( "Avg",avg(df("c2")).over(wSpec1))
>
>
>
> Something like
>
>
>
> val wSpec1 = Window.orderBy("c1").rowsBetween(-20, +20)
>
> var dfWithAlternate = df.withColumn( "alter",*XYZ*(df("c2")).over(wSpec1))
>
>
>
> Where XYZ function can be - +,-,+,- alternatively
>
>
>
>
>
> PS : I have posted the same question at http://stackoverflow.com/
> questions/40318010/spark-dataframe-rolling-window-user-define-operation
>
>
>
> Regards,
>
> Kiran
>



-- 
Best Regards,
Ayan Guha


Re: Reason for Kafka topic existence check / "Does the topic exist?" error

2016-10-29 Thread Dmitry Goldenberg
Cody,

Thanks for your comments.

The way I'm reading the Kafka documentation (
https://kafka.apache.org/documentation) is that auto.create.topics.enable
is set to true by default. Right now it's not set in our server.properties
on the Kafka broker side so I would imagine that the first request to
publish a document into topic X would cause X to be created, as
auto.create.topics.enable is presumably defaulted to true.

Basically, I used to be able to start a streaming Kafka job first, without
the topic X already existing, then let the producer publish the first (and
all subsequent) documents and the consumer would get the documents from
that point.

This mode is not working anymore. Despite auto.create.topics.enable
presumably defaulting to true (?), I'm getting the "Does the topic exist"
exception.

Not a big problem but raises the question of, when would the topic be
"auto-created" if not on the first document being published to it?

It was nice when it was working because we didn't have to operationalize
topic creation. Not a big deal but now we'll have to make sure we execute
the 'create-topics' type of task or shell script at install time.

This seems like a Kafka doc issue potentially, to explain what exactly one
can expect from the auto.create.topics.enable flag.

-Dmitry


On Sat, Oct 8, 2016 at 1:26 PM, Cody Koeninger  wrote:

> So I just now retested this with 1.5.2, and 2.0.0, and the behavior is
> exactly the same across spark versions.
>
> If the topic hasn't been created, you will get that error on startup,
> because the topic doesn't exist and thus doesn't have metadata.
>
> If you have auto.create.topics.enable set to true on the broker
> config, the request will fairly quickly lead to the topic being
> created after the fact.
>
> All you have to do is hit up-arrow-enter and re-submit the spark job,
> the second time around the topic will exist.  That seems pretty low
> effort.
>
> I'd rather stick with having an early error for those of us that
> prefer to run with auto.create set to false (because it makes sure the
> topic is actually set up the way you want, reduces the likelihood of
> spurious topics being created, etc).
>
>
>
> On Sat, Oct 8, 2016 at 11:44 AM, Dmitry Goldenberg
>  wrote:
> > Hi,
> >
> > I am trying to start up a simple consumer that streams from a Kafka
> topic,
> > using Spark 2.0.0:
> >
> > spark-streaming_2.11
> > spark-streaming-kafka-0-8_2.11
> >
> > I was getting an error as below until I created the topic in Kafka. From
> > integrating Spark 1.5, I never used to hit this check; we were able to
> start
> > all of our Spark Kafka consumers, then start the producers, and have
> Kafka
> > automatically create the topics once the first message for a given topic
> was
> > published.
> >
> > Is there something I might be doing to cause this topic existence check
> in
> > KafkaCluster.scala to kick in? I'd much rather be able to not have to
> > pre-create the topics before I start the consumers.  Any
> thoughts/comments
> > would be appreciated.
> >
> > Thanks.
> > - Dmitry
> >
> > 
> >
> > Exception in thread "main" org.apache.spark.SparkException:
> > java.nio.channels.ClosedChannelException
> >
> > java.nio.channels.ClosedChannelException
> >
> > org.apache.spark.SparkException: Error getting partition metadata for
> > ''. Does the topic exist?
> >
> > at
> > org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$
> checkErrors$1.apply(KafkaCluster.scala:373)
> >
> > at
> > org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$
> checkErrors$1.apply(KafkaCluster.scala:373)
> >
> > at scala.util.Either.fold(Either.scala:98)
> >
> > at
> > org.apache.spark.streaming.kafka.KafkaCluster$.checkErrors(KafkaCluster.
> scala:372)
> >
> > at
> > org.apache.spark.streaming.kafka.KafkaUtils$.getFromOffsets(KafkaUtils.
> scala:222)
> >
> > at
> > org.apache.spark.streaming.kafka.KafkaUtils$.
> createDirectStream(KafkaUtils.scala:484)
> >
> > at
> > org.apache.spark.streaming.kafka.KafkaUtils$.
> createDirectStream(KafkaUtils.scala:607)
> >
> > at
> > org.apache.spark.streaming.kafka.KafkaUtils.
> createDirectStream(KafkaUtils.scala)
> >
> > at
> > com.citi.totalconduct.consumer.kafka.spark.KafkaSparkStreamingDriver.
> createContext(KafkaSparkStreamingDriver.java:253)
> >
> > at
> > com.citi.totalconduct.consumer.kafka.spark.KafkaSparkStreamingDriver.
> execute(KafkaSparkStreamingDriver.java:166)
> >
> > at
> > com.citi.totalconduct.consumer.kafka.spark.KafkaSparkStreamingDriver.
> main(KafkaSparkStreamingDriver.java:305)
> >
> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >
> > at
> > sun.reflect.NativeMethodAccessorImpl.invoke(
> NativeMethodAccessorImpl.java:62)
> >
> > at
> > sun.reflect.DelegatingMethodAccessorImpl.invoke(
> 

Re: Spark 2.0 with Hadoop 3.0?

2016-10-29 Thread Steve Loughran

On 27 Oct 2016, at 23:04, adam kramer 
> wrote:

Is the version of Spark built for Hadoop 2.7 and later only for 2.x releases?

Is there any reason why Hadoop 3.0 is a non-starter for use with Spark
2.0? The version of aws-sdk in 3.0 actually works for DynamoDB which
would resolve our driver dependency issues.

what version problems are you having there?


There's a patch to move to AWS SDK 10.10, but that has a jackson 2.6.6+ 
dependency; that being something I'd like to do in Hadoop branch-2 as well, as 
it is Time to Move On ( HADOOP-12705 ) . FWIW all jackson 1.9 dependencies have 
been ripped out, leaving on that 2.x version problem.

https://issues.apache.org/jira/browse/HADOOP-13050

The HADOOP-13345 s3guard work will pull in a (provided) dependency on dynamodb; 
looks like the HADOOP-13449 patch moves to SDK 1.11.0.

I think we are likely to backport that to branch-2 as well, though it'd help 
the dev & test there if you built and tested your code against trunk early —not 
least to find any changes in that transitive dependency set.


Thanks,
Adam

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org





spark-submit fails after setting userClassPathFirst to true

2016-10-29 Thread sudhir patil
After i set spark.driver.userClassPathFirst=true, my spark-submit --master
yarn-client fails with below error & it works fine if i remove
userClassPathFirst setting. I need to add this setting to avoid class
conflicts in some other job so trying to make it this setting work in
simple job first & later try with job with class conflicts.

>From quick search looks like this error occurs when driver cannot find yarn
& hadoop related config, so exported SPARK_CONF_DIR & HADOOP_CONF_DIR and
also added config files in --jar option but still get the same error.

Any ideas on how to fix this?

org.apache.spark.SparkException: Unable to load YARN support

at org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$
1(SparkHadoopUtil.scala:399)

at org.apache.spark.deploy.SparkHadoopUtil$.yarn$
lzycompute(SparkHadoopUtil.scala:394)

at org.apache.spark.deploy.SparkHadoopUtil$.yarn(
SparkHadoopUtil.scala:394)

at org.apache.spark.deploy.SparkHadoopUtil$.get(
SparkHadoopUtil.scala:411)

at org.apache.spark.util.Utils$.getSparkOrYarnConfig(Utils.
scala:2119)

at org.apache.spark.storage.BlockManager.(
BlockManager.scala:105)

at org.apache.spark.SparkEnv$.create(SparkEnv.scala:365)

at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:193)

at org.apache.spark.SparkContext.createSparkEnv(SparkContext.
scala:289)

at org.apache.spark.SparkContext.(SparkContext.scala:462)

at org.apache.spark.api.java.JavaSparkContext.(
JavaSparkContext.scala:59)

at com.citi.ripcurl.timeseriesbatch.BatchContext.<
init>(BatchContext.java:27)

at com.citi.ripcurl.timeseriesbatch.example.EqDataQualityExample.
runReportQuery(EqDataQualityExample.java:28)

at com.citi.ripcurl.timeseriesbatch.example.
EqDataQualityExample.main(EqDataQualityExample.java:70)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(
NativeMethodAccessorImpl.java:62)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(
DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:497)

at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$
deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)

at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(
SparkSubmit.scala:181)

at org.apache.spark.deploy.SparkSubmit$.submit(
SparkSubmit.scala:206)

at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)

at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Caused by: java.lang.RuntimeException: java.lang.RuntimeException: class
org.apache.hadoop.security.ShellBasedUnixGroupsMapping not
org.apache.hadoop.security.GroupMappingServiceProvider


Re: java.lang.OutOfMemoryError: unable to create new native thread

2016-10-29 Thread kant kodali
Another thing I forgot to mention is that it happens after running for
several hours say (4 to 5 hours) I am not sure why it is creating so many
threads? any way to control them?

On Fri, Oct 28, 2016 at 12:47 PM, kant kodali  wrote:

>  "dag-scheduler-event-loop" java.lang.OutOfMemoryError: unable to create
> new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:714)
> at scala.concurrent.forkjoin.ForkJoinPool.tryAddWorker(ForkJoin
> Pool.java:1672)
> at scala.concurrent.forkjoin.ForkJoinPool.signalWork(ForkJoinPo
> ol.java:1966)
> at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.push(ForkJo
> inPool.java:1072)
> at scala.concurrent.forkjoin.ForkJoinTask.fork(ForkJoinTask.
> java:654)
> at scala.collection.parallel.ForkJoinTasks$WrappedTask$
>
> This is the error produced by the Spark Driver program which is running on
> client mode by default so some people say just increase the heap size by
> passing the --driver-memory 3g flag however the message *"**unable to
> create new native thread**"*  really says that the JVM is asking OS to
> create a new thread but OS couldn't allocate it anymore and the number of
> threads a JVM can create by requesting OS is platform dependent but
> typically it is 32K threads on a 64-bit JVM. so I am wondering why spark is
> even creating so many threads and how do I control this number?
>


spark dataframe rolling window for user define operation

2016-10-29 Thread Manjunath, Kiran
Is there a way to get user defined operation to be used for rolling window 
operation?

Like – Instead of

val wSpec1 = Window.orderBy("c1").rowsBetween(-20, +20)
var dfWithMovingAvg = df.withColumn( "Avg",avg(df("c2")).over(wSpec1))

Something like

val wSpec1 = Window.orderBy("c1").rowsBetween(-20, +20)
var dfWithAlternate = df.withColumn( "alter",XYZ(df("c2")).over(wSpec1))

Where XYZ function can be - +,-,+,- alternatively


PS : I have posted the same question at 
http://stackoverflow.com/questions/40318010/spark-dataframe-rolling-window-user-define-operation

Regards,
Kiran