BUILD FAILURE at spark-sql_2.11?!

2016-01-27 Thread Jacek Laskowski
Hi,

Tried to build the sources today with Scala 2.11 twice and it failed.
No local changes. Restarted zinc.

Can anyone else confirm it?

Since the error is buried in the logs I'm asking now without offering
more information (before I catch the cause) so I or *the issue* get
corrected (whatever first :)). Thanks!

Pozdrawiam,
Jacek

Jacek Laskowski | https://medium.com/@jaceklaskowski/
Mastering Apache Spark
==> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
Follow me at https://twitter.com/jaceklaskowski

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: BUILD FAILURE at spark-sql_2.11?!

2016-01-27 Thread Jacek Laskowski
Hi,

My very rough investigation has showed that the commit to may have
broken the build was
https://github.com/apache/spark/commit/555127387accdd7c1cf236912941822ba8af0a52
(nongli committed with rxin 7 hours ago).

Found a fix and building the source again...

Pozdrawiam,
Jacek

Jacek Laskowski | https://medium.com/@jaceklaskowski/
Mastering Apache Spark
==> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
Follow me at https://twitter.com/jaceklaskowski


On Wed, Jan 27, 2016 at 9:21 AM, Jacek Laskowski  wrote:
> Hi,
>
> Tried to build the sources today with Scala 2.11 twice and it failed.
> No local changes. Restarted zinc.
>
> Can anyone else confirm it?
>
> Since the error is buried in the logs I'm asking now without offering
> more information (before I catch the cause) so I or *the issue* get
> corrected (whatever first :)). Thanks!
>
> Pozdrawiam,
> Jacek
>
> Jacek Laskowski | https://medium.com/@jaceklaskowski/
> Mastering Apache Spark
> ==> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
> Follow me at https://twitter.com/jaceklaskowski

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: BUILD FAILURE at spark-sql_2.11?!

2016-01-27 Thread Jacek Laskowski
Hi,

Pull request submitted
https://github.com/apache/spark/pull/10946/files. Please review and
merge.

Pozdrawiam,
Jacek

Jacek Laskowski | https://medium.com/@jaceklaskowski/
Mastering Apache Spark
==> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
Follow me at https://twitter.com/jaceklaskowski


On Wed, Jan 27, 2016 at 9:47 AM, Jacek Laskowski  wrote:
> Hi,
>
> My very rough investigation has showed that the commit to may have
> broken the build was
> https://github.com/apache/spark/commit/555127387accdd7c1cf236912941822ba8af0a52
> (nongli committed with rxin 7 hours ago).
>
> Found a fix and building the source again...
>
> Pozdrawiam,
> Jacek
>
> Jacek Laskowski | https://medium.com/@jaceklaskowski/
> Mastering Apache Spark
> ==> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Wed, Jan 27, 2016 at 9:21 AM, Jacek Laskowski  wrote:
>> Hi,
>>
>> Tried to build the sources today with Scala 2.11 twice and it failed.
>> No local changes. Restarted zinc.
>>
>> Can anyone else confirm it?
>>
>> Since the error is buried in the logs I'm asking now without offering
>> more information (before I catch the cause) so I or *the issue* get
>> corrected (whatever first :)). Thanks!
>>
>> Pozdrawiam,
>> Jacek
>>
>> Jacek Laskowski | https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark
>> ==> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
>> Follow me at https://twitter.com/jaceklaskowski

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: BUILD FAILURE at spark-sql_2.11?!

2016-01-27 Thread Jean-Baptiste Onofré

Thanks Jacek,

I have the same issue here.

Regards
JB

On 01/27/2016 10:15 AM, Jacek Laskowski wrote:

Hi,

Pull request submitted
https://github.com/apache/spark/pull/10946/files. Please review and
merge.

Pozdrawiam,
Jacek

Jacek Laskowski | https://medium.com/@jaceklaskowski/
Mastering Apache Spark
==> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
Follow me at https://twitter.com/jaceklaskowski


On Wed, Jan 27, 2016 at 9:47 AM, Jacek Laskowski  wrote:

Hi,

My very rough investigation has showed that the commit to may have
broken the build was
https://github.com/apache/spark/commit/555127387accdd7c1cf236912941822ba8af0a52
(nongli committed with rxin 7 hours ago).

Found a fix and building the source again...

Pozdrawiam,
Jacek

Jacek Laskowski | https://medium.com/@jaceklaskowski/
Mastering Apache Spark
==> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
Follow me at https://twitter.com/jaceklaskowski


On Wed, Jan 27, 2016 at 9:21 AM, Jacek Laskowski  wrote:

Hi,

Tried to build the sources today with Scala 2.11 twice and it failed.
No local changes. Restarted zinc.

Can anyone else confirm it?

Since the error is buried in the logs I'm asking now without offering
more information (before I catch the cause) so I or *the issue* get
corrected (whatever first :)). Thanks!

Pozdrawiam,
Jacek

Jacek Laskowski | https://medium.com/@jaceklaskowski/
Mastering Apache Spark
==> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
Follow me at https://twitter.com/jaceklaskowski


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: BUILD FAILURE at spark-sql_2.11?!

2016-01-27 Thread Ted Yu
Strangely both Jenkins jobs showed green status:

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/SPARK-master-COMPILE-sbt-SCALA-2.11/
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/SPARK-master-COMPILE-MAVEN-SCALA-2.11/

On Wed, Jan 27, 2016 at 12:47 AM, Jacek Laskowski  wrote:

> Hi,
>
> My very rough investigation has showed that the commit to may have
> broken the build was
>
> https://github.com/apache/spark/commit/555127387accdd7c1cf236912941822ba8af0a52
> (nongli committed with rxin 7 hours ago).
>
> Found a fix and building the source again...
>
> Pozdrawiam,
> Jacek
>
> Jacek Laskowski | https://medium.com/@jaceklaskowski/
> Mastering Apache Spark
> ==> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Wed, Jan 27, 2016 at 9:21 AM, Jacek Laskowski  wrote:
> > Hi,
> >
> > Tried to build the sources today with Scala 2.11 twice and it failed.
> > No local changes. Restarted zinc.
> >
> > Can anyone else confirm it?
> >
> > Since the error is buried in the logs I'm asking now without offering
> > more information (before I catch the cause) so I or *the issue* get
> > corrected (whatever first :)). Thanks!
> >
> > Pozdrawiam,
> > Jacek
> >
> > Jacek Laskowski | https://medium.com/@jaceklaskowski/
> > Mastering Apache Spark
> > ==> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
> > Follow me at https://twitter.com/jaceklaskowski
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Generate Amplab queries set

2016-01-27 Thread Akhil Das
Have a look at the TPC-H queries, I found this repository with the quries.
https://github.com/ssavvides/tpch-spark

Thanks
Best Regards

On Fri, Jan 22, 2016 at 1:35 AM, sara mustafa 
wrote:

> Hi,
> I have downloaded the Amplab benchmark dataset from
> s3n://big-data-benchmark/pavlo/text/tiny, but I don't know how to generate
> a
> set of random mixed queries of different types like scan,aggregate and
> join.
>
> Thanks,
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Generate-Amplab-queries-set-tp16071.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Using distinct count in over clause

2016-01-27 Thread Akhil Das
Does it support over? I couldn't find it in the documentation
http://spark.apache.org/docs/latest/sql-programming-guide.html#supported-hive-features

Thanks
Best Regards

On Fri, Jan 22, 2016 at 2:31 PM, 汪洋  wrote:

> I think it cannot be right.
>
> 在 2016年1月22日,下午4:53,汪洋  写道:
>
> Hi,
>
> Do we support distinct count in the over clause in spark sql?
>
> I ran a sql like this:
>
> select a, count(distinct b) over ( order by a rows between unbounded
> preceding and current row) from table limit 10
>
> Currently, it return an error says: expression ‘a' is neither present in
> the group by, nor is it an aggregate function. Add to group by or wrap in
> first() if you don't care which value you get.;
>
> Yang
>
>
>


Re: Using distinct count in over clause

2016-01-27 Thread Herman van Hövell tot Westerflier
Hi,

We currently do not support distinct clauses in window functions. Nor is
such functionality planned.

Spark 2.0 uses native spark UDAFs (instead of Hive window functions) and
allows you to use your own UDAFs, it is trivial to implement a distinct
count/sum in that case.

Kind regards,

Herman van Hövell

2016-01-27 13:25 GMT+01:00 Akhil Das :

> Does it support over? I couldn't find it in the documentation
> http://spark.apache.org/docs/latest/sql-programming-guide.html#supported-hive-features
>
> Thanks
> Best Regards
>
> On Fri, Jan 22, 2016 at 2:31 PM, 汪洋  wrote:
>
>> I think it cannot be right.
>>
>> 在 2016年1月22日,下午4:53,汪洋  写道:
>>
>> Hi,
>>
>> Do we support distinct count in the over clause in spark sql?
>>
>> I ran a sql like this:
>>
>> select a, count(distinct b) over ( order by a rows between unbounded
>> preceding and current row) from table limit 10
>>
>> Currently, it return an error says: expression ‘a' is neither present in
>> the group by, nor is it an aggregate function. Add to group by or wrap in
>> first() if you don't care which value you get.;
>>
>> Yang
>>
>>
>>
>


Re: timeout in shuffle problem

2016-01-27 Thread Hamel Kothari
Are you running on YARN? Another possibility here is that your shuffle
managers are facing GC pain and becoming less responsive, thus missing
timeouts. Can you try increasing the memory on the node managers and see if
that helps?

On Sun, Jan 24, 2016 at 4:58 PM Ted Yu  wrote:

> Cycling past bits:
>
> http://search-hadoop.com/m/q3RTtU5CRU1KKVA42&subj=RE+shuffle+FetchFailedException+in+spark+on+YARN+job
>
> On Sun, Jan 24, 2016 at 5:52 AM, wangzhenhua (G) 
> wrote:
>
>> Hi,
>>
>> I have a problem of time out in shuffle, it happened after shuffle write
>> and at the start of shuffle read,
>> logs on driver and executors are shown as below. Spark version is 1.5. 
>> Looking
>> forward to your replys. Thanks!
>>
>> logs on driver only have warnings:
>>
>> WARN TaskSetManager: Lost task 38.0 in stage 27.0 (TID 127459, linux-162): 
>> FetchFailed(BlockManagerId(66, 172.168.100.12, 23028), shuffleId=9, 
>> mapId=55, reduceId=38, message=
>>
>> org.apache.spark.shuffle.FetchFailedException: 
>> java.util.concurrent.TimeoutException: Timeout waiting for task.
>>
>> at 
>> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:321)
>>
>> at 
>> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:306)
>>
>> at 
>> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:51)
>> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>>
>> at 
>> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>>
>> at 
>> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>>
>> at 
>> org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:173)
>> at org.apache.spark.sql.execution.TungstenSort.org
>> $apache$spark$sql$execution$TungstenSort$$executePartition$1(sort.scala:160)
>>
>> at 
>> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$4.apply(sort.scala:169)
>>
>> at 
>> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$4.apply(sort.scala:169)
>>
>> at 
>> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>>
>> at 
>> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:99)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>>
>> at 
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>>
>> at 
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>>
>> at 
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>>
>> at 
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>>
>> at 
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>>
>> at 
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>>
>> at 
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>>
>> at 
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>>
>> at 
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>> at org.apa

RE: spark hivethriftserver problem on 1.5.0 -> 1.6.0 upgrade

2016-01-27 Thread james.gre...@baesystems.com

Thanks Yin,  here are the logs:



INFO  SparkContext - Added JAR file:/home/jegreen1/mms/zookeeper-3.4.6.jar at 
http://10.39.65.122:38933/jars/zookeeper-3.4.6.jar with timestamp 1453907484092
INFO  SparkContext - Added JAR 
file:/home/jegreen1/mms/mms-http-0.2-SNAPSHOT.jar at 
http://10.39.65.122:38933/jars/mms-http-0.2-SNAPSHOT.jar with timestamp 
1453907484093
INFO  Executor - Starting executor ID driver on host localhost
INFO  Utils - Successfully started service 
'org.apache.spark.network.netty.NettyBlockTransferService' on port 41220.
INFO  NettyBlockTransferService - Server created on 41220
INFO  BlockManagerMaster - Trying to register BlockManager
INFO  BlockManagerMasterEndpoint - Registering block manager localhost:41220 
with 511.1 MB RAM, BlockManagerId(driver, localhost, 41220)
INFO  BlockManagerMaster - Registered BlockManager
INFO  HiveContext - Initializing execution hive, version 1.2.1
INFO  ClientWrapper - Inspected Hadoop version: 2.6.0
INFO  ClientWrapper - Loaded org.apache.hadoop.hive.shims.Hadoop23Shims for 
Hadoop version 2.6.0
WARN  HiveConf - HiveConf of name hive.enable.spark.execution.engine does not 
exist
INFO  HiveMetaStore - 0: Opening raw store with implemenation 
class:org.apache.hadoop.hive.metastore.ObjectStore
INFO  ObjectStore - ObjectStore, initialize called
INFO  Persistence - Property hive.metastore.integral.jdo.pushdown unknown - 
will be ignored
INFO  Persistence - Property datanucleus.cache.level2 unknown - will be ignored
WARN  HiveConf - HiveConf of name hive.enable.spark.execution.engine does not 
exist
INFO  ObjectStore - Setting MetaStore object pin classes with 
hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
INFO  Datastore - The class 
"org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as 
"embedded-only" so does not have its own datastore table.
INFO  Datastore - The class "org.apache.hadoop.hive.metastore.model.MOrder" is 
tagged as "embedded-only" so does not have its own datastore table.
INFO  Datastore - The class 
"org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as 
"embedded-only" so does not have its own datastore table.
INFO  Datastore - The class "org.apache.hadoop.hive.metastore.model.MOrder" is 
tagged as "embedded-only" so does not have its own datastore table.
INFO  MetaStoreDirectSql - Using direct SQL, underlying DB is DERBY
INFO  ObjectStore - Initialized ObjectStore
WARN  ObjectStore - Version information not found in metastore. 
hive.metastore.schema.verification is not enabled so recording the schema 
version 1.2.0
WARN  ObjectStore - Failed to get database default, returning 
NoSuchObjectException
INFO  HiveMetaStore - Added admin role in metastore
INFO  HiveMetaStore - Added public role in metastore
INFO  HiveMetaStore - No user is added in admin role, since config is empty
INFO  HiveMetaStore - 0: get_all_databases
INFO  audit - ugi=jegreen1  ip=unknown-ip-addr  cmd=get_all_databases
INFO  HiveMetaStore - 0: get_functions: db=default pat=*
INFO  audit - ugi=jegreen1  ip=unknown-ip-addr  cmd=get_functions: 
db=default pat=*
INFO  Datastore - The class 
"org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as 
"embedded-only" so does not have its own datastore table.
WARN  NativeCodeLoader - Unable to load native-hadoop library for your 
platform... using builtin-java classes where applicable
INFO  SessionState - Created local directory: 
/tmp/9b102c97-c3f4-4d92-b722-0a2e257d3b5b_resources
INFO  SessionState - Created HDFS directory: 
/tmp/hive/jegreen1/9b102c97-c3f4-4d92-b722-0a2e257d3b5b
INFO  SessionState - Created local directory: 
/tmp/jegreen1/9b102c97-c3f4-4d92-b722-0a2e257d3b5b
INFO  SessionState - Created HDFS directory: 
/tmp/hive/jegreen1/9b102c97-c3f4-4d92-b722-0a2e257d3b5b/_tmp_space.db
WARN  HiveConf - HiveConf of name hive.enable.spark.execution.engine does not 
exist
INFO  HiveContext - default warehouse location is /user/hive/warehouse
INFO  HiveContext - Initializing HiveMetastoreConnection version 1.2.1 using 
Spark classes.
INFO  ClientWrapper - Inspected Hadoop version: 2.6.0
INFO  ClientWrapper - Loaded org.apache.hadoop.hive.shims.Hadoop23Shims for 
Hadoop version 2.6.0
WARN  HiveConf - HiveConf of name hive.enable.spark.execution.engine does not 
exist
INFO  metastore - Trying to connect to metastore with URI 
thrift://dkclusterm2.imp.net:9083
INFO  metastore - Connected to metastore.
INFO  SessionState - Created local directory: 
/tmp/7e230580-37af-47d3-81cc-eb4829b8da62_resources
INFO  SessionState - Created HDFS directory: 
/tmp/hive/jegreen1/7e230580-37af-47d3-81cc-eb4829b8da62
INFO  SessionState - Created local directory: 
/tmp/jegreen1/7e230580-37af-47d3-81cc-eb4829b8da62
INFO  SessionState - Created HDFS directory: 
/tmp/hive/jegreen1/7e230580-37af-47d3-81cc-eb4829b8da62/_tmp_space.db
INFO  ParquetRelation - Listing 
hdfs://dkclusterm1.imp.net:8020/user/jegreen1/ex208 on driver
INFO  Spa

Adding Naive Bayes sample code in Documentation

2016-01-27 Thread Vinayak Agrawal
Hi,
I was reading through Spark ML package and I couldn't find Naive Bayes
examples documented on the spark documentation page.
http://spark.apache.org/docs/latest/ml-classification-regression.html

However, the API exists and can be used.
https://spark.apache.org/docs/1.5.2/api/python/pyspark.ml.html#module-pyspark.ml.classification

Can the examples be added in the latest documentation?

-- 
Vinayak Agrawal


"To Strive, To Seek, To Find and Not to Yield!"
~Lord Alfred Tennyson


Mutiple spark contexts

2016-01-27 Thread Jakob Odersky
A while ago, I remember reading that multiple active Spark contexts
per JVM was a possible future enhancement.
I was wondering if this is still on the roadmap, what the major
obstacles are and if I can be of any help in adding this feature?

regards,
--Jakob

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Mutiple spark contexts

2016-01-27 Thread Ashish Soni
There is a property you need to set which is
spark.driver.allowMultipleContexts=true

Ashish

On Wed, Jan 27, 2016 at 1:39 PM, Jakob Odersky  wrote:

> A while ago, I remember reading that multiple active Spark contexts
> per JVM was a possible future enhancement.
> I was wondering if this is still on the roadmap, what the major
> obstacles are and if I can be of any help in adding this feature?
>
> regards,
> --Jakob
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Mutiple spark contexts

2016-01-27 Thread Reynold Xin
There are no major obstacles, just a million tiny obstacles that would take
forever to fix.


On Wed, Jan 27, 2016 at 10:39 AM, Jakob Odersky  wrote:

> A while ago, I remember reading that multiple active Spark contexts
> per JVM was a possible future enhancement.
> I was wondering if this is still on the roadmap, what the major
> obstacles are and if I can be of any help in adding this feature?
>
> regards,
> --Jakob
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Mutiple spark contexts

2016-01-27 Thread Herman van Hövell tot Westerflier
Just out of curiousity. What is the use case for having multiple active
contexts in a single JVM?

Kind regards,

Herman van Hövell

2016-01-27 19:41 GMT+01:00 Ashish Soni :

> There is a property you need to set which is
> spark.driver.allowMultipleContexts=true
>
> Ashish
>
> On Wed, Jan 27, 2016 at 1:39 PM, Jakob Odersky  wrote:
>
>> A while ago, I remember reading that multiple active Spark contexts
>> per JVM was a possible future enhancement.
>> I was wondering if this is still on the roadmap, what the major
>> obstacles are and if I can be of any help in adding this feature?
>>
>> regards,
>> --Jakob
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>


Re: Mutiple spark contexts

2016-01-27 Thread Nicholas Chammas
There is a lengthy discussion about this on the JIRA:
https://issues.apache.org/jira/browse/SPARK-2243

On Wed, Jan 27, 2016 at 1:43 PM Herman van Hövell tot Westerflier <
hvanhov...@questtec.nl> wrote:

> Just out of curiousity. What is the use case for having multiple active
> contexts in a single JVM?
>
> Kind regards,
>
> Herman van Hövell
>
> 2016-01-27 19:41 GMT+01:00 Ashish Soni :
>
>> There is a property you need to set which is
>> spark.driver.allowMultipleContexts=true
>>
>> Ashish
>>
>> On Wed, Jan 27, 2016 at 1:39 PM, Jakob Odersky  wrote:
>>
>>> A while ago, I remember reading that multiple active Spark contexts
>>> per JVM was a possible future enhancement.
>>> I was wondering if this is still on the roadmap, what the major
>>> obstacles are and if I can be of any help in adding this feature?
>>>
>>> regards,
>>> --Jakob
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>>
>>
>


Re: Spark 2.0.0 release plan

2016-01-27 Thread Michael Armbrust
We do maintenance releases on demand when there is enough to justify doing
one.  I'm hoping to cut 1.6.1 soon, but have not had time yet.

On Wed, Jan 27, 2016 at 8:12 AM, Daniel Siegmann <
daniel.siegm...@teamaol.com> wrote:

> Will there continue to be monthly releases on the 1.6.x branch during the
> additional time for bug fixes and such?
>
> On Tue, Jan 26, 2016 at 11:28 PM, Koert Kuipers  wrote:
>
>> thanks thats all i needed
>>
>> On Tue, Jan 26, 2016 at 6:19 PM, Sean Owen  wrote:
>>
>>> I think it will come significantly later -- or else we'd be at code
>>> freeze for 2.x in a few days. I haven't heard anyone discuss this
>>> officially but had batted around May or so instead informally in
>>> conversation. Does anyone have a particularly strong opinion on that?
>>> That's basically an extra 3 month period.
>>>
>>> https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage
>>>
>>> On Tue, Jan 26, 2016 at 10:00 PM, Koert Kuipers 
>>> wrote:
>>> > Is the idea that spark 2.0 comes out roughly 3 months after 1.6? So
>>> > quarterly release as usual?
>>> > Thanks
>>>
>>
>>
>


spark job scheduling

2016-01-27 Thread Niranda Perera
hi all,

I have a few questions on spark job scheduling.

1. As I understand, the smallest unit of work an executor can perform. In
the 'fair' scheduler mode, let's say  a job is submitted to the spark ctx
which has a considerable amount of work to do in a task. While such a 'big'
task is running, can we still submit another smaller job (from a separate
thread) and get it done? or does that smaller job has to wait till the
bigger task finishes and the resources are freed from the executor?

2. When a job is submitted without setting a scheduler pool, the default
scheduler pool is assigned to it, which employs FIFO scheduling. but what
happens when we have the spark.scheduler.mode as FAIR, and if I submit jobs
without specifying a scheduler pool (which has FAIR scheduling)? would the
jobs still run in FIFO mode with the default pool?
essentially, for us to really set FAIR scheduling, do we have to assign a
FAIR scheduler pool?

best

-- 
Niranda
@n1r44 
+94-71-554-8430
https://pythagoreanscript.wordpress.com/


Re: Re: timeout in shuffle problem

2016-01-27 Thread wangzhenhua (G)
external shuffle service is not enabled


best regards,
-zhenhua

From: Hamel Kothari
Date: 2016-01-27 22:21
To: Ted Yu; wangzhenhua 
(G)
CC: dev
Subject: Re: timeout in shuffle problem
Are you running on YARN? Another possibility here is that your shuffle managers 
are facing GC pain and becoming less responsive, thus missing timeouts. Can you 
try increasing the memory on the node managers and see if that helps?

On Sun, Jan 24, 2016 at 4:58 PM Ted Yu 
mailto:yuzhih...@gmail.com>> wrote:
Cycling past bits:
http://search-hadoop.com/m/q3RTtU5CRU1KKVA42&subj=RE+shuffle+FetchFailedException+in+spark+on+YARN+job

On Sun, Jan 24, 2016 at 5:52 AM, wangzhenhua (G) 
mailto:wangzhen...@huawei.com>> wrote:
Hi,

I have a problem of time out in shuffle, it happened after shuffle write and at 
the start of shuffle read,
logs on driver and executors are shown as below. Spark version is 1.5. Looking 
forward to your replys. Thanks!

logs on driver only have warnings:
WARN TaskSetManager: Lost task 38.0 in stage 27.0 (TID 127459, linux-162): 
FetchFailed(BlockManagerId(66, 172.168.100.12, 23028), shuffleId=9, mapId=55, 
reduceId=38, message=
org.apache.spark.shuffle.FetchFailedException: 
java.util.concurrent.TimeoutException: Timeout waiting for task.
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:321)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:306)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:51)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:173)
at 
org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$executePartition$1(sort.scala:160)
at 
org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$4.apply(sort.scala:169)
at 
org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$4.apply(sort.scala:169)
at 
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:99)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.Map

Re: spark job scheduling

2016-01-27 Thread Chayapan Khannabha
I think the smallest unit of work is a "Task", and an "Executor" is
responsible for getting the work done? Would like to understand more about
the scheduling system too. Scheduling strategy like FAIR or FIFO do have
significant impact on a Spark cluster architecture design decision.

Best,

Chayapan (A)

On Thu, Jan 28, 2016 at 10:07 AM, Niranda Perera 
wrote:

> hi all,
>
> I have a few questions on spark job scheduling.
>
> 1. As I understand, the smallest unit of work an executor can perform. In
> the 'fair' scheduler mode, let's say  a job is submitted to the spark ctx
> which has a considerable amount of work to do in a task. While such a 'big'
> task is running, can we still submit another smaller job (from a separate
> thread) and get it done? or does that smaller job has to wait till the
> bigger task finishes and the resources are freed from the executor?
>
> 2. When a job is submitted without setting a scheduler pool, the default
> scheduler pool is assigned to it, which employs FIFO scheduling. but what
> happens when we have the spark.scheduler.mode as FAIR, and if I submit jobs
> without specifying a scheduler pool (which has FAIR scheduling)? would the
> jobs still run in FIFO mode with the default pool?
> essentially, for us to really set FAIR scheduling, do we have to assign a
> FAIR scheduler pool?
>
> best
>
> --
> Niranda
> @n1r44 
> +94-71-554-8430
> https://pythagoreanscript.wordpress.com/
>


Re: spark job scheduling

2016-01-27 Thread Niranda Perera
Sorry I have made typos. let me rephrase

1. As I understand, the smallest unit of work an executor can perform, is a
'task'. In the 'FAIR' scheduler mode, let's say a job is submitted to the
spark ctx which has a considerable amount of work to do in a single task.
While such a 'big' task is running, can we still submit another smaller job
(from a separate thread) and get it done? or does that smaller job has to
wait till the bigger task finishes and the resources are freed from the
executor?
(essentially, what I'm asking is, in the FAIR scheduler mode, jobs are
scheduled fairly, but at the task granularity they are still FIFO?)

2. When a job is submitted without setting a scheduler pool, the 'default'
scheduler pool is assigned to it, which employs FIFO scheduling. but what
happens when we have the spark.scheduler.mode as FAIR, and if I submit jobs
without specifying a scheduler pool (which has FAIR scheduling)? would the
jobs still run in FIFO mode with the default pool?
essentially, for us to really set FAIR scheduling, do we have to assign a
FAIR scheduler pool also to the job?

On Thu, Jan 28, 2016 at 8:47 AM, Chayapan Khannabha 
wrote:

> I think the smallest unit of work is a "Task", and an "Executor" is
> responsible for getting the work done? Would like to understand more about
> the scheduling system too. Scheduling strategy like FAIR or FIFO do have
> significant impact on a Spark cluster architecture design decision.
>
> Best,
>
> Chayapan (A)
>
> On Thu, Jan 28, 2016 at 10:07 AM, Niranda Perera  > wrote:
>
>> hi all,
>>
>> I have a few questions on spark job scheduling.
>>
>> 1. As I understand, the smallest unit of work an executor can perform. In
>> the 'fair' scheduler mode, let's say  a job is submitted to the spark ctx
>> which has a considerable amount of work to do in a task. While such a 'big'
>> task is running, can we still submit another smaller job (from a separate
>> thread) and get it done? or does that smaller job has to wait till the
>> bigger task finishes and the resources are freed from the executor?
>>
>> 2. When a job is submitted without setting a scheduler pool, the default
>> scheduler pool is assigned to it, which employs FIFO scheduling. but what
>> happens when we have the spark.scheduler.mode as FAIR, and if I submit jobs
>> without specifying a scheduler pool (which has FAIR scheduling)? would the
>> jobs still run in FIFO mode with the default pool?
>> essentially, for us to really set FAIR scheduling, do we have to assign a
>> FAIR scheduler pool?
>>
>> best
>>
>> --
>> Niranda
>> @n1r44 
>> +94-71-554-8430
>> https://pythagoreanscript.wordpress.com/
>>
>
>


-- 
Niranda
@n1r44 
+94-71-554-8430
https://pythagoreanscript.wordpress.com/


Re: spark job scheduling

2016-01-27 Thread Chayapan Khannabha
I would start at this wiki page
https://spark.apache.org/docs/1.2.0/job-scheduling.html

Although I'm sure this depends a lot on your cluster environment and the
deployed Spark version.

IMHO

On Thu, Jan 28, 2016 at 10:27 AM, Niranda Perera 
wrote:

> Sorry I have made typos. let me rephrase
>
> 1. As I understand, the smallest unit of work an executor can perform, is
> a 'task'. In the 'FAIR' scheduler mode, let's say a job is submitted to the
> spark ctx which has a considerable amount of work to do in a single task.
> While such a 'big' task is running, can we still submit another smaller job
> (from a separate thread) and get it done? or does that smaller job has to
> wait till the bigger task finishes and the resources are freed from the
> executor?
> (essentially, what I'm asking is, in the FAIR scheduler mode, jobs are
> scheduled fairly, but at the task granularity they are still FIFO?)
>
> 2. When a job is submitted without setting a scheduler pool, the 'default'
> scheduler pool is assigned to it, which employs FIFO scheduling. but what
> happens when we have the spark.scheduler.mode as FAIR, and if I submit jobs
> without specifying a scheduler pool (which has FAIR scheduling)? would the
> jobs still run in FIFO mode with the default pool?
> essentially, for us to really set FAIR scheduling, do we have to assign a
> FAIR scheduler pool also to the job?
>
> On Thu, Jan 28, 2016 at 8:47 AM, Chayapan Khannabha 
> wrote:
>
>> I think the smallest unit of work is a "Task", and an "Executor" is
>> responsible for getting the work done? Would like to understand more about
>> the scheduling system too. Scheduling strategy like FAIR or FIFO do have
>> significant impact on a Spark cluster architecture design decision.
>>
>> Best,
>>
>> Chayapan (A)
>>
>> On Thu, Jan 28, 2016 at 10:07 AM, Niranda Perera <
>> niranda.per...@gmail.com> wrote:
>>
>>> hi all,
>>>
>>> I have a few questions on spark job scheduling.
>>>
>>> 1. As I understand, the smallest unit of work an executor can perform.
>>> In the 'fair' scheduler mode, let's say  a job is submitted to the spark
>>> ctx which has a considerable amount of work to do in a task. While such a
>>> 'big' task is running, can we still submit another smaller job (from a
>>> separate thread) and get it done? or does that smaller job has to wait till
>>> the bigger task finishes and the resources are freed from the executor?
>>>
>>> 2. When a job is submitted without setting a scheduler pool, the default
>>> scheduler pool is assigned to it, which employs FIFO scheduling. but what
>>> happens when we have the spark.scheduler.mode as FAIR, and if I submit jobs
>>> without specifying a scheduler pool (which has FAIR scheduling)? would the
>>> jobs still run in FIFO mode with the default pool?
>>> essentially, for us to really set FAIR scheduling, do we have to assign
>>> a FAIR scheduler pool?
>>>
>>> best
>>>
>>> --
>>> Niranda
>>> @n1r44 
>>> +94-71-554-8430
>>> https://pythagoreanscript.wordpress.com/
>>>
>>
>>
>
>
> --
> Niranda
> @n1r44 
> +94-71-554-8430
> https://pythagoreanscript.wordpress.com/
>


Re: spark job scheduling

2016-01-27 Thread Jakob Odersky
Nitpick: the up-to-date version of said wiki page is
https://spark.apache.org/docs/1.6.0/job-scheduling.html (not sure how
much it changed though)

On Wed, Jan 27, 2016 at 7:50 PM, Chayapan Khannabha  wrote:
> I would start at this wiki page
> https://spark.apache.org/docs/1.2.0/job-scheduling.html
>
> Although I'm sure this depends a lot on your cluster environment and the
> deployed Spark version.
>
> IMHO
>
> On Thu, Jan 28, 2016 at 10:27 AM, Niranda Perera 
> wrote:
>>
>> Sorry I have made typos. let me rephrase
>>
>> 1. As I understand, the smallest unit of work an executor can perform, is
>> a 'task'. In the 'FAIR' scheduler mode, let's say a job is submitted to the
>> spark ctx which has a considerable amount of work to do in a single task.
>> While such a 'big' task is running, can we still submit another smaller job
>> (from a separate thread) and get it done? or does that smaller job has to
>> wait till the bigger task finishes and the resources are freed from the
>> executor?
>> (essentially, what I'm asking is, in the FAIR scheduler mode, jobs are
>> scheduled fairly, but at the task granularity they are still FIFO?)
>>
>> 2. When a job is submitted without setting a scheduler pool, the 'default'
>> scheduler pool is assigned to it, which employs FIFO scheduling. but what
>> happens when we have the spark.scheduler.mode as FAIR, and if I submit jobs
>> without specifying a scheduler pool (which has FAIR scheduling)? would the
>> jobs still run in FIFO mode with the default pool?
>> essentially, for us to really set FAIR scheduling, do we have to assign a
>> FAIR scheduler pool also to the job?
>>
>> On Thu, Jan 28, 2016 at 8:47 AM, Chayapan Khannabha 
>> wrote:
>>>
>>> I think the smallest unit of work is a "Task", and an "Executor" is
>>> responsible for getting the work done? Would like to understand more about
>>> the scheduling system too. Scheduling strategy like FAIR or FIFO do have
>>> significant impact on a Spark cluster architecture design decision.
>>>
>>> Best,
>>>
>>> Chayapan (A)
>>>
>>> On Thu, Jan 28, 2016 at 10:07 AM, Niranda Perera
>>>  wrote:

 hi all,

 I have a few questions on spark job scheduling.

 1. As I understand, the smallest unit of work an executor can perform.
 In the 'fair' scheduler mode, let's say  a job is submitted to the spark 
 ctx
 which has a considerable amount of work to do in a task. While such a 'big'
 task is running, can we still submit another smaller job (from a separate
 thread) and get it done? or does that smaller job has to wait till the
 bigger task finishes and the resources are freed from the executor?

 2. When a job is submitted without setting a scheduler pool, the default
 scheduler pool is assigned to it, which employs FIFO scheduling. but what
 happens when we have the spark.scheduler.mode as FAIR, and if I submit jobs
 without specifying a scheduler pool (which has FAIR scheduling)? would the
 jobs still run in FIFO mode with the default pool?
 essentially, for us to really set FAIR scheduling, do we have to assign
 a FAIR scheduler pool?

 best

 --
 Niranda
 @n1r44
 +94-71-554-8430
 https://pythagoreanscript.wordpress.com/
>>>
>>>
>>
>>
>>
>> --
>> Niranda
>> @n1r44
>> +94-71-554-8430
>> https://pythagoreanscript.wordpress.com/
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org