Re: Resultant RDD after a group by query always returns 200 partitions

2016-08-16 Thread Rishi Mishra
That's the default shuffle partitions with Spark, You can tune it using
spark.sql.shuffle.partitions.

Regards,
Rishitesh Mishra,
SnappyData . (http://www.snappydata.io/)

https://in.linkedin.com/in/rishiteshmishra

On Tue, Aug 16, 2016 at 11:31 AM, Niranda Perera 
wrote:

> Hi,
>
> I ran the following simple code in a pseudo distributed spark cluster.
>
> *object ScalaDummy {*
> *  def main(args: Array[String]) {*
> *val appName = "testApp " + Calendar.getInstance().getTime*
> *val conf: SparkConf = new
> SparkConf().setMaster("spark://nira-wso2:7077").setAppName(appName)*
>
> *val propsFile: String =
> Thread.currentThread.getContextClassLoader.getResource("spark-defaults.conf").getPath*
> *val properties: Map[String, String] =
> Utils.getPropertiesFromFile(propsFile)*
> *conf.setAll(properties)*
>
> *val sqlCtx: SQLContext = new SQLContext(new JavaSparkContext(conf))*
>
> *sqlCtx.createDataFrame(Seq(*
> *  (83, 0, 38),*
> *  (26, 0, 79),*
> *  (43, 81, 24))).toDF("a", "b", "c").registerTempTable("table1")*
>
> *val dataFrame: DataFrame = sqlCtx.sql("select count(*) from table1
> group by a, b")*
> *val partitionCount = dataFrame.rdd.partitions.length*
> *dataFrame.show*
>
> *println("Press any key to exit!")*
> *Console.readLine()*
> *  }*
> *}*
>
> I observed that the resultant dataFrame encapsulated a MapPartitionsRDD
> with 200 partitions. Please see the attached screenshots.
>
> I'm wondering where this 200 partitions came from? Is this the default
> number of partitions for MapPartitionsRDD? or is there any way to control
> this?
> The result only contains 3 rows. So, IMO having 200 partitions for this is
> not so efficient because 197< tasks are run on empty partitions, from what
> I could see.
>
> Am I doing something wrong here or is this the expected behavior after a
> 'group by' query?
>
> Look forward to hearing from you!
>
> Best
> --
> Niranda Perera
> @n1r44 
> +94 71 554 8430
> https://www.linkedin.com/in/niranda
> https://pythagoreanscript.wordpress.com/
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>


Re: Resultant RDD after a group by query always returns 200 partitions

2016-08-16 Thread colzer
In addition,you can also set spark.sql.adaptive.enabled=true (default=false)
,enable adaptive query execution。



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Resultant-RDD-after-a-group-by-query-always-returns-200-partitions-tp18647p18650.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Executors go OOM when using JDBC relation provider

2016-08-16 Thread Niranda Perera
Hi,

I have been using a standalone spark cluster (v1.4.x) with the following
configurations. 2 nodes with 1 core each and 4g memory workers in each
node. So I had 2 executors for my app with 2 cores and 8g memory in total.

I have a table in a MySQL database which has around 10million rows. It has
around 10 columns with integer, string and date types. (say table1 with
column c1 to c10)

I run the following query,


   1. select count(*) from table1 - completes within seconds
   2. select c1, count(*) from table1 group by c1 - complete within seconds
   but more than the 1st query
   3. select c1, c2, count(*) from table1 group by c1, c2 - same behavior
   as Q2
   4. select c1, c2, c3, c4, count(*) from table1 group by c1, c2, c3, c4 -
   took a few minutes to finish
   5. select c1, c2, c3, c4, count(*) from table1 group by c1, c2, c3, c4, *c5
   *-* Executor goes OOM within a few minutes!!! *(this has one more column
   for group by statement)

It seemed like the more the group by columns added, the time grows
*exponentially!* Is this the expected behavior?
I was monitoring the MySQL process list, and observed that the data was
transmitted to the executors within a few seconds without an issue.
NOTE: I am not using any partition columns here. So, AFAIU essentially
there's only a single partition for the JDBC RDD

I ran the same query (query 5) in MySQL console and I was able to get a
result with in 3 minutes!!! So, I'm wondering what could have been the
issue here. This OOM exception is actually a blocker!

Are there any other tuning I should do? And it certainly worries me to see
that MySQL gave a significantly fast result than Spark here!

Look forward to hearing from you!

Best

-- 
Niranda Perera
@n1r44 
+94 71 554 8430
https://www.linkedin.com/in/niranda
https://pythagoreanscript.wordpress.com/


GraphFrames 0.2.0 released

2016-08-16 Thread Tim Hunter
Hello all,
I have released version 0.2.0 of the GraphFrames package. Apart from a few
bug fixes, it is the first release published for Spark 2.0 and both scala
2.10 and 2.11. Please let us know if you have any comment or questions.

It is available as a Spark package:
https://spark-packages.org/package/graphframes/graphframes

The source code is available as always at
https://github.com/graphframes/graphframes


What is GraphFrames?

GraphFrames is a DataFrame-based graph engine Spark. In addition to the
algorithms available in GraphX, users can write highly expressive queries
by leveraging the DataFrame API, combined with a new API for motif finding.
The user also benefits from DataFrame performance optimizations within the
Spark SQL engine.

Cheers

Tim


Re: GraphFrames 0.2.0 released

2016-08-16 Thread Shagun Sodhani
Hi Tim.

Could you share link to the release docs as well?

Thanks,
Shagun
https://twitter.com/shagunsodhani

On Tue, Aug 16, 2016 at 10:02 PM, Tim Hunter 
wrote:

> Hello all,
> I have released version 0.2.0 of the GraphFrames package. Apart from a few
> bug fixes, it is the first release published for Spark 2.0 and both scala
> 2.10 and 2.11. Please let us know if you have any comment or questions.
>
> It is available as a Spark package:
> https://spark-packages.org/package/graphframes/graphframes
>
> The source code is available as always at https://github.com/
> graphframes/graphframes
>
>
> What is GraphFrames?
>
> GraphFrames is a DataFrame-based graph engine Spark. In addition to the
> algorithms available in GraphX, users can write highly expressive queries
> by leveraging the DataFrame API, combined with a new API for motif finding.
> The user also benefits from DataFrame performance optimizations within the
> Spark SQL engine.
>
> Cheers
>
> Tim
>
>
>
>


Re: Welcoming Felix Cheung as a committer

2016-08-16 Thread Joseph Bradley
Welcome Felix!

On Mon, Aug 15, 2016 at 6:16 AM, mayur bhole 
wrote:

> Congrats Felix!
>
> On Mon, Aug 15, 2016 at 2:57 PM, Paul Roy  wrote:
>
>> Congrats Felix
>>
>> Paul Roy.
>>
>> On Mon, Aug 8, 2016 at 9:15 PM, Matei Zaharia 
>> wrote:
>>
>>> Hi all,
>>>
>>> The PMC recently voted to add Felix Cheung as a committer. Felix has
>>> been a major contributor to SparkR and we're excited to have him join
>>> officially. Congrats and welcome, Felix!
>>>
>>> Matei
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>
>>
>> --
>> "Change is slow and gradual. It requires hardwork, a bit of
>> luck, a fair amount of self-sacrifice and a lot of patience."
>>
>> Roy.
>>
>
>


Re: GraphFrames 0.2.0 released

2016-08-16 Thread Jacek Laskowski
Hi Tim,

AWESOME. Thanks a lot for releasing it. That makes me even more eager
to see it in Spark's codebase (and replacing the current RDD-based
API)!

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Tue, Aug 16, 2016 at 9:32 AM, Tim Hunter  wrote:
> Hello all,
> I have released version 0.2.0 of the GraphFrames package. Apart from a few
> bug fixes, it is the first release published for Spark 2.0 and both scala
> 2.10 and 2.11. Please let us know if you have any comment or questions.
>
> It is available as a Spark package:
> https://spark-packages.org/package/graphframes/graphframes
>
> The source code is available as always at
> https://github.com/graphframes/graphframes
>
>
> What is GraphFrames?
>
> GraphFrames is a DataFrame-based graph engine Spark. In addition to the
> algorithms available in GraphX, users can write highly expressive queries by
> leveraging the DataFrame API, combined with a new API for motif finding. The
> user also benefits from DataFrame performance optimizations within the Spark
> SQL engine.
>
> Cheers
>
> Tim
>
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[master] ERROR RetryingHMSHandler: AlreadyExistsException(message:Database default already exists)

2016-08-16 Thread Jacek Laskowski
Hi,

I'm working with today's build and am facing the issue:

scala> Seq(A(4)).toDS
16/08/16 19:26:26 ERROR RetryingHMSHandler:
AlreadyExistsException(message:Database default already exists)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_database(HiveMetaStore.java:891)
...

res1: org.apache.spark.sql.Dataset[A] = [id: int]

scala> spark.version
res2: String = 2.1.0-SNAPSHOT

See the complete stack trace at
https://gist.github.com/jaceklaskowski/a969fdd5c2c9cdb736bf647b01257a3e.

I'm quite positive that it didn't happen a day or two ago.

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [master] ERROR RetryingHMSHandler: AlreadyExistsException(message:Database default already exists)

2016-08-16 Thread Yin Huai
Hi Jacek,

We will try to create the default database if it does not exist. Hive
actually relies on that AlreadyExistsException to determine if a db already
exists and ignore the error to implement the logic of "CREATE DATABASE IF
NOT EXISTS". So, that message does not mean any bad thing happened. I think
we can avoid of having this error log by changing
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala#L84-L91.
Basically, we will check if the default db already exists and we only call
create database if the default db does not exist. Do you want to try it?

Thanks,

Yin

On Tue, Aug 16, 2016 at 7:33 PM, Jacek Laskowski  wrote:

> Hi,
>
> I'm working with today's build and am facing the issue:
>
> scala> Seq(A(4)).toDS
> 16/08/16 19:26:26 ERROR RetryingHMSHandler:
> AlreadyExistsException(message:Database default already exists)
> at org.apache.hadoop.hive.metastore.HiveMetaStore$
> HMSHandler.create_database(HiveMetaStore.java:891)
> ...
>
> res1: org.apache.spark.sql.Dataset[A] = [id: int]
>
> scala> spark.version
> res2: String = 2.1.0-SNAPSHOT
>
> See the complete stack trace at
> https://gist.github.com/jaceklaskowski/a969fdd5c2c9cdb736bf647b01257a3e.
>
> I'm quite positive that it didn't happen a day or two ago.
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [master] ERROR RetryingHMSHandler: AlreadyExistsException(message:Database default already exists)

2016-08-16 Thread Jacek Laskowski
On Tue, Aug 16, 2016 at 10:51 PM, Yin Huai  wrote:

> Do you want to try it?

Yes, indeed! I'd be more than happy. Guide me if you don't mind. Thanks.

Should I create a JIRA for this?

Jacek

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org