GroupBy on DataFrame taking too much time

2016-01-10 Thread Gaini Rajeshwar
Hi All, I have a table named *customer *(customer_id, event, country, ) in postgreSQL database. This table is having more than 100 million rows. I want to know number of events from each country. To achieve that i am doing groupBY using spark as following. *val dataframe1 =

Getting an error while submitting spark jar

2016-01-10 Thread Sree Eedupuganti
The way how i submitting jar hadoop@localhost:/usr/local/hadoop/spark$ ./bin/spark-submit \ > --class mllib.perf.TesRunner \ > --master spark://localhost:7077 \ > --executor-memory 2G \ > --total-executor-cores 100 \ > /usr/local/hadoop/spark/lib/mllib-perf-tests-assembly.jar \ > 1000

Spark 1.6 udf/udaf alternatives in dataset?

2016-01-10 Thread Muthu Jayakumar
Hello there, While looking at the features of Dataset, it seem to provide an alternative way towards udf and udaf. Any documentation or sample code snippet to write this would be helpful in rewriting existing UDFs into Dataset mapping step. Also, while extracting a value into Dataset using as[U]

pre-install 3-party Python package on spark cluster

2016-01-10 Thread taotao.li
I have a spark cluster, from machine-1 to machine 100, and machine-1 acts as the master. Then one day my program need use a 3-party python package which is not installed on every machine of the cluster. so here comes my problem: to make that 3-party python package usable on master and slaves,

Re: [discuss] dropping Python 2.6 support

2016-01-10 Thread Dmitry Kniazev
Sasha, it is more complicated than that: many RHEL 6 OS utilities rely on Python 2.6. Upgrading it to 2.7 breaks the system. For large enterprises migrating to another server OS means re-certifying (re-testing) hundreds of applications, so yes, they do prefer to stay where they are until the

Re: Create a n x n graph given only the vertices no

2016-01-10 Thread praveen S
Is it possible in graphx to create/generate graph of n x n given only the vertices. On 8 Jan 2016 23:57, "praveen S" wrote: > Is it possible in graphx to create/generate a graph n x n given n > vertices? >

Re: Create a n x n graph given only the vertices no

2016-01-10 Thread Prem Sure
you mean with out edges data? I dont think so. The other-way is possible..by calling fromEdges on Graph (this would assign vertices mentioned by edges default value ). please share your need/requirement in detail if possible.. On Sun, Jan 10, 2016 at 10:19 PM, praveen S

Negative Number of Workers used memory in Spark UI

2016-01-10 Thread Ricky
In spark UI , Workers used memoy show negative number as following picture: spark version:1.4.0 How to solve this problem? appreciate for you help! 3526FD5F@8B5ABE15.9A0C9356.png Description: Binary data

parquet repartitions and parquet.enable.summary-metadata does not work

2016-01-10 Thread Gavin Yue
Hey, I am trying to convert a bunch of json files into parquet, which would output over 7000 parquet files. But tthere are too many files, so I want to repartition based on id to 3000. But I got the error of GC problem like this one:

Too many tasks killed the scheduler

2016-01-10 Thread Gavin Yue
Hey, I have 10 days data, each day has a parquet directory with over 7000 partitions. So when I union 10 days and do a count, then it submits over 70K tasks. Then the job failed silently with one container exit with code 1. The union with like 5, 6 days data is fine. In the spark-shell, it just

Re: pyspark: calculating row deltas

2016-01-10 Thread Femi Anthony
Can you clarify what you mean with an actual example ? For example, if your data frame looks like this: ID Year Value 12012 100 22013 101 32014 102 What's your desired output ? Femi On Sat, Jan 9, 2016 at 4:55 PM, Franc Carter wrote: > > Hi, > >

Re: pyspark: calculating row deltas

2016-01-10 Thread Franc Carter
Sure, for a dataframe that looks like this ID Year Value 1 2012 100 1 2013 102 1 2014 106 2 2012 110 2 2013 118 2 2014 128 I'd like to get back ID Year Value 1 2013 2 1 2014 4 2 2013 8 2 201410 i.e the Value for an ID,Year combination is the Value for the

Re: pyspark: calculating row deltas

2016-01-10 Thread Blaž Šnuderl
This can be done using spark.sql and window functions. Take a look at https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html On Sun, Jan 10, 2016 at 11:07 AM, Franc Carter wrote: > > Sure, for a dataframe that looks like this > > ID Year

Re: pyspark: calculating row deltas

2016-01-10 Thread Franc Carter
Thanks cheers On 10 January 2016 at 22:35, Blaž Šnuderl wrote: > This can be done using spark.sql and window functions. Take a look at > https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html > > On Sun, Jan 10, 2016 at 11:07 AM, Franc Carter

Re: adding jars - hive on spark cdh 5.4.3

2016-01-10 Thread sandeep vura
Upgrade to CDH 5.5 for spark. It should work On Sat, Jan 9, 2016 at 12:17 AM, Ophir Etzion wrote: > It didn't work. assuming I did the right thing. > in the properties you could see > >

Re: Best IDE Configuration

2016-01-10 Thread Ted Yu
For python, there is https://gist.github.com/bigaidream/40fe0f8267a80e7c9cf8 which was mentioned in http://search-hadoop.com/m/q3RTt2Eu941D9H9t1 FYI On Sat, Jan 9, 2016 at 11:24 AM, Ted Yu wrote: > Please take a look at: > https://cwiki.apache.org/confluence/display/SPARK/