GroupBy on DataFrame taking too much time

2016-01-10 Thread Gaini Rajeshwar
Hi All,

I have a table named *customer *(customer_id, event, country,  ) in
postgreSQL database. This table is having more than 100 million rows.

I want to know number of events from each country. To achieve that i am
doing groupBY using spark as following.

*val dataframe1 = sqlContext.load("jdbc", Map("url" ->
"jdbc:postgresql://localhost/customerlogs?user=postgres=postgres",
"dbtable" -> "customer"))*


*dataframe1.groupBy("country").count().show()*

above code seems to be getting complete customer table before doing
groupBy. Because of that reason it is throwing the following error

*16/01/11 12:49:04 WARN HeartbeatReceiver: Removing executor 0 with no
recent heartbeats: 170758 ms exceeds timeout 12 ms*
*16/01/11 12:49:04 ERROR TaskSchedulerImpl: Lost executor 0 on 10.2.12.59
: Executor heartbeat timed out after 170758 ms*
*16/01/11 12:49:04 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,
10.2.12.59): ExecutorLostFailure (executor 0 exited caused by one of the
running tasks) Reason: Executor heartbeat timed out after 170758 ms*

I am using spark 1.6.0

Is there anyway i can solve this ?

Thanks,
Rajeshwar Gaini.


Getting an error while submitting spark jar

2016-01-10 Thread Sree Eedupuganti
The way how i submitting jar

hadoop@localhost:/usr/local/hadoop/spark$ ./bin/spark-submit \
>   --class mllib.perf.TesRunner \
>   --master spark://localhost:7077 \
>   --executor-memory 2G \
>   --total-executor-cores 100 \
>   /usr/local/hadoop/spark/lib/mllib-perf-tests-assembly.jar \
>   1000

And here is my error,Spark assembly has been built with Hive, including
Datanucleus jars on classpath
java.lang.ClassNotFoundException: mllib.perf.TesRunner
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:538)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
hadoop@localhost:/usr/local/hadoop/spark$

Thanks in Advance

-- 
Best Regards,
Sreeharsha Eedupuganti
Data Engineer
innData Analytics Private Limited


Spark 1.6 udf/udaf alternatives in dataset?

2016-01-10 Thread Muthu Jayakumar
Hello there,

While looking at the features of Dataset, it seem to provide an alternative
way towards udf and udaf. Any documentation or sample code snippet to write
this would be helpful in rewriting existing UDFs into Dataset mapping step.
Also, while extracting a value into Dataset using as[U] method, how could I
specify a custom encoder/translation to case class (where I don't have the
same column-name mapping or same data-type mapping)?

Please advice,
Muthu


pre-install 3-party Python package on spark cluster

2016-01-10 Thread taotao.li
I have a spark cluster, from machine-1 to machine 100, and machine-1 acts as
the master.

Then one day my program need use a 3-party python package which is not
installed on every machine of the cluster.

so here comes my problem: to make that 3-party python package usable on
master and slaves, should I manually ssh to every machine and use pip to
install that package?

I believe there should be some deploy scripts or other things to make this
grace, but I can't find anything after googling.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/pre-install-3-party-Python-package-on-spark-cluster-tp25930.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: [discuss] dropping Python 2.6 support

2016-01-10 Thread Dmitry Kniazev
Sasha, it is more complicated than that: many RHEL 6 OS utilities rely on 
Python 2.6. Upgrading it to 2.7 breaks the system. For large enterprises 
migrating to another server OS means re-certifying (re-testing) hundreds of 
applications, so yes, they do prefer to stay where they are until the benefits 
of migrating outweigh the overhead. Long story short: you cannot simply upgrade 
built-in Python 2.6 in RHEL 6 and it will take years for enterprises to migrate 
to RHEL 7.

Having said that, I don't think that it is a problem though, because Python 2.6 
and Python 2.7 can easily co-exist in the same environment. For example, we use 
virtualenv to run Spark with Python 2.7 and do not touch system Python 2.6.

Thank you,
Dmitry

09.01.2016, 06:36, "Sasha Kacanski" :
> +1
> Companies that use stock python in redhat 2.6 will need to upgrade or install 
> fresh version wich is total of 3.5 minutes so no issues ...
>
> On Tue, Jan 5, 2016 at 2:17 AM, Reynold Xin  wrote:
>> Does anybody here care about us dropping support for Python 2.6 in Spark 2.0?
>>
>> Python 2.6 is ancient, and is pretty slow in many aspects (e.g. json 
>> parsing) when compared with Python 2.7. Some libraries that Spark depend on 
>> stopped supporting 2.6. We can still convince the library maintainers to 
>> support 2.6, but it will be extra work. I'm curious if anybody still uses 
>> Python 2.6 to run Spark.
>>
>> Thanks.
>
> --
> Aleksandar Kacanski

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Create a n x n graph given only the vertices no

2016-01-10 Thread praveen S
Is it possible in graphx to create/generate graph of n x n given only the
vertices.
On 8 Jan 2016 23:57, "praveen S"  wrote:

> Is it possible in graphx to create/generate a graph n x n given n
> vertices?
>


Re: Create a n x n graph given only the vertices no

2016-01-10 Thread Prem Sure
you mean with out edges data? I dont think so. The other-way is
possible..by calling fromEdges on Graph (this would assign vertices
mentioned by edges default value ). please share your need/requirement in
detail if possible..



On Sun, Jan 10, 2016 at 10:19 PM, praveen S  wrote:

> Is it possible in graphx to create/generate graph of n x n given only the
> vertices.
> On 8 Jan 2016 23:57, "praveen S"  wrote:
>
>> Is it possible in graphx to create/generate a graph n x n given n
>> vertices?
>>
>


Negative Number of Workers used memory in Spark UI

2016-01-10 Thread Ricky
In spark UI , Workers used memoy show negative number as following picture:

spark version:1.4.0
 How to solve this problem? appreciate for you help!

3526FD5F@8B5ABE15.9A0C9356.png
Description: Binary data


parquet repartitions and parquet.enable.summary-metadata does not work

2016-01-10 Thread Gavin Yue
Hey,

I am trying to convert a bunch of json files into parquet, which would
output over 7000 parquet files.  But tthere are too many files, so I want
to repartition based on id to 3000.

But I got the error of GC problem like this one:
https://mail-archives.apache.org/mod_mbox/spark-user/201512.mbox/%3CCAB4bC7_LR2rpHceQw3vyJ=l6xq9+9sjl3wgiispzyfh2xmt...@mail.gmail.com%3E#archives

So I set  parquet.enable.summary-metadata to false. But when I
write.parquet, I could still see the 3000 jobs run after the writing
parquet and they failed due to GC.

Basically repartition never succeeded for me. Is there any other settings
which could be optimized?

Thanks,
Gavin


Too many tasks killed the scheduler

2016-01-10 Thread Gavin Yue
Hey,

I have 10 days data, each day has a parquet directory with over 7000
partitions.
So when I union 10 days and do a count, then it submits over 70K tasks.

Then the job failed silently with one container exit with code 1.  The
union with like 5, 6 days data is fine.
In the spark-shell, it just hang showing: Yarn scheduler submit 7+
tasks.

I am running spark 1.6 over hadoop 2.7.  Is there any setting I could
change to make this work?

Thanks,
Gavin


Re: pyspark: calculating row deltas

2016-01-10 Thread Femi Anthony
Can you clarify what you mean with an actual example ?

For example, if your data frame looks like this:

ID  Year   Value
12012   100
22013   101
32014   102

What's your desired output ?

Femi


On Sat, Jan 9, 2016 at 4:55 PM, Franc Carter  wrote:

>
> Hi,
>
> I have a DataFrame with the columns
>
>  ID,Year,Value
>
> I'd like to create a new Column that is Value2-Value1 where the
> corresponding Year2=Year-1
>
> At the moment I am creating  a new DataFrame with renamed columns and doing
>
>DF.join(DF2, . . . .)
>
>  This looks cumbersome to me, is there abtter way ?
>
> thanks
>
>
> --
> Franc
>



-- 
http://www.femibyte.com/twiki5/bin/view/Tech/
http://www.nextmatrix.com
"Great spirits have always encountered violent opposition from mediocre
minds." - Albert Einstein.


Re: pyspark: calculating row deltas

2016-01-10 Thread Franc Carter
Sure, for a dataframe that looks like this

ID Year Value
 1 2012   100
 1 2013   102
 1 2014   106
 2 2012   110
 2 2013   118
 2 2014   128

I'd like to get back

ID Year Value
 1 2013 2
 1 2014 4
 2 2013 8
 2 201410

i.e the Value for an ID,Year combination is the Value for the ID,Year minus
the Value for the ID,Year-1

thanks






On 10 January 2016 at 20:51, Femi Anthony  wrote:

> Can you clarify what you mean with an actual example ?
>
> For example, if your data frame looks like this:
>
> ID  Year   Value
> 12012   100
> 22013   101
> 32014   102
>
> What's your desired output ?
>
> Femi
>
>
> On Sat, Jan 9, 2016 at 4:55 PM, Franc Carter 
> wrote:
>
>>
>> Hi,
>>
>> I have a DataFrame with the columns
>>
>>  ID,Year,Value
>>
>> I'd like to create a new Column that is Value2-Value1 where the
>> corresponding Year2=Year-1
>>
>> At the moment I am creating  a new DataFrame with renamed columns and
>> doing
>>
>>DF.join(DF2, . . . .)
>>
>>  This looks cumbersome to me, is there abtter way ?
>>
>> thanks
>>
>>
>> --
>> Franc
>>
>
>
>
> --
> http://www.femibyte.com/twiki5/bin/view/Tech/
> http://www.nextmatrix.com
> "Great spirits have always encountered violent opposition from mediocre
> minds." - Albert Einstein.
>



-- 
Franc


Re: pyspark: calculating row deltas

2016-01-10 Thread Blaž Šnuderl
This can be done using spark.sql and window functions. Take a look at
https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html

On Sun, Jan 10, 2016 at 11:07 AM, Franc Carter 
wrote:

>
> Sure, for a dataframe that looks like this
>
> ID Year Value
>  1 2012   100
>  1 2013   102
>  1 2014   106
>  2 2012   110
>  2 2013   118
>  2 2014   128
>
> I'd like to get back
>
> ID Year Value
>  1 2013 2
>  1 2014 4
>  2 2013 8
>  2 201410
>
> i.e the Value for an ID,Year combination is the Value for the ID,Year
> minus the Value for the ID,Year-1
>
> thanks
>
>
>
>
>
>
> On 10 January 2016 at 20:51, Femi Anthony  wrote:
>
>> Can you clarify what you mean with an actual example ?
>>
>> For example, if your data frame looks like this:
>>
>> ID  Year   Value
>> 12012   100
>> 22013   101
>> 32014   102
>>
>> What's your desired output ?
>>
>> Femi
>>
>>
>> On Sat, Jan 9, 2016 at 4:55 PM, Franc Carter 
>> wrote:
>>
>>>
>>> Hi,
>>>
>>> I have a DataFrame with the columns
>>>
>>>  ID,Year,Value
>>>
>>> I'd like to create a new Column that is Value2-Value1 where the
>>> corresponding Year2=Year-1
>>>
>>> At the moment I am creating  a new DataFrame with renamed columns and
>>> doing
>>>
>>>DF.join(DF2, . . . .)
>>>
>>>  This looks cumbersome to me, is there abtter way ?
>>>
>>> thanks
>>>
>>>
>>> --
>>> Franc
>>>
>>
>>
>>
>> --
>> http://www.femibyte.com/twiki5/bin/view/Tech/
>> http://www.nextmatrix.com
>> "Great spirits have always encountered violent opposition from mediocre
>> minds." - Albert Einstein.
>>
>
>
>
> --
> Franc
>


Re: pyspark: calculating row deltas

2016-01-10 Thread Franc Carter
Thanks

cheers

On 10 January 2016 at 22:35, Blaž Šnuderl  wrote:

> This can be done using spark.sql and window functions. Take a look at
> https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html
>
> On Sun, Jan 10, 2016 at 11:07 AM, Franc Carter 
> wrote:
>
>>
>> Sure, for a dataframe that looks like this
>>
>> ID Year Value
>>  1 2012   100
>>  1 2013   102
>>  1 2014   106
>>  2 2012   110
>>  2 2013   118
>>  2 2014   128
>>
>> I'd like to get back
>>
>> ID Year Value
>>  1 2013 2
>>  1 2014 4
>>  2 2013 8
>>  2 201410
>>
>> i.e the Value for an ID,Year combination is the Value for the ID,Year
>> minus the Value for the ID,Year-1
>>
>> thanks
>>
>>
>>
>>
>>
>>
>> On 10 January 2016 at 20:51, Femi Anthony  wrote:
>>
>>> Can you clarify what you mean with an actual example ?
>>>
>>> For example, if your data frame looks like this:
>>>
>>> ID  Year   Value
>>> 12012   100
>>> 22013   101
>>> 32014   102
>>>
>>> What's your desired output ?
>>>
>>> Femi
>>>
>>>
>>> On Sat, Jan 9, 2016 at 4:55 PM, Franc Carter 
>>> wrote:
>>>

 Hi,

 I have a DataFrame with the columns

  ID,Year,Value

 I'd like to create a new Column that is Value2-Value1 where the
 corresponding Year2=Year-1

 At the moment I am creating  a new DataFrame with renamed columns and
 doing

DF.join(DF2, . . . .)

  This looks cumbersome to me, is there abtter way ?

 thanks


 --
 Franc

>>>
>>>
>>>
>>> --
>>> http://www.femibyte.com/twiki5/bin/view/Tech/
>>> http://www.nextmatrix.com
>>> "Great spirits have always encountered violent opposition from mediocre
>>> minds." - Albert Einstein.
>>>
>>
>>
>>
>> --
>> Franc
>>
>
>


-- 
Franc


Re: adding jars - hive on spark cdh 5.4.3

2016-01-10 Thread sandeep vura
Upgrade to CDH 5.5 for spark. It should work

On Sat, Jan 9, 2016 at 12:17 AM, Ophir Etzion  wrote:

> It didn't work. assuming I did the right thing.
> in the properties  you could see
>
> {"key":"hive.aux.jars.path","value":"file:///data/loko/foursquare.web-hiverc/current/hadoop-hive-serde.jar,file:///data/loko/foursquare.web-hiverc/current/hadoop-hive-udf.jar","isFinal":false,"resource":"programatically"}
> which includes the jar that has the class I need but I still get
>
> org.apache.hive.com.esotericsoftware.kryo.KryoException: Unable to find 
> class: com.foursquare.hadoop.hive.io.HiveThriftSequenceFileInputFormat
>
>
>
> On Fri, Jan 8, 2016 at 12:24 PM, Edward Capriolo 
> wrote:
>
>> You can not 'add jar' input formats and serde's. They need to be part of
>> your auxlib.
>>
>> On Fri, Jan 8, 2016 at 12:19 PM, Ophir Etzion 
>> wrote:
>>
>>> I tried now. still getting
>>>
>>> 16/01/08 16:37:34 ERROR exec.Utilities: Failed to load plan: 
>>> hdfs://hadoop-alidoro-nn-vip/tmp/hive/hive/c2af9882-38a9-42b0-8d17-3f56708383e8/hive_2016-01-08_16-36-41_370_3307331506800215903-3/-mr-10004/3c90a796-47fc-4541-bbec-b196c40aefab/map.xml:
>>>  org.apache.hive.com.esotericsoftware.kryo.KryoException: Unable to find 
>>> class: com.foursquare.hadoop.hive.io.HiveThriftSequenceFileInputFormat
>>> Serialization trace:
>>> inputFileFormatClass (org.apache.hadoop.hive.ql.plan.PartitionDesc)
>>> aliasToPartnInfo (org.apache.hadoop.hive.ql.plan.MapWork)
>>> org.apache.hive.com.esotericsoftware.kryo.KryoException: Unable to find 
>>> class: com.foursquare.hadoop.hive.io.HiveThriftSequenceFileInputFormat
>>>
>>>
>>> HiveThriftSequenceFileInputFormat is in one of the jars I'm trying to add.
>>>
>>>
>>> On Thu, Jan 7, 2016 at 9:58 PM, Prem Sure  wrote:
>>>
 did you try -- jars property in spark submit? if your jar is of huge
 size, you can pre-load the jar on all executors in a common available
 directory to avoid network IO.

 On Thu, Jan 7, 2016 at 4:03 PM, Ophir Etzion 
 wrote:

> I' trying to add jars before running a query using hive on spark on
> cdh 5.4.3.
> I've tried applying the patch in
> https://issues.apache.org/jira/browse/HIVE-12045 (manually as the
> patch is done on a different hive version) but still hasn't succeeded.
>
> did anyone manage to do ADD JAR successfully with CDH?
>
> Thanks,
> Ophir
>


>>>
>>
>


Re: Best IDE Configuration

2016-01-10 Thread Ted Yu
For python, there is https://gist.github.com/bigaidream/40fe0f8267a80e7c9cf8
which was mentioned in http://search-hadoop.com/m/q3RTt2Eu941D9H9t1

FYI

On Sat, Jan 9, 2016 at 11:24 AM, Ted Yu  wrote:

> Please take a look at:
> https://cwiki.apache.org/confluence/display/SPARK/
> Useful+Developer+Tools#UsefulDeveloperTools-IDESetup
>
> On Sat, Jan 9, 2016 at 11:16 AM, Jorge Machado  wrote:
>
>> Hello everyone,
>>
>>
>> I´m just wondering how do you guys develop for spark.
>>
>> For example I cannot find any decent documentation for connecting Spark
>> to Eclipse using maven or sbt.
>>
>> Is there any link around ?
>>
>>
>> Jorge
>> thanks
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>