Concatenate a string to a Column of type string in DataFrame

2015-12-12 Thread satish chandra j
HI,
I am trying to update a column value in DataFrame, incrementing a column of
integer data type than the below code works

val new_df=old_df.select(df("Int_Column")+10)

If I implement the similar approach for appending a string to a column of
string datatype  as below than it does not error out but returns only
"null" values

val new_df=old_df.select(df("String_Column")+"00:00:000")
 OR
val dt ="00:00:000"
val new_df=old_df.select(df("String_Column")+toString(dt))

Please suggest if any approach to update a column value of datatype String
Ex: Column value consist '20-10-2015' post updating it should have
'20-10-201500:00:000'

Note: Transformation such that new DataFrame has to becreated from old
DataFrame

Regards,
Satish Chandra


Re: Re: Spark assembly in Maven repo?

2015-12-12 Thread Sean Owen
That's exactly what the various artifacts in the Maven repo are for. The
API classes for core are in the core artifact and so on. You don't need an
assembly.

On Sat, Dec 12, 2015 at 12:32 AM, Xiaoyong Zhu 
wrote:

> Yes, so our scenario is to treat the spark assembly as an “SDK” so users
> can develop Spark applications easily without downloading them. In this
> case which way do you guys think might be good?
>
>
>
> Xiaoyong
>
>
>
> *From:* fightf...@163.com [mailto:fightf...@163.com]
> *Sent:* Friday, December 11, 2015 12:08 AM
> *To:* Mark Hamstra 
> *Cc:* Xiaoyong Zhu ; Jeff Zhang ;
> user ; Zhaomin Xu ; Joe Zhang
> (SDE) 
> *Subject:* Re: Re: Spark assembly in Maven repo?
>
>
>
> Agree with you that assembly jar is not good to publish. However, what he
> really need is to fetch
>
> an updatable maven jar file.
>
>
> --
>
> fightf...@163.com
>
>
>
> *From:* Mark Hamstra 
>
> *Date:* 2015-12-11 15:34
>
> *To:* fightf...@163.com
>
> *CC:* Xiaoyong Zhu ; Jeff Zhang ;
> user ; Zhaomin Xu ; Joe Zhang
> (SDE) 
>
> *Subject:* Re: RE: Spark assembly in Maven repo?
>
> No, publishing a spark assembly jar is not fine.  See the doc attached to
> https://issues.apache.org/jira/browse/SPARK-11157
> 
> and be aware that a likely goal of Spark 2.0 will be the elimination of
> assemblies.
>
>
>
> On Thu, Dec 10, 2015 at 11:19 PM, fightf...@163.com 
> wrote:
>
> Using maven to download the assembly jar is fine. I would recommend to
> deploy this
>
> assembly jar to your local maven repo, i.e. nexus repo, Or more likey a
> snapshot repository
>
>
> --
>
> fightf...@163.com
>
>
>
> *From:* Xiaoyong Zhu 
>
> *Date:* 2015-12-11 15:10
>
> *To:* Jeff Zhang 
>
> *CC:* user@spark.apache.org; Zhaomin Xu ; Joe Zhang
> (SDE) 
>
> *Subject:* RE: Spark assembly in Maven repo?
>
> Sorry – I didn’t make it clear. It’s actually not a “dependency” – it’s
> actually that we are building a certain plugin for IntelliJ where we want
> to distribute this jar. But since the jar is updated frequently we don't
> want to distribute it together with our plugin but we would like to
> download it via Maven.
>
>
>
> In this case what’s the recommended way?
>
>
>
> Xiaoyong
>
>
>
> *From:* Jeff Zhang [mailto:zjf...@gmail.com]
> *Sent:* Thursday, December 10, 2015 11:03 PM
> *To:* Xiaoyong Zhu 
> *Cc:* user@spark.apache.org
> *Subject:* Re: Spark assembly in Maven repo?
>
>
>
> I don't think make the assembly jar as dependency a good practice. You may
> meet jar hell issue in that case.
>
>
>
> On Fri, Dec 11, 2015 at 2:46 PM, Xiaoyong Zhu 
> wrote:
>
> Hi Experts,
>
>
>
> We have a project which has a dependency for the following jar
>
>
>
> spark-assembly--hadoop.jar
>
> for example:
>
> spark-assembly-1.4.1.2.3.3.0-2983-hadoop2.7.1.2.3.3.0-2983.jar
>
>
>
> since this assembly might be updated in the future, I am not sure if there
> is a Maven repo that has the above spark assembly jar? Or should we create
> & upload it to Maven central?
>
>
>
> Thanks!
>
>
>
> Xiaoyong
>
>
>
>
>
>
>
> --
>
> Best Regards
>
> Jeff Zhang
>
>
>
>


Spark does not clean garbage in blockmgr folders on slaves if long running spark-shell is used

2015-12-12 Thread Alexander Pivovarov
Recently I faced an issue with Spark 1.5.2 standalone. Spark does not clean
garbage in blockmgr folders on slaves until I exit from spark-shell.
I opened spark-shell and run my spark program for several input folders.
Then I noticed that Spark uses several GBs of disk space on all slaves in
blockmgr folder, e.g.
spark/spark-xxx/executor-yyy/blockmgr-zzz

Yes, I have several RDDs in memory but according to Spark UI all RDDs use
only Memory (but not disk).
RDDs are cached at the beginning of data processing and at that time
blockmgr folders are almost empty.

So, looks like the jobs which I run in shell produced some garbage in
blockmgr folders and Spark did clean the folders after the jobs are done.
If I exit from spark-shell then blockmgr folders are instantly cleaned.

How to force Spark to clean blockmgr folders without exiting from the shell?
Should I use spark.cleaner.ttl setting?


Re: Release data for spark 1.6?

2015-12-12 Thread sri hari kali charan Tummala
Hi Michael, Ted,

https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L48

In Present spark version in line 48 there is a bug, to check whether table
exists in a database using limit doesnt work for all databases sql server
for example.

best way to check whehter table exists in any database is to use, select *
from table where 1=2;  or select 1 from table where 1=2; this supports all
the databases.

In spark 1.6 can this change be implemented, this lets
 write.mode("append") bug to go away.



def tableExists(conn: Connection, table: String): Boolean = { // Somewhat
hacky, but there isn't a good way to identify whether a table exists for all //
SQL database systems, considering "table" could also include the database
name. Try(conn.prepareStatement(s"SELECT 1 FROM $table LIMIT 1"
).executeQuery().next()).isSuccess }

Solution:-
def tableExists(conn: Connection, table: String): Boolean = { // Somewhat
hacky, but there isn't a good way to identify whether a table exists for all //
SQL database systems, considering "table" could also include the database
name. Try(conn.prepareStatement(s"SELECT 1 FROM $table where 1=2"
).executeQuery().next()).isSuccess }



Thanks
Sri



On Wed, Dec 9, 2015 at 10:30 PM, Michael Armbrust 
wrote:

> The release date is "as soon as possible".  In order to make an Apache
> release we must present a release candidate and have 72-hours of voting by
> the PMC.  As soon as there are no known bugs, the vote will pass and 1.6
> will be released.
>
> In the mean time, I'd love support from the community testing the most
> recent release candidate.
>
> On Wed, Dec 9, 2015 at 2:19 PM, Sri  wrote:
>
>> Hi Ted,
>>
>> Thanks for the info , but there is no particular release date from my
>> understanding the package is in testing there is no release date mentioned.
>>
>> Thanks
>> Sri
>>
>>
>>
>> Sent from my iPhone
>>
>> > On 9 Dec 2015, at 21:38, Ted Yu  wrote:
>> >
>> > See this thread:
>> >
>> >
>> http://search-hadoop.com/m/q3RTtBMZpK7lEFB1/Spark+1.6.0+RC&subj=Re+VOTE+Release+Apache+Spark+1+6+0+RC1+
>> >
>> >> On Dec 9, 2015, at 1:20 PM, "kali.tumm...@gmail.com" <
>> kali.tumm...@gmail.com> wrote:
>> >>
>> >> Hi All,
>> >>
>> >> does anyone know exact release data for spark 1.6 ?
>> >>
>> >> Thanks
>> >> Sri
>> >>
>> >>
>> >>
>> >> --
>> >> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Release-data-for-spark-1-6-tp25654.html
>> >> Sent from the Apache Spark User List mailing list archive at
>> Nabble.com.
>> >>
>> >> -
>> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> >> For additional commands, e-mail: user-h...@spark.apache.org
>> >>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


-- 
Thanks & Regards
Sri Tummala


Re: spark data frame write.mode("append") bug

2015-12-12 Thread kali.tumm...@gmail.com
Hi All, 

https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L48

In Present spark version in line 48 there is a bug, to check whether table
exists in a database using limit doesnt work for all databases sql server
for example.

best way to check whehter table exists in any database is to use, select *
from table where 1=2;  or select 1 from table where 1=2; this supports all
the databases.

In spark 1.6 can this change be implemented, this lets  write.mode("append")
bug to go away.



def tableExists(conn: Connection, table: String): Boolean = {

// Somewhat hacky, but there isn't a good way to identify whether a
table exists for all
// SQL database systems, considering "table" could also include the
database name.
Try(conn.prepareStatement(s"SELECT 1 FROM $table LIMIT
1").executeQuery().next()).isSuccess
  }

Solution:-
def tableExists(conn: Connection, table: String): Boolean = {

// Somewhat hacky, but there isn't a good way to identify whether a
table exists for all
// SQL database systems, considering "table" could also include the
database name.
Try(conn.prepareStatement(s"SELECT 1 FROM $table where
1=2").executeQuery().next()).isSuccess
  }



Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-data-frame-write-mode-append-bug-tp25650p25693.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark data frame write.mode("append") bug

2015-12-12 Thread sri hari kali charan Tummala
Hi All,

https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L48

In Present spark version in line 48 there is a bug, to check whether table
exists in a database using limit doesnt work for all databases sql server
for example.

best way to check whehter table exists in any database is to use, select *
from table where 1=2;  or select 1 from table where 1=2; this supports all
the databases.

In spark 1.6 can this change be implemented, this lets
 write.mode("append") bug to go away.



def tableExists(conn: Connection, table: String): Boolean = {

// Somewhat hacky, but there isn't a good way to identify whether a
table exists for all
// SQL database systems, considering "table" could also include the
database name.
Try(conn.prepareStatement(s"SELECT 1 FROM $table LIMIT
1").executeQuery().next()).isSuccess
  }

Solution:-
def tableExists(conn: Connection, table: String): Boolean = {

// Somewhat hacky, but there isn't a good way to identify whether a
table exists for all
// SQL database systems, considering "table" could also include the
database name.
Try(conn.prepareStatement(s"SELECT 1 FROM $table where
1=2").executeQuery().next()).isSuccess
  }



Thanks

On Wed, Dec 9, 2015 at 4:24 PM, Seongduk Cheon  wrote:

> Not for sure, but I think it is bug as of 1.5.
>
> Spark is using LIMIT keyword whether a table exists.
>
> https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L48
>
> If your database does not support LIMIT keyword such as SQL Server, spark
> try to create table
>
> https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L272-L275
>
> This issue has already fixed and It will be released on 1.6
> https://issues.apache.org/jira/browse/SPARK-9078
>
>
> --
> Cheon
>
> 2015-12-09 22:54 GMT+09:00 kali.tumm...@gmail.com 
> :
>
>> Hi Spark Contributors,
>>
>> I am trying to append data  to target table using df.write.mode("append")
>> functionality but spark throwing up table already exists exception.
>>
>> Is there a fix scheduled in later spark release ?, I am using spark 1.5.
>>
>> val sourcedfmode=sourcedf.write.mode("append")
>> sourcedfmode.jdbc(TargetDBinfo.url,TargetDBinfo.table,targetprops)
>>
>> Full Code:-
>>
>> https://github.com/kali786516/ScalaDB/blob/master/src/main/java/com/kali/db/SaprkSourceToTargetBulkLoad.scala
>>
>> Spring Config File:-
>>
>> https://github.com/kali786516/ScalaDB/blob/master/src/main/resources/SourceToTargetBulkLoad.xml
>>
>>
>> Thanks
>> Sri
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/spark-data-frame-write-mode-append-bug-tp25650.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>


-- 
Thanks & Regards
Sri Tummala


Re: Release data for spark 1.6?

2015-12-12 Thread Ted Yu
Please take a look at SPARK-9078 which allows jdbc dialects to override the 
query for checking table existence. 

> On Dec 12, 2015, at 7:12 PM, sri hari kali charan Tummala 
>  wrote:
> 
> Hi Michael, Ted, 
> 
> https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L48
> 
> In Present spark version in line 48 there is a bug, to check whether table 
> exists in a database using limit doesnt work for all databases sql server for 
> example.
> 
> best way to check whehter table exists in any database is to use, select * 
> from table where 1=2;  or select 1 from table where 1=2; this supports all 
> the databases.
> 
> In spark 1.6 can this change be implemented, this lets  write.mode("append") 
> bug to go away.
> 
> 
> 
> def tableExists(conn: Connection, table: String): Boolean = {
> 
> // Somewhat hacky, but there isn't a good way to identify whether a table 
> exists for all
> // SQL database systems, considering "table" could also include the 
> database name.
> Try(conn.prepareStatement(s"SELECT 1 FROM $table LIMIT 
> 1").executeQuery().next()).isSuccess
>   }
> 
> Solution:-
> def tableExists(conn: Connection, table: String): Boolean = {
> 
> // Somewhat hacky, but there isn't a good way to identify whether a table 
> exists for all
> // SQL database systems, considering "table" could also include the 
> database name.
> Try(conn.prepareStatement(s"SELECT 1 FROM $table where 
> 1=2").executeQuery().next()).isSuccess
>   }
> 
> 
> 
> Thanks
> Sri 
> 
> 
> 
>> On Wed, Dec 9, 2015 at 10:30 PM, Michael Armbrust  
>> wrote:
>> The release date is "as soon as possible".  In order to make an Apache 
>> release we must present a release candidate and have 72-hours of voting by 
>> the PMC.  As soon as there are no known bugs, the vote will pass and 1.6 
>> will be released.
>> 
>> In the mean time, I'd love support from the community testing the most 
>> recent release candidate.
>> 
>>> On Wed, Dec 9, 2015 at 2:19 PM, Sri  wrote:
>>> Hi Ted,
>>> 
>>> Thanks for the info , but there is no particular release date from my 
>>> understanding the package is in testing there is no release date mentioned.
>>> 
>>> Thanks
>>> Sri
>>> 
>>> 
>>> 
>>> Sent from my iPhone
>>> 
>>> > On 9 Dec 2015, at 21:38, Ted Yu  wrote:
>>> >
>>> > See this thread:
>>> >
>>> > http://search-hadoop.com/m/q3RTtBMZpK7lEFB1/Spark+1.6.0+RC&subj=Re+VOTE+Release+Apache+Spark+1+6+0+RC1+
>>> >
>>> >> On Dec 9, 2015, at 1:20 PM, "kali.tumm...@gmail.com" 
>>> >>  wrote:
>>> >>
>>> >> Hi All,
>>> >>
>>> >> does anyone know exact release data for spark 1.6 ?
>>> >>
>>> >> Thanks
>>> >> Sri
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> View this message in context: 
>>> >> http://apache-spark-user-list.1001560.n3.nabble.com/Release-data-for-spark-1-6-tp25654.html
>>> >> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>> >>
>>> >> -
>>> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> >> For additional commands, e-mail: user-h...@spark.apache.org
>>> >>
>>> 
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
> 
> 
> 
> -- 
> Thanks & Regards
> Sri Tummala
> 


Re: Release data for spark 1.6?

2015-12-12 Thread sri hari kali charan Tummala
thanks Sean and Ted, I will wait for 1.6 to be out.

Happy Christmas to all !

Thanks
Sri

On Sat, Dec 12, 2015 at 12:18 PM, Ted Yu  wrote:

> Please take a look at SPARK-9078 which allows jdbc dialects to override
> the query for checking table existence.
>
> On Dec 12, 2015, at 7:12 PM, sri hari kali charan Tummala <
> kali.tumm...@gmail.com> wrote:
>
> Hi Michael, Ted,
>
>
> https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L48
>
> In Present spark version in line 48 there is a bug, to check whether table
> exists in a database using limit doesnt work for all databases sql server
> for example.
>
> best way to check whehter table exists in any database is to use, select *
> from table where 1=2;  or select 1 from table where 1=2; this supports all
> the databases.
>
> In spark 1.6 can this change be implemented, this lets
>  write.mode("append") bug to go away.
>
>
>
> def tableExists(conn: Connection, table: String): Boolean = { // Somewhat
> hacky, but there isn't a good way to identify whether a table exists for all 
> //
> SQL database systems, considering "table" could also include the database
> name. Try(conn.prepareStatement(s"SELECT 1 FROM $table LIMIT 1"
> ).executeQuery().next()).isSuccess }
>
> Solution:-
> def tableExists(conn: Connection, table: String): Boolean = { // Somewhat
> hacky, but there isn't a good way to identify whether a table exists for all 
> //
> SQL database systems, considering "table" could also include the database
> name. Try(conn.prepareStatement(s"SELECT 1 FROM $table where 1=2"
> ).executeQuery().next()).isSuccess }
>
>
>
> Thanks
> Sri
>
>
>
> On Wed, Dec 9, 2015 at 10:30 PM, Michael Armbrust 
> wrote:
>
>> The release date is "as soon as possible".  In order to make an Apache
>> release we must present a release candidate and have 72-hours of voting by
>> the PMC.  As soon as there are no known bugs, the vote will pass and 1.6
>> will be released.
>>
>> In the mean time, I'd love support from the community testing the most
>> recent release candidate.
>>
>> On Wed, Dec 9, 2015 at 2:19 PM, Sri  wrote:
>>
>>> Hi Ted,
>>>
>>> Thanks for the info , but there is no particular release date from my
>>> understanding the package is in testing there is no release date mentioned.
>>>
>>> Thanks
>>> Sri
>>>
>>>
>>>
>>> Sent from my iPhone
>>>
>>> > On 9 Dec 2015, at 21:38, Ted Yu  wrote:
>>> >
>>> > See this thread:
>>> >
>>> >
>>> http://search-hadoop.com/m/q3RTtBMZpK7lEFB1/Spark+1.6.0+RC&subj=Re+VOTE+Release+Apache+Spark+1+6+0+RC1+
>>> >
>>> >> On Dec 9, 2015, at 1:20 PM, "kali.tumm...@gmail.com" <
>>> kali.tumm...@gmail.com> wrote:
>>> >>
>>> >> Hi All,
>>> >>
>>> >> does anyone know exact release data for spark 1.6 ?
>>> >>
>>> >> Thanks
>>> >> Sri
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Release-data-for-spark-1-6-tp25654.html
>>> >> Sent from the Apache Spark User List mailing list archive at
>>> Nabble.com .
>>> >>
>>> >> -
>>> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> >> For additional commands, e-mail: user-h...@spark.apache.org
>>> >>
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>
>
> --
> Thanks & Regards
> Sri Tummala
>
>


-- 
Thanks & Regards
Sri Tummala


Re: Concatenate a string to a Column of type string in DataFrame

2015-12-12 Thread Yanbo Liang
Hi Satish,

You can refer the following code snippet:
df.select(concat(col("String_Column"), lit("00:00:000")))

Yanbo

2015-12-12 16:01 GMT+08:00 satish chandra j :

> HI,
> I am trying to update a column value in DataFrame, incrementing a column
> of integer data type than the below code works
>
> val new_df=old_df.select(df("Int_Column")+10)
>
> If I implement the similar approach for appending a string to a column of
> string datatype  as below than it does not error out but returns only
> "null" values
>
> val new_df=old_df.select(df("String_Column")+"00:00:000")
>  OR
> val dt ="00:00:000"
> val new_df=old_df.select(df("String_Column")+toString(dt))
>
> Please suggest if any approach to update a column value of datatype String
> Ex: Column value consist '20-10-2015' post updating it should have
> '20-10-201500:00:000'
>
> Note: Transformation such that new DataFrame has to becreated from old
> DataFrame
>
> Regards,
> Satish Chandra
>
>
>
>


RE: Concatenate a string to a Column of type string in DataFrame

2015-12-12 Thread Satish
Hi,
Will the below mentioned snippet work for Spark 1.4.0

Thanks for your inputs

Regards,
Satish 

-Original Message-
From: "Yanbo Liang" 
Sent: ‎12-‎12-‎2015 20:54
To: "satish chandra j" 
Cc: "user" 
Subject: Re: Concatenate a string to a Column of type string in DataFrame

Hi Satish,


You can refer the following code snippet: 
df.select(concat(col("String_Column"), lit("00:00:000")))


Yanbo


2015-12-12 16:01 GMT+08:00 satish chandra j :

HI,
I am trying to update a column value in DataFrame, incrementing a column of 
integer data type than the below code works


val new_df=old_df.select(df("Int_Column")+10)


If I implement the similar approach for appending a string to a column of 
string datatype  as below than it does not error out but returns only "null" 
values


val new_df=old_df.select(df("String_Column")+"00:00:000") 

 OR
val dt ="00:00:000"
val new_df=old_df.select(df("String_Column")+toString(dt))


Please suggest if any approach to update a column value of datatype String
Ex: Column value consist '20-10-2015' post updating it should have 
'20-10-201500:00:000'


Note: Transformation such that new DataFrame has to becreated from old DataFrame


Regards,
Satish Chandra




 

How to use HProf to profile Spark CPU overhead

2015-12-12 Thread Jia Zou
My goal is to use hprof to profile where the bottleneck is.
Is there anyway to do this without modifying and rebuilding Spark source
code.

I've tried to add "
-Xrunhprof:cpu=samples,depth=100,interval=20,lineno=y,thread=y,file=/home/ubuntu/out.hprof"
to spark-class script, but it can only profile the CPU usage of the
org.apache.spark.deploy.SparkSubmit
class, and can not provide insights for other classes like BlockManager,
and user classes.

Any suggestions? Thanks a lot!

Best Regards,
Jia


Has the format of a spark jar file changes in 1.5

2015-12-12 Thread Steve Lewis
 I have been using my own code to build the jar file I use for spark
submit. In 1.4 I could simply add all class and resource files I find in
the class path to the jar and add all jars in the classpath into a
directory called lib in the jar file.
In 1.5 I see that resources and classes in jars in the lib directory are
not being found and I am forced to add them at the top level.
Has something changed recently in the structure of Spark jar files or how
the class loader works. I find little documentation on the structure of a
Spark jar used in spark-submit


Re: How to use HProf to profile Spark CPU overhead

2015-12-12 Thread Ted Yu
Have you tried adding the option below through spark.executor.extraJavaOptions ?

Cheers

> On Dec 13, 2015, at 3:36 AM, Jia Zou  wrote:
> 
> My goal is to use hprof to profile where the bottleneck is.
> Is there anyway to do this without modifying and rebuilding Spark source code.
> 
> I've tried to add 
> "-Xrunhprof:cpu=samples,depth=100,interval=20,lineno=y,thread=y,file=/home/ubuntu/out.hprof"
>  to spark-class script, but it can only profile the CPU usage of the  
> org.apache.spark.deploy.SparkSubmit class, and can not provide insights for 
> other classes like BlockManager, and user classes.
> 
> Any suggestions? Thanks a lot!
> 
> Best Regards,
> Jia

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to use HProf to profile Spark CPU overhead

2015-12-12 Thread Jia Zou
Hi, Ted, it works, thanks a lot for your help!

--Jia

On Sat, Dec 12, 2015 at 3:01 PM, Ted Yu  wrote:

> Have you tried adding the option below through
> spark.executor.extraJavaOptions ?
>
> Cheers
>
> > On Dec 13, 2015, at 3:36 AM, Jia Zou  wrote:
> >
> > My goal is to use hprof to profile where the bottleneck is.
> > Is there anyway to do this without modifying and rebuilding Spark source
> code.
> >
> > I've tried to add
> "-Xrunhprof:cpu=samples,depth=100,interval=20,lineno=y,thread=y,file=/home/ubuntu/out.hprof"
> to spark-class script, but it can only profile the CPU usage of the
> org.apache.spark.deploy.SparkSubmit class, and can not provide insights for
> other classes like BlockManager, and user classes.
> >
> > Any suggestions? Thanks a lot!
> >
> > Best Regards,
> > Jia
>


Re: spark data frame write.mode("append") bug

2015-12-12 Thread Michael Armbrust
If you want to contribute to the project open a JIRA/PR:
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

On Sat, Dec 12, 2015 at 3:13 AM, kali.tumm...@gmail.com <
kali.tumm...@gmail.com> wrote:

> Hi All,
>
>
> https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L48
>
> In Present spark version in line 48 there is a bug, to check whether table
> exists in a database using limit doesnt work for all databases sql server
> for example.
>
> best way to check whehter table exists in any database is to use, select *
> from table where 1=2;  or select 1 from table where 1=2; this supports all
> the databases.
>
> In spark 1.6 can this change be implemented, this lets
> write.mode("append")
> bug to go away.
>
>
>
> def tableExists(conn: Connection, table: String): Boolean = {
>
> // Somewhat hacky, but there isn't a good way to identify whether a
> table exists for all
> // SQL database systems, considering "table" could also include the
> database name.
> Try(conn.prepareStatement(s"SELECT 1 FROM $table LIMIT
> 1").executeQuery().next()).isSuccess
>   }
>
> Solution:-
> def tableExists(conn: Connection, table: String): Boolean = {
>
> // Somewhat hacky, but there isn't a good way to identify whether a
> table exists for all
> // SQL database systems, considering "table" could also include the
> database name.
> Try(conn.prepareStatement(s"SELECT 1 FROM $table where
> 1=2").executeQuery().next()).isSuccess
>   }
>
>
>
> Thanks
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-data-frame-write-mode-append-bug-tp25650p25693.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


??????Classpath problem trying to use DataFrames

2015-12-12 Thread Ricky
encountor similar problems using hivecontext .When code print classload ,it was 
changed to  multiclassloader from APPclassloader




--  --
??: Harsh J 
: 2015??12??12?? 12:09
??: Christopher Brady , user 

: Re: Classpath problem trying to use DataFrames



Do you have all your hive jars listed in the classpath.txt / 
SPARK_DIST_CLASSPATH env., specifically the hive-exec jar? Is the location of 
that jar also the same on all the distributed hosts?

Passing an explicit executor classpath string may also help overcome this 
(replace HIVE_BASE_DIR to the root of your hive installation):


--conf "spark.executor.extraClassPath=$HIVE_BASE_DIR/hive/lib/*"

On Sat, Dec 12, 2015 at 6:32 AM Christopher Brady 
 wrote:

I'm trying to run a basic "Hello world" type example using DataFrames
 with Hive in yarn-client mode. My code is:
 
 JavaSparkContext sc = new JavaSparkContext("yarn-client", "Test app"))
 HiveContext sqlContext = new HiveContext(sc.sc());
 sqlContext.sql("SELECT * FROM my_table").count();
 
 The exception I get on the driver is:
 java.lang.ClassNotFoundException: org.apache.hadoop.hive.ql.plan.TableDesc
 
 There are no exceptions on the executors.
 
 That class is definitely on the classpath of the driver, and it runs
 without errors in local mode. I haven't been able to find any similar
 errors on google. Does anyone know what I'm doing wrong?
 
 The full stack trace is included below:
 
 java.lang.NoClassDefFoundError: Lorg/apache/hadoop/hive/ql/plan/TableDesc;
  at java.lang.Class.getDeclaredFields0(Native Method)
  at java.lang.Class.privateGetDeclaredFields(Class.java:2436)
  at java.lang.Class.getDeclaredField(Class.java:1946)
  at
 java.io.ObjectStreamClass.getDeclaredSUID(ObjectStreamClass.java:1659)
  at java.io.ObjectStreamClass.access$700(ObjectStreamClass.java:72)
  at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:480)
  at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:468)
  at java.security.AccessController.doPrivileged(Native Method)
  at java.io.ObjectStreamClass.(ObjectStreamClass.java:468)
  at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:365)
  at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:602)
  at
 java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622)
  at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
  at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
  at
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
  at
 java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
  at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
  at
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
  at
 java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
  at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
  at
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
  at
 java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
  at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
  at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
  at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
  at
 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:606)
  at
 java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
  at
 java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
  at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
  at
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
  at
 java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
  at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
  at
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
  at
 java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
  at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
  at java.io.ObjectInputStream.readObject