java.lang.ClassNotFoundException

2015-08-08 Thread Yasemin Kaya
Hi,

I have a little spark program and i am getting an error why i dont
understand.
My code is https://gist.github.com/yaseminn/522a75b863ad78934bc3.
I am using spark 1.3
Submitting : bin/spark-submit --class MonthlyAverage --master local[4]
weather.jar


error:

~/spark-1.3.1-bin-hadoop2.4$ bin/spark-submit --class MonthlyAverage
--master local[4] weather.jar
java.lang.ClassNotFoundException: MonthlyAverage
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:274)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:538)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties


Please help me Asap..

yasemin
-- 
hiç ender hiç


Re: Spark on YARN

2015-08-08 Thread Sandy Ryza
Hi Jem,

Do they fail with any particular exception?  Does YARN just never end up
giving them resources?  Does an application master start?  If so, what are
in its logs?  If not, anything suspicious in the YARN ResourceManager logs?

-Sandy

On Fri, Aug 7, 2015 at 1:48 AM, Jem Tucker jem.tuc...@gmail.com wrote:

 Hi,

 I am running spark on YARN on the CDH5.3.2 stack. I have created a new
 user to own and run a testing environment, however when using this user
 applications I submit to yarn never begin to run, even if they are the
 exact same application that is successful with another user?

 Has anyone seen anything like this before?

 Thanks,

 Jem



Re: Spark on YARN

2015-08-08 Thread Jem Tucker
Hi Sandy,

The application doesn't fail, it gets accepted by yarn but the application
master never starts and the application state never changes to running. I
have checked in the resource manager and node manager logs and nothing
jumps out.

Thanks

Jem
On Sat, 8 Aug 2015 at 09:20, Sandy Ryza sandy.r...@cloudera.com wrote:

 Hi Jem,

 Do they fail with any particular exception?  Does YARN just never end up
 giving them resources?  Does an application master start?  If so, what are
 in its logs?  If not, anything suspicious in the YARN ResourceManager logs?

 -Sandy

 On Fri, Aug 7, 2015 at 1:48 AM, Jem Tucker jem.tuc...@gmail.com wrote:

 Hi,

 I am running spark on YARN on the CDH5.3.2 stack. I have created a new
 user to own and run a testing environment, however when using this user
 applications I submit to yarn never begin to run, even if they are the
 exact same application that is successful with another user?

 Has anyone seen anything like this before?

 Thanks,

 Jem





Pagination on big table, splitting joins

2015-08-08 Thread Gaspar Muñoz
Hi,

I have two different parts in my system.

1. Batch application that every x minutes do sql queries between several
tables that contains millions of rows to compound a entity, and sent that
entities to Kafka.
2. Streaming application that processing data from Kafka.

Now, I have entire system working, but I want to improve the performance in
the batch part, because if I have 100 millions of entities I send them to
Kafka in a foreach method in a row, which makes no sense for the next
streaming application. I want, send each 10 millions events to Kafka, for
example.

I have a query, imagine

*select ... from table 1 left outer join table 2 on ... left outer join
table 3 on ... left outer join table 4 on ...*

My target is do *pagination* on table 1 and take 10 million in a separate
RDD, do the joins and send to Kafka,  then take another 10 million and do
the same... I have all tables in parquet format in hdfs.

I think to use *toLocalIterator* method and something like that, but I have
doubts about memory and parallelism and sure there is a better way to do it.

rdd.toLocalIterator.grouped(1000).foreach( seq =

val rdd: RDD[(String, Int)] = sc.parallelize(seq)
 // Do the processing

)

What do you think?

Regards.

-- 

Gaspar Muñoz
@gmunozsoria


http://www.stratio.com/
Vía de las dos Castillas, 33, Ática 4, 3ª Planta
28224 Pozuelo de Alarcón, Madrid
Tel: +34 91 352 59 42 // *@stratiobd https://twitter.com/StratioBD*


Re: java.lang.ClassNotFoundException

2015-08-08 Thread Ted Yu
Have you tried including package name in the class name ?

Thanks



 On Aug 8, 2015, at 12:00 AM, Yasemin Kaya godo...@gmail.com wrote:
 
 Hi,
 
 I have a little spark program and i am getting an error why i dont 
 understand. 
 My code is https://gist.github.com/yaseminn/522a75b863ad78934bc3.
 I am using spark 1.3 
 Submitting : bin/spark-submit --class MonthlyAverage --master local[4] 
 weather.jar
 
 
 error: 
 
 ~/spark-1.3.1-bin-hadoop2.4$ bin/spark-submit --class MonthlyAverage --master 
 local[4] weather.jar
 java.lang.ClassNotFoundException: MonthlyAverage
   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
   at java.lang.Class.forName0(Native Method)
   at java.lang.Class.forName(Class.java:274)
   at 
 org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:538)
   at 
 org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 Using Spark's default log4j profile: 
 org/apache/spark/log4j-defaults.properties
 
 
 Please help me Asap..
 
 yasemin
 -- 
 hiç ender hiç


Re: Spark on YARN

2015-08-08 Thread Jem Tucker
Hi dustin,

Yes there are enough resources available, the same application run with a
different user works fine so I think it is something to do with permissions
but I can't work out where.

Thanks ,

Jem
On Sat, 8 Aug 2015 at 17:35, Dustin Cote dc...@cloudera.com wrote:

 Hi Jem,

 In the top of the RM web UI, do you see any available resources to spawn
 the application master container?


 On Sat, Aug 8, 2015 at 4:37 AM, Jem Tucker jem.tuc...@gmail.com wrote:

 Hi Sandy,

 The application doesn't fail, it gets accepted by yarn but the
 application master never starts and the application state never changes to
 running. I have checked in the resource manager and node manager logs and
 nothing jumps out.

 Thanks

 Jem


 On Sat, 8 Aug 2015 at 09:20, Sandy Ryza sandy.r...@cloudera.com wrote:

 Hi Jem,

 Do they fail with any particular exception?  Does YARN just never end up
 giving them resources?  Does an application master start?  If so, what are
 in its logs?  If not, anything suspicious in the YARN ResourceManager logs?

 -Sandy

 On Fri, Aug 7, 2015 at 1:48 AM, Jem Tucker jem.tuc...@gmail.com wrote:

 Hi,

 I am running spark on YARN on the CDH5.3.2 stack. I have created a new
 user to own and run a testing environment, however when using this user
 applications I submit to yarn never begin to run, even if they are the
 exact same application that is successful with another user?

 Has anyone seen anything like this before?

 Thanks,

 Jem


 --

 ---
 You received this message because you are subscribed to the Google Groups
 CDH Users group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to cdh-user+unsubscr...@cloudera.org.
 For more options, visit https://groups.google.com/a/cloudera.org/d/optout
 .




 --
 Dustin Cote
 Customer Operations Engineer
 http://www.cloudera.com

 --

 ---
 You received this message because you are subscribed to the Google Groups
 CDH Users group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to cdh-user+unsubscr...@cloudera.org.
 For more options, visit https://groups.google.com/a/cloudera.org/d/optout.



Re: java.lang.ClassNotFoundException

2015-08-08 Thread Yasemin Kaya
Thanx Ted, i solved it :)

2015-08-08 14:07 GMT+03:00 Ted Yu yuzhih...@gmail.com:

 Have you tried including package name in the class name ?

 Thanks



 On Aug 8, 2015, at 12:00 AM, Yasemin Kaya godo...@gmail.com wrote:

 Hi,

 I have a little spark program and i am getting an error why i dont
 understand.
 My code is https://gist.github.com/yaseminn/522a75b863ad78934bc3.
 I am using spark 1.3
 Submitting : bin/spark-submit --class MonthlyAverage --master local[4]
 weather.jar


 error:

 ~/spark-1.3.1-bin-hadoop2.4$ bin/spark-submit --class MonthlyAverage
 --master local[4] weather.jar
 java.lang.ClassNotFoundException: MonthlyAverage
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:274)
 at
 org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:538)
 at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
 at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 Using Spark's default log4j profile:
 org/apache/spark/log4j-defaults.properties


 Please help me Asap..

 yasemin
 --
 hiç ender hiç




-- 
hiç ender hiç


Re: Spark on YARN

2015-08-08 Thread Shushant Arora
which is the scheduler on your cluster. Just check on RM UI scheduler tab
and see your user and max limit of vcores for that user , is currently
other applications of that user have occupies till max vcores of this user
then that could be the reason of not allocating vcores to this user but for
some other user  same applicatin is getting run since another user's max
vcore limit is not reached.

On Sat, Aug 8, 2015 at 10:07 PM, Jem Tucker jem.tuc...@gmail.com wrote:

 Hi dustin,

 Yes there are enough resources available, the same application run with a
 different user works fine so I think it is something to do with permissions
 but I can't work out where.

 Thanks ,

 Jem

 On Sat, 8 Aug 2015 at 17:35, Dustin Cote dc...@cloudera.com wrote:

 Hi Jem,

 In the top of the RM web UI, do you see any available resources to spawn
 the application master container?


 On Sat, Aug 8, 2015 at 4:37 AM, Jem Tucker jem.tuc...@gmail.com wrote:

 Hi Sandy,

 The application doesn't fail, it gets accepted by yarn but the
 application master never starts and the application state never changes to
 running. I have checked in the resource manager and node manager logs and
 nothing jumps out.

 Thanks

 Jem


 On Sat, 8 Aug 2015 at 09:20, Sandy Ryza sandy.r...@cloudera.com wrote:

 Hi Jem,

 Do they fail with any particular exception?  Does YARN just never end
 up giving them resources?  Does an application master start?  If so, what
 are in its logs?  If not, anything suspicious in the YARN ResourceManager
 logs?

 -Sandy

 On Fri, Aug 7, 2015 at 1:48 AM, Jem Tucker jem.tuc...@gmail.com
 wrote:

 Hi,

 I am running spark on YARN on the CDH5.3.2 stack. I have created a new
 user to own and run a testing environment, however when using this user
 applications I submit to yarn never begin to run, even if they are the
 exact same application that is successful with another user?

 Has anyone seen anything like this before?

 Thanks,

 Jem


 --

 ---
 You received this message because you are subscribed to the Google
 Groups CDH Users group.
 To unsubscribe from this group and stop receiving emails from it, send
 an email to cdh-user+unsubscr...@cloudera.org.
 For more options, visit
 https://groups.google.com/a/cloudera.org/d/optout.




 --
 Dustin Cote
 Customer Operations Engineer
 http://www.cloudera.com

 --

 ---
 You received this message because you are subscribed to the Google Groups
 CDH Users group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to cdh-user+unsubscr...@cloudera.org.
 For more options, visit https://groups.google.com/a/cloudera.org/d/optout
 .

 --

 ---
 You received this message because you are subscribed to the Google Groups
 CDH Users group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to cdh-user+unsubscr...@cloudera.org.
 For more options, visit https://groups.google.com/a/cloudera.org/d/optout.



Re: DataFrame column structure change

2015-08-08 Thread Raghavendra Pandey
You can use struct function of org.apache.spark.sql.function class to
combine two columns to create struct column.
Sth like.
val nestedCol = struct(df(d), df(e))
df.select(df(a), df(b), df(c), nestedCol)
On Aug 7, 2015 3:14 PM, Rishabh Bhardwaj rbnex...@gmail.com wrote:

 I am doing it by creating a new data frame out of the fields to be nested
 and then join with the original DF.
 Looking for some optimized solution here.

 On Fri, Aug 7, 2015 at 2:06 PM, Rishabh Bhardwaj rbnex...@gmail.com
 wrote:

 Hi all,

 I want to have some nesting structure from the existing columns of
 the dataframe.
 For that,,I am trying to transform a DF in the following way,but couldn't
 do it.

 scala df.printSchema
 root
  |-- a: string (nullable = true)
  |-- b: string (nullable = true)
  |-- c: string (nullable = true)
  |-- d: string (nullable = true)
  |-- e: string (nullable = true)
  |-- f: string (nullable = true)

 *To*

 scala newDF.printSchema
 root
  |-- a: string (nullable = true)
  |-- b: string (nullable = true)
  |-- c: string (nullable = true)
  |-- newCol: struct (nullable = true)
  ||-- d: string (nullable = true)
  ||-- e: string (nullable = true)


 help me.

 Regards,
 Rishabh.





Spark sql jobs n their partition

2015-08-08 Thread Raghavendra Pandey
I have a complex transformation requirements that i m implementing using
dataframe.  It involves lot of joins also with Cassandra table.
I was wondering how can I debug the jobs n stages queued by spark sql the
way I can do for Rdds.

In one of cases, spark sql creates more than 17 lakhs tasks for 2gb data..
I have set sql partition@32.

Raghav


How to create DataFrame from a binary file?

2015-08-08 Thread unk1102
Hi how do we create DataFrame from a binary file stored in HDFS? I was
thinking to use

JavaPairRDDString,PortableDataStream pairRdd =
javaSparkContext.binaryFiles(/hdfs/path/to/binfile);
JavaRDDPortableDataStream javardd = pairRdd.values();

I can see that PortableDataStream has method called toArray which can
convert into byte array I was thinking if I have JavaRDDbyte[] can I call
the following and get DataFrame

DataFrame binDataFrame = sqlContext.createDataFrame(javaBinRdd,Byte.class);

Please guide I am new to Spark. I have my own custom format which is binary
format and I was thinking if I can convert my custom format into DataFrame
using binary operations then I dont need to create my own custom Hadoop
format am I on right track? Will reading binary data into DataFrame scale?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-create-DataFrame-from-a-binary-file-tp24179.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark master driver UI: How to keep it after process finished?

2015-08-08 Thread Andrew Or
Hi Saif,

You need to run your application with `spark.eventLog.enabled` set to true.
Then if you are using standalone mode, you can view the Master UI at port
8080. Otherwise, you may start a history server through
`sbin/start-history-server.sh`, which by default starts the history UI at
port 18080.

For more information on how to set this up, visit:
http://spark.apache.org/docs/latest/monitoring.html

-Andrew


2015-08-07 13:16 GMT-07:00 François Pelletier 
newslett...@francoispelletier.org:


 look at
 spark.history.ui.port, if you use standalone
 spark.yarn.historyServer.address, if you use YARN

 in your Spark config file

 Mine is located at
 /etc/spark/conf/spark-defaults.conf

 If you use Apache Ambari you can find this settings in the Spark / Configs
 / Advanced spark-defaults tab

 François


 Le 2015-08-07 15:58, saif.a.ell...@wellsfargo.com a écrit :

 Hello, thank you, but that port is unreachable for me. Can you please
 share where can I find that port equivalent in my environment?



 Thank you

 Saif



 *From:* François Pelletier [mailto:newslett...@francoispelletier.org
 newslett...@francoispelletier.org]
 *Sent:* Friday, August 07, 2015 4:38 PM
 *To:* user@spark.apache.org
 *Subject:* Re: Spark master driver UI: How to keep it after process
 finished?



 Hi, all spark processes are saved in the Spark History Server

 look at your host on port 18080 instead of 4040

 François

 Le 2015-08-07 15:26, saif.a.ell...@wellsfargo.com a écrit :

 Hi,



 A silly question here. The Driver Web UI dies when the spark-submit
 program finish. I would like some time to analyze after the program ends,
 as the page does not refresh it self, when I hit F5 I lose all the info.



 Thanks,

 Saif









Re: Schema change on Spark Hive (Parquet file format) table not working

2015-08-08 Thread sim
Yes, I've found a number of problems with metadata management in Spark SQL. 

One core issue is  SPARK-9764
https://issues.apache.org/jira/browse/SPARK-9764  . Related issues are 
SPARK-9342 https://issues.apache.org/jira/browse/SPARK-9342  ,  SPARK-9761
https://issues.apache.org/jira/browse/SPARK-9761   and  SPARK-9762
https://issues.apache.org/jira/browse/SPARK-9762  .

I've also observed a case where, after an exception in ALTER TABLE, Spark
SQL thought a table had 0 rows while, in fact, all the data was still there.
I was not able to reproduce this one reliably so I did not create a JIRA
issue for it.

Let's vote for these issues and get them resolved.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Schema-change-on-Spark-Hive-Parquet-file-format-table-not-working-tp15360p24180.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark inserting into parquet files with different schema

2015-08-08 Thread sim
Adam, did you find a solution for this?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-inserting-into-parquet-files-with-different-schema-tp20706p24181.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org