Re: Why KMeans with mllib is so slow ?

2015-03-28 Thread Burak Yavuz
Hi David,

Can you also try with Spark 1.3 if possible? I believe there was a 2x
improvement on K-Means between 1.2 and 1.3.

Thanks,
Burak



On Sat, Mar 28, 2015 at 9:04 PM, davidshen84  wrote:

> Hi Jao,
>
> Sorry to pop up this old thread. I am have the same problem like you did. I
> want to know if you have figured out how to improve k-means on Spark.
>
> I am using Spark 1.2.0. My data set is about 270k vectors, each has about
> 350 dimensions. If I set k=500, the job takes about 3hrs on my cluster. The
> cluster has 7 executors, each has 8 cores...
>
> If I set k=5000 which is the required value for my task, the job goes on
> forever...
>
>
> Thanks,
> David
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Why-KMeans-with-mllib-is-so-slow-tp20480p22273.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Why KMeans with mllib is so slow ?

2015-03-28 Thread davidshen84
Hi Jao,

Sorry to pop up this old thread. I am have the same problem like you did. I
want to know if you have figured out how to improve k-means on Spark.

I am using Spark 1.2.0. My data set is about 270k vectors, each has about
350 dimensions. If I set k=500, the job takes about 3hrs on my cluster. The
cluster has 7 executors, each has 8 cores...

If I set k=5000 which is the required value for my task, the job goes on
forever...


Thanks,
David




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Why-KMeans-with-mllib-is-so-slow-tp20480p22273.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Add partition support in saveAsParquet

2015-03-28 Thread Michael Armbrust
This is something we are hoping to support in Spark 1.4.  We'll post more
information to JIRA when there is a design.

On Thu, Mar 26, 2015 at 11:22 PM, Jianshi Huang 
wrote:

> Hi,
>
> Anyone has similar request?
>
> https://issues.apache.org/jira/browse/SPARK-6561
>
> When we save a DataFrame into Parquet files, we also want to have it
> partitioned.
>
> The proposed API looks like this:
>
> def saveAsParquet(path: String, partitionColumns: Seq[String])
>
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>


Re: [spark-sql] What is the right way to represent an “Any” type in Spark SQL?

2015-03-28 Thread Michael Armbrust
In this case I'd probably just store it as a String.  Our casting rules
(which come from Hive) are such that when you use a string as an number of
boolean it will be casted to the desired type.

Thanks for the PR btw :)

On Fri, Mar 27, 2015 at 2:31 PM, Eran Medan  wrote:

> Hi everyone,
>
> I had a lot of questions today, sorry if I'm spamming the list, but I
> thought it's better than posting all questions in one thread. Let me know
> if I should throttle my posts ;)
>
> Here is my question:
>
> When I try to have a case class that has Any in it (e.g. I have a
> property map and values can be either String, Int or Boolean, and since we
> don't have union types, Any is the closest thing)
>
> When I try to register such an RDD as a table in 1.2.1 (or convert to
> DataFrame in 1.3 and then register as a table)
>
> I get this weird exception:
>
> Exception in thread "main" scala.MatchError: Any (of class
> scala.reflect.internal.Types$ClassNoArgsTypeRef) at
> org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:112)
>
> Which from my interpretaion simply means that Any is not a valid type
> that Spark SQL can support in it's schema
>
> I already sent a pull request  to
> solve the cryptic exception but my question is - *is there a way to
> support an "Any" type in Spark SQL?*
>
> disclaimer - also posted at
> http://stackoverflow.com/questions/29310405/what-is-the-right-way-to-represent-an-any-type-in-spark-sql
>


Re: RDD resiliency -- does it keep state?

2015-03-28 Thread Michal Klos
got it thanks. Making sure everything is idempotent is definitely a
critical piece for peace of mind.

On Sat, Mar 28, 2015 at 1:47 PM, Aaron Davidson  wrote:

> Note that speculation is off by default to avoid these kinds of unexpected
> issues.
>
> On Sat, Mar 28, 2015 at 6:21 AM, Steve Loughran 
> wrote:
>
>>
>> It's worth adding that there's no guaranteed that re-evaluated work would
>> be on the same host as before, and in the case of node failure, it is not
>> guaranteed to be elsewhere.
>>
>> this means things that depend on host-local information is going to
>> generate different numbers even if there are no other side effects. random
>> number generation for seeding RDD.sample() would be a case in point here.
>>
>> There's also the fact that if you enable speculative execution, then
>> operations may be repeated —even in the absence of any failure. If you are
>> doing side effect work, or don't have an output committer whose actions are
>> guaranteed to be atomic then you want to turn that option off.
>>
>> > On 27 Mar 2015, at 19:46, Patrick Wendell  wrote:
>> >
>> > If you invoke this, you will get at-least-once semantics on failure.
>> > For instance, if a machine dies in the middle of executing the foreach
>> > for a single partition, that will be re-executed on another machine.
>> > It could even fully complete on one machine, but the machine dies
>> > immediately before reporting the result back to the driver.
>> >
>> > This means you need to make sure the side-effects are idempotent, or
>> > use some transactional locking. Spark's own output operations, such as
>> > saving to Hadoop, use such mechanisms. For instance, in the case of
>> > Hadoop it uses the OutputCommitter classes.
>> >
>> > - Patrick
>> >
>> > On Fri, Mar 27, 2015 at 12:36 PM, Michal Klos 
>> wrote:
>> >> Hi Spark group,
>> >>
>> >> We haven't been able to find clear descriptions of how Spark handles
>> the
>> >> resiliency of RDDs in relationship to executing actions with
>> side-effects.
>> >> If you do an `rdd.foreach(someSideEffect)`, then you are doing a
>> side-effect
>> >> for each element in the RDD. If a partition goes down -- the resiliency
>> >> rebuilds the data,  but did it keep track of how far it go in the
>> >> partition's set of data or will it start from the beginning again. So
>> will
>> >> it do at-least-once execution of foreach closures or at-most-once?
>> >>
>> >> thanks,
>> >> Michal
>> >
>> > -
>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: user-h...@spark.apache.org
>> >
>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: RDD resiliency -- does it keep state?

2015-03-28 Thread Aaron Davidson
Note that speculation is off by default to avoid these kinds of unexpected
issues.

On Sat, Mar 28, 2015 at 6:21 AM, Steve Loughran 
wrote:

>
> It's worth adding that there's no guaranteed that re-evaluated work would
> be on the same host as before, and in the case of node failure, it is not
> guaranteed to be elsewhere.
>
> this means things that depend on host-local information is going to
> generate different numbers even if there are no other side effects. random
> number generation for seeding RDD.sample() would be a case in point here.
>
> There's also the fact that if you enable speculative execution, then
> operations may be repeated —even in the absence of any failure. If you are
> doing side effect work, or don't have an output committer whose actions are
> guaranteed to be atomic then you want to turn that option off.
>
> > On 27 Mar 2015, at 19:46, Patrick Wendell  wrote:
> >
> > If you invoke this, you will get at-least-once semantics on failure.
> > For instance, if a machine dies in the middle of executing the foreach
> > for a single partition, that will be re-executed on another machine.
> > It could even fully complete on one machine, but the machine dies
> > immediately before reporting the result back to the driver.
> >
> > This means you need to make sure the side-effects are idempotent, or
> > use some transactional locking. Spark's own output operations, such as
> > saving to Hadoop, use such mechanisms. For instance, in the case of
> > Hadoop it uses the OutputCommitter classes.
> >
> > - Patrick
> >
> > On Fri, Mar 27, 2015 at 12:36 PM, Michal Klos 
> wrote:
> >> Hi Spark group,
> >>
> >> We haven't been able to find clear descriptions of how Spark handles the
> >> resiliency of RDDs in relationship to executing actions with
> side-effects.
> >> If you do an `rdd.foreach(someSideEffect)`, then you are doing a
> side-effect
> >> for each element in the RDD. If a partition goes down -- the resiliency
> >> rebuilds the data,  but did it keep track of how far it go in the
> >> partition's set of data or will it start from the beginning again. So
> will
> >> it do at-least-once execution of foreach closures or at-most-once?
> >>
> >> thanks,
> >> Michal
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class

2015-03-28 Thread Michael Stone
I've also been having trouble running 1.3.0 on HDP. The 
spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0-2041
configuration directive seems to work with pyspark, but not propagate 
when using spark-shell. (That is, everything works find with pyspark, 
and spark-shell fails with the "bad substitution" message.)


Mike Stone

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark-submit not working when application jar is in hdfs

2015-03-28 Thread Ted Yu
Looking at SparkSubmit#addJarToClasspath():

uri.getScheme match {
  case "file" | "local" =>
...
  case _ =>
printWarning(s"Skip remote jar $uri.")

It seems hdfs scheme is not recognized.

FYI

On Thu, Feb 26, 2015 at 6:09 PM, dilm  wrote:

> I'm trying to run a spark application using bin/spark-submit. When I
> reference my application jar inside my local filesystem, it works. However,
> when I copied my application jar to a directory in hdfs, i get the
> following
> exception:
>
> Warning: Skip remote jar
> hdfs://localhost:9000/user/hdfs/jars/simple-project-1.0-SNAPSHOT.jar.
> java.lang.ClassNotFoundException: com.example.SimpleApp
>
> Here's the comand:
>
> $ ./bin/spark-submit --class com.example.SimpleApp --master local
> hdfs://localhost:9000/user/hdfs/jars/simple-project-1.0-SNAPSHOT.jar
>
> I'm using hadoop version 2.6.0, spark version 1.2.1
>
> In the official documentation‌​, it stated there that: "application-jar:
> Path to a bundled jar including your application and all dependencies. The
> URL must be globally visible inside of your cluster, for instance, an
> *hdfs:// path* or a file:// path that is present on all nodes." I'm
> thinking
> maybe this is a valid bug?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-submit-not-working-when-application-jar-is-in-hdfs-tp21840.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark-submit not working when application jar is in hdfs

2015-03-28 Thread rrussell25
Hi, did you resolve this issue or just work around it be keeping your
application jar local?  Running into the same issue with 1.3.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-submit-not-working-when-application-jar-is-in-hdfs-tp21840p22272.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Understanding Spark Memory distribution

2015-03-28 Thread Wisely Chen
Hi Ankur

If your hardware is ok, looks like it is config problem. Can you show me
the config of spark-env.sh or JVM config?

Thanks

Wisely Chen

2015-03-28 15:39 GMT+08:00 Ankur Srivastava :

> Hi Wisely,
> I have 26gb for driver and the master is running on m3.2xlarge machines.
>
> I see OOM errors on workers and even they are running with 26th of memory.
>
> Thanks
>
> On Fri, Mar 27, 2015, 11:43 PM Wisely Chen  wrote:
>
>> Hi
>>
>> In broadcast, spark will collect the whole 3gb object into master node
>> and broadcast to each slaves. It is very common situation that the master
>> node don't have enough memory .
>>
>> What is your master node settings?
>>
>> Wisely Chen
>>
>> Ankur Srivastava  於 2015年3月28日 星期六寫道:
>>
>> I have increased the "spark.storage.memoryFraction" to 0.4 but I still
>>> get OOM errors on Spark Executor nodes
>>>
>>>
>>> 15/03/27 23:19:51 INFO BlockManagerMaster: Updated info of block
>>> broadcast_5_piece10
>>>
>>> 15/03/27 23:19:51 INFO TorrentBroadcast: Reading broadcast variable 5
>>> took 2704 ms
>>>
>>> 15/03/27 23:19:52 INFO MemoryStore: ensureFreeSpace(672530208) called
>>> with curMem=2484698683, maxMem=9631778734
>>>
>>> 15/03/27 23:19:52 INFO MemoryStore: Block broadcast_5 stored as values
>>> in memory (estimated size 641.4 MB, free 6.0 GB)
>>>
>>> 15/03/27 23:34:02 WARN AkkaUtils: Error sending message in 1 attempts
>>>
>>> java.util.concurrent.TimeoutException: Futures timed out after [30
>>> seconds]
>>>
>>> at
>>> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>>>
>>> at
>>> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>>>
>>> at
>>> scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>>>
>>> at
>>> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>>>
>>> at scala.concurrent.Await$.result(package.scala:107)
>>>
>>> at
>>> org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:187)
>>>
>>> at
>>> org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:407)
>>>
>>> 15/03/27 23:34:02 ERROR Executor: Exception in task 7.0 in stage 2.0
>>> (TID 4007)
>>>
>>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>>>
>>> at
>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1986)
>>>
>>> at
>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>>>
>>> at
>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>>>
>>> Thanks
>>>
>>> Ankur
>>>
>>> On Fri, Mar 27, 2015 at 2:52 PM, Ankur Srivastava <
>>> ankur.srivast...@gmail.com> wrote:
>>>
 Hi All,

 I am running a spark cluster on EC2 instances of type: m3.2xlarge. I
 have given 26gb of memory with all 8 cores to my executors. I can see that
 in the logs too:

 *15/03/27 21:31:06 INFO AppClient$ClientActor: Executor added:
 app-20150327213106-/0 on worker-20150327212934-10.x.y.z-40128
 (10.x.y.z:40128) with 8 cores*

 I am not caching any RDD so I have set "spark.storage.memoryFraction"
 to 0.2. I can see on SparkUI under executors tab Memory used is 0.0/4.5 GB.

 I am now confused with these logs?

 *15/03/27 21:31:08 INFO BlockManagerMasterActor: Registering block
 manager 10.77.100.196:58407  with 4.5 GB RAM,
 BlockManagerId(4, 10.x.y.z, 58407)*

 I am broadcasting a large object of 3 gb and after that when I am
 creating an RDD, I see logs which show this 4.5 GB memory getting full and
 then I get OOM.

 How can I make block manager use more memory?

 Is there any other fine tuning I need to do for broadcasting large
 objects?

 And does broadcast variable use cache memory or rest of the heap?


 Thanks

 Ankur

>>>
>>>


[Spark Streaming] Disk not being cleaned up during runtime after RDD being processed

2015-03-28 Thread Nathan Marin
Hi,

I’ve been trying to use Spark Streaming for my real-time analysis
application using the Kafka Stream API on a cluster (using the yarn
version) of 6 executors with 4 dedicated cores and 8192mb of dedicated
RAM.

The thing is, my application should run 24/7 but the disk usage is
leaking. This leads to some exceptions occurring when Spark tries to
write on a file system where no space is left.

Here are some graphs showing the disk space remaining on a node where
my application is deployed:
http://i.imgur.com/vdPXCP0.png
The "drops" occurred on a 3 minute interval.

The Disk Usage goes back to normal once I kill my application:
http://i.imgur.com/ERZs2Cj.png

The persistance level of my RDD is MEMORY_AND_DISK_SER_2, but even
when I tried MEMORY_ONLY_SER_2 the same thing happened (this mode
shouldn't even allow spark to write on disk, right?).

My question is: How can I force Spark (Streaming?) to remove whatever
he stores immediately after he processed-it? Obviously it doesn’t look
like the disk is being cleaned up (even though the memory does) even
with me calling the rdd.unpersist() method foreach RDD processed.

Here’s a sample of my application code:
http://pastebin.com/K86LE1J6

Maybe something is wrong in my app too?

Thanks for your help,
NM




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Disk-not-being-cleaned-up-during-runtime-after-RDD-being-processed-tp22271.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Anyone has some simple example with spark-sql with spark 1.3

2015-03-28 Thread Ted Yu
See
https://databricks.com/blog/2015/03/24/spark-sql-graduates-from-alpha-in-spark-1-3.html

I haven't tried the SQL statements in above blog myself.

Cheers

On Sat, Mar 28, 2015 at 5:39 AM, Vincent He 
wrote:

> thanks for your information . I have read it, I can run sample with scala
> or python, but for spark-sql shell, I can not get an exmaple running
> successfully, can you give me an example I can run with "./bin/spark-sql"
> without writing any code? thanks
>
> On Sat, Mar 28, 2015 at 7:35 AM, Ted Yu  wrote:
>
>> Please take a look at
>> https://spark.apache.org/docs/latest/sql-programming-guide.html
>>
>> Cheers
>>
>>
>>
>> > On Mar 28, 2015, at 5:08 AM, Vincent He 
>> wrote:
>> >
>> >
>> > I am learning spark sql and try spark-sql example,  I running following
>> code, but I got exception "ERROR CliDriver:
>> org.apache.spark.sql.AnalysisException: cannot recognize input near
>> 'CREATE' 'TEMPORARY' 'TABLE' in ddl statement; line 1 pos 17", I have two
>> questions,
>> > 1. Do we have a list of the statement supported in spark-sql ?
>> > 2. Does spark-sql shell support hiveql ? If yes, how to set?
>> >
>> > The example I tried:
>> > CREATE TEMPORARY TABLE jsonTable
>> > USING org.apache.spark.sql.json
>> > OPTIONS (
>> >   path "examples/src/main/resources/people.json"
>> > )
>> > SELECT * FROM jsonTable
>> > The exception I got,
>> > > CREATE TEMPORARY TABLE jsonTable
>> >  > USING org.apache.spark.sql.json
>> >  > OPTIONS (
>> >  >   path "examples/src/main/resources/people.json"
>> >  > )
>> >  > SELECT * FROM jsonTable
>> >  > ;
>> > 15/03/28 17:38:34 INFO ParseDriver: Parsing command: CREATE TEMPORARY
>> TABLE jsonTable
>> > USING org.apache.spark.sql.json
>> > OPTIONS (
>> >   path "examples/src/main/resources/people.json"
>> > )
>> > SELECT * FROM jsonTable
>> > NoViableAltException(241@[654:1: ddlStatement : (
>> createDatabaseStatement | switchDatabaseStatement | dropDatabaseStatement |
>> createTableStatement | dropTableStatement | truncateTableStatement |
>> alterStatement | descStatement | showStatement | metastoreCheck |
>> createViewStatement | dropViewStatement | createFunctionStatement |
>> createMacroStatement | createIndexStatement | dropIndexStatement |
>> dropFunctionStatement | dropMacroStatement | analyzeStatement |
>> lockStatement | unlockStatement | lockDatabase | unlockDatabase |
>> createRoleStatement | dropRoleStatement | grantPrivileges |
>> revokePrivileges | showGrants | showRoleGrants | showRolePrincipals |
>> showRoles | grantRole | revokeRole | setRole | showCurrentRole );])
>> > at org.antlr.runtime.DFA.noViableAlt(DFA.java:158)
>> > at org.antlr.runtime.DFA.predict(DFA.java:144)
>> > at
>> org.apache.hadoop.hive.ql.parse.HiveParser.ddlStatement(HiveParser.java:2090)
>> > at
>> org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1398)
>> > at
>> org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1036)
>> > at
>> org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:199)
>> > at
>> org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)
>> > at
>> org.apache.spark.sql.hive.HiveQl$.getAst(HiveQl.scala:227)
>> > at
>> org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:241)
>> > at
>> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41)
>> > at
>> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40)
>> > at
>> scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
>> > at
>> scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
>> >at
>> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
>> >at
>> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
>> > at
>> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>> > at
>> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
>> > at
>> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
>> > at
>> scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
>> > at
>> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>> > at
>> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>> > at
>> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>> > at
>> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
>> > at
>> scala.util.parsing.combinator.Parsers$$anon$2

input size too large | Performance issues with Spark

2015-03-28 Thread nsareen
Hi All,

I'm facing performance issues with spark implementation, and was briefly
investigating on WebUI logs, i noticed that my RDD size is 55GB & the
Shuffle Write is 10 GB & Input Size is 200GB. Application is a web
application which does predictive analytics, so we keep most of our data in
memory. This observation was only for 30mins usage of the application on a
single user. We anticipate atleast 10-15 users of the application sending
requests in parallel, which makes me a bit nervous. 

One constraint we have is that we do not have too many nodes in a cluster,
we may end up with 3-4 machines at best, but they can be scaled up
vertically each having 24 cores / 512 GB ram etc. which can allow us to make
a virtual 10-15 node cluster. 

Even then the input size & shuffle write is too high for my liking. Any
suggestions in this regard will be greatly appreciated as there aren't much
resource on the net for handling performance issues such as these.

Some pointers on my application's data structures & design 

1) RDD is a JavaPairRDD with Key / Value as CustomPOJO containing 3-4
Hashmaps & Value containing 1 Hashmap
2) Data is loaded via JDBCRDD during application startup, which also tends
to take a lot of time, since we massage the data once it is fetched from DB
and then save it as JavaPairRDD.
3) Most of the data is structured, but we are still using JavaPairRDD, have
not explored the option of Spark SQL though.
4) We have only one SparkContext which caters to all the requests coming
into the application from various users.
5) During a single user session user can send 3-4 parallel stages consisting
of Map / Group By / Join / Reduce etc.
6) We have to change the RDD structure using different types of group by
operations since the user can do drill down drill up of the data (
aggregation at a higher / lower level). This is where we make use of
Groupby's but there is a cost associated with this.
7) We have observed, that the initial RDD's we create have 40 odd
partitions, but post some stage executions like groupby's the partitions
increase to 200 or so, this was odd, and we havn't figured out why this
happens.

In summary we wan to use Spark to provide us the capability to process our
in-memory data structure very fast as well as scale to a larger volume when
required in the future.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/input-size-too-large-Performance-issues-with-Spark-tp22270.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Can't access file in spark, but can in hadoop

2015-03-28 Thread Ted Yu
Thanks for the follow-up, Dale.

bq. hdp 2.3.1

Minor correction: should be hdp 2.1.3

Cheers

On Sat, Mar 28, 2015 at 2:28 AM, Johnson, Dale  wrote:

>  Actually I did figure this out eventually.
>
>  I’m running on a Hortonworks cluster hdp 2.3.1 (hadoop 2.4.1).  Spark
> bundles the org/apache/hadoop/hdfs/… classes along with the spark-assembly
> jar.  This turns out to introduce a small incompatibility with hdp 2.3.1.
> I carved these classes out of the jar, and put a distro-provided jar into
> the class path for the hdfs classes, and this fixed the problem.
>
>  Ideally there would be an exclusion in the pom to deal with this.
>
>  Dale.
>
>   From: Zhan Zhang 
> Date: Friday, March 27, 2015 at 4:28 PM
> To: "Johnson, Dale" 
> Cc: Ted Yu , user 
>
> Subject: Re: Can't access file in spark, but can in hadoop
>
>   Probably guava version conflicts issue. What spark version did you use,
> and which hadoop version it compile against?
>
>  Thanks.
>
>  Zhan Zhang
>
>  On Mar 27, 2015, at 12:13 PM, Johnson, Dale  wrote:
>
>  Yes, I could recompile the hdfs client with more logging, but I don’t
> have the day or two to spare right this week.
>
>  One more thing about this, the cluster is Horton Works 2.1.3 [.0]
>
>  They seem to have a claim of supporting spark on Horton Works 2.2
>
>  Dale.
>
>   From: Ted Yu 
> Date: Thursday, March 26, 2015 at 4:54 PM
> To: "Johnson, Dale" 
> Cc: user 
> Subject: Re: Can't access file in spark, but can in hadoop
>
>   Looks like the following assertion failed:
>   Preconditions.checkState(storageIDsCount == locs.size());
>
>  locs is List
> Can you enhance the assertion to log more information ?
>
>  Cheers
>
> On Thu, Mar 26, 2015 at 3:06 PM, Dale Johnson  wrote:
>
>> There seems to be a special kind of "corrupted according to Spark" state
>> of
>> file in HDFS.  I have isolated a set of files (maybe 1% of all files I
>> need
>> to work with) which are producing the following stack dump when I try to
>> sc.textFile() open them.  When I try to open directories, most large
>> directories contain at least one file of this type.  Curiously, the
>> following two lines fail inside of a Spark job, but not inside of a Scoobi
>> job:
>>
>> val conf = new org.apache.hadoop.conf.Configuration
>> val fs = org.apache.hadoop.fs.FileSystem.get(conf)
>>
>> The stack trace follows:
>>
>> 15/03/26 14:22:43 INFO yarn.ApplicationMaster: Final app status: FAILED,
>> exitCode: 15, (reason: User class threw exception: null)
>> Exception in thread "Driver" java.lang.IllegalStateException
>> at
>>
>> org.spark-project.guava.common.base.Preconditions.checkState(Preconditions.java:133)
>> at
>> org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:673)
>> at
>>
>> org.apache.hadoop.hdfs.protocolPB.PBHelper.convertLocatedBlock(PBHelper.java:1100)
>> at
>> org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1118)
>> at
>> org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1251)
>> at
>> org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1354)
>> at
>> org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1363)
>> at
>>
>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:518)
>> at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1743)
>> at
>>
>> org.apache.hadoop.hdfs.DistributedFileSystem$15.(DistributedFileSystem.java:738)
>> at
>>
>> org.apache.hadoop.hdfs.DistributedFileSystem.listLocatedStatus(DistributedFileSystem.java:727)
>> at
>> org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:1662)
>> at org.apache.hadoop.fs.FileSystem$5.(FileSystem.java:1724)
>> at org.apache.hadoop.fs.FileSystem.listFiles(FileSystem.java:1721)
>> at
>>
>> com.ebay.ss.niffler.miner.speller.SpellQueryLaunch$$anonfun$main$2.apply(SpellQuery.scala:1125)
>> at
>>
>> com.ebay.ss.niffler.miner.speller.SpellQueryLaunch$$anonfun$main$2.apply(SpellQuery.scala:1123)
>> at
>>
>> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>> at
>> scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>> at
>>
>> com.ebay.ss.niffler.miner.speller.SpellQueryLaunch$.main(SpellQuery.scala:1123)
>> at
>> com.ebay.ss.niffler.miner.speller.SpellQueryLaunch.main(SpellQuery.scala)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>>
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> at
>>
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:606)
>> at
>>
>> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:427)
>> 15/03/26 14:22:43 INFO yarn.ApplicationMaster: Invoking sc

Re: RDD resiliency -- does it keep state?

2015-03-28 Thread Steve Loughran

It's worth adding that there's no guaranteed that re-evaluated work would be on 
the same host as before, and in the case of node failure, it is not guaranteed 
to be elsewhere.

this means things that depend on host-local information is going to generate 
different numbers even if there are no other side effects. random number 
generation for seeding RDD.sample() would be a case in point here. 

There's also the fact that if you enable speculative execution, then operations 
may be repeated —even in the absence of any failure. If you are doing side 
effect work, or don't have an output committer whose actions are guaranteed to 
be atomic then you want to turn that option off.

> On 27 Mar 2015, at 19:46, Patrick Wendell  wrote:
> 
> If you invoke this, you will get at-least-once semantics on failure.
> For instance, if a machine dies in the middle of executing the foreach
> for a single partition, that will be re-executed on another machine.
> It could even fully complete on one machine, but the machine dies
> immediately before reporting the result back to the driver.
> 
> This means you need to make sure the side-effects are idempotent, or
> use some transactional locking. Spark's own output operations, such as
> saving to Hadoop, use such mechanisms. For instance, in the case of
> Hadoop it uses the OutputCommitter classes.
> 
> - Patrick
> 
> On Fri, Mar 27, 2015 at 12:36 PM, Michal Klos  wrote:
>> Hi Spark group,
>> 
>> We haven't been able to find clear descriptions of how Spark handles the
>> resiliency of RDDs in relationship to executing actions with side-effects.
>> If you do an `rdd.foreach(someSideEffect)`, then you are doing a side-effect
>> for each element in the RDD. If a partition goes down -- the resiliency
>> rebuilds the data,  but did it keep track of how far it go in the
>> partition's set of data or will it start from the beginning again. So will
>> it do at-least-once execution of foreach closures or at-most-once?
>> 
>> thanks,
>> Michal
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Custom edge partitioning in graphX

2015-03-28 Thread arpp
Hi all,
I am working with spark 1.0.0. mainly for the usage of GraphX and wished to
apply some custom partitioning strategies on the edge list of the graph.
I have generated an edge list file which has the partition number after the
source and destination id in each line. Initially I am loading the
unannotated graph using GraphLoader and then loading the annotated file and
applying 


val unpartitionedGraph = GraphLoader.edgeListFile(sc, fname,
minEdgePartitions = numEPart).cache()
val graph = Graph(unpartitionedGraph.vertices,
partitionCustom(unpartitionedGraph.edges))
// The above method is workaround for spark 1.0.0

 def partitionCustom[ED](edges: RDD[Edge[ED]]): RDD[Edge[ED]] = {
  edges.map(e => (customPartition(e.srcId, e.dstId), e))
.partitionBy(new HashPartitioner(numPartitions))
.mapPartitions(_.map(_._2), preservesPartitioning = true)
}


def customPartition(src: VertexId, dst: VertexId): PartitionID = {
// search for the src and dest line in the loaded annotated file
// read the third element of that line and return it
}

But this method is inefficient as it requires to load the same data multiple
times and also slow as I am performing a large number of searches on really
huge edge list files.
Please suggest some efficient ways of doing this. Also please note that I am
stuck with spark 1.0.0 as I am only a user of the cluster available.

Regards,
Arpit Kumar



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Custom-edge-partitioning-in-graphX-tp22269.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Anyone has some simple example with spark-sql with spark 1.3

2015-03-28 Thread Vincent He
thanks for your information . I have read it, I can run sample with scala
or python, but for spark-sql shell, I can not get an exmaple running
successfully, can you give me an example I can run with "./bin/spark-sql"
without writing any code? thanks

On Sat, Mar 28, 2015 at 7:35 AM, Ted Yu  wrote:

> Please take a look at
> https://spark.apache.org/docs/latest/sql-programming-guide.html
>
> Cheers
>
>
>
> > On Mar 28, 2015, at 5:08 AM, Vincent He 
> wrote:
> >
> >
> > I am learning spark sql and try spark-sql example,  I running following
> code, but I got exception "ERROR CliDriver:
> org.apache.spark.sql.AnalysisException: cannot recognize input near
> 'CREATE' 'TEMPORARY' 'TABLE' in ddl statement; line 1 pos 17", I have two
> questions,
> > 1. Do we have a list of the statement supported in spark-sql ?
> > 2. Does spark-sql shell support hiveql ? If yes, how to set?
> >
> > The example I tried:
> > CREATE TEMPORARY TABLE jsonTable
> > USING org.apache.spark.sql.json
> > OPTIONS (
> >   path "examples/src/main/resources/people.json"
> > )
> > SELECT * FROM jsonTable
> > The exception I got,
> > > CREATE TEMPORARY TABLE jsonTable
> >  > USING org.apache.spark.sql.json
> >  > OPTIONS (
> >  >   path "examples/src/main/resources/people.json"
> >  > )
> >  > SELECT * FROM jsonTable
> >  > ;
> > 15/03/28 17:38:34 INFO ParseDriver: Parsing command: CREATE TEMPORARY
> TABLE jsonTable
> > USING org.apache.spark.sql.json
> > OPTIONS (
> >   path "examples/src/main/resources/people.json"
> > )
> > SELECT * FROM jsonTable
> > NoViableAltException(241@[654:1: ddlStatement : (
> createDatabaseStatement | switchDatabaseStatement | dropDatabaseStatement |
> createTableStatement | dropTableStatement | truncateTableStatement |
> alterStatement | descStatement | showStatement | metastoreCheck |
> createViewStatement | dropViewStatement | createFunctionStatement |
> createMacroStatement | createIndexStatement | dropIndexStatement |
> dropFunctionStatement | dropMacroStatement | analyzeStatement |
> lockStatement | unlockStatement | lockDatabase | unlockDatabase |
> createRoleStatement | dropRoleStatement | grantPrivileges |
> revokePrivileges | showGrants | showRoleGrants | showRolePrincipals |
> showRoles | grantRole | revokeRole | setRole | showCurrentRole );])
> > at org.antlr.runtime.DFA.noViableAlt(DFA.java:158)
> > at org.antlr.runtime.DFA.predict(DFA.java:144)
> > at
> org.apache.hadoop.hive.ql.parse.HiveParser.ddlStatement(HiveParser.java:2090)
> > at
> org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1398)
> > at
> org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1036)
> > at
> org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:199)
> > at
> org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)
> > at org.apache.spark.sql.hive.HiveQl$.getAst(HiveQl.scala:227)
> > at
> org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:241)
> > at
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41)
> > at
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40)
> > at
> scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
> > at
> scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
> >at
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
> >at
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
> > at
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
> > at
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
> > at
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
> > at
> scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
> > at
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
> > at
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
> > at
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
> > at
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
> > at
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
> > at
> scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
> > at
> scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
> > at
> scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110

Re: Anyone has some simple example with spark-sql with spark 1.3

2015-03-28 Thread Ted Yu
Please take a look at 
https://spark.apache.org/docs/latest/sql-programming-guide.html

Cheers



> On Mar 28, 2015, at 5:08 AM, Vincent He  wrote:
> 
> 
> I am learning spark sql and try spark-sql example,  I running following code, 
> but I got exception "ERROR CliDriver: org.apache.spark.sql.AnalysisException: 
> cannot recognize input near 'CREATE' 'TEMPORARY' 'TABLE' in ddl statement; 
> line 1 pos 17", I have two questions,
> 1. Do we have a list of the statement supported in spark-sql ?
> 2. Does spark-sql shell support hiveql ? If yes, how to set?
> 
> The example I tried:
> CREATE TEMPORARY TABLE jsonTable
> USING org.apache.spark.sql.json
> OPTIONS (
>   path "examples/src/main/resources/people.json"
> )
> SELECT * FROM jsonTable
> The exception I got,
> > CREATE TEMPORARY TABLE jsonTable
>  > USING org.apache.spark.sql.json
>  > OPTIONS (
>  >   path "examples/src/main/resources/people.json"
>  > )
>  > SELECT * FROM jsonTable
>  > ;
> 15/03/28 17:38:34 INFO ParseDriver: Parsing command: CREATE TEMPORARY TABLE 
> jsonTable
> USING org.apache.spark.sql.json
> OPTIONS (
>   path "examples/src/main/resources/people.json"
> )
> SELECT * FROM jsonTable
> NoViableAltException(241@[654:1: ddlStatement : ( createDatabaseStatement | 
> switchDatabaseStatement | dropDatabaseStatement | createTableStatement | 
> dropTableStatement | truncateTableStatement | alterStatement | descStatement 
> | showStatement | metastoreCheck | createViewStatement | dropViewStatement | 
> createFunctionStatement | createMacroStatement | createIndexStatement | 
> dropIndexStatement | dropFunctionStatement | dropMacroStatement | 
> analyzeStatement | lockStatement | unlockStatement | lockDatabase | 
> unlockDatabase | createRoleStatement | dropRoleStatement | grantPrivileges | 
> revokePrivileges | showGrants | showRoleGrants | showRolePrincipals | 
> showRoles | grantRole | revokeRole | setRole | showCurrentRole );])
> at org.antlr.runtime.DFA.noViableAlt(DFA.java:158)
> at org.antlr.runtime.DFA.predict(DFA.java:144)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser.ddlStatement(HiveParser.java:2090)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1398)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1036)
> at 
> org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:199)
> at 
> org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)
> at org.apache.spark.sql.hive.HiveQl$.getAst(HiveQl.scala:227)
> at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:241)
> at 
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41)
> at 
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40)
> at 
> scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
> at 
> scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
>at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
>at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
> at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
> at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
> at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
> at 
> scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
> at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
> at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
> at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
> at 
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
> at 
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
> at 
> scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
> at 
> scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
> at 
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38)
> at 
> org.apache.spark.sql.hive.HiveQl$$anonfun$3.apply(HiveQl.scala:138)
> at 
> org.apache.spark.sql.hive.HiveQl$$anonfun$3.apply(HiveQl.scala:138)
> at 
> org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$o

Anyone has some simple example with spark-sql with spark 1.3

2015-03-28 Thread Vincent He
I am learning spark sql and try spark-sql example,  I running following
code, but I got exception "ERROR CliDriver:
org.apache.spark.sql.AnalysisException: cannot recognize input near
'CREATE' 'TEMPORARY' 'TABLE' in ddl statement; line 1 pos 17", I have two
questions,
1. Do we have a list of the statement supported in spark-sql ?
2. Does spark-sql shell support hiveql ? If yes, how to set?

The example I tried:
CREATE TEMPORARY TABLE jsonTable
USING org.apache.spark.sql.json
OPTIONS (
  path "examples/src/main/resources/people.json"
)
SELECT * FROM jsonTable
The exception I got,
> CREATE TEMPORARY TABLE jsonTable
 > USING org.apache.spark.sql.json
 > OPTIONS (
 >   path "examples/src/main/resources/people.json"
 > )
 > SELECT * FROM jsonTable
 > ;
15/03/28 17:38:34 INFO ParseDriver: Parsing command: CREATE TEMPORARY TABLE
jsonTable
USING org.apache.spark.sql.json
OPTIONS (
  path "examples/src/main/resources/people.json"
)
SELECT * FROM jsonTable
NoViableAltException(241@[654:1: ddlStatement : ( createDatabaseStatement |
switchDatabaseStatement | dropDatabaseStatement | createTableStatement |
dropTableStatement | truncateTableStatement | alterStatement |
descStatement | showStatement | metastoreCheck | createViewStatement |
dropViewStatement | createFunctionStatement | createMacroStatement |
createIndexStatement | dropIndexStatement | dropFunctionStatement |
dropMacroStatement | analyzeStatement | lockStatement | unlockStatement |
lockDatabase | unlockDatabase | createRoleStatement | dropRoleStatement |
grantPrivileges | revokePrivileges | showGrants | showRoleGrants |
showRolePrincipals | showRoles | grantRole | revokeRole | setRole |
showCurrentRole );])
at org.antlr.runtime.DFA.noViableAlt(DFA.java:158)
at org.antlr.runtime.DFA.predict(DFA.java:144)
at
org.apache.hadoop.hive.ql.parse.HiveParser.ddlStatement(HiveParser.java:2090)
at
org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1398)
at
org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1036)
at
org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:199)
at
org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)
at org.apache.spark.sql.hive.HiveQl$.getAst(HiveQl.scala:227)
at
org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:241)
at
org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41)
at
org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40)
at
scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
at
scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
   at
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
   at
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at
scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
at
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
at
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
at
scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at
scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
at
scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
at
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38)
at
org.apache.spark.sql.hive.HiveQl$$anonfun$3.apply(HiveQl.scala:138)
at
org.apache.spark.sql.hive.HiveQl$$anonfun$3.apply(HiveQl.scala:138)
at
org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96)
at
org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95)
at
scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
at
scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
   at
scala.util.parsing.combinat

Anyone has some simple example with spark-sql with spark 1.3

2015-03-28 Thread Vincent He
I am learning spark sql and try spark-sql example,  I running following
code, but I got exception "ERROR CliDriver:
org.apache.spark.sql.AnalysisException: cannot recognize input near
'CREATE' 'TEMPORARY' 'TABLE' in ddl statement; line 1 pos 17", I have two
questions,
1. Do we have a list of the statement supported in spark-sql ?
2. Does spark-sql shell support hiveql ? If yes, how to set?

The example I tried:
CREATE TEMPORARY TABLE jsonTable
USING org.apache.spark.sql.json
OPTIONS (
  path "examples/src/main/resources/people.json"
)
SELECT * FROM jsonTable
The exception I got,
> CREATE TEMPORARY TABLE jsonTable
 > USING org.apache.spark.sql.json
 > OPTIONS (
 >   path "examples/src/main/resources/people.json"
 > )
 > SELECT * FROM jsonTable
 > ;
15/03/28 17:38:34 INFO ParseDriver: Parsing command: CREATE TEMPORARY TABLE
jsonTable
USING org.apache.spark.sql.json
OPTIONS (
  path "examples/src/main/resources/people.json"
)
SELECT * FROM jsonTable
NoViableAltException(241@[654:1: ddlStatement : ( createDatabaseStatement |
switchDatabaseStatement | dropDatabaseStatement | createTableStatement |
dropTableStatement | truncateTableStatement | alterStatement |
descStatement | showStatement | metastoreCheck | createViewStatement |
dropViewStatement | createFunctionStatement | createMacroStatement |
createIndexStatement | dropIndexStatement | dropFunctionStatement |
dropMacroStatement | analyzeStatement | lockStatement | unlockStatement |
lockDatabase | unlockDatabase | createRoleStatement | dropRoleStatement |
grantPrivileges | revokePrivileges | showGrants | showRoleGrants |
showRolePrincipals | showRoles | grantRole | revokeRole | setRole |
showCurrentRole );])
at org.antlr.runtime.DFA.noViableAlt(DFA.java:158)
at org.antlr.runtime.DFA.predict(DFA.java:144)
at
org.apache.hadoop.hive.ql.parse.HiveParser.ddlStatement(HiveParser.java:2090)
at
org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1398)
at
org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1036)
at
org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:199)
at
org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)
at org.apache.spark.sql.hive.HiveQl$.getAst(HiveQl.scala:227)
at
org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:241)
at
org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41)
at
org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40)
at
scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
at
scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
   at
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
   at
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at
scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
at
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
at
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
at
scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at
scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
at
scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
at
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38)
at
org.apache.spark.sql.hive.HiveQl$$anonfun$3.apply(HiveQl.scala:138)
at
org.apache.spark.sql.hive.HiveQl$$anonfun$3.apply(HiveQl.scala:138)
at
org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96)
at
org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95)
at
scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
at
scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
   at
scala.util.parsing.combinat

How to add all combinations of items rated by user and difference between the ratings?

2015-03-28 Thread anishm
The input file is of format: userid, movieid, rating
>From this plan, I want to extract all possible combinations of movies and
difference between the ratings for each user.

(movie1, movie2),(rating(movie1)-rating(movie2))

This process should be processed for each user in the dataset. Finally, I
would like to find the average disagreement of movies for the user.

(movie1, movie2), average difference between ratings

How do I do the same in python?

I did write a code for Hadoop Streaming, but having a real hard time
converting it to Spark compatible code.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-all-combinations-of-items-rated-by-user-and-difference-between-the-ratings-tp22268.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: k-means can only run on one executor with one thread?

2015-03-28 Thread Xi Shen
My vector dimension is like 360 or so. The data count is about 270k. My
driver has 2.9G memory. I attache a screenshot of current executor status.
I submitted this job with "--master yarn-cluster". I have a total of 7
worker node, one of them acts as the driver. In the screenshot, you can see
all worker nodes have loaded some data, but the driver is not loaded with
any data.

But the funny thing is, when I log on to the driver, and check its CPU &
memory status. I saw one java process using about 18% of CPU, and is using
about 1.6 GB memory.

[image: Inline image 1]

On Sat, Mar 28, 2015 at 7:06 PM Reza Zadeh  wrote:

> How many dimensions does your data have? The size of the k-means model is
> k * d, where d is the dimension of the data.
>
> Since you're using k=1000, if your data has dimension higher than say,
> 10,000, you will have trouble, because k*d doubles have to fit in the
> driver.
>
> Reza
>
> On Sat, Mar 28, 2015 at 12:27 AM, Xi Shen  wrote:
>
>> I have put more detail of my problem at http://stackoverflow.com/
>> questions/29295420/spark-kmeans-computation-cannot-be-distributed
>>
>> It is really appreciate if you can help me take a look at this problem. I
>> have tried various settings and ways to load/partition my data, but I just
>> cannot get rid that long pause.
>>
>>
>> Thanks,
>> David
>>
>>
>>
>>
>>
>> [image: --]
>> Xi Shen
>> [image: http://]about.me/davidshen
>> 
>>   
>>
>> On Sat, Mar 28, 2015 at 2:38 PM, Xi Shen  wrote:
>>
>>> Yes, I have done repartition.
>>>
>>> I tried to repartition to the number of cores in my cluster. Not
>>> helping...
>>> I tried to repartition to the number of centroids (k value). Not
>>> helping...
>>>
>>>
>>> On Sat, Mar 28, 2015 at 7:27 AM Joseph Bradley 
>>> wrote:
>>>
 Can you try specifying the number of partitions when you load the data
 to equal the number of executors?  If your ETL changes the number of
 partitions, you can also repartition before calling KMeans.


 On Thu, Mar 26, 2015 at 8:04 PM, Xi Shen  wrote:

> Hi,
>
> I have a large data set, and I expects to get 5000 clusters.
>
> I load the raw data, convert them into DenseVector; then I did
> repartition and cache; finally I give the RDD[Vector] to KMeans.train().
>
> Now the job is running, and data are loaded. But according to the
> Spark UI, all data are loaded onto one executor. I checked that executor,
> and its CPU workload is very low. I think it is using only 1 of the 8
> cores. And all other 3 executors are at rest.
>
> Did I miss something? Is it possible to distribute the workload to all
> 4 executors?
>
>
> Thanks,
> David
>
>

>>
>


Re: Can't access file in spark, but can in hadoop

2015-03-28 Thread Johnson, Dale
Actually I did figure this out eventually.

I’m running on a Hortonworks cluster hdp 2.3.1 (hadoop 2.4.1).  Spark bundles 
the org/apache/hadoop/hdfs/… classes along with the spark-assembly jar.  This 
turns out to introduce a small incompatibility with hdp 2.3.1.  I carved these 
classes out of the jar, and put a distro-provided jar into the class path for 
the hdfs classes, and this fixed the problem.

Ideally there would be an exclusion in the pom to deal with this.

Dale.

From: Zhan Zhang mailto:zzh...@hortonworks.com>>
Date: Friday, March 27, 2015 at 4:28 PM
To: "Johnson, Dale" mailto:daljohn...@ebay.com>>
Cc: Ted Yu mailto:yuzhih...@gmail.com>>, user 
mailto:user@spark.apache.org>>
Subject: Re: Can't access file in spark, but can in hadoop

Probably guava version conflicts issue. What spark version did you use, and 
which hadoop version it compile against?

Thanks.

Zhan Zhang

On Mar 27, 2015, at 12:13 PM, Johnson, Dale 
mailto:daljohn...@ebay.com>> wrote:

Yes, I could recompile the hdfs client with more logging, but I don’t have the 
day or two to spare right this week.

One more thing about this, the cluster is Horton Works 2.1.3 [.0]

They seem to have a claim of supporting spark on Horton Works 2.2

Dale.

From: Ted Yu mailto:yuzhih...@gmail.com>>
Date: Thursday, March 26, 2015 at 4:54 PM
To: "Johnson, Dale" mailto:daljohn...@ebay.com>>
Cc: user mailto:user@spark.apache.org>>
Subject: Re: Can't access file in spark, but can in hadoop

Looks like the following assertion failed:
  Preconditions.checkState(storageIDsCount == locs.size());

locs is List
Can you enhance the assertion to log more information ?

Cheers

On Thu, Mar 26, 2015 at 3:06 PM, Dale Johnson 
mailto:daljohn...@ebay.com>> wrote:
There seems to be a special kind of "corrupted according to Spark" state of
file in HDFS.  I have isolated a set of files (maybe 1% of all files I need
to work with) which are producing the following stack dump when I try to
sc.textFile() open them.  When I try to open directories, most large
directories contain at least one file of this type.  Curiously, the
following two lines fail inside of a Spark job, but not inside of a Scoobi
job:

val conf = new org.apache.hadoop.conf.Configuration
val fs = org.apache.hadoop.fs.FileSystem.get(conf)

The stack trace follows:

15/03/26 14:22:43 INFO yarn.ApplicationMaster: Final app status: FAILED,
exitCode: 15, (reason: User class threw exception: null)
Exception in thread "Driver" java.lang.IllegalStateException
at
org.spark-project.guava.common.base.Preconditions.checkState(Preconditions.java:133)
at org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:673)
at
org.apache.hadoop.hdfs.protocolPB.PBHelper.convertLocatedBlock(PBHelper.java:1100)
at 
org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1118)
at 
org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1251)
at 
org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1354)
at 
org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1363)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:518)
at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1743)
at
org.apache.hadoop.hdfs.DistributedFileSystem$15.(DistributedFileSystem.java:738)
at
org.apache.hadoop.hdfs.DistributedFileSystem.listLocatedStatus(DistributedFileSystem.java:727)
at 
org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:1662)
at org.apache.hadoop.fs.FileSystem$5.(FileSystem.java:1724)
at org.apache.hadoop.fs.FileSystem.listFiles(FileSystem.java:1721)
at
com.ebay.ss.niffler.miner.speller.SpellQueryLaunch$$anonfun$main$2.apply(SpellQuery.scala:1125)
at
com.ebay.ss.niffler.miner.speller.SpellQueryLaunch$$anonfun$main$2.apply(SpellQuery.scala:1123)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at
com.ebay.ss.niffler.miner.speller.SpellQueryLaunch$.main(SpellQuery.scala:1123)
at
com.ebay.ss.niffler.miner.speller.SpellQueryLaunch.main(SpellQuery.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:427)
15/03/26 14:22:43 INFO yarn.ApplicationMaster: Invoking sc stop from
shutdown hook

It appears to have found the three copies of the given HDFS block, but is
performing some sort of validation with them before giving them back to
spark to schedule the job.  But there is

Re: k-means can only run on one executor with one thread?

2015-03-28 Thread Reza Zadeh
How many dimensions does your data have? The size of the k-means model is k
* d, where d is the dimension of the data.

Since you're using k=1000, if your data has dimension higher than say,
10,000, you will have trouble, because k*d doubles have to fit in the
driver.

Reza

On Sat, Mar 28, 2015 at 12:27 AM, Xi Shen  wrote:

> I have put more detail of my problem at
> http://stackoverflow.com/questions/29295420/spark-kmeans-computation-cannot-be-distributed
>
> It is really appreciate if you can help me take a look at this problem. I
> have tried various settings and ways to load/partition my data, but I just
> cannot get rid that long pause.
>
>
> Thanks,
> David
>
>
>
>
>
> [image: --]
> Xi Shen
> [image: http://]about.me/davidshen
> 
>   
>
> On Sat, Mar 28, 2015 at 2:38 PM, Xi Shen  wrote:
>
>> Yes, I have done repartition.
>>
>> I tried to repartition to the number of cores in my cluster. Not
>> helping...
>> I tried to repartition to the number of centroids (k value). Not
>> helping...
>>
>>
>> On Sat, Mar 28, 2015 at 7:27 AM Joseph Bradley 
>> wrote:
>>
>>> Can you try specifying the number of partitions when you load the data
>>> to equal the number of executors?  If your ETL changes the number of
>>> partitions, you can also repartition before calling KMeans.
>>>
>>>
>>> On Thu, Mar 26, 2015 at 8:04 PM, Xi Shen  wrote:
>>>
 Hi,

 I have a large data set, and I expects to get 5000 clusters.

 I load the raw data, convert them into DenseVector; then I did
 repartition and cache; finally I give the RDD[Vector] to KMeans.train().

 Now the job is running, and data are loaded. But according to the Spark
 UI, all data are loaded onto one executor. I checked that executor, and its
 CPU workload is very low. I think it is using only 1 of the 8 cores. And
 all other 3 executors are at rest.

 Did I miss something? Is it possible to distribute the workload to all
 4 executors?


 Thanks,
 David


>>>
>


Re: Spark - Hive Metastore MySQL driver

2015-03-28 Thread ๏̯͡๏
This is from my Hive installation

-sh-4.1$ ls /apache/hive/lib  | grep derby

derby-10.10.1.1.jar

derbyclient-10.10.1.1.jar

derbynet-10.10.1.1.jar


-sh-4.1$ ls /apache/hive/lib  | grep datanucleus

datanucleus-api-jdo-3.2.6.jar

datanucleus-core-3.2.10.jar

datanucleus-rdbms-3.2.9.jar


-sh-4.1$ ls /apache/hive/lib  | grep mysql

mysql-connector-java-5.0.8-bin.jar

-sh-4.1$


$ hive --version

Hive 0.13.0.2.1.3.6-2

Subversion
git://ip-10-0-0-90.ec2.internal/grid/0/jenkins/workspace/BIGTOP-HDP_RPM_REPO-HDP-2.1.3.6-centos6/bigtop/build/hive/rpm/BUILD/hive-0.13.0.2.1.3.6
-r 87da9430050fb9cc429d79d95626d26ea382b96c


$



On Sat, Mar 28, 2015 at 1:05 PM, ÐΞ€ρ@Ҝ (๏̯͡๏)  wrote:

> I tried with a different version of driver but same error
>
> ./bin/spark-submit -v --master yarn-cluster --driver-class-path
> /apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar
> --jars
> /home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar,
> */home/dvasthimal/spark1.3/mysql-connector-java-5.0.8-bin.jar* --files
> $SPARK_HOME/conf/hive-site.xml  --num-executors 1 --driver-memory 4g
> --driver-java-options "-XX:MaxPermSize=2G" --executor-memory 2g
> --executor-cores 1 --queue hdmi-express --class
> com.ebay.ep.poc.spark.reporting.SparkApp spark_reporting-1.0-SNAPSHOT.jar
> startDate=2015-02-16 endDate=2015-02-16
> input=/user/dvasthimal/epdatasets/successdetail1/part-r-0.avro
> subcommand=successevents2 output=/user/dvasthimal/epdatasets/successdetail2
>
> On Sat, Mar 28, 2015 at 12:47 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) 
> wrote:
>
>> This is what am seeing
>>
>>
>>
>> ./bin/spark-submit -v --master yarn-cluster --driver-class-path
>> /apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar
>> --jars
>> /home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar
>> --files $SPARK_HOME/conf/hive-site.xml  --num-executors 1 --driver-memory
>> 4g --driver-java-options "-XX:MaxPermSize=2G" --executor-memory 2g
>> --executor-cores 1 --queue hdmi-express --class
>> com.ebay.ep.poc.spark.reporting.SparkApp spark_reporting-1.0-SNAPSHOT.jar
>> startDate=2015-02-16 endDate=2015-02-16
>> input=/user/dvasthimal/epdatasets/successdetail1/part-r-0.avro
>> subcommand=successevents2 output=/user/dvasthimal/epdatasets/successdetail2
>>
>>
>> Caused by:
>> org.datanucleus.store.rdbms.connectionpool.DatastoreDriverNotFoundException:
>> The specified datastore driver ("com.mysql.jdbc.Driver") was not found in
>> the CLASSPATH. Please check your CLASSPATH specification, and the name of
>> the driver.
>>
>>
>>
>>
>> ./bin/spark-submit -v --master yarn-cluster --driver-class-path
>> /apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar
>> --jars
>> /home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar,*/home/dvasthimal/spark1.3/mysql-connector-java-5.1.34.jar
>> *--files $SPARK_HOME/conf/hive-site.xml  --num-executors 1
>> --driver-memory 4g --driver-java-options "-XX:MaxPermSize=2G"
>> --executor-memory 2g --executor-cores 1 --queue hdmi-express --class
>> com.ebay.ep.poc.spark.reporting.SparkApp spark_reporting-1.0-SNAPSHOT.jar
>> startDate=2015-02-16 endDate=2015-02-16
>> input=/user/dvasthimal/epdatasets/successdetail1/part-r-0.avro
>> subcommand=successevents2 output=/user/dvasthimal/epdatasets/successdetail2
>>
>>
>> Caused by: java.sql.SQLException: No suitable driver found for
>> jdbc:mysql://db_host_name.vip.ebay.com:3306/HDB
>> at java.sql.DriverManager.getConnection(DriverManager.java:596)
>>
>>
>> Looks like the driver jar that i got in is not correct,
>>
>> On Sat, Mar 28, 2015 at 12:34 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) 
>> wrote:
>>
>>> Could someone please share the spark-submit command that shows their
>>> mysql jar containing driver class used to connect to Hive MySQL meta store.
>>>
>>> Even after including it through
>>>
>>>  --driver-class-path
>>> /home/dvasthimal/spark1.3/mysql-connector-java-5.1.34.jar
>>> OR (AND)
>>>  --jars /home/dvasth

Re: Understanding Spark Memory distribution

2015-03-28 Thread Ankur Srivastava
Hi Wisely,
I have 26gb for driver and the master is running on m3.2xlarge machines.

I see OOM errors on workers and even they are running with 26th of memory.

Thanks

On Fri, Mar 27, 2015, 11:43 PM Wisely Chen  wrote:

> Hi
>
> In broadcast, spark will collect the whole 3gb object into master node and
> broadcast to each slaves. It is very common situation that the master node
> don't have enough memory .
>
> What is your master node settings?
>
> Wisely Chen
>
> Ankur Srivastava  於 2015年3月28日 星期六寫道:
>
> I have increased the "spark.storage.memoryFraction" to 0.4 but I still
>> get OOM errors on Spark Executor nodes
>>
>>
>> 15/03/27 23:19:51 INFO BlockManagerMaster: Updated info of block
>> broadcast_5_piece10
>>
>> 15/03/27 23:19:51 INFO TorrentBroadcast: Reading broadcast variable 5
>> took 2704 ms
>>
>> 15/03/27 23:19:52 INFO MemoryStore: ensureFreeSpace(672530208) called
>> with curMem=2484698683, maxMem=9631778734
>>
>> 15/03/27 23:19:52 INFO MemoryStore: Block broadcast_5 stored as values in
>> memory (estimated size 641.4 MB, free 6.0 GB)
>>
>> 15/03/27 23:34:02 WARN AkkaUtils: Error sending message in 1 attempts
>>
>> java.util.concurrent.TimeoutException: Futures timed out after [30
>> seconds]
>>
>> at
>> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>>
>> at
>> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>>
>> at
>> scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>>
>> at
>> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>>
>> at scala.concurrent.Await$.result(package.scala:107)
>>
>> at
>> org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:187)
>>
>> at
>> org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:407)
>>
>> 15/03/27 23:34:02 ERROR Executor: Exception in task 7.0 in stage 2.0 (TID
>> 4007)
>>
>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>>
>> at
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1986)
>>
>> at
>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>>
>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>>
>> Thanks
>>
>> Ankur
>>
>> On Fri, Mar 27, 2015 at 2:52 PM, Ankur Srivastava <
>> ankur.srivast...@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> I am running a spark cluster on EC2 instances of type: m3.2xlarge. I
>>> have given 26gb of memory with all 8 cores to my executors. I can see that
>>> in the logs too:
>>>
>>> *15/03/27 21:31:06 INFO AppClient$ClientActor: Executor added:
>>> app-20150327213106-/0 on worker-20150327212934-10.x.y.z-40128
>>> (10.x.y.z:40128) with 8 cores*
>>>
>>> I am not caching any RDD so I have set "spark.storage.memoryFraction" to
>>> 0.2. I can see on SparkUI under executors tab Memory used is 0.0/4.5 GB.
>>>
>>> I am now confused with these logs?
>>>
>>> *15/03/27 21:31:08 INFO BlockManagerMasterActor: Registering block
>>> manager 10.77.100.196:58407  with 4.5 GB RAM,
>>> BlockManagerId(4, 10.x.y.z, 58407)*
>>>
>>> I am broadcasting a large object of 3 gb and after that when I am
>>> creating an RDD, I see logs which show this 4.5 GB memory getting full and
>>> then I get OOM.
>>>
>>> How can I make block manager use more memory?
>>>
>>> Is there any other fine tuning I need to do for broadcasting large
>>> objects?
>>>
>>> And does broadcast variable use cache memory or rest of the heap?
>>>
>>>
>>> Thanks
>>>
>>> Ankur
>>>
>>
>>


Re: Spark - Hive Metastore MySQL driver

2015-03-28 Thread ๏̯͡๏
I tried with a different version of driver but same error

./bin/spark-submit -v --master yarn-cluster --driver-class-path
/apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar
--jars
/home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar,
*/home/dvasthimal/spark1.3/mysql-connector-java-5.0.8-bin.jar* --files
$SPARK_HOME/conf/hive-site.xml  --num-executors 1 --driver-memory 4g
--driver-java-options "-XX:MaxPermSize=2G" --executor-memory 2g
--executor-cores 1 --queue hdmi-express --class
com.ebay.ep.poc.spark.reporting.SparkApp spark_reporting-1.0-SNAPSHOT.jar
startDate=2015-02-16 endDate=2015-02-16
input=/user/dvasthimal/epdatasets/successdetail1/part-r-0.avro
subcommand=successevents2 output=/user/dvasthimal/epdatasets/successdetail2

On Sat, Mar 28, 2015 at 12:47 PM, ÐΞ€ρ@Ҝ (๏̯͡๏)  wrote:

> This is what am seeing
>
>
>
> ./bin/spark-submit -v --master yarn-cluster --driver-class-path
> /apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar
> --jars
> /home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar
> --files $SPARK_HOME/conf/hive-site.xml  --num-executors 1 --driver-memory
> 4g --driver-java-options "-XX:MaxPermSize=2G" --executor-memory 2g
> --executor-cores 1 --queue hdmi-express --class
> com.ebay.ep.poc.spark.reporting.SparkApp spark_reporting-1.0-SNAPSHOT.jar
> startDate=2015-02-16 endDate=2015-02-16
> input=/user/dvasthimal/epdatasets/successdetail1/part-r-0.avro
> subcommand=successevents2 output=/user/dvasthimal/epdatasets/successdetail2
>
>
> Caused by:
> org.datanucleus.store.rdbms.connectionpool.DatastoreDriverNotFoundException:
> The specified datastore driver ("com.mysql.jdbc.Driver") was not found in
> the CLASSPATH. Please check your CLASSPATH specification, and the name of
> the driver.
>
>
>
>
> ./bin/spark-submit -v --master yarn-cluster --driver-class-path
> /apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar
> --jars
> /home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar,*/home/dvasthimal/spark1.3/mysql-connector-java-5.1.34.jar
> *--files $SPARK_HOME/conf/hive-site.xml  --num-executors 1
> --driver-memory 4g --driver-java-options "-XX:MaxPermSize=2G"
> --executor-memory 2g --executor-cores 1 --queue hdmi-express --class
> com.ebay.ep.poc.spark.reporting.SparkApp spark_reporting-1.0-SNAPSHOT.jar
> startDate=2015-02-16 endDate=2015-02-16
> input=/user/dvasthimal/epdatasets/successdetail1/part-r-0.avro
> subcommand=successevents2 output=/user/dvasthimal/epdatasets/successdetail2
>
>
> Caused by: java.sql.SQLException: No suitable driver found for
> jdbc:mysql://db_host_name.vip.ebay.com:3306/HDB
> at java.sql.DriverManager.getConnection(DriverManager.java:596)
>
>
> Looks like the driver jar that i got in is not correct,
>
> On Sat, Mar 28, 2015 at 12:34 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) 
> wrote:
>
>> Could someone please share the spark-submit command that shows their
>> mysql jar containing driver class used to connect to Hive MySQL meta store.
>>
>> Even after including it through
>>
>>  --driver-class-path
>> /home/dvasthimal/spark1.3/mysql-connector-java-5.1.34.jar
>> OR (AND)
>>  --jars /home/dvasthimal/spark1.3/mysql-connector-java-5.1.34.jar
>>
>> I keep getting "Suitable driver not found for
>>
>>
>> Command
>> 
>>
>> ./bin/spark-submit -v --master yarn-cluster --driver-class-path
>> */home/dvasthimal/spark1.3/mysql-connector-java-5.1.34.jar*:/apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar
>> --jars
>> /home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar,
>> */home/dvasthi

Re: k-means can only run on one executor with one thread?

2015-03-28 Thread Xi Shen
I have put more detail of my problem at
http://stackoverflow.com/questions/29295420/spark-kmeans-computation-cannot-be-distributed

It is really appreciate if you can help me take a look at this problem. I
have tried various settings and ways to load/partition my data, but I just
cannot get rid that long pause.


Thanks,
David





[image: --]
Xi Shen
[image: http://]about.me/davidshen

  

On Sat, Mar 28, 2015 at 2:38 PM, Xi Shen  wrote:

> Yes, I have done repartition.
>
> I tried to repartition to the number of cores in my cluster. Not helping...
> I tried to repartition to the number of centroids (k value). Not helping...
>
>
> On Sat, Mar 28, 2015 at 7:27 AM Joseph Bradley 
> wrote:
>
>> Can you try specifying the number of partitions when you load the data to
>> equal the number of executors?  If your ETL changes the number of
>> partitions, you can also repartition before calling KMeans.
>>
>>
>> On Thu, Mar 26, 2015 at 8:04 PM, Xi Shen  wrote:
>>
>>> Hi,
>>>
>>> I have a large data set, and I expects to get 5000 clusters.
>>>
>>> I load the raw data, convert them into DenseVector; then I did
>>> repartition and cache; finally I give the RDD[Vector] to KMeans.train().
>>>
>>> Now the job is running, and data are loaded. But according to the Spark
>>> UI, all data are loaded onto one executor. I checked that executor, and its
>>> CPU workload is very low. I think it is using only 1 of the 8 cores. And
>>> all other 3 executors are at rest.
>>>
>>> Did I miss something? Is it possible to distribute the workload to all 4
>>> executors?
>>>
>>>
>>> Thanks,
>>> David
>>>
>>>
>>


Re: Spark - Hive Metastore MySQL driver

2015-03-28 Thread ๏̯͡๏
This is what am seeing



./bin/spark-submit -v --master yarn-cluster --driver-class-path
/apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar
--jars
/home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar
--files $SPARK_HOME/conf/hive-site.xml  --num-executors 1 --driver-memory
4g --driver-java-options "-XX:MaxPermSize=2G" --executor-memory 2g
--executor-cores 1 --queue hdmi-express --class
com.ebay.ep.poc.spark.reporting.SparkApp spark_reporting-1.0-SNAPSHOT.jar
startDate=2015-02-16 endDate=2015-02-16
input=/user/dvasthimal/epdatasets/successdetail1/part-r-0.avro
subcommand=successevents2 output=/user/dvasthimal/epdatasets/successdetail2


Caused by:
org.datanucleus.store.rdbms.connectionpool.DatastoreDriverNotFoundException:
The specified datastore driver ("com.mysql.jdbc.Driver") was not found in
the CLASSPATH. Please check your CLASSPATH specification, and the name of
the driver.




./bin/spark-submit -v --master yarn-cluster --driver-class-path
/apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar
--jars
/home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar,*/home/dvasthimal/spark1.3/mysql-connector-java-5.1.34.jar
*--files $SPARK_HOME/conf/hive-site.xml  --num-executors 1 --driver-memory
4g --driver-java-options "-XX:MaxPermSize=2G" --executor-memory 2g
--executor-cores 1 --queue hdmi-express --class
com.ebay.ep.poc.spark.reporting.SparkApp spark_reporting-1.0-SNAPSHOT.jar
startDate=2015-02-16 endDate=2015-02-16
input=/user/dvasthimal/epdatasets/successdetail1/part-r-0.avro
subcommand=successevents2 output=/user/dvasthimal/epdatasets/successdetail2


Caused by: java.sql.SQLException: No suitable driver found for jdbc:mysql://
db_host_name.vip.ebay.com:3306/HDB
at java.sql.DriverManager.getConnection(DriverManager.java:596)


Looks like the driver jar that i got in is not correct,

On Sat, Mar 28, 2015 at 12:34 PM, ÐΞ€ρ@Ҝ (๏̯͡๏)  wrote:

> Could someone please share the spark-submit command that shows their mysql
> jar containing driver class used to connect to Hive MySQL meta store.
>
> Even after including it through
>
>  --driver-class-path
> /home/dvasthimal/spark1.3/mysql-connector-java-5.1.34.jar
> OR (AND)
>  --jars /home/dvasthimal/spark1.3/mysql-connector-java-5.1.34.jar
>
> I keep getting "Suitable driver not found for
>
>
> Command
> 
>
> ./bin/spark-submit -v --master yarn-cluster --driver-class-path
> */home/dvasthimal/spark1.3/mysql-connector-java-5.1.34.jar*:/apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar
> --jars
> /home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar,
> */home/dvasthimal/spark1.3/mysql-connector-java-5.1.34.ja*r --files
> $SPARK_HOME/conf/hive-site.xml  --num-executors 1 --driver-memory 4g
> --driver-java-options "-XX:MaxPermSize=2G" --executor-memory 2g
> --executor-cores 1 --queue hdmi-express --class
> com.ebay.ep.poc.spark.reporting.SparkApp spark_reporting-1.0-SNAPSHOT.jar
> startDate=2015-02-16 endDate=2015-02-16
> input=/user/dvasthimal/epdatasets/successdetail1/part-r-0.avro
> subcommand=successevents2 output=/user/dvasthimal/epdatasets/successdetail2
> Logs
> 
>
> Caused by: java.sql.SQLException: No suitable driver found for
> jdbc:mysql://hostname:3306/HDB
> at java.sql.DriverManager.getConnection(DriverManager.java:596)
> at java.sql.DriverManager.getConnection(DriverManager.java:187)
> at com.jolbox.bonecp.BoneCP.obtainRawInternalConnection(BoneCP.java:361)
> at com.jolbox.bonecp.BoneCP.(BoneCP.java:416)
> ... 68 more
> ...
> ...
>
> 15/03/27 23:56:08 INFO yarn.Client: Uploading resource
> file:/home/dvasthimal/spark1.3/mysql-connector-java-5.1.34.jar -> hdfs://
> apollo-phx-nn.vip.ebay.com:8020/user/dvasthimal/.sparkStaging/application_1426715280024_119815/mysql-connector-java-5.1.34.jar
>
> ...
>
> ...
>
>
>
> -sh-4.1$ jar -tvf ../mysql-connector-java-5.1.34.jar | grep Driver
> 61 Fri Oct 17 08:05:36 GMT

Re: Can spark sql read existing tables created in hive

2015-03-28 Thread ๏̯͡๏
Yes am using yarn-cluster and i did add it via --files. I get "Suitable
error not found error"

Please share the spark-submit command that shows mysql jar containing
driver class used to connect to Hive MySQL meta store.

Even after including it through

 --driver-class-path
/home/dvasthimal/spark1.3/mysql-connector-java-5.1.34.jar
OR (AND)
 --jars /home/dvasthimal/spark1.3/mysql-connector-java-5.1.34.jar

I keep getting "Suitable driver not found for"


Command


./bin/spark-submit -v --master yarn-cluster --driver-class-path
*/home/dvasthimal/spark1.3/mysql-connector-java-5.1.34.jar*:/apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar
--jars
/home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar,
*/home/dvasthimal/spark1.3/mysql-connector-java-5.1.34.ja*r --files
$SPARK_HOME/conf/hive-site.xml  --num-executors 1 --driver-memory 4g
--driver-java-options "-XX:MaxPermSize=2G" --executor-memory 2g
--executor-cores 1 --queue hdmi-express --class
com.ebay.ep.poc.spark.reporting.SparkApp spark_reporting-1.0-SNAPSHOT.jar
startDate=2015-02-16 endDate=2015-02-16
input=/user/dvasthimal/epdatasets/successdetail1/part-r-0.avro
subcommand=successevents2 output=/user/dvasthimal/epdatasets/successdetail2
Logs


Caused by: java.sql.SQLException: No suitable driver found for
jdbc:mysql://hostname:3306/HDB
at java.sql.DriverManager.getConnection(DriverManager.java:596)
at java.sql.DriverManager.getConnection(DriverManager.java:187)
at com.jolbox.bonecp.BoneCP.obtainRawInternalConnection(BoneCP.java:361)
at com.jolbox.bonecp.BoneCP.(BoneCP.java:416)
... 68 more
...
...

15/03/27 23:56:08 INFO yarn.Client: Uploading resource
file:/home/dvasthimal/spark1.3/mysql-connector-java-5.1.34.jar ->
hdfs://apollo-NN:8020/user/dvasthimal/.sparkStaging/application_1426715280024_119815/mysql-connector-java-5.1.34.jar

...

...



-sh-4.1$ jar -tvf ../mysql-connector-java-5.1.34.jar | grep Driver
61 Fri Oct 17 08:05:36 GMT-07:00 2014 META-INF/services/java.sql.Driver
  3396 Fri Oct 17 08:05:22 GMT-07:00 2014
com/mysql/fabric/jdbc/FabricMySQLDriver.class
*   692 Fri Oct 17 08:05:22 GMT-07:00 2014 com/mysql/jdbc/Driver.class*
  1562 Fri Oct 17 08:05:20 GMT-07:00 2014
com/mysql/jdbc/NonRegisteringDriver$ConnectionPhantomReference.class
 17817 Fri Oct 17 08:05:20 GMT-07:00 2014
com/mysql/jdbc/NonRegisteringDriver.class
   690 Fri Oct 17 08:05:24 GMT-07:00 2014
com/mysql/jdbc/NonRegisteringReplicationDriver.class
   731 Fri Oct 17 08:05:24 GMT-07:00 2014
com/mysql/jdbc/ReplicationDriver.class
   336 Fri Oct 17 08:05:24 GMT-07:00 2014 org/gjt/mm/mysql/Driver.class
You have new mail in /var/spool/mail/dvasthimal
-sh-4.1$ cat conf/hive-site.xml | grep Driver
  javax.jdo.option.ConnectionDriverName
*  com.mysql.jdbc.Driver*
  Driver class name for a JDBC metastore
-sh-4.1$

-- 
Deepak


On Sat, Mar 28, 2015 at 1:06 AM, Michael Armbrust 
wrote:

> Are you running on yarn?
>
>  - If you are running in yarn-client mode, set HADOOP_CONF_DIR to
> /etc/hive/conf/ (or the directory where your hive-site.xml is located).
>  - If you are running in yarn-cluster mode, the easiest thing to do is to
> add--files=/etc/hive/conf/hive-site.xml (or the path for your
> hive-site.xml) to your spark-submit script.
>
> On Fri, Mar 27, 2015 at 5:42 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) 
> wrote:
>
>> I can recreate tables but what about data. It looks like this is a
>> obvious feature that Spark SQL must be having. People will want to
>> transform tons of data stored in HDFS through Hive from Spark SQL.
>>
>> Spark programming guide suggests its possible.
>>
>>
>> Spark SQL also supports reading and writing data stored in Apache Hive
>> .   Configuration of Hive is done by
>> placing your hive-site.xml file in conf/.
>> https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#hive-tables
>>
>> For some reason its not working.
>>
>>
>> On Fri, Mar 27, 2015 at 3:35 PM, Arush Kharbanda <
>> ar...@sigmoidanalytics.com> wrote:
>>
>>> Seems Spark SQL accesses some more columns apart from those created by
>>> hive.
>>>
>>> You can always recreate the tables, you would need to execute the table
>>> creation scripts but it would be good to avoid recreation.
>>>
>>> On Fri, Mar 27, 2015 at 3:20 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) 
>>> wrote:
>>>
 I did copy hive-conf.xml form Hive installation into spark-home/conf.
 IT does have all the meta store connection details, host, username, passwd,
 driver and others.



 Snippet
 ==


 

 
   javax.jdo.option.ConnectionURL
   jdbc:mysql://host.vip.company.com:3306/HDB
 


Spark - Hive Metastore MySQL driver

2015-03-28 Thread ๏̯͡๏
Could someone please share the spark-submit command that shows their mysql
jar containing driver class used to connect to Hive MySQL meta store.

Even after including it through

 --driver-class-path
/home/dvasthimal/spark1.3/mysql-connector-java-5.1.34.jar
OR (AND)
 --jars /home/dvasthimal/spark1.3/mysql-connector-java-5.1.34.jar

I keep getting "Suitable driver not found for


Command


./bin/spark-submit -v --master yarn-cluster --driver-class-path
*/home/dvasthimal/spark1.3/mysql-connector-java-5.1.34.jar*:/apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar
--jars
/home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar,
*/home/dvasthimal/spark1.3/mysql-connector-java-5.1.34.ja*r --files
$SPARK_HOME/conf/hive-site.xml  --num-executors 1 --driver-memory 4g
--driver-java-options "-XX:MaxPermSize=2G" --executor-memory 2g
--executor-cores 1 --queue hdmi-express --class
com.ebay.ep.poc.spark.reporting.SparkApp spark_reporting-1.0-SNAPSHOT.jar
startDate=2015-02-16 endDate=2015-02-16
input=/user/dvasthimal/epdatasets/successdetail1/part-r-0.avro
subcommand=successevents2 output=/user/dvasthimal/epdatasets/successdetail2
Logs


Caused by: java.sql.SQLException: No suitable driver found for
jdbc:mysql://hostname:3306/HDB
at java.sql.DriverManager.getConnection(DriverManager.java:596)
at java.sql.DriverManager.getConnection(DriverManager.java:187)
at com.jolbox.bonecp.BoneCP.obtainRawInternalConnection(BoneCP.java:361)
at com.jolbox.bonecp.BoneCP.(BoneCP.java:416)
... 68 more
...
...

15/03/27 23:56:08 INFO yarn.Client: Uploading resource
file:/home/dvasthimal/spark1.3/mysql-connector-java-5.1.34.jar -> hdfs://
apollo-phx-nn.vip.ebay.com:8020/user/dvasthimal/.sparkStaging/application_1426715280024_119815/mysql-connector-java-5.1.34.jar

...

...



-sh-4.1$ jar -tvf ../mysql-connector-java-5.1.34.jar | grep Driver
61 Fri Oct 17 08:05:36 GMT-07:00 2014 META-INF/services/java.sql.Driver
  3396 Fri Oct 17 08:05:22 GMT-07:00 2014
com/mysql/fabric/jdbc/FabricMySQLDriver.class
*   692 Fri Oct 17 08:05:22 GMT-07:00 2014 com/mysql/jdbc/Driver.class*
  1562 Fri Oct 17 08:05:20 GMT-07:00 2014
com/mysql/jdbc/NonRegisteringDriver$ConnectionPhantomReference.class
 17817 Fri Oct 17 08:05:20 GMT-07:00 2014
com/mysql/jdbc/NonRegisteringDriver.class
   690 Fri Oct 17 08:05:24 GMT-07:00 2014
com/mysql/jdbc/NonRegisteringReplicationDriver.class
   731 Fri Oct 17 08:05:24 GMT-07:00 2014
com/mysql/jdbc/ReplicationDriver.class
   336 Fri Oct 17 08:05:24 GMT-07:00 2014 org/gjt/mm/mysql/Driver.class
You have new mail in /var/spool/mail/dvasthimal
-sh-4.1$ cat conf/hive-site.xml | grep Driver
  javax.jdo.option.ConnectionDriverName
*  com.mysql.jdbc.Driver*
  Driver class name for a JDBC metastore
-sh-4.1$

-- 
Deepak