send this email to subscribe

2016-06-14 Thread Kun Liu
-- 
*  Kun Liu*
  M.S in Computer Science
  New York University
  Phone: (917) 864-1016


Re: Spark Assembly jar ?

2016-06-14 Thread Egor Pahomov
It's strange for me, that having and support fat jar was never a important
thing. We have next scenario - we have big application, where spark is just
another library for data processing. So we can not create small jar and
feed it to spark scripts - we need to call spark from application. And
having fat jar as maven dependency is perfect. We have some spark installed
on cluster(whatever cloudera put there), but often we need to patch spark
for our needs, so we need to bring everything with us. Different
departments use different spark versions - so we can not share jars on
cluster easily. Yep, there are some disadvantages, but flexibility of
changing spark process and deploying overcome these disadvantages.

So we probably would patch pom's as usual to create fat jar.

2016-06-14 12:23 GMT-07:00 Reynold Xin :

> You just need to run normal packaging and all the scripts are now setup to
> run without the assembly jars.
>
>
> On Tuesday, June 14, 2016, Franklyn D'souza 
> wrote:
>
>> Just wondering where the spark-assembly jar has gone in 2.0. i've been
>> reading that its been removed but i'm not sure what the new workflow is .
>>
>


-- 


*Sincerely yoursEgor Pakhomov*


Re: Utilizing YARN AM RPC port field

2016-06-14 Thread Mingyu Kim
Thanks for the pointers, Steve!

 

The first option sounds like a the most light-weight and non-disruptive option 
among them. So, we can add a configuration that enables socket initialization, 
Spark AM will create a ServerSocket if the socket init is enabled and set it on 
SparkContext

 

If there are no objections, I can file a bug and find time to tackle it myself. 

 

Mingyu

 

From: Steve Loughran 
Date: Tuesday, June 14, 2016 at 4:55 AM
To: Mingyu Kim 
Cc: "dev@spark.apache.org" , Matt Cheah 

Subject: Re: Utilizing YARN AM RPC port field

 

 

On 14 Jun 2016, at 01:30, Mingyu Kim  wrote:

 

Hi all,

 

YARN provides a way for AppilcationMaster to register a RPC port so that a 
client outside the YARN cluster can reach the application for any RPCs, but 
Spark’s YARN AMs simply register a dummy port number of 0. (See 
https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnRMClient.scala#L74)
 This is useful for the long-running Spark application usecases where jobs are 
submitted via a form of RPC to an already started Spark context running in YARN 
cluster mode. Spark job server 
(https://github.com/spark-jobserver/spark-jobserver) and Livy 
(https://github.com/cloudera/hue/tree/master/apps/spark/java) are good 
open-source examples of these usecases. The current work-around is to have the 
Spark AM make a call back to a configured URL with the port number of the RPC 
server for the client to communicate with the AM.

 

Utilizing YARN AM RPC port allows the port number reporting to be done in a 
secure way (i.e. With AM RPC port field and Kerberized YARN cluster, you don’t 
need to re-invent a way to verify the authenticity of the port number 
reporting.) and removes the callback from YARN cluster back to a client, which 
means you can operate YARN in a low-trust environment and run other client 
applications behind a firewall.

 

A couple of proposals for utilizing YARN AM RPC port I have are, (Note that you 
cannot simply pre-configure the port number and pass it to Spark AM via 
configuration because of potential port conflicts on the YARN node)

 

· Start-up an empty Jetty server during Spark AM initialization, set 
the port number when registering AM with RM, and pass a reference to the Jetty 
server into the Spark application (e.g. through SparkContext) for the 
application to dynamically add servlet/resources to the Jetty server.

· Have an optional static method in the main class (e.g. 
initializeRpcPort()) which optionally sets up a RPC server and returns the RPC 
port. Spark AM can call this method, register the port number to RM and 
continue on with invoking the main method. I don’t see this making a good API, 
though.

 

I’m curious to hear what other people think. Would this be useful for anyone? 
What do you think about the proposals? Please feel free to suggest other ideas. 
Thanks!

 

 

It's a recurrent irritation of mine that you can't ever change the HTTP/RPC 
ports of a YARN AM after launch; it creates a complex startup state where you 
can't register until your IPC endpoints are up.

 

Tactics

 

-Create a socket on an empty port, register it, hand off the port to the RPC 
setup code as the chosen port. Ideally, support a range to scan, so that 
systems which only open a specific range of ports, e.g. 6500-6800 can have 
those ports only scanned. We've done this in other projects.

 

-serve up the port binding info via a REST API off the AM web; clients hit the 
(HEAD/GET only RM Proxy), ask for the port, work on it. Nonstandard; could be 
extensible with other binding information. (TTL of port caching, )

 

-Use the YARN-913 ZK based registry to register/lookup bindings. This is used 
in various YARN apps to register service endpoints (RPC, Rest); there's work 
ongoing for DNS support. this would allow you to use DNS against a specific DNS 
server to get the endpoints. Works really well with containerized deployments 
where the apps come up with per-container IPaddresses and fixed ports.

Although you couldn't get the latter into the spark-yarn codeitself (needs 
Hadoop 2.6+), you can plug in support via the extension point implemented in 
SPARK-11314., I've actually thought of doing that for a while...just been too 
busy.

 

-Just fix the bit of the YARN api that forces you to know your endpoints in 
advance. People will appreciate it, though it will take a while to trickle 
downstream.

 

 

 

 



smime.p7s
Description: S/MIME cryptographic signature


Re: spark-ec2 scripts with spark-2.0.0-preview

2016-06-14 Thread Shivaram Venkataraman
Can you open an issue on https://github.com/amplab/spark-ec2 ?  I
think we should be able to escape the version string and pass the
2.0.0-preview through the scripts

Shivaram

On Tue, Jun 14, 2016 at 12:07 PM, Sunil Kumar
 wrote:
> Hi,
>
> The spark-ec2 scripts are missing from spark-2.0.0-preview. Is there a
> workaround available ? I tried to change the ec2 scripts to accomodate
> spark-2.0.0...If I call the release spark-2.0.0-preview, then it barfs
> because the command line argument : --spark-version=spark-2.0.0-preview
> gets translated to spark-2.0.0-preiew (-v is taken as a switch)...If I call
> the release spark-2.0.0, then it cant find it in aws, since it looks for
> http://s3.amazonaws.com/spark-related-packages/spark-2.0.0-bin-hadoop2.4.tgz
> instead of
> http://s3.amazonaws.com/spark-related-packages/spark-2.0.0-preview-bin-hadoop2.4.tgz
>
> Any ideas on how to make this work ? How can I tweak/hack the code to look
> for spark-2.0.0-preview in spark-related-packages ?
>
> thanks
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Custom receiver to connect MySQL database

2016-06-14 Thread dvlpr
I have tried but it gives me an error. because something is missing in my
code



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Custom-receiver-to-connect-MySQL-database-tp17895p17912.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark Assembly jar ?

2016-06-14 Thread Reynold Xin
You just need to run normal packaging and all the scripts are now setup to
run without the assembly jars.

On Tuesday, June 14, 2016, Franklyn D'souza 
wrote:

> Just wondering where the spark-assembly jar has gone in 2.0. i've been
> reading that its been removed but i'm not sure what the new workflow is .
>


Spark Assembly jar ?

2016-06-14 Thread Franklyn D'souza
Just wondering where the spark-assembly jar has gone in 2.0. i've been
reading that its been removed but i'm not sure what the new workflow is .


Re: Custom receiver to connect MySQL database

2016-06-14 Thread Matthias Niehoff
You must add an output operation your normal stream application that uses
the receiver.  Calling print() on the DStream will do the job

2016-06-14 9:29 GMT+02:00 dvlpr :

> Hi folks,
> I have written some codes for custom receiver to get data from MySQL db.
> Belowed code:
>
> class CustomReceiver(url: String, username: String, password: String)
> extends Receiver[String](StorageLevel.MEMORY_AND_DISK_2) with Logging {
>
>  case class customer(c_sk: Int, c_add_sk: Int, c_first: String)
>
>   def onStart() {
> // Start the thread that receives data over a connection
> new Thread("MySQL Receiver") {
>   override def run() { receive() }
> }.start()
>   }
>
>   def onStop() {
>// There is nothing much to do as the thread calling receive()
>// is designed to stop by itself isStopped() returns false
>   }
>
>   private def receive() {
> Class.forName("com.mysql.jdbc.Driver").newInstance()
> val con = DriverManager.getConnection(url, username, password)
> }
>
> while executing this code i am getting an error: Exception in thread "main"
> java.lang.IllegalArgumentException: requirement failed: No output
> operations
> registered, so nothing to execute
>
> Please help me to solve my problem ?
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Custom-receiver-to-connect-MySQL-database-tp17895.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


-- 
Matthias Niehoff | IT-Consultant | Agile Software Factory  | Consulting
codecentric AG | Zeppelinstr 2 | 76185 Karlsruhe | Deutschland
tel: +49 (0) 721.9595-681 | fax: +49 (0) 721.9595-666 | mobil: +49 (0)
172.1702676
www.codecentric.de | blog.codecentric.de | www.meettheexperts.de |
www.more4fi.de

Sitz der Gesellschaft: Solingen | HRB 25917| Amtsgericht Wuppertal
Vorstand: Michael Hochgürtel . Mirko Novakovic . Rainer Vehns
Aufsichtsrat: Patric Fedlmeier (Vorsitzender) . Klaus Jäger . Jürgen Schütz

Diese E-Mail einschließlich evtl. beigefügter Dateien enthält vertrauliche
und/oder rechtlich geschützte Informationen. Wenn Sie nicht der richtige
Adressat sind oder diese E-Mail irrtümlich erhalten haben, informieren Sie
bitte sofort den Absender und löschen Sie diese E-Mail und evtl.
beigefügter Dateien umgehend. Das unerlaubte Kopieren, Nutzen oder Öffnen
evtl. beigefügter Dateien sowie die unbefugte Weitergabe dieser E-Mail ist
nicht gestattet


Re: Databricks SparkPerf with Spark 2.0

2016-06-14 Thread Michael Armbrust
NoSuchMethodError always means that you are compiling against a different
classpath than is available at runtime, so it sounds like you are on the
right track.  The project is not abandoned, we're just busy with the
release.  It would be great if you could open a pull request.

On Tue, Jun 14, 2016 at 4:56 AM, Adam Roberts  wrote:

> Fixed the below problem, grepped for spark.version, noticed some instances
> of 1.5.2 being declared, changed to 2.0.0-preview in
> spark-tests/project/SparkTestsBuild.scala
>
> Next one to fix is:
> 16/06/14 12:52:44 INFO ContextCleaner: Cleaned shuffle 9
> Exception in thread "main" java.lang.NoSuchMethodError:
> org/json4s/jackson/JsonMethods$.render$default$2(Lorg/json4s/JsonAST$JValue;)Lorg/json4s/Formats;
>
> I'm going to log this and further progress under "Issues" for the project
> itself (probably need to change org.json4s version in
> SparkTestsBuild.scala, now I know this file is super important), so the
> emails here will at least point people there.
>
> Cheers,
>
>
>
>
>
>
>
> From:Adam Roberts/UK/IBM@IBMGB
> To:dev 
> Date:14/06/2016 12:18
> Subject:Databricks SparkPerf with Spark 2.0
> --
>
>
>
> Hi, I'm working on having "SparkPerf" (
> *https://github.com/databricks/spark-perf*
> ) run with Spark 2.0, noticed a
> few pull requests not yet accepted so concerned this project's been
> abandoned - it's proven very useful in the past for quality assurance as we
> can easily exercise lots of Spark functions with a cluster (perhaps
> exposing problems that don't surface with the Spark unit tests).
>
> I want to use Scala 2.11.8 and Spark 2.0.0 so I'm making my way through
> various files, currently faced with a NoSuchMethod exception
>
> NoSuchMethodError:
> org/apache/spark/SparkContext.rddToPairRDDFunctions(Lorg/apache/spark/rdd/RDD;Lscala/reflect/ClassTag;Lscala/reflect/ClassTag;Lscala/math/Ordering;)Lorg/apache/spark/rdd/PairRDDFunctions;
> at spark.perf.AggregateByKey.runTest(KVDataTest.scala:137)
>
> class AggregateByKey(sc: SparkContext) extends KVDataTest(sc) {
>  override def runTest(rdd: RDD[_], reduceTasks: Int) {
>rdd.asInstanceOf[RDD[(String, String)]]
> * .map{case (k, v) => (k, v.toInt)}.reduceByKey(_ + _,
> reduceTasks).count()*
>  }
> }
>
> Grepping shows
> *
> ./spark-tests/target/streams/compile/incCompileSetup/$global/streams/inc_compile_2.10:/home/aroberts/Desktop/spark-perf/spark-tests/src/main/scala/spark/perf/KVDataTest.scala
> -> rddToPairRDDFunctions *
>
> The scheduling-throughput tests complete fine but the problem here is seen
> with agg-by-key (and likely other modules to fix owing to API changes
> between 1.x and 2.x which I guess is the cause of the above problem).
>
> * Has anybody already made good progress here? Would like to work together
> and get this available for everyone, I'll be churning through it either
> way. *Will be looking at HiBench also.
>
> Next step for me is to use sbt -Dspark.version=2.0.0 (2.0.0-preview?) and
> work from there, although I figured the prep tests stage would do this for
> me (how else is it going to build?).
>
> Cheers,
>
>
>
>
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
>
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
>


Re: Return binary mode in ThriftServer

2016-06-14 Thread lalit sharma
+1 for bringing this back.

 Binary mode needs to be present for working with data visualization tools.

--Regards,
Lalit

On Tue, Jun 14, 2016 at 7:07 PM, Raymond Honderdors <
raymond.honderd...@sizmek.com> wrote:

> I experienced something similar using Sprak+Microstretegy
>
> I reformatted the commands file locally and recompiled spark 2.0, for the
> issue was resolved, but I am not sure I made the change in the correct
> direction
>
> 1.SPARK-14947 
>
>
>
>
>
>
>
>
>
> *Raymond Honderdors *
>
> *Team Lead Analytics BI*
>
> *Business Intelligence Developer *
>
> *raymond.honderd...@sizmek.com  *
>
> *T +972.7325.3569*
>
> *Herzliya*
>
>
>
> *From:* Chris Fregly [mailto:ch...@fregly.com]
> *Sent:* Tuesday, June 14, 2016 4:06 PM
> *To:* Reynold Xin 
> *Cc:* Egor Pahomov ; dav...@databricks.com;
> dev@spark.apache.org
> *Subject:* Re: Return binary mode in ThriftServer
>
>
>
> +1 on bringing it back.  causing all sorts of problems on my end that was
> not obvious without digging in
>
>
>
> I was having problems building spark, as well, with the
> --hive-thriftserver flag.  also thought I was doing something wrong on my
> end.
>
>
> On Jun 13, 2016, at 9:11 PM, Reynold Xin  wrote:
>
> Thanks for the email. Things like this (and bugs) are exactly the reason
> the preview releases exist. It seems like enough people have run into
> problem with this one that maybe we should just bring it back for backward
> compatibility.
>
> On Monday, June 13, 2016, Egor Pahomov  wrote:
>
> In May due to the SPARK-15095 binary mode was "removed" (code is there,
> but you can not turn it on) from Spark-2.0. In 1.6.1 binary was default and
> in 2.0.0-preview it was removed. It's really annoying:
>
>- I can not use Tableau+Spark anymore
>- I need to change connection URL in SQL client for every analyst in
>my organization. And with Squirrel I experiencing problems with that.
>- We have parts of infrastructure, which connected to data
>infrastructure though ThriftServer. And of course format was binary.
>
> I've created a ticket to get binary back(
> https://issues.apache.org/jira/browse/SPARK-15934), but that's not the
> point. I've experienced this problem a month ago, but haven't done anything
> about it, because I believed, that I'm stupid and doing something wrong.
> But documentation was release recently and it contained no information
> about this new thing and it made me digging.
>
>
>
> Most of what I describe is just annoying, but Tableau+Spark new
> incompatibility I believe is big deal. Maybe I'm wrong and there are ways
> to make things work, it's just I wouldn't expect move to 2.0.0 to be so
> time consuming.
>
>
>
> My point: Do we have any guidelines regarding doing such radical things?
>
>
>
> --
>
>
>
>
> *Sincerely yours Egor Pakhomov *
>
>


RE: Return binary mode in ThriftServer

2016-06-14 Thread Raymond Honderdors
I experienced something similar using Sprak+Microstretegy
I reformatted the commands file locally and recompiled spark 2.0, for the issue 
was resolved, but I am not sure I made the change in the correct direction
1.SPARK-14947




Raymond Honderdors
Team Lead Analytics BI
Business Intelligence Developer
raymond.honderd...@sizmek.com
T +972.7325.3569
Herzliya

From: Chris Fregly [mailto:ch...@fregly.com]
Sent: Tuesday, June 14, 2016 4:06 PM
To: Reynold Xin 
Cc: Egor Pahomov ; dav...@databricks.com; 
dev@spark.apache.org
Subject: Re: Return binary mode in ThriftServer

+1 on bringing it back.  causing all sorts of problems on my end that was not 
obvious without digging in

I was having problems building spark, as well, with the --hive-thriftserver 
flag.  also thought I was doing something wrong on my end.

On Jun 13, 2016, at 9:11 PM, Reynold Xin 
> wrote:
Thanks for the email. Things like this (and bugs) are exactly the reason the 
preview releases exist. It seems like enough people have run into problem with 
this one that maybe we should just bring it back for backward compatibility.

On Monday, June 13, 2016, Egor Pahomov 
> wrote:
In May due to the SPARK-15095 binary mode was "removed" (code is there, but you 
can not turn it on) from Spark-2.0. In 1.6.1 binary was default and in 
2.0.0-preview it was removed. It's really annoying:

  *   I can not use Tableau+Spark anymore
  *   I need to change connection URL in SQL client for every analyst in my 
organization. And with Squirrel I experiencing problems with that.
  *   We have parts of infrastructure, which connected to data infrastructure 
though ThriftServer. And of course format was binary.
I've created a ticket to get binary 
back(https://issues.apache.org/jira/browse/SPARK-15934), but that's not the 
point. I've experienced this problem a month ago, but haven't done anything 
about it, because I believed, that I'm stupid and doing something wrong. But 
documentation was release recently and it contained no information about this 
new thing and it made me digging.

Most of what I describe is just annoying, but Tableau+Spark new incompatibility 
I believe is big deal. Maybe I'm wrong and there are ways to make things work, 
it's just I wouldn't expect move to 2.0.0 to be so time consuming.

My point: Do we have any guidelines regarding doing such radical things?

--
Sincerely yours
Egor Pakhomov



Re: Return binary mode in ThriftServer

2016-06-14 Thread Chris Fregly
+1 on bringing it back.  causing all sorts of problems on my end that was not 
obvious without digging in

I was having problems building spark, as well, with the --hive-thriftserver 
flag.  also thought I was doing something wrong on my end.

> On Jun 13, 2016, at 9:11 PM, Reynold Xin  wrote:
> 
> Thanks for the email. Things like this (and bugs) are exactly the reason the 
> preview releases exist. It seems like enough people have run into problem 
> with this one that maybe we should just bring it back for backward 
> compatibility. 
> 
>> On Monday, June 13, 2016, Egor Pahomov  wrote:
>> In May due to the SPARK-15095 binary mode was "removed" (code is there, but 
>> you can not turn it on) from Spark-2.0. In 1.6.1 binary was default and in 
>> 2.0.0-preview it was removed. It's really annoying: 
>> I can not use Tableau+Spark anymore
>> I need to change connection URL in SQL client for every analyst in my 
>> organization. And with Squirrel I experiencing problems with that.
>> We have parts of infrastructure, which connected to data infrastructure 
>> though ThriftServer. And of course format was binary.
>> I've created a ticket to get binary 
>> back(https://issues.apache.org/jira/browse/SPARK-15934), but that's not the 
>> point. I've experienced this problem a month ago, but haven't done anything 
>> about it, because I believed, that I'm stupid and doing something wrong. But 
>> documentation was release recently and it contained no information about 
>> this new thing and it made me digging. 
>> 
>> Most of what I describe is just annoying, but Tableau+Spark new 
>> incompatibility I believe is big deal. Maybe I'm wrong and there are ways to 
>> make things work, it's just I wouldn't expect move to 2.0.0 to be so time 
>> consuming. 
>> 
>> My point: Do we have any guidelines regarding doing such radical things?
>> 
>> -- 
>> Sincerely yours
>> Egor Pakhomov


Re: [YARN] Small fix for yarn.Client to use buildPath (not Path.SEPARATOR)

2016-06-14 Thread Jacek Laskowski
Hi Steve and Sean,

Didn't expect such a warm welcome from Sean and you! Since I'm with
Spark on YARN these days, let me see what I can do to make it nicer.
Thanks!

I'm going to change Spark to use buildPath first. And then propose
another patch to use Environment.CLASS_PATH_SEPARATOR instead. And
only then I could work on
https://issues.apache.org/jira/browse/YARN-5247. Is this about
changing the annotation(s) only?

Thanks for your support!

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Tue, Jun 14, 2016 at 1:44 PM, Steve Loughran  wrote:
>
> if you want to be able to build up CPs on windows to run on a Linux cluster, 
> or vice-versa, you really need to be using the 
> Environment.CLASS_PATH_SEPARATOR field, "". This is expanded in the 
> cluster, not in the client
>
> Although tagged as @Public, @Unstable, it's been in there sinceYARN-1824 & 
> Hadoop 2.4; things rely on it. If someone wants to fix that by submitting a 
> patch to YARN-5247; I'll review it.
>
>> On 13 Jun 2016, at 20:06, Sean Owen  wrote:
>>
>> Yeah it does the same thing anyway. It's fine to consistently use the
>> method. I think there's an instance in ClientSuite that can use it.
>>
>> On Mon, Jun 13, 2016 at 6:50 PM, Jacek Laskowski  wrote:
>>> Hi,
>>>
>>> Just noticed that yarn.Client#populateClasspath uses Path.SEPARATOR
>>> [1] to build a CLASSPATH entry while another similar-looking line uses
>>> buildPath method [2].
>>>
>>> Could a pull request with a change to use buildPath at [1] be
>>> accepted? I'm always confused how to fix such small changes.
>>>
>>> [1] 
>>> https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L1298
>>> [2] Path.SEPARATOR
>>>
>>> Pozdrawiam,
>>> Jacek Laskowski
>>> 
>>> https://medium.com/@jaceklaskowski/
>>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>>> Follow me at https://twitter.com/jaceklaskowski
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: tpcds q1 - java.lang.NegativeArraySizeException

2016-06-14 Thread Ovidiu-Cristian MARCU
I confirm the same exception for other queries as well. I was able to reproduce 
it many times.
Queries 1, 3 and 5 failed with the same exception. Queries 2 and 4 are running 
ok.

I am using TPCDSQueryBenchmark and I have used the following settings:

spark.conf.set(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key, "true")
spark.conf.set(SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key, "true”)

 spark.executor.memory  102g
 spark.executor.extraJavaOptions-XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
-XX:ObjectAlignmentInBytes=32
 spark.executor.cores   16
 spark.driver.maxResultSize 32g
 spark.default.parallelism  128
 spark.sql.shuffle.partitions   128
 spark.sql.parquet.compression.codec snappy
 spark.sql.optimizer.maxIterations  500
 spark.sql.autoBroadcastJoinThreshold 41943040
 spark.shuffle.file.buffer  64k
 spark.akka.frameSize   128
 spark.shuffle.manager  sort


> On 14 Jun 2016, at 00:12, Sameer Agarwal  wrote:
> 
> I'm unfortunately not able to reproduce this on master. Does the query always 
> fail deterministically?
> 
> On Mon, Jun 13, 2016 at 12:54 PM, Ovidiu-Cristian MARCU 
> > 
> wrote:
> Yes, commit ad102af 
> 
>> On 13 Jun 2016, at 21:25, Reynold Xin > > wrote:
>> 
>> Did you try this on master?
>> 
>> 
>> On Mon, Jun 13, 2016 at 11:26 AM, Ovidiu-Cristian MARCU 
>> > 
>> wrote:
>> Hi,
>> 
>> Running the first query of tpcds on a standalone setup (4 nodes, tpcds2 
>> generated for scale 10 and transformed in parquet under hdfs)  it results in 
>> one exception [1].
>> Close to this problem I found this issue 
>> https://issues.apache.org/jira/browse/SPARK-12089 
>>  but it seems to be 
>> solved.
>> 
>> Running the second query is successful.
>> 
>> OpenJDK 64-Bit Server VM 1.7.0_101-b00 on Linux 3.2.0-4-amd64
>> Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
>> TPCDS Snappy:Best/Avg Time(ms)Rate(M/s)   
>> Per Row(ns)   Relative
>> 
>> q24512 / 8142  0.0   
>> 61769.4   1.0X
>> 
>> Best,
>> Ovidiu
>> 
>> [1]
>> WARN TaskSetManager: Lost task 17.0 in stage 80.0 (TID 4469, 172.16.96.70): 
>> java.lang.NegativeArraySizeException
>>  at 
>> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:61)
>>  at 
>> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:214)
>>  at 
>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>>  Source)
>>  at 
>> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>>  at 
>> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$doExecute$3$$anon$2.hasNext(WholeStageCodegenExec.scala:386)
>>  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>>  at 
>> scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30)
>>  at org.spark_project.guava.collect.Ordering.leastOf(Ordering.java:628)
>>  at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37)
>>  at 
>> org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1365)
>>  at 
>> org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1362)
>>  at 
>> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757)
>>  at 
>> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757)
>>  at 
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
>>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
>>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>>  at org.apache.spark.scheduler.Task.run(Task.scala:85)
>>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>>  at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>  at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>  at java.lang.Thread.run(Thread.java:745)
>> 
>> ERROR TaskSetManager: Task 17 in stage 80.0 failed 4 times; aborting job
>> 
>> Driver stacktrace:
>>  at org.apache.spark.scheduler.DAGScheduler.org 
>> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
>>  at 
>> 

Re: Databricks SparkPerf with Spark 2.0

2016-06-14 Thread Adam Roberts
Fixed the below problem, grepped for spark.version, noticed some instances 
of 1.5.2 being declared, changed to 2.0.0-preview in 
spark-tests/project/SparkTestsBuild.scala

Next one to fix is:
16/06/14 12:52:44 INFO ContextCleaner: Cleaned shuffle 9
Exception in thread "main" java.lang.NoSuchMethodError: 
org/json4s/jackson/JsonMethods$.render$default$2(Lorg/json4s/JsonAST$JValue;)Lorg/json4s/Formats;

I'm going to log this and further progress under "Issues" for the project 
itself (probably need to change org.json4s version in 
SparkTestsBuild.scala, now I know this file is super important), so the 
emails here will at least point people there.

Cheers,







From:   Adam Roberts/UK/IBM@IBMGB
To: dev 
Date:   14/06/2016 12:18
Subject:Databricks SparkPerf with Spark 2.0



Hi, I'm working on having "SparkPerf" (
https://github.com/databricks/spark-perf) run with Spark 2.0, noticed a 
few pull requests not yet accepted so concerned this project's been 
abandoned - it's proven very useful in the past for quality assurance as 
we can easily exercise lots of Spark functions with a cluster (perhaps 
exposing problems that don't surface with the Spark unit tests). 

I want to use Scala 2.11.8 and Spark 2.0.0 so I'm making my way through 
various files, currently faced with a NoSuchMethod exception 

NoSuchMethodError: 
org/apache/spark/SparkContext.rddToPairRDDFunctions(Lorg/apache/spark/rdd/RDD;Lscala/reflect/ClassTag;Lscala/reflect/ClassTag;Lscala/math/Ordering;)Lorg/apache/spark/rdd/PairRDDFunctions;
 
at spark.perf.AggregateByKey.runTest(KVDataTest.scala:137) 

class AggregateByKey(sc: SparkContext) extends KVDataTest(sc) {
  override def runTest(rdd: RDD[_], reduceTasks: Int) {
rdd.asInstanceOf[RDD[(String, String)]]
  .map{case (k, v) => (k, v.toInt)}.reduceByKey(_ + _, 
reduceTasks).count()
  } 
}

Grepping shows
./spark-tests/target/streams/compile/incCompileSetup/$global/streams/inc_compile_2.10:/home/aroberts/Desktop/spark-perf/spark-tests/src/main/scala/spark/perf/KVDataTest.scala
 
-> rddToPairRDDFunctions 

The scheduling-throughput tests complete fine but the problem here is seen 
with agg-by-key (and likely other modules to fix owing to API changes 
between 1.x and 2.x which I guess is the cause of the above problem). 

Has anybody already made good progress here? Would like to work together 
and get this available for everyone, I'll be churning through it either 
way. Will be looking at HiBench also. 

Next step for me is to use sbt -Dspark.version=2.0.0 (2.0.0-preview?) and 
work from there, although I figured the prep tests stage would do this for 
me (how else is it going to build?). 

Cheers, 




Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU


Re: Utilizing YARN AM RPC port field

2016-06-14 Thread Steve Loughran

On 14 Jun 2016, at 01:30, Mingyu Kim 
> wrote:

Hi all,

YARN provides a way for AppilcationMaster to register a RPC port so that a 
client outside the YARN cluster can reach the application for any RPCs, but 
Spark’s YARN AMs simply register a dummy port number of 0. (See 
https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnRMClient.scala#L74)
 This is useful for the long-running Spark application usecases where jobs are 
submitted via a form of RPC to an already started Spark context running in YARN 
cluster mode. Spark job server 
(https://github.com/spark-jobserver/spark-jobserver) and Livy 
(https://github.com/cloudera/hue/tree/master/apps/spark/java) are good 
open-source examples of these usecases. The current work-around is to have the 
Spark AM make a call back to a configured URL with the port number of the RPC 
server for the client to communicate with the AM.

Utilizing YARN AM RPC port allows the port number reporting to be done in a 
secure way (i.e. With AM RPC port field and Kerberized YARN cluster, you don’t 
need to re-invent a way to verify the authenticity of the port number 
reporting.) and removes the callback from YARN cluster back to a client, which 
means you can operate YARN in a low-trust environment and run other client 
applications behind a firewall.

A couple of proposals for utilizing YARN AM RPC port I have are, (Note that you 
cannot simply pre-configure the port number and pass it to Spark AM via 
configuration because of potential port conflicts on the YARN node)

• Start-up an empty Jetty server during Spark AM initialization, set 
the port number when registering AM with RM, and pass a reference to the Jetty 
server into the Spark application (e.g. through SparkContext) for the 
application to dynamically add servlet/resources to the Jetty server.
• Have an optional static method in the main class (e.g. 
initializeRpcPort()) which optionally sets up a RPC server and returns the RPC 
port. Spark AM can call this method, register the port number to RM and 
continue on with invoking the main method. I don’t see this making a good API, 
though.

I’m curious to hear what other people think. Would this be useful for anyone? 
What do you think about the proposals? Please feel free to suggest other ideas. 
Thanks!


It's a recurrent irritation of mine that you can't ever change the HTTP/RPC 
ports of a YARN AM after launch; it creates a complex startup state where you 
can't register until your IPC endpoints are up.

Tactics

-Create a socket on an empty port, register it, hand off the port to the RPC 
setup code as the chosen port. Ideally, support a range to scan, so that 
systems which only open a specific range of ports, e.g. 6500-6800 can have 
those ports only scanned. We've done this in other projects.

-serve up the port binding info via a REST API off the AM web; clients hit the 
(HEAD/GET only RM Proxy), ask for the port, work on it. Nonstandard; could be 
extensible with other binding information. (TTL of port caching, )

-Use the YARN-913 ZK based registry to register/lookup bindings. This is used 
in various YARN apps to register service endpoints (RPC, Rest); there's work 
ongoing for DNS support. this would allow you to use DNS against a specific DNS 
server to get the endpoints. Works really well with containerized deployments 
where the apps come up with per-container IPaddresses and fixed ports.
Although you couldn't get the latter into the spark-yarn codeitself (needs 
Hadoop 2.6+), you can plug in support via the extension point implemented in 
SPARK-11314., I've actually thought of doing that for a while...just been too 
busy.

-Just fix the bit of the YARN api that forces you to know your endpoints in 
advance. People will appreciate it, though it will take a while to trickle 
downstream.






Re: [YARN] Small fix for yarn.Client to use buildPath (not Path.SEPARATOR)

2016-06-14 Thread Steve Loughran

if you want to be able to build up CPs on windows to run on a Linux cluster, or 
vice-versa, you really need to be using the Environment.CLASS_PATH_SEPARATOR 
field, "". This is expanded in the cluster, not in the client

Although tagged as @Public, @Unstable, it's been in there sinceYARN-1824 & 
Hadoop 2.4; things rely on it. If someone wants to fix that by submitting a 
patch to YARN-5247; I'll review it.

> On 13 Jun 2016, at 20:06, Sean Owen  wrote:
> 
> Yeah it does the same thing anyway. It's fine to consistently use the
> method. I think there's an instance in ClientSuite that can use it.
> 
> On Mon, Jun 13, 2016 at 6:50 PM, Jacek Laskowski  wrote:
>> Hi,
>> 
>> Just noticed that yarn.Client#populateClasspath uses Path.SEPARATOR
>> [1] to build a CLASSPATH entry while another similar-looking line uses
>> buildPath method [2].
>> 
>> Could a pull request with a change to use buildPath at [1] be
>> accepted? I'm always confused how to fix such small changes.
>> 
>> [1] 
>> https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L1298
>> [2] Path.SEPARATOR
>> 
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>> 
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
> 
> 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Databricks SparkPerf with Spark 2.0

2016-06-14 Thread Adam Roberts
Hi, I'm working on having "SparkPerf" (
https://github.com/databricks/spark-perf) run with Spark 2.0, noticed a 
few pull requests not yet accepted so concerned this project's been 
abandoned - it's proven very useful in the past for quality assurance as 
we can easily exercise lots of Spark functions with a cluster (perhaps 
exposing problems that don't surface with the Spark unit tests).

I want to use Scala 2.11.8 and Spark 2.0.0 so I'm making my way through 
various files, currently faced with a NoSuchMethod exception

NoSuchMethodError: 
org/apache/spark/SparkContext.rddToPairRDDFunctions(Lorg/apache/spark/rdd/RDD;Lscala/reflect/ClassTag;Lscala/reflect/ClassTag;Lscala/math/Ordering;)Lorg/apache/spark/rdd/PairRDDFunctions;
 
at spark.perf.AggregateByKey.runTest(KVDataTest.scala:137) 

class AggregateByKey(sc: SparkContext) extends KVDataTest(sc) {
  override def runTest(rdd: RDD[_], reduceTasks: Int) {
rdd.asInstanceOf[RDD[(String, String)]]
  .map{case (k, v) => (k, v.toInt)}.reduceByKey(_ + _, 
reduceTasks).count()
  }
}

Grepping shows
./spark-tests/target/streams/compile/incCompileSetup/$global/streams/inc_compile_2.10:/home/aroberts/Desktop/spark-perf/spark-tests/src/main/scala/spark/perf/KVDataTest.scala
 
-> rddToPairRDDFunctions 

The scheduling-throughput tests complete fine but the problem here is seen 
with agg-by-key (and likely other modules to fix owing to API changes 
between 1.x and 2.x which I guess is the cause of the above problem).

Has anybody already made good progress here? Would like to work together 
and get this available for everyone, I'll be churning through it either 
way. Will be looking at HiBench also.

Next step for me is to use sbt -Dspark.version=2.0.0 (2.0.0-preview?) and 
work from there, although I figured the prep tests stage would do this for 
me (how else is it going to build?).

Cheers,




Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU


Custom receiver to connect MySQL database

2016-06-14 Thread dvlpr
Hi folks,
I have written some codes for custom receiver to get data from MySQL db.
Belowed code:

class CustomReceiver(url: String, username: String, password: String)
extends Receiver[String](StorageLevel.MEMORY_AND_DISK_2) with Logging {

 case class customer(c_sk: Int, c_add_sk: Int, c_first: String)
 
  def onStart() {
// Start the thread that receives data over a connection
new Thread("MySQL Receiver") {
  override def run() { receive() }
}.start()
  }

  def onStop() {
   // There is nothing much to do as the thread calling receive()
   // is designed to stop by itself isStopped() returns false
  }

  private def receive() {
Class.forName("com.mysql.jdbc.Driver").newInstance()
val con = DriverManager.getConnection(url, username, password)
}

while executing this code i am getting an error: Exception in thread "main"
java.lang.IllegalArgumentException: requirement failed: No output operations
registered, so nothing to execute

Please help me to solve my problem ? 



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Custom-receiver-to-connect-MySQL-database-tp17895.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org