Re: how to implement my own datasource?

2015-06-25 Thread 诺铁
thank you guys, I'll read examples and give a try. On Fri, Jun 26, 2015 at 2:47 AM, jimfcarroll wrote: > > I'm not sure if this is what you're looking for but we have several custom > RDD implementations for internal data format/partitioning schemes. > > The Spark api is really simple and consis

Re: External Shuffle service over yarn

2015-06-25 Thread Sandy Ryza
Hi Yash, One of the main advantages is that, if you turn dynamic allocation on, and executors are discarded, your application is still able to get at the shuffle data that they wrote out. -Sandy On Thu, Jun 25, 2015 at 11:08 PM, yash datta wrote: > Hi devs, > > Can someone point out if there a

Spark for distributed dbms cluster

2015-06-25 Thread louis.hust
Hi, all For now, spark is based on hadoop, I want use a database cluster instead of the hadoop, so data distributed on each database in the cluster. I want to know if spark suitable for this situation ? Any idea will be appreciated!

Re: Github spam from naver user

2015-06-25 Thread Sean Owen
Ultimately I think its Github that has to act to ban the user. I've already asked for him to be blocked from the apache org on Github. On Fri, Jun 26, 2015, 1:47 AM Reynold Xin wrote: > We might be able to ask asf infra to help. Can you create a ticket? > > > On Thu, Jun 25, 2015 at 3:22 AM, Sea

External Shuffle service over yarn

2015-06-25 Thread yash datta
Hi devs, Can someone point out if there are any distinct advantages of using external shuffle service over yarn (runs on node manager as an auxiliary service https://issues.apache.org/jira/browse/SPARK-3797) instead of the default execution in the executor containers ? Please also mention if y

Re: custom REST port from spark-defaults.cof

2015-06-25 Thread giive chen
HI Niranda I think "spark.master.rest.port" is what you want. Wisely Chen On Tue, Jun 23, 2015 at 2:03 PM, Niranda Perera wrote: > > Hi, > > is there a configuration setting to set a custom port number for the master > REST URL? can that be included in the spark-defaults.conf? > > cheers > --

RE: Force inner join to shuffle the smallest table

2015-06-25 Thread Ulanov, Alexander
The problem is that it shuffles the wrong table which even compressed won’t fit to my disk. Actually, I found the source of the problem, although I could not reproduce it with synthetic data (but remains true for my original data: big table 2B rows, small 500K): When I do join on two fields li

Re: Github spam from naver user

2015-06-25 Thread Reynold Xin
We might be able to ask asf infra to help. Can you create a ticket? On Thu, Jun 25, 2015 at 3:22 AM, Sean Owen wrote: > Lots of spam like this popping up: > https://github.com/apache/spark/pull/6972#issuecomment-115130412 > > I've already reported this to Github to get it stopped and tried to >

Re: how to implement my own datasource?

2015-06-25 Thread jimfcarroll
I'm not sure if this is what you're looking for but we have several custom RDD implementations for internal data format/partitioning schemes. The Spark api is really simple and consists primarily of being able to implement 3 simple things: 1) You need a class that extends RDD that's lightweight

Re: how to implement my own datasource?

2015-06-25 Thread Michael Armbrust
I'd suggest looking at the avro data source as an example implementation: https://github.com/databricks/spark-avro I also gave a talk a while ago: https://www.youtube.com/watch?v=GQSNJAzxOr8 Hi, You can connect to by JDBC as described in https://spark.apache.org/docs/latest/sql-programming-guide

Re: Problem with version compatibility

2015-06-25 Thread jimfcarroll
Yana and Sean, Thanks for the feedback. I can get it to work a number of ways, I'm just wondering if there's a preferred means. One last question. Is there a reason the deployed Spark install doesn't contain the same version of several classes as the maven dependency. Is this intentional? Thank

Visualize Spark-SQL query plans

2015-06-25 Thread Raajay
Hello, I am trying to understand the code base of spark-SQL, especially the Query Analyzer part. I understand that currently (as of Spark 1.4), the sql module generates a set of Physical plans, but executes only the first in the list (ref : core/src/main/scala/org/apache/spark/sql/SQLContext.scala

Verifying Empirically Number of Performance-Heavy Threads and Parallelism

2015-06-25 Thread jcai
Hello, I am doing some performance testing on Spark and 1. I would like to verify Spark's parallelism, empirically. For example, I would like to determine if the number of "performance-heavy" threads is equal to SPARK_WORKER_CORES in standalone mode at a given moment. Essentially, I want to ensu

Re: [VOTE] Release Apache Spark 1.4.1

2015-06-25 Thread Andrew Ash
I would guess that many tickets targeted at 1.4.1 were set that way during the tail end of the 1.4.0 voting process as people realized they wouldn't make the .0 release in time. In that case, they were likely aiming for a 1.4.x release, not necessarily 1.4.1 specifically. Maybe creating a "1.4.x"

Re: Problem with version compatibility

2015-06-25 Thread Yana Kadiyska
Jim, I do something similar to you. I mark all dependencies as provided and then make sure to drop the same version of spark-assembly in my war as I have on the executors. I don't remember if dropping in server/lib works, I think I ran into an issue with that. Would love to know "best practices" wh

Re: Problem with version compatibility

2015-06-25 Thread Sean Owen
Try putting your same Mesos assembly on the classpath of your client then, to emulate what spark-submit does. I don't think you merely also want to put it on the classpath but make sure nothing else from Spark is coming from your app. In 1.4 there is the 'launcher' API which makes programmatic acc

Re: Problem with version compatibility

2015-06-25 Thread jimfcarroll
Ah. I've avoided using spark-submit primarily because our use of Spark is as part of an analytics library that's meant to be embedded in other applications with their own lifecycle management. One of those application is a REST app running in tomcat which will make the use of spark-submit difficul

Re: Python UDF performance at large scale

2015-06-25 Thread Justin Uang
Sweet, filed here: https://issues.apache.org/jira/browse/SPARK-8632 On Thu, Jun 25, 2015 at 3:05 AM Davies Liu wrote: > I'm thinking that the batched synchronous version will be too slow > (with small batch size) or easy to OOM with large (batch size). If > it's not that hard, you can give it a

Re: Problem with version compatibility

2015-06-25 Thread Sean Owen
Yes spark-submit adds all this for you. You don't bring Spark classes in your app On Thu, Jun 25, 2015, 4:01 PM jimfcarroll wrote: > Hi Sean, > > I'm packaging spark with my (standalone) driver app using maven. Any > assemblies that are used on the mesos workers through extending the > classpath

[SQL] codegen on wide dataset throws StackOverflow

2015-06-25 Thread Peter Rudenko
Hi, i have a small but very wide dataset (2000 columns). Trying to optimize Dataframe pipeline for it, since it behaves very poorly comparing to rdd operation. With spark.sql.codegen=true it throws StackOverflow: 15/06/25 16:27:16 INFO CacheManager: Partition rdd_12_3 not found, computing it 1

Various forks

2015-06-25 Thread Iulian Dragoș
Could someone point the source of the Spark-fork used to build genjavadoc-plugin? Even more important it would be to know the reasoning behind this fork. Ironically, this hinders my attempts at removing another fork, the Spark REPL fork (and the upgrade to Scala 2.11.7). See here

Re: Problem with version compatibility

2015-06-25 Thread jimfcarroll
Hi Sean, I'm packaging spark with my (standalone) driver app using maven. Any assemblies that are used on the mesos workers through extending the classpath or providing the jars in the driver (via the SparkConf) isn't packaged with spark (it seems obvious that would be a mistake). I need, for exa

Github spam from naver user

2015-06-25 Thread Sean Owen
Lots of spam like this popping up: https://github.com/apache/spark/pull/6972#issuecomment-115130412 I've already reported this to Github to get it stopped and tried to contact the user, FYI. - To unsubscribe, e-mail: dev-unsubscr

Re: [VOTE] Release Apache Spark 1.4.1

2015-06-25 Thread Sean Owen
That makes sense to me -- there's an urgent fix to get out. I missed that part. Not that it really matters but was that expressed elsewhere? I know we tend to start the RC process even when a few more changes are still in progress, to get a first wave or two of testing done early, knowing that the

Re: Problem with version compatibility

2015-06-25 Thread Sean Owen
-dev +user That all sounds fine except are you packaging Spark classes with your app? that's the bit I'm wondering about. You would mark it as a 'provided' dependency in Maven. On Thu, Jun 25, 2015 at 5:12 AM, jimfcarroll wrote: > Hi Sean, > > I'm running a Mesos cluster. My driver app is built

Re: how to implement my own datasource?

2015-06-25 Thread Juan Rodríguez Hortalá
Hi, You can connect to by JDBC as described in https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases. Other option is using HadoopRDD and NewHadoopRDD to connect to databases compatible with Hadoop, like HBase, some examples can be found at chapter 5 of "Learning

Re: Python UDF performance at large scale

2015-06-25 Thread Davies Liu
I'm thinking that the batched synchronous version will be too slow (with small batch size) or easy to OOM with large (batch size). If it's not that hard, you can give it a try. On Wed, Jun 24, 2015 at 4:39 PM, Justin Uang wrote: > Correct, I was running with a batch size of about 100 when I did t