Cluster to Cluster communication

2017-02-08 Thread Vasu Gourabathina
All, This is a theoretical question at this point of time. Wanted to pose this question, before spending too much time to figure it out. Advance apologies if this is not the right forum to ask this question. Use-case: - Migration from one cluster manager to another (for ex. Spark stand-alone to

Re: does persistence required for single action ?

2017-02-08 Thread Jon Gregg
Hard to say without more context around where your job is stalling, what file sizes you're working with etc. Best answer would be to test and see, but in general for simple DAGs, I find that not persisting anything typically runs the fastest. If I persist anything it would be rdd6 because it took

Issues launching job dynamically in Java

2017-02-08 Thread yohann jardin
Hello everyone, I'm trying to develop a WebService launching jobs. The WebService is based on tomcat, and I'm working with Spark 2.1.0. The SparkLauncher provides two method to launch the job. First SparkLauncher.launch(), and SparkLauncher.startApplication(SparkAppHandle.Listener...

Re: Un-exploding / denormalizing Spark SQL help

2017-02-08 Thread Xiaomeng Wan
You could also try pivot. On 7 February 2017 at 16:13, Everett Anderson wrote: > > > On Tue, Feb 7, 2017 at 2:21 PM, Michael Armbrust > wrote: > >> I think the fastest way is likely to use a combination of conditionals >> (when / otherwise),

Union of DStream and RDD

2017-02-08 Thread Amit Sela
Hi all, I'm looking to union a DStream and RDD into a single stream. One important note is that the RDD has to be added to the DStream just once. Ideas ? Thanks, Amit

Re: [Spark Context]: How to add on demand jobs to an existing spark context?

2017-02-08 Thread Gourav Sengupta
Hi, I am not quite sure of your used case here, but I would use spark-submit and submit sequential jobs as steps to an EMR cluster. Regards, Gourav On Wed, Feb 8, 2017 at 11:10 AM, Cosmin Posteuca wrote: > I tried to run some test on EMR on yarn cluster mode. > > I

Re: Un-exploding / denormalizing Spark SQL help

2017-02-08 Thread Everett Anderson
On Wed, Feb 8, 2017 at 1:14 PM, ayan guha wrote: > Will a sql solution will be acceptable? > I'm very curious to see how it'd be done in raw SQL if you're up for it! I think the 2 programmatic solutions so far are viable, though, too. (By the way, thanks everyone for the

Re: Dynamic resource allocation to Spark on Mesos

2017-02-08 Thread Michael Gummelt
Sun, are you using marathon to run the shuffle service? On Tue, Feb 7, 2017 at 7:36 PM, Sun Rui wrote: > Yi Jan, > > We have been using Spark on Mesos with dynamic allocation enabled, which > works and improves the overall cluster utilization. > > In terms of job, do you

Re: Un-exploding / denormalizing Spark SQL help

2017-02-08 Thread ayan guha
Will a sql solution will be acceptable? On Thu, 9 Feb 2017 at 4:01 am, Xiaomeng Wan wrote: > You could also try pivot. > > On 7 February 2017 at 16:13, Everett Anderson > wrote: > > > > On Tue, Feb 7, 2017 at 2:21 PM, Michael Armbrust

Spark 2.0 Scala 2.11 and Kafka 0.10 Scala 2.10

2017-02-08 Thread u...@moosheimer.com
Dear devs, is it possible to use Spark 2.0.2 Scala 2.11 and consume messages from kafka server 0.10.0.2 running on Scala 2.10? I tried this the last two days by using createDirectStream and can't get no message out of kafka?! I'm using HDP 2.5.3 running kafka_2.10-0.10.0.2.5.3.0-37 and Spark

Structured Streaming. S3 To Google BigQuery

2017-02-08 Thread Sam Elamin
Hi All Thank you all for the amazing support! I have written a BigQuery connector for structured streaming that you can find here I just tweeted about it and would really appreciated it if you

Re: Spark 2 - Creating datasets from dataframes with extra columns

2017-02-08 Thread Don Drake
Please see: https://issues.apache.org/jira/browse/SPARK-19477 Thanks. -Don On Wed, Feb 8, 2017 at 6:51 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote: > i checked it, it seems is a bug. do you create a jira now plesae? > > ---Original--- > *From:* "Don Drake" > *Date:* 2017/2/7

Does Spark consider the free space of hard drive of the data nodes?

2017-02-08 Thread Benyi Wang
We are trying to add 6 spare servers to our existing cluster. Those machines have more CPU cores, more memory. Unfortunately, 3 of them can only use 2.5” hard drives and total size of each node is about 7TB. The other 3 nodes can only have 3.5” hard drives, but have 48TB each nodes. In addition,

Strange behavior with 'not' and filter pushdown

2017-02-08 Thread Alexi Kostibas
Hi, I have an application where I’m filtering data with SparkSQL with simple WHERE clauses. I also want the ability to show the unmatched rows for any filter, and so am wrapping the previous clause in `NOT()` to get the inverse. Example: Filter: username is not null Inverse filter:

[Spark 2.1.0] Spark SQL return correct count, but NULL on all fields

2017-02-08 Thread Babak Alipour
Hi everyone, I'm using Spark with HiveSupport enabled, the data is stored in parquet format in a fixed location. I just downloaded Spark 2.1.0 and it broke Spark-SQL queries. I can do count(*) and it returns the correct count, but all columns show as "NULL". It worked fine on 1.6 & 2.0.x. I'm

Counting things in Spark Structured Streaming

2017-02-08 Thread Timothy Chan
I would like to count running totals for events coming in since a given date for a given user. How would I go about doing this? For example, we have user data coming in, we'd like to score that data, then keep running totals on that score, since a given date. Specifically, I always want to score

Re: [Structured Streaming] Using File Sink to store to hive table.

2017-02-08 Thread Egor Pahomov
Jacek, you mean http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.ForeachWriter ? I do not understand how to use it, since it passes every value separately, not every partition. And addding to table value by value would not work 2017-02-07 12:10 GMT-08:00 Jacek

Re: Union of DStream and RDD

2017-02-08 Thread Egor Pahomov
Just guessing here, but would http://spark.apache.org/docs/latest/streaming-programming-guide.html#basic-sources "*Queue of RDDs as a Stream*" work? Basically create DStream from your RDD and than union with other DStream. 2017-02-08 12:32 GMT-08:00 Amit Sela : > Hi all, >

Re: Dynamic resource allocation to Spark on Mesos

2017-02-08 Thread Sun Rui
Michael, No. We directly launch the external shuffle service by specifying a larger heap size than default on each worker node. It is observed that the processes are quite stable. > On Feb 9, 2017, at 05:21, Michael Gummelt wrote: > > Sun, are you using marathon to run

Re: [Spark-SQL] Hive support is required to select over the following tables

2017-02-08 Thread Egor Pahomov
Just guessing here, but have you build your spark with "-Phive"? By the way, which version of Zeppelin? 2017-02-08 5:13 GMT-08:00 Daniel Haviv : > Hi, > I'm using Spark 2.1.0 on Zeppelin. > > I can successfully create a table but when I try to select from it I fail: >

Re: Spark 2.0 Scala 2.11 and Kafka 0.10 Scala 2.10

2017-02-08 Thread Cody Koeninger
Pretty sure there was no 0.10.0.2 release of apache kafka. If that's a hortonworks modified version you may get better results asking in a hortonworks specific forum. Scala version of kafka shouldn't be relevant either way though. On Wed, Feb 8, 2017 at 5:30 PM, u...@moosheimer.com

Spark stream parallel streaming

2017-02-08 Thread Udbhav Agarwal
Hi, I am using spark streaming for processing messages from kafka for real time analytics. I am trying to fine tune my streaming process. Currently my spark streaming system is reading a batch of messages from kafka topic and processing each message one at a time. I have set properties in spark

MultiLabelBinarizer

2017-02-08 Thread Madabhattula Rajesh Kumar
Hi, Do we have a below equivalent preprocessing function in Spark ML? from sklearn.preprocessing import MultiLabelBinarizer Regards, Rajesh

Re: MultiLabelBinarizer

2017-02-08 Thread Georg Heiler
I believe only http://stackoverflow.com/questions/34167105/using-spark-mls-onehotencoder-on-multiple-columns is currently possible i.e. using multiple stringindexers and then multiple one hot encoders one per column Madabhattula Rajesh Kumar schrieb am Do., 9. Feb. 2017 um

[ANNOUNCE] Apache SystemML 0.12.0-incubating released.

2017-02-08 Thread Arvind Surve
The Apache SystemML team is pleased to announce the release of Apache SystemML version 0.12.0-incubating.Apache SystemML provides declarative large-scale machine learning (ML) that aims at a flexible specification of ML algorithms and automatic generation of hybrid runtime plans ranging from

Re: Union of DStream and RDD

2017-02-08 Thread Amit Sela
Not with checkpointing. On Thu, Feb 9, 2017, 04:58 Egor Pahomov wrote: > Just guessing here, but would > http://spark.apache.org/docs/latest/streaming-programming-guide.html#basic-sources > "*Queue of RDDs as a Stream*" work? Basically create DStream from your > RDD and

Dataset count on database or parquet

2017-02-08 Thread Rohit Verma
Hi Which of the following is better approach for too many values in database final Dataset dataset = spark.sqlContext().read() .format("jdbc") .option("url", params.getJdbcUrl()) .option("driver", params.getDriver())

rdd save to orc file happened problems

2017-02-08 Thread 446463...@qq.com
Hi: when I run mine program on spark tody,I meet a error and there is no error before this. I pick the stack snippet below ``` 17/02/08 17:24:14 ERROR InsertIntoHadoopFsRelation: Aborting job. org.apache.spark.SparkException:

Re: [Spark Context]: How to add on demand jobs to an existing spark context?

2017-02-08 Thread Cosmin Posteuca
I tried to run some test on EMR on yarn cluster mode. I have a cluster with 16 cores(8 processors with 2 threads each). If i run one job(use 5 core) takes 90 seconds, if i run 2 jobs simultaneous, both finished in 170 seconds. If i run 3 jobs simultaneous, all three finished in 240 seconds. If i

JavaBean serialization with cyclic bean attributes

2017-02-08 Thread Pascal Stammer
Hi, we have a more or less small problem with DataFrame creation. In our current project we have a more or less complex data model corresponding to documents and word positions in it. In the last month we refactor our architecture to use neo4j as persistence. Before that we used PostgreSQL and

FINAL REMINDER: CFP for ApacheCon closes February 11th

2017-02-08 Thread Rich Bowen
Dear Apache Enthusiast, This is your FINAL reminder that the Call for Papers (CFP) for ApacheCon Miami is closing this weekend - February 11th. This is your final opportunity to submit a talk for consideration at this event. This year, we are running several mini conferences in conjunction with

Re: [Spark Context]: How to add on demand jobs to an existing spark context?

2017-02-08 Thread Jörn Franke
The resource management in yarn cluster mode is yarns task. So it dependents how you configured the queues and the scheduler there. > On 8 Feb 2017, at 12:10, Cosmin Posteuca wrote: > > I tried to run some test on EMR on yarn cluster mode. > > I have a cluster with

Re: Spark 2 - Creating datasets from dataframes with extra columns

2017-02-08 Thread ??????????
i checked it, it seems is a bug. do you create a jira now plesae? ---Original--- From: "Don Drake" Date: 2017/2/7 01:26:59 To: "user"; Subject: Re: Spark 2 - Creating datasets from dataframes with extra columns This seems like a bug to me, the schemas

[Spark-SQL] Hive support is required to select over the following tables

2017-02-08 Thread Daniel Haviv
Hi, I'm using Spark 2.1.0 on Zeppelin. I can successfully create a table but when I try to select from it I fail: spark.sql("create table foo (name string)") res0: org.apache.spark.sql.DataFrame = [] spark.sql("select * from foo") org.apache.spark.sql.AnalysisException: Hive support is required

Practical configuration to run LSH in Spark 2.1.0

2017-02-08 Thread nguyen duc Tuan
Hi everyone, Since spark 2.1.0 introduces LSH ( http://spark.apache.org/docs/latest/ml-features.html#locality-sensitive-hashing), we want to use LSH to find approximately nearest neighbors. Basically, We have dataset with about 7M rows. we want to use cosine distance to meassure the similarity