Re: how to merge dataframe write output files

2016-11-10 Thread Jorge Sánchez
Do you have the logs of the containers? This seems like a Memory issue. 2016-11-10 7:28 GMT+00:00 lk_spark : > hi,all: > when I call api df.write.parquet ,there is alot of small files : how > can I merge then into on file ? I tried df.coalesce(1).write.parquet ,but > it

RE: Strongly Connected Components

2016-11-10 Thread Shreya Agarwal
Yesterday's run died sometime during the night, without any errors. Today, I am running it using GraphFrames instead. It is still spawning new tasks, so there is progress. From: Felix Cheung [mailto:felixcheun...@hotmail.com] Sent: Thursday, November 10, 2016 7:50 PM To: user@spark.apache.org;

Re: Strongly Connected Components

2016-11-10 Thread Felix Cheung
It is possible it is dead. Could you check the Spark UI to see if there is any progress? _ From: Shreya Agarwal > Sent: Thursday, November 10, 2016 12:45 AM Subject: RE: Strongly Connected Components To:

Re: Newbie question - Best way to bootstrap with Spark

2016-11-10 Thread jggg777
A couple options: (1) You can start locally by downloading Spark to your laptop: http://spark.apache.org/downloads.html , then jump into the Quickstart docs: http://spark.apache.org/docs/latest/quick-start.html (2) There is a free Databricks community edition that runs on AWS:

Re: Joining to a large, pre-sorted file

2016-11-10 Thread Silvio Fiorito
You want to look at the bucketBy option when you save the master file out. That way it will be pre-partitioned by the join column, eliminating the shuffle on the larger file. From: Stuart White Date: Thursday, November 10, 2016 at 8:39 PM To: Jörn Franke

Re: Instability issues with Spark 2.0.1 and Kafka 0.10

2016-11-10 Thread Shixiong(Ryan) Zhu
Yeah, the KafkaRDD cannot be reused. It's better to document it. On Thu, Nov 10, 2016 at 8:26 AM, Ivan von Nagy wrote: > Ok, I have split he KafkaRDD logic to each use their own group and bumped > the poll.ms to 10 seconds. Anything less then 2 seconds on the poll.ms > ends up

Re: Joining to a large, pre-sorted file

2016-11-10 Thread Stuart White
Yes. In my original question, when I said I wanted to pre-sort the master file, I should have said "pre-sort and pre-partition the file". Years ago, I did this with Hadoop MapReduce. I pre-sorted/partitioned the master file into N partitions. Then, when a transaction file would arrive, I would

Re: Joining to a large, pre-sorted file

2016-11-10 Thread Jörn Franke
Can you split the files beforehand in several files (e.g. By the column you do the join on?) ? > On 10 Nov 2016, at 23:45, Stuart White wrote: > > I have a large "master" file (~700m records) that I frequently join smaller > "transaction" files to. (The transaction

Re: Correct SparkLauncher usage

2016-11-10 Thread Mohammad Tariq
Sure, will look into the tests. Thanks you so much for your time! [image: --] Tariq, Mohammad [image: https://]about.me/mti [image: http://] Tariq, Mohammad about.me/mti [image: http://]

Re: Correct SparkLauncher usage

2016-11-10 Thread Marcelo Vanzin
Sorry, it's kinda hard to give any more feedback from just the info you provided. I'd start with some working code like this from Spark's own unit tests:

Re: Correct SparkLauncher usage

2016-11-10 Thread Mohammad Tariq
All I want to do is submit a job, and keep on getting states as soon as it changes, and come out once the job is over. I'm sorry to be a pest of questions. Kind of having a bit of tough time making this work. [image: --] Tariq, Mohammad [image: https://]about.me/mti

Re: Correct SparkLauncher usage

2016-11-10 Thread Mohammad Tariq
Yeah, that definitely makes sense. I was just trying to make it work somehow. The problem is that it's not at all calling the listeners, hence i'm unable to do anything. Just wanted to cross check it by looping inside. But I get the point. thank you for that! I'm on YARN(cluster mode). [image:

Re: Correct SparkLauncher usage

2016-11-10 Thread Marcelo Vanzin
On Thu, Nov 10, 2016 at 2:43 PM, Mohammad Tariq wrote: > @Override > public void stateChanged(SparkAppHandle handle) { > System.out.println("Spark App Id [" + handle.getAppId() + "]. State [" + > handle.getState() + "]"); > while(!handle.getState().isFinal()) {

Joining to a large, pre-sorted file

2016-11-10 Thread Stuart White
I have a large "master" file (~700m records) that I frequently join smaller "transaction" files to. (The transaction files have 10's of millions of records, so too large for a broadcast join). I would like to pre-sort the master file, write it to disk, and then, in subsequent jobs, read the file

Re: Correct SparkLauncher usage

2016-11-10 Thread Mohammad Tariq
Hi Marcelo, After a few changes I got it working. However I could not understand one thing. I need to call Thread.sleep() and then get the state explicitly in order to make it work. Also, no matter what I do my launcher program doesn't call stateChanged() or infoChanged(). Here is my code :

Anyone using ProtoBuf for Kafka messages with Spark Streaming for processing?

2016-11-10 Thread shyla deshpande
Using ProtoBuf for Kafka messages with Spark Streaming because ProtoBuf is already being used in the system. Some sample code and reading material for using ProtoBuf for Kafka messages with Spark Streaming will be helpful. Thanks

Re: Access_Remote_Kerberized_Cluster_Through_Spark

2016-11-10 Thread KhajaAsmath Mohammed
Hi Ajay, I was able to resolve it by adding yarn user principal. here is complete code. def main(args: Array[String]) { // create Spark context with Spark configuration val cmdLine = Parse.commandLine(args) val configFile = cmdLine.getOptionValue("c") val propertyConfiguration

Re: UDF with column value comparison fails with PySpark

2016-11-10 Thread Perttu Ranta-aho
So it was something obvious, thanks! -Perttu to 10. marraskuuta 2016 klo 21.19 Davies Liu kirjoitti: > On Thu, Nov 10, 2016 at 11:14 AM, Perttu Ranta-aho > wrote: > > Hello, > > > > I want to create an UDF which modifies one column value depending on >

Re: type-safe join in the new DataSet API?

2016-11-10 Thread Michael Armbrust
You can groupByKey and then cogroup. On Thu, Nov 10, 2016 at 10:44 AM, Yang wrote: > the new DataSet API is supposed to provide type safety and type checks at > compile time https://spark.apache.org/docs/latest/structured- >

Re: UDF with column value comparison fails with PySpark

2016-11-10 Thread Davies Liu
On Thu, Nov 10, 2016 at 11:14 AM, Perttu Ranta-aho wrote: > Hello, > > I want to create an UDF which modifies one column value depending on value > of some other column. But Python version of the code fails always in column > value comparison. Below are simple examples, scala

UDF with column value comparison fails with PySpark

2016-11-10 Thread Perttu Ranta-aho
Hello, I want to create an UDF which modifies one column value depending on value of some other column. But Python version of the code fails always in column value comparison. Below are simple examples, scala version works as expected but Python version throws an execption. Am I missing something

type-safe join in the new DataSet API?

2016-11-10 Thread Yang
the new DataSet API is supposed to provide type safety and type checks at compile time https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#join-operations It does this indeed for a lot of places, but I found it still doesn't have a type safe join: val ds1 =

Spark Streaming: question on sticky session across batches ?

2016-11-10 Thread Manish Malhotra
Hello Spark Devs/Users, Im trying to solve the use case with Spark Streaming 1.6.2 where for every batch ( say 2 mins) data needs to go to the same reducer node after grouping by key. The underlying storage is Cassandra and not HDFS. This is a map-reduce job, where also trying to use the

Re: Instability issues with Spark 2.0.1 and Kafka 0.10

2016-11-10 Thread Ivan von Nagy
Ok, I have split he KafkaRDD logic to each use their own group and bumped the poll.ms to 10 seconds. Anything less then 2 seconds on the poll.ms ends up with a timeout and exception so I am still perplexed on that one. The new error I am getting now is a `ConcurrentModificationException` when

will spark aggregate and treeaggregate case a shuflle action?

2016-11-10 Thread codlife
Hi Users: I'm doubt about whether spark aggregate and treeaggregate case a shuffle action? if not how spark do the combine option in spark internal.Any tips is appreciated, thank you! -- View this message in context:

Re: If we run sc.textfile(path,xxx) many times, will the elements be the same in each partition

2016-11-10 Thread Prashant Sharma
+user -dev Since the same hash based partitioner is in action by default. In my understanding every time same partitioning will happen. Thanks, On Nov 10, 2016 7:13 PM, "WangJianfei" wrote: > Hi Devs: > If i run sc.textFile(path,xxx) many times, will the

Re: Akka Stream as the source for Spark Streaming. Please advice...

2016-11-10 Thread Cody Koeninger
The basic structured streaming source for Kafka is already committed to master, build it and try it out. If you're already using Kafka I don't really see much point in trying to put Akka in between it and Spark. On Nov 10, 2016 02:25, "vincent gromakowski" wrote:

Re: importing data into hdfs/spark using Informatica ETL tool

2016-11-10 Thread Mich Talebzadeh
Sounds like the only option informatica has for Hadoop is connector to Hive and as I read it it connects to Hive thrift server. The tool is called Informatica cloud connector and it is add-on which means that it is not part of standard informatica offering. anyway if we can use informatica to

RE: Re:RE: how to merge dataframe write output files

2016-11-10 Thread Mendelson, Assaf
As people stated, when you coalesce to 1 partition then basically you lose all parallelism, however, you can coalesce to a difference value. If for example you coalesce to 20 then you can parallelize up to 20 different tasks. You have a total of 4 executors, with 2 cores each. This means that

RE: Strongly Connected Components

2016-11-10 Thread Shreya Agarwal
Bump. Anyone? Its been running for 10 hours now. No results. From: Shreya Agarwal Sent: Tuesday, November 8, 2016 9:05 PM To: user@spark.apache.org Subject: Strongly Connected Components Hi, I am running this on a graph with >5B edges and >3B edges and have 2 questions - 1. What is the

Re: SparkLauncer 2.0.1 version working incosistently in yarn-client mode

2016-11-10 Thread Elkhan Dadashov
Thanks Marcelo. I changed the code using CountDownLatch, and it works as expected. ...final CountDownLatch countDownLatch = new CountDownLatch(1); SparkAppListener sparkAppListener = new SparkAppListener(countDownLatch); SparkAppHandle appHandle =

RE: Re:RE: how to merge dataframe write output files

2016-11-10 Thread Shreya Agarwal
Your coalesce should technically work - One thing to check would be overhead memory. You should configure it as 10% of executor memory. Also, you might need to increase maxResultSize. Also, the data looks fine for the cluster unless your join yields >6G worth of data. Few things to try - 1.

Re: Akka Stream as the source for Spark Streaming. Please advice...

2016-11-10 Thread vincent gromakowski
I have already integrated common actors. I am also interested, specially to see how we can achieve end to end back pressure. 2016-11-10 8:46 GMT+01:00 shyla deshpande : > I am using Spark 2.0.1. I wanted to build a data pipeline using Kafka, > Spark Streaming and

Re: Unable to lauch Python Web Application on Spark Cluster

2016-11-10 Thread Daniel van der Ende
Hi Anjali, It would help to see the code. But more importantly: why do you want to deploy a web application on a Spark cluster? Spark is meant for distributed, in-memory computations. I don't know what you're application is doing, but it would make more sense to run it outside the Spark cluster,

Re:RE: how to merge dataframe write output files

2016-11-10 Thread lk_spark
thank you for reply,Shreya: It's because the files is too small and hdfs dosen't like small file . for your question. yes I want to create ExternalTable on the parquetfile floder. And how to use fragmented files as you mention? the tests case as below: bin/spark-shell --master yarn

Fwd: Unable to lauch Python Web Application on Spark Cluster

2016-11-10 Thread anjali gautam
-- Forwarded message -- From: anjali gautam Date: Thu, Nov 10, 2016 at 12:01 PM Subject: Unable to lauch Python Web Application on Spark Cluster To: user@spark.apache.org Hello Everyone, I have developed a web application (say abc) in Python using