Re: uncontinuous offset in kafka will cause the spark streamingfailure

2018-01-23 Thread namesuperwood
It seems this patch is not suitable for our problem。 https://github.com/apache/spark/compare/master...koeninger:SPARK-17147 wood.super 原始邮件 发件人:namesuperwoodnamesuperw...@gmail.com 收件人:Justin millerjustin.mil...@protectwise.com 抄送:useru...@spark.apache.org; Cody koeningerc...@koeninger.org

Questions about using pyspark 2.1.1 pushing data to kafka

2018-01-23 Thread hsy...@gmail.com
I have questions about using pyspark 2.1.1 pushing data to kafka. I don't see any pyspark streaming api to write data directly to kafka, if there is one or example, please point me to the right page. I implemented my own way which using a global kafka producer and push the data picked from

Question about accumulator

2018-01-23 Thread hsy...@gmail.com
I have a small application like this acc = sc.accumulate(5) def t_f(x,): global acc sleep(5) acc += x def f(x): global acc thread = Thread(target = t_f, args = (x,)) thread.start() # thread.join() # without this it doesn't work rdd = sc.parallelize([1,2,4,1])

Re: uncontinuous offset in kafka will cause the spark streamingfailure

2018-01-23 Thread namesuperwood
Yes. My spark streaming application works with uncompacted topic. I will check the patch. wood.super 原始邮件 发件人:Justin millerjustin.mil...@protectwise.com 收件人:namesuperwoodnamesuperw...@gmail.com 抄送:useru...@spark.apache.org; Cody koeningerc...@koeninger.org 发送时间:2018年1月24日(周三) 14:23 主题:Re:

Re: uncontinuous offset in kafka will cause the spark streaming failure

2018-01-23 Thread Justin Miller
We appear to be kindred spirits, I’ve recently run into the same issue. Are you running compacted topics? I’ve run into this issue on non-compacted topics as well, it happens rarely but is still a pain. You might check out this patch and related spark streaming Kafka ticket:

uncontinuous offset in kafka will cause the spark streaming failure

2018-01-23 Thread namesuperwood
Hi all kafka version : kafka_2.11-0.11.0.2 spark version : 2.0.1 A topic-partition "adn-tracking,15" in kafka who's earliest offset is1255644602 andlatest offset is1271253441. While starting a spark streaming to process the data from the topic , we got a exception with "Got wrong record

Re: Spark Tuning Tool

2018-01-23 Thread Raj Adyanthaya
Its very interesting and I do agree that it will get a lot of traction once made open source. On Mon, Jan 22, 2018 at 9:01 PM, Rohit Karlupia wrote: > Hi, > > I have been working on making the performance tuning of spark applications > bit easier. We have just released the

Re: write parquet with statistics min max with binary field

2018-01-23 Thread Stephen Joung
How can I write parquet file with min/max statistic? 2018-01-24 10:30 GMT+09:00 Stephen Joung : > Hi, I am trying to use spark sql filter push down. and specially want to > use row group skipping with parquet file. > > And I guessed that I need parquet file with statistics

write parquet with statistics min max with binary field

2018-01-23 Thread Stephen Joung
Hi, I am trying to use spark sql filter push down. and specially want to use row group skipping with parquet file. And I guessed that I need parquet file with statistics min/max. On spark master branch - I tried to write single column with "a", "b", "c" to parquet file f1 scala>

Re: Spark Tuning Tool

2018-01-23 Thread Mich Talebzadeh
looking good. do we have a downloadable version of this product? I assume it will be installed on one of the edge nodes? regards, Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Spark Tuning Tool

2018-01-23 Thread रविशंकर नायर
Very good job, intact a missing link has been addressed. Any plan to porting to GITHUB, I would like to contribute. Best, RS On Tue, Jan 23, 2018 at 12:01 AM, Rohit Karlupia wrote: > Hi, > > I have been working on making the performance tuning of spark applications > bit

Re: Spark Tuning Tool

2018-01-23 Thread manish ranjan
This is awesome work Rohit. Not only as a user, but I will be also super interested in contributing to solving this pain point of my daily work. Manish ~Manish On Mon, Jan 22, 2018 at 9:21 PM, lucas.g...@gmail.com wrote: > I'd be very interested in anything I can send

Re: S3 token times out during data frame "write.csv"

2018-01-23 Thread Vasyl Harasymiv
It is about 400 million rows. S3 automatically chunks the file on their end while writing, so that's fine, e.g. creates the same file name with alphanumeric suffixes. However, the write session expires due to token expiration. On Tue, Jan 23, 2018 at 5:03 PM, Jörn Franke

Re: S3 token times out during data frame "write.csv"

2018-01-23 Thread Jörn Franke
How large is the file? If it is very large then you should have anyway several partitions for the output. This is also important in case you need to read again from S3 - having several files there enables parallel reading. > On 23. Jan 2018, at 23:58, Vasyl Harasymiv

S3 token times out during data frame "write.csv"

2018-01-23 Thread Vasyl Harasymiv
Hi Spark Community, Saving a data frame into a file on S3 using: *df.write.csv(s3_location)* If run for longer than 30 mins, the following error persists: *The provided token has expired. (Service: Amazon S3; Status Code: 400; Error Code: ExpiredToken;`)* Potentially, because there is a

Re: I can't save DataFrame from running Spark locally

2018-01-23 Thread Toy
Thanks, I get this error when I switched to s3a:// Exception in thread "streaming-job-executor-0" java.lang.NoSuchMethodError: com.amazonaws.services.s3.transfer.TransferManager.(Lcom/amazonaws/services/s3/AmazonS3;Ljava/util/concurrent/ThreadPoolExecutor;)V at

Re: I can't save DataFrame from running Spark locally

2018-01-23 Thread Patrick Alwell
Spark cannot read locally from S3 without an S3a protocol; you’ll more than likely need a local copy of the data or you’ll need to utilize the proper jars to enable S3 communication from the edge to the datacenter.

I can't save DataFrame from running Spark locally

2018-01-23 Thread Toy
Hi, First of all, my Spark application runs fine in AWS EMR. However, I'm trying to run it locally to debug some issue. My application is just to parse log files and convert to DataFrame then convert to ORC and save to S3. However, when I run locally I get this error java.io.IOException:

Re: How to hold some data in memory while processing rows in a DataFrame?

2018-01-23 Thread David Rosenstrauch
That sounds like it might fit the bill. I'll take a look - thanks! DR On Mon, Jan 22, 2018 at 11:26 PM, vermanurag wrote: > Looking at description of problem window functions may solve your issue. It > allows operation over a window that can include records

Re: How to hold some data in memory while processing rows in a DataFrame?

2018-01-23 Thread David Rosenstrauch
Thanks, but broadcast variables won't achieve won't I'm looking to do. I'm not trying to just share a one-time set of data across the cluster. Rather, I'm trying to set up a small cache of info that's constantly being updated based on the records in the dataframe. DR On Mon, Jan 22, 2018 at

Re: Spark querying C* in Scala

2018-01-23 Thread Conconscious
Hi list, val dfs = spark   .read   .format("org.apache.spark.sql.cassandra")   .options(Map("cluster" -> "helloCassandra",   "spark.cassandra.connection.host" -> "127.0.0.1",   "spark.input.fetch.size_in_rows" -> "10",   "spark.cassandra.input.consistency.level" -> "ONE",   "table" ->

[Structured streaming] Merging streaming with semi-static datasets

2018-01-23 Thread Christiaan Ras
Hi, I’m currently doing some tests with Structured Streaming and I’m wondering how I can merge the streaming dataset with a more-or-less static dataset (from a JDBC source). With more-or-less I mean a dataset which does not change that often and could be cached by Spark for a while. It is

Re: external shuffle service in mesos

2018-01-23 Thread igor.berman
Hi Susan, yes, agree with you regarding resource accounting. Imho, in this case shuffle service must run on node no matter what resources are available(same as we don't account for resources that "system" takes - mesos agent, OS itself and any other process that is running on same machine) One

Re: [Spark structured streaming] Use of (flat)mapgroupswithstate takes long time

2018-01-23 Thread Christiaan Ras
Hi TD, Thanks for taking the time to review my question. Answers to your questions: - How many tasks are being launched in the reduce stage (that is, the stage after the shuffle, that is computing mapGroupsWithState) In the dashboard I count 200 tasks in the stage containing: Exchange ->