date:20180123

Re: uncontinuous offset in kafka will cause the spark streamingfailure

2018-01-23 Thread namesuperwood

It seems this patch is not suitable for our problem。 https://github.com/apache/spark/compare/master...koeninger:SPARK-17147 wood.super 原始邮件发件人:namesuperwoodnamesuperw...@gmail.com 收件人:Justin millerjustin.mil...@protectwise.com 抄送:useru...@spark.apache.org; Cody koeningerc...@koeninger.org

Questions about using pyspark 2.1.1 pushing data to kafka

2018-01-23 Thread hsy...@gmail.com

I have questions about using pyspark 2.1.1 pushing data to kafka. I don't see any pyspark streaming api to write data directly to kafka, if there is one or example, please point me to the right page. I implemented my own way which using a global kafka producer and push the data picked from

Question about accumulator

2018-01-23 Thread hsy...@gmail.com

I have a small application like this acc = sc.accumulate(5) def t_f(x,): global acc sleep(5) acc += x def f(x): global acc thread = Thread(target = t_f, args = (x,)) thread.start() # thread.join() # without this it doesn't work rdd = sc.parallelize([1,2,4,1])

Re: uncontinuous offset in kafka will cause the spark streamingfailure

2018-01-23 Thread namesuperwood

Yes. My spark streaming application works with uncompacted topic. I will check the patch. wood.super 原始邮件发件人:Justin millerjustin.mil...@protectwise.com 收件人:namesuperwoodnamesuperw...@gmail.com 抄送:useru...@spark.apache.org; Cody koeningerc...@koeninger.org 发送时间:2018年1月24日(周三) 14:23 主题:Re:

Re: uncontinuous offset in kafka will cause the spark streaming failure

2018-01-23 Thread Justin Miller

We appear to be kindred spirits, I’ve recently run into the same issue. Are you running compacted topics? I’ve run into this issue on non-compacted topics as well, it happens rarely but is still a pain. You might check out this patch and related spark streaming Kafka ticket:

uncontinuous offset in kafka will cause the spark streaming failure

2018-01-23 Thread namesuperwood

Hi all kafka version : kafka_2.11-0.11.0.2 spark version : 2.0.1 A topic-partition "adn-tracking,15" in kafka who's earliest offset is1255644602 andlatest offset is1271253441. While starting a spark streaming to process the data from the topic , we got a exception with "Got wrong record

Re: Spark Tuning Tool

2018-01-23 Thread Raj Adyanthaya

Its very interesting and I do agree that it will get a lot of traction once made open source. On Mon, Jan 22, 2018 at 9:01 PM, Rohit Karlupia wrote: > Hi, > > I have been working on making the performance tuning of spark applications > bit easier. We have just released the

Re: write parquet with statistics min max with binary field

2018-01-23 Thread Stephen Joung

How can I write parquet file with min/max statistic? 2018-01-24 10:30 GMT+09:00 Stephen Joung : > Hi, I am trying to use spark sql filter push down. and specially want to > use row group skipping with parquet file. > > And I guessed that I need parquet file with statistics

write parquet with statistics min max with binary field

2018-01-23 Thread Stephen Joung

Hi, I am trying to use spark sql filter push down. and specially want to use row group skipping with parquet file. And I guessed that I need parquet file with statistics min/max. On spark master branch - I tried to write single column with "a", "b", "c" to parquet file f1 scala>

Re: Spark Tuning Tool

2018-01-23 Thread Mich Talebzadeh

looking good. do we have a downloadable version of this product? I assume it will be installed on one of the edge nodes? regards, Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Spark Tuning Tool

2018-01-23 Thread रविशंकर नायर

Very good job, intact a missing link has been addressed. Any plan to porting to GITHUB, I would like to contribute. Best, RS On Tue, Jan 23, 2018 at 12:01 AM, Rohit Karlupia wrote: > Hi, > > I have been working on making the performance tuning of spark applications > bit

Re: Spark Tuning Tool

2018-01-23 Thread manish ranjan

This is awesome work Rohit. Not only as a user, but I will be also super interested in contributing to solving this pain point of my daily work. Manish ~Manish On Mon, Jan 22, 2018 at 9:21 PM, lucas.g...@gmail.com wrote: > I'd be very interested in anything I can send

Re: S3 token times out during data frame "write.csv"

2018-01-23 Thread Vasyl Harasymiv

It is about 400 million rows. S3 automatically chunks the file on their end while writing, so that's fine, e.g. creates the same file name with alphanumeric suffixes. However, the write session expires due to token expiration. On Tue, Jan 23, 2018 at 5:03 PM, Jörn Franke

Re: S3 token times out during data frame "write.csv"

2018-01-23 Thread Jörn Franke

How large is the file? If it is very large then you should have anyway several partitions for the output. This is also important in case you need to read again from S3 - having several files there enables parallel reading. > On 23. Jan 2018, at 23:58, Vasyl Harasymiv

S3 token times out during data frame "write.csv"

2018-01-23 Thread Vasyl Harasymiv

Hi Spark Community, Saving a data frame into a file on S3 using: *df.write.csv(s3_location)* If run for longer than 30 mins, the following error persists: *The provided token has expired. (Service: Amazon S3; Status Code: 400; Error Code: ExpiredToken;`)* Potentially, because there is a

Re: I can't save DataFrame from running Spark locally

2018-01-23 Thread Toy

Thanks, I get this error when I switched to s3a:// Exception in thread "streaming-job-executor-0" java.lang.NoSuchMethodError: com.amazonaws.services.s3.transfer.TransferManager.(Lcom/amazonaws/services/s3/AmazonS3;Ljava/util/concurrent/ThreadPoolExecutor;)V at

Re: I can't save DataFrame from running Spark locally

2018-01-23 Thread Patrick Alwell

Spark cannot read locally from S3 without an S3a protocol; you’ll more than likely need a local copy of the data or you’ll need to utilize the proper jars to enable S3 communication from the edge to the datacenter.

I can't save DataFrame from running Spark locally

2018-01-23 Thread Toy

Hi, First of all, my Spark application runs fine in AWS EMR. However, I'm trying to run it locally to debug some issue. My application is just to parse log files and convert to DataFrame then convert to ORC and save to S3. However, when I run locally I get this error java.io.IOException:

Re: How to hold some data in memory while processing rows in a DataFrame?

2018-01-23 Thread David Rosenstrauch

That sounds like it might fit the bill. I'll take a look - thanks! DR On Mon, Jan 22, 2018 at 11:26 PM, vermanurag wrote: > Looking at description of problem window functions may solve your issue. It > allows operation over a window that can include records

Re: How to hold some data in memory while processing rows in a DataFrame?

2018-01-23 Thread David Rosenstrauch

Thanks, but broadcast variables won't achieve won't I'm looking to do. I'm not trying to just share a one-time set of data across the cluster. Rather, I'm trying to set up a small cache of info that's constantly being updated based on the records in the dataframe. DR On Mon, Jan 22, 2018 at

Re: Spark querying C* in Scala

2018-01-23 Thread Conconscious

Hi list, val dfs = spark .read .format("org.apache.spark.sql.cassandra") .options(Map("cluster" -> "helloCassandra", "spark.cassandra.connection.host" -> "127.0.0.1", "spark.input.fetch.size_in_rows" -> "10", "spark.cassandra.input.consistency.level" -> "ONE", "table" ->

[Structured streaming] Merging streaming with semi-static datasets

2018-01-23 Thread Christiaan Ras

Hi, I’m currently doing some tests with Structured Streaming and I’m wondering how I can merge the streaming dataset with a more-or-less static dataset (from a JDBC source). With more-or-less I mean a dataset which does not change that often and could be cached by Spark for a while. It is

Re: external shuffle service in mesos

2018-01-23 Thread igor.berman

Hi Susan, yes, agree with you regarding resource accounting. Imho, in this case shuffle service must run on node no matter what resources are available(same as we don't account for resources that "system" takes - mesos agent, OS itself and any other process that is running on same machine) One

Re: [Spark structured streaming] Use of (flat)mapgroupswithstate takes long time

2018-01-23 Thread Christiaan Ras

Hi TD, Thanks for taking the time to review my question. Answers to your questions: - How many tasks are being launched in the reduce stage (that is, the stage after the shuffle, that is computing mapGroupsWithState) In the dashboard I count 200 tasks in the stage containing: Exchange ->

Re: uncontinuous offset in kafka will cause the spark streamingfailure

Questions about using pyspark 2.1.1 pushing data to kafka

Question about accumulator

Re: uncontinuous offset in kafka will cause the spark streamingfailure

Re: uncontinuous offset in kafka will cause the spark streaming failure

uncontinuous offset in kafka will cause the spark streaming failure

Re: Spark Tuning Tool

Re: write parquet with statistics min max with binary field

write parquet with statistics min max with binary field

Re: Spark Tuning Tool

Re: Spark Tuning Tool

Re: Spark Tuning Tool

Re: S3 token times out during data frame "write.csv"

Re: S3 token times out during data frame "write.csv"

S3 token times out during data frame "write.csv"

Re: I can't save DataFrame from running Spark locally

Re: I can't save DataFrame from running Spark locally

I can't save DataFrame from running Spark locally

Re: How to hold some data in memory while processing rows in a DataFrame?

Re: How to hold some data in memory while processing rows in a DataFrame?

Re: Spark querying C* in Scala

[Structured streaming] Merging streaming with semi-static datasets

Re: external shuffle service in mesos

Re: [Spark structured streaming] Use of (flat)mapgroupswithstate takes long time

24 matches

Site Navigation

Mail list logo

Footer information