It seems this patch is not suitable for our problem。
https://github.com/apache/spark/compare/master...koeninger:SPARK-17147
wood.super
原始邮件
发件人:namesuperwoodnamesuperw...@gmail.com
收件人:Justin millerjustin.mil...@protectwise.com
抄送:useru...@spark.apache.org; Cody koeningerc...@koeninger.org
I have questions about using pyspark 2.1.1 pushing data to kafka.
I don't see any pyspark streaming api to write data directly to kafka, if
there is one or example, please point me to the right page.
I implemented my own way which using a global kafka producer and push the
data picked from
I have a small application like this
acc = sc.accumulate(5)
def t_f(x,):
global acc
sleep(5)
acc += x
def f(x):
global acc
thread = Thread(target = t_f, args = (x,))
thread.start()
# thread.join() # without this it doesn't work
rdd = sc.parallelize([1,2,4,1])
Yes. My spark streaming application works with uncompacted topic. I will check
the patch.
wood.super
原始邮件
发件人:Justin millerjustin.mil...@protectwise.com
收件人:namesuperwoodnamesuperw...@gmail.com
抄送:useru...@spark.apache.org; Cody koeningerc...@koeninger.org
发送时间:2018年1月24日(周三) 14:23
主题:Re:
We appear to be kindred spirits, I’ve recently run into the same issue. Are you
running compacted topics? I’ve run into this issue on non-compacted topics as
well, it happens rarely but is still a pain. You might check out this patch and
related spark streaming Kafka ticket:
Hi all
kafka version : kafka_2.11-0.11.0.2
spark version : 2.0.1
A topic-partition "adn-tracking,15" in kafka who's earliest offset is1255644602
andlatest offset is1271253441.
While starting a spark streaming to process the data from the topic , we got a
exception with "Got wrong record
Its very interesting and I do agree that it will get a lot of traction once
made open source.
On Mon, Jan 22, 2018 at 9:01 PM, Rohit Karlupia wrote:
> Hi,
>
> I have been working on making the performance tuning of spark applications
> bit easier. We have just released the
How can I write parquet file with min/max statistic?
2018-01-24 10:30 GMT+09:00 Stephen Joung :
> Hi, I am trying to use spark sql filter push down. and specially want to
> use row group skipping with parquet file.
>
> And I guessed that I need parquet file with statistics
Hi, I am trying to use spark sql filter push down. and specially want to
use row group skipping with parquet file.
And I guessed that I need parquet file with statistics min/max.
On spark master branch - I tried to write single column with "a", "b", "c"
to parquet file f1
scala>
looking good. do we have a downloadable version of this product? I assume
it will be installed on one of the edge nodes?
regards,
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
Very good job, intact a missing link has been addressed.
Any plan to porting to GITHUB, I would like to contribute.
Best,
RS
On Tue, Jan 23, 2018 at 12:01 AM, Rohit Karlupia wrote:
> Hi,
>
> I have been working on making the performance tuning of spark applications
> bit
This is awesome work Rohit. Not only as a user, but I will be also super
interested in contributing to solving this pain point of my daily work.
Manish
~Manish
On Mon, Jan 22, 2018 at 9:21 PM, lucas.g...@gmail.com
wrote:
> I'd be very interested in anything I can send
It is about 400 million rows. S3 automatically chunks the file on their end
while writing, so that's fine, e.g. creates the same file name with
alphanumeric suffixes.
However, the write session expires due to token expiration.
On Tue, Jan 23, 2018 at 5:03 PM, Jörn Franke
How large is the file?
If it is very large then you should have anyway several partitions for the
output. This is also important in case you need to read again from S3 - having
several files there enables parallel reading.
> On 23. Jan 2018, at 23:58, Vasyl Harasymiv
Hi Spark Community,
Saving a data frame into a file on S3 using:
*df.write.csv(s3_location)*
If run for longer than 30 mins, the following error persists:
*The provided token has expired. (Service: Amazon S3; Status Code: 400;
Error Code: ExpiredToken;`)*
Potentially, because there is a
Thanks, I get this error when I switched to s3a://
Exception in thread "streaming-job-executor-0" java.lang.NoSuchMethodError:
com.amazonaws.services.s3.transfer.TransferManager.(Lcom/amazonaws/services/s3/AmazonS3;Ljava/util/concurrent/ThreadPoolExecutor;)V
at
Spark cannot read locally from S3 without an S3a protocol; you’ll more than
likely need a local copy of the data or you’ll need to utilize the proper jars
to enable S3 communication from the edge to the datacenter.
Hi,
First of all, my Spark application runs fine in AWS EMR. However, I'm
trying to run it locally to debug some issue. My application is just to
parse log files and convert to DataFrame then convert to ORC and save to
S3. However, when I run locally I get this error
java.io.IOException:
That sounds like it might fit the bill. I'll take a look - thanks!
DR
On Mon, Jan 22, 2018 at 11:26 PM, vermanurag
wrote:
> Looking at description of problem window functions may solve your issue. It
> allows operation over a window that can include records
Thanks, but broadcast variables won't achieve won't I'm looking to do. I'm
not trying to just share a one-time set of data across the cluster.
Rather, I'm trying to set up a small cache of info that's constantly being
updated based on the records in the dataframe.
DR
On Mon, Jan 22, 2018 at
Hi list,
val dfs = spark
.read
.format("org.apache.spark.sql.cassandra")
.options(Map("cluster" -> "helloCassandra",
"spark.cassandra.connection.host" -> "127.0.0.1",
"spark.input.fetch.size_in_rows" -> "10",
"spark.cassandra.input.consistency.level" -> "ONE",
"table" ->
Hi,
I’m currently doing some tests with Structured Streaming and I’m wondering how
I can merge the streaming dataset with a more-or-less static dataset (from a
JDBC source).
With more-or-less I mean a dataset which does not change that often and could
be cached by Spark for a while. It is
Hi Susan,
yes, agree with you regarding resource accounting. Imho, in this case
shuffle service must run on node no matter what resources are available(same
as we don't account for resources that "system" takes - mesos agent, OS
itself and any other process that is running on same machine)
One
Hi TD,
Thanks for taking the time to review my question.
Answers to your questions:
- How many tasks are being launched in the reduce stage (that is, the stage
after the shuffle, that is computing mapGroupsWithState)
In the dashboard I count 200 tasks in the stage containing: Exchange ->
24 matches
Mail list logo