Hi,
I am using spark-sql-2.4.1v to streaming in my PoC.
how to refresh the loaded dataframe from hdfs/cassandra table every time
new batch of stream processed ? What is the practice followed in general to
handle this kind of scenario?
Below is the SOF link for more details .
"left join" complains and tells me I need to turn on
"spark.sql.crossJoin.enabled=true".
But when I persist one dataframe, it runs fine.
Why do you have to "persist"?
org.apache.spark.sql.AnalysisException: Detected implicit cartesian product
for INNER join between logical plans
SELECT * FROM
Hi Hichame,
Thanks a lot. I forked it. There are lots of codes. Need documents to guide
me which part I should start from.
On Thu, Sep 5, 2019 at 1:30 PM Hichame El Khalfi
wrote:
> Hey David,
>
> You can the source code on GitHub:
> https://github.com/apache/spark
>
> Hope this helps,
>
>
Hey David,
You can the source code on GitHub:
https://github.com/apache/spark
Hope this helps,
Hichame
From: zhou10...@gmail.com
Sent: September 5, 2019 4:11 PM
To: user@spark.apache.org
Subject: Start point to read source codes
Hi,
I want to read the source codes. Is there any doc, wiki or
Hi,
I want to read the source codes. Is there any doc, wiki or book which
introduces the source codes.
Thanks in advance.
David
Gabor,
Thanks for the quick response and sharing about spark 3.0, we need to use
spark streaming (KafkaUtils.createDirectStream) than structured streaming
by following this document
https://spark.apache.org/docs/2.2.0/streaming-kafka-0-10-integration.html and
re-iterating the issue again for
Gabor,
Thanks for the quick response and sharing about spark 3.0, we need to use
spark streaming (KafkaUtils.createDirectStream) than structured streaming
by following this document
https://spark.apache.org/docs/2.2.0/streaming-kafka-0-10-integration.html and
re-iterating the issue again for
Stop using collect for this purpose. Either continue your further
processing in spark (maybe you need to use streaming), or sink the data to
something that can accept the data (gcs/s3/azure
storage/redshift/elasticsearch/whatever), and have further processing read
from that sink.
On Thu, Sep 5,
Hi.
I have been trying to collect a large dataset(about 2 gb in size, 30
columns, more than a million rows) onto the driver side. I am aware that
collecting such a huge dataset isn't suggested, however, the application
within which the spark driver is running requires that data.
While collecting
Hello experts,
I have quick question: which API allows me to read images files or binary
files (for SparkSession.readStream) from a local/hadoop file system in
Spark 2.3?
I have been browsing the following documentations and googling for it and
didn't find a good example/documentation:
Hi Sathi,
Thanks for a quick reply, so this ( list of some epoch times in IN clause) was
part of 30 days aggregation already. As per our input to output aggregation
ratio, our cardinality is too high. So we require query tuning kind of thing.
As we can’t assign additional resource for this
Hi,
Let me share Spark 3.0 documentation part (Structured Streaming and not
DStreams what you've mentioned but still relevant):
kafka.group.id string none streaming and batch The Kafka group id to use in
Kafka consumer while reading from Kafka. Use this with caution. By default,
each query
What I can immediately think of is,
as you are doing IN in the where clause for a series of timestamps, if you can
consider breaking them and for each epoch timestampYou can load your results to
an intermediate staging table and then do a final aggregate from that table
keeping the group by
Hello all,
We have one use-case where we are aggregating billion of rows. It does huge
shuffle.
Example :
As per ‘Job’ tab on yarn UI
When Input size is 350 G something, shuffle size >3 TBs. This increases
Non-DFS usage beyond warning limit and thus affecting entire cluster.
It seems we need
Hi Team,
We have secured Kafka cluster (which only allows to consume from the
pre-configured, authorized consumer group), there is a scenario where we
want to use spark streaming to consume from secured kafka. so we have
decided to use spark-streaming-kafka-0-10
Using SQL, is it possible to query a column's metadata?
Thanks,
Kyunam
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
17 matches
Mail list logo