I have a 1TB dataset with 100 columns. The first column is a user_id, there
are about 1000 unique user_ids in this 1TB dataset.
The use case: I want to train a ML model for each user_id on this user's
records (approximately 1GB records per user). Say the ML model is a
Decision Tree. But it is not
Hi Burak,
Hurray so you made finally delta open source :)
I always thought of asking TD, is there any chance we could get the
streaming graphs back in the SPARK UI? It will just be wonderful.
Hi Shubham,
there are always easier way and super fancy way to solve problems,
filtering data before
Hello - anyone has any ideas on this PySpark/PyCharm error in SO, pls. let
me know.
https://stackoverflow.com/questions/56028402/java-util-nosuchelementexception-key-not-found-pyspark-driver-callback-host
Hi ,
I have oracle table in which has
column schema is : DATA_DATE DATE something like 31-MAR-02
I am trying to retrieve data from oracle using spark-sql-2.4.1 version. I
tried to set the JdbcOptions as below :
.option("lowerBound", "2002-03-31 00:00:00");
.option("upperBound",
Thanks
On Wed, May 8, 2019 at 10:36 AM Felix Cheung
wrote:
> You could
>
> df.filter(col(“c”) = “c1”).write().partitionBy(“c”).save
>
> It could get some data skew problem but might work for you
>
>
>
> --
> *From:* Burak Yavuz
> *Sent:* Tuesday, May 7, 2019 9:35:10
Hi,
I have a dataframe reading messages from kafka topic. I have another
dataframe created from a hive table. When I join (inner) stream with hive
table i am getting below assertion error.
java.lang.AssertionError: assertion failed: No plan for HiveTableRelation
When i do an explain on the
Hi guys,
I'm reading spark source code. When I read
org.apache.spark.sql.catalyst.util.ArrayBasedMapData,
org.apache.spark.sql.catalyst.expressions.UnsafeMapData, I can't understand
how it supports hash lookup? Is there anything I miss?
Hi Austin,
A few questions:
1. What is the partition of the kafka topic that used for input and
output data?
2. In the write stream, I will recommend to use "trigger" with a defined
interval, if you prefer micro-batching strategy,
3. along with defining "maxOffsetsPerTrigger" in