Train ML models on each partition

2019-05-08 Thread Qian He
I have a 1TB dataset with 100 columns. The first column is a user_id, there are about 1000 unique user_ids in this 1TB dataset. The use case: I want to train a ML model for each user_id on this user's records (approximately 1GB records per user). Say the ML model is a Decision Tree. But it is not

Re: Static partitioning in partitionBy()

2019-05-08 Thread Gourav Sengupta
Hi Burak, Hurray so you made finally delta open source :) I always thought of asking TD, is there any chance we could get the streaming graphs back in the SPARK UI? It will just be wonderful. Hi Shubham, there are always easier way and super fancy way to solve problems, filtering data before

pyspark on pycharm error

2019-05-08 Thread karan alang
Hello - anyone has any ideas on this PySpark/PyCharm error in SO, pls. let me know. https://stackoverflow.com/questions/56028402/java-util-nosuchelementexception-key-not-found-pyspark-driver-callback-host

IllegalArgumentException: Timestamp format must be yyyy-mm-dd hh:mm:ss[.fffffffff] while using spark-sql-2.4.1v to read data from oracle

2019-05-08 Thread Shyam P
Hi , I have oracle table in which has column schema is : DATA_DATE DATE something like 31-MAR-02 I am trying to retrieve data from oracle using spark-sql-2.4.1 version. I tried to set the JdbcOptions as below : .option("lowerBound", "2002-03-31 00:00:00"); .option("upperBound",

Re: Static partitioning in partitionBy()

2019-05-08 Thread Shubham Chaurasia
Thanks On Wed, May 8, 2019 at 10:36 AM Felix Cheung wrote: > You could > > df.filter(col(“c”) = “c1”).write().partitionBy(“c”).save > > It could get some data skew problem but might work for you > > > > -- > *From:* Burak Yavuz > *Sent:* Tuesday, May 7, 2019 9:35:10

HiveTableRelation Assertion Error | Joining Stream with Hive table

2019-05-08 Thread Manjunath N
Hi, I have a dataframe reading messages from kafka topic. I have another dataframe created from a hive table. When I join (inner) stream with hive table i am getting below assertion error. java.lang.AssertionError: assertion failed: No plan for HiveTableRelation When i do an explain on the

How does org.apache.spark.sql.catalyst.util.MapData support hash lookup?

2019-05-08 Thread Shawn Yang
Hi guys, I'm reading spark source code. When I read org.apache.spark.sql.catalyst.util.ArrayBasedMapData, org.apache.spark.sql.catalyst.expressions.UnsafeMapData, I can't understand how it supports hash lookup? Is there anything I miss?

Re: Structured Streaming Kafka - Weird behavior with performance and logs

2019-05-08 Thread Akshay Bhardwaj
Hi Austin, A few questions: 1. What is the partition of the kafka topic that used for input and output data? 2. In the write stream, I will recommend to use "trigger" with a defined interval, if you prefer micro-batching strategy, 3. along with defining "maxOffsetsPerTrigger" in