How to control batch size while reading from hdfs files?

2019-03-22 Thread kant kodali
Hi All, What determines the batch size while reading from a file from HDFS? I am trying to read files from HDFS and ingest into Kafka using Spark Structured Streaming 2.3.1. I get an error sayiKafkafka batch size is too big and that I need to increase max.request.size. Sure I can increase it but

Re: Cross Join

2019-03-22 Thread kathy Harayama
Hello, I using 2.4 , it works scala> val df_A=Seq(("1", 10.0),("2",20.0),("3",30.0),("4",40.0),("5",50.0),("6",60.0),("7",70.0),("8",80.0),("9",90.0),("10",10.0)).toDF("id","val"); df_A: org.apache.spark.sql.DataFrame = [id: string, val: double] scala> val df_B=Seq(("11",

Re: writing a small csv to HDFS is super slow

2019-03-22 Thread kathy Harayama
Hi Lian, Since you using repartition(1), do you want to decrease the number of partitions? If so, have you tried to use coalesce instead? Kathleen On Fri, Mar 22, 2019 at 2:43 PM Lian Jiang wrote: > Hi, > > Writing a csv to HDFS takes about 1 hour: > > >

Re: writing a small csv to HDFS is super slow

2019-03-22 Thread Apostolos N. Papadopoulos
Is it also slow when you do not repartition? (i.e., to get multiple output files) Also did you try simply saveAsTextFile? Also, before repartition, how many partitions are there? a. On 22/3/19 23:34, Lian Jiang wrote: Hi, Writing a csv to HDFS takes about 1 hour:

writing a small csv to HDFS is super slow

2019-03-22 Thread Lian Jiang
Hi, Writing a csv to HDFS takes about 1 hour: df.repartition(1).write.format('com.databricks.spark.csv').mode('overwrite').options(header='true').save(csv) The generated csv file is only about 150kb. The job uses 3 containers (13 cores, 23g mem). Other people have similar issues but I don't

Re: Java Heap Space error - Spark ML

2019-03-22 Thread Apostolos N. Papadopoulos
What is the size of your data, size of the cluster, are you using spark-submit or an IDE, what spark version are you using? Try spark-submit and increase the memory of the driver or the executors. a. On 22/3/19 17:19, KhajaAsmath Mohammed wrote: Hi, I am getting the below exception when

Re: Manually reading parquet files.

2019-03-22 Thread Wenchen Fan
Try `val enconder = RowEncoder(df.schema).resolveAndBind()` ? On Thu, Mar 21, 2019 at 5:39 PM Long, Andrew wrote: > Thanks a ton for the help! > > > > Is there a standardized way of converting the internal row to rows? > > > > I’ve tried this but im getting an exception > > > > *val *enconder =

Re: spark sql occer error

2019-03-22 Thread Wenchen Fan
Did you include the whole error message? On Fri, Mar 22, 2019 at 12:45 AM 563280193 <563280...@qq.com> wrote: > Hi , > I ran a spark sql like this: > > *select imei,tag, product_id,* > * sum(case when succ1>=1 then 1 else 0 end) as succ,* > * sum(case when fail1>=1 and succ1=0 then 1

Java Heap Space error - Spark ML

2019-03-22 Thread KhajaAsmath Mohammed
Hi, I am getting the below exception when using Spark Kmeans. Any solutions from the experts. Would be really helpful. val kMeans = new KMeans().setK(reductionCount).setMaxIter(30) val kMeansModel = kMeans.fit(df) Error is occured when calling kmeans.fit Exception in thread "main"

Re: spark sql occer error

2019-03-22 Thread Mingcong Han
Hi, It seems that there is a syntax error in your SQL. You should use `select ... from ... group by ... having sum(...)=0 and sum(...)>0` instead of `where sum(...)=0...`. I think aggregate functions are not allowed in WHERE. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

spark sql occer error

2019-03-22 Thread 563280193
Hi , I ran a spark sql like this: select imei,tag, product_id, sum(case when succ1>=1 then 1 else 0 end) as succ, sum(case when fail1>=1 and succ1=0 then 1 else 0 end) as fail, count(*) as cnt from t_tbl where sum(case when succ1>=1 then 1 else 0 end)=0

spark sql occer error

2019-03-22 Thread 563280...@qq.com
Hi , I ran a spark sql like this: select imei,tag, product_id, sum(case when succ1>=1 then 1 else 0 end) as succ, sum(case when fail1>=1 and succ1=0 then 1 else 0 end) as fail, count(*) as cnt from t_tbl where sum(case when succ1>=1 then 1 else 0 end)=0 and sum(case when