date:20190322

How to control batch size while reading from hdfs files?

2019-03-22 Thread kant kodali

Hi All, What determines the batch size while reading from a file from HDFS? I am trying to read files from HDFS and ingest into Kafka using Spark Structured Streaming 2.3.1. I get an error sayiKafkafka batch size is too big and that I need to increase max.request.size. Sure I can increase it but

Re: Cross Join

2019-03-22 Thread kathy Harayama

Hello, I using 2.4 , it works scala> val df_A=Seq(("1", 10.0),("2",20.0),("3",30.0),("4",40.0),("5",50.0),("6",60.0),("7",70.0),("8",80.0),("9",90.0),("10",10.0)).toDF("id","val"); df_A: org.apache.spark.sql.DataFrame = [id: string, val: double] scala> val df_B=Seq(("11",

Re: writing a small csv to HDFS is super slow

2019-03-22 Thread kathy Harayama

Hi Lian, Since you using repartition(1), do you want to decrease the number of partitions? If so, have you tried to use coalesce instead? Kathleen On Fri, Mar 22, 2019 at 2:43 PM Lian Jiang wrote: > Hi, > > Writing a csv to HDFS takes about 1 hour: > > >

Re: writing a small csv to HDFS is super slow

2019-03-22 Thread Apostolos N. Papadopoulos

Is it also slow when you do not repartition? (i.e., to get multiple output files) Also did you try simply saveAsTextFile? Also, before repartition, how many partitions are there? a. On 22/3/19 23:34, Lian Jiang wrote: Hi, Writing a csv to HDFS takes about 1 hour:

writing a small csv to HDFS is super slow

2019-03-22 Thread Lian Jiang

Hi, Writing a csv to HDFS takes about 1 hour: df.repartition(1).write.format('com.databricks.spark.csv').mode('overwrite').options(header='true').save(csv) The generated csv file is only about 150kb. The job uses 3 containers (13 cores, 23g mem). Other people have similar issues but I don't

Re: Java Heap Space error - Spark ML

2019-03-22 Thread Apostolos N. Papadopoulos

What is the size of your data, size of the cluster, are you using spark-submit or an IDE, what spark version are you using? Try spark-submit and increase the memory of the driver or the executors. a. On 22/3/19 17:19, KhajaAsmath Mohammed wrote: Hi, I am getting the below exception when

Re: Manually reading parquet files.

2019-03-22 Thread Wenchen Fan

Try `val enconder = RowEncoder(df.schema).resolveAndBind()` ? On Thu, Mar 21, 2019 at 5:39 PM Long, Andrew wrote: > Thanks a ton for the help! > > > > Is there a standardized way of converting the internal row to rows? > > > > I’ve tried this but im getting an exception > > > > *val *enconder =

Re: spark sql occer error

2019-03-22 Thread Wenchen Fan

Did you include the whole error message? On Fri, Mar 22, 2019 at 12:45 AM 563280193 <563280...@qq.com> wrote: > Hi , > I ran a spark sql like this: > > *select imei,tag, product_id,* > * sum(case when succ1>=1 then 1 else 0 end) as succ,* > * sum(case when fail1>=1 and succ1=0 then 1

Java Heap Space error - Spark ML

2019-03-22 Thread KhajaAsmath Mohammed

Hi, I am getting the below exception when using Spark Kmeans. Any solutions from the experts. Would be really helpful. val kMeans = new KMeans().setK(reductionCount).setMaxIter(30) val kMeansModel = kMeans.fit(df) Error is occured when calling kmeans.fit Exception in thread "main"

Re: spark sql occer error

2019-03-22 Thread Mingcong Han

Hi, It seems that there is a syntax error in your SQL. You should use `select ... from ... group by ... having sum(...)=0 and sum(...)>0` instead of `where sum(...)=0...`. I think aggregate functions are not allowed in WHERE. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

spark sql occer error

2019-03-22 Thread 563280193

Hi , I ran a spark sql like this: select imei,tag, product_id, sum(case when succ1>=1 then 1 else 0 end) as succ, sum(case when fail1>=1 and succ1=0 then 1 else 0 end) as fail, count(*) as cnt from t_tbl where sum(case when succ1>=1 then 1 else 0 end)=0

spark sql occer error

2019-03-22 Thread 563280...@qq.com

Hi , I ran a spark sql like this: select imei,tag, product_id, sum(case when succ1>=1 then 1 else 0 end) as succ, sum(case when fail1>=1 and succ1=0 then 1 else 0 end) as fail, count(*) as cnt from t_tbl where sum(case when succ1>=1 then 1 else 0 end)=0 and sum(case when

How to control batch size while reading from hdfs files?

Re: Cross Join

Re: writing a small csv to HDFS is super slow

Re: writing a small csv to HDFS is super slow

writing a small csv to HDFS is super slow

Re: Java Heap Space error - Spark ML

Re: Manually reading parquet files.

Re: spark sql occer error

Java Heap Space error - Spark ML

Re: spark sql occer error

spark sql occer error

spark sql occer error

12 matches

Site Navigation

Mail list logo

Footer information