Re: Rdd - zip with index

2021-03-23 Thread ayan guha
Best case is use dataframe and df.columns will automatically give you column names. Are you sure your file is indeed in csv? maybe it is easier if you share the code? On Wed, 24 Mar 2021 at 2:12 pm, Sean Owen wrote: > It would split 10GB of CSV into multiple partitions by default, unless > it's

Re: Rdd - zip with index

2021-03-23 Thread Sean Owen
It would split 10GB of CSV into multiple partitions by default, unless it's gzipped. Something else is going on here. ‪On Tue, Mar 23, 2021 at 10:04 PM ‫"Yuri Oleynikov (‫יורי אולייניקוב‬‎)"‬‎ < yur...@gmail.com> wrote:‬ > I’m not Spark core developer and do not want to confuse you but it seems

Re: Rdd - zip with index

2021-03-23 Thread Yuri Oleynikov (‫יורי אולייניקוב‬‎)
I’m not Spark core developer and do not want to confuse you but it seems logical to me that just reading from single file (no matter what format of the file is used) gives no parallelism unless you do repartition by some column just after csv load, but the if you’re telling you’ve already tried

Re: Rdd - zip with index

2021-03-23 Thread Sean Owen
I don't think that would change partitioning? try .repartition(). It isn't necessary to write it out let alone in Avro. ‪On Tue, Mar 23, 2021 at 8:45 PM ‫"Yuri Oleynikov (‫יורי אולייניקוב‬‎)"‬‎ < yur...@gmail.com> wrote:‬ > Hi, Mohammed > I think that the reason that only one executor is running

Re: Rdd - zip with index

2021-03-23 Thread KhajaAsmath Mohammed
So spark by default doesn’t split the large 10gb file when loaded? Sent from my iPhone > On Mar 23, 2021, at 8:44 PM, Yuri Oleynikov (‫יורי אולייניקוב‬‎) > wrote: > > Hi, Mohammed > I think that the reason that only one executor is running and have single > partition is because you have

Re: Rdd - zip with index

2021-03-23 Thread Yuri Oleynikov (‫יורי אולייניקוב‬‎)
Hi, Mohammed I think that the reason that only one executor is running and have single partition is because you have single file that might be read/loaded into memory. In order to achieve better parallelism I’d suggest to split the csv file. Another problem is question: why are you using rdd?

Rdd - zip with index

2021-03-23 Thread KhajaAsmath Mohammed
Hi, I have 10gb file that should be loaded into spark dataframe. This file is csv with header and we were using rdd.zipwithindex to get column names and convert to avro accordingly. I am assuming this is taking long time and only executor runs and never achieves parallelism. Is there a easy

Re: Spark on your Oracle Data Warehouse

2021-03-23 Thread Mich Talebzadeh
Hi, I just posted some stuff regarding using Spark with Oracle, If you want to do distributed processing like any DW of your choice be Oracle , Hive or BigQuery, best in my experience to create Spark dataframes on top of the underlying storage.either through JDBC or Spark API (Hive or BigQuery).

Re: [Spark SQL]: Can complex oracle views be created using Spark SQL

2021-03-23 Thread Mich Talebzadeh
Hi, I did some investigation on this and created a dataframe on top of the underlying view in Oracle database. Let assume that our oracle view is just a normal view as opposed to materialized view, something like below where both sales and costs are FACT tables CREATE OR REPLACE FORCE

Re: Issue while consuming message in kafka using structured streaming

2021-03-23 Thread Sachit Murarka
Hi Team, I am facing this issue again. I am using Spark 3.0.1 with Python. Could you please suggest why it says the below error: *Current Committed Offsets: {KafkaV2[Subscribe[my-topic]]: {“my-topic”:{“1":1498,“0”:1410}}}Current Available Offsets: {KafkaV2[Subscribe[my-topic]]:

Spark learning for beginner and certification

2021-03-23 Thread Kishore Kumar
Hi Team I am looking to learn apache spark and to do certification. I am new to spark framework. Kindly help with guidelines and complete details to proceed. Thanks Kishore Kumar

Spark on your Oracle Data Warehouse

2021-03-23 Thread Harish Butani
I have been developing 'Spark on Oracle', a project to provide better integration of Spark into an Oracle Data Warehouse. You can read about it at https://hbutani.github.io/spark-on-oracle/blog/Spark_on_Oracle_Blog.html The key features are Catalog Integration, translation and pushdown of Spark

Re: Spark History Server log files questions

2021-03-23 Thread German Schiavon
Hey! I don't think you can do selectively removals, never heard of it but who knows.. You can refer here to see all the available options -> https://spark.apache.org/docs/latest/monitoring.html . In my experience having 4 days worth of logs is enough, usually if something fails you check it