DataSourceReader and SupportPushDownFilters for Short types

2018-09-08 Thread Hugh Hyndman
Hi, This is my first message to the Apache Spark digest. In a custom data source reader I am implementing, I noticed that I do not receive pushdown filters for datatypes such as ShortType, ByteType, and BooleanType. I do get filters for types: IntegerType, LongType, FloatType, DoubleType, Dat

Re: How to debug Spark job

2018-09-08 Thread Marco Mistroni
Hi Might sound like a dumb advice. But try to break apart your process. Sounds like you Are doing ETL start basic with just ET. and do the changes that results in issues If no problem add the load step Enable spark logging so that you can post error message to the list I think you can have a look

RE: [K8S] Driver and Executor Logging

2018-09-08 Thread Garlapati, Suryanarayana (Nokia - IN/Bangalore)
Hi, Provide the following options in spark-defaults.conf and make sure the log4j.properties file is available in driver and executor: spark.driver.extraJavaOptions -Dlog4j.configuration=file:/log4j.properties spark.executor.extraJavaOptions -Dlog4j.configuration=file: /log4j.properties Regards

Re: Error in show()

2018-09-08 Thread Prakash Joshi
Pls checke the specific ERORR lines of the text file . Chaces are are few Columns are not properly delimited in specific rows. Regards Prakash On Fri, Sep 7, 2018, 3:41 AM dimitris plakas wrote: > Hello everyone, I am new in Pyspark and i am facing an issue. Let me > explain what exactly is the

Re: How to retreive data from nested json use dataframe

2018-09-08 Thread Mich Talebzadeh
Hi Tony, Try something like below with two class definitions case class Address(building: String, coord: Array[Double], street: String, zipcode: String) case class Restaurant(address: Address, borough: String, cuisine: String, name: String) val dfRestaurants = Seq(Restaurant(Address("1480", Array

How to retreive data from nested json use dataframe

2018-09-08 Thread 阎志涛
Hi, All, I am using Spark 2.1 and want to do data transfer for a nested json. I tried to read it use dataframe but failed. Following is the schema of the dataframe: root |-- deviceid: string (nullable = true) |-- app: struct (nullable = true) ||-- appList: array (nullable = true) |||--

Re: Error in show()

2018-09-08 Thread Sonal Goyal
It says serialization error - could there be a column value which is not getting parsed as int in one of the rows 31-60? The relevant Python code in serializers.py which is throwing the error is def read_int(stream): length = stream.read(4) if not length: raise EOFError return

Re: [External Sender] How to debug Spark job

2018-09-08 Thread Sonal Goyal
You could also try to profile your program on the executor or driver by using jvisualvm or yourkit to see if there is any memory/cpu optimization you could do. Thanks, Sonal Nube Technologies On Fri, Sep 7, 2018 at 6:35 PM, James