Re: statefulStreaming checkpointing too often

2017-06-02 Thread Tathagata Das
There are two kinds of checkpointing going on here - metadata and data. The 100 second that you have configured is the data checkpointing (expensive, large data) where the RDD data is being written to HDFS. The 10 second one is the metadata checkpoint (cheap, small data) where the metadata of the

Re: Spark 2.1 - Infering schema of dataframe after reading json files not during

2017-06-02 Thread vaquar khan
You can add filter or replace null with value like 0 or string. df.na.fill(0, Seq("y")) Regards, Vaquar khan On Jun 2, 2017 11:25 AM, "Alonso Isidoro Roman" wrote: not sure if this can help you, but you can infer programmatically the schema providing a json schema file,

Re: An Architecture question on the use of virtualised clusters

2017-06-02 Thread Gene Pang
As Vincent mentioned earlier, I think Alluxio can work for this. You can mount your (potentially remote) storage systems to Alluxio , and deploy Alluxio co-located to the compute cluster. The computation framework will

Re: Spark 2.1 - Infering schema of dataframe after reading json files not during

2017-06-02 Thread Alonso Isidoro Roman
not sure if this can help you, but you can infer programmatically the schema providing a json schema file, val path: Path = new Path(schema_parquet) val fileSystem = path.getFileSystem(sc.hadoopConfiguration) val inputStream: FSDataInputStream = fileSystem.open(path) val schema_json =

Spark SQL, formatting timezone in UTC

2017-06-02 Thread yohann jardin
Hello everyone, I'm having a hard time with time zones. I have a Long representing a timestamp: 149636160, I want the output to be 2017-06-02 00:00:00 Based on https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/functions.html The only function that helps formatting a

Re: Number Of Partitions in RDD

2017-06-02 Thread neil90
CLuster mode with HDFS? or local mode? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Number-Of-Partitions-in-RDD-tp28730p28737.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Spark 2.1 - Infering schema of dataframe after reading json files not during

2017-06-02 Thread Aseem Bansal
When we read files in spark it infers the schema. We have the option to not infer the schema. Is there a way to ask spark to infer the schema again just like when reading json? The reason we want to get this done is because we have a problem in our data files. We have a json file containing this

Re: Number Of Partitions in RDD

2017-06-02 Thread Vikash Pareek
Spark 1.6.1 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Number-Of-Partitions-in-RDD-tp28730p28735.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To