Re:

2020-03-01 Thread Wim Van Leuven
Hey Hamish, I don't think there is 'automatic fix' for this problem ... Are you reading those as partitions of a single dataset? Or are you processing them individually? As apparently, your incoming data is not stable, you should implement a preprocessing step on each file to check and, if

[no subject]

2020-03-01 Thread Hamish Whittal
Hi there, I have an hdfs directory with thousands of files. It seems that some of them - and I don't know which ones - have a problem with their schema and it's causing my Spark application to fail with this error: Caused by: org.apache.spark.sql.execution.QueryExecutionException: Parquet column

command line build fail and warnings compare to IDE build success

2020-03-01 Thread Zahid Rahman
Hi, When I run the same word count program from the command line and the IDE I get two different logging messages. The command line execution even says BUILD failed when the outcome of the program is success with the production of the directory word count and the resulting output file. I have

How to collect Spark dataframe write metrics

2020-03-01 Thread Manjunath Shetty H
Hi all, Basically my use case is to validate the DataFrame rows count before and after writing to HDFS. Is this even to good practice ? Or Should relay on spark for guaranteed writes ?. If it is a good practice to follow then how to get the DataFrame level write metrics ? Any pointers would

Re: Counting streaks in stateful structured streaming

2020-03-01 Thread Zahid Rahman
>From this paragraph it appears the answer to your query is YES. page 334 Spark the definitive guide states : "Stream processing is the act of *continuously incorporating new data* to compute a result. In stream processing, the input data is *unbounded *and has *no predetermined beginning or

Counting streaks in stateful structured streaming

2020-03-01 Thread nimrod
Hi all, I have a stream with a list of values and I want to count consecutive values a period of 1 minute (even if it is across samples). Can it be done at all? Nimrod -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/