Hey Hamish,
I don't think there is 'automatic fix' for this problem ...
Are you reading those as partitions of a single dataset? Or are you
processing them individually?
As apparently, your incoming data is not stable, you should implement a
preprocessing step on each file to check and, if
Hi there,
I have an hdfs directory with thousands of files. It seems that some of
them - and I don't know which ones - have a problem with their schema and
it's causing my Spark application to fail with this error:
Caused by: org.apache.spark.sql.execution.QueryExecutionException: Parquet
column
Hi,
When I run the same word count program from the command line and the IDE I
get two different logging messages.
The command line execution even says BUILD failed when the outcome of the
program is success with the production of the directory word count and the
resulting output file.
I have
Hi all,
Basically my use case is to validate the DataFrame rows count before and after
writing to HDFS. Is this even to good practice ? Or Should relay on spark for
guaranteed writes ?.
If it is a good practice to follow then how to get the DataFrame level write
metrics ?
Any pointers would
>From this paragraph it appears the answer to your query is YES.
page 334 Spark the definitive guide states :
"Stream processing is the act of *continuously incorporating new data* to
compute a result. In stream processing,
the input data is *unbounded *and has *no predetermined beginning or
Hi all,
I have a stream with a list of values and I want to count consecutive values
a period of 1 minute (even if it is across samples). Can it be done at all?
Nimrod
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/