Re: Cannot read case-sensitive Glue table backed by Parquet

2020-01-17 Thread oripwk
Sorry, but my original solution is incorrect

1. Glue Crawlers are not supposed to set the spark.sql.sources.schema.*
properties, but Spark SQL should. The default in Spark 2.4 for
spark.sql.hive.caseSensitiveInferenceMode is INFER_AND_SAVE which means that
Spark infers the schema from the underlying files and alters the tables to
add the spark.sql.sources.schema.* properties to SERDEPROPERTIES. In our
case, Spark failed to do so, because of a I"llegalArgumentException: Can not
create a Path from an empty string" exception which is caused because the
Hive database class instance has an empty locationUri property string. This
is caused because the Glue database does not have a Location property enter
image description here. After the schema is saved, Spark reads it from the
table.
2. There could be a way around this, by setting INFER_ONLY, which should
only infer the schema from the files and not attempt to alter the table
SERDEPROPERTIES. However, this doesn't work because of a Spark bug, where
the inferred schema is then lowercased [1].

[1]
https://github.com/apache/spark/blob/c1b6fe479649c482947dfce6b6db67b159bd78a3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L284




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Cannot read case-sensitive Glue table backed by Parquet

2020-01-17 Thread oripwk



This bug happens because the Glue table's SERDEPROPERTIES is missing two
important properties:

spark.sql.sources.schema.numParts
spark.sql.sources.schema.part.0

To solve the problem, I had to add those two properties via the Glue console
(couldn't do it with ALTER TABLE …)

I guess this is a bug with Glue crawlers, which do not set these properties
when creating the table.




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Cannot read case-sensitive Glue table backed by Parquet

2020-01-16 Thread oripwk
Spark version: 2.4.2 on Amazon EMR 5.24.0

I have a Glue Catalog table backed by S3 Parquet directory. The Parquet
files have case-sensitive column names (like /lastModified/). It doesn't
matter what I do, I get lowercase column names (/lastmodified/) when reading
the Glue Catalog table with Spark:



[1]
https://medium.com/@an_chee/why-using-mixed-case-field-names-in-hive-spark-sql-is-a-bad-idea-95da8b6ec1e0
[2] https://spark.apache.org/docs/latest/sql-data-sources-parquet.html




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to reduceByKeyAndWindow in Structured Streaming?

2018-07-30 Thread oripwk
Thanks guys, it really helps.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Using Spark Streaming for analyzing changing data

2018-07-30 Thread oripwk


We have a use case where there's a stream of events while every event has an
ID and its current state with a timestamp:

…
111,ready,1532949947
111,offline,1532949955
111,ongoing,1532949955
111,offline,1532949973
333,offline,1532949981
333,ongoing,1532949987
…

We want to ask questions about the current state of the *whole dataset*,
from the beginning of time, such as:
  "how many items are now in ongoing state"

(but bear in mind that there are more complicated questions, and all of them
are asking about the _current_ state of the dataset, from the beginning of
time)

I haven't found any simple, performant way of doing it.

The ways I've found are:
1. Using mapGroupsWithState, where I groupByKey on the ID, and update the
state always for the latest event by timestamp
2. Using groupByKey on the ID, and leaving only the matched event whose
timestamp is the latest

Both methods are not good because the first one involves state which means
checkpointing, memory, etc., and the second involves shuffling and sorting.

We will have a lot of such queries in order to populate a real-time
dashboard.

I wonder, as a general question, what is the correct way to process this
type of data in Spark Streaming?




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



How to reduceByKeyAndWindow in Structured Streaming?

2018-06-28 Thread oripwk
In Structured Streaming, there's the notion of event-time windowing:



However, this is not quite similar to DStream's windowing operations: in
Structured Streaming, windowing groups the data by fixed time-windows, and
every event in a time window is associated to its group:


And in DStreams it just outputs all the data according to a limited window
in time (last 10 minutes for example).

The question was asked also  here

 
, if it makes it clearer.

How the latter can be achieved in Structured Streaming?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org