Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/17902#discussion_r141120298 --- Diff: common/kvstore/src/main/java/org/apache/spark/kvstore/LevelDBTypeInfo.java --- @@ -0,0 +1,516 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.kvstore; + +import java.lang.reflect.Array; +import java.lang.reflect.Field; +import java.lang.reflect.Method; +import java.io.ByteArrayOutputStream; +import java.io.IOException; +import java.util.Collection; +import java.util.HashMap; +import java.util.Map; +import static java.nio.charset.StandardCharsets.UTF_8; + +import com.google.common.base.Preconditions; +import com.google.common.base.Throwables; +import org.iq80.leveldb.WriteBatch; + +/** + * Holds metadata about app-specific types stored in LevelDB. Serves as a cache for data collected + * via reflection, to make it cheaper to access it multiple times. + * + * <p> + * The hierarchy of keys stored in LevelDB looks roughly like the following. This hierarchy ensures + * that iteration over indices is easy, and that updating values in the store is not overly + * expensive. Of note, indices choose using more disk space (one value per key) instead of keeping + * lists of pointers, which would be more expensive to update at runtime. + * </p> + * + * <p> + * Indentation defines when a sub-key lives under a parent key. In LevelDB, this means the full + * key would be the concatenation of everything up to that point in the hierarchy, with each + * component separated by a NULL byte. + * </p> + * + * <pre> + * +TYPE_NAME + * NATURAL_INDEX + * +NATURAL_KEY + * - + * -NATURAL_INDEX + * INDEX_NAME + * +INDEX_VALUE + * +NATURAL_KEY + * -INDEX_VALUE + * .INDEX_VALUE + * CHILD_INDEX_NAME + * +CHILD_INDEX_VALUE + * NATURAL_KEY_OR_DATA + * - + * -INDEX_NAME + * </pre> + * + * <p> + * Entity data (either the entity's natural key or a copy of the data) is stored in all keys + * that end with "+<something>". A count of all objects that match a particular top-level index + * value is kept at the end marker ("-<something>"). A count is also kept at the natural index's end + * marker, to make it easy to retrieve the number of all elements of a particular type. + * </p> + * + * <p> + * To illustrate, given a type "Foo", with a natural index and a second index called "bar", you'd + * have these keys and values in the store for two instances, one with natural key "key1" and the + * other "key2", both with value "yes" for "bar": + * </p> + * + * <pre> + * Foo __main__ +key1 [data for instance 1] + * Foo __main__ +key2 [data for instance 2] + * Foo __main__ - [count of all Foo] + * Foo bar +yes +key1 [instance 1 key or data, depending on index type] --- End diff -- Because otherwise, when reading from the index, it's not that easy to parse the object's key from the leveldb key so that you can retrieve the object itself. You also have to store something, leveldb doesn't allow you to store null.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org