unsubscribe

2023-12-11 Thread Stevens, Clay
unsubscribe



RE: The job failed when we upgraded from spark 3.3.1 to spark3.4.1

2023-11-16 Thread Stevens, Clay
Perhaps you also need to upgrade Scala?

Clay Stevens

From: Hanyu Huang 
Sent: Wednesday, 15 November, 2023 1:15 AM
To: user@spark.apache.org
Subject: The job failed when we upgraded from spark 3.3.1 to spark3.4.1

Caution, this email may be from a sender outside Wolters Kluwer. Verify the 
sender and know the content is safe.
The version our job originally ran was spark 3.3.1 and Apache Iceberg to 1.2.0, 
But since we upgraded to  spark3.4.1 and Apache Iceberg to 1.3.1, jobs started 
to fail frequently, We tried to upgrade only iceberg without upgrading spark, 
and the job did not report an error.


Detailed description:

When we execute this function writing data to the iceberg table:

def appendToIcebergTable(targetTable: String, df: DataFrame): Unit = {
  _logger.warn(s"Append data to $targetTable")

  val (targetCols, sourceCols) = matchDFSchemaWithTargetTable(targetTable, df)
  df.createOrReplaceTempView("_temp")
  spark.sql(s"""
  INSERT INTO $targetTable ($targetCols) SELECT $sourceCols FROM _temp
  """)
  _logger.warn(s"Done append data to $targetTable")
  getIcebergLastAppendCountVerbose(targetTable)
}


The error is reported as follows:
Caused by: java.lang.AssertionError: assertion failed at 
scala.Predef$.assert(Predef.scala:208) at 
org.apache.spark.sql.execution.ColumnarToRowExec.(Columnar.scala:72) ... 
191 more


Read the source code and find that the error is reported here:


case class ColumnarToRowExec(child: SparkPlan) extends ColumnarToRowTransition 
with CodegenSupport {
  // supportsColumnar requires to be only called on driver side, see also 
SPARK-37779.
  assert(Utils.isInRunningSparkTask || child.supportsColumnar)

  override def output: Seq[Attribute] = child.output

  override def outputPartitioning: Partitioning = child.outputPartitioning

  override def outputOrdering: Seq[SortOrder] = child.outputOrdering

But we can't find the root cause,So seek help from the community ,If more log 
information is required, please let me know.

thanks


Efficient cosine similarity computation

2019-09-23 Thread Stevens, Clay
There are several ways I can compute the cosine similarities between a Spark ML 
vector to each ML vector in a Spark DataFrame column then sorting for the 
highest results.  However, I can't come up with a method that is faster than 
replacing the `/data/` in a Spark ML Word2Vec model, then using 
`.findSynonyms()`.  The problem is the Word2Vec model is held entirely in the 
driver which can cause memory issues if the data set I want to compare to gets 
too big.

1. Is there a more efficient method than the ones I have shown below?
2. Could the data for the Word2Vec model be distributed across the cluster?
3. Could the the `.findSynonyms()` [Scala 
code](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala#L571toL619)
 be modified to make a spark sql function that can operate efficiently over a 
whole Spark DataFrame?


Methods I have tried:

#1 rdd function:
```

# vecIn = vector of same dimensions as 'vectors' column
def cosSim(row, vecIn):
return (
tuple(( Vectors.dense( Vectors.dense(row.vectors.dot(vecIn)) /
(Vectors.dense(np.sqrt(row.vectors.dot(row.vectors))) *
  Vectors.dense(np.sqrt(vecIn.dot(vecIn)
).toArray().tolist()))

df.rdd.map(lambda row: cosSim(row, 
vecIn)).toDF(['CosSim']).show(truncate=False)
```

#2  `.toIndexedRowMatrix().columnSimilarities()` then filter the results (not 
shown):

```

spark.createDataFrame(
IndexedRowMatrix(df.rdd.map(lambda row: (row.vectors.toArray(
.toBlockMatrix()
.transpose()
.toIndexedRowMatrix()
.columnSimilarities()
.entries)
```

#3 replace Word2Vec model `/data/` with my own, then load 'revised' model and 
use `.findSynonyms()`:
```
df_words_vectors.schema
## 
StructType(List(StructField(word,StringType,true),StructField(vector,ArrayType(FloatType,true),true)))

df_words_vectors.write.parquet("exiting_Word2Vec_model/data/", 
mode='overwrite')

new_Word2Vec_model = Word2VecModel.load("exiting_Word2Vec_model")

## vecIn = vector of same dimensions as 'vector' column in DataFrame saved 
over Word2Vec model /data/
new_Word2Vec_model.findSynonyms(vecIn, 20).show()
```


Clay Stevens