[ https://issues.apache.org/jira/browse/SPARK-37325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
liu updated SPARK-37325: ------------------------ Description: {code:java} schema = StructType([ StructField("node", StringType()) ]) rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt") rdd = rdd.map(lambda obj: {'node': obj}) df_node = spark.createDataFrame(rdd, schema=schema) df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") pd_fname = df_fname.select('fname').toPandas() @pandas_udf(IntegerType(), PandasUDFType.SCALAR) def udf_match(word: pd.Series) -> pd.Series: my_Series = pd_fname.squeeze() # dataframe to Series num = int(my_Series.str.contains(word.array[0]).sum()) return pd.Series(num) df = df_node.withColumn("match_fname_num", udf_match(df_node["node"])) {code} Hi, I have two dataframe, and I try above method, however, I get this {code:java} RuntimeError: Result vector from pandas_udf was not the required length: expected 100, got 1{code} it will be really thankful, if there is any helps PS: for the method itself, I think there is no problem, I create same sample data to verify it successfully, however, when I use the real data error came. I checked the data, can't figure out, does anyone know what it cause? was: {code:java} schema = StructType([ StructField("node", StringType()) ]) rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt") rdd = rdd.map(lambda obj: {'node': obj}) df_node = spark.createDataFrame(rdd, schema=schema) df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") pd_fname = df_fname.select('fname').toPandas() @pandas_udf(IntegerType(), PandasUDFType.SCALAR) def udf_match(word: pd.Series) -> pd.Series: my_Series = pd_fname.squeeze() # dataframe to Series num = int(my_Series.str.contains(word.array[0]).sum()) return pd.Series(num) df = df_node.withColumn("match_fname_num", udf_match(df_node["node"])) {code} Hi, I have two dataframe, and I try above method, however, I get this {code:java} RuntimeError: Result vector from pandas_udf was not the required length: expected 100, got 1{code} it will be really thankful, if there is any helps PS: for the method itself, I think there is no problem, I create same sample data to verify it successfully, however, when I use the real data error came. I checked the data, can't figure out, does anyone thinks what it cause? > Result vector from pandas_udf was not the required length > --------------------------------------------------------- > > Key: SPARK-37325 > URL: https://issues.apache.org/jira/browse/SPARK-37325 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 3.2.0 > Environment: 1 > Reporter: liu > Priority: Major > > > {code:java} > schema = StructType([ > StructField("node", StringType()) > ]) > rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt") > rdd = rdd.map(lambda obj: {'node': obj}) > df_node = spark.createDataFrame(rdd, schema=schema) > df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") > pd_fname = df_fname.select('fname').toPandas() > @pandas_udf(IntegerType(), PandasUDFType.SCALAR) > def udf_match(word: pd.Series) -> pd.Series: > my_Series = pd_fname.squeeze() # dataframe to Series > num = int(my_Series.str.contains(word.array[0]).sum()) > return pd.Series(num) > df = df_node.withColumn("match_fname_num", udf_match(df_node["node"])) > {code} > > Hi, I have two dataframe, and I try above method, however, I get this > {code:java} > RuntimeError: Result vector from pandas_udf was not the required length: > expected 100, got 1{code} > it will be really thankful, if there is any helps > > PS: for the method itself, I think there is no problem, I create same sample > data to verify it successfully, however, when I use the real data error came. > I checked the data, can't figure out, > does anyone know what it cause? -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org