[jira] [Updated] (SPARK-37325) Result vector from pandas_udf was not the required length
[ https://issues.apache.org/jira/browse/SPARK-37325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liu updated SPARK-37325: Description: {code:java} schema = StructType([ StructField("node", StringType()) ]) rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt") rdd = rdd.map(lambda obj: {'node': obj}) df_node = spark.createDataFrame(rdd, schema=schema) df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") pd_fname = df_fname.select('fname').toPandas() @pandas_udf(IntegerType(), PandasUDFType.SCALAR) def udf_match(word: pd.Series) -> pd.Series: my_Series = pd_fname.squeeze() # dataframe to Series num = int(my_Series.str.contains(word.array[0]).sum()) return pd.Series(num) df = df_node.withColumn("match_fname_num", udf_match(df_node["node"])) {code} Hi, I have two dataframe, and I try above method, however, I get this {code:java} RuntimeError: Result vector from pandas_udf was not the required length: expected 100, got 1{code} it will be really thankful, if there is any helps PS: for the method itself, I think there is no problem, I create same sample data to verify it successfully, however, when I use the real data error came. I checked the data, can't figure out, does anyone thinks what it cause? was: {code:java} schema = StructType([ StructField("node", StringType()) ]) rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt") rdd = rdd.map(lambda obj: {'node': obj}) df_node = spark.createDataFrame(rdd, schema=schema) df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") pd_fname = df_fname.select('fname').toPandas() @pandas_udf(IntegerType(), PandasUDFType.SCALAR) def udf_match(word: pd.Series) -> pd.Series: my_Series = pd_fname.squeeze() # dataframe to Series num = int(my_Series.str.contains(word.array[0]).sum()) return pd.Series(num) df = df_node.withColumn("match_fname_num", udf_match(df_node["node"])) {code} Hi, I have two dataframe, and I try above method, however, I get this {code:java} RuntimeError: Result vector from pandas_udf was not the required length: expected 100, got 1{code} it will be really thankful, if there is any helps PS: for the method itself, I think there is no problem, I create same sample data to verify it successfully, however, when I use the real data error came. I checked the data, can't figure out, does anyone thinks where it cause? > Result vector from pandas_udf was not the required length > - > > Key: SPARK-37325 > URL: https://issues.apache.org/jira/browse/SPARK-37325 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 > Environment: 1 >Reporter: liu >Priority: Major > > > {code:java} > schema = StructType([ > StructField("node", StringType()) > ]) > rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt") > rdd = rdd.map(lambda obj: {'node': obj}) > df_node = spark.createDataFrame(rdd, schema=schema) > df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") > pd_fname = df_fname.select('fname').toPandas() > @pandas_udf(IntegerType(), PandasUDFType.SCALAR) > def udf_match(word: pd.Series) -> pd.Series: > my_Series = pd_fname.squeeze() # dataframe to Series > num = int(my_Series.str.contains(word.array[0]).sum()) > return pd.Series(num) > df = df_node.withColumn("match_fname_num", udf_match(df_node["node"])) > {code} > > Hi, I have two dataframe, and I try above method, however, I get this > {code:java} > RuntimeError: Result vector from pandas_udf was not the required length: > expected 100, got 1{code} > it will be really thankful, if there is any helps > > PS: for the method itself, I think there is no problem, I create same sample > data to verify it successfully, however, when I use the real data error came. > I checked the data, can't figure out, > does anyone thinks what it cause? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37325) Result vector from pandas_udf was not the required length
[ https://issues.apache.org/jira/browse/SPARK-37325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liu updated SPARK-37325: Description: {code:java} schema = StructType([ StructField("node", StringType()) ]) rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt") rdd = rdd.map(lambda obj: {'node': obj}) df_node = spark.createDataFrame(rdd, schema=schema) df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") pd_fname = df_fname.select('fname').toPandas() @pandas_udf(IntegerType(), PandasUDFType.SCALAR) def udf_match(word: pd.Series) -> pd.Series: my_Series = pd_fname.squeeze() # dataframe to Series num = int(my_Series.str.contains(word.array[0]).sum()) return pd.Series(num) df = df_node.withColumn("match_fname_num", udf_match(df_node["node"])) {code} Hi, I have two dataframe, and I try above method, however, I get this {code:java} RuntimeError: Result vector from pandas_udf was not the required length: expected 100, got 1{code} it will be really thankful, if there is any helps PS: for the method itself, I think there is no problem, I create same sample data to verify it successfully, however, when I use the real data error came. I checked the data, can't figure out, does anyone know what it cause? was: {code:java} schema = StructType([ StructField("node", StringType()) ]) rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt") rdd = rdd.map(lambda obj: {'node': obj}) df_node = spark.createDataFrame(rdd, schema=schema) df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") pd_fname = df_fname.select('fname').toPandas() @pandas_udf(IntegerType(), PandasUDFType.SCALAR) def udf_match(word: pd.Series) -> pd.Series: my_Series = pd_fname.squeeze() # dataframe to Series num = int(my_Series.str.contains(word.array[0]).sum()) return pd.Series(num) df = df_node.withColumn("match_fname_num", udf_match(df_node["node"])) {code} Hi, I have two dataframe, and I try above method, however, I get this {code:java} RuntimeError: Result vector from pandas_udf was not the required length: expected 100, got 1{code} it will be really thankful, if there is any helps PS: for the method itself, I think there is no problem, I create same sample data to verify it successfully, however, when I use the real data error came. I checked the data, can't figure out, does anyone thinks what it cause? > Result vector from pandas_udf was not the required length > - > > Key: SPARK-37325 > URL: https://issues.apache.org/jira/browse/SPARK-37325 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 > Environment: 1 >Reporter: liu >Priority: Major > > > {code:java} > schema = StructType([ > StructField("node", StringType()) > ]) > rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt") > rdd = rdd.map(lambda obj: {'node': obj}) > df_node = spark.createDataFrame(rdd, schema=schema) > df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") > pd_fname = df_fname.select('fname').toPandas() > @pandas_udf(IntegerType(), PandasUDFType.SCALAR) > def udf_match(word: pd.Series) -> pd.Series: > my_Series = pd_fname.squeeze() # dataframe to Series > num = int(my_Series.str.contains(word.array[0]).sum()) > return pd.Series(num) > df = df_node.withColumn("match_fname_num", udf_match(df_node["node"])) > {code} > > Hi, I have two dataframe, and I try above method, however, I get this > {code:java} > RuntimeError: Result vector from pandas_udf was not the required length: > expected 100, got 1{code} > it will be really thankful, if there is any helps > > PS: for the method itself, I think there is no problem, I create same sample > data to verify it successfully, however, when I use the real data error came. > I checked the data, can't figure out, > does anyone know what it cause? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37325) Result vector from pandas_udf was not the required length
[ https://issues.apache.org/jira/browse/SPARK-37325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liu updated SPARK-37325: Description: {code:java} schema = StructType([ StructField("node", StringType()) ]) rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt") rdd = rdd.map(lambda obj: {'node': obj}) df_node = spark.createDataFrame(rdd, schema=schema) df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") pd_fname = df_fname.select('fname').toPandas() @pandas_udf(IntegerType(), PandasUDFType.SCALAR) def udf_match(word: pd.Series) -> pd.Series: my_Series = pd_fname.squeeze() # dataframe to Series num = int(my_Series.str.contains(word.array[0]).sum()) return pd.Series(num) df = df_node.withColumn("match_fname_num", udf_match(df_node["node"])) {code} Hi, I have two dataframe, and I try above method, however, I get this {code:java} RuntimeError: Result vector from pandas_udf was not the required length: expected 100, got 1{code} it will be really thankful, if there is any helps PS: for the method itself, I think there is no problem, I create same sample data to verify it successfully, however, when I use the real data error came. I checked the data, can't figure out, does anyone thinks where it cause? was: {code:java} schema = StructType([ StructField("node", StringType()) ]) rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt") rdd = rdd.map(lambda obj: {'node': obj}) df_node = spark.createDataFrame(rdd, schema=schema) df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") pd_fname = df_fname.select('fname').toPandas() @pandas_udf(IntegerType(), PandasUDFType.SCALAR) def udf_match(word: pd.Series) -> pd.Series: my_Series = pd_fname.squeeze() # dataframe to Series num = int(my_Series.str.contains(word.array[0]).sum()) return pd.Series(num) df = df_node.withColumn("match_fname_num", udf_match(df_node["node"])) {code} Hi, I have two dataframe, and I try above method, however, I get this {code:java} RuntimeError: Result vector from pandas_udf was not the required length: expected 100, got 1{code} it will be really thankful, if there is any helps PS: for the method itself, I think there is no problem, I create same sample data to verify it successfully, however, when I use the real data error came. I checked the data, can't figure out, does anyone thinks where it cause? > Result vector from pandas_udf was not the required length > - > > Key: SPARK-37325 > URL: https://issues.apache.org/jira/browse/SPARK-37325 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 > Environment: 1 >Reporter: liu >Priority: Major > > > {code:java} > schema = StructType([ > StructField("node", StringType()) > ]) > rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt") > rdd = rdd.map(lambda obj: {'node': obj}) > df_node = spark.createDataFrame(rdd, schema=schema) > df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") > pd_fname = df_fname.select('fname').toPandas() > @pandas_udf(IntegerType(), PandasUDFType.SCALAR) > def udf_match(word: pd.Series) -> pd.Series: > my_Series = pd_fname.squeeze() # dataframe to Series > num = int(my_Series.str.contains(word.array[0]).sum()) > return pd.Series(num) > df = df_node.withColumn("match_fname_num", udf_match(df_node["node"])) > {code} > > Hi, I have two dataframe, and I try above method, however, I get this > {code:java} > RuntimeError: Result vector from pandas_udf was not the required length: > expected 100, got 1{code} > it will be really thankful, if there is any helps > > PS: for the method itself, I think there is no problem, I create same sample > data to verify it successfully, however, when I use the real data error came. > I checked the data, can't figure out, > does anyone thinks where it cause? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37325) Result vector from pandas_udf was not the required length
[ https://issues.apache.org/jira/browse/SPARK-37325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liu updated SPARK-37325: Description: {code:java} schema = StructType([ StructField("node", StringType()) ]) rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt") rdd = rdd.map(lambda obj: {'node': obj}) df_node = spark.createDataFrame(rdd, schema=schema) df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") pd_fname = df_fname.select('fname').toPandas() @pandas_udf(IntegerType(), PandasUDFType.SCALAR) def udf_match(word: pd.Series) -> pd.Series: my_Series = pd_fname.squeeze() # dataframe to Series num = int(my_Series.str.contains(word.array[0]).sum()) return pd.Series(num) df = df_node.withColumn("match_fname_num", udf_match(df_node["node"])) {code} Hi, I have two dataframe, and I try above method, however, I get this {code:java} RuntimeError: Result vector from pandas_udf was not the required length: expected 100, got 1{code} it will be really thankful, if there is any helps PS: for the method itself, I think there is no problem, I create same sample data to verify it successfully, however, when I use the real data error came. I checked the data, can't figure out, does anyone thinks where it cause? was: {code:java} schema = StructType([ StructField("node", StringType()) ]) rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt") rdd = rdd.map(lambda obj: {'node': obj}) df_node = spark.createDataFrame(rdd, schema=schema) df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") pd_fname = df_fname.select('fname').toPandas() @pandas_udf(IntegerType(), PandasUDFType.SCALAR) def udf_match(word: pd.Series) -> pd.Series: my_Series = pd_fname.squeeze() # dataframe to Series num = int(my_Series.str.contains(word.array[0]).sum()) return pd.Series(num) df = df_node.withColumn("match_fname_num", udf_match(df_node["node"])) {code} Hi, I have two dataframe, and I try above method, however, I get this {code:java} RuntimeError: Result vector from pandas_udf was not the required length: expected 100, got 1{code} it will be really thankful, if there is any helps PS: for the method itself, I think there is no problem, I create same sample data to verify it successfully, however, when I use the real data error came. I checked the data, can't figure out, does anyone thinks where it cause? > Result vector from pandas_udf was not the required length > - > > Key: SPARK-37325 > URL: https://issues.apache.org/jira/browse/SPARK-37325 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 > Environment: 1 >Reporter: liu >Priority: Major > > > {code:java} > schema = StructType([ > StructField("node", StringType()) > ]) > rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt") > rdd = rdd.map(lambda obj: {'node': obj}) > df_node = spark.createDataFrame(rdd, schema=schema) > df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") > pd_fname = df_fname.select('fname').toPandas() > @pandas_udf(IntegerType(), PandasUDFType.SCALAR) > def udf_match(word: pd.Series) -> pd.Series: > my_Series = pd_fname.squeeze() # dataframe to Series > num = int(my_Series.str.contains(word.array[0]).sum()) > return pd.Series(num) > df = df_node.withColumn("match_fname_num", udf_match(df_node["node"])) > {code} > > Hi, I have two dataframe, and I try above method, however, I get this > {code:java} > RuntimeError: Result vector from pandas_udf was not the required length: > expected 100, got 1{code} > it will be really thankful, if there is any helps > > PS: for the method itself, I think there is no problem, I create same sample > data to verify it successfully, however, when I use the real data error came. > I checked the data, can't figure out, > does anyone thinks where it cause? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37325) Result vector from pandas_udf was not the required length
[ https://issues.apache.org/jira/browse/SPARK-37325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liu updated SPARK-37325: Description: {code:java} schema = StructType([ StructField("node", StringType()) ]) rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt") rdd = rdd.map(lambda obj: {'node': obj}) df_node = spark.createDataFrame(rdd, schema=schema) df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") pd_fname = df_fname.select('fname').toPandas() @pandas_udf(IntegerType(), PandasUDFType.SCALAR) def udf_match(word: pd.Series) -> pd.Series: my_Series = pd_fname.squeeze() # dataframe to Series num = int(my_Series.str.contains(word.array[0]).sum()) return pd.Series(num) df = df_node.withColumn("match_fname_num", udf_match(df_node["node"])) {code} Hi, I have two dataframe, and I try above method, however, I get this {code:java} RuntimeError: Result vector from pandas_udf was not the required length: expected 100, got 1{code} it will be really thankful, if there is any helps PS: for the method itself, I think there is no problem, I create same sample data to verify it successfully, however, when I use the real data error came. I checked the data, can't figure out, does anyone thinks where it cause? was: {code:java} schema = StructType([ StructField("node", StringType()) ]) rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt") rdd = rdd.map(lambda obj: {'node': obj}) df_node = spark.createDataFrame(rdd, schema=schema) df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") pd_fname = df_fname.select('fname').toPandas() @pandas_udf(IntegerType(), PandasUDFType.SCALAR) def udf_match(word: pd.Series) -> pd.Series: my_Series = pd_fname.squeeze() # dataframe to Series num = int(my_Series.str.contains(word.array[0]).sum()) return pd.Series(num) df = df_node.withColumn("match_fname_num", udf_match(df_node["node"])) {code} Hi, I have two dataframe, and I try above method, however, I get this {{}} {code:java} RuntimeError: Result vector from pandas_udf was not the required length: expected 100, got 1{code} {{}} it will be really thankful, if there is any helps PS: for the method itself, I think there is no problem, I create same sample data to verify it successfully, however, when I use the real data error came. I checked the data, can't figure out, does anyone thinks where it cause? > Result vector from pandas_udf was not the required length > - > > Key: SPARK-37325 > URL: https://issues.apache.org/jira/browse/SPARK-37325 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 > Environment: 1 >Reporter: liu >Priority: Major > > > {code:java} > schema = StructType([ > StructField("node", StringType()) > ]) > rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt") > rdd = rdd.map(lambda obj: {'node': obj}) > df_node = spark.createDataFrame(rdd, schema=schema) > df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") > pd_fname = df_fname.select('fname').toPandas() > @pandas_udf(IntegerType(), PandasUDFType.SCALAR) > def udf_match(word: pd.Series) -> pd.Series: > my_Series = pd_fname.squeeze() # dataframe to Series > num = int(my_Series.str.contains(word.array[0]).sum()) > return pd.Series(num) > df = df_node.withColumn("match_fname_num", udf_match(df_node["node"])) > {code} > > Hi, I have two dataframe, and I try above method, however, I get this > {code:java} > RuntimeError: Result vector from pandas_udf was not the required length: > expected 100, got 1{code} > it will be really thankful, if there is any helps > > PS: for the method itself, I think there is no problem, I create same sample > data to verify it successfully, however, when I use the real data error came. > I checked the data, can't figure out, > does anyone thinks where it cause? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37325) Result vector from pandas_udf was not the required length
[ https://issues.apache.org/jira/browse/SPARK-37325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liu updated SPARK-37325: Description: {code:java} schema = StructType([ StructField("node", StringType()) ]) rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt") rdd = rdd.map(lambda obj: {'node': obj}) df_node = spark.createDataFrame(rdd, schema=schema) df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") pd_fname = df_fname.select('fname').toPandas() @pandas_udf(IntegerType(), PandasUDFType.SCALAR) def udf_match(word: pd.Series) -> pd.Series: my_Series = pd_fname.squeeze() # dataframe to Series num = int(my_Series.str.contains(word.array[0]).sum()) return pd.Series(num) df = df_node.withColumn("match_fname_num", udf_match(df_node["node"])) {code} Hi, I have two dataframe, and I try above method, however, I get this {{}} {code:java} RuntimeError: Result vector from pandas_udf was not the required length: expected 100, got 1{code} {{}} it will be really thankful, if there is any helps PS: for the method itself, I think there is no problem, I create same sample data to verify it successfully, however, when I use the real data error came. I checked the data, can't figure out, does anyone thinks where it cause? was: {code:java} schema = StructType([ StructField("node", StringType()) ]) rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt") rdd = rdd.map(lambda obj: {'node': obj}) df_node = spark.createDataFrame(rdd, schema=schema) df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") pd_fname = df_fname.select('fname').toPandas() @pandas_udf(IntegerType(), PandasUDFType.SCALAR) def udf_match(word: pd.Series) -> pd.Series: my_Series = pd_fname.squeeze() # dataframe to Series num = int(my_Series.str.contains(word.array[0]).sum()) return pd.Series(num) df = df_node.withColumn("match_fname_num", udf_match(df_node["node"])) {code} Hi, I have two dataframe, and I try above method, however, I get this {{RuntimeError: Result vector from pandas_udf was not the required length: expected 100, got 1}} it will be really thankful, if there is any helps PS: for the method itself, I think there is no problem, I create same sample data to verify it successfully, however, when I use the real data error came. I checked the data, can't figure out, does anyone thinks where it cause? > Result vector from pandas_udf was not the required length > - > > Key: SPARK-37325 > URL: https://issues.apache.org/jira/browse/SPARK-37325 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 > Environment: 1 >Reporter: liu >Priority: Major > > > {code:java} > schema = StructType([ > StructField("node", StringType()) > ]) > rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt") > rdd = rdd.map(lambda obj: {'node': obj}) > df_node = spark.createDataFrame(rdd, schema=schema) > df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") > pd_fname = df_fname.select('fname').toPandas() > @pandas_udf(IntegerType(), PandasUDFType.SCALAR) > def udf_match(word: pd.Series) -> pd.Series: > my_Series = pd_fname.squeeze() # dataframe to Series > num = int(my_Series.str.contains(word.array[0]).sum()) > return pd.Series(num) > df = df_node.withColumn("match_fname_num", udf_match(df_node["node"])) > {code} > > Hi, I have two dataframe, and I try above method, however, I get this > {{}} > {code:java} > RuntimeError: Result vector from pandas_udf was not the required length: > expected 100, got 1{code} > {{}} > it will be really thankful, if there is any helps > > PS: for the method itself, I think there is no problem, I create same sample > data to verify it successfully, however, when I use the real data error came. > I checked the data, can't figure out, > does anyone thinks where it cause? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37325) Result vector from pandas_udf was not the required length
[ https://issues.apache.org/jira/browse/SPARK-37325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liu updated SPARK-37325: Description: {code:java} schema = StructType([ StructField("node", StringType()) ]) rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt") rdd = rdd.map(lambda obj: {'node': obj}) df_node = spark.createDataFrame(rdd, schema=schema) df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") pd_fname = df_fname.select('fname').toPandas() @pandas_udf(IntegerType(), PandasUDFType.SCALAR) def udf_match(word: pd.Series) -> pd.Series: my_Series = pd_fname.squeeze() # dataframe to Series num = int(my_Series.str.contains(word.array[0]).sum()) return pd.Series(num) df = df_node.withColumn("match_fname_num", udf_match(df_node["node"])) {code} Hi, I have two dataframe, and I try above method, however, I get this {{RuntimeError: Result vector from pandas_udf was not the required length: expected 100, got 1}} it will be really thankful, if there is any helps PS: for the method itself, I think there is no problem, I create same sample data to verify it successfully, however, when I use the real data error came. I checked the data, can't figure out, does anyone thinks where it cause? was: schema = StructType([ StructField("node", StringType()) ]) {{rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")}} {{rdd = rdd.map(lambda obj: \{'node': obj})}} {{df_node = spark.createDataFrame(rdd, schema=schema)}} df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") pd_fname = df_fname.select('fname').toPandas() {{@pandas_udf(IntegerType(), PandasUDFType.SCALAR)}} {{def udf_match(word: pd.Series) -> pd.Series:}} {{ my_Series = pd_fname.squeeze() # dataframe to Series}} {{ num = int(my_Series.str.contains(word.array[0]).sum())}} return pd.Series(num) {{df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))}} Hi, I have two dataframe, and I try above method, however, I get this {{RuntimeError: Result vector from pandas_udf was not the required length: expected 100, got 1}} it will be really thankful, if there is any helps PS: for the method itself, I think there is no problem, I create same sample data to verify it successfully, however, when I use the real data error came. I checked the data, can't figure out, does anyone thinks where it cause? > Result vector from pandas_udf was not the required length > - > > Key: SPARK-37325 > URL: https://issues.apache.org/jira/browse/SPARK-37325 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 > Environment: 1 >Reporter: liu >Priority: Major > > > {code:java} > schema = StructType([ > StructField("node", StringType()) > ]) > rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt") > rdd = rdd.map(lambda obj: {'node': obj}) > df_node = spark.createDataFrame(rdd, schema=schema) > df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") > pd_fname = df_fname.select('fname').toPandas() > @pandas_udf(IntegerType(), PandasUDFType.SCALAR) > def udf_match(word: pd.Series) -> pd.Series: > my_Series = pd_fname.squeeze() # dataframe to Series > num = int(my_Series.str.contains(word.array[0]).sum()) > return pd.Series(num) > df = df_node.withColumn("match_fname_num", udf_match(df_node["node"])) > {code} > > Hi, I have two dataframe, and I try above method, however, I get this > {{RuntimeError: Result vector from pandas_udf was not the required length: > expected 100, got 1}} > it will be really thankful, if there is any helps > > PS: for the method itself, I think there is no problem, I create same sample > data to verify it successfully, however, when I use the real data error came. > I checked the data, can't figure out, > does anyone thinks where it cause? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37325) Result vector from pandas_udf was not the required length
[ https://issues.apache.org/jira/browse/SPARK-37325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liu updated SPARK-37325: Description: schema = StructType([ StructField("node", StringType()) ]) {{rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")}} {{rdd = rdd.map(lambda obj: \{'node': obj})}} {{df_node = spark.createDataFrame(rdd, schema=schema)}} df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") pd_fname = df_fname.select('fname').toPandas() {{@pandas_udf(IntegerType(), PandasUDFType.SCALAR)}} {{def udf_match(word: pd.Series) -> pd.Series:}} {{ my_Series = pd_fname.squeeze() # dataframe to Series}} {{ num = int(my_Series.str.contains(word.array[0]).sum())}} return pd.Series(num) {{df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))}} Hi, I have two dataframe, and I try above method, however, I get this {{RuntimeError: Result vector from pandas_udf was not the required length: expected 100, got 1}} it will be really thankful, if there is any helps PS: for the method itself, I think there is no problem, I create same sample data to verify it successfully, however, when I use the real data error came. I checked the data, can't figure out, does anyone thinks where it cause? was: schema = StructType([ StructField("node", StringType()) ]) {{rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")}} {{rdd = rdd.map(lambda obj: \{'node': obj})}} {{df_node = spark.createDataFrame(rdd, schema=schema)}} df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") pd_fname = df_fname.select('fname').toPandas() {{@pandas_udf(IntegerType(), PandasUDFType.SCALAR)}} {{def udf_match(word: pd.Series) -> pd.Series:}} {{ my_Series = pd_fname.squeeze() # dataframe to Series}} {{ num = int(my_Series.str.contains(word.array[0]).sum())}} return pd.Series(num) {{df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))}} Hi, I have two dataframe, and I try above method, however, I get this {{RuntimeError: Result vector from pandas_udf was not the required length: expected 100, got 1}} it will be really thankful, if there is any helps PS: for the method itself, I think there is no problem, I create same sample data to verify it successfully, however, when I use the really data it came. I checked the data, can't figure out, does anyone thinks where it cause? > Result vector from pandas_udf was not the required length > - > > Key: SPARK-37325 > URL: https://issues.apache.org/jira/browse/SPARK-37325 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 > Environment: 1 >Reporter: liu >Priority: Major > > schema = StructType([ > StructField("node", StringType()) > ]) > {{rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")}} > {{rdd = rdd.map(lambda obj: \{'node': obj})}} > {{df_node = spark.createDataFrame(rdd, schema=schema)}} > df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") > pd_fname = df_fname.select('fname').toPandas() > {{@pandas_udf(IntegerType(), PandasUDFType.SCALAR)}} > {{def udf_match(word: pd.Series) -> pd.Series:}} > {{ my_Series = pd_fname.squeeze() # dataframe to Series}} > {{ num = int(my_Series.str.contains(word.array[0]).sum())}} > return pd.Series(num) > {{df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))}} > Hi, I have two dataframe, and I try above method, however, I get this > {{RuntimeError: Result vector from pandas_udf was not the required length: > expected 100, got 1}} > it will be really thankful, if there is any helps > > PS: for the method itself, I think there is no problem, I create same sample > data to verify it successfully, however, when I use the real data error came. > I checked the data, can't figure out, > does anyone thinks where it cause? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37325) Result vector from pandas_udf was not the required length
[ https://issues.apache.org/jira/browse/SPARK-37325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liu updated SPARK-37325: Description: schema = StructType([ StructField("node", StringType()) ]) {{rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")}} {{rdd = rdd.map(lambda obj: \{'node': obj})}} {{df_node = spark.createDataFrame(rdd, schema=schema)}} df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") pd_fname = df_fname.select('fname').toPandas() {{@pandas_udf(IntegerType(), PandasUDFType.SCALAR)}} {{def udf_match(word: pd.Series) -> pd.Series:}} {{ my_Series = pd_fname.squeeze() # dataframe to Series}} {{ num = int(my_Series.str.contains(word.array[0]).sum())}} return pd.Series(num) {{df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))}} Hi, I have two dataframe, and I try above method, however, I get this {{RuntimeError: Result vector from pandas_udf was not the required length: expected 100, got 1}} it will be really thankful, if there is any helps PS: for the method itself, I think there is no problem, I create same sample data to verify it successfully, however, when I use the really data it came. I checked the data, can't figure out, does anyone thinks where it cause? was: schema = StructType([ StructField("node", StringType()) ]) {{rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")}} {{rdd = rdd.map(lambda obj: \{'node': obj})}} {{df_node = spark.createDataFrame(rdd, schema=schema)}} df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") pd_fname = df_fname.select('fname').toPandas() {{@pandas_udf(IntegerType(), PandasUDFType.SCALAR)}} {{def udf_match(word: pd.Series) -> pd.Series:}} {{ my_Series = pd_fname.squeeze() # dataframe to Series}} {{ num = int(my_Series.str.contains(word.array[0]).sum())}} return pd.Series(num) {{df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))}} Hi, I have two dataframe, and I try above method, however, I get this {{RuntimeError: Result vector from pandas_udf was not the required length: expected 100, got 1}} it will be really thankful, if there is any helps PS: for the method itself, I think there is no problem, I create same sample data to verify it successfully, however, when I use the really data it came. I checked the data, can't figure out, does anyone thinks where it cause? > Result vector from pandas_udf was not the required length > - > > Key: SPARK-37325 > URL: https://issues.apache.org/jira/browse/SPARK-37325 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 > Environment: 1 >Reporter: liu >Priority: Major > > schema = StructType([ > StructField("node", StringType()) > ]) > {{rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")}} > {{rdd = rdd.map(lambda obj: \{'node': obj})}} > {{df_node = spark.createDataFrame(rdd, schema=schema)}} > df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") > pd_fname = df_fname.select('fname').toPandas() > {{@pandas_udf(IntegerType(), PandasUDFType.SCALAR)}} > {{def udf_match(word: pd.Series) -> pd.Series:}} > {{ my_Series = pd_fname.squeeze() # dataframe to Series}} > {{ num = int(my_Series.str.contains(word.array[0]).sum())}} > return pd.Series(num) > {{df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))}} > Hi, I have two dataframe, and I try above method, however, I get this > {{RuntimeError: Result vector from pandas_udf was not the required length: > expected 100, got 1}} > it will be really thankful, if there is any helps > > PS: for the method itself, I think there is no problem, I create same sample > data to verify it successfully, however, when I use the really data it came. > I checked the data, can't figure out, > does anyone thinks where it cause? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37325) Result vector from pandas_udf was not the required length
[ https://issues.apache.org/jira/browse/SPARK-37325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liu updated SPARK-37325: Description: schema = StructType([ StructField("node", StringType()) ]) {{rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")}} {{rdd = rdd.map(lambda obj: \{'node': obj})}} {{df_node = spark.createDataFrame(rdd, schema=schema)}} df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") pd_fname = df_fname.select('fname').toPandas() {{@pandas_udf(IntegerType(), PandasUDFType.SCALAR)}} {{def udf_match(word: pd.Series) -> pd.Series:}} {{ my_Series = pd_fname.squeeze() # dataframe to Series}} {{ num = int(my_Series.str.contains(word.array[0]).sum())}} return pd.Series(num) {{df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))}} Hi, I have two dataframe, and I try above method, however, I get this {{RuntimeError: Result vector from pandas_udf was not the required length: expected 100, got 1}} it will be really thankful, if there is any helps PS: for the method itself, I think there is no problem, I create same sample data to verify it successfully, however, when I use the really data it came. I checked the data, can't figure out, does anyone thinks where it cause? was: {{schema = StructType([ StructField("node", StringType()) ])}} {{rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")}} {{rdd = rdd.map(lambda obj: \{'node': obj})}} {{df_node = spark.createDataFrame(rdd, schema=schema)}} {{}} {{}} {{df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") pd_fname = df_fname.select('fname').toPandas()}} {{}} {{@pandas_udf(IntegerType(), PandasUDFType.SCALAR)}} {{def udf_match(word: pd.Series) -> pd.Series:}} {{ my_Series = pd_fname.squeeze()# dataframe to Series}} {{ num = int(my_Series.str.contains(word.array[0]).sum())}} {{return pd.Series(num)}} {{}} {{}} {{df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))}} Hi, I have two dataframe, and I try above method, however, I get this {{RuntimeError: Result vector from pandas_udf was not the required length: expected 100, got 1}} it will be really thankful, if there is any helps PS: for the method itself, I think there is no problem, I create same sample data to verify it successfully, however, when I use the really data it came. I checked the data, can't figure out, does anyone thinks where it cause? > Result vector from pandas_udf was not the required length > - > > Key: SPARK-37325 > URL: https://issues.apache.org/jira/browse/SPARK-37325 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 > Environment: 1 >Reporter: liu >Priority: Major > > schema = StructType([ > StructField("node", StringType()) > ]) > {{rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")}} > {{rdd = rdd.map(lambda obj: \{'node': obj})}} > {{df_node = spark.createDataFrame(rdd, schema=schema)}} > df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") > pd_fname = df_fname.select('fname').toPandas() > {{@pandas_udf(IntegerType(), PandasUDFType.SCALAR)}} > {{def udf_match(word: pd.Series) -> pd.Series:}} > {{ my_Series = pd_fname.squeeze() # dataframe to Series}} > {{ num = int(my_Series.str.contains(word.array[0]).sum())}} > return pd.Series(num) > {{df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))}} > Hi, I have two dataframe, and I try above method, however, I get this > {{RuntimeError: Result vector from pandas_udf was not the required length: > expected 100, got 1}} > it will be really thankful, if there is any helps > > PS: for the method itself, I think there is no problem, I create same sample > data to verify it successfully, however, when I use the really data it came. > I checked the data, can't figure out, > does anyone thinks where it cause? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org