[jira] [Updated] (SPARK-37325) Result vector from pandas_udf was not the required length
[ https://issues.apache.org/jira/browse/SPARK-37325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liu updated SPARK-37325: Description: {code:java} schema = StructType([ StructField("node", StringType()) ]) rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt") rdd = rdd.map(lambda obj: {'node': obj}) df_node = spark.createDataFrame(rdd, schema=schema) df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") pd_fname = df_fname.select('fname').toPandas() @pandas_udf(IntegerType(), PandasUDFType.SCALAR) def udf_match(word: pd.Series) -> pd.Series: my_Series = pd_fname.squeeze() # dataframe to Series num = int(my_Series.str.contains(word.array[0]).sum()) return pd.Series(num) df = df_node.withColumn("match_fname_num", udf_match(df_node["node"])) {code} Hi, I have two dataframe, and I try above method, however, I get this {code:java} RuntimeError: Result vector from pandas_udf was not the required length: expected 100, got 1{code} it will be really thankful, if there is any helps PS: for the method itself, I think there is no problem, I create same sample data to verify it successfully, however, when I use the real data error came. I checked the data, can't figure out, does anyone thinks what it cause? was: {code:java} schema = StructType([ StructField("node", StringType()) ]) rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt") rdd = rdd.map(lambda obj: {'node': obj}) df_node = spark.createDataFrame(rdd, schema=schema) df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") pd_fname = df_fname.select('fname').toPandas() @pandas_udf(IntegerType(), PandasUDFType.SCALAR) def udf_match(word: pd.Series) -> pd.Series: my_Series = pd_fname.squeeze() # dataframe to Series num = int(my_Series.str.contains(word.array[0]).sum()) return pd.Series(num) df = df_node.withColumn("match_fname_num", udf_match(df_node["node"])) {code} Hi, I have two dataframe, and I try above method, however, I get this {code:java} RuntimeError: Result vector from pandas_udf was not the required length: expected 100, got 1{code} it will be really thankful, if there is any helps PS: for the method itself, I think there is no problem, I create same sample data to verify it successfully, however, when I use the real data error came. I checked the data, can't figure out, does anyone thinks where it cause? > Result vector from pandas_udf was not the required length > - > > Key: SPARK-37325 > URL: https://issues.apache.org/jira/browse/SPARK-37325 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 > Environment: 1 >Reporter: liu >Priority: Major > > > {code:java} > schema = StructType([ > StructField("node", StringType()) > ]) > rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt") > rdd = rdd.map(lambda obj: {'node': obj}) > df_node = spark.createDataFrame(rdd, schema=schema) > df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") > pd_fname = df_fname.select('fname').toPandas() > @pandas_udf(IntegerType(), PandasUDFType.SCALAR) > def udf_match(word: pd.Series) -> pd.Series: > my_Series = pd_fname.squeeze() # dataframe to Series > num = int(my_Series.str.contains(word.array[0]).sum()) > return pd.Series(num) > df = df_node.withColumn("match_fname_num", udf_match(df_node["node"])) > {code} > > Hi, I have two dataframe, and I try above method, however, I get this > {code:java} > RuntimeError: Result vector from pandas_udf was not the required length: > expected 100, got 1{code} > it will be really thankful, if there is any helps > > PS: for the method itself, I think there is no problem, I create same sample > data to verify it successfully, however, when I use the real data error came. > I checked the data, can't figure out, > does anyone thinks what it cause? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-37325) Result vector from pandas_udf was not the required length
[ https://issues.apache.org/jira/browse/SPARK-37325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444972#comment-17444972 ] liu edited comment on SPARK-37325 at 11/17/21, 7:19 AM: [~hyukjin.kwon] I don't think it will work. I have to make it clear, I have two dataframe, one spark dataframe and one pd.dataframe, I want to use {code:java} @pandas_udf{code} for it's faster, however, it has some unknown error. I can solve it by {code:java} @udf{code} like below, but it's slow: {code:java} @udf(returnType=IntegerType()) def udf_match1(word): my_Series = pd_fname.squeeze() # dataframe to Series num = my_Series.str.contains(word).sum() return int(num){code} was (Author: JIRAUSER280209): [~hyukjin.kwon] I don't think it will work. I have to make the questions clear, I have two dataframe, one spark dataframe and one pd.dataframe, I want to use {code:java} @pandas_udf{code} for it's faster, however, it has some unknown error. I can solve it by {code:java} @udf{code} like below, but it's slow: {code:java} @udf(returnType=IntegerType()) def udf_match1(word): my_Series = pd_fname.squeeze() # dataframe to Series num = my_Series.str.contains(word).sum() return int(num){code} > Result vector from pandas_udf was not the required length > - > > Key: SPARK-37325 > URL: https://issues.apache.org/jira/browse/SPARK-37325 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 > Environment: 1 >Reporter: liu >Priority: Major > > > {code:java} > schema = StructType([ > StructField("node", StringType()) > ]) > rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt") > rdd = rdd.map(lambda obj: {'node': obj}) > df_node = spark.createDataFrame(rdd, schema=schema) > df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") > pd_fname = df_fname.select('fname').toPandas() > @pandas_udf(IntegerType(), PandasUDFType.SCALAR) > def udf_match(word: pd.Series) -> pd.Series: > my_Series = pd_fname.squeeze() # dataframe to Series > num = int(my_Series.str.contains(word.array[0]).sum()) > return pd.Series(num) > df = df_node.withColumn("match_fname_num", udf_match(df_node["node"])) > {code} > > Hi, I have two dataframe, and I try above method, however, I get this > {code:java} > RuntimeError: Result vector from pandas_udf was not the required length: > expected 100, got 1{code} > it will be really thankful, if there is any helps > > PS: for the method itself, I think there is no problem, I create same sample > data to verify it successfully, however, when I use the real data error came. > I checked the data, can't figure out, > does anyone thinks where it cause? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37325) Result vector from pandas_udf was not the required length
[ https://issues.apache.org/jira/browse/SPARK-37325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liu updated SPARK-37325: Description: {code:java} schema = StructType([ StructField("node", StringType()) ]) rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt") rdd = rdd.map(lambda obj: {'node': obj}) df_node = spark.createDataFrame(rdd, schema=schema) df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") pd_fname = df_fname.select('fname').toPandas() @pandas_udf(IntegerType(), PandasUDFType.SCALAR) def udf_match(word: pd.Series) -> pd.Series: my_Series = pd_fname.squeeze() # dataframe to Series num = int(my_Series.str.contains(word.array[0]).sum()) return pd.Series(num) df = df_node.withColumn("match_fname_num", udf_match(df_node["node"])) {code} Hi, I have two dataframe, and I try above method, however, I get this {code:java} RuntimeError: Result vector from pandas_udf was not the required length: expected 100, got 1{code} it will be really thankful, if there is any helps PS: for the method itself, I think there is no problem, I create same sample data to verify it successfully, however, when I use the real data error came. I checked the data, can't figure out, does anyone know what it cause? was: {code:java} schema = StructType([ StructField("node", StringType()) ]) rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt") rdd = rdd.map(lambda obj: {'node': obj}) df_node = spark.createDataFrame(rdd, schema=schema) df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") pd_fname = df_fname.select('fname').toPandas() @pandas_udf(IntegerType(), PandasUDFType.SCALAR) def udf_match(word: pd.Series) -> pd.Series: my_Series = pd_fname.squeeze() # dataframe to Series num = int(my_Series.str.contains(word.array[0]).sum()) return pd.Series(num) df = df_node.withColumn("match_fname_num", udf_match(df_node["node"])) {code} Hi, I have two dataframe, and I try above method, however, I get this {code:java} RuntimeError: Result vector from pandas_udf was not the required length: expected 100, got 1{code} it will be really thankful, if there is any helps PS: for the method itself, I think there is no problem, I create same sample data to verify it successfully, however, when I use the real data error came. I checked the data, can't figure out, does anyone thinks what it cause? > Result vector from pandas_udf was not the required length > - > > Key: SPARK-37325 > URL: https://issues.apache.org/jira/browse/SPARK-37325 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 > Environment: 1 >Reporter: liu >Priority: Major > > > {code:java} > schema = StructType([ > StructField("node", StringType()) > ]) > rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt") > rdd = rdd.map(lambda obj: {'node': obj}) > df_node = spark.createDataFrame(rdd, schema=schema) > df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") > pd_fname = df_fname.select('fname').toPandas() > @pandas_udf(IntegerType(), PandasUDFType.SCALAR) > def udf_match(word: pd.Series) -> pd.Series: > my_Series = pd_fname.squeeze() # dataframe to Series > num = int(my_Series.str.contains(word.array[0]).sum()) > return pd.Series(num) > df = df_node.withColumn("match_fname_num", udf_match(df_node["node"])) > {code} > > Hi, I have two dataframe, and I try above method, however, I get this > {code:java} > RuntimeError: Result vector from pandas_udf was not the required length: > expected 100, got 1{code} > it will be really thankful, if there is any helps > > PS: for the method itself, I think there is no problem, I create same sample > data to verify it successfully, however, when I use the real data error came. > I checked the data, can't figure out, > does anyone know what it cause? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-37325) Result vector from pandas_udf was not the required length
[ https://issues.apache.org/jira/browse/SPARK-37325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444972#comment-17444972 ] liu edited comment on SPARK-37325 at 11/17/21, 7:18 AM: [~hyukjin.kwon] I don't think it will work. I have to make the questions clear, I have two dataframe, one spark dataframe and one pd.dataframe, I want to use {code:java} @pandas_udf{code} for it's faster, however, it has some unknown error. I can solve it by {code:java} @udf{code} like below, but it's slow: {code:java} @udf(returnType=IntegerType()) def udf_match1(word): my_Series = pd_fname.squeeze() # dataframe to Series num = my_Series.str.contains(word).sum() return int(num){code} was (Author: JIRAUSER280209): [~hyukjin.kwon] I don't think it will work. I have to make the questions clear, I have two dataframe, one spark dataframe and one pd.dataframe, I want to use {code:java} @pandas_udf{code} for it's faster, however, it has some unknown error. I can solve it by {code:java} @udf{code} like below, but it's slow: {code:java} @udf(returnType=IntegerType()) def udf_match1(word): my_Series = pd_fname.squeeze() # dataframe to Series num = my_Series.str.contains(word).sum() return int(num){code} > Result vector from pandas_udf was not the required length > - > > Key: SPARK-37325 > URL: https://issues.apache.org/jira/browse/SPARK-37325 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 > Environment: 1 >Reporter: liu >Priority: Major > > > {code:java} > schema = StructType([ > StructField("node", StringType()) > ]) > rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt") > rdd = rdd.map(lambda obj: {'node': obj}) > df_node = spark.createDataFrame(rdd, schema=schema) > df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") > pd_fname = df_fname.select('fname').toPandas() > @pandas_udf(IntegerType(), PandasUDFType.SCALAR) > def udf_match(word: pd.Series) -> pd.Series: > my_Series = pd_fname.squeeze() # dataframe to Series > num = int(my_Series.str.contains(word.array[0]).sum()) > return pd.Series(num) > df = df_node.withColumn("match_fname_num", udf_match(df_node["node"])) > {code} > > Hi, I have two dataframe, and I try above method, however, I get this > {code:java} > RuntimeError: Result vector from pandas_udf was not the required length: > expected 100, got 1{code} > it will be really thankful, if there is any helps > > PS: for the method itself, I think there is no problem, I create same sample > data to verify it successfully, however, when I use the real data error came. > I checked the data, can't figure out, > does anyone thinks where it cause? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-37325) Result vector from pandas_udf was not the required length
[ https://issues.apache.org/jira/browse/SPARK-37325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444972#comment-17444972 ] liu edited comment on SPARK-37325 at 11/17/21, 7:16 AM: [~hyukjin.kwon] I don't think it will work. I have to make the questions clear, I have two dataframe, one spark dataframe and one pd.dataframe, I want to use {code:java} @pandas_udf{code} for it's faster, however, it has some unknown error. I can solve it by {code:java} @udf{code} like below, but it's slow: {code:java} @udf(returnType=IntegerType()) def udf_match1(word): my_Series = pd_fname.squeeze() # dataframe to Series num = my_Series.str.contains(word).sum() return int(num){code} was (Author: JIRAUSER280209): [~hyukjin.kwon] I don't think it will work. I have to make the questions clear, I have two dataframe, one spark dataframe and one pd.dataframe, I want to use {code:java} @pandas_udf{code} for it's faster, however, it have some unknown error. I can solve it by {code:java} @udf{code} like below, but it's slow: {code:java} @udf(returnType=IntegerType()) def udf_match1(word): my_Series = pd_fname.squeeze() # dataframe to Series num = my_Series.str.contains(word).sum() return int(num){code} > Result vector from pandas_udf was not the required length > - > > Key: SPARK-37325 > URL: https://issues.apache.org/jira/browse/SPARK-37325 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 > Environment: 1 >Reporter: liu >Priority: Major > > > {code:java} > schema = StructType([ > StructField("node", StringType()) > ]) > rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt") > rdd = rdd.map(lambda obj: {'node': obj}) > df_node = spark.createDataFrame(rdd, schema=schema) > df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") > pd_fname = df_fname.select('fname').toPandas() > @pandas_udf(IntegerType(), PandasUDFType.SCALAR) > def udf_match(word: pd.Series) -> pd.Series: > my_Series = pd_fname.squeeze() # dataframe to Series > num = int(my_Series.str.contains(word.array[0]).sum()) > return pd.Series(num) > df = df_node.withColumn("match_fname_num", udf_match(df_node["node"])) > {code} > > Hi, I have two dataframe, and I try above method, however, I get this > {code:java} > RuntimeError: Result vector from pandas_udf was not the required length: > expected 100, got 1{code} > it will be really thankful, if there is any helps > > PS: for the method itself, I think there is no problem, I create same sample > data to verify it successfully, however, when I use the real data error came. > I checked the data, can't figure out, > does anyone thinks where it cause? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37325) Result vector from pandas_udf was not the required length
[ https://issues.apache.org/jira/browse/SPARK-37325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liu updated SPARK-37325: Description: {code:java} schema = StructType([ StructField("node", StringType()) ]) rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt") rdd = rdd.map(lambda obj: {'node': obj}) df_node = spark.createDataFrame(rdd, schema=schema) df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") pd_fname = df_fname.select('fname').toPandas() @pandas_udf(IntegerType(), PandasUDFType.SCALAR) def udf_match(word: pd.Series) -> pd.Series: my_Series = pd_fname.squeeze() # dataframe to Series num = int(my_Series.str.contains(word.array[0]).sum()) return pd.Series(num) df = df_node.withColumn("match_fname_num", udf_match(df_node["node"])) {code} Hi, I have two dataframe, and I try above method, however, I get this {code:java} RuntimeError: Result vector from pandas_udf was not the required length: expected 100, got 1{code} it will be really thankful, if there is any helps PS: for the method itself, I think there is no problem, I create same sample data to verify it successfully, however, when I use the real data error came. I checked the data, can't figure out, does anyone thinks where it cause? was: {code:java} schema = StructType([ StructField("node", StringType()) ]) rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt") rdd = rdd.map(lambda obj: {'node': obj}) df_node = spark.createDataFrame(rdd, schema=schema) df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") pd_fname = df_fname.select('fname').toPandas() @pandas_udf(IntegerType(), PandasUDFType.SCALAR) def udf_match(word: pd.Series) -> pd.Series: my_Series = pd_fname.squeeze() # dataframe to Series num = int(my_Series.str.contains(word.array[0]).sum()) return pd.Series(num) df = df_node.withColumn("match_fname_num", udf_match(df_node["node"])) {code} Hi, I have two dataframe, and I try above method, however, I get this {code:java} RuntimeError: Result vector from pandas_udf was not the required length: expected 100, got 1{code} it will be really thankful, if there is any helps PS: for the method itself, I think there is no problem, I create same sample data to verify it successfully, however, when I use the real data error came. I checked the data, can't figure out, does anyone thinks where it cause? > Result vector from pandas_udf was not the required length > - > > Key: SPARK-37325 > URL: https://issues.apache.org/jira/browse/SPARK-37325 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 > Environment: 1 >Reporter: liu >Priority: Major > > > {code:java} > schema = StructType([ > StructField("node", StringType()) > ]) > rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt") > rdd = rdd.map(lambda obj: {'node': obj}) > df_node = spark.createDataFrame(rdd, schema=schema) > df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") > pd_fname = df_fname.select('fname').toPandas() > @pandas_udf(IntegerType(), PandasUDFType.SCALAR) > def udf_match(word: pd.Series) -> pd.Series: > my_Series = pd_fname.squeeze() # dataframe to Series > num = int(my_Series.str.contains(word.array[0]).sum()) > return pd.Series(num) > df = df_node.withColumn("match_fname_num", udf_match(df_node["node"])) > {code} > > Hi, I have two dataframe, and I try above method, however, I get this > {code:java} > RuntimeError: Result vector from pandas_udf was not the required length: > expected 100, got 1{code} > it will be really thankful, if there is any helps > > PS: for the method itself, I think there is no problem, I create same sample > data to verify it successfully, however, when I use the real data error came. > I checked the data, can't figure out, > does anyone thinks where it cause? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-37325) Result vector from pandas_udf was not the required length
[ https://issues.apache.org/jira/browse/SPARK-37325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444972#comment-17444972 ] liu edited comment on SPARK-37325 at 11/17/21, 7:14 AM: [~hyukjin.kwon] I don't think it will work. I have to make the questions clear, I have two dataframe, one spark dataframe and one pd.dataframe, I want to use {code:java} @pandas_udf{code} for it's faster, however, it have some unknown error. I can solve it by {code:java} @udf{code} like below, but it's slow: {code:java} @udf(returnType=IntegerType()) def udf_match1(word): my_Series = pd_fname.squeeze() # dataframe to Series num = my_Series.str.contains(word).sum() return int(num){code} was (Author: JIRAUSER280209): [~hyukjin.kwon] I don't think it will work. I have to make the questions clear, I have to dataframe, one spark dataframe and one pd.dataframe, I want to use @pandas_udf for it's faster, however, it have some unknown error. I can solve it by @udf like below, but it's slow: {code:java} @udf(returnType=IntegerType()) def udf_match1(word): my_Series = pd_fname.squeeze() # dataframe to Series num = my_Series.str.contains(word).sum() return int(num){code} > Result vector from pandas_udf was not the required length > - > > Key: SPARK-37325 > URL: https://issues.apache.org/jira/browse/SPARK-37325 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 > Environment: 1 >Reporter: liu >Priority: Major > > > {code:java} > schema = StructType([ > StructField("node", StringType()) > ]) > rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt") > rdd = rdd.map(lambda obj: {'node': obj}) > df_node = spark.createDataFrame(rdd, schema=schema) > df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") > pd_fname = df_fname.select('fname').toPandas() > @pandas_udf(IntegerType(), PandasUDFType.SCALAR) > def udf_match(word: pd.Series) -> pd.Series: > my_Series = pd_fname.squeeze() # dataframe to Series > num = int(my_Series.str.contains(word.array[0]).sum()) > return pd.Series(num) > df = df_node.withColumn("match_fname_num", udf_match(df_node["node"])) > {code} > > Hi, I have two dataframe, and I try above method, however, I get this > {code:java} > RuntimeError: Result vector from pandas_udf was not the required length: > expected 100, got 1{code} > it will be really thankful, if there is any helps > > PS: for the method itself, I think there is no problem, I create same sample > data to verify it successfully, however, when I use the real data error came. > I checked the data, can't figure out, > does anyone thinks where it cause? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37325) Result vector from pandas_udf was not the required length
[ https://issues.apache.org/jira/browse/SPARK-37325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liu updated SPARK-37325: Description: {code:java} schema = StructType([ StructField("node", StringType()) ]) rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt") rdd = rdd.map(lambda obj: {'node': obj}) df_node = spark.createDataFrame(rdd, schema=schema) df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") pd_fname = df_fname.select('fname').toPandas() @pandas_udf(IntegerType(), PandasUDFType.SCALAR) def udf_match(word: pd.Series) -> pd.Series: my_Series = pd_fname.squeeze() # dataframe to Series num = int(my_Series.str.contains(word.array[0]).sum()) return pd.Series(num) df = df_node.withColumn("match_fname_num", udf_match(df_node["node"])) {code} Hi, I have two dataframe, and I try above method, however, I get this {code:java} RuntimeError: Result vector from pandas_udf was not the required length: expected 100, got 1{code} it will be really thankful, if there is any helps PS: for the method itself, I think there is no problem, I create same sample data to verify it successfully, however, when I use the real data error came. I checked the data, can't figure out, does anyone thinks where it cause? was: {code:java} schema = StructType([ StructField("node", StringType()) ]) rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt") rdd = rdd.map(lambda obj: {'node': obj}) df_node = spark.createDataFrame(rdd, schema=schema) df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") pd_fname = df_fname.select('fname').toPandas() @pandas_udf(IntegerType(), PandasUDFType.SCALAR) def udf_match(word: pd.Series) -> pd.Series: my_Series = pd_fname.squeeze() # dataframe to Series num = int(my_Series.str.contains(word.array[0]).sum()) return pd.Series(num) df = df_node.withColumn("match_fname_num", udf_match(df_node["node"])) {code} Hi, I have two dataframe, and I try above method, however, I get this {code:java} RuntimeError: Result vector from pandas_udf was not the required length: expected 100, got 1{code} it will be really thankful, if there is any helps PS: for the method itself, I think there is no problem, I create same sample data to verify it successfully, however, when I use the real data error came. I checked the data, can't figure out, does anyone thinks where it cause? > Result vector from pandas_udf was not the required length > - > > Key: SPARK-37325 > URL: https://issues.apache.org/jira/browse/SPARK-37325 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 > Environment: 1 >Reporter: liu >Priority: Major > > > {code:java} > schema = StructType([ > StructField("node", StringType()) > ]) > rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt") > rdd = rdd.map(lambda obj: {'node': obj}) > df_node = spark.createDataFrame(rdd, schema=schema) > df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") > pd_fname = df_fname.select('fname').toPandas() > @pandas_udf(IntegerType(), PandasUDFType.SCALAR) > def udf_match(word: pd.Series) -> pd.Series: > my_Series = pd_fname.squeeze() # dataframe to Series > num = int(my_Series.str.contains(word.array[0]).sum()) > return pd.Series(num) > df = df_node.withColumn("match_fname_num", udf_match(df_node["node"])) > {code} > > Hi, I have two dataframe, and I try above method, however, I get this > {code:java} > RuntimeError: Result vector from pandas_udf was not the required length: > expected 100, got 1{code} > it will be really thankful, if there is any helps > > PS: for the method itself, I think there is no problem, I create same sample > data to verify it successfully, however, when I use the real data error came. > I checked the data, can't figure out, > does anyone thinks where it cause? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37325) Result vector from pandas_udf was not the required length
[ https://issues.apache.org/jira/browse/SPARK-37325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444972#comment-17444972 ] liu commented on SPARK-37325: - [~hyukjin.kwon] I don't think it will work. I have to make the questions clear, I have to dataframe, one spark dataframe and one pd.dataframe, I want to use @pandas_udf for it's faster, however, it have some unknown error. I can solve it by @udf like below, but it's slow: {code:java} @udf(returnType=IntegerType()) def udf_match1(word): my_Series = pd_fname.squeeze() # dataframe to Series num = my_Series.str.contains(word).sum() return int(num){code} > Result vector from pandas_udf was not the required length > - > > Key: SPARK-37325 > URL: https://issues.apache.org/jira/browse/SPARK-37325 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 > Environment: 1 >Reporter: liu >Priority: Major > > schema = StructType([ > StructField("node", StringType()) > ]) > {{rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")}} > {{rdd = rdd.map(lambda obj: \{'node': obj})}} > {{df_node = spark.createDataFrame(rdd, schema=schema)}} > df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") > pd_fname = df_fname.select('fname').toPandas() > {{@pandas_udf(IntegerType(), PandasUDFType.SCALAR)}} > {{def udf_match(word: pd.Series) -> pd.Series:}} > {{ my_Series = pd_fname.squeeze() # dataframe to Series}} > {{ num = int(my_Series.str.contains(word.array[0]).sum())}} > return pd.Series(num) > {{df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))}} > Hi, I have two dataframe, and I try above method, however, I get this > {{RuntimeError: Result vector from pandas_udf was not the required length: > expected 100, got 1}} > it will be really thankful, if there is any helps > > PS: for the method itself, I think there is no problem, I create same sample > data to verify it successfully, however, when I use the real data error came. > I checked the data, can't figure out, > does anyone thinks where it cause? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37325) Result vector from pandas_udf was not the required length
[ https://issues.apache.org/jira/browse/SPARK-37325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liu updated SPARK-37325: Description: {code:java} schema = StructType([ StructField("node", StringType()) ]) rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt") rdd = rdd.map(lambda obj: {'node': obj}) df_node = spark.createDataFrame(rdd, schema=schema) df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") pd_fname = df_fname.select('fname').toPandas() @pandas_udf(IntegerType(), PandasUDFType.SCALAR) def udf_match(word: pd.Series) -> pd.Series: my_Series = pd_fname.squeeze() # dataframe to Series num = int(my_Series.str.contains(word.array[0]).sum()) return pd.Series(num) df = df_node.withColumn("match_fname_num", udf_match(df_node["node"])) {code} Hi, I have two dataframe, and I try above method, however, I get this {code:java} RuntimeError: Result vector from pandas_udf was not the required length: expected 100, got 1{code} it will be really thankful, if there is any helps PS: for the method itself, I think there is no problem, I create same sample data to verify it successfully, however, when I use the real data error came. I checked the data, can't figure out, does anyone thinks where it cause? was: {code:java} schema = StructType([ StructField("node", StringType()) ]) rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt") rdd = rdd.map(lambda obj: {'node': obj}) df_node = spark.createDataFrame(rdd, schema=schema) df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") pd_fname = df_fname.select('fname').toPandas() @pandas_udf(IntegerType(), PandasUDFType.SCALAR) def udf_match(word: pd.Series) -> pd.Series: my_Series = pd_fname.squeeze() # dataframe to Series num = int(my_Series.str.contains(word.array[0]).sum()) return pd.Series(num) df = df_node.withColumn("match_fname_num", udf_match(df_node["node"])) {code} Hi, I have two dataframe, and I try above method, however, I get this {{}} {code:java} RuntimeError: Result vector from pandas_udf was not the required length: expected 100, got 1{code} {{}} it will be really thankful, if there is any helps PS: for the method itself, I think there is no problem, I create same sample data to verify it successfully, however, when I use the real data error came. I checked the data, can't figure out, does anyone thinks where it cause? > Result vector from pandas_udf was not the required length > - > > Key: SPARK-37325 > URL: https://issues.apache.org/jira/browse/SPARK-37325 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 > Environment: 1 >Reporter: liu >Priority: Major > > > {code:java} > schema = StructType([ > StructField("node", StringType()) > ]) > rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt") > rdd = rdd.map(lambda obj: {'node': obj}) > df_node = spark.createDataFrame(rdd, schema=schema) > df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") > pd_fname = df_fname.select('fname').toPandas() > @pandas_udf(IntegerType(), PandasUDFType.SCALAR) > def udf_match(word: pd.Series) -> pd.Series: > my_Series = pd_fname.squeeze() # dataframe to Series > num = int(my_Series.str.contains(word.array[0]).sum()) > return pd.Series(num) > df = df_node.withColumn("match_fname_num", udf_match(df_node["node"])) > {code} > > Hi, I have two dataframe, and I try above method, however, I get this > {code:java} > RuntimeError: Result vector from pandas_udf was not the required length: > expected 100, got 1{code} > it will be really thankful, if there is any helps > > PS: for the method itself, I think there is no problem, I create same sample > data to verify it successfully, however, when I use the real data error came. > I checked the data, can't figure out, > does anyone thinks where it cause? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37325) Result vector from pandas_udf was not the required length
[ https://issues.apache.org/jira/browse/SPARK-37325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liu updated SPARK-37325: Description: {code:java} schema = StructType([ StructField("node", StringType()) ]) rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt") rdd = rdd.map(lambda obj: {'node': obj}) df_node = spark.createDataFrame(rdd, schema=schema) df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") pd_fname = df_fname.select('fname').toPandas() @pandas_udf(IntegerType(), PandasUDFType.SCALAR) def udf_match(word: pd.Series) -> pd.Series: my_Series = pd_fname.squeeze() # dataframe to Series num = int(my_Series.str.contains(word.array[0]).sum()) return pd.Series(num) df = df_node.withColumn("match_fname_num", udf_match(df_node["node"])) {code} Hi, I have two dataframe, and I try above method, however, I get this {{}} {code:java} RuntimeError: Result vector from pandas_udf was not the required length: expected 100, got 1{code} {{}} it will be really thankful, if there is any helps PS: for the method itself, I think there is no problem, I create same sample data to verify it successfully, however, when I use the real data error came. I checked the data, can't figure out, does anyone thinks where it cause? was: {code:java} schema = StructType([ StructField("node", StringType()) ]) rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt") rdd = rdd.map(lambda obj: {'node': obj}) df_node = spark.createDataFrame(rdd, schema=schema) df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") pd_fname = df_fname.select('fname').toPandas() @pandas_udf(IntegerType(), PandasUDFType.SCALAR) def udf_match(word: pd.Series) -> pd.Series: my_Series = pd_fname.squeeze() # dataframe to Series num = int(my_Series.str.contains(word.array[0]).sum()) return pd.Series(num) df = df_node.withColumn("match_fname_num", udf_match(df_node["node"])) {code} Hi, I have two dataframe, and I try above method, however, I get this {{RuntimeError: Result vector from pandas_udf was not the required length: expected 100, got 1}} it will be really thankful, if there is any helps PS: for the method itself, I think there is no problem, I create same sample data to verify it successfully, however, when I use the real data error came. I checked the data, can't figure out, does anyone thinks where it cause? > Result vector from pandas_udf was not the required length > - > > Key: SPARK-37325 > URL: https://issues.apache.org/jira/browse/SPARK-37325 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 > Environment: 1 >Reporter: liu >Priority: Major > > > {code:java} > schema = StructType([ > StructField("node", StringType()) > ]) > rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt") > rdd = rdd.map(lambda obj: {'node': obj}) > df_node = spark.createDataFrame(rdd, schema=schema) > df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") > pd_fname = df_fname.select('fname').toPandas() > @pandas_udf(IntegerType(), PandasUDFType.SCALAR) > def udf_match(word: pd.Series) -> pd.Series: > my_Series = pd_fname.squeeze() # dataframe to Series > num = int(my_Series.str.contains(word.array[0]).sum()) > return pd.Series(num) > df = df_node.withColumn("match_fname_num", udf_match(df_node["node"])) > {code} > > Hi, I have two dataframe, and I try above method, however, I get this > {{}} > {code:java} > RuntimeError: Result vector from pandas_udf was not the required length: > expected 100, got 1{code} > {{}} > it will be really thankful, if there is any helps > > PS: for the method itself, I think there is no problem, I create same sample > data to verify it successfully, however, when I use the real data error came. > I checked the data, can't figure out, > does anyone thinks where it cause? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37325) Result vector from pandas_udf was not the required length
[ https://issues.apache.org/jira/browse/SPARK-37325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liu updated SPARK-37325: Description: {code:java} schema = StructType([ StructField("node", StringType()) ]) rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt") rdd = rdd.map(lambda obj: {'node': obj}) df_node = spark.createDataFrame(rdd, schema=schema) df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") pd_fname = df_fname.select('fname').toPandas() @pandas_udf(IntegerType(), PandasUDFType.SCALAR) def udf_match(word: pd.Series) -> pd.Series: my_Series = pd_fname.squeeze() # dataframe to Series num = int(my_Series.str.contains(word.array[0]).sum()) return pd.Series(num) df = df_node.withColumn("match_fname_num", udf_match(df_node["node"])) {code} Hi, I have two dataframe, and I try above method, however, I get this {{RuntimeError: Result vector from pandas_udf was not the required length: expected 100, got 1}} it will be really thankful, if there is any helps PS: for the method itself, I think there is no problem, I create same sample data to verify it successfully, however, when I use the real data error came. I checked the data, can't figure out, does anyone thinks where it cause? was: schema = StructType([ StructField("node", StringType()) ]) {{rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")}} {{rdd = rdd.map(lambda obj: \{'node': obj})}} {{df_node = spark.createDataFrame(rdd, schema=schema)}} df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") pd_fname = df_fname.select('fname').toPandas() {{@pandas_udf(IntegerType(), PandasUDFType.SCALAR)}} {{def udf_match(word: pd.Series) -> pd.Series:}} {{ my_Series = pd_fname.squeeze() # dataframe to Series}} {{ num = int(my_Series.str.contains(word.array[0]).sum())}} return pd.Series(num) {{df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))}} Hi, I have two dataframe, and I try above method, however, I get this {{RuntimeError: Result vector from pandas_udf was not the required length: expected 100, got 1}} it will be really thankful, if there is any helps PS: for the method itself, I think there is no problem, I create same sample data to verify it successfully, however, when I use the real data error came. I checked the data, can't figure out, does anyone thinks where it cause? > Result vector from pandas_udf was not the required length > - > > Key: SPARK-37325 > URL: https://issues.apache.org/jira/browse/SPARK-37325 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 > Environment: 1 >Reporter: liu >Priority: Major > > > {code:java} > schema = StructType([ > StructField("node", StringType()) > ]) > rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt") > rdd = rdd.map(lambda obj: {'node': obj}) > df_node = spark.createDataFrame(rdd, schema=schema) > df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") > pd_fname = df_fname.select('fname').toPandas() > @pandas_udf(IntegerType(), PandasUDFType.SCALAR) > def udf_match(word: pd.Series) -> pd.Series: > my_Series = pd_fname.squeeze() # dataframe to Series > num = int(my_Series.str.contains(word.array[0]).sum()) > return pd.Series(num) > df = df_node.withColumn("match_fname_num", udf_match(df_node["node"])) > {code} > > Hi, I have two dataframe, and I try above method, however, I get this > {{RuntimeError: Result vector from pandas_udf was not the required length: > expected 100, got 1}} > it will be really thankful, if there is any helps > > PS: for the method itself, I think there is no problem, I create same sample > data to verify it successfully, however, when I use the real data error came. > I checked the data, can't figure out, > does anyone thinks where it cause? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37353) Resolved attribute(s) region_id#7993 missing from lm_tam_code#2666,ld_month#2697,geo_sub_district_id#2677 when doing left join
Anubhav Tarar created SPARK-37353: - Summary: Resolved attribute(s) region_id#7993 missing from lm_tam_code#2666,ld_month#2697,geo_sub_district_id#2677 when doing left join Key: SPARK-37353 URL: https://issues.apache.org/jira/browse/SPARK-37353 Project: Spark Issue Type: Bug Components: Build Affects Versions: 2.4.0 Reporter: Anubhav Tarar For this Code i m getting error var df = totalTimeFootfallDf.join(dwellTimeDf,Seq(groupByKeys: _*),"left_outer") org.apache.spark.sql.AnalysisException: Resolved attribute(s) region_id#7993 missing from lm_tam_code#2666,ld_month#2697,geo_sub_district_id#2677,lm_subldm_id#2652,geo_province_name_th#2689,lm_amp_code#2665,lm_cat_name_t#2675,lm_prov_code#2664,lm_sub_name_t#2672,lm_name_branch#2694,region_id#422,lm_update_dttm#2669,mkt_region_name_en#2692,lm_radius#369,geo_province_id#2679,lm_name#2655,geo_province_name_en#2684,lm_name_fld_t#2656,geo_region_name_en#2686,geo_district_name_th#2688,geo_sub_region_id#2680,ld_day#2698,lm_master#2653,lm_location_t#2660,lm_zone#2668,lm_latitude#2671,lm_branch_t#2658,geo_sub_district_name_th#2687,lm_cat_id#2674,geo_district_id#2678,lm_symbol#2649,lm_id#2648,lm_name_t#2654,lm_cat_name#2676,geo_district_name_en#2683,lm_name_fld#2657,lm_master_branch#2663,lm_longitude#2670,lm_branch#2659,mkt_sub_region_id#2690,lm_sub_cat_id#2650,geo_sub_region_name_en#2685,lm_sub_cat_name#316,mkt_sub_region_name_en#2691,grid_id#2695,lm_floor#2667,lm_location#2661,geo_sub_district_name_en#2682,lm_master_branch_t#2662,data_date#2693,lm_ldmtag#2651,ld_year#2696 in operator !Project [lm_id#2648, lm_symbol#2649, lm_sub_cat_id#2650, lm_ldmtag#2651, lm_subldm_id#2652, lm_master#2653, lm_name_t#2654, lm_name#2655, lm_name_fld_t#2656, lm_name_fld#2657, lm_branch_t#2658, lm_branch#2659, lm_location_t#2660, lm_location#2661, lm_master_branch_t#2662, lm_master_branch#2663, lm_prov_code#2664, lm_amp_code#2665, lm_tam_code#2666, lm_floor#2667, lm_zone#2668, lm_update_dttm#2669, lm_longitude#2670, lm_latitude#2671, ... 28 more fields]. Attribute(s) with the same name appear in the operation: region_id. Please check if the right attribute(s) are used.;; Join LeftOuter -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37270) Incorect result of filter using isNull condition
[ https://issues.apache.org/jira/browse/SPARK-37270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444956#comment-17444956 ] Apache Spark commented on SPARK-37270: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/34627 > Incorect result of filter using isNull condition > > > Key: SPARK-37270 > URL: https://issues.apache.org/jira/browse/SPARK-37270 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 3.2.0 >Reporter: Tomasz Kus >Priority: Major > Labels: correctness > > Simple code that allows to reproduce this issue: > {code:java} > val frame = Seq((false, 1)).toDF("bool", "number") > frame > .checkpoint() > .withColumn("conditions", when(col("bool"), "I am not null")) > .filter(col("conditions").isNull) > .show(false){code} > Although "conditions" column is null > {code:java} > +-+--+--+ > |bool |number|conditions| > +-+--+--+ > |false|1 |null | > +-+--+--+{code} > empty result is shown. > Execution plans: > {code:java} > == Parsed Logical Plan == > 'Filter isnull('conditions) > +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END > AS conditions#252] > +- LogicalRDD [bool#124, number#125], false > == Analyzed Logical Plan == > bool: boolean, number: int, conditions: string > Filter isnull(conditions#252) > +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END > AS conditions#252] > +- LogicalRDD [bool#124, number#125], false > == Optimized Logical Plan == > LocalRelation , [bool#124, number#125, conditions#252] > == Physical Plan == > LocalTableScan , [bool#124, number#125, conditions#252] > {code} > After removing checkpoint proper result is returned and execution plans are > as follow: > {code:java} > == Parsed Logical Plan == > 'Filter isnull('conditions) > +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END > AS conditions#256] > +- Project [_1#119 AS bool#124, _2#120 AS number#125] > +- LocalRelation [_1#119, _2#120] > == Analyzed Logical Plan == > bool: boolean, number: int, conditions: string > Filter isnull(conditions#256) > +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END > AS conditions#256] > +- Project [_1#119 AS bool#124, _2#120 AS number#125] > +- LocalRelation [_1#119, _2#120] > == Optimized Logical Plan == > LocalRelation [bool#124, number#125, conditions#256] > == Physical Plan == > LocalTableScan [bool#124, number#125, conditions#256] > {code} > It seems that the most important difference is LogicalRDD -> LocalRelation > There are following ways (workarounds) to retrieve correct result: > 1) remove checkpoint > 2) add explicit .otherwise(null) to when > 3) add checkpoint() or cache() just before filter > 4) downgrade to Spark 3.1.2 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37352) Silence the `index_col` advice in `to_spark()` for internal usage
[ https://issues.apache.org/jira/browse/SPARK-37352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37352: Assignee: Apache Spark > Silence the `index_col` advice in `to_spark()` for internal usage > - > > Key: SPARK-37352 > URL: https://issues.apache.org/jira/browse/SPARK-37352 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Haejoon Lee >Assignee: Apache Spark >Priority: Major > > The advice warning for `to_spark()` - when the `index_col` parameter is not > specified - issuing too much message e.g. when user runs the plotting > functions, so we want to silence the warning message when it's used as an > internal purpose. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37352) Silence the `index_col` advice in `to_spark()` for internal usage
[ https://issues.apache.org/jira/browse/SPARK-37352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37352: Assignee: (was: Apache Spark) > Silence the `index_col` advice in `to_spark()` for internal usage > - > > Key: SPARK-37352 > URL: https://issues.apache.org/jira/browse/SPARK-37352 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Haejoon Lee >Priority: Major > > The advice warning for `to_spark()` - when the `index_col` parameter is not > specified - issuing too much message e.g. when user runs the plotting > functions, so we want to silence the warning message when it's used as an > internal purpose. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37352) Silence the `index_col` advice in `to_spark()` for internal usage
[ https://issues.apache.org/jira/browse/SPARK-37352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444953#comment-17444953 ] Apache Spark commented on SPARK-37352: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/34626 > Silence the `index_col` advice in `to_spark()` for internal usage > - > > Key: SPARK-37352 > URL: https://issues.apache.org/jira/browse/SPARK-37352 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Haejoon Lee >Priority: Major > > The advice warning for `to_spark()` - when the `index_col` parameter is not > specified - issuing too much message e.g. when user runs the plotting > functions, so we want to silence the warning message when it's used as an > internal purpose. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37325) Result vector from pandas_udf was not the required length
[ https://issues.apache.org/jira/browse/SPARK-37325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liu updated SPARK-37325: Description: schema = StructType([ StructField("node", StringType()) ]) {{rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")}} {{rdd = rdd.map(lambda obj: \{'node': obj})}} {{df_node = spark.createDataFrame(rdd, schema=schema)}} df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") pd_fname = df_fname.select('fname').toPandas() {{@pandas_udf(IntegerType(), PandasUDFType.SCALAR)}} {{def udf_match(word: pd.Series) -> pd.Series:}} {{ my_Series = pd_fname.squeeze() # dataframe to Series}} {{ num = int(my_Series.str.contains(word.array[0]).sum())}} return pd.Series(num) {{df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))}} Hi, I have two dataframe, and I try above method, however, I get this {{RuntimeError: Result vector from pandas_udf was not the required length: expected 100, got 1}} it will be really thankful, if there is any helps PS: for the method itself, I think there is no problem, I create same sample data to verify it successfully, however, when I use the real data error came. I checked the data, can't figure out, does anyone thinks where it cause? was: schema = StructType([ StructField("node", StringType()) ]) {{rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")}} {{rdd = rdd.map(lambda obj: \{'node': obj})}} {{df_node = spark.createDataFrame(rdd, schema=schema)}} df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") pd_fname = df_fname.select('fname').toPandas() {{@pandas_udf(IntegerType(), PandasUDFType.SCALAR)}} {{def udf_match(word: pd.Series) -> pd.Series:}} {{ my_Series = pd_fname.squeeze() # dataframe to Series}} {{ num = int(my_Series.str.contains(word.array[0]).sum())}} return pd.Series(num) {{df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))}} Hi, I have two dataframe, and I try above method, however, I get this {{RuntimeError: Result vector from pandas_udf was not the required length: expected 100, got 1}} it will be really thankful, if there is any helps PS: for the method itself, I think there is no problem, I create same sample data to verify it successfully, however, when I use the really data it came. I checked the data, can't figure out, does anyone thinks where it cause? > Result vector from pandas_udf was not the required length > - > > Key: SPARK-37325 > URL: https://issues.apache.org/jira/browse/SPARK-37325 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 > Environment: 1 >Reporter: liu >Priority: Major > > schema = StructType([ > StructField("node", StringType()) > ]) > {{rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")}} > {{rdd = rdd.map(lambda obj: \{'node': obj})}} > {{df_node = spark.createDataFrame(rdd, schema=schema)}} > df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") > pd_fname = df_fname.select('fname').toPandas() > {{@pandas_udf(IntegerType(), PandasUDFType.SCALAR)}} > {{def udf_match(word: pd.Series) -> pd.Series:}} > {{ my_Series = pd_fname.squeeze() # dataframe to Series}} > {{ num = int(my_Series.str.contains(word.array[0]).sum())}} > return pd.Series(num) > {{df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))}} > Hi, I have two dataframe, and I try above method, however, I get this > {{RuntimeError: Result vector from pandas_udf was not the required length: > expected 100, got 1}} > it will be really thankful, if there is any helps > > PS: for the method itself, I think there is no problem, I create same sample > data to verify it successfully, however, when I use the real data error came. > I checked the data, can't figure out, > does anyone thinks where it cause? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37352) Silence the `index_col` advice in `to_spark()` for internal usage
[ https://issues.apache.org/jira/browse/SPARK-37352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-37352: Description: The advice warning for `to_spark()` - when the `index_col` parameter is not specified - issuing too much message e.g. when user runs the plotting functions, so we want to silence the warning message when it's used as an internal purpose. (was: The advice warning for `index_col` in `to_spark()` issuing too much message when user runs the plotting functions, so we want to silence the warning message when it's used as an internal purpose.) > Silence the `index_col` advice in `to_spark()` for internal usage > - > > Key: SPARK-37352 > URL: https://issues.apache.org/jira/browse/SPARK-37352 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Haejoon Lee >Priority: Major > > The advice warning for `to_spark()` - when the `index_col` parameter is not > specified - issuing too much message e.g. when user runs the plotting > functions, so we want to silence the warning message when it's used as an > internal purpose. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37352) Silence the `index_col` advice in `to_spark()` for internal usage
Haejoon Lee created SPARK-37352: --- Summary: Silence the `index_col` advice in `to_spark()` for internal usage Key: SPARK-37352 URL: https://issues.apache.org/jira/browse/SPARK-37352 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.3.0 Reporter: Haejoon Lee The advice warning for `index_col` in `to_spark()` issuing too much message when user runs the plotting functions, so we want to silence the warning message when it's used as an internal purpose. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37279) Support DayTimeIntervalType in createDataFrame/toPandas and Python UDFs
[ https://issues.apache.org/jira/browse/SPARK-37279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37279: Assignee: Apache Spark > Support DayTimeIntervalType in createDataFrame/toPandas and Python UDFs > --- > > Key: SPARK-37279 > URL: https://issues.apache.org/jira/browse/SPARK-37279 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > Implements the support of DayTimeIntervalType in: > - Python UDFs > - createDataFrame/toPandas when Arrow is disabled -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37279) Support DayTimeIntervalType in createDataFrame/toPandas and Python UDFs
[ https://issues.apache.org/jira/browse/SPARK-37279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37279: Assignee: (was: Apache Spark) > Support DayTimeIntervalType in createDataFrame/toPandas and Python UDFs > --- > > Key: SPARK-37279 > URL: https://issues.apache.org/jira/browse/SPARK-37279 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > Implements the support of DayTimeIntervalType in: > - Python UDFs > - createDataFrame/toPandas when Arrow is disabled -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37279) Support DayTimeIntervalType in createDataFrame/toPandas and Python UDFs
[ https://issues.apache.org/jira/browse/SPARK-37279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444944#comment-17444944 ] Apache Spark commented on SPARK-37279: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/34614 > Support DayTimeIntervalType in createDataFrame/toPandas and Python UDFs > --- > > Key: SPARK-37279 > URL: https://issues.apache.org/jira/browse/SPARK-37279 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > Implements the support of DayTimeIntervalType in: > - Python UDFs > - createDataFrame/toPandas when Arrow is disabled -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37281) Support DayTimeIntervalType in Py4J
[ https://issues.apache.org/jira/browse/SPARK-37281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37281: Assignee: (was: Apache Spark) > Support DayTimeIntervalType in Py4J > --- > > Key: SPARK-37281 > URL: https://issues.apache.org/jira/browse/SPARK-37281 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > This PR adds the support of YearMonthIntervalType in Py4J. For example, > functions.lit(DayTimeIntervalType) should work. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37281) Support DayTimeIntervalType in Py4J
[ https://issues.apache.org/jira/browse/SPARK-37281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444943#comment-17444943 ] Apache Spark commented on SPARK-37281: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/34625 > Support DayTimeIntervalType in Py4J > --- > > Key: SPARK-37281 > URL: https://issues.apache.org/jira/browse/SPARK-37281 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > This PR adds the support of YearMonthIntervalType in Py4J. For example, > functions.lit(DayTimeIntervalType) should work. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37281) Support DayTimeIntervalType in Py4J
[ https://issues.apache.org/jira/browse/SPARK-37281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37281: Assignee: Apache Spark > Support DayTimeIntervalType in Py4J > --- > > Key: SPARK-37281 > URL: https://issues.apache.org/jira/browse/SPARK-37281 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > This PR adds the support of YearMonthIntervalType in Py4J. For example, > functions.lit(DayTimeIntervalType) should work. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-37277) Support DayTimeIntervalType in Arrow
[ https://issues.apache.org/jira/browse/SPARK-37277 ] Hyukjin Kwon deleted comment on SPARK-37277: -- was (Author: apachespark): User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/34614 > Support DayTimeIntervalType in Arrow > > > Key: SPARK-37277 > URL: https://issues.apache.org/jira/browse/SPARK-37277 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > Implements the support of DayTimeIntervalType in Arrow code path: > - pandas UDFs > - pandas functions APIs > - createDataFrame/toPandas when Arrow is enabled -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-37277) Support DayTimeIntervalType in Arrow
[ https://issues.apache.org/jira/browse/SPARK-37277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reopened SPARK-37277: -- > Support DayTimeIntervalType in Arrow > > > Key: SPARK-37277 > URL: https://issues.apache.org/jira/browse/SPARK-37277 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > Implements the support of DayTimeIntervalType in Arrow code path: > - pandas UDFs > - pandas functions APIs > - createDataFrame/toPandas when Arrow is enabled -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37277) Support DayTimeIntervalType in Arrow
[ https://issues.apache.org/jira/browse/SPARK-37277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-37277. -- Resolution: Fixed > Support DayTimeIntervalType in Arrow > > > Key: SPARK-37277 > URL: https://issues.apache.org/jira/browse/SPARK-37277 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > Implements the support of DayTimeIntervalType in Arrow code path: > - pandas UDFs > - pandas functions APIs > - createDataFrame/toPandas when Arrow is enabled -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37325) Result vector from pandas_udf was not the required length
[ https://issues.apache.org/jira/browse/SPARK-37325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-37325. -- Resolution: Invalid > Result vector from pandas_udf was not the required length > - > > Key: SPARK-37325 > URL: https://issues.apache.org/jira/browse/SPARK-37325 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 > Environment: 1 >Reporter: liu >Priority: Major > > schema = StructType([ > StructField("node", StringType()) > ]) > {{rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")}} > {{rdd = rdd.map(lambda obj: \{'node': obj})}} > {{df_node = spark.createDataFrame(rdd, schema=schema)}} > df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") > pd_fname = df_fname.select('fname').toPandas() > {{@pandas_udf(IntegerType(), PandasUDFType.SCALAR)}} > {{def udf_match(word: pd.Series) -> pd.Series:}} > {{ my_Series = pd_fname.squeeze() # dataframe to Series}} > {{ num = int(my_Series.str.contains(word.array[0]).sum())}} > return pd.Series(num) > {{df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))}} > Hi, I have two dataframe, and I try above method, however, I get this > {{RuntimeError: Result vector from pandas_udf was not the required length: > expected 100, got 1}} > it will be really thankful, if there is any helps > > PS: for the method itself, I think there is no problem, I create same sample > data to verify it successfully, however, when I use the really data it came. > I checked the data, can't figure out, > does anyone thinks where it cause? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37348) PySpark pmod function
[ https://issues.apache.org/jira/browse/SPARK-37348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444938#comment-17444938 ] Hyukjin Kwon commented on SPARK-37348: -- The question should be how often it is used in Python world, and how common it is. There are many expressions that are not existent in Python, Scala etc. (e.g., regex_* expressions) but they are not implemented on purpose. > PySpark pmod function > - > > Key: SPARK-37348 > URL: https://issues.apache.org/jira/browse/SPARK-37348 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Tim Schwab >Priority: Minor > > Because Spark is built on the JVM, in PySpark, F.lit(-1) % F.lit(2) returns > -1. However, the modulus is often desired instead of the remainder. > > There is a [PMOD() function in Spark > SQL|https://spark.apache.org/docs/latest/api/sql/#pmod], but [not in > PySpark|https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#functions]. > So at the moment, the two options for getting the modulus is to use > F.expr("pmod(A, B)"), or create a helper function such as: > > {code:java} > def pmod(dividend, divisor): > return F.when(dividend < 0, (dividend % divisor) + > divisor).otherwise(dividend % divisor){code} > > > Neither are optimal - pmod should be native to PySpark as it is in Spark SQL. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37325) Result vector from pandas_udf was not the required length
[ https://issues.apache.org/jira/browse/SPARK-37325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444939#comment-17444939 ] Hyukjin Kwon commented on SPARK-37325: -- Again, please consider using DataFrame.mapInPandas. Questions should better go to a mailing list. > Result vector from pandas_udf was not the required length > - > > Key: SPARK-37325 > URL: https://issues.apache.org/jira/browse/SPARK-37325 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 > Environment: 1 >Reporter: liu >Priority: Major > > schema = StructType([ > StructField("node", StringType()) > ]) > {{rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")}} > {{rdd = rdd.map(lambda obj: \{'node': obj})}} > {{df_node = spark.createDataFrame(rdd, schema=schema)}} > df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet") > pd_fname = df_fname.select('fname').toPandas() > {{@pandas_udf(IntegerType(), PandasUDFType.SCALAR)}} > {{def udf_match(word: pd.Series) -> pd.Series:}} > {{ my_Series = pd_fname.squeeze() # dataframe to Series}} > {{ num = int(my_Series.str.contains(word.array[0]).sum())}} > return pd.Series(num) > {{df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))}} > Hi, I have two dataframe, and I try above method, however, I get this > {{RuntimeError: Result vector from pandas_udf was not the required length: > expected 100, got 1}} > it will be really thankful, if there is any helps > > PS: for the method itself, I think there is no problem, I create same sample > data to verify it successfully, however, when I use the really data it came. > I checked the data, can't figure out, > does anyone thinks where it cause? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37350) EventLoggingListener keep logging errors after hdfs restart all datanodes
[ https://issues.apache.org/jira/browse/SPARK-37350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444937#comment-17444937 ] Hyukjin Kwon commented on SPARK-37350: -- Spark 2.4.x is EOL. can you check if the same issue persists in Spark 3.x? > EventLoggingListener keep logging errors after hdfs restart all datanodes > - > > Key: SPARK-37350 > URL: https://issues.apache.org/jira/browse/SPARK-37350 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 > Environment: Spark-2.4.0、Hadoop-3.0.0、Hive-2.1.1 >Reporter: Shefron Yudy >Priority: Major > > I saw the error in SparkThriftServer process's log when I restart all > datanodes of HDFS , The logs as follows: > {code:java} > 2021-11-16 13:52:11,044 ERROR [spark-listener-group-eventLog] > scheduler.AsyncEventQueue:Listener EventLoggingListener threw an exception > java.io.IOException: All datanodes > [DatanodeInfoWithStorage[10.121.23.101:1019,DS-90cb8066-8e5c-443f-804b-20c3ad01851b,DISK]] > are bad. Aborting... > at > org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1561) > at > org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1495) > at > org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481) > at > org.apache.hadoop.hdfs.DataStreamer.processDatanodeErrorOrExternalError(DataStreamer.java:1256) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667) > {code} > The eventLog will be available normally if I restart the SparkThriftServer, I > suggest that the EventLoggingListener's dfs writer and hadoopDataStream > should reconnect after all datanodes stop and then start later。 > {code:java} > /** Log the event as JSON. */ > private def logEvent(event: SparkListenerEvent, flushLogger: Boolean = > false) { > val eventJson = JsonProtocol.sparkEventToJson(event) > // scalastyle:off println > writer.foreach(_.println(compact(render(eventJson > // scalastyle:on println > if (flushLogger) { > writer.foreach(_.flush()) > hadoopDataStream.foreach(_.hflush()) > } > if (testing) { > loggedEvents += eventJson > } > } > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37334) pandas `convert_dtypes` method support
[ https://issues.apache.org/jira/browse/SPARK-37334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-37334: - Parent: SPARK-37197 Issue Type: Sub-task (was: New Feature) > pandas `convert_dtypes` method support > -- > > Key: SPARK-37334 > URL: https://issues.apache.org/jira/browse/SPARK-37334 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Ali Amin-Nejad >Priority: Minor > > Support for the {{convert_dtypes}} method as part of the new pandas API in > pyspark? > [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.convert_dtypes.html] -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37348) PySpark pmod function
[ https://issues.apache.org/jira/browse/SPARK-37348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-37348: - Parent: (was: SPARK-37197) Issue Type: Bug (was: Sub-task) > PySpark pmod function > - > > Key: SPARK-37348 > URL: https://issues.apache.org/jira/browse/SPARK-37348 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Tim Schwab >Priority: Minor > > Because Spark is built on the JVM, in PySpark, F.lit(-1) % F.lit(2) returns > -1. However, the modulus is often desired instead of the remainder. > > There is a [PMOD() function in Spark > SQL|https://spark.apache.org/docs/latest/api/sql/#pmod], but [not in > PySpark|https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#functions]. > So at the moment, the two options for getting the modulus is to use > F.expr("pmod(A, B)"), or create a helper function such as: > > {code:java} > def pmod(dividend, divisor): > return F.when(dividend < 0, (dividend % divisor) + > divisor).otherwise(dividend % divisor){code} > > > Neither are optimal - pmod should be native to PySpark as it is in Spark SQL. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37348) PySpark pmod function
[ https://issues.apache.org/jira/browse/SPARK-37348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-37348: - Labels: (was: newbie) > PySpark pmod function > - > > Key: SPARK-37348 > URL: https://issues.apache.org/jira/browse/SPARK-37348 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Tim Schwab >Priority: Minor > > Because Spark is built on the JVM, in PySpark, F.lit(-1) % F.lit(2) returns > -1. However, the modulus is often desired instead of the remainder. > > There is a [PMOD() function in Spark > SQL|https://spark.apache.org/docs/latest/api/sql/#pmod], but [not in > PySpark|https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#functions]. > So at the moment, the two options for getting the modulus is to use > F.expr("pmod(A, B)"), or create a helper function such as: > > {code:java} > def pmod(dividend, divisor): > return F.when(dividend < 0, (dividend % divisor) + > divisor).otherwise(dividend % divisor){code} > > > Neither are optimal - pmod should be native to PySpark as it is in Spark SQL. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37348) PySpark pmod function
[ https://issues.apache.org/jira/browse/SPARK-37348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-37348: - Parent: SPARK-37197 Issue Type: Sub-task (was: New Feature) > PySpark pmod function > - > > Key: SPARK-37348 > URL: https://issues.apache.org/jira/browse/SPARK-37348 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Tim Schwab >Priority: Minor > Labels: newbie > > Because Spark is built on the JVM, in PySpark, F.lit(-1) % F.lit(2) returns > -1. However, the modulus is often desired instead of the remainder. > > There is a [PMOD() function in Spark > SQL|https://spark.apache.org/docs/latest/api/sql/#pmod], but [not in > PySpark|https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#functions]. > So at the moment, the two options for getting the modulus is to use > F.expr("pmod(A, B)"), or create a helper function such as: > > {code:java} > def pmod(dividend, divisor): > return F.when(dividend < 0, (dividend % divisor) + > divisor).otherwise(dividend % divisor){code} > > > Neither are optimal - pmod should be native to PySpark as it is in Spark SQL. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37348) PySpark pmod function
[ https://issues.apache.org/jira/browse/SPARK-37348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-37348: - Issue Type: Improvement (was: Bug) > PySpark pmod function > - > > Key: SPARK-37348 > URL: https://issues.apache.org/jira/browse/SPARK-37348 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Tim Schwab >Priority: Minor > > Because Spark is built on the JVM, in PySpark, F.lit(-1) % F.lit(2) returns > -1. However, the modulus is often desired instead of the remainder. > > There is a [PMOD() function in Spark > SQL|https://spark.apache.org/docs/latest/api/sql/#pmod], but [not in > PySpark|https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#functions]. > So at the moment, the two options for getting the modulus is to use > F.expr("pmod(A, B)"), or create a helper function such as: > > {code:java} > def pmod(dividend, divisor): > return F.when(dividend < 0, (dividend % divisor) + > divisor).otherwise(dividend % divisor){code} > > > Neither are optimal - pmod should be native to PySpark as it is in Spark SQL. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37230) Document DataFrame.mapInArrow
[ https://issues.apache.org/jira/browse/SPARK-37230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444934#comment-17444934 ] Apache Spark commented on SPARK-37230: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/34624 > Document DataFrame.mapInArrow > - > > Key: SPARK-37230 > URL: https://issues.apache.org/jira/browse/SPARK-37230 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37230) Document DataFrame.mapInArrow
[ https://issues.apache.org/jira/browse/SPARK-37230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444933#comment-17444933 ] Apache Spark commented on SPARK-37230: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/34624 > Document DataFrame.mapInArrow > - > > Key: SPARK-37230 > URL: https://issues.apache.org/jira/browse/SPARK-37230 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37230) Document DataFrame.mapInArrow
[ https://issues.apache.org/jira/browse/SPARK-37230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37230: Assignee: Apache Spark > Document DataFrame.mapInArrow > - > > Key: SPARK-37230 > URL: https://issues.apache.org/jira/browse/SPARK-37230 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37230) Document DataFrame.mapInArrow
[ https://issues.apache.org/jira/browse/SPARK-37230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37230: Assignee: (was: Apache Spark) > Document DataFrame.mapInArrow > - > > Key: SPARK-37230 > URL: https://issues.apache.org/jira/browse/SPARK-37230 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37347) Spark Thrift Server (STS) driver fullFC becourse of timeoutExecutor not shutdown correctly
[ https://issues.apache.org/jira/browse/SPARK-37347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444919#comment-17444919 ] Apache Spark commented on SPARK-37347: -- User 'lk246' has created a pull request for this issue: https://github.com/apache/spark/pull/34623 > Spark Thrift Server (STS) driver fullFC becourse of timeoutExecutor not > shutdown correctly > -- > > Key: SPARK-37347 > URL: https://issues.apache.org/jira/browse/SPARK-37347 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0, 3.1.1, 3.2.0 >Reporter: liukai >Priority: Major > > When spark.sql.thriftServer.queryTimeout or > java.sql.Statement.setQueryTimeout is setted >0 , > SparkExecuteStatementOperation add timeoutExecutor to kill time-consumed > query in [SPARK-26533|https://issues.apache.org/jira/browse/SPARK-26533]. But > timeoutExecutor is not shutdown correctly when statement is finished, it can > only be shutdown when timeout. When we set timeout to a long time for example > 1 hour, the long-running STS driver will FullGC and the application is not > available for a long time. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37350) EventLoggingListener keep logging errors after hdfs restart all datanodes
[ https://issues.apache.org/jira/browse/SPARK-37350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shefron Yudy updated SPARK-37350: - Description: I saw the error in SparkThriftServer process's log when I restart all datanodes of HDFS , The logs as follows: {code:java} 2021-11-16 13:52:11,044 ERROR [spark-listener-group-eventLog] scheduler.AsyncEventQueue:Listener EventLoggingListener threw an exception java.io.IOException: All datanodes [DatanodeInfoWithStorage[10.121.23.101:1019,DS-90cb8066-8e5c-443f-804b-20c3ad01851b,DISK]] are bad. Aborting... at org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1561) at org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1495) at org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481) at org.apache.hadoop.hdfs.DataStreamer.processDatanodeErrorOrExternalError(DataStreamer.java:1256) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667) {code} The eventLog will be available normally if I restart the SparkThriftServer, I suggest that the EventLoggingListener's dfs writer and hadoopDataStream should reconnect after all datanodes stop and then start later。 {code:java} /** Log the event as JSON. */ private def logEvent(event: SparkListenerEvent, flushLogger: Boolean = false) { val eventJson = JsonProtocol.sparkEventToJson(event) // scalastyle:off println writer.foreach(_.println(compact(render(eventJson // scalastyle:on println if (flushLogger) { writer.foreach(_.flush()) hadoopDataStream.foreach(_.hflush()) } if (testing) { loggedEvents += eventJson } } {code} was: I saw the same error in SparkThriftServer process's log when I restart all datanodes of HDFS , The logs as follows: {code:java} 2021-11-16 13:52:11,044 ERROR [spark-listener-group-eventLog] scheduler.AsyncEventQueue:Listener EventLoggingListener threw an exception java.io.IOException: All datanodes [DatanodeInfoWithStorage[10.121.23.101:1019,DS-90cb8066-8e5c-443f-804b-20c3ad01851b,DISK]] are bad. Aborting... at org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1561) at org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1495) at org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481) at org.apache.hadoop.hdfs.DataStreamer.processDatanodeErrorOrExternalError(DataStreamer.java:1256) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667) {code} The eventLog will be available normally if I restart the SparkThriftServer, I suggest that the EventLoggingListener's dfs writer and hadoopDataStream should reconnect after all datanodes stop and then start later。 {code:java} /** Log the event as JSON. */ private def logEvent(event: SparkListenerEvent, flushLogger: Boolean = false) { val eventJson = JsonProtocol.sparkEventToJson(event) // scalastyle:off println writer.foreach(_.println(compact(render(eventJson // scalastyle:on println if (flushLogger) { writer.foreach(_.flush()) hadoopDataStream.foreach(_.hflush()) } if (testing) { loggedEvents += eventJson } } {code} > EventLoggingListener keep logging errors after hdfs restart all datanodes > - > > Key: SPARK-37350 > URL: https://issues.apache.org/jira/browse/SPARK-37350 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 > Environment: Spark-2.4.0、Hadoop-3.0.0、Hive-2.1.1 >Reporter: Shefron Yudy >Priority: Major > > I saw the error in SparkThriftServer process's log when I restart all > datanodes of HDFS , The logs as follows: > {code:java} > 2021-11-16 13:52:11,044 ERROR [spark-listener-group-eventLog] > scheduler.AsyncEventQueue:Listener EventLoggingListener threw an exception > java.io.IOException: All datanodes > [DatanodeInfoWithStorage[10.121.23.101:1019,DS-90cb8066-8e5c-443f-804b-20c3ad01851b,DISK]] > are bad. Aborting... > at > org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1561) > at > org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1495) > at > org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481) > at > org.apache.hadoop.hdfs.DataStreamer.processDatanodeErrorOrExternalError(DataStreamer.java:1256) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667) > {code} > The eventLog will be available normally if I restart the SparkThriftServer, I > suggest that the EventLoggingListener's dfs writer and hadoopDataStream > should reconnect after all datanodes stop and then start later。 > {code:java} > /** Log the event as JSON. */ > private def
[jira] [Updated] (SPARK-37350) EventLoggingListener keep logging errors after hdfs restart all datanodes
[ https://issues.apache.org/jira/browse/SPARK-37350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shefron Yudy updated SPARK-37350: - Description: I saw the same error in SparkThriftServer process's log when I restart all datanodes of HDFS , The logs as follows: {code:java} 2021-11-16 13:52:11,044 ERROR [spark-listener-group-eventLog] scheduler.AsyncEventQueue:Listener EventLoggingListener threw an exception java.io.IOException: All datanodes [DatanodeInfoWithStorage[10.121.23.101:1019,DS-90cb8066-8e5c-443f-804b-20c3ad01851b,DISK]] are bad. Aborting... at org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1561) at org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1495) at org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481) at org.apache.hadoop.hdfs.DataStreamer.processDatanodeErrorOrExternalError(DataStreamer.java:1256) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667) {code} The eventLog will be available normally if I restart the SparkThriftServer, I suggest that the EventLoggingListener's dfs writer and hadoopDataStream should reconnect after all datanodes stop and then start later。 {code:java} /** Log the event as JSON. */ private def logEvent(event: SparkListenerEvent, flushLogger: Boolean = false) { val eventJson = JsonProtocol.sparkEventToJson(event) // scalastyle:off println writer.foreach(_.println(compact(render(eventJson // scalastyle:on println if (flushLogger) { writer.foreach(_.flush()) hadoopDataStream.foreach(_.hflush()) } if (testing) { loggedEvents += eventJson } } {code} was: I saw the same error in SparkThriftServer process's log when I restart all datanodes of HDFS , The logs as follows: {code:java} 2021-11-16 13:52:11,044 ERROR [spark-listener-group-eventLog] scheduler.AsyncEventQueue:Listener EventLoggingListener threw an exception java.io.IOException: All datanodes [DatanodeInfoWithStorage[10.121.23.101:1019,DS-90cb8066-8e5c-443f-804b-20c3ad01851b,DISK]] are bad. Aborting... at org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1561) at org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1495) at org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481) at org.apache.hadoop.hdfs.DataStreamer.processDatanodeErrorOrExternalError(DataStreamer.java:1256) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667) {code} The eventLog will be available normally if I restart the SparkThriftServer, I suggest that the EventLoggingListener's dfs writer should reconnect after all datanodes stop and then start later。 {code:java} /** Log the event as JSON. */ private def logEvent(event: SparkListenerEvent, flushLogger: Boolean = false) { val eventJson = JsonProtocol.sparkEventToJson(event) // scalastyle:off println writer.foreach(_.println(compact(render(eventJson // scalastyle:on println if (flushLogger) { writer.foreach(_.flush()) hadoopDataStream.foreach(_.hflush()) } if (testing) { loggedEvents += eventJson } } {code} > EventLoggingListener keep logging errors after hdfs restart all datanodes > - > > Key: SPARK-37350 > URL: https://issues.apache.org/jira/browse/SPARK-37350 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 > Environment: Spark-2.4.0、Hadoop-3.0.0、Hive-2.1.1 >Reporter: Shefron Yudy >Priority: Major > > I saw the same error in SparkThriftServer process's log when I restart all > datanodes of HDFS , The logs as follows: > {code:java} > 2021-11-16 13:52:11,044 ERROR [spark-listener-group-eventLog] > scheduler.AsyncEventQueue:Listener EventLoggingListener threw an exception > java.io.IOException: All datanodes > [DatanodeInfoWithStorage[10.121.23.101:1019,DS-90cb8066-8e5c-443f-804b-20c3ad01851b,DISK]] > are bad. Aborting... > at > org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1561) > at > org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1495) > at > org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481) > at > org.apache.hadoop.hdfs.DataStreamer.processDatanodeErrorOrExternalError(DataStreamer.java:1256) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667) > {code} > The eventLog will be available normally if I restart the SparkThriftServer, I > suggest that the EventLoggingListener's dfs writer and hadoopDataStream > should reconnect after all datanodes stop and then start later。 > {code:java} > /** Log the event as JSON. */ > private def logEvent(event:
[jira] [Updated] (SPARK-37350) EventLoggingListener keep logging errors after hdfs restart all datanodes
[ https://issues.apache.org/jira/browse/SPARK-37350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shefron Yudy updated SPARK-37350: - Description: I saw the same error in SparkThriftServer process's log when I restart all datanodes of HDFS , The logs as follows: {code:java} 2021-11-16 13:52:11,044 ERROR [spark-listener-group-eventLog] scheduler.AsyncEventQueue:Listener EventLoggingListener threw an exception java.io.IOException: All datanodes [DatanodeInfoWithStorage[10.121.23.101:1019,DS-90cb8066-8e5c-443f-804b-20c3ad01851b,DISK]] are bad. Aborting... at org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1561) at org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1495) at org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481) at org.apache.hadoop.hdfs.DataStreamer.processDatanodeErrorOrExternalError(DataStreamer.java:1256) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667) {code} The eventLog will be available normally if I restart the SparkThriftServer, I suggest that the EventLoggingListener's dfs writer should reconnect after all datanodes stop and then start later。 {code:java} /** Log the event as JSON. */ private def logEvent(event: SparkListenerEvent, flushLogger: Boolean = false) { val eventJson = JsonProtocol.sparkEventToJson(event) // scalastyle:off println writer.foreach(_.println(compact(render(eventJson // scalastyle:on println if (flushLogger) { writer.foreach(_.flush()) hadoopDataStream.foreach(_.hflush()) } if (testing) { loggedEvents += eventJson } } {code} was: I saw the same error in SparkThriftServer process's log when I restart all datanodes of HDFS , The logs as follows: {code:java} 2021-11-16 13:52:11,044 ERROR [spark-listener-group-eventLog] scheduler.AsyncEventQueue:Listener EventLoggingListener threw an exception java.io.IOException: All datanodes [DatanodeInfoWithStorage[10.121.23.101:1019,DS-90cb8066-8e5c-443f-804b-20c3ad01851b,DISK]] are bad. Aborting... at org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1561) at org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1495) at org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481) at org.apache.hadoop.hdfs.DataStreamer.processDatanodeErrorOrExternalError(DataStreamer.java:1256) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667) {code} The eventLog will be available normally if I restart the SparkThriftServer, I suggest that the EventLoggingListener's dfs writer should reconnect after all datanodes stop and then start later。 > EventLoggingListener keep logging errors after hdfs restart all datanodes > - > > Key: SPARK-37350 > URL: https://issues.apache.org/jira/browse/SPARK-37350 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 > Environment: Spark-2.4.0、Hadoop-3.0.0、Hive-2.1.1 >Reporter: Shefron Yudy >Priority: Major > > I saw the same error in SparkThriftServer process's log when I restart all > datanodes of HDFS , The logs as follows: > {code:java} > 2021-11-16 13:52:11,044 ERROR [spark-listener-group-eventLog] > scheduler.AsyncEventQueue:Listener EventLoggingListener threw an exception > java.io.IOException: All datanodes > [DatanodeInfoWithStorage[10.121.23.101:1019,DS-90cb8066-8e5c-443f-804b-20c3ad01851b,DISK]] > are bad. Aborting... > at > org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1561) > at > org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1495) > at > org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481) > at > org.apache.hadoop.hdfs.DataStreamer.processDatanodeErrorOrExternalError(DataStreamer.java:1256) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667) > {code} > The eventLog will be available normally if I restart the SparkThriftServer, I > suggest that the EventLoggingListener's dfs writer should reconnect after all > datanodes stop and then start later。 > {code:java} > /** Log the event as JSON. */ > private def logEvent(event: SparkListenerEvent, flushLogger: Boolean = > false) { > val eventJson = JsonProtocol.sparkEventToJson(event) > // scalastyle:off println > writer.foreach(_.println(compact(render(eventJson > // scalastyle:on println > if (flushLogger) { > writer.foreach(_.flush()) > hadoopDataStream.foreach(_.hflush()) > } > if (testing) { > loggedEvents += eventJson > } > } > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (SPARK-37351) Supports write data flow control
melin created SPARK-37351: - Summary: Supports write data flow control Key: SPARK-37351 URL: https://issues.apache.org/jira/browse/SPARK-37351 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.2.0 Reporter: melin The hive table data is written to a relational database, generally an online production database. If the writing speed has no traffic control, it can easily affect the stability of the online system. It is recommended to add traffic control parameters -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-24907) Migrate JDBC data source to DataSource API v2
[ https://issues.apache.org/jira/browse/SPARK-24907 ] melin deleted comment on SPARK-24907: --- was (Author: melin): The hive table data is written to a relational database, generally an online production database. If the writing speed has no traffic control, it can easily affect the stability of the online system. It is recommended to add traffic control parameters > Migrate JDBC data source to DataSource API v2 > - > > Key: SPARK-24907 > URL: https://issues.apache.org/jira/browse/SPARK-24907 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Teng Peng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37350) EventLoggingListener keep logging errors after hdfs restart all datanodes
[ https://issues.apache.org/jira/browse/SPARK-37350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shefron Yudy updated SPARK-37350: - Environment: Spark-2.4.0、Hadoop-3.0.0、Hive-2.1.1 (was: Spark-2.4.0 Hadoop-3.0.0 Hive-2.1.1) > EventLoggingListener keep logging errors after hdfs restart all datanodes > - > > Key: SPARK-37350 > URL: https://issues.apache.org/jira/browse/SPARK-37350 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 > Environment: Spark-2.4.0、Hadoop-3.0.0、Hive-2.1.1 >Reporter: Shefron Yudy >Priority: Major > > I saw the same error in SparkThriftServer process's log when I restart all > datanodes of HDFS , The logs as follows: > {code:java} > 2021-11-16 13:52:11,044 ERROR [spark-listener-group-eventLog] > scheduler.AsyncEventQueue:Listener EventLoggingListener threw an exception > java.io.IOException: All datanodes > [DatanodeInfoWithStorage[10.121.23.101:1019,DS-90cb8066-8e5c-443f-804b-20c3ad01851b,DISK]] > are bad. Aborting... > at > org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1561) > at > org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1495) > at > org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481) > at > org.apache.hadoop.hdfs.DataStreamer.processDatanodeErrorOrExternalError(DataStreamer.java:1256) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667) > {code} > The eventLog will be available normally if I restart the SparkThriftServer, I > suggest that the EventLoggingListener's dfs writer should reconnect after all > datanodes stop and then start later。 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37350) EventLoggingListener keep logging errors after hdfs restart all datanodes
[ https://issues.apache.org/jira/browse/SPARK-37350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shefron Yudy updated SPARK-37350: - Description: I saw the same error in SparkThriftServer process's log when I restart all datanodes of HDFS , The logs as follows: {code:java} 2021-11-16 13:52:11,044 ERROR [spark-listener-group-eventLog] scheduler.AsyncEventQueue:Listener EventLoggingListener threw an exception java.io.IOException: All datanodes [DatanodeInfoWithStorage[10.121.23.101:1019,DS-90cb8066-8e5c-443f-804b-20c3ad01851b,DISK]] are bad. Aborting... at org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1561) at org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1495) at org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481) at org.apache.hadoop.hdfs.DataStreamer.processDatanodeErrorOrExternalError(DataStreamer.java:1256) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667) {code} The eventLog will be available normally if I restart the SparkThriftServer, I suggest that the EventLoggingListener's dfs writer should reconnect after all datanodes stop and then start later。 was: I saw the same error in SparkThriftServer process's log when I restart all datanodes of HDFS , The logs as follows: {panel:title=My title} 2021-11-16 13:52:11,044 ERROR [spark-listener-group-eventLog] scheduler.AsyncEventQueue:Listener EventLoggingListener threw an exception java.io.IOException: All datanodes [DatanodeInfoWithStorage[10.121.23.101:1019,DS-90cb8066-8e5c-443f-804b-20c3ad01851b,DISK]] are bad. Aborting... at org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1561) at org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1495) at org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481) at org.apache.hadoop.hdfs.DataStreamer.processDatanodeErrorOrExternalError(DataStreamer.java:1256) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667) {panel} The eventLog will be available normally if I restart the SparkThriftServer, I suggest that the EventLoggingListener's dfs writer should reconnect after all datanodes stop and then start later。 > EventLoggingListener keep logging errors after hdfs restart all datanodes > - > > Key: SPARK-37350 > URL: https://issues.apache.org/jira/browse/SPARK-37350 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 > Environment: Spark-2.4.0 > Hadoop-3.0.0 > Hive-2.1.1 >Reporter: Shefron Yudy >Priority: Major > > I saw the same error in SparkThriftServer process's log when I restart all > datanodes of HDFS , The logs as follows: > {code:java} > 2021-11-16 13:52:11,044 ERROR [spark-listener-group-eventLog] > scheduler.AsyncEventQueue:Listener EventLoggingListener threw an exception > java.io.IOException: All datanodes > [DatanodeInfoWithStorage[10.121.23.101:1019,DS-90cb8066-8e5c-443f-804b-20c3ad01851b,DISK]] > are bad. Aborting... > at > org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1561) > at > org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1495) > at > org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481) > at > org.apache.hadoop.hdfs.DataStreamer.processDatanodeErrorOrExternalError(DataStreamer.java:1256) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667) > {code} > The eventLog will be available normally if I restart the SparkThriftServer, I > suggest that the EventLoggingListener's dfs writer should reconnect after all > datanodes stop and then start later。 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24907) Migrate JDBC data source to DataSource API v2
[ https://issues.apache.org/jira/browse/SPARK-24907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444916#comment-17444916 ] melin commented on SPARK-24907: --- The hive table data is written to a relational database, generally an online production database. If the writing speed has no traffic control, it can easily affect the stability of the online system. It is recommended to add traffic control parameters > Migrate JDBC data source to DataSource API v2 > - > > Key: SPARK-24907 > URL: https://issues.apache.org/jira/browse/SPARK-24907 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Teng Peng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37350) EventLoggingListener keep logging errors after hdfs restart all datanodes
Shefron Yudy created SPARK-37350: Summary: EventLoggingListener keep logging errors after hdfs restart all datanodes Key: SPARK-37350 URL: https://issues.apache.org/jira/browse/SPARK-37350 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.0 Environment: Spark-2.4.0 Hadoop-3.0.0 Hive-2.1.1 Reporter: Shefron Yudy I saw the same error in SparkThriftServer process's log when I restart all datanodes of HDFS , The logs as follows: {panel:title=My title} 2021-11-16 13:52:11,044 ERROR [spark-listener-group-eventLog] scheduler.AsyncEventQueue:Listener EventLoggingListener threw an exception java.io.IOException: All datanodes [DatanodeInfoWithStorage[10.121.23.101:1019,DS-90cb8066-8e5c-443f-804b-20c3ad01851b,DISK]] are bad. Aborting... at org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1561) at org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1495) at org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481) at org.apache.hadoop.hdfs.DataStreamer.processDatanodeErrorOrExternalError(DataStreamer.java:1256) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667) {panel} The eventLog will be available normally if I restart the SparkThriftServer, I suggest that the EventLoggingListener's dfs writer should reconnect after all datanodes stop and then start later。 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35345) Add BloomFilter Benchmark test for Parquet
[ https://issues.apache.org/jira/browse/SPARK-35345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-35345. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34594 [https://github.com/apache/spark/pull/34594] > Add BloomFilter Benchmark test for Parquet > -- > > Key: SPARK-35345 > URL: https://issues.apache.org/jira/browse/SPARK-35345 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.2.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Trivial > Fix For: 3.3.0 > > > Currently, we only have BloomFilter Benchmark test for ORC. Will add one for > Parquet too. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35345) Add BloomFilter Benchmark test for Parquet
[ https://issues.apache.org/jira/browse/SPARK-35345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-35345: Assignee: Huaxin Gao > Add BloomFilter Benchmark test for Parquet > -- > > Key: SPARK-35345 > URL: https://issues.apache.org/jira/browse/SPARK-35345 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.2.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Trivial > > Currently, we only have BloomFilter Benchmark test for ORC. Will add one for > Parquet too. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37341) Avoid unnecessary buffer and copy in full outer sort merge join
[ https://issues.apache.org/jira/browse/SPARK-37341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-37341: --- Assignee: Cheng Su > Avoid unnecessary buffer and copy in full outer sort merge join > --- > > Key: SPARK-37341 > URL: https://issues.apache.org/jira/browse/SPARK-37341 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Cheng Su >Assignee: Cheng Su >Priority: Minor > > FULL OUTER sort merge join (non-code-gen path) copies join keys and buffers > input rows, even when rows from both sides do have matched keys > ([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala#L1637-L1641] > ). This is unnecessary, as we can just output the row with smaller join > keys, and only buffer when both sides have matched keys. This would save us > from unnecessary copy and buffer, when both join sides have a lot of rows not > matched with each other. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37341) Avoid unnecessary buffer and copy in full outer sort merge join
[ https://issues.apache.org/jira/browse/SPARK-37341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-37341. - Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34612 [https://github.com/apache/spark/pull/34612] > Avoid unnecessary buffer and copy in full outer sort merge join > --- > > Key: SPARK-37341 > URL: https://issues.apache.org/jira/browse/SPARK-37341 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Cheng Su >Assignee: Cheng Su >Priority: Minor > Fix For: 3.3.0 > > > FULL OUTER sort merge join (non-code-gen path) copies join keys and buffers > input rows, even when rows from both sides do have matched keys > ([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala#L1637-L1641] > ). This is unnecessary, as we can just output the row with smaller join > keys, and only buffer when both sides have matched keys. This would save us > from unnecessary copy and buffer, when both join sides have a lot of rows not > matched with each other. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28242) DataStreamer keeps logging errors even after fixing writeStream output sink
[ https://issues.apache.org/jira/browse/SPARK-28242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444900#comment-17444900 ] Shefron Yudy commented on SPARK-28242: -- I saw the same error in SparkThriftServer process's log when I restart all datanodes of HDFS with the env Spark-2.4.0 and Hadoop-3.0.0), The logs as follows: 2021-11-16 13:52:11,044 ERROR [spark-listener-group-eventLog] scheduler.AsyncEventQueue:Listener EventLoggingListener threw an exception java.io.IOException: All datanodes [DatanodeInfoWithStorage[10.121.23.101:1019,DS-90cb8066-8e5c-443f-804b-20c3ad01851b,DISK]] are bad. Aborting... at org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1561) at org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1495) at org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481) at org.apache.hadoop.hdfs.DataStreamer.processDatanodeErrorOrExternalError(DataStreamer.java:1256) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667) The eventLog will be available normally if I restart the SparkThriftServer, I suggest that the EventLoggingListener's dfs writer should reconnect after all datanodes stop and then start later。 > DataStreamer keeps logging errors even after fixing writeStream output sink > --- > > Key: SPARK-28242 > URL: https://issues.apache.org/jira/browse/SPARK-28242 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.3 > Environment: Hadoop 2.8.4 > >Reporter: Miquel Canes >Priority: Minor > Labels: bulk-closed > > I have been testing what happens to a running structured streaming that is > writing to HDFS when all datanodes are down/stopped or all cluster is down > (including namenode) > So I created a structured stream from kafka to a File output sink to HDFS and > tested some scenarios. > We used a very simple streamings: > {code:java} > spark.readStream() > .format("kafka") > .option("kafka.bootstrap.servers", "kafka.server:9092...") > .option("subscribe", "test_topic") > .load() > .select(col("value").cast(DataTypes.StringType)) > .writeStream() > .format("text") > .option("path", "HDFS/PATH") > .option("checkpointLocation", "checkpointPath") > .start() > .awaitTermination();{code} > > After stopping all the datanodes the process starts logging the error that > datanodes are bad. > That's correct... > {code:java} > 2019-07-03 15:55:00 [spark-listener-group-eventLog] ERROR > org.apache.spark.scheduler.AsyncEventQueue:91 - Listener EventLoggingListener > threw an exception java.io.IOException: All datanodes > [DatanodeInfoWithStorage[10.2.12.202:50010,DS-d2fba01b-28eb-4fe4-baaa-4072102a2172,DISK]] > are bad. Aborting... at > org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1530) > at > org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1465) > at > org.apache.hadoop.hdfs.DataStreamer.processDatanodeError(DataStreamer.java:1237) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:657) > {code} > The problem is that even after starting again the datanodes the process keeps > logging the same error all the time. > We checked and the WriteStream to HDFS recovered successfully after starting > the datanodes and the output sink worked again without problems. > I have been trying some different HDFS configurations to be sure it's not a > client config related problem but with no clue about how to fix it. > It seams that something is stuck indefinitely in an error loop. > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36231) Support arithmetic operations of Series containing Decimal(np.nan)
[ https://issues.apache.org/jira/browse/SPARK-36231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-36231. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34314 [https://github.com/apache/spark/pull/34314] > Support arithmetic operations of Series containing Decimal(np.nan) > --- > > Key: SPARK-36231 > URL: https://issues.apache.org/jira/browse/SPARK-36231 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Assignee: Yikun Jiang >Priority: Major > Fix For: 3.3.0 > > > Arithmetic operations of Series containing Decimal(np.nan) raise > java.lang.NullPointerException in driver. An example is shown as below: > {code:java} > >>> pser = pd.Series([decimal.Decimal(1.0), decimal.Decimal(2.0), > >>> decimal.Decimal(np.nan)]) > >>> psser = ps.from_pandas(pser) > >>> pser + 1 > 0 2 > 1 3 > 2 NaN > >>> psser + 1 > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2259) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2208) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2207) > at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2207) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1084) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1084) > at scala.Option.foreach(Option.scala:407) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1084) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2446) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2388) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2377) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) > at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:873) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2208) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2303) > at > org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$5(Dataset.scala:3648) > at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437) > at > org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$2(Dataset.scala:3652) > at > org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$2$adapted(Dataset.scala:3629) > at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:774) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704) > at > org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$1(Dataset.scala:3629) > at > org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$1$adapted(Dataset.scala:3628) > at > org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$2(SocketAuthServer.scala:139) > at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437) > at > org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$1(SocketAuthServer.scala:141) > at > org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$1$adapted(SocketAuthServer.scala:136) > at > org.apache.spark.security.SocketFuncServer.handleConnection(SocketAuthServer.scala:113) > at > org.apache.spark.security.SocketFuncServer.handleConnection(SocketAuthServer.scala:107) > at > org.apache.spark.security.SocketAuthServer$$anon$1.$anonfun$run$4(SocketAuthServer.scala:68) > at scala.util.Try$.apply(Try.scala:213) > at > org.apache.spark.security.SocketAuthServer$$anon$1.run(SocketAuthServer.scala:68) > Caused by: java.lang.NullPointerException > at >
[jira] [Resolved] (SPARK-36000) Support creation and operations of ps.Series/Index with Decimal('NaN')
[ https://issues.apache.org/jira/browse/SPARK-36000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-36000. -- Fix Version/s: 3.3.0 Resolution: Done > Support creation and operations of ps.Series/Index with Decimal('NaN') > -- > > Key: SPARK-36000 > URL: https://issues.apache.org/jira/browse/SPARK-36000 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Assignee: Yikun Jiang >Priority: Major > Fix For: 3.3.0 > > > The creation and operations of ps.Series/Index with Decimal('NaN') doesn't > work as expected. > That might be due to the underlying PySpark limit. > Please refer to sub-tasks for issues detected. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36000) Support creation and operations of ps.Series/Index with Decimal('NaN')
[ https://issues.apache.org/jira/browse/SPARK-36000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-36000: Assignee: Yikun Jiang > Support creation and operations of ps.Series/Index with Decimal('NaN') > -- > > Key: SPARK-36000 > URL: https://issues.apache.org/jira/browse/SPARK-36000 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Assignee: Yikun Jiang >Priority: Major > > The creation and operations of ps.Series/Index with Decimal('NaN') doesn't > work as expected. > That might be due to the underlying PySpark limit. > Please refer to sub-tasks for issues detected. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36231) Support arithmetic operations of Series containing Decimal(np.nan)
[ https://issues.apache.org/jira/browse/SPARK-36231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-36231: Assignee: Yikun Jiang > Support arithmetic operations of Series containing Decimal(np.nan) > --- > > Key: SPARK-36231 > URL: https://issues.apache.org/jira/browse/SPARK-36231 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Assignee: Yikun Jiang >Priority: Major > > Arithmetic operations of Series containing Decimal(np.nan) raise > java.lang.NullPointerException in driver. An example is shown as below: > {code:java} > >>> pser = pd.Series([decimal.Decimal(1.0), decimal.Decimal(2.0), > >>> decimal.Decimal(np.nan)]) > >>> psser = ps.from_pandas(pser) > >>> pser + 1 > 0 2 > 1 3 > 2 NaN > >>> psser + 1 > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2259) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2208) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2207) > at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2207) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1084) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1084) > at scala.Option.foreach(Option.scala:407) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1084) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2446) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2388) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2377) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) > at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:873) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2208) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2303) > at > org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$5(Dataset.scala:3648) > at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437) > at > org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$2(Dataset.scala:3652) > at > org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$2$adapted(Dataset.scala:3629) > at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:774) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704) > at > org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$1(Dataset.scala:3629) > at > org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$1$adapted(Dataset.scala:3628) > at > org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$2(SocketAuthServer.scala:139) > at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437) > at > org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$1(SocketAuthServer.scala:141) > at > org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$1$adapted(SocketAuthServer.scala:136) > at > org.apache.spark.security.SocketFuncServer.handleConnection(SocketAuthServer.scala:113) > at > org.apache.spark.security.SocketFuncServer.handleConnection(SocketAuthServer.scala:107) > at > org.apache.spark.security.SocketAuthServer$$anon$1.$anonfun$run$4(SocketAuthServer.scala:68) > at scala.util.Try$.apply(Try.scala:213) > at > org.apache.spark.security.SocketAuthServer$$anon$1.run(SocketAuthServer.scala:68) > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at >
[jira] [Assigned] (SPARK-37345) Add java.security.jgss/sun.security.krb5 to DEFAULT_MODULE_OPTIONS
[ https://issues.apache.org/jira/browse/SPARK-37345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-37345: --- Assignee: Yuming Wang > Add java.security.jgss/sun.security.krb5 to DEFAULT_MODULE_OPTIONS > -- > > Key: SPARK-37345 > URL: https://issues.apache.org/jira/browse/SPARK-37345 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > > {noformat} > Exception in thread "main" java.lang.IllegalArgumentException: Can't get > Kerberos realm > at > org.apache.hadoop.security.HadoopKerberosName.setConfiguration(HadoopKerberosName.java:65) > at > org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:306) > at > org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:352) > at > org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:50) > at > org.apache.spark.deploy.SparkHadoopUtil$.instance$lzycompute(SparkHadoopUtil.scala:397) > at > org.apache.spark.deploy.SparkHadoopUtil$.instance(SparkHadoopUtil.scala:397) > at > org.apache.spark.deploy.SparkHadoopUtil$.get(SparkHadoopUtil.scala:418) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:423) > at > org.apache.spark.executor.YarnCoarseGrainedExecutorBackend$.main(YarnCoarseGrainedExecutorBackend.scala:81) > at > org.apache.spark.executor.YarnCoarseGrainedExecutorBackend.main(YarnCoarseGrainedExecutorBackend.scala) > Caused by: java.lang.IllegalAccessException: class > org.apache.hadoop.security.authentication.util.KerberosUtil cannot access > class sun.security.krb5.Config (in module java.security.jgss) because module > java.security.jgss does not export sun.security.krb5 to unnamed module > @3a0baae5 > at > java.base/jdk.internal.reflect.Reflection.newIllegalAccessException(Reflection.java:392) > at > java.base/java.lang.reflect.AccessibleObject.checkAccess(AccessibleObject.java:674) > at java.base/java.lang.reflect.Method.invoke(Method.java:560) > at > org.apache.hadoop.security.authentication.util.KerberosUtil.getDefaultRealm(KerberosUtil.java:85) > at > org.apache.hadoop.security.HadoopKerberosName.setConfiguration(HadoopKerberosName.java:63) > ... 9 more > {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37345) Add java.security.jgss/sun.security.krb5 to DEFAULT_MODULE_OPTIONS
[ https://issues.apache.org/jira/browse/SPARK-37345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-37345. - Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34615 [https://github.com/apache/spark/pull/34615] > Add java.security.jgss/sun.security.krb5 to DEFAULT_MODULE_OPTIONS > -- > > Key: SPARK-37345 > URL: https://issues.apache.org/jira/browse/SPARK-37345 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.3.0 > > > {noformat} > Exception in thread "main" java.lang.IllegalArgumentException: Can't get > Kerberos realm > at > org.apache.hadoop.security.HadoopKerberosName.setConfiguration(HadoopKerberosName.java:65) > at > org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:306) > at > org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:352) > at > org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:50) > at > org.apache.spark.deploy.SparkHadoopUtil$.instance$lzycompute(SparkHadoopUtil.scala:397) > at > org.apache.spark.deploy.SparkHadoopUtil$.instance(SparkHadoopUtil.scala:397) > at > org.apache.spark.deploy.SparkHadoopUtil$.get(SparkHadoopUtil.scala:418) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:423) > at > org.apache.spark.executor.YarnCoarseGrainedExecutorBackend$.main(YarnCoarseGrainedExecutorBackend.scala:81) > at > org.apache.spark.executor.YarnCoarseGrainedExecutorBackend.main(YarnCoarseGrainedExecutorBackend.scala) > Caused by: java.lang.IllegalAccessException: class > org.apache.hadoop.security.authentication.util.KerberosUtil cannot access > class sun.security.krb5.Config (in module java.security.jgss) because module > java.security.jgss does not export sun.security.krb5 to unnamed module > @3a0baae5 > at > java.base/jdk.internal.reflect.Reflection.newIllegalAccessException(Reflection.java:392) > at > java.base/java.lang.reflect.AccessibleObject.checkAccess(AccessibleObject.java:674) > at java.base/java.lang.reflect.Method.invoke(Method.java:560) > at > org.apache.hadoop.security.authentication.util.KerberosUtil.getDefaultRealm(KerberosUtil.java:85) > at > org.apache.hadoop.security.HadoopKerberosName.setConfiguration(HadoopKerberosName.java:63) > ... 9 more > {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37329) File system delegation tokens are leaked
[ https://issues.apache.org/jira/browse/SPARK-37329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444890#comment-17444890 ] Wei-Chiu Chuang commented on SPARK-37329: - I should also note that this affects not just KMS, but any file system implementation (HDFS, Ozone, perhaps S3) with delegation token support. > File system delegation tokens are leaked > > > Key: SPARK-37329 > URL: https://issues.apache.org/jira/browse/SPARK-37329 > Project: Spark > Issue Type: Bug > Components: Security, YARN >Affects Versions: 2.4.0 >Reporter: Wei-Chiu Chuang >Priority: Major > > On a very busy Hadoop cluster (with HDFS at rest encryption) we found KMS > accumulated millions of delegation tokens that are not cancelled even after > jobs are finished, and KMS goes out of memory within a day because of the > delegation token leak. > We were able to reproduce the bug in a smaller test cluster, and realized > when a Spark job starts, it acquires two delegation tokens, and only one is > cancelled properly after the job finishes. The other one is left over and > linger around for up to 7 days ( default Hadoop delegation token life time). > YARN handles the lifecycle of a delegation token properly if its renewer is > 'yarn'. However, Spark intentionally (a hack?) acquires a second delegation > token with the job issuer as the renewer, simply to get the token renewal > interval. The token is then ignored but not cancelled. > Propose: cancel the delegation token immediately after the token renewal > interval is obtained. > Environment: CDH6.3.2 (based on Apache Spark 2.4.0) but the bug probably got > introduced since day 1. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37349) Improve SQL Rest API Parsing
[ https://issues.apache.org/jira/browse/SPARK-37349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444891#comment-17444891 ] Yian Liou commented on SPARK-37349: --- Will be working on PR > Improve SQL Rest API Parsing > > > Key: SPARK-37349 > URL: https://issues.apache.org/jira/browse/SPARK-37349 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yian Liou >Priority: Major > > https://issues.apache.org/jira/browse/SPARK-31440 added improvements for SQL > Rest API. This Jira aims to add further enhancements in regards to parsing > the incoming data by accounting for `StageIds` and `TaskIds` fields that came > in Spark 3. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37349) Improve SQL Rest API Parsing
Yian Liou created SPARK-37349: - Summary: Improve SQL Rest API Parsing Key: SPARK-37349 URL: https://issues.apache.org/jira/browse/SPARK-37349 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Yian Liou https://issues.apache.org/jira/browse/SPARK-31440 added improvements for SQL Rest API. This Jira aims to add further enhancements in regards to parsing the incoming data by accounting for `StageIds` and `TaskIds` fields that came in Spark 3. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37340) Display StageIds in Operators for SQL UI
[ https://issues.apache.org/jira/browse/SPARK-37340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37340: Assignee: (was: Apache Spark) > Display StageIds in Operators for SQL UI > > > Key: SPARK-37340 > URL: https://issues.apache.org/jira/browse/SPARK-37340 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.2.0 >Reporter: Yian Liou >Priority: Major > > This proposes a more generalized solution of > https://issues.apache.org/jira/browse/SPARK-30209, where a stageId-> operator > mapping is done with the following algorithm. > 1. Read SparkGraph to get every Node's name and respective AccumulatorIDs. > 2. Gets each stage's AccumulatorIDs. > 3. Maps Operators to stages by checking for non-zero intersection of Step 1 > and 2's AccumulatorIDs. > 4. Connect SparkGraphNodes to respective StageIDs for rendering in SQL UI. > As a result, some operators without max metrics values will also have > stageIds in the UI. This Jira also aims to add minor enhancements to the SQL > UI tab. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37340) Display StageIds in Operators for SQL UI
[ https://issues.apache.org/jira/browse/SPARK-37340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37340: Assignee: Apache Spark > Display StageIds in Operators for SQL UI > > > Key: SPARK-37340 > URL: https://issues.apache.org/jira/browse/SPARK-37340 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.2.0 >Reporter: Yian Liou >Assignee: Apache Spark >Priority: Major > > This proposes a more generalized solution of > https://issues.apache.org/jira/browse/SPARK-30209, where a stageId-> operator > mapping is done with the following algorithm. > 1. Read SparkGraph to get every Node's name and respective AccumulatorIDs. > 2. Gets each stage's AccumulatorIDs. > 3. Maps Operators to stages by checking for non-zero intersection of Step 1 > and 2's AccumulatorIDs. > 4. Connect SparkGraphNodes to respective StageIDs for rendering in SQL UI. > As a result, some operators without max metrics values will also have > stageIds in the UI. This Jira also aims to add minor enhancements to the SQL > UI tab. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37340) Display StageIds in Operators for SQL UI
[ https://issues.apache.org/jira/browse/SPARK-37340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444764#comment-17444764 ] Apache Spark commented on SPARK-37340: -- User 'yliou' has created a pull request for this issue: https://github.com/apache/spark/pull/34622 > Display StageIds in Operators for SQL UI > > > Key: SPARK-37340 > URL: https://issues.apache.org/jira/browse/SPARK-37340 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.2.0 >Reporter: Yian Liou >Priority: Major > > This proposes a more generalized solution of > https://issues.apache.org/jira/browse/SPARK-30209, where a stageId-> operator > mapping is done with the following algorithm. > 1. Read SparkGraph to get every Node's name and respective AccumulatorIDs. > 2. Gets each stage's AccumulatorIDs. > 3. Maps Operators to stages by checking for non-zero intersection of Step 1 > and 2's AccumulatorIDs. > 4. Connect SparkGraphNodes to respective StageIDs for rendering in SQL UI. > As a result, some operators without max metrics values will also have > stageIds in the UI. This Jira also aims to add minor enhancements to the SQL > UI tab. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37219) support AS OF syntax
[ https://issues.apache.org/jira/browse/SPARK-37219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444669#comment-17444669 ] Apache Spark commented on SPARK-37219: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/34621 > support AS OF syntax > > > Key: SPARK-37219 > URL: https://issues.apache.org/jira/browse/SPARK-37219 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.3.0 > > > https://docs.databricks.com/delta/quick-start.html#query-an-earlier-version-of-the-table-time-travel > Delta Lake time travel allows user to query an older snapshot of a Delta > table. To query an older version of a table, user needs to specify a version > or timestamp in a SELECT statement using AS OF syntax as the follows > SELECT * FROM default.people10m VERSION AS OF 0; > SELECT * FROM default.people10m TIMESTAMP AS OF '2019-01-29 00:37:58'; > This ticket is opened to add AS OF syntax in Spark -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35672) Spark fails to launch executors with very large user classpath lists on YARN
[ https://issues.apache.org/jira/browse/SPARK-35672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros resolved SPARK-35672. Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34120 [https://github.com/apache/spark/pull/34120] > Spark fails to launch executors with very large user classpath lists on YARN > > > Key: SPARK-35672 > URL: https://issues.apache.org/jira/browse/SPARK-35672 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Affects Versions: 3.1.2 > Environment: Linux RHEL7 > Spark 3.1.1 >Reporter: Erik Krogen >Assignee: Erik Krogen >Priority: Major > Fix For: 3.3.0 > > > When running Spark on YARN, the {{user-class-path}} argument to > {{CoarseGrainedExecutorBackend}} is used to pass a list of user JAR URIs to > executor processes. The argument is specified once for each JAR, and the URIs > are fully-qualified, so the paths can be quite long. With large user JAR > lists (say 1000+), this can result in system-level argument length limits > being exceeded, typically manifesting as the error message: > {code} > /bin/bash: Argument list too long > {code} > A [Google > search|https://www.google.com/search?q=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22] > indicates that this is not a theoretical problem and afflicts real users, > including ours. This issue was originally observed on Spark 2.3, but has been > confirmed to exist in the master branch as well. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37348) PySpark pmod function
[ https://issues.apache.org/jira/browse/SPARK-37348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Schwab updated SPARK-37348: --- Description: Because Spark is built on the JVM, in PySpark, F.lit(-1) % F.lit(2) returns -1. However, the modulus is often desired instead of the remainder. There is a [PMOD() function in Spark SQL|https://spark.apache.org/docs/latest/api/sql/#pmod], but [not in PySpark|https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#functions]. So at the moment, the two options for getting the modulus is to use F.expr("pmod(A, B)"), or create a helper function such as: {code:java} def pmod(dividend, divisor): return F.when(dividend < 0, (dividend % divisor) + divisor).otherwise(dividend % divisor){code} Neither are optimal - pmod should be native to PySpark as it is in Spark SQL. was: Because Spark is built on the JVM, in PySpark, F.lit(-1) % F.lit(2) returns -1. However, the modulus is often desired instead of the remainder. There is a PMOD() function in Spark SQL, but not in PySpark. So at the moment, the two options for getting the modulus is to use F.expr("pmod(A, B)"), or create a helper function such as: {code:java} def pmod(dividend, divisor): return F.when(dividend < 0, (dividend % divisor) + divisor).otherwise(dividend % divisor){code} Neither are optimal - pmod should be native to PySpark as it is in Spark SQL. > PySpark pmod function > - > > Key: SPARK-37348 > URL: https://issues.apache.org/jira/browse/SPARK-37348 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Tim Schwab >Priority: Minor > Labels: newbie > > Because Spark is built on the JVM, in PySpark, F.lit(-1) % F.lit(2) returns > -1. However, the modulus is often desired instead of the remainder. > > There is a [PMOD() function in Spark > SQL|https://spark.apache.org/docs/latest/api/sql/#pmod], but [not in > PySpark|https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#functions]. > So at the moment, the two options for getting the modulus is to use > F.expr("pmod(A, B)"), or create a helper function such as: > > {code:java} > def pmod(dividend, divisor): > return F.when(dividend < 0, (dividend % divisor) + > divisor).otherwise(dividend % divisor){code} > > > Neither are optimal - pmod should be native to PySpark as it is in Spark SQL. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37348) PySpark pmod function
[ https://issues.apache.org/jira/browse/SPARK-37348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444636#comment-17444636 ] Tim Schwab commented on SPARK-37348: Also, this has been the case for over 6 years: https://github.com/apache/spark/pull/6783. > PySpark pmod function > - > > Key: SPARK-37348 > URL: https://issues.apache.org/jira/browse/SPARK-37348 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Tim Schwab >Priority: Minor > Labels: newbie > > Because Spark is built on the JVM, in PySpark, F.lit(-1) % F.lit(2) returns > -1. However, the modulus is often desired instead of the remainder. > > There is a PMOD() function in Spark SQL, but not in PySpark. So at the > moment, the two options for getting the modulus is to use F.expr("pmod(A, > B)"), or create a helper function such as: > > {code:java} > def pmod(dividend, divisor): > return F.when(dividend < 0, (dividend % divisor) + > divisor).otherwise(dividend % divisor){code} > > > Neither are optimal - pmod should be native to PySpark as it is in Spark SQL. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37348) PySpark pmod function
Tim Schwab created SPARK-37348: -- Summary: PySpark pmod function Key: SPARK-37348 URL: https://issues.apache.org/jira/browse/SPARK-37348 Project: Spark Issue Type: New Feature Components: PySpark Affects Versions: 3.2.0 Reporter: Tim Schwab Because Spark is built on the JVM, in PySpark, F.lit(-1) % F.lit(2) returns -1. However, the modulus is often desired instead of the remainder. There is a PMOD() function in Spark SQL, but not in PySpark. So at the moment, the two options for getting the modulus is to use F.expr("pmod(A, B)"), or create a helper function such as: {code:java} def pmod(dividend, divisor): return F.when(dividend < 0, (dividend % divisor) + divisor).otherwise(dividend % divisor){code} Neither are optimal - pmod should be native to PySpark as it is in Spark SQL. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37181) pyspark.pandas.read_csv() should support latin-1 encoding
[ https://issues.apache.org/jira/browse/SPARK-37181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37181: Assignee: Apache Spark > pyspark.pandas.read_csv() should support latin-1 encoding > - > > Key: SPARK-37181 > URL: https://issues.apache.org/jira/browse/SPARK-37181 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Chuck Connell >Assignee: Apache Spark >Priority: Major > > {{In regular pandas, you can say read_csv(encoding='latin-1'). This encoding > is not recognized in pyspark.pandas. You have to use Windows-1252 instead, > which is almost the same but not identical. }} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37209) YarnShuffleIntegrationSuite and other two similar cases in `resource-managers` test failed
[ https://issues.apache.org/jira/browse/SPARK-37209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37209: Assignee: (was: Apache Spark) > YarnShuffleIntegrationSuite and other two similar cases in > `resource-managers` test failed > --- > > Key: SPARK-37209 > URL: https://issues.apache.org/jira/browse/SPARK-37209 > Project: Spark > Issue Type: Bug > Components: Tests, YARN >Affects Versions: 3.3.0 >Reporter: Yang Jie >Priority: Minor > Attachments: failed-unit-tests.log, success-unit-tests.log > > > Execute : > # build/mvn clean package -DskipTests -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud > -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl > -Pkubernetes -Phive > # build/mvn test -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn > -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive > -Pscala-2.13 -pl resource-managers/yarn > The test will successful. > > Execute : > # build/mvn clean -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn > -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive > # build/mvn clean test -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn > -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive > -Pscala-2.13 -pl resource-managers/yarn > The test will failed. > > Execute : > # build/mvn clean package -DskipTests -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud > -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl > -Pkubernetes -Phive > # Delete assembly/target/scala-2.12/jars manually > # build/mvn test -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn > -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive > -Pscala-2.13 -pl resource-managers/yarn > The test will failed. > > The error stack is : > {code:java} > 21/11/04 19:48:52.159 main ERROR Client: Application diagnostics message: > User class threw exception: org.apache.spark.SparkException: Job aborted due > to stage failure: Task 0 in stage 0.0 failed 4 times, > most recent failure: Lost task 0.3 in stage 0.0 (TID 6) (localhost executor > 1): java.lang.NoClassDefFoundError: breeze/linalg/Matrix > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:348) > at org.apache.spark.util.Utils$.classForName(Utils.scala:216) > at > org.apache.spark.serializer.KryoSerializer$.$anonfun$loadableSparkClasses$1(KryoSerializer.scala:537) > at scala.collection.immutable.List.flatMap(List.scala:293) > at scala.collection.immutable.List.flatMap(List.scala:79) > at > org.apache.spark.serializer.KryoSerializer$.loadableSparkClasses$lzycompute(KryoSerializer.scala:535) > at > org.apache.spark.serializer.KryoSerializer$.org$apache$spark$serializer$KryoSerializer$$loadableSparkClasses(KryoSerializer.scala:502) > at > org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:226) > at > org.apache.spark.serializer.KryoSerializer$$anon$1.create(KryoSerializer.scala:102) > at > com.esotericsoftware.kryo.pool.KryoPoolQueueImpl.borrow(KryoPoolQueueImpl.java:48) > at > org.apache.spark.serializer.KryoSerializer$PoolWrapper.borrow(KryoSerializer.scala:109) > at > org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:346) > at > org.apache.spark.serializer.KryoSerializationStream.(KryoSerializer.scala:266) > at > org.apache.spark.serializer.KryoSerializerInstance.serializeStream(KryoSerializer.scala:432) > at > org.apache.spark.shuffle.ShufflePartitionPairsWriter.open(ShufflePartitionPairsWriter.scala:76) > at > org.apache.spark.shuffle.ShufflePartitionPairsWriter.write(ShufflePartitionPairsWriter.scala:59) > at > org.apache.spark.util.collection.WritablePartitionedIterator.writeNext(WritablePartitionedPairCollection.scala:83) > at > org.apache.spark.util.collection.ExternalSorter.$anonfun$writePartitionedMapOutput$1(ExternalSorter.scala:772) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1468) > at > org.apache.spark.util.collection.ExternalSorter.writePartitionedMapOutput(ExternalSorter.scala:775) > at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:70) > at > org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > at >
[jira] [Commented] (SPARK-37209) YarnShuffleIntegrationSuite and other two similar cases in `resource-managers` test failed
[ https://issues.apache.org/jira/browse/SPARK-37209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444603#comment-17444603 ] Apache Spark commented on SPARK-37209: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/34620 > YarnShuffleIntegrationSuite and other two similar cases in > `resource-managers` test failed > --- > > Key: SPARK-37209 > URL: https://issues.apache.org/jira/browse/SPARK-37209 > Project: Spark > Issue Type: Bug > Components: Tests, YARN >Affects Versions: 3.3.0 >Reporter: Yang Jie >Priority: Minor > Attachments: failed-unit-tests.log, success-unit-tests.log > > > Execute : > # build/mvn clean package -DskipTests -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud > -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl > -Pkubernetes -Phive > # build/mvn test -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn > -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive > -Pscala-2.13 -pl resource-managers/yarn > The test will successful. > > Execute : > # build/mvn clean -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn > -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive > # build/mvn clean test -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn > -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive > -Pscala-2.13 -pl resource-managers/yarn > The test will failed. > > Execute : > # build/mvn clean package -DskipTests -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud > -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl > -Pkubernetes -Phive > # Delete assembly/target/scala-2.12/jars manually > # build/mvn test -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn > -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive > -Pscala-2.13 -pl resource-managers/yarn > The test will failed. > > The error stack is : > {code:java} > 21/11/04 19:48:52.159 main ERROR Client: Application diagnostics message: > User class threw exception: org.apache.spark.SparkException: Job aborted due > to stage failure: Task 0 in stage 0.0 failed 4 times, > most recent failure: Lost task 0.3 in stage 0.0 (TID 6) (localhost executor > 1): java.lang.NoClassDefFoundError: breeze/linalg/Matrix > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:348) > at org.apache.spark.util.Utils$.classForName(Utils.scala:216) > at > org.apache.spark.serializer.KryoSerializer$.$anonfun$loadableSparkClasses$1(KryoSerializer.scala:537) > at scala.collection.immutable.List.flatMap(List.scala:293) > at scala.collection.immutable.List.flatMap(List.scala:79) > at > org.apache.spark.serializer.KryoSerializer$.loadableSparkClasses$lzycompute(KryoSerializer.scala:535) > at > org.apache.spark.serializer.KryoSerializer$.org$apache$spark$serializer$KryoSerializer$$loadableSparkClasses(KryoSerializer.scala:502) > at > org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:226) > at > org.apache.spark.serializer.KryoSerializer$$anon$1.create(KryoSerializer.scala:102) > at > com.esotericsoftware.kryo.pool.KryoPoolQueueImpl.borrow(KryoPoolQueueImpl.java:48) > at > org.apache.spark.serializer.KryoSerializer$PoolWrapper.borrow(KryoSerializer.scala:109) > at > org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:346) > at > org.apache.spark.serializer.KryoSerializationStream.(KryoSerializer.scala:266) > at > org.apache.spark.serializer.KryoSerializerInstance.serializeStream(KryoSerializer.scala:432) > at > org.apache.spark.shuffle.ShufflePartitionPairsWriter.open(ShufflePartitionPairsWriter.scala:76) > at > org.apache.spark.shuffle.ShufflePartitionPairsWriter.write(ShufflePartitionPairsWriter.scala:59) > at > org.apache.spark.util.collection.WritablePartitionedIterator.writeNext(WritablePartitionedPairCollection.scala:83) > at > org.apache.spark.util.collection.ExternalSorter.$anonfun$writePartitionedMapOutput$1(ExternalSorter.scala:772) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1468) > at > org.apache.spark.util.collection.ExternalSorter.writePartitionedMapOutput(ExternalSorter.scala:775) > at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:70) > at > org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) > at >
[jira] [Assigned] (SPARK-37181) pyspark.pandas.read_csv() should support latin-1 encoding
[ https://issues.apache.org/jira/browse/SPARK-37181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37181: Assignee: (was: Apache Spark) > pyspark.pandas.read_csv() should support latin-1 encoding > - > > Key: SPARK-37181 > URL: https://issues.apache.org/jira/browse/SPARK-37181 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Chuck Connell >Priority: Major > > {{In regular pandas, you can say read_csv(encoding='latin-1'). This encoding > is not recognized in pyspark.pandas. You have to use Windows-1252 instead, > which is almost the same but not identical. }} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37181) pyspark.pandas.read_csv() should support latin-1 encoding
[ https://issues.apache.org/jira/browse/SPARK-37181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444604#comment-17444604 ] Apache Spark commented on SPARK-37181: -- User 'pralabhkumar' has created a pull request for this issue: https://github.com/apache/spark/pull/34619 > pyspark.pandas.read_csv() should support latin-1 encoding > - > > Key: SPARK-37181 > URL: https://issues.apache.org/jira/browse/SPARK-37181 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Chuck Connell >Priority: Major > > {{In regular pandas, you can say read_csv(encoding='latin-1'). This encoding > is not recognized in pyspark.pandas. You have to use Windows-1252 instead, > which is almost the same but not identical. }} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37209) YarnShuffleIntegrationSuite and other two similar cases in `resource-managers` test failed
[ https://issues.apache.org/jira/browse/SPARK-37209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37209: Assignee: Apache Spark > YarnShuffleIntegrationSuite and other two similar cases in > `resource-managers` test failed > --- > > Key: SPARK-37209 > URL: https://issues.apache.org/jira/browse/SPARK-37209 > Project: Spark > Issue Type: Bug > Components: Tests, YARN >Affects Versions: 3.3.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Minor > Attachments: failed-unit-tests.log, success-unit-tests.log > > > Execute : > # build/mvn clean package -DskipTests -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud > -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl > -Pkubernetes -Phive > # build/mvn test -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn > -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive > -Pscala-2.13 -pl resource-managers/yarn > The test will successful. > > Execute : > # build/mvn clean -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn > -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive > # build/mvn clean test -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn > -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive > -Pscala-2.13 -pl resource-managers/yarn > The test will failed. > > Execute : > # build/mvn clean package -DskipTests -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud > -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl > -Pkubernetes -Phive > # Delete assembly/target/scala-2.12/jars manually > # build/mvn test -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn > -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive > -Pscala-2.13 -pl resource-managers/yarn > The test will failed. > > The error stack is : > {code:java} > 21/11/04 19:48:52.159 main ERROR Client: Application diagnostics message: > User class threw exception: org.apache.spark.SparkException: Job aborted due > to stage failure: Task 0 in stage 0.0 failed 4 times, > most recent failure: Lost task 0.3 in stage 0.0 (TID 6) (localhost executor > 1): java.lang.NoClassDefFoundError: breeze/linalg/Matrix > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:348) > at org.apache.spark.util.Utils$.classForName(Utils.scala:216) > at > org.apache.spark.serializer.KryoSerializer$.$anonfun$loadableSparkClasses$1(KryoSerializer.scala:537) > at scala.collection.immutable.List.flatMap(List.scala:293) > at scala.collection.immutable.List.flatMap(List.scala:79) > at > org.apache.spark.serializer.KryoSerializer$.loadableSparkClasses$lzycompute(KryoSerializer.scala:535) > at > org.apache.spark.serializer.KryoSerializer$.org$apache$spark$serializer$KryoSerializer$$loadableSparkClasses(KryoSerializer.scala:502) > at > org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:226) > at > org.apache.spark.serializer.KryoSerializer$$anon$1.create(KryoSerializer.scala:102) > at > com.esotericsoftware.kryo.pool.KryoPoolQueueImpl.borrow(KryoPoolQueueImpl.java:48) > at > org.apache.spark.serializer.KryoSerializer$PoolWrapper.borrow(KryoSerializer.scala:109) > at > org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:346) > at > org.apache.spark.serializer.KryoSerializationStream.(KryoSerializer.scala:266) > at > org.apache.spark.serializer.KryoSerializerInstance.serializeStream(KryoSerializer.scala:432) > at > org.apache.spark.shuffle.ShufflePartitionPairsWriter.open(ShufflePartitionPairsWriter.scala:76) > at > org.apache.spark.shuffle.ShufflePartitionPairsWriter.write(ShufflePartitionPairsWriter.scala:59) > at > org.apache.spark.util.collection.WritablePartitionedIterator.writeNext(WritablePartitionedPairCollection.scala:83) > at > org.apache.spark.util.collection.ExternalSorter.$anonfun$writePartitionedMapOutput$1(ExternalSorter.scala:772) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1468) > at > org.apache.spark.util.collection.ExternalSorter.writePartitionedMapOutput(ExternalSorter.scala:775) > at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:70) > at > org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > at >
[jira] [Assigned] (SPARK-37219) support AS OF syntax
[ https://issues.apache.org/jira/browse/SPARK-37219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-37219: --- Assignee: Huaxin Gao > support AS OF syntax > > > Key: SPARK-37219 > URL: https://issues.apache.org/jira/browse/SPARK-37219 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > > https://docs.databricks.com/delta/quick-start.html#query-an-earlier-version-of-the-table-time-travel > Delta Lake time travel allows user to query an older snapshot of a Delta > table. To query an older version of a table, user needs to specify a version > or timestamp in a SELECT statement using AS OF syntax as the follows > SELECT * FROM default.people10m VERSION AS OF 0; > SELECT * FROM default.people10m TIMESTAMP AS OF '2019-01-29 00:37:58'; > This ticket is opened to add AS OF syntax in Spark -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37219) support AS OF syntax
[ https://issues.apache.org/jira/browse/SPARK-37219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-37219. - Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34497 [https://github.com/apache/spark/pull/34497] > support AS OF syntax > > > Key: SPARK-37219 > URL: https://issues.apache.org/jira/browse/SPARK-37219 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.3.0 > > > https://docs.databricks.com/delta/quick-start.html#query-an-earlier-version-of-the-table-time-travel > Delta Lake time travel allows user to query an older snapshot of a Delta > table. To query an older version of a table, user needs to specify a version > or timestamp in a SELECT statement using AS OF syntax as the follows > SELECT * FROM default.people10m VERSION AS OF 0; > SELECT * FROM default.people10m TIMESTAMP AS OF '2019-01-29 00:37:58'; > This ticket is opened to add AS OF syntax in Spark -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37346) Link migration guide for structured stream.
[ https://issues.apache.org/jira/browse/SPARK-37346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444585#comment-17444585 ] Apache Spark commented on SPARK-37346: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/34618 > Link migration guide for structured stream. > --- > > Key: SPARK-37346 > URL: https://issues.apache.org/jira/browse/SPARK-37346 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > Fix For: 3.3.0 > > > Link migration guide to each project. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37346) Link migration guide for structured stream.
[ https://issues.apache.org/jira/browse/SPARK-37346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37346: Assignee: (was: Apache Spark) > Link migration guide for structured stream. > --- > > Key: SPARK-37346 > URL: https://issues.apache.org/jira/browse/SPARK-37346 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > Fix For: 3.3.0 > > > Link migration guide to each project. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37346) Link migration guide for structured stream.
[ https://issues.apache.org/jira/browse/SPARK-37346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37346: Assignee: Apache Spark > Link migration guide for structured stream. > --- > > Key: SPARK-37346 > URL: https://issues.apache.org/jira/browse/SPARK-37346 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: Apache Spark >Priority: Major > Fix For: 3.3.0 > > > Link migration guide to each project. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37346) Link migration guide for struct stream.
[ https://issues.apache.org/jira/browse/SPARK-37346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-37346: -- Summary: Link migration guide for struct stream. (was: Link migration guide to each project.) > Link migration guide for struct stream. > --- > > Key: SPARK-37346 > URL: https://issues.apache.org/jira/browse/SPARK-37346 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > Fix For: 3.3.0 > > > Link migration guide to each project. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37346) Link migration guide for structured stream.
[ https://issues.apache.org/jira/browse/SPARK-37346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-37346: -- Summary: Link migration guide for structured stream. (was: Link migration guide for struct stream.) > Link migration guide for structured stream. > --- > > Key: SPARK-37346 > URL: https://issues.apache.org/jira/browse/SPARK-37346 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > Fix For: 3.3.0 > > > Link migration guide to each project. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-37209) YarnShuffleIntegrationSuite and other two similar cases in `resource-managers` test failed
[ https://issues.apache.org/jira/browse/SPARK-37209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444502#comment-17444502 ] Yang Jie edited comment on SPARK-37209 at 11/16/21, 1:31 PM: - After some investigation, I found that this issue maybe related to `hadoop-3.x`, when use `hadoop-2.7` profile, the above test can be successful: {code:java} mvn clean install -DskipTests -pl resource-managers/yarn -Pyarn -Phadoop-2.7 -am mvn test -pl resource-managers/yarn -Pyarn -Phadoop-2.7 -Dtest=none -DwildcardSuites=org.apache.spark.deploy.yarn.YarnShuffleIntegrationSuite Discovery starting. Discovery completed in 259 milliseconds. Run starting. Expected test count is: 1 YarnShuffleIntegrationSuite: - external shuffle service Run completed in 30 seconds, 765 milliseconds. Total number of tests run: 1 Suites: completed 2, aborted 0 Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0 All tests passed.{code} It seems that when testing with hadoop-2.7, the result of executing `Utils.isTesting` is true, which helps test case to ignore the `NoClassDefFoundError` in the test, but when testing with hadoop-3.2, the result of executing `Utils.isTesting` is false. But I haven't investigated the root cause with hadoop-3.2 cc [~hyukjin.kwon] [~dongjoon] [~srowen] was (Author: luciferyang): After some investigation, I found that this issue maybe related to `hadoop-3.x`, when use `hadoop-2.7` profile, the above test can be successful: {code:java} mvn clean install -DskipTests -pl resource-managers/yarn -Pyarn -Phadoop-2.7 -am mvn test -pl resource-managers/yarn -Pyarn -Phadoop-2.7 -Dtest=none -DwildcardSuites=org.apache.spark.deploy.yarn.YarnShuffleIntegrationSuite Discovery starting. Discovery completed in 259 milliseconds. Run starting. Expected test count is: 1 YarnShuffleIntegrationSuite: - external shuffle service Run completed in 30 seconds, 765 milliseconds. Total number of tests run: 1 Suites: completed 2, aborted 0 Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0 All tests passed.{code} It seems that when testing with hadoop-2.7, the result of executing `Utils.isTesting` on the executor side is true, which helps test case to ignore the `NoClassDefFoundError` in the test, but when testing with hadoop-3.2, the result of executing `Utils.isTesting` on the executor side is false. But I haven't investigated the root cause with hadoop-3.2 cc [~hyukjin.kwon] [~dongjoon] [~srowen] > YarnShuffleIntegrationSuite and other two similar cases in > `resource-managers` test failed > --- > > Key: SPARK-37209 > URL: https://issues.apache.org/jira/browse/SPARK-37209 > Project: Spark > Issue Type: Bug > Components: Tests, YARN >Affects Versions: 3.3.0 >Reporter: Yang Jie >Priority: Minor > Attachments: failed-unit-tests.log, success-unit-tests.log > > > Execute : > # build/mvn clean package -DskipTests -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud > -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl > -Pkubernetes -Phive > # build/mvn test -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn > -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive > -Pscala-2.13 -pl resource-managers/yarn > The test will successful. > > Execute : > # build/mvn clean -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn > -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive > # build/mvn clean test -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn > -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive > -Pscala-2.13 -pl resource-managers/yarn > The test will failed. > > Execute : > # build/mvn clean package -DskipTests -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud > -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl > -Pkubernetes -Phive > # Delete assembly/target/scala-2.12/jars manually > # build/mvn test -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn > -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive > -Pscala-2.13 -pl resource-managers/yarn > The test will failed. > > The error stack is : > {code:java} > 21/11/04 19:48:52.159 main ERROR Client: Application diagnostics message: > User class threw exception: org.apache.spark.SparkException: Job aborted due > to stage failure: Task 0 in stage 0.0 failed 4 times, > most recent failure: Lost task 0.3 in stage 0.0 (TID 6) (localhost executor > 1): java.lang.NoClassDefFoundError: breeze/linalg/Matrix > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:348) > at org.apache.spark.util.Utils$.classForName(Utils.scala:216) > at >
[jira] [Comment Edited] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data
[ https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444386#comment-17444386 ] Wei Zhang edited comment on SPARK-18105 at 11/16/21, 1:16 PM: -- In our case, it is strongly related with `spark.file.transferTo`. Set `spark.file.transferTo` to false has saved our life. We have been tortured by this problem for almost about a year, and we got about hundreds of tasks with this `FetchFailedException: Decompression error: Corrupted block detected` error everyday. Recently the heat is burning up and got us kind of on fire since we are using Spark in more scenarios. One of the stack I got is: {code:java} org.apache.spark.shuffle.FetchFailedException: Decompression error: Corrupted block detectedat org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:569) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:474) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:66) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221) at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:357) at org.apache.spark.rdd.RDD.iterator(RDD.scala:308) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:123) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1405)at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: Decompression error: Corrupted block detected at com.github.luben.zstd.ZstdInputStream.read(ZstdInputStream.java:111) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at org.apache.spark.util.Utils$$anonfun$1.apply$mcZ$sp(Utils.scala:402) at org.apache.spark.util.Utils$$anonfun$1.apply(Utils.scala:397)at org.apache.spark.util.Utils$$anonfun$1.apply(Utils.scala:397)at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1405)at org.apache.spark.util.Utils$.copyStreamUpTo(Utils.scala:409) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:467) ... 32 more{code} We have Spark version 2.2.2, 2.4.7, 3.0.1 and 3.1.2,and all have this issue. I tried methods in this post and others, nothing can make the non-repeatable error disappear. After weeks' struggle and going through the code with one of our almost repeatable case, we are able to find that `spark.file.transferTo` is the one. We turned off this flag in our cluster, and this error literally disappears – None of this error any more. Normally we won't suspect flags like `spark.file.transferTo`, in that the only difference is you are using `FileStream` or
[jira] [Assigned] (SPARK-37342) Upgrade Apache Arrow to 6.0.0
[ https://issues.apache.org/jira/browse/SPARK-37342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-37342: Assignee: Chao Sun > Upgrade Apache Arrow to 6.0.0 > - > > Key: SPARK-37342 > URL: https://issues.apache.org/jira/browse/SPARK-37342 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > > Spark is still using Apache Arrow 2.0.0 while 6.0.0 was already released last > month. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37342) Upgrade Apache Arrow to 6.0.0
[ https://issues.apache.org/jira/browse/SPARK-37342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-37342. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34613 [https://github.com/apache/spark/pull/34613] > Upgrade Apache Arrow to 6.0.0 > - > > Key: SPARK-37342 > URL: https://issues.apache.org/jira/browse/SPARK-37342 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Fix For: 3.3.0 > > > Spark is still using Apache Arrow 2.0.0 while 6.0.0 was already released last > month. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37347) Spark Thrift Server (STS) driver fullFC becourse of timeoutExecutor not shutdown correctly
[ https://issues.apache.org/jira/browse/SPARK-37347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37347: Assignee: (was: Apache Spark) > Spark Thrift Server (STS) driver fullFC becourse of timeoutExecutor not > shutdown correctly > -- > > Key: SPARK-37347 > URL: https://issues.apache.org/jira/browse/SPARK-37347 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0, 3.1.1, 3.2.0 >Reporter: liukai >Priority: Major > > When spark.sql.thriftServer.queryTimeout or > java.sql.Statement.setQueryTimeout is setted >0 , > SparkExecuteStatementOperation add timeoutExecutor to kill time-consumed > query in [SPARK-26533|https://issues.apache.org/jira/browse/SPARK-26533]. But > timeoutExecutor is not shutdown correctly when statement is finished, it can > only be shutdown when timeout. When we set timeout to a long time for example > 1 hour, the long-running STS driver will FullGC and the application is not > available for a long time. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37347) Spark Thrift Server (STS) driver fullFC becourse of timeoutExecutor not shutdown correctly
[ https://issues.apache.org/jira/browse/SPARK-37347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-37347: Target Version/s: (was: 3.1.2, 3.2.0) > Spark Thrift Server (STS) driver fullFC becourse of timeoutExecutor not > shutdown correctly > -- > > Key: SPARK-37347 > URL: https://issues.apache.org/jira/browse/SPARK-37347 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0, 3.1.1, 3.2.0 >Reporter: liukai >Priority: Major > > When spark.sql.thriftServer.queryTimeout or > java.sql.Statement.setQueryTimeout is setted >0 , > SparkExecuteStatementOperation add timeoutExecutor to kill time-consumed > query in [SPARK-26533|https://issues.apache.org/jira/browse/SPARK-26533]. But > timeoutExecutor is not shutdown correctly when statement is finished, it can > only be shutdown when timeout. When we set timeout to a long time for example > 1 hour, the long-running STS driver will FullGC and the application is not > available for a long time. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37347) Spark Thrift Server (STS) driver fullFC becourse of timeoutExecutor not shutdown correctly
[ https://issues.apache.org/jira/browse/SPARK-37347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-37347: Fix Version/s: (was: 3.1.3) (was: 3.2.1) > Spark Thrift Server (STS) driver fullFC becourse of timeoutExecutor not > shutdown correctly > -- > > Key: SPARK-37347 > URL: https://issues.apache.org/jira/browse/SPARK-37347 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0, 3.1.1, 3.2.0 >Reporter: liukai >Priority: Major > > When spark.sql.thriftServer.queryTimeout or > java.sql.Statement.setQueryTimeout is setted >0 , > SparkExecuteStatementOperation add timeoutExecutor to kill time-consumed > query in [SPARK-26533|https://issues.apache.org/jira/browse/SPARK-26533]. But > timeoutExecutor is not shutdown correctly when statement is finished, it can > only be shutdown when timeout. When we set timeout to a long time for example > 1 hour, the long-running STS driver will FullGC and the application is not > available for a long time. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37347) Spark Thrift Server (STS) driver fullFC becourse of timeoutExecutor not shutdown correctly
[ https://issues.apache.org/jira/browse/SPARK-37347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444507#comment-17444507 ] Apache Spark commented on SPARK-37347: -- User 'lk246' has created a pull request for this issue: https://github.com/apache/spark/pull/34617 > Spark Thrift Server (STS) driver fullFC becourse of timeoutExecutor not > shutdown correctly > -- > > Key: SPARK-37347 > URL: https://issues.apache.org/jira/browse/SPARK-37347 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0, 3.1.1, 3.2.0 >Reporter: liukai >Priority: Major > > When spark.sql.thriftServer.queryTimeout or > java.sql.Statement.setQueryTimeout is setted >0 , > SparkExecuteStatementOperation add timeoutExecutor to kill time-consumed > query in [SPARK-26533|https://issues.apache.org/jira/browse/SPARK-26533]. But > timeoutExecutor is not shutdown correctly when statement is finished, it can > only be shutdown when timeout. When we set timeout to a long time for example > 1 hour, the long-running STS driver will FullGC and the application is not > available for a long time. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37347) Spark Thrift Server (STS) driver fullFC becourse of timeoutExecutor not shutdown correctly
[ https://issues.apache.org/jira/browse/SPARK-37347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37347: Assignee: Apache Spark > Spark Thrift Server (STS) driver fullFC becourse of timeoutExecutor not > shutdown correctly > -- > > Key: SPARK-37347 > URL: https://issues.apache.org/jira/browse/SPARK-37347 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0, 3.1.1, 3.2.0 >Reporter: liukai >Assignee: Apache Spark >Priority: Major > > When spark.sql.thriftServer.queryTimeout or > java.sql.Statement.setQueryTimeout is setted >0 , > SparkExecuteStatementOperation add timeoutExecutor to kill time-consumed > query in [SPARK-26533|https://issues.apache.org/jira/browse/SPARK-26533]. But > timeoutExecutor is not shutdown correctly when statement is finished, it can > only be shutdown when timeout. When we set timeout to a long time for example > 1 hour, the long-running STS driver will FullGC and the application is not > available for a long time. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37209) YarnShuffleIntegrationSuite and other two similar cases in `resource-managers` test failed
[ https://issues.apache.org/jira/browse/SPARK-37209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444502#comment-17444502 ] Yang Jie commented on SPARK-37209: -- After some investigation, I found that this issue maybe related to `hadoop-3.x`, when use `hadoop-2.7` profile, the above test can be successful: {code:java} mvn clean install -DskipTests -pl resource-managers/yarn -Pyarn -Phadoop-2.7 -am mvn test -pl resource-managers/yarn -Pyarn -Phadoop-2.7 -Dtest=none -DwildcardSuites=org.apache.spark.deploy.yarn.YarnShuffleIntegrationSuite Discovery starting. Discovery completed in 259 milliseconds. Run starting. Expected test count is: 1 YarnShuffleIntegrationSuite: - external shuffle service Run completed in 30 seconds, 765 milliseconds. Total number of tests run: 1 Suites: completed 2, aborted 0 Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0 All tests passed.{code} It seems that when testing with hadoop-2.7, the result of executing `Utils.isTesting` on the executor side is true, which helps test case to ignore the `NoClassDefFoundError` in the test, but when testing with hadoop-3.2, the result of executing `Utils.isTesting` on the executor side is false. But I haven't investigated the root cause with hadoop-3.2 > YarnShuffleIntegrationSuite and other two similar cases in > `resource-managers` test failed > --- > > Key: SPARK-37209 > URL: https://issues.apache.org/jira/browse/SPARK-37209 > Project: Spark > Issue Type: Bug > Components: Tests, YARN >Affects Versions: 3.3.0 >Reporter: Yang Jie >Priority: Minor > Attachments: failed-unit-tests.log, success-unit-tests.log > > > Execute : > # build/mvn clean package -DskipTests -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud > -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl > -Pkubernetes -Phive > # build/mvn test -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn > -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive > -Pscala-2.13 -pl resource-managers/yarn > The test will successful. > > Execute : > # build/mvn clean -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn > -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive > # build/mvn clean test -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn > -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive > -Pscala-2.13 -pl resource-managers/yarn > The test will failed. > > Execute : > # build/mvn clean package -DskipTests -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud > -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl > -Pkubernetes -Phive > # Delete assembly/target/scala-2.12/jars manually > # build/mvn test -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn > -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive > -Pscala-2.13 -pl resource-managers/yarn > The test will failed. > > The error stack is : > {code:java} > 21/11/04 19:48:52.159 main ERROR Client: Application diagnostics message: > User class threw exception: org.apache.spark.SparkException: Job aborted due > to stage failure: Task 0 in stage 0.0 failed 4 times, > most recent failure: Lost task 0.3 in stage 0.0 (TID 6) (localhost executor > 1): java.lang.NoClassDefFoundError: breeze/linalg/Matrix > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:348) > at org.apache.spark.util.Utils$.classForName(Utils.scala:216) > at > org.apache.spark.serializer.KryoSerializer$.$anonfun$loadableSparkClasses$1(KryoSerializer.scala:537) > at scala.collection.immutable.List.flatMap(List.scala:293) > at scala.collection.immutable.List.flatMap(List.scala:79) > at > org.apache.spark.serializer.KryoSerializer$.loadableSparkClasses$lzycompute(KryoSerializer.scala:535) > at > org.apache.spark.serializer.KryoSerializer$.org$apache$spark$serializer$KryoSerializer$$loadableSparkClasses(KryoSerializer.scala:502) > at > org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:226) > at > org.apache.spark.serializer.KryoSerializer$$anon$1.create(KryoSerializer.scala:102) > at > com.esotericsoftware.kryo.pool.KryoPoolQueueImpl.borrow(KryoPoolQueueImpl.java:48) > at > org.apache.spark.serializer.KryoSerializer$PoolWrapper.borrow(KryoSerializer.scala:109) > at > org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:346) > at > org.apache.spark.serializer.KryoSerializationStream.(KryoSerializer.scala:266) > at > org.apache.spark.serializer.KryoSerializerInstance.serializeStream(KryoSerializer.scala:432) > at >
[jira] [Comment Edited] (SPARK-37209) YarnShuffleIntegrationSuite and other two similar cases in `resource-managers` test failed
[ https://issues.apache.org/jira/browse/SPARK-37209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444502#comment-17444502 ] Yang Jie edited comment on SPARK-37209 at 11/16/21, 12:10 PM: -- After some investigation, I found that this issue maybe related to `hadoop-3.x`, when use `hadoop-2.7` profile, the above test can be successful: {code:java} mvn clean install -DskipTests -pl resource-managers/yarn -Pyarn -Phadoop-2.7 -am mvn test -pl resource-managers/yarn -Pyarn -Phadoop-2.7 -Dtest=none -DwildcardSuites=org.apache.spark.deploy.yarn.YarnShuffleIntegrationSuite Discovery starting. Discovery completed in 259 milliseconds. Run starting. Expected test count is: 1 YarnShuffleIntegrationSuite: - external shuffle service Run completed in 30 seconds, 765 milliseconds. Total number of tests run: 1 Suites: completed 2, aborted 0 Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0 All tests passed.{code} It seems that when testing with hadoop-2.7, the result of executing `Utils.isTesting` on the executor side is true, which helps test case to ignore the `NoClassDefFoundError` in the test, but when testing with hadoop-3.2, the result of executing `Utils.isTesting` on the executor side is false. But I haven't investigated the root cause with hadoop-3.2 cc [~hyukjin.kwon] [~dongjoon] [~srowen] was (Author: luciferyang): After some investigation, I found that this issue maybe related to `hadoop-3.x`, when use `hadoop-2.7` profile, the above test can be successful: {code:java} mvn clean install -DskipTests -pl resource-managers/yarn -Pyarn -Phadoop-2.7 -am mvn test -pl resource-managers/yarn -Pyarn -Phadoop-2.7 -Dtest=none -DwildcardSuites=org.apache.spark.deploy.yarn.YarnShuffleIntegrationSuite Discovery starting. Discovery completed in 259 milliseconds. Run starting. Expected test count is: 1 YarnShuffleIntegrationSuite: - external shuffle service Run completed in 30 seconds, 765 milliseconds. Total number of tests run: 1 Suites: completed 2, aborted 0 Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0 All tests passed.{code} It seems that when testing with hadoop-2.7, the result of executing `Utils.isTesting` on the executor side is true, which helps test case to ignore the `NoClassDefFoundError` in the test, but when testing with hadoop-3.2, the result of executing `Utils.isTesting` on the executor side is false. But I haven't investigated the root cause with hadoop-3.2 > YarnShuffleIntegrationSuite and other two similar cases in > `resource-managers` test failed > --- > > Key: SPARK-37209 > URL: https://issues.apache.org/jira/browse/SPARK-37209 > Project: Spark > Issue Type: Bug > Components: Tests, YARN >Affects Versions: 3.3.0 >Reporter: Yang Jie >Priority: Minor > Attachments: failed-unit-tests.log, success-unit-tests.log > > > Execute : > # build/mvn clean package -DskipTests -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud > -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl > -Pkubernetes -Phive > # build/mvn test -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn > -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive > -Pscala-2.13 -pl resource-managers/yarn > The test will successful. > > Execute : > # build/mvn clean -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn > -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive > # build/mvn clean test -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn > -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive > -Pscala-2.13 -pl resource-managers/yarn > The test will failed. > > Execute : > # build/mvn clean package -DskipTests -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud > -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl > -Pkubernetes -Phive > # Delete assembly/target/scala-2.12/jars manually > # build/mvn test -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn > -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive > -Pscala-2.13 -pl resource-managers/yarn > The test will failed. > > The error stack is : > {code:java} > 21/11/04 19:48:52.159 main ERROR Client: Application diagnostics message: > User class threw exception: org.apache.spark.SparkException: Job aborted due > to stage failure: Task 0 in stage 0.0 failed 4 times, > most recent failure: Lost task 0.3 in stage 0.0 (TID 6) (localhost executor > 1): java.lang.NoClassDefFoundError: breeze/linalg/Matrix > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:348) > at org.apache.spark.util.Utils$.classForName(Utils.scala:216) > at >
[jira] [Created] (SPARK-37347) Spark Thrift Server (STS) driver fullFC becourse of timeoutExecutor not shutdown correctly
liukai created SPARK-37347: -- Summary: Spark Thrift Server (STS) driver fullFC becourse of timeoutExecutor not shutdown correctly Key: SPARK-37347 URL: https://issues.apache.org/jira/browse/SPARK-37347 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.0, 3.1.1, 3.1.0 Reporter: liukai Fix For: 3.1.3, 3.2.1 When spark.sql.thriftServer.queryTimeout or java.sql.Statement.setQueryTimeout is setted >0 , SparkExecuteStatementOperation add timeoutExecutor to kill time-consumed query in [SPARK-26533|https://issues.apache.org/jira/browse/SPARK-26533]. But timeoutExecutor is not shutdown correctly when statement is finished, it can only be shutdown when timeout. When we set timeout to a long time for example 1 hour, the long-running STS driver will FullGC and the application is not available for a long time. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37346) Link migration guide to each project.
[ https://issues.apache.org/jira/browse/SPARK-37346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1775#comment-1775 ] angerszhu commented on SPARK-37346: --- Raise a pr soon > Link migration guide to each project. > - > > Key: SPARK-37346 > URL: https://issues.apache.org/jira/browse/SPARK-37346 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > Fix For: 3.3.0 > > > Link migration guide to each project. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org