[jira] [Updated] (SPARK-37325) Result vector from pandas_udf was not the required length

2021-11-16 Thread liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liu updated SPARK-37325:

Description: 
 
{code:java}
schema = StructType([
StructField("node", StringType())
])
rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")
rdd = rdd.map(lambda obj: {'node': obj})
df_node = spark.createDataFrame(rdd, schema=schema)

df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet")
pd_fname = df_fname.select('fname').toPandas()

@pandas_udf(IntegerType(), PandasUDFType.SCALAR)
def udf_match(word: pd.Series) -> pd.Series:
  my_Series = pd_fname.squeeze() # dataframe to Series
  num = int(my_Series.str.contains(word.array[0]).sum())
  return pd.Series(num)

df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))
{code}
 

Hi, I have two dataframe, and I try above method, however, I get this
{code:java}
RuntimeError: Result vector from pandas_udf was not the required length: 
expected 100, got 1{code}
it will be really thankful, if there is any helps

 

PS: for the method itself, I think there is no problem, I create same sample 
data to verify it successfully, however, when I use the real data error came. I 
checked the data, can't figure out,

does anyone thinks what it cause?

  was:
 
{code:java}
schema = StructType([
StructField("node", StringType())
])
rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")
rdd = rdd.map(lambda obj: {'node': obj})
df_node = spark.createDataFrame(rdd, schema=schema)

df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet")
pd_fname = df_fname.select('fname').toPandas()

@pandas_udf(IntegerType(), PandasUDFType.SCALAR)
def udf_match(word: pd.Series) -> pd.Series:
  my_Series = pd_fname.squeeze() # dataframe to Series
  num = int(my_Series.str.contains(word.array[0]).sum())
  return pd.Series(num)

df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))
{code}
 

Hi, I have two dataframe, and I try above method, however, I get this
{code:java}
RuntimeError: Result vector from pandas_udf was not the required length: 
expected 100, got 1{code}
it will be really thankful, if there is any helps

 

PS: for the method itself, I think there is no problem, I create same sample 
data to verify it successfully, however, when I use the real data error came. I 
checked the data, can't figure out,

does anyone thinks where it cause?


> Result vector from pandas_udf was not the required length
> -
>
> Key: SPARK-37325
> URL: https://issues.apache.org/jira/browse/SPARK-37325
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
> Environment: 1
>Reporter: liu
>Priority: Major
>
>  
> {code:java}
> schema = StructType([
> StructField("node", StringType())
> ])
> rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")
> rdd = rdd.map(lambda obj: {'node': obj})
> df_node = spark.createDataFrame(rdd, schema=schema)
> df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet")
> pd_fname = df_fname.select('fname').toPandas()
> @pandas_udf(IntegerType(), PandasUDFType.SCALAR)
> def udf_match(word: pd.Series) -> pd.Series:
>   my_Series = pd_fname.squeeze() # dataframe to Series
>   num = int(my_Series.str.contains(word.array[0]).sum())
>   return pd.Series(num)
> df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))
> {code}
>  
> Hi, I have two dataframe, and I try above method, however, I get this
> {code:java}
> RuntimeError: Result vector from pandas_udf was not the required length: 
> expected 100, got 1{code}
> it will be really thankful, if there is any helps
>  
> PS: for the method itself, I think there is no problem, I create same sample 
> data to verify it successfully, however, when I use the real data error came. 
> I checked the data, can't figure out,
> does anyone thinks what it cause?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-37325) Result vector from pandas_udf was not the required length

2021-11-16 Thread liu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444972#comment-17444972
 ] 

liu edited comment on SPARK-37325 at 11/17/21, 7:19 AM:


[~hyukjin.kwon]  I don't think it will work. I have to make it clear, I have 
two dataframe, one spark dataframe and one pd.dataframe, I want to use 
{code:java}
@pandas_udf{code}
 for it's faster, however, it has some unknown error.

I can solve it by 
{code:java}
@udf{code}
 like below, but it's slow:
{code:java}
@udf(returnType=IntegerType())
def udf_match1(word):
   my_Series = pd_fname.squeeze() # dataframe to Series
   num = my_Series.str.contains(word).sum()
   return int(num){code}


was (Author: JIRAUSER280209):
[~hyukjin.kwon]  I don't think it will work. I have to make the questions 
clear, I have two dataframe, one spark dataframe and one pd.dataframe, I want 
to use 
{code:java}
@pandas_udf{code}
 for it's faster, however, it has some unknown error.

I can solve it by 
{code:java}
@udf{code}
 like below, but it's slow:
{code:java}
@udf(returnType=IntegerType())
def udf_match1(word):
   my_Series = pd_fname.squeeze() # dataframe to Series
   num = my_Series.str.contains(word).sum()
   return int(num){code}

> Result vector from pandas_udf was not the required length
> -
>
> Key: SPARK-37325
> URL: https://issues.apache.org/jira/browse/SPARK-37325
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
> Environment: 1
>Reporter: liu
>Priority: Major
>
>  
> {code:java}
> schema = StructType([
> StructField("node", StringType())
> ])
> rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")
> rdd = rdd.map(lambda obj: {'node': obj})
> df_node = spark.createDataFrame(rdd, schema=schema)
> df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet")
> pd_fname = df_fname.select('fname').toPandas()
> @pandas_udf(IntegerType(), PandasUDFType.SCALAR)
> def udf_match(word: pd.Series) -> pd.Series:
>   my_Series = pd_fname.squeeze() # dataframe to Series
>   num = int(my_Series.str.contains(word.array[0]).sum())
>   return pd.Series(num)
> df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))
> {code}
>  
> Hi, I have two dataframe, and I try above method, however, I get this
> {code:java}
> RuntimeError: Result vector from pandas_udf was not the required length: 
> expected 100, got 1{code}
> it will be really thankful, if there is any helps
>  
> PS: for the method itself, I think there is no problem, I create same sample 
> data to verify it successfully, however, when I use the real data error came. 
> I checked the data, can't figure out,
> does anyone thinks where it cause?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37325) Result vector from pandas_udf was not the required length

2021-11-16 Thread liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liu updated SPARK-37325:

Description: 
 
{code:java}
schema = StructType([
StructField("node", StringType())
])
rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")
rdd = rdd.map(lambda obj: {'node': obj})
df_node = spark.createDataFrame(rdd, schema=schema)

df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet")
pd_fname = df_fname.select('fname').toPandas()

@pandas_udf(IntegerType(), PandasUDFType.SCALAR)
def udf_match(word: pd.Series) -> pd.Series:
  my_Series = pd_fname.squeeze() # dataframe to Series
  num = int(my_Series.str.contains(word.array[0]).sum())
  return pd.Series(num)

df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))
{code}
 

Hi, I have two dataframe, and I try above method, however, I get this
{code:java}
RuntimeError: Result vector from pandas_udf was not the required length: 
expected 100, got 1{code}
it will be really thankful, if there is any helps

 

PS: for the method itself, I think there is no problem, I create same sample 
data to verify it successfully, however, when I use the real data error came. I 
checked the data, can't figure out,

does anyone know what it cause?

  was:
 
{code:java}
schema = StructType([
StructField("node", StringType())
])
rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")
rdd = rdd.map(lambda obj: {'node': obj})
df_node = spark.createDataFrame(rdd, schema=schema)

df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet")
pd_fname = df_fname.select('fname').toPandas()

@pandas_udf(IntegerType(), PandasUDFType.SCALAR)
def udf_match(word: pd.Series) -> pd.Series:
  my_Series = pd_fname.squeeze() # dataframe to Series
  num = int(my_Series.str.contains(word.array[0]).sum())
  return pd.Series(num)

df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))
{code}
 

Hi, I have two dataframe, and I try above method, however, I get this
{code:java}
RuntimeError: Result vector from pandas_udf was not the required length: 
expected 100, got 1{code}
it will be really thankful, if there is any helps

 

PS: for the method itself, I think there is no problem, I create same sample 
data to verify it successfully, however, when I use the real data error came. I 
checked the data, can't figure out,

does anyone thinks what it cause?


> Result vector from pandas_udf was not the required length
> -
>
> Key: SPARK-37325
> URL: https://issues.apache.org/jira/browse/SPARK-37325
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
> Environment: 1
>Reporter: liu
>Priority: Major
>
>  
> {code:java}
> schema = StructType([
> StructField("node", StringType())
> ])
> rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")
> rdd = rdd.map(lambda obj: {'node': obj})
> df_node = spark.createDataFrame(rdd, schema=schema)
> df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet")
> pd_fname = df_fname.select('fname').toPandas()
> @pandas_udf(IntegerType(), PandasUDFType.SCALAR)
> def udf_match(word: pd.Series) -> pd.Series:
>   my_Series = pd_fname.squeeze() # dataframe to Series
>   num = int(my_Series.str.contains(word.array[0]).sum())
>   return pd.Series(num)
> df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))
> {code}
>  
> Hi, I have two dataframe, and I try above method, however, I get this
> {code:java}
> RuntimeError: Result vector from pandas_udf was not the required length: 
> expected 100, got 1{code}
> it will be really thankful, if there is any helps
>  
> PS: for the method itself, I think there is no problem, I create same sample 
> data to verify it successfully, however, when I use the real data error came. 
> I checked the data, can't figure out,
> does anyone know what it cause?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-37325) Result vector from pandas_udf was not the required length

2021-11-16 Thread liu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444972#comment-17444972
 ] 

liu edited comment on SPARK-37325 at 11/17/21, 7:18 AM:


[~hyukjin.kwon]  I don't think it will work. I have to make the questions 
clear, I have two dataframe, one spark dataframe and one pd.dataframe, I want 
to use 
{code:java}
@pandas_udf{code}
 for it's faster, however, it has some unknown error.

I can solve it by 
{code:java}
@udf{code}
 like below, but it's slow:
{code:java}
@udf(returnType=IntegerType())
def udf_match1(word):
   my_Series = pd_fname.squeeze() # dataframe to Series
   num = my_Series.str.contains(word).sum()
   return int(num){code}


was (Author: JIRAUSER280209):
[~hyukjin.kwon]  I don't think it will work. I have to make the questions 
clear, I have two dataframe, one spark dataframe and one pd.dataframe, I want 
to use 
{code:java}
@pandas_udf{code}
 for it's faster, however, it has some unknown error.

I can solve it by 
{code:java}
@udf{code}
 like below, but it's slow:
{code:java}
@udf(returnType=IntegerType())
def udf_match1(word):
my_Series = pd_fname.squeeze() # dataframe to Series
num = my_Series.str.contains(word).sum()
return int(num){code}

> Result vector from pandas_udf was not the required length
> -
>
> Key: SPARK-37325
> URL: https://issues.apache.org/jira/browse/SPARK-37325
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
> Environment: 1
>Reporter: liu
>Priority: Major
>
>  
> {code:java}
> schema = StructType([
> StructField("node", StringType())
> ])
> rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")
> rdd = rdd.map(lambda obj: {'node': obj})
> df_node = spark.createDataFrame(rdd, schema=schema)
> df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet")
> pd_fname = df_fname.select('fname').toPandas()
> @pandas_udf(IntegerType(), PandasUDFType.SCALAR)
> def udf_match(word: pd.Series) -> pd.Series:
>   my_Series = pd_fname.squeeze() # dataframe to Series
>   num = int(my_Series.str.contains(word.array[0]).sum())
>   return pd.Series(num)
> df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))
> {code}
>  
> Hi, I have two dataframe, and I try above method, however, I get this
> {code:java}
> RuntimeError: Result vector from pandas_udf was not the required length: 
> expected 100, got 1{code}
> it will be really thankful, if there is any helps
>  
> PS: for the method itself, I think there is no problem, I create same sample 
> data to verify it successfully, however, when I use the real data error came. 
> I checked the data, can't figure out,
> does anyone thinks where it cause?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-37325) Result vector from pandas_udf was not the required length

2021-11-16 Thread liu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444972#comment-17444972
 ] 

liu edited comment on SPARK-37325 at 11/17/21, 7:16 AM:


[~hyukjin.kwon]  I don't think it will work. I have to make the questions 
clear, I have two dataframe, one spark dataframe and one pd.dataframe, I want 
to use 
{code:java}
@pandas_udf{code}
 for it's faster, however, it has some unknown error.

I can solve it by 
{code:java}
@udf{code}
 like below, but it's slow:
{code:java}
@udf(returnType=IntegerType())
def udf_match1(word):
my_Series = pd_fname.squeeze() # dataframe to Series
num = my_Series.str.contains(word).sum()
return int(num){code}


was (Author: JIRAUSER280209):
[~hyukjin.kwon]  I don't think it will work. I have to make the questions 
clear, I have two dataframe, one spark dataframe and one pd.dataframe, I want 
to use 
{code:java}
@pandas_udf{code}

 for it's faster, however, it have some unknown error.

I can solve it by 
{code:java}
@udf{code}

 like below, but it's slow:
{code:java}
@udf(returnType=IntegerType())
def udf_match1(word):
my_Series = pd_fname.squeeze() # dataframe to Series
num = my_Series.str.contains(word).sum()
return int(num){code}

> Result vector from pandas_udf was not the required length
> -
>
> Key: SPARK-37325
> URL: https://issues.apache.org/jira/browse/SPARK-37325
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
> Environment: 1
>Reporter: liu
>Priority: Major
>
>  
> {code:java}
> schema = StructType([
> StructField("node", StringType())
> ])
> rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")
> rdd = rdd.map(lambda obj: {'node': obj})
> df_node = spark.createDataFrame(rdd, schema=schema)
> df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet")
> pd_fname = df_fname.select('fname').toPandas()
> @pandas_udf(IntegerType(), PandasUDFType.SCALAR)
> def udf_match(word: pd.Series) -> pd.Series:
>   my_Series = pd_fname.squeeze() # dataframe to Series
>   num = int(my_Series.str.contains(word.array[0]).sum())
>   return pd.Series(num)
> df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))
> {code}
>  
> Hi, I have two dataframe, and I try above method, however, I get this
> {code:java}
> RuntimeError: Result vector from pandas_udf was not the required length: 
> expected 100, got 1{code}
> it will be really thankful, if there is any helps
>  
> PS: for the method itself, I think there is no problem, I create same sample 
> data to verify it successfully, however, when I use the real data error came. 
> I checked the data, can't figure out,
> does anyone thinks where it cause?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37325) Result vector from pandas_udf was not the required length

2021-11-16 Thread liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liu updated SPARK-37325:

Description: 
 
{code:java}
schema = StructType([
StructField("node", StringType())
])
rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")
rdd = rdd.map(lambda obj: {'node': obj})
df_node = spark.createDataFrame(rdd, schema=schema)

df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet")
pd_fname = df_fname.select('fname').toPandas()

@pandas_udf(IntegerType(), PandasUDFType.SCALAR)
def udf_match(word: pd.Series) -> pd.Series:
  my_Series = pd_fname.squeeze() # dataframe to Series
  num = int(my_Series.str.contains(word.array[0]).sum())
  return pd.Series(num)

df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))
{code}
 

Hi, I have two dataframe, and I try above method, however, I get this
{code:java}
RuntimeError: Result vector from pandas_udf was not the required length: 
expected 100, got 1{code}
it will be really thankful, if there is any helps

 

PS: for the method itself, I think there is no problem, I create same sample 
data to verify it successfully, however, when I use the real data error came. I 
checked the data, can't figure out,

does anyone thinks where it cause?

  was:
 
{code:java}
schema = StructType([
StructField("node", StringType())
])
rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")
rdd = rdd.map(lambda obj: {'node': obj})
df_node = spark.createDataFrame(rdd, schema=schema)

df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet")
pd_fname = df_fname.select('fname').toPandas()

@pandas_udf(IntegerType(), PandasUDFType.SCALAR)
def udf_match(word: pd.Series) -> pd.Series:
  my_Series = pd_fname.squeeze() # dataframe to Series
  num = int(my_Series.str.contains(word.array[0]).sum())
     return pd.Series(num)

df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))
{code}
 

Hi, I have two dataframe, and I try above method, however, I get this
{code:java}
RuntimeError: Result vector from pandas_udf was not the required length: 
expected 100, got 1{code}
it will be really thankful, if there is any helps

 

PS: for the method itself, I think there is no problem, I create same sample 
data to verify it successfully, however, when I use the real data error came. I 
checked the data, can't figure out,

does anyone thinks where it cause?


> Result vector from pandas_udf was not the required length
> -
>
> Key: SPARK-37325
> URL: https://issues.apache.org/jira/browse/SPARK-37325
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
> Environment: 1
>Reporter: liu
>Priority: Major
>
>  
> {code:java}
> schema = StructType([
> StructField("node", StringType())
> ])
> rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")
> rdd = rdd.map(lambda obj: {'node': obj})
> df_node = spark.createDataFrame(rdd, schema=schema)
> df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet")
> pd_fname = df_fname.select('fname').toPandas()
> @pandas_udf(IntegerType(), PandasUDFType.SCALAR)
> def udf_match(word: pd.Series) -> pd.Series:
>   my_Series = pd_fname.squeeze() # dataframe to Series
>   num = int(my_Series.str.contains(word.array[0]).sum())
>   return pd.Series(num)
> df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))
> {code}
>  
> Hi, I have two dataframe, and I try above method, however, I get this
> {code:java}
> RuntimeError: Result vector from pandas_udf was not the required length: 
> expected 100, got 1{code}
> it will be really thankful, if there is any helps
>  
> PS: for the method itself, I think there is no problem, I create same sample 
> data to verify it successfully, however, when I use the real data error came. 
> I checked the data, can't figure out,
> does anyone thinks where it cause?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-37325) Result vector from pandas_udf was not the required length

2021-11-16 Thread liu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444972#comment-17444972
 ] 

liu edited comment on SPARK-37325 at 11/17/21, 7:14 AM:


[~hyukjin.kwon]  I don't think it will work. I have to make the questions 
clear, I have two dataframe, one spark dataframe and one pd.dataframe, I want 
to use 
{code:java}
@pandas_udf{code}

 for it's faster, however, it have some unknown error.

I can solve it by 
{code:java}
@udf{code}

 like below, but it's slow:
{code:java}
@udf(returnType=IntegerType())
def udf_match1(word):
my_Series = pd_fname.squeeze() # dataframe to Series
num = my_Series.str.contains(word).sum()
return int(num){code}


was (Author: JIRAUSER280209):
[~hyukjin.kwon]  I don't think it will work. I have to make the questions 
clear, I have to dataframe, one spark dataframe and one pd.dataframe, I want to 
use @pandas_udf
 for it's faster, however, it have some unknown error.

I can solve it by @udf
 like below, but it's slow:
{code:java}
@udf(returnType=IntegerType())
def udf_match1(word):
my_Series = pd_fname.squeeze() # dataframe to Series
num = my_Series.str.contains(word).sum()
return int(num){code}

> Result vector from pandas_udf was not the required length
> -
>
> Key: SPARK-37325
> URL: https://issues.apache.org/jira/browse/SPARK-37325
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
> Environment: 1
>Reporter: liu
>Priority: Major
>
>  
> {code:java}
> schema = StructType([
> StructField("node", StringType())
> ])
> rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")
> rdd = rdd.map(lambda obj: {'node': obj})
> df_node = spark.createDataFrame(rdd, schema=schema)
> df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet")
> pd_fname = df_fname.select('fname').toPandas()
> @pandas_udf(IntegerType(), PandasUDFType.SCALAR)
> def udf_match(word: pd.Series) -> pd.Series:
>   my_Series = pd_fname.squeeze() # dataframe to Series
>   num = int(my_Series.str.contains(word.array[0]).sum())
>      return pd.Series(num)
> df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))
> {code}
>  
> Hi, I have two dataframe, and I try above method, however, I get this
> {code:java}
> RuntimeError: Result vector from pandas_udf was not the required length: 
> expected 100, got 1{code}
> it will be really thankful, if there is any helps
>  
> PS: for the method itself, I think there is no problem, I create same sample 
> data to verify it successfully, however, when I use the real data error came. 
> I checked the data, can't figure out,
> does anyone thinks where it cause?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37325) Result vector from pandas_udf was not the required length

2021-11-16 Thread liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liu updated SPARK-37325:

Description: 
 
{code:java}
schema = StructType([
StructField("node", StringType())
])
rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")
rdd = rdd.map(lambda obj: {'node': obj})
df_node = spark.createDataFrame(rdd, schema=schema)

df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet")
pd_fname = df_fname.select('fname').toPandas()

@pandas_udf(IntegerType(), PandasUDFType.SCALAR)
def udf_match(word: pd.Series) -> pd.Series:
  my_Series = pd_fname.squeeze() # dataframe to Series
  num = int(my_Series.str.contains(word.array[0]).sum())
     return pd.Series(num)

df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))
{code}
 

Hi, I have two dataframe, and I try above method, however, I get this
{code:java}
RuntimeError: Result vector from pandas_udf was not the required length: 
expected 100, got 1{code}
it will be really thankful, if there is any helps

 

PS: for the method itself, I think there is no problem, I create same sample 
data to verify it successfully, however, when I use the real data error came. I 
checked the data, can't figure out,

does anyone thinks where it cause?

  was:
 
{code:java}
schema = StructType([
StructField("node", StringType())
])
rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")
rdd = rdd.map(lambda obj: {'node': obj})
df_node = spark.createDataFrame(rdd, schema=schema)
df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet")
pd_fname = df_fname.select('fname').toPandas()
@pandas_udf(IntegerType(), PandasUDFType.SCALAR)
def udf_match(word: pd.Series) -> pd.Series:
  my_Series = pd_fname.squeeze() # dataframe to Series
  num = int(my_Series.str.contains(word.array[0]).sum())
     return pd.Series(num)
df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))
{code}
 

Hi, I have two dataframe, and I try above method, however, I get this
{code:java}
RuntimeError: Result vector from pandas_udf was not the required length: 
expected 100, got 1{code}

it will be really thankful, if there is any helps

 

PS: for the method itself, I think there is no problem, I create same sample 
data to verify it successfully, however, when I use the real data error came. I 
checked the data, can't figure out,

does anyone thinks where it cause?


> Result vector from pandas_udf was not the required length
> -
>
> Key: SPARK-37325
> URL: https://issues.apache.org/jira/browse/SPARK-37325
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
> Environment: 1
>Reporter: liu
>Priority: Major
>
>  
> {code:java}
> schema = StructType([
> StructField("node", StringType())
> ])
> rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")
> rdd = rdd.map(lambda obj: {'node': obj})
> df_node = spark.createDataFrame(rdd, schema=schema)
> df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet")
> pd_fname = df_fname.select('fname').toPandas()
> @pandas_udf(IntegerType(), PandasUDFType.SCALAR)
> def udf_match(word: pd.Series) -> pd.Series:
>   my_Series = pd_fname.squeeze() # dataframe to Series
>   num = int(my_Series.str.contains(word.array[0]).sum())
>      return pd.Series(num)
> df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))
> {code}
>  
> Hi, I have two dataframe, and I try above method, however, I get this
> {code:java}
> RuntimeError: Result vector from pandas_udf was not the required length: 
> expected 100, got 1{code}
> it will be really thankful, if there is any helps
>  
> PS: for the method itself, I think there is no problem, I create same sample 
> data to verify it successfully, however, when I use the real data error came. 
> I checked the data, can't figure out,
> does anyone thinks where it cause?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37325) Result vector from pandas_udf was not the required length

2021-11-16 Thread liu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444972#comment-17444972
 ] 

liu commented on SPARK-37325:
-

[~hyukjin.kwon]  I don't think it will work. I have to make the questions 
clear, I have to dataframe, one spark dataframe and one pd.dataframe, I want to 
use @pandas_udf
 for it's faster, however, it have some unknown error.

I can solve it by @udf
 like below, but it's slow:
{code:java}
@udf(returnType=IntegerType())
def udf_match1(word):
my_Series = pd_fname.squeeze() # dataframe to Series
num = my_Series.str.contains(word).sum()
return int(num){code}

> Result vector from pandas_udf was not the required length
> -
>
> Key: SPARK-37325
> URL: https://issues.apache.org/jira/browse/SPARK-37325
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
> Environment: 1
>Reporter: liu
>Priority: Major
>
> schema = StructType([
> StructField("node", StringType())
> ])
> {{rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")}}
> {{rdd = rdd.map(lambda obj: \{'node': obj})}}
> {{df_node = spark.createDataFrame(rdd, schema=schema)}}
> df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet")
> pd_fname = df_fname.select('fname').toPandas()
> {{@pandas_udf(IntegerType(), PandasUDFType.SCALAR)}}
> {{def udf_match(word: pd.Series) -> pd.Series:}}
> {{  my_Series = pd_fname.squeeze() # dataframe to Series}}
> {{  num = int(my_Series.str.contains(word.array[0]).sum())}}
>      return pd.Series(num)
> {{df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))}}
> Hi, I have two dataframe, and I try above method, however, I get this
> {{RuntimeError: Result vector from pandas_udf was not the required length: 
> expected 100, got 1}}
> it will be really thankful, if there is any helps
>  
> PS: for the method itself, I think there is no problem, I create same sample 
> data to verify it successfully, however, when I use the real data error came. 
> I checked the data, can't figure out,
> does anyone thinks where it cause?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37325) Result vector from pandas_udf was not the required length

2021-11-16 Thread liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liu updated SPARK-37325:

Description: 
 
{code:java}
schema = StructType([
StructField("node", StringType())
])
rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")
rdd = rdd.map(lambda obj: {'node': obj})
df_node = spark.createDataFrame(rdd, schema=schema)
df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet")
pd_fname = df_fname.select('fname').toPandas()
@pandas_udf(IntegerType(), PandasUDFType.SCALAR)
def udf_match(word: pd.Series) -> pd.Series:
  my_Series = pd_fname.squeeze() # dataframe to Series
  num = int(my_Series.str.contains(word.array[0]).sum())
     return pd.Series(num)
df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))
{code}
 

Hi, I have two dataframe, and I try above method, however, I get this
{code:java}
RuntimeError: Result vector from pandas_udf was not the required length: 
expected 100, got 1{code}

it will be really thankful, if there is any helps

 

PS: for the method itself, I think there is no problem, I create same sample 
data to verify it successfully, however, when I use the real data error came. I 
checked the data, can't figure out,

does anyone thinks where it cause?

  was:
 
{code:java}
schema = StructType([
StructField("node", StringType())
])
rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")
rdd = rdd.map(lambda obj: {'node': obj})
df_node = spark.createDataFrame(rdd, schema=schema)
df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet")
pd_fname = df_fname.select('fname').toPandas()
@pandas_udf(IntegerType(), PandasUDFType.SCALAR)
def udf_match(word: pd.Series) -> pd.Series:
  my_Series = pd_fname.squeeze() # dataframe to Series
  num = int(my_Series.str.contains(word.array[0]).sum())
     return pd.Series(num)
df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))
{code}
 

Hi, I have two dataframe, and I try above method, however, I get this
{{}}
{code:java}
RuntimeError: Result vector from pandas_udf was not the required length: 
expected 100, got 1{code}
{{}}
it will be really thankful, if there is any helps

 

PS: for the method itself, I think there is no problem, I create same sample 
data to verify it successfully, however, when I use the real data error came. I 
checked the data, can't figure out,

does anyone thinks where it cause?


> Result vector from pandas_udf was not the required length
> -
>
> Key: SPARK-37325
> URL: https://issues.apache.org/jira/browse/SPARK-37325
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
> Environment: 1
>Reporter: liu
>Priority: Major
>
>  
> {code:java}
> schema = StructType([
> StructField("node", StringType())
> ])
> rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")
> rdd = rdd.map(lambda obj: {'node': obj})
> df_node = spark.createDataFrame(rdd, schema=schema)
> df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet")
> pd_fname = df_fname.select('fname').toPandas()
> @pandas_udf(IntegerType(), PandasUDFType.SCALAR)
> def udf_match(word: pd.Series) -> pd.Series:
>   my_Series = pd_fname.squeeze() # dataframe to Series
>   num = int(my_Series.str.contains(word.array[0]).sum())
>      return pd.Series(num)
> df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))
> {code}
>  
> Hi, I have two dataframe, and I try above method, however, I get this
> {code:java}
> RuntimeError: Result vector from pandas_udf was not the required length: 
> expected 100, got 1{code}
> it will be really thankful, if there is any helps
>  
> PS: for the method itself, I think there is no problem, I create same sample 
> data to verify it successfully, however, when I use the real data error came. 
> I checked the data, can't figure out,
> does anyone thinks where it cause?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37325) Result vector from pandas_udf was not the required length

2021-11-16 Thread liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liu updated SPARK-37325:

Description: 
 
{code:java}
schema = StructType([
StructField("node", StringType())
])
rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")
rdd = rdd.map(lambda obj: {'node': obj})
df_node = spark.createDataFrame(rdd, schema=schema)
df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet")
pd_fname = df_fname.select('fname').toPandas()
@pandas_udf(IntegerType(), PandasUDFType.SCALAR)
def udf_match(word: pd.Series) -> pd.Series:
  my_Series = pd_fname.squeeze() # dataframe to Series
  num = int(my_Series.str.contains(word.array[0]).sum())
     return pd.Series(num)
df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))
{code}
 

Hi, I have two dataframe, and I try above method, however, I get this
{{}}
{code:java}
RuntimeError: Result vector from pandas_udf was not the required length: 
expected 100, got 1{code}
{{}}
it will be really thankful, if there is any helps

 

PS: for the method itself, I think there is no problem, I create same sample 
data to verify it successfully, however, when I use the real data error came. I 
checked the data, can't figure out,

does anyone thinks where it cause?

  was:
 
{code:java}
schema = StructType([
StructField("node", StringType())
])
rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")
rdd = rdd.map(lambda obj: {'node': obj})
df_node = spark.createDataFrame(rdd, schema=schema)
df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet")
pd_fname = df_fname.select('fname').toPandas()
@pandas_udf(IntegerType(), PandasUDFType.SCALAR)
def udf_match(word: pd.Series) -> pd.Series:
  my_Series = pd_fname.squeeze() # dataframe to Series
  num = int(my_Series.str.contains(word.array[0]).sum())
     return pd.Series(num)
df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))
{code}
 

Hi, I have two dataframe, and I try above method, however, I get this
{{RuntimeError: Result vector from pandas_udf was not the required length: 
expected 100, got 1}}
it will be really thankful, if there is any helps

 

PS: for the method itself, I think there is no problem, I create same sample 
data to verify it successfully, however, when I use the real data error came. I 
checked the data, can't figure out,

does anyone thinks where it cause?


> Result vector from pandas_udf was not the required length
> -
>
> Key: SPARK-37325
> URL: https://issues.apache.org/jira/browse/SPARK-37325
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
> Environment: 1
>Reporter: liu
>Priority: Major
>
>  
> {code:java}
> schema = StructType([
> StructField("node", StringType())
> ])
> rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")
> rdd = rdd.map(lambda obj: {'node': obj})
> df_node = spark.createDataFrame(rdd, schema=schema)
> df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet")
> pd_fname = df_fname.select('fname').toPandas()
> @pandas_udf(IntegerType(), PandasUDFType.SCALAR)
> def udf_match(word: pd.Series) -> pd.Series:
>   my_Series = pd_fname.squeeze() # dataframe to Series
>   num = int(my_Series.str.contains(word.array[0]).sum())
>      return pd.Series(num)
> df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))
> {code}
>  
> Hi, I have two dataframe, and I try above method, however, I get this
> {{}}
> {code:java}
> RuntimeError: Result vector from pandas_udf was not the required length: 
> expected 100, got 1{code}
> {{}}
> it will be really thankful, if there is any helps
>  
> PS: for the method itself, I think there is no problem, I create same sample 
> data to verify it successfully, however, when I use the real data error came. 
> I checked the data, can't figure out,
> does anyone thinks where it cause?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37325) Result vector from pandas_udf was not the required length

2021-11-16 Thread liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liu updated SPARK-37325:

Description: 
 
{code:java}
schema = StructType([
StructField("node", StringType())
])
rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")
rdd = rdd.map(lambda obj: {'node': obj})
df_node = spark.createDataFrame(rdd, schema=schema)
df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet")
pd_fname = df_fname.select('fname').toPandas()
@pandas_udf(IntegerType(), PandasUDFType.SCALAR)
def udf_match(word: pd.Series) -> pd.Series:
  my_Series = pd_fname.squeeze() # dataframe to Series
  num = int(my_Series.str.contains(word.array[0]).sum())
     return pd.Series(num)
df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))
{code}
 

Hi, I have two dataframe, and I try above method, however, I get this
{{RuntimeError: Result vector from pandas_udf was not the required length: 
expected 100, got 1}}
it will be really thankful, if there is any helps

 

PS: for the method itself, I think there is no problem, I create same sample 
data to verify it successfully, however, when I use the real data error came. I 
checked the data, can't figure out,

does anyone thinks where it cause?

  was:
schema = StructType([
StructField("node", StringType())
])
{{rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")}}
{{rdd = rdd.map(lambda obj: \{'node': obj})}}
{{df_node = spark.createDataFrame(rdd, schema=schema)}}

df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet")
pd_fname = df_fname.select('fname').toPandas()

{{@pandas_udf(IntegerType(), PandasUDFType.SCALAR)}}
{{def udf_match(word: pd.Series) -> pd.Series:}}
{{  my_Series = pd_fname.squeeze() # dataframe to Series}}
{{  num = int(my_Series.str.contains(word.array[0]).sum())}}
     return pd.Series(num)

{{df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))}}

Hi, I have two dataframe, and I try above method, however, I get this
{{RuntimeError: Result vector from pandas_udf was not the required length: 
expected 100, got 1}}
it will be really thankful, if there is any helps

 

PS: for the method itself, I think there is no problem, I create same sample 
data to verify it successfully, however, when I use the real data error came. I 
checked the data, can't figure out,

does anyone thinks where it cause?


> Result vector from pandas_udf was not the required length
> -
>
> Key: SPARK-37325
> URL: https://issues.apache.org/jira/browse/SPARK-37325
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
> Environment: 1
>Reporter: liu
>Priority: Major
>
>  
> {code:java}
> schema = StructType([
> StructField("node", StringType())
> ])
> rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")
> rdd = rdd.map(lambda obj: {'node': obj})
> df_node = spark.createDataFrame(rdd, schema=schema)
> df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet")
> pd_fname = df_fname.select('fname').toPandas()
> @pandas_udf(IntegerType(), PandasUDFType.SCALAR)
> def udf_match(word: pd.Series) -> pd.Series:
>   my_Series = pd_fname.squeeze() # dataframe to Series
>   num = int(my_Series.str.contains(word.array[0]).sum())
>      return pd.Series(num)
> df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))
> {code}
>  
> Hi, I have two dataframe, and I try above method, however, I get this
> {{RuntimeError: Result vector from pandas_udf was not the required length: 
> expected 100, got 1}}
> it will be really thankful, if there is any helps
>  
> PS: for the method itself, I think there is no problem, I create same sample 
> data to verify it successfully, however, when I use the real data error came. 
> I checked the data, can't figure out,
> does anyone thinks where it cause?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37353) Resolved attribute(s) region_id#7993 missing from lm_tam_code#2666,ld_month#2697,geo_sub_district_id#2677 when doing left join

2021-11-16 Thread Anubhav Tarar (Jira)
Anubhav Tarar created SPARK-37353:
-

 Summary:  Resolved attribute(s) region_id#7993 missing from 
lm_tam_code#2666,ld_month#2697,geo_sub_district_id#2677 when doing left join
 Key: SPARK-37353
 URL: https://issues.apache.org/jira/browse/SPARK-37353
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.4.0
Reporter: Anubhav Tarar


For this Code i m getting error 

var df = totalTimeFootfallDf.join(dwellTimeDf,Seq(groupByKeys: _*),"left_outer")


org.apache.spark.sql.AnalysisException: Resolved attribute(s) region_id#7993 
missing from 
lm_tam_code#2666,ld_month#2697,geo_sub_district_id#2677,lm_subldm_id#2652,geo_province_name_th#2689,lm_amp_code#2665,lm_cat_name_t#2675,lm_prov_code#2664,lm_sub_name_t#2672,lm_name_branch#2694,region_id#422,lm_update_dttm#2669,mkt_region_name_en#2692,lm_radius#369,geo_province_id#2679,lm_name#2655,geo_province_name_en#2684,lm_name_fld_t#2656,geo_region_name_en#2686,geo_district_name_th#2688,geo_sub_region_id#2680,ld_day#2698,lm_master#2653,lm_location_t#2660,lm_zone#2668,lm_latitude#2671,lm_branch_t#2658,geo_sub_district_name_th#2687,lm_cat_id#2674,geo_district_id#2678,lm_symbol#2649,lm_id#2648,lm_name_t#2654,lm_cat_name#2676,geo_district_name_en#2683,lm_name_fld#2657,lm_master_branch#2663,lm_longitude#2670,lm_branch#2659,mkt_sub_region_id#2690,lm_sub_cat_id#2650,geo_sub_region_name_en#2685,lm_sub_cat_name#316,mkt_sub_region_name_en#2691,grid_id#2695,lm_floor#2667,lm_location#2661,geo_sub_district_name_en#2682,lm_master_branch_t#2662,data_date#2693,lm_ldmtag#2651,ld_year#2696
 in operator !Project [lm_id#2648, lm_symbol#2649, lm_sub_cat_id#2650, 
lm_ldmtag#2651, lm_subldm_id#2652, lm_master#2653, lm_name_t#2654, 
lm_name#2655, lm_name_fld_t#2656, lm_name_fld#2657, lm_branch_t#2658, 
lm_branch#2659, lm_location_t#2660, lm_location#2661, lm_master_branch_t#2662, 
lm_master_branch#2663, lm_prov_code#2664, lm_amp_code#2665, lm_tam_code#2666, 
lm_floor#2667, lm_zone#2668, lm_update_dttm#2669, lm_longitude#2670, 
lm_latitude#2671, ... 28 more fields]. Attribute(s) with the same name appear 
in the operation: region_id. Please check if the right attribute(s) are used.;;
Join LeftOuter



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37270) Incorect result of filter using isNull condition

2021-11-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444956#comment-17444956
 ] 

Apache Spark commented on SPARK-37270:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/34627

> Incorect result of filter using isNull condition
> 
>
> Key: SPARK-37270
> URL: https://issues.apache.org/jira/browse/SPARK-37270
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Tomasz Kus
>Priority: Major
>  Labels: correctness
>
> Simple code that allows to reproduce this issue:
> {code:java}
>  val frame = Seq((false, 1)).toDF("bool", "number")
> frame
>   .checkpoint()
>   .withColumn("conditions", when(col("bool"), "I am not null"))
>   .filter(col("conditions").isNull)
>   .show(false){code}
> Although "conditions" column is null
> {code:java}
>  +-+--+--+
> |bool |number|conditions|
> +-+--+--+
> |false|1     |null      |
> +-+--+--+{code}
> empty result is shown.
> Execution plans:
> {code:java}
> == Parsed Logical Plan ==
> 'Filter isnull('conditions)
> +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END 
> AS conditions#252]
>    +- LogicalRDD [bool#124, number#125], false
> == Analyzed Logical Plan ==
> bool: boolean, number: int, conditions: string
> Filter isnull(conditions#252)
> +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END 
> AS conditions#252]
>    +- LogicalRDD [bool#124, number#125], false
> == Optimized Logical Plan ==
> LocalRelation , [bool#124, number#125, conditions#252]
> == Physical Plan ==
> LocalTableScan , [bool#124, number#125, conditions#252]
>  {code}
> After removing checkpoint proper result is returned  and execution plans are 
> as follow:
> {code:java}
> == Parsed Logical Plan ==
> 'Filter isnull('conditions)
> +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END 
> AS conditions#256]
>    +- Project [_1#119 AS bool#124, _2#120 AS number#125]
>       +- LocalRelation [_1#119, _2#120]
> == Analyzed Logical Plan ==
> bool: boolean, number: int, conditions: string
> Filter isnull(conditions#256)
> +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END 
> AS conditions#256]
>    +- Project [_1#119 AS bool#124, _2#120 AS number#125]
>       +- LocalRelation [_1#119, _2#120]
> == Optimized Logical Plan ==
> LocalRelation [bool#124, number#125, conditions#256]
> == Physical Plan ==
> LocalTableScan [bool#124, number#125, conditions#256]
>  {code}
> It seems that the most important difference is LogicalRDD ->  LocalRelation
> There are following ways (workarounds) to retrieve correct result:
> 1) remove checkpoint
> 2) add explicit .otherwise(null) to when
> 3) add checkpoint() or cache() just before filter
> 4) downgrade to Spark 3.1.2



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37352) Silence the `index_col` advice in `to_spark()` for internal usage

2021-11-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37352:


Assignee: Apache Spark

> Silence the `index_col` advice in `to_spark()` for internal usage
> -
>
> Key: SPARK-37352
> URL: https://issues.apache.org/jira/browse/SPARK-37352
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Assignee: Apache Spark
>Priority: Major
>
> The advice warning for `to_spark()` - when the `index_col` parameter is not 
> specified - issuing too much message e.g. when user runs the plotting 
> functions, so we want to silence the warning message when it's used as an 
> internal purpose.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37352) Silence the `index_col` advice in `to_spark()` for internal usage

2021-11-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37352:


Assignee: (was: Apache Spark)

> Silence the `index_col` advice in `to_spark()` for internal usage
> -
>
> Key: SPARK-37352
> URL: https://issues.apache.org/jira/browse/SPARK-37352
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Priority: Major
>
> The advice warning for `to_spark()` - when the `index_col` parameter is not 
> specified - issuing too much message e.g. when user runs the plotting 
> functions, so we want to silence the warning message when it's used as an 
> internal purpose.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37352) Silence the `index_col` advice in `to_spark()` for internal usage

2021-11-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444953#comment-17444953
 ] 

Apache Spark commented on SPARK-37352:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/34626

> Silence the `index_col` advice in `to_spark()` for internal usage
> -
>
> Key: SPARK-37352
> URL: https://issues.apache.org/jira/browse/SPARK-37352
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Priority: Major
>
> The advice warning for `to_spark()` - when the `index_col` parameter is not 
> specified - issuing too much message e.g. when user runs the plotting 
> functions, so we want to silence the warning message when it's used as an 
> internal purpose.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37325) Result vector from pandas_udf was not the required length

2021-11-16 Thread liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liu updated SPARK-37325:

Description: 
schema = StructType([
StructField("node", StringType())
])
{{rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")}}
{{rdd = rdd.map(lambda obj: \{'node': obj})}}
{{df_node = spark.createDataFrame(rdd, schema=schema)}}

df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet")
pd_fname = df_fname.select('fname').toPandas()

{{@pandas_udf(IntegerType(), PandasUDFType.SCALAR)}}
{{def udf_match(word: pd.Series) -> pd.Series:}}
{{  my_Series = pd_fname.squeeze() # dataframe to Series}}
{{  num = int(my_Series.str.contains(word.array[0]).sum())}}
     return pd.Series(num)

{{df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))}}

Hi, I have two dataframe, and I try above method, however, I get this
{{RuntimeError: Result vector from pandas_udf was not the required length: 
expected 100, got 1}}
it will be really thankful, if there is any helps

 

PS: for the method itself, I think there is no problem, I create same sample 
data to verify it successfully, however, when I use the real data error came. I 
checked the data, can't figure out,

does anyone thinks where it cause?

  was:
schema = StructType([
StructField("node", StringType())
])
{{rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")}}
{{rdd = rdd.map(lambda obj: \{'node': obj})}}
{{df_node = spark.createDataFrame(rdd, schema=schema)}}

df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet")
pd_fname = df_fname.select('fname').toPandas()

{{@pandas_udf(IntegerType(), PandasUDFType.SCALAR)}}
{{def udf_match(word: pd.Series) -> pd.Series:}}
{{  my_Series = pd_fname.squeeze() # dataframe to Series}}
{{  num = int(my_Series.str.contains(word.array[0]).sum())}}
     return pd.Series(num)

{{df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))}}


Hi, I have two dataframe, and I try above method, however, I get this
{{RuntimeError: Result vector from pandas_udf was not the required length: 
expected 100, got 1}}
it will be really thankful, if there is any helps

 

PS: for the method itself, I think there is no problem, I create same sample 
data to verify it successfully, however, when I use the really data it came. I 
checked the data, can't figure out,

does anyone thinks where it cause?


> Result vector from pandas_udf was not the required length
> -
>
> Key: SPARK-37325
> URL: https://issues.apache.org/jira/browse/SPARK-37325
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
> Environment: 1
>Reporter: liu
>Priority: Major
>
> schema = StructType([
> StructField("node", StringType())
> ])
> {{rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")}}
> {{rdd = rdd.map(lambda obj: \{'node': obj})}}
> {{df_node = spark.createDataFrame(rdd, schema=schema)}}
> df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet")
> pd_fname = df_fname.select('fname').toPandas()
> {{@pandas_udf(IntegerType(), PandasUDFType.SCALAR)}}
> {{def udf_match(word: pd.Series) -> pd.Series:}}
> {{  my_Series = pd_fname.squeeze() # dataframe to Series}}
> {{  num = int(my_Series.str.contains(word.array[0]).sum())}}
>      return pd.Series(num)
> {{df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))}}
> Hi, I have two dataframe, and I try above method, however, I get this
> {{RuntimeError: Result vector from pandas_udf was not the required length: 
> expected 100, got 1}}
> it will be really thankful, if there is any helps
>  
> PS: for the method itself, I think there is no problem, I create same sample 
> data to verify it successfully, however, when I use the real data error came. 
> I checked the data, can't figure out,
> does anyone thinks where it cause?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37352) Silence the `index_col` advice in `to_spark()` for internal usage

2021-11-16 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-37352:

Description: The advice warning for `to_spark()` - when the `index_col` 
parameter is not specified - issuing too much message e.g. when user runs the 
plotting functions, so we want to silence the warning message when it's used as 
an internal purpose.  (was: The advice warning for `index_col` in `to_spark()` 
issuing too much message when user runs the plotting functions, so we want to 
silence the warning message when it's used as an internal purpose.)

> Silence the `index_col` advice in `to_spark()` for internal usage
> -
>
> Key: SPARK-37352
> URL: https://issues.apache.org/jira/browse/SPARK-37352
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Priority: Major
>
> The advice warning for `to_spark()` - when the `index_col` parameter is not 
> specified - issuing too much message e.g. when user runs the plotting 
> functions, so we want to silence the warning message when it's used as an 
> internal purpose.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37352) Silence the `index_col` advice in `to_spark()` for internal usage

2021-11-16 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-37352:
---

 Summary: Silence the `index_col` advice in `to_spark()` for 
internal usage
 Key: SPARK-37352
 URL: https://issues.apache.org/jira/browse/SPARK-37352
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.3.0
Reporter: Haejoon Lee


The advice warning for `index_col` in `to_spark()` issuing too much message 
when user runs the plotting functions, so we want to silence the warning 
message when it's used as an internal purpose.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37279) Support DayTimeIntervalType in createDataFrame/toPandas and Python UDFs

2021-11-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37279:


Assignee: Apache Spark

> Support DayTimeIntervalType in createDataFrame/toPandas and Python UDFs
> ---
>
> Key: SPARK-37279
> URL: https://issues.apache.org/jira/browse/SPARK-37279
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> Implements the support of DayTimeIntervalType in:
> - Python UDFs
> - createDataFrame/toPandas when Arrow is disabled



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37279) Support DayTimeIntervalType in createDataFrame/toPandas and Python UDFs

2021-11-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37279:


Assignee: (was: Apache Spark)

> Support DayTimeIntervalType in createDataFrame/toPandas and Python UDFs
> ---
>
> Key: SPARK-37279
> URL: https://issues.apache.org/jira/browse/SPARK-37279
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Implements the support of DayTimeIntervalType in:
> - Python UDFs
> - createDataFrame/toPandas when Arrow is disabled



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37279) Support DayTimeIntervalType in createDataFrame/toPandas and Python UDFs

2021-11-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444944#comment-17444944
 ] 

Apache Spark commented on SPARK-37279:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/34614

> Support DayTimeIntervalType in createDataFrame/toPandas and Python UDFs
> ---
>
> Key: SPARK-37279
> URL: https://issues.apache.org/jira/browse/SPARK-37279
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Implements the support of DayTimeIntervalType in:
> - Python UDFs
> - createDataFrame/toPandas when Arrow is disabled



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37281) Support DayTimeIntervalType in Py4J

2021-11-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37281:


Assignee: (was: Apache Spark)

> Support DayTimeIntervalType in Py4J
> ---
>
> Key: SPARK-37281
> URL: https://issues.apache.org/jira/browse/SPARK-37281
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> This PR adds the support of YearMonthIntervalType in Py4J. For example, 
> functions.lit(DayTimeIntervalType) should work.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37281) Support DayTimeIntervalType in Py4J

2021-11-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444943#comment-17444943
 ] 

Apache Spark commented on SPARK-37281:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/34625

> Support DayTimeIntervalType in Py4J
> ---
>
> Key: SPARK-37281
> URL: https://issues.apache.org/jira/browse/SPARK-37281
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> This PR adds the support of YearMonthIntervalType in Py4J. For example, 
> functions.lit(DayTimeIntervalType) should work.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37281) Support DayTimeIntervalType in Py4J

2021-11-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37281:


Assignee: Apache Spark

> Support DayTimeIntervalType in Py4J
> ---
>
> Key: SPARK-37281
> URL: https://issues.apache.org/jira/browse/SPARK-37281
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> This PR adds the support of YearMonthIntervalType in Py4J. For example, 
> functions.lit(DayTimeIntervalType) should work.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-37277) Support DayTimeIntervalType in Arrow

2021-11-16 Thread Hyukjin Kwon (Jira)


[ https://issues.apache.org/jira/browse/SPARK-37277 ]


Hyukjin Kwon deleted comment on SPARK-37277:
--

was (Author: apachespark):
User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/34614

> Support DayTimeIntervalType in Arrow
> 
>
> Key: SPARK-37277
> URL: https://issues.apache.org/jira/browse/SPARK-37277
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Implements the support of DayTimeIntervalType in Arrow code path:
> - pandas UDFs
> - pandas functions APIs
> - createDataFrame/toPandas when Arrow is enabled



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-37277) Support DayTimeIntervalType in Arrow

2021-11-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-37277:
--

> Support DayTimeIntervalType in Arrow
> 
>
> Key: SPARK-37277
> URL: https://issues.apache.org/jira/browse/SPARK-37277
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Implements the support of DayTimeIntervalType in Arrow code path:
> - pandas UDFs
> - pandas functions APIs
> - createDataFrame/toPandas when Arrow is enabled



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37277) Support DayTimeIntervalType in Arrow

2021-11-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-37277.
--
Resolution: Fixed

> Support DayTimeIntervalType in Arrow
> 
>
> Key: SPARK-37277
> URL: https://issues.apache.org/jira/browse/SPARK-37277
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Implements the support of DayTimeIntervalType in Arrow code path:
> - pandas UDFs
> - pandas functions APIs
> - createDataFrame/toPandas when Arrow is enabled



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37325) Result vector from pandas_udf was not the required length

2021-11-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-37325.
--
Resolution: Invalid

> Result vector from pandas_udf was not the required length
> -
>
> Key: SPARK-37325
> URL: https://issues.apache.org/jira/browse/SPARK-37325
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
> Environment: 1
>Reporter: liu
>Priority: Major
>
> schema = StructType([
> StructField("node", StringType())
> ])
> {{rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")}}
> {{rdd = rdd.map(lambda obj: \{'node': obj})}}
> {{df_node = spark.createDataFrame(rdd, schema=schema)}}
> df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet")
> pd_fname = df_fname.select('fname').toPandas()
> {{@pandas_udf(IntegerType(), PandasUDFType.SCALAR)}}
> {{def udf_match(word: pd.Series) -> pd.Series:}}
> {{  my_Series = pd_fname.squeeze() # dataframe to Series}}
> {{  num = int(my_Series.str.contains(word.array[0]).sum())}}
>      return pd.Series(num)
> {{df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))}}
> Hi, I have two dataframe, and I try above method, however, I get this
> {{RuntimeError: Result vector from pandas_udf was not the required length: 
> expected 100, got 1}}
> it will be really thankful, if there is any helps
>  
> PS: for the method itself, I think there is no problem, I create same sample 
> data to verify it successfully, however, when I use the really data it came. 
> I checked the data, can't figure out,
> does anyone thinks where it cause?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37348) PySpark pmod function

2021-11-16 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444938#comment-17444938
 ] 

Hyukjin Kwon commented on SPARK-37348:
--

The question should be how often it is used in Python world, and how common it 
is. There are many expressions that are not existent in Python, Scala etc. 
(e.g., regex_* expressions) but they are not implemented on purpose.

> PySpark pmod function
> -
>
> Key: SPARK-37348
> URL: https://issues.apache.org/jira/browse/SPARK-37348
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Tim Schwab
>Priority: Minor
>
> Because Spark is built on the JVM, in PySpark, F.lit(-1) % F.lit(2) returns 
> -1. However, the modulus is often desired instead of the remainder.
>  
> There is a [PMOD() function in Spark 
> SQL|https://spark.apache.org/docs/latest/api/sql/#pmod], but [not in 
> PySpark|https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#functions].
>  So at the moment, the two options for getting the modulus is to use 
> F.expr("pmod(A, B)"), or create a helper function such as:
>  
> {code:java}
> def pmod(dividend, divisor):
> return F.when(dividend < 0, (dividend % divisor) + 
> divisor).otherwise(dividend % divisor){code}
>  
>  
> Neither are optimal - pmod should be native to PySpark as it is in Spark SQL.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37325) Result vector from pandas_udf was not the required length

2021-11-16 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444939#comment-17444939
 ] 

Hyukjin Kwon commented on SPARK-37325:
--

Again, please consider using DataFrame.mapInPandas. Questions should better go 
to a mailing list.

> Result vector from pandas_udf was not the required length
> -
>
> Key: SPARK-37325
> URL: https://issues.apache.org/jira/browse/SPARK-37325
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
> Environment: 1
>Reporter: liu
>Priority: Major
>
> schema = StructType([
> StructField("node", StringType())
> ])
> {{rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")}}
> {{rdd = rdd.map(lambda obj: \{'node': obj})}}
> {{df_node = spark.createDataFrame(rdd, schema=schema)}}
> df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet")
> pd_fname = df_fname.select('fname').toPandas()
> {{@pandas_udf(IntegerType(), PandasUDFType.SCALAR)}}
> {{def udf_match(word: pd.Series) -> pd.Series:}}
> {{  my_Series = pd_fname.squeeze() # dataframe to Series}}
> {{  num = int(my_Series.str.contains(word.array[0]).sum())}}
>      return pd.Series(num)
> {{df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))}}
> Hi, I have two dataframe, and I try above method, however, I get this
> {{RuntimeError: Result vector from pandas_udf was not the required length: 
> expected 100, got 1}}
> it will be really thankful, if there is any helps
>  
> PS: for the method itself, I think there is no problem, I create same sample 
> data to verify it successfully, however, when I use the really data it came. 
> I checked the data, can't figure out,
> does anyone thinks where it cause?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37350) EventLoggingListener keep logging errors after hdfs restart all datanodes

2021-11-16 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444937#comment-17444937
 ] 

Hyukjin Kwon commented on SPARK-37350:
--

Spark 2.4.x is EOL. can you check if the same issue persists in Spark 3.x?

> EventLoggingListener keep logging errors after hdfs restart all datanodes
> -
>
> Key: SPARK-37350
> URL: https://issues.apache.org/jira/browse/SPARK-37350
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment: Spark-2.4.0、Hadoop-3.0.0、Hive-2.1.1
>Reporter: Shefron Yudy
>Priority: Major
>
> I saw the error in SparkThriftServer process's log when I restart all 
> datanodes of HDFS , The logs as follows:
> {code:java}
> 2021-11-16 13:52:11,044 ERROR [spark-listener-group-eventLog] 
> scheduler.AsyncEventQueue:Listener EventLoggingListener threw an exception
> java.io.IOException: All datanodes 
> [DatanodeInfoWithStorage[10.121.23.101:1019,DS-90cb8066-8e5c-443f-804b-20c3ad01851b,DISK]]
>  are bad. Aborting...
> at 
> org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1561)
> at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1495)
> at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481)
> at 
> org.apache.hadoop.hdfs.DataStreamer.processDatanodeErrorOrExternalError(DataStreamer.java:1256)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667)
> {code}
> The eventLog will be available normally if I restart the SparkThriftServer, I 
> suggest that the EventLoggingListener's dfs writer and hadoopDataStream 
> should reconnect after all datanodes stop and then start later。
> {code:java}
>   /** Log the event as JSON. */
>   private def logEvent(event: SparkListenerEvent, flushLogger: Boolean = 
> false) {
> val eventJson = JsonProtocol.sparkEventToJson(event)
> // scalastyle:off println
> writer.foreach(_.println(compact(render(eventJson
> // scalastyle:on println
> if (flushLogger) {
>   writer.foreach(_.flush())
>   hadoopDataStream.foreach(_.hflush())
> }
> if (testing) {
>   loggedEvents += eventJson
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37334) pandas `convert_dtypes` method support

2021-11-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-37334:
-
Parent: SPARK-37197
Issue Type: Sub-task  (was: New Feature)

> pandas `convert_dtypes` method support
> --
>
> Key: SPARK-37334
> URL: https://issues.apache.org/jira/browse/SPARK-37334
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Ali Amin-Nejad
>Priority: Minor
>
> Support for the {{convert_dtypes}} method as part of the new pandas API in 
> pyspark?
> [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.convert_dtypes.html]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37348) PySpark pmod function

2021-11-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-37348:
-
Parent: (was: SPARK-37197)
Issue Type: Bug  (was: Sub-task)

> PySpark pmod function
> -
>
> Key: SPARK-37348
> URL: https://issues.apache.org/jira/browse/SPARK-37348
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Tim Schwab
>Priority: Minor
>
> Because Spark is built on the JVM, in PySpark, F.lit(-1) % F.lit(2) returns 
> -1. However, the modulus is often desired instead of the remainder.
>  
> There is a [PMOD() function in Spark 
> SQL|https://spark.apache.org/docs/latest/api/sql/#pmod], but [not in 
> PySpark|https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#functions].
>  So at the moment, the two options for getting the modulus is to use 
> F.expr("pmod(A, B)"), or create a helper function such as:
>  
> {code:java}
> def pmod(dividend, divisor):
> return F.when(dividend < 0, (dividend % divisor) + 
> divisor).otherwise(dividend % divisor){code}
>  
>  
> Neither are optimal - pmod should be native to PySpark as it is in Spark SQL.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37348) PySpark pmod function

2021-11-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-37348:
-
Labels:   (was: newbie)

> PySpark pmod function
> -
>
> Key: SPARK-37348
> URL: https://issues.apache.org/jira/browse/SPARK-37348
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Tim Schwab
>Priority: Minor
>
> Because Spark is built on the JVM, in PySpark, F.lit(-1) % F.lit(2) returns 
> -1. However, the modulus is often desired instead of the remainder.
>  
> There is a [PMOD() function in Spark 
> SQL|https://spark.apache.org/docs/latest/api/sql/#pmod], but [not in 
> PySpark|https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#functions].
>  So at the moment, the two options for getting the modulus is to use 
> F.expr("pmod(A, B)"), or create a helper function such as:
>  
> {code:java}
> def pmod(dividend, divisor):
> return F.when(dividend < 0, (dividend % divisor) + 
> divisor).otherwise(dividend % divisor){code}
>  
>  
> Neither are optimal - pmod should be native to PySpark as it is in Spark SQL.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37348) PySpark pmod function

2021-11-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-37348:
-
Parent: SPARK-37197
Issue Type: Sub-task  (was: New Feature)

> PySpark pmod function
> -
>
> Key: SPARK-37348
> URL: https://issues.apache.org/jira/browse/SPARK-37348
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Tim Schwab
>Priority: Minor
>  Labels: newbie
>
> Because Spark is built on the JVM, in PySpark, F.lit(-1) % F.lit(2) returns 
> -1. However, the modulus is often desired instead of the remainder.
>  
> There is a [PMOD() function in Spark 
> SQL|https://spark.apache.org/docs/latest/api/sql/#pmod], but [not in 
> PySpark|https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#functions].
>  So at the moment, the two options for getting the modulus is to use 
> F.expr("pmod(A, B)"), or create a helper function such as:
>  
> {code:java}
> def pmod(dividend, divisor):
> return F.when(dividend < 0, (dividend % divisor) + 
> divisor).otherwise(dividend % divisor){code}
>  
>  
> Neither are optimal - pmod should be native to PySpark as it is in Spark SQL.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37348) PySpark pmod function

2021-11-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-37348:
-
Issue Type: Improvement  (was: Bug)

> PySpark pmod function
> -
>
> Key: SPARK-37348
> URL: https://issues.apache.org/jira/browse/SPARK-37348
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Tim Schwab
>Priority: Minor
>
> Because Spark is built on the JVM, in PySpark, F.lit(-1) % F.lit(2) returns 
> -1. However, the modulus is often desired instead of the remainder.
>  
> There is a [PMOD() function in Spark 
> SQL|https://spark.apache.org/docs/latest/api/sql/#pmod], but [not in 
> PySpark|https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#functions].
>  So at the moment, the two options for getting the modulus is to use 
> F.expr("pmod(A, B)"), or create a helper function such as:
>  
> {code:java}
> def pmod(dividend, divisor):
> return F.when(dividend < 0, (dividend % divisor) + 
> divisor).otherwise(dividend % divisor){code}
>  
>  
> Neither are optimal - pmod should be native to PySpark as it is in Spark SQL.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37230) Document DataFrame.mapInArrow

2021-11-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444934#comment-17444934
 ] 

Apache Spark commented on SPARK-37230:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/34624

> Document DataFrame.mapInArrow
> -
>
> Key: SPARK-37230
> URL: https://issues.apache.org/jira/browse/SPARK-37230
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37230) Document DataFrame.mapInArrow

2021-11-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444933#comment-17444933
 ] 

Apache Spark commented on SPARK-37230:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/34624

> Document DataFrame.mapInArrow
> -
>
> Key: SPARK-37230
> URL: https://issues.apache.org/jira/browse/SPARK-37230
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37230) Document DataFrame.mapInArrow

2021-11-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37230:


Assignee: Apache Spark

> Document DataFrame.mapInArrow
> -
>
> Key: SPARK-37230
> URL: https://issues.apache.org/jira/browse/SPARK-37230
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37230) Document DataFrame.mapInArrow

2021-11-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37230:


Assignee: (was: Apache Spark)

> Document DataFrame.mapInArrow
> -
>
> Key: SPARK-37230
> URL: https://issues.apache.org/jira/browse/SPARK-37230
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37347) Spark Thrift Server (STS) driver fullFC becourse of timeoutExecutor not shutdown correctly

2021-11-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444919#comment-17444919
 ] 

Apache Spark commented on SPARK-37347:
--

User 'lk246' has created a pull request for this issue:
https://github.com/apache/spark/pull/34623

> Spark Thrift Server (STS) driver fullFC becourse of timeoutExecutor not 
> shutdown correctly
> --
>
> Key: SPARK-37347
> URL: https://issues.apache.org/jira/browse/SPARK-37347
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0, 3.1.1, 3.2.0
>Reporter: liukai
>Priority: Major
>
> When spark.sql.thriftServer.queryTimeout or 
> java.sql.Statement.setQueryTimeout is setted >0 , 
> SparkExecuteStatementOperation add timeoutExecutor to kill time-consumed 
> query in [SPARK-26533|https://issues.apache.org/jira/browse/SPARK-26533]. But 
> timeoutExecutor is not shutdown correctly when statement is finished, it can 
> only be shutdown when timeout. When we set timeout to a long time for example 
> 1 hour, the long-running STS driver will FullGC and the application is not 
> available for  a long time.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37350) EventLoggingListener keep logging errors after hdfs restart all datanodes

2021-11-16 Thread Shefron Yudy (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shefron Yudy updated SPARK-37350:
-
Description: 
I saw the error in SparkThriftServer process's log when I restart all datanodes 
of HDFS , The logs as follows:

{code:java}
2021-11-16 13:52:11,044 ERROR [spark-listener-group-eventLog] 
scheduler.AsyncEventQueue:Listener EventLoggingListener threw an exception
java.io.IOException: All datanodes 
[DatanodeInfoWithStorage[10.121.23.101:1019,DS-90cb8066-8e5c-443f-804b-20c3ad01851b,DISK]]
 are bad. Aborting...
at org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1561)
at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1495)
at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481)
at 
org.apache.hadoop.hdfs.DataStreamer.processDatanodeErrorOrExternalError(DataStreamer.java:1256)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667)
{code}

The eventLog will be available normally if I restart the SparkThriftServer, I 
suggest that the EventLoggingListener's dfs writer and hadoopDataStream should 
reconnect after all datanodes stop and then start later。


{code:java}
  /** Log the event as JSON. */
  private def logEvent(event: SparkListenerEvent, flushLogger: Boolean = false) 
{
val eventJson = JsonProtocol.sparkEventToJson(event)
// scalastyle:off println
writer.foreach(_.println(compact(render(eventJson
// scalastyle:on println
if (flushLogger) {
  writer.foreach(_.flush())
  hadoopDataStream.foreach(_.hflush())
}
if (testing) {
  loggedEvents += eventJson
}
  }
{code}


  was:
I saw the same error in SparkThriftServer process's log when I restart all 
datanodes of HDFS , The logs as follows:

{code:java}
2021-11-16 13:52:11,044 ERROR [spark-listener-group-eventLog] 
scheduler.AsyncEventQueue:Listener EventLoggingListener threw an exception
java.io.IOException: All datanodes 
[DatanodeInfoWithStorage[10.121.23.101:1019,DS-90cb8066-8e5c-443f-804b-20c3ad01851b,DISK]]
 are bad. Aborting...
at org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1561)
at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1495)
at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481)
at 
org.apache.hadoop.hdfs.DataStreamer.processDatanodeErrorOrExternalError(DataStreamer.java:1256)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667)
{code}

The eventLog will be available normally if I restart the SparkThriftServer, I 
suggest that the EventLoggingListener's dfs writer and hadoopDataStream should 
reconnect after all datanodes stop and then start later。


{code:java}
  /** Log the event as JSON. */
  private def logEvent(event: SparkListenerEvent, flushLogger: Boolean = false) 
{
val eventJson = JsonProtocol.sparkEventToJson(event)
// scalastyle:off println
writer.foreach(_.println(compact(render(eventJson
// scalastyle:on println
if (flushLogger) {
  writer.foreach(_.flush())
  hadoopDataStream.foreach(_.hflush())
}
if (testing) {
  loggedEvents += eventJson
}
  }
{code}



> EventLoggingListener keep logging errors after hdfs restart all datanodes
> -
>
> Key: SPARK-37350
> URL: https://issues.apache.org/jira/browse/SPARK-37350
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment: Spark-2.4.0、Hadoop-3.0.0、Hive-2.1.1
>Reporter: Shefron Yudy
>Priority: Major
>
> I saw the error in SparkThriftServer process's log when I restart all 
> datanodes of HDFS , The logs as follows:
> {code:java}
> 2021-11-16 13:52:11,044 ERROR [spark-listener-group-eventLog] 
> scheduler.AsyncEventQueue:Listener EventLoggingListener threw an exception
> java.io.IOException: All datanodes 
> [DatanodeInfoWithStorage[10.121.23.101:1019,DS-90cb8066-8e5c-443f-804b-20c3ad01851b,DISK]]
>  are bad. Aborting...
> at 
> org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1561)
> at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1495)
> at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481)
> at 
> org.apache.hadoop.hdfs.DataStreamer.processDatanodeErrorOrExternalError(DataStreamer.java:1256)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667)
> {code}
> The eventLog will be available normally if I restart the SparkThriftServer, I 
> suggest that the EventLoggingListener's dfs writer and hadoopDataStream 
> should reconnect after all datanodes stop and then start later。
> {code:java}
>   /** Log the event as JSON. */
>   private def 

[jira] [Updated] (SPARK-37350) EventLoggingListener keep logging errors after hdfs restart all datanodes

2021-11-16 Thread Shefron Yudy (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shefron Yudy updated SPARK-37350:
-
Description: 
I saw the same error in SparkThriftServer process's log when I restart all 
datanodes of HDFS , The logs as follows:

{code:java}
2021-11-16 13:52:11,044 ERROR [spark-listener-group-eventLog] 
scheduler.AsyncEventQueue:Listener EventLoggingListener threw an exception
java.io.IOException: All datanodes 
[DatanodeInfoWithStorage[10.121.23.101:1019,DS-90cb8066-8e5c-443f-804b-20c3ad01851b,DISK]]
 are bad. Aborting...
at org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1561)
at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1495)
at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481)
at 
org.apache.hadoop.hdfs.DataStreamer.processDatanodeErrorOrExternalError(DataStreamer.java:1256)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667)
{code}

The eventLog will be available normally if I restart the SparkThriftServer, I 
suggest that the EventLoggingListener's dfs writer and hadoopDataStream should 
reconnect after all datanodes stop and then start later。


{code:java}
  /** Log the event as JSON. */
  private def logEvent(event: SparkListenerEvent, flushLogger: Boolean = false) 
{
val eventJson = JsonProtocol.sparkEventToJson(event)
// scalastyle:off println
writer.foreach(_.println(compact(render(eventJson
// scalastyle:on println
if (flushLogger) {
  writer.foreach(_.flush())
  hadoopDataStream.foreach(_.hflush())
}
if (testing) {
  loggedEvents += eventJson
}
  }
{code}


  was:
I saw the same error in SparkThriftServer process's log when I restart all 
datanodes of HDFS , The logs as follows:

{code:java}
2021-11-16 13:52:11,044 ERROR [spark-listener-group-eventLog] 
scheduler.AsyncEventQueue:Listener EventLoggingListener threw an exception
java.io.IOException: All datanodes 
[DatanodeInfoWithStorage[10.121.23.101:1019,DS-90cb8066-8e5c-443f-804b-20c3ad01851b,DISK]]
 are bad. Aborting...
at org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1561)
at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1495)
at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481)
at 
org.apache.hadoop.hdfs.DataStreamer.processDatanodeErrorOrExternalError(DataStreamer.java:1256)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667)
{code}

The eventLog will be available normally if I restart the SparkThriftServer, I 
suggest that the EventLoggingListener's dfs writer should reconnect after all 
datanodes stop and then start later。


{code:java}
  /** Log the event as JSON. */
  private def logEvent(event: SparkListenerEvent, flushLogger: Boolean = false) 
{
val eventJson = JsonProtocol.sparkEventToJson(event)
// scalastyle:off println
writer.foreach(_.println(compact(render(eventJson
// scalastyle:on println
if (flushLogger) {
  writer.foreach(_.flush())
  hadoopDataStream.foreach(_.hflush())
}
if (testing) {
  loggedEvents += eventJson
}
  }
{code}



> EventLoggingListener keep logging errors after hdfs restart all datanodes
> -
>
> Key: SPARK-37350
> URL: https://issues.apache.org/jira/browse/SPARK-37350
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment: Spark-2.4.0、Hadoop-3.0.0、Hive-2.1.1
>Reporter: Shefron Yudy
>Priority: Major
>
> I saw the same error in SparkThriftServer process's log when I restart all 
> datanodes of HDFS , The logs as follows:
> {code:java}
> 2021-11-16 13:52:11,044 ERROR [spark-listener-group-eventLog] 
> scheduler.AsyncEventQueue:Listener EventLoggingListener threw an exception
> java.io.IOException: All datanodes 
> [DatanodeInfoWithStorage[10.121.23.101:1019,DS-90cb8066-8e5c-443f-804b-20c3ad01851b,DISK]]
>  are bad. Aborting...
> at 
> org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1561)
> at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1495)
> at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481)
> at 
> org.apache.hadoop.hdfs.DataStreamer.processDatanodeErrorOrExternalError(DataStreamer.java:1256)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667)
> {code}
> The eventLog will be available normally if I restart the SparkThriftServer, I 
> suggest that the EventLoggingListener's dfs writer and hadoopDataStream 
> should reconnect after all datanodes stop and then start later。
> {code:java}
>   /** Log the event as JSON. */
>   private def logEvent(event: 

[jira] [Updated] (SPARK-37350) EventLoggingListener keep logging errors after hdfs restart all datanodes

2021-11-16 Thread Shefron Yudy (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shefron Yudy updated SPARK-37350:
-
Description: 
I saw the same error in SparkThriftServer process's log when I restart all 
datanodes of HDFS , The logs as follows:

{code:java}
2021-11-16 13:52:11,044 ERROR [spark-listener-group-eventLog] 
scheduler.AsyncEventQueue:Listener EventLoggingListener threw an exception
java.io.IOException: All datanodes 
[DatanodeInfoWithStorage[10.121.23.101:1019,DS-90cb8066-8e5c-443f-804b-20c3ad01851b,DISK]]
 are bad. Aborting...
at org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1561)
at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1495)
at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481)
at 
org.apache.hadoop.hdfs.DataStreamer.processDatanodeErrorOrExternalError(DataStreamer.java:1256)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667)
{code}

The eventLog will be available normally if I restart the SparkThriftServer, I 
suggest that the EventLoggingListener's dfs writer should reconnect after all 
datanodes stop and then start later。


{code:java}
  /** Log the event as JSON. */
  private def logEvent(event: SparkListenerEvent, flushLogger: Boolean = false) 
{
val eventJson = JsonProtocol.sparkEventToJson(event)
// scalastyle:off println
writer.foreach(_.println(compact(render(eventJson
// scalastyle:on println
if (flushLogger) {
  writer.foreach(_.flush())
  hadoopDataStream.foreach(_.hflush())
}
if (testing) {
  loggedEvents += eventJson
}
  }
{code}


  was:
I saw the same error in SparkThriftServer process's log when I restart all 
datanodes of HDFS , The logs as follows:

{code:java}
2021-11-16 13:52:11,044 ERROR [spark-listener-group-eventLog] 
scheduler.AsyncEventQueue:Listener EventLoggingListener threw an exception
java.io.IOException: All datanodes 
[DatanodeInfoWithStorage[10.121.23.101:1019,DS-90cb8066-8e5c-443f-804b-20c3ad01851b,DISK]]
 are bad. Aborting...
at org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1561)
at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1495)
at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481)
at 
org.apache.hadoop.hdfs.DataStreamer.processDatanodeErrorOrExternalError(DataStreamer.java:1256)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667)
{code}

The eventLog will be available normally if I restart the SparkThriftServer, I 
suggest that the EventLoggingListener's dfs writer should reconnect after all 
datanodes stop and then start later。


> EventLoggingListener keep logging errors after hdfs restart all datanodes
> -
>
> Key: SPARK-37350
> URL: https://issues.apache.org/jira/browse/SPARK-37350
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment: Spark-2.4.0、Hadoop-3.0.0、Hive-2.1.1
>Reporter: Shefron Yudy
>Priority: Major
>
> I saw the same error in SparkThriftServer process's log when I restart all 
> datanodes of HDFS , The logs as follows:
> {code:java}
> 2021-11-16 13:52:11,044 ERROR [spark-listener-group-eventLog] 
> scheduler.AsyncEventQueue:Listener EventLoggingListener threw an exception
> java.io.IOException: All datanodes 
> [DatanodeInfoWithStorage[10.121.23.101:1019,DS-90cb8066-8e5c-443f-804b-20c3ad01851b,DISK]]
>  are bad. Aborting...
> at 
> org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1561)
> at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1495)
> at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481)
> at 
> org.apache.hadoop.hdfs.DataStreamer.processDatanodeErrorOrExternalError(DataStreamer.java:1256)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667)
> {code}
> The eventLog will be available normally if I restart the SparkThriftServer, I 
> suggest that the EventLoggingListener's dfs writer should reconnect after all 
> datanodes stop and then start later。
> {code:java}
>   /** Log the event as JSON. */
>   private def logEvent(event: SparkListenerEvent, flushLogger: Boolean = 
> false) {
> val eventJson = JsonProtocol.sparkEventToJson(event)
> // scalastyle:off println
> writer.foreach(_.println(compact(render(eventJson
> // scalastyle:on println
> if (flushLogger) {
>   writer.foreach(_.flush())
>   hadoopDataStream.foreach(_.hflush())
> }
> if (testing) {
>   loggedEvents += eventJson
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (SPARK-37351) Supports write data flow control

2021-11-16 Thread melin (Jira)
melin created SPARK-37351:
-

 Summary: Supports write data flow control
 Key: SPARK-37351
 URL: https://issues.apache.org/jira/browse/SPARK-37351
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.2.0
Reporter: melin


The hive table data is written to a relational database, generally an online 
production database. If the writing speed has no traffic control, it can easily 
affect the stability of the online system. It is recommended to add traffic 
control parameters



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-24907) Migrate JDBC data source to DataSource API v2

2021-11-16 Thread melin (Jira)


[ https://issues.apache.org/jira/browse/SPARK-24907 ]


melin deleted comment on SPARK-24907:
---

was (Author: melin):
The hive table data is written to a relational database, generally an online 
production database. If the writing speed has no traffic control, it can easily 
affect the stability of the online system. It is recommended to add traffic 
control parameters

> Migrate JDBC data source to DataSource API v2
> -
>
> Key: SPARK-24907
> URL: https://issues.apache.org/jira/browse/SPARK-24907
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Teng Peng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37350) EventLoggingListener keep logging errors after hdfs restart all datanodes

2021-11-16 Thread Shefron Yudy (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shefron Yudy updated SPARK-37350:
-
Environment: Spark-2.4.0、Hadoop-3.0.0、Hive-2.1.1  (was: Spark-2.4.0
Hadoop-3.0.0
Hive-2.1.1)

> EventLoggingListener keep logging errors after hdfs restart all datanodes
> -
>
> Key: SPARK-37350
> URL: https://issues.apache.org/jira/browse/SPARK-37350
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment: Spark-2.4.0、Hadoop-3.0.0、Hive-2.1.1
>Reporter: Shefron Yudy
>Priority: Major
>
> I saw the same error in SparkThriftServer process's log when I restart all 
> datanodes of HDFS , The logs as follows:
> {code:java}
> 2021-11-16 13:52:11,044 ERROR [spark-listener-group-eventLog] 
> scheduler.AsyncEventQueue:Listener EventLoggingListener threw an exception
> java.io.IOException: All datanodes 
> [DatanodeInfoWithStorage[10.121.23.101:1019,DS-90cb8066-8e5c-443f-804b-20c3ad01851b,DISK]]
>  are bad. Aborting...
> at 
> org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1561)
> at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1495)
> at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481)
> at 
> org.apache.hadoop.hdfs.DataStreamer.processDatanodeErrorOrExternalError(DataStreamer.java:1256)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667)
> {code}
> The eventLog will be available normally if I restart the SparkThriftServer, I 
> suggest that the EventLoggingListener's dfs writer should reconnect after all 
> datanodes stop and then start later。



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37350) EventLoggingListener keep logging errors after hdfs restart all datanodes

2021-11-16 Thread Shefron Yudy (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shefron Yudy updated SPARK-37350:
-
Description: 
I saw the same error in SparkThriftServer process's log when I restart all 
datanodes of HDFS , The logs as follows:

{code:java}
2021-11-16 13:52:11,044 ERROR [spark-listener-group-eventLog] 
scheduler.AsyncEventQueue:Listener EventLoggingListener threw an exception
java.io.IOException: All datanodes 
[DatanodeInfoWithStorage[10.121.23.101:1019,DS-90cb8066-8e5c-443f-804b-20c3ad01851b,DISK]]
 are bad. Aborting...
at org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1561)
at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1495)
at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481)
at 
org.apache.hadoop.hdfs.DataStreamer.processDatanodeErrorOrExternalError(DataStreamer.java:1256)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667)
{code}

The eventLog will be available normally if I restart the SparkThriftServer, I 
suggest that the EventLoggingListener's dfs writer should reconnect after all 
datanodes stop and then start later。

  was:
I saw the same error in SparkThriftServer process's log when I restart all 
datanodes of HDFS , The logs as follows:


{panel:title=My title}
2021-11-16 13:52:11,044 ERROR [spark-listener-group-eventLog] 
scheduler.AsyncEventQueue:Listener EventLoggingListener threw an exception
java.io.IOException: All datanodes 
[DatanodeInfoWithStorage[10.121.23.101:1019,DS-90cb8066-8e5c-443f-804b-20c3ad01851b,DISK]]
 are bad. Aborting...
at org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1561)
at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1495)
at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481)
at 
org.apache.hadoop.hdfs.DataStreamer.processDatanodeErrorOrExternalError(DataStreamer.java:1256)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667)
{panel}


The eventLog will be available normally if I restart the SparkThriftServer, I 
suggest that the EventLoggingListener's dfs writer should reconnect after all 
datanodes stop and then start later。


> EventLoggingListener keep logging errors after hdfs restart all datanodes
> -
>
> Key: SPARK-37350
> URL: https://issues.apache.org/jira/browse/SPARK-37350
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment: Spark-2.4.0
> Hadoop-3.0.0
> Hive-2.1.1
>Reporter: Shefron Yudy
>Priority: Major
>
> I saw the same error in SparkThriftServer process's log when I restart all 
> datanodes of HDFS , The logs as follows:
> {code:java}
> 2021-11-16 13:52:11,044 ERROR [spark-listener-group-eventLog] 
> scheduler.AsyncEventQueue:Listener EventLoggingListener threw an exception
> java.io.IOException: All datanodes 
> [DatanodeInfoWithStorage[10.121.23.101:1019,DS-90cb8066-8e5c-443f-804b-20c3ad01851b,DISK]]
>  are bad. Aborting...
> at 
> org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1561)
> at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1495)
> at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481)
> at 
> org.apache.hadoop.hdfs.DataStreamer.processDatanodeErrorOrExternalError(DataStreamer.java:1256)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667)
> {code}
> The eventLog will be available normally if I restart the SparkThriftServer, I 
> suggest that the EventLoggingListener's dfs writer should reconnect after all 
> datanodes stop and then start later。



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24907) Migrate JDBC data source to DataSource API v2

2021-11-16 Thread melin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444916#comment-17444916
 ] 

melin commented on SPARK-24907:
---

The hive table data is written to a relational database, generally an online 
production database. If the writing speed has no traffic control, it can easily 
affect the stability of the online system. It is recommended to add traffic 
control parameters

> Migrate JDBC data source to DataSource API v2
> -
>
> Key: SPARK-24907
> URL: https://issues.apache.org/jira/browse/SPARK-24907
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Teng Peng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37350) EventLoggingListener keep logging errors after hdfs restart all datanodes

2021-11-16 Thread Shefron Yudy (Jira)
Shefron Yudy created SPARK-37350:


 Summary: EventLoggingListener keep logging errors after hdfs 
restart all datanodes
 Key: SPARK-37350
 URL: https://issues.apache.org/jira/browse/SPARK-37350
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.0
 Environment: Spark-2.4.0
Hadoop-3.0.0
Hive-2.1.1
Reporter: Shefron Yudy


I saw the same error in SparkThriftServer process's log when I restart all 
datanodes of HDFS , The logs as follows:


{panel:title=My title}
2021-11-16 13:52:11,044 ERROR [spark-listener-group-eventLog] 
scheduler.AsyncEventQueue:Listener EventLoggingListener threw an exception
java.io.IOException: All datanodes 
[DatanodeInfoWithStorage[10.121.23.101:1019,DS-90cb8066-8e5c-443f-804b-20c3ad01851b,DISK]]
 are bad. Aborting...
at org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1561)
at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1495)
at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481)
at 
org.apache.hadoop.hdfs.DataStreamer.processDatanodeErrorOrExternalError(DataStreamer.java:1256)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667)
{panel}


The eventLog will be available normally if I restart the SparkThriftServer, I 
suggest that the EventLoggingListener's dfs writer should reconnect after all 
datanodes stop and then start later。



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35345) Add BloomFilter Benchmark test for Parquet

2021-11-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-35345.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34594
[https://github.com/apache/spark/pull/34594]

> Add BloomFilter Benchmark test for Parquet
> --
>
> Key: SPARK-35345
> URL: https://issues.apache.org/jira/browse/SPARK-35345
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Trivial
> Fix For: 3.3.0
>
>
> Currently, we only have BloomFilter Benchmark test for ORC. Will add one for 
> Parquet too.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35345) Add BloomFilter Benchmark test for Parquet

2021-11-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-35345:


Assignee: Huaxin Gao

> Add BloomFilter Benchmark test for Parquet
> --
>
> Key: SPARK-35345
> URL: https://issues.apache.org/jira/browse/SPARK-35345
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Trivial
>
> Currently, we only have BloomFilter Benchmark test for ORC. Will add one for 
> Parquet too.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37341) Avoid unnecessary buffer and copy in full outer sort merge join

2021-11-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-37341:
---

Assignee: Cheng Su

> Avoid unnecessary buffer and copy in full outer sort merge join
> ---
>
> Key: SPARK-37341
> URL: https://issues.apache.org/jira/browse/SPARK-37341
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Minor
>
> FULL OUTER sort merge join (non-code-gen path) copies join keys and buffers 
> input rows, even when rows from both sides do have matched keys 
> ([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala#L1637-L1641]
>  ). This is unnecessary, as we can just output the row with smaller join 
> keys, and only buffer when both sides have matched keys. This would save us 
> from unnecessary copy and buffer, when both join sides have a lot of rows not 
> matched with each other.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37341) Avoid unnecessary buffer and copy in full outer sort merge join

2021-11-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-37341.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34612
[https://github.com/apache/spark/pull/34612]

> Avoid unnecessary buffer and copy in full outer sort merge join
> ---
>
> Key: SPARK-37341
> URL: https://issues.apache.org/jira/browse/SPARK-37341
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Minor
> Fix For: 3.3.0
>
>
> FULL OUTER sort merge join (non-code-gen path) copies join keys and buffers 
> input rows, even when rows from both sides do have matched keys 
> ([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala#L1637-L1641]
>  ). This is unnecessary, as we can just output the row with smaller join 
> keys, and only buffer when both sides have matched keys. This would save us 
> from unnecessary copy and buffer, when both join sides have a lot of rows not 
> matched with each other.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28242) DataStreamer keeps logging errors even after fixing writeStream output sink

2021-11-16 Thread Shefron Yudy (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444900#comment-17444900
 ] 

Shefron Yudy commented on SPARK-28242:
--

I saw the same error in SparkThriftServer process's log when I restart all 
datanodes of HDFS with the env Spark-2.4.0 and Hadoop-3.0.0), The logs as 
follows:

2021-11-16 13:52:11,044 ERROR [spark-listener-group-eventLog] 
scheduler.AsyncEventQueue:Listener EventLoggingListener threw an exception 
java.io.IOException: All datanodes 
[DatanodeInfoWithStorage[10.121.23.101:1019,DS-90cb8066-8e5c-443f-804b-20c3ad01851b,DISK]]
 are bad. Aborting... 
at 
org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1561)
at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1495)
at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481)
 
at 
org.apache.hadoop.hdfs.DataStreamer.processDatanodeErrorOrExternalError(DataStreamer.java:1256)
 
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667)

The eventLog will be available normally  if I restart the SparkThriftServer, I 
suggest that the EventLoggingListener's dfs writer should reconnect  after all 
datanodes stop and then start later。  

> DataStreamer keeps logging errors even after fixing writeStream output sink
> ---
>
> Key: SPARK-28242
> URL: https://issues.apache.org/jira/browse/SPARK-28242
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3
> Environment: Hadoop 2.8.4
>  
>Reporter: Miquel Canes
>Priority: Minor
>  Labels: bulk-closed
>
> I have been testing what happens to a running structured streaming that is 
> writing to HDFS when all datanodes are down/stopped or all cluster is down 
> (including namenode)
> So I created a structured stream from kafka to a File output sink to HDFS and 
> tested some scenarios.
> We used a very simple streamings:
> {code:java}
> spark.readStream()
> .format("kafka")
> .option("kafka.bootstrap.servers", "kafka.server:9092...")
> .option("subscribe", "test_topic")
> .load()
> .select(col("value").cast(DataTypes.StringType))
> .writeStream()
> .format("text")
> .option("path", "HDFS/PATH")
> .option("checkpointLocation", "checkpointPath")
> .start()
> .awaitTermination();{code}
>  
> After stopping all the datanodes the process starts logging the error that 
> datanodes are bad.
> That's correct...
> {code:java}
> 2019-07-03 15:55:00 [spark-listener-group-eventLog] ERROR 
> org.apache.spark.scheduler.AsyncEventQueue:91 - Listener EventLoggingListener 
> threw an exception java.io.IOException: All datanodes 
> [DatanodeInfoWithStorage[10.2.12.202:50010,DS-d2fba01b-28eb-4fe4-baaa-4072102a2172,DISK]]
>  are bad. Aborting... at 
> org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1530) 
> at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1465)
>  at 
> org.apache.hadoop.hdfs.DataStreamer.processDatanodeError(DataStreamer.java:1237)
>  at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:657)
> {code}
> The problem is that even after starting again the datanodes the process keeps 
> logging the same error all the time.
> We checked and the WriteStream to HDFS recovered successfully after starting 
> the datanodes and the output sink worked again without problems.
> I have been trying some different HDFS configurations to be sure it's not a 
> client config related problem but with no clue about how to fix it.
> It seams that something is stuck indefinitely in an error loop.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36231) Support arithmetic operations of Series containing Decimal(np.nan)

2021-11-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36231.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34314
[https://github.com/apache/spark/pull/34314]

> Support arithmetic operations of Series containing Decimal(np.nan) 
> ---
>
> Key: SPARK-36231
> URL: https://issues.apache.org/jira/browse/SPARK-36231
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.3.0
>
>
> Arithmetic operations of Series containing Decimal(np.nan) raise 
> java.lang.NullPointerException in driver. An example is shown as below:
> {code:java}
> >>> pser = pd.Series([decimal.Decimal(1.0), decimal.Decimal(2.0), 
> >>> decimal.Decimal(np.nan)])
> >>> psser = ps.from_pandas(pser)
> >>> pser + 1
> 0 2
>  1 3
>  2 NaN
> >>> psser + 1
>  Driver stacktrace:
>  at 
> org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2259)
>  at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2208)
>  at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2207)
>  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>  at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>  at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2207)
>  at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1084)
>  at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1084)
>  at scala.Option.foreach(Option.scala:407)
>  at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1084)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2446)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2388)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2377)
>  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
>  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:873)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2208)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2303)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$5(Dataset.scala:3648)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$2(Dataset.scala:3652)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$2$adapted(Dataset.scala:3629)
>  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
>  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:774)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$1(Dataset.scala:3629)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$1$adapted(Dataset.scala:3628)
>  at 
> org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$2(SocketAuthServer.scala:139)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437)
>  at 
> org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$1(SocketAuthServer.scala:141)
>  at 
> org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$1$adapted(SocketAuthServer.scala:136)
>  at 
> org.apache.spark.security.SocketFuncServer.handleConnection(SocketAuthServer.scala:113)
>  at 
> org.apache.spark.security.SocketFuncServer.handleConnection(SocketAuthServer.scala:107)
>  at 
> org.apache.spark.security.SocketAuthServer$$anon$1.$anonfun$run$4(SocketAuthServer.scala:68)
>  at scala.util.Try$.apply(Try.scala:213)
>  at 
> org.apache.spark.security.SocketAuthServer$$anon$1.run(SocketAuthServer.scala:68)
>  Caused by: java.lang.NullPointerException
>  at 
> 

[jira] [Resolved] (SPARK-36000) Support creation and operations of ps.Series/Index with Decimal('NaN')

2021-11-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36000.
--
Fix Version/s: 3.3.0
   Resolution: Done

> Support creation and operations of ps.Series/Index with Decimal('NaN')
> --
>
> Key: SPARK-36000
> URL: https://issues.apache.org/jira/browse/SPARK-36000
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.3.0
>
>
> The creation and operations of ps.Series/Index with Decimal('NaN') doesn't 
> work as expected.
> That might be due to the underlying PySpark limit.
> Please refer to sub-tasks for issues detected.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36000) Support creation and operations of ps.Series/Index with Decimal('NaN')

2021-11-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-36000:


Assignee: Yikun Jiang

> Support creation and operations of ps.Series/Index with Decimal('NaN')
> --
>
> Key: SPARK-36000
> URL: https://issues.apache.org/jira/browse/SPARK-36000
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Assignee: Yikun Jiang
>Priority: Major
>
> The creation and operations of ps.Series/Index with Decimal('NaN') doesn't 
> work as expected.
> That might be due to the underlying PySpark limit.
> Please refer to sub-tasks for issues detected.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36231) Support arithmetic operations of Series containing Decimal(np.nan)

2021-11-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-36231:


Assignee: Yikun Jiang

> Support arithmetic operations of Series containing Decimal(np.nan) 
> ---
>
> Key: SPARK-36231
> URL: https://issues.apache.org/jira/browse/SPARK-36231
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Assignee: Yikun Jiang
>Priority: Major
>
> Arithmetic operations of Series containing Decimal(np.nan) raise 
> java.lang.NullPointerException in driver. An example is shown as below:
> {code:java}
> >>> pser = pd.Series([decimal.Decimal(1.0), decimal.Decimal(2.0), 
> >>> decimal.Decimal(np.nan)])
> >>> psser = ps.from_pandas(pser)
> >>> pser + 1
> 0 2
>  1 3
>  2 NaN
> >>> psser + 1
>  Driver stacktrace:
>  at 
> org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2259)
>  at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2208)
>  at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2207)
>  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>  at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>  at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2207)
>  at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1084)
>  at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1084)
>  at scala.Option.foreach(Option.scala:407)
>  at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1084)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2446)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2388)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2377)
>  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
>  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:873)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2208)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2303)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$5(Dataset.scala:3648)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$2(Dataset.scala:3652)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$2$adapted(Dataset.scala:3629)
>  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
>  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:774)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$1(Dataset.scala:3629)
>  at 
> org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$1$adapted(Dataset.scala:3628)
>  at 
> org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$2(SocketAuthServer.scala:139)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437)
>  at 
> org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$1(SocketAuthServer.scala:141)
>  at 
> org.apache.spark.security.SocketAuthServer$.$anonfun$serveToStream$1$adapted(SocketAuthServer.scala:136)
>  at 
> org.apache.spark.security.SocketFuncServer.handleConnection(SocketAuthServer.scala:113)
>  at 
> org.apache.spark.security.SocketFuncServer.handleConnection(SocketAuthServer.scala:107)
>  at 
> org.apache.spark.security.SocketAuthServer$$anon$1.$anonfun$run$4(SocketAuthServer.scala:68)
>  at scala.util.Try$.apply(Try.scala:213)
>  at 
> org.apache.spark.security.SocketAuthServer$$anon$1.run(SocketAuthServer.scala:68)
>  Caused by: java.lang.NullPointerException
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>  at 
> 

[jira] [Assigned] (SPARK-37345) Add java.security.jgss/sun.security.krb5 to DEFAULT_MODULE_OPTIONS

2021-11-16 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-37345:
---

Assignee: Yuming Wang

> Add java.security.jgss/sun.security.krb5 to DEFAULT_MODULE_OPTIONS
> --
>
> Key: SPARK-37345
> URL: https://issues.apache.org/jira/browse/SPARK-37345
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> {noformat}
> Exception in thread "main" java.lang.IllegalArgumentException: Can't get 
> Kerberos realm
>   at 
> org.apache.hadoop.security.HadoopKerberosName.setConfiguration(HadoopKerberosName.java:65)
>   at 
> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:306)
>   at 
> org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:352)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:50)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$.instance$lzycompute(SparkHadoopUtil.scala:397)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$.instance(SparkHadoopUtil.scala:397)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$.get(SparkHadoopUtil.scala:418)
>   at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:423)
>   at 
> org.apache.spark.executor.YarnCoarseGrainedExecutorBackend$.main(YarnCoarseGrainedExecutorBackend.scala:81)
>   at 
> org.apache.spark.executor.YarnCoarseGrainedExecutorBackend.main(YarnCoarseGrainedExecutorBackend.scala)
> Caused by: java.lang.IllegalAccessException: class 
> org.apache.hadoop.security.authentication.util.KerberosUtil cannot access 
> class sun.security.krb5.Config (in module java.security.jgss) because module 
> java.security.jgss does not export sun.security.krb5 to unnamed module 
> @3a0baae5
>   at 
> java.base/jdk.internal.reflect.Reflection.newIllegalAccessException(Reflection.java:392)
>   at 
> java.base/java.lang.reflect.AccessibleObject.checkAccess(AccessibleObject.java:674)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:560)
>   at 
> org.apache.hadoop.security.authentication.util.KerberosUtil.getDefaultRealm(KerberosUtil.java:85)
>   at 
> org.apache.hadoop.security.HadoopKerberosName.setConfiguration(HadoopKerberosName.java:63)
>   ... 9 more
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37345) Add java.security.jgss/sun.security.krb5 to DEFAULT_MODULE_OPTIONS

2021-11-16 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-37345.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34615
[https://github.com/apache/spark/pull/34615]

> Add java.security.jgss/sun.security.krb5 to DEFAULT_MODULE_OPTIONS
> --
>
> Key: SPARK-37345
> URL: https://issues.apache.org/jira/browse/SPARK-37345
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.3.0
>
>
> {noformat}
> Exception in thread "main" java.lang.IllegalArgumentException: Can't get 
> Kerberos realm
>   at 
> org.apache.hadoop.security.HadoopKerberosName.setConfiguration(HadoopKerberosName.java:65)
>   at 
> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:306)
>   at 
> org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:352)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:50)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$.instance$lzycompute(SparkHadoopUtil.scala:397)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$.instance(SparkHadoopUtil.scala:397)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$.get(SparkHadoopUtil.scala:418)
>   at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:423)
>   at 
> org.apache.spark.executor.YarnCoarseGrainedExecutorBackend$.main(YarnCoarseGrainedExecutorBackend.scala:81)
>   at 
> org.apache.spark.executor.YarnCoarseGrainedExecutorBackend.main(YarnCoarseGrainedExecutorBackend.scala)
> Caused by: java.lang.IllegalAccessException: class 
> org.apache.hadoop.security.authentication.util.KerberosUtil cannot access 
> class sun.security.krb5.Config (in module java.security.jgss) because module 
> java.security.jgss does not export sun.security.krb5 to unnamed module 
> @3a0baae5
>   at 
> java.base/jdk.internal.reflect.Reflection.newIllegalAccessException(Reflection.java:392)
>   at 
> java.base/java.lang.reflect.AccessibleObject.checkAccess(AccessibleObject.java:674)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:560)
>   at 
> org.apache.hadoop.security.authentication.util.KerberosUtil.getDefaultRealm(KerberosUtil.java:85)
>   at 
> org.apache.hadoop.security.HadoopKerberosName.setConfiguration(HadoopKerberosName.java:63)
>   ... 9 more
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37329) File system delegation tokens are leaked

2021-11-16 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444890#comment-17444890
 ] 

Wei-Chiu Chuang commented on SPARK-37329:
-

I should also note that this affects not just KMS, but any file system 
implementation (HDFS, Ozone, perhaps S3) with delegation token support.

> File system delegation tokens are leaked
> 
>
> Key: SPARK-37329
> URL: https://issues.apache.org/jira/browse/SPARK-37329
> Project: Spark
>  Issue Type: Bug
>  Components: Security, YARN
>Affects Versions: 2.4.0
>Reporter: Wei-Chiu Chuang
>Priority: Major
>
> On a very busy Hadoop cluster (with HDFS at rest encryption) we found KMS 
> accumulated millions of delegation tokens that are not cancelled even after 
> jobs are finished, and KMS goes out of memory within a day because of the 
> delegation token leak.
> We were able to reproduce the bug in a smaller test cluster, and realized 
> when a Spark job starts, it acquires two delegation tokens, and only one is 
> cancelled properly after the job finishes. The other one is left over and 
> linger around for up to 7 days ( default Hadoop delegation token life time).
> YARN handles the lifecycle of a delegation token properly if its renewer is 
> 'yarn'. However, Spark intentionally (a hack?) acquires a second delegation 
> token with the job issuer as the renewer, simply to get the token renewal 
> interval. The token is then ignored but not cancelled.
> Propose: cancel the delegation token immediately after the token renewal 
> interval is obtained.
> Environment: CDH6.3.2 (based on Apache Spark 2.4.0) but the bug probably got 
> introduced since day 1.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37349) Improve SQL Rest API Parsing

2021-11-16 Thread Yian Liou (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444891#comment-17444891
 ] 

Yian Liou commented on SPARK-37349:
---

Will be working on PR

> Improve SQL Rest API Parsing
> 
>
> Key: SPARK-37349
> URL: https://issues.apache.org/jira/browse/SPARK-37349
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yian Liou
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-31440 added improvements for SQL 
> Rest API. This Jira aims to add further enhancements in regards to parsing 
> the incoming data by accounting for `StageIds` and `TaskIds` fields that came 
> in Spark 3.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37349) Improve SQL Rest API Parsing

2021-11-16 Thread Yian Liou (Jira)
Yian Liou created SPARK-37349:
-

 Summary: Improve SQL Rest API Parsing
 Key: SPARK-37349
 URL: https://issues.apache.org/jira/browse/SPARK-37349
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Yian Liou


https://issues.apache.org/jira/browse/SPARK-31440 added improvements for SQL 
Rest API. This Jira aims to add further enhancements in regards to parsing the 
incoming data by accounting for `StageIds` and `TaskIds` fields that came in 
Spark 3.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37340) Display StageIds in Operators for SQL UI

2021-11-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37340:


Assignee: (was: Apache Spark)

> Display StageIds in Operators for SQL UI
> 
>
> Key: SPARK-37340
> URL: https://issues.apache.org/jira/browse/SPARK-37340
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.2.0
>Reporter: Yian Liou
>Priority: Major
>
> This proposes a more generalized solution of 
> https://issues.apache.org/jira/browse/SPARK-30209, where a stageId-> operator 
> mapping is done with the following algorithm.
>  1. Read SparkGraph to get every Node's name and respective AccumulatorIDs.
>  2. Gets each stage's AccumulatorIDs.
>  3. Maps Operators to stages by checking for non-zero intersection of Step 1 
> and 2's AccumulatorIDs.
>  4. Connect SparkGraphNodes to respective StageIDs for rendering in SQL UI.
> As a result, some operators without max metrics values will also have 
> stageIds in the UI. This Jira also aims to add minor enhancements to the SQL 
> UI tab.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37340) Display StageIds in Operators for SQL UI

2021-11-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37340:


Assignee: Apache Spark

> Display StageIds in Operators for SQL UI
> 
>
> Key: SPARK-37340
> URL: https://issues.apache.org/jira/browse/SPARK-37340
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.2.0
>Reporter: Yian Liou
>Assignee: Apache Spark
>Priority: Major
>
> This proposes a more generalized solution of 
> https://issues.apache.org/jira/browse/SPARK-30209, where a stageId-> operator 
> mapping is done with the following algorithm.
>  1. Read SparkGraph to get every Node's name and respective AccumulatorIDs.
>  2. Gets each stage's AccumulatorIDs.
>  3. Maps Operators to stages by checking for non-zero intersection of Step 1 
> and 2's AccumulatorIDs.
>  4. Connect SparkGraphNodes to respective StageIDs for rendering in SQL UI.
> As a result, some operators without max metrics values will also have 
> stageIds in the UI. This Jira also aims to add minor enhancements to the SQL 
> UI tab.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37340) Display StageIds in Operators for SQL UI

2021-11-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444764#comment-17444764
 ] 

Apache Spark commented on SPARK-37340:
--

User 'yliou' has created a pull request for this issue:
https://github.com/apache/spark/pull/34622

> Display StageIds in Operators for SQL UI
> 
>
> Key: SPARK-37340
> URL: https://issues.apache.org/jira/browse/SPARK-37340
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.2.0
>Reporter: Yian Liou
>Priority: Major
>
> This proposes a more generalized solution of 
> https://issues.apache.org/jira/browse/SPARK-30209, where a stageId-> operator 
> mapping is done with the following algorithm.
>  1. Read SparkGraph to get every Node's name and respective AccumulatorIDs.
>  2. Gets each stage's AccumulatorIDs.
>  3. Maps Operators to stages by checking for non-zero intersection of Step 1 
> and 2's AccumulatorIDs.
>  4. Connect SparkGraphNodes to respective StageIDs for rendering in SQL UI.
> As a result, some operators without max metrics values will also have 
> stageIds in the UI. This Jira also aims to add minor enhancements to the SQL 
> UI tab.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37219) support AS OF syntax

2021-11-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444669#comment-17444669
 ] 

Apache Spark commented on SPARK-37219:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/34621

> support AS OF syntax
> 
>
> Key: SPARK-37219
> URL: https://issues.apache.org/jira/browse/SPARK-37219
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.3.0
>
>
> https://docs.databricks.com/delta/quick-start.html#query-an-earlier-version-of-the-table-time-travel
> Delta Lake time travel allows user to query an older snapshot of a Delta 
> table. To query an older version of a table, user needs to specify a version 
> or timestamp in a SELECT statement using AS OF syntax as the follows
> SELECT * FROM default.people10m VERSION AS OF 0;
> SELECT * FROM default.people10m TIMESTAMP AS OF '2019-01-29 00:37:58';
> This ticket is opened to add AS OF syntax in Spark



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35672) Spark fails to launch executors with very large user classpath lists on YARN

2021-11-16 Thread Attila Zsolt Piros (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros resolved SPARK-35672.

Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34120
[https://github.com/apache/spark/pull/34120]

> Spark fails to launch executors with very large user classpath lists on YARN
> 
>
> Key: SPARK-35672
> URL: https://issues.apache.org/jira/browse/SPARK-35672
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 3.1.2
> Environment: Linux RHEL7
> Spark 3.1.1
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
> Fix For: 3.3.0
>
>
> When running Spark on YARN, the {{user-class-path}} argument to 
> {{CoarseGrainedExecutorBackend}} is used to pass a list of user JAR URIs to 
> executor processes. The argument is specified once for each JAR, and the URIs 
> are fully-qualified, so the paths can be quite long. With large user JAR 
> lists (say 1000+), this can result in system-level argument length limits 
> being exceeded, typically manifesting as the error message:
> {code}
> /bin/bash: Argument list too long
> {code}
> A [Google 
> search|https://www.google.com/search?q=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22]
>  indicates that this is not a theoretical problem and afflicts real users, 
> including ours. This issue was originally observed on Spark 2.3, but has been 
> confirmed to exist in the master branch as well.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37348) PySpark pmod function

2021-11-16 Thread Tim Schwab (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Schwab updated SPARK-37348:
---
Description: 
Because Spark is built on the JVM, in PySpark, F.lit(-1) % F.lit(2) returns -1. 
However, the modulus is often desired instead of the remainder.

 

There is a [PMOD() function in Spark 
SQL|https://spark.apache.org/docs/latest/api/sql/#pmod], but [not in 
PySpark|https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#functions].
 So at the moment, the two options for getting the modulus is to use 
F.expr("pmod(A, B)"), or create a helper function such as:
 
{code:java}
def pmod(dividend, divisor):
return F.when(dividend < 0, (dividend % divisor) + 
divisor).otherwise(dividend % divisor){code}
 
 
Neither are optimal - pmod should be native to PySpark as it is in Spark SQL.

  was:
Because Spark is built on the JVM, in PySpark, F.lit(-1) % F.lit(2) returns -1. 
However, the modulus is often desired instead of the remainder.

 

There is a PMOD() function in Spark SQL, but not in PySpark. So at the moment, 
the two options for getting the modulus is to use F.expr("pmod(A, B)"), or 
create a helper function such as:
 
{code:java}
def pmod(dividend, divisor):
return F.when(dividend < 0, (dividend % divisor) + 
divisor).otherwise(dividend % divisor){code}
 
 
Neither are optimal - pmod should be native to PySpark as it is in Spark SQL.


> PySpark pmod function
> -
>
> Key: SPARK-37348
> URL: https://issues.apache.org/jira/browse/SPARK-37348
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Tim Schwab
>Priority: Minor
>  Labels: newbie
>
> Because Spark is built on the JVM, in PySpark, F.lit(-1) % F.lit(2) returns 
> -1. However, the modulus is often desired instead of the remainder.
>  
> There is a [PMOD() function in Spark 
> SQL|https://spark.apache.org/docs/latest/api/sql/#pmod], but [not in 
> PySpark|https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#functions].
>  So at the moment, the two options for getting the modulus is to use 
> F.expr("pmod(A, B)"), or create a helper function such as:
>  
> {code:java}
> def pmod(dividend, divisor):
> return F.when(dividend < 0, (dividend % divisor) + 
> divisor).otherwise(dividend % divisor){code}
>  
>  
> Neither are optimal - pmod should be native to PySpark as it is in Spark SQL.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37348) PySpark pmod function

2021-11-16 Thread Tim Schwab (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444636#comment-17444636
 ] 

Tim Schwab commented on SPARK-37348:


Also, this has been the case for over 6 years: 
https://github.com/apache/spark/pull/6783.

> PySpark pmod function
> -
>
> Key: SPARK-37348
> URL: https://issues.apache.org/jira/browse/SPARK-37348
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Tim Schwab
>Priority: Minor
>  Labels: newbie
>
> Because Spark is built on the JVM, in PySpark, F.lit(-1) % F.lit(2) returns 
> -1. However, the modulus is often desired instead of the remainder.
>  
> There is a PMOD() function in Spark SQL, but not in PySpark. So at the 
> moment, the two options for getting the modulus is to use F.expr("pmod(A, 
> B)"), or create a helper function such as:
>  
> {code:java}
> def pmod(dividend, divisor):
> return F.when(dividend < 0, (dividend % divisor) + 
> divisor).otherwise(dividend % divisor){code}
>  
>  
> Neither are optimal - pmod should be native to PySpark as it is in Spark SQL.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37348) PySpark pmod function

2021-11-16 Thread Tim Schwab (Jira)
Tim Schwab created SPARK-37348:
--

 Summary: PySpark pmod function
 Key: SPARK-37348
 URL: https://issues.apache.org/jira/browse/SPARK-37348
 Project: Spark
  Issue Type: New Feature
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Tim Schwab


Because Spark is built on the JVM, in PySpark, F.lit(-1) % F.lit(2) returns -1. 
However, the modulus is often desired instead of the remainder.

 

There is a PMOD() function in Spark SQL, but not in PySpark. So at the moment, 
the two options for getting the modulus is to use F.expr("pmod(A, B)"), or 
create a helper function such as:
 
{code:java}
def pmod(dividend, divisor):
return F.when(dividend < 0, (dividend % divisor) + 
divisor).otherwise(dividend % divisor){code}
 
 
Neither are optimal - pmod should be native to PySpark as it is in Spark SQL.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37181) pyspark.pandas.read_csv() should support latin-1 encoding

2021-11-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37181:


Assignee: Apache Spark

> pyspark.pandas.read_csv() should support latin-1 encoding
> -
>
> Key: SPARK-37181
> URL: https://issues.apache.org/jira/browse/SPARK-37181
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Chuck Connell
>Assignee: Apache Spark
>Priority: Major
>
> {{In regular pandas, you can say read_csv(encoding='latin-1'). This encoding 
> is not recognized in pyspark.pandas. You have to use Windows-1252 instead, 
> which is almost the same but not identical. }}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37209) YarnShuffleIntegrationSuite and other two similar cases in `resource-managers` test failed

2021-11-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37209:


Assignee: (was: Apache Spark)

> YarnShuffleIntegrationSuite  and other two similar cases in 
> `resource-managers` test failed
> ---
>
> Key: SPARK-37209
> URL: https://issues.apache.org/jira/browse/SPARK-37209
> Project: Spark
>  Issue Type: Bug
>  Components: Tests, YARN
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Minor
> Attachments: failed-unit-tests.log, success-unit-tests.log
>
>
> Execute :
>  # build/mvn clean package -DskipTests -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud 
> -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl 
> -Pkubernetes -Phive
>  # build/mvn test -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn 
> -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive 
> -Pscala-2.13 -pl resource-managers/yarn
> The test will successful.
>  
> Execute :
>  # build/mvn clean -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn 
> -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive
>  # build/mvn clean test -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn 
> -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive 
> -Pscala-2.13 -pl resource-managers/yarn 
> The test will failed.
>  
> Execute :
>  # build/mvn clean package -DskipTests -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud 
> -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl 
> -Pkubernetes -Phive
>  # Delete assembly/target/scala-2.12/jars manually
>  # build/mvn test -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn 
> -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive 
> -Pscala-2.13 -pl resource-managers/yarn 
> The test will failed.
>  
> The error stack is :
> {code:java}
> 21/11/04 19:48:52.159 main ERROR Client: Application diagnostics message: 
> User class threw exception: org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 0 in stage 0.0 failed 4 times,
>  most recent failure: Lost task 0.3 in stage 0.0 (TID 6) (localhost executor 
> 1): java.lang.NoClassDefFoundError: breeze/linalg/Matrix
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at org.apache.spark.util.Utils$.classForName(Utils.scala:216)
> at 
> org.apache.spark.serializer.KryoSerializer$.$anonfun$loadableSparkClasses$1(KryoSerializer.scala:537)
> at scala.collection.immutable.List.flatMap(List.scala:293)
> at scala.collection.immutable.List.flatMap(List.scala:79)
> at 
> org.apache.spark.serializer.KryoSerializer$.loadableSparkClasses$lzycompute(KryoSerializer.scala:535)
> at 
> org.apache.spark.serializer.KryoSerializer$.org$apache$spark$serializer$KryoSerializer$$loadableSparkClasses(KryoSerializer.scala:502)
> at 
> org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:226)
> at 
> org.apache.spark.serializer.KryoSerializer$$anon$1.create(KryoSerializer.scala:102)
> at 
> com.esotericsoftware.kryo.pool.KryoPoolQueueImpl.borrow(KryoPoolQueueImpl.java:48)
> at 
> org.apache.spark.serializer.KryoSerializer$PoolWrapper.borrow(KryoSerializer.scala:109)
> at 
> org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:346)
> at 
> org.apache.spark.serializer.KryoSerializationStream.(KryoSerializer.scala:266)
> at 
> org.apache.spark.serializer.KryoSerializerInstance.serializeStream(KryoSerializer.scala:432)
> at 
> org.apache.spark.shuffle.ShufflePartitionPairsWriter.open(ShufflePartitionPairsWriter.scala:76)
> at 
> org.apache.spark.shuffle.ShufflePartitionPairsWriter.write(ShufflePartitionPairsWriter.scala:59)
> at 
> org.apache.spark.util.collection.WritablePartitionedIterator.writeNext(WritablePartitionedPairCollection.scala:83)
> at 
> org.apache.spark.util.collection.ExternalSorter.$anonfun$writePartitionedMapOutput$1(ExternalSorter.scala:772)
> at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1468)
> at 
> org.apache.spark.util.collection.ExternalSorter.writePartitionedMapOutput(ExternalSorter.scala:775)
> at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:70)
> at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> at 
> 

[jira] [Commented] (SPARK-37209) YarnShuffleIntegrationSuite and other two similar cases in `resource-managers` test failed

2021-11-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444603#comment-17444603
 ] 

Apache Spark commented on SPARK-37209:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/34620

> YarnShuffleIntegrationSuite  and other two similar cases in 
> `resource-managers` test failed
> ---
>
> Key: SPARK-37209
> URL: https://issues.apache.org/jira/browse/SPARK-37209
> Project: Spark
>  Issue Type: Bug
>  Components: Tests, YARN
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Minor
> Attachments: failed-unit-tests.log, success-unit-tests.log
>
>
> Execute :
>  # build/mvn clean package -DskipTests -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud 
> -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl 
> -Pkubernetes -Phive
>  # build/mvn test -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn 
> -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive 
> -Pscala-2.13 -pl resource-managers/yarn
> The test will successful.
>  
> Execute :
>  # build/mvn clean -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn 
> -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive
>  # build/mvn clean test -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn 
> -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive 
> -Pscala-2.13 -pl resource-managers/yarn 
> The test will failed.
>  
> Execute :
>  # build/mvn clean package -DskipTests -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud 
> -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl 
> -Pkubernetes -Phive
>  # Delete assembly/target/scala-2.12/jars manually
>  # build/mvn test -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn 
> -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive 
> -Pscala-2.13 -pl resource-managers/yarn 
> The test will failed.
>  
> The error stack is :
> {code:java}
> 21/11/04 19:48:52.159 main ERROR Client: Application diagnostics message: 
> User class threw exception: org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 0 in stage 0.0 failed 4 times,
>  most recent failure: Lost task 0.3 in stage 0.0 (TID 6) (localhost executor 
> 1): java.lang.NoClassDefFoundError: breeze/linalg/Matrix
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at org.apache.spark.util.Utils$.classForName(Utils.scala:216)
> at 
> org.apache.spark.serializer.KryoSerializer$.$anonfun$loadableSparkClasses$1(KryoSerializer.scala:537)
> at scala.collection.immutable.List.flatMap(List.scala:293)
> at scala.collection.immutable.List.flatMap(List.scala:79)
> at 
> org.apache.spark.serializer.KryoSerializer$.loadableSparkClasses$lzycompute(KryoSerializer.scala:535)
> at 
> org.apache.spark.serializer.KryoSerializer$.org$apache$spark$serializer$KryoSerializer$$loadableSparkClasses(KryoSerializer.scala:502)
> at 
> org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:226)
> at 
> org.apache.spark.serializer.KryoSerializer$$anon$1.create(KryoSerializer.scala:102)
> at 
> com.esotericsoftware.kryo.pool.KryoPoolQueueImpl.borrow(KryoPoolQueueImpl.java:48)
> at 
> org.apache.spark.serializer.KryoSerializer$PoolWrapper.borrow(KryoSerializer.scala:109)
> at 
> org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:346)
> at 
> org.apache.spark.serializer.KryoSerializationStream.(KryoSerializer.scala:266)
> at 
> org.apache.spark.serializer.KryoSerializerInstance.serializeStream(KryoSerializer.scala:432)
> at 
> org.apache.spark.shuffle.ShufflePartitionPairsWriter.open(ShufflePartitionPairsWriter.scala:76)
> at 
> org.apache.spark.shuffle.ShufflePartitionPairsWriter.write(ShufflePartitionPairsWriter.scala:59)
> at 
> org.apache.spark.util.collection.WritablePartitionedIterator.writeNext(WritablePartitionedPairCollection.scala:83)
> at 
> org.apache.spark.util.collection.ExternalSorter.$anonfun$writePartitionedMapOutput$1(ExternalSorter.scala:772)
> at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1468)
> at 
> org.apache.spark.util.collection.ExternalSorter.writePartitionedMapOutput(ExternalSorter.scala:775)
> at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:70)
> at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
> at 
> 

[jira] [Assigned] (SPARK-37181) pyspark.pandas.read_csv() should support latin-1 encoding

2021-11-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37181:


Assignee: (was: Apache Spark)

> pyspark.pandas.read_csv() should support latin-1 encoding
> -
>
> Key: SPARK-37181
> URL: https://issues.apache.org/jira/browse/SPARK-37181
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Chuck Connell
>Priority: Major
>
> {{In regular pandas, you can say read_csv(encoding='latin-1'). This encoding 
> is not recognized in pyspark.pandas. You have to use Windows-1252 instead, 
> which is almost the same but not identical. }}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37181) pyspark.pandas.read_csv() should support latin-1 encoding

2021-11-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444604#comment-17444604
 ] 

Apache Spark commented on SPARK-37181:
--

User 'pralabhkumar' has created a pull request for this issue:
https://github.com/apache/spark/pull/34619

> pyspark.pandas.read_csv() should support latin-1 encoding
> -
>
> Key: SPARK-37181
> URL: https://issues.apache.org/jira/browse/SPARK-37181
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Chuck Connell
>Priority: Major
>
> {{In regular pandas, you can say read_csv(encoding='latin-1'). This encoding 
> is not recognized in pyspark.pandas. You have to use Windows-1252 instead, 
> which is almost the same but not identical. }}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37209) YarnShuffleIntegrationSuite and other two similar cases in `resource-managers` test failed

2021-11-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37209:


Assignee: Apache Spark

> YarnShuffleIntegrationSuite  and other two similar cases in 
> `resource-managers` test failed
> ---
>
> Key: SPARK-37209
> URL: https://issues.apache.org/jira/browse/SPARK-37209
> Project: Spark
>  Issue Type: Bug
>  Components: Tests, YARN
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
> Attachments: failed-unit-tests.log, success-unit-tests.log
>
>
> Execute :
>  # build/mvn clean package -DskipTests -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud 
> -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl 
> -Pkubernetes -Phive
>  # build/mvn test -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn 
> -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive 
> -Pscala-2.13 -pl resource-managers/yarn
> The test will successful.
>  
> Execute :
>  # build/mvn clean -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn 
> -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive
>  # build/mvn clean test -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn 
> -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive 
> -Pscala-2.13 -pl resource-managers/yarn 
> The test will failed.
>  
> Execute :
>  # build/mvn clean package -DskipTests -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud 
> -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl 
> -Pkubernetes -Phive
>  # Delete assembly/target/scala-2.12/jars manually
>  # build/mvn test -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn 
> -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive 
> -Pscala-2.13 -pl resource-managers/yarn 
> The test will failed.
>  
> The error stack is :
> {code:java}
> 21/11/04 19:48:52.159 main ERROR Client: Application diagnostics message: 
> User class threw exception: org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 0 in stage 0.0 failed 4 times,
>  most recent failure: Lost task 0.3 in stage 0.0 (TID 6) (localhost executor 
> 1): java.lang.NoClassDefFoundError: breeze/linalg/Matrix
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at org.apache.spark.util.Utils$.classForName(Utils.scala:216)
> at 
> org.apache.spark.serializer.KryoSerializer$.$anonfun$loadableSparkClasses$1(KryoSerializer.scala:537)
> at scala.collection.immutable.List.flatMap(List.scala:293)
> at scala.collection.immutable.List.flatMap(List.scala:79)
> at 
> org.apache.spark.serializer.KryoSerializer$.loadableSparkClasses$lzycompute(KryoSerializer.scala:535)
> at 
> org.apache.spark.serializer.KryoSerializer$.org$apache$spark$serializer$KryoSerializer$$loadableSparkClasses(KryoSerializer.scala:502)
> at 
> org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:226)
> at 
> org.apache.spark.serializer.KryoSerializer$$anon$1.create(KryoSerializer.scala:102)
> at 
> com.esotericsoftware.kryo.pool.KryoPoolQueueImpl.borrow(KryoPoolQueueImpl.java:48)
> at 
> org.apache.spark.serializer.KryoSerializer$PoolWrapper.borrow(KryoSerializer.scala:109)
> at 
> org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:346)
> at 
> org.apache.spark.serializer.KryoSerializationStream.(KryoSerializer.scala:266)
> at 
> org.apache.spark.serializer.KryoSerializerInstance.serializeStream(KryoSerializer.scala:432)
> at 
> org.apache.spark.shuffle.ShufflePartitionPairsWriter.open(ShufflePartitionPairsWriter.scala:76)
> at 
> org.apache.spark.shuffle.ShufflePartitionPairsWriter.write(ShufflePartitionPairsWriter.scala:59)
> at 
> org.apache.spark.util.collection.WritablePartitionedIterator.writeNext(WritablePartitionedPairCollection.scala:83)
> at 
> org.apache.spark.util.collection.ExternalSorter.$anonfun$writePartitionedMapOutput$1(ExternalSorter.scala:772)
> at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1468)
> at 
> org.apache.spark.util.collection.ExternalSorter.writePartitionedMapOutput(ExternalSorter.scala:775)
> at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:70)
> at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> at 
> 

[jira] [Assigned] (SPARK-37219) support AS OF syntax

2021-11-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-37219:
---

Assignee: Huaxin Gao

> support AS OF syntax
> 
>
> Key: SPARK-37219
> URL: https://issues.apache.org/jira/browse/SPARK-37219
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
>
> https://docs.databricks.com/delta/quick-start.html#query-an-earlier-version-of-the-table-time-travel
> Delta Lake time travel allows user to query an older snapshot of a Delta 
> table. To query an older version of a table, user needs to specify a version 
> or timestamp in a SELECT statement using AS OF syntax as the follows
> SELECT * FROM default.people10m VERSION AS OF 0;
> SELECT * FROM default.people10m TIMESTAMP AS OF '2019-01-29 00:37:58';
> This ticket is opened to add AS OF syntax in Spark



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37219) support AS OF syntax

2021-11-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-37219.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34497
[https://github.com/apache/spark/pull/34497]

> support AS OF syntax
> 
>
> Key: SPARK-37219
> URL: https://issues.apache.org/jira/browse/SPARK-37219
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.3.0
>
>
> https://docs.databricks.com/delta/quick-start.html#query-an-earlier-version-of-the-table-time-travel
> Delta Lake time travel allows user to query an older snapshot of a Delta 
> table. To query an older version of a table, user needs to specify a version 
> or timestamp in a SELECT statement using AS OF syntax as the follows
> SELECT * FROM default.people10m VERSION AS OF 0;
> SELECT * FROM default.people10m TIMESTAMP AS OF '2019-01-29 00:37:58';
> This ticket is opened to add AS OF syntax in Spark



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37346) Link migration guide for structured stream.

2021-11-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444585#comment-17444585
 ] 

Apache Spark commented on SPARK-37346:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/34618

> Link migration guide for structured stream.
> ---
>
> Key: SPARK-37346
> URL: https://issues.apache.org/jira/browse/SPARK-37346
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
> Fix For: 3.3.0
>
>
> Link migration guide to each project.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37346) Link migration guide for structured stream.

2021-11-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37346:


Assignee: (was: Apache Spark)

> Link migration guide for structured stream.
> ---
>
> Key: SPARK-37346
> URL: https://issues.apache.org/jira/browse/SPARK-37346
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
> Fix For: 3.3.0
>
>
> Link migration guide to each project.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37346) Link migration guide for structured stream.

2021-11-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37346:


Assignee: Apache Spark

> Link migration guide for structured stream.
> ---
>
> Key: SPARK-37346
> URL: https://issues.apache.org/jira/browse/SPARK-37346
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.3.0
>
>
> Link migration guide to each project.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37346) Link migration guide for struct stream.

2021-11-16 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-37346:
--
Summary: Link migration guide for struct stream.  (was: Link migration 
guide to each project.)

> Link migration guide for struct stream.
> ---
>
> Key: SPARK-37346
> URL: https://issues.apache.org/jira/browse/SPARK-37346
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
> Fix For: 3.3.0
>
>
> Link migration guide to each project.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37346) Link migration guide for structured stream.

2021-11-16 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-37346:
--
Summary: Link migration guide for structured stream.  (was: Link migration 
guide for struct stream.)

> Link migration guide for structured stream.
> ---
>
> Key: SPARK-37346
> URL: https://issues.apache.org/jira/browse/SPARK-37346
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
> Fix For: 3.3.0
>
>
> Link migration guide to each project.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-37209) YarnShuffleIntegrationSuite and other two similar cases in `resource-managers` test failed

2021-11-16 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444502#comment-17444502
 ] 

Yang Jie edited comment on SPARK-37209 at 11/16/21, 1:31 PM:
-

After some investigation, I found that this issue maybe related to 
`hadoop-3.x`, when use `hadoop-2.7` profile, the above test can be successful:

 
{code:java}
mvn clean install -DskipTests -pl resource-managers/yarn -Pyarn -Phadoop-2.7 -am
 mvn test -pl resource-managers/yarn -Pyarn -Phadoop-2.7 -Dtest=none 
-DwildcardSuites=org.apache.spark.deploy.yarn.YarnShuffleIntegrationSuite

Discovery starting.
Discovery completed in 259 milliseconds.
Run starting. Expected test count is: 1
YarnShuffleIntegrationSuite:
- external shuffle service
Run completed in 30 seconds, 765 milliseconds.
Total number of tests run: 1
Suites: completed 2, aborted 0
Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0
All tests passed.{code}
It seems that when testing with hadoop-2.7, the result of executing 
`Utils.isTesting` is true, which helps test case to ignore the 
`NoClassDefFoundError` in the test, but when testing with hadoop-3.2, the 
result of executing `Utils.isTesting` is false.

 

But I haven't investigated the root cause with hadoop-3.2

 

cc [~hyukjin.kwon] [~dongjoon] [~srowen] 

 

 


was (Author: luciferyang):
After some investigation, I found that this issue maybe related to 
`hadoop-3.x`, when use `hadoop-2.7` profile, the above test can be successful:

 
{code:java}
mvn clean install -DskipTests -pl resource-managers/yarn -Pyarn -Phadoop-2.7 -am
 mvn test -pl resource-managers/yarn -Pyarn -Phadoop-2.7 -Dtest=none 
-DwildcardSuites=org.apache.spark.deploy.yarn.YarnShuffleIntegrationSuite

Discovery starting.
Discovery completed in 259 milliseconds.
Run starting. Expected test count is: 1
YarnShuffleIntegrationSuite:
- external shuffle service
Run completed in 30 seconds, 765 milliseconds.
Total number of tests run: 1
Suites: completed 2, aborted 0
Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0
All tests passed.{code}
It seems that when testing with hadoop-2.7, the result of executing 
`Utils.isTesting` on the executor side is true, which helps test case to ignore 
the `NoClassDefFoundError` in the test, but when testing with hadoop-3.2, the 
result of executing `Utils.isTesting` on the executor side is false.

 

But I haven't investigated the root cause with hadoop-3.2

 

cc [~hyukjin.kwon] [~dongjoon] [~srowen] 

 

 

> YarnShuffleIntegrationSuite  and other two similar cases in 
> `resource-managers` test failed
> ---
>
> Key: SPARK-37209
> URL: https://issues.apache.org/jira/browse/SPARK-37209
> Project: Spark
>  Issue Type: Bug
>  Components: Tests, YARN
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Minor
> Attachments: failed-unit-tests.log, success-unit-tests.log
>
>
> Execute :
>  # build/mvn clean package -DskipTests -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud 
> -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl 
> -Pkubernetes -Phive
>  # build/mvn test -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn 
> -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive 
> -Pscala-2.13 -pl resource-managers/yarn
> The test will successful.
>  
> Execute :
>  # build/mvn clean -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn 
> -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive
>  # build/mvn clean test -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn 
> -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive 
> -Pscala-2.13 -pl resource-managers/yarn 
> The test will failed.
>  
> Execute :
>  # build/mvn clean package -DskipTests -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud 
> -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl 
> -Pkubernetes -Phive
>  # Delete assembly/target/scala-2.12/jars manually
>  # build/mvn test -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn 
> -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive 
> -Pscala-2.13 -pl resource-managers/yarn 
> The test will failed.
>  
> The error stack is :
> {code:java}
> 21/11/04 19:48:52.159 main ERROR Client: Application diagnostics message: 
> User class threw exception: org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 0 in stage 0.0 failed 4 times,
>  most recent failure: Lost task 0.3 in stage 0.0 (TID 6) (localhost executor 
> 1): java.lang.NoClassDefFoundError: breeze/linalg/Matrix
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at org.apache.spark.util.Utils$.classForName(Utils.scala:216)
> at 
> 

[jira] [Comment Edited] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data

2021-11-16 Thread Wei Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444386#comment-17444386
 ] 

Wei Zhang edited comment on SPARK-18105 at 11/16/21, 1:16 PM:
--

In our case, it is strongly related with `spark.file.transferTo`. Set 
`spark.file.transferTo` to false has saved our life.

We have been tortured by this problem for almost about a year, and we got about 
hundreds of tasks  with this `FetchFailedException: Decompression error: 
Corrupted block detected` error everyday. Recently the heat is burning up and 
got us kind of on fire since we are using Spark in more scenarios.

One of the stack I got is: 
{code:java}
org.apache.spark.shuffle.FetchFailedException: Decompression error: Corrupted 
block detectedat 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:569)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:474)
 at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:66)
  at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)   
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)   at 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)   at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)   
 at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)  
 at 
org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)  
 at 
org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)
   at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) 
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156) 
at 
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)   
 at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:357) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:308) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)  at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:310) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)  at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:310) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)  at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:310) at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)   at 
org.apache.spark.scheduler.Task.run(Task.scala:123)  at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1405)at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
 at java.lang.Thread.run(Thread.java:748) 

Caused by: java.io.IOException: Decompression error: Corrupted block detected   
at com.github.luben.zstd.ZstdInputStream.read(ZstdInputStream.java:111) at 
java.io.BufferedInputStream.fill(BufferedInputStream.java:246)   at 
java.io.BufferedInputStream.read1(BufferedInputStream.java:286)  at 
java.io.BufferedInputStream.read(BufferedInputStream.java:345)   at 
org.apache.spark.util.Utils$$anonfun$1.apply$mcZ$sp(Utils.scala:402) at 
org.apache.spark.util.Utils$$anonfun$1.apply(Utils.scala:397)at 
org.apache.spark.util.Utils$$anonfun$1.apply(Utils.scala:397)at 
org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1405)at 
org.apache.spark.util.Utils$.copyStreamUpTo(Utils.scala:409) at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:467)
 ... 32 more{code}
We have Spark version 2.2.2, 2.4.7, 3.0.1 and 3.1.2,and all have this issue. I 
tried methods in this post and others, nothing can make the non-repeatable 
error disappear. After weeks' struggle and going through the code with one of 
our almost repeatable case, we are able to find that `spark.file.transferTo` is 
the one. We turned off this flag in our cluster, and this error literally 
disappears – None of this error any more.

 

Normally we won't suspect flags like `spark.file.transferTo`, in that the only 
difference is you are using `FileStream` or 

[jira] [Assigned] (SPARK-37342) Upgrade Apache Arrow to 6.0.0

2021-11-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-37342:


Assignee: Chao Sun

> Upgrade Apache Arrow to 6.0.0
> -
>
> Key: SPARK-37342
> URL: https://issues.apache.org/jira/browse/SPARK-37342
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>
> Spark is still using Apache Arrow 2.0.0 while 6.0.0 was already released last 
> month.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37342) Upgrade Apache Arrow to 6.0.0

2021-11-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-37342.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34613
[https://github.com/apache/spark/pull/34613]

> Upgrade Apache Arrow to 6.0.0
> -
>
> Key: SPARK-37342
> URL: https://issues.apache.org/jira/browse/SPARK-37342
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.3.0
>
>
> Spark is still using Apache Arrow 2.0.0 while 6.0.0 was already released last 
> month.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37347) Spark Thrift Server (STS) driver fullFC becourse of timeoutExecutor not shutdown correctly

2021-11-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37347:


Assignee: (was: Apache Spark)

> Spark Thrift Server (STS) driver fullFC becourse of timeoutExecutor not 
> shutdown correctly
> --
>
> Key: SPARK-37347
> URL: https://issues.apache.org/jira/browse/SPARK-37347
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0, 3.1.1, 3.2.0
>Reporter: liukai
>Priority: Major
>
> When spark.sql.thriftServer.queryTimeout or 
> java.sql.Statement.setQueryTimeout is setted >0 , 
> SparkExecuteStatementOperation add timeoutExecutor to kill time-consumed 
> query in [SPARK-26533|https://issues.apache.org/jira/browse/SPARK-26533]. But 
> timeoutExecutor is not shutdown correctly when statement is finished, it can 
> only be shutdown when timeout. When we set timeout to a long time for example 
> 1 hour, the long-running STS driver will FullGC and the application is not 
> available for  a long time.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37347) Spark Thrift Server (STS) driver fullFC becourse of timeoutExecutor not shutdown correctly

2021-11-16 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-37347:

Target Version/s:   (was: 3.1.2, 3.2.0)

> Spark Thrift Server (STS) driver fullFC becourse of timeoutExecutor not 
> shutdown correctly
> --
>
> Key: SPARK-37347
> URL: https://issues.apache.org/jira/browse/SPARK-37347
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0, 3.1.1, 3.2.0
>Reporter: liukai
>Priority: Major
>
> When spark.sql.thriftServer.queryTimeout or 
> java.sql.Statement.setQueryTimeout is setted >0 , 
> SparkExecuteStatementOperation add timeoutExecutor to kill time-consumed 
> query in [SPARK-26533|https://issues.apache.org/jira/browse/SPARK-26533]. But 
> timeoutExecutor is not shutdown correctly when statement is finished, it can 
> only be shutdown when timeout. When we set timeout to a long time for example 
> 1 hour, the long-running STS driver will FullGC and the application is not 
> available for  a long time.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37347) Spark Thrift Server (STS) driver fullFC becourse of timeoutExecutor not shutdown correctly

2021-11-16 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-37347:

Fix Version/s: (was: 3.1.3)
   (was: 3.2.1)

> Spark Thrift Server (STS) driver fullFC becourse of timeoutExecutor not 
> shutdown correctly
> --
>
> Key: SPARK-37347
> URL: https://issues.apache.org/jira/browse/SPARK-37347
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0, 3.1.1, 3.2.0
>Reporter: liukai
>Priority: Major
>
> When spark.sql.thriftServer.queryTimeout or 
> java.sql.Statement.setQueryTimeout is setted >0 , 
> SparkExecuteStatementOperation add timeoutExecutor to kill time-consumed 
> query in [SPARK-26533|https://issues.apache.org/jira/browse/SPARK-26533]. But 
> timeoutExecutor is not shutdown correctly when statement is finished, it can 
> only be shutdown when timeout. When we set timeout to a long time for example 
> 1 hour, the long-running STS driver will FullGC and the application is not 
> available for  a long time.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37347) Spark Thrift Server (STS) driver fullFC becourse of timeoutExecutor not shutdown correctly

2021-11-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444507#comment-17444507
 ] 

Apache Spark commented on SPARK-37347:
--

User 'lk246' has created a pull request for this issue:
https://github.com/apache/spark/pull/34617

> Spark Thrift Server (STS) driver fullFC becourse of timeoutExecutor not 
> shutdown correctly
> --
>
> Key: SPARK-37347
> URL: https://issues.apache.org/jira/browse/SPARK-37347
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0, 3.1.1, 3.2.0
>Reporter: liukai
>Priority: Major
>
> When spark.sql.thriftServer.queryTimeout or 
> java.sql.Statement.setQueryTimeout is setted >0 , 
> SparkExecuteStatementOperation add timeoutExecutor to kill time-consumed 
> query in [SPARK-26533|https://issues.apache.org/jira/browse/SPARK-26533]. But 
> timeoutExecutor is not shutdown correctly when statement is finished, it can 
> only be shutdown when timeout. When we set timeout to a long time for example 
> 1 hour, the long-running STS driver will FullGC and the application is not 
> available for  a long time.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37347) Spark Thrift Server (STS) driver fullFC becourse of timeoutExecutor not shutdown correctly

2021-11-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37347:


Assignee: Apache Spark

> Spark Thrift Server (STS) driver fullFC becourse of timeoutExecutor not 
> shutdown correctly
> --
>
> Key: SPARK-37347
> URL: https://issues.apache.org/jira/browse/SPARK-37347
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0, 3.1.1, 3.2.0
>Reporter: liukai
>Assignee: Apache Spark
>Priority: Major
>
> When spark.sql.thriftServer.queryTimeout or 
> java.sql.Statement.setQueryTimeout is setted >0 , 
> SparkExecuteStatementOperation add timeoutExecutor to kill time-consumed 
> query in [SPARK-26533|https://issues.apache.org/jira/browse/SPARK-26533]. But 
> timeoutExecutor is not shutdown correctly when statement is finished, it can 
> only be shutdown when timeout. When we set timeout to a long time for example 
> 1 hour, the long-running STS driver will FullGC and the application is not 
> available for  a long time.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37209) YarnShuffleIntegrationSuite and other two similar cases in `resource-managers` test failed

2021-11-16 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444502#comment-17444502
 ] 

Yang Jie commented on SPARK-37209:
--

After some investigation, I found that this issue maybe related to 
`hadoop-3.x`, when use `hadoop-2.7` profile, the above test can be successful:

 
{code:java}
mvn clean install -DskipTests -pl resource-managers/yarn -Pyarn -Phadoop-2.7 -am
 mvn test -pl resource-managers/yarn -Pyarn -Phadoop-2.7 -Dtest=none 
-DwildcardSuites=org.apache.spark.deploy.yarn.YarnShuffleIntegrationSuite

Discovery starting.
Discovery completed in 259 milliseconds.
Run starting. Expected test count is: 1
YarnShuffleIntegrationSuite:
- external shuffle service
Run completed in 30 seconds, 765 milliseconds.
Total number of tests run: 1
Suites: completed 2, aborted 0
Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0
All tests passed.{code}
It seems that when testing with hadoop-2.7, the result of executing 
`Utils.isTesting` on the executor side is true, which helps test case to ignore 
the `NoClassDefFoundError` in the test, but when testing with hadoop-3.2, the 
result of executing `Utils.isTesting` on the executor side is false.

 

But I haven't investigated the root cause with hadoop-3.2

 

 

> YarnShuffleIntegrationSuite  and other two similar cases in 
> `resource-managers` test failed
> ---
>
> Key: SPARK-37209
> URL: https://issues.apache.org/jira/browse/SPARK-37209
> Project: Spark
>  Issue Type: Bug
>  Components: Tests, YARN
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Minor
> Attachments: failed-unit-tests.log, success-unit-tests.log
>
>
> Execute :
>  # build/mvn clean package -DskipTests -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud 
> -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl 
> -Pkubernetes -Phive
>  # build/mvn test -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn 
> -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive 
> -Pscala-2.13 -pl resource-managers/yarn
> The test will successful.
>  
> Execute :
>  # build/mvn clean -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn 
> -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive
>  # build/mvn clean test -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn 
> -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive 
> -Pscala-2.13 -pl resource-managers/yarn 
> The test will failed.
>  
> Execute :
>  # build/mvn clean package -DskipTests -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud 
> -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl 
> -Pkubernetes -Phive
>  # Delete assembly/target/scala-2.12/jars manually
>  # build/mvn test -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn 
> -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive 
> -Pscala-2.13 -pl resource-managers/yarn 
> The test will failed.
>  
> The error stack is :
> {code:java}
> 21/11/04 19:48:52.159 main ERROR Client: Application diagnostics message: 
> User class threw exception: org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 0 in stage 0.0 failed 4 times,
>  most recent failure: Lost task 0.3 in stage 0.0 (TID 6) (localhost executor 
> 1): java.lang.NoClassDefFoundError: breeze/linalg/Matrix
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at org.apache.spark.util.Utils$.classForName(Utils.scala:216)
> at 
> org.apache.spark.serializer.KryoSerializer$.$anonfun$loadableSparkClasses$1(KryoSerializer.scala:537)
> at scala.collection.immutable.List.flatMap(List.scala:293)
> at scala.collection.immutable.List.flatMap(List.scala:79)
> at 
> org.apache.spark.serializer.KryoSerializer$.loadableSparkClasses$lzycompute(KryoSerializer.scala:535)
> at 
> org.apache.spark.serializer.KryoSerializer$.org$apache$spark$serializer$KryoSerializer$$loadableSparkClasses(KryoSerializer.scala:502)
> at 
> org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:226)
> at 
> org.apache.spark.serializer.KryoSerializer$$anon$1.create(KryoSerializer.scala:102)
> at 
> com.esotericsoftware.kryo.pool.KryoPoolQueueImpl.borrow(KryoPoolQueueImpl.java:48)
> at 
> org.apache.spark.serializer.KryoSerializer$PoolWrapper.borrow(KryoSerializer.scala:109)
> at 
> org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:346)
> at 
> org.apache.spark.serializer.KryoSerializationStream.(KryoSerializer.scala:266)
> at 
> org.apache.spark.serializer.KryoSerializerInstance.serializeStream(KryoSerializer.scala:432)
> at 
> 

[jira] [Comment Edited] (SPARK-37209) YarnShuffleIntegrationSuite and other two similar cases in `resource-managers` test failed

2021-11-16 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444502#comment-17444502
 ] 

Yang Jie edited comment on SPARK-37209 at 11/16/21, 12:10 PM:
--

After some investigation, I found that this issue maybe related to 
`hadoop-3.x`, when use `hadoop-2.7` profile, the above test can be successful:

 
{code:java}
mvn clean install -DskipTests -pl resource-managers/yarn -Pyarn -Phadoop-2.7 -am
 mvn test -pl resource-managers/yarn -Pyarn -Phadoop-2.7 -Dtest=none 
-DwildcardSuites=org.apache.spark.deploy.yarn.YarnShuffleIntegrationSuite

Discovery starting.
Discovery completed in 259 milliseconds.
Run starting. Expected test count is: 1
YarnShuffleIntegrationSuite:
- external shuffle service
Run completed in 30 seconds, 765 milliseconds.
Total number of tests run: 1
Suites: completed 2, aborted 0
Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0
All tests passed.{code}
It seems that when testing with hadoop-2.7, the result of executing 
`Utils.isTesting` on the executor side is true, which helps test case to ignore 
the `NoClassDefFoundError` in the test, but when testing with hadoop-3.2, the 
result of executing `Utils.isTesting` on the executor side is false.

 

But I haven't investigated the root cause with hadoop-3.2

 

cc [~hyukjin.kwon] [~dongjoon] [~srowen] 

 

 


was (Author: luciferyang):
After some investigation, I found that this issue maybe related to 
`hadoop-3.x`, when use `hadoop-2.7` profile, the above test can be successful:

 
{code:java}
mvn clean install -DskipTests -pl resource-managers/yarn -Pyarn -Phadoop-2.7 -am
 mvn test -pl resource-managers/yarn -Pyarn -Phadoop-2.7 -Dtest=none 
-DwildcardSuites=org.apache.spark.deploy.yarn.YarnShuffleIntegrationSuite

Discovery starting.
Discovery completed in 259 milliseconds.
Run starting. Expected test count is: 1
YarnShuffleIntegrationSuite:
- external shuffle service
Run completed in 30 seconds, 765 milliseconds.
Total number of tests run: 1
Suites: completed 2, aborted 0
Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0
All tests passed.{code}
It seems that when testing with hadoop-2.7, the result of executing 
`Utils.isTesting` on the executor side is true, which helps test case to ignore 
the `NoClassDefFoundError` in the test, but when testing with hadoop-3.2, the 
result of executing `Utils.isTesting` on the executor side is false.

 

But I haven't investigated the root cause with hadoop-3.2

 

 

> YarnShuffleIntegrationSuite  and other two similar cases in 
> `resource-managers` test failed
> ---
>
> Key: SPARK-37209
> URL: https://issues.apache.org/jira/browse/SPARK-37209
> Project: Spark
>  Issue Type: Bug
>  Components: Tests, YARN
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Minor
> Attachments: failed-unit-tests.log, success-unit-tests.log
>
>
> Execute :
>  # build/mvn clean package -DskipTests -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud 
> -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl 
> -Pkubernetes -Phive
>  # build/mvn test -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn 
> -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive 
> -Pscala-2.13 -pl resource-managers/yarn
> The test will successful.
>  
> Execute :
>  # build/mvn clean -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn 
> -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive
>  # build/mvn clean test -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn 
> -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive 
> -Pscala-2.13 -pl resource-managers/yarn 
> The test will failed.
>  
> Execute :
>  # build/mvn clean package -DskipTests -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud 
> -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl 
> -Pkubernetes -Phive
>  # Delete assembly/target/scala-2.12/jars manually
>  # build/mvn test -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn 
> -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive 
> -Pscala-2.13 -pl resource-managers/yarn 
> The test will failed.
>  
> The error stack is :
> {code:java}
> 21/11/04 19:48:52.159 main ERROR Client: Application diagnostics message: 
> User class threw exception: org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 0 in stage 0.0 failed 4 times,
>  most recent failure: Lost task 0.3 in stage 0.0 (TID 6) (localhost executor 
> 1): java.lang.NoClassDefFoundError: breeze/linalg/Matrix
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at org.apache.spark.util.Utils$.classForName(Utils.scala:216)
> at 
> 

[jira] [Created] (SPARK-37347) Spark Thrift Server (STS) driver fullFC becourse of timeoutExecutor not shutdown correctly

2021-11-16 Thread liukai (Jira)
liukai created SPARK-37347:
--

 Summary: Spark Thrift Server (STS) driver fullFC becourse of 
timeoutExecutor not shutdown correctly
 Key: SPARK-37347
 URL: https://issues.apache.org/jira/browse/SPARK-37347
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0, 3.1.1, 3.1.0
Reporter: liukai
 Fix For: 3.1.3, 3.2.1


When spark.sql.thriftServer.queryTimeout or java.sql.Statement.setQueryTimeout 
is setted >0 , SparkExecuteStatementOperation add timeoutExecutor to kill 
time-consumed query in 
[SPARK-26533|https://issues.apache.org/jira/browse/SPARK-26533]. But 
timeoutExecutor is not shutdown correctly when statement is finished, it can 
only be shutdown when timeout. When we set timeout to a long time for example 1 
hour, the long-running STS driver will FullGC and the application is not 
available for  a long time.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37346) Link migration guide to each project.

2021-11-16 Thread angerszhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1775#comment-1775
 ] 

angerszhu commented on SPARK-37346:
---

Raise a pr soon

> Link migration guide to each project.
> -
>
> Key: SPARK-37346
> URL: https://issues.apache.org/jira/browse/SPARK-37346
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
> Fix For: 3.3.0
>
>
> Link migration guide to each project.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >