date:20180310

[jira] [Assigned] (SPARK-23486) LookupFunctions should not check the same function name more than once

2018-03-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23486:


Assignee: (was: Apache Spark)

> LookupFunctions should not check the same function name more than once
> --
>
> Key: SPARK-23486
> URL: https://issues.apache.org/jira/browse/SPARK-23486
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Cheng Lian
>Priority: Major
>  Labels: starter
>
> For a query invoking the same function multiple times, the current 
> {{LookupFunctions}} rule performs a check for each invocation. For users 
> using Hive metastore as external catalog, this issues unnecessary metastore 
> accesses and can slow down the analysis phase quite a bit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23486) LookupFunctions should not check the same function name more than once

2018-03-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23486:


Assignee: Apache Spark

> LookupFunctions should not check the same function name more than once
> --
>
> Key: SPARK-23486
> URL: https://issues.apache.org/jira/browse/SPARK-23486
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Cheng Lian
>Assignee: Apache Spark
>Priority: Major
>  Labels: starter
>
> For a query invoking the same function multiple times, the current 
> {{LookupFunctions}} rule performs a check for each invocation. For users 
> using Hive metastore as external catalog, this issues unnecessary metastore 
> accesses and can slow down the analysis phase quite a bit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23486) LookupFunctions should not check the same function name more than once

2018-03-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394422#comment-16394422
 ] 

Apache Spark commented on SPARK-23486:
--

User 'kevinyu98' has created a pull request for this issue:
https://github.com/apache/spark/pull/20795

> LookupFunctions should not check the same function name more than once
> --
>
> Key: SPARK-23486
> URL: https://issues.apache.org/jira/browse/SPARK-23486
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Cheng Lian
>Priority: Major
>  Labels: starter
>
> For a query invoking the same function multiple times, the current 
> {{LookupFunctions}} rule performs a check for each invocation. For users 
> using Hive metastore as external catalog, this issues unnecessary metastore 
> accesses and can slow down the analysis phase quite a bit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23618) docker-image-tool.sh Fails While Building Image

2018-03-10 Thread Jooseong Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394399#comment-16394399
 ] 

Jooseong Kim commented on SPARK-23618:
--

In function build, "local BUILD_ARGS" effectively creates an array of one 
element where the first and only element is an empty string, so 
"${BUILD_ARGS[@]}" expands to "" and passes an extra argument to docker.

> docker-image-tool.sh Fails While Building Image
> ---
>
> Key: SPARK-23618
> URL: https://issues.apache.org/jira/browse/SPARK-23618
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Ninad Ingole
>Priority: Major
>
> I am trying to build kubernetes image for version 2.3.0, using 
> {code:java}
> ./bin/docker-image-tool.sh -r ninadingole/spark-docker -t v2.3.0 build
> {code}
> giving me an issue for docker build 
> error:
> {code:java}
> "docker build" requires exactly 1 argument.
> See 'docker build --help'.
> Usage: docker build [OPTIONS] PATH | URL | - [flags]
> Build an image from a Dockerfile
> {code}
>  
> Executing the command within the spark distribution directory. Please let me 
> know what's the issue.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23646) pyspark DataFrameWriter ignores customized settings?

2018-03-10 Thread Chuan-Heng Hsiao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394397#comment-16394397
 ] 

Chuan-Heng Hsiao commented on SPARK-23646:
--

Thanks.

I'll post to dev mailing list as well～

 

 

> pyspark DataFrameWriter ignores customized settings?
> 
>
> Key: SPARK-23646
> URL: https://issues.apache.org/jira/browse/SPARK-23646
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.1
>Reporter: Chuan-Heng Hsiao
>Priority: Major
>
> I am using spark-2.2.1-bin-hadoop2.7 with stand-alone mode.
> (python version: 3.5.2 from ubuntu 16.04)
> I intended to have DataFrame write to hdfs with customized block-size but 
> failed.
> However, the corresponding rdd can successfully write with the customized 
> block-size.
>  
>  
> The following is the test code:
> (dfs.namenode.fs-limits.min-block-size has been set as 131072 in hdfs)
>  
>  
> ##
> # init
> ##from pyspark import SparkContext, SparkConf
> from pyspark.sql import SparkSession
>  
> import hdfs
> from hdfs import InsecureClient
> import os
>  
> import numpy as np
> import pandas as pd
> import logging
>  
> os.environ['SPARK_HOME'] = '/opt/spark-2.2.1-bin-hadoop2.7'
>  
> block_size = 512 * 1024
>  
> conf = 
> SparkConf().setAppName("DCSSpark").setMaster("spark://spark1[:7077|http://10.7.34.47:7077/];).set('spark.cores.max',
>  20).set("spark.executor.cores", 10).set("spark.executor.memory", 
> "10g").set("spark.hadoop.dfs.blocksize", 
> str(block_size)).set("spark.hadoop.dfs.block.size", str(block_size))
>  
> spark = SparkSession.builder.config(conf=conf).getOrCreate()
> spark.sparkContext._jsc.hadoopConfiguration().setInt("dfs.blocksize", 
> block_size)
> spark.sparkContext._jsc.hadoopConfiguration().setInt("dfs.block.size", 
> block_size)
>  
> ##
> # main
> ##
>  # create DataFrame
> df_txt = spark.createDataFrame([\{'temp': "hello"}, \{'temp': "world"}, 
> \{'temp': "!"}])
>  
> # save using DataFrameWriter, resulting 128MB-block-size
> df_txt.write.mode('overwrite').format('parquet').save('hdfs://spark1/tmp/temp_with_df')
>  
> # save using rdd, resulting 512k-block-size
> client = InsecureClient('[http://spark1:50070|http://spark1:50070/]')
> client.delete('/tmp/temp_with_rrd', recursive=True)
> df_txt.rdd.saveAsTextFile('hdfs://spark1/tmp/temp_with_rrd')



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23645) pandas_udf can not be called with keyword arguments

2018-03-10 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394395#comment-16394395
 ] 

Hyukjin Kwon commented on SPARK-23645:
--

Hm .. I think named arguments in Scala side do not work too though. Also, 
sounds the same thing applies to normal udf too. From a very quick look, I 
think the real difficulty is to properly support when we actually use mixed 
positional and keyword arguments.

Sounds a good to do if the change is minimal but if the change is big, I doubt 
if this is something we should support. Documenting this might be good enough 
for now.

> pandas_udf can not be called with keyword arguments
> ---
>
> Key: SPARK-23645
> URL: https://issues.apache.org/jira/browse/SPARK-23645
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: python 3.6 | pyspark 2.3.0 | Using Scala version 2.11.8, 
> OpenJDK 64-Bit Server VM, 1.8.0_141
>Reporter: Stu (Michael Stewart)
>Priority: Minor
>
> pandas_udf (all python udfs(?)) do not accept keyword arguments because 
> `pyspark/sql/udf.py` class `UserDefinedFunction` has __call__, and also 
> wrapper utility methods, that only accept args and not kwargs:
> @ line 168:
> {code:java}
> ...
> def __call__(self, *cols):
> judf = self._judf
> sc = SparkContext._active_spark_context
> return Column(judf.apply(_to_seq(sc, cols, _to_java_column)))
> # This function is for improving the online help system in the interactive 
> interpreter.
> # For example, the built-in help / pydoc.help. It wraps the UDF with the 
> docstring and
> # argument annotation. (See: SPARK-19161)
> def _wrapped(self):
> """
> Wrap this udf with a function and attach docstring from func
> """
> # It is possible for a callable instance without __name__ attribute or/and
> # __module__ attribute to be wrapped here. For example, 
> functools.partial. In this case,
> # we should avoid wrapping the attributes from the wrapped function to 
> the wrapper
> # function. So, we take out these attribute names from the default names 
> to set and
> # then manually assign it after being wrapped.
> assignments = tuple(
> a for a in functools.WRAPPER_ASSIGNMENTS if a != '__name__' and a != 
> '__module__')
> @functools.wraps(self.func, assigned=assignments)
> def wrapper(*args):
> return self(*args)
> ...{code}
> as seen in:
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import pandas_udf, PandasUDFType, col, lit
> spark = SparkSession.builder.getOrCreate()
> df = spark.range(12).withColumn('b', col('id') * 2)
> def ok(a,b): return a*b
> df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b')).show()  
> # no problems
> df.withColumn('ok', pandas_udf(f=ok, 
> returnType='bigint')(a='id',b='b')).show()  # fail with ~no stacktrace thanks 
> to wrapper helper
> ---
> TypeError Traceback (most recent call last)
>  in ()
> > 1 df.withColumn('ok', pandas_udf(f=ok, 
> returnType='bigint')(a='id',b='b')).show()
> TypeError: wrapper() got an unexpected keyword argument 'a'{code}
>  
>  
> *discourse*: it isn't difficult to swap back in the kwargs, allowing the UDF 
> to be called as such, but the cols tuple that gets passed in the call method:
> {code:java}
> _to_seq(sc, cols, _to_java_column{code}
>  has to be in the right order based on the functions defined argument inputs, 
> or the function will return incorrect results. so, the challenge here is to:
> (a) make sure to reconstruct the proper order of the full args/kwargs
> --> args first, and then kwargs (not in the order passed but in the order 
> requested by the fn)
> (b) handle python2 and python3 `inspect` module inconsistencies 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23646) pyspark DataFrameWriter ignores customized settings?

2018-03-10 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394392#comment-16394392
 ] 

Hyukjin Kwon commented on SPARK-23646:
--

It sounds rather a question. I would recommend to ask it to dev mailing list 
first before filing an issue here.

> pyspark DataFrameWriter ignores customized settings?
> 
>
> Key: SPARK-23646
> URL: https://issues.apache.org/jira/browse/SPARK-23646
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.1
>Reporter: Chuan-Heng Hsiao
>Priority: Major
>
> I am using spark-2.2.1-bin-hadoop2.7 with stand-alone mode.
> (python version: 3.5.2 from ubuntu 16.04)
> I intended to have DataFrame write to hdfs with customized block-size but 
> failed.
> However, the corresponding rdd can successfully write with the customized 
> block-size.
>  
>  
> The following is the test code:
> (dfs.namenode.fs-limits.min-block-size has been set as 131072 in hdfs)
>  
>  
> ##
> # init
> ##from pyspark import SparkContext, SparkConf
> from pyspark.sql import SparkSession
>  
> import hdfs
> from hdfs import InsecureClient
> import os
>  
> import numpy as np
> import pandas as pd
> import logging
>  
> os.environ['SPARK_HOME'] = '/opt/spark-2.2.1-bin-hadoop2.7'
>  
> block_size = 512 * 1024
>  
> conf = 
> SparkConf().setAppName("DCSSpark").setMaster("spark://spark1[:7077|http://10.7.34.47:7077/];).set('spark.cores.max',
>  20).set("spark.executor.cores", 10).set("spark.executor.memory", 
> "10g").set("spark.hadoop.dfs.blocksize", 
> str(block_size)).set("spark.hadoop.dfs.block.size", str(block_size))
>  
> spark = SparkSession.builder.config(conf=conf).getOrCreate()
> spark.sparkContext._jsc.hadoopConfiguration().setInt("dfs.blocksize", 
> block_size)
> spark.sparkContext._jsc.hadoopConfiguration().setInt("dfs.block.size", 
> block_size)
>  
> ##
> # main
> ##
>  # create DataFrame
> df_txt = spark.createDataFrame([\{'temp': "hello"}, \{'temp': "world"}, 
> \{'temp': "!"}])
>  
> # save using DataFrameWriter, resulting 128MB-block-size
> df_txt.write.mode('overwrite').format('parquet').save('hdfs://spark1/tmp/temp_with_df')
>  
> # save using rdd, resulting 512k-block-size
> client = InsecureClient('[http://spark1:50070|http://spark1:50070/]')
> client.delete('/tmp/temp_with_rrd', recursive=True)
> df_txt.rdd.saveAsTextFile('hdfs://spark1/tmp/temp_with_rrd')



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23646) pyspark DataFrameWriter ignores customized settings?

2018-03-10 Thread Chuan-Heng Hsiao (JIRA)

Chuan-Heng Hsiao created SPARK-23646:


 Summary: pyspark DataFrameWriter ignores customized settings?
 Key: SPARK-23646
 URL: https://issues.apache.org/jira/browse/SPARK-23646
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.2.1
Reporter: Chuan-Heng Hsiao


I am using spark-2.2.1-bin-hadoop2.7 with stand-alone mode.
(python version: 3.5.2 from ubuntu 16.04)
I intended to have DataFrame write to hdfs with customized block-size but 
failed.
However, the corresponding rdd can successfully write with the customized 
block-size.
 
 
The following is the test code:
(dfs.namenode.fs-limits.min-block-size has been set as 131072 in hdfs)
 
 
##
# init
##from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
 
import hdfs
from hdfs import InsecureClient
import os
 
import numpy as np
import pandas as pd
import logging
 
os.environ['SPARK_HOME'] = '/opt/spark-2.2.1-bin-hadoop2.7'
 
block_size = 512 * 1024
 
conf = 
SparkConf().setAppName("DCSSpark").setMaster("spark://spark1[:7077|http://10.7.34.47:7077/];).set('spark.cores.max',
 20).set("spark.executor.cores", 10).set("spark.executor.memory", 
"10g").set("spark.hadoop.dfs.blocksize", 
str(block_size)).set("spark.hadoop.dfs.block.size", str(block_size))
 
spark = SparkSession.builder.config(conf=conf).getOrCreate()
spark.sparkContext._jsc.hadoopConfiguration().setInt("dfs.blocksize", 
block_size)
spark.sparkContext._jsc.hadoopConfiguration().setInt("dfs.block.size", 
block_size)
 
##
# main
##
 # create DataFrame

df_txt = spark.createDataFrame([\{'temp': "hello"}, \{'temp': "world"}, 
\{'temp': "!"}])
 
# save using DataFrameWriter, resulting 128MB-block-size
df_txt.write.mode('overwrite').format('parquet').save('hdfs://spark1/tmp/temp_with_df')
 
# save using rdd, resulting 512k-block-size
client = InsecureClient('[http://spark1:50070|http://spark1:50070/]')
client.delete('/tmp/temp_with_rrd', recursive=True)
df_txt.rdd.saveAsTextFile('hdfs://spark1/tmp/temp_with_rrd')



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23560) A joinWith followed by groupBy requires extra shuffle

2018-03-10 Thread Bruce Robbins (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394369#comment-16394369
 ] 

Bruce Robbins commented on SPARK-23560:
---

A simpler example that seems to reproduce this issue (without joinWith or join) 
is the following:

{noformat}
val df = Seq((20, 33), (20, 44), (99, 20), (33, 33), (-44, 99)).toDF("id1", 
"id2")

val result1 = df
  .select(struct('id1 as 'id1, 'id2 as 'id2) as 'x)
  .repartition($"x.id1")
  .groupBy($"x.id1")
  .count

result1.explain
== Physical Plan ==
*(2) HashAggregate(keys=[x#11.id1#25], functions=[count(1)])
+- Exchange hashpartitioning(x#11.id1#25, 200)
   +- *(1) HashAggregate(keys=[x#11.id1 AS x#11.id1#25], 
functions=[partial_count(1)])
  +- Exchange hashpartitioning(x#11.id1, 200)
 +- LocalTableScan [x#11]

val result2 = df
  .repartition('id1)
  .groupBy('id1).count

result2.explain
== Physical Plan ==
*(1) HashAggregate(keys=[id1#5], functions=[count(1)])
+- *(1) HashAggregate(keys=[id1#5], functions=[partial_count(1)])
   +- Exchange hashpartitioning(id1#5, 200)
  +- LocalTableScan [id1#5]
{noformat}

Seems joinWith is relevant only in the sense that it creates a struct.

> A joinWith followed by groupBy requires extra shuffle
> -
>
> Key: SPARK-23560
> URL: https://issues.apache.org/jira/browse/SPARK-23560
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: debian 8.9, macos x high sierra
>Reporter: Bruce Robbins
>Priority: Major
>
> Depending on the size of the input, a joinWith followed by a groupBy requires 
> more shuffles than a join followed by a groupBy.
> For example, here's a joinWith on two CSV files, followed by a groupBy:
> {noformat}
> import org.apache.spark.sql.types._
> val schema = StructType(StructField("id1", LongType) :: StructField("id2", 
> LongType) :: Nil)
> val df1 = spark.read.schema(schema).csv("ds1.csv")
> val df2 = spark.read.schema(schema).csv("ds2.csv")
> val result1 = df1.joinWith(df2, df1.col("id1") === 
> df2.col("id2")).groupBy("_1.id1").count
> result1.explain
> == Physical Plan ==
> *(6) HashAggregate(keys=[_1#8.id1#19L], functions=[count(1)])
> +- Exchange hashpartitioning(_1#8.id1#19L, 200)
>+- *(5) HashAggregate(keys=[_1#8.id1 AS _1#8.id1#19L], 
> functions=[partial_count(1)])
>   +- *(5) Project [_1#8]
>  +- *(5) SortMergeJoin [_1#8.id1], [_2#9.id2], Inner
> :- *(2) Sort [_1#8.id1 ASC NULLS FIRST], false, 0
> :  +- Exchange hashpartitioning(_1#8.id1, 200)
> : +- *(1) Project [named_struct(id1, id1#0L, id2, id2#1L) AS 
> _1#8]
> :+- *(1) FileScan csv [id1#0L,id2#1L] Batched: false, 
> Format: CSV, Location: InMemoryFileIndex[file:.../ds1.csv], PartitionFilters: 
> [], PushedFilters: [], ReadSchema: struct
> +- *(4) Sort [_2#9.id2 ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(_2#9.id2, 200)
>   +- *(3) Project [named_struct(id1, id1#4L, id2, id2#5L) AS 
> _2#9]
>  +- *(3) FileScan csv [id1#4L,id2#5L] Batched: false, 
> Format: CSV, Location: InMemoryFileIndex[file:...ds2.csv], PartitionFilters: 
> [], PushedFilters: [], ReadSchema: struct
> {noformat}
> Using join, there is one less shuffle:
> {noformat}
> val result2 = df1.join(df2,  df1.col("id1") === 
> df2.col("id2")).groupBy(df1("id1")).count
> result2.explain
> == Physical Plan ==
> *(5) HashAggregate(keys=[id1#0L], functions=[count(1)])
> +- *(5) HashAggregate(keys=[id1#0L], functions=[partial_count(1)])
>+- *(5) Project [id1#0L]
>   +- *(5) SortMergeJoin [id1#0L], [id2#5L], Inner
>  :- *(2) Sort [id1#0L ASC NULLS FIRST], false, 0
>  :  +- Exchange hashpartitioning(id1#0L, 200)
>  : +- *(1) Project [id1#0L]
>  :+- *(1) Filter isnotnull(id1#0L)
>  :   +- *(1) FileScan csv [id1#0L] Batched: false, Format: 
> CSV, Location: InMemoryFileIndex[file:.../ds1.csv], PartitionFilters: [], 
> PushedFilters: [IsNotNull(id1)], ReadSchema: struct
>  +- *(4) Sort [id2#5L ASC NULLS FIRST], false, 0
> +- Exchange hashpartitioning(id2#5L, 200)
>+- *(3) Project [id2#5L]
>   +- *(3) Filter isnotnull(id2#5L)
>  +- *(3) FileScan csv [id2#5L] Batched: false, Format: 
> CSV, Location: InMemoryFileIndex[file:...ds2.csv], PartitionFilters: [], 
> PushedFilters: [IsNotNull(id2)], ReadSchema: struct
> {noformat}
> T-he extra exchange is reflected in the run time of the query.- Actually, I 
> recant this bit. In my particular tests, the extra exchange has negligible 
> impact on run time. All the difference is in stage 2.
> My tests were on inputs

[jira] [Updated] (SPARK-23645) pandas_udf can not be called with keyword arguments

2018-03-10 Thread Stu (Michael Stewart) (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stu (Michael Stewart) updated SPARK-23645:
--
Description: 
pandas_udf (all python udfs(?)) do not accept keyword arguments because 
`pyspark/sql/udf.py` class `UserDefinedFunction` has __call__, and also wrapper 
utility methods, that only accept args and not kwargs:

@ line 168:
{code:java}
...

def __call__(self, *cols):
judf = self._judf
sc = SparkContext._active_spark_context
return Column(judf.apply(_to_seq(sc, cols, _to_java_column)))

# This function is for improving the online help system in the interactive 
interpreter.
# For example, the built-in help / pydoc.help. It wraps the UDF with the 
docstring and
# argument annotation. (See: SPARK-19161)
def _wrapped(self):
"""
Wrap this udf with a function and attach docstring from func
"""

# It is possible for a callable instance without __name__ attribute or/and
# __module__ attribute to be wrapped here. For example, functools.partial. 
In this case,
# we should avoid wrapping the attributes from the wrapped function to the 
wrapper
# function. So, we take out these attribute names from the default names to 
set and
# then manually assign it after being wrapped.
assignments = tuple(
a for a in functools.WRAPPER_ASSIGNMENTS if a != '__name__' and a != 
'__module__')

@functools.wraps(self.func, assigned=assignments)
def wrapper(*args):
return self(*args)

...{code}
as seen in:
{code:java}
from pyspark.sql import SparkSession

from pyspark.sql.functions import pandas_udf, PandasUDFType, col, lit

spark = SparkSession.builder.getOrCreate()

df = spark.range(12).withColumn('b', col('id') * 2)

def ok(a,b): return a*b

df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b')).show()  # 
no problems
df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')(a='id',b='b')).show() 
 # fail with ~no stacktrace thanks to wrapper helper

---
TypeError Traceback (most recent call last)
 in ()
> 1 df.withColumn('ok', pandas_udf(f=ok, 
returnType='bigint')(a='id',b='b')).show()

TypeError: wrapper() got an unexpected keyword argument 'a'{code}
 

 

*discourse*: it isn't difficult to swap back in the kwargs, allowing the UDF to 
be called as such, but the cols tuple that gets passed in the call method:
{code:java}
_to_seq(sc, cols, _to_java_column{code}
 has to be in the right order based on the functions defined argument inputs, 
or the function will return incorrect results. so, the challenge here is to:

(a) make sure to reconstruct the proper order of the full args/kwargs

--> args first, and then kwargs (not in the order passed but in the order 
requested by the fn)

(b) handle python2 and python3 `inspect` module inconsistencies 

  was:
pandas_udf (all python udfs(?)) do not accept keyword arguments because 
`pyspark/sql/udf.py` class `UserDefinedFunction` has __call__, and also wrapper 
utility methods, that only accept args and not kwargs:

@ line 168:
{code:java}
...

def __call__(self, *cols):
judf = self._judf
sc = SparkContext._active_spark_context
return Column(judf.apply(_to_seq(sc, cols, _to_java_column)))

# This function is for improving the online help system in the interactive 
interpreter.
# For example, the built-in help / pydoc.help. It wraps the UDF with the 
docstring and
# argument annotation. (See: SPARK-19161)
def _wrapped(self):
"""
Wrap this udf with a function and attach docstring from func
"""

# It is possible for a callable instance without __name__ attribute or/and
# __module__ attribute to be wrapped here. For example, functools.partial. 
In this case,
# we should avoid wrapping the attributes from the wrapped function to the 
wrapper
# function. So, we take out these attribute names from the default names to 
set and
# then manually assign it after being wrapped.
assignments = tuple(
a for a in functools.WRAPPER_ASSIGNMENTS if a != '__name__' and a != 
'__module__')

@functools.wraps(self.func, assigned=assignments)
def wrapper(*args):
return self(*args)

...{code}
as seen in:
{code:java}
from pyspark.sql import SparkSession

from pyspark.sql.functions import pandas_udf, PandasUDFType, col, lit

spark = SparkSession.builder.getOrCreate()

df = spark.range(12).withColumn('b', col('id') * 2)

def ok(a,b): return a*b

df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b')).show()  # 
no problems
df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')(a='id',b='b')).show() 
 # fail with ~no stacktrace thanks to wrapper helper{code}
 

 

*discourse*: it isn't difficult to swap back in the kwargs, allowing the UDF to 
be called as such, but the cols tuple that gets passed in the call method:
{code:java}
_to_seq(sc, cols,

[jira] [Updated] (SPARK-23645) pandas_udf can not be called with keyword arguments

2018-03-10 Thread Stu (Michael Stewart) (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stu (Michael Stewart) updated SPARK-23645:
--
Description: 
pandas_udf (all python udfs(?)) do not accept keyword arguments because 
`pyspark/sql/udf.py` class `UserDefinedFunction` has __call__, and also wrapper 
utility methods, that only accept args and not kwargs:

@ line 168:
{code:java}
...

def __call__(self, *cols):
judf = self._judf
sc = SparkContext._active_spark_context
return Column(judf.apply(_to_seq(sc, cols, _to_java_column)))

# This function is for improving the online help system in the interactive 
interpreter.
# For example, the built-in help / pydoc.help. It wraps the UDF with the 
docstring and
# argument annotation. (See: SPARK-19161)
def _wrapped(self):
"""
Wrap this udf with a function and attach docstring from func
"""

# It is possible for a callable instance without __name__ attribute or/and
# __module__ attribute to be wrapped here. For example, functools.partial. 
In this case,
# we should avoid wrapping the attributes from the wrapped function to the 
wrapper
# function. So, we take out these attribute names from the default names to 
set and
# then manually assign it after being wrapped.
assignments = tuple(
a for a in functools.WRAPPER_ASSIGNMENTS if a != '__name__' and a != 
'__module__')

@functools.wraps(self.func, assigned=assignments)
def wrapper(*args):
return self(*args)

...{code}
as seen in:
{code:java}
from pyspark.sql import SparkSession

from pyspark.sql.functions import pandas_udf, PandasUDFType, col, lit

spark = SparkSession.builder.getOrCreate()

df = spark.range(12).withColumn('b', col('id') * 2)

def ok(a,b): return a*b

df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b')).show()  # 
no problems
df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')(a='id',b='b')).show() 
 # fail with ~no stacktrace thanks to wrapper helper{code}
 

 

*discourse*: it isn't difficult to swap back in the kwargs, allowing the UDF to 
be called as such, but the cols tuple that gets passed in the call method:
{code:java}
_to_seq(sc, cols, _to_java_column{code}
 has to be in the right order based on the functions defined argument inputs, 
or the function will return incorrect results. so, the challenge here is to:

(a) make sure to reconstruct the proper order of the full args/kwargs

--> args first, and then kwargs (not in the order passed but in the order 
requested by the fn)

(b) handle python2 and python3 `inspect` module inconsistencies 

  was:
pandas_udf (all python udfs(?)) do not accept keyword arguments because 
`pyspark/sql/udf.py` class `UserDefinedFunction` has __call__, and also wrapper 
utility methods, that only accept args and not kwargs:

@ line 168:
{code:java}
...

def __call__(self, *cols):
judf = self._judf
sc = SparkContext._active_spark_context
return Column(judf.apply(_to_seq(sc, cols, _to_java_column)))

# This function is for improving the online help system in the interactive 
interpreter.
# For example, the built-in help / pydoc.help. It wraps the UDF with the 
docstring and
# argument annotation. (See: SPARK-19161)
def _wrapped(self):
"""
Wrap this udf with a function and attach docstring from func
"""

# It is possible for a callable instance without __name__ attribute or/and
# __module__ attribute to be wrapped here. For example, functools.partial. 
In this case,
# we should avoid wrapping the attributes from the wrapped function to the 
wrapper
# function. So, we take out these attribute names from the default names to 
set and
# then manually assign it after being wrapped.
assignments = tuple(
a for a in functools.WRAPPER_ASSIGNMENTS if a != '__name__' and a != 
'__module__')

@functools.wraps(self.func, assigned=assignments)
def wrapper(*args):
return self(*args)

...{code}
as seen in:
{code:java}
from pyspark.sql import SparkSession

from pyspark.sql.functions import pandas_udf, PandasUDFType, col, lit

spark = SparkSession.builder.getOrCreate()

df = spark.range(12).withColumn('b', col('id') * 2)

def ok(a,b): return a*b

df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b')).show()  # 
no problems
df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')(a='id',b='b')).show() 
 # fail with ~no stacktrace thanks to wrapper helper{code}
discourse: it isn't difficult to swap back in the kwargs, allowing the UDF to 
be called as such, but the cols tuple that gets passed in the call method:
{code:java}
_to_seq(sc, cols, _to_java_column{code}
 has to be in the right order based on the functions defined argument inputs, 
or the function will return incorrect results. 


> pandas_udf can not be called with keyword arguments
> ---
>
> Key:

[jira] [Updated] (SPARK-23645) pandas_udf can not be called with keyword arguments

2018-03-10 Thread Stu (Michael Stewart) (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stu (Michael Stewart) updated SPARK-23645:
--
Description: 
pandas_udf (all python udfs(?)) do not accept keyword arguments because 
`pyspark/sql/udf.py` class `UserDefinedFunction` has __call__, and also wrapper 
utility methods, that only accept args and not kwargs:

@ line 168:
{code:java}
...

def __call__(self, *cols):
judf = self._judf
sc = SparkContext._active_spark_context
return Column(judf.apply(_to_seq(sc, cols, _to_java_column)))

# This function is for improving the online help system in the interactive 
interpreter.
# For example, the built-in help / pydoc.help. It wraps the UDF with the 
docstring and
# argument annotation. (See: SPARK-19161)
def _wrapped(self):
"""
Wrap this udf with a function and attach docstring from func
"""

# It is possible for a callable instance without __name__ attribute or/and
# __module__ attribute to be wrapped here. For example, functools.partial. 
In this case,
# we should avoid wrapping the attributes from the wrapped function to the 
wrapper
# function. So, we take out these attribute names from the default names to 
set and
# then manually assign it after being wrapped.
assignments = tuple(
a for a in functools.WRAPPER_ASSIGNMENTS if a != '__name__' and a != 
'__module__')

@functools.wraps(self.func, assigned=assignments)
def wrapper(*args):
return self(*args)

...{code}
as seen in:
{code:java}
from pyspark.sql import SparkSession

from pyspark.sql.functions import pandas_udf, PandasUDFType, col, lit

spark = SparkSession.builder.getOrCreate()

df = spark.range(12).withColumn('b', col('id') * 2)

def ok(a,b): return a*b

df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b')).show()  # 
no problems
df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')(a='id',b='b')).show() 
 # fail with ~no stacktrace thanks to wrapper helper{code}
discourse: it isn't difficult to swap back in the kwargs, allowing the UDF to 
be called as such, but the cols tuple that gets passed in the call method:
{code:java}
_to_seq(sc, cols, _to_java_column{code}
 has to be in the right order based on the functions defined argument inputs, 
or the function will return incorrect results. 

  was:
pandas_udf (all python udfs(?)) do not accept keyword arguments because 
`pyspark/sql/udf.py` class `UserDefinedFunction` has __call__, and also wrapper 
utility methods, that only accept args and not kwargs:

@ line 168:
{code:java}
...

def __call__(self, *cols):
judf = self._judf
sc = SparkContext._active_spark_context
return Column(judf.apply(_to_seq(sc, cols, _to_java_column)))

# This function is for improving the online help system in the interactive 
interpreter.
# For example, the built-in help / pydoc.help. It wraps the UDF with the 
docstring and
# argument annotation. (See: SPARK-19161)
def _wrapped(self):
"""
Wrap this udf with a function and attach docstring from func
"""

# It is possible for a callable instance without __name__ attribute or/and
# __module__ attribute to be wrapped here. For example, functools.partial. 
In this case,
# we should avoid wrapping the attributes from the wrapped function to the 
wrapper
# function. So, we take out these attribute names from the default names to 
set and
# then manually assign it after being wrapped.
assignments = tuple(
a for a in functools.WRAPPER_ASSIGNMENTS if a != '__name__' and a != 
'__module__')

@functools.wraps(self.func, assigned=assignments)
def wrapper(*args):
return self(*args)

...{code}


> pandas_udf can not be called with keyword arguments
> ---
>
> Key: SPARK-23645
> URL: https://issues.apache.org/jira/browse/SPARK-23645
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: python 3.6 | pyspark 2.3.0 | Using Scala version 2.11.8, 
> OpenJDK 64-Bit Server VM, 1.8.0_141
>Reporter: Stu (Michael Stewart)
>Priority: Minor
>
> pandas_udf (all python udfs(?)) do not accept keyword arguments because 
> `pyspark/sql/udf.py` class `UserDefinedFunction` has __call__, and also 
> wrapper utility methods, that only accept args and not kwargs:
> @ line 168:
> {code:java}
> ...
> def __call__(self, *cols):
> judf = self._judf
> sc = SparkContext._active_spark_context
> return Column(judf.apply(_to_seq(sc, cols, _to_java_column)))
> # This function is for improving the online help system in the interactive 
> interpreter.
> # For example, the built-in help / pydoc.help. It wraps the UDF with the 
> docstring and
> # argument annotation. (See: SPARK-19161)
> def _wrapped(self):
> """
> Wrap

[jira] [Created] (SPARK-23645) pandas_udf can not be called with keyword arguments

2018-03-10 Thread Stu (Michael Stewart) (JIRA)

Stu (Michael Stewart) created SPARK-23645:
-

 Summary: pandas_udf can not be called with keyword arguments
 Key: SPARK-23645
 URL: https://issues.apache.org/jira/browse/SPARK-23645
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.3.0
 Environment: python 3.6 | pyspark 2.3.0 | Using Scala version 2.11.8, 
OpenJDK 64-Bit Server VM, 1.8.0_141
Reporter: Stu (Michael Stewart)


pandas_udf (all python udfs(?)) do not accept keyword arguments because 
`pyspark/sql/udf.py` class `UserDefinedFunction` has __call__, and also wrapper 
utility methods, that only accept args and not kwargs:

@ line 168:
{code:java}
...

def __call__(self, *cols):
judf = self._judf
sc = SparkContext._active_spark_context
return Column(judf.apply(_to_seq(sc, cols, _to_java_column)))

# This function is for improving the online help system in the interactive 
interpreter.
# For example, the built-in help / pydoc.help. It wraps the UDF with the 
docstring and
# argument annotation. (See: SPARK-19161)
def _wrapped(self):
"""
Wrap this udf with a function and attach docstring from func
"""

# It is possible for a callable instance without __name__ attribute or/and
# __module__ attribute to be wrapped here. For example, functools.partial. 
In this case,
# we should avoid wrapping the attributes from the wrapped function to the 
wrapper
# function. So, we take out these attribute names from the default names to 
set and
# then manually assign it after being wrapped.
assignments = tuple(
a for a in functools.WRAPPER_ASSIGNMENTS if a != '__name__' and a != 
'__module__')

@functools.wraps(self.func, assigned=assignments)
def wrapper(*args):
return self(*args)

...{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23644) SHS with proxy doesn't show applications

2018-03-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394243#comment-16394243
 ] 

Apache Spark commented on SPARK-23644:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/20794

> SHS with proxy doesn't show applications
> 
>
> Key: SPARK-23644
> URL: https://issues.apache.org/jira/browse/SPARK-23644
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Priority: Minor
>
> The History server supports being consumed via a proxy using the 
> {{spark.ui.proxyBase}} property. Despite it works fine if you access the 
> proxy using a link which ends with "/", it doesn't show any application if 
> the URL accessed doesn't end with "/", eg. if you access SHS using 
> {{https://yourproxy.whatever:1234/path/to/historyserver/}} it works fine, but 
> it you access it using 
> {{https://yourproxy.whatever:1234/path/to/historyserver}} no application is 
> shown.
> The cause of this is that the call to the REST API to get the list of the 
> application is a relative path call. So in the second case, instead of 
> performing a GET to 
> {{https://yourproxy.whatever:1234/path/to/historyserver/api/v1/applications}},
>  it performs a call to 
> {{https://yourproxy.whatever:1234/path/to/api/v1/applications}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23644) SHS with proxy doesn't show applications

2018-03-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23644:


Assignee: (was: Apache Spark)

> SHS with proxy doesn't show applications
> 
>
> Key: SPARK-23644
> URL: https://issues.apache.org/jira/browse/SPARK-23644
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Priority: Minor
>
> The History server supports being consumed via a proxy using the 
> {{spark.ui.proxyBase}} property. Despite it works fine if you access the 
> proxy using a link which ends with "/", it doesn't show any application if 
> the URL accessed doesn't end with "/", eg. if you access SHS using 
> {{https://yourproxy.whatever:1234/path/to/historyserver/}} it works fine, but 
> it you access it using 
> {{https://yourproxy.whatever:1234/path/to/historyserver}} no application is 
> shown.
> The cause of this is that the call to the REST API to get the list of the 
> application is a relative path call. So in the second case, instead of 
> performing a GET to 
> {{https://yourproxy.whatever:1234/path/to/historyserver/api/v1/applications}},
>  it performs a call to 
> {{https://yourproxy.whatever:1234/path/to/api/v1/applications}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23644) SHS with proxy doesn't show applications

2018-03-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23644:


Assignee: Apache Spark

> SHS with proxy doesn't show applications
> 
>
> Key: SPARK-23644
> URL: https://issues.apache.org/jira/browse/SPARK-23644
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Assignee: Apache Spark
>Priority: Minor
>
> The History server supports being consumed via a proxy using the 
> {{spark.ui.proxyBase}} property. Despite it works fine if you access the 
> proxy using a link which ends with "/", it doesn't show any application if 
> the URL accessed doesn't end with "/", eg. if you access SHS using 
> {{https://yourproxy.whatever:1234/path/to/historyserver/}} it works fine, but 
> it you access it using 
> {{https://yourproxy.whatever:1234/path/to/historyserver}} no application is 
> shown.
> The cause of this is that the call to the REST API to get the list of the 
> application is a relative path call. So in the second case, instead of 
> performing a GET to 
> {{https://yourproxy.whatever:1234/path/to/historyserver/api/v1/applications}},
>  it performs a call to 
> {{https://yourproxy.whatever:1234/path/to/api/v1/applications}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23644) SHS with proxy doesn't show applications

2018-03-10 Thread Marco Gaido (JIRA)

Marco Gaido created SPARK-23644:
---

 Summary: SHS with proxy doesn't show applications
 Key: SPARK-23644
 URL: https://issues.apache.org/jira/browse/SPARK-23644
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, Web UI
Affects Versions: 2.3.0
Reporter: Marco Gaido


The History server supports being consumed via a proxy using the 
{{spark.ui.proxyBase}} property. Despite it works fine if you access the proxy 
using a link which ends with "/", it doesn't show any application if the URL 
accessed doesn't end with "/", eg. if you access SHS using 
{{https://yourproxy.whatever:1234/path/to/historyserver/}} it works fine, but 
it you access it using 
{{https://yourproxy.whatever:1234/path/to/historyserver}} no application is 
shown.

The cause of this is that the call to the REST API to get the list of the 
application is a relative path call. So in the second case, instead of 
performing a GET to 
{{https://yourproxy.whatever:1234/path/to/historyserver/api/v1/applications}}, 
it performs a call to 
{{https://yourproxy.whatever:1234/path/to/api/v1/applications}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23643) XORShiftRandom.hashSeed allocates unnecessary memory

2018-03-10 Thread Maxim Gekk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-23643:
---
Description: The hashSeed method allocates 64 bytes buffer and puts only 8 
bytes of the seed parameter into it. Other bytes are always zero and could be 
easily excluded from hash calculation.  (was: The setSeed method allocates 64 
bytes buffer and puts only 8 bytes of the seed parameter into it. Other bytes 
are always zero and could be easily excluded from hash calculation.)

> XORShiftRandom.hashSeed allocates unnecessary memory
> 
>
> Key: SPARK-23643
> URL: https://issues.apache.org/jira/browse/SPARK-23643
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Trivial
>
> The hashSeed method allocates 64 bytes buffer and puts only 8 bytes of the 
> seed parameter into it. Other bytes are always zero and could be easily 
> excluded from hash calculation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23643) XORShiftRandom.hashSeed allocates unnecessary memory

2018-03-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394205#comment-16394205
 ] 

Apache Spark commented on SPARK-23643:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/20793

> XORShiftRandom.hashSeed allocates unnecessary memory
> 
>
> Key: SPARK-23643
> URL: https://issues.apache.org/jira/browse/SPARK-23643
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Trivial
>
> The setSeed method allocates 64 bytes buffer and puts only 8 bytes of the 
> seed parameter into it. Other bytes are always zero and could be easily 
> excluded from hash calculation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23643) XORShiftRandom.hashSeed allocates unnecessary memory

2018-03-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23643:


Assignee: Apache Spark

> XORShiftRandom.hashSeed allocates unnecessary memory
> 
>
> Key: SPARK-23643
> URL: https://issues.apache.org/jira/browse/SPARK-23643
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Trivial
>
> The setSeed method allocates 64 bytes buffer and puts only 8 bytes of the 
> seed parameter into it. Other bytes are always zero and could be easily 
> excluded from hash calculation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23643) XORShiftRandom.hashSeed allocates unnecessary memory

2018-03-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23643:


Assignee: (was: Apache Spark)

> XORShiftRandom.hashSeed allocates unnecessary memory
> 
>
> Key: SPARK-23643
> URL: https://issues.apache.org/jira/browse/SPARK-23643
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Trivial
>
> The setSeed method allocates 64 bytes buffer and puts only 8 bytes of the 
> seed parameter into it. Other bytes are always zero and could be easily 
> excluded from hash calculation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23643) XORShiftRandom.hashSeed allocates unnecessary memory

2018-03-10 Thread Maxim Gekk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-23643:
---
Summary: XORShiftRandom.hashSeed allocates unnecessary memory  (was: 
XORShiftRandom.setSeed allocates unnecessary memory)

> XORShiftRandom.hashSeed allocates unnecessary memory
> 
>
> Key: SPARK-23643
> URL: https://issues.apache.org/jira/browse/SPARK-23643
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Trivial
>
> The setSeed method allocates 64 bytes buffer and puts only 8 bytes of the 
> seed parameter into it. Other bytes are always zero and could be easily 
> excluded from hash calculation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23643) XORShiftRandom.setSeed allocates unnecessary memory

2018-03-10 Thread Maxim Gekk (JIRA)

Maxim Gekk created SPARK-23643:
--

 Summary: XORShiftRandom.setSeed allocates unnecessary memory
 Key: SPARK-23643
 URL: https://issues.apache.org/jira/browse/SPARK-23643
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: Maxim Gekk


The setSeed method allocates 64 bytes buffer and puts only 8 bytes of the seed 
parameter into it. Other bytes are always zero and could be easily excluded 
from hash calculation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-23605) Conflicting dependencies for janino in 2.3.0

2018-03-10 Thread Tao Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Liu closed SPARK-23605.
---

> Conflicting dependencies for janino in 2.3.0
> 
>
> Key: SPARK-23605
> URL: https://issues.apache.org/jira/browse/SPARK-23605
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Tao Liu
>Priority: Minor
>  Labels: maven
> Attachments: pom.xml
>
>
> spark-catalyst_2.11 2.3.0 has both a janino 2.7.8 and a commons-compiler 
> 3.0.8 dependency which are conflicting with one another resulting in 
> ClassNotFoundExceptions.
> java.lang.ClassNotFoundException: 
> org.codehaus.janino.InternalCompilerException
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1421)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1497)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1494)
>   at 
> org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
>   at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
>   at 
> org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
>   at 
> org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
>   at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000)
>   at 
> org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
>   at 
> org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1369)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.create(GenerateUnsafeProjection.scala:412)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.create(GenerateUnsafeProjection.scala:366)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.create(GenerateUnsafeProjection.scala:32)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1325)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.extractProjection$lzycompute(ExpressionEncoder.scala:264)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.extractProjection(ExpressionEncoder.scala:264)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:288)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:468)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:468)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:468)
>   at 
> org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:507)
> 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23605) Conflicting dependencies for janino in 2.3.0

2018-03-10 Thread Tao Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394197#comment-16394197
 ] 

Tao Liu commented on SPARK-23605:
-

Thanks [~kiszk], It was the spring-boot-dependencies/pom.xml that was causing 
it. Migrating to 2.0.0.RELEASE fixes the issue.

> Conflicting dependencies for janino in 2.3.0
> 
>
> Key: SPARK-23605
> URL: https://issues.apache.org/jira/browse/SPARK-23605
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Tao Liu
>Priority: Minor
>  Labels: maven
> Attachments: pom.xml
>
>
> spark-catalyst_2.11 2.3.0 has both a janino 2.7.8 and a commons-compiler 
> 3.0.8 dependency which are conflicting with one another resulting in 
> ClassNotFoundExceptions.
> java.lang.ClassNotFoundException: 
> org.codehaus.janino.InternalCompilerException
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1421)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1497)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1494)
>   at 
> org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
>   at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
>   at 
> org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
>   at 
> org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
>   at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000)
>   at 
> org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
>   at 
> org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1369)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.create(GenerateUnsafeProjection.scala:412)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.create(GenerateUnsafeProjection.scala:366)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.create(GenerateUnsafeProjection.scala:32)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1325)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.extractProjection$lzycompute(ExpressionEncoder.scala:264)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.extractProjection(ExpressionEncoder.scala:264)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:288)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:468)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:468)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:468)
>   at 
> org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:507)
> 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23605) Conflicting dependencies for janino in 2.3.0

2018-03-10 Thread Tao Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Liu resolved SPARK-23605.
-
Resolution: Not A Bug

> Conflicting dependencies for janino in 2.3.0
> 
>
> Key: SPARK-23605
> URL: https://issues.apache.org/jira/browse/SPARK-23605
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Tao Liu
>Priority: Minor
>  Labels: maven
> Attachments: pom.xml
>
>
> spark-catalyst_2.11 2.3.0 has both a janino 2.7.8 and a commons-compiler 
> 3.0.8 dependency which are conflicting with one another resulting in 
> ClassNotFoundExceptions.
> java.lang.ClassNotFoundException: 
> org.codehaus.janino.InternalCompilerException
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1421)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1497)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1494)
>   at 
> org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
>   at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
>   at 
> org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
>   at 
> org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
>   at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000)
>   at 
> org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
>   at 
> org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1369)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.create(GenerateUnsafeProjection.scala:412)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.create(GenerateUnsafeProjection.scala:366)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.create(GenerateUnsafeProjection.scala:32)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1325)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.extractProjection$lzycompute(ExpressionEncoder.scala:264)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.extractProjection(ExpressionEncoder.scala:264)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:288)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:468)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:468)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:468)
>   at 
> org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:507)
> 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23173) from_json can produce nulls for fields which are marked as non-nullable

2018-03-10 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23173.
--
   Resolution: Fixed
 Assignee: Michał Świtakowski
Fix Version/s: 2.4.0
   2.3.1

Fixed in https://github.com/apache/spark/pull/20694

> from_json can produce nulls for fields which are marked as non-nullable
> ---
>
> Key: SPARK-23173
> URL: https://issues.apache.org/jira/browse/SPARK-23173
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Herman van Hovell
>Assignee: Michał Świtakowski
>Priority: Major
>  Labels: release-notes
> Fix For: 2.3.1, 2.4.0
>
>
> The {{from_json}} function uses a schema to convert a string into a Spark SQL 
> struct. This schema can contain non-nullable fields. The underlying 
> {{JsonToStructs}} expression does not check if a resulting struct respects 
> the nullability of the schema. This leads to very weird problems in consuming 
> expressions. In our case parquet writing would produce an illegal parquet 
> file.
> There are roughly solutions here:
>  # Assume that each field in schema passed to {{from_json}} is nullable, and 
> ignore the nullability information set in the passed schema.
>  # Validate the object during runtime, and fail execution if the data is null 
> where we are not expecting this.
> I currently am slightly in favor of option 1, since this is the more 
> performant option and a lot easier to do.
> WDYT? cc [~rxin] [~marmbrus] [~hyukjin.kwon] [~brkyvz]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15473) CSV fails to write and read back empty dataframe

2018-03-10 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-15473.
--
Resolution: Cannot Reproduce

Yup, I just double checked in the master too. Let me leave this resolved.

> CSV fails to write and read back empty dataframe
> 
>
> Key: SPARK-15473
> URL: https://issues.apache.org/jira/browse/SPARK-15473
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently CSV data source fails to write and read empty data.
> The code below:
> {code}
> val emptyDf = spark.range(10).filter(_ => false)
> emptyDf.write
>   .format("csv")
>   .save(path.getCanonicalPath)
> val copyEmptyDf = spark.read
>   .format("csv")
>   .load(path.getCanonicalPath)
> copyEmptyDf.show()
> {code}
> throws an exception below:
> {code}
> Can not create a Path from an empty string
> java.lang.IllegalArgumentException: Can not create a Path from an empty string
>   at org.apache.hadoop.fs.Path.checkPathArg(Path.java:127)
>   at org.apache.hadoop.fs.Path.(Path.java:135)
>   at org.apache.hadoop.util.StringUtils.stringToPath(StringUtils.java:241)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:362)
>   at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:987)
>   at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:987)
>   at 
> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:178)
>   at 
> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:178)
>   at scala.Option.map(Option.scala:146)
> {code}
> Note that this is a different case with the data below
> {code}
> val emptyDf = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema)
> {code}
> In this case, any writer is not initialised and created. (no calls of 
> {{WriterContainer.writeRows()}}.
> Maybe, it should be able to read/write header for schemas as well as empty 
> data.
> For Parquet and JSON, it works but CSV does not.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23510) Support read data from Hive 2.2 and Hive 2.3 metastore

2018-03-10 Thread R (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394117#comment-16394117
 ] 

R commented on SPARK-23510:
---

[~q79969786] - can you add fix version of 2.3.1 to this? Would like this in 
next Spark release

> Support read data from Hive 2.2 and Hive 2.3 metastore
> --
>
> Key: SPARK-23510
> URL: https://issues.apache.org/jira/browse/SPARK-23510
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23486) LookupFunctions should not check the same function name more than once

[jira] [Assigned] (SPARK-23486) LookupFunctions should not check the same function name more than once

[jira] [Commented] (SPARK-23486) LookupFunctions should not check the same function name more than once

[jira] [Commented] (SPARK-23618) docker-image-tool.sh Fails While Building Image

[jira] [Commented] (SPARK-23646) pyspark DataFrameWriter ignores customized settings?

[jira] [Commented] (SPARK-23645) pandas_udf can not be called with keyword arguments

[jira] [Commented] (SPARK-23646) pyspark DataFrameWriter ignores customized settings?

[jira] [Created] (SPARK-23646) pyspark DataFrameWriter ignores customized settings?

[jira] [Commented] (SPARK-23560) A joinWith followed by groupBy requires extra shuffle

[jira] [Updated] (SPARK-23645) pandas_udf can not be called with keyword arguments

[jira] [Updated] (SPARK-23645) pandas_udf can not be called with keyword arguments

[jira] [Updated] (SPARK-23645) pandas_udf can not be called with keyword arguments

[jira] [Created] (SPARK-23645) pandas_udf can not be called with keyword arguments

[jira] [Commented] (SPARK-23644) SHS with proxy doesn't show applications

[jira] [Assigned] (SPARK-23644) SHS with proxy doesn't show applications

[jira] [Assigned] (SPARK-23644) SHS with proxy doesn't show applications

[jira] [Created] (SPARK-23644) SHS with proxy doesn't show applications

[jira] [Updated] (SPARK-23643) XORShiftRandom.hashSeed allocates unnecessary memory

[jira] [Commented] (SPARK-23643) XORShiftRandom.hashSeed allocates unnecessary memory

[jira] [Assigned] (SPARK-23643) XORShiftRandom.hashSeed allocates unnecessary memory

[jira] [Assigned] (SPARK-23643) XORShiftRandom.hashSeed allocates unnecessary memory

[jira] [Updated] (SPARK-23643) XORShiftRandom.hashSeed allocates unnecessary memory

[jira] [Created] (SPARK-23643) XORShiftRandom.setSeed allocates unnecessary memory

[jira] [Closed] (SPARK-23605) Conflicting dependencies for janino in 2.3.0

[jira] [Commented] (SPARK-23605) Conflicting dependencies for janino in 2.3.0

[jira] [Resolved] (SPARK-23605) Conflicting dependencies for janino in 2.3.0

[jira] [Resolved] (SPARK-23173) from_json can produce nulls for fields which are marked as non-nullable

[jira] [Resolved] (SPARK-15473) CSV fails to write and read back empty dataframe

[jira] [Commented] (SPARK-23510) Support read data from Hive 2.2 and Hive 2.3 metastore

29 matches

Site Navigation

Mail list logo

Footer information