[jira] [Commented] (SPARK-21819) UserGroupInformation initialization in SparkHadoopUtilwill overwrite user config

2017-08-23 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16139590#comment-16139590
 ] 

Saisai Shao commented on SPARK-21819:
-

Then I think there should no issue in Spark, right? [~KSLaskfla].

>  UserGroupInformation initialization in SparkHadoopUtilwill overwrite user 
> config
> -
>
> Key: SPARK-21819
> URL: https://issues.apache.org/jira/browse/SPARK-21819
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, YARN
>Affects Versions: 2.1.0, 2.1.1
> Environment: Ubuntu14.04
> Spark2.10/2.11 (I checked the github of 2.20 , it exist there as well)
> Cluster mode: Yarn client 
>Reporter: Keith Sun
> Attachments: yarnsparkutil.jpg
>
>
> When  submit job in Java or Scala code to ,the initialization of 
> SparkHadoopUtil will trigger the configuration overwritten in UGI which may 
> not be expected if the UGI has already been initialized by customized xmls 
> which are not on the classpath (like the cfg4j , which could set conf from 
> github code, a database etc). 
> {code:java}
> //it will overwrite the UGI conf which has already been initialized
> class SparkHadoopUtil extends Logging {
>   private val sparkConf = new SparkConf(false).loadFromSystemProperties(true)
>   val conf: Configuration = newConfiguration(sparkConf)
>   UserGroupInformation.setConfiguration(conf)
> {code}
> My scenario : My yarn cluster is kerberized, my configuration is set to use 
> kerberos for hadoop security. While, after the initialzation of 
> SparkHadoopUtil , the authentiationMethod in UGI is updated to "simple"(my 
> xmls not on the classpath), which lead to the failure like below :
> {code:java}
> 933453 [main] INFO  org.apache.spark.SparkContext  - Successfully stopped 
> SparkContext
> Exception in thread "main" org.apache.hadoop.security.AccessControlException: 
> SIMPLE authentication is not enabled.  Available:[TOKEN, KERBEROS]
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:209)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
>   at com.sun.proxy.$Proxy16.getClusterMetrics(Unknown Source)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getYarnClusterMetrics(YarnClientImpl.java:501)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:154)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:154)
>   at org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
>   at org.apache.spark.deploy.yarn.Client.logInfo(Client.scala:60)
>   at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:153)
>   at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:149)
>   at org.apache.spark.SparkContext.(SparkContext.scala:497)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2258)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:831)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:823)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823)
>   at SparkTest.SparkEAZDebug.main(SparkEAZDebug.java:84)
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException):
>  SIMPLE authentication is not enabled.  Available:[TOKEN, KERBEROS]
>   at 

[jira] [Commented] (SPARK-21819) UserGroupInformation initialization in SparkHadoopUtilwill overwrite user config

2017-08-23 Thread Keith Sun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16139586#comment-16139586
 ] 

Keith Sun commented on SPARK-21819:
---

[~vanzin], the third option works for my use case, i could add different 
default resouces based on the input (simple or kerberos).



>  UserGroupInformation initialization in SparkHadoopUtilwill overwrite user 
> config
> -
>
> Key: SPARK-21819
> URL: https://issues.apache.org/jira/browse/SPARK-21819
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, YARN
>Affects Versions: 2.1.0, 2.1.1
> Environment: Ubuntu14.04
> Spark2.10/2.11 (I checked the github of 2.20 , it exist there as well)
> Cluster mode: Yarn client 
>Reporter: Keith Sun
> Attachments: yarnsparkutil.jpg
>
>
> When  submit job in Java or Scala code to ,the initialization of 
> SparkHadoopUtil will trigger the configuration overwritten in UGI which may 
> not be expected if the UGI has already been initialized by customized xmls 
> which are not on the classpath (like the cfg4j , which could set conf from 
> github code, a database etc). 
> {code:java}
> //it will overwrite the UGI conf which has already been initialized
> class SparkHadoopUtil extends Logging {
>   private val sparkConf = new SparkConf(false).loadFromSystemProperties(true)
>   val conf: Configuration = newConfiguration(sparkConf)
>   UserGroupInformation.setConfiguration(conf)
> {code}
> My scenario : My yarn cluster is kerberized, my configuration is set to use 
> kerberos for hadoop security. While, after the initialzation of 
> SparkHadoopUtil , the authentiationMethod in UGI is updated to "simple"(my 
> xmls not on the classpath), which lead to the failure like below :
> {code:java}
> 933453 [main] INFO  org.apache.spark.SparkContext  - Successfully stopped 
> SparkContext
> Exception in thread "main" org.apache.hadoop.security.AccessControlException: 
> SIMPLE authentication is not enabled.  Available:[TOKEN, KERBEROS]
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:209)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
>   at com.sun.proxy.$Proxy16.getClusterMetrics(Unknown Source)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getYarnClusterMetrics(YarnClientImpl.java:501)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:154)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:154)
>   at org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
>   at org.apache.spark.deploy.yarn.Client.logInfo(Client.scala:60)
>   at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:153)
>   at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:149)
>   at org.apache.spark.SparkContext.(SparkContext.scala:497)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2258)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:831)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:823)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823)
>   at SparkTest.SparkEAZDebug.main(SparkEAZDebug.java:84)
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException):
>  SIMPLE authentication is not enabled.  

[jira] [Resolved] (SPARK-21805) disable R vignettes code on Windows

2017-08-23 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-21805.
--
  Resolution: Fixed
Assignee: Felix Cheung
   Fix Version/s: 2.3.0
  2.2.1
Target Version/s: 2.2.1, 2.3.0

> disable R vignettes code on Windows
> ---
>
> Key: SPARK-21805
> URL: https://issues.apache.org/jira/browse/SPARK-21805
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
> Fix For: 2.2.1, 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21745) Refactor ColumnVector hierarchy to make ColumnVector read-only and to introduce WritableColumnVector.

2017-08-23 Thread Takuya Ueshin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin updated SPARK-21745:
--
Description: 
This is a refactoring of {{ColumnVector}} hierarchy and related classes.

# make {{ColumnVector}} read-only
# introduce {{WritableColumnVector}} with write interface
# remove {{ReadOnlyColumnVector}}

  was:
This is a refactoring of {{ColumnVector}} hierarchy and related classes.

# make {{ColumnVector}} read-only
# introduce {{MutableColumnVector}} with write interface
# remove {{ReadOnlyColumnVector}}

Summary: Refactor ColumnVector hierarchy to make ColumnVector read-only 
and to introduce WritableColumnVector.  (was: Refactor ColumnVector hierarchy 
to make ColumnVector read-only and to introduce MutableColumnVector.)

> Refactor ColumnVector hierarchy to make ColumnVector read-only and to 
> introduce WritableColumnVector.
> -
>
> Key: SPARK-21745
> URL: https://issues.apache.org/jira/browse/SPARK-21745
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Takuya Ueshin
>
> This is a refactoring of {{ColumnVector}} hierarchy and related classes.
> # make {{ColumnVector}} read-only
> # introduce {{WritableColumnVector}} with write interface
> # remove {{ReadOnlyColumnVector}}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21807) The getAliasedConstraints function in LogicalPlan will take a long time when number of expressions is greater than 100

2017-08-23 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-21807.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> The getAliasedConstraints function  in LogicalPlan will take a long time when 
> number of expressions is greater than 100 
> 
>
> Key: SPARK-21807
> URL: https://issues.apache.org/jira/browse/SPARK-21807
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: eaton
>Assignee: eaton
> Fix For: 2.3.0
>
>
> The getAliasedConstraints  fuction in LogicalPlan.scala will clone the 
> expression set when an element added,
> and it will take a long time.
> Before modified, the cost of getAliasedConstraints is:
> 100 expressions:  41 seconds
> 150 expressions:  466 seconds
> The test is like this:
> test("getAliasedConstraints") {
> val expressionNum = 150
> val aggExpression = (1 to expressionNum).map(i => Alias(Count(Literal(1)), 
> s"cnt$i")())
> val aggPlan = Aggregate(Nil, aggExpression, LocalRelation())
> val beginTime = System.currentTimeMillis()
> val expressions = aggPlan.validConstraints
> println(s"validConstraints cost: ${System.currentTimeMillis() - beginTime}ms")
> // The size of Aliased expression is n * (n - 1) / 2 + n
> assert( expressions.size === expressionNum * (expressionNum - 1) / 2 + 
> expressionNum)
> }



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21190) SPIP: Vectorized UDFs in Python

2017-08-23 Thread Takuya Ueshin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16139507#comment-16139507
 ] 

Takuya Ueshin commented on SPARK-21190:
---

[~icexelloss]
We can know the length of input from {{RecordBatch}} and pass it as a hint to 
the actual function for users to create the output of the exact length.

> SPIP: Vectorized UDFs in Python
> ---
>
> Key: SPARK-21190
> URL: https://issues.apache.org/jira/browse/SPARK-21190
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>  Labels: SPIP
> Attachments: SPIPVectorizedUDFsforPython (1).pdf
>
>
> *Background and Motivation*
> Python is one of the most popular programming languages among Spark users. 
> Spark currently exposes a row-at-a-time interface for defining and executing 
> user-defined functions (UDFs). This introduces high overhead in serialization 
> and deserialization, and also makes it difficult to leverage Python libraries 
> (e.g. numpy, Pandas) that are written in native code.
>  
> This proposal advocates introducing new APIs to support vectorized UDFs in 
> Python, in which a block of data is transferred over to Python in some 
> columnar format for execution.
>  
>  
> *Target Personas*
> Data scientists, data engineers, library developers.
>  
> *Goals*
> - Support vectorized UDFs that apply on chunks of the data frame
> - Low system overhead: Substantially reduce serialization and deserialization 
> overhead when compared with row-at-a-time interface
> - UDF performance: Enable users to leverage native libraries in Python (e.g. 
> numpy, Pandas) for data manipulation in these UDFs
>  
> *Non-Goals*
> The following are explicitly out of scope for the current SPIP, and should be 
> done in future SPIPs. Nonetheless, it would be good to consider these future 
> use cases during API design, so we can achieve some consistency when rolling 
> out new APIs.
>  
> - Define block oriented UDFs in other languages (that are not Python).
> - Define aggregate UDFs
> - Tight integration with machine learning frameworks
>  
> *Proposed API Changes*
> The following sketches some possibilities. I haven’t spent a lot of time 
> thinking about the API (wrote it down in 5 mins) and I am not attached to 
> this design at all. The main purpose of the SPIP is to get feedback on use 
> cases and see how they can impact API design.
>  
> A few things to consider are:
>  
> 1. Python is dynamically typed, whereas DataFrames/SQL requires static, 
> analysis time typing. This means users would need to specify the return type 
> of their UDFs.
>  
> 2. Ratio of input rows to output rows. We propose initially we require number 
> of output rows to be the same as the number of input rows. In the future, we 
> can consider relaxing this constraint with support for vectorized aggregate 
> UDFs.
> 3. How do we handle null values, since Pandas doesn't have the concept of 
> nulls?
>  
> Proposed API sketch (using examples):
>  
> Use case 1. A function that defines all the columns of a DataFrame (similar 
> to a “map” function):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_on_entire_df(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A Pandas data frame.
>   """
>   input[c] = input[a] + input[b]
>   Input[d] = input[a] - input[b]
>   return input
>  
> spark.range(1000).selectExpr("id a", "id / 2 b")
>   .mapBatches(my_func_on_entire_df)
> {code}
>  
> Use case 2. A function that defines only one column (similar to existing 
> UDFs):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_that_returns_one_column(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A numpy array
>   """
>   return input[a] + input[b]
>  
> my_func = udf(my_func_that_returns_one_column)
>  
> df = spark.range(1000).selectExpr("id a", "id / 2 b")
> df.withColumn("c", my_func(df.a, df.b))
> {code}
>  
>  
>  
> *Optional Design Sketch*
> I’m more concerned about getting proper feedback for API design. The 
> implementation should be pretty straightforward and is not a huge concern at 
> this point. We can leverage the same implementation for faster toPandas 
> (using Arrow).
>  
>  
> *Optional Rejected Designs*
> See above.
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17321) YARN shuffle service should use good disk from yarn.nodemanager.local-dirs

2017-08-23 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16139502#comment-16139502
 ] 

Saisai Shao commented on SPARK-17321:
-

1. if NM recovery is enabled, then yarn will provide a recovery path, this 
recovery path will be used for any aux-service running on yarn (tez, mr, 
spark...) and NM itself to store state. So user/yarn should guarantee the 
availability of this path, if not then NM itself will be failed to restart. So 
as a conclusion if NM recovery is enabled, then we should always use recovery 
path.

2. Yes we will never use NM local dirs whether NM recovery is enabled or not. 
Previously we need to support Hadoop 2.6- which has no recovery path, so we 
choose a local dir instead. Since now we only support 2.6+, so there's no 
meaning to still use NM local dir.

3. The memory overhead should not be large, since it only stores some 
application/executor information. Also when you use external shuffle service in 
standalone and Mesos, it always use memory, so I don't think it is a big 
problem.

> YARN shuffle service should use good disk from yarn.nodemanager.local-dirs
> --
>
> Key: SPARK-17321
> URL: https://issues.apache.org/jira/browse/SPARK-17321
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.2, 2.0.0, 2.1.1
>Reporter: yunjiong zhao
>
> We run spark on yarn, after enabled spark dynamic allocation, we notice some 
> spark application failed randomly due to YarnShuffleService.
> From log I found
> {quote}
> 2016-08-29 11:33:03,450 ERROR org.apache.spark.network.TransportContext: 
> Error while initializing Netty pipeline
> java.lang.NullPointerException
> at 
> org.apache.spark.network.server.TransportRequestHandler.(TransportRequestHandler.java:77)
> at 
> org.apache.spark.network.TransportContext.createChannelHandler(TransportContext.java:159)
> at 
> org.apache.spark.network.TransportContext.initializePipeline(TransportContext.java:135)
> at 
> org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:123)
> at 
> org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:116)
> at 
> io.netty.channel.ChannelInitializer.channelRegistered(ChannelInitializer.java:69)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRegistered(AbstractChannelHandlerContext.java:133)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRegistered(AbstractChannelHandlerContext.java:119)
> at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRegistered(DefaultChannelPipeline.java:733)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:450)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.access$100(AbstractChannel.java:378)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:424)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
> at java.lang.Thread.run(Thread.java:745)
> {quote} 
> Which caused by the first disk in yarn.nodemanager.local-dirs was broken.
> If we enabled spark.yarn.shuffle.stopOnFailure(SPARK-16505) we might lost 
> hundred nodes which is unacceptable.
> We have 12 disks in yarn.nodemanager.local-dirs, so why not use other good 
> disks if the first one is broken?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17321) YARN shuffle service should use good disk from yarn.nodemanager.local-dirs

2017-08-23 Thread lishuming (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16139488#comment-16139488
 ] 

lishuming commented on SPARK-17321:
---

[~jerryshao] I agree with what you said, however there are some questions in my 
mind:

1. The current chosen strategy is puzzling somehow, because both 
`yarn.nodemanager.local-dirs` and `NM recovery path` are available to choose to 
store leveldb now, so we can always pick an available disk to store, avoiding 
the disk 
problem(https://github.com/apache/spark/pull/18905#issuecomment-323287272).

2. If as [~jerryshao] said, `yarn.nodemanager.local-dirs` should not be used 
whenever NM recovery is enabled or not, am I right ?

3. Can someone check that If we don't use leveldb, ShuffleService which uses 
`Map` will affect NM's memory or something else?

> YARN shuffle service should use good disk from yarn.nodemanager.local-dirs
> --
>
> Key: SPARK-17321
> URL: https://issues.apache.org/jira/browse/SPARK-17321
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.2, 2.0.0, 2.1.1
>Reporter: yunjiong zhao
>
> We run spark on yarn, after enabled spark dynamic allocation, we notice some 
> spark application failed randomly due to YarnShuffleService.
> From log I found
> {quote}
> 2016-08-29 11:33:03,450 ERROR org.apache.spark.network.TransportContext: 
> Error while initializing Netty pipeline
> java.lang.NullPointerException
> at 
> org.apache.spark.network.server.TransportRequestHandler.(TransportRequestHandler.java:77)
> at 
> org.apache.spark.network.TransportContext.createChannelHandler(TransportContext.java:159)
> at 
> org.apache.spark.network.TransportContext.initializePipeline(TransportContext.java:135)
> at 
> org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:123)
> at 
> org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:116)
> at 
> io.netty.channel.ChannelInitializer.channelRegistered(ChannelInitializer.java:69)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRegistered(AbstractChannelHandlerContext.java:133)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRegistered(AbstractChannelHandlerContext.java:119)
> at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRegistered(DefaultChannelPipeline.java:733)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:450)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.access$100(AbstractChannel.java:378)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:424)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
> at java.lang.Thread.run(Thread.java:745)
> {quote} 
> Which caused by the first disk in yarn.nodemanager.local-dirs was broken.
> If we enabled spark.yarn.shuffle.stopOnFailure(SPARK-16505) we might lost 
> hundred nodes which is unacceptable.
> We have 12 disks in yarn.nodemanager.local-dirs, so why not use other good 
> disks if the first one is broken?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21733) ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM

2017-08-23 Thread lishuming (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16139482#comment-16139482
 ] 

lishuming commented on SPARK-21733:
---

[~1028344...@qq.com] Maybe you should check the executor's log to find some 
exceptions...

> ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
> -
>
> Key: SPARK-21733
> URL: https://issues.apache.org/jira/browse/SPARK-21733
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.1
> Environment: Apache Spark2.1.1 
> CDH5.12.0 Yarn
>Reporter: Jepson
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Kafka+Spark streaming ,throw these error:
> {code:java}
> 17/08/15 09:34:14 INFO memory.MemoryStore: Block broadcast_8003_piece0 stored 
> as bytes in memory (estimated size 1895.0 B, free 1643.2 MB)
> 17/08/15 09:34:14 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
> 8003 took 11 ms
> 17/08/15 09:34:14 INFO memory.MemoryStore: Block broadcast_8003 stored as 
> values in memory (estimated size 2.9 KB, free 1643.2 MB)
> 17/08/15 09:34:14 INFO kafka010.KafkaRDD: Beginning offset 10130733 is the 
> same as ending offset skipping kssh 5
> 17/08/15 09:34:14 INFO executor.Executor: Finished task 7.0 in stage 8003.0 
> (TID 64178). 1740 bytes result sent to driver
> 17/08/15 09:34:21 INFO storage.BlockManager: Removing RDD 8002
> 17/08/15 09:34:21 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 64186
> 17/08/15 09:34:21 INFO executor.Executor: Running task 7.0 in stage 8004.0 
> (TID 64186)
> 17/08/15 09:34:21 INFO broadcast.TorrentBroadcast: Started reading broadcast 
> variable 8004
> 17/08/15 09:34:21 INFO memory.MemoryStore: Block broadcast_8004_piece0 stored 
> as bytes in memory (estimated size 1895.0 B, free 1643.2 MB)
> 17/08/15 09:34:21 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
> 8004 took 8 ms
> 17/08/15 09:34:21 INFO memory.MemoryStore: Block broadcast_8004 stored as 
> values in memory (estimated size 2.9 KB, free 1643.2 MB)
> 17/08/15 09:34:21 INFO kafka010.KafkaRDD: Beginning offset 10130733 is the 
> same as ending offset skipping kssh 5
> 17/08/15 09:34:21 INFO executor.Executor: Finished task 7.0 in stage 8004.0 
> (TID 64186). 1740 bytes result sent to driver
> h3. 17/08/15 09:34:29 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED 
> SIGNAL TERM
> 17/08/15 09:34:29 INFO storage.DiskBlockManager: Shutdown hook called
> 17/08/15 09:34:29 INFO util.ShutdownHookManager: Shutdown hook called
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15689) Data source API v2

2017-08-23 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16139479#comment-16139479
 ] 

Wenchen Fan commented on SPARK-15689:
-

yea, `LogicalPlan` is an internal concept and we can't use it in a stable API.

For making data source instances immutable, one problem is that, immutability 
can not be guaranteed by java/scala, we can only document it and ask users to 
follow, so we can't be 100% safe anyway. Another point is, I think implementing 
data source immutably is hard, especially when a data source implements many 
push down interfaces, as it needs to copy around many states.

> Data source API v2
> --
>
> Key: SPARK-15689
> URL: https://issues.apache.org/jira/browse/SPARK-15689
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>  Labels: releasenotes
> Attachments: SPIP Data Source API V2.pdf
>
>
> This ticket tracks progress in creating the v2 of data source API. This new 
> API should focus on:
> 1. Have a small surface so it is easy to freeze and maintain compatibility 
> for a long time. Ideally, this API should survive architectural rewrites and 
> user-facing API revamps of Spark.
> 2. Have a well-defined column batch interface for high performance. 
> Convenience methods should exist to convert row-oriented formats into column 
> batches for data source developers.
> 3. Still support filter push down, similar to the existing API.
> 4. Nice-to-have: support additional common operators, including limit and 
> sampling.
> Note that both 1 and 2 are problems that the current data source API (v1) 
> suffers. The current data source API has a wide surface with dependency on 
> DataFrame/SQLContext, making the data source API compatibility depending on 
> the upper level API. The current data source API is also only row oriented 
> and has to go through an expensive external data type conversion to internal 
> data type.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21660) Yarn ShuffleService failed to start when the chosen directory become read-only

2017-08-23 Thread lishuming (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16139471#comment-16139471
 ] 

lishuming commented on SPARK-21660:
---

Sorry, this is a dup of 
[SPARK-17321|https://issues.apache.org/jira/browse/SPARK-17321], I will comment 
there.

> Yarn ShuffleService failed to start when the chosen directory become read-only
> --
>
> Key: SPARK-21660
> URL: https://issues.apache.org/jira/browse/SPARK-21660
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, YARN
>Affects Versions: 2.1.1
>Reporter: lishuming
>
> h3. Background
> In our production environment,disks corrupt to `read-only` status almost once 
> a month. Now the strategy of Yarn ShuffleService which chooses an available 
> directory(disk) to store Shuffle info(DB) is as 
> below(https://github.com/apache/spark/blob/master/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java#L340):
> 1. If NameNode's recoveryPath not empty and shuffle DB exists in the 
> recoveryPath, return the recoveryPath;
> 2. If recoveryPath empty and shuffle DB exists in 
> `yarn.nodemanager.local-dirs`, set recoveryPath as the existing DB path and 
> return the path;
> 3. If recoveryPath not empty(shuffle DB not exists in the path) and shuffle 
> DB exists in `yarn.nodemanager.local-dirs`, mv the existing shuffle DB to 
> recoveryPath and return the path;
> 4. If all above don't hit, we choose the first disk of 
> `yarn.nodemanager.local-dirs`as the recoveryPath;
> All above strategy don't consider the chosen disk(directory) is writable or 
> not, so in our environment we meet such exception:
> {code:java}
> 2017-06-25 07:15:43,512 ERROR org.apache.spark.network.util.LevelDBProvider: 
> error opening leveldb file /mnt/dfs/12/yarn/local/registeredExecutors.ldb. 
> Creating new file, will not be able to recover state for existing applications
> at 
> org.apache.spark.network.util.LevelDBProvider.initLevelDB(LevelDBProvider.java:48)
> at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.(ExternalShuffleBlockResolver.java:116)
> at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.(ExternalShuffleBlockResolver.java:94)
> at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.(ExternalShuffleBlockHandler.java:66)
> at 
> org.apache.spark.network.yarn.YarnShuffleService.serviceInit(YarnShuffleService.java:167)
> 2017-06-25 07:15:43,514 WARN org.apache.spark.network.util.LevelDBProvider: 
> error deleting /mnt/dfs/12/yarn/local/registeredExecutors.ldb
> 2017-06-25 07:15:43,515 INFO org.apache.hadoop.service.AbstractService: 
> Service spark_shuffle failed in state INITED; cause: java.io.IOException: 
> Unable to create state store
> at 
> org.apache.spark.network.util.LevelDBProvider.initLevelDB(LevelDBProvider.java:77)
> at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.(ExternalShuffleBlockResolver.java:116)
> at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.(ExternalShuffleBlockResolver.java:94)
> at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.(ExternalShuffleBlockHandler.java:66)
> at 
> org.apache.spark.network.yarn.YarnShuffleService.serviceInit(YarnShuffleService.java:167)
> at 
> org.apache.spark.network.util.LevelDBProvider.initLevelDB(LevelDBProvider.java:75)
> {code}
> h3. Consideration
> 1. For many production environment, `yarn.nodemanager.local-dirs` always has 
> more than 1 disk, so we can make a better chosen strategy to avoid the 
> problem above;
> 2. Can we add a strategy to check the DB directory we choose is writable, so 
> avoid the problem above?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-21816) The comment of Class ExchangeCoordinator exist a typing and context error

2017-08-23 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-21816:
--

> The comment of Class ExchangeCoordinator exist a typing and context error
> -
>
> Key: SPARK-21816
> URL: https://issues.apache.org/jira/browse/SPARK-21816
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: lufei
>Priority: Trivial
>
> The given example in the comment of Class ExchangeCoordinator is exist four 
> post-shuffle partitions,but the current comment is “three”.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21816) The comment of Class ExchangeCoordinator exist a typing and context error

2017-08-23 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21816.
--
Resolution: Invalid

> The comment of Class ExchangeCoordinator exist a typing and context error
> -
>
> Key: SPARK-21816
> URL: https://issues.apache.org/jira/browse/SPARK-21816
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: lufei
>Priority: Trivial
>
> The given example in the comment of Class ExchangeCoordinator is exist four 
> post-shuffle partitions,but the current comment is “three”.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-21816) The comment of Class ExchangeCoordinator exist a typing and context error

2017-08-23 Thread lufei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufei closed SPARK-21816.
-
Resolution: Fixed

> The comment of Class ExchangeCoordinator exist a typing and context error
> -
>
> Key: SPARK-21816
> URL: https://issues.apache.org/jira/browse/SPARK-21816
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: lufei
>Priority: Trivial
>
> The given example in the comment of Class ExchangeCoordinator is exist four 
> post-shuffle partitions,but the current comment is “three”.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21816) The comment of Class ExchangeCoordinator exist a typing and context error

2017-08-23 Thread lufei (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16139455#comment-16139455
 ] 

lufei commented on SPARK-21816:
---

[~hyukjin.kwon] ,I'm sorry for this, I will colse this issue immediately. 
Thanks.

> The comment of Class ExchangeCoordinator exist a typing and context error
> -
>
> Key: SPARK-21816
> URL: https://issues.apache.org/jira/browse/SPARK-21816
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: lufei
>Priority: Trivial
>
> The given example in the comment of Class ExchangeCoordinator is exist four 
> post-shuffle partitions,but the current comment is “three”.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21770) ProbabilisticClassificationModel: Improve normalization of all-zero raw predictions

2017-08-23 Thread Weichen Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16139364#comment-16139364
 ] 

Weichen Xu commented on SPARK-21770:


Hmm... `normalizeToProbabilitiesInPlace` is only effective in the case when we 
use class instance counts as `rawPrediction`.
So, I guess the meaning is, when we have empty instance set, all the counts == 
0, in this case, how to assume the probability for each class ? The uniform 
distribution is a reasonable assumption I think.
But I am not sure whether this change will cause issues in other places.

> ProbabilisticClassificationModel: Improve normalization of all-zero raw 
> predictions
> ---
>
> Key: SPARK-21770
> URL: https://issues.apache.org/jira/browse/SPARK-21770
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Siddharth Murching
>Priority: Minor
>
> Given an n-element raw prediction vector of all-zeros, 
> ProbabilisticClassifierModel.normalizeToProbabilitiesInPlace() should output 
> a probability vector of all-equal 1/n entries



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19954) Joining to a unioned DataFrame does not produce expected result.

2017-08-23 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16139229#comment-16139229
 ] 

Sean Owen commented on SPARK-19954:
---

Might be a mistake about exactly what other change resolved this. 

> Joining to a unioned DataFrame does not produce expected result.
> 
>
> Key: SPARK-19954
> URL: https://issues.apache.org/jira/browse/SPARK-19954
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Arun Allamsetty
>Priority: Blocker
>
> I found this bug while trying to update from Spark 1.6.1 to 2.1.0. The bug is 
> that when we try to join two DataFrames, one of which is a result of a union 
> operation, the result of the join results in data as if the table was joined 
> only to the first table in the union. This issue is not present in Spark 
> 2.0.0 or 2.0.1 or 2.0.2, only in 2.1.0. Here's how to reproduce it.
> {noformat}
> import spark.implicits._
> import org.apache.spark.sql.functions.lit
> case class A(id: Long, colA: Boolean)
> case class B(id: Long, colB: Int)
> case class C(id: Long, colC: Double)
> case class X(id: Long, name: String)
> val aData = A(1, true) :: Nil
> val bData = B(2, 10) :: Nil
> val cData = C(3, 9.73D) :: Nil
> val xData = X(1, "a") :: X(2, "b") :: X(3, "c") :: Nil
> val aDf = spark.createDataset(aData).toDF
> val bDf = spark.createDataset(bData).toDF
> val cDf = spark.createDataset(cData).toDF
> val xDf = spark.createDataset(xData).toDF
> val unionDf =
>   aDf.select($"id", lit("a").as("name"), $"colA", lit(null).as("colB"), 
> lit(null).as("colC")).union(
>   bDf.select($"id", lit("b").as("name"), lit(null).as("colA"), $"colB", 
> lit(null).as("colC"))).union(
>   cDf.select($"id", lit("c").as("name"), lit(null).as("colA"), 
> lit(null).as("colB"), $"colC"))
> val result = xDf.join(unionDf, unionDf("name") === xDf("name") && 
> unionDf("id") === xDf("id"))
> result.show
> {noformat}
> The result being
> {noformat}
> +---++---+++++
> | id|name| id|name|colA|colB|colC|
> +---++---+++++
> |  1|   a|  1|   a|true|null|null|
> +---++---+++++
> {noformat}
> Force computing {{unionDf}} using {{count}} does not help change the result 
> of the join. However, writing the data to disk and reading it back does give 
> the correct result. But it is definitely not ideal. Interestingly caching the 
> {{unionDf}} also gives the correct result.
> {noformat}
> +---++---+++++
> | id|name| id|name|colA|colB|colC|
> +---++---+++++
> |  1|   a|  1|   a|true|null|null|
> |  2|   b|  2|   b|null|  10|null|
> |  3|   c|  3|   c|null|null|9.73|
> +---++---+++++
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19954) Joining to a unioned DataFrame does not produce expected result.

2017-08-23 Thread Adam Heinermann (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16139058#comment-16139058
 ] 

Adam Heinermann commented on SPARK-19954:
-

How is an issue that affects version 2.1.0 resolved as a duplicate of an 
unrelated issue that was fixed in version 2.1.0?

> Joining to a unioned DataFrame does not produce expected result.
> 
>
> Key: SPARK-19954
> URL: https://issues.apache.org/jira/browse/SPARK-19954
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Arun Allamsetty
>Priority: Blocker
>
> I found this bug while trying to update from Spark 1.6.1 to 2.1.0. The bug is 
> that when we try to join two DataFrames, one of which is a result of a union 
> operation, the result of the join results in data as if the table was joined 
> only to the first table in the union. This issue is not present in Spark 
> 2.0.0 or 2.0.1 or 2.0.2, only in 2.1.0. Here's how to reproduce it.
> {noformat}
> import spark.implicits._
> import org.apache.spark.sql.functions.lit
> case class A(id: Long, colA: Boolean)
> case class B(id: Long, colB: Int)
> case class C(id: Long, colC: Double)
> case class X(id: Long, name: String)
> val aData = A(1, true) :: Nil
> val bData = B(2, 10) :: Nil
> val cData = C(3, 9.73D) :: Nil
> val xData = X(1, "a") :: X(2, "b") :: X(3, "c") :: Nil
> val aDf = spark.createDataset(aData).toDF
> val bDf = spark.createDataset(bData).toDF
> val cDf = spark.createDataset(cData).toDF
> val xDf = spark.createDataset(xData).toDF
> val unionDf =
>   aDf.select($"id", lit("a").as("name"), $"colA", lit(null).as("colB"), 
> lit(null).as("colC")).union(
>   bDf.select($"id", lit("b").as("name"), lit(null).as("colA"), $"colB", 
> lit(null).as("colC"))).union(
>   cDf.select($"id", lit("c").as("name"), lit(null).as("colA"), 
> lit(null).as("colB"), $"colC"))
> val result = xDf.join(unionDf, unionDf("name") === xDf("name") && 
> unionDf("id") === xDf("id"))
> result.show
> {noformat}
> The result being
> {noformat}
> +---++---+++++
> | id|name| id|name|colA|colB|colC|
> +---++---+++++
> |  1|   a|  1|   a|true|null|null|
> +---++---+++++
> {noformat}
> Force computing {{unionDf}} using {{count}} does not help change the result 
> of the join. However, writing the data to disk and reading it back does give 
> the correct result. But it is definitely not ideal. Interestingly caching the 
> {{unionDf}} also gives the correct result.
> {noformat}
> +---++---+++++
> | id|name| id|name|colA|colB|colC|
> +---++---+++++
> |  1|   a|  1|   a|true|null|null|
> |  2|   b|  2|   b|null|  10|null|
> |  3|   c|  3|   c|null|null|9.73|
> +---++---+++++
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21535) Reduce memory requirement for CrossValidator and TrainValidationSplit

2017-08-23 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16139023#comment-16139023
 ] 

yuhao yang commented on SPARK-21535:


Thank for for the comments.

> Reduce memory requirement for CrossValidator and TrainValidationSplit 
> --
>
> Key: SPARK-21535
> URL: https://issues.apache.org/jira/browse/SPARK-21535
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>
> CrossValidator and TrainValidationSplit both use 
> {code}models = est.fit(trainingDataset, epm) {code} to fit the models, where 
> epm is Array[ParamMap].
> Even though the training process is sequential, current implementation 
> consumes extra driver memory for holding the trained models, which is not 
> necessary and often leads to memory exception for both CrossValidator and 
> TrainValidationSplit. My proposal is to optimize the training implementation, 
> thus that used model can be collected by GC, and avoid the unnecessary OOM 
> exceptions.
> E.g. when grid search space is 12, old implementation needs to hold all 12 
> trained models in the driver memory at the same time, while the new 
> implementation only needs to hold 1 trained model at a time, and previous 
> model can be cleared by GC.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21752) Config spark.jars.packages is ignored in SparkSession config

2017-08-23 Thread Jakub Nowacki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138948#comment-16138948
 ] 

Jakub Nowacki commented on SPARK-21752:
---

OK I did one more extra test and, indeed, on the newest version 2.2.0 (and also 
2.1.1) all three configs work fine; though I'm pretty sure one did not work at 
least once, but maybe this was a coincident. I investigated further, and when I 
rolled back to 2.0.2, which I have on a different setup, only the 
{{PYSPARK_SUBMIT_ARGS}} worked reliably and the other ones didn't; maybe in 
case of this version the {{config}} ones work not deterministically.  Thus, 
this seems to be an issue for versions up to 2.0.2, and for the newer ones it 
seems to work, but not sure, if all the time.

First question is if there is a way to check if fact that it works for 2.1.1 
and 2.2.0 also can stop working on occasion? Also, do we still should have a 
form of documentation for the safer way of configuration?

> Config spark.jars.packages is ignored in SparkSession config
> 
>
> Key: SPARK-21752
> URL: https://issues.apache.org/jira/browse/SPARK-21752
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jakub Nowacki
>
> If I put a config key {{spark.jars.packages}} using {{SparkSession}} builder 
> as follows:
> {code}
> spark = pyspark.sql.SparkSession.builder\
> .appName('test-mongo')\
> .master('local[*]')\
> .config("spark.jars.packages", 
> "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")\
> .config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \
> .config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \
> .getOrCreate()
> {code}
> the SparkSession gets created but there are no package download logs printed, 
> and if I use the loaded classes, Mongo connector in this case, but it's the 
> same for other packages, I get {{java.lang.ClassNotFoundException}} for the 
> missing classes.
> If I use the config file {{conf/spark-defaults.comf}}, command line option 
> {{--packages}}, e.g.:
> {code}
> import os
> os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages 
> org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell'
> {code}
> it works fine. Interestingly, using {{SparkConf}} object works fine as well, 
> e.g.:
> {code}
> conf = pyspark.SparkConf()
> conf.set("spark.jars.packages", 
> "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")
> conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll")
> conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll")
> spark = pyspark.sql.SparkSession.builder\
> .appName('test-mongo')\
> .master('local[*]')\
> .config(conf=conf)\
> .getOrCreate()
> {code}
> The above is in Python but I've seen the behavior in other languages, though, 
> I didn't check R. 
> I also have seen it in older Spark versions.
> It seems that this is the only config key that doesn't work for me via the 
> {{SparkSession}} builder config.
> Note that this is related to creating new {{SparkSession}} as getting new 
> packages into existing {{SparkSession}} doesn't indeed make sense. Thus this 
> will only work with bare Python, Scala or Java, and not on {{pyspark}} or 
> {{spark-shell}} as they create the session automatically; it this case one 
> would need to use {{--packages}} option. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21817) Pass FSPermissions to LocatedFileStatus from InMemoryFileIndex

2017-08-23 Thread Ewan Higgs (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ewan Higgs resolved SPARK-21817.

Resolution: Invalid

This was caused by a change in a stable/evolving interface which previously 
accepted null. This should continue to accept null, so it will be fixed in 
HDFS-12344.

> Pass FSPermissions to LocatedFileStatus from InMemoryFileIndex
> --
>
> Key: SPARK-21817
> URL: https://issues.apache.org/jira/browse/SPARK-21817
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ewan Higgs
>Priority: Minor
> Attachments: SPARK-21817.001.patch
>
>
> The implementation of HDFS-6984 now uses the passed in {{FSPermission}} to 
> pull out the ACL and other information. Therefore passing in a {{null}} is no 
> longer adequate and hence causes a NPE when listing files.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21817) Pass FSPermissions to LocatedFileStatus from InMemoryFileIndex

2017-08-23 Thread Ewan Higgs (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138905#comment-16138905
 ] 

Ewan Higgs commented on SPARK-21817:


{quote}
Ewan: do a patch there with a new test method (where?) & I'll review it.
{quote}
Sure.

Sorry for the bug report on Spark, all. I'll fix in HDFS.

> Pass FSPermissions to LocatedFileStatus from InMemoryFileIndex
> --
>
> Key: SPARK-21817
> URL: https://issues.apache.org/jira/browse/SPARK-21817
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ewan Higgs
>Priority: Minor
> Attachments: SPARK-21817.001.patch
>
>
> The implementation of HDFS-6984 now uses the passed in {{FSPermission}} to 
> pull out the ACL and other information. Therefore passing in a {{null}} is no 
> longer adequate and hence causes a NPE when listing files.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-17771) Allow start-master/slave scripts to start in the foreground

2017-08-23 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-17771.
-
Resolution: Duplicate

> Allow start-master/slave scripts to start in the foreground
> ---
>
> Key: SPARK-17771
> URL: https://issues.apache.org/jira/browse/SPARK-17771
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.1
>Reporter: Mike Ihbe
>Priority: Minor
>
> Based on conversation from this thread: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Running-Spark-master-slave-instances-in-non-Daemon-mode-td19172.html
> Some scheduler solutions like Nomad have a simple fork-exec execution model, 
> and the daemonization causes the scheduler to lose track of the spark 
> process. I'm proposing adding a SPARK_NO_DAEMONIZE environment variable that 
> will trigger a switch in ./sbin/spark-daemon.sh to run the process in the 
> foreground. 
> This opens a question about whether or not to rename the bash script, but I 
> think that's a potentially breaking change that we should avoid at this point.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-17771) Allow start-master/slave scripts to start in the foreground

2017-08-23 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reopened SPARK-17771:
---

> Allow start-master/slave scripts to start in the foreground
> ---
>
> Key: SPARK-17771
> URL: https://issues.apache.org/jira/browse/SPARK-17771
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.1
>Reporter: Mike Ihbe
>Priority: Minor
>
> Based on conversation from this thread: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Running-Spark-master-slave-instances-in-non-Daemon-mode-td19172.html
> Some scheduler solutions like Nomad have a simple fork-exec execution model, 
> and the daemonization causes the scheduler to lose track of the spark 
> process. I'm proposing adding a SPARK_NO_DAEMONIZE environment variable that 
> will trigger a switch in ./sbin/spark-daemon.sh to run the process in the 
> foreground. 
> This opens a question about whether or not to rename the bash script, but I 
> think that's a potentially breaking change that we should avoid at this point.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12449) Pushing down arbitrary logical plans to data sources

2017-08-23 Thread Evan Chan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138856#comment-16138856
 ] 

Evan Chan commented on SPARK-12449:
---

Andrew and others:

Is there a plan to make this CatalystSource available or contribute it back to 
Spark somehow?





> Pushing down arbitrary logical plans to data sources
> 
>
> Key: SPARK-12449
> URL: https://issues.apache.org/jira/browse/SPARK-12449
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Stephan Kessler
> Attachments: pushingDownLogicalPlans.pdf
>
>
> With the help of the DataSource API we can pull data from external sources 
> for processing. Implementing interfaces such as {{PrunedFilteredScan}} allows 
> to push down filters and projects pruning unnecessary fields and rows 
> directly in the data source.
> However, data sources such as SQL Engines are capable of doing even more 
> preprocessing, e.g., evaluating aggregates. This is beneficial because it 
> would reduce the amount of data transferred from the source to Spark. The 
> existing interfaces do not allow such kind of processing in the source.
> We would propose to add a new interface {{CatalystSource}} that allows to 
> defer the processing of arbitrary logical plans to the data source. We have 
> already shown the details at the Spark Summit 2015 Europe 
> [https://spark-summit.org/eu-2015/events/the-pushdown-of-everything/]
> I will add a design document explaining details. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-17891) SQL-based three column join loses first column

2017-08-23 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-17891.
-
Resolution: Duplicate

> SQL-based three column join loses first column
> --
>
> Key: SPARK-17891
> URL: https://issues.apache.org/jira/browse/SPARK-17891
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
>Reporter: Eli Miller
> Attachments: test.tgz
>
>
> Hi all,
> I hope that this is not a known issue (I haven't had any luck finding 
> anything similar in Jira or the mailing lists but I could be searching with 
> the wrong terms). I just started to experiment with Spark SQL and am seeing 
> what appears to be a bug. When using Spark SQL to join two tables with a 
> three column inner join, the first column join is ignored. The example code 
> that I have starts with two tables *T1*:
> {noformat}
> +---+---+---+---+
> |  a|  b|  c|  d|
> +---+---+---+---+
> |  1|  2|  3|  4|
> +---+---+---+---+
> {noformat}
> and *T2*:
> {noformat}
> +---+---+---+---+
> |  b|  c|  d|  e|
> +---+---+---+---+
> |  2|  3|  4|  5|
> | -2|  3|  4|  6|
> |  2| -3|  4|  7|
> +---+---+---+---+
> {noformat}
> Joining *T1* to *T2* on *b*, *c* and *d* (in that order):
> {code:sql}
> SELECT t1.a, t1.b, t2.b, t1.c,t2.c, t1.d, t2.d, t2.e
> FROM t1, t2
> WHERE t1.b = t2.b AND t1.c = t2.c AND t1.d = t2.d
> {code}
> results in the following (note that *T1.b* != *T2.b* in the first row):
> {noformat}
> +---+---+---+---+---+---+---+---+
> |  a|  b|  b|  c|  c|  d|  d|  e|
> +---+---+---+---+---+---+---+---+
> |  1|  2| -2|  3|  3|  4|  4|  6|
> |  1|  2|  2|  3|  3|  4|  4|  5|
> +---+---+---+---+---+---+---+---+
> {noformat}
> Switching the predicate order to *c*, *b* and *d*:
> {code:sql}
> SELECT t1.a, t1.b, t2.b, t1.c,t2.c, t1.d, t2.d, t2.e
> FROM t1, t2
> WHERE t1.c = t2.c AND t1.b = t2.b AND t1.d = t2.d
> {code}
> yields different results (now *T1.c* != *T2.c* in the first row):
> {noformat}
> +---+---+---+---+---+---+---+---+
> |  a|  b|  b|  c|  c|  d|  d|  e|
> +---+---+---+---+---+---+---+---+
> |  1|  2|  2|  3| -3|  4|  4|  7|
> |  1|  2|  2|  3|  3|  4|  4|  5|
> +---+---+---+---+---+---+---+---+
> {noformat}
> Is this expected?
> I started to research this a bit and one thing that jumped out at me was the 
> ordering of the HashedRelationBroadcastMode concatenation in the plan (this 
> is from the *b*, *c*, *d* ordering):
> {noformat}
> ...
> *Project [a#0, b#1, b#9, c#2, c#10, d#3, d#11, e#12]
> +- *BroadcastHashJoin [b#1, c#2, d#3], [b#9, c#10, d#11], Inner, BuildRight
>:- *Project [a#0, b#1, c#2, d#3]
>:  +- *Filter ((isnotnull(b#1) && isnotnull(c#2)) && isnotnull(d#3))
>: +- *Scan csv [a#0,b#1,c#2,d#3] Format: CSV, InputPaths: 
> file:/home/eli/git/IENG/what/target/classes/t1.csv, PartitionFilters: [], 
> PushedFilters: [IsNotNull(b), IsNotNull(c), IsNotNull(d)], ReadSchema: 
> struct
>+- BroadcastExchange 
> HashedRelationBroadcastMode(List((shiftleft((shiftleft(cast(input[0, int, 
> true] as bigint), 32) | (cast(input[1, int, true] as bigint) & 4294967295)), 
> 32) | (cast(input[2, int, true] as bigint) & 4294967295
>   +- *Project [b#9, c#10, d#11, e#12]
>  +- *Filter ((isnotnull(c#10) && isnotnull(b#9)) && isnotnull(d#11))
> +- *Scan csv [b#9,c#10,d#11,e#12] Format: CSV, InputPaths: 
> file:/home/eli/git/IENG/what/target/classes/t2.csv, PartitionFilters: [], 
> PushedFilters: [IsNotNull(c), IsNotNull(b), IsNotNull(d)], ReadSchema: 
> struct]
> {noformat}
> If this concatenated byte array is ever truncated to 64 bits in a comparison, 
> the leading column will be lost and could result in this behavior.
> I will attach my example code and data. Please let me know if I can provide 
> any other details.
> Best regards,
> Eli



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18656) org.apache.spark.sql.execution.stat.StatFunctions#multipleApproxQuantiles requires too much memory in case of many columns

2017-08-23 Thread poplav (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138853#comment-16138853
 ] 

poplav commented on SPARK-18656:


[~barrybecker4], Any more insights into this.  I am going to have to do this 
for about 3000 columns.  Are you chunking your queries into smaller batches 
like running multipleApproxQuantiles on batches of 100 at a time?

> org.apache.spark.sql.execution.stat.StatFunctions#multipleApproxQuantiles 
> requires too much memory in case of many columns
> --
>
> Key: SPARK-18656
> URL: https://issues.apache.org/jira/browse/SPARK-18656
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Sina Sohangir
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> org.apache.spark.sql.execution.stat.StatFunctions#multipleApproxQuantiles
> Is implemented in a way that is causes out of memory error for cases where 
> the number of columns are high.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-17891) SQL-based three column join loses first column

2017-08-23 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reopened SPARK-17891:
---

> SQL-based three column join loses first column
> --
>
> Key: SPARK-17891
> URL: https://issues.apache.org/jira/browse/SPARK-17891
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
>Reporter: Eli Miller
> Attachments: test.tgz
>
>
> Hi all,
> I hope that this is not a known issue (I haven't had any luck finding 
> anything similar in Jira or the mailing lists but I could be searching with 
> the wrong terms). I just started to experiment with Spark SQL and am seeing 
> what appears to be a bug. When using Spark SQL to join two tables with a 
> three column inner join, the first column join is ignored. The example code 
> that I have starts with two tables *T1*:
> {noformat}
> +---+---+---+---+
> |  a|  b|  c|  d|
> +---+---+---+---+
> |  1|  2|  3|  4|
> +---+---+---+---+
> {noformat}
> and *T2*:
> {noformat}
> +---+---+---+---+
> |  b|  c|  d|  e|
> +---+---+---+---+
> |  2|  3|  4|  5|
> | -2|  3|  4|  6|
> |  2| -3|  4|  7|
> +---+---+---+---+
> {noformat}
> Joining *T1* to *T2* on *b*, *c* and *d* (in that order):
> {code:sql}
> SELECT t1.a, t1.b, t2.b, t1.c,t2.c, t1.d, t2.d, t2.e
> FROM t1, t2
> WHERE t1.b = t2.b AND t1.c = t2.c AND t1.d = t2.d
> {code}
> results in the following (note that *T1.b* != *T2.b* in the first row):
> {noformat}
> +---+---+---+---+---+---+---+---+
> |  a|  b|  b|  c|  c|  d|  d|  e|
> +---+---+---+---+---+---+---+---+
> |  1|  2| -2|  3|  3|  4|  4|  6|
> |  1|  2|  2|  3|  3|  4|  4|  5|
> +---+---+---+---+---+---+---+---+
> {noformat}
> Switching the predicate order to *c*, *b* and *d*:
> {code:sql}
> SELECT t1.a, t1.b, t2.b, t1.c,t2.c, t1.d, t2.d, t2.e
> FROM t1, t2
> WHERE t1.c = t2.c AND t1.b = t2.b AND t1.d = t2.d
> {code}
> yields different results (now *T1.c* != *T2.c* in the first row):
> {noformat}
> +---+---+---+---+---+---+---+---+
> |  a|  b|  b|  c|  c|  d|  d|  e|
> +---+---+---+---+---+---+---+---+
> |  1|  2|  2|  3| -3|  4|  4|  7|
> |  1|  2|  2|  3|  3|  4|  4|  5|
> +---+---+---+---+---+---+---+---+
> {noformat}
> Is this expected?
> I started to research this a bit and one thing that jumped out at me was the 
> ordering of the HashedRelationBroadcastMode concatenation in the plan (this 
> is from the *b*, *c*, *d* ordering):
> {noformat}
> ...
> *Project [a#0, b#1, b#9, c#2, c#10, d#3, d#11, e#12]
> +- *BroadcastHashJoin [b#1, c#2, d#3], [b#9, c#10, d#11], Inner, BuildRight
>:- *Project [a#0, b#1, c#2, d#3]
>:  +- *Filter ((isnotnull(b#1) && isnotnull(c#2)) && isnotnull(d#3))
>: +- *Scan csv [a#0,b#1,c#2,d#3] Format: CSV, InputPaths: 
> file:/home/eli/git/IENG/what/target/classes/t1.csv, PartitionFilters: [], 
> PushedFilters: [IsNotNull(b), IsNotNull(c), IsNotNull(d)], ReadSchema: 
> struct
>+- BroadcastExchange 
> HashedRelationBroadcastMode(List((shiftleft((shiftleft(cast(input[0, int, 
> true] as bigint), 32) | (cast(input[1, int, true] as bigint) & 4294967295)), 
> 32) | (cast(input[2, int, true] as bigint) & 4294967295
>   +- *Project [b#9, c#10, d#11, e#12]
>  +- *Filter ((isnotnull(c#10) && isnotnull(b#9)) && isnotnull(d#11))
> +- *Scan csv [b#9,c#10,d#11,e#12] Format: CSV, InputPaths: 
> file:/home/eli/git/IENG/what/target/classes/t2.csv, PartitionFilters: [], 
> PushedFilters: [IsNotNull(c), IsNotNull(b), IsNotNull(d)], ReadSchema: 
> struct]
> {noformat}
> If this concatenated byte array is ever truncated to 64 bits in a comparison, 
> the leading column will be lost and could result in this behavior.
> I will attach my example code and data. Please let me know if I can provide 
> any other details.
> Best regards,
> Eli



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18594) Name Validation of Databases/Tables

2017-08-23 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-18594:
--
Fix Version/s: 2.1.0

> Name Validation of Databases/Tables
> ---
>
> Key: SPARK-18594
> URL: https://issues.apache.org/jira/browse/SPARK-18594
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.1.0
>
>
> Currently, the name validation checks are limited to table creation. It is 
> enfored by Analyzer rule: `PreWriteCheck`.  
> However, table renaming and database creation have the same issues. It makes 
> more sense to do the checks in `SessionCatalog`. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[no subject]

2017-08-23 Thread Wei Zheng



[jira] [Updated] (SPARK-19307) SPARK-17387 caused ignorance of conf object passed to SparkContext:

2017-08-23 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-19307:
--
Fix Version/s: 2.1.1
   2.2.0

> SPARK-17387 caused ignorance of conf object passed to SparkContext:
> ---
>
> Key: SPARK-19307
> URL: https://issues.apache.org/jira/browse/SPARK-19307
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: yuriy_hupalo
>Assignee: Marcelo Vanzin
> Fix For: 2.1.1, 2.2.0
>
> Attachments: SPARK-19307.patch
>
>
> after patch SPARK-17387 was applied -- Sparkconf object is ignored when 
> launching SparkContext programmatically via python from spark-submit:
> https://github.com/apache/spark/blob/master/python/pyspark/context.py#L128:
> in case when we are running python SparkContext(conf=xxx) from spark-submit:
> conf is set, conf._jconf is None ()
> passed as arg  conf object is ignored (and used only when we are 
> launching java_gateway).
> how to fix:
> python/pyspark/context.py:132
> {code:title=python/pyspark/context.py:132}
> if conf is not None and conf._jconf is not None:
> # conf has been initialized in JVM properly, so use conf 
> directly. This represent the
> # scenario that JVM has been launched before SparkConf is created 
> (e.g. SparkContext is
> # created and then stopped, and we create a new SparkConf and new 
> SparkContext again)
> self._conf = conf
> else:
> self._conf = SparkConf(_jvm=SparkContext._jvm)
> + if conf:
> + for key, value in conf.getAll():
> + self._conf.set(key,value)
> + print(key,value)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19307) SPARK-17387 caused ignorance of conf object passed to SparkContext:

2017-08-23 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138840#comment-16138840
 ] 

Dongjoon Hyun commented on SPARK-19307:
---

Hi, [~irinatruong].
Yes. It's available in 2.1.1. Maybe, the release note is missing due to the 
missing fixed version.

> SPARK-17387 caused ignorance of conf object passed to SparkContext:
> ---
>
> Key: SPARK-19307
> URL: https://issues.apache.org/jira/browse/SPARK-19307
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: yuriy_hupalo
>Assignee: Marcelo Vanzin
> Fix For: 2.1.1, 2.2.0
>
> Attachments: SPARK-19307.patch
>
>
> after patch SPARK-17387 was applied -- Sparkconf object is ignored when 
> launching SparkContext programmatically via python from spark-submit:
> https://github.com/apache/spark/blob/master/python/pyspark/context.py#L128:
> in case when we are running python SparkContext(conf=xxx) from spark-submit:
> conf is set, conf._jconf is None ()
> passed as arg  conf object is ignored (and used only when we are 
> launching java_gateway).
> how to fix:
> python/pyspark/context.py:132
> {code:title=python/pyspark/context.py:132}
> if conf is not None and conf._jconf is not None:
> # conf has been initialized in JVM properly, so use conf 
> directly. This represent the
> # scenario that JVM has been launched before SparkConf is created 
> (e.g. SparkContext is
> # created and then stopped, and we create a new SparkConf and new 
> SparkContext again)
> self._conf = conf
> else:
> self._conf = SparkConf(_jvm=SparkContext._jvm)
> + if conf:
> + for key, value in conf.getAll():
> + self._conf.set(key,value)
> + print(key,value)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18415) Weird Plan Output when CTE used in RunnableCommand

2017-08-23 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-18415:
--
Fix Version/s: 2.1.0

> Weird Plan Output when CTE used in RunnableCommand
> --
>
> Key: SPARK-18415
> URL: https://issues.apache.org/jira/browse/SPARK-18415
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.1.0
>
>
> Currently, when CTE is used in RunnableCommand, the Analyzer does not replace 
> the logical node `With`. The child plan of RunnableCommand is not resolved. 
> However, the output of the `With` plan node looks very confusing.
> For example, 
> {code}
> sql("CREATE VIEW cte_view AS WITH w AS (SELECT 1 AS n) SELECT n FROM 
> w").explain()
> {code}
> The output is like
> {code}
> ExecutedCommand
>+- CreateViewCommand `w`, WITH w AS (SELECT 1 AS n) SELECT n FROM w, 
> false, false, org.apache.spark.sql.execution.command.PersistedView$@2251b87b
>  +- 'With [(w,SubqueryAlias w
>+- Project [1 AS n#16]
>   +- OneRowRelation$
>)]
> +- 'Project ['n]
>+- 'UnresolvedRelation `w`
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21102) Refresh command is too aggressive in parsing

2017-08-23 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-21102:

Fix Version/s: 2.3.0

> Refresh command is too aggressive in parsing
> 
>
> Key: SPARK-21102
> URL: https://issues.apache.org/jira/browse/SPARK-21102
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Anton Okolnychyi
>  Labels: starter
> Fix For: 2.3.0
>
>
> SQL REFRESH command parsing is way too aggressive:
> {code}
> | REFRESH TABLE tableIdentifier
> #refreshTable
> | REFRESH .*?  
> #refreshResource
> {code}
> We should change it so it takes the whole string (without space), or a quoted 
> string.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20754) Add Function Alias For MOD/TRUNCT/POSITION

2017-08-23 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138829#comment-16138829
 ] 

Dongjoon Hyun commented on SPARK-20754:
---

Hi, [~smilegator].
Could you set [~q79969786] as `Assignee` of this issue? Thanks!

> Add Function Alias For MOD/TRUNCT/POSITION
> --
>
> Key: SPARK-20754
> URL: https://issues.apache.org/jira/browse/SPARK-20754
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>  Labels: starter
> Fix For: 2.3.0
>
>
> We already have the impl of the following functions. We can add the function 
> alias to be consistent with ANSI. 
> {noformat} 
> MOD(,)
> {noformat} 
> Returns the remainder of m divided by n. Returns m if n is 0.
> {noformat} 
> TRUNC
> {noformat} 
> Returns the number x, truncated to D decimals. If D is 0, the result will 
> have no decimal point or fractional part. If D is negative, the number is 
> zeroed out.
> {noformat} 
> POSITION
> {noformat} 
> Returns the position of the char IN ) in the source string.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21102) Refresh command is too aggressive in parsing

2017-08-23 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-21102:
---

Assignee: Anton Okolnychyi

> Refresh command is too aggressive in parsing
> 
>
> Key: SPARK-21102
> URL: https://issues.apache.org/jira/browse/SPARK-21102
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Anton Okolnychyi
>  Labels: starter
>
> SQL REFRESH command parsing is way too aggressive:
> {code}
> | REFRESH TABLE tableIdentifier
> #refreshTable
> | REFRESH .*?  
> #refreshResource
> {code}
> We should change it so it takes the whole string (without space), or a quoted 
> string.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6761) Approximate quantile

2017-08-23 Thread poplav (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138809#comment-16138809
 ] 

poplav edited comment on SPARK-6761 at 8/23/17 6:22 PM:


Question:  Say I have a DataFrame of 1000 columns.  I want approximate 
quantiles for all 1000 columns of that DataFrame.  I am seeing that this method 
takes in a parameter for one column, thus I am having to map over all 1000 
columns and run this sequentially.  Is it possible for this to accept a 
sequence of columns and improve performance?

[Edit nevermind I just saw import 
org.apache.spark.sql.execution.stat.StatFunctions.multipleApproxQuantiles, but 
it looks like there are performance issues when running on many columns see 
https://issues.apache.org/jira/browse/SPARK-18656]


was (Author: poplav):
Question:  Say I have a DataFrame of 1000 columns.  I want approximate 
quantiles for all 1000 columns of that DataFrame.  I am seeing that this method 
takes in a parameter for one column, thus I am having to map over all 1000 
columns and run this sequentially.  Is it possible for this to accept a 
sequence of columns and improve performance?

> Approximate quantile
> 
>
> Key: SPARK-6761
> URL: https://issues.apache.org/jira/browse/SPARK-6761
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Liang-Chi Hsieh
> Fix For: 2.0.0
>
>
> See mailing list discussion: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Approximate-rank-based-statistics-median-95-th-percentile-etc-for-Spark-td11414.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20754) Add Function Alias For MOD/TRUNCT/POSITION

2017-08-23 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-20754:
--
Fix Version/s: 2.3.0

> Add Function Alias For MOD/TRUNCT/POSITION
> --
>
> Key: SPARK-20754
> URL: https://issues.apache.org/jira/browse/SPARK-20754
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>  Labels: starter
> Fix For: 2.3.0
>
>
> We already have the impl of the following functions. We can add the function 
> alias to be consistent with ANSI. 
> {noformat} 
> MOD(,)
> {noformat} 
> Returns the remainder of m divided by n. Returns m if n is 0.
> {noformat} 
> TRUNC
> {noformat} 
> Returns the number x, truncated to D decimals. If D is 0, the result will 
> have no decimal point or fractional part. If D is negative, the number is 
> zeroed out.
> {noformat} 
> POSITION
> {noformat} 
> Returns the position of the char IN ) in the source string.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20953) Add hash map metrics to aggregate and join

2017-08-23 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-20953:
--
Fix Version/s: 2.3.0

> Add hash map metrics to aggregate and join
> --
>
> Key: SPARK-20953
> URL: https://issues.apache.org/jira/browse/SPARK-20953
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
> Fix For: 2.3.0
>
>
> It would be useful if we can identify hash map collision issues early on.
> We should add avg hash map probe metric to aggregate operator and hash join 
> operator and report them. If the avg probe is greater than a specific 
> (configurable) threshold, we should log an error at runtime.
> The primary classes to look at are UnsafeFixedWidthAggregationMap, 
> HashAggregateExec, HashedRelation, HashJoin.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19571) tests are failing to run on Windows with another instance Derby error with Hadoop 2.6.5

2017-08-23 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138821#comment-16138821
 ] 

Dongjoon Hyun commented on SPARK-19571:
---

Hi, [~hyukjin.kwon].
Could you set `Fix Version`? Thanks!

> tests are failing to run on Windows with another instance Derby error with 
> Hadoop 2.6.5
> ---
>
> Key: SPARK-19571
> URL: https://issues.apache.org/jira/browse/SPARK-19571
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, SQL
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>Assignee: Hyukjin Kwon
>
> Between 
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/751-master
> https://github.com/apache/spark/commit/7a7ce272fe9a703f58b0180a9d2001ecb5c4b8db
> And
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/758-master
> https://github.com/apache/spark/commit/c618ccdbe9ac103dfa3182346e2a14a1e7fca91a
> Something is changed (not likely caused by R) such that tests running on 
> Windows are consistently failing with
> {code}
> Caused by: ERROR XSDB6: Another instance of Derby may have already booted the 
> database 
> C:\Users\appveyor\AppData\Local\Temp\1\spark-75266bb9-bd54-4ee2-ae54-2122d2c011e8\metastore.
>   at org.apache.derby.iapi.error.StandardException.newException(Unknown 
> Source)
>   at org.apache.derby.iapi.error.StandardException.newException(Unknown 
> Source)
>   at 
> org.apache.derby.impl.store.raw.data.BaseDataFileFactory.privGetJBMSLockOnDB(Unknown
>  Source)
>   at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.run(Unknown 
> Source)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at 
> org.apache.derby.impl.store.raw.data.BaseDataFileFactory.getJBMSLockOnDB(Unknown
>  Source)
>   at 
> org.apache.derby.impl.store.raw.data.BaseDataFileFactory.boot(Unknown Source)
>   at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown 
> Source)
>   at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown 
> Source)
> {code}
> Since we run appveyor only when there is R changes, it is a bit harder to 
> track down which change specifically caused this.
> We also can't run appveyor on branch-2.1, so it could also be broken there.
> This could be a blocker, since it could fail tests for the R release.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21256) Add WithSQLConf to Catalyst Test

2017-08-23 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-21256:
--
Fix Version/s: 2.3.0

> Add WithSQLConf to Catalyst Test
> 
>
> Key: SPARK-21256
> URL: https://issues.apache.org/jira/browse/SPARK-21256
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.3.0
>
>
> Add WithSQLConf to the Catalyst module.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6761) Approximate quantile

2017-08-23 Thread poplav (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138809#comment-16138809
 ] 

poplav commented on SPARK-6761:
---

Question:  Say I have a DataFrame of 1000 columns.  I want approximate 
quantiles for all 1000 columns of that DataFrame.  I am seeing that this method 
takes in a parameter for one column, thus I am having to map over all 1000 
columns and run this sequentially.  Is it possible for this to accept a 
sequence of columns and improve performance?

> Approximate quantile
> 
>
> Key: SPARK-6761
> URL: https://issues.apache.org/jira/browse/SPARK-6761
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Liang-Chi Hsieh
> Fix For: 2.0.0
>
>
> See mailing list discussion: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Approximate-rank-based-statistics-median-95-th-percentile-etc-for-Spark-td11414.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18539) Cannot filter by nonexisting column in parquet file

2017-08-23 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-18539:
--
Fix Version/s: 2.2.0

> Cannot filter by nonexisting column in parquet file
> ---
>
> Key: SPARK-18539
> URL: https://issues.apache.org/jira/browse/SPARK-18539
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1, 2.0.2
>Reporter: Vitaly Gerasimov
>Assignee: Dongjoon Hyun
>Priority: Critical
> Fix For: 2.2.0
>
>
> {code}
>   import org.apache.spark.SparkConf
>   import org.apache.spark.sql.SparkSession
>   import org.apache.spark.sql.types.DataTypes._
>   import org.apache.spark.sql.types.{StructField, StructType}
>   val sc = SparkSession.builder().config(new 
> SparkConf().setMaster("local")).getOrCreate()
>   val jsonRDD = sc.sparkContext.parallelize(Seq("""{"a":1}"""))
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType
> .json(jsonRDD)
> .write
> .parquet("/tmp/test")
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType), StructField("b", 
> IntegerType, nullable = true
> .load("/tmp/test")
> .createOrReplaceTempView("table")
>   sc.sql("select b from table where b is not null").show()
> {code}
> returns:
> {code}
> 16/11/22 17:43:47 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.IllegalArgumentException: Column [b] was not found in schema!
>   at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:100)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59)
>   at 
> org.apache.parquet.filter2.predicate.Operators$NotEq.accept(Operators.java:194)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:64)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40)
>   at 
> org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:110)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:367)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:341)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:116)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at 

[jira] [Updated] (SPARK-21578) Add JavaSparkContextSuite

2017-08-23 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-21578:
--
Fix Version/s: 2.3.0

> Add JavaSparkContextSuite
> -
>
> Key: SPARK-21578
> URL: https://issues.apache.org/jira/browse/SPARK-21578
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.3.0
>
>
> Due to SI-8479, 
> [SPARK-1093|https://issues.apache.org/jira/browse/SPARK-21578] introduced 
> redundant [SparkContext 
> constructors|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L148-L181].
>  However, [SI-8479|https://issues.scala-lang.org/browse/SI-8479] is already 
> fixed in Scala 2.10.5 and Scala 2.11.1. 
> The real reason to provide this constructor is that Java code can access 
> `SparkContext` directly. It's Scala behavior, SI-4278. So, this PR adds an 
> explicit testsuite, `JavaSparkContextSuite`  to prevent future regression, 
> and fixes the outdate comment, too.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21817) Pass FSPermissions to LocatedFileStatus from InMemoryFileIndex

2017-08-23 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138685#comment-16138685
 ] 

Steve Loughran commented on SPARK-21817:


API is tagged as stable/evolving; it's clearly in use downstream, so strictly, 
yep, broken, and easily fixed. Just not a codepath tested in hdfs

> Pass FSPermissions to LocatedFileStatus from InMemoryFileIndex
> --
>
> Key: SPARK-21817
> URL: https://issues.apache.org/jira/browse/SPARK-21817
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ewan Higgs
>Priority: Minor
> Attachments: SPARK-21817.001.patch
>
>
> The implementation of HDFS-6984 now uses the passed in {{FSPermission}} to 
> pull out the ACL and other information. Therefore passing in a {{null}} is no 
> longer adequate and hence causes a NPE when listing files.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21817) Pass FSPermissions to LocatedFileStatus from InMemoryFileIndex

2017-08-23 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138683#comment-16138683
 ] 

Steve Loughran commented on SPARK-21817:


I think it's a regression in HDFS-6984; the superclass handles permissions 
being null, but the modified LocatedFileStatus ctor doesn't.

Ewan: do a patch there with a new test method (where?) & I'll review it.

> Pass FSPermissions to LocatedFileStatus from InMemoryFileIndex
> --
>
> Key: SPARK-21817
> URL: https://issues.apache.org/jira/browse/SPARK-21817
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ewan Higgs
>Priority: Minor
> Attachments: SPARK-21817.001.patch
>
>
> The implementation of HDFS-6984 now uses the passed in {{FSPermission}} to 
> pull out the ACL and other information. Therefore passing in a {{null}} is no 
> longer adequate and hence causes a NPE when listing files.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21770) ProbabilisticClassificationModel: Improve normalization of all-zero raw predictions

2017-08-23 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138680#comment-16138680
 ] 

Yanbo Liang commented on SPARK-21770:
-

[~srowen] Of course, we should understand what outputs [0, 0, 0]. If it's 
impossible to be all-zero, then there is nothing to do. 

> ProbabilisticClassificationModel: Improve normalization of all-zero raw 
> predictions
> ---
>
> Key: SPARK-21770
> URL: https://issues.apache.org/jira/browse/SPARK-21770
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Siddharth Murching
>Priority: Minor
>
> Given an n-element raw prediction vector of all-zeros, 
> ProbabilisticClassifierModel.normalizeToProbabilitiesInPlace() should output 
> a probability vector of all-equal 1/n entries



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21817) Pass FSPermissions to LocatedFileStatus from InMemoryFileIndex

2017-08-23 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138672#comment-16138672
 ] 

Marcelo Vanzin commented on SPARK-21817:


Not sure if it counts as a regression since the behavior wasn't really 
documented as far as I can tell... but it's definitely a behavior change, since 
permission was allowed to be null before.

> Pass FSPermissions to LocatedFileStatus from InMemoryFileIndex
> --
>
> Key: SPARK-21817
> URL: https://issues.apache.org/jira/browse/SPARK-21817
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ewan Higgs
>Priority: Minor
> Attachments: SPARK-21817.001.patch
>
>
> The implementation of HDFS-6984 now uses the passed in {{FSPermission}} to 
> pull out the ACL and other information. Therefore passing in a {{null}} is no 
> longer adequate and hence causes a NPE when listing files.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21817) Pass FSPermissions to LocatedFileStatus from InMemoryFileIndex

2017-08-23 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138664#comment-16138664
 ] 

Steve Loughran commented on SPARK-21817:


This a regression in HDFS?

> Pass FSPermissions to LocatedFileStatus from InMemoryFileIndex
> --
>
> Key: SPARK-21817
> URL: https://issues.apache.org/jira/browse/SPARK-21817
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ewan Higgs
>Priority: Minor
> Attachments: SPARK-21817.001.patch
>
>
> The implementation of HDFS-6984 now uses the passed in {{FSPermission}} to 
> pull out the ACL and other information. Therefore passing in a {{null}} is no 
> longer adequate and hence causes a NPE when listing files.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21501) Spark shuffle index cache size should be memory based

2017-08-23 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-21501.
---
   Resolution: Fixed
 Assignee: Sanket Reddy
Fix Version/s: 2.3.0

> Spark shuffle index cache size should be memory based
> -
>
> Key: SPARK-21501
> URL: https://issues.apache.org/jira/browse/SPARK-21501
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.1.0
>Reporter: Thomas Graves
>Assignee: Sanket Reddy
> Fix For: 2.3.0
>
>
> Right now the spark shuffle service has a cache for index files. It is based 
> on a # of files cached (spark.shuffle.service.index.cache.entries). This can 
> cause issues if people have a lot of reducers because the size of each entry 
> can fluctuate based on the # of reducers. 
> We saw an issues with a job that had 17 reducers and it caused NM with 
> spark shuffle service to use 700-800MB or memory in NM by itself.
> We should change this cache to be memory based and only allow a certain 
> memory size used. When I say memory based I mean the cache should have a 
> limit of say 100MB.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21807) The getAliasedConstraints function in LogicalPlan will take a long time when number of expressions is greater than 100

2017-08-23 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-21807:

Target Version/s: 2.3.0

> The getAliasedConstraints function  in LogicalPlan will take a long time when 
> number of expressions is greater than 100 
> 
>
> Key: SPARK-21807
> URL: https://issues.apache.org/jira/browse/SPARK-21807
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: eaton
>Assignee: eaton
>
> The getAliasedConstraints  fuction in LogicalPlan.scala will clone the 
> expression set when an element added,
> and it will take a long time.
> Before modified, the cost of getAliasedConstraints is:
> 100 expressions:  41 seconds
> 150 expressions:  466 seconds
> The test is like this:
> test("getAliasedConstraints") {
> val expressionNum = 150
> val aggExpression = (1 to expressionNum).map(i => Alias(Count(Literal(1)), 
> s"cnt$i")())
> val aggPlan = Aggregate(Nil, aggExpression, LocalRelation())
> val beginTime = System.currentTimeMillis()
> val expressions = aggPlan.validConstraints
> println(s"validConstraints cost: ${System.currentTimeMillis() - beginTime}ms")
> // The size of Aliased expression is n * (n - 1) / 2 + n
> assert( expressions.size === expressionNum * (expressionNum - 1) / 2 + 
> expressionNum)
> }



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21807) The getAliasedConstraints function in LogicalPlan will take a long time when number of expressions is greater than 100

2017-08-23 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-21807:
---

Assignee: eaton

> The getAliasedConstraints function  in LogicalPlan will take a long time when 
> number of expressions is greater than 100 
> 
>
> Key: SPARK-21807
> URL: https://issues.apache.org/jira/browse/SPARK-21807
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: eaton
>Assignee: eaton
>
> The getAliasedConstraints  fuction in LogicalPlan.scala will clone the 
> expression set when an element added,
> and it will take a long time.
> Before modified, the cost of getAliasedConstraints is:
> 100 expressions:  41 seconds
> 150 expressions:  466 seconds
> The test is like this:
> test("getAliasedConstraints") {
> val expressionNum = 150
> val aggExpression = (1 to expressionNum).map(i => Alias(Count(Literal(1)), 
> s"cnt$i")())
> val aggPlan = Aggregate(Nil, aggExpression, LocalRelation())
> val beginTime = System.currentTimeMillis()
> val expressions = aggPlan.validConstraints
> println(s"validConstraints cost: ${System.currentTimeMillis() - beginTime}ms")
> // The size of Aliased expression is n * (n - 1) / 2 + n
> assert( expressions.size === expressionNum * (expressionNum - 1) / 2 + 
> expressionNum)
> }



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15689) Data source API v2

2017-08-23 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138607#comment-16138607
 ] 

Reynold Xin commented on SPARK-15689:
-

Not the author but my guess is that the other approach wouldn't be stable,
neither source nor binary. It relies on internal logical plans.

Note that it would be possible to incorporate the other approach in this as
well, if we add a function that takes in a logical plan and returns a
logical plan.


Wenchen, one thing I have been thinking is whether we should make such user
implementations of data sources immutable. That is, all methods (e.g. Push
filters) would return a new instance of the source. It would be safer. I
don't know if it is worth the complexity though.




> Data source API v2
> --
>
> Key: SPARK-15689
> URL: https://issues.apache.org/jira/browse/SPARK-15689
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>  Labels: releasenotes
> Attachments: SPIP Data Source API V2.pdf
>
>
> This ticket tracks progress in creating the v2 of data source API. This new 
> API should focus on:
> 1. Have a small surface so it is easy to freeze and maintain compatibility 
> for a long time. Ideally, this API should survive architectural rewrites and 
> user-facing API revamps of Spark.
> 2. Have a well-defined column batch interface for high performance. 
> Convenience methods should exist to convert row-oriented formats into column 
> batches for data source developers.
> 3. Still support filter push down, similar to the existing API.
> 4. Nice-to-have: support additional common operators, including limit and 
> sampling.
> Note that both 1 and 2 are problems that the current data source API (v1) 
> suffers. The current data source API has a wide surface with dependency on 
> DataFrame/SQLContext, making the data source API compatibility depending on 
> the upper level API. The current data source API is also only row oriented 
> and has to go through an expensive external data type conversion to internal 
> data type.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15799) Release SparkR on CRAN

2017-08-23 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138602#comment-16138602
 ] 

Shivaram Venkataraman commented on SPARK-15799:
---

The email I got from CRAN is pasted below. There are three points there -- one 
about `attach`, one about the Description field and one about the vignettes.

{code}
Thanks, we see:


* checking R code for possible problems ... NOTE
Found the following calls to attach():
File 'SparkR/R/DataFrame.R':
  attach(newEnv, pos = pos, name = name, warn.conflicts = warn.conflicts)
See section 'Good practice' in '?attach'.

The Description field should not start with "The SparkR package". Simply start 
"Provides ".



and thenm

* checking re-building of vignette outputs ...
and nothing happens. Apparently you expect some installed Hadoop or SPark 
software for running the vignettes?

But there is no SystemRequeirements field?


Please fix and resubmit.
{code}

> Release SparkR on CRAN
> --
>
> Key: SPARK-15799
> URL: https://issues.apache.org/jira/browse/SPARK-15799
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Xiangrui Meng
>
> Story: "As an R user, I would like to see SparkR released on CRAN, so I can 
> use SparkR easily in an existing R environment and have other packages built 
> on top of SparkR."
> I made this JIRA with the following questions in mind:
> * Are there known issues that prevent us releasing SparkR on CRAN?
> * Do we want to package Spark jars in the SparkR release?
> * Are there license issues?
> * How does it fit into Spark's release process?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21817) Pass FSPermissions to LocatedFileStatus from InMemoryFileIndex

2017-08-23 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138552#comment-16138552
 ] 

Marcelo Vanzin commented on SPARK-21817:


That HDFS change is only in Hadoop 3, which is not officially supported by 
Spark as of yet... in any case I have a patch for this internally that seems to 
be working with our tests:

{code}
+val perm = new FsPermission(FsAction.READ_WRITE, FsAction.NONE, 
FsAction.NONE)
 val locations = fs.getFileBlockLocations(f, 0, f.getLen)
 val lfs = new LocatedFileStatus(f.getLen, f.isDirectory, 
f.getReplication, f.getBlockSize,
-  f.getModificationTime, 0, null, null, null, null, f.getPath, 
locations)
+  f.getModificationTime, 0, perm, null, null, null, f.getPath, 
locations)
 if (f.isSymlink) {
{code}


> Pass FSPermissions to LocatedFileStatus from InMemoryFileIndex
> --
>
> Key: SPARK-21817
> URL: https://issues.apache.org/jira/browse/SPARK-21817
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ewan Higgs
>Priority: Minor
> Attachments: SPARK-21817.001.patch
>
>
> The implementation of HDFS-6984 now uses the passed in {{FSPermission}} to 
> pull out the ACL and other information. Therefore passing in a {{null}} is no 
> longer adequate and hence causes a NPE when listing files.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21819) UserGroupInformation initialization in SparkHadoopUtilwill overwrite user config

2017-08-23 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138550#comment-16138550
 ] 

Marcelo Vanzin commented on SPARK-21819:


There are a few ways to control which Hadoop configuration Spark will use:

- the HADOOP_CONF_DIR env variable (or YARN_CONF_DIR)
- setting {{spark.hadoop.=value}} in Spark's config
- calling {{Configuration.addDefaultResource}} before instantiating a 
SparkContext

Any other method is not supported, because there's no API in Spark to provide 
your own custom-crafted configuration object.

>  UserGroupInformation initialization in SparkHadoopUtilwill overwrite user 
> config
> -
>
> Key: SPARK-21819
> URL: https://issues.apache.org/jira/browse/SPARK-21819
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, YARN
>Affects Versions: 2.1.0, 2.1.1
> Environment: Ubuntu14.04
> Spark2.10/2.11 (I checked the github of 2.20 , it exist there as well)
> Cluster mode: Yarn client 
>Reporter: Keith Sun
> Attachments: yarnsparkutil.jpg
>
>
> When  submit job in Java or Scala code to ,the initialization of 
> SparkHadoopUtil will trigger the configuration overwritten in UGI which may 
> not be expected if the UGI has already been initialized by customized xmls 
> which are not on the classpath (like the cfg4j , which could set conf from 
> github code, a database etc). 
> {code:java}
> //it will overwrite the UGI conf which has already been initialized
> class SparkHadoopUtil extends Logging {
>   private val sparkConf = new SparkConf(false).loadFromSystemProperties(true)
>   val conf: Configuration = newConfiguration(sparkConf)
>   UserGroupInformation.setConfiguration(conf)
> {code}
> My scenario : My yarn cluster is kerberized, my configuration is set to use 
> kerberos for hadoop security. While, after the initialzation of 
> SparkHadoopUtil , the authentiationMethod in UGI is updated to "simple"(my 
> xmls not on the classpath), which lead to the failure like below :
> {code:java}
> 933453 [main] INFO  org.apache.spark.SparkContext  - Successfully stopped 
> SparkContext
> Exception in thread "main" org.apache.hadoop.security.AccessControlException: 
> SIMPLE authentication is not enabled.  Available:[TOKEN, KERBEROS]
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:209)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
>   at com.sun.proxy.$Proxy16.getClusterMetrics(Unknown Source)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getYarnClusterMetrics(YarnClientImpl.java:501)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:154)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:154)
>   at org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
>   at org.apache.spark.deploy.yarn.Client.logInfo(Client.scala:60)
>   at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:153)
>   at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:149)
>   at org.apache.spark.SparkContext.(SparkContext.scala:497)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2258)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:831)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:823)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> 

[jira] [Commented] (SPARK-15689) Data source API v2

2017-08-23 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138509#comment-16138509
 ] 

Andrew Ash commented on SPARK-15689:


Can the authors of this document add a section contrasting the approach with 
the one from https://issues.apache.org/jira/browse/SPARK-12449 ?  In that 
approach the data source receives an entire arbitrary plan rather than just 
parts of the plan.

> Data source API v2
> --
>
> Key: SPARK-15689
> URL: https://issues.apache.org/jira/browse/SPARK-15689
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>  Labels: releasenotes
> Attachments: SPIP Data Source API V2.pdf
>
>
> This ticket tracks progress in creating the v2 of data source API. This new 
> API should focus on:
> 1. Have a small surface so it is easy to freeze and maintain compatibility 
> for a long time. Ideally, this API should survive architectural rewrites and 
> user-facing API revamps of Spark.
> 2. Have a well-defined column batch interface for high performance. 
> Convenience methods should exist to convert row-oriented formats into column 
> batches for data source developers.
> 3. Still support filter push down, similar to the existing API.
> 4. Nice-to-have: support additional common operators, including limit and 
> sampling.
> Note that both 1 and 2 are problems that the current data source API (v1) 
> suffers. The current data source API has a wide surface with dependency on 
> DataFrame/SQLContext, making the data source API compatibility depending on 
> the upper level API. The current data source API is also only row oriented 
> and has to go through an expensive external data type conversion to internal 
> data type.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21814) build spark current master can not use hive metadatamysql

2017-08-23 Thread xinzhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138075#comment-16138075
 ] 

xinzhang edited comment on SPARK-21814 at 8/23/17 3:32 PM:
---

Thanks your reply.
(I will del this one hour later may be later)


was (Author: zhangxin0112zx):
Thanks your reply.
(I will del this one hour later)

> build spark current master can not use hive metadatamysql
> -
>
> Key: SPARK-21814
> URL: https://issues.apache.org/jira/browse/SPARK-21814
> Project: Spark
>  Issue Type: Question
>  Components: Build, SQL
>Affects Versions: 2.2.0
>Reporter: xinzhang
>
> Hi. I builded spark(master) source code by myself and it was successful. 
> Useed the cmd :
> ./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr 
> -Phive -Phive-thriftserver -Pyarn
> But when I used the 'spark-sql' for connnecting the metadata(I put my Hive's 
> conf hive-site.xml into the $SPARK_HOME/conf/ ) . It seems do not worked.It 
> always connected use derby(My hive-site.xml use MySQL as metadata db).
> I could not judge the problem's reason.
> Is my build cmd right? If not.Which cmd should I use for build the project by 
> myself.
>  Any suggestes will be helpful.
> the spark source code's last commit is :
> [root@node3 spark]# git log
> commit be72b157ea13ea116c5178a9e41e37ae24090f72
> Author: gatorsmile 
> Date:   Tue Aug 22 17:54:39 2017 +0800
> [SPARK-21803][TEST] Remove the HiveDDLCommandSuite
> 
> ## What changes were proposed in this pull request?
> We do not have any Hive-specific parser. It does not make sense to keep a 
> parser-specific test suite `HiveDDLCommandSuite.scala` in the Hive package. 
> This PR is to
> 
> ## How was this patch tested?
> N/A
> 
> Author: gatorsmile 
> 
> Closes #19015 from gatorsmile/combineDDL.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12449) Pushing down arbitrary logical plans to data sources

2017-08-23 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138501#comment-16138501
 ] 

Andrew Ash commented on SPARK-12449:


Relevant slides: 
https://www.slideshare.net/SparkSummit/the-pushdown-of-everything-by-stephan-kessler-and-santiago-mola

> Pushing down arbitrary logical plans to data sources
> 
>
> Key: SPARK-12449
> URL: https://issues.apache.org/jira/browse/SPARK-12449
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Stephan Kessler
> Attachments: pushingDownLogicalPlans.pdf
>
>
> With the help of the DataSource API we can pull data from external sources 
> for processing. Implementing interfaces such as {{PrunedFilteredScan}} allows 
> to push down filters and projects pruning unnecessary fields and rows 
> directly in the data source.
> However, data sources such as SQL Engines are capable of doing even more 
> preprocessing, e.g., evaluating aggregates. This is beneficial because it 
> would reduce the amount of data transferred from the source to Spark. The 
> existing interfaces do not allow such kind of processing in the source.
> We would propose to add a new interface {{CatalystSource}} that allows to 
> defer the processing of arbitrary logical plans to the data source. We have 
> already shown the details at the Spark Summit 2015 Europe 
> [https://spark-summit.org/eu-2015/events/the-pushdown-of-everything/]
> I will add a design document explaining details. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21819) UserGroupInformation initialization in SparkHadoopUtilwill overwrite user config

2017-08-23 Thread Keith Sun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Keith Sun updated SPARK-21819:
--
Attachment: yarnsparkutil.jpg

>  UserGroupInformation initialization in SparkHadoopUtilwill overwrite user 
> config
> -
>
> Key: SPARK-21819
> URL: https://issues.apache.org/jira/browse/SPARK-21819
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, YARN
>Affects Versions: 2.1.0, 2.1.1
> Environment: Ubuntu14.04
> Spark2.10/2.11 (I checked the github of 2.20 , it exist there as well)
> Cluster mode: Yarn client 
>Reporter: Keith Sun
> Attachments: yarnsparkutil.jpg
>
>
> When  submit job in Java or Scala code to ,the initialization of 
> SparkHadoopUtil will trigger the configuration overwritten in UGI which may 
> not be expected if the UGI has already been initialized by customized xmls 
> which are not on the classpath (like the cfg4j , which could set conf from 
> github code, a database etc). 
> {code:java}
> //it will overwrite the UGI conf which has already been initialized
> class SparkHadoopUtil extends Logging {
>   private val sparkConf = new SparkConf(false).loadFromSystemProperties(true)
>   val conf: Configuration = newConfiguration(sparkConf)
>   UserGroupInformation.setConfiguration(conf)
> {code}
> My scenario : My yarn cluster is kerberized, my configuration is set to use 
> kerberos for hadoop security. While, after the initialzation of 
> SparkHadoopUtil , the authentiationMethod in UGI is updated to "simple"(my 
> xmls not on the classpath), which lead to the failure like below :
> {code:java}
> 933453 [main] INFO  org.apache.spark.SparkContext  - Successfully stopped 
> SparkContext
> Exception in thread "main" org.apache.hadoop.security.AccessControlException: 
> SIMPLE authentication is not enabled.  Available:[TOKEN, KERBEROS]
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:209)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
>   at com.sun.proxy.$Proxy16.getClusterMetrics(Unknown Source)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getYarnClusterMetrics(YarnClientImpl.java:501)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:154)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:154)
>   at org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
>   at org.apache.spark.deploy.yarn.Client.logInfo(Client.scala:60)
>   at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:153)
>   at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:149)
>   at org.apache.spark.SparkContext.(SparkContext.scala:497)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2258)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:831)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:823)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823)
>   at SparkTest.SparkEAZDebug.main(SparkEAZDebug.java:84)
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException):
>  SIMPLE authentication is not enabled.  Available:[TOKEN, KERBEROS]
>   at org.apache.hadoop.ipc.Client.call(Client.java:1426)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1363)
>  

[jira] [Commented] (SPARK-21819) UserGroupInformation initialization in SparkHadoopUtilwill overwrite user config

2017-08-23 Thread Keith Sun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138461#comment-16138461
 ] 

Keith Sun commented on SPARK-21819:
---

[~jerryshao], i just attach the UGI update in SparkHadoopUtil 

>  UserGroupInformation initialization in SparkHadoopUtilwill overwrite user 
> config
> -
>
> Key: SPARK-21819
> URL: https://issues.apache.org/jira/browse/SPARK-21819
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, YARN
>Affects Versions: 2.1.0, 2.1.1
> Environment: Ubuntu14.04
> Spark2.10/2.11 (I checked the github of 2.20 , it exist there as well)
> Cluster mode: Yarn client 
>Reporter: Keith Sun
> Attachments: yarnsparkutil.jpg
>
>
> When  submit job in Java or Scala code to ,the initialization of 
> SparkHadoopUtil will trigger the configuration overwritten in UGI which may 
> not be expected if the UGI has already been initialized by customized xmls 
> which are not on the classpath (like the cfg4j , which could set conf from 
> github code, a database etc). 
> {code:java}
> //it will overwrite the UGI conf which has already been initialized
> class SparkHadoopUtil extends Logging {
>   private val sparkConf = new SparkConf(false).loadFromSystemProperties(true)
>   val conf: Configuration = newConfiguration(sparkConf)
>   UserGroupInformation.setConfiguration(conf)
> {code}
> My scenario : My yarn cluster is kerberized, my configuration is set to use 
> kerberos for hadoop security. While, after the initialzation of 
> SparkHadoopUtil , the authentiationMethod in UGI is updated to "simple"(my 
> xmls not on the classpath), which lead to the failure like below :
> {code:java}
> 933453 [main] INFO  org.apache.spark.SparkContext  - Successfully stopped 
> SparkContext
> Exception in thread "main" org.apache.hadoop.security.AccessControlException: 
> SIMPLE authentication is not enabled.  Available:[TOKEN, KERBEROS]
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:209)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
>   at com.sun.proxy.$Proxy16.getClusterMetrics(Unknown Source)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getYarnClusterMetrics(YarnClientImpl.java:501)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:154)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:154)
>   at org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
>   at org.apache.spark.deploy.yarn.Client.logInfo(Client.scala:60)
>   at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:153)
>   at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:149)
>   at org.apache.spark.SparkContext.(SparkContext.scala:497)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2258)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:831)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:823)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823)
>   at SparkTest.SparkEAZDebug.main(SparkEAZDebug.java:84)
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException):
>  SIMPLE authentication is not enabled.  Available:[TOKEN, KERBEROS]
>   at 

[jira] [Commented] (SPARK-11248) Spark hivethriftserver is using the wrong user to while getting HDFS permissions

2017-08-23 Thread wuchang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138443#comment-16138443
 ] 

wuchang commented on SPARK-11248:
-

+1
I have met exactly the same problem.my spark version is 2.0.0. Spark thrift 
server use another username which is not the one I am connection with beeline 
,and thus it causes a permission error.

> Spark hivethriftserver is using the wrong user to while getting HDFS 
> permissions
> 
>
> Key: SPARK-11248
> URL: https://issues.apache.org/jira/browse/SPARK-11248
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1, 2.1.1, 2.2.0
>Reporter: Trystan Leftwich
>
> While running spark as a hivethrift-server via Yarn Spark will use the user 
> running the Hivethrift server rather than the user connecting via JDBC to 
> check HDFS perms.
> i.e.
> In HDFS the perms are
> rwx--   3 testuser testuser /user/testuser/table/testtable
> And i connect via beeline as user testuser
> beeline -u 'jdbc:hive2://localhost:10511' -n 'testuser' -p ''
> If i try to hit that table
> select count(*) from test_table;
> I get the following error
> Error: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch 
> table test_table. java.security.AccessControlException: Permission denied: 
> user=hive, access=READ, 
> inode="/user/testuser/table/testtable":testuser:testuser:drwxr-x--x
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:271)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:257)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:185)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6795)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6777)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPathAccess(FSNamesystem.java:6702)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAccess(FSNamesystem.java:9529)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkAccess(NameNodeRpcServer.java:1516)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.checkAccess(ClientNamenodeProtocolServerSideTranslatorPB.java:1433)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) 
> (state=,code=0)
> I have the following in set in hive-site.xml so it should be using the 
> correct user.
> 
>   hive.server2.enable.doAs
>   true
> 
> 
>   hive.metastore.execute.setugi
>   true
> 
> 
> This works correctly in hive.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21820) csv option "multiLine" as "true" not parsing windows line feed (CR LF) properly

2017-08-23 Thread Kumaresh C R (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138405#comment-16138405
 ] 

Kumaresh C R edited comment on SPARK-21820 at 8/23/17 2:24 PM:
---

[~hyukjin.kwon]: Sounds great.. We will wait for your proposal 
https://github.com/apache/spark/pull/18581 to be merged. 
Thanks a lot :)


was (Author: crkumaresh24):
[~hyukjin.kwon]: Sounds great.. We will wait for your proposal 
https://github.com/apache/spark/pull/18581to be merged. 
Thanks a lot :)

> csv option "multiLine" as "true" not parsing windows line feed (CR LF) 
> properly
> ---
>
> Key: SPARK-21820
> URL: https://issues.apache.org/jira/browse/SPARK-21820
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kumaresh C R
>  Labels: features
> Attachments: windows_CRLF.csv
>
>
> With multiLine=true, windows CR LF is not getting parsed properly. If i make 
> multiLine=false, it parses properly. Could you please help here ?
> Attached the CSV used in the below commands for your reference.
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> res0: Array[String] = Array(Sales_Dollars, Created_Date, Order_Delivered)
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").option("multiLine","true").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> ")s1: Array[String] = Array(Sales_Dollars, Created_Date, "Order_Delivered



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21799) KMeans performance regression (5-6x slowdown) in Spark 2.2

2017-08-23 Thread zakaria hili (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138408#comment-16138408
 ] 

zakaria hili commented on SPARK-21799:
--

[~Siddharth Murching], sorry about that,
I think that the best solution is to verify "cache" on dataframe instead of 
rdd. and i think that there are some contributors working on it, i can give a 
hand to resolve this issue if you want.


> KMeans performance regression (5-6x slowdown) in Spark 2.2
> --
>
> Key: SPARK-21799
> URL: https://issues.apache.org/jira/browse/SPARK-21799
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: Siddharth Murching
>
> I've been running KMeans performance tests using 
> [spark-sql-perf|https://github.com/databricks/spark-sql-perf/] and have 
> noticed a regression (slowdowns of 5-6x) when running tests on large datasets 
> in Spark 2.2 vs 2.1.
> The test params are:
> * Cluster: 510 GB RAM, 16 workers
> * Data: 100 examples, 1 features
> After talking to [~josephkb], the issue seems related to the changes in 
> [SPARK-18356|https://issues.apache.org/jira/browse/SPARK-18356] introduced in 
> [this PR|https://github.com/apache/spark/pull/16295].
> It seems `df.cache()` doesn't set the storageLevel of `df.rdd`, so 
> `handlePersistence` is true even when KMeans is run on a cached DataFrame. 
> This unnecessarily causes another copy of the input dataset to be persisted.
> As of Spark 2.1 ([JIRA 
> link|https://issues.apache.org/jira/browse/SPARK-16063]) `df.storageLevel` 
> returns the correct result after calling `df.cache()`, so I'd suggest 
> replacing instances of `df.rdd.getStorageLevel` with df.storageLevel` in 
> MLlib algorithms (the same pattern shows up in LogisticRegression, 
> LinearRegression, and others). I've verified this behavior in [this 
> notebook|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/5211178207246023/950505630032626/7788830288800223/latest.html]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21820) csv option "multiLine" as "true" not parsing windows line feed (CR LF) properly

2017-08-23 Thread Kumaresh C R (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138405#comment-16138405
 ] 

Kumaresh C R commented on SPARK-21820:
--

[~hyukjin.kwon]: Sound great.. We will wait for your proposal 
https://github.com/apache/spark/pull/18581to be merged. 
Thanks a lot :)

> csv option "multiLine" as "true" not parsing windows line feed (CR LF) 
> properly
> ---
>
> Key: SPARK-21820
> URL: https://issues.apache.org/jira/browse/SPARK-21820
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kumaresh C R
>  Labels: features
> Attachments: windows_CRLF.csv
>
>
> With multiLine=true, windows CR LF is not getting parsed properly. If i make 
> multiLine=false, it parses properly. Could you please help here ?
> Attached the CSV used in the below commands for your reference.
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> res0: Array[String] = Array(Sales_Dollars, Created_Date, Order_Delivered)
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").option("multiLine","true").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> ")s1: Array[String] = Array(Sales_Dollars, Created_Date, "Order_Delivered



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21820) csv option "multiLine" as "true" not parsing windows line feed (CR LF) properly

2017-08-23 Thread Kumaresh C R (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138405#comment-16138405
 ] 

Kumaresh C R edited comment on SPARK-21820 at 8/23/17 2:13 PM:
---

[~hyukjin.kwon]: Sounds great.. We will wait for your proposal 
https://github.com/apache/spark/pull/18581to be merged. 
Thanks a lot :)


was (Author: crkumaresh24):
[~hyukjin.kwon]: Sound great.. We will wait for your proposal 
https://github.com/apache/spark/pull/18581to be merged. 
Thanks a lot :)

> csv option "multiLine" as "true" not parsing windows line feed (CR LF) 
> properly
> ---
>
> Key: SPARK-21820
> URL: https://issues.apache.org/jira/browse/SPARK-21820
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kumaresh C R
>  Labels: features
> Attachments: windows_CRLF.csv
>
>
> With multiLine=true, windows CR LF is not getting parsed properly. If i make 
> multiLine=false, it parses properly. Could you please help here ?
> Attached the CSV used in the below commands for your reference.
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> res0: Array[String] = Array(Sales_Dollars, Created_Date, Order_Delivered)
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").option("multiLine","true").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> ")s1: Array[String] = Array(Sales_Dollars, Created_Date, "Order_Delivered



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21819) UserGroupInformation initialization in SparkHadoopUtilwill overwrite user config

2017-08-23 Thread Keith Sun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138404#comment-16138404
 ] 

Keith Sun commented on SPARK-21819:
---

As the UGI is static and shared by all the component , is it feasible to check 
the configuration in UGI before overwritten ? 

>  UserGroupInformation initialization in SparkHadoopUtilwill overwrite user 
> config
> -
>
> Key: SPARK-21819
> URL: https://issues.apache.org/jira/browse/SPARK-21819
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, YARN
>Affects Versions: 2.1.0, 2.1.1
> Environment: Ubuntu14.04
> Spark2.10/2.11 (I checked the github of 2.20 , it exist there as well)
> Cluster mode: Yarn client 
>Reporter: Keith Sun
>
> When  submit job in Java or Scala code to ,the initialization of 
> SparkHadoopUtil will trigger the configuration overwritten in UGI which may 
> not be expected if the UGI has already been initialized by customized xmls 
> which are not on the classpath (like the cfg4j , which could set conf from 
> github code, a database etc). 
> {code:java}
> //it will overwrite the UGI conf which has already been initialized
> class SparkHadoopUtil extends Logging {
>   private val sparkConf = new SparkConf(false).loadFromSystemProperties(true)
>   val conf: Configuration = newConfiguration(sparkConf)
>   UserGroupInformation.setConfiguration(conf)
> {code}
> My scenario : My yarn cluster is kerberized, my configuration is set to use 
> kerberos for hadoop security. While, after the initialzation of 
> SparkHadoopUtil , the authentiationMethod in UGI is updated to "simple"(my 
> xmls not on the classpath), which lead to the failure like below :
> {code:java}
> 933453 [main] INFO  org.apache.spark.SparkContext  - Successfully stopped 
> SparkContext
> Exception in thread "main" org.apache.hadoop.security.AccessControlException: 
> SIMPLE authentication is not enabled.  Available:[TOKEN, KERBEROS]
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:209)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
>   at com.sun.proxy.$Proxy16.getClusterMetrics(Unknown Source)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getYarnClusterMetrics(YarnClientImpl.java:501)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:154)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:154)
>   at org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
>   at org.apache.spark.deploy.yarn.Client.logInfo(Client.scala:60)
>   at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:153)
>   at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:149)
>   at org.apache.spark.SparkContext.(SparkContext.scala:497)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2258)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:831)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:823)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823)
>   at SparkTest.SparkEAZDebug.main(SparkEAZDebug.java:84)
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException):
>  SIMPLE authentication is not enabled.  Available:[TOKEN, KERBEROS]
>   at 

[jira] [Updated] (SPARK-21820) csv option "multiLine" as "true" not parsing windows line feed (CR LF) properly

2017-08-23 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-21820:
-
Component/s: (was: Spark Core)
 SQL

> csv option "multiLine" as "true" not parsing windows line feed (CR LF) 
> properly
> ---
>
> Key: SPARK-21820
> URL: https://issues.apache.org/jira/browse/SPARK-21820
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kumaresh C R
>  Labels: features
> Attachments: windows_CRLF.csv
>
>
> With multiLine=true, windows CR LF is not getting parsed properly. If i make 
> multiLine=false, it parses properly. Could you please help here ?
> Attached the CSV used in the below commands for your reference.
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> res0: Array[String] = Array(Sales_Dollars, Created_Date, Order_Delivered)
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").option("multiLine","true").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> ")s1: Array[String] = Array(Sales_Dollars, Created_Date, "Order_Delivered



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21820) csv option "multiLine" as "true" not parsing windows line feed (CR LF) properly

2017-08-23 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138379#comment-16138379
 ] 

Hyukjin Kwon commented on SPARK-21820:
--

I investigated this newline stuff few times before. For example, 
https://github.com/apache/spark/pull/18304#discussion_r122142421.

> csv option "multiLine" as "true" not parsing windows line feed (CR LF) 
> properly
> ---
>
> Key: SPARK-21820
> URL: https://issues.apache.org/jira/browse/SPARK-21820
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Kumaresh C R
>  Labels: features
> Attachments: windows_CRLF.csv
>
>
> With multiLine=true, windows CR LF is not getting parsed properly. If i make 
> multiLine=false, it parses properly. Could you please help here ?
> Attached the CSV used in the below commands for your reference.
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> res0: Array[String] = Array(Sales_Dollars, Created_Date, Order_Delivered)
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").option("multiLine","true").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> ")s1: Array[String] = Array(Sales_Dollars, Created_Date, "Order_Delivered



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21820) csv option "multiLine" as "true" not parsing windows line feed (CR LF) properly

2017-08-23 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138365#comment-16138365
 ] 

Hyukjin Kwon commented on SPARK-21820:
--

I think the preferable format should be {{format("csv")}} for the built-in 
Spark CSV. {{.format("com.databricks.spark.csv")}} basically indicates 
thirdparty CSV library in Databricks repository which is not for Spark 2.x, 
although we had to make some changes within Spark to choose Spark's internal 
one for such cases, e.g., SPARK-20590. Let's avoid to report a JIRA with 
{{"com.databricks.spark.csv"}} in the future to prevent confusion.

For {{multiLine}} in CSV, the newline is dependent on OS whereas TEXT, JSON and 
CSV datasources by default deal with some newlines together such as on Windows 
and Linux, by Hadoop's library, up to my knowledge. I proposed a change for a 
configurable newline - https://github.com/apache/spark/pull/18581. I guess this 
will address this problem together.

> csv option "multiLine" as "true" not parsing windows line feed (CR LF) 
> properly
> ---
>
> Key: SPARK-21820
> URL: https://issues.apache.org/jira/browse/SPARK-21820
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Kumaresh C R
>  Labels: features
> Attachments: windows_CRLF.csv
>
>
> With multiLine=true, windows CR LF is not getting parsed properly. If i make 
> multiLine=false, it parses properly. Could you please help here ?
> Attached the CSV used in the below commands for your reference.
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> res0: Array[String] = Array(Sales_Dollars, Created_Date, Order_Delivered)
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").option("multiLine","true").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> ")s1: Array[String] = Array(Sales_Dollars, Created_Date, "Order_Delivered



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21820) csv option "multiLine" as "true" not parsing windows line feed (CR LF) properly

2017-08-23 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138362#comment-16138362
 ] 

Sean Owen commented on SPARK-21820:
---

The code you're using isn't in Spark though. It's been migrated to Spark, but 
that's not what you're having trouble with.

> csv option "multiLine" as "true" not parsing windows line feed (CR LF) 
> properly
> ---
>
> Key: SPARK-21820
> URL: https://issues.apache.org/jira/browse/SPARK-21820
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Kumaresh C R
>  Labels: features
> Attachments: windows_CRLF.csv
>
>
> With multiLine=true, windows CR LF is not getting parsed properly. If i make 
> multiLine=false, it parses properly. Could you please help here ?
> Attached the CSV used in the below commands for your reference.
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> res0: Array[String] = Array(Sales_Dollars, Created_Date, Order_Delivered)
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").option("multiLine","true").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> ")s1: Array[String] = Array(Sales_Dollars, Created_Date, "Order_Delivered



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21172) EOFException reached end of stream in UnsafeRowSerializer

2017-08-23 Thread Lasantha Fernando (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138361#comment-16138361
 ] 

Lasantha Fernando commented on SPARK-21172:
---

I've encountered the same issue with Spark 2.1.1 as well. Am hitting this 
continuously for a particular job without being able to finish the job at all. 
I can get results if I split the job to smaller jobs and run them separately. 
But need to run the whole job together. I've tried the following parameter 
modifications without any luck.

* Change the {{spark.sql.shuffle.partitions}}
* Change {{spark.files.fetchTimeout}}
* Increase {{spark.shuffle.file.buffer}}
* Increase {{spark.unsafe.sorter.spill.reader.buffer.size}}
* Change {{spark.io.compression.codec}}
* Enable/disable {{spark.shuffle.compress}}
* Enable/disable {{spark.shuffle.spill.compress}}
* Enable/disable {{spark.file.transferTo}}
* Change the garbage collector to use parallel GC 
{{spark.executor.extraJavaOptions  -XX:ParallelGCThreads=4 -XX:+UseParallelGC}}
* Enable/disable off heap memory {{spark.memory.offHeap.enabled}}

Also being trying some of the suggestions from the slide deck 
[here|https://www.slideshare.net/databricks/tuning-apache-spark-for-largescale-workloads-gaoxiang-liu-and-sital-kedia]
 and some suggestions from SPARK-4105. Was not able to resolve the issue so far 
with any of these suggestions.

> EOFException reached end of stream in UnsafeRowSerializer
> -
>
> Key: SPARK-21172
> URL: https://issues.apache.org/jira/browse/SPARK-21172
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.0.1
>Reporter: liupengcheng
>  Labels: shuffle
>
> Spark sql job failed because of the following Exception. Seems like a bug in 
> shuffle stage. 
> Shuffle read size for single task is tens of GB
> {code}
> org.apache.spark.SparkException: Task failed while writing rows
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:264)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.EOFException: reached end of stream after reading 9034374 
> bytes; 1684891936 bytes expected
>   at 
> org.spark_project.guava.io.ByteStreams.readFully(ByteStreams.java:735)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:127)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110)
>   at scala.collection.Iterator$$anon$12.next(Iterator.scala:444)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at 
> org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
>   at 
> org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:255)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:253)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:253)
>   at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1345)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:259)
>   ... 8 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17321) YARN shuffle service should use good disk from yarn.nodemanager.local-dirs

2017-08-23 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138347#comment-16138347
 ] 

Thomas Graves commented on SPARK-17321:
---

Yes that sounds good.  It wouldn't hurt to verify the second point. The NM 
should throw an exception on container launch because it itself can't record 
the container start thus can't recover.

> YARN shuffle service should use good disk from yarn.nodemanager.local-dirs
> --
>
> Key: SPARK-17321
> URL: https://issues.apache.org/jira/browse/SPARK-17321
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.2, 2.0.0, 2.1.1
>Reporter: yunjiong zhao
>
> We run spark on yarn, after enabled spark dynamic allocation, we notice some 
> spark application failed randomly due to YarnShuffleService.
> From log I found
> {quote}
> 2016-08-29 11:33:03,450 ERROR org.apache.spark.network.TransportContext: 
> Error while initializing Netty pipeline
> java.lang.NullPointerException
> at 
> org.apache.spark.network.server.TransportRequestHandler.(TransportRequestHandler.java:77)
> at 
> org.apache.spark.network.TransportContext.createChannelHandler(TransportContext.java:159)
> at 
> org.apache.spark.network.TransportContext.initializePipeline(TransportContext.java:135)
> at 
> org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:123)
> at 
> org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:116)
> at 
> io.netty.channel.ChannelInitializer.channelRegistered(ChannelInitializer.java:69)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRegistered(AbstractChannelHandlerContext.java:133)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRegistered(AbstractChannelHandlerContext.java:119)
> at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRegistered(DefaultChannelPipeline.java:733)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:450)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.access$100(AbstractChannel.java:378)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:424)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
> at java.lang.Thread.run(Thread.java:745)
> {quote} 
> Which caused by the first disk in yarn.nodemanager.local-dirs was broken.
> If we enabled spark.yarn.shuffle.stopOnFailure(SPARK-16505) we might lost 
> hundred nodes which is unacceptable.
> We have 12 disks in yarn.nodemanager.local-dirs, so why not use other good 
> disks if the first one is broken?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21190) SPIP: Vectorized UDFs in Python

2017-08-23 Thread Li Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138340#comment-16138340
 ] 

Li Jin commented on SPARK-21190:


[~ueshin], thanks for the summary. +1 for this API.

Although the 0-parameter UDF is a bit confusing. Where does "size" come from?


> SPIP: Vectorized UDFs in Python
> ---
>
> Key: SPARK-21190
> URL: https://issues.apache.org/jira/browse/SPARK-21190
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>  Labels: SPIP
> Attachments: SPIPVectorizedUDFsforPython (1).pdf
>
>
> *Background and Motivation*
> Python is one of the most popular programming languages among Spark users. 
> Spark currently exposes a row-at-a-time interface for defining and executing 
> user-defined functions (UDFs). This introduces high overhead in serialization 
> and deserialization, and also makes it difficult to leverage Python libraries 
> (e.g. numpy, Pandas) that are written in native code.
>  
> This proposal advocates introducing new APIs to support vectorized UDFs in 
> Python, in which a block of data is transferred over to Python in some 
> columnar format for execution.
>  
>  
> *Target Personas*
> Data scientists, data engineers, library developers.
>  
> *Goals*
> - Support vectorized UDFs that apply on chunks of the data frame
> - Low system overhead: Substantially reduce serialization and deserialization 
> overhead when compared with row-at-a-time interface
> - UDF performance: Enable users to leverage native libraries in Python (e.g. 
> numpy, Pandas) for data manipulation in these UDFs
>  
> *Non-Goals*
> The following are explicitly out of scope for the current SPIP, and should be 
> done in future SPIPs. Nonetheless, it would be good to consider these future 
> use cases during API design, so we can achieve some consistency when rolling 
> out new APIs.
>  
> - Define block oriented UDFs in other languages (that are not Python).
> - Define aggregate UDFs
> - Tight integration with machine learning frameworks
>  
> *Proposed API Changes*
> The following sketches some possibilities. I haven’t spent a lot of time 
> thinking about the API (wrote it down in 5 mins) and I am not attached to 
> this design at all. The main purpose of the SPIP is to get feedback on use 
> cases and see how they can impact API design.
>  
> A few things to consider are:
>  
> 1. Python is dynamically typed, whereas DataFrames/SQL requires static, 
> analysis time typing. This means users would need to specify the return type 
> of their UDFs.
>  
> 2. Ratio of input rows to output rows. We propose initially we require number 
> of output rows to be the same as the number of input rows. In the future, we 
> can consider relaxing this constraint with support for vectorized aggregate 
> UDFs.
> 3. How do we handle null values, since Pandas doesn't have the concept of 
> nulls?
>  
> Proposed API sketch (using examples):
>  
> Use case 1. A function that defines all the columns of a DataFrame (similar 
> to a “map” function):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_on_entire_df(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A Pandas data frame.
>   """
>   input[c] = input[a] + input[b]
>   Input[d] = input[a] - input[b]
>   return input
>  
> spark.range(1000).selectExpr("id a", "id / 2 b")
>   .mapBatches(my_func_on_entire_df)
> {code}
>  
> Use case 2. A function that defines only one column (similar to existing 
> UDFs):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_that_returns_one_column(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A numpy array
>   """
>   return input[a] + input[b]
>  
> my_func = udf(my_func_that_returns_one_column)
>  
> df = spark.range(1000).selectExpr("id a", "id / 2 b")
> df.withColumn("c", my_func(df.a, df.b))
> {code}
>  
>  
>  
> *Optional Design Sketch*
> I’m more concerned about getting proper feedback for API design. The 
> implementation should be pretty straightforward and is not a huge concern at 
> this point. We can leverage the same implementation for faster toPandas 
> (using Arrow).
>  
>  
> *Optional Rejected Designs*
> See above.
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21820) csv option "multiLine" as "true" not parsing windows line feed (CR LF) properly

2017-08-23 Thread Kumaresh C R (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138324#comment-16138324
 ] 

Kumaresh C R edited comment on SPARK-21820 at 8/23/17 1:21 PM:
---

[~hyukjin.kwon]: Could you please help us here ?This issue occurs  after we 
moved to "multiLine" as "true"


was (Author: crkumaresh24):
[~hyukjin.kwon]: Could you please help us here ?This issue after we moved to 
"multiLine" as "true"

> csv option "multiLine" as "true" not parsing windows line feed (CR LF) 
> properly
> ---
>
> Key: SPARK-21820
> URL: https://issues.apache.org/jira/browse/SPARK-21820
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Kumaresh C R
>  Labels: features
> Attachments: windows_CRLF.csv
>
>
> With multiLine=true, windows CR LF is not getting parsed properly. If i make 
> multiLine=false, it parses properly. Could you please help here ?
> Attached the CSV used in the below commands for your reference.
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> res0: Array[String] = Array(Sales_Dollars, Created_Date, Order_Delivered)
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").option("multiLine","true").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> ")s1: Array[String] = Array(Sales_Dollars, Created_Date, "Order_Delivered



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21820) csv option "multiLine" as "true" not parsing windows line feed (CR LF) properly

2017-08-23 Thread Kumaresh C R (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138327#comment-16138327
 ] 

Kumaresh C R edited comment on SPARK-21820 at 8/23/17 1:20 PM:
---

[~sowen] This is an issue with spark databricks-CSV reading. I could not find 
any such option in the filter. Could you please help me what could be the 
proper component for this bug ?


was (Author: crkumaresh24):
@Sean Owen: This is an issue with spark databricks-CSV reading. I could not 
find any such option in the filter. Could you please help me what could be the 
proper component for this bug ?

> csv option "multiLine" as "true" not parsing windows line feed (CR LF) 
> properly
> ---
>
> Key: SPARK-21820
> URL: https://issues.apache.org/jira/browse/SPARK-21820
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Kumaresh C R
>  Labels: features
> Attachments: windows_CRLF.csv
>
>
> With multiLine=true, windows CR LF is not getting parsed properly. If i make 
> multiLine=false, it parses properly. Could you please help here ?
> Attached the CSV used in the below commands for your reference.
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> res0: Array[String] = Array(Sales_Dollars, Created_Date, Order_Delivered)
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").option("multiLine","true").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> ")s1: Array[String] = Array(Sales_Dollars, Created_Date, "Order_Delivered



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21820) csv option "multiLine" as "true" not parsing windows line feed (CR LF) properly

2017-08-23 Thread Kumaresh C R (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138327#comment-16138327
 ] 

Kumaresh C R commented on SPARK-21820:
--

@Sean Owen: This is an issue with spark databricks-CSV reading. I could not 
find any such option in the filter. Could you please help me what could be the 
proper component for this bug ?

> csv option "multiLine" as "true" not parsing windows line feed (CR LF) 
> properly
> ---
>
> Key: SPARK-21820
> URL: https://issues.apache.org/jira/browse/SPARK-21820
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Kumaresh C R
>  Labels: features
> Attachments: windows_CRLF.csv
>
>
> With multiLine=true, windows CR LF is not getting parsed properly. If i make 
> multiLine=false, it parses properly. Could you please help here ?
> Attached the CSV used in the below commands for your reference.
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> res0: Array[String] = Array(Sales_Dollars, Created_Date, Order_Delivered)
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").option("multiLine","true").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> ")s1: Array[String] = Array(Sales_Dollars, Created_Date, "Order_Delivered



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21820) csv option "multiLine" as "true" not parsing windows line feed (CR LF) properly

2017-08-23 Thread Kumaresh C R (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138324#comment-16138324
 ] 

Kumaresh C R commented on SPARK-21820:
--

[~hyukjin.kwon]: Could you please help us here ?This issue after we moved to 
"multiLine" as "true"

> csv option "multiLine" as "true" not parsing windows line feed (CR LF) 
> properly
> ---
>
> Key: SPARK-21820
> URL: https://issues.apache.org/jira/browse/SPARK-21820
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Kumaresh C R
>  Labels: features
> Attachments: windows_CRLF.csv
>
>
> With multiLine=true, windows CR LF is not getting parsed properly. If i make 
> multiLine=false, it parses properly. Could you please help here ?
> Attached the CSV used in the below commands for your reference.
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> res0: Array[String] = Array(Sales_Dollars, Created_Date, Order_Delivered)
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").option("multiLine","true").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> ")s1: Array[String] = Array(Sales_Dollars, Created_Date, "Order_Delivered



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21820) csv option "multiLine" as "true" not parsing windows line feed (CR LF) properly

2017-08-23 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138322#comment-16138322
 ] 

Sean Owen commented on SPARK-21820:
---

You need to use the built-in Spark CSV support if you're reporting an issue 
with Spark.

> csv option "multiLine" as "true" not parsing windows line feed (CR LF) 
> properly
> ---
>
> Key: SPARK-21820
> URL: https://issues.apache.org/jira/browse/SPARK-21820
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Kumaresh C R
>  Labels: features
> Attachments: windows_CRLF.csv
>
>
> With multiLine=true, windows CR LF is not getting parsed properly. If i make 
> multiLine=false, it parses properly. Could you please help here ?
> Attached the CSV used in the below commands for your reference.
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> res0: Array[String] = Array(Sales_Dollars, Created_Date, Order_Delivered)
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").option("multiLine","true").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> ")s1: Array[String] = Array(Sales_Dollars, Created_Date, "Order_Delivered



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21820) csv option "multiLine" as "true" not parsing windows line feed (CR LF) properly

2017-08-23 Thread Kumaresh C R (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kumaresh C R updated SPARK-21820:
-
Attachment: windows_CRLF.csv

> csv option "multiLine" as "true" not parsing windows line feed (CR LF) 
> properly
> ---
>
> Key: SPARK-21820
> URL: https://issues.apache.org/jira/browse/SPARK-21820
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Kumaresh C R
>  Labels: features
> Attachments: windows_CRLF.csv
>
>
> With multiLine=true, windows CR LF is not getting parsed properly. If i make 
> multiLine=false, it parses properly. Could you please help here ?
> Attached the CSV used in the below commands for your reference.
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> res0: Array[String] = Array(Sales_Dollars, Created_Date, Order_Delivered)
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").option("multiLine","true").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> ")s1: Array[String] = Array(Sales_Dollars, Created_Date, "Order_Delivered



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21820) csv option "multiLine" as "true" not parsing windows line feed (CR LF) properly

2017-08-23 Thread Kumaresh C R (JIRA)
Kumaresh C R created SPARK-21820:


 Summary: csv option "multiLine" as "true" not parsing windows line 
feed (CR LF) properly
 Key: SPARK-21820
 URL: https://issues.apache.org/jira/browse/SPARK-21820
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Kumaresh C R


With multiLine=true, windows CR LF is not getting parsed properly. If i make 
multiLine=false, it parses properly. Could you please help here ?

Attached the CSV used in the below commands for your reference.

scala> val csvFile = 
spark.read.format("com.databricks.spark.csv").option("header", 
"true").option("inferSchema", "true").option("parserLib", 
"univocity").load("/home/kumar/Desktop/windows_CRLF.csv");
csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
string ... 1 more field]

scala> csvFile.schema.fieldNames
res0: Array[String] = Array(Sales_Dollars, Created_Date, Order_Delivered)

scala> val csvFile = 
spark.read.format("com.databricks.spark.csv").option("header", 
"true").option("inferSchema", "true").option("parserLib", 
"univocity").option("multiLine","true").load("/home/kumar/Desktop/windows_CRLF.csv");
csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
string ... 1 more field]

scala> csvFile.schema.fieldNames
")s1: Array[String] = Array(Sales_Dollars, Created_Date, "Order_Delivered





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17321) YARN shuffle service should use good disk from yarn.nodemanager.local-dirs

2017-08-23 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138320#comment-16138320
 ] 

Saisai Shao commented on SPARK-17321:
-

We're facing the same issue. I think YARN shuffle service should be like:

* If NM recovery is not enabled, then Spark will not persist data into leveldb, 
in that case yarn shuffle service can still be served but lose the ability for 
recovery, (it is fine because the failure of NM will kill the containers as 
well as applications).
* If NM recovery is enabled, then user or yarn should guarantee recovery path 
is reliable. Because recovery path is also crucial for NM to recover.

What do you think [~tgraves] ? 

I'm currently working on the 1st thing to avoid persisting data into leveldb, 
to see if this is a feasible solution.

> YARN shuffle service should use good disk from yarn.nodemanager.local-dirs
> --
>
> Key: SPARK-17321
> URL: https://issues.apache.org/jira/browse/SPARK-17321
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.2, 2.0.0, 2.1.1
>Reporter: yunjiong zhao
>
> We run spark on yarn, after enabled spark dynamic allocation, we notice some 
> spark application failed randomly due to YarnShuffleService.
> From log I found
> {quote}
> 2016-08-29 11:33:03,450 ERROR org.apache.spark.network.TransportContext: 
> Error while initializing Netty pipeline
> java.lang.NullPointerException
> at 
> org.apache.spark.network.server.TransportRequestHandler.(TransportRequestHandler.java:77)
> at 
> org.apache.spark.network.TransportContext.createChannelHandler(TransportContext.java:159)
> at 
> org.apache.spark.network.TransportContext.initializePipeline(TransportContext.java:135)
> at 
> org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:123)
> at 
> org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:116)
> at 
> io.netty.channel.ChannelInitializer.channelRegistered(ChannelInitializer.java:69)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRegistered(AbstractChannelHandlerContext.java:133)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRegistered(AbstractChannelHandlerContext.java:119)
> at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRegistered(DefaultChannelPipeline.java:733)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:450)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.access$100(AbstractChannel.java:378)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:424)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
> at java.lang.Thread.run(Thread.java:745)
> {quote} 
> Which caused by the first disk in yarn.nodemanager.local-dirs was broken.
> If we enabled spark.yarn.shuffle.stopOnFailure(SPARK-16505) we might lost 
> hundred nodes which is unacceptable.
> We have 12 disks in yarn.nodemanager.local-dirs, so why not use other good 
> disks if the first one is broken?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21819) UserGroupInformation initialization in SparkHadoopUtilwill overwrite user config

2017-08-23 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138308#comment-16138308
 ] 

Saisai Shao commented on SPARK-21819:
-

I'm not sure if Spark expose the user API to set {{Configuration}} to Spark on 
YARN.

One possible solution is to set hadoop configurations vis Spark conf like 
"spark.hadoop.", and SparkHadoopUtil will leverage them and set to Hadoop 
Configuration, this will be honored by yarn client.

>  UserGroupInformation initialization in SparkHadoopUtilwill overwrite user 
> config
> -
>
> Key: SPARK-21819
> URL: https://issues.apache.org/jira/browse/SPARK-21819
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, YARN
>Affects Versions: 2.1.0, 2.1.1
> Environment: Ubuntu14.04
> Spark2.10/2.11 (I checked the github of 2.20 , it exist there as well)
> Cluster mode: Yarn client 
>Reporter: Keith Sun
>
> When  submit job in Java or Scala code to ,the initialization of 
> SparkHadoopUtil will trigger the configuration overwritten in UGI which may 
> not be expected if the UGI has already been initialized by customized xmls 
> which are not on the classpath (like the cfg4j , which could set conf from 
> github code, a database etc). 
> {code:java}
> //it will overwrite the UGI conf which has already been initialized
> class SparkHadoopUtil extends Logging {
>   private val sparkConf = new SparkConf(false).loadFromSystemProperties(true)
>   val conf: Configuration = newConfiguration(sparkConf)
>   UserGroupInformation.setConfiguration(conf)
> {code}
> My scenario : My yarn cluster is kerberized, my configuration is set to use 
> kerberos for hadoop security. While, after the initialzation of 
> SparkHadoopUtil , the authentiationMethod in UGI is updated to "simple"(my 
> xmls not on the classpath), which lead to the failure like below :
> {code:java}
> 933453 [main] INFO  org.apache.spark.SparkContext  - Successfully stopped 
> SparkContext
> Exception in thread "main" org.apache.hadoop.security.AccessControlException: 
> SIMPLE authentication is not enabled.  Available:[TOKEN, KERBEROS]
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:209)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
>   at com.sun.proxy.$Proxy16.getClusterMetrics(Unknown Source)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getYarnClusterMetrics(YarnClientImpl.java:501)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:154)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:154)
>   at org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
>   at org.apache.spark.deploy.yarn.Client.logInfo(Client.scala:60)
>   at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:153)
>   at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:149)
>   at org.apache.spark.SparkContext.(SparkContext.scala:497)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2258)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:831)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:823)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823)
>   at SparkTest.SparkEAZDebug.main(SparkEAZDebug.java:84)
> Caused by: 
> 

[jira] [Commented] (SPARK-21819) UserGroupInformation initialization in SparkHadoopUtilwill overwrite user config

2017-08-23 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138302#comment-16138302
 ] 

Sean Owen commented on SPARK-21819:
---

Would it suffice to add this configuration somehow after Spark initializes but 
before it is required?

>  UserGroupInformation initialization in SparkHadoopUtilwill overwrite user 
> config
> -
>
> Key: SPARK-21819
> URL: https://issues.apache.org/jira/browse/SPARK-21819
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, YARN
>Affects Versions: 2.1.0, 2.1.1
> Environment: Ubuntu14.04
> Spark2.10/2.11 (I checked the github of 2.20 , it exist there as well)
> Cluster mode: Yarn client 
>Reporter: Keith Sun
>
> When  submit job in Java or Scala code to ,the initialization of 
> SparkHadoopUtil will trigger the configuration overwritten in UGI which may 
> not be expected if the UGI has already been initialized by customized xmls 
> which are not on the classpath (like the cfg4j , which could set conf from 
> github code, a database etc). 
> {code:java}
> //it will overwrite the UGI conf which has already been initialized
> class SparkHadoopUtil extends Logging {
>   private val sparkConf = new SparkConf(false).loadFromSystemProperties(true)
>   val conf: Configuration = newConfiguration(sparkConf)
>   UserGroupInformation.setConfiguration(conf)
> {code}
> My scenario : My yarn cluster is kerberized, my configuration is set to use 
> kerberos for hadoop security. While, after the initialzation of 
> SparkHadoopUtil , the authentiationMethod in UGI is updated to "simple"(my 
> xmls not on the classpath), which lead to the failure like below :
> {code:java}
> 933453 [main] INFO  org.apache.spark.SparkContext  - Successfully stopped 
> SparkContext
> Exception in thread "main" org.apache.hadoop.security.AccessControlException: 
> SIMPLE authentication is not enabled.  Available:[TOKEN, KERBEROS]
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:209)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
>   at com.sun.proxy.$Proxy16.getClusterMetrics(Unknown Source)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getYarnClusterMetrics(YarnClientImpl.java:501)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:154)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:154)
>   at org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
>   at org.apache.spark.deploy.yarn.Client.logInfo(Client.scala:60)
>   at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:153)
>   at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:149)
>   at org.apache.spark.SparkContext.(SparkContext.scala:497)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2258)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:831)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:823)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823)
>   at SparkTest.SparkEAZDebug.main(SparkEAZDebug.java:84)
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException):
>  SIMPLE authentication is not enabled.  Available:[TOKEN, KERBEROS]
>   at 

[jira] [Commented] (SPARK-21819) UserGroupInformation initialization in SparkHadoopUtilwill overwrite user config

2017-08-23 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138300#comment-16138300
 ] 

Saisai Shao commented on SPARK-21819:
-

I think here because `Configuration` object created in the user code cannot be 
leveraged by {{YARN#client}}, and {{YARN#client}} will create a `Configuration` 
object using default configurations, so {{YARN#client}} isn't aware of security 
stuffs and still issue RPC without Kerberos.

Looks like this is not an issue of Spark, the way you're writing the code and 
submitting application might not be as expected as normal Spark application.

>  UserGroupInformation initialization in SparkHadoopUtilwill overwrite user 
> config
> -
>
> Key: SPARK-21819
> URL: https://issues.apache.org/jira/browse/SPARK-21819
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, YARN
>Affects Versions: 2.1.0, 2.1.1
> Environment: Ubuntu14.04
> Spark2.10/2.11 (I checked the github of 2.20 , it exist there as well)
> Cluster mode: Yarn client 
>Reporter: Keith Sun
>
> When  submit job in Java or Scala code to ,the initialization of 
> SparkHadoopUtil will trigger the configuration overwritten in UGI which may 
> not be expected if the UGI has already been initialized by customized xmls 
> which are not on the classpath (like the cfg4j , which could set conf from 
> github code, a database etc). 
> {code:java}
> //it will overwrite the UGI conf which has already been initialized
> class SparkHadoopUtil extends Logging {
>   private val sparkConf = new SparkConf(false).loadFromSystemProperties(true)
>   val conf: Configuration = newConfiguration(sparkConf)
>   UserGroupInformation.setConfiguration(conf)
> {code}
> My scenario : My yarn cluster is kerberized, my configuration is set to use 
> kerberos for hadoop security. While, after the initialzation of 
> SparkHadoopUtil , the authentiationMethod in UGI is updated to "simple"(my 
> xmls not on the classpath), which lead to the failure like below :
> {code:java}
> 933453 [main] INFO  org.apache.spark.SparkContext  - Successfully stopped 
> SparkContext
> Exception in thread "main" org.apache.hadoop.security.AccessControlException: 
> SIMPLE authentication is not enabled.  Available:[TOKEN, KERBEROS]
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:209)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
>   at com.sun.proxy.$Proxy16.getClusterMetrics(Unknown Source)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getYarnClusterMetrics(YarnClientImpl.java:501)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:154)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:154)
>   at org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
>   at org.apache.spark.deploy.yarn.Client.logInfo(Client.scala:60)
>   at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:153)
>   at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:149)
>   at org.apache.spark.SparkContext.(SparkContext.scala:497)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2258)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:831)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:823)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> 

[jira] [Commented] (SPARK-12157) Support numpy types as return values of Python UDFs

2017-08-23 Thread Maciej Szymkiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138299#comment-16138299
 ] 

Maciej Szymkiewicz commented on SPARK-12157:


[~felixcheung] IMHO it is not worth fixing. It doesn't look like a common 
problem and once you know it exists, it is trivial to address explicitly in the 
user code.

> Support numpy types as return values of Python UDFs
> ---
>
> Key: SPARK-12157
> URL: https://issues.apache.org/jira/browse/SPARK-12157
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.5.2
>Reporter: Justin Uang
>
> Currently, if I have a python UDF
> {code}
> import pyspark.sql.types as T
> import pyspark.sql.functions as F
> from pyspark.sql import Row
> import numpy as np
> argmax = F.udf(lambda x: np.argmax(x), T.IntegerType())
> df = sqlContext.createDataFrame([Row(array=[1,2,3])])
> df.select(argmax("array")).count()
> {code}
> I get an exception that is fairly opaque:
> {code}
> Caused by: net.razorvine.pickle.PickleException: expected zero arguments for 
> construction of ClassDict (for numpy.dtype)
> at 
> net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
> at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:701)
> at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:171)
> at net.razorvine.pickle.Unpickler.load(Unpickler.java:85)
> at net.razorvine.pickle.Unpickler.loads(Unpickler.java:98)
> at 
> org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1$$anonfun$apply$3.apply(python.scala:404)
> at 
> org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1$$anonfun$apply$3.apply(python.scala:403)
> {code}
> Numpy types like np.int and np.float64 should automatically be cast to the 
> proper dtypes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21817) Pass FSPermissions to LocatedFileStatus from InMemoryFileIndex

2017-08-23 Thread Ewan Higgs (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138253#comment-16138253
 ] 

Ewan Higgs commented on SPARK-21817:


{quote}Can this be accomplished with a change that's still compatible with 
2.6?{quote}
Yes I believe it should just work. The argument exists in the function call; 
{{InMemoryFileIndex}} is just passing {{null}} currently.

> Pass FSPermissions to LocatedFileStatus from InMemoryFileIndex
> --
>
> Key: SPARK-21817
> URL: https://issues.apache.org/jira/browse/SPARK-21817
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ewan Higgs
>Priority: Minor
> Attachments: SPARK-21817.001.patch
>
>
> The implementation of HDFS-6984 now uses the passed in {{FSPermission}} to 
> pull out the ACL and other information. Therefore passing in a {{null}} is no 
> longer adequate and hence causes a NPE when listing files.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21819) UserGroupInformation initialization in SparkHadoopUtilwill overwrite user config

2017-08-23 Thread Keith Sun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Keith Sun updated SPARK-21819:
--
Description: 
When  submit job in Java or Scala code to ,the initialization of 
SparkHadoopUtil will trigger the configuration overwritten in UGI which may not 
be expected if the UGI has already been initialized by customized xmls which 
are not on the classpath (like the cfg4j , which could set conf from github 
code, a database etc). 

{code:java}
//it will overwrite the UGI conf which has already been initialized
class SparkHadoopUtil extends Logging {
  private val sparkConf = new SparkConf(false).loadFromSystemProperties(true)
  val conf: Configuration = newConfiguration(sparkConf)
  UserGroupInformation.setConfiguration(conf)
{code}

My scenario : My yarn cluster is kerberized, my configuration is set to use 
kerberos for hadoop security. While, after the initialzation of SparkHadoopUtil 
, the authentiationMethod in UGI is updated to "simple"(my xmls not on the 
classpath), which lead to the failure like below :

{code:java}
933453 [main] INFO  org.apache.spark.SparkContext  - Successfully stopped 
SparkContext
Exception in thread "main" org.apache.hadoop.security.AccessControlException: 
SIMPLE authentication is not enabled.  Available:[TOKEN, KERBEROS]
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at 
org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
at 
org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
at 
org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:209)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
at com.sun.proxy.$Proxy16.getClusterMetrics(Unknown Source)
at 
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getYarnClusterMetrics(YarnClientImpl.java:501)
at 
org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:154)
at 
org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:154)
at org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
at org.apache.spark.deploy.yarn.Client.logInfo(Client.scala:60)
at 
org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:153)
at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:149)
at org.apache.spark.SparkContext.(SparkContext.scala:497)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2258)
at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:831)
at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:823)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823)
at SparkTest.SparkEAZDebug.main(SparkEAZDebug.java:84)
Caused by: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException):
 SIMPLE authentication is not enabled.  Available:[TOKEN, KERBEROS]
at org.apache.hadoop.ipc.Client.call(Client.java:1426)
at org.apache.hadoop.ipc.Client.call(Client.java:1363)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy15.getClusterMetrics(Unknown Source)
at 
org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:206)
... 22 more

{code}

Sample code :

{code:java}
Configuration hc = new  Configuration(false);
String yarnxml=String.format("%s/%s", 
ConfigLocation,"yarn-site.xml");
String corexml=String.format("%s/%s", 
ConfigLocation,"core-site.xml");
String hdfsxml=String.format("%s/%s", 
ConfigLocation,"hdfs-site.xml");
String 

[jira] [Updated] (SPARK-21819) UserGroupInformation initialization in SparkHadoopUtilwill overwrite user config

2017-08-23 Thread Keith Sun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Keith Sun updated SPARK-21819:
--
Description: 
When  submit job in Java or Scala code to ,the initialization of 
SparkHadoopUtil will trigger the configuration overwritten in UGI which may not 
be expected if the UGI has already been initialized by customized xmls which 
are not on the classpath (like the cfg4j , which could set conf from github 
code, a database etc). 

{code:java}
//it will overwrite the UGI conf which has already been initialized
class SparkHadoopUtil extends Logging {
  private val sparkConf = new SparkConf(false).loadFromSystemProperties(true)
  val conf: Configuration = newConfiguration(sparkConf)
  UserGroupInformation.setConfiguration(conf)
{code}

My scenario : My yarn cluster is kerberized, my configuration is set to use 
kerberos for hadoop security. While, after the initialzation of SparkHadoopUtil 
, the authentiationMethod in UGI is updated to "simple"(my xmls not on the 
classpath), which lead to the failure like below :

{code:java}
933453 [main] INFO  org.apache.spark.SparkContext  - Successfully stopped 
SparkContext
Exception in thread "main" org.apache.hadoop.security.AccessControlException: 
SIMPLE authentication is not enabled.  Available:[TOKEN, KERBEROS]
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at 
org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
at 
org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
at 
org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:209)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
at com.sun.proxy.$Proxy16.getClusterMetrics(Unknown Source)
at 
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getYarnClusterMetrics(YarnClientImpl.java:501)
at 
org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:154)
at 
org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:154)
at org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
at org.apache.spark.deploy.yarn.Client.logInfo(Client.scala:60)
at 
org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:153)
at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:149)
at org.apache.spark.SparkContext.(SparkContext.scala:497)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2258)
at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:831)
at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:823)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823)
at SparkTest.SparkEAZDebug.main(SparkEAZDebug.java:84)
Caused by: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException):
 SIMPLE authentication is not enabled.  Available:[TOKEN, KERBEROS]
at org.apache.hadoop.ipc.Client.call(Client.java:1426)
at org.apache.hadoop.ipc.Client.call(Client.java:1363)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy15.getClusterMetrics(Unknown Source)
at 
org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:206)
... 22 more

{code}

Sample code :
Configuration hc = new  Configuration(false);
String yarnxml=String.format("%s/%s", 
ConfigLocation,"yarn-site.xml");
String corexml=String.format("%s/%s", 
ConfigLocation,"core-site.xml");
String hdfsxml=String.format("%s/%s", 
ConfigLocation,"hdfs-site.xml");
String 

[jira] [Commented] (SPARK-21819) UserGroupInformation initialization in SparkHadoopUtilwill overwrite user config

2017-08-23 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138223#comment-16138223
 ] 

Sean Owen commented on SPARK-21819:
---

Hm, can that be considered supported though? if you've initialized it through 
some third-party library not known to Spark or Hadoop? I don't know the details 
of what mechanism you're talking about though.

>  UserGroupInformation initialization in SparkHadoopUtilwill overwrite user 
> config
> -
>
> Key: SPARK-21819
> URL: https://issues.apache.org/jira/browse/SPARK-21819
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, YARN
>Affects Versions: 2.1.0, 2.1.1
> Environment: Ubuntu14.04
> Spark2.10/2.11 (I checked the github of 2.20 , it exist there as well)
> Cluster mode: Yarn client 
>Reporter: Keith Sun
>
> When  submit job in Java or Scala code to ,the initialization of 
> SparkHadoopUtil will trigger the configuration overwritten in UGI which may 
> not be expected if the UGI has already been initialized by customized xmls 
> which are not on the classpath (like the cfg4j , which could set conf from 
> github code, a database etc). 
> {code:java}
> //it will overwrite the UGI conf which has already been initialized
> class SparkHadoopUtil extends Logging {
>   private val sparkConf = new SparkConf(false).loadFromSystemProperties(true)
>   val conf: Configuration = newConfiguration(sparkConf)
>   UserGroupInformation.setConfiguration(conf)
> {code}
> My scenario : My yarn cluster is kerberized, my configuration is set to use 
> kerberos for hadoop security. While, after the initialzation of 
> SparkHadoopUtil , the authentiationMethod in UGI is updated to "simple"(my 
> xmls not on the classpath), which lead to the failure like below :
> {code:java}
> 933453 [main] INFO  org.apache.spark.SparkContext  - Successfully stopped 
> SparkContext
> Exception in thread "main" org.apache.hadoop.security.AccessControlException: 
> SIMPLE authentication is not enabled.  Available:[TOKEN, KERBEROS]
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:209)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
>   at com.sun.proxy.$Proxy16.getClusterMetrics(Unknown Source)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getYarnClusterMetrics(YarnClientImpl.java:501)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:154)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:154)
>   at org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
>   at org.apache.spark.deploy.yarn.Client.logInfo(Client.scala:60)
>   at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:153)
>   at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:149)
>   at org.apache.spark.SparkContext.(SparkContext.scala:497)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2258)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:831)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:823)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823)
>   at SparkTest.SparkEAZDebug.main(SparkEAZDebug.java:84)
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException):
>  SIMPLE authentication is 

[jira] [Updated] (SPARK-21819) UserGroupInformation initialization in SparkHadoopUtilwill overwrite user config

2017-08-23 Thread Keith Sun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Keith Sun updated SPARK-21819:
--
Description: 
When  submit job in Java or Scala code to ,the initialization of 
SparkHadoopUtil will trigger the configuration overwritten in UGI which may not 
be expected if the UGI has already been initialized by customized xmls which 
are not on the classpath (like the cfg4j , which could set conf from github 
code, a database etc). 

{code:java}
//it will overwrite the UGI conf which has already been initialized
class SparkHadoopUtil extends Logging {
  private val sparkConf = new SparkConf(false).loadFromSystemProperties(true)
  val conf: Configuration = newConfiguration(sparkConf)
  UserGroupInformation.setConfiguration(conf)
{code}

My scenario : My yarn cluster is kerberized, my configuration is set to use 
kerberos for hadoop security. While, after the initialzation of SparkHadoopUtil 
, the authentiationMethod in UGI is updated to "simple"(my xmls not on the 
classpath), which lead to the failure like below :

{code:java}
933453 [main] INFO  org.apache.spark.SparkContext  - Successfully stopped 
SparkContext
Exception in thread "main" org.apache.hadoop.security.AccessControlException: 
SIMPLE authentication is not enabled.  Available:[TOKEN, KERBEROS]
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at 
org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
at 
org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
at 
org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:209)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
at com.sun.proxy.$Proxy16.getClusterMetrics(Unknown Source)
at 
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getYarnClusterMetrics(YarnClientImpl.java:501)
at 
org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:154)
at 
org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:154)
at org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
at org.apache.spark.deploy.yarn.Client.logInfo(Client.scala:60)
at 
org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:153)
at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:149)
at org.apache.spark.SparkContext.(SparkContext.scala:497)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2258)
at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:831)
at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:823)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823)
at SparkTest.SparkEAZDebug.main(SparkEAZDebug.java:84)
Caused by: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException):
 SIMPLE authentication is not enabled.  Available:[TOKEN, KERBEROS]
at org.apache.hadoop.ipc.Client.call(Client.java:1426)
at org.apache.hadoop.ipc.Client.call(Client.java:1363)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy15.getClusterMetrics(Unknown Source)
at 
org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:206)
... 22 more

{code}


  was:
When  submit job in Java or Scala code to ,the initialization of 
SparkHadoopUtil will trigger the configuration overwritten in UGI which may not 
be expected if the UGI has already been initialized by customized xmls which 
are not on the classpath (like the cfg4j , which could set conf from github 
code, a database etc). 

{code:java}
//it will overwrite the UGI 

[jira] [Updated] (SPARK-21819) UserGroupInformation initialization in SparkHadoopUtilwill overwrite user config

2017-08-23 Thread Keith Sun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Keith Sun updated SPARK-21819:
--
Environment: 
Ubuntu14.04
Spark2.10/2.11 (I checked the github of 2.20 , it exist there as well)
Cluster mode: Yarn client 

  was:
Ubuntu14.04
Spark2.10/2.11 (I checked the github of 2.20 , it exist there as well)


Description: 
When  submit job in Java or Scala code to ,the initialization of 
SparkHadoopUtil will trigger the configuration overwritten in UGI which may not 
be expected if the UGI has already been initialized by customized xmls which 
are not on the classpath (like the cfg4j , which could set conf from github 
code, a database etc). 

{code:java}
//it will overwrite the UGI conf which has already been initialized
class SparkHadoopUtil extends Logging {
  private val sparkConf = new SparkConf(false).loadFromSystemProperties(true)
  val conf: Configuration = newConfiguration(sparkConf)
  UserGroupInformation.setConfiguration(conf)
{code}

My scenario : My yarn cluster is kerberized, my configuration is set to use 
kerberos for hadoop security. While, after the initialzation of SparkHadoopUtil 
, the authentiationMethod in UGI is updated to "simple", which lead to the 
failure like below :

{code:java}
933453 [main] INFO  org.apache.spark.SparkContext  - Successfully stopped 
SparkContext
Exception in thread "main" org.apache.hadoop.security.AccessControlException: 
SIMPLE authentication is not enabled.  Available:[TOKEN, KERBEROS]
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at 
org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
at 
org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
at 
org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:209)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
at com.sun.proxy.$Proxy16.getClusterMetrics(Unknown Source)
at 
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getYarnClusterMetrics(YarnClientImpl.java:501)
at 
org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:154)
at 
org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:154)
at org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
at org.apache.spark.deploy.yarn.Client.logInfo(Client.scala:60)
at 
org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:153)
at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:149)
at org.apache.spark.SparkContext.(SparkContext.scala:497)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2258)
at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:831)
at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:823)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823)
at SparkTest.SparkEAZDebug.main(SparkEAZDebug.java:84)
Caused by: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException):
 SIMPLE authentication is not enabled.  Available:[TOKEN, KERBEROS]
at org.apache.hadoop.ipc.Client.call(Client.java:1426)
at org.apache.hadoop.ipc.Client.call(Client.java:1363)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy15.getClusterMetrics(Unknown Source)
at 
org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:206)
... 22 more

{code}


  was:
When  submit job in Java or Scala code to ,the initialization of 
SparkHadoopUtil will trigger the configuration overwritten in UGI which may not 
be expected if the UGI has 

[jira] [Created] (SPARK-21819) UserGroupInformation initialization in SparkHadoopUtilwill overwrite user config

2017-08-23 Thread Keith Sun (JIRA)
Keith Sun created SPARK-21819:
-

 Summary:  UserGroupInformation initialization in 
SparkHadoopUtilwill overwrite user config
 Key: SPARK-21819
 URL: https://issues.apache.org/jira/browse/SPARK-21819
 Project: Spark
  Issue Type: Bug
  Components: Java API, Spark Core
Affects Versions: 2.1.1, 2.1.0
 Environment: Ubuntu14.04
Spark2.10/2.11 (I checked the github of 2.20 , it exist there as well)

Reporter: Keith Sun


When  submit job in Java or Scala code to ,the initialization of 
SparkHadoopUtil will trigger the configuration overwritten in UGI which may not 
be expected if the UGI has already been initialized by customized xmls which 
are not on the classpath (like the cfg4j , which could set conf from github 
code, a database etc)

{code:java}
//it will overwrite the UGI conf which has already been initialized
class SparkHadoopUtil extends Logging {
  private val sparkConf = new SparkConf(false).loadFromSystemProperties(true)
  val conf: Configuration = newConfiguration(sparkConf)
  UserGroupInformation.setConfiguration(conf)
{code}





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21770) ProbabilisticClassificationModel: Improve normalization of all-zero raw predictions

2017-08-23 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138220#comment-16138220
 ] 

Sean Owen commented on SPARK-21770:
---

No - it would be better to understand what outputs [0,0,0] to normalize in the 
first place. Right?

> ProbabilisticClassificationModel: Improve normalization of all-zero raw 
> predictions
> ---
>
> Key: SPARK-21770
> URL: https://issues.apache.org/jira/browse/SPARK-21770
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Siddharth Murching
>Priority: Minor
>
> Given an n-element raw prediction vector of all-zeros, 
> ProbabilisticClassifierModel.normalizeToProbabilitiesInPlace() should output 
> a probability vector of all-equal 1/n entries



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21818) MultivariateOnlineSummarizer.variance generate negative result

2017-08-23 Thread Weichen Xu (JIRA)
Weichen Xu created SPARK-21818:
--

 Summary: MultivariateOnlineSummarizer.variance generate negative 
result
 Key: SPARK-21818
 URL: https://issues.apache.org/jira/browse/SPARK-21818
 Project: Spark
  Issue Type: Bug
  Components: ML, MLlib
Affects Versions: 2.2.0
Reporter: Weichen Xu


Because of numerical error, MultivariateOnlineSummarizer.variance is possible 
to generate negative variance.
This is a serious bug because many algos in MLLib use stddev computed from 
sqrt(variance),
it will generate NaN and crash the whole algorithm.

we can reproduce this bug use the following code:
{code}
val summarizer1 = (new MultivariateOnlineSummarizer)
  .add(Vectors.dense(3.0), 0.7)
val summarizer2 = (new MultivariateOnlineSummarizer)
  .add(Vectors.dense(3.0), 0.4)
val summarizer3 = (new MultivariateOnlineSummarizer)
  .add(Vectors.dense(3.0), 0.5)
val summarizer4 = (new MultivariateOnlineSummarizer)
  .add(Vectors.dense(3.0), 0.4)

val summarizer = summarizer1
  .merge(summarizer2)
  .merge(summarizer3)
  .merge(summarizer4)

println(summarizer.variance(0))
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21476) RandomForest classification model not using broadcast in transform

2017-08-23 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-21476:
--
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

> RandomForest classification model not using broadcast in transform
> --
>
> Key: SPARK-21476
> URL: https://issues.apache.org/jira/browse/SPARK-21476
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Saurabh Agrawal
>Priority: Minor
>
> I notice significant task deserialization latency while running prediction 
> with pipelines using RandomForestClassificationModel. While digging into the 
> source, found that the transform method in RandomForestClassificationModel 
> binds to its parent ProbabilisticClassificationModel and the only concrete 
> definition that RandomForestClassificationModel provides and which is 
> actually used in transform is that of predictRaw. Broadcasting is not being 
> used in predictRaw.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21817) Pass FSPermissions to LocatedFileStatus from InMemoryFileIndex

2017-08-23 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-21817:
--
 Priority: Minor  (was: Major)
Fix Version/s: (was: 2.3.0)
   Issue Type: Improvement  (was: Bug)

That seems to only affect Hadoop 3, which isn't even released yet. Can this be 
accomplished with a change that's still compatible with 2.6?

We use PRs, not patches.
Please see http://spark.apache.org/contributing.html

> Pass FSPermissions to LocatedFileStatus from InMemoryFileIndex
> --
>
> Key: SPARK-21817
> URL: https://issues.apache.org/jira/browse/SPARK-21817
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ewan Higgs
>Priority: Minor
> Attachments: SPARK-21817.001.patch
>
>
> The implementation of HDFS-6984 now uses the passed in {{FSPermission}} to 
> pull out the ACL and other information. Therefore passing in a {{null}} is no 
> longer adequate and hence causes a NPE when listing files.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21476) RandomForest classification model not using broadcast in transform

2017-08-23 Thread Saurabh Agrawal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138214#comment-16138214
 ] 

Saurabh Agrawal edited comment on SPARK-21476 at 8/23/17 10:44 AM:
---

[~peng.m...@intel.com] Under what circumstances will using broadcast hamper the 
performance? Is there an intuitive explanation why it might affect the 
performance negatively? 


was (Author: sagraw):
[~peng.m...@intel.com] Under what circumstances will using broadcast hamper the 
performance? 

> RandomForest classification model not using broadcast in transform
> --
>
> Key: SPARK-21476
> URL: https://issues.apache.org/jira/browse/SPARK-21476
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Saurabh Agrawal
>
> I notice significant task deserialization latency while running prediction 
> with pipelines using RandomForestClassificationModel. While digging into the 
> source, found that the transform method in RandomForestClassificationModel 
> binds to its parent ProbabilisticClassificationModel and the only concrete 
> definition that RandomForestClassificationModel provides and which is 
> actually used in transform is that of predictRaw. Broadcasting is not being 
> used in predictRaw.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21817) Pass FSPermissions to LocatedFileStatus from InMemoryFileIndex

2017-08-23 Thread Ewan Higgs (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ewan Higgs updated SPARK-21817:
---
Attachment: SPARK-21817.001.patch

Attaching simple fix that will no longer NPE on Hadoop head.

> Pass FSPermissions to LocatedFileStatus from InMemoryFileIndex
> --
>
> Key: SPARK-21817
> URL: https://issues.apache.org/jira/browse/SPARK-21817
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ewan Higgs
> Fix For: 2.3.0
>
> Attachments: SPARK-21817.001.patch
>
>
> The implementation of HDFS-6984 now uses the passed in {{FSPermission}} to 
> pull out the ACL and other information. Therefore passing in a {{null}} is no 
> longer adequate and hence causes a NPE when listing files.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21476) RandomForest classification model not using broadcast in transform

2017-08-23 Thread Saurabh Agrawal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138214#comment-16138214
 ] 

Saurabh Agrawal commented on SPARK-21476:
-

[~peng.m...@intel.com] Under what circumstances will using broadcast hamper the 
performance? 

> RandomForest classification model not using broadcast in transform
> --
>
> Key: SPARK-21476
> URL: https://issues.apache.org/jira/browse/SPARK-21476
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Saurabh Agrawal
>
> I notice significant task deserialization latency while running prediction 
> with pipelines using RandomForestClassificationModel. While digging into the 
> source, found that the transform method in RandomForestClassificationModel 
> binds to its parent ProbabilisticClassificationModel and the only concrete 
> definition that RandomForestClassificationModel provides and which is 
> actually used in transform is that of predictRaw. Broadcasting is not being 
> used in predictRaw.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >