[jira] [Created] (SPARK-22579) BlockManager.getRemoteValues and BlockManager.getRemoteBytes should be implemented using streaming

2017-11-21 Thread Eyal Farago (JIRA)
Eyal Farago created SPARK-22579:
---

 Summary: BlockManager.getRemoteValues and 
BlockManager.getRemoteBytes should be implemented using streaming
 Key: SPARK-22579
 URL: https://issues.apache.org/jira/browse/SPARK-22579
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, Spark Core
Affects Versions: 2.1.0
Reporter: Eyal Farago


when an RDD partition is cached on an executor bu the task requiring it is 
running on another executor (process locality ANY), the cached partition is 
fetched via BlockManager.getRemoteValues which delegates to 
BlockManager.getRemoteBytes, both calls are blocking.
in my use case I had a 700GB RDD spread over 1000 partitions on a 6 nodes 
cluster, cached to disk. rough math shows that average partition size is 700MB.
looking at spark UI it was obvious that tasks running with process locality 
'ANY' are much slower than local tasks (~40 seconds to 8-10 minutes ratio), I 
was able to capture thread dumps of executors executing remote tasks and got 
this stake trace:

{quote}Thread IDThread Name Thread StateThread Locks
1521Executor task launch worker-1000WAITING 
Lock(java.util.concurrent.ThreadPoolExecutor$Worker@196462978})
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
scala.concurrent.Await$.result(package.scala:190)
org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:190)
org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:104)
org.apache.spark.storage.BlockManager.getRemoteBytes(BlockManager.scala:582)
org.apache.spark.storage.BlockManager.getRemoteValues(BlockManager.scala:550)
org.apache.spark.storage.BlockManager.get(BlockManager.scala:638)
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:690)
org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
org.apache.spark.rdd.RDD.iterator(RDD.scala:287){quote}

digging into the code showed that the block manager first fetches all bytes 
(getRemoteBytes) and then wraps it with a deserialization stream, this has 
several draw backs:
1. blocking, requesting executor is blocked while the remote executor is 
serving the block.
2. potentially large memory footprint on requesting executor, in my use case a 
700mb of raw bytes stored in a ChunkedByteBuffer.
3. inefficient, requesting side usually don't need all values at once as it 
consumes the values via an iterator.
4. potentially large memory footprint on serving executor, in case the block is 
cached in deserialized form the serving executor has to serialize it into a 
ChunkedByteBuffer (BlockManager.doGetLocalBytes). this is both memory & CPU 
intensive, memory footprint can be reduced by using a limited buffer for 
serialization 'spilling' to the response stream.

I suggest improving this either by implementing full streaming mechanism or 
some kind of pagination mechanism, in addition the requesting executor should 
be able to make progress with the data it already has, blocking only when local 
buffer is exhausted and remote side didn't deliver the next chunk of the stream 
(or page in case of pagination) yet.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22560) Must create spark session directly to connect to hive

2017-11-21 Thread Ran Mingxuan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ran Mingxuan updated SPARK-22560:
-
Affects Version/s: 2.2.0

> Must create spark session directly to connect to hive
> -
>
> Key: SPARK-22560
> URL: https://issues.apache.org/jira/browse/SPARK-22560
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Ran Mingxuan
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> In a java project I have to use both JavaSparkContext  and SparkSession. I 
> find the order to create them affect hive connection.
> I have built a spark job like below:
> {code:java}
> // wrong code
> public void main(String[] args)
> {
> SparkConf sparkConf = new SparkConf().setAppName("testApp");
> JavaSparkContext sc = new JavaSparkContext(sparkConf);
> SparkSession spark = 
> SparkSession.builder().sparkContext(sc.sc()).enableHiveSupport().getOrCreate();
> spark.sql("show databases").show();
> }
> {code}
> and with this code spark job will not be able to find hive meta-store even if 
> it can discover correct warehouse.
> I have to use code like below to make things work:
> {code:java}
> // correct code 
> public String main(String[] args)
> {
> SparkConf sparkConf = new SparkConf().setAppName("testApp");
> SparkSession spark = 
> SparkSession.builder().config(sparkConf).enableHiveSupport().getOrCreate();
> SparkContext sparkContext = spark.sparkContext();
> JavaSparkContext sc = JavaSparkContext.fromSparkContext(sparkContext);
> spark.sql("show databases").show();
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22560) Must create spark session directly to connect to hive

2017-11-21 Thread Ran Mingxuan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16262093#comment-16262093
 ] 

Ran Mingxuan edited comment on SPARK-22560 at 11/22/17 7:47 AM:


I think the problem is introduced by org.apache.spark.sql.internal.SessionState 
which will take the option of sparkcontext. That means the 
`enableHiveSupport()` won't work in this case if spark session is built from 
spark context.


{code:java}
// enableHiveSupport() will not be able to be got here
  @InterfaceStability.Unstable
  @transient
  lazy val sessionState: SessionState = {
parentSessionState
  .map(_.clone(this))
  .getOrElse {
val state = SparkSession.instantiateSessionState(
  SparkSession.sessionStateClassName(sparkContext.conf),
  self)
initialSessionOptions.foreach { case (k, v) => 
state.conf.setConfString(k, v) }
state
  }
  }
{code}



was (Author: ranmx):
I think the problem is introduced by org.apache.spark.sql.internal.SessionState 
which will take the option of sparkcontext. That means the 
`enableHiveSupport()` won't work in this case if spark session is built from 
spark context.

> Must create spark session directly to connect to hive
> -
>
> Key: SPARK-22560
> URL: https://issues.apache.org/jira/browse/SPARK-22560
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, SQL
>Affects Versions: 2.1.0
>Reporter: Ran Mingxuan
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> In a java project I have to use both JavaSparkContext  and SparkSession. I 
> find the order to create them affect hive connection.
> I have built a spark job like below:
> {code:java}
> // wrong code
> public void main(String[] args)
> {
> SparkConf sparkConf = new SparkConf().setAppName("testApp");
> JavaSparkContext sc = new JavaSparkContext(sparkConf);
> SparkSession spark = 
> SparkSession.builder().sparkContext(sc.sc()).enableHiveSupport().getOrCreate();
> spark.sql("show databases").show();
> }
> {code}
> and with this code spark job will not be able to find hive meta-store even if 
> it can discover correct warehouse.
> I have to use code like below to make things work:
> {code:java}
> // correct code 
> public String main(String[] args)
> {
> SparkConf sparkConf = new SparkConf().setAppName("testApp");
> SparkSession spark = 
> SparkSession.builder().config(sparkConf).enableHiveSupport().getOrCreate();
> SparkContext sparkContext = spark.sparkContext();
> JavaSparkContext sc = JavaSparkContext.fromSparkContext(sparkContext);
> spark.sql("show databases").show();
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22560) Must create spark session directly to connect to hive

2017-11-21 Thread Ran Mingxuan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16262093#comment-16262093
 ] 

Ran Mingxuan commented on SPARK-22560:
--

I think the problem is introduced by org.apache.spark.sql.internal.SessionState 
which will take the option of sparkcontext. That means the 
`enableHiveSupport()` won't work in this case if spark session is built from 
spark context.

> Must create spark session directly to connect to hive
> -
>
> Key: SPARK-22560
> URL: https://issues.apache.org/jira/browse/SPARK-22560
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, SQL
>Affects Versions: 2.1.0
>Reporter: Ran Mingxuan
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> In a java project I have to use both JavaSparkContext  and SparkSession. I 
> find the order to create them affect hive connection.
> I have built a spark job like below:
> {code:java}
> // wrong code
> public void main(String[] args)
> {
> SparkConf sparkConf = new SparkConf().setAppName("testApp");
> JavaSparkContext sc = new JavaSparkContext(sparkConf);
> SparkSession spark = 
> SparkSession.builder().sparkContext(sc.sc()).enableHiveSupport().getOrCreate();
> spark.sql("show databases").show();
> }
> {code}
> and with this code spark job will not be able to find hive meta-store even if 
> it can discover correct warehouse.
> I have to use code like below to make things work:
> {code:java}
> // correct code 
> public String main(String[] args)
> {
> SparkConf sparkConf = new SparkConf().setAppName("testApp");
> SparkSession spark = 
> SparkSession.builder().config(sparkConf).enableHiveSupport().getOrCreate();
> SparkContext sparkContext = spark.sparkContext();
> JavaSparkContext sc = JavaSparkContext.fromSparkContext(sparkContext);
> spark.sql("show databases").show();
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22578) CSV with quoted line breaks not correctly parsed

2017-11-21 Thread Carlos Barahona (JIRA)
Carlos Barahona created SPARK-22578:
---

 Summary: CSV with quoted line breaks not correctly parsed
 Key: SPARK-22578
 URL: https://issues.apache.org/jira/browse/SPARK-22578
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Carlos Barahona


I believe the behavior addressed in SPARK-19610 still exists. Using spark 
2.2.0, when attempting to read in a CSV file containing a quoted newline, the 
resulting dataset contains two separate items split along the quoted newline.

Example text:


{code:java}
4/28/2015 8:01,4/28/2015 8:19,0,100,1078,1,4/28/2015 8:19,email,"Hello

World", 2,3,4,5
{code}

scala> val csvFile = spark.read.csv("file:///path")
csvFile: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 7 more 
fields]

scala> csvFile.first()
res2: org.apache.spark.sql.Row = [4/28/2015 8:01,4/28/2015 
8:19,0,100,1078,1,4/28/2015 8:19,email,Hello]

scala> csvFile.count()
res3: Long = 2





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22543) fix java 64kb compile error for deeply nested expressions

2017-11-21 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22543:

Target Version/s: 2.3.0  (was: 2.2.1, 2.3.0)

> fix java 64kb compile error for deeply nested expressions
> -
>
> Key: SPARK-22543
> URL: https://issues.apache.org/jira/browse/SPARK-22543
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22548) Incorrect nested AND expression pushed down to JDBC data source

2017-11-21 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-22548.
-
   Resolution: Fixed
 Assignee: Jia Li
Fix Version/s: 2.3.0
   2.2.1
   2.1.3

> Incorrect nested AND expression pushed down to JDBC data source
> ---
>
> Key: SPARK-22548
> URL: https://issues.apache.org/jira/browse/SPARK-22548
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jia Li
>Assignee: Jia Li
> Fix For: 2.1.3, 2.2.1, 2.3.0
>
>
> Let’s say I have a JDBC data source table ‘foobar’ with 3 rows:
> NAME  THEID
> ==
> fred  1
> mary  2
> joe 'foo' "bar"3
> This query returns incorrect result. 
> SELECT * FROM foobar WHERE (THEID > 0 AND TRIM(NAME) = 'mary') OR (NAME = 
> 'fred')
> It’s supposed to return:
> fred  1
> mary  2
> But it returns
> fred  1
> mary  2
> joe 'foo' "bar"3
> This is because one leg of the nested AND predicate, TRIM(NAME) = 'mary’, can 
> not be pushed down but is lost during JDBC push down filter translation. The 
> same translation method is also called by Data Source V2. I have a fix for 
> this issue and will open a PR. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22560) Must create spark session directly to connect to hive

2017-11-21 Thread Ran Mingxuan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ran Mingxuan updated SPARK-22560:
-
Description: 
In a java project I have to use both JavaSparkContext  and SparkSession. I find 
the order to create them affect hive connection.
I have built a spark job like below:

{code:java}
// wrong code
public void main(String[] args)
{
SparkConf sparkConf = new SparkConf().setAppName("testApp");
JavaSparkContext sc = new JavaSparkContext(sparkConf);
SparkSession spark = 
SparkSession.builder().sparkContext(sc.sc()).enableHiveSupport().getOrCreate();
spark.sql("show databases").show();
}
{code}

and with this code spark job will not be able to find hive meta-store even if 
it can discover correct warehouse.

I have to use code like below to make things work:

{code:java}
// correct code 
public String main(String[] args)
{
SparkConf sparkConf = new SparkConf().setAppName("testApp");
SparkSession spark = 
SparkSession.builder().config(sparkConf).enableHiveSupport().getOrCreate();
SparkContext sparkContext = spark.sparkContext();
JavaSparkContext sc = JavaSparkContext.fromSparkContext(sparkContext);
spark.sql("show databases").show();
}
{code}





  was:
In a java project I have to use both JavaSparkContext  and SparkSession. I find 
the order to create them affect hive connection.
I have built a spark job like below:

{code:java}
// wrong code
public void main(String[] args)
{
SparkConf sparkConf = new SparkConf().setAppName("testApp");
JavaSparkContext sc = new JavaSparkContext(sparkConf);

sc.hadoopConfiguration().setBoolean("mapreduce.fileoutputcommitter.marksuccessfuljobs",
 false);
sc.hadoopConfiguration().setBoolean("parquet.enable.summary-metadata", 
false);//
SparkSession spark = 
SparkSession.builder().sparkContext(sc.sc()).enableHiveSupport().getOrCreate();
spark.sql("show databases").show();
}
{code}

and with this code spark job will not be able to find hive meta-store even if 
it can discover correct warehouse.

I have to use code like below to make things work:

{code:java}
// correct code 
public String main(String[] args)
{
SparkConf sparkConf = new SparkConf().setAppName("testApp");
SparkSession spark = 
SparkSession.builder().config(sparkConf).enableHiveSupport().getOrCreate();
SparkContext sparkContext = spark.sparkContext();
JavaSparkContext sc = JavaSparkContext.fromSparkContext(sparkContext);

sc.hadoopConfiguration().setBoolean("mapreduce.fileoutputcommitter.marksuccessfuljobs",
 false);
sc.hadoopConfiguration().setBoolean("parquet.enable.summary-metadata", 
false);
spark.sql("show databases").show();
}
{code}






> Must create spark session directly to connect to hive
> -
>
> Key: SPARK-22560
> URL: https://issues.apache.org/jira/browse/SPARK-22560
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, SQL
>Affects Versions: 2.1.0
>Reporter: Ran Mingxuan
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> In a java project I have to use both JavaSparkContext  and SparkSession. I 
> find the order to create them affect hive connection.
> I have built a spark job like below:
> {code:java}
> // wrong code
> public void main(String[] args)
> {
> SparkConf sparkConf = new SparkConf().setAppName("testApp");
> JavaSparkContext sc = new JavaSparkContext(sparkConf);
> SparkSession spark = 
> SparkSession.builder().sparkContext(sc.sc()).enableHiveSupport().getOrCreate();
> spark.sql("show databases").show();
> }
> {code}
> and with this code spark job will not be able to find hive meta-store even if 
> it can discover correct warehouse.
> I have to use code like below to make things work:
> {code:java}
> // correct code 
> public String main(String[] args)
> {
> SparkConf sparkConf = new SparkConf().setAppName("testApp");
> SparkSession spark = 
> SparkSession.builder().config(sparkConf).enableHiveSupport().getOrCreate();
> SparkContext sparkContext = spark.sparkContext();
> JavaSparkContext sc = JavaSparkContext.fromSparkContext(sparkContext);
> spark.sql("show databases").show();
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22560) Must create spark session directly to connect to hive

2017-11-21 Thread Ran Mingxuan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16261887#comment-16261887
 ] 

Ran Mingxuan commented on SPARK-22560:
--

[~srowen] Thank you for your reply. Maybe I should declare that the hadoop 
configurations have no effect on the result and move them away from the code.

> Must create spark session directly to connect to hive
> -
>
> Key: SPARK-22560
> URL: https://issues.apache.org/jira/browse/SPARK-22560
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, SQL
>Affects Versions: 2.1.0
>Reporter: Ran Mingxuan
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> In a java project I have to use both JavaSparkContext  and SparkSession. I 
> find the order to create them affect hive connection.
> I have built a spark job like below:
> {code:java}
> // wrong code
> public void main(String[] args)
> {
> SparkConf sparkConf = new SparkConf().setAppName("testApp");
> JavaSparkContext sc = new JavaSparkContext(sparkConf);
> 
> sc.hadoopConfiguration().setBoolean("mapreduce.fileoutputcommitter.marksuccessfuljobs",
>  false);
> sc.hadoopConfiguration().setBoolean("parquet.enable.summary-metadata", 
> false);//
> SparkSession spark = 
> SparkSession.builder().sparkContext(sc.sc()).enableHiveSupport().getOrCreate();
> spark.sql("show databases").show();
> }
> {code}
> and with this code spark job will not be able to find hive meta-store even if 
> it can discover correct warehouse.
> I have to use code like below to make things work:
> {code:java}
> // correct code 
> public String main(String[] args)
> {
> SparkConf sparkConf = new SparkConf().setAppName("testApp");
> SparkSession spark = 
> SparkSession.builder().config(sparkConf).enableHiveSupport().getOrCreate();
> SparkContext sparkContext = spark.sparkContext();
> JavaSparkContext sc = JavaSparkContext.fromSparkContext(sparkContext);
> 
> sc.hadoopConfiguration().setBoolean("mapreduce.fileoutputcommitter.marksuccessfuljobs",
>  false);
> sc.hadoopConfiguration().setBoolean("parquet.enable.summary-metadata", 
> false);
> spark.sql("show databases").show();
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22374) STS ran into OOM in a secure cluster

2017-11-21 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16261875#comment-16261875
 ] 

Dongjoon Hyun commented on SPARK-22374:
---

Thank you for advice. I'll consider that.

> STS ran into OOM in a secure cluster
> 
>
> Key: SPARK-22374
> URL: https://issues.apache.org/jira/browse/SPARK-22374
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Dongjoon Hyun
> Attachments: 1.png, 2.png, 3.png
>
>
> In a secure cluster, FileSystem.CACHE grows indefinitely.
> *ENVIRONMENT*
> 1. `spark.yarn.principal` and `spark.yarn.keytab` is used.
> 2. Spark Thrift Server run with `doAs` false.
> {code}
> 
>   hive.server2.enable.doAs
>   false
> 
> {code}
> With 6GB (-Xmx6144m) options, `HiveConf` consumes 4GB inside FileSystem.CACHE.
> {code}
> 20,030 instances of "org.apache.hadoop.hive.conf.HiveConf", loaded by 
> "sun.misc.Launcher$AppClassLoader @ 0x64001c160" occupy 4,418,101,352 
> (73.42%) bytes. These instances are referenced from one instance of 
> "java.util.HashMap$Node[]", loaded by ""
> {code}
> Please see the attached images.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22510) Exceptions caused by 64KB JVM bytecode or constant pool entry limits

2017-11-21 Thread Kazuaki Ishizaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-22510:
-
Description: 
Codegen can throw an exception due to the 64KB JVM bytecode or 64K constant 
pool entry limit.


  was:
Codegen can throw an exception due to the 64KB JVM bytecode limit.



> Exceptions caused by 64KB JVM bytecode or constant pool entry limits 
> -
>
> Key: SPARK-22510
> URL: https://issues.apache.org/jira/browse/SPARK-22510
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>
> Codegen can throw an exception due to the 64KB JVM bytecode or 64K constant 
> pool entry limit.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22510) Exceptions caused by 64KB JVM bytecode or 64K constant pool entry limit

2017-11-21 Thread Kazuaki Ishizaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-22510:
-
Summary: Exceptions caused by 64KB JVM bytecode or 64K constant pool entry 
limit   (was: Exceptions caused by 64KB JVM bytecode or constant pool entry 
limits )

> Exceptions caused by 64KB JVM bytecode or 64K constant pool entry limit 
> 
>
> Key: SPARK-22510
> URL: https://issues.apache.org/jira/browse/SPARK-22510
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>
> Codegen can throw an exception due to the 64KB JVM bytecode or 64K constant 
> pool entry limit.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22510) Exceptions caused by 64KB JVM bytecode or constant pool entry limits

2017-11-21 Thread Kazuaki Ishizaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-22510:
-
Summary: Exceptions caused by 64KB JVM bytecode or constant pool entry 
limits   (was: Exceptions caused by 64KB JVM bytecode limit )

> Exceptions caused by 64KB JVM bytecode or constant pool entry limits 
> -
>
> Key: SPARK-22510
> URL: https://issues.apache.org/jira/browse/SPARK-22510
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>
> Codegen can throw an exception due to the 64KB JVM bytecode limit.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22575) Making Spark Thrift Server clean up its cache

2017-11-21 Thread Oz Ben-Ami (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16261808#comment-16261808
 ] 

Oz Ben-Ami commented on SPARK-22575:


We are not explicitly caching any tables, only running SQL queries. Is there a 
way to manually remove tables or partitions from cache?

> Making Spark Thrift Server clean up its cache
> -
>
> Key: SPARK-22575
> URL: https://issues.apache.org/jira/browse/SPARK-22575
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, SQL
>Affects Versions: 2.2.0
>Reporter: Oz Ben-Ami
>Priority: Minor
>  Labels: cache, dataproc, thrift, yarn
>
> Currently, Spark Thrift Server accumulates data in its appcache, even for old 
> queries. This fills up the disk (using over 100GB per worker node) within 
> days, and the only way to clear it is to restart the Thrift Server 
> application. Even deleting the files directly isn't a solution, as Spark then 
> complains about FileNotFound.
> I asked about this on [Stack 
> Overflow|https://stackoverflow.com/questions/46893123/how-can-i-make-spark-thrift-server-clean-up-its-cache]
>  a few weeks ago, but it does not seem to be currently doable by 
> configuration.
> Am I missing some configuration option, or some other factor here?
> Otherwise, can anyone point me to the code that handles this, so maybe I can 
> try my hand at a fix?
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22575) Making Spark Thrift Server clean up its cache

2017-11-21 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16261557#comment-16261557
 ] 

Marco Gaido commented on SPARK-22575:
-

does it happen because you are caching some tables and never uncaching them?

> Making Spark Thrift Server clean up its cache
> -
>
> Key: SPARK-22575
> URL: https://issues.apache.org/jira/browse/SPARK-22575
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, SQL
>Affects Versions: 2.2.0
>Reporter: Oz Ben-Ami
>Priority: Minor
>  Labels: cache, dataproc, thrift, yarn
>
> Currently, Spark Thrift Server accumulates data in its appcache, even for old 
> queries. This fills up the disk (using over 100GB per worker node) within 
> days, and the only way to clear it is to restart the Thrift Server 
> application. Even deleting the files directly isn't a solution, as Spark then 
> complains about FileNotFound.
> I asked about this on [Stack 
> Overflow|https://stackoverflow.com/questions/46893123/how-can-i-make-spark-thrift-server-clean-up-its-cache]
>  a few weeks ago, but it does not seem to be currently doable by 
> configuration.
> Am I missing some configuration option, or some other factor here?
> Otherwise, can anyone point me to the code that handles this, so maybe I can 
> try my hand at a fix?
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22500) 64KB JVM bytecode limit problem with cast

2017-11-21 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22500.
-
   Resolution: Fixed
Fix Version/s: 2.3.0
   2.2.1

Issue resolved by pull request 19730
[https://github.com/apache/spark/pull/19730]

> 64KB JVM bytecode limit problem with cast
> -
>
> Key: SPARK-22500
> URL: https://issues.apache.org/jira/browse/SPARK-22500
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
> Fix For: 2.2.1, 2.3.0
>
>
> {{Cast}} can throw an exception due to the 64KB JVM bytecode limit when they 
> use with a lot of structure fields



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22500) 64KB JVM bytecode limit problem with cast

2017-11-21 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-22500:
---

Assignee: Kazuaki Ishizaki

> 64KB JVM bytecode limit problem with cast
> -
>
> Key: SPARK-22500
> URL: https://issues.apache.org/jira/browse/SPARK-22500
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
> Fix For: 2.2.1, 2.3.0
>
>
> {{Cast}} can throw an exception due to the 64KB JVM bytecode limit when they 
> use with a lot of structure fields



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22374) STS ran into OOM in a secure cluster

2017-11-21 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16261444#comment-16261444
 ] 

Steve Loughran commented on SPARK-22374:


We need to do something about this, it is dangerously recurrent. Even though 
the fix is {{closeAllForUGI}}, we should be able to track it better and start 
warning early and meaningfully. For example

# record timestamps of creation
# track total cache size as an instrumented value
# start warning when it gets big

> STS ran into OOM in a secure cluster
> 
>
> Key: SPARK-22374
> URL: https://issues.apache.org/jira/browse/SPARK-22374
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Dongjoon Hyun
> Attachments: 1.png, 2.png, 3.png
>
>
> In a secure cluster, FileSystem.CACHE grows indefinitely.
> *ENVIRONMENT*
> 1. `spark.yarn.principal` and `spark.yarn.keytab` is used.
> 2. Spark Thrift Server run with `doAs` false.
> {code}
> 
>   hive.server2.enable.doAs
>   false
> 
> {code}
> With 6GB (-Xmx6144m) options, `HiveConf` consumes 4GB inside FileSystem.CACHE.
> {code}
> 20,030 instances of "org.apache.hadoop.hive.conf.HiveConf", loaded by 
> "sun.misc.Launcher$AppClassLoader @ 0x64001c160" occupy 4,418,101,352 
> (73.42%) bytes. These instances are referenced from one instance of 
> "java.util.HashMap$Node[]", loaded by ""
> {code}
> Please see the attached images.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22475) show histogram in DESC COLUMN command

2017-11-21 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-22475:
---

Assignee: Marco Gaido

> show histogram in DESC COLUMN command
> -
>
> Key: SPARK-22475
> URL: https://issues.apache.org/jira/browse/SPARK-22475
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>Assignee: Marco Gaido
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22475) show histogram in DESC COLUMN command

2017-11-21 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22475.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19774
[https://github.com/apache/spark/pull/19774]

> show histogram in DESC COLUMN command
> -
>
> Key: SPARK-22475
> URL: https://issues.apache.org/jira/browse/SPARK-22475
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22576) Spark SQL locate returns incorrect value when the start position is negative

2017-11-21 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22576.
---
Resolution: Not A Problem

Yes, there's nothing about HiveQL (which Spark follows) or Spark SQL that 
suggests -1 is a valid start position. It is 1-based, even. I don't even see 
that engines like MySQL support it. 
https://www.w3schools.com/sql/func_mysql_locate.asp


> Spark SQL locate returns incorrect value when the start position is negative
> 
>
> Key: SPARK-22576
> URL: https://issues.apache.org/jira/browse/SPARK-22576
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yuxin Cao
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22165) Type conflicts between dates, timestamps and date in partition column

2017-11-21 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22165.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19389
[https://github.com/apache/spark/pull/19389]

> Type conflicts between dates, timestamps and date in partition column
> -
>
> Key: SPARK-22165
> URL: https://issues.apache.org/jira/browse/SPARK-22165
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0, 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.3.0
>
>
> It looks we have some bugs when resolving type conflicts in partition column. 
> I found few corner cases as below:
> Case 1: timestamp should be inferred but date type is inferred.
> {code}
> val df = Seq((1, "2015-01-01"), (2, "2016-01-01 00:00:00")).toDF("i", "ts")
> df.write.format("parquet").partitionBy("ts").save("/tmp/foo")
> spark.read.load("/tmp/foo").printSchema()
> {code}
> {code}
> root
>  |-- i: integer (nullable = true)
>  |-- ts: date (nullable = true)
> {code}
> Case 2: decimal should be inferred but integer is inferred.
> {code}
> val df = Seq((1, "1"), (2, "1" * 30)).toDF("i", "decimal")
> df.write.format("parquet").partitionBy("decimal").save("/tmp/bar")
> spark.read.load("/tmp/bar").printSchema()
> {code}
> {code}
> root
>  |-- i: integer (nullable = true)
>  |-- decimal: integer (nullable = true)
> {code}
> Looks we should de-duplicate type resolution logic if possible rather than 
> separate numeric precedence-like comparison alone.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22165) Type conflicts between dates, timestamps and date in partition column

2017-11-21 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-22165:
---

Assignee: Hyukjin Kwon

> Type conflicts between dates, timestamps and date in partition column
> -
>
> Key: SPARK-22165
> URL: https://issues.apache.org/jira/browse/SPARK-22165
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0, 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.3.0
>
>
> It looks we have some bugs when resolving type conflicts in partition column. 
> I found few corner cases as below:
> Case 1: timestamp should be inferred but date type is inferred.
> {code}
> val df = Seq((1, "2015-01-01"), (2, "2016-01-01 00:00:00")).toDF("i", "ts")
> df.write.format("parquet").partitionBy("ts").save("/tmp/foo")
> spark.read.load("/tmp/foo").printSchema()
> {code}
> {code}
> root
>  |-- i: integer (nullable = true)
>  |-- ts: date (nullable = true)
> {code}
> Case 2: decimal should be inferred but integer is inferred.
> {code}
> val df = Seq((1, "1"), (2, "1" * 30)).toDF("i", "decimal")
> df.write.format("parquet").partitionBy("decimal").save("/tmp/bar")
> spark.read.load("/tmp/bar").printSchema()
> {code}
> {code}
> root
>  |-- i: integer (nullable = true)
>  |-- decimal: integer (nullable = true)
> {code}
> Looks we should de-duplicate type resolution logic if possible rather than 
> separate numeric precedence-like comparison alone.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20586) Add deterministic to ScalaUDF

2017-11-21 Thread Bago Amirbekian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16261393#comment-16261393
 ] 

Bago Amirbekian commented on SPARK-20586:
-

Also a follow up questions, are the performance implications to using the 
`deterministic` flag we should try and avoid by restructuring the ml UDFs to 
avoid raising within the UDF.

> Add deterministic to ScalaUDF
> -
>
> Key: SPARK-20586
> URL: https://issues.apache.org/jira/browse/SPARK-20586
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.3.0
>
>
> https://hive.apache.org/javadocs/r2.0.1/api/org/apache/hadoop/hive/ql/udf/UDFType.html
> Like Hive UDFType, we should allow users to add the extra flags for ScalaUDF 
> too. {{stateful}}/{{impliesOrder}} are not applicable to ScalaUDF. Thus, we 
> only add the following two flags. 
> - deterministic: Certain optimizations should not be applied if UDF is not 
> deterministic. Deterministic UDF returns same result each time it is invoked 
> with a particular input. This determinism just needs to hold within the 
> context of a query.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20586) Add deterministic to ScalaUDF

2017-11-21 Thread Bago Amirbekian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16261377#comment-16261377
 ] 

Bago Amirbekian commented on SPARK-20586:
-

Is there some documentation somewhere about the right way to use this 
`deterministic` flag?

I bring it up because in `spark.ml` sometimes we will raise errors in a udf 
when a dataFrame contains invalid data. This can cause bad behavior if the 
optimizer re-orders the udf with other operations so we’re marking these udfs 
as `nonDeterministic` (https://github.com/apache/spark/pull/19662) but that 
somehow seems wrong. The issue isn’t that the udfs are non-deterministic, they 
are deterministic and always raise on the same inputs.

I guess my question is
1) Is this correct usage of the non-deterministic flag or are we simply abusing 
it when we should come up with a more specific solution?
2) If this is correct usage could we rename the flag or document this type of 
usage somewhere?

> Add deterministic to ScalaUDF
> -
>
> Key: SPARK-20586
> URL: https://issues.apache.org/jira/browse/SPARK-20586
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.3.0
>
>
> https://hive.apache.org/javadocs/r2.0.1/api/org/apache/hadoop/hive/ql/udf/UDFType.html
> Like Hive UDFType, we should allow users to add the extra flags for ScalaUDF 
> too. {{stateful}}/{{impliesOrder}} are not applicable to ScalaUDF. Thus, we 
> only add the following two flags. 
> - deterministic: Certain optimizations should not be applied if UDF is not 
> deterministic. Deterministic UDF returns same result each time it is invoked 
> with a particular input. This determinism just needs to hold within the 
> context of a query.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22576) Spark SQL locate returns incorrect value when the start position is negative

2017-11-21 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16261347#comment-16261347
 ] 

Marco Gaido commented on SPARK-22576:
-

I see, but this is SAP Sysbase. Why do you think Spark should behave like that? 
Also Hive behaves like Spark, thus I don't see why Spark would be supposed to 
work differently.

> Spark SQL locate returns incorrect value when the start position is negative
> 
>
> Key: SPARK-22576
> URL: https://issues.apache.org/jira/browse/SPARK-22576
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yuxin Cao
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22576) Spark SQL locate returns incorrect value when the start position is negative

2017-11-21 Thread Yuxin Cao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16261313#comment-16261313
 ] 

Yuxin Cao commented on SPARK-22576:
---

http://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.infocenter.dc38151.1600/doc/html/san1278453073929.html

The character position at which to begin the search in the string. The first 
character is position 1. If the starting offset is negative, LOCATE returns the 
last matching string offset, rather than the first. A negative offset indicates 
how much of the end of the string to exclude from the search. The number of 
bytes excluded is calculated as ( -1 * offset ) - 1.

> Spark SQL locate returns incorrect value when the start position is negative
> 
>
> Key: SPARK-22576
> URL: https://issues.apache.org/jira/browse/SPARK-22576
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yuxin Cao
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22577) executor page blacklist status should update with TaskSet level blacklisting

2017-11-21 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-22577:
-

 Summary: executor page blacklist status should update with TaskSet 
level blacklisting
 Key: SPARK-22577
 URL: https://issues.apache.org/jira/browse/SPARK-22577
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 2.1.1
Reporter: Thomas Graves


right now the executor blacklist status only updates with the BlacklistTracker 
after a task set has finished and propagated the blacklisting to the 
application level. We should change that to show at the taskset level as well. 
Without this it can be very confusing to the user why things aren't running.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22526) Spark hangs while reading binary files from S3

2017-11-21 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16261292#comment-16261292
 ] 

Steve Loughran commented on SPARK-22526:


S3a uses the AWS S3 client, which uses httpclient inside. S3AInputStream 
absolutely closes that stream on close(), it does a lot of work to use abort 
when it's considered more expensive to read to the end of the GET.  I wouldn't 
jump to blame the httpclient.

However, AWS S3 does http connection recycling pooling, set in 
{{fs.s3a.connection.maximum}}; default is 15. There might be some blocking 
waiting for free connections, so try a larger value. 

Now, what's needed to track things down? You get to do it yourself, at least 
for now, as you are the only person reporting this.

I'd go for
* getting the thread dump as things block & see what the threads are up to
* use "netstat -p tcp" to see what network connections are live
* logging {{org.apache.hadoop.fs.s3a}} and seeing what it says. 

Before doing any of that: move off 2.7 to using the Hadoop 2.8 binaries, Hadoop 
2.8 has a lot of performance and functionality improvements which will never be 
backported to 2.7.x, including updated AWS SDK libraries.

If you do find a problem in the s3a client/AWS SDK, the response to a HADOOP- 
JIRA will be "does it go away if you upgrade?". Save time by doing that before 
anything else.

> Spark hangs while reading binary files from S3
> --
>
> Key: SPARK-22526
> URL: https://issues.apache.org/jira/browse/SPARK-22526
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: mohamed imran
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Hi,
> I am using Spark 2.2.0(recent version) to read binary files from S3. I use 
> sc.binaryfiles to read the files.
> It is working fine until some 100 file read but later it get hangs 
> indefinitely from 5 up to 40 mins like Avro file read issue(it was fixed in 
> the later releases)
> I tried setting the fs.s3a.connection.maximum to some maximum values but 
> didn't help.
> And finally i ended up using the spark speculation parameter set which is 
> again didnt help much. 
> One thing Which I observed is that it is not closing the connection after 
> every read of binary files from the S3.
> example :- sc.binaryFiles("s3a://test/test123.zip")
> Please look into this major issue!  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22576) Spark SQL locate returns incorrect value when the start position is negative

2017-11-21 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16261293#comment-16261293
 ] 

Marco Gaido commented on SPARK-22576:
-

why do you expect locate to work like this and not as it is working now?

> Spark SQL locate returns incorrect value when the start position is negative
> 
>
> Key: SPARK-22576
> URL: https://issues.apache.org/jira/browse/SPARK-22576
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yuxin Cao
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22521) VectorIndexerModel support handle unseen categories via handleInvalid: Python API

2017-11-21 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk reassigned SPARK-22521:
---

Assignee: Weichen Xu

> VectorIndexerModel support handle unseen categories via handleInvalid: Python 
> API
> -
>
> Key: SPARK-22521
> URL: https://issues.apache.org/jira/browse/SPARK-22521
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
> Fix For: 2.3.0
>
>
> VectorIndexerModel support handle unseen categories via handleInvalid: Python 
> API



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22521) VectorIndexerModel support handle unseen categories via handleInvalid: Python API

2017-11-21 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk resolved SPARK-22521.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19753
[https://github.com/apache/spark/pull/19753]

> VectorIndexerModel support handle unseen categories via handleInvalid: Python 
> API
> -
>
> Key: SPARK-22521
> URL: https://issues.apache.org/jira/browse/SPARK-22521
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Weichen Xu
> Fix For: 2.3.0
>
>
> VectorIndexerModel support handle unseen categories via handleInvalid: Python 
> API



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22576) Spark SQL locate returns incorrect value when the start position is negative

2017-11-21 Thread Yuxin Cao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuxin Cao updated SPARK-22576:
--
Summary: Spark SQL locate returns incorrect value when the start position 
is negative  (was: Spark SQL locate returns incorrect value when position is 
negative)

> Spark SQL locate returns incorrect value when the start position is negative
> 
>
> Key: SPARK-22576
> URL: https://issues.apache.org/jira/browse/SPARK-22576
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yuxin Cao
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22576) Spark SQL locate returns incorrect value when position is negative

2017-11-21 Thread Yuxin Cao (JIRA)
Yuxin Cao created SPARK-22576:
-

 Summary: Spark SQL locate returns incorrect value when position is 
negative
 Key: SPARK-22576
 URL: https://issues.apache.org/jira/browse/SPARK-22576
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Yuxin Cao






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



subscribe

2017-11-21 Thread Shinichiro Abe



[jira] [Created] (SPARK-22575) Making Spark Thrift Server clean up its cache

2017-11-21 Thread Oz Ben-Ami (JIRA)
Oz Ben-Ami created SPARK-22575:
--

 Summary: Making Spark Thrift Server clean up its cache
 Key: SPARK-22575
 URL: https://issues.apache.org/jira/browse/SPARK-22575
 Project: Spark
  Issue Type: Improvement
  Components: Block Manager, SQL
Affects Versions: 2.2.0
Reporter: Oz Ben-Ami
Priority: Minor


Currently, Spark Thrift Server accumulates data in its appcache, even for old 
queries. This fills up the disk (using over 100GB per worker node) within days, 
and the only way to clear it is to restart the Thrift Server application. Even 
deleting the files directly isn't a solution, as Spark then complains about 
FileNotFound.

I asked about this on [Stack 
Overflow|https://stackoverflow.com/questions/46893123/how-can-i-make-spark-thrift-server-clean-up-its-cache]
 a few weeks ago, but it does not seem to be currently doable by configuration.

Am I missing some configuration option, or some other factor here?

Otherwise, can anyone point me to the code that handles this, so maybe I can 
try my hand at a fix?

Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22574) Wrong request causing Spark Dispatcher going inactive

2017-11-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16260954#comment-16260954
 ] 

Apache Spark commented on SPARK-22574:
--

User 'Gschiavon' has created a pull request for this issue:
https://github.com/apache/spark/pull/19793

> Wrong request causing Spark Dispatcher going inactive
> -
>
> Key: SPARK-22574
> URL: https://issues.apache.org/jira/browse/SPARK-22574
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Submit
>Affects Versions: 2.2.0
>Reporter: German Schiavon Matteo
>Priority: Minor
> Fix For: 2.0.0, 2.1.0, 2.2.0
>
>
> When submitting a wrong _CreateSubmissionRequest_ to Spark Dispatcher is 
> causing a bad state of Dispatcher and making it inactive as a mesos framework.
> The class CreateSubmissionRequest initialise its arguments to null as follows:
> {code:title=CreateSubmissionRequest.scala|borderStyle=solid}
>   var appResource: String = null
>   var mainClass: String = null
>   var appArgs: Array[String] = null
>   var sparkProperties: Map[String, String] = null
>   var environmentVariables: Map[String, String] = null
> {code}
> There are some checks of this variable but not in all of them, for example in 
> appArgs and environmentVariables. 
> If you don't set _appArgs_ it will cause the following error: 
> {code:title=error|borderStyle=solid}
> 17/11/21 14:37:24 INFO MesosClusterScheduler: Reviving Offers.
> Exception in thread "Thread-22" java.lang.NullPointerException
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.getDriverCommandValue(MesosClusterScheduler.scala:444)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.buildDriverCommand(MesosClusterScheduler.scala:451)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.org$apache$spark$scheduler$cluster$mesos$MesosClusterScheduler$$createTaskInfo(MesosClusterScheduler.scala:538)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:570)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:555)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.scheduleTasks(MesosClusterScheduler.scala:555)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.resourceOffers(MesosClusterScheduler.scala:621)
> {code}
> Because it's trying to access to it without checking whether is null or not.
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22574) Wrong request causing Spark Dispatcher going inactive

2017-11-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22574:


Assignee: Apache Spark

> Wrong request causing Spark Dispatcher going inactive
> -
>
> Key: SPARK-22574
> URL: https://issues.apache.org/jira/browse/SPARK-22574
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Submit
>Affects Versions: 2.2.0
>Reporter: German Schiavon Matteo
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 2.0.0, 2.1.0, 2.2.0
>
>
> When submitting a wrong _CreateSubmissionRequest_ to Spark Dispatcher is 
> causing a bad state of Dispatcher and making it inactive as a mesos framework.
> The class CreateSubmissionRequest initialise its arguments to null as follows:
> {code:title=CreateSubmissionRequest.scala|borderStyle=solid}
>   var appResource: String = null
>   var mainClass: String = null
>   var appArgs: Array[String] = null
>   var sparkProperties: Map[String, String] = null
>   var environmentVariables: Map[String, String] = null
> {code}
> There are some checks of this variable but not in all of them, for example in 
> appArgs and environmentVariables. 
> If you don't set _appArgs_ it will cause the following error: 
> {code:title=error|borderStyle=solid}
> 17/11/21 14:37:24 INFO MesosClusterScheduler: Reviving Offers.
> Exception in thread "Thread-22" java.lang.NullPointerException
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.getDriverCommandValue(MesosClusterScheduler.scala:444)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.buildDriverCommand(MesosClusterScheduler.scala:451)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.org$apache$spark$scheduler$cluster$mesos$MesosClusterScheduler$$createTaskInfo(MesosClusterScheduler.scala:538)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:570)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:555)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.scheduleTasks(MesosClusterScheduler.scala:555)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.resourceOffers(MesosClusterScheduler.scala:621)
> {code}
> Because it's trying to access to it without checking whether is null or not.
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22574) Wrong request causing Spark Dispatcher going inactive

2017-11-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22574:


Assignee: (was: Apache Spark)

> Wrong request causing Spark Dispatcher going inactive
> -
>
> Key: SPARK-22574
> URL: https://issues.apache.org/jira/browse/SPARK-22574
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Submit
>Affects Versions: 2.2.0
>Reporter: German Schiavon Matteo
>Priority: Minor
> Fix For: 2.0.0, 2.1.0, 2.2.0
>
>
> When submitting a wrong _CreateSubmissionRequest_ to Spark Dispatcher is 
> causing a bad state of Dispatcher and making it inactive as a mesos framework.
> The class CreateSubmissionRequest initialise its arguments to null as follows:
> {code:title=CreateSubmissionRequest.scala|borderStyle=solid}
>   var appResource: String = null
>   var mainClass: String = null
>   var appArgs: Array[String] = null
>   var sparkProperties: Map[String, String] = null
>   var environmentVariables: Map[String, String] = null
> {code}
> There are some checks of this variable but not in all of them, for example in 
> appArgs and environmentVariables. 
> If you don't set _appArgs_ it will cause the following error: 
> {code:title=error|borderStyle=solid}
> 17/11/21 14:37:24 INFO MesosClusterScheduler: Reviving Offers.
> Exception in thread "Thread-22" java.lang.NullPointerException
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.getDriverCommandValue(MesosClusterScheduler.scala:444)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.buildDriverCommand(MesosClusterScheduler.scala:451)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.org$apache$spark$scheduler$cluster$mesos$MesosClusterScheduler$$createTaskInfo(MesosClusterScheduler.scala:538)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:570)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:555)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.scheduleTasks(MesosClusterScheduler.scala:555)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.resourceOffers(MesosClusterScheduler.scala:621)
> {code}
> Because it's trying to access to it without checking whether is null or not.
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22574) Wrong request causing Spark Dispatcher going inactive

2017-11-21 Thread German Schiavon Matteo (JIRA)
German Schiavon Matteo created SPARK-22574:
--

 Summary: Wrong request causing Spark Dispatcher going inactive
 Key: SPARK-22574
 URL: https://issues.apache.org/jira/browse/SPARK-22574
 Project: Spark
  Issue Type: Bug
  Components: Mesos, Spark Submit
Affects Versions: 2.2.0
Reporter: German Schiavon Matteo
Priority: Minor
 Fix For: 2.2.0, 2.1.0, 2.0.0


When submitting a wrong _CreateSubmissionRequest_ to Spark Dispatcher is 
causing a bad state of Dispatcher and making it inactive as a mesos framework.

The class CreateSubmissionRequest initialise its arguments to null as follows:

{code:title=CreateSubmissionRequest.scala|borderStyle=solid}
  var appResource: String = null
  var mainClass: String = null
  var appArgs: Array[String] = null
  var sparkProperties: Map[String, String] = null
  var environmentVariables: Map[String, String] = null
{code}

There are some checks of this variable but not in all of them, for example in 
appArgs and environmentVariables. 

If you don't set _appArgs_ it will cause the following error: 
{code:title=error|borderStyle=solid}
17/11/21 14:37:24 INFO MesosClusterScheduler: Reviving Offers.
Exception in thread "Thread-22" java.lang.NullPointerException
at 
org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.getDriverCommandValue(MesosClusterScheduler.scala:444)
at 
org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.buildDriverCommand(MesosClusterScheduler.scala:451)
at 
org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.org$apache$spark$scheduler$cluster$mesos$MesosClusterScheduler$$createTaskInfo(MesosClusterScheduler.scala:538)
at 
org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:570)
at 
org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:555)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.scheduleTasks(MesosClusterScheduler.scala:555)
at 
org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.resourceOffers(MesosClusterScheduler.scala:621)
{code}

Because it's trying to access to it without checking whether is null or not.

 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22323) Design doc for different types of pandas_udf

2017-11-21 Thread Li Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16260924#comment-16260924
 ] 

Li Jin commented on SPARK-22323:


I believe this is done. (At least my original indention for this). The design 
doc will be tracked in the link [~hyukjin.kwon] posted. If we want to make this 
into official pyspark doc, let's revisit.

> Design doc for different types of pandas_udf
> 
>
> Key: SPARK-22323
> URL: https://issues.apache.org/jira/browse/SPARK-22323
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Li Jin
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22323) Design doc for different types of pandas_udf

2017-11-21 Thread Li Jin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin resolved SPARK-22323.

Resolution: Fixed

> Design doc for different types of pandas_udf
> 
>
> Key: SPARK-22323
> URL: https://issues.apache.org/jira/browse/SPARK-22323
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Li Jin
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22516) CSV Read breaks: When "multiLine" = "true", if "comment" option is set as last line's first character

2017-11-21 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16260904#comment-16260904
 ] 

Marco Gaido commented on SPARK-22516:
-

[~crkumaresh24] I can't reproduce the issue with the new file you have 
uploaded. I am running on a OSX, maybe it depends on the OS:

{code}
scala> val a = spark.read.option("header","true").option("inferSchema", 
"true").option("multiLine", "true").option("comment", "c").option("parserLib", 
"univocity").csv("/Users/mgaido/Downloads/test_file_without_eof_char.csv")
a: org.apache.spark.sql.DataFrame = [abc: string, def: string]

scala> a.show
+---+---+
|abc|def|
+---+---+
|ghi|jkl|
+---+---+
{code}

> CSV Read breaks: When "multiLine" = "true", if "comment" option is set as 
> last line's first character
> -
>
> Key: SPARK-22516
> URL: https://issues.apache.org/jira/browse/SPARK-22516
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Kumaresh C R
>Priority: Minor
>  Labels: csvparser
> Attachments: testCommentChar.csv, test_file_without_eof_char.csv
>
>
> Try to read attached CSV file with following parse properties,
> scala> *val csvFile = 
> spark.read.option("header","true").option("inferSchema", 
> "true").option("parserLib", "univocity").option("comment", 
> "c").csv("hdfs://localhost:8020/test
> CommentChar.csv");   *
>   
>   
> csvFile: org.apache.spark.sql.DataFrame = [a: string, b: string]  
>   
>  
>   
>   
>  
> scala> csvFile.show   
>   
>  
> +---+---+ 
>   
>  
> |  a|  b| 
>   
>  
> +---+---+ 
>   
>  
> +---+---+   
> {color:#8eb021}*Noticed that it works fine.*{color}
> If we add an option "multiLine" = "true", it fails with below exception. This 
> happens only if we pass "comment" == input dataset's last line's first 
> character
> scala> val csvFile = 
> *spark.read.option("header","true").{color:red}{color:#d04437}option("multiLine","true"){color}{color}.option("inferSchema",
>  "true").option("parserLib", "univocity").option("comment", 
> "c").csv("hdfs://localhost:8020/testCommentChar.csv");*
> 17/11/14 14:26:17 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 8)
> com.univocity.parsers.common.TextParsingException: 
> java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End 
> of input reached
> Parser Configuration: CsvParserSettings:
> Auto configuration enabled=true
> Autodetect column delimiter=false
> Autodetect quotes=false
> Column reordering enabled=true
> Empty value=null
> Escape unquoted values=false
> Header extraction enabled=null
> Headers=null
> Ignore leading whitespaces=false
> Ignore trailing whitespaces=false
> Input buffer size=128
> Input reading on separate thread=false
> Keep escape sequences=false
> Keep quotes=false
> Length of content displayed on error=-1
> Line separator detection enabled=false
> Maximum number of characters per column=-1
> Maximum number of columns=20480
> Normalize escaped line separators=true
> Null value=
> Number of records to read=all
> Processor=none
> Restricting data in exceptions=false
> RowProcessor error handler=null
> Selected fields=none
> Skip empty lines=true
> Unescaped quote handling=STOP_AT_DELIMITERFormat configuration:
> CsvFormat:
> Comment character=c
> Field delimiter=,
> Line separator (normalized)=\n
> Line separator sequence=\r\n
> Quote character="
> Quote 

[jira] [Assigned] (SPARK-22566) Better error message for `_merge_type` in Pandas to Spark DF conversion

2017-11-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22566:


Assignee: (was: Apache Spark)

> Better error message for `_merge_type` in Pandas to Spark DF conversion
> ---
>
> Key: SPARK-22566
> URL: https://issues.apache.org/jira/browse/SPARK-22566
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Guilherme Berger
>Priority: Minor
>
> When creating a Spark DF from a Pandas DF without specifying a schema, schema 
> inference is used. This inference can fail when a column contains values of 
> two different types; this is ok. The problem is the error message does not 
> tell us in which column this happened.
> When this happens, it is painful to debug since the error message is too 
> vague.
> I plan on submitting a PR which fixes this, providing a better error message 
> for such cases, containing the column name (and possibly the problematic 
> values too).
> >>> spark_session.createDataFrame(pandas_df)
> File "redacted/pyspark/sql/session.py", line 541, in createDataFrame
>   rdd, schema = self._createFromLocal(map(prepare, data), schema)
> File "redacted/pyspark/sql/session.py", line 401, in _createFromLocal
>   struct = self._inferSchemaFromList(data)
> File "redacted/pyspark/sql/session.py", line 333, in _inferSchemaFromList
>   schema = reduce(_merge_type, map(_infer_schema, data))
> File "redacted/pyspark/sql/types.py", line 1124, in _merge_type
>   for f in a.fields]
> File "redacted/pyspark/sql/types.py", line 1118, in _merge_type
>   raise TypeError("Can not merge type %s and %s" % (type(a), type(b)))
> TypeError: Can not merge type  and  'pyspark.sql.types.StringType'>
>



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22566) Better error message for `_merge_type` in Pandas to Spark DF conversion

2017-11-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22566:


Assignee: Apache Spark

> Better error message for `_merge_type` in Pandas to Spark DF conversion
> ---
>
> Key: SPARK-22566
> URL: https://issues.apache.org/jira/browse/SPARK-22566
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Guilherme Berger
>Assignee: Apache Spark
>Priority: Minor
>
> When creating a Spark DF from a Pandas DF without specifying a schema, schema 
> inference is used. This inference can fail when a column contains values of 
> two different types; this is ok. The problem is the error message does not 
> tell us in which column this happened.
> When this happens, it is painful to debug since the error message is too 
> vague.
> I plan on submitting a PR which fixes this, providing a better error message 
> for such cases, containing the column name (and possibly the problematic 
> values too).
> >>> spark_session.createDataFrame(pandas_df)
> File "redacted/pyspark/sql/session.py", line 541, in createDataFrame
>   rdd, schema = self._createFromLocal(map(prepare, data), schema)
> File "redacted/pyspark/sql/session.py", line 401, in _createFromLocal
>   struct = self._inferSchemaFromList(data)
> File "redacted/pyspark/sql/session.py", line 333, in _inferSchemaFromList
>   schema = reduce(_merge_type, map(_infer_schema, data))
> File "redacted/pyspark/sql/types.py", line 1124, in _merge_type
>   for f in a.fields]
> File "redacted/pyspark/sql/types.py", line 1118, in _merge_type
>   raise TypeError("Can not merge type %s and %s" % (type(a), type(b)))
> TypeError: Can not merge type  and  'pyspark.sql.types.StringType'>
>



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22566) Better error message for `_merge_type` in Pandas to Spark DF conversion

2017-11-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16260889#comment-16260889
 ] 

Apache Spark commented on SPARK-22566:
--

User 'gberger' has created a pull request for this issue:
https://github.com/apache/spark/pull/19792

> Better error message for `_merge_type` in Pandas to Spark DF conversion
> ---
>
> Key: SPARK-22566
> URL: https://issues.apache.org/jira/browse/SPARK-22566
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Guilherme Berger
>Priority: Minor
>
> When creating a Spark DF from a Pandas DF without specifying a schema, schema 
> inference is used. This inference can fail when a column contains values of 
> two different types; this is ok. The problem is the error message does not 
> tell us in which column this happened.
> When this happens, it is painful to debug since the error message is too 
> vague.
> I plan on submitting a PR which fixes this, providing a better error message 
> for such cases, containing the column name (and possibly the problematic 
> values too).
> >>> spark_session.createDataFrame(pandas_df)
> File "redacted/pyspark/sql/session.py", line 541, in createDataFrame
>   rdd, schema = self._createFromLocal(map(prepare, data), schema)
> File "redacted/pyspark/sql/session.py", line 401, in _createFromLocal
>   struct = self._inferSchemaFromList(data)
> File "redacted/pyspark/sql/session.py", line 333, in _inferSchemaFromList
>   schema = reduce(_merge_type, map(_infer_schema, data))
> File "redacted/pyspark/sql/types.py", line 1124, in _merge_type
>   for f in a.fields]
> File "redacted/pyspark/sql/types.py", line 1118, in _merge_type
>   raise TypeError("Can not merge type %s and %s" % (type(a), type(b)))
> TypeError: Can not merge type  and  'pyspark.sql.types.StringType'>
>



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22552) Cannot Union multiple kafka streams

2017-11-21 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-22552:
-

 Assignee: (was: Shixiong Zhu)
 Priority: Minor  (was: Major)
Fix Version/s: (was: 2.2.2)
   (was: 2.3.0)

> Cannot Union multiple kafka streams
> ---
>
> Key: SPARK-22552
> URL: https://issues.apache.org/jira/browse/SPARK-22552
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: sachin malhotra
>Priority: Minor
>
> When unioning multiple kafka streams I learned that the resulting dataframe 
> only contains the data that exists in the dataframe that initiated the union 
> i.e. if df1.union(df2) (or a chaining of unions) the result will only contain 
> the rows that exist in df1.
> Now to be more specific this occurs when data comes in during the same 
> micro-batch for all three streams. If you wait for each single row to be 
> processed for each stream the union does return the right results. 
> For example, if you have 3 kafka streams and you:
> send message 1 to stream 1, WAIT for batch to finish, send message 2 to 
> stream 2, wait for batch to finish, send message 3 to stream 3, wait for 
> batch to finish. Union will return the right data.
> But if you,
> send message 1,2,3, WAIT for batch to finish, you only receive data in the 
> first stream when unioning all three dataframes



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22572) spark-shell does not re-initialize on :replay

2017-11-21 Thread Mark Petruska (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16260759#comment-16260759
 ] 

Mark Petruska commented on SPARK-22572:
---

[~srowen]: Can I ask you to look at https://github.com/apache/spark/pull/19791, 
please?

> spark-shell does not re-initialize on :replay
> -
>
> Key: SPARK-22572
> URL: https://issues.apache.org/jira/browse/SPARK-22572
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.3.0
>Reporter: Mark Petruska
>Priority: Minor
>
> Spark-shell does not run the re-initialization script when a `:replay` 
> command is issued:
> {code}
> $ ./bin/spark-shell 
> 17/11/21 12:01:00 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> Spark context Web UI available at http://192.168.1.3:4040
> Spark context available as 'sc' (master = local[*], app id = 
> local-1511262066013).
> Spark session available as 'spark'.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
>   /_/
>  
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_74)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> sc
> res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@77bb916f
> scala> :replay
> Replaying: sc
> :12: error: not found: value sc
>sc
>^
> scala> sc
> :12: error: not found: value sc
>sc
>^
> scala>
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22572) spark-shell does not re-initialize on :replay

2017-11-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22572:


Assignee: Apache Spark

> spark-shell does not re-initialize on :replay
> -
>
> Key: SPARK-22572
> URL: https://issues.apache.org/jira/browse/SPARK-22572
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.3.0
>Reporter: Mark Petruska
>Assignee: Apache Spark
>Priority: Minor
>
> Spark-shell does not run the re-initialization script when a `:replay` 
> command is issued:
> {code}
> $ ./bin/spark-shell 
> 17/11/21 12:01:00 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> Spark context Web UI available at http://192.168.1.3:4040
> Spark context available as 'sc' (master = local[*], app id = 
> local-1511262066013).
> Spark session available as 'spark'.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
>   /_/
>  
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_74)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> sc
> res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@77bb916f
> scala> :replay
> Replaying: sc
> :12: error: not found: value sc
>sc
>^
> scala> sc
> :12: error: not found: value sc
>sc
>^
> scala>
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22572) spark-shell does not re-initialize on :replay

2017-11-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22572:


Assignee: (was: Apache Spark)

> spark-shell does not re-initialize on :replay
> -
>
> Key: SPARK-22572
> URL: https://issues.apache.org/jira/browse/SPARK-22572
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.3.0
>Reporter: Mark Petruska
>Priority: Minor
>
> Spark-shell does not run the re-initialization script when a `:replay` 
> command is issued:
> {code}
> $ ./bin/spark-shell 
> 17/11/21 12:01:00 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> Spark context Web UI available at http://192.168.1.3:4040
> Spark context available as 'sc' (master = local[*], app id = 
> local-1511262066013).
> Spark session available as 'spark'.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
>   /_/
>  
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_74)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> sc
> res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@77bb916f
> scala> :replay
> Replaying: sc
> :12: error: not found: value sc
>sc
>^
> scala> sc
> :12: error: not found: value sc
>sc
>^
> scala>
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22572) spark-shell does not re-initialize on :replay

2017-11-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16260756#comment-16260756
 ] 

Apache Spark commented on SPARK-22572:
--

User 'mpetruska' has created a pull request for this issue:
https://github.com/apache/spark/pull/19791

> spark-shell does not re-initialize on :replay
> -
>
> Key: SPARK-22572
> URL: https://issues.apache.org/jira/browse/SPARK-22572
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.3.0
>Reporter: Mark Petruska
>Priority: Minor
>
> Spark-shell does not run the re-initialization script when a `:replay` 
> command is issued:
> {code}
> $ ./bin/spark-shell 
> 17/11/21 12:01:00 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> Spark context Web UI available at http://192.168.1.3:4040
> Spark context available as 'sc' (master = local[*], app id = 
> local-1511262066013).
> Spark session available as 'spark'.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
>   /_/
>  
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_74)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> sc
> res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@77bb916f
> scala> :replay
> Replaying: sc
> :12: error: not found: value sc
>sc
>^
> scala> sc
> :12: error: not found: value sc
>sc
>^
> scala>
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20133) User guide for spark.ml.stat.ChiSquareTest

2017-11-21 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20133.
---
Resolution: Not A Problem

> User guide for spark.ml.stat.ChiSquareTest
> --
>
> Key: SPARK-20133
> URL: https://issues.apache.org/jira/browse/SPARK-20133
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Add new user guide section for spark.ml.stat, and document ChiSquareTest.  
> This may involve adding new example scripts.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22572) spark-shell does not re-initialize on :replay

2017-11-21 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16260736#comment-16260736
 ] 

Sean Owen commented on SPARK-22572:
---

Yes, I think this is just how it is. The way Spark is initialized by injecting 
some code into the shell init means it's not available for replay. Have a look 
at the integration and see if you have bright ideas, but I recall concluding it 
wouldn't work.

> spark-shell does not re-initialize on :replay
> -
>
> Key: SPARK-22572
> URL: https://issues.apache.org/jira/browse/SPARK-22572
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.3.0
>Reporter: Mark Petruska
>Priority: Minor
>
> Spark-shell does not run the re-initialization script when a `:replay` 
> command is issued:
> {code}
> $ ./bin/spark-shell 
> 17/11/21 12:01:00 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> Spark context Web UI available at http://192.168.1.3:4040
> Spark context available as 'sc' (master = local[*], app id = 
> local-1511262066013).
> Spark session available as 'spark'.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
>   /_/
>  
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_74)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> sc
> res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@77bb916f
> scala> :replay
> Replaying: sc
> :12: error: not found: value sc
>sc
>^
> scala> sc
> :12: error: not found: value sc
>sc
>^
> scala>
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22528) History service and non-HDFS filesystems

2017-11-21 Thread paul mackles (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16260727#comment-16260727
 ] 

paul mackles commented on SPARK-22528:
--

In case anyone else bumps into this, I received some feedback from the 
data-lake team at MSFT:

This is expected behavior since Hadoop supports Kerberos based identity whereas 
data lake supports OAuth2 – Azure active directory (AAD). The bridge/mapping 
between Kerberos and AAD OAuth2 is supported only in Azure HDInsight cluster 
today.
 
OAuth2 support in Hadoop is non-trivial task and is in progress - 
https://issues.apache.org/jira/browse/HADOOP-11744
Workaround for the limitation is (Specific to data lake)
core-site.xml
 {code}

adl.debug.override.localuserasfileowner
true

 {code}

What does this configuration do ?
FileStatus contains the user/group information which is associated with object 
id from AAD. Hadoop driver would replace object id with local Hadoop user under 
the context of Hadoop process. Actual file information in data lake remains 
unchanged though, only shadowed behind the local Hadoop user.


> History service and non-HDFS filesystems
> 
>
> Key: SPARK-22528
> URL: https://issues.apache.org/jira/browse/SPARK-22528
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: paul mackles
>Priority: Minor
>
> We are using Azure Data Lake (ADL) to store our event logs. This worked fine 
> in 2.1.x but in 2.2.0, the event logs are no longer visible to the history 
> server. I tracked it down to the call to:
> {code}
> SparkHadoopUtil.get.checkAccessPermission()
> {code}
> which was added to "FSHistoryProvider" in 2.2.0.
> I was able to workaround it by:
> * setting the files on ADL to world readable
> * setting HADOOP_PROXY to the Azure objectId of the service principal that 
> owns file
> Neither of those workaround are particularly desirable in our environment. 
> That said, I am not sure how this should be addressed:
> * Is this an issue with the Azure/Hadoop bindings not setting up the user 
> context correctly so that the "checkAccessPermission()" call succeeds w/out 
> having to use the username under which the process is running?
> * Is this an issue with "checkAccessPermission()" not really accounting for 
> all of the possible FileSystem implementations? If so, I would imagine that 
> there are similar issues when using S3.
> In spite of this check, I know the files are accessible through the 
> underlying FileSystem object so it feels like the latter but I don't think 
> that the FileSystem object alone could be used to implement this check.
> Any thoughts [~jerryshao]?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22568) Split pair RDDs by keys - an efficient (maybe?) substitute to groupByKey

2017-11-21 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16260701#comment-16260701
 ] 

Sean Owen commented on SPARK-22568:
---

To filter individually you would have to collect distinct values first. But the 
other approaches here don't require that. If you don't need all values in 
memory at once, try repartitionAndSortWithinPartitions to encounter all values 
for a key in a row.

> Split pair RDDs by keys - an efficient (maybe?) substitute to groupByKey
> 
>
> Key: SPARK-22568
> URL: https://issues.apache.org/jira/browse/SPARK-22568
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Éderson Cássio
>  Labels: features, performance, usability
>
> Sorry for any mistakes on filling this big form... it's my first issue here :)
> Recently, I have the need to separate a RDD by some categorization. I was 
> able to accomplish that by some ways.
> First, the obvious: mapping each element to a pair, with the key being the 
> category of the element. Then, using the good ol' {{groupByKey}}.
> Listening to advices to avoid {{groupByKey}}, I failed to find another way 
> that was more efficient. I ended up (a) obtaining the distinct list of 
> element categories, (b) {{collect}} ing them and (c) making a call to 
> {{filter}} for each category. Of course, before all I {{cache}} d my initial 
> RDD.
> So, I started to speculate: maybe it would be possible to make a number of 
> RDDs from an initial pair RDD _without the need to shuffle the data_. It 
> could be made by a kind of _local repartition_: first each partition is 
> splitted into various by key; then the master group the partitions with the 
> same key into a new RDD. The operation returns a List or array containing the 
> new RDDs.
> It's just a conjecture, I don't know if it would be feasible in current Spark 
> Core architecture. But it would be great if it could be done.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22571) How to connect to secure(Kerberos) kafka broker using native KafkaConsumer api in Spark Streamming application

2017-11-21 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22571.
---
Resolution: Invalid

This should go to the mailing list.

> How to connect to secure(Kerberos) kafka broker using native KafkaConsumer 
> api in Spark Streamming application
> --
>
> Key: SPARK-22571
> URL: https://issues.apache.org/jira/browse/SPARK-22571
> Project: Spark
>  Issue Type: Question
>  Components: DStreams
>Affects Versions: 1.6.2
> Environment: Hortonworks cluster(2.5x) with Kafka 0.10 and Spark 1.6.2
>Reporter: Ujjal Satpathy
>
> I have been trying to create a KafkaConsumer in SparkStreaming application in 
> order to fetch kafka partition details from a secured(Kerberos) kafka 
> cluster.So basically there are total 2 consumers(Spark consumer and 
> KafkaConsumer ) are running parallelly in yarn cluster mode.I am providing 
> below option in run command to connect to kafka brokers but the application 
> is not able to create native KafkaConsumer and throwing Jass configuration 
> not found.Spark consumer is working fine and successfully creating the 
> connection
> Below are the configuration provided in spark-submit command =>
> --conf "spark.executor.extraJavaOptions 
> -Dlog4j.configuration=./log4j.properties 
> -Djava.security.auth.login.config=./consumer_jaas.conf 
> -Djava.security.krb5.conf=./consumer_krb5.conf" \
> --conf "spark.driver.extraJavaOptions 
> -Dlog4j.configuration=./log4j.properties 
> -Djava.security.auth.login.config=./consumer_jaas.conf 
> -Djava.security.krb5.conf=./consumer_krb5.conf" \
> I am running this application in yarn-cluster mode in Hortonworks cluster



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22569) Clean up caller of splitExpressions and addMutableState

2017-11-21 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22569.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19790
[https://github.com/apache/spark/pull/19790]

> Clean up caller of splitExpressions and addMutableState
> ---
>
> Key: SPARK-22569
> URL: https://issues.apache.org/jira/browse/SPARK-22569
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.3.0
>
>
> Clean the usage of `addMutableState` and `splitExpressions `



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22538) SQLTransformer.transform(inputDataFrame) uncaches inputDataFrame

2017-11-21 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-22538:

Fix Version/s: (was: 2.2.2)
   2.2.1

> SQLTransformer.transform(inputDataFrame) uncaches inputDataFrame
> 
>
> Key: SPARK-22538
> URL: https://issues.apache.org/jira/browse/SPARK-22538
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark, SQL, Web UI
>Affects Versions: 2.2.0
>Reporter: MBA Learns to Code
>Assignee: Liang-Chi Hsieh
> Fix For: 2.2.1, 2.3.0
>
>
> When running the below code on PySpark v2.2.0, the cached input DataFrame df 
> disappears from SparkUI after SQLTransformer.transform(...) is called on it.
> I don't yet know whether this is only a SparkUI bug, or the input DataFrame 
> df is indeed unpersisted from memory. If the latter is true, this can be a 
> serious bug because any new computation using new_df would have to re-do all 
> the work leading up to df.
> {code}
> import pandas
> import pyspark
> from pyspark.ml.feature import SQLTransformer
> spark = pyspark.sql.SparkSession.builder.getOrCreate()
> df = spark.createDataFrame(pandas.DataFrame(dict(x=[-1, 0, 1])))
> # after below step, SparkUI Storage shows 1 cached RDD
> df.cache(); df.count()
> # after below step, cached RDD disappears from SparkUI Storage
> new_df = SQLTransformer(statement='SELECT * FROM __THIS__').transform(df)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22498) 64KB JVM bytecode limit problem with concat

2017-11-21 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-22498:

Fix Version/s: (was: 2.2.2)
   2.2.1

> 64KB JVM bytecode limit problem with concat
> ---
>
> Key: SPARK-22498
> URL: https://issues.apache.org/jira/browse/SPARK-22498
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
> Fix For: 2.2.1, 2.3.0
>
>
> {{concat}} can throw an exception due to the 64KB JVM bytecode limit when 
> they use with a lot of arguments



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22549) 64KB JVM bytecode limit problem with concat_ws

2017-11-21 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-22549:

Fix Version/s: (was: 2.2.2)
   2.2.1

> 64KB JVM bytecode limit problem with concat_ws
> --
>
> Key: SPARK-22549
> URL: https://issues.apache.org/jira/browse/SPARK-22549
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
> Fix For: 2.2.1, 2.3.0
>
>
> {{concat_ws}} can throw an exception due to the 64KB JVM bytecode limit when 
> they use with a lot of arguments



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22494) Coalesce and AtLeastNNonNulls can cause 64KB JVM bytecode limit exception

2017-11-21 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-22494:

Fix Version/s: (was: 2.2.2)
   2.2.1

> Coalesce and AtLeastNNonNulls can cause 64KB JVM bytecode limit exception
> -
>
> Key: SPARK-22494
> URL: https://issues.apache.org/jira/browse/SPARK-22494
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
> Fix For: 2.2.1, 2.3.0
>
>
> Both `Coalesce` and `AtLeastNNonNulls` can cause the 64KB limit exception 
> when used with a lot of arguments and/or complex expressions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22501) 64KB JVM bytecode limit problem with in

2017-11-21 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-22501:

Fix Version/s: (was: 2.2.2)
   2.2.1

> 64KB JVM bytecode limit problem with in
> ---
>
> Key: SPARK-22501
> URL: https://issues.apache.org/jira/browse/SPARK-22501
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
> Fix For: 2.2.1, 2.3.0
>
>
> {{In}} can throw an exception due to the 64KB JVM bytecode limit when they 
> use with a lot of arguments



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22550) 64KB JVM bytecode limit problem with elt

2017-11-21 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-22550:

Fix Version/s: (was: 2.2.2)
   2.2.1

> 64KB JVM bytecode limit problem with elt
> 
>
> Key: SPARK-22550
> URL: https://issues.apache.org/jira/browse/SPARK-22550
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
> Fix For: 2.2.1, 2.3.0
>
>
> {{elt}} can throw an exception due to the 64KB JVM bytecode limit when they 
> use with a lot of arguments



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22508) 64KB JVM bytecode limit problem with GenerateUnsafeRowJoiner.create()

2017-11-21 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-22508:

Fix Version/s: (was: 2.2.2)
   2.2.1

> 64KB JVM bytecode limit problem with GenerateUnsafeRowJoiner.create()
> -
>
> Key: SPARK-22508
> URL: https://issues.apache.org/jira/browse/SPARK-22508
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
> Fix For: 2.2.1, 2.3.0
>
>
> {{GenerateUnsafeRowJoiner.create()}} can throw an exception due to the 64KB 
> JVM bytecode limit when they use a schema with a lot of fields



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22499) 64KB JVM bytecode limit problem with least and greatest

2017-11-21 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-22499:

Fix Version/s: (was: 2.2.2)
   2.2.1

> 64KB JVM bytecode limit problem with least and greatest
> ---
>
> Key: SPARK-22499
> URL: https://issues.apache.org/jira/browse/SPARK-22499
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
> Fix For: 2.2.1, 2.3.0
>
>
> Both {{least}} and {{greatest}} can throw an exception due to the 64KB JVM 
> bytecode limit when they use with a lot of arguments



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22550) 64KB JVM bytecode limit problem with elt

2017-11-21 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-22550:
---

Assignee: Kazuaki Ishizaki

> 64KB JVM bytecode limit problem with elt
> 
>
> Key: SPARK-22550
> URL: https://issues.apache.org/jira/browse/SPARK-22550
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
> Fix For: 2.2.1, 2.3.0
>
>
> {{elt}} can throw an exception due to the 64KB JVM bytecode limit when they 
> use with a lot of arguments



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22550) 64KB JVM bytecode limit problem with elt

2017-11-21 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22550.
-
   Resolution: Fixed
Fix Version/s: 2.3.0
   2.2.2

Issue resolved by pull request 19778
[https://github.com/apache/spark/pull/19778]

> 64KB JVM bytecode limit problem with elt
> 
>
> Key: SPARK-22550
> URL: https://issues.apache.org/jira/browse/SPARK-22550
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
> Fix For: 2.2.2, 2.3.0
>
>
> {{elt}} can throw an exception due to the 64KB JVM bytecode limit when they 
> use with a lot of arguments



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22568) Split pair RDDs by keys - an efficient (maybe?) substitute to groupByKey

2017-11-21 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-22568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16260616#comment-16260616
 ] 

Éderson Cássio commented on SPARK-22568:


Thanks for your response.
How can I filter by each distinct key without know them previously? It's common 
to determine the keys by some processing routine. The only way I could be able 
to filter by all of them is doing a {{collect}}, before this I don't ever know 
how many they are.

> Split pair RDDs by keys - an efficient (maybe?) substitute to groupByKey
> 
>
> Key: SPARK-22568
> URL: https://issues.apache.org/jira/browse/SPARK-22568
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Éderson Cássio
>  Labels: features, performance, usability
>
> Sorry for any mistakes on filling this big form... it's my first issue here :)
> Recently, I have the need to separate a RDD by some categorization. I was 
> able to accomplish that by some ways.
> First, the obvious: mapping each element to a pair, with the key being the 
> category of the element. Then, using the good ol' {{groupByKey}}.
> Listening to advices to avoid {{groupByKey}}, I failed to find another way 
> that was more efficient. I ended up (a) obtaining the distinct list of 
> element categories, (b) {{collect}} ing them and (c) making a call to 
> {{filter}} for each category. Of course, before all I {{cache}} d my initial 
> RDD.
> So, I started to speculate: maybe it would be possible to make a number of 
> RDDs from an initial pair RDD _without the need to shuffle the data_. It 
> could be made by a kind of _local repartition_: first each partition is 
> splitted into various by key; then the master group the partitions with the 
> same key into a new RDD. The operation returns a List or array containing the 
> new RDDs.
> It's just a conjecture, I don't know if it would be feasible in current Spark 
> Core architecture. But it would be great if it could be done.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22573) SQL Planner is including unnecessary columns in the projection

2017-11-21 Thread Rajkishore Hembram (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajkishore Hembram updated SPARK-22573:
---
Description: 
While I was running TPC-H query 18 for benchmarking, I observed that the query 
plan for Apache Spark 2.2.0 is inefficient than other versions of Apache Spark. 
I noticed that the other versions of Apache Spark (2.0.2 and 2.1.2) are only 
including the required columns in the projections. But the query planner of 
Apache Spark 2.2.0 is including unnecessary columns into the projection for 
some of the queries and hence unnecessarily increasing the I/O. And because of 
that the Apache Spark 2.2.0 is taking more time.

[Spark 2.1.2 TPC-H Query 18 
Plan|https://drive.google.com/file/d/1_u8nPKG_SIM7P6fs0VK-8UEXIhWPY_BN/view]
[Spark 2.2.0 TPC-H Query 18 
Plan|https://drive.google.com/file/d/1xtxG5Ext36djfTDSdf_W5vGbbdgRApPo/view]

TPC-H Query 18
{code:java}
select C_NAME,C_CUSTKEY,O_ORDERKEY,O_ORDERDATE,O_TOTALPRICE,sum(L_QUANTITY) 
from CUSTOMER,ORDERS,LINEITEM where O_ORDERKEY in ( select L_ORDERKEY from 
LINEITEM group by L_ORDERKEY having sum(L_QUANTITY) > 300 ) and C_CUSTKEY = 
O_CUSTKEY and O_ORDERKEY = L_ORDERKEY group by 
C_NAME,C_CUSTKEY,O_ORDERKEY,O_ORDERDATE,O_TOTALPRICE order by O_TOTALPRICE 
desc,O_ORDERDATE
{code}


  was:
While I was running TPC-H query 18 for benchmarking, I observed that the query 
plan for Apache Spark 2.2.0 is inefficient than other versions of Apache Spark. 
I noticed that the other versions of Apache Spark (2.0.2 and 2.1.2) are only 
including the required columns in the projections. But the query planner of 
Apache Spark 2.2.0 is including unnecessary columns into the projection for 
some of the queries and hence unnecessarily increasing the I/O. And because of 
that the Apache Spark 2.2.0 is taking more time.

[https://drive.google.com/file/d/1_u8nPKG_SIM7P6fs0VK-8UEXIhWPY_BN/view]
[https://drive.google.com/file/d/1xtxG5Ext36djfTDSdf_W5vGbbdgRApPo/view]

TPC-H Query 18
{code:java}
select C_NAME,C_CUSTKEY,O_ORDERKEY,O_ORDERDATE,O_TOTALPRICE,sum(L_QUANTITY) 
from CUSTOMER,ORDERS,LINEITEM where O_ORDERKEY in ( select L_ORDERKEY from 
LINEITEM group by L_ORDERKEY having sum(L_QUANTITY) > 300 ) and C_CUSTKEY = 
O_CUSTKEY and O_ORDERKEY = L_ORDERKEY group by 
C_NAME,C_CUSTKEY,O_ORDERKEY,O_ORDERDATE,O_TOTALPRICE order by O_TOTALPRICE 
desc,O_ORDERDATE
{code}



> SQL Planner is including unnecessary columns in the projection
> --
>
> Key: SPARK-22573
> URL: https://issues.apache.org/jira/browse/SPARK-22573
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Rajkishore Hembram
>
> While I was running TPC-H query 18 for benchmarking, I observed that the 
> query plan for Apache Spark 2.2.0 is inefficient than other versions of 
> Apache Spark. I noticed that the other versions of Apache Spark (2.0.2 and 
> 2.1.2) are only including the required columns in the projections. But the 
> query planner of Apache Spark 2.2.0 is including unnecessary columns into the 
> projection for some of the queries and hence unnecessarily increasing the 
> I/O. And because of that the Apache Spark 2.2.0 is taking more time.
> [Spark 2.1.2 TPC-H Query 18 
> Plan|https://drive.google.com/file/d/1_u8nPKG_SIM7P6fs0VK-8UEXIhWPY_BN/view]
> [Spark 2.2.0 TPC-H Query 18 
> Plan|https://drive.google.com/file/d/1xtxG5Ext36djfTDSdf_W5vGbbdgRApPo/view]
> TPC-H Query 18
> {code:java}
> select C_NAME,C_CUSTKEY,O_ORDERKEY,O_ORDERDATE,O_TOTALPRICE,sum(L_QUANTITY) 
> from CUSTOMER,ORDERS,LINEITEM where O_ORDERKEY in ( select L_ORDERKEY from 
> LINEITEM group by L_ORDERKEY having sum(L_QUANTITY) > 300 ) and C_CUSTKEY = 
> O_CUSTKEY and O_ORDERKEY = L_ORDERKEY group by 
> C_NAME,C_CUSTKEY,O_ORDERKEY,O_ORDERDATE,O_TOTALPRICE order by O_TOTALPRICE 
> desc,O_ORDERDATE
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22573) SQL Planner is including unnecessary columns in the projection

2017-11-21 Thread Rajkishore Hembram (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajkishore Hembram updated SPARK-22573:
---
Description: 
While I was running TPC-H query 18 for benchmarking, I observed that the query 
plan for Apache Spark 2.2.0 is inefficient than other versions of Apache Spark. 
I noticed that the other versions of Apache Spark (2.0.2 and 2.1.2) are only 
including the required columns in the projections. But the query planner of 
Apache Spark 2.2.0 is including unnecessary columns into the projection for 
some of the queries and hence unnecessarily increasing the I/O. And because of 
that the Apache Spark 2.2.0 is taking more time.

[https://drive.google.com/file/d/1_u8nPKG_SIM7P6fs0VK-8UEXIhWPY_BN/view]
[https://drive.google.com/file/d/1xtxG5Ext36djfTDSdf_W5vGbbdgRApPo/view]

TPC-H Query 18
{code:java}
select C_NAME,C_CUSTKEY,O_ORDERKEY,O_ORDERDATE,O_TOTALPRICE,sum(L_QUANTITY) 
from CUSTOMER,ORDERS,LINEITEM where O_ORDERKEY in ( select L_ORDERKEY from 
LINEITEM group by L_ORDERKEY having sum(L_QUANTITY) > 300 ) and C_CUSTKEY = 
O_CUSTKEY and O_ORDERKEY = L_ORDERKEY group by 
C_NAME,C_CUSTKEY,O_ORDERKEY,O_ORDERDATE,O_TOTALPRICE order by O_TOTALPRICE 
desc,O_ORDERDATE
{code}


  was:While I was running TPC-H query 18 for benchmarking, I observed that the 
query plan for Apache Spark 2.2.0 is inefficient than other versions of Apache 
Spark. I noticed that the other versions of Apache Spark (2.0.2 and 2.1.2) are 
only including the required columns in the projections. But the query planner 
of Apache Spark 2.2.0 is including unnecessary columns into the projection for 
some of the queries and hence unnecessarily increasing the I/O. And because of 
that the Apache Spark 2.2.0 is taking more time.


> SQL Planner is including unnecessary columns in the projection
> --
>
> Key: SPARK-22573
> URL: https://issues.apache.org/jira/browse/SPARK-22573
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Rajkishore Hembram
>
> While I was running TPC-H query 18 for benchmarking, I observed that the 
> query plan for Apache Spark 2.2.0 is inefficient than other versions of 
> Apache Spark. I noticed that the other versions of Apache Spark (2.0.2 and 
> 2.1.2) are only including the required columns in the projections. But the 
> query planner of Apache Spark 2.2.0 is including unnecessary columns into the 
> projection for some of the queries and hence unnecessarily increasing the 
> I/O. And because of that the Apache Spark 2.2.0 is taking more time.
> [https://drive.google.com/file/d/1_u8nPKG_SIM7P6fs0VK-8UEXIhWPY_BN/view]
> [https://drive.google.com/file/d/1xtxG5Ext36djfTDSdf_W5vGbbdgRApPo/view]
> TPC-H Query 18
> {code:java}
> select C_NAME,C_CUSTKEY,O_ORDERKEY,O_ORDERDATE,O_TOTALPRICE,sum(L_QUANTITY) 
> from CUSTOMER,ORDERS,LINEITEM where O_ORDERKEY in ( select L_ORDERKEY from 
> LINEITEM group by L_ORDERKEY having sum(L_QUANTITY) > 300 ) and C_CUSTKEY = 
> O_CUSTKEY and O_ORDERKEY = L_ORDERKEY group by 
> C_NAME,C_CUSTKEY,O_ORDERKEY,O_ORDERDATE,O_TOTALPRICE order by O_TOTALPRICE 
> desc,O_ORDERDATE
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22573) SQL Planner is including unnecessary columns in the projection

2017-11-21 Thread Rajkishore Hembram (JIRA)
Rajkishore Hembram created SPARK-22573:
--

 Summary: SQL Planner is including unnecessary columns in the 
projection
 Key: SPARK-22573
 URL: https://issues.apache.org/jira/browse/SPARK-22573
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Rajkishore Hembram


While I was running TPC-H query 18 for benchmarking, I observed that the query 
plan for Apache Spark 2.2.0 is inefficient than other versions of Apache Spark. 
I noticed that the other versions of Apache Spark (2.0.2 and 2.1.2) are only 
including the required columns in the projections. But the query planner of 
Apache Spark 2.2.0 is including unnecessary columns into the projection for 
some of the queries and hence unnecessarily increasing the I/O. And because of 
that the Apache Spark 2.2.0 is taking more time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22508) 64KB JVM bytecode limit problem with GenerateUnsafeRowJoiner.create()

2017-11-21 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-22508:
---

Assignee: Kazuaki Ishizaki

> 64KB JVM bytecode limit problem with GenerateUnsafeRowJoiner.create()
> -
>
> Key: SPARK-22508
> URL: https://issues.apache.org/jira/browse/SPARK-22508
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
> Fix For: 2.3.0, 2.2.2
>
>
> {{GenerateUnsafeRowJoiner.create()}} can throw an exception due to the 64KB 
> JVM bytecode limit when they use a schema with a lot of fields



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22508) 64KB JVM bytecode limit problem with GenerateUnsafeRowJoiner.create()

2017-11-21 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22508.
-
   Resolution: Fixed
Fix Version/s: 2.3.0
   2.2.2

Issue resolved by pull request 19737
[https://github.com/apache/spark/pull/19737]

> 64KB JVM bytecode limit problem with GenerateUnsafeRowJoiner.create()
> -
>
> Key: SPARK-22508
> URL: https://issues.apache.org/jira/browse/SPARK-22508
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
> Fix For: 2.2.2, 2.3.0
>
>
> {{GenerateUnsafeRowJoiner.create()}} can throw an exception due to the 64KB 
> JVM bytecode limit when they use a schema with a lot of fields



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22572) spark-shell does not re-initialize on :replay

2017-11-21 Thread Mark Petruska (JIRA)
Mark Petruska created SPARK-22572:
-

 Summary: spark-shell does not re-initialize on :replay
 Key: SPARK-22572
 URL: https://issues.apache.org/jira/browse/SPARK-22572
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 2.3.0
Reporter: Mark Petruska
Priority: Minor


Spark-shell does not run the re-initialization script when a `:replay` command 
is issued:

{code}
$ ./bin/spark-shell 
17/11/21 12:01:00 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
Spark context Web UI available at http://192.168.1.3:4040
Spark context available as 'sc' (master = local[*], app id = 
local-1511262066013).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
  /_/
 
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_74)
Type in expressions to have them evaluated.
Type :help for more information.

scala> sc
res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@77bb916f

scala> :replay
Replaying: sc
:12: error: not found: value sc
   sc
   ^


scala> sc
:12: error: not found: value sc
   sc
   ^

scala>
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18859) Catalyst codegen does not mark column as nullable when it should. Causes NPE

2017-11-21 Thread Anurag Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16260481#comment-16260481
 ] 

Anurag Agarwal commented on SPARK-18859:


Another workaround for prostgres databases can be to make dblink and query via 
dblink, because at the end you need to mention the target schema. This avoids 
confusion for the postgres driver and helps it identify nullability properly 
from a query.
I tried the approach by [~mentegy] and it works too, but somehow I was not 
having create view permission on source database and it made me look around for 
another approach.

> Catalyst codegen does not mark column as nullable when it should. Causes NPE
> 
>
> Key: SPARK-18859
> URL: https://issues.apache.org/jira/browse/SPARK-18859
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.0.2
>Reporter: Mykhailo Osypov
>
> When joining two tables via LEFT JOIN, columns in right table may be NULLs, 
> however catalyst codegen cannot recognize it.
> Example:
> {code:title=schema.sql|borderStyle=solid}
> create table masterdata.testtable(
>   id int not null,
>   age int
> );
> create table masterdata.jointable(
>   id int not null,
>   name text not null
> );
> {code}
> {code:title=query_to_select.sql|borderStyle=solid}
> (select t.id, t.age, j.name from masterdata.testtable t left join 
> masterdata.jointable j on t.id = j.id) as testtable;
> {code}
> {code:title=master code|borderStyle=solid}
> val df = sqlContext
>   .read
>   .format("jdbc")
>   .option("dbTable", "query to select")
>   
>   .load
> //df generated schema
> /*
> root
>  |-- id: integer (nullable = false)
>  |-- age: integer (nullable = true)
>  |-- name: string (nullable = false)
> */
> {code}
> {code:title=Codegen|borderStyle=solid}
> /* 038 */   scan_rowWriter.write(0, scan_value);
> /* 039 */
> /* 040 */   if (scan_isNull1) {
> /* 041 */ scan_rowWriter.setNullAt(1);
> /* 042 */   } else {
> /* 043 */ scan_rowWriter.write(1, scan_value1);
> /* 044 */   }
> /* 045 */
> /* 046 */   scan_rowWriter.write(2, scan_value2);
> {code}
> Since *j.name* is from right table of *left join* query, it may be null. 
> However generated schema doesn't think so (probably because it defined as 
> *name text not null*)
> {code:title=StackTrace|borderStyle=solid}
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:210)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.debug.package$DebugExec$$anonfun$3$$anon$1.hasNext(package.scala:146)
>   at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1763)
>   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
>   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22541) Dataframes: applying multiple filters one after another using udfs and accumulators results in faulty accumulators

2017-11-21 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22541.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19787
[https://github.com/apache/spark/pull/19787]

> Dataframes: applying multiple filters one after another using udfs and 
> accumulators results in faulty accumulators
> --
>
> Key: SPARK-22541
> URL: https://issues.apache.org/jira/browse/SPARK-22541
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 2.2.0
> Environment: pyspark 2.2.0, ubuntu
>Reporter: Janne K. Olesen
> Fix For: 2.3.0
>
>
> I'm using udf filters and accumulators to keep track of filtered rows in 
> dataframes.
> If I'm applying multiple filters one after the other, they seem to be 
> executed in parallel, not in sequence, which messes with the accumulators i'm 
> using to keep track of filtered data. 
> {code:title=example.py|borderStyle=solid}
> from pyspark.sql.functions import udf, col
> from pyspark.sql.types import BooleanType
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.getOrCreate()
> sc = spark.sparkContext
> df = spark.createDataFrame([("a", 1, 1), ("b", 2, 2), ("c", 3, 3)], ["key", 
> "val1", "val2"])
> def __myfilter(val, acc):
> if val < 2:
> return True
> else:
> acc.add(1)
> return False
> acc1 = sc.accumulator(0)
> acc2 = sc.accumulator(0)
> def myfilter1(val):
> return __myfilter(val, acc1)
> def myfilter2(val):
> return __myfilter(val, acc2)
> my_udf1 = udf(myfilter1, BooleanType())
> my_udf2 = udf(myfilter2, BooleanType())
> df.show()
> # +---+++
> # |key|val1|val2|
> # +---+++
> # |  a|   1|   1|
> # |  b|   2|   2|
> # |  c|   3|   3|
> # +---+++
> df = df.filter(my_udf1(col("val1")))
> # df.show()
> # +---+++
> # |key|val1|val2|
> # +---+++
> # |  a|   1|   1|
> # +---+++
> # expected acc1: 2
> # expected acc2: 0
> df = df.filter(my_udf2(col("val2")))
> # df.show()
> # +---+++
> # |key|val1|val2|
> # +---+++
> # |  a|   1|   1|
> # +---+++
> # expected acc1: 2
> # expected acc2: 0
> df.show()
> print("acc1: %s" % acc1.value)  # expected 2, is 2 OK
> print("acc2: %s" % acc2.value)  # expected 0, is 2 !!!
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22541) Dataframes: applying multiple filters one after another using udfs and accumulators results in faulty accumulators

2017-11-21 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-22541:
---

Assignee: Liang-Chi Hsieh

> Dataframes: applying multiple filters one after another using udfs and 
> accumulators results in faulty accumulators
> --
>
> Key: SPARK-22541
> URL: https://issues.apache.org/jira/browse/SPARK-22541
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 2.2.0
> Environment: pyspark 2.2.0, ubuntu
>Reporter: Janne K. Olesen
>Assignee: Liang-Chi Hsieh
> Fix For: 2.3.0
>
>
> I'm using udf filters and accumulators to keep track of filtered rows in 
> dataframes.
> If I'm applying multiple filters one after the other, they seem to be 
> executed in parallel, not in sequence, which messes with the accumulators i'm 
> using to keep track of filtered data. 
> {code:title=example.py|borderStyle=solid}
> from pyspark.sql.functions import udf, col
> from pyspark.sql.types import BooleanType
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.getOrCreate()
> sc = spark.sparkContext
> df = spark.createDataFrame([("a", 1, 1), ("b", 2, 2), ("c", 3, 3)], ["key", 
> "val1", "val2"])
> def __myfilter(val, acc):
> if val < 2:
> return True
> else:
> acc.add(1)
> return False
> acc1 = sc.accumulator(0)
> acc2 = sc.accumulator(0)
> def myfilter1(val):
> return __myfilter(val, acc1)
> def myfilter2(val):
> return __myfilter(val, acc2)
> my_udf1 = udf(myfilter1, BooleanType())
> my_udf2 = udf(myfilter2, BooleanType())
> df.show()
> # +---+++
> # |key|val1|val2|
> # +---+++
> # |  a|   1|   1|
> # |  b|   2|   2|
> # |  c|   3|   3|
> # +---+++
> df = df.filter(my_udf1(col("val1")))
> # df.show()
> # +---+++
> # |key|val1|val2|
> # +---+++
> # |  a|   1|   1|
> # +---+++
> # expected acc1: 2
> # expected acc2: 0
> df = df.filter(my_udf2(col("val2")))
> # df.show()
> # +---+++
> # |key|val1|val2|
> # +---+++
> # |  a|   1|   1|
> # +---+++
> # expected acc1: 2
> # expected acc2: 0
> df.show()
> print("acc1: %s" % acc1.value)  # expected 2, is 2 OK
> print("acc2: %s" % acc2.value)  # expected 0, is 2 !!!
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22564) csv reader no longer logs errors

2017-11-21 Thread Adrian Bridgett (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16260428#comment-16260428
 ] 

Adrian Bridgett commented on SPARK-22564:
-

I guess this should be closed as won't fix.  Maybe a better note on 
https://spark.apache.org/releases/spark-release-2-2-0.html would have been nice 
:D

> csv reader no longer logs errors
> 
>
> Key: SPARK-22564
> URL: https://issues.apache.org/jira/browse/SPARK-22564
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Adrian Bridgett
>Priority: Minor
>
> Since upgrading from 2.0.2 to 2.2.0 we no longer see any malformed CSV 
> warnings in the executor logs.  It looks like this maybe related to 
> https://issues.apache.org/jira/browse/SPARK-19949 (where 
> maxMalformedLogPerPartition was removed) as it seems to be used but AFAICT, 
> not set anywhere.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22564) csv reader no longer logs errors

2017-11-21 Thread Adrian Bridgett (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16260424#comment-16260424
 ] 

Adrian Bridgett commented on SPARK-22564:
-

Ah, I had an old copy of CSVRelation.scala from before the refactoring left in 
that directory which caused my confusion.  Yep - it appears to be gone, I'll 
look at using columnNameOfCorruptRecord instead. Thanks

> csv reader no longer logs errors
> 
>
> Key: SPARK-22564
> URL: https://issues.apache.org/jira/browse/SPARK-22564
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Adrian Bridgett
>Priority: Minor
>
> Since upgrading from 2.0.2 to 2.2.0 we no longer see any malformed CSV 
> warnings in the executor logs.  It looks like this maybe related to 
> https://issues.apache.org/jira/browse/SPARK-19949 (where 
> maxMalformedLogPerPartition was removed) as it seems to be used but AFAICT, 
> not set anywhere.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org