date:20171028

[jira] [Commented] (SPARK-18451) Always set -XX:+HeapDumpOnOutOfMemoryError for Spark tests

2017-10-28 Thread Xin Lu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16223846#comment-16223846
 ] 

Xin Lu commented on SPARK-18451:


Pretty easy to do if we can get the jenkins job builder scripts changed and 
there is a place to dump the files on amplab jenkins/s3

> Always set -XX:+HeapDumpOnOutOfMemoryError for Spark tests
> --
>
> Key: SPARK-18451
> URL: https://issues.apache.org/jira/browse/SPARK-18451
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Tests
>Reporter: Cheng Lian
>
> It would be nice if we always set {{-XX:+HeapDumpOnOutOfMemoryError}} and 
> {{-XX:+HeapDumpPath}} for open source Spark tests. So that it would be easier 
> to investigate issues like SC-5041.
> Note:
> - We need to ensure that the heap dumps are stored in a location on Jenkins 
> that won't be automatically cleaned up.
> - It would be nice to be able to customize the customize the heap dump output 
> paths on a per build basis so that it's easier to find the heap dump file of 
> any given build.
> The 2nd point is optional since we can probably identify wanted heap dump 
> files by looking at the creation timestamp.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22344) Prevent R CMD check from using /tmp

2017-10-28 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16223844#comment-16223844
 ] 

Shivaram Venkataraman commented on SPARK-22344:
---

The ~/.cache is created by `install.spark` -- so its in our control whether we 
want to "uninstall" spark at the end of the tests ?

> Prevent R CMD check from using /tmp
> ---
>
> Key: SPARK-22344
> URL: https://issues.apache.org/jira/browse/SPARK-22344
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.3, 2.1.2, 2.2.0, 2.3.0
>Reporter: Shivaram Venkataraman
>
> When R CMD check is run on the SparkR package it leaves behind files in /tmp 
> which is a violation of CRAN policy. We should instead write to Rtmpdir. 
> Notes from CRAN are below
> {code}
> Checking this leaves behind dirs
>hive/$USER
>$USER
> and files named like
>b4f6459b-0624-4100-8358-7aa7afbda757_resources
> in /tmp, in violation of the CRAN Policy.
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17788) RangePartitioner results in few very large tasks and many small to empty tasks

2017-10-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16223837#comment-16223837
 ] 

Apache Spark commented on SPARK-17788:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/18251

> RangePartitioner results in few very large tasks and many small to empty 
> tasks 
> ---
>
> Key: SPARK-17788
> URL: https://issues.apache.org/jira/browse/SPARK-17788
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0
> Environment: Ubuntu 14.04 64bit
> Java 1.8.0_101
>Reporter: Babak Alipour
>
> Greetings everyone,
> I was trying to read a single field of a Hive table stored as Parquet in 
> Spark (~140GB for the entire table, this single field is a Double, ~1.4B 
> records) and look at the sorted output using the following:
> sql("SELECT " + field + " FROM MY_TABLE ORDER BY " + field + " DESC") 
> But this simple line of code gives:
> Caused by: java.lang.IllegalArgumentException: Cannot allocate a page with 
> more than 17179869176 bytes
> Same error for:
> sql("SELECT " + field + " FROM MY_TABLE).sort(field)
> and:
> sql("SELECT " + field + " FROM MY_TABLE).orderBy(field)
> After doing some searching, the issue seems to lie in the RangePartitioner 
> trying to create equal ranges. [1]
> [1] 
> https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/RangePartitioner.html
>  
>  The Double values I'm trying to sort are mostly in the range [0,1] (~70% of 
> the data which roughly equates 1 billion records), other numbers in the 
> dataset are as high as 2000. With the RangePartitioner trying to create equal 
> ranges, some tasks are becoming almost empty while others are extremely 
> large, due to the heavily skewed distribution. 
> This is either a bug in Apache Spark or a major limitation of the framework. 
> I hope one of the devs can help solve this issue.
> P.S. Email thread on Spark user mailing list:
> http://mail-archives.apache.org/mod_mbox/spark-user/201610.mbox/%3CCA%2B_of14hTVYTUHXC%3DmS9Kqd6qegVvkoF-ry3Yj2%2BRT%2BWSBNzhg%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17788) RangePartitioner results in few very large tasks and many small to empty tasks

2017-10-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17788:


Assignee: (was: Apache Spark)

> RangePartitioner results in few very large tasks and many small to empty 
> tasks 
> ---
>
> Key: SPARK-17788
> URL: https://issues.apache.org/jira/browse/SPARK-17788
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0
> Environment: Ubuntu 14.04 64bit
> Java 1.8.0_101
>Reporter: Babak Alipour
>
> Greetings everyone,
> I was trying to read a single field of a Hive table stored as Parquet in 
> Spark (~140GB for the entire table, this single field is a Double, ~1.4B 
> records) and look at the sorted output using the following:
> sql("SELECT " + field + " FROM MY_TABLE ORDER BY " + field + " DESC") 
> But this simple line of code gives:
> Caused by: java.lang.IllegalArgumentException: Cannot allocate a page with 
> more than 17179869176 bytes
> Same error for:
> sql("SELECT " + field + " FROM MY_TABLE).sort(field)
> and:
> sql("SELECT " + field + " FROM MY_TABLE).orderBy(field)
> After doing some searching, the issue seems to lie in the RangePartitioner 
> trying to create equal ranges. [1]
> [1] 
> https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/RangePartitioner.html
>  
>  The Double values I'm trying to sort are mostly in the range [0,1] (~70% of 
> the data which roughly equates 1 billion records), other numbers in the 
> dataset are as high as 2000. With the RangePartitioner trying to create equal 
> ranges, some tasks are becoming almost empty while others are extremely 
> large, due to the heavily skewed distribution. 
> This is either a bug in Apache Spark or a major limitation of the framework. 
> I hope one of the devs can help solve this issue.
> P.S. Email thread on Spark user mailing list:
> http://mail-archives.apache.org/mod_mbox/spark-user/201610.mbox/%3CCA%2B_of14hTVYTUHXC%3DmS9Kqd6qegVvkoF-ry3Yj2%2BRT%2BWSBNzhg%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17788) RangePartitioner results in few very large tasks and many small to empty tasks

2017-10-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17788:


Assignee: Apache Spark

> RangePartitioner results in few very large tasks and many small to empty 
> tasks 
> ---
>
> Key: SPARK-17788
> URL: https://issues.apache.org/jira/browse/SPARK-17788
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0
> Environment: Ubuntu 14.04 64bit
> Java 1.8.0_101
>Reporter: Babak Alipour
>Assignee: Apache Spark
>
> Greetings everyone,
> I was trying to read a single field of a Hive table stored as Parquet in 
> Spark (~140GB for the entire table, this single field is a Double, ~1.4B 
> records) and look at the sorted output using the following:
> sql("SELECT " + field + " FROM MY_TABLE ORDER BY " + field + " DESC") 
> But this simple line of code gives:
> Caused by: java.lang.IllegalArgumentException: Cannot allocate a page with 
> more than 17179869176 bytes
> Same error for:
> sql("SELECT " + field + " FROM MY_TABLE).sort(field)
> and:
> sql("SELECT " + field + " FROM MY_TABLE).orderBy(field)
> After doing some searching, the issue seems to lie in the RangePartitioner 
> trying to create equal ranges. [1]
> [1] 
> https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/RangePartitioner.html
>  
>  The Double values I'm trying to sort are mostly in the range [0,1] (~70% of 
> the data which roughly equates 1 billion records), other numbers in the 
> dataset are as high as 2000. With the RangePartitioner trying to create equal 
> ranges, some tasks are becoming almost empty while others are extremely 
> large, due to the heavily skewed distribution. 
> This is either a bug in Apache Spark or a major limitation of the framework. 
> I hope one of the devs can help solve this issue.
> P.S. Email thread on Spark user mailing list:
> http://mail-archives.apache.org/mod_mbox/spark-user/201610.mbox/%3CCA%2B_of14hTVYTUHXC%3DmS9Kqd6qegVvkoF-ry3Yj2%2BRT%2BWSBNzhg%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18394) Executing the same query twice in a row results in CodeGenerator cache misses

2017-10-28 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16223835#comment-16223835
 ] 

Wenchen Fan commented on SPARK-18394:
-

resolved by https://github.com/apache/spark/pull/18959

> Executing the same query twice in a row results in CodeGenerator cache misses
> -
>
> Key: SPARK-18394
> URL: https://issues.apache.org/jira/browse/SPARK-18394
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
> Environment: HiveThriftServer2 running on branch-2.0 on Mac laptop
>Reporter: Jonny Serencsa
>Assignee: Takeshi Yamamuro
> Fix For: 2.3.0
>
>
> Executing the query:
> {noformat}
> select
> l_returnflag,
> l_linestatus,
> sum(l_quantity) as sum_qty,
> sum(l_extendedprice) as sum_base_price,
> sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
> sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
> avg(l_quantity) as avg_qty,
> avg(l_extendedprice) as avg_price,
> avg(l_discount) as avg_disc,
> count(*) as count_order
> from
> lineitem_1_row
> where
> l_shipdate <= date_sub('1998-12-01', '90')
> group by
> l_returnflag,
> l_linestatus
> ;
> {noformat}
> twice (in succession), will result in CodeGenerator cache misses in BOTH 
> executions. Since the query is identical, I would expect the same code to be 
> generated. 
> Turns out, the generated code is not exactly the same, resulting in cache 
> misses when performing the lookup in the CodeGenerator cache. Yet, the code 
> is equivalent. 
> Below is (some portion of the) generated code for two runs of the query:
> run-1
> {noformat}
> import java.nio.ByteBuffer;
> import java.nio.ByteOrder;
> import scala.collection.Iterator;
> import org.apache.spark.sql.types.DataType;
> import org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder;
> import org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter;
> import org.apache.spark.sql.execution.columnar.MutableUnsafeRow;
> public SpecificColumnarIterator generate(Object[] references) {
> return new SpecificColumnarIterator();
> }
> class SpecificColumnarIterator extends 
> org.apache.spark.sql.execution.columnar.ColumnarIterator {
> private ByteOrder nativeOrder = null;
> private byte[][] buffers = null;
> private UnsafeRow unsafeRow = new UnsafeRow(7);
> private BufferHolder bufferHolder = new BufferHolder(unsafeRow);
> private UnsafeRowWriter rowWriter = new UnsafeRowWriter(bufferHolder, 7);
> private MutableUnsafeRow mutableRow = null;
> private int currentRow = 0;
> private int numRowsInBatch = 0;
> private scala.collection.Iterator input = null;
> private DataType[] columnTypes = null;
> private int[] columnIndexes = null;
> private org.apache.spark.sql.execution.columnar.DoubleColumnAccessor accessor;
> private org.apache.spark.sql.execution.columnar.DoubleColumnAccessor 
> accessor1;
> private org.apache.spark.sql.execution.columnar.DoubleColumnAccessor 
> accessor2;
> private org.apache.spark.sql.execution.columnar.StringColumnAccessor 
> accessor3;
> private org.apache.spark.sql.execution.columnar.DoubleColumnAccessor 
> accessor4;
> private org.apache.spark.sql.execution.columnar.StringColumnAccessor 
> accessor5;
> private org.apache.spark.sql.execution.columnar.StringColumnAccessor 
> accessor6;
> public SpecificColumnarIterator() {
> this.nativeOrder = ByteOrder.nativeOrder();
> this.buffers = new byte[7][];
> this.mutableRow = new MutableUnsafeRow(rowWriter);
> }
> public void initialize(Iterator input, DataType[] columnTypes, int[] 
> columnIndexes) {
> this.input = input;
> this.columnTypes = columnTypes;
> this.columnIndexes = columnIndexes;
> }
> public boolean hasNext() {
> if (currentRow < numRowsInBatch) {
> return true;
> }
> if (!input.hasNext()) {
> return false;
> }
> org.apache.spark.sql.execution.columnar.CachedBatch batch = 
> (org.apache.spark.sql.execution.columnar.CachedBatch) input.next();
> currentRow = 0;
> numRowsInBatch = batch.numRows();
> for (int i = 0; i < columnIndexes.length; i ++) {
> buffers[i] = batch.buffers()[columnIndexes[i]];
> }
> accessor = new 
> org.apache.spark.sql.execution.columnar.DoubleColumnAccessor(ByteBuffer.wrap(buffers[0]).order(nativeOrder));
> accessor1 = new 
> org.apache.spark.sql.execution.columnar.DoubleColumnAccessor(ByteBuffer.wrap(buffers[1]).order(nativeOrder));
> accessor2 = new 
> org.apache.spark.sql.execution.columnar.DoubleColumnAccessor(ByteBuffer.wrap(buffers[2]).order(nativeOrder));
> accessor3 = new 
> org.apache.spark.sql.execution.columnar.StringColumnAccessor(ByteBuffer.wrap(buffers[3]).order(nativeOrder));
> accessor4 = new 
> org.apache.spark.sql.execution.columnar.DoubleColumnAccessor(ByteBuffer.wrap(buffers[4]).order(nativeOrder));
> accessor5 = new

[jira] [Updated] (SPARK-21033) fix the potential OOM in UnsafeExternalSorter

2017-10-28 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-21033:

Description: 
In `UnsafeInMemorySorter`, one record may take 32 bytes: 1 `long` for pointer, 
1 `long` for key-prefix, and another 2 `long`s as the temporary buffer for 
radix sort.

In `UnsafeExternalSorter`, we set the 
`DEFAULT_NUM_ELEMENTS_FOR_SPILL_THRESHOLD` to be `1024 * 1024 * 1024 / 2`, and 
hoping the max size of point array to be 8 GB. However this is wrong, `1024 * 
1024 * 1024 / 2 * 32` is actually 16 GB, and if we grow the point array before 
reach this limitation, we may hit the max-page-size error.

Users may see exception like this on large dataset:
{code}
Caused by: java.lang.IllegalArgumentException: Cannot allocate a page with more 
than 17179869176 bytes
at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:241)
at org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:121)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:374)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:396)
at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:94)
...
{code}

  was:
## What changes were proposed in this pull request?

In `UnsafeInMemorySorter`, one record may take 32 bytes: 1 `long` for pointer, 
1 `long` for key-prefix, and another 2 `long`s as the temporary buffer for 
radix sort.

In `UnsafeExternalSorter`, we set the 
`DEFAULT_NUM_ELEMENTS_FOR_SPILL_THRESHOLD` to be `1024 * 1024 * 1024 / 2`, and 
hoping the max size of point array to be 8 GB. However this is wrong, `1024 * 
1024 * 1024 / 2 * 32` is actually 16 GB, and if we grow the point array before 
reach this limitation, we may hit the max-page-size error.

Users may see exception like this on large dataset:
{code}
Caused by: java.lang.IllegalArgumentException: Cannot allocate a page with more 
than 17179869176 bytes
at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:241)
at org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:121)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:374)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:396)
at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:94)
...
{code}

Setting `DEFAULT_NUM_ELEMENTS_FOR_SPILL_THRESHOLD` to a smaller number is not 
enough, users can still set the config to a big number and trigger the too 
large page size issue. This PR fixes it by explicitly handling the too large 
page size exception in the sorter and spill.

This PR also change the type of 
`spark.shuffle.spill.numElementsForceSpillThreshold` to int, because it's only 
compared with `numRecords`, which is an int. This is an internal conf so we 
don't have a serious compatibility issue.

## How was this patch tested?

TODO


> fix the potential OOM in UnsafeExternalSorter
> -
>
> Key: SPARK-21033
> URL: https://issues.apache.org/jira/browse/SPARK-21033
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>
> In `UnsafeInMemorySorter`, one record may take 32 bytes: 1 `long` for 
> pointer, 1 `long` for key-prefix, and another 2 `long`s as the temporary 
> buffer for radix sort.
> In `UnsafeExternalSorter`, we set the 
> `DEFAULT_NUM_ELEMENTS_FOR_SPILL_THRESHOLD` to be `1024 * 1024 * 1024 / 2`, 
> and hoping the max size of point array to be 8 GB. However this is wrong, 
> `1024 * 1024 * 1024 / 2 * 32` is actually 16 GB, and if we grow the point 
> array before reach this limitation, we may hit the max-page-size error.
> Users may see exception like this on large dataset:
> {code}
> Caused by: java.lang.IllegalArgumentException: Cannot allocate a page with 
> more than 17179869176 bytes
> at 
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:241)
> at 
> org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:121)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:374)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:396)
> at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:94)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail:

[jira] [Updated] (SPARK-21033) fix the potential OOM in UnsafeExternalSorter

2017-10-28 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-21033:

Description: 
## What changes were proposed in this pull request?

In `UnsafeInMemorySorter`, one record may take 32 bytes: 1 `long` for pointer, 
1 `long` for key-prefix, and another 2 `long`s as the temporary buffer for 
radix sort.

In `UnsafeExternalSorter`, we set the 
`DEFAULT_NUM_ELEMENTS_FOR_SPILL_THRESHOLD` to be `1024 * 1024 * 1024 / 2`, and 
hoping the max size of point array to be 8 GB. However this is wrong, `1024 * 
1024 * 1024 / 2 * 32` is actually 16 GB, and if we grow the point array before 
reach this limitation, we may hit the max-page-size error.

Users may see exception like this on large dataset:
{code}
Caused by: java.lang.IllegalArgumentException: Cannot allocate a page with more 
than 17179869176 bytes
at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:241)
at org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:121)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:374)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:396)
at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:94)
...
{code}

Setting `DEFAULT_NUM_ELEMENTS_FOR_SPILL_THRESHOLD` to a smaller number is not 
enough, users can still set the config to a big number and trigger the too 
large page size issue. This PR fixes it by explicitly handling the too large 
page size exception in the sorter and spill.

This PR also change the type of 
`spark.shuffle.spill.numElementsForceSpillThreshold` to int, because it's only 
compared with `numRecords`, which is an int. This is an internal conf so we 
don't have a serious compatibility issue.

## How was this patch tested?

TODO

> fix the potential OOM in UnsafeExternalSorter
> -
>
> Key: SPARK-21033
> URL: https://issues.apache.org/jira/browse/SPARK-21033
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>
> ## What changes were proposed in this pull request?
> In `UnsafeInMemorySorter`, one record may take 32 bytes: 1 `long` for 
> pointer, 1 `long` for key-prefix, and another 2 `long`s as the temporary 
> buffer for radix sort.
> In `UnsafeExternalSorter`, we set the 
> `DEFAULT_NUM_ELEMENTS_FOR_SPILL_THRESHOLD` to be `1024 * 1024 * 1024 / 2`, 
> and hoping the max size of point array to be 8 GB. However this is wrong, 
> `1024 * 1024 * 1024 / 2 * 32` is actually 16 GB, and if we grow the point 
> array before reach this limitation, we may hit the max-page-size error.
> Users may see exception like this on large dataset:
> {code}
> Caused by: java.lang.IllegalArgumentException: Cannot allocate a page with 
> more than 17179869176 bytes
> at 
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:241)
> at 
> org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:121)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:374)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:396)
> at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:94)
> ...
> {code}
> Setting `DEFAULT_NUM_ELEMENTS_FOR_SPILL_THRESHOLD` to a smaller number is not 
> enough, users can still set the config to a big number and trigger the too 
> large page size issue. This PR fixes it by explicitly handling the too large 
> page size exception in the sorter and spill.
> This PR also change the type of 
> `spark.shuffle.spill.numElementsForceSpillThreshold` to int, because it's 
> only compared with `numRecords`, which is an int. This is an internal conf so 
> we don't have a serious compatibility issue.
> ## How was this patch tested?
> TODO



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22383) Generate code to directly get value of primitive type array from ColumnVector for table cache

2017-10-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16223825#comment-16223825
 ] 

Apache Spark commented on SPARK-22383:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/19601

> Generate code to directly get value of primitive type array from ColumnVector 
> for table cache
> -
>
> Key: SPARK-22383
> URL: https://issues.apache.org/jira/browse/SPARK-22383
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>
> This JIRA generates the Java code to directly get a value for a primitive 
> type array in ColumnVector without using an iterator for table cache (e.g. 
> dataframe.cache). This JIRA improves runtime performance by eliminating data 
> copy from column-oriented storage to {{InternalRow}} in a 
> {{SpecificColumnarIterator}} iterator for primitive type.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22383) Generate code to directly get value of primitive type array from ColumnVector for table cache

2017-10-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22383:


Assignee: (was: Apache Spark)

> Generate code to directly get value of primitive type array from ColumnVector 
> for table cache
> -
>
> Key: SPARK-22383
> URL: https://issues.apache.org/jira/browse/SPARK-22383
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>
> This JIRA generates the Java code to directly get a value for a primitive 
> type array in ColumnVector without using an iterator for table cache (e.g. 
> dataframe.cache). This JIRA improves runtime performance by eliminating data 
> copy from column-oriented storage to {{InternalRow}} in a 
> {{SpecificColumnarIterator}} iterator for primitive type.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22383) Generate code to directly get value of primitive type array from ColumnVector for table cache

2017-10-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22383:


Assignee: Apache Spark

> Generate code to directly get value of primitive type array from ColumnVector 
> for table cache
> -
>
> Key: SPARK-22383
> URL: https://issues.apache.org/jira/browse/SPARK-22383
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>Assignee: Apache Spark
>
> This JIRA generates the Java code to directly get a value for a primitive 
> type array in ColumnVector without using an iterator for table cache (e.g. 
> dataframe.cache). This JIRA improves runtime performance by eliminating data 
> copy from column-oriented storage to {{InternalRow}} in a 
> {{SpecificColumnarIterator}} iterator for primitive type.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22383) Generate code to directly get value of primitive type array from ColumnVector for table cache

2017-10-28 Thread Kazuaki Ishizaki (JIRA)

Kazuaki Ishizaki created SPARK-22383:


 Summary: Generate code to directly get value of primitive type 
array from ColumnVector for table cache
 Key: SPARK-22383
 URL: https://issues.apache.org/jira/browse/SPARK-22383
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Kazuaki Ishizaki


This JIRA generates the Java code to directly get a value for a primitive type 
array in ColumnVector without using an iterator for table cache (e.g. 
dataframe.cache). This JIRA improves runtime performance by eliminating data 
copy from column-oriented storage to {{InternalRow}} in a 
{{SpecificColumnarIterator}} iterator for primitive type.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22277) Chi Square selector garbling Vector content.

2017-10-28 Thread Cheburakshu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16223795#comment-16223795
 ] 

Cheburakshu commented on SPARK-22277:
-

I actually meant VecotrIndexer or (StringIndexer+OneHotEncoder). But,
doesn't work. I understand your point to use DecisionTrees to find the best
features, but it will give the features after a pass, while chisqselector
will help choose them before feeding to the ml aglo.




> Chi Square selector garbling Vector content.
> 
>
> Key: SPARK-22277
> URL: https://issues.apache.org/jira/browse/SPARK-22277
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.1.1
>Reporter: Cheburakshu
>
> There is a difference in behavior when Chisquare selector is used v direct 
> feature use in decision tree classifier. 
> In the below code, I have used chisquare selector as a thru' pass but the 
> decision tree classifier is unable to process it. But, it is able to process 
> when the features are used directly.
> The example is pulled out directly from Apache spark python documentation.
> Kindly help.
> {code:python}
> from pyspark.ml.feature import ChiSqSelector
> from pyspark.ml.linalg import Vectors
> import sys
> df = spark.createDataFrame([
> (7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,),
> (8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,),
> (9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], ["id", "features", 
> "clicked"])
> # ChiSq selector will just be a pass-through. All four featuresin the i/p 
> will be in output also.
> selector = ChiSqSelector(numTopFeatures=4, featuresCol="features",
>  outputCol="selectedFeatures", labelCol="clicked")
> result = selector.fit(df).transform(df)
> print("ChiSqSelector output with top %d features selected" % 
> selector.getNumTopFeatures())
> from pyspark.ml.classification import DecisionTreeClassifier
> try:
> # Fails
> dt = 
> DecisionTreeClassifier(labelCol="clicked",featuresCol="selectedFeatures")
> model = dt.fit(result)
> except:
> print(sys.exc_info())
> #Works
> dt = DecisionTreeClassifier(labelCol="clicked",featuresCol="features")
> model = dt.fit(df)
> 
> # Make predictions. Using same dataset, not splitting!!
> predictions = model.transform(result)
> # Select example rows to display.
> predictions.select("prediction", "clicked", "features").show(5)
> # Select (prediction, true label) and compute test error
> evaluator = MulticlassClassificationEvaluator(
> labelCol="clicked", predictionCol="prediction", metricName="accuracy")
> accuracy = evaluator.evaluate(predictions)
> print("Test Error = %g " % (1.0 - accuracy))
> {code}
> Output:
> ChiSqSelector output with top 4 features selected
> (, 
> IllegalArgumentException('Feature 0 is marked as Nominal (categorical), but 
> it does not have the number of values specified.', 
> 'org.apache.spark.ml.util.MetadataUtils$$anonfun$getCategoricalFeatures$1.apply(MetadataUtils.scala:69)\n\t
>  at 
> org.apache.spark.ml.util.MetadataUtils$$anonfun$getCategoricalFeatures$1.apply(MetadataUtils.scala:59)\n\t
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)\n\t
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)\n\t
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)\n\t
>  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)\n\t 
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)\n\t 
> at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:186)\n\t at 
> org.apache.spark.ml.util.MetadataUtils$.getCategoricalFeatures(MetadataUtils.scala:59)\n\t
>  at 
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:101)\n\t
>  at 
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)\n\t
>  at org.apache.spark.ml.Predictor.fit(Predictor.scala:96)\n\t at 
> org.apache.spark.ml.Predictor.fit(Predictor.scala:72)\n\t at 
> sun.reflect.GeneratedMethodAccessor280.invoke(Unknown Source)\n\t at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\t
>  at java.lang.reflect.Method.invoke(Method.java:498)\n\t at 
> py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\t at 
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\t at 
> py4j.Gateway.invoke(Gateway.java:280)\n\t at 
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\t at 
> py4j.commands.CallCommand.execute(CallCommand.java:79)\n\t at 
> py4j.GatewayConnection.run(GatewayConnection.java:214)\n\t at 
> java.lang.Thread.run(Thread.java:745)'), )
> +--+---+--+
>

[jira] [Updated] (SPARK-22382) Spark on mesos: doesn't support public IP setup for agent and master.

2017-10-28 Thread DUC LIEM NGUYEN (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DUC LIEM NGUYEN updated SPARK-22382:

Description: 
I've installed a system as followed:

--mesos master private IP of 10.x.x.2 , Public 35.x.x.6

--mesos slave private IP of 192.x.x.10, Public 111.x.x.2

Now the master assigned the task successfully to the slave, however, the task 
failed. The error message is as followed:

{color:#d04437}Exception in thread "main" 17/10/11 22:38:01 ERROR 
RpcOutboxMessage: Ask timeout before connecting successfully

Caused by: org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply 
in 120 seconds. This timeout is controlled by spark.rpc.askTimeout
{color}

When I look at the environment page, the spark.driver.host points to the 
private IP address of the master 10.x.x.2 instead of it public IP address 
35.x.x.6. I look at the Wireshark capture and indeed, there was failed TCP 
package to the master private IP address.

Now if I set spark.driver.bindAddress from the master to its local IP address, 
spark.driver.host from the master to its public IP address, I get the following 
message.

{color:#d04437}ERROR TaskSchedulerImpl: Lost executor 1 on 
myhostname.singnet.com.sg: Unable to create executor due to Cannot assign 
requested address.{color}

>From my understanding, the spark.driver.bindAddress set it for both master and 
>slave, hence the slave get the said error. Now I'm really wondering how do I 
>proper setup spark to work on this clustering over public IP?

  was:
I've installed a system as followed:

--mesos master private IP of 10.x.x.2 , Public 35.x.x.6

--mesos slave private IP of 192.x.x.10, Public 111.x.x.2

Now the master assigned the task successfully to the slave, however, the task 
failed. The error message is as followed:

{color:#d04437}Exception in thread "main" 17/10/11 22:38:01 ERROR 
RpcOutboxMessage: Ask timeout before connecting successfully

Caused by: org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply 
in 120 seconds. This timeout is controlled by spark.rpc.askTimeout
{color}
When I look at the environment, the spark.driver.host points to the private IP 
address of the master 10.x.x.2 instead of it public IP address 35.x.x.6. I look 
at the Wireshark capture and indeed, there was failed TCP package to the master 
private IP address.

Now if I set spark.driver.bindAddress from the master to its local IP address, 
spark.driver.host from the master to its public IP address, I get the following 
message.

{color:#d04437}ERROR TaskSchedulerImpl: Lost executor 1 on 
myhostname.singnet.com.sg: Unable to create executor due to Cannot assign 
requested address.{color}

>From my understanding, the spark.driver.bindAddress set it for both master and 
>slave, hence the slave get the said error. Now I'm really wondering how do I 
>proper setup spark to work on this clustering over public IP?


> Spark on mesos: doesn't support public IP setup for agent and master. 
> --
>
> Key: SPARK-22382
> URL: https://issues.apache.org/jira/browse/SPARK-22382
> Project: Spark
>  Issue Type: Question
>  Components: Mesos
>Affects Versions: 2.1.0
>Reporter: DUC LIEM NGUYEN
>
> I've installed a system as followed:
> --mesos master private IP of 10.x.x.2 , Public 35.x.x.6
> --mesos slave private IP of 192.x.x.10, Public 111.x.x.2
> Now the master assigned the task successfully to the slave, however, the task 
> failed. The error message is as followed:
> {color:#d04437}Exception in thread "main" 17/10/11 22:38:01 ERROR 
> RpcOutboxMessage: Ask timeout before connecting successfully
> Caused by: org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply 
> in 120 seconds. This timeout is controlled by spark.rpc.askTimeout
> {color}
> When I look at the environment page, the spark.driver.host points to the 
> private IP address of the master 10.x.x.2 instead of it public IP address 
> 35.x.x.6. I look at the Wireshark capture and indeed, there was failed TCP 
> package to the master private IP address.
> Now if I set spark.driver.bindAddress from the master to its local IP 
> address, spark.driver.host from the master to its public IP address, I get 
> the following message.
> {color:#d04437}ERROR TaskSchedulerImpl: Lost executor 1 on 
> myhostname.singnet.com.sg: Unable to create executor due to Cannot assign 
> requested address.{color}
> From my understanding, the spark.driver.bindAddress set it for both master 
> and slave, hence the slave get the said error. Now I'm really wondering how 
> do I proper setup spark to work on this clustering over public IP?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail:

[jira] [Updated] (SPARK-22382) Spark on mesos: doesn't support public IP setup for agent and master.

2017-10-28 Thread DUC LIEM NGUYEN (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DUC LIEM NGUYEN updated SPARK-22382:

Affects Version/s: (was: 2.1.1)
   2.1.0
  Description: 
I've installed a system as followed:

--mesos master private IP of 10.x.x.2 , Public 35.x.x.6

--mesos slave private IP of 192.x.x.10, Public 111.x.x.2

Now the master assigned the task successfully to the slave, however, the task 
failed. The error message is as followed:

{color:#d04437}Exception in thread "main" 17/10/11 22:38:01 ERROR 
RpcOutboxMessage: Ask timeout before connecting successfully

Caused by: org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply 
in 120 seconds. This timeout is controlled by spark.rpc.askTimeout
{color}
When I look at the environment, the spark.driver.host points to the private IP 
address of the master 10.x.x.2 instead of it public IP address 35.x.x.6. I look 
at the Wireshark capture and indeed, there was failed TCP package to the master 
private IP address.

Now if I set spark.driver.bindAddress from the master to its local IP address, 
spark.driver.host from the master to its public IP address, I get the following 
message.

{color:#d04437}ERROR TaskSchedulerImpl: Lost executor 1 on 
myhostname.singnet.com.sg: Unable to create executor due to Cannot assign 
requested address.{color}

>From my understanding, the spark.driver.bindAddress set it for both master and 
>slave, hence the slave get the said error. Now I'm really wondering how do I 
>proper setup spark to work on this clustering over public IP?

  was:
I've installed a system as followed:

--mesos master private IP of 10.x.x.2 , Public 35.x.x.6

--mesos slave private IP of 192.x.x.10, Public 111.x.x.2

Now the master assigned the task successfully to the slave, however, the task 
failed. The error message is as followed:

{color:#d04437}{{Exception in thread "main" 17/10/11 22:38:01 ERROR 
RpcOutboxMessage: Ask timeout before connecting successfully

Caused by: org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply 
in 120 seconds. This timeout is controlled by spark.rpc.askTimeout
}}{color}
When I look at the environment, the spark.driver.host points to the private IP 
address of the master 10.x.x.2 instead of it public IP address 35.x.x.6. I look 
at the Wireshark capture and indeed, there was failed TCP package to the master 
private IP address.

Now if I set spark.driver.bindAddress from the master to its local IP address, 
spark.driver.host from the master to its public IP address, I get the following 
message.

{{ERROR TaskSchedulerImpl: Lost executor 1 on myhostname.singnet.com.sg: Unable 
to create executor due to Cannot assign requested address.}}

>From my understanding, the spark.driver.bindAddress set it for both master and 
>slave, hence the slave get the said error. Now I'm really wondering how do I 
>proper setup spark to work on this clustering over public IP?


> Spark on mesos: doesn't support public IP setup for agent and master. 
> --
>
> Key: SPARK-22382
> URL: https://issues.apache.org/jira/browse/SPARK-22382
> Project: Spark
>  Issue Type: Question
>  Components: Mesos
>Affects Versions: 2.1.0
>Reporter: DUC LIEM NGUYEN
>
> I've installed a system as followed:
> --mesos master private IP of 10.x.x.2 , Public 35.x.x.6
> --mesos slave private IP of 192.x.x.10, Public 111.x.x.2
> Now the master assigned the task successfully to the slave, however, the task 
> failed. The error message is as followed:
> {color:#d04437}Exception in thread "main" 17/10/11 22:38:01 ERROR 
> RpcOutboxMessage: Ask timeout before connecting successfully
> Caused by: org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply 
> in 120 seconds. This timeout is controlled by spark.rpc.askTimeout
> {color}
> When I look at the environment, the spark.driver.host points to the private 
> IP address of the master 10.x.x.2 instead of it public IP address 35.x.x.6. I 
> look at the Wireshark capture and indeed, there was failed TCP package to the 
> master private IP address.
> Now if I set spark.driver.bindAddress from the master to its local IP 
> address, spark.driver.host from the master to its public IP address, I get 
> the following message.
> {color:#d04437}ERROR TaskSchedulerImpl: Lost executor 1 on 
> myhostname.singnet.com.sg: Unable to create executor due to Cannot assign 
> requested address.{color}
> From my understanding, the spark.driver.bindAddress set it for both master 
> and slave, hence the slave get the said error. Now I'm really wondering how 
> do I proper setup spark to work on this clustering over public IP?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (SPARK-22382) Spark on mesos: doesn't support public IP setup for agent and master.

2017-10-28 Thread DUC LIEM NGUYEN (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DUC LIEM NGUYEN updated SPARK-22382:

Description: 
I've installed a system as followed:

--mesos master private IP of 10.x.x.2 , Public 35.x.x.6

--mesos slave private IP of 192.x.x.10, Public 111.x.x.2

Now the master assigned the task successfully to the slave, however, the task 
failed. The error message is as followed:

{color:#d04437}{{Exception in thread "main" 17/10/11 22:38:01 ERROR 
RpcOutboxMessage: Ask timeout before connecting successfully

Caused by: org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply 
in 120 seconds. This timeout is controlled by spark.rpc.askTimeout
}}{color}
When I look at the environment, the spark.driver.host points to the private IP 
address of the master 10.x.x.2 instead of it public IP address 35.x.x.6. I look 
at the Wireshark capture and indeed, there was failed TCP package to the master 
private IP address.

Now if I set spark.driver.bindAddress from the master to its local IP address, 
spark.driver.host from the master to its public IP address, I get the following 
message.

{{ERROR TaskSchedulerImpl: Lost executor 1 on myhostname.singnet.com.sg: Unable 
to create executor due to Cannot assign requested address.}}

>From my understanding, the spark.driver.bindAddress set it for both master and 
>slave, hence the slave get the said error. Now I'm really wondering how do I 
>proper setup spark to work on this clustering over public IP?

  was:
I've installed a system as followed:

--mesos master private IP of 10.x.x.2 , Public 35.x.x.6

--mesos slave private IP of 192.x.x.10, Public 111.x.x.2

Now the master assigned the task successfully to the slave, however, the task 
failed. The error message is as followed:

{{Exception in thread "main" 17/10/11 22:38:01 ERROR RpcOutboxMessage: Ask 
timeout before connecting successfully

Caused by: org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply 
in 120 seconds. This timeout is controlled by spark.rpc.askTimeout}}

When I look at the environment, the spark.driver.host points to the private IP 
address of the master 10.x.x.2 instead of it public IP address 35.x.x.6. I look 
at the Wireshark capture and indeed, there was failed TCP package to the master 
private IP address.

Now if I set spark.driver.bindAddress from the master to its local IP address, 
spark.driver.host from the master to its public IP address, I get the following 
message.

{{ERROR TaskSchedulerImpl: Lost executor 1 on myhostname.singnet.com.sg: Unable 
to create executor due to Cannot assign requested address.}}

>From my understanding, the spark.driver.bindAddress set it for both master and 
>slave, hence the slave get the said error. Now I'm really wondering how do I 
>proper setup spark to work on this clustering over public IP?


> Spark on mesos: doesn't support public IP setup for agent and master. 
> --
>
> Key: SPARK-22382
> URL: https://issues.apache.org/jira/browse/SPARK-22382
> Project: Spark
>  Issue Type: Question
>  Components: Mesos
>Affects Versions: 2.1.0
>Reporter: DUC LIEM NGUYEN
>
> I've installed a system as followed:
> --mesos master private IP of 10.x.x.2 , Public 35.x.x.6
> --mesos slave private IP of 192.x.x.10, Public 111.x.x.2
> Now the master assigned the task successfully to the slave, however, the task 
> failed. The error message is as followed:
> {color:#d04437}{{Exception in thread "main" 17/10/11 22:38:01 ERROR 
> RpcOutboxMessage: Ask timeout before connecting successfully
> Caused by: org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply 
> in 120 seconds. This timeout is controlled by spark.rpc.askTimeout
> }}{color}
> When I look at the environment, the spark.driver.host points to the private 
> IP address of the master 10.x.x.2 instead of it public IP address 35.x.x.6. I 
> look at the Wireshark capture and indeed, there was failed TCP package to the 
> master private IP address.
> Now if I set spark.driver.bindAddress from the master to its local IP 
> address, spark.driver.host from the master to its public IP address, I get 
> the following message.
> {{ERROR TaskSchedulerImpl: Lost executor 1 on myhostname.singnet.com.sg: 
> Unable to create executor due to Cannot assign requested address.}}
> From my understanding, the spark.driver.bindAddress set it for both master 
> and slave, hence the slave get the said error. Now I'm really wondering how 
> do I proper setup spark to work on this clustering over public IP?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22382) Spark on mesos: doesn't support public IP setup for agent and master.

2017-10-28 Thread DUC LIEM NGUYEN (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DUC LIEM NGUYEN updated SPARK-22382:

Description: 
I've installed a system as followed:

--mesos master private IP of 10.x.x.2 , Public 35.x.x.6

--mesos slave private IP of 192.x.x.10, Public 111.x.x.2

Now the master assigned the task successfully to the slave, however, the task 
failed. The error message is as followed:

{{Exception in thread "main" 17/10/11 22:38:01 ERROR RpcOutboxMessage: Ask 
timeout before connecting successfully

Caused by: org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply 
in 120 seconds. This timeout is controlled by spark.rpc.askTimeout}}

When I look at the environment, the spark.driver.host points to the private IP 
address of the master 10.x.x.2 instead of it public IP address 35.x.x.6. I look 
at the Wireshark capture and indeed, there was failed TCP package to the master 
private IP address.

Now if I set spark.driver.bindAddress from the master to its local IP address, 
spark.driver.host from the master to its public IP address, I get the following 
message.

{{ERROR TaskSchedulerImpl: Lost executor 1 on myhostname.singnet.com.sg: Unable 
to create executor due to Cannot assign requested address.}}

>From my understanding, the spark.driver.bindAddress set it for both master and 
>slave, hence the slave get the said error. Now I'm really wondering how do I 
>proper setup spark to work on this clustering over public IP?

  was:
I've installed a system as followed:

--mesos master private IP of 10.x.x.2 , Public 35.x.x.6

--mesos slave private IP of 192.x.x.10, Public 111.x.x.2

Now the master assigned the task successfully to the slave, however, the task 
failed. The error message is as followed:

Exception in thread "main" 17/10/11 22:38:01 ERROR RpcOutboxMessage: Ask 
timeout before connecting successfully

Caused by: org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply 
in 120 seconds. This timeout is controlled by spark.rpc.askTimeout
When I look at the environment, the spark.driver.host points to the private IP 
address of the master 10.x.x.2 instead of it public IP address 35.x.x.6. I look 
at the Wireshark capture and indeed, there was failed TCP package to the master 
private IP address.

Now if I set spark.driver.bindAddress from the master to its local IP address, 
spark.driver.host from the master to its public IP address, I get the following 
message.

ERROR TaskSchedulerImpl: Lost executor 1 on myhostname.singnet.com.sg: Unable 
to create executor due to Cannot assign requested address.

>From my understanding, the spark.driver.bindAddress set it for both master and 
>slave, hence the slave get the said error. Now I'm really wondering how do I 
>proper setup spark to work on this clustering over public IP?


> Spark on mesos: doesn't support public IP setup for agent and master. 
> --
>
> Key: SPARK-22382
> URL: https://issues.apache.org/jira/browse/SPARK-22382
> Project: Spark
>  Issue Type: Question
>  Components: Mesos
>Affects Versions: 2.1.1
>Reporter: DUC LIEM NGUYEN
>
> I've installed a system as followed:
> --mesos master private IP of 10.x.x.2 , Public 35.x.x.6
> --mesos slave private IP of 192.x.x.10, Public 111.x.x.2
> Now the master assigned the task successfully to the slave, however, the task 
> failed. The error message is as followed:
> {{Exception in thread "main" 17/10/11 22:38:01 ERROR RpcOutboxMessage: Ask 
> timeout before connecting successfully
> Caused by: org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply 
> in 120 seconds. This timeout is controlled by spark.rpc.askTimeout}}
> When I look at the environment, the spark.driver.host points to the private 
> IP address of the master 10.x.x.2 instead of it public IP address 35.x.x.6. I 
> look at the Wireshark capture and indeed, there was failed TCP package to the 
> master private IP address.
> Now if I set spark.driver.bindAddress from the master to its local IP 
> address, spark.driver.host from the master to its public IP address, I get 
> the following message.
> {{ERROR TaskSchedulerImpl: Lost executor 1 on myhostname.singnet.com.sg: 
> Unable to create executor due to Cannot assign requested address.}}
> From my understanding, the spark.driver.bindAddress set it for both master 
> and slave, hence the slave get the said error. Now I'm really wondering how 
> do I proper setup spark to work on this clustering over public IP?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22382) Spark on mesos: doesn't support public IP setup for agent and master.

2017-10-28 Thread DUC LIEM NGUYEN (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DUC LIEM NGUYEN updated SPARK-22382:

Description: 
I've installed a system as followed:

--mesos master private IP of 10.x.x.2 , Public 35.x.x.6

--mesos slave private IP of 192.x.x.10, Public 111.x.x.2

Now the master assigned the task successfully to the slave, however, the task 
failed. The error message is as followed:

Exception in thread "main" 17/10/11 22:38:01 ERROR RpcOutboxMessage: Ask 
timeout before connecting successfully

Caused by: org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply 
in 120 seconds. This timeout is controlled by spark.rpc.askTimeout
When I look at the environment, the spark.driver.host points to the private IP 
address of the master 10.x.x.2 instead of it public IP address 35.x.x.6. I look 
at the Wireshark capture and indeed, there was failed TCP package to the master 
private IP address.

Now if I set spark.driver.bindAddress from the master to its local IP address, 
spark.driver.host from the master to its public IP address, I get the following 
message.

ERROR TaskSchedulerImpl: Lost executor 1 on myhostname.singnet.com.sg: Unable 
to create executor due to Cannot assign requested address.

>From my understanding, the spark.driver.bindAddress set it for both master and 
>slave, hence the slave get the said error. Now I'm really wondering how do I 
>proper setup spark to work on this clustering over public IP?

> Spark on mesos: doesn't support public IP setup for agent and master. 
> --
>
> Key: SPARK-22382
> URL: https://issues.apache.org/jira/browse/SPARK-22382
> Project: Spark
>  Issue Type: Question
>  Components: Mesos
>Affects Versions: 2.1.1
>Reporter: DUC LIEM NGUYEN
>
> I've installed a system as followed:
> --mesos master private IP of 10.x.x.2 , Public 35.x.x.6
> --mesos slave private IP of 192.x.x.10, Public 111.x.x.2
> Now the master assigned the task successfully to the slave, however, the task 
> failed. The error message is as followed:
> Exception in thread "main" 17/10/11 22:38:01 ERROR RpcOutboxMessage: Ask 
> timeout before connecting successfully
> Caused by: org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply 
> in 120 seconds. This timeout is controlled by spark.rpc.askTimeout
> When I look at the environment, the spark.driver.host points to the private 
> IP address of the master 10.x.x.2 instead of it public IP address 35.x.x.6. I 
> look at the Wireshark capture and indeed, there was failed TCP package to the 
> master private IP address.
> Now if I set spark.driver.bindAddress from the master to its local IP 
> address, spark.driver.host from the master to its public IP address, I get 
> the following message.
> ERROR TaskSchedulerImpl: Lost executor 1 on myhostname.singnet.com.sg: Unable 
> to create executor due to Cannot assign requested address.
> From my understanding, the spark.driver.bindAddress set it for both master 
> and slave, hence the slave get the said error. Now I'm really wondering how 
> do I proper setup spark to work on this clustering over public IP?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22382) Spark on mesos: doesn't support public IP setup for agent and master.

2017-10-28 Thread DUC LIEM NGUYEN (JIRA)

DUC LIEM NGUYEN created SPARK-22382:
---

 Summary: Spark on mesos: doesn't support public IP setup for agent 
and master. 
 Key: SPARK-22382
 URL: https://issues.apache.org/jira/browse/SPARK-22382
 Project: Spark
  Issue Type: Question
  Components: Mesos
Affects Versions: 2.1.1
Reporter: DUC LIEM NGUYEN






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22381) Add StringParam that supports valid options

2017-10-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22381:


Assignee: (was: Apache Spark)

> Add StringParam that supports valid options
> ---
>
> Key: SPARK-22381
> URL: https://issues.apache.org/jira/browse/SPARK-22381
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Priority: Minor
>
> During test with https://issues.apache.org/jira/browse/SPARK-22331, I found 
> it might be a good idea to include the possible options in a StringParam.
> A StringParam extends Param[String] and allow user to specify the valid 
> options in Array[String] (case insensitive).
> So far it can help achieve three goals:
> 1. Make the StringParam aware of its possible options and support native 
> validations.
> 2. StringParam can list the supported options when user input wrong value.
> 3. allow automatic unit test coverage for case-insensitive String param
> and IMO it also decrease the code redundancy.
> The StringParam is designed to be completely compatible with existing 
> Param[String], just adding the extra logic for supporting options, which 
> means we don't need to convert all Param[String] to StringParam until we feel 
> comfortable to do that.
> 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22381) Add StringParam that supports valid options

2017-10-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16223714#comment-16223714
 ] 

Apache Spark commented on SPARK-22381:
--

User 'hhbyyh' has created a pull request for this issue:
https://github.com/apache/spark/pull/19599

> Add StringParam that supports valid options
> ---
>
> Key: SPARK-22381
> URL: https://issues.apache.org/jira/browse/SPARK-22381
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Priority: Minor
>
> During test with https://issues.apache.org/jira/browse/SPARK-22331, I found 
> it might be a good idea to include the possible options in a StringParam.
> A StringParam extends Param[String] and allow user to specify the valid 
> options in Array[String] (case insensitive).
> So far it can help achieve three goals:
> 1. Make the StringParam aware of its possible options and support native 
> validations.
> 2. StringParam can list the supported options when user input wrong value.
> 3. allow automatic unit test coverage for case-insensitive String param
> and IMO it also decrease the code redundancy.
> The StringParam is designed to be completely compatible with existing 
> Param[String], just adding the extra logic for supporting options, which 
> means we don't need to convert all Param[String] to StringParam until we feel 
> comfortable to do that.
> 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22381) Add StringParam that supports valid options

2017-10-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22381:


Assignee: Apache Spark

> Add StringParam that supports valid options
> ---
>
> Key: SPARK-22381
> URL: https://issues.apache.org/jira/browse/SPARK-22381
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Assignee: Apache Spark
>Priority: Minor
>
> During test with https://issues.apache.org/jira/browse/SPARK-22331, I found 
> it might be a good idea to include the possible options in a StringParam.
> A StringParam extends Param[String] and allow user to specify the valid 
> options in Array[String] (case insensitive).
> So far it can help achieve three goals:
> 1. Make the StringParam aware of its possible options and support native 
> validations.
> 2. StringParam can list the supported options when user input wrong value.
> 3. allow automatic unit test coverage for case-insensitive String param
> and IMO it also decrease the code redundancy.
> The StringParam is designed to be completely compatible with existing 
> Param[String], just adding the extra logic for supporting options, which 
> means we don't need to convert all Param[String] to StringParam until we feel 
> comfortable to do that.
> 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22381) Add StringParam that supports valid options

2017-10-28 Thread yuhao yang (JIRA)

yuhao yang created SPARK-22381:
--

 Summary: Add StringParam that supports valid options
 Key: SPARK-22381
 URL: https://issues.apache.org/jira/browse/SPARK-22381
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 2.2.0
Reporter: yuhao yang
Priority: Minor


During test with https://issues.apache.org/jira/browse/SPARK-22331, I found it 
might be a good idea to include the possible options in a StringParam.

A StringParam extends Param[String] and allow user to specify the valid options 
in Array[String] (case insensitive).

So far it can help achieve three goals:
1. Make the StringParam aware of its possible options and support native 
validations.
2. StringParam can list the supported options when user input wrong value.
3. allow automatic unit test coverage for case-insensitive String param

and IMO it also decrease the code redundancy.

The StringParam is designed to be completely compatible with existing 
Param[String], just adding the extra logic for supporting options, which means 
we don't need to convert all Param[String] to StringParam until we feel 
comfortable to do that.





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22380) Upgrade protobuf-java (com.google.protobuf) version from 2.5.0 to 3.4.0

2017-10-28 Thread Maziyar PANAHI (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16223700#comment-16223700
 ] 

Maziyar PANAHI commented on SPARK-22380:


Hi Sean,

Tanks for your reply. In fact I am more looking for workaround this issue than 
upgrading this dependency in Hadoop as the current version is compatible with 
all the builds as you said.

Will you show me a way of how to shade dependencies for Hadoop? Or add version 
3.4 and ignore 2.5 inside the Spark App?  I am using Cloudera distribution of 
Spark 2 to be precise.

Thanks,
Maziyar

> Upgrade protobuf-java (com.google.protobuf) version from 2.5.0 to 3.4.0
> ---
>
> Key: SPARK-22380
> URL: https://issues.apache.org/jira/browse/SPARK-22380
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Deploy
>Affects Versions: 1.6.1, 2.2.0
> Environment: Cloudera 5.13.x
> Spark 2.2.0.cloudera1-1.cdh5.12.0.p0.142354
> And anything beyond Spark 2.2.0
>Reporter: Maziyar PANAHI
>Priority: Blocker
>
> Hi,
> This upgrade is needed when we try to use CoreNLP 3.8 with Spark (1.6+ and 
> 2.2+) due to incompatibilities in the protobuf version used by 
> com.google.protobuf and the one is used in latest Stanford CoreNLP (3.8). The 
> version of protobuf has been set to 2.5.0 in the global properties, and this 
> is stated in the pom.xml file.
> The error that refers to this dependency:
> {code:java}
> java.lang.VerifyError: Bad type on operand stack
> Exception Details:
>   Location:
> 
> com/google/protobuf/GeneratedMessageV3$ExtendableMessage.getExtension(Lcom/google/protobuf/GeneratedMessage$GeneratedExtension;I)Ljava/lang/Object;
>  @3: invokevirtual
>   Reason:
> Type 'com/google/protobuf/GeneratedMessage$GeneratedExtension' (current 
> frame, stack[1]) is not assignable to 'com/google/protobuf/ExtensionLite'
>   Current Frame:
> bci: @3
> flags: { }
> locals: { 'com/google/protobuf/GeneratedMessageV3$ExtendableMessage', 
> 'com/google/protobuf/GeneratedMessage$GeneratedExtension', integer }
> stack: { 'com/google/protobuf/GeneratedMessageV3$ExtendableMessage', 
> 'com/google/protobuf/GeneratedMessage$GeneratedExtension', integer }
>   Bytecode:
> 0x000: 2a2b 1cb6 0024 b0
>   at edu.stanford.nlp.simple.Document.(Document.java:433)
>   at edu.stanford.nlp.simple.Sentence.(Sentence.java:118)
>   at edu.stanford.nlp.simple.Sentence.(Sentence.java:126)
>   ... 56 elided
> {code}
> Is it possible to upgrade this dependency to the latest (3.4) or any 
> workaround besides manually removing protobuf-java-2.5.0.jar and adding 
> protobuf-java-3.4.0.jar?
> You can follow the discussion of how this upgrade would fix the issue:
> https://github.com/stanfordnlp/CoreNLP/issues/556
> Many thanks,
> Maziyar



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-22055) Port release scripts

2017-10-28 Thread Xin Lu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16223229#comment-16223229
 ] 

Xin Lu edited comment on SPARK-22055 at 10/28/17 6:30 PM:
--

[~holdenk] [~joshrosen] do you guys need help with this?   This is all the JJB 
code josh has in databricks/spark?


was (Author: xynny):
[~holdenk] [~joshrosen] do you guys need help with this?  

> Port release scripts
> 
>
> Key: SPARK-22055
> URL: https://issues.apache.org/jira/browse/SPARK-22055
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.1, 2.3.0
>Reporter: holdenk
>Priority: Blocker
>
> The current Jenkins jobs are generated from scripts in a private repo. We 
> should port these to enable changes like SPARK-22054 .



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22380) Upgrade protobuf-java (com.google.protobuf) version from 2.5.0 to 3.4.0

2017-10-28 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16223688#comment-16223688
 ] 

Sean Owen commented on SPARK-22380:
---

The problem is that the tightly-coupled Hadoop dependencies depend on a 2.x 
version, as far as I know. You'd have to research what's compatible with 
Hadoop. You can always shade your dependencies.

> Upgrade protobuf-java (com.google.protobuf) version from 2.5.0 to 3.4.0
> ---
>
> Key: SPARK-22380
> URL: https://issues.apache.org/jira/browse/SPARK-22380
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Deploy
>Affects Versions: 1.6.1, 2.2.0
> Environment: Cloudera 5.13.x
> Spark 2.2.0.cloudera1-1.cdh5.12.0.p0.142354
> And anything beyond Spark 2.2.0
>Reporter: Maziyar PANAHI
>Priority: Blocker
>
> Hi,
> This upgrade is needed when we try to use CoreNLP 3.8 with Spark (1.6+ and 
> 2.2+) due to incompatibilities in the protobuf version used by 
> com.google.protobuf and the one is used in latest Stanford CoreNLP (3.8). The 
> version of protobuf has been set to 2.5.0 in the global properties, and this 
> is stated in the pom.xml file.
> The error that refers to this dependency:
> {code:java}
> java.lang.VerifyError: Bad type on operand stack
> Exception Details:
>   Location:
> 
> com/google/protobuf/GeneratedMessageV3$ExtendableMessage.getExtension(Lcom/google/protobuf/GeneratedMessage$GeneratedExtension;I)Ljava/lang/Object;
>  @3: invokevirtual
>   Reason:
> Type 'com/google/protobuf/GeneratedMessage$GeneratedExtension' (current 
> frame, stack[1]) is not assignable to 'com/google/protobuf/ExtensionLite'
>   Current Frame:
> bci: @3
> flags: { }
> locals: { 'com/google/protobuf/GeneratedMessageV3$ExtendableMessage', 
> 'com/google/protobuf/GeneratedMessage$GeneratedExtension', integer }
> stack: { 'com/google/protobuf/GeneratedMessageV3$ExtendableMessage', 
> 'com/google/protobuf/GeneratedMessage$GeneratedExtension', integer }
>   Bytecode:
> 0x000: 2a2b 1cb6 0024 b0
>   at edu.stanford.nlp.simple.Document.(Document.java:433)
>   at edu.stanford.nlp.simple.Sentence.(Sentence.java:118)
>   at edu.stanford.nlp.simple.Sentence.(Sentence.java:126)
>   ... 56 elided
> {code}
> Is it possible to upgrade this dependency to the latest (3.4) or any 
> workaround besides manually removing protobuf-java-2.5.0.jar and adding 
> protobuf-java-3.4.0.jar?
> You can follow the discussion of how this upgrade would fix the issue:
> https://github.com/stanfordnlp/CoreNLP/issues/556
> Many thanks,
> Maziyar



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22378) Redundant nullcheck is generated for extracting value in complex types

2017-10-28 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-22378:
-
Description: Redundant null check is generated in the code for extracting 
an element from complex types {{GetArrayItem}}, {{GetMapValue}}, and 
{{GetArrayStructFields}}. Since these code generations do not take care of 
{{nullable}} in {{DataType}} such as {{ArrayType}}, the generated code always 
has {{isNullAt(index)}}.
Summary: Redundant nullcheck is generated for extracting value in 
complex types  (was: Eliminate redundant nullcheck in generated code for 
extracting value in complex type)

> Redundant nullcheck is generated for extracting value in complex types
> --
>
> Key: SPARK-22378
> URL: https://issues.apache.org/jira/browse/SPARK-22378
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>Priority: Minor
>
> Redundant null check is generated in the code for extracting an element from 
> complex types {{GetArrayItem}}, {{GetMapValue}}, and 
> {{GetArrayStructFields}}. Since these code generations do not take care of 
> {{nullable}} in {{DataType}} such as {{ArrayType}}, the generated code always 
> has {{isNullAt(index)}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22370) Config values should be captured in Driver.

2017-10-28 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-22370:
---

Assignee: Takuya Ueshin

> Config values should be captured in Driver.
> ---
>
> Key: SPARK-22370
> URL: https://issues.apache.org/jira/browse/SPARK-22370
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
> Fix For: 2.3.0
>
>
> {{ArrowEvalPythonExec}} and {{FlatMapGroupsInPandasExec}} are refering config 
> values of {{SQLConf}} in function for 
> {{mapPartitions}}/{{mapPartitionsInternal}}, but we should capture them in 
> Driver.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22370) Config values should be captured in Driver.

2017-10-28 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22370.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19587
[https://github.com/apache/spark/pull/19587]

> Config values should be captured in Driver.
> ---
>
> Key: SPARK-22370
> URL: https://issues.apache.org/jira/browse/SPARK-22370
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Takuya Ueshin
> Fix For: 2.3.0
>
>
> {{ArrowEvalPythonExec}} and {{FlatMapGroupsInPandasExec}} are refering config 
> values of {{SQLConf}} in function for 
> {{mapPartitions}}/{{mapPartitionsInternal}}, but we should capture them in 
> Driver.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21158) SparkSQL function SparkSession.Catalog.ListTables() does not handle spark setting for case-sensitivity

2017-10-28 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16223667#comment-16223667
 ] 

Wenchen Fan commented on SPARK-21158:
-

I think this is a reasonable feature request, i.e. making 
{{Catalog.listTables}} case preserving. However it needs to change how Spark 
SQL implements case sensitivity, which is really a big change. I'd like to mark 
this ticket as "later" because the benefit is small here and we may not have 
time to do it recently. Any objections? cc [~smilegator] [~srowen]

> SparkSQL function SparkSession.Catalog.ListTables() does not handle spark 
> setting for case-sensitivity
> --
>
> Key: SPARK-21158
> URL: https://issues.apache.org/jira/browse/SPARK-21158
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Windows 10
> IntelliJ 
> Scala
>Reporter: Kathryn McClintic
>Priority: Minor
>  Labels: easyfix, features, sparksql, windows
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When working with SQL table names in Spark SQL we have noticed some issues 
> with case-sensitivity.
> If you set spark.sql.caseSensitive setting to be true, SparkSQL stores the 
> table names in the way it was provided. This is correct.
> If you set  spark.sql.caseSensitive setting to be false, SparkSQL stores the 
> table names in lower case.
> Then, we use the function sqlContext.tableNames() to get all the tables in 
> our DB. We check if this list contains(<"string of table name">) to determine 
> if we have already created a table. If case-sensitivity is turned off 
> (false), this function should look if the table name is contained in the 
> table list regardless of case.
> However, it tries to look for only ones that match the lower case version of 
> the stored table. Therefore, if you pass in a camel or upper case table name, 
> this function would return false when in fact the table does exist.
> The root cause of this issue is in the function 
> SparkSession.Catalog.ListTables()
> For example:
> In your SQL context - you have  four tables and you have chosen to have 
> spark.sql.case-Sensitive=false so it stores your tables in lowercase: 
> carnames
> carmodels
> carnamesandmodels
> users
> dealerlocations
> When running your pipeline, you want to see if you have already created the 
> temp join table of 'carnamesandmodels'. However, you have stored it as a 
> constant which reads: CarNamesAndModels for readability.
> So you can use the function
> sqlContext.tableNames().contains("CarNamesAndModels").
> This should return true - because we know its already created, but it will 
> currently return false since CarNamesAndModels is not in lowercase.
> The responsibility to change the name passed into the .contains method to be 
> lowercase should not be put on the spark user. This should be done by spark 
> sql if case-sensitivity is turned to false.
> Proposed solutions:
> - Setting case sensitive in the sql context should make the sql context 
> be agnostic to case but not change the storage of the table
> - There should be a custom contains method for ListTables() which converts 
> the tablename to be lowercase before checking
> - SparkSession.Catalog.ListTables() should return the list of tables in the 
> input format instead of in all lowercase.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22378) Eliminate redundant nullcheck in generated code for extracting value in complex type

2017-10-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22378:


Assignee: Apache Spark

> Eliminate redundant nullcheck in generated code for extracting value in 
> complex type
> 
>
> Key: SPARK-22378
> URL: https://issues.apache.org/jira/browse/SPARK-22378
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22378) Eliminate redundant nullcheck in generated code for extracting value in complex type

2017-10-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22378:


Assignee: (was: Apache Spark)

> Eliminate redundant nullcheck in generated code for extracting value in 
> complex type
> 
>
> Key: SPARK-22378
> URL: https://issues.apache.org/jira/browse/SPARK-22378
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22378) Eliminate redundant nullcheck in generated code for extracting value in complex type

2017-10-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16223651#comment-16223651
 ] 

Apache Spark commented on SPARK-22378:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/19598

> Eliminate redundant nullcheck in generated code for extracting value in 
> complex type
> 
>
> Key: SPARK-22378
> URL: https://issues.apache.org/jira/browse/SPARK-22378
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22375) Test script can fail if eggs are installed by setup.py during test process

2017-10-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22375:


Assignee: Apache Spark

> Test script can fail if eggs are installed by setup.py during test process
> --
>
> Key: SPARK-22375
> URL: https://issues.apache.org/jira/browse/SPARK-22375
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.3.0
> Environment: OSX 10.12.6
>Reporter: Joel Croteau
>Assignee: Apache Spark
>Priority: Trivial
>
> Running ./dev/run-tests may install missing Python packages as part of it's 
> setup process. setup.py can cache these in python/.eggs, and since the 
> lint-python script checks any file with the .py extension anywhere in the 
> Spark project, it will check files in .eggs and will fail if any of these do 
> not meet style criteria, even though these are not part of the project 
> lint-spark should exclude python/.eggs from its search directories.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22375) Test script can fail if eggs are installed by setup.py during test process

2017-10-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22375:


Assignee: (was: Apache Spark)

> Test script can fail if eggs are installed by setup.py during test process
> --
>
> Key: SPARK-22375
> URL: https://issues.apache.org/jira/browse/SPARK-22375
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.3.0
> Environment: OSX 10.12.6
>Reporter: Joel Croteau
>Priority: Trivial
>
> Running ./dev/run-tests may install missing Python packages as part of it's 
> setup process. setup.py can cache these in python/.eggs, and since the 
> lint-python script checks any file with the .py extension anywhere in the 
> Spark project, it will check files in .eggs and will fail if any of these do 
> not meet style criteria, even though these are not part of the project 
> lint-spark should exclude python/.eggs from its search directories.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22375) Test script can fail if eggs are installed by setup.py during test process

2017-10-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16223634#comment-16223634
 ] 

Apache Spark commented on SPARK-22375:
--

User 'xynny' has created a pull request for this issue:
https://github.com/apache/spark/pull/19597

> Test script can fail if eggs are installed by setup.py during test process
> --
>
> Key: SPARK-22375
> URL: https://issues.apache.org/jira/browse/SPARK-22375
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.3.0
> Environment: OSX 10.12.6
>Reporter: Joel Croteau
>Priority: Trivial
>
> Running ./dev/run-tests may install missing Python packages as part of it's 
> setup process. setup.py can cache these in python/.eggs, and since the 
> lint-python script checks any file with the .py extension anywhere in the 
> Spark project, it will check files in .eggs and will fail if any of these do 
> not meet style criteria, even though these are not part of the project 
> lint-spark should exclude python/.eggs from its search directories.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22380) Upgrade protobuf-java (com.google.protobuf) version from 2.5.0 to 3.4.0

2017-10-28 Thread Maziyar PANAHI (JIRA)

Maziyar PANAHI created SPARK-22380:
--

 Summary: Upgrade protobuf-java (com.google.protobuf) version from 
2.5.0 to 3.4.0
 Key: SPARK-22380
 URL: https://issues.apache.org/jira/browse/SPARK-22380
 Project: Spark
  Issue Type: Dependency upgrade
  Components: Deploy
Affects Versions: 2.2.0, 1.6.1
 Environment: Cloudera 5.13.x
Spark 2.2.0.cloudera1-1.cdh5.12.0.p0.142354
And anything beyond Spark 2.2.0
Reporter: Maziyar PANAHI
Priority: Blocker


Hi,

This upgrade is needed when we try to use CoreNLP 3.8 with Spark (1.6+ and 
2.2+) due to incompatibilities in the protobuf version used by 
com.google.protobuf and the one is used in latest Stanford CoreNLP (3.8). The 
version of protobuf has been set to 2.5.0 in the global properties, and this is 
stated in the pom.xml file.

The error that refers to this dependency:

{code:java}
java.lang.VerifyError: Bad type on operand stack
Exception Details:
  Location:

com/google/protobuf/GeneratedMessageV3$ExtendableMessage.getExtension(Lcom/google/protobuf/GeneratedMessage$GeneratedExtension;I)Ljava/lang/Object;
 @3: invokevirtual
  Reason:
Type 'com/google/protobuf/GeneratedMessage$GeneratedExtension' (current 
frame, stack[1]) is not assignable to 'com/google/protobuf/ExtensionLite'
  Current Frame:
bci: @3
flags: { }
locals: { 'com/google/protobuf/GeneratedMessageV3$ExtendableMessage', 
'com/google/protobuf/GeneratedMessage$GeneratedExtension', integer }
stack: { 'com/google/protobuf/GeneratedMessageV3$ExtendableMessage', 
'com/google/protobuf/GeneratedMessage$GeneratedExtension', integer }
  Bytecode:
0x000: 2a2b 1cb6 0024 b0

  at edu.stanford.nlp.simple.Document.(Document.java:433)
  at edu.stanford.nlp.simple.Sentence.(Sentence.java:118)
  at edu.stanford.nlp.simple.Sentence.(Sentence.java:126)
  ... 56 elided

{code}

Is it possible to upgrade this dependency to the latest (3.4) or any workaround 
besides manually removing protobuf-java-2.5.0.jar and adding 
protobuf-java-3.4.0.jar?

You can follow the discussion of how this upgrade would fix the issue:
https://github.com/stanfordnlp/CoreNLP/issues/556


Many thanks,
Maziyar





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22369) PySpark: Document methods of spark.catalog interface

2017-10-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22369:


Assignee: Apache Spark

> PySpark: Document methods of spark.catalog interface
> 
>
> Key: SPARK-22369
> URL: https://issues.apache.org/jira/browse/SPARK-22369
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Andreas Maier
>Assignee: Apache Spark
>
> The following methods from the {{spark.catalog}} interface are not documented.
> {code:java}
> $ pyspark
> >>> dir(spark.catalog)
> ['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', 
> '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', 
> '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', 
> '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', 
> '__str__', '__subclasshook__', '__weakref__', '_jcatalog', '_jsparkSession', 
> '_reset', '_sparkSession', 'cacheTable', 'clearCache', 'createExternalTable', 
> 'createTable', 'currentDatabase', 'dropGlobalTempView', 'dropTempView', 
> 'isCached', 'listColumns', 'listDatabases', 'listFunctions', 'listTables', 
> 'recoverPartitions', 'refreshByPath', 'refreshTable', 'registerFunction', 
> 'setCurrentDatabase', 'uncacheTable']
> {code}
> As a user I would like to have these methods documented on 
> http://spark.apache.org/docs/latest/api/python/pyspark.sql.html . Old methods 
> of the SQLContext (e.g. {{pyspark.sql.SQLContext.cacheTable()}} vs. 
> {{pyspark.sql.SparkSession.catalog.cacheTable()}} or 
> {{pyspark.sql.HiveContext.refreshTable()}} vs. 
> {{pyspark.sql.SparkSession.catalog.refreshTable()}} ) should point to the new 
> method. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22369) PySpark: Document methods of spark.catalog interface

2017-10-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22369:


Assignee: (was: Apache Spark)

> PySpark: Document methods of spark.catalog interface
> 
>
> Key: SPARK-22369
> URL: https://issues.apache.org/jira/browse/SPARK-22369
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Andreas Maier
>
> The following methods from the {{spark.catalog}} interface are not documented.
> {code:java}
> $ pyspark
> >>> dir(spark.catalog)
> ['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', 
> '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', 
> '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', 
> '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', 
> '__str__', '__subclasshook__', '__weakref__', '_jcatalog', '_jsparkSession', 
> '_reset', '_sparkSession', 'cacheTable', 'clearCache', 'createExternalTable', 
> 'createTable', 'currentDatabase', 'dropGlobalTempView', 'dropTempView', 
> 'isCached', 'listColumns', 'listDatabases', 'listFunctions', 'listTables', 
> 'recoverPartitions', 'refreshByPath', 'refreshTable', 'registerFunction', 
> 'setCurrentDatabase', 'uncacheTable']
> {code}
> As a user I would like to have these methods documented on 
> http://spark.apache.org/docs/latest/api/python/pyspark.sql.html . Old methods 
> of the SQLContext (e.g. {{pyspark.sql.SQLContext.cacheTable()}} vs. 
> {{pyspark.sql.SparkSession.catalog.cacheTable()}} or 
> {{pyspark.sql.HiveContext.refreshTable()}} vs. 
> {{pyspark.sql.SparkSession.catalog.refreshTable()}} ) should point to the new 
> method. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22369) PySpark: Document methods of spark.catalog interface

2017-10-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16223591#comment-16223591
 ] 

Apache Spark commented on SPARK-22369:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/19596

> PySpark: Document methods of spark.catalog interface
> 
>
> Key: SPARK-22369
> URL: https://issues.apache.org/jira/browse/SPARK-22369
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Andreas Maier
>
> The following methods from the {{spark.catalog}} interface are not documented.
> {code:java}
> $ pyspark
> >>> dir(spark.catalog)
> ['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', 
> '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', 
> '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', 
> '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', 
> '__str__', '__subclasshook__', '__weakref__', '_jcatalog', '_jsparkSession', 
> '_reset', '_sparkSession', 'cacheTable', 'clearCache', 'createExternalTable', 
> 'createTable', 'currentDatabase', 'dropGlobalTempView', 'dropTempView', 
> 'isCached', 'listColumns', 'listDatabases', 'listFunctions', 'listTables', 
> 'recoverPartitions', 'refreshByPath', 'refreshTable', 'registerFunction', 
> 'setCurrentDatabase', 'uncacheTable']
> {code}
> As a user I would like to have these methods documented on 
> http://spark.apache.org/docs/latest/api/python/pyspark.sql.html . Old methods 
> of the SQLContext (e.g. {{pyspark.sql.SQLContext.cacheTable()}} vs. 
> {{pyspark.sql.SparkSession.catalog.cacheTable()}} or 
> {{pyspark.sql.HiveContext.refreshTable()}} vs. 
> {{pyspark.sql.SparkSession.catalog.refreshTable()}} ) should point to the new 
> method. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22379) Reduce duplication setUpClass and tearDownClass in PySpark SQL tests

2017-10-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22379:


Assignee: Apache Spark

> Reduce duplication setUpClass and tearDownClass in PySpark SQL tests
> 
>
> Key: SPARK-22379
> URL: https://issues.apache.org/jira/browse/SPARK-22379
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Trivial
>
> Looks there are some duplication in sql/tests.py:
> {code}
> diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
> index 98afae662b4..6812da6b309 100644
> --- a/python/pyspark/sql/tests.py
> +++ b/python/pyspark/sql/tests.py
> @@ -179,6 +179,18 @@ class MyObject(object):
>  self.value = value
> +class ReusedSQLTestCase(ReusedPySparkTestCase):
> +@classmethod
> +def setUpClass(cls):
> +ReusedPySparkTestCase.setUpClass()
> +cls.spark = SparkSession(cls.sc)
> +
> +@classmethod
> +def tearDownClass(cls):
> +ReusedPySparkTestCase.tearDownClass()
> +cls.spark.stop()
> +
> +
>  class DataTypeTests(unittest.TestCase):
>  # regression test for SPARK-6055
>  def test_data_type_eq(self):
> @@ -214,21 +226,19 @@ class DataTypeTests(unittest.TestCase):
>  self.assertRaises(TypeError, struct_field.typeName)
> -class SQLTests(ReusedPySparkTestCase):
> +class SQLTests(ReusedSQLTestCase):
>  @classmethod
>  def setUpClass(cls):
> -ReusedPySparkTestCase.setUpClass()
> +ReusedSQLTestCase.setUpClass()
>  cls.tempdir = tempfile.NamedTemporaryFile(delete=False)
>  os.unlink(cls.tempdir.name)
> -cls.spark = SparkSession(cls.sc)
>  cls.testData = [Row(key=i, value=str(i)) for i in range(100)]
>  cls.df = cls.spark.createDataFrame(cls.testData)
>  @classmethod
>  def tearDownClass(cls):
> -ReusedPySparkTestCase.tearDownClass()
> -cls.spark.stop()
> +ReusedSQLTestCase.tearDownClass()
>  shutil.rmtree(cls.tempdir.name, ignore_errors=True)
>  def test_sqlcontext_reuses_sparksession(self):
> @@ -2623,17 +2633,7 @@ class HiveSparkSubmitTests(SparkSubmitTests):
>  self.assertTrue(os.path.exists(metastore_path))
> -class SQLTests2(ReusedPySparkTestCase):
> -
> -@classmethod
> -def setUpClass(cls):
> -ReusedPySparkTestCase.setUpClass()
> -cls.spark = SparkSession(cls.sc)
> -
> -@classmethod
> -def tearDownClass(cls):
> -ReusedPySparkTestCase.tearDownClass()
> -cls.spark.stop()
> +class SQLTests2(ReusedSQLTestCase):
>  # We can't include this test into SQLTests because we will stop class's 
> SparkContext and cause
>  # other tests failed.
> @@ -3082,12 +3082,12 @@ class DataTypeVerificationTests(unittest.TestCase):
>  @unittest.skipIf(not _have_arrow, "Arrow not installed")
> -class ArrowTests(ReusedPySparkTestCase):
> +class ArrowTests(ReusedSQLTestCase):
>  @classmethod
>  def setUpClass(cls):
>  from datetime import datetime
> -ReusedPySparkTestCase.setUpClass()
> +ReusedSQLTestCase.setUpClass()
>  # Synchronize default timezone between Python and Java
>  cls.tz_prev = os.environ.get("TZ", None)  # save current tz if set
> @@ -3095,7 +3095,6 @@ class ArrowTests(ReusedPySparkTestCase):
>  os.environ["TZ"] = tz
>  time.tzset()
> -cls.spark = SparkSession(cls.sc)
>  cls.spark.conf.set("spark.sql.session.timeZone", tz)
>  cls.spark.conf.set("spark.sql.execution.arrow.enabled", "true")
>  cls.schema = StructType([
> @@ -3116,8 +3115,7 @@ class ArrowTests(ReusedPySparkTestCase):
>  if cls.tz_prev is not None:
>  os.environ["TZ"] = cls.tz_prev
>  time.tzset()
> -ReusedPySparkTestCase.tearDownClass()
> -cls.spark.stop()
> +ReusedSQLTestCase.tearDownClass()
>  def assertFramesEqual(self, df_with_arrow, df_without):
>  msg = ("DataFrame from Arrow is not equal" +
> @@ -3169,17 +3167,7 @@ class ArrowTests(ReusedPySparkTestCase):
>  @unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not 
> installed")
> -class VectorizedUDFTests(ReusedPySparkTestCase):
> -
> -@classmethod
> -def setUpClass(cls):
> -ReusedPySparkTestCase.setUpClass()
> -cls.spark = SparkSession(cls.sc)
> -
> -@classmethod
> -def tearDownClass(cls):
> -ReusedPySparkTestCase.tearDownClass()
> -cls.spark.stop()
> +class VectorizedUDFTests(ReusedSQLTestCase):
>  def test_vectorized_udf_basic(self):
>  from pyspark.sql.functions import pandas_udf, col
> @@ -3478,16 +3466,7 @@ class

[jira] [Commented] (SPARK-22379) Reduce duplication setUpClass and tearDownClass in PySpark SQL tests

2017-10-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16223568#comment-16223568
 ] 

Apache Spark commented on SPARK-22379:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/19595

> Reduce duplication setUpClass and tearDownClass in PySpark SQL tests
> 
>
> Key: SPARK-22379
> URL: https://issues.apache.org/jira/browse/SPARK-22379
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> Looks there are some duplication in sql/tests.py:
> {code}
> diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
> index 98afae662b4..6812da6b309 100644
> --- a/python/pyspark/sql/tests.py
> +++ b/python/pyspark/sql/tests.py
> @@ -179,6 +179,18 @@ class MyObject(object):
>  self.value = value
> +class ReusedSQLTestCase(ReusedPySparkTestCase):
> +@classmethod
> +def setUpClass(cls):
> +ReusedPySparkTestCase.setUpClass()
> +cls.spark = SparkSession(cls.sc)
> +
> +@classmethod
> +def tearDownClass(cls):
> +ReusedPySparkTestCase.tearDownClass()
> +cls.spark.stop()
> +
> +
>  class DataTypeTests(unittest.TestCase):
>  # regression test for SPARK-6055
>  def test_data_type_eq(self):
> @@ -214,21 +226,19 @@ class DataTypeTests(unittest.TestCase):
>  self.assertRaises(TypeError, struct_field.typeName)
> -class SQLTests(ReusedPySparkTestCase):
> +class SQLTests(ReusedSQLTestCase):
>  @classmethod
>  def setUpClass(cls):
> -ReusedPySparkTestCase.setUpClass()
> +ReusedSQLTestCase.setUpClass()
>  cls.tempdir = tempfile.NamedTemporaryFile(delete=False)
>  os.unlink(cls.tempdir.name)
> -cls.spark = SparkSession(cls.sc)
>  cls.testData = [Row(key=i, value=str(i)) for i in range(100)]
>  cls.df = cls.spark.createDataFrame(cls.testData)
>  @classmethod
>  def tearDownClass(cls):
> -ReusedPySparkTestCase.tearDownClass()
> -cls.spark.stop()
> +ReusedSQLTestCase.tearDownClass()
>  shutil.rmtree(cls.tempdir.name, ignore_errors=True)
>  def test_sqlcontext_reuses_sparksession(self):
> @@ -2623,17 +2633,7 @@ class HiveSparkSubmitTests(SparkSubmitTests):
>  self.assertTrue(os.path.exists(metastore_path))
> -class SQLTests2(ReusedPySparkTestCase):
> -
> -@classmethod
> -def setUpClass(cls):
> -ReusedPySparkTestCase.setUpClass()
> -cls.spark = SparkSession(cls.sc)
> -
> -@classmethod
> -def tearDownClass(cls):
> -ReusedPySparkTestCase.tearDownClass()
> -cls.spark.stop()
> +class SQLTests2(ReusedSQLTestCase):
>  # We can't include this test into SQLTests because we will stop class's 
> SparkContext and cause
>  # other tests failed.
> @@ -3082,12 +3082,12 @@ class DataTypeVerificationTests(unittest.TestCase):
>  @unittest.skipIf(not _have_arrow, "Arrow not installed")
> -class ArrowTests(ReusedPySparkTestCase):
> +class ArrowTests(ReusedSQLTestCase):
>  @classmethod
>  def setUpClass(cls):
>  from datetime import datetime
> -ReusedPySparkTestCase.setUpClass()
> +ReusedSQLTestCase.setUpClass()
>  # Synchronize default timezone between Python and Java
>  cls.tz_prev = os.environ.get("TZ", None)  # save current tz if set
> @@ -3095,7 +3095,6 @@ class ArrowTests(ReusedPySparkTestCase):
>  os.environ["TZ"] = tz
>  time.tzset()
> -cls.spark = SparkSession(cls.sc)
>  cls.spark.conf.set("spark.sql.session.timeZone", tz)
>  cls.spark.conf.set("spark.sql.execution.arrow.enabled", "true")
>  cls.schema = StructType([
> @@ -3116,8 +3115,7 @@ class ArrowTests(ReusedPySparkTestCase):
>  if cls.tz_prev is not None:
>  os.environ["TZ"] = cls.tz_prev
>  time.tzset()
> -ReusedPySparkTestCase.tearDownClass()
> -cls.spark.stop()
> +ReusedSQLTestCase.tearDownClass()
>  def assertFramesEqual(self, df_with_arrow, df_without):
>  msg = ("DataFrame from Arrow is not equal" +
> @@ -3169,17 +3167,7 @@ class ArrowTests(ReusedPySparkTestCase):
>  @unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not 
> installed")
> -class VectorizedUDFTests(ReusedPySparkTestCase):
> -
> -@classmethod
> -def setUpClass(cls):
> -ReusedPySparkTestCase.setUpClass()
> -cls.spark = SparkSession(cls.sc)
> -
> -@classmethod
> -def tearDownClass(cls):
> -ReusedPySparkTestCase.tearDownClass()
> -cls.spark.stop()
> +class VectorizedUDFTests(ReusedSQLTestCase):
>  def test_vectorized_udf_basic(self):
>  from

[jira] [Assigned] (SPARK-22379) Reduce duplication setUpClass and tearDownClass in PySpark SQL tests

2017-10-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22379:


Assignee: (was: Apache Spark)

> Reduce duplication setUpClass and tearDownClass in PySpark SQL tests
> 
>
> Key: SPARK-22379
> URL: https://issues.apache.org/jira/browse/SPARK-22379
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> Looks there are some duplication in sql/tests.py:
> {code}
> diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
> index 98afae662b4..6812da6b309 100644
> --- a/python/pyspark/sql/tests.py
> +++ b/python/pyspark/sql/tests.py
> @@ -179,6 +179,18 @@ class MyObject(object):
>  self.value = value
> +class ReusedSQLTestCase(ReusedPySparkTestCase):
> +@classmethod
> +def setUpClass(cls):
> +ReusedPySparkTestCase.setUpClass()
> +cls.spark = SparkSession(cls.sc)
> +
> +@classmethod
> +def tearDownClass(cls):
> +ReusedPySparkTestCase.tearDownClass()
> +cls.spark.stop()
> +
> +
>  class DataTypeTests(unittest.TestCase):
>  # regression test for SPARK-6055
>  def test_data_type_eq(self):
> @@ -214,21 +226,19 @@ class DataTypeTests(unittest.TestCase):
>  self.assertRaises(TypeError, struct_field.typeName)
> -class SQLTests(ReusedPySparkTestCase):
> +class SQLTests(ReusedSQLTestCase):
>  @classmethod
>  def setUpClass(cls):
> -ReusedPySparkTestCase.setUpClass()
> +ReusedSQLTestCase.setUpClass()
>  cls.tempdir = tempfile.NamedTemporaryFile(delete=False)
>  os.unlink(cls.tempdir.name)
> -cls.spark = SparkSession(cls.sc)
>  cls.testData = [Row(key=i, value=str(i)) for i in range(100)]
>  cls.df = cls.spark.createDataFrame(cls.testData)
>  @classmethod
>  def tearDownClass(cls):
> -ReusedPySparkTestCase.tearDownClass()
> -cls.spark.stop()
> +ReusedSQLTestCase.tearDownClass()
>  shutil.rmtree(cls.tempdir.name, ignore_errors=True)
>  def test_sqlcontext_reuses_sparksession(self):
> @@ -2623,17 +2633,7 @@ class HiveSparkSubmitTests(SparkSubmitTests):
>  self.assertTrue(os.path.exists(metastore_path))
> -class SQLTests2(ReusedPySparkTestCase):
> -
> -@classmethod
> -def setUpClass(cls):
> -ReusedPySparkTestCase.setUpClass()
> -cls.spark = SparkSession(cls.sc)
> -
> -@classmethod
> -def tearDownClass(cls):
> -ReusedPySparkTestCase.tearDownClass()
> -cls.spark.stop()
> +class SQLTests2(ReusedSQLTestCase):
>  # We can't include this test into SQLTests because we will stop class's 
> SparkContext and cause
>  # other tests failed.
> @@ -3082,12 +3082,12 @@ class DataTypeVerificationTests(unittest.TestCase):
>  @unittest.skipIf(not _have_arrow, "Arrow not installed")
> -class ArrowTests(ReusedPySparkTestCase):
> +class ArrowTests(ReusedSQLTestCase):
>  @classmethod
>  def setUpClass(cls):
>  from datetime import datetime
> -ReusedPySparkTestCase.setUpClass()
> +ReusedSQLTestCase.setUpClass()
>  # Synchronize default timezone between Python and Java
>  cls.tz_prev = os.environ.get("TZ", None)  # save current tz if set
> @@ -3095,7 +3095,6 @@ class ArrowTests(ReusedPySparkTestCase):
>  os.environ["TZ"] = tz
>  time.tzset()
> -cls.spark = SparkSession(cls.sc)
>  cls.spark.conf.set("spark.sql.session.timeZone", tz)
>  cls.spark.conf.set("spark.sql.execution.arrow.enabled", "true")
>  cls.schema = StructType([
> @@ -3116,8 +3115,7 @@ class ArrowTests(ReusedPySparkTestCase):
>  if cls.tz_prev is not None:
>  os.environ["TZ"] = cls.tz_prev
>  time.tzset()
> -ReusedPySparkTestCase.tearDownClass()
> -cls.spark.stop()
> +ReusedSQLTestCase.tearDownClass()
>  def assertFramesEqual(self, df_with_arrow, df_without):
>  msg = ("DataFrame from Arrow is not equal" +
> @@ -3169,17 +3167,7 @@ class ArrowTests(ReusedPySparkTestCase):
>  @unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not 
> installed")
> -class VectorizedUDFTests(ReusedPySparkTestCase):
> -
> -@classmethod
> -def setUpClass(cls):
> -ReusedPySparkTestCase.setUpClass()
> -cls.spark = SparkSession(cls.sc)
> -
> -@classmethod
> -def tearDownClass(cls):
> -ReusedPySparkTestCase.tearDownClass()
> -cls.spark.stop()
> +class VectorizedUDFTests(ReusedSQLTestCase):
>  def test_vectorized_udf_basic(self):
>  from pyspark.sql.functions import pandas_udf, col
> @@ -3478,16 +3466,7 @@ class

[jira] [Created] (SPARK-22379) Reduce duplication setUpClass and tearDownClass in PySpark SQL tests

2017-10-28 Thread Hyukjin Kwon (JIRA)

Hyukjin Kwon created SPARK-22379:


 Summary: Reduce duplication setUpClass and tearDownClass in 
PySpark SQL tests
 Key: SPARK-22379
 URL: https://issues.apache.org/jira/browse/SPARK-22379
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.3.0
Reporter: Hyukjin Kwon
Priority: Trivial


Looks there are some duplication in sql/tests.py:

{code}
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index 98afae662b4..6812da6b309 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -179,6 +179,18 @@ class MyObject(object):
 self.value = value


+class ReusedSQLTestCase(ReusedPySparkTestCase):
+@classmethod
+def setUpClass(cls):
+ReusedPySparkTestCase.setUpClass()
+cls.spark = SparkSession(cls.sc)
+
+@classmethod
+def tearDownClass(cls):
+ReusedPySparkTestCase.tearDownClass()
+cls.spark.stop()
+
+
 class DataTypeTests(unittest.TestCase):
 # regression test for SPARK-6055
 def test_data_type_eq(self):
@@ -214,21 +226,19 @@ class DataTypeTests(unittest.TestCase):
 self.assertRaises(TypeError, struct_field.typeName)


-class SQLTests(ReusedPySparkTestCase):
+class SQLTests(ReusedSQLTestCase):

 @classmethod
 def setUpClass(cls):
-ReusedPySparkTestCase.setUpClass()
+ReusedSQLTestCase.setUpClass()
 cls.tempdir = tempfile.NamedTemporaryFile(delete=False)
 os.unlink(cls.tempdir.name)
-cls.spark = SparkSession(cls.sc)
 cls.testData = [Row(key=i, value=str(i)) for i in range(100)]
 cls.df = cls.spark.createDataFrame(cls.testData)

 @classmethod
 def tearDownClass(cls):
-ReusedPySparkTestCase.tearDownClass()
-cls.spark.stop()
+ReusedSQLTestCase.tearDownClass()
 shutil.rmtree(cls.tempdir.name, ignore_errors=True)

 def test_sqlcontext_reuses_sparksession(self):
@@ -2623,17 +2633,7 @@ class HiveSparkSubmitTests(SparkSubmitTests):
 self.assertTrue(os.path.exists(metastore_path))


-class SQLTests2(ReusedPySparkTestCase):
-
-@classmethod
-def setUpClass(cls):
-ReusedPySparkTestCase.setUpClass()
-cls.spark = SparkSession(cls.sc)
-
-@classmethod
-def tearDownClass(cls):
-ReusedPySparkTestCase.tearDownClass()
-cls.spark.stop()
+class SQLTests2(ReusedSQLTestCase):

 # We can't include this test into SQLTests because we will stop class's 
SparkContext and cause
 # other tests failed.
@@ -3082,12 +3082,12 @@ class DataTypeVerificationTests(unittest.TestCase):


 @unittest.skipIf(not _have_arrow, "Arrow not installed")
-class ArrowTests(ReusedPySparkTestCase):
+class ArrowTests(ReusedSQLTestCase):

 @classmethod
 def setUpClass(cls):
 from datetime import datetime
-ReusedPySparkTestCase.setUpClass()
+ReusedSQLTestCase.setUpClass()

 # Synchronize default timezone between Python and Java
 cls.tz_prev = os.environ.get("TZ", None)  # save current tz if set
@@ -3095,7 +3095,6 @@ class ArrowTests(ReusedPySparkTestCase):
 os.environ["TZ"] = tz
 time.tzset()

-cls.spark = SparkSession(cls.sc)
 cls.spark.conf.set("spark.sql.session.timeZone", tz)
 cls.spark.conf.set("spark.sql.execution.arrow.enabled", "true")
 cls.schema = StructType([
@@ -3116,8 +3115,7 @@ class ArrowTests(ReusedPySparkTestCase):
 if cls.tz_prev is not None:
 os.environ["TZ"] = cls.tz_prev
 time.tzset()
-ReusedPySparkTestCase.tearDownClass()
-cls.spark.stop()
+ReusedSQLTestCase.tearDownClass()

 def assertFramesEqual(self, df_with_arrow, df_without):
 msg = ("DataFrame from Arrow is not equal" +
@@ -3169,17 +3167,7 @@ class ArrowTests(ReusedPySparkTestCase):


 @unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not 
installed")
-class VectorizedUDFTests(ReusedPySparkTestCase):
-
-@classmethod
-def setUpClass(cls):
-ReusedPySparkTestCase.setUpClass()
-cls.spark = SparkSession(cls.sc)
-
-@classmethod
-def tearDownClass(cls):
-ReusedPySparkTestCase.tearDownClass()
-cls.spark.stop()
+class VectorizedUDFTests(ReusedSQLTestCase):

 def test_vectorized_udf_basic(self):
 from pyspark.sql.functions import pandas_udf, col
@@ -3478,16 +3466,7 @@ class VectorizedUDFTests(ReusedPySparkTestCase):


 @unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not 
installed")
-class GroupbyApplyTests(ReusedPySparkTestCase):
-@classmethod
-def setUpClass(cls):
-ReusedPySparkTestCase.setUpClass()
-cls.spark = SparkSession(cls.sc)
-
-@classmethod
-def tearDownClass(cls):
-ReusedPySparkTestCase.tearDownClass()
-cls.spark.stop()
+class

[jira] [Commented] (SPARK-22240) S3 CSV number of partitions incorrectly computed

2017-10-28 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16223564#comment-16223564
 ] 

Hyukjin Kwon commented on SPARK-22240:
--

I am sorry, I have been super busy these for few days. 

So, here is my understanding:

1. {{multiLine}} is enabled, it is nonsplittable - 
https://github.com/apache/spark/blob/9f6b3e65ccfa0daec31b58c5a6386b3a890c2149/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala#L201

{code}
object MultiLineCSVDataSource extends CSVDataSource {
  override val isSplitable: Boolean = false
{code}

So, partition is always 1 per file. Schema inference code path is a bit 
convoluted but basically it fully reads the input by {{BinaryFileRDD}} roughly 
like {{sc.binaryFiles}} when {{multiLine}} is enabled

2. gzip is nonsplittable I believe . So, this is the same case with ^. See 
https://github.com/apache/spark/blob/32fa0b81411f781173e185f4b19b9fd6d118f9fe/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala#L161-L171
 and SPARK-15654

{code}

  override def isSplitable(
  sparkSession: SparkSession,
  options: Map[String, String],
  path: Path): Boolean = {
if (codecFactory == null) {
  codecFactory = new CompressionCodecFactory(
sparkSession.sessionState.newHadoopConfWithOptions(options))
}
val codec = codecFactory.getCodec(path)
codec == null || codec.isInstanceOf[SplittableCompressionCodec]
  }
{code}

3. These partition calculation for file-based datasources are (roughly) done 
here - 
https://github.com/apache/spark/blob/d28d5732ae205771f1f443b15b10e64dcffb5ff0/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L420-L484
 Here, it (roughly) combines input splits as an optimization.

Please correct me if I am mistaken. There might be few minor incorrections but 
I believe generally these are correct.


> S3 CSV number of partitions incorrectly computed
> 
>
> Key: SPARK-22240
> URL: https://issues.apache.org/jira/browse/SPARK-22240
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: Running on EMR 5.8.0 with Hadoop 2.7.3 and Spark 2.2.0
>Reporter: Arthur Baudry
>
> Reading CSV out of S3 using S3A protocol does not compute the number of 
> partitions correctly in Spark 2.2.0.
> With Spark 2.2.0 I get only partition when loading a 14GB file
> {code:java}
> scala> val input = spark.read.format("csv").option("header", 
> "true").option("delimiter", "|").option("multiLine", 
> "true").load("s3a://")
> input: org.apache.spark.sql.DataFrame = [PARTY_KEY: string, ROW_START_DATE: 
> string ... 36 more fields]
> scala> input.rdd.getNumPartitions
> res2: Int = 1
> {code}
> While in Spark 2.0.2 I had:
> {code:java}
> scala> val input = spark.read.format("csv").option("header", 
> "true").option("delimiter", "|").option("multiLine", 
> "true").load("s3a://")
> input: org.apache.spark.sql.DataFrame = [PARTY_KEY: string, ROW_START_DATE: 
> string ... 36 more fields]
> scala> input.rdd.getNumPartitions
> res2: Int = 115
> {code}
> This introduces obvious performance issues in Spark 2.2.0. Maybe there is a 
> property that should be set to have the number of partitions computed 
> correctly.
> I'm aware that the .option("multiline","true") is not supported in Spark 
> 2.0.2, it's not relevant here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22335) Union for DataSet uses column order instead of types for union

2017-10-28 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-22335.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19570
[https://github.com/apache/spark/pull/19570]

> Union for DataSet uses column order instead of types for union
> --
>
> Key: SPARK-22335
> URL: https://issues.apache.org/jira/browse/SPARK-22335
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Carlos Bribiescas
> Fix For: 2.3.0
>
>
> I see union uses column order for a DF. This to me is "fine" since they 
> aren't typed.
> However, for a dataset which is supposed to be strongly typed it is actually 
> giving the wrong result. If you try to access the members by name, it will 
> use the order. Heres is a reproducible case. 2.2.0
> {code:java}
>   case class AB(a : String, b : String)
>   val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b")
>   val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")
>   
>   abDf.union(baDf).show() // as linked ticket states, its "Not a problem"
>   
>   val abDs = abDf.as[AB]
>   val baDs = baDf.as[AB]
>   
>   abDs.union(baDs).show()  // This gives wrong result since a Dataset[AB] 
> should be correctly mapped by type, not by column order
>   
>   abDs.union(baDs).map(_.a).show() // This gives wrong result since a 
> Dataset[AB] should be correctly mapped by type, not by column order
>abDs.union(baDs).rdd.take(2) // This also gives wrong result
>   baDs.map(_.a).show() // However, this gives the correct result, even though 
> columns were out of order.
>   abDs.map(_.a).show() // This is correct too
>   baDs.select("a","b").as[AB].union(abDs).show() // This is the same 
> workaround for linked issue, slightly modified.  However this seems wrong 
> since its supposed to be strongly typed
>   
>   baDs.rdd.toDF().as[AB].union(abDs).show()  // This however gives correct 
> result, which is logically inconsistent behavior
>   abDs.rdd.union(baDs.rdd).toDF().show() // Simpler example that gives 
> correct result
> {code}
> So its inconsistent and a bug IMO.  And I'm not sure that the suggested work 
> around is really fair, since I'm supposed to be getting of type `AB`.  More 
> importantly I think the issue is bigger when you consider that it happens 
> even if you read from parquet (as you would expect).  And that its 
> inconsistent when going to/from rdd.
> I imagine its just lazily converting to typed DS instead of initially.  So 
> either that typing could be prioritized to happen before the union or 
> unioning of DF could be done with column order taken into account.  Again, 
> this is speculation..



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22335) Union for DataSet uses column order instead of types for union

2017-10-28 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-22335:


Assignee: Liang-Chi Hsieh

> Union for DataSet uses column order instead of types for union
> --
>
> Key: SPARK-22335
> URL: https://issues.apache.org/jira/browse/SPARK-22335
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Carlos Bribiescas
>Assignee: Liang-Chi Hsieh
> Fix For: 2.3.0
>
>
> I see union uses column order for a DF. This to me is "fine" since they 
> aren't typed.
> However, for a dataset which is supposed to be strongly typed it is actually 
> giving the wrong result. If you try to access the members by name, it will 
> use the order. Heres is a reproducible case. 2.2.0
> {code:java}
>   case class AB(a : String, b : String)
>   val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b")
>   val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")
>   
>   abDf.union(baDf).show() // as linked ticket states, its "Not a problem"
>   
>   val abDs = abDf.as[AB]
>   val baDs = baDf.as[AB]
>   
>   abDs.union(baDs).show()  // This gives wrong result since a Dataset[AB] 
> should be correctly mapped by type, not by column order
>   
>   abDs.union(baDs).map(_.a).show() // This gives wrong result since a 
> Dataset[AB] should be correctly mapped by type, not by column order
>abDs.union(baDs).rdd.take(2) // This also gives wrong result
>   baDs.map(_.a).show() // However, this gives the correct result, even though 
> columns were out of order.
>   abDs.map(_.a).show() // This is correct too
>   baDs.select("a","b").as[AB].union(abDs).show() // This is the same 
> workaround for linked issue, slightly modified.  However this seems wrong 
> since its supposed to be strongly typed
>   
>   baDs.rdd.toDF().as[AB].union(abDs).show()  // This however gives correct 
> result, which is logically inconsistent behavior
>   abDs.rdd.union(baDs.rdd).toDF().show() // Simpler example that gives 
> correct result
> {code}
> So its inconsistent and a bug IMO.  And I'm not sure that the suggested work 
> around is really fair, since I'm supposed to be getting of type `AB`.  More 
> importantly I think the issue is bigger when you consider that it happens 
> even if you read from parquet (as you would expect).  And that its 
> inconsistent when going to/from rdd.
> I imagine its just lazily converting to typed DS instead of initially.  So 
> either that typing could be prioritized to happen before the union or 
> unioning of DF could be done with column order taken into account.  Again, 
> this is speculation..



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22351) Support user-created custom Encoders for Datasets

2017-10-28 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16223470#comment-16223470
 ] 

Hyukjin Kwon commented on SPARK-22351:
--

{quote}
While this was possible in Spark 1.6 it's not longer the case in Spark 2.x.
{quote}

Would you mind sharing the codes? I think I can't reproduce this in 1.6.



> Support user-created custom Encoders for Datasets
> -
>
> Key: SPARK-22351
> URL: https://issues.apache.org/jira/browse/SPARK-22351
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Adamos Loizou
>Priority: Minor
>
> It would be very helpful if we could easily support creating custom encoders 
> for classes in Spark SQL.
> This is to allow a user to properly define a business model using types of 
> their choice. They can then map them to Spark SQL types without being forced 
> to pollute their model with the built-in mappable types (e.g. 
> {{java.sql.Timestamp}}).
> Specifically in our case, we tend to use either the Java 8 time API or the 
> joda time API for dates instead of {{java.sql.Timestamp}} whose API is quite 
> limited compared to the others.
> Ideally we would like to be able to have a dataset of such a class:
> {code:java}
> case class Person(name: String, dateOfBirth: org.joda.time.LocalDate)
> implicit def localDateTimeEncoder: Encoder[LocalDate] = ??? // we define 
> something that maps to Spark SQL TimestampType
> ...
> // read csv and map it to model
> val people:Dataset[Person] = spark.read.csv("/my/path/file.csv").as[Person]
> {code}
> While this was possible in Spark 1.6 it's not longer the case in Spark 2.x.
> It's also not straight forward as to how to support that using an 
> {{ExpressionEncoder}} (any tips would be much appreciated)
> Thanks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22240) S3 CSV number of partitions incorrectly computed

2017-10-28 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16223400#comment-16223400
 ] 

Steve Loughran commented on SPARK-22240:


so this partition calculation problem is independent of filesystem? That is: 
you can replicate on local or HDFS? if so, points the problem more @ spark than 
me, though it could be use/doc/implementation of FileSystem API calls.

> S3 CSV number of partitions incorrectly computed
> 
>
> Key: SPARK-22240
> URL: https://issues.apache.org/jira/browse/SPARK-22240
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: Running on EMR 5.8.0 with Hadoop 2.7.3 and Spark 2.2.0
>Reporter: Arthur Baudry
>
> Reading CSV out of S3 using S3A protocol does not compute the number of 
> partitions correctly in Spark 2.2.0.
> With Spark 2.2.0 I get only partition when loading a 14GB file
> {code:java}
> scala> val input = spark.read.format("csv").option("header", 
> "true").option("delimiter", "|").option("multiLine", 
> "true").load("s3a://")
> input: org.apache.spark.sql.DataFrame = [PARTY_KEY: string, ROW_START_DATE: 
> string ... 36 more fields]
> scala> input.rdd.getNumPartitions
> res2: Int = 1
> {code}
> While in Spark 2.0.2 I had:
> {code:java}
> scala> val input = spark.read.format("csv").option("header", 
> "true").option("delimiter", "|").option("multiLine", 
> "true").load("s3a://")
> input: org.apache.spark.sql.DataFrame = [PARTY_KEY: string, ROW_START_DATE: 
> string ... 36 more fields]
> scala> input.rdd.getNumPartitions
> res2: Int = 115
> {code}
> This introduces obvious performance issues in Spark 2.2.0. Maybe there is a 
> property that should be set to have the number of partitions computed 
> correctly.
> I'm aware that the .option("multiline","true") is not supported in Spark 
> 2.0.2, it's not relevant here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22378) Eliminate redundant nullcheck in generated code for extracting value in complex type

2017-10-28 Thread Kazuaki Ishizaki (JIRA)

Kazuaki Ishizaki created SPARK-22378:


 Summary: Eliminate redundant nullcheck in generated code for 
extracting value in complex type
 Key: SPARK-22378
 URL: https://issues.apache.org/jira/browse/SPARK-22378
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Kazuaki Ishizaki
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22376) run-tests.py fails at exec-sbt if run with Python 3

2017-10-28 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16223355#comment-16223355
 ] 

Sean Owen commented on SPARK-22376:
---

If the change is still compatible with Python 2, yes go ahead. 

> run-tests.py fails at exec-sbt if run with Python 3
> ---
>
> Key: SPARK-22376
> URL: https://issues.apache.org/jira/browse/SPARK-22376
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.3.0
> Environment: OSX 10.12.6 Python 3.6.0 Anaconda 4.4
>Reporter: Joel Croteau
>Priority: Minor
>
> Running ./dev/run-tests with python3 as the default gives this error at the 
> Building Spark stage:
> {noformat}
> 
> Building Spark
> 
> [info] Building Spark (w/Hive 1.2.1) using SBT with these arguments:  
> -Phadoop-2.6 -Phive-thriftserver -Phive -Pkafka-0-8 -Pflume -Pyarn -Pmesos 
> -Pkinesis-asl test:package streaming-kafka-0-8-assembly/assembly 
> streaming-flume-assembly/assembly streaming-kinesis-asl-assembly/assembly
> Traceback (most recent call last):
>   File "./dev/run-tests.py", line 622, in 
> main()
>   File "./dev/run-tests.py", line 593, in main
> build_apache_spark(build_tool, hadoop_version)
>   File "./dev/run-tests.py", line 391, in build_apache_spark
> build_spark_sbt(hadoop_version)
>   File "./dev/run-tests.py", line 344, in build_spark_sbt
> exec_sbt(profiles_and_goals)
>   File "./dev/run-tests.py", line 293, in exec_sbt
> if not sbt_output_filter.match(line):
> TypeError: cannot use a string pattern on a bytes-like object
> {noformat}
> This is because in Python 3, the stdout member of a POpen object defaults to 
> returning a byte stream, and exec_sbt tries to read it as a text screen. This 
> can be fixed by specifying universal_newlines=True when creating the POpen 
> object. I notice that the hashbang at the start of run-tests.py says to run 
> with python2, so I am not sure how much of the rest of it is Python 3 
> compatible, but this is the first error I've run into. It ran with python 3 
> because run-tests runs run-tests.py using the default Python, which on my 
> system is Python 3. Not sure whether the better solution here is to try and 
> fix Python 3 compatibility in run-tests.py or set run-tests to use Python 2.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22373) Intermittent NullPointerException in org.codehaus.janino.IClass.isAssignableFrom

2017-10-28 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16223348#comment-16223348
 ] 

Sean Owen commented on SPARK-22373:
---

It looks like a Janino problem, so not sure if it can be fixed in Spark, or 
worked around if it can't be reproduced.

> Intermittent NullPointerException in 
> org.codehaus.janino.IClass.isAssignableFrom
> 
>
> Key: SPARK-22373
> URL: https://issues.apache.org/jira/browse/SPARK-22373
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
> Environment: Hortonworks distribution: HDP 2.6.2.0-205 , 
> /usr/hdp/current/spark2-client/jars/spark-core_2.11-2.1.1.2.6.2.0-205.jar
>Reporter: Dan Meany
>Priority: Minor
>
> Very occasional and retry works.
> Full stack:
> 17/10/27 21:06:15 ERROR Executor: Exception in task 29.0 in stage 12.0 (TID 
> 758)
> java.lang.NullPointerException
>   at org.codehaus.janino.IClass.isAssignableFrom(IClass.java:569)
>   at 
> org.codehaus.janino.UnitCompiler.isWideningReferenceConvertible(UnitCompiler.java:10347)
>   at 
> org.codehaus.janino.UnitCompiler.isMethodInvocationConvertible(UnitCompiler.java:8636)
>   at 
> org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:8427)
>   at 
> org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:8285)
>   at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8169)
>   at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8071)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4421)
>   at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112)
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:550)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at 
>

[jira] [Commented] (SPARK-22375) Test script can fail if eggs are installed by setup.py during test process

2017-10-28 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16223345#comment-16223345
 ] 

Sean Owen commented on SPARK-22375:
---

Certainly, please open a pull request.

> Test script can fail if eggs are installed by setup.py during test process
> --
>
> Key: SPARK-22375
> URL: https://issues.apache.org/jira/browse/SPARK-22375
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.3.0
> Environment: OSX 10.12.6
>Reporter: Joel Croteau
>Priority: Trivial
>
> Running ./dev/run-tests may install missing Python packages as part of it's 
> setup process. setup.py can cache these in python/.eggs, and since the 
> lint-python script checks any file with the .py extension anywhere in the 
> Spark project, it will check files in .eggs and will fail if any of these do 
> not meet style criteria, even though these are not part of the project 
> lint-spark should exclude python/.eggs from its search directories.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21635) ACOS(2) and ASIN(2) should be null

2017-10-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21635.
---
Resolution: Won't Fix

> ACOS(2) and ASIN(2) should be null
> --
>
> Key: SPARK-21635
> URL: https://issues.apache.org/jira/browse/SPARK-21635
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Yuming Wang
>
> ACOS(2) and ASIN(2) should be null, I have create a patch for Hive.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13085) Add scalastyle command used in build testing

2017-10-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-13085.
---
Resolution: Duplicate

> Add scalastyle command used in build testing
> 
>
> Key: SPARK-13085
> URL: https://issues.apache.org/jira/browse/SPARK-13085
> Project: Spark
>  Issue Type: Wish
>  Components: Build, Tests
>Reporter: Charles Allen
>
> As an occasional or new contributor, it is easy to screw up scala style. But 
> looking at the output logs (for example 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50300/consoleFull
>  ) it is not obvious how to fix the scala style tests, even when reading the 
> scala style guide.
> {code}
> 
> Running Scala style checks
> 
> Scalastyle checks failed at following occurrences:
> [error] 
> /home/jenkins/workspace/SparkPullRequestBuilder/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala:22:0:
>  import.ordering.wrongOrderInGroup.message
> [error] (core/compile:scalastyle) errors exist
> [error] Total time: 9 s, completed Jan 28, 2016 2:11:00 PM
> [error] running 
> /home/jenkins/workspace/SparkPullRequestBuilder/dev/lint-scala ; received 
> return code 1
> {code}
> This ask is that the command used to check scalastyle is presented in the log 
> so a developer does not have to wait for the build process to check if a pull 
> request should pass scala style checks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20731) Add ability to change or omit .csv file extension in CSV Data Source

2017-10-28 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-20731.
--
Resolution: Won't Fix

> Add ability to change or omit .csv file extension in CSV Data Source
> 
>
> Key: SPARK-20731
> URL: https://issues.apache.org/jira/browse/SPARK-20731
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Mikko Kupsu
>Priority: Minor
>
> CSV Data Source has the ability to change the field delimiter. If this is 
> changed for example to TAB, then the default file extension "csv" is 
> misleading and eg. "tsv" would be preferable.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21619) Fail the execution of canonicalized plans explicitly

2017-10-28 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-21619.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> Fail the execution of canonicalized plans explicitly
> 
>
> Key: SPARK-21619
> URL: https://issues.apache.org/jira/browse/SPARK-21619
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.3.0
>
>
> Canonicalized plans are not supposed to be executed. I ran into a case in 
> which there's some code that accidentally calls execute on a canonicalized 
> plan. This patch throws a more explicit exception when that happens.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22333) ColumnReference should get higher priority than timeFunctionCall(CURRENT_DATE, CURRENT_TIMESTAMP)

2017-10-28 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-22333.
-
   Resolution: Fixed
 Assignee: Feng Zhu
Fix Version/s: 2.3.0

> ColumnReference should get higher priority than 
> timeFunctionCall(CURRENT_DATE, CURRENT_TIMESTAMP)
> -
>
> Key: SPARK-22333
> URL: https://issues.apache.org/jira/browse/SPARK-22333
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1, 2.1.2, 2.2.0
>Reporter: Feng Zhu
>Assignee: Feng Zhu
> Fix For: 2.3.0
>
>
> In our cluster, there is a table "T" with column named as "current_date". 
> When we select data from this column with SQL:
> {code:sql}
> select current_date from T
> {code}
> We get the wrong answer, as the column is translated as CURRENT_DATE() 
> function.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

58 matches

Mail list logo