[jira] [Commented] (SPARK-10969) Spark Streaming Kinesis: Allow specifying separate credentials for Kinesis and DynamoDB

2017-11-26 Thread Christoph Pirkl (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16266430#comment-16266430
 ] 

Christoph Pirkl commented on SPARK-10969:
-

Thanks for the information, this solves my problem. You can close the ticket.

> Spark Streaming Kinesis: Allow specifying separate credentials for Kinesis 
> and DynamoDB
> ---
>
> Key: SPARK-10969
> URL: https://issues.apache.org/jira/browse/SPARK-10969
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 1.5.1
>Reporter: Christoph Pirkl
>Priority: Critical
>
> {{KinesisUtils.createStream()}} allows specifying only one set of AWS 
> credentials that will be used by Amazon KCL for accessing Kinesis, DynamoDB 
> and CloudWatch.
> h5. Motivation
> In a scenario where one needs to read from a Kinesis Stream owned by a 
> different AWS account the user usually has minimal rights (i.e. only read 
> from the stream). In this case creating the DynamoDB table in KCL will fail.
> h5. Proposal
> My proposed solution would be to allow specifying multiple credentials in 
> {{KinesisUtils.createStream()}} for Kinesis, DynamoDB and CloudWatch. The 
> additional credentials could then be passed to the constructor of 
> {{KinesisClientLibConfiguration}} or method 
> {{KinesisClientLibConfiguration.withDynamoDBClientConfig()}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22602) remove ColumnVector#loadBytes

2017-11-26 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-22602.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> remove ColumnVector#loadBytes
> -
>
> Key: SPARK-22602
> URL: https://issues.apache.org/jira/browse/SPARK-22602
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22611) Spark Kinesis ProvisionedThroughputExceededException leads to dropped records

2017-11-26 Thread Richard Moorhead (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16266389#comment-16266389
 ] 

Richard Moorhead commented on SPARK-22611:
--

Thats correct.

The Spark Streaming UI shows a greater number than the actual number of records 
instantiated by the KinesisBackedBlockRDD during actions invoked in 
`foreachRdd`. It appears that individual sequence numbers are skipped when 
throughput exceptions occur. 

> Spark Kinesis ProvisionedThroughputExceededException leads to dropped records
> -
>
> Key: SPARK-22611
> URL: https://issues.apache.org/jira/browse/SPARK-22611
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: Richard Moorhead
>
> Ive loaded a Kinesis stream with a single shard with ~20M records and have 
> created a simple spark streaming application that writes those records to s3. 
> When the streaming interval is set sufficiently wide such that 2MB/s read 
> rates are violated, the receiver's KCL processes throw 
> ProvisionedThroughputExceededExceptions. While these exceptions are expected, 
> the output record counts in s3 do not match the record counts in the Spark 
> Streaming UI and worse, the records never appear to be fetched in future 
> batches. This problem can be mitigated by setting the streaming interval to a 
> narrow window such that batches are small enough that throughput limits arent 
> exceeded but this isnt guaranteed in a production system.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10969) Spark Streaming Kinesis: Allow specifying separate credentials for Kinesis and DynamoDB

2017-11-26 Thread Grega Kespret (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16266374#comment-16266374
 ] 

Grega Kespret edited comment on SPARK-10969 at 11/27/17 5:33 AM:
-

I believe this is now possible through the {{KinesisInputDStream.Builder}} 
class added in SPARK-19911 as:

{code}
KinesisInputDStream.builder
  ...
  .kinesisCredentials(creds1)
  .dynamoDBCredentials(creds2)
  .cloudWatchCredentials(creds3)
  .build
{code}

Close the ticket?

cc [~brkyvz]


was (Author: gregak):
I believe this is now possible through the {{KinesisInputDStream.Builder}} 
class added in SPARK-19911 as:

{code}
KinesisInputDStream.builder
  ...
  .kinesisCredentials(creds1)
  .dynamoDBCredentials(creds2)
  .cloudWatchCredentials(creds3)
  .build
{code}

Close the ticket?

> Spark Streaming Kinesis: Allow specifying separate credentials for Kinesis 
> and DynamoDB
> ---
>
> Key: SPARK-10969
> URL: https://issues.apache.org/jira/browse/SPARK-10969
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 1.5.1
>Reporter: Christoph Pirkl
>Priority: Critical
>
> {{KinesisUtils.createStream()}} allows specifying only one set of AWS 
> credentials that will be used by Amazon KCL for accessing Kinesis, DynamoDB 
> and CloudWatch.
> h5. Motivation
> In a scenario where one needs to read from a Kinesis Stream owned by a 
> different AWS account the user usually has minimal rights (i.e. only read 
> from the stream). In this case creating the DynamoDB table in KCL will fail.
> h5. Proposal
> My proposed solution would be to allow specifying multiple credentials in 
> {{KinesisUtils.createStream()}} for Kinesis, DynamoDB and CloudWatch. The 
> additional credentials could then be passed to the constructor of 
> {{KinesisClientLibConfiguration}} or method 
> {{KinesisClientLibConfiguration.withDynamoDBClientConfig()}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10969) Spark Streaming Kinesis: Allow specifying separate credentials for Kinesis and DynamoDB

2017-11-26 Thread Grega Kespret (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16266374#comment-16266374
 ] 

Grega Kespret edited comment on SPARK-10969 at 11/27/17 5:31 AM:
-

I believe this is now possible through the {{KinesisInputDStream.Builder}} 
class added in SPARK-19911 as:

{code}
KinesisInputDStream.builder
  ...
  .kinesisCredentials(creds1)
  .dynamoDBCredentials(creds2)
  .cloudWatchCredentials(creds3)
  .build
{code}

Close the ticket?


was (Author: gregak):
I believe this is now possible through the {{KinesisInputDStream.Builder}} 
class as:

{code}
KinesisInputDStream.builder
  ...
  .kinesisCredentials(creds1)
  .dynamoDBCredentials(creds2)
  .cloudWatchCredentials(creds3)
  .build
{code}

Close the ticket?

> Spark Streaming Kinesis: Allow specifying separate credentials for Kinesis 
> and DynamoDB
> ---
>
> Key: SPARK-10969
> URL: https://issues.apache.org/jira/browse/SPARK-10969
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 1.5.1
>Reporter: Christoph Pirkl
>Priority: Critical
>
> {{KinesisUtils.createStream()}} allows specifying only one set of AWS 
> credentials that will be used by Amazon KCL for accessing Kinesis, DynamoDB 
> and CloudWatch.
> h5. Motivation
> In a scenario where one needs to read from a Kinesis Stream owned by a 
> different AWS account the user usually has minimal rights (i.e. only read 
> from the stream). In this case creating the DynamoDB table in KCL will fail.
> h5. Proposal
> My proposed solution would be to allow specifying multiple credentials in 
> {{KinesisUtils.createStream()}} for Kinesis, DynamoDB and CloudWatch. The 
> additional credentials could then be passed to the constructor of 
> {{KinesisClientLibConfiguration}} or method 
> {{KinesisClientLibConfiguration.withDynamoDBClientConfig()}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10969) Spark Streaming Kinesis: Allow specifying separate credentials for Kinesis and DynamoDB

2017-11-26 Thread Grega Kespret (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16266374#comment-16266374
 ] 

Grega Kespret commented on SPARK-10969:
---

I believe this is now possible through the {{KinesisInputDStream.Builder}} 
class as:

{code}
KinesisInputDStream.builder
  ...
  .kinesisCredentials(creds1)
  .dynamoDBCredentials(creds2)
  .cloudWatchCredentials(creds3)
  .build
{code}

Close the ticket?

> Spark Streaming Kinesis: Allow specifying separate credentials for Kinesis 
> and DynamoDB
> ---
>
> Key: SPARK-10969
> URL: https://issues.apache.org/jira/browse/SPARK-10969
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 1.5.1
>Reporter: Christoph Pirkl
>Priority: Critical
>
> {{KinesisUtils.createStream()}} allows specifying only one set of AWS 
> credentials that will be used by Amazon KCL for accessing Kinesis, DynamoDB 
> and CloudWatch.
> h5. Motivation
> In a scenario where one needs to read from a Kinesis Stream owned by a 
> different AWS account the user usually has minimal rights (i.e. only read 
> from the stream). In this case creating the DynamoDB table in KCL will fail.
> h5. Proposal
> My proposed solution would be to allow specifying multiple credentials in 
> {{KinesisUtils.createStream()}} for Kinesis, DynamoDB and CloudWatch. The 
> additional credentials could then be passed to the constructor of 
> {{KinesisClientLibConfiguration}} or method 
> {{KinesisClientLibConfiguration.withDynamoDBClientConfig()}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7721) Generate test coverage report from Python

2017-11-26 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16266358#comment-16266358
 ] 

Reynold Xin commented on SPARK-7721:


This is really cool. I took a look but it looks like doctests are missing? For 
example, sortWithinPartitions is labeled as missing, but there is doctest for 
that.


> Generate test coverage report from Python
> -
>
> Key: SPARK-7721
> URL: https://issues.apache.org/jira/browse/SPARK-7721
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, Tests
>Reporter: Reynold Xin
>
> Would be great to have test coverage report for Python. Compared with Scala, 
> it is tricker to understand the coverage without coverage reports in Python 
> because we employ both docstring tests and unit tests in test files. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22612) NullPointerException in AppendOnlyMap

2017-11-26 Thread Lijie Xu (JIRA)
Lijie Xu created SPARK-22612:


 Summary: NullPointerException in AppendOnlyMap
 Key: SPARK-22612
 URL: https://issues.apache.org/jira/browse/SPARK-22612
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.2, 2.1.1
Reporter: Lijie Xu


I recently encounter a NullPointerException in AppendOnlyMap while running the 
SparkPageRank example in the package org.apache.spark.examples.

{code:java}
17/11/25 16:31:13 ERROR Executor: Exception in task 30.0 in stage 10.0 (TID 417)
java.lang.NullPointerException
at scala.Tuple2.equals(Tuple2.scala:20)
at 
org.apache.spark.util.collection.AppendOnlyMap.changeValue(AppendOnlyMap.scala:149)
at 
org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(SizeTrackingAppendOnlyMap.scala:32)
at 
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:194)
at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
17/11/25 16:31:13 INFO Executor: Executor is trying to kill task 24.0 in stage 
10.0 (TID 409)
{code}

The corresponding AppendOnlyMap code is as follows.

{code:java}
 while (true) {
  val curKey = data(2 * pos)
  if (curKey.eq(null)) {
val newValue = updateFunc(false, null.asInstanceOf[V])
data(2 * pos) = k
data(2 * pos + 1) = newValue.asInstanceOf[AnyRef]
incrementSize()
return newValue
  } else if (k.eq(curKey) || k.equals(curKey)) {   // NullPointerException 
in this line (149)
val newValue = updateFunc(true, data(2 * pos + 1).asInstanceOf[V])
data(2 * pos + 1) = newValue.asInstanceOf[AnyRef]
return newValue
  } else {
val delta = i
pos = (pos + delta) & mask
i += 1
  }
}
{code}









--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22611) Spark Kinesis ProvisionedThroughputExceededException leads to dropped records

2017-11-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16266333#comment-16266333
 ] 

Sean Owen commented on SPARK-22611:
---

What's the Spark issue here -- somehow the error causes the records to be 
considered processed when they didn't make it to user code?

> Spark Kinesis ProvisionedThroughputExceededException leads to dropped records
> -
>
> Key: SPARK-22611
> URL: https://issues.apache.org/jira/browse/SPARK-22611
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: Richard Moorhead
>
> Ive loaded a Kinesis stream with a single shard with ~20M records and have 
> created a simple spark streaming application that writes those records to s3. 
> When the streaming interval is set sufficiently wide such that 2MB/s read 
> rates are violated, the receiver's KCL processes throw 
> ProvisionedThroughputExceededExceptions. While these exceptions are expected, 
> the output record counts in s3 do not match the record counts in the Spark 
> Streaming UI and worse, the records never appear to be fetched in future 
> batches. This problem can be mitigated by setting the streaming interval to a 
> narrow window such that batches are small enough that throughput limits arent 
> exceeded but this isnt guaranteed in a production system.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22610) AM restart in a other node send Jobs into a state of feign death

2017-11-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22610.
---
Resolution: Invalid

Questions to the mailing list please

> AM restart in a other node send Jobs into a state of feign death
> 
>
> Key: SPARK-22610
> URL: https://issues.apache.org/jira/browse/SPARK-22610
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Bang Xiao
>Priority: Minor
>
> I run "spark-sql  --master yarn --deploy-mode client -f 'SQLs' " in shell,  
> The application  is stuck when the AM is down and restart in other nodes. It 
> seems the driver wait for the next sql. Is this a bug?In my opinion,Either 
> the application execute the failed sql or exit with a failure when the AM 
> restart。



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22611) Spark Kinesis ProvisionedThroughputExceededException leads to dropped records

2017-11-26 Thread Richard Moorhead (JIRA)
Richard Moorhead created SPARK-22611:


 Summary: Spark Kinesis ProvisionedThroughputExceededException 
leads to dropped records
 Key: SPARK-22611
 URL: https://issues.apache.org/jira/browse/SPARK-22611
 Project: Spark
  Issue Type: Bug
  Components: DStreams
Affects Versions: 2.2.0
Reporter: Richard Moorhead


Ive loaded a Kinesis stream with a single shard with ~20M records and have 
created a simple spark streaming application that writes those records to s3. 
When the streaming interval is set sufficiently wide such that 2MB/s read rates 
are violated, the receiver's KCL processes throw 
ProvisionedThroughputExceededExceptions. While these exceptions are expected, 
the output record counts in s3 do not match the record counts in the Spark 
Streaming UI and worse, the records never appear to be fetched in future 
batches. This problem can be mitigated by setting the streaming interval to a 
narrow window such that batches are small enough that throughput limits arent 
exceeded but this isnt guaranteed in a production system.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22610) AM restart in a other node send Jobs into a state of feign death

2017-11-26 Thread Bang Xiao (JIRA)
Bang Xiao created SPARK-22610:
-

 Summary: AM restart in a other node send Jobs into a state of 
feign death
 Key: SPARK-22610
 URL: https://issues.apache.org/jira/browse/SPARK-22610
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.1
Reporter: Bang Xiao
Priority: Minor


I run "spark-sql  --master yarn --deploy-mode client -f 'SQLs' " in shell,  The 
application  is stuck when the AM is down and restart in other nodes. It seems 
the driver wait for the next sql. Is this a bug?In my opinion,Either the 
application execute the failed sql or exit with a failure when the AM restart。



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-22609) Reuse CodeGeneration.nullSafeExec when possible

2017-11-26 Thread Marco Gaido (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido closed SPARK-22609.
---

> Reuse CodeGeneration.nullSafeExec when possible
> ---
>
> Key: SPARK-22609
> URL: https://issues.apache.org/jira/browse/SPARK-22609
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Priority: Trivial
>
> There are several places in the code where `CodeGeneration.nullSafeExec` 
> could be used, but it is not. This makes the generated code containing a lot 
> of useless:
> {code}
> if (!false) {
>   // some code here
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22609) Reuse CodeGeneration.nullSafeExec when possible

2017-11-26 Thread Marco Gaido (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido resolved SPARK-22609.
-
Resolution: Invalid

> Reuse CodeGeneration.nullSafeExec when possible
> ---
>
> Key: SPARK-22609
> URL: https://issues.apache.org/jira/browse/SPARK-22609
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Priority: Trivial
>
> There are several places in the code where `CodeGeneration.nullSafeExec` 
> could be used, but it is not. This makes the generated code containing a lot 
> of useless:
> {code}
> if (!false) {
>   // some code here
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22601) Data load is getting displayed successful on providing non existing hdfs file path

2017-11-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22601:


Assignee: (was: Apache Spark)

> Data load is getting displayed successful on providing non existing hdfs file 
> path
> --
>
> Key: SPARK-22601
> URL: https://issues.apache.org/jira/browse/SPARK-22601
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Sujith
>Priority: Minor
>
> Data load is getting displayed successful on providing non existing hdfs file 
> path where as in local path proper error message is getting displayed
> create table tb2 (a string, b int);
>  load data inpath 'hdfs://hacluster/data1.csv' into table tb2
> Note:  data1.csv does not exist in HDFS
> when local non existing file path is given below error message will be 
> displayed
> "LOAD DATA input path does not exist". attached snapshots of behaviour in 
> spark 2.1 and spark 2.2 version



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22601) Data load is getting displayed successful on providing non existing hdfs file path

2017-11-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22601:


Assignee: Apache Spark

> Data load is getting displayed successful on providing non existing hdfs file 
> path
> --
>
> Key: SPARK-22601
> URL: https://issues.apache.org/jira/browse/SPARK-22601
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Sujith
>Assignee: Apache Spark
>Priority: Minor
>
> Data load is getting displayed successful on providing non existing hdfs file 
> path where as in local path proper error message is getting displayed
> create table tb2 (a string, b int);
>  load data inpath 'hdfs://hacluster/data1.csv' into table tb2
> Note:  data1.csv does not exist in HDFS
> when local non existing file path is given below error message will be 
> displayed
> "LOAD DATA input path does not exist". attached snapshots of behaviour in 
> spark 2.1 and spark 2.2 version



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22601) Data load is getting displayed successful on providing non existing hdfs file path

2017-11-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16266123#comment-16266123
 ] 

Apache Spark commented on SPARK-22601:
--

User 'sujith71955' has created a pull request for this issue:
https://github.com/apache/spark/pull/19823

> Data load is getting displayed successful on providing non existing hdfs file 
> path
> --
>
> Key: SPARK-22601
> URL: https://issues.apache.org/jira/browse/SPARK-22601
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Sujith
>Priority: Minor
>
> Data load is getting displayed successful on providing non existing hdfs file 
> path where as in local path proper error message is getting displayed
> create table tb2 (a string, b int);
>  load data inpath 'hdfs://hacluster/data1.csv' into table tb2
> Note:  data1.csv does not exist in HDFS
> when local non existing file path is given below error message will be 
> displayed
> "LOAD DATA input path does not exist". attached snapshots of behaviour in 
> spark 2.1 and spark 2.2 version



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22579) BlockManager.getRemoteValues and BlockManager.getRemoteBytes should be implemented using streaming

2017-11-26 Thread Eyal Farago (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16266102#comment-16266102
 ] 

Eyal Farago commented on SPARK-22579:
-

[~srowen], I couldn't see a place where this data is stored for later use, even 
in the 'big fetch' scenario the file isn't reused as 
org.apache.spark.storage.BlockManager#remoteBlockTempFileManager is only used 
to stream big file - once, it's never used to figure out if the download can be 
avoided. furthermore this blocks are also not registered with the block 
manager, after all they can be used as 'DISK_ONLY' cache on the requesting 
executor and potentially even by other executors residing on the same node.

[~jerryshao], it seems the 'big files' code path actually uses streaming, makes 
me think it can be slightly modified to actually accept a stream handler 
(current behavior can be achieved by passing in a download stream handler).

one thing that does have the potential for troubles is error recovery when more 
than one executor can serve the block, current impl simply moves to the next 
one, a streaming based approach would have to  request the rest of the block 
from an other executor (assuming blocks are identical).

> BlockManager.getRemoteValues and BlockManager.getRemoteBytes should be 
> implemented using streaming
> --
>
> Key: SPARK-22579
> URL: https://issues.apache.org/jira/browse/SPARK-22579
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Spark Core
>Affects Versions: 2.1.0
>Reporter: Eyal Farago
>
> when an RDD partition is cached on an executor bu the task requiring it is 
> running on another executor (process locality ANY), the cached partition is 
> fetched via BlockManager.getRemoteValues which delegates to 
> BlockManager.getRemoteBytes, both calls are blocking.
> in my use case I had a 700GB RDD spread over 1000 partitions on a 6 nodes 
> cluster, cached to disk. rough math shows that average partition size is 
> 700MB.
> looking at spark UI it was obvious that tasks running with process locality 
> 'ANY' are much slower than local tasks (~40 seconds to 8-10 minutes ratio), I 
> was able to capture thread dumps of executors executing remote tasks and got 
> this stake trace:
> {quote}Thread ID  Thread Name Thread StateThread Locks
> 1521  Executor task launch worker-1000WAITING 
> Lock(java.util.concurrent.ThreadPoolExecutor$Worker@196462978})
> sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
> scala.concurrent.Await$.result(package.scala:190)
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:190)
> org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:104)
> org.apache.spark.storage.BlockManager.getRemoteBytes(BlockManager.scala:582)
> org.apache.spark.storage.BlockManager.getRemoteValues(BlockManager.scala:550)
> org.apache.spark.storage.BlockManager.get(BlockManager.scala:638)
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:690)
> org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287){quote}
> digging into the code showed that the block manager first fetches all bytes 
> (getRemoteBytes) and then wraps it with a deserialization stream, this has 
> several draw backs:
> 1. blocking, requesting executor is blocked while the remote executor is 
> serving the block.
> 2. potentially large memory footprint on requesting executor, in my use case 
> a 700mb of raw bytes stored in a ChunkedByteBuffer.
> 3. inefficient, requesting side usually don't need 

[jira] [Updated] (SPARK-22608) Avoid code duplication regarding CodeGeneration.splitExpressions()

2017-11-26 Thread Kazuaki Ishizaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-22608:
-
Issue Type: Improvement  (was: Bug)

> Avoid code duplication regarding CodeGeneration.splitExpressions()
> --
>
> Key: SPARK-22608
> URL: https://issues.apache.org/jira/browse/SPARK-22608
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>
> Since several {{CodeGenenerator.splitExpression}} are used with 
> {{ctx.INPUT_ROW}}, it would be good to prepare APIs for this to avoid code 
> duplication.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22608) Avoid code duplication regarding CodeGeneration.splitExpressions()

2017-11-26 Thread Kazuaki Ishizaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-22608:
-
Priority: Minor  (was: Major)

> Avoid code duplication regarding CodeGeneration.splitExpressions()
> --
>
> Key: SPARK-22608
> URL: https://issues.apache.org/jira/browse/SPARK-22608
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Priority: Minor
>
> Since several {{CodeGenenerator.splitExpression}} are used with 
> {{ctx.INPUT_ROW}}, it would be good to prepare APIs for this to avoid code 
> duplication.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22573) SQL Planner is including unnecessary columns in the projection

2017-11-26 Thread Yuming Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-22573.
-
Resolution: Duplicate

> SQL Planner is including unnecessary columns in the projection
> --
>
> Key: SPARK-22573
> URL: https://issues.apache.org/jira/browse/SPARK-22573
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Rajkishore Hembram
>
> While I was running TPC-H query 18 for benchmarking, I observed that the 
> query plan for Apache Spark 2.2.0 is inefficient than other versions of 
> Apache Spark. I noticed that the other versions of Apache Spark (2.0.2 and 
> 2.1.2) are only including the required columns in the projections. But the 
> query planner of Apache Spark 2.2.0 is including unnecessary columns into the 
> projection for some of the queries and hence unnecessarily increasing the 
> I/O. And because of that the Apache Spark 2.2.0 is taking more time.
> [Spark 2.1.2 TPC-H Query 18 
> Plan|https://drive.google.com/file/d/1_u8nPKG_SIM7P6fs0VK-8UEXIhWPY_BN/view]
> [Spark 2.2.0 TPC-H Query 18 
> Plan|https://drive.google.com/file/d/1xtxG5Ext36djfTDSdf_W5vGbbdgRApPo/view]
> TPC-H Query 18
> {code:java}
> select C_NAME,C_CUSTKEY,O_ORDERKEY,O_ORDERDATE,O_TOTALPRICE,sum(L_QUANTITY) 
> from CUSTOMER,ORDERS,LINEITEM where O_ORDERKEY in ( select L_ORDERKEY from 
> LINEITEM group by L_ORDERKEY having sum(L_QUANTITY) > 300 ) and C_CUSTKEY = 
> O_CUSTKEY and O_ORDERKEY = L_ORDERKEY group by 
> C_NAME,C_CUSTKEY,O_ORDERKEY,O_ORDERDATE,O_TOTALPRICE order by O_TOTALPRICE 
> desc,O_ORDERDATE
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22607) Set large stack size consistently for tests to avoid StackOverflowError

2017-11-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22607.
---
   Resolution: Fixed
Fix Version/s: 2.3.0
   2.2.2

Issue resolved by pull request 19820
[https://github.com/apache/spark/pull/19820]

> Set large stack size consistently for tests to avoid StackOverflowError
> ---
>
> Key: SPARK-22607
> URL: https://issues.apache.org/jira/browse/SPARK-22607
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Tests
>Affects Versions: 2.2.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 2.2.2, 2.3.0
>
>
> I was seeing this error while testing the 2.2.1 RC:
> {code}
> OrderingSuite:
> ...
> - GenerateOrdering with ShortType
> *** RUN ABORTED ***
> java.lang.StackOverflowError: 
> at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:370) 
> at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) 
> at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) 
> at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) 
> at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) 
> at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) 
> at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) 
> at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541)
> ...
> {code}
> This doesn't seem to happen on Jenkins, for whatever reason. It seems like we 
> set JVM flags for tests inconsistently, and in particular, only set a 4MB 
> stack size for surefire, not scalatest-maven-plugin. Adding {{-Xss4m}} made 
> the test pass for me.
> We can also make sure that all of these pass {{-ea}} consistently.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22607) Set large stack size consistently for tests to avoid StackOverflowError

2017-11-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-22607:
--
Description: 
I was seeing this error while testing the 2.2.1 RC:

{code}
OrderingSuite:
...
- GenerateOrdering with ShortType
*** RUN ABORTED ***
java.lang.StackOverflowError: 
at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:370) 
at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) 
at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) 
at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) 
at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) 
at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) 
at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) 
at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541)
...
{code}

This doesn't seem to happen on Jenkins, for whatever reason. It seems like we 
set JVM flags for tests inconsistently, and in particular, only set a 4MB stack 
size for surefire, not scalatest-maven-plugin. Adding {{-Xss4m}} made the test 
pass for me.

We can also make sure that all of these pass {{-ea}} consistently.

  was:
I was seeing this error while testing the 2.2.1 RC:

{code}
java.lang.StackOverflowError: 
at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:370) 
at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) 
at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) 
at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) 
at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) 
at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) 
at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) 
at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541)
{code}

This doesn't seem to happen on Jenkins, for whatever reason. It seems like we 
set JVM flags for tests inconsistently, and in particular, only set a 4MB stack 
size for surefire, not scalatest-maven-plugin. Adding {{-Xss4m}} made the test 
pass for me.

We can also make sure that all of these pass {{-ea}} consistently.


> Set large stack size consistently for tests to avoid StackOverflowError
> ---
>
> Key: SPARK-22607
> URL: https://issues.apache.org/jira/browse/SPARK-22607
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Tests
>Affects Versions: 2.2.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>
> I was seeing this error while testing the 2.2.1 RC:
> {code}
> OrderingSuite:
> ...
> - GenerateOrdering with ShortType
> *** RUN ABORTED ***
> java.lang.StackOverflowError: 
> at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:370) 
> at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) 
> at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) 
> at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) 
> at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) 
> at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) 
> at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) 
> at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541)
> ...
> {code}
> This doesn't seem to happen on Jenkins, for whatever reason. It seems like we 
> set JVM flags for tests inconsistently, and in particular, only set a 4MB 
> stack size for surefire, not scalatest-maven-plugin. Adding {{-Xss4m}} made 
> the test pass for me.
> We can also make sure that all of these pass {{-ea}} consistently.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22609) Reuse CodeGeneration.nullSafeExec when possible

2017-11-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22609:


Assignee: (was: Apache Spark)

> Reuse CodeGeneration.nullSafeExec when possible
> ---
>
> Key: SPARK-22609
> URL: https://issues.apache.org/jira/browse/SPARK-22609
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Priority: Trivial
>
> There are several places in the code where `CodeGeneration.nullSafeExec` 
> could be used, but it is not. This makes the generated code containing a lot 
> of useless:
> {code}
> if (!false) {
>   // some code here
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22609) Reuse CodeGeneration.nullSafeExec when possible

2017-11-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16266002#comment-16266002
 ] 

Apache Spark commented on SPARK-22609:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/19822

> Reuse CodeGeneration.nullSafeExec when possible
> ---
>
> Key: SPARK-22609
> URL: https://issues.apache.org/jira/browse/SPARK-22609
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Priority: Trivial
>
> There are several places in the code where `CodeGeneration.nullSafeExec` 
> could be used, but it is not. This makes the generated code containing a lot 
> of useless:
> {code}
> if (!false) {
>   // some code here
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22609) Reuse CodeGeneration.nullSafeExec when possible

2017-11-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22609:


Assignee: Apache Spark

> Reuse CodeGeneration.nullSafeExec when possible
> ---
>
> Key: SPARK-22609
> URL: https://issues.apache.org/jira/browse/SPARK-22609
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Assignee: Apache Spark
>Priority: Trivial
>
> There are several places in the code where `CodeGeneration.nullSafeExec` 
> could be used, but it is not. This makes the generated code containing a lot 
> of useless:
> {code}
> if (!false) {
>   // some code here
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22609) Reuse CodeGeneration.nullSafeExec when possible

2017-11-26 Thread Marco Gaido (JIRA)
Marco Gaido created SPARK-22609:
---

 Summary: Reuse CodeGeneration.nullSafeExec when possible
 Key: SPARK-22609
 URL: https://issues.apache.org/jira/browse/SPARK-22609
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Marco Gaido
Priority: Trivial


There are several places in the code where `CodeGeneration.nullSafeExec` could 
be used, but it is not. This makes the generated code containing a lot of 
useless:
{code}
if (!false) {
  // some code here
}
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset

2017-11-26 Thread ABHISHEK CHOUDHARY (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16265993#comment-16265993
 ] 

ABHISHEK CHOUDHARY commented on SPARK-18016:


I found the same issue in Latest spark 2.2.0 while using with pyspark.
Number of columns I am expecting is more than 50K , do you think, the patch 
will fix that kind of huge number as well ?

> Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
> -
>
> Key: SPARK-18016
> URL: https://issues.apache.org/jira/browse/SPARK-18016
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Aleksander Eskilson
>Assignee: Aleksander Eskilson
> Fix For: 2.3.0
>
>
> When attempting to encode collections of large Java objects to Datasets 
> having very wide or deeply nested schemas, code generation can fail, yielding:
> {code}
> Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for 
> class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
>  has grown past JVM limit of 0x
>   at 
> org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358)
>   at 
> org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547)
>   at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112)
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420)
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
>   at 

[jira] [Assigned] (SPARK-22608) Avoid code duplication regarding CodeGeneration.splitExpressions()

2017-11-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22608:


Assignee: (was: Apache Spark)

> Avoid code duplication regarding CodeGeneration.splitExpressions()
> --
>
> Key: SPARK-22608
> URL: https://issues.apache.org/jira/browse/SPARK-22608
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>
> Since several {{CodeGenenerator.splitExpression}} are used with 
> {{ctx.INPUT_ROW}}, it would be good to prepare APIs for this to avoid code 
> duplication.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22608) Avoid code duplication regarding CodeGeneration.splitExpressions()

2017-11-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16265952#comment-16265952
 ] 

Apache Spark commented on SPARK-22608:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/19821

> Avoid code duplication regarding CodeGeneration.splitExpressions()
> --
>
> Key: SPARK-22608
> URL: https://issues.apache.org/jira/browse/SPARK-22608
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>
> Since several {{CodeGenenerator.splitExpression}} are used with 
> {{ctx.INPUT_ROW}}, it would be good to prepare APIs for this to avoid code 
> duplication.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22608) Avoid code duplication regarding CodeGeneration.splitExpressions()

2017-11-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22608:


Assignee: Apache Spark

> Avoid code duplication regarding CodeGeneration.splitExpressions()
> --
>
> Key: SPARK-22608
> URL: https://issues.apache.org/jira/browse/SPARK-22608
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Assignee: Apache Spark
>
> Since several {{CodeGenenerator.splitExpression}} are used with 
> {{ctx.INPUT_ROW}}, it would be good to prepare APIs for this to avoid code 
> duplication.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22608) Avoid code duplication regarding CodeGeneration.splitExpressions()

2017-11-26 Thread Kazuaki Ishizaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-22608:
-
Summary: Avoid code duplication regarding CodeGeneration.splitExpressions() 
 (was: Add new API to CodeGeneration.splitExpressions())

> Avoid code duplication regarding CodeGeneration.splitExpressions()
> --
>
> Key: SPARK-22608
> URL: https://issues.apache.org/jira/browse/SPARK-22608
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>
> Since several {{CodeGenenerator.splitExpression}} are used with 
> {{ctx.INPUT_ROW}}, it would be good to prepare APIs for this to avoid code 
> duplication.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22608) Add new API to CodeGeneration.splitExpressions()

2017-11-26 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-22608:


 Summary: Add new API to CodeGeneration.splitExpressions()
 Key: SPARK-22608
 URL: https://issues.apache.org/jira/browse/SPARK-22608
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Kazuaki Ishizaki


Since several {{CodeGenenerator.splitExpression}} are used with 
{{ctx.INPUT_ROW}}, it would be good to prepare APIs for this to avoid code 
duplication.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org